Kube-Reporting via Ansible Operator

Chris Hambridge
The Startup
Published in
7 min readApr 29, 2020

--

Kubernetes is everywhere, permeating the Tech world.

While Kubernetes has helped to simplify the orchestration of containers and cloud native applications, no enterprise has just one cluster. We know data is key for business, especially in this critical time, and many top companies are embracing data driven decision making. An enterprise may want to evaluate utilization, capacity, health, geo-based activity, showback/chargeback across their hybrid cloud footprint. So how do you gather the data you need, not just what is happening now or today, but in the last month/quarter?

Data Collection

There are really two strategies you could employ for data collection across an enterprise scale Kubernetes footprint.

  • Bring all the monitoring data from each cluster to a central location to be queried
  • Query the monitoring data at each cluster and centralized the results to be processed/aggregated

Both approaches have pros and cons and associated open-source tools.

Thanos

If you are familiar with Prometheus, just think of Thanos as a multi-cluster Prometheus. Thanos is a CNCF Sandbox project which will likely continue to grow in popularity and utility as IT footprints continue to scale.

Pros:

  • Opensource tool used by many companies
  • Leverages a known production grade monitoring tool, Prometheus
  • Handles a mulit-cluster landscape

Cons:

  • Clusters must have network connectivity to perform the remote write of Prometheus data to Thanos (poor air-gapped support)
  • Storage cost for massive amounts of metric data for each cluster
  • Data storage trade-offs of what metrics to “export” vs. how long you will store the data
  • PromQL may not be a familiar or usable language for the users accessing the data

Metering Operator

The Metering Operator brings big data analytics to Kubernetes. The Metering Operator relies on PrestoDB to query and report on large data sets. Metering leverages PrestoDB to collect data from Prometheus allowing it to gather data with scheduled reports (on-demand or hourly) and roll this data up into aggregations. Metering takes care of the scheduling and the partitioning of data to make queries performant. You can access the results of the reports using a Kubernetes route, requesting JSON or CSV to fit your needs.

Pros:

  • Opensource tool leveraging Prometheus to extract monitoring data
  • Utilizes PrestoDB to query, collect, and combine data
  • PrestoDB is ANSI SQL compliant which may be more consumable for data engineers

Cons:

  • Increased cluster resource consumption in order to run Metering Operator on each cluster
  • Creating reports and collecting results across the multi-cluster landscape
  • Data aggregation needed from collected results

Technology Selection

Both of the above technologies are interesting and have valid use cases, but let’s breakdown our technology selection as its rooted in the problem my team is attempting to solve.

Photo by Jon Tyson on Unsplash

Project Koku is focused on building a cost management tool for a multi-cluster hybrid cloud IT footprint. Showing the spend by correlating public cloud cost or a defined cost model for clusters, projects, and by tag.

In order to provide cost management data we need rich pod level data (CPU/Memory usage, request, limits) along with associated labels that could be tracked for months. The volume and cardinality of the data makes the choice of Thanos prohibitive. We selected the Metering Operator, as one can think of this problem in the vein of a newer edge computing scenario, where if we processes the high volume data closer to where the data lives and only extract out the essential information it becomes lighter in weight and more practical to aggregate at scale.

Automating Data Collection with an Ansible Operator

One of the cons listed above was around the creation of reports and the collection of the resulting data. Our team is attempting to alleviate this pain and simplify the on-boarding of our data pipeline using an operator.

Why an Ansible Operator?

In a previous story, Testing Ansible Roles: A practical application, I referred to our team’s usage of Ansible to automate the creation of the Metering custom resources and data collection. The story Operator Metering with Look Back: Kubernetes Reports, highlights the basics of and breaks down our use of Metering. So as operators have grown in capability, maturity, and popularity within the Kubernetes ecosystem with the rise of OperatorHub, we chose to embrace our customers needs by providing what was a simpler and more uniform installation and configuration flow.

From Nothing to Published Operator in 2 Months

The team built and published our Cost Management operator on OperatorHub in two months. How did we accomplish this you may ask?

  • Operator Framework SDK provides a great guide to bootstrap new developers (both GOLang & Ansbile)
  • Previous knowledge and skill with Ansible & Molecule among the team
  • Metering out-of-the-box resources and previously built reports along with a flow for setup and collection

Our biggest gaps were understanding how to best test our operator with Molecule and how to publish our operator once we had a viable operator.

The initial gap just came down to understand how to best utilize and debug the scaffolded Molecule test setup provided via the Operator Framework SDK. The test-local flow, which was used for unit testing and Travis-CI testing, makes use of bsycorp/kind, which is a single node cluster for CI. Here its just important to be able to think of it as a container that if you exec into it is running internal containers for each of the Kubernetes components; if you can identify your operator’s container name then you are able to navigate some of the complexities when debugging various failure during the development phase.

The second part of that question had a clear path with only a few hurdles along the way, as OperatorHub has documented the publishing steps. There were a few areas with less than stellar documentation, but nothing unexpected in a rapidly growing and iterating opensource community. We were able to reach out with questions and provide comments within our pull request.

Delivered Functionality

Installed Cost Management operator from OperatorHub on OpenShift 4

Our operator watches to two types of custom resources which triggers two different Ansible roles. The first role does the configuration of Metering resources using the k8s module.

Creating Metering resources with k8s module

Report files are templated and created and destroyed monthly to keep data size from endlessly growing and requests for the results performant.

Reusable task for creating previous, current, and next month reports

The second role is focused on data collection and runs every 6 hours to collect the last 3 days of data (to assist with any delivery gaps or potential operator outages) using the look back functionality mentioned above. Once the data is retrieved into a persistent volume it is compressed and uploaded to a target service.

Along with these capabilities we were able to leverage dependency resolution since our operator depends on the Metering Operator, it will install the Metering Operator if its not already present and no activity will take place within the Cost Management operator until Metering has been configured.

What’s Next?

Where should we go from here? I’ll mention a few ideas I’ve been thinking through, but we’d love to hear from you. Feel free to reach out to us on Twitter or Gitter.

Photo by Capturing the human heart. on Unsplash
  • Provide an upstream OperatorHub operator: Today our operator can run on Kubernetes but is wired to collect data and transmit it back to cloud.redhat.com, which isn’t very useful. We could easily provide a mechanism to supply an alternate URL to post the data to continuing to utilize either basic authentication or a token, but are there other options that would be preferred?
  • Improve operator capability level: Current operator is only a Level I operator. While its simple there are a few improvements that can be made to bring this operator up in capabilities.
  • Create a template repository: Generalize the functionality to create a template repository to make it simple to deliver your own Metering resources to your Kubernetes cluster. Working to contribute this repository to the Kube-reporting organization.
  • Building out a community of report resources: Ansible’s success is that everyone doesn’t have to be an expert in every aspect as the community has grown large enough that its likely someone else had done what your looking for. Building out curated content similar to Ansible Galaxy will reduce the burden on report writing.

Synopsis

Opensource technologies now exist that afford a view into an enterprise Kubernetes multi-cluster landscape. Utilizing and collecting this data can and should be important factors in making data driven decisions for the business. Kube-Reporting provides the mechanism to query & gather low-level metric data with its Metering Operator. An Ansible operator can quickly be built to deliver a consistent and simple installation and data collection flow for Kubernetes metric reports. Let our team know about your interest in what we can do next in this space.

--

--

Chris Hambridge
The Startup

Software Engineer at Red Hat. Passionate about devOps and cloud native technologies.