Kuberhealthy & the writing the AMI Exists check

Published in

GrepMyMind

5 min readSep 19, 2020

Kuberhealthy from Comcast is an incredible tool for doing synthetic checks against your Kubernetes cluster. I’ve been using it for several years now and was extremely happy when v2 came out a while back (seems like forever now). Since then there was a great blog on Kubernetes about using Kuberhealthy as a way to track your KPIs as well. As my adoption of Kuberhealthy increased I started relying on it more & more to track the overall health of the clusters I was responsible for.

Recently a co-worker and I wanted to start learning Go. I’ve dabbled with it in the past but never really learned about what I was doing. We figured that the best way to learn was to take a problem we had and try to solve it from the ground up using Go. We decided to take an open source approach to it and bring that code into our jobs instead of the other way around. While our employer embraces and supports open source projects we felt that this was something we should do on our own to better support the community. My coworker wrote the kuberhealthy-aws-iam-role-check tool and I wrote the kuberhealthy-ami-exists-check tool.

The great thing about some problems is that sometimes you’re not the only one with that problem. Kuberhealthy itself has a check that checks for AWS AMIs if you’re using kops but that doesn’t handle generic use-cases. I use golden images for the clusters I manage and each iteration of testing & deployment generates a new AMI. This can get messy and expensive to keep around for a while. To solve this we wrote a simple tool that checks for AMIs with a specific tag and if they’re too old and not in use it deletes them. Initially this tool just checked against LaunchConfigurations the AMI being in use. But then the unfortunate happened; we switched to LaunchTemplates and forgot to update the ami-cleaner tool so a bunch of our AMIs that were in use got deleted. This prevented our AutoScalingGroups from autoscaling and we ran out of resources available to the clusters!

But out of sadness a new Kuberhealthy check was born. While not a synthetic check, the existence of an AMI is critical to the health of the cluster. From a high-level the kuberhealthy-ami-exists-check works like this:

gathers the list of AWS instances running in a Kubernetes cluster
queries the EC2 API to get the list of images
queries the EC2 API for the AMIs associated with the found instances
reports back to Kuberhealthy whether or not the number of AMIs match

If the numbers don’t match, it reports a failure to Kuberhealthy otherwise it’s considered a pass. Sounds pretty straight forward and I could have written a check in less and an hour using Python but it turns out there was a lot to learn for a new Go programmer.

Writing the check involved two basic elements, using the Kubernetes & AWS APIs. Interacting with the Kubernetes API was even easier than I anticipated. There are a ton of good examples of how to interact with it on the Internet and I had no problems learning how to do something.

Authentication to Kubernetes using Go was super easy and I got that working within a few minutes. Basically all you have to do is import the k8s libraries, use the InClusterConfig() function, and away you go. If you’re testing from outside the cluster then you need to load the local kubeconfig when you get an err back from InClusterConfig() and authenticate that way.

import (
    "k8s.io/client-go/kubernetes"
    "k8s.io/client-go/rest"
)func main() {
    kubeconfig, err := rest.InClusterConfig()
    if err != nil {
        // do error handling
    }
    k8sClient, err := kubernetes.NewForConfig(config)
    // do something with it
}

Lucky for me, it’s even easier with Kuberhealthy as they provide a helper for doing exactly this logic.

import (
    "github.com/Comcast/kuberhealthy/v2/pkg/kubeClient"
    "k8s.io/client-go/kubernetes"
    _ "k8s.io/client-go/plugin/pkg/client/auth"
)func main() {
    k8sClient, err := kubeClient.Create(kubeConfigFile)
    if err != nil {
        // do error handling
    }
}

Next up was getting the list of nodes from the Kubernetes API server. This was a nice, simple one-liner.

import (
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)func getNodeInstanceIDs() []string {
    nodes, err := k8sClient.CoreV1().Nodes().List(metav1.ListOptions{})
}

Wow! I’m only a short time into this and I’m already able to interact with the Kubernetes API and get the data I need. I’m feeling pretty good about myself at this stage of the check. I took a quick break and then started working on the AWS portion of the code. This turned out to be just as easy as writing the Kubernetes code so I was flying along. I got the check working both in & out of cluster within just a couple of hours.

Before I could call it a day though I knew I needed to write some unit tests for this code. It just wouldn’t be write to publish something that wasn’t tested. What if I made a change and broke something?

I knew I needed a way to mock out the AWS API so my test suite could run offline or without an AWS account. It was at this point that my struggles began. How the heck so I do this? I’m just starting and I have no idea what’s possible. The next 5 hours was spent banging my head on my desk until I finally figured it out. I had some help from several different blogs & gists but the one that helped me understand it the most was from Convalesco. After hours of trying to understand how theec2iface.EC2APIworked I realized that I needed to refactor all of my AWS code.

To be totally honest, I’m still not entirely sure how to explain what it’s doing in terms of Go so I’m not going to even try. To take what was complex to me and boil it down into something simple is beyond my skills in Go. Once I’m farther along in my journey I’ll write something up then.

At this point, everything has come together and works as I want. It was finally time to start packaging it up and publishing it to Github. I used several Github action workflows to make this happen. The first workflow build & runs the unit tests. I run a gosec scan as part of another workflow. But I was planning on publishing a Docker image so I added Azure container-scan as a third workflow to make sure I was covering all my bases (hopefully) from a security perspective. Lastly I added the workflow to publish the container to both Docker Hub & Github Container Registry when a new release was published. I would have preferred to just publish to GHCR due to the upcoming Docker Hub restrictions but from a discoverability perspective I needed it on Docker Hub as well.

I hope you found this journey as interesting as I did. If you want to check out and use the kuberhealthy-ami-exists-check you can get it by

# Or whatever the latest version is
docker pull ghcr.io/mtougeron/khcheck-ami-exists:v0.0.3

There’s an example of the check itself inside the repo as well.

GrepMyMind

Kuberhealthy & the writing the AMI Exists check

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in GrepMyMind

Written by Mike Tougeron

No responses yet