DEV Community

Cover image for Leveraging AI for Kubernetes Troubleshooting via K8sGPT
Ivan Porta
Ivan Porta

Posted on • Originally published at gtrekter.Medium

Leveraging AI for Kubernetes Troubleshooting via K8sGPT

Nowadays, there is a lot of excitement around AI and its new applications. For instance, in April/May 2024, there were at least four AI conventions in Seoul with thousands of attendees. So, what about Kubernetes? Can AI help us manage Kubernetes? The answer is yes. In this article, I will introduce K8sGPT.

What does GPT stand for?

GPT stands for Generative Pre-trained Transformer. It’s a deep learning architecture that relies on a neural network pre-trained on a massive dataset of unlabeled text from various sources such as books, articles, websites, and other digital texts. This enables it to generate coherent and contextually relevant text. The first GPT was introduced in 2018 by OpenAI.

GPT models are based on the transformer architecture, developed by Google, which uses a multi-head attention mechanism. Text is converted into numerical representations called tokens, often how the usage of these models is priced when provided as a service.

Image description

Each token is transformed into a vector via a lookup from a word embedding table based on a pre-trained matrix where each row corresponds to a token and contains a vector representing the token in a high-dimensional space, preserving the semantic information about the token.

Token ID Embedding Vector
11 [0.12456, -0.00324, 0.45238,...]
19 [-0.28345, 0.13245, 0.02938,...]
30 [0.11234, -0.05678, 0.19834,...]
82 [0.09876, 0.23456, -0.11234,...]
67474 [0.56438, -0.23845, 0.04238,...]

At each layer, each token is then contextualized within the context window with other tokens through a parallel multi-head attention mechanism

What is K8sGPT?

K8sGPT is an open-source project written in Go that uses different providers (called backends) to access various AI language models. It scans the Kubernetes cluster to discover issues and provides the results, causes, and solutions in simple sentences. The target audience for this tool is SRE Engineers, whose duty is to maintain and improve service stability.

Installation and Configuration

Before performing any queries, it’s mandatory to install the tool in an environment with kubectl and set up the backend that will be used for our queries. In this example, I will install K8sGPT on Ubuntu x64:

curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.24/k8sgpt_amd64.deb
sudo dpkg -i k8sgpt_amd64.deb
Enter fullscreen mode Exit fullscreen mode

Once installed, we can configure it with the desired provider that will interact with the AI service’s APIs. In this example, I will use OpenAI.

Image description

Next, add the secret key to K8sGPT so that it can authenticate to the AI service:

$ k8sgpt auth add
Warning: backend input is empty, will use the default value: openai
Warning: model input is empty, will use the default value: gpt-3.5-turbo
Enter openai Key: 
openai added to the AI backend provider list
Enter fullscreen mode Exit fullscreen mode

By default, it will use OpenAI, but you can change it by executing the following command:

$ k8sgpt auth list
Default:
> openai
Active:
> openai
Unused:
> localai
> azureopenai
> noopai
> cohere
> amazonbedrock
> amazonsagemaker

$ k8sgpt auth default --provider amazonsagemaker
Enter fullscreen mode Exit fullscreen mode

Analyze the cluster

K8sGPT uses analyzers to triage and diagnose issues in the cluster. Each one of them will result in a series of requests (and subsequent usage of tokens) to the AI service’s APIs. To review which analyzers are enabled, execute the following:

$ k8sgpt filter list
Active:
> Pod
> ValidatingWebhookConfiguration
> Deployment
> CronJob
> PersistentVolumeClaim
> ReplicaSet
> Ingress
> Node
> MutatingWebhookConfiguration
> Service
Unused:
> HTTPRoute
> StatefulSet
> Gateway
> HorizontalPodAutoScaler
> Log
> PodDisruptionBudget
> NetworkPolicy
> GatewayClass
Enter fullscreen mode Exit fullscreen mode

By enabling and disabling these analyzers, you can limit the requests sent to the AI service APIs and focus on specific types of services. In this demo, we will analyze data coming from the logs and disable the Pods analyzer. To do so, I will execute the following:

$ k8sgpt filter remove Pod
$ k8sgpt filter add Log
Enter fullscreen mode Exit fullscreen mode

Now that K8sGPT is configured, we can start analyzing the cluster. In this example, I will deploy two pods with incorrect configurations and proceed with cluster analysis using K8sGPT. The first will be a nginx image with a non-existent tag, and the second will be a mysql image without the mandatory parameters.

$ kubectl run nginx --image=nginx:invalid_tag
$ kubectl run mysql --image=mysql:latest
Enter fullscreen mode Exit fullscreen mode

If you check the pods running on the cluster, you will see that something went wrong:

$ kubectl get pods
NAME    READY   STATUS         RESTARTS   AGE
mysql   0/1     Error          0          6s
nginx   0/1     ErrImagePull   0          17s
Enter fullscreen mode Exit fullscreen mode

Let’s move forward and analyze the cluster with K8sGPT by executing the following:

$ k8sgpt analyze -e --no-cache --with-doc
 100% |█████████████████████████████████████████████████████████████████████████████████████████████████| (5/5, 34 it/min)
AI Provider: openai

Warnings :
- [HTTPRoute] failed to get API group resources: unable to retrieve the complete list of server APIs: gateway.networking.k8s.io/v1: the server could not find the requested resource

0 default/mysql(mysql)
- Error: 2024-07-01 07:28:34+00:00 [ERROR] [Entrypoint]: Database is uninitialized and password option is not specified
Error: Database is uninitialized and password option is not specified.
Solution:
1. Specify the password option for the database.
2. Initialize the database to resolve the uninitialized state.

1 default/nginx(nginx)
- Error: Error the server rejected our request for an unknown reason (get pods nginx) from Pod nginx
Error: The server rejected the request for an unknown reason when trying to get pods for the nginx Pod.
Solution:
1. Check the Kubernetes cluster logs for more details on the rejection.
2. Verify the permissions and access rights for the user making the request.
3. Ensure the Kubernetes API server is running and reachable.
4. Retry the request after resolving any issues.
2 kube-system/coredns-7db6d8ff4d-p8bxj(Deployment/coredns)

- Error: [INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: connection refused
Error: Unable to list namespaces in Kubernetes due to connection refusal.
Solution:
1. Check if the Kubernetes API server is running.
2. Verify the network connectivity between the client and API server.
3. Ensure the API server IP and port are correct.
4. Restart the API server if needed.
3 kube-system/kube-controller-manager-minikube(kube-controller-manager-minikube)

- Error: I0701 05:34:48.925627       1 actual_state_of_world.go:543] "Failed to update statusUpdateNeeded field in actual state of world" logger="persistentvolume-attach-detach-controller" err="Failed to set statusUpdateNeeded to needed true, because nodeName=\"minikube\" does not exist"
Error: Failed to update statusUpdateNeeded field in actual state of world because nodeName "minikube" does not exist.
Solution:
1. Check if the node "minikube" exists in the Kubernetes cluster.
2. If the node does not exist, create a new node with the name "minikube".
3. Update the statusUpdateNeeded field in the actual state of world.
4 kube-system/kube-scheduler-minikube(kube-scheduler-minikube)

- Error: W0701 05:34:34.522300       1 authentication.go:368] Error looking up in-cluster authentication configuration: configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-scheduler" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
Error: The user "system:kube-scheduler" is forbidden to access the configmaps resource in the kube-system namespace.
Solution:
1. Check the RBAC permissions for the user "system:kube-scheduler".
2. Grant the necessary permissions to access the configmaps resource.
3. Verify the changes by attempting to access the configmaps resource again.
Enter fullscreen mode Exit fullscreen mode

As you can see, it provides a list of errors. While some of them are a consequence of the real error, the analyzers also provide correct explanations of the pods misconfiguration.

Conclusions

This tool should not be considered the sole source of truth but rather as a good starting point for troubleshooting. It narrows the path to discovering the problem in the cluster. Organizations that don’t want to share their data with OpenAI can take advantage of the option to use local AI systems.

References

Top comments (0)