Azure Kubernetes Wrangler (SRE) here, before I turn some LLM loose on my cluster...

wilson090 · on Aug 26, 2024

Your comment brings up a good point (and also one of our big challenges): there is a huge diversity in the tools teams use to setup and operate their infra. Right now our platform only speaks to your cluster directly through kubectl commands. We’ll build other integrations so it can communicate with things like Elastic Search to broaden its context as needed, but we’ll have to be somewhat thoughtful in picking the highest ROI integrations to build.

Currently, we only handle the investigation piece and suggest a remediation to the on-call engineer. But to properly move into automatically applying a fix, which we hope to do at some point, we’ll need to integrate into CI/CD

As for the demo example, I agree that the issue itself isn’t the most compelling. We used it as an example since it is easy to visualize and set up for a demo. The agent is capable of investigating more complex issues we've seen in our customer's production clusters, but we're still looking for a way to better simulate these on our test environment, so if you/anyone has ideas we’d love to hear them.

We do think this has more value for engineers/teams with less expertise in k8s, but we think SREs will still find it useful

stackskipton · on Aug 26, 2024

>we're still looking for a way to better simulate these on our test environment, so if you/anyone has ideas we’d love to hear them.

Pick Kubernetes offering from big 3, deploy it then blow it up.

(I couldn't get HackerNews to format properly and done fighting it)

On Azure, deploy a Kubernetes cluster with following:

Azure CNI with Network Policies

Application Gateway for Containers

External DNS hooked to Azure DNS

Ingress Nginx

Flexible PostGres Server (outside the cluster)

FluxCD/Argo

Something with using Workload Identity

Once all that is configured, put some fake workloads on it and start misconfiguring it with your LLM wired up. When the fireworks start, identify the failures and train your LLM properly.

solatic · on Aug 26, 2024

> we think SREs will still find it useful

There are two kinds of outages: people being idiots and legit hard-to-track-down bugs. SREs worth their salt don't need help with the former. They may find an AI bot somewhat useful to find root cause quicker, but usually not so valuable as to justify paying the kind of price you would need to charge to make your business viable to VCs. As for the latter, good luck collecting enough training data.

Otherwise, you're selling a self-driving car to executives who want the chauffeur without the salary. Sounds like a great idea, until you think about the tail cases. Then you wish you had a chauffeur (or picked up driving skills yourself).

Maybe you'll find a market, but as an SRE, I wouldn't want to sell it.

solatic · on Aug 26, 2024

I basically want to +1 this. OP isn't selling to any place that is already spending six figures on SRE salaries. Actual competitors are companies like Komodor and Robusta who sell "we know Kubernetes better than you" solutions to companies that don't want to spend money on SRE salaries. Companies in this situation should just seriously reconsider hosting on Kubernetes and go back to higher-level managed services like ECS/Fargate, Fly/Railway, etc.

patabyte · on Aug 26, 2024

> CertManager is on strike and certificate has expired

Had a good chuckle here, hah.

mad_vill · on Aug 27, 2024

Same. Typically call it “hung” but maybe stating certmanager is on strike will get the point across better.

But sigh does it really get to the state of the kubernetes ecosystem. All these projects need to be operated, can’t just set it and forget it.

shmatt · on Aug 26, 2024

Im sure this is on their roadmap, but honestly a pre-requisite should be a separate piece of software that analyzes and suggests changes to your error handling.

This is a cool proof of concept but almost useless otherwise in a production system

I can already feed Claude or ChatGPT my kubectl output pretty easily

Error handling and logging that are tailored for consumption of a specific pre trained model, thats where this will be ground breaking

stackskipton · on Aug 26, 2024

The AI needs to be integrated into Dev IDE. All my logging screaming is terrible decisions made by long ago Devs but getting them fixed now is impossible because they don't want to do it and no one is going to make them.

wilson090 · on Aug 26, 2024

That is something we're working on -- good observability is a place where teams usually fall short and often the limiting factor to better incident response. We're working on logging integrations as a first step.