Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Azure Kubernetes Wrangler (SRE) here, before I turn some LLM loose on my cluster, I need to know what it supports, how it supports it and how I can integrate into my workflow.

Videos show CrashLoopBackOff pod and analyzing logs. This works if Pod is writing to stdout but I've got some stuff doing straight to ElasticSearch. Does LLM speak Elastic Search? How about Log Files in the Pod? (Don't get me started on that nightmare)

You also show fixing by editing YAML in place. That's great except my FluxCD is going revert since you violated principle of "All goes through GitOps". So if you are going to change anything, you need to update the proper git repo. Also in said GitOps is Kustomize so hope you understand all interactions there.

Personally, the stuff that takes most troubleshooting time is Kubernetes infrastructure. Network CNI is acting up. Ingress Controller is missing proper path based routing. NetworkPolicy says No to Pod talking to PostGres Server. CertManager is on strike and certificate has expired. If LLM is quick at identifying those, it has some uses but selling me on "Dev made mistake with Pod Config" is likely not to move the needle because I'm really quick at identifying that.

Maybe I'm not the target market and target market is "Small Dev team that bought Kubernetes without realizing what they were signing up for"



Your comment brings up a good point (and also one of our big challenges): there is a huge diversity in the tools teams use to setup and operate their infra. Right now our platform only speaks to your cluster directly through kubectl commands. We’ll build other integrations so it can communicate with things like Elastic Search to broaden its context as needed, but we’ll have to be somewhat thoughtful in picking the highest ROI integrations to build.

Currently, we only handle the investigation piece and suggest a remediation to the on-call engineer. But to properly move into automatically applying a fix, which we hope to do at some point, we’ll need to integrate into CI/CD

As for the demo example, I agree that the issue itself isn’t the most compelling. We used it as an example since it is easy to visualize and set up for a demo. The agent is capable of investigating more complex issues we've seen in our customer's production clusters, but we're still looking for a way to better simulate these on our test environment, so if you/anyone has ideas we’d love to hear them.

We do think this has more value for engineers/teams with less expertise in k8s, but we think SREs will still find it useful


>we're still looking for a way to better simulate these on our test environment, so if you/anyone has ideas we’d love to hear them.

Pick Kubernetes offering from big 3, deploy it then blow it up.

(I couldn't get HackerNews to format properly and done fighting it)

On Azure, deploy a Kubernetes cluster with following:

Azure CNI with Network Policies

Application Gateway for Containers

External DNS hooked to Azure DNS

Ingress Nginx

Flexible PostGres Server (outside the cluster)

FluxCD/Argo

Something with using Workload Identity

Once all that is configured, put some fake workloads on it and start misconfiguring it with your LLM wired up. When the fireworks start, identify the failures and train your LLM properly.


> we think SREs will still find it useful

There are two kinds of outages: people being idiots and legit hard-to-track-down bugs. SREs worth their salt don't need help with the former. They may find an AI bot somewhat useful to find root cause quicker, but usually not so valuable as to justify paying the kind of price you would need to charge to make your business viable to VCs. As for the latter, good luck collecting enough training data.

Otherwise, you're selling a self-driving car to executives who want the chauffeur without the salary. Sounds like a great idea, until you think about the tail cases. Then you wish you had a chauffeur (or picked up driving skills yourself).

Maybe you'll find a market, but as an SRE, I wouldn't want to sell it.


I basically want to +1 this. OP isn't selling to any place that is already spending six figures on SRE salaries. Actual competitors are companies like Komodor and Robusta who sell "we know Kubernetes better than you" solutions to companies that don't want to spend money on SRE salaries. Companies in this situation should just seriously reconsider hosting on Kubernetes and go back to higher-level managed services like ECS/Fargate, Fly/Railway, etc.


> CertManager is on strike and certificate has expired

Had a good chuckle here, hah.


Same. Typically call it “hung” but maybe stating certmanager is on strike will get the point across better.

But sigh does it really get to the state of the kubernetes ecosystem. All these projects need to be operated, can’t just set it and forget it.


Im sure this is on their roadmap, but honestly a pre-requisite should be a separate piece of software that analyzes and suggests changes to your error handling.

This is a cool proof of concept but almost useless otherwise in a production system

I can already feed Claude or ChatGPT my kubectl output pretty easily

Error handling and logging that are tailored for consumption of a specific pre trained model, thats where this will be ground breaking


The AI needs to be integrated into Dev IDE. All my logging screaming is terrible decisions made by long ago Devs but getting them fixed now is impossible because they don't want to do it and no one is going to make them.


That is something we're working on -- good observability is a place where teams usually fall short and often the limiting factor to better incident response. We're working on logging integrations as a first step.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: