I'm not sure if this is what the writer was getting at, but I tend to check telemetry for my production applications regularly not because I'm looking for things that would fire alerts, but to keep a sense of what production looks like. Things like request rate, average latency, top request paths etc. It's not about knowing something is broken, it's about knowing what healthy looks like.
Understanding what your code looks like in production gives you a lot better sense of how to update it, and how to fix it when it does inevitably break. I think having AI checking for you will make this basically impossible, and that probably makes it a pretty bad idea.
This is a good answer, and I agree that having a good production intuition like this is important. You're probably also right that having AI do it probably doesn't get that value.
I'm not sure I'd do this once a day. I tend to take note of things to build that intuition when I have other reasons to go and look at dashboards, and we have a weekly SLO review as a team, but perhaps there's a place for this in some way.
Yeah, agreed. Daily isn't really necessary outside of initial launch and maybe a busy season. It's really just often enough to build a good sense of production use, and keep it up to date.
Almost no one actually knows how to set up their monitoring. Like, they know the words but not the full picture or how the pieces should actually fit together. Then they do shit like this to try and make up for that fact.
I read the article as a way for AI to check, classify and potentially partial fix the alerts you see when logging-in in the morning.
And for many alerts you need to look at other events around it to properly classify and partially solve them. Due to that you need to give the AI more then just the alerts.
Through I do see a risk similar to wrongly tuned alerts:
Not everything which resolves by itself and can be ignored _in this moment_ is a non issue. It's e.g. pretty common that a system with same rare ignoble warns/errs falls completely flat, when on-boarding a lot of users, introducing a new high load feature, etc. due the exactly the things which you could fully ignore before hand.