Notable they are not discussing centralized collection and indexing of debug logs, they implicitly leave the logs on the disk of the machine where they were produced, and go out to read them when called for. This is an important lesson because centralizing and indexing logs is very foolish unless you are a Splunk shareholder.
I'd say retaining logs is the real mistake. You need to retain some, but it's going to be far less then you think. The more interesting data is usually "what's being logged right now" and none of the big logging systems have an answer as good as "tail -f somefile.log" or "journalctl -f -u some-service" on the machine it's running on...which sucks, because technically there's value to be found there! Connect me to the entire cluster I'm running and tail "just these logs" in as close to real-time as you can.
EDIT: I know it can be emulated in a bunch of ways, but it's certainly not a first class feature.
I'm not sure that is a takeaway here. The processes evolved, but at AWS, centralized logging w/CloudWatch Logs was becoming more common than having tools that ran commands on end hosts against locally stored ephemeral logs. Sometimes an agent would aggregate the data locally before publishing, but using tools like CloudWatch Insights was distributed and fast enough (multiple gigabytes/sec) that no indexes were necessary to get quick answers to adhoc questions. My read is they are not advocating for publishing debug logs which gets expensive quick. They are suggesting you have the ability to enable verbose logging if needs be for a temporary use case. The two types of logs they describe are service logs which have one log entry per request (customer, request id, e2e latency, response code, etc) and applications logs that go into more detail about what happened on a particular request. Service logs are less expensive to store and more information dense. You are primarily querying service logs for operational visibility. Service logs are inexpensive and fast. You query application logs to trace a specific request and that is slower and more costly. For cost savings, I know of some teams that would reprocess applications logs after x weeks to strip information that only provided short term value.
Centralizing logs can be a HUGE help... You just have to use a tool that doesn't cost what Splunk charges. At the risk of over-selling, log-store.com caps charges at $4k/month! That's still a lot of money, but way less than what Splunk and other SaaS providers charge!
I just think it's pretty foolish. All the resources you own are out at the edge of you infrastructure. You might as well just drop application logs right there and leave them there, because pushing out your search predicate to all the relevant machines is going to exploit all that CPU and IO bandwidth you already paid for.