A word of caution here: This is very impressive, but almost entirely wrong for your organisation.
Most log messages are useless 99.99% of the time. Best likely outcome is that its turned into a metric. The once in the blue moon outcome is that it tells you what went wrong when something crashed.
Before you get to shipping _petabytes_ of logs, you really need to start thinking in metrics. Yes, you should log errors, you should also make sure they are stored centrally and are searchable.
But logs shouldn't be your primary source of data, metrics should be.
things like connection time, upstream service count, memory usage, transactions a second, failed transactions, upsteam/downstream end point health should all be metrics emitted by your app(or hosting layer), directly. Don't try and derive it from structured logs. Its fragile, slow and fucking expensive.
comparing, cutting and slicing metrics across processes or even services is simple, with logs its not.
Metrics are only good when you can disregard some amount of errors without investigation. But they're a financial organization, they have a certain amount of liability. Generalized metrics won't help to understand what happened to that one particular transaction that failed in a cumbersome way and caused some money to disappear.
You can still have logs. What I'm suggesting is that vast amounts of unstructured logs, are worse than useless.
Metics tell you where and when something when wrong. Logs tell you why.
However, a logging framework, which is generally lossy, and has the lowest level of priority in terms of deliverability is not an audit mechanism. especially as nowhere are ACLs or verifiability is mentioned. How do they prove that those logs originates from that machine?
If you're going to have an audit mechanism, some generic logging framework is almost certainly a bad fit.
> You can still have logs. What I'm suggesting is that vast amounts of unstructured logs, are worse than useless
Until you need them, the you'd trade anything to get them. Logs are like backups, you don't need them most of the times, but when you need them, you really need them.
On the flip side, the tendency is to over-log "just in case". A good compromise is to allocate a per-project storage budget for logs with log expiration, and let the ones close to the coal-face figure out how they use their allocation.
Even at very immature organizations, log data within a service is usually structured.
Even in my personal projects if I'm doing anything parallel structured logging is the first helper function I write. I don't think I'm unrepresentative here.
> Even at very immature organizations, log data within a service is usually structured.
unless the framework provides it by default, I've never seen this actually happen in real life. Sure I've seen a lot of custom telegraf configs, status end points and the like, but never actual working structured logging.
When I have seen structure logs, each team did it differently, The "ontology" was different (protip: if you're ever discussing ontology in logging then you might as well scream and run away.)
I suspect you and the parent are using different meanings of the word "structured". They're not totally random or they wouldn't be usable. It's a question of what the structuring principle is.
Am I crazy here? We run all of our app logs and error logs through LogStash and just have a few filters in there to normalize stuff like the timestamp. Honestly the only peace of data that absolutely HAS to be standardized because that’s the piece of data that splits our log indexes, is the primary sorting mechanism, and at what point we roll up an index into some aggregates and then compress and send it to cold storage.
> "But they're a financial organization, they have a certain amount of liability."
In the loosest possible sense. Binance is an organization that pretended it doesn't have any physical location in any jurisdiction. Its founder is currently in jail in the United States.
It's always struck me that these are two wildly different concerns though.
Use metrics & SLOs to help diagnose the health of your systems. Derive those directly from logs/traces, keep a sample of the raw data, and now you can point any alert to the sampled data to help go about understanding a client-facing issue.
But, for auditing of a particular transaction, you don't need full indexing of the events? You need a transactional journal for every account/user, likely with a well-defined schema to describe successful changes and failed attempts. Perhaps these come from the same stream of data as the observability tooling, but I can only imagine it must be a much smaller subset of the 100PB that you can avoid doing full inverse indexes on this, because your search pattern is simply answering "what happened to this transaction?"
The reality is that when their service delays something they owe us tens to hundreds of thousands of dollars. This is the tool they’re using but if they can’t even get a precise notion of when a specific request arrived at their gateway they’re in trouble.
As an engineer I generally want logs so I can dive into problems that weren't anticipated. Debugging.
I get a lot of pushback from ops folks. They often don't have the same use case. The logs are for the things that'll be escalated beyond the ops folks to the people that wrote the bug.
Yes, most (> 99.99%) of them will never be looked at. But storage is supposed to be cheap, right? If we can waste bytes on loading a copy of Chromium for each desktop application, surely we can waste bytes on this.
My argument is completely orthogonal to "do we want to generate metrics from structured logs".
Most probably, said ops folks have quite a few war stories to share about logs.
Maybe a JVM-based app went haywire, producing 500GB of logs within 15 minutes, filling the disk, and breaking a critical system because no one anticipated that a disk could go from 75% free to 0% free in 15 minutes.
Maybe another JVM-based app went haywire inside a managed Kubernetes service, producing 4 terabytes of logs, and the company's Google Cloud monthly usage went from $5,000 to $15,000 because storing bytes is supposed to be cheap when they are bytes and not when they are terabytes.
I completely agree that logs are useful, but developers often do not consider what to log and when.
Check your company's cloud costs. I bet you the cost of keeping logs is at least 10%, maybe closer to 25% of the total cost.
Agreed you need to engineer the logging system and not just pray. "The log service slowed down and our writes to it are synchronous" is one I've seen a few times.
On "do not consider what to log and when" .. I'm not saying don't think about it at all, but if I could anticipate bugs well enough to know exactly what I'll need to debug them, I'd just not write the bug.
Just saw this at work recently - 94% of log disk space for domain controllers were filled by logging what groups users were in (I don't know the specifics but group membership is pretty static, and if a log-on fails I assume the missing group is logged as part of that failure message).
Sounds like really bad design choices here. #1 logs shouldn't go on the same machine that's running the app, they should be reported tp another server and if you want local logs, then properly setup log rotators. Both would be good.
Something I’ve discovered is that Azure App Insights can capture memory snapshots when an exception happens. You can download these with a button press and open in Visual Studio with a double-click.
It’s magic!
The stack variables, other threads and most of the heap is right there as-if you had set a breakpoint and it was an interactive debug session.
IMHO this eliminates the need for 99% of the typical detailed tracing seen in large complex apps.
I simply doubt that most of these logs (or anyone’s, usually) are that useful.
I worked at a SaaS observability company (Datadog competitor) that was ingesting, IIRC, multiple GBps of metrics, spread across multiple regions, dozens upon dozens of cells, etc. Our log budget was 650 GB/day.
I have seen – entirely too many times – DEBUG logs running in prod endlessly, messages that are clearly INFO at best classified as ERROR, etc. Not to mention where a 3rd party library is spamming the same line continuously, and no one bothers to track down why and stop it.
You probably don't need full text search, but only exact match search and very efficient time-based retrieval of contiguous log fragments. As an engineer spending quite a lot of time debugging and reading logs, our Opensearch has been almost useless for me (and a nightmare for our ops folks), since it can miss searches on terms like filenames and OSD UX is slow and generally unpleasant. I'd rather have a 100MB of text logs downloaded locally.
Please enlighten me, what are use cases for real full-text search (with fuzzy matching, linguistic normalization etc.) in logs and similar machine-generated transactional data? I understand its use for dealing with human-written texts, but these are rarely in TB range, unless you are indexing the Web or logs of some large-scale communication platform.
I agree that fuzzy matching etc. are usually not needed, but in my experience I need at least substring match. A log message may say "XYZ failed for FOO id 1234556789" and I want to be able to search logs for 123456789 to see all related information (+ trace id if available)
In systems that deal with asynchronous actions, log entries relating to "123456789" may be spread over minutes, hours or even days. When researching issues, I have found searches like Opensearch, Splunk etc. invaluable and think the additional cost is worth it. But we also don't have PB of logs to handle, so there may be a point where the cost is greater than the benefit.
My response to that would be that you can enable logging locally, or in your staging environment, but not in production. If an error occurs, your telemetry tooling should gather a stack trace and all related metadata, so you should be able to reproduce or at least locate the error.
But all other logs produced at runtime are breadcrumbs that are only ever useful when an exception occurs, anyway. Thus, you don’t need them otherwise.
Storage is not cheap at this scale. That would be 100s of thousands a year at the very least. (How I know, I work in an identical area and have huge budget problems with rando verbose logging).
My system has a version number and input + known starting state dbwise. Now assuming i have determenistic reprodible state, is a log just a replay of that game engine at work?
Interesting you should mention inputs. One of the things I’ve often found useful to log are the data that are inputs into a decision the code is going to make. This can be difficult to reconstruct after the fact, especially if there is a cache between my code and the source of truth.
> Most log messages are useless 99.99% of the time. Best likely outcome is that its turned into a metric. The once in the blue moon outcome is that it tells you what went wrong when something crashed.
If it crashes, it's probably some scenario that was not properly handled. If it's not properly handled, it's also likely not properly logged. That's why you need verbose logs -- once in a blue moon you need to have the ability to retrospectively investigate something in the past that was not thought through, without using a time machine.
This is more common in the financial world where audit trail is required to be kept long term for regulation. Some auditor may ask you for proof that you have done a unit test for a function 3 years ago.
Every organization needs to find their balance between storage cost and quality of observability. I prefer to keep as much data as we are financially allowed. If Binance is happy to pay to store 100PB logs, good for them!
"Do we absolutely need this data or not" is a very tough question. Instead, I usually ask "how long do we need to keep this data" and apply proper retention policy. That's a much easier question to answer for everyone.
It is quite unlikely that a regulator will ask you for proof you have a unit test for anything (also, that's not what a unit test is - see [1] for a good summary of why).
It _is_ likely a regulator will ask you to prove that you are developing within the quality assurance framework you have claimed you are, though.
Finally though, logs are not an audit trail, and almost no-one can prove their logs are correct with respect to the state of the system at any given time.
> If it's not properly handled, it's also likely not properly logged
Then you're blue moon probability if it being useful rapidly drops. Verbose logs are simply a pain in the arse, unless you have a massive processing system. but even then it just either kneecaps your observation window, or makes your queries take ages.
I am lucky enough to work at a place that has really ace logging capability, but, and I cannot stress this enough, it is colossally expensive. literal billions.
but, logging is not an audit trail. Even here where we have fancy PII shields and stuff, logging doesn't have the SLA to record anything critical. If there is a capacity crunch, logging resolution gets turned down. Plus logging anything of value to the system gets you a significant bollocking.
If you need something that you can hand to a government investigator, if you're pulling logs, you're already in deep shit. An audit framework needs to have a super high SLA, incredible durability and strong authentication for both people and services. All three of those things are generally foreign to logging systems.
Logging is useful, you should log things, but, you should not use it as a way to generate metrics. verbose logs are just a really efficient way to burn through your infrastructure budget.
> Verbose logs are simply a pain in the arse, unless you have a massive processing system. but even then it just either kneecaps your observation window, or makes your queries take ages.
which is why this blog post brags about their capability. Technologies advances, and something difficult to do today may not be as difficult tomorrow. If your logging infra is overwhelmed, by all means drop some data and protect the system. But if Binance is happily storing and querying their 100PB logs now, that's their choice and it's totally fine. I won't say they are doing anything wrong. Again, we are talking about blue moon scenarios here, which is all about hedging risks and uncertainties. It's fine if Netflix drops a few frames of pictures in a movie, but my bank can't drop my transaction.
I think this works well if you think about sampling traces not logs.
Basically, every log message should be attached to a trace. Then, you might choose to throw away the trace data based on criteria, e.g. throw away 98% of "successful" traces, and 0% of "error" traces.
The (admittedly not particularly hard) challenge then is building the infra that knows how to essentially make one buffer per trace, and keep/discard collections of related logs as required.
It sounds nice, but also consider: 1) depending on how your app crashes, are you sure the buffer will be flushed, and 2) if logging is expensive from a performance perspective, your base performance profile may be operating under the assumption that you’re humming along not logging anything. Some errors may beget more errors and have a snowball effect.
Both solved by having a sidecar (think of as a local ingestion point) that records everything (no waiting for flush on error), and then does tail sampling on the spans where status is non OK - i.e. everything thats non OK gets sent to Datadog, Baselime, your Grafana setup, your custom Clickhouse 100PB storage nodes. Or take your pick of any of 1000+ OpenTelemetry compatible providers. https://opentelemetry.io/docs/concepts/sampling/#tail-sampli...
Hogwash. I’ll agree that it’s not as simple with logs, but amazingly powerful, and even more so with distributed tracing.
They both have their places and are both needed.
Without logs, I would not have been able to pinpoint multiple issues that plagued our systems. With logs, we were able to tell google, Apigee, it was there problem, not ours. With tracing, we were able to tell a legacy team they had an issue and was able to pinpoint it after them telling us for 6 months that it was our fault. Without logging and tracing, we wouldn’t have been able to tell our largest client, that we never received a 1/3 of their requests they sent us as our company was running around frantically.
They’re both needed, but for different things…ish.
You're missing my main point: logs should not be your primary source of information.
> Without logs, I would not have been able to pinpoint multiple issues that plagued our systems.
Logs are great for finding out what went wrong, but terrible at telling there is a problem. This is what I mean by primary information source. If you are sifting through TBs logs to pinpoint a issue, it sucks. Yes, there are tools, but its still hard.
Logs are shit for deriving metrics, it usually requires some level of bespoke processing which is easy to break silently, especially for rarer messages.
> You're missing my main point: logs should not be your primary source of information.
I think you're missing my point. They're both needed. Metrics are outside blackbox and logs are inside -- they're both needed. I don't recall saying that logs should be the primary source.
> Logs are shit for deriving metrics, it usually requires some level of bespoke processing which is easy to break silently, especially for rarer messages.
Truthfully, you're probably just doing it wrong if you can't derive actionable metrics from logs / tracing. I'm willing to hear you out though. Are you using structured logs? if so, please tell me more how you're having issues deriving metrics from those. if not, that's your first problem.
> logs are great for finding out what went wrong, but terrible at telling there is a problem
> Truthfully, you're probably just doing it wrong if you can't derive actionable metrics from logs
I have ~200 services, each composed of many sub services, each made up of a number of processes. something like 150k processes.
Now, we are going to ship all those logs, where every transaction emits something like 500-2000 bytes of data. Storing that is easy, evne storing it in a structured way is easy. making sure we don'y leak PII is a lot harder, so we have to have fairly strict ACLs.
now, I want process them to generate metrics and then display them. But that takes a lot of horse power. Moreover when I want to have metrics for more than a week or so, the amount of data I have to process grows linearly. I also need to back up that data, and derived metrics. We are looking at a large cluster just for processing.
Now, if we make sure that our services emit metrics for all useful things, the infra for recording, processing and displaying that is much smaller, maybe two/three instances. Not only that but custom queries are way quicker, and much more resistant to PII leaking. Just like structured logging, it does require some dev effort.
At no point is it _impossible_ to use logs as the data store/transport, its just either fucking expensive, fragile, or dogshit slow.
or to put it another way:
old system == >£1million in licenses and servers (yearly)
metric system == £100k in licenses and servers + £12k for the metrics servers (yearly)
I would say from my experience, for _application logs_, it's the exact opposite. When you deal with a few GB/day of data, you want to have logs, and metrics can be derived from those logs.
Logs are expensive compared to metrics, but they convey a lot more information about the state of your system. You want to move towards metrics over time only one hotspot at a time to reduce cost while keeping observability of your overall system.
I'll take logs over metrics any day of the week, when cost isn't prohibitive.
I was at a large financial news site, They were a total splunk shop. We had lots real steel machines shipping and chunking _loads_ of logs. Every team had a large screen showing off key metrics. Most of the time they were badly maintained and broken, so only the _really_ key metrics worked. Great for finding out what went wrong, terrible at alerting when it went wrong.
However, over the space of about three years we shifted organically over to graphite+grafana. There wasn't a top down push, but once people realised how easy it was to make a dashboard, do templating and generally keep things working, they moved in droves. It also helped that people put metrics emitting system into the underlying hosting app library.
What really sealed the deal was the non-tech business owners making or updating dashboards. They managed to take pure tech metrics and turn them into service/business metrics.
It's fair that you had a different experience than I had. However, your experience seems to be very close to what I was describing. Cost got prohibitive (splunk), and you chose a different avenue. It's totally acceptable to do that, but your experience doesn't reflect mine, and I don't think I'm the exception.
I've used both grafana+metrics and logs to different degrees. I've enjoyed using both, but any system I work on starts with logs and gradually add metrics as needed, it feels like a natural evolution to me, and I've worked at different scale, like you.
I feel like I shouldn't need to mention this, but comparing a news site to a financial exchange with money at stake is not the same. If there is a glitch you need to be able to trace it back and you can't do that with some abstracted metrics.
Yea, on a news site, the metrics are important. If suddenly you start seeing errors accrue above background noise and it's affecting a number of people you can act on it. If it's affecting one user, you probably don't give a shit.
In finance if someone puts and entry for 1,000,000,000 and it changes to 1,000,000 the SEC, fraud investigators, lawyers, banks, and some number of other FLAs are shining a flashlight up your butt as to what happened.
I'm not saying that you can't log, I'm saying that logging _everything_ on debug in an unstructured way and then hoping to devine a signal from it, is madness. You will need logs, as they eventually tell you what went wrong. But they are very bad at telling you that something is going wrong now.
Its also exceptionally bad at allowing you quickly pinpointing _when_ something changed.
Even in a logging only environment, you get an alert, you look at the graphs, then dive into the logs. The big issue is that those metrics are out of date, hard to derrive and prone to breaking when you make changes.
verbose logging is not a protection in a financial market, because if something goes wrong you'll need to process those logs for consumption by a third party. You'll then have to explain why the format changed three times in the two weeks leading up to that event.
Moreover you will need to seperate the money audit trail from the verbose application logs, ideally at source. as its "high value data" you can't be mixing those stream at all
> Logs are expensive compared to metrics, but they convey a lot more information about the state of your system.
My experience has been the kind of opposite.
Yes, you can put more fields in a log, and you can nest stuff. In my experience however, attics tend to give me a clearer picture into the overall state (and behaviour) of my systems. I find them easier and faster to operate, easier to get an automatic chronology going, easier to alert on, etc.
Logs in my apps are mostly relegated to capturing warning error and error states for debugging reference as the metrics give us a quicker and easier indicator of issues.
I’m not well versed in QA/Sysadmin/Logs but surely metrics suffer from Simpson’s paradox compared to properly probed questions only answered through having access to the entirety of the logs?
If you average out metrics across all log files you’re potentially reaching false or worse inverse conclusions about multiple distinct subsets of the logs
It’s part of the reason why statisticians are so pedantic about the wording of their conclusions and to which subpopulation their conclusions actually apply to
When performing forensic analysis, metrics don't usually help that much. I'd rather sift 2PB of logs, knowing that information I'm looking for is in there, than sit at the usual "2 weeks of nginx access logs which roll over".
Obviously running everything with debug logging just burns through money, but having decent logs can help a lot other teams, not just the ones working on the project (developers, sysadmins, etc.)
Metrics are useful when you know what to measure, which implies that you already have a good idea for what can go wrong. If your entire product exists in some cloud servers that you fully control, that's probably feasible. Binance probably could have done something more elegant than storing extraordinary amounts of logs.
However, if you're selling a physical product, and/or a service that integrates deeply with third party products/services, it becomes a lot more difficult to determine what's even worth measuring. A conservative approach to metrics collection will limit the usefulness of the metrics, for obvious reasons. A "kitchen sink" approach will take you right back to the same "data volume" problem you had with logs, but now your developers have to deal with more friction when creating diagnostics. Neither extreme is desirable, and finding the middle ground would require information that you simply don't have.
On a related note, one approach I've found useful (at a certain scale) is to shove metrics inside of the logs themselves. Put a machine-readable suffix on your human-readable log messages. The resulting system requires no more infrastructure than what your logs are already using, and you get a reliable timeline of when certain metrics appear vs. when certain log messages appear.
I'm trying to offer the perspective of someone who works with products that don't exist entirely on a server. If your product is a web service, the following might not apply to you.
IME creating diagnostic systems for various IoT and industrial devices, the "natural" stuff is relatively easy to implement (battery level, RSSI, connection state, etc) but it's rarely informative. In other words, it doesn't meaningfully correlate with the health of the system unless failure is already imminent.
It's the obscure stuff that tends to be informative (routing table state, delivery ratio, etc). But, complex metrics demand a greater engineering lift in their development and testing. There's also a non-trivial amount of effort involved in developing tools to interpret the resulting data.
Even if natural and informative were tightly correlated, which they aren't, an informative metric isn't necessarily actionable. You have to be able to use the data to improve your product. I can't charge the battery in a customer's device for them. I also can't move their phone closer to a cell tower. If you can't act on a metric, you're just wasting your time.
Fine, but I'm now wondering what sort of "data" is going to help you "charge the battery in a customer's device for them [or] move their phone closer to a cell tower."
A natural metric for a distributed system is connectivity (or conversely partition detection). A metric on connectivity is informative. Can the information help you heal the partition? Maybe, maybe not. Time to hit the logs and see why the partition occurred and if an actionable remedy is possible.
(I'm trying to understand your pov btw, so clarify as you will.)
> I'm now wondering what sort of "data" is going to help you "charge the battery in a customer's device for them [or] move their phone closer to a cell tower."
None. The idea is that you have to think about what you'd actually do with that data once you've collected. If it's something that far-fetched, it isn't worth collecting that data. (This philosphy is also convenient for GDPR reasons)
Distributed systems are one place where metrics can be genuinely useful. They can be good at reducing the complexity of a bunch of interacting nodes down to something a bit more digestible. Distributed systems have their own fascinating technical challenges. One of the less-fascinating difficulties is that you're at the mercy of your client's IT. If they don't want their devices phoning home, you don't get real-time metrics. You might be able to store some stuff for offline diagnostic purposes, but other practical limits arise from there.
How do you detect partitions? You could have each device periodically record a snapshot of its routing table, but then if you wanted to identify the partition, you'd have to go fetch data from each node individually. So, maybe you have them share their routing tables with each other, thereby allowing the partition detection to happen on the fly. That's great, but now you're using precious, precious bandwidth hurling around diagnostic data that you might not even be able to access in practice. There's really no right answer here.
When you have metrics, you should also keep sampled logs.
Ie. 1 per million log entries is kept. Write some rules to try and keep more of the more interesting ones.
One way to do this is to have your logging macro include the source file and line number the logline came from, and then, for each file and line number emit/store no more than 1 logline per minute.
That way you get detailed records of rare events, while filtering most of the noise.
There are also different types of logs. Maybe you want every transaction action but don't need a full fidelity copy of every load balancer ping from the last ten years.
I’ve got to disagree here, especially with memoization and streaming, deriving metric from structured logs is extremely flexible, relatively fast, and can be configured to be as cheap as you need it to be. With streaming you can literally run your workload on a raspberry pi. Granted, you need to write the code to do so yourself, most off-the-shelf services probably are expensive
memoization isn't free in logs, you're basically deduping an unbounded queue and its difficult to scale from one machine. Its both CPU and Memory heavy. I mean sure you can use scuba, which is great, but that's basically a database made to look like a log store.
> deriving metric from structured logs is extremely flexible
Assuming you can actually generate structured logs reliably. but even if you do, its really easy to silently break it.
> With streaming you can literally run your workload on a raspberry pi
no, you really can't. Streaming logs to a centralised place is exceptionally IO heavy. If you want to generate metrics from it, its CPU heavy as well. If you need speed, then you'll also need lots of RAM, otherwise searching your logs will cause logging to stop. (either because you've run out of CPU, or you've just caused the VFS cache to drop because you're suddenly doing no predictable IO. )
greylog exists for streaming logs. hell, even rsyslog does it. Transporting logs is fairly simple, storing and generating signal from it is very much not.
> Most log messages are useless 99.99% of the time.
Things are useless until first crash happens, same thing applies to replication, you don't need replication until your servers start crashing.
> But logs shouldn't be your primary source of data, metrics should be.
There are different types of data related to the product:
* product data - what's in your db
* logs - human readable details of a journey for a single request
* metrics - approximate health state of overall system, where storing high cardinality values are bad (e.g. customer_uuid)
* traces - approximate details of a single request to be able to analyze request journey through systems, where storing high cardinality values might still be bad.
Logs are useful, but costly. Just like everything else which makes system more reliable
Just to be sure, I'm speaking below about application/system logs, not as "our event sourcing uses log storage"
Yes, you probably don't want to store debug logs of 2 years ago, but logs and metrics solve very different problems.
Logs need to have determined lifecycle, e.g. most detailed logs are stored for 7/14/30/release cadence days, then discarded. But when you need to troubleshoot something, metrics give you signal, but logs give you information about what was going on.
> Most log messages are useless 99.99% of the time. Best likely outcome is that its turned into a metric. The once in the blue moon outcome is that it tells you what went wrong when something crashed.
Wonder if just keeping timestamps in a more efficient table of each unique textual log entry would be better? Or rather log entry text template. Then store the arguments also separate.
"Once in a blue moon" -- you mean the thing that constantly happens? If you're not using logs, you're not practicing engineering. Metrics can't really diagnose problems.
It's also a lot easier to inspect a log stream that maps to an alert with a trace id than it is to assemble a pile of metrics for each user action.
I think the above comment is just saying that you shouldn't use logs to do the job of metrics. Like, if you have an alert that goes off when some HTTP server is sending lots of 5xx, that shouldn't rely on parsing logs.
> But logs shouldn't be your primary source of data, metrics should be.
Metrics, logs, relational data, KVs, indexes, flat files, etc. are all equally valid forms of data for different shapes of data and different access patterns. If you are building for a one-size-fits-all database you are in for a nasty surprise.
With logs you can get an idea of what events happened in what order during some complex process, stretched over long timeframe, and so on. I don't think you can do this with a metric
> With logs you can get an idea of what events happened in what order
Again, if you're at that point, you need logs. But thats never going to be your primary source of information. if you have more than a few services running at many transactions a second, you can't scale that kind of understanding using logs.
This is my point, if you have >100 services, each with many tens or hundreds of processes, your primary (well it shouldn't be, you need pre SLA fuckup alerts)alert to something going wrong is something breeching an SLA. That's almost certainly a metric. Using logs to derive that metric means you have a latency of 60-1500 seconds
Getting your apps to emit metrics directly means that you are able to make things much more observable. It also forces your devs to think about _how_ their app is observed.
I would note that a notional "log store", doesn't have to just be used for things that are literally "logs."
You know what else you could call a log store? A CQRS/ES event store.
(Specifically, a "log store" is a CQRS/ES event store that just so happens to also remember a primary-source textual representation for each structured event-document it ingests — i.e. the original "log line" — so that it can spit "log lines" back out unchanged from their input form when asked. But it might not even have this feature, if it's a structured log store that expects all "log lines" to be "structured logging" formatted, JSON, etc.)
And you know what the most important operation a CQRS/ES event store performs is? A continuous streaming-reduction over particular filtered subsets of the events, to compute CQRS "aggregates" (= live snapshot states / incremental state deltas, which you then continuously load into a data warehouse to power the "query" part of CQRS.)
Most CQRS/ES event stores are built atop message queues (like Kafka), or row-stores (like Postgres). But neither are actually very good backends for powering the "ad-hoc-filtered incremental large-batch streaming" operation.
• With an MQ backend, streaming is easy, but MQs maintain no indices for events per se, just copies of events in different topics; so filtered streaming would either have the filtering occur mostly client-side; or would involve a bolt-on component that is its own "client-side", ala Kafka Streams. You can use topics for this — but only if you know exactly what reduction event-type-sets you'll need before you start publishing any events. Or if you're willing to keep an archival topic of every-event-ever online, so that you can stream over it to retroactively build new filtered topics.
• With a row-store backend, filtered streaming without pre-indexing is tenable — it's a query plan consisting of a primary-key-index-directed seq scan with a filter node. But it's still a lot more expensive than it'd be to just be streaming through a flat file containing the same data, since a seq scan is going to be reading+materializing+discarding all the rows that don't match the filtering rule. You can create (partial!) indices to avoid this — and nicely-enough, in a row-store, you can do this retroactively, once you figure out what the needs of a given reduction job are. But it's still a DBA task rather than a dev task — the data warehouse needs to be tweaked to respond to the needs of the app, every time the needs of the app change. (I would also mention something about schema flexibility here, but Postgres has a JSON column type, and I presume CQRS/ES event-store backends would just use that.)
A CQRS/ES event store built atop a fully-indexed document store / "index store" like ElasticSearch (or Quickwit, apparently) would have all the same advantages of the RDBMS approach, but wouldn't require any manual index creation.
Such a store would perform as if you took the RDBMS version of the solution, and then wrote a little insert-trigger stored-procedure that reads the JSON documents out of each row, finds any novel keys in them, and creates a new partial index for each such novel key. (Except with much lower storage-overhead — because in an "index store" all the indices share data; and much better ability to combine use of multiple "indices", as in an "index store" these are often not actually separate indices at all, but just one index where the key is part of the index.)
---
That being said, you know what you can use the CQRS/ES model for? Reducing your literal "logs" into metrics, as a continuous write-through reduction — to allow your platform to write log events, but have its associated observability platform read back pre-aggregated metrics time-series data, rather than having to crunch over logs itself at query time.
And AFAIK, this "modelling of log messages as CQRS/ES events in a CQRS/ES event store, so that you can do CQRS/ES reductions to them to compute metrics as aggregates" approach is already widely in use — but just not much talked about.
For example, when you use Google Cloud Logging, Google seems to be shoving your log messages into something approximating an event-store — and specifically, one with exactly the filtered-streaming-cost semantics of an "index store" like ElasticSearch (even though they're actually probably using a structured column-store architecture, i.e. "BigTable but append-only and therefore serverless.") And this event store then powers Cloud Logging's "logs-based metrics" reductions (https://cloud.google.com/logging/docs/logs-based-metrics).
Most log messages are useless 99.99% of the time. Best likely outcome is that its turned into a metric. The once in the blue moon outcome is that it tells you what went wrong when something crashed.
Before you get to shipping _petabytes_ of logs, you really need to start thinking in metrics. Yes, you should log errors, you should also make sure they are stored centrally and are searchable.
But logs shouldn't be your primary source of data, metrics should be.
things like connection time, upstream service count, memory usage, transactions a second, failed transactions, upsteam/downstream end point health should all be metrics emitted by your app(or hosting layer), directly. Don't try and derive it from structured logs. Its fragile, slow and fucking expensive.
comparing, cutting and slicing metrics across processes or even services is simple, with logs its not.