Logging Sucks - Your Logs Are Lying To You
55 points by pondidum
55 points by pondidum
See also: A Practitioner's Guide to Wide Events or really just the Observability Engineering book. Charity Majors has been beating this drum for years and her skin is in the game with being the CTO of Honeycomb.
I’ve watched some of her YouTube talks (which is super unusual for me), Charity is super sharp and really changed how I look at logging and observability.
I’d be a honeycomb customer if they had a deployment at the far end of the internet (Australia)
If the whole point of having a wide event is to put everything relevant that happened to a request into every log statement to get all the context, why is this not solved by having a request ID so different log statements can be linked?
One of the difficulties with linked log lines is trying to query whether something is missing for a given request.
Asking your log system for a request where it doesn't contain a specific message/line is hard, but quering a single wide event for if a property is false or undefined is easy.
A request ID helps with correlating related logs, but doesn’t result in a wide event. You still need to add appropriate attributes. Canonical log lines greatly help, too.
Note that this is also useful outside of request contexts. For example, we emit canonical logs for all BullMQ consumer executions and Temporal activity executions.
tldr; use "wide events" i.e put all request context in the json object
JSON is an implementation detail; focusing on it is missing the forest for the trees.
The key point is that all of the logged information for a request is logged in a single event structure (which, sure, could be serialised to JSON), which will then be retained or discarded according to a rules engine.
Note that this technique can extend traditional line-based logging: it’s entirely possible for one of the items in the structure to be a time-ordered sequence of log lines, e.g. for printf-style debugging.
The advice from the post (use structured logging, wide events, and tail sampling) sounds good – I'll give it a try next time this topic comes up for me.
But also, I'm still not sure whether "single wide events" are better than "multiple events for the same request, linked by some request id". I think the latter approach could be easier for adding new log output, and it should also solve the "the process crashed before it could emit the single wide event" problem. But it also requires much more elaborate log querying tools (to link log events together again, effectively recreating the single wide event at querying time); and tail sampling of a log event is probably much more difficult if the final result of the request is not known yet.
Do you have experience with these two approaches? Is this a common tradeoff when doing logging, or did I overlook something that makes the latter approach totally pointless?
Wide events sound great unless you have a tech stack that makes you get OOM killed every once in a while. Then you have zero clue what happened - not even a partial stream of log messages, which is probably about the best you can hope for. Traces are probably also emitted at the end of the request cycle, so those are gone too.
Maybe you could recover them by emitting the trace parts into a temporary log file as you go, then at the end of the request cycle you log the wide event to the main log and delete the temporary log file. If your process gets OOM killed, when the supervisor restarts it then on startup it should gather up all the temporary log files, log them as wide events with a field marking them as incomplete due to process crash, and then delete the temporary log files.
Or do your log analytics and debugging with an OLAP database that understands request IDs and transparently stitches together multiple log messages into one wide event when you query it.
While this is an good post and I have subscribed to the idea of wide events for a while now. But looking into my $DAYJOB as a systems engineer/admin, this only applies to to things I/we develop ourselves. How do we improve the logging stack for these services that I have no control over and care for deeply?
One thing I really liked in a past system I worked on was having a trace buffer. Every step in a request logged a message to the trace buffer; but the trace buffer was ONLY emitted if there was an error. Otherwise it would have been far too much log data to emit. Since the system was multithreaded and had interesting issues around lock contention, queues, RPC calls that could fail in weird ways, etc., this was very valuable for debugging after an error. I haven't seen it since.
In a long time I haven't read anything that I would agree with so much as with this.
The AI-style writing with bullet points is very grating to my eyes, which is unfortunate because it seems someone put a lot of effort into this article. That said, I did become convinced that wide-events are a good idea, it's very elegant.
I think consciously avoiding AI style should become a prerequisite for quality writing. I don’t want to spend time guessing whether the author is human or not. It would rule out a lot of articles that would have been great if they were written 5 years ago, but I don’t think that’s a big deal. The space of possible writing is so huge that adding constraints like this does not meaningfully limit it. We do this all the time, e.g. changing text to fit a page or typeset better. If in another 5 years AI writing is truly indistinguishable from another unique human voice then that will be different, but for now unedited AI output is easy to spot conservatively (lots of false negatives but few false positives), and has become super grating. I enjoyed a recent NYT article on this, “Why Does A.I. Write Like … That?” by Sam Kriss.
I absolutely get where you’re coming from, but as someone who has used em and en dashes in my writing for years it hurts—my love of punctuation is under attack!
Yeah I still use them too. I don’t think we have to give them up entirely. I would describe AI writing style in two ways. The first is superficial: overuse of em dashes, boldface, emojis, subheadings, lists, and padding/filler text that adds no information. Obviously lots of great blog posts use lists but all these attributes together raise my suspicions (and some I just subjectively dislike even when written by humans). The second is more the AI “voice” that leaks through even if you get it to avoid those superficial things. This is the more grating part for me.
I don't think you even need to change your writing style, just add a disclaimer about your (lack of) use of AI, e.g. using https://notbyai.fyi/
Then I don't need to second guess based on some fuzzy heuristics.
First of all, I really like the interactive design of the post, it's definitely helpful to understand the ideas discussed in the post. More articles should be like that.
One aspect of wide events which isn't clear to me: how does this deal with log events which are issued in a loop? Say, one inbound request triggers ten outbound requests (queries, etc.) in a loop. With an individual event per log call, each one would be another log line, which, for instance, makes it easy to query for slow outbound requests.
Whereas with a wide event, I'd have a set of repeatable columns, which is much less practical to query, in particular if the cardinality of that set isn't known (one inbound request could trigger one outbound request, or five, or ten, etc.). How to people deal with that in practice?
I don't think you can compress that information, it's like a join table in SQL; you're bound to have a lot of repeated "foreign" IDs and little actual content change. But to be able to query (for example, by request ID and ordered by timestamp), you still need all the foreign context.
Wow, I wrote about this recently in the context of ruby's logger gem missing an API for this.
I was afraid there would be no cost consideration but it is addressed at the end with tail sampling! Very good baseline article for the why of wide events and I think it also has good business arguments here and there such that it's possible to make a case to the business.
Good one. I believe there are different needs for logging, from low to higher level: dev debugging, tracing, investigation, monitoring[1], and analysis. The article is right on about the last one.
[1] Really this ad hoc monitoring should be only via metrics, or a log -> metric conversion. I can't count the number of times, "is the log flowing?" sort of questions shouldn't be answered this way but are.