Logging Sucks - Your Logs Are Lying To You

55 points by pondidum


quad

See also: A Practitioner's Guide to Wide Events or really just the Observability Engineering book. Charity Majors has been beating this drum for years and her skin is in the game with being the CTO of Honeycomb.

apromixately

If the whole point of having a wide event is to put everything relevant that happened to a request into every log statement to get all the context, why is this not solved by having a request ID so different log statements can be linked?

ocramz

tldr; use "wide events" i.e put all request context in the json object

thangalin

On a related note:

https://dave.autonoma.ca/blog/2022/01/08/logging-code-smell/

oger

The advice from the post (use structured logging, wide events, and tail sampling) sounds good – I'll give it a try next time this topic comes up for me.

But also, I'm still not sure whether "single wide events" are better than "multiple events for the same request, linked by some request id". I think the latter approach could be easier for adding new log output, and it should also solve the "the process crashed before it could emit the single wide event" problem. But it also requires much more elaborate log querying tools (to link log events together again, effectively recreating the single wide event at querying time); and tail sampling of a log event is probably much more difficult if the final result of the request is not known yet.

Do you have experience with these two approaches? Is this a common tradeoff when doing logging, or did I overlook something that makes the latter approach totally pointless?

strugee

Wide events sound great unless you have a tech stack that makes you get OOM killed every once in a while. Then you have zero clue what happened - not even a partial stream of log messages, which is probably about the best you can hope for. Traces are probably also emitted at the end of the request cycle, so those are gone too.

fbegyn

While this is an good post and I have subscribed to the idea of wide events for a while now. But looking into my $DAYJOB as a systems engineer/admin, this only applies to to things I/we develop ourselves. How do we improve the logging stack for these services that I have no control over and care for deeply?

df

One thing I really liked in a past system I worked on was having a trace buffer. Every step in a request logged a message to the trace buffer; but the trace buffer was ONLY emitted if there was an error. Otherwise it would have been far too much log data to emit. Since the system was multithreaded and had interesting issues around lock contention, queues, RPC calls that could fail in weird ways, etc., this was very valuable for debugging after an error. I haven't seen it since.

hauleth

In a long time I haven't read anything that I would agree with so much as with this.

mccd

The AI-style writing with bullet points is very grating to my eyes, which is unfortunate because it seems someone put a lot of effort into this article. That said, I did become convinced that wide-events are a good idea, it's very elegant.