Log level 'error' should mean that something needs to be fixed
23 points by carlana
23 points by carlana
My rule of thumb is usually:
Info: supplementary information should failure occur later
Warning: inefficient/non-standard operation, but primary functionality is retained
Error: A PICNIC issue, something about the environment the user chose the run the program is it broken and the system can't continue functioning as normal
Critical: A violation of the program's own logic invariants occurred (such as a tripped assert), indicating a bug in the system itself (in proprietary systems, this would be cause for a Return To Base)
IMO, programs without high availability requirements should always abort on critical since it's usually a sign of memory corruption or other akin-to-UB states.
Interested in comment on this, if anybody does things differently!
I use syslog() and I do:
Debug: useful for debugging a service; disabled in production
Info, Notice: To me, no real difference. I might use Notice when the service goes operational, but I tend to just use Info.
Warn: usually I will log issues with the configuration, like a duplicated path when configuring a web server. I will also use this level if the server spawns a program and that program returns a status other than 0.
Error: I usually log unsuccessful system calls at this level. You might think that this will cause excessive logging, but in my experience, it does not. System calls fail rarely, and if they do, it's an indication of resource exhaustion. I will also use this to log requests the server can't parse, just on the off chance I missed something when coding (and yes, this has been useful in production, but use with care).
Critical: I only log this if the service can't start for some reason, like a missing configuration file, or a missing required bit of configuration. I'm not entirely consistent here, because I did find some instances were I use Error for this. I realize that Critical might be considered "too high", but syslog also has Alert and Emergency, which I don't use.
Alert, Emergency: I don't use these.
IMO, programs without high availability requirements should always abort on critical since it's usually a sign of memory corruption or other akin-to-UB states.
Programs with high availability requirements should also do that so another cluster member can take over who hopefully still has the correct state ;)
Well, sure: that's feasible if your system cares about availability but not latency, resource use, etc. and has a fairly relaxed notion of internal consistency (backend software for a large user-facing service like social media, say). Priorities start to change when it's flight control software or running in an embedded environment.
Priorities start to change when it's flight control software or running in an embedded environment.
Those are fun and terrifying environments. Going through FMEA (failure modes and effects analysis) and really identifying the safest response in various circumstances is a really interesting exercise and sometimes the tradeoffs are complicated.
On the flight control side, whether or not you’re using triple-redundant gear, there’s a certain case that is incredibly complicated: what happens if the system reboots mid-air? Most unmanned systems default to powering up in a safe/disarmed state that requires an explicit arm command from the ground station before motors and actuators will move. That’s the right approach except if there is a mid-air reboot. If there is a mid-air reboot and it comes back up disarmed, not only has the aircraft stopped flying its autonomous trajectory but the pilot also likely has no control until the system is armed.
Working in this industry has made me think about failure modes in so much more detail than any other job. It’s great fun but also pretty unnerving to know that a software bug can cause people to be killed or a $250k asset getting rolled into a firey ball.
I like this! I’m used to thinking about errors in terms of: do we blame the client or the server. Like http 400 vs 500. Sounds like your distinction between “program messed up” and “operator messed up” is similar? Just for a different audience.
What category would you put errors like “S3 is down”? It could be no fault of your own (it’s really down) or it could be your own misconfig (wrong url maybe).
Also do you log errors that are the client’s fault, and which category? Maybe it depends on how closely you work with customers, whether you want to be notified for mistakes on their end.
Like http 400 vs 500.
I think that's a really useful way to think about it, yes.
What category would you put errors like “S3 is down”?
That depends on whether one considers S3 to be 'internal' to the system at whatever location the logs are being spat out at, I'd think.
Also do you log errors that are the client’s fault, and which category?
Most of my work isn't really customer-facing, but I wouldn't put client error any higher than warn. A robust system should be able to deal with pretty much anything a client throws at it without degraded function, so errors should never be possible to trigger via client misbehaviour. In Veloren we log warnings when clients start sending us network packets about non-existent entities, but we don't do anything that could produce broken outcomes.
Syslog is balge † but it has a few problems:
There are roughly three kinds of logging, which correspond to the circumstances and reason for reading the logs.
Information about what the daemon as a whole is doing: things like startup, shutdown, (re)configuration. Useful for ensuring that it is operating basically OK, or for configuration errors.
Information about the traffic the daemon is handling, as in a web server’s access log or a mail server’s delivery log. Useful for metrics and traffic analysis, and for troubleshooting after the fact.
Voluminous debug or trace output that is usually disabled. Useful for operators when the traffic logs are not detailed enough, or for developers when chasing a bug.
Syslog isn’t great at separating the first two. Often that doesn’t matter but it’s useful to be able to do so when there’s a lot of traffic. Subsystem tags would help to distinguish them. (Subsystem tags are also useful for controlling debug verbosity.)
One thing that distinguishes the first two is what kind of problems get reported. In the first case, any errors or warnings should be something that the operator needs to know about and is able to fix (as cks says). In the second case, problems with traffic often relate to remote systems that the operator can’t do anything about. If (because of syslog’s limitations) you can’t separate traffic logs from daemon logs then it’s unhelpful to log remote problems at warning or error priority, because that can hide important actionable messages in heaps of unfixable spam.
WRT specific priority levels, my rule of thumb for the first kind of logging is roughly:
debug: if a simple on-off switch isn’t enough, ignore syslog and have a flexible debug-to-stderr mode instead
info: normal activity
notice: info, but louder; maybe use this when startup is complete, when shutting down, when reconfiguring; but you can ignore it and just use info
warning: use for an error that doesn’t prevent normal operation, e.g. deprecated or inconsistent config settings
error: something that prevents the daemon from working as configured; an error typically makes the daemon fail to start; an error during reconfiguration might make the daemon continue using the old config instead of quitting … then fail on restart
crit: assertion failure causing immediate ungraceful exit; but you can ignore it and just use error
alert: system-wide notice/warning; do not use unless you are roughly the kernel or init
emerg: system-wide error/crit; do not use unless you are roughly the kernel or init
The distinctions between info/notice and warn/error/crit are not really that important. Two or three levels would work as well as five.
† by and large good enough
I kind of disagree with the anti-warning stance. Warning is “if you’re investigating this sequence after the fact, then look here first”.
Like the smtp error in the top level post seems to be the perfect example. Do you want to page your ops team every time you can’t connect to a remote smtp server? No, that seems a good way to burn people out and have them approach alerts as “it’s probably nothing”. But if you notice that your email delivery metrics are going down and you dig into a few example traces, a level of “this is probably important” is useful
If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system and should not be logged at level 'error'. The 'error' log level is for 'I'm not working right, help',
I kinda of agree, but in this case it's not possible to log the 'error' level message in the correct place (the server which cannot be connected to), because it's not aware that someone is trying to connect to it and failing.
That absolutely could be an error that requires local action too. Maybe there's a typo in the hostname in config for example. The process doing the logging usually has no way to know if the situation is temporary or permanent, local or remote.
Absolutely. The better solution here is probably to emit a metric counting connection attempts with metadata indicating if they were successful and which server it was attempting to connect to.
With this information we can set sensible alerts
Being unable to connect to things occasionally is not an error, it's things working as expected.
GIGO makes this a bit complicated. If your software runs more or less continuously, takes data from users and/or other software, and gracefully handles bad data, it’s not obvious whether warning or error is a more appropriate log level when some feature failed to complete its task. The program’s still running fine, but something failed for someone at some point through no fault of your own and it will again because it’s an imperfect world. Where would you want to find a record of that event? Mixed in with random driver noise and deprecation notices? Or with more serious failures?
Throwing out garbage data that comes in is a normal operation. If the garbage data crashes your service, then it's error because you shouldn't crash. If the garbage data is detected and thrown out, then it's a warning or info level, or not even logged.
I agree with the words of the title and much of the article, but not this example:
If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system and should not be logged at level 'error'.
It could be an error in the local system’s configuration (e.g., which host to send email to). It could be an error in the local system’s hardware (e.g. the NIC).
And why assume that only local errors matter? If I am running a program which tries to send mail, I very likely control its network connexion to the rest of the world. That’s something I can debug and fix.
The 'error' log level is for 'I'm not working right, help', things such as 'configuration file error', 'my memory allocation failed', 'unexpected failure to read a data file', and so on.
Or ‘unexpected failure to connect to configured SMTP host.’
Log level ‘error’ should mean errors: programmer errors, deployment errors, operational errors, network errors, peer errors. Errors.
Regarding the subject: Isn't that obvious? The word error means something is incorrect, so obviously it needs fixing.
Regarding the reported issue: Checking for an error I guess is a slightly more philosophical question and not at all what the title of the article indicates. It feels more like a missing feature. The fact that a probe fails is an error case that usually indicates "something needs to be fixed". This seems more like using a feature that doesn't exist.
As for whether it should be logged at all when a probe fails. I think people will disagree on that. However I can understand the take that you want this to be an error because then during a disaster something becomes evident from the logs. But then I can also see the probe status being more an INFO level.
But you can also see this in a completely different way and see blackbox_exporter as the "drop in" for self-reporting of some system, in which case it might make sense.
So in short I do not think that the title does the content justice. If a probe fails something does need to be fixed. The fact that the author wants to check for the error case - unless I am missing something - is a feature request for blackbox_exporter.
An example of this is how various programming language's stdlibs consider things like 404 or no file found an error/an exception, which I always found really weird, because checking for the absence of something doesn't ring like something being incorrect or exceptional at all.
But think about a port scanner which is way more closely. Nobody would be surprised about establishing a TCP connection being an error/an exception, even though it's very expected if your use case is a port scanner or something similar.
Warnings should also be fixed. User input should usually not trigger warnings (when it's a server and you have no control over it).
We use warnings for "this input is nonsensical, but external constraints require us to proceed anyway", so that someone looking through logs will have the warning jump out at them.