Log level 'error' should mean that something needs to be fixed

23 points by carlana

zesterer

My rule of thumb is usually:

Info: supplementary information should failure occur later

Warning: inefficient/non-standard operation, but primary functionality is retained

Error: A PICNIC issue, something about the environment the user chose the run the program is it broken and the system can't continue functioning as normal

Critical: A violation of the program's own logic invariants occurred (such as a tripped assert), indicating a bug in the system itself (in proprietary systems, this would be cause for a Return To Base)

IMO, programs without high availability requirements should always abort on critical since it's usually a sign of memory corruption or other akin-to-UB states.

Interested in comment on this, if anybody does things differently!

spc476

I use syslog() and I do:

Debug: useful for debugging a service; disabled in production

Info, Notice: To me, no real difference. I might use Notice when the service goes operational, but I tend to just use Info.

Warn: usually I will log issues with the configuration, like a duplicated path when configuring a web server. I will also use this level if the server spawns a program and that program returns a status other than 0.

Error: I usually log unsuccessful system calls at this level. You might think that this will cause excessive logging, but in my experience, it does not. System calls fail rarely, and if they do, it's an indication of resource exhaustion. I will also use this to log requests the server can't parse, just on the off chance I missed something when coding (and yes, this has been useful in production, but use with care).

Critical: I only log this if the service can't start for some reason, like a missing configuration file, or a missing required bit of configuration. I'm not entirely consistent here, because I did find some instances were I use Error for this. I realize that Critical might be considered "too high", but syslog also has Alert and Emergency, which I don't use.

Alert, Emergency: I don't use these.
mxey

IMO, programs without high availability requirements should always abort on critical since it's usually a sign of memory corruption or other akin-to-UB states.

Programs with high availability requirements should also do that so another cluster member can take over who hopefully still has the correct state ;)
- zesterer
  
  Well, sure: that's feasible if your system cares about availability but not latency, resource use, etc. and has a fairly relaxed notion of internal consistency (backend software for a large user-facing service like social media, say). Priorities start to change when it's flight control software or running in an embedded environment.
  - tonyarkles
    
    Priorities start to change when it's flight control software or running in an embedded environment.
    
    Those are fun and terrifying environments. Going through FMEA (failure modes and effects analysis) and really identifying the safest response in various circumstances is a really interesting exercise and sometimes the tradeoffs are complicated.
    
    On the flight control side, whether or not you’re using triple-redundant gear, there’s a certain case that is incredibly complicated: what happens if the system reboots mid-air? Most unmanned systems default to powering up in a safe/disarmed state that requires an explicit arm command from the ground station before motors and actuators will move. That’s the right approach except if there is a mid-air reboot. If there is a mid-air reboot and it comes back up disarmed, not only has the aircraft stopped flying its autonomous trajectory but the pilot also likely has no control until the system is armed.
    
    Working in this industry has made me think about failure modes in so much more detail than any other job. It’s great fun but also pretty unnerving to know that a software bug can cause people to be killed or a $250k asset getting rolled into a firey ball.
- dpercy
  
  I like this! I’m used to thinking about errors in terms of: do we blame the client or the server. Like http 400 vs 500. Sounds like your distinction between “program messed up” and “operator messed up” is similar? Just for a different audience.
  
  What category would you put errors like “S3 is down”? It could be no fault of your own (it’s really down) or it could be your own misconfig (wrong url maybe).
  
  Also do you log errors that are the client’s fault, and which category? Maybe it depends on how closely you work with customers, whether you want to be notified for mistakes on their end.
  - zesterer
    
    Like http 400 vs 500.
    
    I think that's a really useful way to think about it, yes.
    
    What category would you put errors like “S3 is down”?
    
    That depends on whether one considers S3 to be 'internal' to the system at whatever location the logs are being spat out at, I'd think.
    
    Also do you log errors that are the client’s fault, and which category?
    
    Most of my work isn't really customer-facing, but I wouldn't put client error any higher than warn. A robust system should be able to deal with pretty much anything a client throws at it without degraded function, so errors should never be possible to trigger via client misbehaviour. In Veloren we log warnings when clients start sending us network packets about non-existent entities, but we don't do anything that could produce broken outcomes.
- fanf
  Syslog is balge † but it has a few problems:
  
  it has too many levels above info
  
  it has too few levels below info
  
  the fixed set of facilities is inconvenient
  
  you can’t tag and filter messages by module or subsystem within a program
  
  it isn’t easy for a program to dynamically adjust debug/trace logging verbosity
  
  There are roughly three kinds of logging, which correspond to the circumstances and reason for reading the logs.
  
  Information about what the daemon as a whole is doing: things like startup, shutdown, (re)configuration. Useful for ensuring that it is operating basically OK, or for configuration errors.
  
  Information about the traffic the daemon is handling, as in a web server’s access log or a mail server’s delivery log. Useful for metrics and traffic analysis, and for troubleshooting after the fact.
  
  Voluminous debug or trace output that is usually disabled. Useful for operators when the traffic logs are not detailed enough, or for developers when chasing a bug.
  
  Syslog isn’t great at separating the first two. Often that doesn’t matter but it’s useful to be able to do so when there’s a lot of traffic. Subsystem tags would help to distinguish them. (Subsystem tags are also useful for controlling debug verbosity.)
  
  One thing that distinguishes the first two is what kind of problems get reported. In the first case, any errors or warnings should be something that the operator needs to know about and is able to fix (as cks says). In the second case, problems with traffic often relate to remote systems that the operator can’t do anything about. If (because of syslog’s limitations) you can’t separate traffic logs from daemon logs then it’s unhelpful to log remote problems at warning or error priority, because that can hide important actionable messages in heaps of unfixable spam.
  
  WRT specific priority levels, my rule of thumb for the first kind of logging is roughly:
  
  debug: if a simple on-off switch isn’t enough, ignore syslog and have a flexible debug-to-stderr mode instead
  
  info: normal activity
  
  notice: info, but louder; maybe use this when startup is complete, when shutting down, when reconfiguring; but you can ignore it and just use info
  
  warning: use for an error that doesn’t prevent normal operation, e.g. deprecated or inconsistent config settings
  
  error: something that prevents the daemon from working as configured; an error typically makes the daemon fail to start; an error during reconfiguration might make the daemon continue using the old config instead of quitting … then fail on restart
  
  crit: assertion failure causing immediate ungraceful exit; but you can ignore it and just use error
  
  alert: system-wide notice/warning; do not use unless you are roughly the kernel or init
  
  emerg: system-wide error/crit; do not use unless you are roughly the kernel or init
  
  The distinctions between info/notice and warn/error/crit are not really that important. Two or three levels would work as well as five.
  
  † by and large good enough
- olegkovalov
  
  Reminds me https://dave.cheney.net/2015/11/05/lets-talk-about-logging
  - tonyfinn
    
    I kind of disagree with the anti-warning stance. Warning is “if you’re investigating this sequence after the fact, then look here first”.
    
    Like the smtp error in the top level post seems to be the perfect example. Do you want to page your ops team every time you can’t connect to a remote smtp server? No, that seems a good way to burn people out and have them approach alerts as “it’s probably nothing”. But if you notice that your email delivery metrics are going down and you dig into a few example traces, a level of “this is probably important” is useful
- m_eiman
  
  If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system and should not be logged at level 'error'. The 'error' log level is for 'I'm not working right, help',
  
  I kinda of agree, but in this case it's not possible to log the 'error' level message in the correct place (the server which cannot be connected to), because it's not aware that someone is trying to connect to it and failing.
  - tomnomnom
    
    That absolutely could be an error that requires local action too. Maybe there's a typo in the hostname in config for example. The process doing the logging usually has no way to know if the situation is temporary or permanent, local or remote.
    
    msangi
    
    Absolutely. The better solution here is probably to emit a metric counting connection attempts with metadata indicating if they were successful and which server it was attempting to connect to.
    
    With this information we can set sensible alerts
  - orib
    
    Being unable to connect to things occasionally is not an error, it's things working as expected.
- adamshaylor
  
  GIGO makes this a bit complicated. If your software runs more or less continuously, takes data from users and/or other software, and gracefully handles bad data, it’s not obvious whether warning or error is a more appropriate log level when some feature failed to complete its task. The program’s still running fine, but something failed for someone at some point through no fault of your own and it will again because it’s an imperfect world. Where would you want to find a record of that event? Mixed in with random driver noise and deprecation notices? Or with more serious failures?
  - jlarocco
    
    Throwing out garbage data that comes in is a normal operation. If the garbage data crashes your service, then it's error because you shouldn't crash. If the garbage data is detected and thrown out, then it's a warning or info level, or not even logged.
- rau
  
  I agree with the words of the title and much of the article, but not this example:
  
  If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system and should not be logged at level 'error'.
  
  It could be an error in the local system’s configuration (e.g., which host to send email to). It could be an error in the local system’s hardware (e.g. the NIC).
  
  And why assume that only local errors matter? If I am running a program which tries to send mail, I very likely control its network connexion to the rest of the world. That’s something I can debug and fix.
  
  The 'error' log level is for 'I'm not working right, help', things such as 'configuration file error', 'my memory allocation failed', 'unexpected failure to read a data file', and so on.
  
  Or ‘unexpected failure to connect to configured SMTP host.’
  
  Log level ‘error’ should mean errors: programmer errors, deployment errors, operational errors, network errors, peer errors. Errors.
- reezer
  
  Regarding the subject: Isn't that obvious? The word error means something is incorrect, so obviously it needs fixing.
  
  Regarding the reported issue: Checking for an error I guess is a slightly more philosophical question and not at all what the title of the article indicates. It feels more like a missing feature. The fact that a probe fails is an error case that usually indicates "something needs to be fixed". This seems more like using a feature that doesn't exist.
  
  As for whether it should be logged at all when a probe fails. I think people will disagree on that. However I can understand the take that you want this to be an error because then during a disaster something becomes evident from the logs. But then I can also see the probe status being more an INFO level.
  
  But you can also see this in a completely different way and see blackbox_exporter as the "drop in" for self-reporting of some system, in which case it might make sense.
  
  So in short I do not think that the title does the content justice. If a probe fails something does need to be fixed. The fact that the author wants to check for the error case - unless I am missing something - is a feature request for blackbox_exporter.
  
  An example of this is how various programming language's stdlibs consider things like 404 or no file found an error/an exception, which I always found really weird, because checking for the absence of something doesn't ring like something being incorrect or exceptional at all.
  
  But think about a port scanner which is way more closely. Nobody would be surprised about establishing a TCP connection being an error/an exception, even though it's very expected if your use case is a port scanner or something similar.
- minus
  
  Warnings should also be fixed. User input should usually not trigger warnings (when it's a server and you have no control over it).
  - quasi_qua_quasi
    
    We use warnings for "this input is nonsensical, but external constraints require us to proceed anyway", so that someone looking through logs will have the warning jump out at them.
- emschwartz
  
  👀 Searches codebase for error!(...