Retry Loop Retry

28 points by sinclairtarget

joshka

Something I recall learning a while back is that the best number of retries is often 1, which makes this not a retry loop but instead an if statement. This was coming from people that really knew exponential backoff with jitter and token bucket approaches well and could explain how these affected extremely large scale systems.

Why? Because failing a second time is highly correlated with failing a third time in many distributed systems. A good mental model for this is that errors is generally either:

transient, which implies that a single retry will get a successful result); or
persistent, so a second retry will not get a successful result, and may contribute negatively to overall system overload.

You shouldn’t just trust me on this, instead go look at some logs of your own systems where you have retries. Hopefully you’re logging the errors with a retry count and you can compare counts of the first retry and the second retry.

oliverpool

Interesting take.

The only counter example I think of is rate-limiting, where you know that you will be unlocked soon (soon enough that some sleep can help), but the server does not tell you exactly when (server does not returns a Retry-After header).

[might sound like a contrived example, but I did actually face such situation at $WORK recently…]
- quad
  
  I’d probably use a circuit breaker rather than a retry loop for that.
peterbourgon

Why? Because failing a second time is highly correlated with failing a third time in many distributed systems. A good mental model for this is that errors is generally either: (a) transient, which implies that a single retry will get a successful result); or (b) persistent, so a second retry will not get a successful result, and may contribute negatively to overall system overload.

Absolutely true! But I’d carry it even further: failing even a single time is highly correlated with failing every time afterwards. By the time an error makes its way back to your application code, your request has almost always run through a whole litany of “retry loops” in middle-boxes, load balancers, application servers, service meshes, etc. etc., all of which will be largely invisible to you as a caller, and all of which having way more domain-specific information that can be leveraged to make much more intelligent retry decisions than you can at the far end of the pipe. This is especially true for transient errors! The best retry policy for application code is usually to not try to retry at all.
- joshka
  
  Yep. The single retry + percentage based circuit breaker approach can be pretty powerful. Especially if your service can run degraded without that call being made.
  
  The last piece of this is to be careful of retries within retries. It’s really easy to have a multiplication effect due to things which also retry internally. Having 3 levels each with 3 retries on failure means that a persistent failure at the bottom level of the stack might mean 444 = 64x your load on the bottom level, which can lead to overload.
peterbourgon
I’ve found that this kind of code becomes much easier to write, and simpler to understand, if instead of working in terms of “max number of retries” you use “max number of attempts”. For exactly the reasons described in the article! It’s still the same retry loop pattern, but attempts is usually a better proxy for what’s actually being expressed.

edit: to wit
```
var retries_left = retry_count;
const result = try for(0..retry_count + 1) {
    const err = if (action()) |ok| break ok else |err| err;
    if (!is_transient_error(err)) break err;
    if (retries_left == 0) break err;
    retries_left -= 1;
    sleep();
} else @panic("runaway loop");
```
vs.
```
for n=attempts; n>0; n-- {
    match action() {
        |ok|            return ok
        |permanent err| return err
        |transient err| sleep/continue
    }
}
return error(attempts exhausted)
```
- matklad
  This fails couple of points:
  
  there’s one extra needless sleep after the last attempt
  
  the original error is not returned if we are out of attempts
  
  Both can be fixed by replacing return error with return action(), but then that’s repetition, which only works if you already have action nicely extracted as a function.
  - benl
    
    You can check the attempt number in the match so that you have attempt == 0 || !is_temporary(err) => return err, but point taken. In that case it’s syntactically clear the number of attempts are bounded but not that you’re sleeping appropriately.
  - peterbourgon
    
    Sorry, I didn’t mean for the example code to be any kind of direct translation of the original program, I only meant for it to be a rough demonstration of my actual point, which was only that it’s often better to model this kind of code using “attempts” (which map directly to invocations) rather than “retries” (which are off-by-one).
    
    Let me make another attempt at demonstrating this point, this time by directly refactoring your original example, or at least my best attempt to do so:
    
    const result = try while (attempts > 0) { const err = if (action()) |ok| break ok else |err| err; if (!is_transient_error(err)) break err; attempts -= 1; if (attempts == 0) break err; sleep(); } else @panic("impossible condition, runaway loop");
    
    If you don’t think this is an improvement vs. the original, then that’s fine.
    
    invlpg
    
    This still needs an else unreachable; or else @panic(...) after the while loop, otherwise it won’t typecheck.
    
    peterbourgon
    
    OK! Again my examples were never intended to be runnable or correct or anything other than an attempt at demonstrating my actual point, which I made in the original paragraph and I feel like have absolutely failed to communicate effectively, my mistakes all around.
    
    invlpg
    
    Sorry for my bluntness, please don’t take it as criticism of you or your communication style. I was just scrolling through the thread reading some new replies, saw some zig code with a minor error, and offered a correction without looking at the surrounding context.
    
    tentacloids
    
    This is fine. They take your software developer license away if you don’t periodically go on the internet and reply to something with an irrelevant nitpick.
    
    peterbourgon
    
    Can confirm.
    
    benl
    
    Came to post exactly the same thing. If you’re time-bounding the loop, you can also nicely express it this way:
    
    match action() { ok => return ok, err if permanent(err) || now().after(deadline) => return err, _ => continue, }
    
    We both hand-waved away saving the first error, but it’s also not ever clear to me if that’s what I want in these situations.
    
    spc476
    
    There’s a difference between your two examples: the first does one retry if the count is 0; the other one doesn’t.
    
    If your language supports tail call optimization, then the following works well (Lua, since it supports it):
    
    local function attempt(data,count) assert(count >= 0) local okay,err = sendpacket(data) if okay then return true,0 end if not_transient_error(err) then return false,err end if count > 0 then sleep() return attempt(data,count-1) end return false,ETRIES end
    
    Even handles the 0 case without the “+1”. The assert() there is to signal that the count should be 0 or positive.
    
    peterbourgon
    
    There’s a difference between your two examples: the first does one retry if the count is 0; the other one doesn’t.
    
    Yeah, but that’s the point: the first one operates on “retries” and the second on “attempts”.
    
    dataangel
    
    I like micro snippets that are hard to express like these, good little case studies for future PL designers
    
    julesjacobs
    
    let rec retry f retries = match f () with | Error(e) if retries >= 0 && is_transient(e) => sleep(); retry f (retries - 1) | other => other
    
    or
    
    let rec retry f retries = match f () with | Error(e) => if retries >= 0 && is_transient(e) then sleep(); retry f (retries - 1) else Error(e) | Ok(x) => Ok(x)
    
    quad
    
    let rec 👌
    
    mikea
    
    Maybe I misunderstand the requirements but wouldn’t the code in the post return the last error, not the original?
    
    I personally prefer to count attempts upward from 1. This removes weird 0 attempts left situation:
    
    Mobile pseudocode:
    
    attemptNo = 1 while True: result = attempt() if attemptNo == maxAttempts: break result sleep() attemptNo++
    
    ocharles
    
    I like to use the iterative monad transformer in Haskell for this:
    
    void $ retract $ cutoff 10 $ untilJust do Dashboard.getProgramState dashboard >>= \case Dashboard.ProgramStopped -> return $ Just () _ -> Nothing <$ liftIO (threadDelay 1_000_000)
    
    Here I:
    
    Run getProgramState, which can fail
    
    If it fails, I add a 1s delay and return Nothing.
    
    I wrap this in untilJust :: Monad m => m (Maybe a) -> IterT m a
    
    I add a cutoff to limit the number of iterations: cutoff :: Monad m => Integer -> IterT m a -> IterT m (Maybe a)
    
    I retract, to run the computation: retract :: Monad m => IterT m a -> m a
    
    However, this doesn’t satisfy the requirements: there is a useless sleep at the end.
    
    I think I’ve done it without the extra sleep by using <|> with an infinite list of delays (fun aside, you can generate this list to vary the delays at each step!). It was something like:
    
    retract $ cutoff 10 $ untilJust tryTheThing <|> delay (untilJust (Nothing <$ threadDelay 1_000_000))
    
    but I can’t find this in our code anymore, I guess I must have gotten rid of it.
    
    kristoff
    
    Here’s my attempt, comes with two caveats:
    
    action() is called in two places
    
    retry_count is literally the retry count, so action is called up to retry_count + 1 times
    
    const result = try for (0..retry_count) |n| { break action() catch |err| { if (!transient(err)) break err; sleep(exp(n)); }; } else action();
    
    quad
    
    How is the original error reported if retrying fails?
    
    invlpg
    
    This is Zig, so action() returns an “error union” which can either be an error or a valid value. When retrying fails, the for ... else action() block evaluates to that error union returned by action(). That error union is then passed to the try statement. If it’s an error, then it is returned from the enclosing function, if it’s a value then it’s assigned to the result variable.
    
    quad
    
    If I understand you correctly, the error (or value) from the last action() call are passed to the try statement?
    
    Am I misunderstanding “original” in “the original error is reported if retrying fails?” I interpreted this as the first error that triggered the retrying.
    
    invlpg
    
    Am I misunderstanding “original” in “the original error is reported if retrying fails?” I interpreted this as the first error that triggered the retrying.
    
    I thought the same at first until I read @matklad’s original blog post that was linked in this one:
    
    In the event of a retry failure, the underlying error is reported. I don’t want to see just that all attempts failed, I want to see an actual error from the last attempt.
    
    quad
    
    Cheers! I’ve revised my code golf accordingly.
    
    invlpg
    
    This doesn’t work because when action() returns type E!T then the catch |err| { ... } needs to be coercible to type T as well. You can avoid this with continue inside the break/catch block, as I did in my similar answer further down the thread.
    
    kristoff
    
    yeah I forgot the continue :^)
    
    invlpg
    
    How about this?
    
    const result = try retry: for (0..retry_count + 1) |attempt| { break action() catch |err| { if (!isTransientError(err) or attempt == retry_count) break err; sleep(); continue :retry; }; } else @panic("runaway loop");
    
    It’s bounded, there’s no additional variable, the immediate break action() makes the happy path obvious, and that catch |err| is an exceptional case that needs to be handled separately.
    
    I’m not entirely happy about the continue inside of a break statement, which is why I added the retry: label. It still has the additional + 1, but that could just be thought of as Zig’s syntax for an inclusive range.
    
    quad
    
    let mut first_err = None; for attempt in 0..=retry_count { if (attempt > 0) { sleep(); } match action() { Err(e) if is_transient_error(e) => { first_err.get_or_insert(e); }, r => return r } } Err(first_err.unwrap())
    
    I don’t speak Zig and thus don’t understand how the original error is reported in the blog post’s code.
    
    quad
    
    Everyone seems to have interpreted the “original error” requirement as allowing the result of the last action() to be returned, so…
    
    for attempt in 0..=retry_count { match action() { Err(e) if attempt < retry_count && is_transient_error(e) => sleep(), r => return r, } } unreachable!()
    
    fanf
    
    I had déjà vu reading this because it’s almost word for word what I sketched out for myself yesterday :-) (Then I started over-engineering intermittent errors and threw it away!)
    
    xq
    
    i’ve got two alternative solutions here:
    
    const result = try for(0..retry_count) |retry| { const is_first = (retry == 0); const is_last = (retry == retry_count-1); if(!is_first) { sleep(); } if(action()) |value| { break value; } else |err| if(!is_last and is_transient(err)) { continue; } else { break err; } } else unreachable;
    
    const result = try blk: { var last_err: E = undefined; break :blk for(0..retry_count) |retry| { const is_first = (retry == 0); if(!is_first) { sleep(); } if(action()) |value| { break value; } else |err| { last_err = err; if(!is_transient(err)) { break err; } } } else last_err; };
    
    quad
    
    How does the first one capture the original error? Do either work if retry_count == 0?
    
    xq
    
    Both won’t work if retry_count is 0, because there won’t be any call. I’ve left out the asser/@compileError for brevity. I guess attempt_count would’ve been better
    
    the first one captures the error by the break err on either a non-retryable error or the last loop iterarion.
    
    robbiev
    
    Bit late to the party, but here’s what I came up with:
    
    fn action_with_retries(comptime retry_count: u32) E!T { var last_err: ?E = null; return loop: switch (@as(u32, 0)) { 0...retry_count => |attempt| { if (last_err) |err| { if (!is_transient_error(err)) break :loop err; std.Thread.sleep(std.time.ns_per_ms); } if (action()) |ok| { break :loop ok; } else |err| { last_err = err; continue :loop attempt + 1; } }, else => return last_err orelse @panic("bug"), }; }
    
    robbiev
    
    A for loop version which I like even more. A bonus is that attempts_retry doesn’t need to be comptime, and that the operation is more obviously bounded.
    
    fn action_with_retries(attempts_retry: u32) E!T { const attempts_first = 1; const attempts_total = attempts_first + attempts_retry; var last_err: ?E = null; return for (0..attempts_total) |_| { if (last_err) |err| { if (!is_transient_error(err)) break err; std.Thread.sleep(std.time.ns_per_ms); } if (action()) |ok| break ok else |err| last_err = err; } else last_err orelse @panic("bug"); }