Retry Loop Retry
28 points by sinclairtarget
28 points by sinclairtarget
Something I recall learning a while back is that the best number of retries is often 1, which makes this not a retry loop but instead an if statement. This was coming from people that really knew exponential backoff with jitter and token bucket approaches well and could explain how these affected extremely large scale systems.
Why? Because failing a second time is highly correlated with failing a third time in many distributed systems. A good mental model for this is that errors is generally either:
You shouldn’t just trust me on this, instead go look at some logs of your own systems where you have retries. Hopefully you’re logging the errors with a retry count and you can compare counts of the first retry and the second retry.
Interesting take.
The only counter example I think of is rate-limiting, where you know that you will be unlocked soon (soon enough that some sleep can help), but the server does not tell you exactly when (server does not returns a Retry-After header).
[might sound like a contrived example, but I did actually face such situation at $WORK recently…]
Why? Because failing a second time is highly correlated with failing a third time in many distributed systems. A good mental model for this is that errors is generally either: (a) transient, which implies that a single retry will get a successful result); or (b) persistent, so a second retry will not get a successful result, and may contribute negatively to overall system overload.
Absolutely true! But I’d carry it even further: failing even a single time is highly correlated with failing every time afterwards. By the time an error makes its way back to your application code, your request has almost always run through a whole litany of “retry loops” in middle-boxes, load balancers, application servers, service meshes, etc. etc., all of which will be largely invisible to you as a caller, and all of which having way more domain-specific information that can be leveraged to make much more intelligent retry decisions than you can at the far end of the pipe. This is especially true for transient errors! The best retry policy for application code is usually to not try to retry at all.
Yep. The single retry + percentage based circuit breaker approach can be pretty powerful. Especially if your service can run degraded without that call being made.
The last piece of this is to be careful of retries within retries. It’s really easy to have a multiplication effect due to things which also retry internally. Having 3 levels each with 3 retries on failure means that a persistent failure at the bottom level of the stack might mean 444 = 64x your load on the bottom level, which can lead to overload.
I’ve found that this kind of code becomes much easier to write, and simpler to understand, if instead of working in terms of “max number of retries” you use “max number of attempts”. For exactly the reasons described in the article! It’s still the same retry loop pattern, but attempts is usually a better proxy for what’s actually being expressed.
edit: to wit
var retries_left = retry_count;
const result = try for(0..retry_count + 1) {
const err = if (action()) |ok| break ok else |err| err;
if (!is_transient_error(err)) break err;
if (retries_left == 0) break err;
retries_left -= 1;
sleep();
} else @panic("runaway loop");
vs.
for n=attempts; n>0; n-- {
match action() {
|ok| return ok
|permanent err| return err
|transient err| sleep/continue
}
}
return error(attempts exhausted)
This fails couple of points:
Both can be fixed by replacing return error
with return action()
, but then that’s repetition, which only works if you already have action
nicely extracted as a function.
You can check the attempt number in the match so that you have attempt == 0 || !is_temporary(err) => return err
, but point taken. In that case it’s syntactically clear the number of attempts are bounded but not that you’re sleeping appropriately.
Sorry, I didn’t mean for the example code to be any kind of direct translation of the original program, I only meant for it to be a rough demonstration of my actual point, which was only that it’s often better to model this kind of code using “attempts” (which map directly to invocations) rather than “retries” (which are off-by-one).
Let me make another attempt at demonstrating this point, this time by directly refactoring your original example, or at least my best attempt to do so:
const result = try while (attempts > 0) {
const err = if (action()) |ok| break ok else |err| err;
if (!is_transient_error(err)) break err;
attempts -= 1;
if (attempts == 0) break err;
sleep();
} else @panic("impossible condition, runaway loop");
If you don’t think this is an improvement vs. the original, then that’s fine.
This still needs an else unreachable;
or else @panic(...)
after the while loop, otherwise it won’t typecheck.
OK! Again my examples were never intended to be runnable or correct or anything other than an attempt at demonstrating my actual point, which I made in the original paragraph and I feel like have absolutely failed to communicate effectively, my mistakes all around.
Sorry for my bluntness, please don’t take it as criticism of you or your communication style. I was just scrolling through the thread reading some new replies, saw some zig code with a minor error, and offered a correction without looking at the surrounding context.
This is fine. They take your software developer license away if you don’t periodically go on the internet and reply to something with an irrelevant nitpick.
Came to post exactly the same thing. If you’re time-bounding the loop, you can also nicely express it this way:
match action() {
ok => return ok,
err if permanent(err) || now().after(deadline) => return err,
_ => continue,
}
We both hand-waved away saving the first error, but it’s also not ever clear to me if that’s what I want in these situations.
There’s a difference between your two examples: the first does one retry if the count is 0; the other one doesn’t.
If your language supports tail call optimization, then the following works well (Lua, since it supports it):
local function attempt(data,count)
assert(count >= 0)
local okay,err = sendpacket(data)
if okay then return true,0 end
if not_transient_error(err) then return false,err end
if count > 0 then sleep() return attempt(data,count-1) end
return false,ETRIES
end
Even handles the 0 case without the “+1”. The assert()
there is to signal that the count should be 0 or positive.
There’s a difference between your two examples: the first does one retry if the count is 0; the other one doesn’t.
Yeah, but that’s the point: the first one operates on “retries” and the second on “attempts”.
I like micro snippets that are hard to express like these, good little case studies for future PL designers
let rec retry f retries =
match f () with
| Error(e) if retries >= 0 && is_transient(e) => sleep(); retry f (retries - 1)
| other => other
or
let rec retry f retries =
match f () with
| Error(e) =>
if retries >= 0 && is_transient(e) then
sleep(); retry f (retries - 1)
else
Error(e)
| Ok(x) => Ok(x)
Maybe I misunderstand the requirements but wouldn’t the code in the post return the last error, not the original?
I personally prefer to count attempts upward from 1. This removes weird 0 attempts left situation:
Mobile pseudocode:
attemptNo = 1
while True:
result = attempt()
if attemptNo == maxAttempts: break result
sleep()
attemptNo++
I like to use the iterative monad transformer in Haskell for this:
void $ retract $ cutoff 10 $ untilJust do
Dashboard.getProgramState dashboard >>= \case
Dashboard.ProgramStopped -> return $ Just ()
_ -> Nothing <$ liftIO (threadDelay 1_000_000)
Here I:
getProgramState
, which can failNothing
.untilJust :: Monad m => m (Maybe a) -> IterT m a
cutoff
to limit the number of iterations: cutoff :: Monad m => Integer -> IterT m a -> IterT m (Maybe a)
retract
, to run the computation: retract :: Monad m => IterT m a -> m a
However, this doesn’t satisfy the requirements: there is a useless sleep at the end.
I think I’ve done it without the extra sleep by using <|>
with an infinite list of delays (fun aside, you can generate this list to vary the delays at each step!). It was something like:
retract $ cutoff 10 $ untilJust tryTheThing <|> delay (untilJust (Nothing <$ threadDelay 1_000_000))
but I can’t find this in our code anymore, I guess I must have gotten rid of it.
Here’s my attempt, comes with two caveats:
action()
is called in two placesretry_count
is literally the retry count, so action
is called up to retry_count + 1
times const result = try for (0..retry_count) |n| {
break action() catch |err| {
if (!transient(err)) break err;
sleep(exp(n));
};
} else action();
How is the original error reported if retrying fails?
This is Zig, so action()
returns an “error union” which can either be an error or a valid value. When retrying fails, the for ... else action()
block evaluates to that error union returned by action()
. That error union is then passed to the try
statement. If it’s an error, then it is returned from the enclosing function, if it’s a value then it’s assigned to the result
variable.
If I understand you correctly, the error (or value) from the last action()
call are passed to the try
statement?
Am I misunderstanding “original” in “the original error is reported if retrying fails?” I interpreted this as the first error that triggered the retrying.
Am I misunderstanding “original” in “the original error is reported if retrying fails?” I interpreted this as the first error that triggered the retrying.
I thought the same at first until I read @matklad’s original blog post that was linked in this one:
In the event of a retry failure, the underlying error is reported. I don’t want to see just that all attempts failed, I want to see an actual error from the last attempt.
How about this?
const result = try retry: for (0..retry_count + 1) |attempt| {
break action() catch |err| {
if (!isTransientError(err) or attempt == retry_count) break err;
sleep();
continue :retry;
};
} else @panic("runaway loop");
It’s bounded, there’s no additional variable, the immediate break action()
makes the happy path obvious, and that catch |err|
is an exceptional case that needs to be handled separately.
I’m not entirely happy about the continue
inside of a break statement, which is why I added the retry:
label. It still has the additional + 1
, but that could just be thought of as Zig’s syntax for an inclusive range.
let mut first_err = None;
for attempt in 0..=retry_count {
if (attempt > 0) { sleep(); }
match action() {
Err(e) if is_transient_error(e) => { first_err.get_or_insert(e); },
r => return r
}
}
Err(first_err.unwrap())
I don’t speak Zig and thus don’t understand how the original error is reported in the blog post’s code.
Everyone seems to have interpreted the “original error” requirement as allowing the result of the last action()
to be returned, so…
for attempt in 0..=retry_count {
match action() {
Err(e) if attempt < retry_count && is_transient_error(e) => sleep(),
r => return r,
}
}
unreachable!()
I had déjà vu reading this because it’s almost word for word what I sketched out for myself yesterday :-) (Then I started over-engineering intermittent errors and threw it away!)
i’ve got two alternative solutions here:
const result = try for(0..retry_count) |retry| {
const is_first = (retry == 0);
const is_last = (retry == retry_count-1);
if(!is_first) {
sleep();
}
if(action()) |value| {
break value;
} else |err| if(!is_last and is_transient(err)) {
continue;
} else {
break err;
}
} else unreachable;
const result = try blk: {
var last_err: E = undefined;
break :blk for(0..retry_count) |retry| {
const is_first = (retry == 0);
if(!is_first) {
sleep();
}
if(action()) |value| {
break value;
} else |err| {
last_err = err;
if(!is_transient(err)) {
break err;
}
}
} else last_err;
};
How does the first one capture the original error? Do either work if retry_count == 0
?
Both won’t work if retry_count is 0, because there won’t be any call. I’ve left out the asser/@compileError
for brevity. I guess attempt_count
would’ve been better
the first one captures the error by the break err
on either a non-retryable error or the last loop iterarion.
Bit late to the party, but here’s what I came up with:
fn action_with_retries(comptime retry_count: u32) E!T {
var last_err: ?E = null;
return loop: switch (@as(u32, 0)) {
0...retry_count => |attempt| {
if (last_err) |err| {
if (!is_transient_error(err)) break :loop err;
std.Thread.sleep(std.time.ns_per_ms);
}
if (action()) |ok| {
break :loop ok;
} else |err| {
last_err = err;
continue :loop attempt + 1;
}
},
else => return last_err orelse @panic("bug"),
};
}
A for
loop version which I like even more. A bonus is that attempts_retry
doesn’t need to be comptime, and that the operation is more obviously bounded.
fn action_with_retries(attempts_retry: u32) E!T {
const attempts_first = 1;
const attempts_total = attempts_first + attempts_retry;
var last_err: ?E = null;
return for (0..attempts_total) |_| {
if (last_err) |err| {
if (!is_transient_error(err)) break err;
std.Thread.sleep(std.time.ns_per_ms);
}
if (action()) |ok| break ok else |err| last_err = err;
} else last_err orelse @panic("bug");
}