Reservoir Sampling

49 points by uncenter


samwho

Hello! o/

I’m the author of this post, glad you’re enjoying it. Happy to answer any questions you have about it :)

imadr

The animations are really smooth and well designed, the rest of the website is a banger too I love this one in particular https://samwho.dev/turing-machines/

justinpombrio

Working through this before reading the post for fun. (I read just up to the description of the problem.)

My interpretation of the problem:

An anthropomorphic wolf keeps handing you cards one at a time. When handed a card, you can either drop it on the ground, or hold on to it (and thereby drop any card you were previously holding). Eventually the wolf will say “woof!” instead of handing you a card. At this point, the card that you’re holding must be chosen uniformly at random from among the cards you were handed.

You must hold onto the first card that you’re given, because the wolf might say “woof!” next. After that, assume by induction that you’ve been handed N cards and you’re currently holding one of the N chosen uniformly at random. Now you’re handed card number N+1. You need to decide whether to keep it or discard it. You know that you want to be holding it with probability 1/(N+1), so keep it with that probability.

We can check whether this results in a uniform probability distribution. Define P(i) to be the probability that you’re holding the ith card after N+1 steps.

I’m guessing the next problem is that you want to hold onto more than one card, say K cards.

You must hold onto the first K cards that you’re given. After that, assume by induction that you’ve been handed N cards and you’re currently holding K of them chosen uniformly at random from among the N. Now you’re handed card number N+1. You need to decide whether to keep or discard it, and if you keep it you need to decide which of your existing cards to toss out. You know that you want to be holding it with probability K/(N+1), so keep it with that probability. And by symmetry, if you do keep it them discard one of your existing K cards uniformly at random.

Again, let’s check whether this results in a uniform probability distribution:

zzing

I love the interactivity of the concept.

owl

I used reservoir sampling in a passphrase generator, to keep memory usage proportional to the number of words requested, rather that the wordlist. First I used mmap for the common case and just read it all into memory if the wordlist was on stdin. It’s slower than mmap now but works well for both cases (and not slow; it’s faster than e.g. rbw generate --diceware, which has a wordlist embedded in the executable.)

Riolku

Very cool concept, thanks for sharing! Reminds me of a statistics problem from my undergrad, where you have 10 keys to open a door and only one works. What’s the probability I get it in k attempts?

When you use a key, you discard it and don’t use it again.

deepchasm

TL;DR: Nice graphics solving “Pick a random item from an unknown number of items”. Summary is pick to ‘save’ one item with probability of 1/n, with n being items seen so far.

swaits

See also: Laplace’s Rule of Succession

itamarst

I have come up with a similar (perhaps already existing) algorithm for reservoir sampling where samples are spread out approximately equally in time, useful for profiling that outputs a timeline (https://sciagraph.com). I should really write it up…