Spaced Repetition Systems Have Gotten Way Better
34 points by eduard
34 points by eduard
Having recently read Bernoulli’s Fallacy by Aubrey Clayton, I wonder: where are the Bayesian approaches for spaced repetition? (Clayton pulls no punches and argues at length against the many flaws of frequentist approaches and their misinterpretations.)
So far, I’ve only found Ebisu : https://fasiha.github.io/ebisu/ “Behind these two simple functions, Ebisu is using a simple yet powerful model of forgetting, a model that is founded on Bayesian statistics and exponential forgetting.”
The Ebisu page (above) also refers to “Predicting the Optimal Spacing of Study: A Multiscale Context Model of Memory” by Mozer, Pashler, Cepeda, Lindsey & Vul : https://home.cs.colorado.edu/~mozer/Research/Selected%20Publications/reprints/MozerPashlerCepedaLindseyVul2009.pdf
I imagine a Bayesian approach offers some flexibility over FSRS, especially in relation to user heterogeneity. On the other hand, since the goal isn’t getting reliable estimates for interesting parameters but just presenting a flashcard at the best time, MLE with a well-selected default algorithm makes good sense to me.
Bayesian statistics is logically sound, extending formal logic to handle uncertainty. Frequentist methods, in contrast, are fundamentally flawed; they confuse forward (P(Data|Hypothesis)) with reverse (P(Hypothesis|Data)) probabilities, relying on the imaginary concept of long-run frequencies instead of rationally updating beliefs based on evidence. Since Maximum Likelihood Estimation (MLE) is a frequentist technique, it inherently suffers from all these same critical flaws. If you are unfamiliar with these critiques, I refer you to Clayton, Jaynes, Jeffries, Savage; Lindley, Box, Berger, Gelman, Pearl, McElreath, Cohen, Meehl, and dozens (or hundreds?) of other scholars. It is eye opening to review the history of frequentist statistics, ranging from unsavory connections to eugenics to poor reasoning in legal cases to the reproducibility crisis across many experimental fields of science.
This is a somewhat dated view. In the 1960’s there was enormous hope that Bayesian statistics was a firm foundation because of the theorems connection Bayesian procedures to the behavior of rational actors. Unfortunately, it turns out you need to consider classes of priors, not just single priors, and when you do all the rational actor connections fall apart. So Bayesian statistics don’t provide a means of “rationally updating beliefs based on evidence.” But everyone should read Jeremy Savage’s book because it’s awesome.
Decision theory is the usual foundation for choosing statistical procedures today. Expected risk given a prior is one way of choosing a procedure, but not always a good one. For adversarial situations, minimax is a much better criterion. Sadly, decision theory isn’t usually taught until graduate level courses in statistics, even though it’s laid out pretty straightforwardly in machine learning and game theory (machine learning being high dimensional statistical inference and statistical inference being a one player game against nature).
It is eye opening to review the history of frequentist statistics, ranging from unsavory connections to eugenics to poor reasoning in legal cases to the reproducibility crisis across many experimental fields of science.
This has nothing in particular to do with frequentist statistics. Bayesian procedures are every bit as susceptible to misuse. They just aren’t used as much so there are fewer cases of it.
This has nothing in particular to do with frequentist statistics.
Clayton would disagree and wrote several chapters on this. His argument is that the history matters. Underlying the entire project of claiming that statistics can be objective was a clear motive to veil various socio-political goals as objective. One could attempt to extricate this history and claim the frequentist methods remain in a sort of vacuum, but I think this misses Clayton’s whole point. Someone doing Bayesian statistics transparently cannot claim objectivity (unlike a frequentist). Part of the culture means welcoming further discussion and additional information.
Specifically, the frequentist emphasis on significance testing might be the worst contributor to the reproducibility crisis.
Bayesian procedures are every bit as susceptible to misuse.
On what basis do you claim this? Certainly not the same kinds of misuse!
They just aren’t used as much so there are fewer cases of it.
This is easy to say, but it is actually an empirical, counterfactual question that would have to be studied. Which would lead to a discussion of Pearl and recent work on causality.
I had never heard of Clayton before, so I did some digging through his writing and background. I see no reason why I should take this man seriously. He doesn’t seem to have actually worked as a statistician or a scientist, and anyone who claims Jaynes as his major intellectual influence is fighting against a pretty brutal prior on my part.
In practice the kinds of misuse are basically the same. Inference gets far more attention from philosophers and mathematicians than it really deserves compared to issues like design of experiments. Thinking back through all the papers I’ve dug through over the course of my career, very few died of problems with inference. The reason that, say, lab biology continues to use basically the techniques from Fisher’s old book is that they are good enough. On the other hand, Bayesian online designs in clinical trials are truly an improvement.
I’ve read many (thousands?) of pages of statistics and machine learning books and papers, and I’m quite familiar with a common approach of “not making waves” regarding Frequentist versus Bayesian interpretations where authors say, more or less, “they both have their place”. I’ve said this kind of thing in the past, but I’ve changed. The “can’t we all get along?” viewpoint feels like like bland relativism now.
This change is largely the result of reading Clayton’s “Bernoulli’s Fallacy” (and Judea Pearl’s “The Book of Why”). Learning about the historical context and personalities of Fisher and others was eye-opening. Clayton’s book might not be to everyone’s taste — it is tough to write such a book for a non-statistical audience. Clayton makes clear arguments, doesn’t hide his point of view, and doesn’t hold back. This is refreshing.
When I think back to one of my statistics classes in public policy school, I remember how carefully the professor phrased what a significance test actually claims. It goes something like this:
For p=0.05: If the null hypothesis were true, and you repeated this exact experimental procedure infinitely many times, you would observe a test statistic as extreme or more extreme than what you actually observed in 5% or fewer of those repetitions.
At the time, I noticed its awkwardness made a point to not get it backwards or wrong. Now, I view it almost like a trick. Or at least a footgun waiting to go off. We have many scientific disciplines publishing papers in ways that are pretty much guaranteed to be misinterpreted. It feels unfair to blame people for not understanding this level of complexity! Clayton makes the case to blame frequentism itself.
A frequentist statistical test focuses on P(data | hypothesis), but this is the wrong question. When doing inference, we care about P(hypothesis | data) – to calculate that, we need Bayesian foundations.
So why do so many people hang on to frequentism? Clayton offers various explanations. Relative to the history of statistics, modern Bayesian techniques are relatively new, computationally intensive, and require a prior, which feels subjective. So there are many kinds of pushback – but this “pushback” can be addressed.
Putting aside your dismissal of Clayton’s background, did you look at the arguments he makes? These are hardly new, and they are hardly original. In any case, if you disagree, surely your counterpoints have been written down already? Point me to a paper or book that characterizes your views (including criticisms of Jaynes), and I’ll read it.
To your comment above… Thanks for admitting your dislike of Jaynes. Given that, there was probably little chance that you would like Clayton or take him seriously. I don’t think Clayton’s background really mattered much; you likely would have found something you didn’t like and cherry-pick that for your reason. Am I wrong?
Finally, to me, the patterns of misuse between a Frequentist and Bayesian are different enough to matter. Here are some. (1) Bayesians won’t assume a 5% statistical significance test is a good idea, much less a good idea by convention. (2) Bayesians don’t have to carve out special techniques to fight the base rate fallacy. (3) Bayesians don’t have to consult a Byzantine flow chart to do statistical testing. (4) Bayesians openly admit subjectivity whereas Frequentists often claim objectivity while e.g. actually making subjective choices around the choice of reference classes. (These are just some of the differences.)
I’m quite familiar with a common approach of “not making waves” regarding Frequentist versus Bayesian interpretations where authors say, more or less, “they both have their place”. I’ve said this kind of thing in the past, but I’ve changed. The “can’t we all get along?” viewpoint feels like like bland relativism now.
It’s not “they both have their place” or “can’t we all get along.” It’s that the underlying structure is decision theory which unifies them along with a bunch of other stuff.
I don’t think Clayton’s background really mattered much; you likely would have found something you didn’t like and cherry-pick that for your reason. Am I wrong?
Yes, you’re wrong. I read through some of his articles that he has written on his website and was unimpressed enough to not want to bother dipping into his book. Similarly, I read Jaynes’s work years ago, and I think he fell prey to a simple, wrong answer about the structure of science.
A frequentist statistical test focuses on P(data | hypothesis), but this is the wrong question. When doing inference, we care about P(hypothesis | data) – to calculate that, we need Bayesian foundations.
No, you’re still treating this far too narrowly. It’s a two way decision procedure. You’re choosing a function from data to decision and trying to choose well. The expected loss given a prior is one way to choose, but not the only one. Any statistical approach which can’t encompass minimax procedures is inherently incomplete. Now, every admissible minimax procedure corresponds to some prior, but that doesn’t mean that you choose the prior, it means that Bayes procedures are a superset of admissible procedures.
A p-value is a summary statistic. It intentionally does not include a prior. Yes, a lot of fields use a really simple, arbitrary decision procedure based on that p-value. But if you are in a lab doing an experiment, you don’t have a prior probability typically on its outcome. You have some features of your belief that correspond to a class of priors. What’s more important than having the perfect decision procedure is doing the experiment again. Again, inference’s importance is vastly overstated.
Thanks for engaging, I really appreciate it.
First, what are some papers or books I should read to better understand your points? Savage? Berger? Others?
Second, what core aspect of science did Jaynes get wrong from your POV?
Regarding decision-theoretic underpinnings of statistics, I have various comments and questions:
I’m fairly green to the arguments connecting decision theory to statistics, but I’m interested to read more. Point me in useful directions, please.
It seems likely that many/most defenders of frequentist statistics don’t have an understanding of decision theory’s connections like you do. So you might find yourself defending frequentist methods for different reasons than a typical practitioner. Fair? Have you noticed this?
I appreciate when one theory can synthesize and compare and contrast other theories. But this alone does not justify the other theories. They still need to make sense in context; an underlying theory hopefully unpacks the rationale. Your comments speak to this, but I’ll need to see it for myself.
… Nor is a “higher” level theory necessarily better. In particular, I don’t see why (intellectually) decision theory is a “better” foundation — perhaps it is only a different lens? I’ve been exposed to a few decision theories, and it starts to feel pretty abstract. Building bridges to thought experiments is fun but applying them to daily work is harder.
A new “deeper” theory such as decision theory brings a new set of assumptions. Are these considered solid and acceptable to both frequentists and Bayesians? I’m curious, on a sociological level, how well decision theory has succeeded in building bridges to practitioners. Thoughts?
How and why did you find the connection between decision theory and statistics?
As one example of #4, it seems clear that the only correct probabilistic generalization of logic is Bayesian statistics. I’m not trying to be hyperbolic here. I recognize that there can be a significant computational cost involved.
I’m willing to (tentatively) grant under different decision criteria that one might be better served to choose frequentist methods, but I haven’t yet seen the principled reason for doing so (other than say “Bayesian computation is harder” or picking a prior feels messy or arbitrary),
Will studying decision theory help me practically make good tradeoffs? Perhaps even build a kind of decision tree on which techniques to use and when?
I started with Kiefer’s ‘Introduction to Statistical Inference’ (my statistics education was very peculiar), which suffers from Baroque notation and skipping over a lot of results, but I think does a good job of setting up the structure. I do like Savage’s book, but I sometimes feel like I read a different book than other people when they talk about it. Berger’s text on Bayesian statistics is a really good read. It’s where I was pointed to the issues with Bayesian inference with prior classes.
The key thing that Jaynes got wrong is that he wanted to treat the output of experiments as probabilities without recognizing that scientists actually construct symbolic systems that mirror the range of observations they’re trying to reflect. Those symbolic systems may contain probability as an ingredient, but they aren’t themselves the output of an experiment with a Bayesian posterior probability attached. I wrote a book about this (https://www.madhadron.com/science/into_the_sciences.html).
So for your points:
Decision theory goes back to Wald in the 1940’s. His book is still pretty good. A lot of it arose from trying to apply statistics to problems that weren’t the controlled experimental setups that Gosset, Fisher, et al were thinking about. It’s a similar path that led into game theory and operations research.
This is perhaps true. I will defend fields like agronomy or a lot of lab biology continuing the old Fisherian methods because I truly don’t think that retraining them is a good use of very scarce time. My experience in infectious disease and microbiology is that misuse of these tools was very rarely the thing that rendered papers invalid. If I had some stats training time for a lab, it would go straight to Pearl’s causal graphs and more design of experiments. I also don’t think that the frequentist long run of identical trials interpretation is actually necessary to how the techniques are used in practice. When a microbiologist does a t-test they are assuming that a given state of nature will produce some range of random outcomes, which works fine under both frequentist and Bayesian interpretations of probability.
3,4, part of 5. If you look at graduate level stats textbooks like Casella & Berger you’ll find decision theory as one of the topics. This isn’t new stuff and all of inference has been recast in this form decades ago, and having it as a structure led to Breiman and folks like him developing whole new ranges of statistical procedure that basically turned into machine learning. They don’t even call it decision theory in machine learning. It’s just chapter 1 of the textbook. For practitioners in traditional sciences it isn’t as relevant but that’s because a given field of science uses a subset of statistics adapted to their work and updates it very rarely, such as the jump to Bayesian methods in particle physics in the early 2000’s because it made systematic computational work on their data so much more straightforward.
Decision theory has the same set of assumptions as probability theory. Neither of them assumes an interpretation of probability. You can use decision theory with the combinatoric approach where they prove something has non-zero probability to demonstrate the existence of a certain combinatoric object, which just uses probability as a convenient intuition over measure theory. It works fine with statistical quality control, where the frequentist interpretation makes sense because the first notion is that of a process being “in control.” It works fine with purely Bayesian settings.
My intro textbook (Kiefer) started with decision theory. Since it was invented by a statistician (Wald) working in statistics, there’s not really a connection to be discovered.
I agree that Bayesian probability is the only probabilistic generalization of logic (well, leaving aside various quantum things that people have dreamt up). I’m not convinced that a probabilistic generalization of logic is actually that interesting an object, though. If you treat something like scientific work as a piece of game theory, you are dealing with discrete decisions and you reason about discrete strategies. There is a theory of differential games, but it is built from a limit of discrete games with very closely spaced decisions. And for scientists, who are constructing symbolic mirrors of reality (theories) in the end they are making discrete decisions about what to do.
The classic example of when starting with a prior is the wrong way to go about choosing a procedure isn’t frequentist at all: it’s when you’re playing against a thinking opponent that gets to choose their move to make your life as bad as possible. In that case you want to minimize your maximum loss under any choice of their move (what are called minimax procedures). There is some Bayesian prior that will produce the minimax procedure, but that isn’t the criteria you’re choosing under.
Yes, considering that this is what led to decision trees in machine learning, and all their extensions like random forests. I have used it for a number of medical and public health problems both personal (optimal strategies for my mother’s breast cancer treatment given all the information from the oncologists) and public (some calculations for various groups during covid).
I have fairly good background and if I speak to other practitioners over dinner I’m more likely to advocate for Bayes as part of one’s personal philosophy than against it. I rarely reach for Bayesian techniques as part of my toolkit, though.
In this problem we are trying to fit parameters in a practically simple model of forgetting. The spaced-repetition user may not care whether the choice of model comes from a posterior distribution or if it’s merely the best on some benchmark. Unfortunately, the best methods according to the benchmark avoid overfitting by regularization. This is of course nonsense, we ought to specify a prior over the weights in our neural network.
Of course we don’t really care what the parameters are for FSRS. We’re making a pragmatic choice to model the per-item memory state after each review that we think will do a pretty good job of minimizing the number of repetitions. If we really wanted to understand the relative difficulty of items, say, it makes a lot of sense to be good Bayesians and be able to read P(D_i > D_j) right from the model.
In this case we have to calculate an interval exactly. If we’re Bayesians we would use MAP instead of MLE, which sucks because point estimates lose most of what’s good about the whole Bayes thing but c’est la vie, at least it’s rational. Okay what was our prior over those…good grief there are 19 parameters? And half of them are just “crude heuristics” we don’t have or want any intuition for?
At this point I’m tempted to give up and appeal to objectivity: get out the trusty uniform prior, write up some numerical integration code and–wait a minute, doesn’t the uniform prior mean MAP = MLE?
I’m being silly here, but this is roughly what I end up thinking a few times a year when I say to myself, “why not try a Bayesian approach?”
I remember when I viewed it as merely a “philosophical” difference. But now, I view it as a matter of intellectual honesty: are we trying to answer P(hypothesis | data) or P(data | hypothesis)? Once we agree, there are better and worse ways of answering the question.
I’m working on a Japanese learning app for comprehensible input that is based on the Ebisu model. I went specifically looking for Bayesian approaches as well since they are the most intuitive to me.
Great write up and the linked repo with the spaced repetition algorithm benchmarks is very interesting too:
While I lived in China I used Anki to learn vocabulary for my daily life needs. Like the author, I also warmly recommend it.
Seeing that it’s been updated with a better learning model makes me feel an itch to get back into language learning. Thanks for posting the article!
I used it for multiple languages. At this point anyone learning a language without spaced repetition for vocabulary is just not being serious about it. Insert rant about language instruction in the US public schools here.
How do you use Anki for language learning? I’d like to start.
There’s several pieces to effective language learning, and Anki is one of them. You need to train your ear and mouth to be able to distinguish all the sounds of the language (such as English speakers learning to distinguish the nasalized vowels in French or the tones in Mandarin). You need the grammar of how you put sentences together and modify words to fit into them. You need the body of cultural information about greetings, thank yous, etc. You need conversational and reading practice. And you need a body of vocabulary to fit into the rest. Anki does that last part.
As you learn words, you put them into Anki. The exact things you put in vary from language to language. For French, I would put in English -> French, French -> English, the gender of nouns, and any irregular forms for verbs. For Japanese, I put in English -> Hiragana, English -> kanji, kanji -> hiragana, kanji -> English, and hiragana to English, plus any various forms that I needed.
Once you’re used to learning new languages and making yourself understood in languages you don’t speak super well, vocabulary becomes the biggest limiting factor, and we don’t know a faster way to build working vocabulary than spaced repetition systems.
Do you have a recommended deck, or did you build your own?
Not the person you asked but I’ll give my personal experience.
I took an A-1 class and was adding everything from there. Also, I was adding anything that I would learn/see outside. Within a few weeks, I had already a good deck.
Also, native speakers are mostly enthusiastic about it and often offer « useful » additions. This way I also learnt some slang that no schools would teach to an A-1, but that offered a lot of fun.
It’s a whole lot of years ago, so I would be surprised if there haven’t been consolidated and updated decks by now. I was using some standard education decks plus manually made decks from other users. And a deck with the different provinces, and a deck with different foods (including pictures of the foods).
This is actually very exciting as I found my Anki decks always a lot of pain with uncertain gain. Thanks for sharing!
Sure, but Anki is an utterly unusable dogshit piece of software. I’m much happier and much more productive using Wanikani.
I’d be surprised if there aren’t many more people who feel like this, which comes to show you: your algorithm can be the best in the world, if your UX is bad, it doesn’t really matter that much.