VaultGemma: A differentially private LLM (2025)
7 points by badcryptobitch
7 points by badcryptobitch
Cynical but probably lukewarm take: this isn’t about privacy or protecting people’s PII if it shows up in training data (though they do have liability reasons to want to do that, I don’t think it was the only reason. They would treat lawsuits from affected individuals as the cost of doing business.)
I think it’s mainly about being able to deny what data they trained on, because large, wealthy copyright holding corporations are far harder to defend lawsuits from.
Idealistically, learning more about overfitting and controlling it is generally a good idea. Might also help against single-source poisoning of larger models?
Cynically, individual lawsuits are peanuts either way, compared to a political decision that a majority of European DPAs raise the question of willful structural inability to comply with GDPR request to fully cease active processing an individual's PII. It's a question of political priorities, sure. But political priorities right now can shift quickly, and Google is already guilty of enough GDPR violations (and even fined for some) to need options to correct things.
Pragmatically, they talk of differential privacy in context of PII because that's how the most relevant ten scientific papers in the previous literature are written.
[pdf] (2025)
and, to avoid being a completely pedantic comment, here is the abstract to save you a click
We introduce VaultGemma 1B, a 1 billion parameter model within the Gemma family, fully trained with differential privacy. Pretrained on the identical data mixture used for the Gemma 2 series, VaultGemma 1B represents a significant step forward in privacy-preserving large language models. We openly release this model to the community.
"release ... to the community" appears to be a click-through license and not one of the friendlier weights licenses