VaultGemma: A differentially private LLM (2025)

7 points by badcryptobitch


dubiouslittlecreature

Cynical but probably lukewarm take: this isn’t about privacy or protecting people’s PII if it shows up in training data (though they do have liability reasons to want to do that, I don’t think it was the only reason. They would treat lawsuits from affected individuals as the cost of doing business.)

I think it’s mainly about being able to deny what data they trained on, because large, wealthy copyright holding corporations are far harder to defend lawsuits from.

mdaniel

[pdf] (2025)

and, to avoid being a completely pedantic comment, here is the abstract to save you a click

We introduce VaultGemma 1B, a 1 billion parameter model within the Gemma family, fully trained with differential privacy. Pretrained on the identical data mixture used for the Gemma 2 series, VaultGemma 1B represents a significant step forward in privacy-preserving large language models. We openly release this model to the community.

"release ... to the community" appears to be a click-through license and not one of the friendlier weights licenses