The future of Siri, or: why private inference isn’t private enough

19 points by achivetta


david_chisnall

This is a great summary. When I worked on confidential cloud computing, there was suddenly a push to do confidential machine learning. We built accelerators that had a strong set of guarantees about how data could flow, precisely what code was running, and what data could be exfiltrated. The part of the team that focused more on the ML side concluded that there were a number of open research problems to determine whether it was even possible to build models that wouldn’t leak data when used in multi-party scenarios.

There were some importantly use cases for this kind of thing. For example, a consortium of banks wish to build a shared model for fraud detection without sharing transaction details with each other. A set of hospitals want to build diagnostic models without sharing patient data with each other. In both cases, you want guarantees that the trained model cannot leak anything. Ideally you also want guarantees that one of the entities providing (encrypted, decrypted only in the TEE) data cannot taint it in such a way that they get a good model and everyone else gets a subtly broken one. Both of these turned out to be impossible guarantees to make with any existing techniques (not necessarily impossible in theory, just really hard and we don’t know how to do it). There were some interesting approaches, but they all had the problem that model utility degraded quickly as anonymity increased. This was fine for things where you have millions of users’ data and you want a model that doesn’t leak anything about any individual, or even small group. It doesn’t help at all if you have a set of a few friends / colleagues and you want to avoid leaking anything about any of them to the others.

For inference, it’s much harder. Designing a model such that you can combine data from two sources, run an arbitrary query (prompt), and not leak the data is impossible. Importantly, no amount of post-facto filtering can avoid the author of the prompt from exfiltrating things via covert channels in the response. The only way you can achieve this kind of isolation is if you start by solving the first problem: multi-party training with strong differential privacy guarantees. And we don’t know how to do that for small sets.

Note that all of this was assuming a perfect TEE: It runs exactly the code that you think it’s running, on the data that you think it’s using. It will send the results, encrypted, to precisely the person who should have them. The key release happens online and there is no possibility of replay attacks. There are no side channels or other mechanisms that allow the operator to violate confidentiality or integrity. That’s something that we’ve only approximated in practice (Apple’s private cloud is the closest I’ve seen anyone get, Google’s confidential computing offerings were a long way away from this). And even with that, there are no guarantees once you bring deep neural networks into the system.

It looks as if Apple has moved into the ‘lying to customers about the security properties of machine learning systems’ phase of the AI bubble.