A Technical Tour of the DeepSeek Models from V3 to V3.2

5 points by refi64


Note that this does assume some prior transformer architecture knowledge, but if you know how attention works then you should at least be able to get the overall idea.