A Technical Tour of the DeepSeek Models from V3 to V3.2
5 points by refi64
5 points by refi64
Note that this does assume some prior transformer architecture knowledge, but if you know how attention works then you should at least be able to get the overall idea.