Mercury: Ultra-Fast Language Models Based on Diffusion

3 points by Yogthos

Machine learning researchers have been locked in the autoregressive bottleneck for years. A recent paper argues that instead, diffusion models can perform at scale on discrete data. The researchers trained two coding models named Mercury Coder Mini and Small. The Mini model reached a staggering 1109 tokens per second on H100 GPUs, with the Small model achieving 737. These models eclipsed competing efficient state-of-the-art models in throughput by factors of up to ten, while retaining their ability to perform the coding tasks they were trained on. On real world testing and human evaluation platforms such as the Copilot Arena, the Mini model tied for second place in quality with massive models like GPT-4o, having an average latency of only 25 milliseconds. The model matched the performance of established high-speed models such as Claude 3.5 Haiku and Gemini 2.0 Flash Lite across a variety of programming languages, but with many orders of magnitude improvement in decode speed.

Diffusion models have a clear advantage over older autoregressive ones in their ability to generate text in parallel, which makes things much more efficient. Standard language models are hamstrung by a serial decoding process in which answers have to be produced one piece at a time. Transformer models abandon that bottleneck entirely. They learn to predict many pieces of text all at once. You start with a string of random noise and run a denoting process that refines all the tokens in concert, zooming from coarse to fine, until the final text emerges. This ability to generate in parallel achieves much higher arithmetic intensity and makes full use of the computational power of modern GPUs.