
The boffins on Google's DeepMind team unveiled an experimental new language model this week that uses techniques originally developed for AI image generators to boost text output performance by as much as 4x when running on resource-constrained consumer hardware. It's free to download and you can run it with just 18 GB of DRAM or VRAM. The model, codenamed DiffusionGemma, is the latest addition to Google's open weights model family. But unlike Gemma 4, which launched this spring, the 26 billion-parameter mixture of experts (MoE) model isn't a large language model in a conventional sense. Instead, it's actually closer to image models like Stable Diffusion or Flux. Rather than generating tokens one after another in an autoregressive fashion, DiffusionGemma generates entire paragraphs' worth of tokens at the same time. The process looks a lot like how a diffusion model turns what's essentially static into an image through a series of denoising steps. As Google explains it, DiffusionGemma works by laying out a canvas of random tokens, and then refining them until the final output is reached. Compared to conventional LLMs, which are memory-bandwidth bound and require a lot of VRAM, diffusion models are a predominantly compute-bound workload, which is why the Chocolate Factory is positioning these models for local deployment. LLMs are autoregressive. During token generation, the model's active parameters need to be streamed from memory for every token generated, making memory bandwidth a major bottleneck. In the cloud, inference providers balance compute and memory bandwidth by processing hundreds or thousands of requests in parallel. As you might have guessed, this isn't something the average user running a local model on their notebook can do. However, many consumer products, like high-end graphics cards, have plenty of excess horsepower, which DiffusionGemma can take advantage of to boost output performance. Diffusion language models aren't perfect. Google isn't the first to explore this tech. Previous models, like DREAM or Mercury 2, demonstrated major speedups over conventional LLMs, but generally underperformed them in benchmarks for their size. DiffusionGemma doesn't appear to be any different. According to Google, the 26 billion-parameter model falls just behind Gemma 4 12B in the GPQA-Diamond benchmark, with its main advantage being output speed, and even then it's not as impressive as Google has made it out to be. The chart shows a roughly 2.25x speedup for DiffusionGemma over the 12B parameter LLM with speculative decode enabled. Compared to Gemma 4 26B-A4B, the speedup is nearly 4x when running a single Nvidia H100. DiffusionGemma is being released as an experimental model rather than an enterprise focused one, like we saw with Gemma 4. The model is available for download on popular model repos like Hugging Face under a highly permissive Apache 2.0 license with support already merged into popular inference engines like vLLM, MLX, and HF Transformers, with support for Llama.cpp coming soon. While local inference has largely been the domain of AI enthusiasts, companies like Google are increasingly leaning on the tech to cut cloud costs associated with their AI services. As you may recall, back in May, Google quietly began shipping a small LLM with its Chrome web browser. (R)