The Context Window Has Been Shattered: Subquadratic Debuts a 12-Million-Token Window
"Fnord666" writes:
Every frontier model in 2026 advertises a context window of at least a million tokens, but almost none of them are actually great at making use of all of that information. On MRCR v2, the multi-reference retrieval benchmark labs report, the best model is GPT-5.5, which scores 74.0%. Others like Claude Opus 4.7 at 32.2% are far behind.
At this point, a million tokens seems to be the maximum for the context window that the major frontier labs are offering. One major reason for the million-token max is the same one that has shaped every transformer-based model since 2017: Attention cost scales quadratically with context length, so doubling the input quadruples the work. Essentially, RAG, agentic decomposition, hybrid model architectures, and every other workaround the industry has built are ways of making tradeoffs to get around this.
Subquadratic, a Miami-based startup, launched its first model on Tuesday and claims it can get around all of this, now offering a model that can handle a token window of 12 million tokens. What's more, the company says it plans to offer a model with a 50-million-context window soon.
The company, which has 11 Ph.D. researchers on staff, argues that its architecture, called Subquadratic Selective Attention (SSA), scales linearly in both compute and memory with respect to context length. The company says it runs 52 times faster than dense attention at a million tokens, hits 92.1% on needle-in-a-haystack retrieval at 12 million tokens - a context length no frontier model currently gets close to - and scores 83 on MRCR v2, beating OpenAI by nine points.
[...] The quadratic cost of attention is obviously not a new problem, and SSA is not the first attempt to solve it. The research line goes back nearly to the original transformer paper, and the overall pattern has remained consistent. Every approach has traded one necessary property to gain another, and none have been able to replace dense attention at the frontier scale.
[...] DeepSeek's Native Sparse Attention won the ACL 2025 best paper award, for example. Its successor, DeepSeek Sparse Attention (DSA), is shipping in DeepSeek V3.2-Exp. DSA's lightning indexer routes attention to a small subset of selected keys, and the attention over those keys is genuinely sparse. The indexer that picks them, however, has to score every query against every key, meaning the selection step is itself quadratic.
SubQuadratic CTO Alex Whedon tells The New Stack, "Sparse attention basically means instead of doing what transformers do, which is if you have 1,000 words, you look at every possible relationship between all 1,000 words, which is 1,000 squared combinations. You realize that only a portion of those actually matter and you only process the portion that matter."
Read more of this story at SoylentNews.