Netflix wiz creates app to slash AI bills, then open sources it

from www.theregister.com - Articles on 2026-05-31 07:00 (#75ZYH)

As the COOs from both Uber and Microsoft recently learned, encouraging company engineers to use AI aggressively can lead to hefty usage bills, perhaps even offsetting all the gains from laying off employees. The AI bills at Netflix may not be so eye-popping thanks to company senior engineer Tejas Chopra, who has created software to prune agent instructions, as measured in tokens, before they hit the LLM. Chopra has estimated that as much as 90% of tokens are redundant to the giant thinking machine of your choice. Although not an official Netflix project, several teams there already use Project Headroom, and a number of external projects rely on it as well. In a talk at the Open Source Summit last week, Chopra said that Headroom has saved an estimated $700,000 for its users, who collectively now have 200 billion tokens to spend elsewhere. Not bad for an open source application that's been out only since January. Headroom, currently at a still-raw v0.22, has gathered 2,000 stars on GitHub and has been forked over 120 times. A lot of our users are people who have been really burned by token costs, more than anything else," Chopra said in his presentation. Lossless context compression A $287 bill from Claude Sonnet first brought Chopra's attention to the idea of token economization. The bill was typical home project stuff: a bit of debugging, some refactoring, MCP tools querying a database. At the time, Claude Sonnet's token-based pricing seemed pretty generous: $3 for every million input tokens, or $6/million if you went over the 200,000 token limit for your context window. Still, that $287 added up quickly. Upon deeper inspection, Chopra found a lot of this data was highly redundant to the LLM. By and large, his own hand-crafted instructions were not the culprit. Rather it was all the boilerplate and machine metadata that came along for the ride: Needlessly-verbose JSON schemas, nested templates within API responses, identical database columns. This isn't prose. This isn't creative writing. This is compressible data masquerading as text," Chopra wrote in a blog post introducing his software. In 2025, a group of researchers found that reading user input accounted for about 76% of all token consumption. The model providers have their own tools to save tokens. But to date, the settings on these tools are somewhat oblique to end users. By default, Claude has a prefix cache setting of just five minutes. After five minutes of inactivity, the entire context window needs to be refreshed, even if the LLM needs the exact same data. Another setting is exposed in the API documentation: a one-hour time to live (TTL). But there is a catch. "You pay two times the cost for your writes to get 90% savings for your reads," Chopra told the audience. It's up to you to find the sweet spot. There are also a number of new commercial token barbers popping up, such as YCombinator-funded Token Company, which offers token compression as a service. On the open source side there is RTK (Rust Token Killer), which trims to the output of verbose commands, such as calls to a repository. Another open source project, LeanCTX, is a variant of RTK. All these tools are useful, Chopra admitted, but he designed Headroom to keep the operations confined to the developer's workflow. And it had something none of the apps and services could offer: reversible compression. Headroom's job is to compress all the source material that is fed into the user's context window - not only the conversation history, but also logs, tool outputs, files, chunks of documentation that the RAG found useful - before it arrives at the LLM. The context window is the set space for each user session. The latest frontier models are rapidly expanding their context windows upwards towards two million tokens, which holds both input and output. Such generosity is a mixed blessing, as Pope Leo might point out. As a unit of measurement, a single token is more or less equivalent to a human word. For pay-as-you go plans, the more you feed the context window, the more you'll pay. Gobbling tokens like Pac-Man Running on Python and Node, Headroom runs as a proxy (port 8787) on the engineer's computer. The user wraps their LLM at the command line interface (i.e. headroom wrap codex") and it then parses the input. While Headroom does compress a bit of programming code and human instruction, it is best at chopping server logs (90% of which can be jettisoned), MCP tool outputs (70% redundant JSON), Database outputs (it's all one schema), and file trees (much repeated metadata). Headroom's first step is a process called CacheAligner which looks only for information that has been changed within input that's already been entered, and ships only the new info, eliminating the need to replace an entire body of mostly unchanged text in KV Cache, the cache where the AI provider stores the user's context window. If your system prompt contains a date field or contains some UUID that changes per session, you are effectively getting a cache miss every single time," he told the audience. That will blow up your costs." Then, a router process infers the type of content and sends it to one of a number of compressors. An Abstract Syntax Tree (AST) compressor squishes programming code. JSON and Document Object Model (DOM) compressors snip unneeded JSON and Web boilerplate, respectively. Headroom also has some squashers" that look at text or JSON input and decide which bits are actually relevant, based on statistical analysis. These tools learn in a feedback loop if they are over- or under-compressing, based on how often the model has to call back into the original uncompressed prompt. The final process, called Compress Cache and Retrieve (CCR), offers that ability for the LLM to look at the original unsquashed data. It puts markers to where the data has been compressed, so if the LLM wishes to get the original context, it can call a Headroom MCP to retrieve the needed material from the user's machine. The original context is stored on Redis or SQLite. There is still work to be done to this software stack, Chopra admitted, particularly on testing accuracy. It should be an easy task because the CCR stores the original prompts. More compressors can also be built for other specific types of data, such as financial data. Audio, image, and video will also have to be tackled (one user has already forked the project for video parsing). A related project, which Chopra says will be open source soon, is Headlight. Headlight will keep track of the origin of each token, which could be especially handy for ensuring the accuracy of multi-model work. A token saved is a token earned Minding your tokens does not only save money, it can improve results, research suggests. Agents send more context than the model can possibly use, which, in addition to emptying the user's coffers, can actually make the LLM dumber. Like the rest of us, LLMs get confused when presented with too much information. A group of Stanford University boffins found that LLMs tend to pay more attention to the beginning and the end of the context window, and tend to disregard the middle bits. Likewise, a set of researchers from data integrator Chroma deduced that, across 18 LLMs, performance grows increasingly unreliable as input length grows." Context rot," they called this phenomenon. Trimming prompts can also improve latency. In his presentation, Chopra relayed how one of Headroom's users forked the software for a voice-activated application. With voice, even silence can generate tokens. The user expects a response from the app within 200 milliseconds for the service to sound natural, so the company is using Headroom to help shrink that latency window down as much as possible. Headroom also offers some good news for those worrying about data centers heating the world into a fiery inferno with their energy usage. Fewer tokens means a smaller context window, which means less energy use - at least until Jevon's Paradox kicks in and people find even more power-hungry ways to render their animated cat movies. (R)

Source	RSS or Atom Feed
Feed Location	http://www.theregister.co.uk/headlines.atom
Feed Title	www.theregister.com - Articles
Feed Link	https://www.theregister.com/