When Order Matters: How A Single DNS Change Broke The Internet For Millions

janrinok

from SoylentNews on 2026-01-24 04:01 (#732E1)

Arthur T Knackerbracket has processed the following story:

On January 8, 2026, a seemingly innocuous code change at Cloudflare triggered a cascade of DNS resolution failures across the internet, affecting millions of users worldwide. The culprit wasn't a cyberattack, server outage, or configuration error - it was something far more subtle: the order in which DNS records appeared in responses from 1.1.1.1, one of the world's most popular public DNS resolvers.
[...] The story begins on December 2, 2025, when Cloudflare engineers introduced what appeared to be a routine optimization to their DNS caching system. The change was designed to reduce memory usage - a worthy goal for infrastructure serving millions of queries per second. After testing in their development environment for over a month, the change began its global rollout on January 7, 2026.
By January 8 at 17:40 UTC, the update had reached 90% of Cloudflare's DNS servers. Within 39 minutes, the company had declared an incident as reports of DNS resolution failures poured in from around the world. The rollback began immediately, but it took another hour and a half to fully restore service.
The affected timeframe was relatively short - less than two hours from incident declaration to resolution - but the impact was significant. Users across multiple platforms and operating systems found themselves unable to access websites and services that relied on CNAME records, a fundamental building block of modern DNS infrastructure.
To understand what went wrong, it's essential to grasp how DNS CNAME (Canonical Name) records work. When you visit a website like www.example.com, your request might follow a chain of aliases before reaching the final destination:
Each step in this chain has its own Time-To-Live (TTL) value, indicating how long the record can be cached. When some records in the chain expire while others remain valid, DNS resolvers like 1.1.1.1 can optimize by only resolving the expired portions and combining them with cached data. This optimization is where the trouble began.
The problematic change was deceptively simple. Previously, when merging cached CNAME records with newly resolved data, Cloudflare's code created a new list and placed CNAME records first:
let mut answer_rrs = Vec::with_capacity(entry.answer.len() + self.records.len());
answer_rrs.extend_from_slice(&self.records); // CNAMEs first
answer_rrs.extend_from_slice(&entry.answer); // Then A/AAAA records
To save memory allocations, engineers changed this to append CNAMEs to the existing answer list. This seemingly minor optimization had a profound consequence: CNAME records now sometimes appeared after the final resolved answers instead of before them.
The reason this change caused widespread failures lies in how many DNS client implementations process responses. Some clients, including the widely-used getaddrinfo function in glibc (the GNU C Library used by most Linux systems), parse DNS responses sequentially while tracking the expected record name.
When processing a response in the correct order:
Find records for www.example.com
Encounter www.example.com CNAME cdn.example.com
Update expected name to cdn.example.com
Find cdn.example.com A 198.51.100.1
Success!
But when CNAMEs appear after A records:
Find records for www.example.com
Ignore cdn.example.com A 198.51.100.1 (doesn't match expected name)
Encounter www.example.com CNAME cdn.example.com
Update expected name to cdn.example.com
No more records found - resolution fails
This sequential parsing approach, while seemingly fragile, made sense when it was implemented. It's efficient, requires minimal memory, and worked reliably for decades because most DNS implementations naturally placed CNAME records first.
The impact of this change was far-reaching but unevenly distributed. The primary victims were systems using glibc's getaddrinfo function, which includes most traditional Linux distributions that don't use systemd-resolved as an intermediary caching layer.
Perhaps most dramatically affected were certain Cisco ethernet switches. Three specific models experienced spontaneous reboot loops when they received responses with reordered CNAMEs from 1.1.1.1. Cisco has since published a service document describing the issue, highlighting how deeply this problem penetrated into network infrastructure.
Interestingly, many modern systems were unaffected. Windows, macOS, iOS, and Android all use different DNS resolution libraries that handle record ordering more flexibly. Even on Linux, distributions using systemd-resolved were protected because the local caching resolver reconstructed responses according to its own ordering logic.

Source	RSS or Atom Feed
Feed Location	https://soylentnews.org/index.rss
Feed Title	SoylentNews
Feed Link	https://soylentnews.org/
Feed Copyright	Copyright 2014, SoylentNews