Article 76MS5 Zuck saves Meta bucks by reusing memory from old servers with a custom CXL ASIC

Zuck saves Meta bucks by reusing memory from old servers with a custom CXL ASIC

by
from www.theregister.com - Articles on (#76MS5)
Story ImageMeta is recovering DDR4 memory from old servers, installing it in new machines, and using a custom Compute Express Link (CXL) ASIC to share the memory across applications - without encountering latency problems. The social networking giant calls its tech "Vistara" and will present it at ISCA 2026 on Monday, but The Register found the company's paper ahead of the talk. Our sister site, Blocks and Files, also happens to have reported on this on Friday. The document opens with the admission that Meta can't increase the amount of memory in around 40 percent of its vast server fleet, meaning millions of servers can't handle some of its workloads. That's unfortunate because the expected service life of its servers is three to five years, but memory is useful for seven to ten years. Meta's response is to rip DDR4 DIMMs from old servers, put them into new machines that rely on DDR5, and turn it all into a pool of capacity - which in theory makes it possible to compose virtual servers that share resources across multiple physical hosts. The paper points out that CXL is hard to put into production because sharing memory across hosts can mean low bandwidth, high latency, and extra computing overheads to manage additional memory layers. Those problems can arise in systems that combine different memory technologies. Meta wanted to blend memory types in a single machine but found off-the-shelf CXL kit can't do the job. "Most CXL solutions bundle DRAM with the controller - preventing DIMM reuse - and often omit DDR4 support, which is a requirement for repurposing older memory," the paper states. "Additionally, their high power consumption and high cost further limit their appeal." To make CXL sing, Meta created a custom ASIC called "Vistara." "At its core, the Vistara ASIC is designed to bridge DDR4 memory to host processors via a CXL 2.0/1.1-compliant PCIe Gen5 x16 interface," the paper explains. "Each Vistara ASIC integrates two independent 72-bit DDR4 memory channels, supporting speeds up to 3,200 MT/s and up to 256 GB per chip with 64 GB DIMMs." A pair of custom RISC-V processors drive the ASICs. Vistara hardware lives in devices Meta calls a "MemServer" powered by an AMD Turin processor packing 158 cores and running 316 threads. Each MemServer combines 768 GB of DDR5 memory alongside 256 GB of DDR4 connected through Vistara ASICs. "The Vistara CXL cards are installed in dedicated rear-accessible slots within each MemServer chassis," the paper reveals. "To manage the increased thermal load from high-density memory and CXL devices, the chassis employs directed airflow with high-capacity fans that channel cool air directly across the Vistara modules, for stable operation under heavy workloads." The software side of Vistara sees the DDR4 presented to the OS "as a distinct, CPU-less NUMA node, separate from the local DRAM nodes directly attached to the processor." Meta's platforms first use all available local DDR4, then employ the CXL-enabled memory when needed. Zuck's house of hyperscale hypnotism makes this happen with custom tweaks to the Linux CXL driver. "All Linux kernel CXL driver code in use for Vistara is either present in the upstream kernel, or is on its way to being included in the upstream kernel," the paper states. The paper says Meta has put this CXL stuff to work "in hyperscale infrastructure with millions of servers, across a variety of production workloads, including disaggregated ML inference (embedding tables in recommendation systems), big data processing, databases, distributed caches, and CI/CD build systems." Some workloads, including big data tools such as Spark and Hive, use terabyte and petabyte-scale datasets, and need hundreds of gigabytes of memory per job. The paper says that if those workloads experience out-of-memory events, it can "disrupt critical business analytics and ML pipelines." "The expanded memory headroom provided by CXL enhances system reliability," the paper explains. "By mitigating the risk of out-of-memory (OOM) events, CXL reduces the frequency of job failures and the associated overhead of job restarts and resource fragmentation by 33 percent." Meta says the system also cuts infrastructure costs. "These deployments have demonstrated large benefits, such as reducing the server count by up to 25 percent for disaggregated inference," the paper states. And of course Meta is avoiding the sky-high memory prices caused by the RAMpocalypse. (R)
External Content
Source RSS or Atom Feed
Feed Location http://www.theregister.co.uk/headlines.atom
Feed Title www.theregister.com - Articles
Feed Link https://www.theregister.com/
Reply 0 comments