Ceph: a Journey To 1 TiB/s

EditorDavid

from Slashdot on 2024-01-20 16:34 (#6J0K8)

It's "a free and open-source, software-defined storage platform," according to Wikipedia, providing object storage, block storage, and file storage "built on a common distributed cluster foundation".The charter advisory board for Ceph included people from Canonical, CERN, Cisco, Fujitsu, Intel, Red Hat, SanDisk, and SUSE. And Nite_Hawk (Slashdot reader #1,304) is one of its core engineers - a former Red Hat principal software engineer named Mark Nelson. (He's now leading R&D for a small cloud systems company called Clyso that provides Ceph consulting.) And he's returned to Slashdot to share a blog post describing "a journey to 1 TiB/s". This gnarly tale-from-Production starts while assisting Clyso with "a fairly hip and cutting edge company that wanted to transition their HDD-backed Ceph cluster to a 10 petabyte NVMe deployment" using object-based storage devices [or OSDs]...) I can't believe they figured it out first. That was the thought going through my head back in mid-December after several weeks of 12-hour days debugging why this cluster was slow... Half-forgotten superstitions from the 90s about appeasing SCSI gods flitted through my consciousness... Ultimately they decided to go with a Dell architecture we designed, which quoted at roughly 13% cheaper than the original configuration despite having several key advantages. The new configuration has less memory per OSD (still comfortably 12GiB each), but faster memory throughput. It also provides more aggregate CPU resources, significantly more aggregate network throughput, a simpler single-socket configuration, and utilizes the newest generation of AMD processors and DDR5 RAM. By employing smaller nodes, we halved the impact of a node failure on cluster recovery.... The initial single-OSD test looked fantastic for large reads and writes and showed nearly the same throughput we saw when running FIO tests directly against the drives. As soon as we ran the 8-OSD test, however, we observed a performance drop. Subsequent single-OSD tests continued to perform poorly until several hours later when they recovered. So long as a multi-OSD test was not introduced, performance remained high. Confusingly, we were unable to invoke the same behavior when running FIO tests directly against the drives. Just as confusing, we saw that during the 8 OSD test, a single OSD would use significantly more CPU than the others. A wallclock profile of the OSD under load showed significant time spent in io_submit, which is what we typically see when the kernel starts blocking because a drive's queue becomes full... For over a week, we looked at everything from bios settings, NVMe multipath, low-level NVMe debugging, changing kernel/Ubuntu versions, and checking every single kernel, OS, and Ceph setting we could think of. None these things fully resolved the issue. We even performed blktrace and iowatcher analysis during "good" and "bad" single OSD tests, and could directly observe the slow IO completion behavior. At this point, we started getting the hardware vendors involved. Ultimately it turned out to be unnecessary. There was one minor, and two major fixes that got things back on track. It's a long blog post, but here's where it ends up: Fix One: "Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. A quick check of the bios on these nodes showed that they weren't running in maximum performance mode which disables c-states." Fix Two: [A very clever engineer working for the customer] "ran a perf profile during a bad run and made a very astute discovery: A huge amount of time is spent in the kernel contending on a spin lock while updating the IOMMU mappings. He disabled IOMMU in the kernel and immediately saw a huge increase in performance during the 8-node tests."In a comment below, Nelson adds that "We've never seen the IOMMU issue before with Ceph... I'm hoping we can work with the vendors to understand better what's going on and get it fixed without having to completely disable IOMMU." Fix Three: "We were not, in fact, building RocksDB with the correct compile flags... It turns out that Canonical fixed this for their own builds as did Gentoo after seeing the note I wrote in do_cmake.sh over 6 years ago... With the issue understood, we built custom 17.2.7 packages with a fix in place. Compaction time dropped by around 3X and 4K random write performance doubled."The story has a happy ending, with performance testing eventually showing data being read at 635 GiB/s - and a colleague daring them to attempt 1 TiB/s. They built a new testing configuration targeting 63 nodes - achieving 950GiB/s - then tried some more performance optimizations...

Source	RSS or Atom Feed
Feed Location	https://rss.slashdot.org/Slashdot/slashdotMain
Feed Title	Slashdot
Feed Link	https://slashdot.org/
Feed Copyright	Copyright Slashdot Media. All Rights Reserved.