by Remy Porter on (#56GRW)
"Memory leaks are impossible in a garbage collected language!" is one of my favorite lies. It feels true, but it isn't. Sure, it's much harder to make them, and they're usually much easier to track down, but you can still create a memory leak. Most times, it's when you create objects, dump them into a data structure, and never empty that data structure. Usually, it's just a matter of finding out what object references are still being held. Usually.A few months ago, I discovered a new variation on that theme. I was working on a C# application that was leaking memory faster than bad waterway engineering in the Imperial Valley.I don't exactly work in the "enterprise" space anymore, though I still interact with corporate IT departments and get to see some serious internal WTFs. This is a chandelier we built for the Allegheny Health Network's Cancer Institute which recently opened in Pittsburgh. It's 15 meters tall, weighs about 450kg, and is broken up into 30 segments, each with hundreds of addressable LEDs in a grid. The software we were writing was built to make them blink pretty.Each of those 30 segments is home to a single-board computer with their GPIO pins wired up to addressable LEDs. Each computer runs a UDP listener, and we blast them with packets containing RGB data, which they dump to the LEDs using a heavily tweaked version of LEDScape.This is our standard approach to most of our lighting installations. We drop a Beaglebone onto a custom circuit board and let it drive the LEDs, then we have a render-box someplace which generates frame data and chops it up into UDP packets. Depending on the environment, we can drive anything from 30-120 frames per second this way (and probably faster, but that's rarely useful).Apologies to the networking folks, but this works very well. Yes, we're blasting many megabytes of raw bitmap data across the network, but we're usually on our own dedicated network segment. We use UDP because, well, we don't care about the data that much. A dropped packet or an out of order packet isn't going to make too large a difference in most cases. We don't care if our destination Beaglebone is up or down, we just blast the packets out onto the network, and they get there reliably enough that the system works.Now, normally, we do this from Python programs on Linux. For this particular installation, though, we have an interactive kiosk which provides details about cancer treatments and patient success stories, and lets the users interact with the chandelier in real time. We wanted to show them a 3D model of the chandelier on the screen, and show them an animation on the UI that was mirrored in the physical object. After considering our options, we decided this was a good case for Unity and C#. After a quick test of doing multitouch interactions, we also decided that we shouldn't deploy to Linux (Unity didn't really have good Linux multitouch support), so we would deploy a Windows kiosk. This meant we were doing most of our development on MacOS, but our final build would be for Windows.Months go by. We worked on the software while building the physical pieces, which meant the actual testbed hardware wasn't available for most of the development cycle. Custom electronics were being refined and physical designs were changing as we iterated to the best possible outcome. This is normal for us, but it meant that we didn't start getting real end-to-end testing until very late in the process.Once we started test-hanging chandelier pieces, we started basic developer testing. You know how it is: you push the run button, you test a feature, you push the stop button. Tweak the code, rinse, repeat. Eventually, though, we had about 2/3rds of the chandelier pieces plugged in, and started deploying to the kiosk computer, running Windows.We left it running, and the next time someone walked by and decided to give the screen a tap… nothing happened. It was hung. Well, that could be anything. We rebooted and checked again, and everything seems fine, until a few minutes later, when it's hung… again. We checked the task manager- which hey, everything is really slow, and sure enough, RAM is full and the computer is so slow because it's constantly thrashing to disk.We're only a few weeks before we actually have to ship this thing, and we've discovered a massive memory leak, and it's such a sudden discovery that it feels like the draining of Lake Agassiz. No problem, though, we go back to our dev machines, fire it up in the profiler, and start looking for the memory leak.Which wasn't there. The memory leak only appeared in the Windows build, and never happened in the Mac or Linux builds. Clearly, there must be some different behavior, and it must be around object lifecycles. When you see a memory leak in a GCed language, you assume you're creating objects that the GC ends up thinking are in use. In the case of Unity, your assumption is that you're handing objects off to the game engine, and not telling it you're done with them. So that's what we checked, but we just couldn't find anything that fit the bill.Well, we needed to create some relatively large arrays to use as framebuffers. Maybe that's where the problem lay? We keep digging through the traces, we added a bunch of profiling code, we spent days trying to dig into this memory leak…… and then it just went away. Our memory leak just became a Heisenbug, our shipping deadline was even closer, and we officially knew less about what was going wrong than when we started. For bonus points, once this kiosk ships, it's not going to be connected to the Internet, so if we need to patch the software, someone is going to have to go onsite. And we aren't going to have a suitable test environment, because we're not exactly going to build two gigantic chandeliers.The folks doing assembly had the whole chandelier built up, hanging in three sections (we don't have any 14m tall ceiling spaces), and all connected to the network for a smoke test. There wasn't any smoke, but they needed to do more work. Someone unplugged a third of the chandelier pieces from the network.And the memory leak came back.We use UDP because we don't care if our packet sends succeed or not. Frame-by-frame, we just want to dump the data on the network and hope for the best. On MacOS and Linux, our software usually uses a sender thread that just, at the end of the day, wraps around calls to the send system call. It's simple, it's dumb, and it works. We ignore errors.In C#, though, we didn't do things exactly the same way. Instead, we used the .NET UdpClient object and its SendAsync method. We assumed that it would do roughly the same thing.We were wrong.