Diagnosing a single slow test
In the last blog post, I investigated why my Noda Time tests on Travis were running much slower than those on AppVeyor. I resolved a lot of the problem just by making sure I was running release builds on Travis.
That left a single test which takes up about half of the test time though: TzdbDateTimeZoneSourceTest.GuessZoneIdByTransitionsUncached. It seems to run more slowly on .NET Core on Linux than on Windows. The aim of this post is to work out why - at least to the extent of understanding it better than before.
Step 0: Initial assumptions/knowledge and aimFor the moment, I'll assume that my records of previous investigations are correct. So the situation is:
- I have near-identical hardware to run tests on: gabriel is running Linux; bagpuss is running Windows.
- Under a release build, the test takes 19s on gabriel and 4.3s on bagpuss
- The test does more work on gabriel: it checks 434 zones instead of 135 on bagpuss
- That still shows a per-test slow-down of 40%
- I've looked at this code before to try to optimize it. In production, this method is only going to be called for one time zone (the system default) so it's not really a problem - but improving the test speed would be nice. (We can categorize it as a "slow" test if necessary, but I'd rather not.)
Aims of this diagnosis:
- Find out whether the speed difference is in the test code or the production code that it's testing. (This test has a lot of assertions.)
- Find out whether the speed difference is in Noda Time code (e.g. Instant) or in BCL code (e.g. TimeZoneInfo). If it's in Noda Time code, that could lead to another investigation later, with potential optimizations.
- Make another attempt at speeding up the test across the board
Let's make sure we're still in the same state as we were before. (That seems pretty likely.)
On both machines, pull the current code and run
dotnet run -c Release -f netcoreapp1.0 -- --where=test=~Guess
from the src/NodaTime.Test directory.
Odd: it now takes ~6 seconds on bagpuss. At that point, there's no time discrepancy. But I'm sure there was when I wrote up the previous post. I didn't make up the 4.3s result" but it's entirely possible it was on the wrong machine. Run the tests several times - yes, it's always around 6 seconds. Odd.
At this point, I nearly gave up. After a little while doing other things, I decided to pursue the rest of the diagnosis anyway" at which point the execution time was back to about 4 seconds. That's very, very strange. I still don't know what was going on there.
Step 2: shorten experiment iteration timeBrutal honesty time: although I ran the tests at this point, I didn't spot the discrepancy until I was at the end of step 2. I then looked back in the command shell buffer and saw the 6 second run, which was consistent with the runs at the end of step 2.
The only test we're interested in is this one. We're going to be changing the code occasionally, so let's ditch everything else.
- Start a new branch in git
- Move TzdbDateTimeZoneSourceTest.cs into NodaTime.Test so it's easier to find. (It was in a subdirectory.)
- Remove all other tests from TzdbDateTimeZoneSourceTest.cs
- Delete all the subdirectories in NodaTime.Test
- Remove everything in NodaTime.Test apart from Program.cs and TzdbDateTimeZoneSourceTest.cs
- Change the target framework to just netcoreapp1.0 so we don't need to specify it. (This also requires changing the csproj file to use instead of.)
- Add a Release property to attempt to make it a release build by default
- Test: run dotnet run - oops, even though it's running from a Release folder, it's taking 13 seconds instead of 3 on Windows.
- Revert previous change, and try dotnet run -c Release - that's more like it
Great. We can push the commit to github and fetch it on all the relevant machines. (I'm developing on a third machine, just using gabriel and bagpuss to run tests.)
Step 3: Narrow down the time-consuming codeAs I mentioned before, the test has a lot of assertions, and they involve a lot of time zone work. Let's see how much time they're contributing - by removing them entirely.
- Comment out everything after the call to GuessZoneIdByTransitionsUncached
- Run locally just to check it does still run
- Commit and push
- Fetch on gabriel/bagpuss and rerun
Result: the checking part of the test is almost free.
Step 4: Remove NUnitAt this point I'm fairly confident that NUnit is irrelevant. I still want to use the NodaTime.Test project, as NodaTime is already configured to expose internal members to it - and GuessZoneIdByTransitionsUncached is internal. However, it's very easy to remove NUnit - just remove the dependencies, and replace all the code in Program with a simple test:
using NodaTime.TimeZones;using System;using System.Diagnostics;using System.Linq;namespace NodaTime.Test{ class Program { public static void Main(string[] args) { var zones = TimeZoneInfo.GetSystemTimeZones().ToList(); // Fetch the source outside the measurement time var source = TzdbDateTimeZoneSource.Default; // JIT compile before we start source.GuessZoneIdByTransitionsUncached(TimeZoneInfo.Local); var stopwatch = Stopwatch.StartNew(); foreach (var zone in zones) { source.GuessZoneIdByTransitionsUncached(zone); } stopwatch.Stop(); Console.WriteLine($"Zones tested: {zones.Count}"); Console.WriteLine($"Elapsed time: {stopwatch.Elapsed}"); } }}
There are a couple of differences here compared with the NUnit test:
- I'm deliberately fetching TzdbDateTimeZoneSource.Default (which needs to read some resources) outside the measurement
- I'm deliberately calling GuessZoneIdByTransitionsUncached once outside the measurement to remove JIT compilation time
This still isn't a good benchmark by any means, but it's probably sufficient for now.
Commit/push/fetch/run as before.
Result: similar to before, but a little faster. Playing around with the code a little further, it seems that the bullet points above make a difference of about 0.1s each (on bagpuss, at least) - there's quite a lot of code that wouldn't have been JIT-compiled before the first call, and loading time zone data is somewhat expensive. NUnit also adds a little bit of overhead, which is understandable.
We now have more accurate data for just GuessZoneIdByTransitionsUncached:
- bagpuss: 3.6s
- gabriel: 19.0s (just under 19 in most cases, instead of just over when running NUnit)
Now we need to get into Noda Time code"
Step 5: Gut the production codeFirst, the source code helpfully points to https://github.com/nodatime/nodatime/issues/686 with some comments from my last investigation.
We don't care about correctness at this point, just speed. We can return null at any point. At this point you probably need to know a bit about what the method does. The aim is to guess a matching TZDB (aka IANA, tz, zoneinfo or Olson) time zone from a BCL TimeZoneInfo, when we can't use the names to match. It combines all the time zone transitions (e.g. going into or out of DST) of every TZDB zone for the next 5 years, and then computes a score for each zone by checking how many of the UTC offsets at those transition points match the offsets returned by the BCL zone.
Let's divide the code into 5 sections:
- Working out the start and end of this UTC year
- Getting all the time zones
- Getting all the transitions we're going to test
- Getting all the offsets from the BCL zone
- Computing the scores
It's going to be simplest to measure just 1, then 1+2, then 1+2+3 etc. We can just add return null; after the step we're interested in. We hope the results will tell us:
- Which part of the code is slowest (so we can try to optimize it)
- Which part of the code is slower on Linux than Windows
For simplicity, let's make gabriel use the same number of time zones as bagpuss, just by adding .Take(135) in Program.cs above first. That way we're comparing the same amount of work on each machine - although admittedly we don't know that each time zone takes the same amount of time. Without any change to the production code, gabriel now takes 6s - which is about what we'd expect if the time per zone is uniform, as 19 * 135/434 = 5.91["].
Results on gabriel:
- Step 1: 0s (it's free, effectively)
- Step 1+2: 1.82s
- Step 1+2+3: 3.15s
- Step 1+2+3+4: 3.21s
- Step 1+2+3+4+5: 6s
Results on bagpuss:
- Step 1: 0s
- Step 1+2: 1.21s
- Step 1+2+3: 2.10s
- Step 1+2+3+4: 2.13s
- Step 1+2+3+4+5: 3.63s
Or in other words, stepwise:
Step 1: 0s on both
Step 2: 1.82s on gabriel; 1.21s on bagpuss
Step 3: 1.33s on gabriel; 0.89s on bagpuss
Step 4: 0.06s on gabriel; 0.03s on bagpuss
Step 5: 2.79s on gabriel; 1.50s on bagpuss
No particular smoking gun, unfortunately - no obvious part that would be easily optimized. The last part looks like it shows the worst platform difference though" let's see if we can dig into that.
Step 6: Pivot to DateTimeZone.GetUtcOffsetLet's change our test pretty radically:
- Get a list of all the built in TZDB time zones
- Create a list of 50,000 instants, starting at 2000-01-01T00:00:00Z and spaced 1 hour apart. 100,000 is some enough instants to make the test last for roughly 10 seconds. (Fast enough to iterate, slow enough to give significant results.)
- For every zone, find the UTC offset for every instant.
To simplify reasoning about what we're testing, we want to remove time zone caching, too. Time zone computations are inherently fiddly, so Noda Time has a cache for each zone. In GuessZoneIdByTransitionsUncached we were coming up with a new DateTimeZone in each call, so while there'd be some caching involved, it wouldn't be as much as a simple version of this new test. For the moment, let's just remove the time zone caching layer entirely:
var zones = DateTimeZoneProviders.Tzdb.GetAllZones() .Select(zone => zone is CachedDateTimeZone cached ? cached.TimeZone : zone) .ToList();var start = Instant.FromUtc(2000, 1, 1, 0, 0);var gap = Duration.FromHours(1);var instants = Enumerable .Range(0, 100000) .Select(index => start + gap * index) .ToList();var stopwatch = Stopwatch.StartNew();foreach (var zone in zones){ foreach (var instant in instants) { zone.GetUtcOffset(instant); }}stopwatch.Stop();Console.WriteLine($"Elapsed time: {stopwatch.Elapsed}");
There's still one other caching layer that's relevant, and that's in terms of conversions between calendrical values and instants" Noda Time is optimized for the years 1900-2100 in the Gregorian calendar. If our test went outside that range, we'd get very different results, hitting different pieces of code very hard. Our original test was looking at 5 years around now, so let's stick to the optimized range for now.
Results:
- gabriel: 19.5s
- bagpuss: 12.5
Good - we're still seeing the Linux run being about 50% slower than the Windows run.
Step 7: Break out a profiler!It's been a while since I've done any profiling in .NET. I had a quick search to see what was available in terms of free tools that handle .NET Core, and came across CodeTrack. I don't think it runs under Linux, but that's okay - we're mostly using it to track down the hotspots first, so we can fine tune our tests.
I'm not specifically advocating CodeTrack - profiling is an art and science about which I know relatively little - but it was easy to get going. 85% of the time was spent in StandardDaylightAlternatingMap.GetZoneInterval, so that looks like a good place to look next.
Step 8: Write actual benchmarksI have a benchmark suite for Noda Time already, and at this point we've got to a stage where it's probably worth using proper benchmarking tools. BenchmarkDotNet is my framework of choice - so let's add a couple of benchmarks.
Looking at the code and the profiler, it looks worth testing both StandardDaylightAlternatingMap and ZoneRecurrence. Having added those benchmarks, I might as well put this down until the next day, as the complete benchmark suite runs automatically every night.
Step 9: Compare the benchmarksThe benchmark results are automatically uploaded to the Noda Time web site although the UI for browsing them is somewhat unstable (in terms of URLs etc) - hence the lack of link here.
Looking at the benchmark results, it turns out I'd already got benchmarks for StandardDaylightAlternatingMap, annoyingly enough - so I needed to tidy those up.
However, they do clearly show the difference between .NET Core on Linux and on Windows, so we've narrowed the problem down further. That doesn't show that only that code has different performance, but it's hopefully a good enough start that we'll be able to continue and get a short but complete example to provide to the .NET Core CLR team.
Diving deeper into the implementation, ZoneYearOffset makes sense to test too" the further we go down, the less code will be involved.
Step 10: take a step back for nowWe've made progress, but we're not all the way there. That's okay - the progress is recorded (both here and in benchmarks) and in looking at the benchmark that's taking so long to run, we've confirmed that we're not doing anything stupid. It happens to end up doing a lot more work than any application code normally would - work that would normally only happen once (so can be inefficient without being a problem) is exercised hundreds of times. Such is life - I could decide not to run it in CI if I want, running it occasionally manually and always just before a release.
I'll come back to this issue at some point in the future, but at this point, other projects have become more important.
Points to note and lessons learned- Validate assumptions carefully. I still don't quite know why there was the blip in execution time, but it was very surprising.
- Unit tests are no substitute for a good benchmarking framework.
- Profilers exist, and are useful. No specific recommendations here, but make sure you're aware of the tools you could benefit from.
- Sometimes, you need to put down a problem for a while. Try not to lose the information you've already discovered - capture it in an issue, comments, a blog post like this one, a private document" anything to preserve information. My head leaks data very, very quickly.
I'm still investigating where .NET Core runs faster on Linux than on Windows, but I've fixed this test - PR 953 is a pretty minor refactoring, but it allows the tests to pass in the same list of time zones into each test case. That has two benefits:
- We don't need to load the zones from raw data on each test
- The "get UTC offset at instant X" result is cached in the time zone, so after the first test there's very little work to do
This very small, internal-only change has halved the total time of the Noda Time tests on Travis. Yay.