Diagnosing a Linux-only unit test failure
This is an odd one. I'm currently working on Cloud Firestore support for C#, and I've introduced a GeoPoint struct to represent a latitude/longitude pair, each being represented by a double. It implements IEquatable and overloads == and != in the obvious way. So far, so good, and I have working tests which have passed on CI.
I've added some more types that implement equality, so I thought I'd extract out common code into an EqualityTester class that can handle both the Equals methods and the operators (the latter via reflection).
All went well on my local machine, but on CI one environment failed. We use both AppVeyor and Travis in CI, and only Travis was failing. There were two failures - one testing the equality operators, and one as part of a broader serialization/deserialization test. Odd. Time to go through the normal diagnostic process"
Step 1: Check there's really a problemThe nature of the failing tests don't feel like an intermittent problem, but it's always good to get more of an idea. I forced Travis to rebuild the pull request on Travis a couple of times: yes, it failed the same two tests each time.
Step 2: Reproduce locallyTests that only fail in CI are really annoying. I've had this a couple of times, and it's always taken hours to sort out, because the "try something, get results, think about what they mean" cycle is so slow.
Fortunately, I have a Linux box at home - so I fetched the pull request and ran the tests there - and they failed. Hooray! It's not quite as nice as being able to reproduce it on my Windows laptop, but it's much better than being isolated to CI.
Step 3: Remove irrelevant test codeThere are two parts to the code that's failing: the unit tests, and the production code. Ideally, I'd like to get rid of both, to create a truly minimal example. I usually try to take out one project at a time, and I could have gone either way - but this time I decided to remove the testing first.
So in the same repo, I created a new netcoreapp1.0 console application, with a project reference to the production code. I moved the operator part of my equality tester code into Program.cs, simplified it slightly, introduced by own Assert class with True and False methods that printed out the result, and I was ready to run:
using System;using System.Reflection;using Google.Cloud.Firestore.Data;class Program{ static void Main(string[] args) { var gp1 = new GeoPoint(1.5, 2.5); var gp2 = new GeoPoint(1.5, 2.5); AssertEqualityOperators(gp1, gp2); } internal static void AssertEqualityOperators<T>(T control, T equal) { var typeInfo = typeof(T).GetTypeInfo(); var equalityMethodInfo = typeInfo.GetMethod( "op_Equality", new[] { typeof(T), typeof(T) }); Func<T, T, bool> equality = (lhs, rhs) => (bool) equalityMethodInfo.Invoke(null, new object[] { lhs, rhs }); Assert.True(equality(control, equal)); }}class Assert{ public static void True(bool actual) => Console.WriteLine($"Expected True; was {actual}"); public static void False(bool actual) => Console.WriteLine($"Expected False; was {actual}");}
Sure enough, the output shows the problem:
Expected True; was False
Great. Now let's remove the other dependency"
Step 4: Remove production code dependencyRather than copy the whole of GeoPoint into my test project, I just created a small Test struct which had similar equality code.
Dead-end: a single double valueI made a mistake here - I changed too much in one go. I reduced my Test struct to just one double field:
using System;using System.Reflection;class Program{ static void Main(string[] args) { Test t1 = new Test(1.5); Test t2 = new Test(1.5); var equalityMethodInfo = typeof(Test) .GetMethod("op_Equality", new[] { typeof(Test), typeof(Test) }); Func<Test, Test, bool> equality = (lhs, rhs) => (bool) equalityMethodInfo.Invoke(null, new object[] { lhs, rhs }); Console.WriteLine(t1 == t2); Console.WriteLine(equality(t1, t2)); Console.WriteLine(equalityMethodInfo.Invoke(null, new object[] { t1, t2 })); }}struct Test : IEquatable<Test>{ private readonly double value; public Test(double value) { this.value = value; } public static bool operator==(Test lhs, Test rhs) => lhs.value == rhs.value; public static bool operator!=(Test lhs, Test rhs) => !(lhs == rhs); public override bool Equals(object obj) => obj is Test t && Equals(t); public bool Equals(Test other) => this == other; public override int GetHashCode() => value.GetHashCode();}
That prints "True" three times - which is what we'd like it to do eventually, but it means it's not reproducing the problem we want it do.
Back to reproducing it: two double valuesTaking one step back towards the previous code, I went back to two double values instead of just one, expanding the Test struct like this:
struct Test : IEquatable<Test>{ private readonly double x; private readonly double y; public Test(double x, double y) { this.x = x; this.y = y; } public static bool operator==(Test lhs, Test rhs) => lhs.x == rhs.x && lhs.y == rhs.y; public static bool operator!=(Test lhs, Test rhs) => !(lhs == rhs); public override bool Equals(object obj) => obj is Test t && Equals(t); public bool Equals(Test other) => this == other; public override int GetHashCode() => x.GetHashCode() + y.GetHashCode();}
Change both constructor calls to Test(1.5, 2.5) and bingo: it prints True, False, False.
Step 5: Explore the scope of the problemI tried the exact same code on Windows, and (unsurprisingly, given that my earlier unit tests passed) it didn't reproduce the problem. But sticking with Linux, we can still try multiple frameworks and multiple build configurations.
I wouldn't like to say what prompted me to try different frameworks - intuition can be hard to explain, unfortunately. But let's change the project file a bit:
<Project Sdk="Microsoft.NET.Sdk"> <PropertyGroup> <OutputType>Exe</OutputType> <TargetFrameworks>netcoreapp1.0;netcoreapp1.1;netcoreapp2.0</TargetFrameworks> </PropertyGroup></Project>
And then run it in several different configurations (remember that three "True" lines means "everything is working", and "True", "False", "False" means "this is the problem we're investigating"):
$ dotnet run -c Debug -f netcoreapp1.0TrueFalseFalse$ dotnet run -c Release -f netcoreapp1.0TrueFalseFalse$ dotnet run -c Debug -f netcoreapp1.1TrueTrueTrue$ dotnet run -c Release -f netcoreapp1.1TrueTrueTrue$ dotnet run -c Debug -f netcoreapp2.0TrueTrueTrue$ dotnet run -c Release -f netcoreapp2.0TrueTrueTrue
So, it fails under .NET Core 1.0, but works under .NET Core 1.1 and .NET Core 2.0. That certainly makes it sound like a .NET Core 1.0 bug, but that doesn't mean I want to give up at this point. It's entirely possible that I have users who are using .NET Core 1.0, so I want my code to work under it.
Now, I've only given major/minor version numbers - and that's partly because I don't have a very good handle on the different parts of .NET Core on Linux and how to find the version numbers. The 1.0 packages I've got installed are listed as:
- dotnet-dev-1.0.4/xenial,now 1.0.4-1 amd64 [installed]
- dotnet-hostfxr-1.0.1/xenial,now 1.0.1-1 amd64 [installed]
- dotnet-sharedframework-microsoft.netcore.app-1.0.5/xenial,now 1.0.5-1 amd64 [installed,automatic]
No dotnet-runtime package, which I believe is part of 2.0 instead.
Step 6: Look into the failed comparisonThis part is really simple - just add more diagnostics to the == method:
public static bool operator==(Test lhs, Test rhs){ Console.WriteLine($"lhs=({lhs.x}, {lhs.y}); rhs=({rhs.x}, {rhs.y})"); Console.WriteLine($"{lhs.x == rhs.x}, {lhs.y == rhs.y}"); return lhs.x == rhs.x && lhs.y == rhs.y;}
Now things get really interesting in the output:
lhs=(1.5, 2.5); rhs=(1.5, 2.5)True, TrueTruelhs=(1.5, 1.85492638478664E-316); rhs=(1.5, 6.95251497332956E-310)True, FalseFalselhs=(1.5, 6.9525149733517E-310); rhs=(1.5, 6.9525149733351E-310)True, FalseFalse
The first three lines are as expected - and that's the working comparison. But each of the three-line-segments for the second and third comparisons show the first field (x) being fine, but the second being some tiny double value, and inconsistent at that.
Step 7: Try (and fail) to simplifyMaybe boxing is to blame? How about we try a really simple version of boxing - just override ToString(), box a value and see what comes up. The change in Test is simply:
public override string ToString() => $"{x}, {y}";
And then to test it:
object x = new Test(1.5, 2.5);Console.WriteLine(x);
This prints 1.5, 2.5 - so it appears to be boxed correctly.
Step 8: Try simplifying a bit lessIn our failing program, there are two potential complications at play: we're boxing the value, and we're calling a method via reflection. If just boxing isn't a problem, maybe reflection is?
I decided to try two things at once: an instance method and a static method accepting a Test parameter (like the operator). Now we're not doing any comparisons, we can get rid of a lot of code. Here's the new complete code:
using System;using System.Reflection;class Program{ static void Main(string[] args) { Test t = new Test(1.5, 2.5); var foo = typeof(Test).GetMethod("Foo"); foo.Invoke(t, null); var bar = typeof(Test).GetMethod("Bar"); bar.Invoke(null, new object[] { t }); }}struct Test{ private readonly double x; private readonly double y; public Test(double x, double y) { this.x = x; this.y = y; } public void Foo() => Console.WriteLine($"{x}, {y}"); public static void Bar(Test t) => Console.WriteLine($"{t.x}, {t.y}");}
And now the results:
1.5, 2.51.5, 6.95260190884551E-310
Interesting! At this point, there are lots of things that I could check:
- Does it affect all parameters?
- Does it happen for other field types? Other primitives? References?
- Can I provoke it in any other circumstances other than via
reflection?
Most of these can probably be answered by finding an existing issue that was fixed in netcore1.1" which is reasonably feasible given the Open Source nature of .NET Core.
That can all wait though (it's a Sunday evening for me, after all") - let's see if we can fix the problem first.
Step 9: Fix it with a stroke of luckGiven that this feels like a memory management issue, let's try something I've basically not used before: explicit struct layout.
A struct containing two double fields should be pretty easy to lay out: just fields at offsets 0 and 8, right? Admittedly it also sounds like it should be easy for the CLR to get right itself" but at least laying it out explicitly is low-risk.
using System.Runtime.InteropServices;[StructLayout(LayoutKind.Explicit)]struct Test{ [FieldOffset(0)] private readonly double x; [FieldOffset(8)] private readonly double y; ...}
Normally, things don't get fixed on first attempt, but in this case" it just worked!
Backtracking from the simplification, to the standalone comparison version, to fixing GeoPoint (which involved not using automatically implemented properties any more, as I want to apply the attribute to the field, not the property - boo!) and checking it works locally, to adding a commit to the PR and seeing it pass on Travis. Yay!
Next steps and points to noteMy next step is definitely to try to find some history of this issue. It's a surprisingly big problem to affect the GA .NET Core 1.0 runtime, so I expect there's some record of it somewhere. I want to understand better what was wrong, whether my fix is appropriate, and whether I need to apply it elsewhere.
Update: this is a still an issue in the 1.0.7 runtime, and apparently even in the 1.0.8-servicing daily builds, so I've filed an issue and look forward to seeing what happens there.
Some interesting aspects:
- If you initially see a problem in CI, your top priority is to reproduce it locally.
- As ever, remove as much other code as you can - your production code, your test code, all dependencies.
- It's handy to have multiple execution environments available. If I hadn't had a local Linux box, this would have been a nightmare.
- Try different configurations and different runtimes/frameworks.
- Sometimes the bug really is in the runtime or BCL. It's pretty rare, but it does happen.