Article 2ZQN0 Diagnosing a corner-case tooling bug

Diagnosing a corner-case tooling bug

by
jonskeet
from Jon Skeet's coding blog on (#2ZQN0)
Story Image

Unlike the previous tests which have been based on Noda Time, this post is based on some issues I've had with my main project at work: the Google Cloud Client Libraries for .NET.

Background

This is somewhat detailed for those of you who really care about the chain of events. If you don't, skip to the next subheading.

google-cloud-dotnet is a complicated repo with lots of projects, generated project files, its own Roslyn analyzers, a fairly involved documentation process using docfx, and other bits and pieces. We use Travis and AppVeyor for CI, and the AppVeyor build has historically gone through the documentation build as well.

On the evening of August 17th (UK time), AppVeyor's "Visual Studio 2017" image changed to include the .NET Core 2.0 SDK by default. This broke us in a few ways:

  • Our appveyor.yml invoked a bash script, but without explicitly invoking bash. I don't know why it worked before, but the change broke that. Easily fixed.
  • Our docfx build completely broke on AppVeyor due to differences in how msbuild is found or invoked. I don't understand this yet, but for the moment we've had to remove the docfx part of our build from AppVeyor. It works on our development machines - I believe due to the presence of VS2017 itself, rather than just the build tools.
  • Some aspect of building with .NET Core 2.0 broke our Roslyn analyzers, causing them to throw missing method exceptions. I don't understand enough about the breakage to explain it or say whether it was a bug, but we had to change the analyzers for them to work with the latest compilers. This wasn't a problem in our command-line builds as we had pinned to the 1.0 SDK using global.json, but it seems VS2017 update 3 uses the latest Roslyn version anyway (reasonably) so building in Visual Studio caused problems.

Unfortunately the solution to the last bullet point caused the analyzers to break under the 1.0 SDK" so we were basically in a no-win situation until we upgraded everything to use the 2.0 SDK. (Just to reassure customers, we're still targeting netstandard1.0 - just because we're using the 2.0 SDK doesn't mean you have to.)

Having updated to the 2.0 SDK, we were now guaranteed to have the C# 7.1 compiler available, so after a project file update to include latest, we could use new features.

The problem for this post: docfx and default literals

Our code has a lot of methods accepting cancellation tokens, and the default is always specified as

CancellationToken cancellationToken = default(CancellationToken)

That's quite a mouthful - but C# 7.1 default literals can help! This post isn't about the language feature, but basically it allows you to just use default in places where the C# compiler can work out which type's default value you mean. So we can change all occurrences of the above to:

CancellationToken cancellationToken = default

Great! A colleague put together a pull request, and I was about to give an LGTM and suggest that he merged it, when I remembered that we weren't building the documentation on AppVeyor any more. Given that this was the first build using C# 7.1 features, I thought it would be worth just checking that it worked.

It didn't. It failed with an error message:

Error:Error extracting metadata for c:/Users/jon/Test/Projects/google-cloud-dotnet/apis/Google.Cloud.BigQuery.V2/Google.Cloud.BigQuery.V2/Google.Cloud.BigQuery.V2.csproj: Value cannot be null.
Parameter name: name

This post is a log of the steps I've gone through, not trying to fix the bug, but trying to report the problem in a friendly manner.

On Stack Overflow, when I ask for a short but complete program demonstrating the problem, this is the sort of process I expect people to go through.

This won't be quite as accurate a log as some other posts, as I only started writing it when I was most of the way through the investigation. Hopefully it'll still be useful.

Step 0: Confirm the problem is real

This is where git is really, really useful. I was very easily able to clone my colleague's fork, checkout the branch his PR was created from, and go through the build procedure to see the error shown above.

Once I'd done that, I checked out the master branch, pulled to make sure I was up to date, and rebuilt. No problems - so it was definitely the change in that PR.

Step 1: Try the simplest potential fix

We have a file in our repo specifying the version of docfx to use, so trying a different version is trivial. I knew we weren't using the latest version, and that the latest version used a new version of Roslyn itself. It would make sense for that to fix the problem.

Tried it - no luck. At this point, it seemed unlikely that we'd be able to resolve the problem ourselves, and I was looking for a minimal example to report it.

Step 2: Try to reproduce in a minimal example

A minimal example in docfx is pretty simple. It needs three files:

A csproj for the project:

<Project Sdk="Microsoft.NET.Sdk"> <PropertyGroup> <TargetFramework>netstandard1.0</TargetFramework> <LangVersion>latest</LangVersion> </PropertyGroup></Project>

A docfx.json file, just enough to build the metadata in this case:

{ "metadata": [{ "src": [{ "files": ["Project.csproj"] }], "dest": "api", }]}

And a source file for it to build:

namespace Project{ public class Class1 { public void Foo(int x = default(int)) {} }}

Then I just needed to build the project and then the metadata:

dotnet build && docfx metadata -f

That worked, as expected. I then changed default(int) to just default and built again, hoping to see the failure. Unfortunately, that worked too.

At this point, I have no clue about the differences between the real project which didn't work and the minimal project that did.

Step 3: find a smaller reproduction from real code

As mentioned in the introduction, the repo contains a lot of projects. The BigQuery project mentioned in the error message contains a lot of code - not an ideal candidate to cut down. Looking through the changes in the PR, I spotted that files in Google.Cloud.Translation.V2 had also been changed. That's a much smaller project, and would be simpler to work with.

Instead of running through the full build, I just ran the commands to build that specific project and then its documentation. Bingo - same error as before, basically:

Error:Error extracting metadata for C:/Users/jon/Test/Projects/google-cloud-dotnet/apis/Google.Cloud.Translation.V2/Google.Cloud.Translation.V2/Google.Cloud.Translation.V2.csproj: Value cannot be null.
Parameter name: name

Step 4: isolate the smaller reproduction

I copied the broken code away from the git repo, and hacked the project file a bit:

  • I didn't need the signing key
  • I didn't need the import that prevented building net45 code on Linux
  • I didn't need the Roslyn analyzer

I copied and modified the docfx.json file that was used to create the metadata, and tested it again: same error.

At this point, we've got a reproduction with "only" 10 C# files and a single project. None of the infrastructure used for the rest of the repo is required. That's a good start.

Step 5: Gut the code

The aim is to remove as much code as possible, leaving only what's required to demonstrate the problem.

The project contained two parallel pairs of classes (TranslationClient/TranslationClientImpl and AdvancedTranslationClient/AdvancedTranslationClientImpl). They're very similar (and really only both exist for backwards compatibility) - it was trivial to remove the "advanced" pair and retest: same failure.

Next, remove the TranslationClientImpl class - we weren't running any code, so if we could reproduce the problem with the abstract TranslationClient class, that would be fine. Remove all factory methods and the class itself: same failure.

Next, remove the ListLanguages and DetectLanguage methods (and their related friends) - at which point the Language and Detection classes can go, along with LanguageCodes which is really just helper code. Same failure.

We're down to relatively little code now, but still more than I really want. What's left is a bunch of calls to do translation. It's basically a matrix of 8 methods, for every combination of:

  • Translate a single item or multiple items
  • Translate text or HTML
  • Sync or async

The single/multiple part is just implemented in terms of the "single" method calling the "multiple" method and extracting the result, using an expression-bodied member. Only the async methods used the default literal, so I decided to remove the synchronous code first. Same failure.

The HTML and text methods were basically the same in the abstract class (the real differences, small as they are, lie in the implementation) so it was simple to remove the HTML methods. Same failure.

That left two methods: the two async methods, both using the default literal, and one of which ("single") called the other ("multiple"). I tried removing the "single" method - suddenly, the failure went away. I restored the method, and double-checked: yes, we're back to seeing the failure.

Still, two methods is reasonable. Let's reduce those as far as we can, changing them to return strings instead of TranslationResult objects, for example, and removing the TranslationModel? parameter in each case: same failure.

Okay, this looks like it's enough to reproduce the problem even more minimally.

Step 6: Back to the minimal example code

I went back to the sample project from step 2, with a tiny project file and a tiny docfx.json file. I thought the problem might be one method calling another, so I reproduced that within the test code. (Unfortunately I no longer have the code I tried at this point.)

No failure. Grrr. What was the significant difference between that code and the failing Translation code?

Step 7: Transplant the definitely-failing code

Pretty clueless at this point, I copied the TranslateText methods that were definitely failing in the Translation project into my test project. As I'd removed all the Translation-specific types now, that didn't cause any compilation problems in itself" but the failure was back. Hooray.

So we now have:

  • Slightly complex code that at least slightly resembles the Translation code
  • A project file that's nice and simple
  • A docfx file that's nice and simple

That's still good: we can now jettison the gutted Translation code, and work just within our minimal test project.

Step 8: Reduce to one method

We still had one method ("single") calling another ("multiple"). Removing "single" removed the failure; removing "multiple" would cause a compilation failure in "single". Let's cut that link.

First, I replaced the the expression-bodied member from "single" with a regular method body, awaited a Task.Delay(...) (to avoid a warning about not awaiting anything) and returned null. The failure went away. Weird.

Next, I tried making it a non-async method, and just returned a null task instead. The failure stayed away.

My first "reasonably minimal, single-method" failure code looked like this:

using System.Threading.Tasks;namespace Project{ public class Class1 { /// <param name="x">A parameter</param> public virtual async Task<string> Foo(int x = default) => await Task.FromResult<string>(null); }}

I've since played around a lot and reduced it somewhat:

using System.Threading.Tasks;namespace Project{ public class Class1 { /// <param name="x">A parameter</param> public Task Foo(int x = default) => (Task) null; }}

Note that it's now no longer an async method. This generates the same error as before, and it's reliable, but any one of the following changes makes the failure go away:

  • Change default to default(int), the failure goes away
  • Remove the " documentation, the failure goes away
  • Make it a regular method instead of an expression-bodied method
  • Change (Task) null to just null

That's a pretty sensitive bug! I've still no idea where the problem is, but it's now ready to tell the docfx team.

Step 9: file a bug

With all of this work done, it was simple to file an issue with the repro code. At this point, I can relax"

Points to note
  • The only chance I really had for resolving this problem myself was updating to the latest version of docfx. The C# code in my project is correct, so it really looks like a bug somewhere (whether in docfx or in one of its dependencies is still up for grabs) but it's buried further than I'm likely to go.
  • To be as friendly as I can, I knew I had to provide a way for the team to reproduce the problem.
  • In this case, I tried a simple "let's reproduce from scratch" unsuccessfully before going back to the known-failing code
  • Even though the original scope ("all the code in the repo") was huge, it was fairly simple to reduce it in stages:
    • Find a sub-project which demonstrated the problem
    • Isolate that sub-project from the rest of the repo infrastructure
    • Cut down that sub-project to a reasonably minimal example
    • Transplant that example into a new, completely minimal project
  • Most of the time I "wasted" was due to the very odd nature of the failure - very small, usually-irrelevant perturbations of the code changed the result in an unexpected way. I could probably have realized this earlier, and started making my changes even smaller between tests.
External Content
Source RSS or Atom Feed
Feed Location http://codeblog.jonskeet.uk/feed/
Feed Title Jon Skeet's coding blog
Feed Link https://codeblog.jonskeet.uk/
Reply 0 comments