NuGet package statistics
For a while, I've been considering how useful nuget.org statistics are.
I know there have been issues in the past around accuracy, but that's not what I'm thinking about. I've been
trying to work out what the numbers mean at all and whether that's useful.
I've pretty sure an older version of the nuget.org gallery gave stats on a per-operation basis, but right now it looks like we can break down the downloads by package version, client name and client version. (NodaTime example)
In a way, the lack of NuGet "operation" at least makes it simpler to talk about: we only know about "downloads". So, what counts as a download?
What's a download?Here are a few things that might increment that counter:
- Manual download from the web page
- Adding a new package in Visual Studio
- Adding a new package in Visual Studio Code
- nuget install from the command line
- dotnet restore for a project locally
- dotnet restore in a Continuous Integration system testing a PR
- dotnet restore in a CI system testing a merged PR
All of them sound plausible, but it's also possible that they wouldn't increment the counter:
- I might have a package in my NuGet cache locally
- A CI system might have its own global package cache
- A CI system might use a mirror service somehow
So what does the number really mean? Some set of coincidences in terms of developer behavior and project lifetime? One natural reaction to this is "The precise meaning of the number doesn't matter, but bigger is better." I'd suggest that's overly complacent.
Suppose I'm right that some CI systems have a package cache, but others don't. Suppose we look at packages X and Y which have download numbers of 1000 and 100,000 respectively. (Let's ignore
which versions those are for, or how long those versions have been out.) Does that mean Y's usage is "better" than X's in some way? Not necessarily. Maybe it means there's a single actively-developed
open source project using Y and a CI system that doesn't have a NuGet cache (and configured to build each PR on each revision), whereas maybe there are a thousand entirely separate projects using
X, but all using a CI system that just serves up a single version from a cache for everything.
Of course, that's an extreme position. It's reasonable to suggest that on average, if package Y has larger download numbers than package X, then it's likely to be more widely used" but can we
do better?
Imagine we had perfect information: a view into every machine on the planet, and every operation any of them performed. What number would we want to report? What does it mean for a package to be "popular" or "widely used"?
Maybe we should think in terms of "number of projects that use package X". Let's consider some situations:
- A project created to investigate a problem, and then gets deleted. Never even committed to source control system.
- A project which is created and committed to source control, but never used.
- A project created and in production use, maintained by 1 person.
- A project created and in production use, maintained by a team of
100 people. - A project created by 1 person, but then forked by 10 people and
never merged. - A project created on github by 1 person, and forked by 10 people on github, with them repeatedly creating branches and merging back into the original repo.
- A project which doesn't use package X directly, but uses package Y that depends on package X.
If those all happened for the same package, what number would you want each of those projects to contribute to the package usage?
One first-order approximation could be achieved with "take some hash of the name of the project and propagate it (even past caches) when installing a package". That would allow us to be reasonably confident in some measure of "how many differently-named projects depend on package X" which might at least feel slightly more reasonable, although it's unclear to me how throwaway projects would end up being represented. (Do people tend to use the same names as each other for throwaway projects? I bet Console1 and WindowsForms1 would be pretty popular")
That isn't a serious suggestion, by the way - it's not clear to me that hashing alone provides sufficient privacy protection, for a start. There are multiple further issues in terms of cache-busting, too. It's an interesting thought experiment.
What do I actually care about though?That's even assuming that "number of projects that use package X is a useful measure. It's not clear to me that it is.
As an open source contributor, there are two aspects I care about:
- How many people will I upset, and how badly, if I break something?
- How many people will I delight, and to what extent, if I implement a particular new feature?
It's not clear to me that any number is going to answer those questions for me.
So what do you care about? What would you want nuget.org to show if it could? What do you think would be reasonable for it to show in the real world with real world constraints?