National Drug Code (NDC)
The US Food and Drug Administration tracks drugs using an identifer called the NDC or National Drug Code. It is described as a 10-digit code, but it may be more helpful to think of it as a 12-character code.
An NDC contains 10 digits, separated into three segments by two dashes. The three segments are the labeler code, product code, and package code. The FDA assigns the labeler codes to companies, and each company assigns its own product and package codes.
FormatThe segments are of variable length and so the dashes are significant. The labeler code could be 4 or 5 digits. The product code could be 3 or 4 digits, and the package code could be 1 or 2 digits. The total number of digits is must be 10, so their are three possible combinations:
- 4-4-2
- 5-3-2
- 5-4-1.
There's no way to look at just the digits and know how to separate them into three segments. My previous post looked at self-punctuating codes. The digits of NDC codes are not self-punctuating because they require the dashes. The digit combinations are supposed to be unique, but you can't tell how to parse a set of digits from the digits alone.
StatisticsI downloaded the NDC data from the FDA to verify whether the codes work as documented, and to see the relative frequency of various formats.
(The data change daily, so you may get different results if you do try this yourself.)
FormatAll the codes were 12 characters long, and all had the documented format as verified by the regular expression [1]
\d{4,5}-\d{3,4}-\d{1,2}Uniqueness exception
I found one exception to the rule that the sequence of digits should be unique. The command
sed "s/-//g" ndc.txt | sort | uniq -d
returned 2950090777.
The set of NDC codes contained both 29500-907-77 and 29500-9077-7.
DistributionAbout 60% of the codes had the form 5-3-2. About 30% had the form 5-4-1, and the remaining 10% had the form 4-4-2.
There were a total of 252,355 NDC codes with 6,532 different lablelers (companies).
There were 9448 NDC codes associated with the most prolific labeler. The 1,424 least prolific labelers had only one DNC code. In Pareto-like fashion, the top 20% of labelers accounted for about 90% of the codes.
Related posts[1] Programming languages like Python or Perl will recognize this regular expression, but by default grep does not support \d for digits. The Gnu implementation of grep with the -P option will. It will also understand notation like {4,5} to mean a pattern is repeated 4 or 5 times, with or without -P, but I don't think other implementations of grep necessarily will.