Feed john-d-cook John D. Cook

Favorite IconJohn D. Cook

Link https://www.johndcook.com/blog
Feed http://feeds.feedburner.com/TheEndeavour?format=xml
Updated 2025-03-07 01:01
Typesetting zodiac symbols in LaTeX
Typesetting zodiac symbols in LaTeX is admittedly an unusual thing to do. LaTeX is mostly used for scientific publication, and zodiac symbols are commonly associated with astrology. But occasionally zodiac symbols are used in more respectable contexts. The wasysym package for LaTeX includes miscellaneous symbols, including zodiac symbols. Here are the symbols, their LaTeX commands, […]
Airline flight number parity
I read in Wikipedia this morning that there’s a pattern to the parity of flight numbers. Among airline flight numbers, even numbers typically identify eastbound or northbound flights, and odd numbers typically identify westbound or southbound flights. I never noticed this. I could see how it might be a useful convention. It would mean that […]
Testing Rupert Miller’s suspicion
I was reading Rupert Miller’s book Beyond ANOVA when I ran across this line: I never use the Kolmogorov-Smirnov test (or one of its cousins) or the χ² test as a preliminary test of normality. … I have a feeling they are more likely to detect irregularities in the middle of the distribution than in […]
Why would anyone do that?
There are tools that I’ve used occasionally for many years that I’ve just started to appreciate lately. “Oh, that’s why they did that.” When you see something that looks poorly designed, don’t just exclaim “Why would anyone do that?!” but ask sincerely “Why would someone do that?” There’s probably a good reason, or at least […]
Predicted distribution of Mersenne primes
Mersenne primes are prime numbers of the form 2p – 1. It turns out that if 2p – 1 is a prime, so is p; the requirement that p is prime is a theorem, not part of the definition. So far 51 Mersenne primes have discovered [1]. Maybe that’s all there are, but it is […]
Short video introducing differential privacy
Here is a 12-minute video from Minute Physics, in collaboration with the US Census Bureau, giving an overview of differential privacy and how the 2020 census will use it to protect privacy. Related posts Scaling up differential privacy: lessons from the US Census Protecting privacy while keeping detailed date information Comparing differential privacy to Safe […]
Collatz conjecture skepticism
The Collatz conjecture asks whether the following procedure always terminates at 1. Take any positive integer n. If it’s odd, multiply it by 3 and add 1. Otherwise, divide it by 2. For obvious reasons the Collatz conjecture is also known as the 3n + 1 conjecture. It has been computationally verified that the Collatz […]
String interpolation in Python and R
One of the things I liked about Perl was string interpolation. If you use a variable name in a string, the variable will expand to its value. For example, if you a variable $x which equals 42, then the string "The answer is $x." will expand to “The answer is 42.” Perl requires variables to […]
Detecting typos with the four color theorem
In my previous post on VIN numbers, I commented that if a check sum has to be one of 11 characters, it cannot detect all possible changes to a string from an alphabet of 33 characters. The number of possible check sum characters must be at least as large as the number of possible characters […]
Vehicle Identification Number (VIN) check sum
A VIN (vehicle identification number) is a string of 17 characters that uniquely identifies a car or motorcycle. These numbers are used around the world and have three standardized formats: one for North America, one for the EU, and one for the rest of the world. Letters that resemble digits The characters used in a […]
Progress on the Collatz conjecture
The Collatz conjecture is for computer science what until recently Fermat’s last theorem was for mathematics: a famous unsolved problem that is very simple to state. The Collatz conjecture, also known as the 3n+1 problem, asks whether the following function terminates for all positive integer arguments n. def collatz(n): if n == 1: return 1 […]
How UTF-8 works
UTF-8 is a clever way of encoding Unicode text. I’ve mentioned it a couple times lately, but I haven’t blogged about UTF-8 per se. Here goes. The problem UTF-8 solves US keyboards can often produce 101 symbols, which suggests 101 symbols would be enough for most English text. Seven bits would be enough to encode […]
Excel, R, and Unicode
I received some data as an Excel file recently. I cleaned things up a bit, exported the data to a CSV file, and read it into R. Then something strange happened. Say the CSV file looked like this: foo,bar 1,2 3,4 I read the file into R with df <- read.csv("foobar.csv", header=TRUE) and could access […]
How fast were dead languages spoken?
A new paper in Science suggests that all human languages carry about the same amount of information per unit time. In languages with fewer possible syllables, people speak faster. In languages with more syllables, people speak slower. Researchers quantified the information content per syllable in 17 different languages by calculating Shannon entropy. When you multiply […]
Quiet mode
When you start a programming language like Python or R from the command line, you get a lot of initial text that you probably don’t read. For example, you might see something like this when you start Python. Python 2.7.6 (default, Nov 23 2017, 15:49:48) [GCC 4.8.4] on linux2 Type "help", "copyright", "credits" or "license" […]
More bc weirdness
As I mentioned in a footnote to my previous post, I just discovered that variable names in the bc programming language cannot contain capital letters. I think I understand why: Capital letters are reserved for hexadecimal constants, though in a weird sort of way. At first variable names in bc could only be one letter […]
Asimov’s question about π
In 1977, Isaac Asimov [1] asked how many terms of the slowly converging series π = 4 – 4/3 + 4/5 – 4/7 + 4/9 – … would you have to sum before doing better than the approximation π ≈ 355/113. A couple years later Richard Johnsonbaugh [2] answered Asimov’s question in the course of […]
National Drug Code (NDC)
The US Food and Drug Administration tracks drugs using an identifer called the NDC or National Drug Code. It is described as a 10-digit code, but it may be more helpful to think of it as a 12-character code. An NDC contains 10 digits, separated into three segments by two dashes. The three segments are […]
Prefix code examples
In many offices, you can dial a three digit number to reach someone else in the office. In such offices, you usually have to dial 9 to to reach an outside number. There’s no ambiguity because no one can have an extension that begins with 9. After you’ve entered three digits, the phone system knows […]
How many possible Unicode characters there are and why
How many? The previous post showed how the number of Unicode characters has grown over time. You’ll notice there was a big jump between versions 3.0 and 3.1. That will be important later on. Unicode started out relative small then became much more ambitious. Are they going to run out of room? How many possible […]
Growth of Unicode over time
My previous post quoted Randall Munroe saying Unicode “started out just trying to unify a couple different character sets” and grew much more ambitious. The first version of Unicode, published in 1991, had 7,191 characters. Now the latest version has 137,994 characters and so is about 19 times bigger. Here’s a plot of the number […]
The hopeless task of the Unicode Consortium
Randall Munroe, author of xkcd, discussing Unicode on the Triangulation podcast: I am endlessly delighted by the hopeless task that the Unicode Consortium has created for themselves. … They started out just trying to unify a couple different character sets. And before they quite realized what was happening, they were grappling with decisions at the […]
Regular expressions and special characters
Special characters make text processing more complicated because you have to pay close attention to context. If you’re looking at Python code containing a regular expression, you have to think about what you see, what Python sees, and what the regular expression engine sees. A character may be special to Python but not to regular […]
Munging CSV files with standard Unix tools
This post briefly discusses working with CSV (comma separated value) files using command line tools that are usually available on any Unix-like system. This will raise two objections: why CSV and why dusty old tools? Why CSV? In theory, and occasionally in practice, CSV can be a mess. But CSV is the de facto standard […]
Three-digit zip codes and data privacy
Birth date, sex, and five-digit zip code are enough information to uniquely identify a large majority of Americans. See more on this here. So if you want to deidentify a data set, the HIPAA Safe Harbor provision says you should chop off the last two digits of a zip code. And even though three-digit zip […]
Working with wide text files at the command line
Suppose you have a data file with obnoxiously long lines and you’d like to preview it from the command line. For example, the other day I downloaded some data from the American Community Survey and wanted to see what the files contained. I ran something like head data.csv to look at the first few lines […]
Estimating vocabulary size with Heaps’ law
Heaps’ law says that the number of unique words in a text of n words is approximated by V(n) = K nβ where K is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, K is often between 10 and 100 and β is often between 0.4 […]
Mickey Mouse, Batman, and conformal mapping
A conformal map between two regions in the plane preserves angles [1]. If two curves meet at a given angle in the domain, their images will meet at the same angle in the range. Two subsets of the plane are conformally equivalent if there is a conformal map between them. The Riemann mapping theorem says […]
Star-crossed lovers
A story in The New Yorker quotes the following explanation from Arthur Eddington regarding relativity and the speed of light. Suppose that you are in love with a lady on Neptune and that she returns the sentiment. It will be some consolation for the melancholy separation if you can say to yourself at some—possibly prearranged—moment, […]
Contributing to open source projects
David Heinemeier Hansson presents a very gracious view of open source software in his keynote address at RailsConf 2019. And by gracious, I mean gracious in the theological sense. He says at one point “If I were a Christian …” implying that he is not, but his philosophy of software echos the Christian idea of […]
Stone-Weierstrass on a disk
A couple weeks ago I wrote about a sort of paradox, that Weierstrass’ approximation theorem could seem to contradict Morera’s theorem. Weierstrass says that the uniform limit of polynomials can be an arbitrary continuous function, and so may have sharp creases. But Morera’s theorem says that the uniform limit of polynomials is analytic and thus […]
Distribution of zip code population
There are three schools of thought regarding power laws: the naive, the enthusiasts, and the skeptics. Of course there are more than three schools of thought, but there are three I want to talk about. The naive haven’t heard of power laws or don’t know much about them. They probably tend to expect things to […]
Landau kernel
The previous post was about the trick Lebesgue used to construct a sequence of polynomials converging to |x| on the interval [-1, 1]. This was the main step in his proof of the Weierstrass approximation theorem. Before that, I wrote a post on Bernstein’s proof that used his eponymous polynomials to prove Weierstrass’ theorem. This […]
Lebesgue’s proof of Weierstrass’ theorem
A couple weeks ago I wrote about the Weierstrass approximation theorem, the theorem that says every continuous function on a closed finite interval can be approximated as closely as you like by a polynomial. The post mentioned above uses a proof by Bernstein. And in that post I used the absolute value function as an […]
Proving that a choice was made in good faith
How can you prove that a choice was made in good faith? For example, if your company selects a cohort of people for random drug testing, how can you convince those who were chosen that they weren’t chosen deliberately? Would a judge find your explanation persuasive? This is something I’ve helped companies with. It may […]
Detecting a short period in an RNG
The last couple posts have been looking at the Cliff random number generator. I introduce the generator here and look at its fixed points. These turn out to be less of a problem in practice than in theory. Yesterday I posted about testing the generator with the DIEHARDER test suite, the successor to George Marsaglia’s […]
Testing Cliff RNG with DIEHARDER
My previous post introduced the Cliff random number generator. The post showed how to find starting seeds where the generator will start out by producing approximately equal numbers. Despite this flaw, the generator works well by some criteria. I produced a file of s billion 32-bit integers by multiplying the output values, which were floating […]
Fixed points of the Cliff random number generator
I ran across the Cliff random number generator yesterday. Given a starting value x0 in the open interval (0, 1), the generator proceeds by xn+1 = | 100 log(xn) mod 1 | for n > 0. The article linked to above says that this generator passes a test of randomness based on generating points on […]
Ease of learning vs relearning
Much more is written about how easy or hard some technology is to learn than about how hard it is to relearn. Maybe this is because people are more eager to write about something while the excitement or frustration of their first encounter is fresh. Advocates of difficult-to-learn technologies say that tools should be optimized […]
Uniform approximation paradox
What I’m going to present here is not exactly a paradox, but I couldn’t think of a better way to describe it in the space of a title. I’ll discuss two theorems about uniform convergence that seem to contradict each other, then show by an example why there’s no contradiction. Weierstrass approximation theorem One of […]
Nearly parallel is nearly transitive
We begin with a bit of geometry, then show its relevance to statistics. Geometry Let X, Y, and Z be three unit vectors. If X is nearly parallel to Y, and Y is nearly parallel to Z, then X is nearly parallel to Z. Here’s a proof. Think of X, Y, and Z as points […]
Angles in the spiral of Theodorus
The previous post looked at how to plot the spiral of Theodorus shown below. We stopped the construction where we did because the next triangle to be added would overlap the first triangle, which would clutter the image. But we could certainly have kept going. If we do keep going, then the set of hypotenuse […]
How to plot the spiral of Theodorus
You may have seen the spiral of Theodorus. It sticks a sequence of right triangles together to make a sort of spiral. Each triangle has a short side of length 1, and the hypotenuse of each triangle becomes the long leg of the next triangle as shown below. How would you plot this spiral? At […]
Encryption as secure as factoring
RSA encryption is based on the assumption that factoring large integers is hard. However, it’s possible that breaking RSA is easier than factoring. That is, the ability to factor large integers is sufficient for breaking RSA, but it might not be necessary. Two years after the publication of RSA, Michael Rabin created an alternative that […]
Accelerating convergence with Aitken’s method
The previous post looked at Euler’s method for accelerating the convergence of a slowly converging alternating series. Both hypotheses are necessary. The signs must alternate between terms, and applying the method to a series that is already converging quickly can slow down convergence. Aitken’s method This post looks at Aitken’s method for speeding up the […]
Accelerating an alternating series
The most direct way of computing the sum of an alternating series, simply computing the partial sums in the terms get small enough, may not be the most efficient. Euler figured this out in the 18th century. For our demo we’ll evaluate the Struve function defined by the series Note that the the terms in […]
Data breach trends
Are data breaches becoming more or less common? This post gives a crude, back-of-the-envelope calculation to address the question. We won’t look at number of breaches per se but number of records breached. There’s a terrific visualization of data breach statistics at Information is Beautiful, and they share their data here. Note that the data […]
Beating the odds on the Diffie-Hellman decision problem
There are a couple variations on the Diffie-Hellman problem in cryptography: the computation problem (CDH) and the decision problem (DDH). This post will explain both and give an example of where the former is hard and the latter easy. The Diffie-Hellman problems The Diffie-Hellman problems are formulated for an Abelian group. The main group we […]
Magic square links and errata
Someone pointed out that what I called a knight’s tour magic square is technically a semi-magic square: the rows and columns add up to the same constant, but the diagonals do not. It turns out there are no strict magic squares formed by knight’s tours. This was proved in 2003. See a news article here. […]
Quaternion reference in the Vulgate
To contemporary ears “quaternion” refers to a number system discovered in the 19h century, but there were a couple precedents. Both refer to something related to a group of four things, but there is no relation to mathematical quaternions other than that they have four dimensions. I’ve written before about Milton’s use of the word […]
...32333435363738394041...