Feed john-d-cook John D. Cook

Favorite IconJohn D. Cook

Link https://www.johndcook.com/blog
Feed http://feeds.feedburner.com/TheEndeavour?format=xml
Updated 2025-09-12 01:46
Quantifying privacy loss in a statistical database
In the previous post we looked at a simple randomization procedure to obscure individual responses to yes/no questions in a way that retains the statistical usefulness of the data. In this post we’ll generalize that procedure, quantify the privacy loss, and discuss the utility/privacy trade-off. More general randomized response Suppose we have a binary response […]
Randomized response, privacy, and Bayes theorem
Suppose you want to gather data on an incriminating question. For example, maybe a statistics professor would like to know how many students cheated on a test. Being a statistician, the professor has a clever way to find out what he wants to know while giving each student deniability. Randomized response Each student is asked […]
Why don’t you simply use XeTeX?
From an FAQ post I wrote a few years ago: This may seem like an odd question, but it’s actually one I get very often. On my TeXtip twitter account, I include tips on how to create non-English characters such as using \AA to produce Å. Every time someone will ask “Why not use XeTeX and just […]
Pascal’s triangle and Fermat’s little theorem
I was listening to My Favorite Theorem when Jordan Ellenberg said something in passing about proving Fermat’s little theorem from Pascal’s triangle. I wasn’t familiar with that, and fortunately Evelyn Lamb wasn’t either and so she asked him to explain. Fermat’s little theorem says that for any prime p, then for any integer a, ap = a […]
Making a problem easier by making it harder
In the oral exam for my PhD, my advisor asked me a question about a differential equation. I don’t recall the question, but I remember the interaction that followed. I was stuck, and my advisor countered by saying “Let me ask you a harder question.” I was still stuck, and so he said “Let me […]
Quantifying the information content of personal data
It can be surprisingly easy to identify someone from data that’s not directly identifiable. One commonly cited result is that the combination of birth date, zip code, and sex is enough to identify most people. This post will look at how to quantify the amount of information contained in such data. If the answer to […]
Negative correlation introduced by success
Suppose you measure people on two independent attributes, X and Y, and take those for whom X+Y is above some threshold. Then even though X and Y are uncorrelated in the full population, they will be negatively correlated in your sample. This article gives the following example. Suppose beauty and acting ability were uncorrelated. Knowing how […]
Highly cited theorems
Some theorems are cited far more often than others. These are not the most striking theorems, not the most advanced or most elegant, but ones that are extraordinarily useful. I first noticed this when taking complex analysis where the Cauchy integral formula comes up over and over. When I first saw the formula I thought […]
Width of mixture PDFs
Suppose you draw samples from two populations, one of which has a wider probability distribution than the other. How does the width of the distribution of the combined sample vary as you change the proportions of the two populations? The extremes are easy. If you pick only from one population, then the resulting distribution will […]
Team dynamics and encouragement
When you add people to a project, the total productivity of the team as a whole may go up, but the productivity per person usually goes down. Someone suggested that as a rule of thumb, a company needs to triple its number of employees to double its productivity. Fred Brooks summarized this saying “Many hands […]
Relearning from a new perspective
I had a conversation with someone today who said he’s relearning logic from a categorical perspective. What struck me about this was not the specifics but the pattern: Relearning _______ from a _______ perspective. Not relearning something forgotten, but going back over something you already know well, but from a different starting point, a different […]
Hurricane Harvey update
As you may know, I live in the darkest region of the rainfall map below. My family and I are doing fine. Our house has not flooded, and at this point it looks like it will not flood. We’ve only lost electricity for a second or two. Of course not everyone in Houston is doing […]
Defining the Fourier transform on LCA groups
My previous post said that all the familiar variations on Fourier transforms—Fourier series analysis and synthesis, Fourier transforms on the real line, discrete Fourier transforms, etc.—can be unified into a single theory. They’re all instances of a Fourier transform on a locally compact Abelian (LCA) group. The difference between them is the underlying group. Given […]
Unified theory of Fourier transforms
You can take a periodic function and analyze it into its Fourier coefficients, or use the Fourier coefficients in a sum to synthesize a periodic function. You can take the Fourier transform of a function defined on the whole real line and get another such function. And you can compute the discrete Fourier transform via […]
Solving problems we wish we had
There’s a great line from Heather McGaw toward the end of the latest episode of 99 Percent Invisible: Sometimes … we can start to solve problems that we wish were problems because they’re easy to solve. Reminds me of an excerpt from Richard Weaver’s book Ideas Have Consequences: Obsession, according to the canons of psychology, […]
Predicting when an RNG will output a given value
A few days ago I wrote about how to pick the seed of a simple random number generator so that a desired output came n values later. The number n was fixed and we varied the seed. In this post, the seed will be fixed and we’ll solve for n. In other words, we ask when a […]
Programming language life expectancy
The Lindy effect says that what’s been around the longest is likely to remain around the longest. It applies to creative artifacts, not living things. A puppy is likely to live longer than an elderly dog, but a book that has been in press for a century is likely to be in press for another century. […]
Reverse engineering the seed of a linear congruential generator
The previous post gave an example of manipulating the seed of a random number generator to produce a desired result. This post will do something similar for a different generator. A couple times I’ve used the following LCG (linear congruential random number generator) in examples. An LCG starts with an initial value of z and updates z […]
Manipulating a random number generator
With some random number generators, it’s possible to select the seed carefully to manipulate the output. Sometimes this is easy to do. Sometimes it’s hard but doable. Sometimes it’s theoretically possible but practically impossible. In my recent correspondence with Melissa O’Neill, she gave me an example that seeds a random number generator so that the […]
Testing RNGs with PractRand
PractRand is a random number generator test suite, somewhat like the DIEHARDER and NIST tests I’ve written about before, but more demanding. Rather than running to completion, it runs until it a test fails with an infinitesimally small p-value. It runs all tests at a given sample size, then doubles the sample and runs the tests again. […]
Random minimum spanning trees
I just ran across a post by John Baez pointing to an article by Alan Frieze on random minimum spanning trees. Here’s the problem. Create a complete graph with n nodes, i.e. connect every node to every other node. Assign each edge a uniform random weight between 0 and 1. Find the minimum spanning tree. Add up […]
Selecting things in Emacs
You can select blocks of text in Emacs just as you would in most other environments. You could, for example, drag your mouse over a region. You could also hold down the Shift key and use arrow keys. But Emacs also has a number of commands that let you work in larger semantic units. That […]
Random walk on quaternions
The previous post was a riff on a tweet asking what you’d get if you extracted all the i‘s, j‘s, and k‘s from Finnegans Wake and multiplied them as quaternions. This post is a probabilistic variation on the previous one. If you randomly select a piece of English prose, extract the i‘s, j‘s, and k‘s, and multiply them together as quaternions, what […]
Wolfram Alpha, Finnegans Wake, and Quaternions
I stumbled on a Twitter account yesterday called Wolfram|Alpha Can’t. It posts bizarre queries that Wolfram Alpha can’t answer. Here’s one that caught my eye. result of extracting the i’s, j’s, and k’s in order from Finnegans Wake and interpreting as a quaternion product — Wolfram|Alpha Can’t (@wacnt) May 17, 2017 Suppose you did extract […]
The cross polytope
There are five regular solids in three dimensions: tetrahedron octahedron (pictured above) hexahedron (cube) dodecahedron icosahedron. I give a proof here that these are the only five. The first three of these regular solids generalize to all dimensions, and these generalizations are the only regular solids in dimensions 5 and higher. (There are six regular […]
Bayesian methods at Bletchley Park
From Nick Patterson’s interview on Talking Machines: GCHQ in the ’70s, we thought of ourselves as completely Bayesian statisticians. All our data analysis was completely Bayesian, and that was a direct inheritance from Alan Turing. I’m not sure this has ever really been published, but Turing, almost as a sideline during his cryptoanalytic work, reinvented […]
Sphere packing
The previous couple blog posts touched on a special case of sphere packing. We looked at the proportion of volume contained near the corners of a hypercube. If you take the set of points within a distance 1/2 of a corner of a hypercube, you could rearrange these points to form a full ball centered […]
Is most volume in the corners or not?
I’ve written a couple blog posts that may seem to contradict each other. Given a high-dimensional cube, is most of the volume in the corners or not? I recently wrote that the corners of a cube stick out more in high dimensions. You can quantify this by centering a ball at a corner and looking […]
Corners stick out more in high dimensions
High-dimensional geometry is full of surprises. For example, nearly all the area of a high-dimensional sphere is near the equator, and by symmetry it doesn’t matter which equator you take. Here’s another surprise: corners stick out more in high dimensions. Hypercubes, for example, become pointier as dimension increases. How might we quantify this? Think of […]
Math diagrams updated
I updated several of the math diagrams on this site today. They’re SVG now, so they resize nicely if you want to zoom in our out. Special functions Topological vector spaces Category theory concepts General topology Gamma function identities
Discrete example of concentration of measure
The previous post looked at a continuous example of concentration of measure. As you move away from a thin band around the equator, the remaining area in the rest of the sphere decreases as an exponential function of the dimension and the distance from the equator. This post will show a very similar result for […]
Nearly all the area in a high-dimensional sphere is near the equator
Nearly all the area of a high-dimensional sphere is near the equator. And by symmetry, it doesn’t matter which equator you take. Draw any great circle and nearly all of the area will be near that circle. This is the canonical example of “concentration of measure.” What exactly do we mean by “nearly all the […]
DIEHARDER random number generator test results for PCG and MWC
A few days ago I wrote about testing the PCG random number generator using the DIEHARDER test suite. In this post I’ll go into a little more background on this random number generator test suite. I’ll also show that like M. E. O’Neill’s PCG (“permuted congruential generator”), George Marsaglia’s MWC (“multiply with carry”) generator does quite […]
The chaos game and the Sierpinski triangle
The chaos game is played as follows. Pick a starting point at random. Then at each subsequent step, pick a triangle vertex at random and move half way from the current position to that vertex. The result looks like a fractal called the Sierpinski triangle or Sierpinski gasket. Here’s an example: If the random number […]
Testing the PCG random number generator
M. E. O’Neill’s PCG family of random number generators looks very promising. It appears to have excellent statistical and cryptographic properties. And it takes remarkably little code to implement. (PCG stands for Permuted Congruential Generator.) The journal article announcing PCG gives the results of testing it with the TestU01 test suite. I wanted to try it out […]
Simple random number generator does surprisingly well
I was running the NIST statistical test suite recently. I wanted an example of a random number generator where the tests failed, and so I used a simple generator, a linear congruence generator. But to my surprise, the generator passed nearly all the tests, even though some more sophisticated generators failed some of the same […]
Least common multiple of the first n positive integers
Here’s a surprising result: The least common multiple of the first n positive integers is approximately exp(n). More precisely, let φ(n) equal the log of the least common multiple of the numbers 1, 2, …, n. There are theorems that give upper and lower bounds on how far φ(n) can be from n. We won’t prove or […]
Subscribing by email
You can subscribe to my blog by email or RSS. I also have a brief newsletter you could sign up for. There are links to these in the sidebar of the blog: If you subscribe by email, you’ll get an email each morning containing the post(s) from the previous day. I just noticed a problem […]
Effective sample size for MCMC
In applications we’d like to draw independent random samples from complicated probability distributions, often the posterior distribution on parameters in a Bayesian analysis. Most of the time this is impractical. MCMC (Markov Chain Monte Carlo) gives us a way around this impasse. It lets us draw samples from practically any probability distribution. But there’s a […]
Quicksort and prime numbers
The average number of operations needed for quicksort to sort a list of n items is approximately 10 times the nth prime number. Here’s some data to illustrate this. |------+-----------------+---------| | n | avg. operations | 10*p(n) | |------+-----------------+---------| | 100 | 5200.2 | 5410 | | 200 | 12018.3 | 12230 | | 300 | 19446.9 […]
Why do linear prediction confidence regions flare out?
Suppose you’re tracking some object based on its initial position x0 and initial velocity v0. The initial position and initial velocity are estimated from normal distributions with standard deviations σx and σv. (To keep things simple, let’s assume our object is moving in only one dimension and that the distributions around initial position and velocity […]
Polynomials evaluated at integers
Let p(x) = a0 + a1x + a2x2 + … + anxn and suppose at least one of the coefficients ai is irrational for some i ≥ 1. Then a theorem by Weyl says that the fractional parts of p(n) are equidistributed as n varies over the integers. That is, the proportion of values that land in some interval […]
Leading digits of powers of 2
The first digit of a power of 2 is a 1 more often than any other digit. Powers of 2 begin with 1 about 30% of the time. This is because powers of 2 follow Benford’s law. We’ll prove this below. When is the first digit of 2n equal to k? When 2n is between […]
Extreme beta distributions
A beta probability distribution has two parameters, a and b. You can think of these as the number of successes and failures out of a+b trials. The PDF of a beta distribution is approximately normal if a and b are approximately equal and a + b is large. If a and b are close, they don’t have to be very large for the beta […]
Fractional parts, invariant measures, and simulation
A function f: X → X is measure-preserving if for each iteration of f sends the same amount of stuff into a given set. To be more precise, given a measure μ and any μ-measurable set E with μ(E) > 0, we have μ( E ) = μ( f –1(E) ). You can read the right side of […]
Most popular posts this year so far
These have been the most popular posts for the first half of 2017. Golden powers are nearly integers How efficient is Morse code? Putting SHA1 in perspective Improving on the Unix shell Three proofs that 2017 is prime How areas of math are connected
One practical application of functional programming
Arguments in favor of functional programming are often unconvincing. For example, the most common argument is that functional programming makes it easier to “reason about your code.” That’s true to some extent. All other things being equal, it’s easier to understand a function if all its inputs and outputs are explicit. But all other things […]
Dividing projects into math, statistics, and computing
If you’ve read this blog for long, you know that my work is a combination of math, statistics, and computing. I was looking over my records and tried to see how my work divides into these three areas. In short, it doesn’t. The boundaries between these areas are fuzzy or arbitrary to begin with, but […]
Gaussian correlation inequality
The Gaussian correlation inequality was proven in 2014, but the proof only became widely known this year. You can find Thomas Royan’s remarkably short proof here. Let X be a multivariate Gaussian random variable with mean zero and let E and F be two symmetric convex sets, both centered at the origin. The Gaussian correlation inequality says that Prob(X in E […]
Irrational rotations are ergodic
In a blog post yesterday, I mentioned that the golden angle is an irrational portion of a circle, and so a sequence of rotations by the golden angle will not repeat itself. We can say more: rotations by an irrational portion of a circle are ergodic. Roughly speaking, this means that not only does the […]
...47484950515253545556...