Feed john-d-cook John D. Cook

Favorite IconJohn D. Cook

Link https://www.johndcook.com/blog
Feed http://feeds.feedburner.com/TheEndeavour?format=xml
Updated 2024-11-23 00:01
Chinese character frequency and entropy
Yesterday I wrote a post looking at the frequency of Koine Greek letters and the corresponding entropy. David Littleboy asked what an analogous calculation would look like for a language like Japanese. This post answers that question. First of all, information theory defines the Shannon entropy of an “alphabet” to be bits where pi is […]
The science of snow
Kenneth G. Libbrecht has posted a 523-page book on snow to arXiv.
Greek letter frequency and entropy
Would the letters in an ancient Greek text carry more or less information than in modern English? To address this question, I downloaded a copy of the Greek New Testament from Project Gutenberg and ran the word frequency script from my previous post. This lead to the follow table of letters and percent frequency. α […]
File character counts
Once in a while I need to know what characters are in a file and how often each appears. One reason I might do this is to look for statistical anomalies. Another reason might be to see whether a file has any characters it’s not supposed to have, which is often the case. A few […]
Survivalist vi
A few days ago I wrote about computational survivalists, people who prepare to be able to work on computers with only software that is available everywhere. Of course nothing is available everywhere, and so each person interprets “everywhere” to mean computers they anticipate using. If you need to edit a text file on a Windows […]
What is a privacy budget?
The idea behind differential privacy is that it doesn’t make much difference whether your data is in a data set or not. How much difference your participation makes is made precise in terms of probability statements. The exact definition doesn’t for this post, but it matters that there is an exact definition. Someone designing a […]
Queueing theory and regular expressions
Queueing theory is the study of waiting in line. That may not sound very interesting, but the subject is full of surprises. For example, when a server is near capacity, adding a second server can cut backlog not just in half but by an order of magnitude or more. More on that here. In this […]
Hyperexponential and hypoexponential distributions
There are a couple different ways to combine random variables into a new random variable: means and mixtures. To take the mean of X and Y you average their values. To take the mixture of X and Y you average their densities. The former makes the tails thinner. The latter makes the tails thicker. When […]
Primes that don’t look like primes
Primes usually look odd. They’re literally odd [1], but they typically don’t look like they have a pattern, because a pattern would often imply a way to factor the number. However, 12345678910987654321 is prime! I posted this on Twitter, and someone with the handle lagomoof replied that the analogous numbers are true for bases 2, […]
Computational survivalist
Some programmers and systems engineers try to do everything they can with basic command line tools on the grounds that someday they may be in an environment where that’s all they have. I think of this as a sort of computational survivalism. I’m not much of a computational survivalist, but I’ve come to appreciate such […]
What exactly is a day?
How many days are in a year? 365. How many times does the earth rotate on its axis in a year? 366. If you think a day is the time it takes for earth to rotate once around its axis, you’re approximately right, but off by about four minutes. What we typically mean by “day” […]
Harmonographs
In the previous post, I said that Lissajous curves are the result of plotting a curve whose x and y coordinates come from (undamped) harmonic oscillators. If we add a little bit of dampening, multiplying our cosine terms by negative exponentials, the resulting curve is called a harmonograph. Here’s a bit of Mathematica code to […]
Lissajous curves and knots
Suppose that over time the x and y coordinates of a point are both given by a harmonic oscillator, i.e. x(t) = cos(nx t + φx) y(t) = cos(ny t + φy) Then the resulting path is called a Lissajous curve. If you add a z coordinate also given by harmonic oscillator z(t) = cos(nz […]
Curvature of an ellipsoid
For an ellipsoid with equation the Gaussian curvature at each point is given by Now suppose a ≥ b ≥ c > 0. Otherwise relabel the coordinate axes so that this is the case. Then the largest curvature occurs at (±a, 0, 0), and the smallest curvature occurs at (0, 0, ±c). You could prove […]
Fixed points
Take a calculator and enter any number. Then press the cosine key over and over. Eventually the numbers will stop changing. You will either see 0.99984774 or 0.73908513, depending on whether your calculator was in degree mode or radian mode. This is an example of a fixed point, a point that doesn’t change when you […]
Number of real roots in an interval
Suppose you have a polynomial p(x) and in interval [a, b] and you want to know how many distinct real roots the polynomial has in the interval. You can answer this question using Sturm’s algorithm. Let p0(x) = p(x) and letp1(x) be its derivative p‘(x). Then define a series of polynomials for i ≥ 1 […]
Total curvature of a knot
Tie a knot in a rope and join the ends together. At each point in the rope, compute the curvature, i.e. how much the rope bends, and integrate this over the length of the rope. The Fary-Milnor theorem says the result must be greater than 4π. This post will illustrate this theorem by computing numerically […]
A sort of mathematical quine
Julian Havil writes what I think of as serious recreational mathematics. His books are recreational in the sense that they tell a story rather than cover a subject. They are lighter reading than a text book, but require more advanced mathematics than books by Martin Gardner. Havil’s latest book is Curves for the Mathematically Curious. […]
Control characters
I didn’t realize until recently that there’s a connection between the control key on a computer keyboard and controlling a mechanical device. Both uses of the word control are related via ASCII control characters as I discovered by reading the blog post Four Column ASCII. Computers work with bits in groups of eight, and there […]
Fat tails and the t test
Suppose you want to test whether something you’re doing is having any effect. You take a few measurements and you compute the average. The average is different than what it would be if what you’re doing had no effect, but is the difference significant? That is, how likely is it that you might see the […]
Amendment to CCPA regarding personal information
California’s new privacy law takes effect January 1, 2020, less than 100 days from now. The bill was written in a hurry in order to prevent a similar measuring from appearing on a ballot initiative. The thought was that the state legislature would pass something quickly then clean it up later with amendments. Six amendments […]
Right to be forgotten in the news
The GDPR‘s right-to-be-forgotten has been in the news this week. This post will look at a couple news stories and how they relate. Forgetting about a stabbing On Monday the New York Times ran a story about an Italian news site that folded as a result of resisting requests to hide a story about a […]
Exception Driven Development
Using program exceptions as a learning tool: When I’m learning something new, I sometimes find myself practicing EDD (exception driven development). I try to evaluate some code, get an exception or error message, and then Google the error message to figure out what the heck happened. From Mastering Clojure Macros
One of these days I’m going to figure this out
If something is outside your grasp, it’s hard to know just how far outside it is. Many times I’ve intended to sit down and understand something thoroughly, and I’ve put it off for years. Maybe it’s a programming language that I just use a few features of, or a book I keep seeing references to. […]
Typesetting zodiac symbols in LaTeX
Typesetting zodiac symbols in LaTeX is admittedly an unusual thing to do. LaTeX is mostly used for scientific publication, and zodiac symbols are commonly associated with astrology. But occasionally zodiac symbols are used in more respectable contexts. The wasysym package for LaTeX includes miscellaneous symbols, including zodiac symbols. Here are the symbols, their LaTeX commands, […]
Airline flight number parity
I read in Wikipedia this morning that there’s a pattern to the parity of flight numbers. Among airline flight numbers, even numbers typically identify eastbound or northbound flights, and odd numbers typically identify westbound or southbound flights. I never noticed this. I could see how it might be a useful convention. It would mean that […]
Testing Rupert Miller’s suspicion
I was reading Rupert Miller’s book Beyond ANOVA when I ran across this line: I never use the Kolmogorov-Smirnov test (or one of its cousins) or the χ² test as a preliminary test of normality. … I have a feeling they are more likely to detect irregularities in the middle of the distribution than in […]
Why would anyone do that?
There are tools that I’ve used occasionally for many years that I’ve just started to appreciate lately. “Oh, that’s why they did that.” When you see something that looks poorly designed, don’t just exclaim “Why would anyone do that?!” but ask sincerely “Why would someone do that?” There’s probably a good reason, or at least […]
Predicted distribution of Mersenne primes
Mersenne primes are prime numbers of the form 2p – 1. It turns out that if 2p – 1 is a prime, so is p; the requirement that p is prime is a theorem, not part of the definition. So far 51 Mersenne primes have discovered [1]. Maybe that’s all there are, but it is […]
Short video introducing differential privacy
Here is a 12-minute video from Minute Physics, in collaboration with the US Census Bureau, giving an overview of differential privacy and how the 2020 census will use it to protect privacy. Related posts Scaling up differential privacy: lessons from the US Census Protecting privacy while keeping detailed date information Comparing differential privacy to Safe […]
Collatz conjecture skepticism
The Collatz conjecture asks whether the following procedure always terminates at 1. Take any positive integer n. If it’s odd, multiply it by 3 and add 1. Otherwise, divide it by 2. For obvious reasons the Collatz conjecture is also known as the 3n + 1 conjecture. It has been computationally verified that the Collatz […]
String interpolation in Python and R
One of the things I liked about Perl was string interpolation. If you use a variable name in a string, the variable will expand to its value. For example, if you a variable $x which equals 42, then the string "The answer is $x." will expand to “The answer is 42.” Perl requires variables to […]
Detecting typos with the four color theorem
In my previous post on VIN numbers, I commented that if a check sum has to be one of 11 characters, it cannot detect all possible changes to a string from an alphabet of 33 characters. The number of possible check sum characters must be at least as large as the number of possible characters […]
Vehicle Identification Number (VIN) check sum
A VIN (vehicle identification number) is a string of 17 characters that uniquely identifies a car or motorcycle. These numbers are used around the world and have three standardized formats: one for North America, one for the EU, and one for the rest of the world. Letters that resemble digits The characters used in a […]
Progress on the Collatz conjecture
The Collatz conjecture is for computer science what until recently Fermat’s last theorem was for mathematics: a famous unsolved problem that is very simple to state. The Collatz conjecture, also known as the 3n+1 problem, asks whether the following function terminates for all positive integer arguments n. def collatz(n): if n == 1: return 1 […]
How UTF-8 works
UTF-8 is a clever way of encoding Unicode text. I’ve mentioned it a couple times lately, but I haven’t blogged about UTF-8 per se. Here goes. The problem UTF-8 solves US keyboards can often produce 101 symbols, which suggests 101 symbols would be enough for most English text. Seven bits would be enough to encode […]
Excel, R, and Unicode
I received some data as an Excel file recently. I cleaned things up a bit, exported the data to a CSV file, and read it into R. Then something strange happened. Say the CSV file looked like this: foo,bar 1,2 3,4 I read the file into R with df <- read.csv("foobar.csv", header=TRUE) and could access […]
How fast were dead languages spoken?
A new paper in Science suggests that all human languages carry about the same amount of information per unit time. In languages with fewer possible syllables, people speak faster. In languages with more syllables, people speak slower. Researchers quantified the information content per syllable in 17 different languages by calculating Shannon entropy. When you multiply […]
Quiet mode
When you start a programming language like Python or R from the command line, you get a lot of initial text that you probably don’t read. For example, you might see something like this when you start Python. Python 2.7.6 (default, Nov 23 2017, 15:49:48) [GCC 4.8.4] on linux2 Type "help", "copyright", "credits" or "license" […]
More bc weirdness
As I mentioned in a footnote to my previous post, I just discovered that variable names in the bc programming language cannot contain capital letters. I think I understand why: Capital letters are reserved for hexadecimal constants, though in a weird sort of way. At first variable names in bc could only be one letter […]
Asimov’s question about π
In 1977, Isaac Asimov [1] asked how many terms of the slowly converging series π = 4 – 4/3 + 4/5 – 4/7 + 4/9 – … would you have to sum before doing better than the approximation π ≈ 355/113. A couple years later Richard Johnsonbaugh [2] answered Asimov’s question in the course of […]
National Drug Code (NDC)
The US Food and Drug Administration tracks drugs using an identifer called the NDC or National Drug Code. It is described as a 10-digit code, but it may be more helpful to think of it as a 12-character code. An NDC contains 10 digits, separated into three segments by two dashes. The three segments are […]
Prefix code examples
In many offices, you can dial a three digit number to reach someone else in the office. In such offices, you usually have to dial 9 to to reach an outside number. There’s no ambiguity because no one can have an extension that begins with 9. After you’ve entered three digits, the phone system knows […]
How many possible Unicode characters there are and why
How many? The previous post showed how the number of Unicode characters has grown over time. You’ll notice there was a big jump between versions 3.0 and 3.1. That will be important later on. Unicode started out relative small then became much more ambitious. Are they going to run out of room? How many possible […]
Growth of Unicode over time
My previous post quoted Randall Munroe saying Unicode “started out just trying to unify a couple different character sets” and grew much more ambitious. The first version of Unicode, published in 1991, had 7,191 characters. Now the latest version has 137,994 characters and so is about 19 times bigger. Here’s a plot of the number […]
The hopeless task of the Unicode Consortium
Randall Munroe, author of xkcd, discussing Unicode on the Triangulation podcast: I am endlessly delighted by the hopeless task that the Unicode Consortium has created for themselves. … They started out just trying to unify a couple different character sets. And before they quite realized what was happening, they were grappling with decisions at the […]
Regular expressions and special characters
Special characters make text processing more complicated because you have to pay close attention to context. If you’re looking at Python code containing a regular expression, you have to think about what you see, what Python sees, and what the regular expression engine sees. A character may be special to Python but not to regular […]
Munging CSV files with standard Unix tools
This post briefly discusses working with CSV (comma separated value) files using command line tools that are usually available on any Unix-like system. This will raise two objections: why CSV and why dusty old tools? Why CSV? In theory, and occasionally in practice, CSV can be a mess. But CSV is the de facto standard […]
Three-digit zip codes and data privacy
Birth date, sex, and five-digit zip code are enough information to uniquely identify a large majority of Americans. See more on this here. So if you want to deidentify a data set, the HIPAA Safe Harbor provision says you should chop off the last two digits of a zip code. And even though three-digit zip […]
Working with wide text files at the command line
Suppose you have a data file with obnoxiously long lines and you’d like to preview it from the command line. For example, the other day I downloaded some data from the American Community Survey and wanted to see what the files contained. I ran something like head data.csv to look at the first few lines […]
...30313233343536373839...