Summing an array of floating point numbers
Adding up an array of numbers is simple: just loop over the array, adding each element to an accumulator.
Usually that's good enough, but sometimes you need more precision, either because your application need a high-precision answer, or because your problem is poorly conditioned and you need a moderate-precision answer [1].
There are many algorithms for summing an array, which may be surprising if you haven't thought about this before. This post will look at Kahan's algorithm, one of the first non-obvious algorithms for summing a list of floating point numbers.
We will sum a list of numbers a couple ways using C's float type, a 32-bit integer. Some might object that this is artificial. Does anyone use floats any more? Didn't moving from float to double get rid of all our precision problems? No and no.
In fact, the use of 32-bit float types is on the rise. Moving data in an out of memory is now the bottleneck, not arithmetic operations per se, and float lets you pack more data into a giving volume of memory. Not only that, there is growing interest in floating point types even smaller than 32-bit floats. See the following posts for examples:
And while sometimes you need less precision than 32 bits, sometimes you need more precision than 64 bits. See, for example, this post.
Now for some code. Here is the obvious way to sum an array in C:
float naive(float x[], int N) { float s = 0.0; for (int i = 0; i < N; i++) { s += x[i]; } return s; }
Kahan's algorithm is not much more complicated, but it looks a little mysterious and provides more accuracy.
float kahan(float x[], int N) { float s = x[0]; float c = 0.0; for (int i = 1; i < N; i++) { float y = x[i] - c; float t = s + y; c = (t - s) - y; s = t; } return s; }
Now to compare the accuracy of these two methods, we'll use a third method that uses a 64-bit accumulator. We will assume its value is accurate to at least 32 bits and use it as our exact result.
double dsum(float x[], int N) { double s = 0.0; for (int i = 0; i < N; i++) { s += x[i]; } return s; }
For our test problem, we'll use the reciprocals of the first 100,000 positive integers rounded to 32 bits as our input. This problem was taken from [2].
#include <stdio.h>: int main() { int N = 100000; float x[N]; for (int i = 1; i <= N; i++) x[i-1] = 1.0/i; printf("%.15g\n", naive(x, N)); printf("%.15g\n", kahan(x, N)); printf("%.15g\n", dsum(x, N)); }
This produces the following output.
12.0908508300781 12.0901460647583 12.0901461953972
The naive method is correct to 6 significant figures while Kahan's method is correct to 9 significant figures. The error in the former is 5394 times larger. In this example, Kahan's method is as accurate as possible using 32-bit numbers.
Note that in our example we were not exactly trying to find the Nth harmonic number, the sum of the reciprocals of the first N positive numbers. Instead, we were using these integer reciprocals rounded to 32 bits as our input. Think of them as just data. The data happened to be produced by computing reciprocals and rounding the result to 32-bit floating point accuracy.
But what if we did want to compute the 100,000th harmonic number? There are methods for computing harmonic numbers directly, as described here.
Related posts- Anatomy of a floating point number
- Why IEEE floating point numbers have +0 and -0
- IEEE floating point exceptions in C++
[1] The condition number of a problem is basically a measure of how much floating point errors are multiplied in the process of computing a result. A well-conditioned problem has a small condition number, and a ill-conditioned problem has a large condition number. With ill-conditioned problems, you may have to use extra precision in your intermediate calculations to get a result that's accurate to ordinary precision.
[2] Handbook of Floating-Point Arithmetic by Jen-Michel Muller et al.