Efficiency is not associative for matrix multiplication

John

from John D. Cook on 2017-12-12 13:00 (#3AG8V)

Here's a fact that has been rediscovered many times in many different contexts: The way you parenthesize matrix products can greatly change the time it takes to compute the product. This is important, for example, for the back propagation algorithm in deep learning.

Let A, B, and C be matrices that are compatible for multiplication. Then (AB)C = A(BC). That is, matrix multiplication is associative. But as far as efficiency is concerned, matrix multiplication is not associative: One side of the equation may be much faster to compute than the other.

For any matrix M, let rows(M) be the number of rows in M and let cols(M) be the number of columns. When we multiply two matrices M and N, we multiply every row of M by every column of N [1]. So the number of vector inner products is rows(M) cols(N), and the number of multiplications [2] in each inner product is cols(M). (We must have cols(M) = rows(N) because we implicitly assumed it's possible to multiple M and N.)

That means the total number of scalar multiplications required to conmpute MN equals

rows(M) cols(M) cols(N) = rows(M) rows(N) cols(N).

Now let's apply this to computing ABC. Suppose A is an m by n matrix and C is a p by q matrix. Then B has to be a n by p matrix in order to be compatible for multiplication.

Computing AB requires mnp multiplications. Once we have AB, a m by p matrix, the number of multiplications to compute (AB)C is mpq. So the total multiplications to compute ABC by first computing AB is mp(n + q).

If we compute BC first, this requires npq multiplications, and then multiplying A by BC requires mnq operations, for a total of nq(m + p).

In summary, computing (AB)C requires mp(n + q) multiplications, and computing A(BC) requires (m + p)nq multiplications. I turned the second term around to emphasize that in both expressions you do something to m and p, and something else to n and q. You either multiply and add, or add and multiply.

If you're trying to minimize these terms, you'd rather add big numbers together than multiply them. So if m and p are large relative to n and q, i.e. both A and C are tall matrices, having more rows than columns, then multiplying A(BC) is going to be faster than multiplying (AB)C.

For example, suppose both A and C have a million rows and a hundred columns. Necessarily B would have a hundred rows and a million columns. Computing (AB)C would require 2i-10¹⁴ multiplications, but computing A(BC) would take 2i-10¹⁰ multiplications. That is, the latter would be 10,000 times faster.

Related: Applied linear algebra

***

[1] This post assumes we compute matrix products the usual way, taking the product of every row in the left matrix with every column in the right matrix. It's theoretically possible, though not practical, to multiply matrices in fewer operations. If the matrices have some exploitable special structure then there could be a faster way to multiply them, but not in general.

[2] It is common in numerical linear algebra to just count multiplications because this gives a good idea of how efficient an algorithm is. Counting additions and multiplications would usually lead to the same conclusions. If you really want to be precise, you'd need to count memory operations. In practice memory management makes more of a difference than whether we count just multiplications or multiplications and additions.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title	John D. Cook
Feed Link	https://www.johndcook.com/blog