Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1Learning Outcomes

In this section, we consider how knowing the underlying design of our cache can actually improve how we write programs.

2Matrix Multiplication

In this section, we use a matrix multiplication benchmark. In this example, we will consider multiplying matrix AA (4 rows ×\times 8 columns) by matrix BB (8 rows ×\times 4 columns) to produce the matrix CC (4 rows ×\times 4 columns).[1]

Assume that matrices AA, BB, and CC are stored in row-major order as int A[], int B[], and int C[].

C code, for your reference:

3Architecture details

Suppose we run our C program on a 32-bit architecture that has a single-layer cache:

For our matrix multiplication example, we will assume that on this architecture, sizeof(int) is 4, so each block holds four ints.

4matmul Memory Access Pattern

How does matrix multiplication fare? Assume the cache starts out cold.

Suppose we first compute C00C_{00}, which is the dot-product of the (zero-indexed) zero-th row of AA and the (zero-indexed) zero-th column of BB.

Figure 1:Computing C00C_{00} as vector multiplication of the zero-th row of AA and the zero-th column of BB. Use the menu bar to trace through the animation or access the original Google Slides.

Cache contents after computing C00C_{00}, in order of most recently used:

Next, suppose we computed C01C_{01}, which is the dot-product of the (zero-indexed) zero-th row of AA and the (zero-indexed) first column of BB.

Figure 2:Computing CijC_{ij} as vector multiplication of the i-th row of AA and the j-th column of BB. Use the menu bar to trace through the animation or access the original Google slides.

Cache contents after computing C01C_{01}, in order of most recently used:

In our matmul example, we know that B is stored in row-major-layout. To access a column of BB as is needed in matrix multiplication, we must load in all 8 rows of B.

From P&H 4.4 for square matrices (N-by-N):

If the cache can hold one N-by-N matrix and one row of N, then at least the ith row of A and the entire matrix B may stay in the cache. Less than that and misses may occur for both B and C. In the worst case, there would be 2 N3+ N2 memory words accesed for N3 operations.

5Cache Blocking

We have seen in a previous section how as computer architects, we can reduce these misses by increasing the capacity of our cache. However, as programmers, we often may not have control over the hardware of our computer. It is not trivial to swap out the cache. Instead, we must assume that we have some fixed hardware architecture, then see how we can rewrite our programmer to maximize use of the hardware.

Cache blocking is a programmer technique that rearranges data accesses to make better use of the data brought into the cache.

5.1Approach 1: Transpose

One observation is that it would be much better to load in just the 8 elements in the column of B, and not elements in other columns needed for later matrix multiplications.

A cache blocking technique could transpose B before matrix multiplication. The transpose of BB is written as BTB^T and is defined where BjkT=BkjB^T_{jk} = B_{kj} for all indices jj and kk, as shown in Figure 3.

B^T is the matrix transpose of B

Figure 3:BTB^T is the matrix transpose of BB

If we maintain a copy of B_T (mathematically BTB^T), we can therefore redefine our matrix multiplication as follows:

// cache-blocking code
// for row i, col j of C
int sum = 0; // sizeof(int) = 4
for (int k = 0; k < size; k++) {
  sum += A[i][k] * B_T[j][k];
}
C[i][j] = sum;

Notes:

Transposing is quite slow; it also requires a N2 overhead to complete and triggers the same types of cache misses as we observed in our original computation. We have simply moved our poor cache performance from matrix multiplication to another part of the program.

5.2Approach 2: Submatrix computation

Our second cache blocking approach observes that matrix multiplication can be computed piecewise. A submatrix (i.e., tile) of CC can be computed as the sum of multiplying different submatrices of AA and BB.

Figure 4 illustrates cache blocking with a blocking factor or BLOCKSIZE of 2. We compute the 2x2 tile of CC with elements CijC_{ij}, where i0,1i \in {0, 1} and j0,1j \in {0, 1}. This tile can be computed as four submatrix multiplications.

Figure 4:Cache blocking. Use the menu bar to trace through the animation or access the original Google slides.

5.3Summary: Cache Blocking

Note that cache blocking may still replace rows of BB that will be needed later; it does not avoid all capacity misses. However, it reduces the total number of memory accesses, thereby reducing the total number of compulsory misses.

From P&H 4.4 for square matrices (N-by-N):

Looking only at capacity misses, the total number of memory words accessed is 2 N3/BLOCKSIZE + N2. This total is an improvement by about a factor of BLOCKSIZE.

Blocking exploits a combination of spatial and temporal locality:

Footnotes
  1. Using proper mathematical notation, where Z\mathbb{Z} is the set of all integers: AZn×d,BZd×m,CZn×mA \in \mathbb{Z}^{n \times d}, B \in \mathbb{Z}^{d \times m}, C \in \mathbb{Z}^{n \times m}. In our example, n=m=4,d=8n = m = 4, d = 8.

  2. Not in LRU order for now. Marking this as TODO for future semesters.