Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1Learning Outcomes

Review the DGEMM (Double GEneral Matrix Multiplication) algorithm in an earlier section.

DGEMM: Double-precision General Matrix Multiplication

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
matrix_d_t *dgemm(matrix_d_t *A, matrix_d_t *B) {
  if (A->ncols != B->nrows) return NULL;
  matrix_d_t *C = init_mat_d(A->nrows, B->ncols);
    
  for (int i = 0; i < A->nrows; i++) {
    for (int j = 0; j < B->ncols; j++) {
      double sum = 0; 
      for (int k = 0; k < A->ncols; k++) {
        sum += A->data[i*A->ncols+k]*B->data[k*B->ncols+j];
      }
      C->data[i*C->ncols+j] = sum;
    } 
  }   
  return C;
}

2Demo Information

GCC Options: For now, unless otherwise stated, all C benchmarks are compiled with no optimization flags turned on in gcc by passing in the option O0. We discuss gcc optimization flags in a later section.

gcc -g3 -std=c11 -Wall -O0 matmul.c run_matmul.c -o run_matmul

Timing: For non-threaded programs (no OpenMP), we use the clock from the C standard library <time.h>. The CLOCKS_PER_SEC name[1] is used to convert the value returned by the clock() function into seconds.

#include <time.h>
int main() {
  ...

  clock_t start = clock();
  matrix_d_t* C = dgemm(A, B);
  clock_t end = clock();

  // execution time in seconds
  double delta_time = (double) (end - start)/CLOCKS_PER_SEC;
  ...
}

Machine: The demos in this section run on the shared course hive machines. Intel(R) Core(TM) i7-8700T Processor. 6 cores (2 threads per core) and cache size 12MiB.

For better or for worse, hive machines are shared. Our timing is measured from program start to program end, and we may be sharing the machine with other users. So the improvement from these times is certainly an upper bound :-)

For more accurate numbers, we recommend reading Leiserson et al. 2020.[2]

3DGEMM 1: int vs. double, FLOPs

The double-precision component of DGEMM is challenging because it requires many floating point operations. Integer multiplication and addition are faster than their floating-point analogs:

dgemm:  0.770195 seconds
igemm:  0.734030 seconds

To measure instruction count, we may find it useful to use a secondary metric that explicitly measures the execution time of the bottleneck operations.

FLOPS (Floating Point Operations per Second) counts the number of floating point operations (e.g., floating point adds, floating point multiplications, etc.) and scales by measured program execution time.

Table 1 shows the FLOPS (and megaflops, and gigaflops) of a standard Python or C implementation of DGEMM.

Table 1:Python vs. C in FLOPS.

NPython (MFLOPS)Python (GFLOPS)C (GFLOPS)
325.40.00541.30
1605.50.00551.30
4805.40.00541.32
9605.30.00530.91

4DGEMM 2: C vs. Python

The analogous Python code to the the C DGEMM benchmark:

DGEMM: Python

1
2
3
4
5
6
def matmul(A, B, N): 
    C = [0 for _ in A]
    for i in range(N):
        for j in range(N):
            for k in range(N):
                C[i*N + j] += A[i+k*N] * B[k+j*N]

Running and timing C vs. Python on a 512 ×\times 512 matrix multiplication:

C:      0.770195 seconds
Python: 32.67397 seconds

Implementing DGEMM in C yields more than a 40x improvement than implementing DGEMM in Python! This performance comes from reducing the number of operations the program performs. From P&H 2.21:

The reasons for the speedup are fundamentally using a compiler instead of an interpreter and because the type declarations of C allow the compiler to produce much more efficient code.

5DGEMM 3: C vs. Python NumPy

Should we move back to C for all our mathematical and scientific computations? In practice, no. Modern scientific computing libraries leverage many of the performance optimizations we will discuss in the next few sections.

NumPy is a Python library designed specifically to add support for large, multi-dimensional arrays and matrices. Matrix multiplication in NumPy assumes that matrices A and B are numpy.ndarrays:

DGEMM: Python NumPy

1
2
3
4
5
6
import numpy as np

...
A = np.array(A).reshape((N, N))
B = np.array(B).reshape((N, N))
C = A * B

Now with NumPy:

C:      0.770195 seconds
NumPy:  0.000964 seconds

That’s right—using NumPy gives a nearly 800x improvement over our C benchmark! Instead of reinventing the wheel, for practical reasons in your future work, we suggest just using scientific libraries like those offered by Python’s NumPy.

NumPy accelerates matrix multiplication by leveraging parallelism and cache blocking, discussed later in this unit.

6DGEMM 4: The register keyword

We update our C dgemm benchmark from before to use registers for the innermost loop variable k, the pointer to elements in matrices A and B, and the sum value that will eventually be written to the corresponding element of matrix C.

DGEMM with Registers
DGEMM (original)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
matrix_d_t *dgemm(matrix_d_t *A, matrix_d_t *B) {
  if (A->ncols != B->nrows) return NULL;
  matrix_d_t *C = init_mat_d(A->nrows, B->ncols);

  for (int i = 0; i < A->nrows; i++) {
    for (int j = 0; j < B->ncols; j++) {
      register double sum = 0;
      register double *ptr_A = (A->data) + (i*A->ncols);
      register double *ptr_B = (B->data) + j;
      register int k = 0;
      for(; k < C->ncols; ){
          sum += (*ptr_A) + (*ptr_B);
          ptr_A++;
          ptr_B += B->ncols;
          k++;
      }
      C->data[i*C->ncols+j] = sum;
    }
  }
  return C;
}

On a 512 ×\times 512 matrix multiplication:

C          0.768672 seconds
registers: 0.277462 seconds

We get a 4x improvement by moving frequently accessed values to registers!

7DGEMM 5: Cache blocking with matrix transpose

In this section, we first transpose the BB matrix to store it in column-major order (as the variable B_T). We additionally include the register optimization from before.

Transpose + Registers
Registers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
matrix_d_t *dgemm(matrix_d_t *A, matrix_d_t *B) {
  if (A->ncols!=B->nrows) return NULL;
  matrix_d_t *C = init_mat_d(A->nrows, B->ncols);
  matrix_d_t *B_T = transpose_mat_d(B);
  for (int i = 0; i < A->nrows; i++) {
    for (int j = 0; j < B->ncols; j++) {
      register double sum = 0;
      register double *ptr_A = A->data+(i*A->ncols);
      register double *ptr_B_T = B_T->data+(j*B_T->ncols);
      register int k = 0;
      for (; k < A->ncols;) {
        sum+=(*ptr_A) * (*ptr_B_T);
        ptr_A++;
        ptr_B_T++;
        k++;
      }
      C->data[i*C->ncols+j] = sum;
    }
  }
  free_mat_d(B_T);
  return C;
}

On a 512 ×\times 512 matrix multiplication:

C          0.768672 seconds
registers: 0.277462 seconds
transpose: 0.204523 seconds

This performance speedup may be less than you expected. Understandably, the matrix transpose in Line 4 will add slow memory accesses, even though it speeds up accesses in the core triple for-loop of matrix multiplication. The transpose optimization should be faster for large matrices but slower for small matrices.[3]

8DGEMM 6: GCC optimizations

We run original DGEMM, register DGEMM, transpose DGEMM with different gcc optimization flags and report the results in Table 2.

Table 2:DGEMM/matrix multiplication runtime (in seconds) with different optimization flags.

Program

-O0

-O1

-O2

-O3

DGEMM (original)

0.768672

0.193948

0.197389

0.193935

DGEMM with registers

0.277462

0.191474

0.191921

0.126439

DGEMM transpose (and registers)

0.204523

0.132441

0.132383

0.126460

Why doesn’t loop unrolling produce (-O3) produce wildly faster code? In our current code, it is entirely possible that unrolling is limited because our outer loop is too large (the size of N). To enhance code for loop unrolling, we could introduce a fourth nested loop (that iterates for UNROLL iterations), then let gcc optimize that.[4]

Footnotes
  1. See the UNIX specification of <time.h> for more information on CLOCKS_PER_SEC and clock().

  2. Charles E. Leiserson et al. “There’s plenty of room at the Top: What will drive computer performance after Moore’s law?” Science Volume 368, Issue 6495 (2020). DOI:10.1126/science.aam9744 Code is on GitHub as described on Zenodo.

  3. In practice (if you look at the NumPy code), the transpose and a matrix are both used for matrix multiplication; it is natural to assume that the transpose is computed ahead of time and updated when needed. See the reference open-source code for NumPy at the end of the section if you are curious.

  4. We may implement DGEMM unrolling in a future version of this course. For now, we leave this implementation to those of you who are truly inspired. :-)

References
  1. Leiserson, C. E., Thompson, N. C., Emer, J. S., Kuszmaul, B. C., Lampson, B. W., Sanchez, D., & Schardl, T. B. (2020). There’s plenty of room at the Top: What will drive computer performance after Moore’s law? Science, 368(6495). 10.1126/science.aam9744