DGEMM: Performance Benchmark

1Learning Outcomes¶

Explain why C DGEMM runs faster than Python DGEMM.
Understand that in practice, library implementations like NumPy DGEMM are plenty fast because they use the optimizations described in this unit.
Compare sequential optimizations on matrix multiplication: loop unrolling, the register keyword, and transposing matrices.
Compare performance using different gcc optimization flags: -O0, -O2, -O3.

Review the DGEMM (Double GEneral Matrix Multiplication) algorithm in an earlier section.

DGEMM: Double-precision General Matrix Multiplication

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
matrix_d_t *dgemm(matrix_d_t *A, matrix_d_t *B) {
  if (A->ncols != B->nrows) return NULL;
  matrix_d_t *C = init_mat_d(A->nrows, B->ncols);
    
  for (int i = 0; i < A->nrows; i++) {
    for (int j = 0; j < B->ncols; j++) {
      double sum = 0; 
      for (int k = 0; k < A->ncols; k++) {
        sum += A->data[i*A->ncols+k]*B->data[k*B->ncols+j];
      }
      C->data[i*C->ncols+j] = sum;
    } 
  }   
  return C;
}

2Demo Information¶

GCC Options: For now, unless otherwise stated, all C benchmarks are compiled with no optimization flags turned on in gcc by passing in the option O0. We discuss gcc optimization flags in a later section.

gcc -g3 -std=c11 -Wall -O0 matmul.c run_matmul.c -o run_matmul

Timing: For non-threaded programs (no OpenMP^[1]), we use the clock from the C standard library <time.h>. The CLOCKS_PER_SEC name^[2] is used to convert the value returned by the clock() function into seconds.

#include <time.h>
int main() {
  ...

  clock_t start = clock();
  matrix_d_t* C = dgemm(A, B);
  clock_t end = clock();

  // execution time in seconds
  double delta_time = (double) (end - start)/CLOCKS_PER_SEC;
  ...
}

Machine: The demos in this section run on the shared course hive machines. Intel(R) Core(TM) i7-8700T Processor. 6 cores (2 threads per core) and cache size 12MiB.

For better or for worse, hive machines are shared. Our timing is measured from program start to program end, and we may be sharing the machine with other users. So the improvement from these times is certainly an upper bound :-)

For more accurate numbers, we recommend reading Leiserson et al. 2020.^[3]

3DGEMM 1: `int` vs. `double`, FLOPs¶

The double-precision component of DGEMM is challenging because it requires many floating point operations. Integer multiplication and addition are faster than their floating-point analogs:

dgemm:  0.770195 seconds
igemm:  0.734030 seconds

Click to show “IGEMM” (Integer GEneral Matrix Multiplication)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
typedef struct {
  int nrows; 
  int ncols; 
  int* data;
} matrix_t;
        
matrix_t *init_mat(int nrows, int ncols);
matrix_t *load_mat(FILE* f);
void print_mat(matrix_t *mat);
void free_mat(matrix_t *mat);

matrix_t *igemm(matrix_t *mat_A, matrix_t *mat_B) {
  if (mat_A->ncols != mat_B->nrows) return NULL;
    matrix_t *mat_C = init_mat(mat_A->nrows, mat_B->ncols);
    
  for (int i = 0; i < mat_A->nrows; i++) {
    for (int j = 0; j < mat_B->ncols; j++) {
      int sum = 0; 
      for (int k = 0; k < mat_A->ncols; k++) {
        sum += mat_A->data[i*mat_A->ncols+k]*mat_B->data[k*mat_B->ncols+j];
      }
      mat_C->data[i*mat_C->ncols+j] = sum;
    } 
  } 
  return mat_C;
}

To measure instruction count, we may find it useful to use a secondary metric that explicitly measures the execution time of the bottleneck operations.

FLOPS (Floating Point Operations per Second) counts the number of floating point operations (e.g., floating point adds, floating point multiplications, etc.) and scales by measured program execution time.

Table 1 shows the FLOPS (and megaflops, and gigaflops) of a standard Python or C implementation of DGEMM.

Table 1:Python vs. C in FLOPS.

N	Python (MFLOPS)	Python (GFLOPS)	C (GFLOPS)
32	5.4	0.0054	1.30
160	5.5	0.0055	1.30
480	5.4	0.0054	1.32
960	5.3	0.0053	0.91

4DGEMM 2: C vs. Python¶

The analogous Python code to the the C DGEMM benchmark:

DGEMM: Python

1
2
3
4
5
6
def matmul(A, B, N): 
    C = [0 for _ in A]
    for i in range(N):
        for j in range(N):
            for k in range(N):
                C[i*N + j] += A[i+k*N] * B[k+j*N]

Running and timing C vs. Python on a 512 $\times$ 512 matrix multiplication:

C:      0.770195 seconds
Python: 32.67397 seconds

Implementing DGEMM in C yields more than a 40x improvement than implementing DGEMM in Python! This performance comes from reducing the number of operations the program performs. From P&H 2.21:

The reasons for the speedup are fundamentally using a compiler instead of an interpreter and because the type declarations of C allow the compiler to produce much more efficient code.

5DGEMM 3: C vs. Python NumPy¶

Should we move back to C for all our mathematical and scientific computations? In practice, no. Modern scientific computing libraries leverage many of the performance optimizations we will discuss in the next few sections.

NumPy is a Python library designed specifically to add support for large, multi-dimensional arrays and matrices. Matrix multiplication in NumPy assumes that matrices A and B are numpy.ndarrays:

DGEMM: Python NumPy

1
2
3
4
5
6
import numpy as np

...
A = np.array(A).reshape((N, N))
B = np.array(B).reshape((N, N))
C = A * B

Now with NumPy:

C:      0.770195 seconds
NumPy:  0.000964 seconds

That’s right—using NumPy gives a nearly 800x improvement over our C benchmark! Instead of reinventing the wheel, for practical reasons in your future work, we suggest just using scientific libraries like those offered by Python’s NumPy.

NumPy accelerates matrix multiplication by leveraging parallelism and cache blocking, discussed later in this unit.

6DGEMM 4: The `register` keyword¶

We update our C dgemm benchmark from before to use registers for the innermost loop variable k, the pointer to elements in matrices A and B, and the sum value that will eventually be written to the corresponding element of matrix C.

DGEMM with Registers

DGEMM (original)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
matrix_d_t *dgemm(matrix_d_t *A, matrix_d_t *B) {
  if (A->ncols != B->nrows) return NULL;
  matrix_d_t *C = init_mat_d(A->nrows, B->ncols);

  for (int i = 0; i < A->nrows; i++) {
    for (int j = 0; j < B->ncols; j++) {
      register double sum = 0;
      register double *ptr_A = (A->data) + (i*A->ncols);
      register double *ptr_B = (B->data) + j;
      register int k = 0;
      for(; k < C->ncols; ){
          sum += (*ptr_A) + (*ptr_B);
          ptr_A++;
          ptr_B += B->ncols;
          k++;
      }
      C->data[i*C->ncols+j] = sum;
    }
  }
  return C;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
matrix_d_t *dgemm(matrix_d_t *A, matrix_d_t *B) {
  if (A->ncols != B->nrows) return NULL;
  matrix_d_t *C = init_mat_d(A->nrows, B->ncols);
    
  for (int i = 0; i < A->nrows; i++) {
    for (int j = 0; j < B->ncols; j++) {
      double sum = 0; 
      for (int k = 0; k < A->ncols; k++) {
        sum += A->data[i*A->ncols+k]*B->data[k*B->ncols+j];
      }
      C->data[i*C->ncols+j] = sum;
    } 
  }   
  return C;
}

Show Explanation

Line 11 uses unconventional loop syntax and is closest to the following:

for(k = 0; k < C->ncols; ) {...}

Note that the initial value of sum is already declared in Line 7; this can possibly reduce instruction count if we have a zero register (e.g., x0). Ultimately, the code is more readable if we specify Line 11 more normally and delegate additional optimizations to the compiler.

On a 512 $\times$ 512 matrix multiplication:

C          0.768672 seconds
registers: 0.277462 seconds

We get a 4x improvement by moving frequently accessed values to registers!

7DGEMM 5: Cache blocking with matrix transpose¶

In this section, we first transpose the $B$ matrix to store it in column-major order (as the variable B_T). We additionally include the register optimization from before.

Transpose + Registers

Registers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
matrix_d_t *dgemm(matrix_d_t *A, matrix_d_t *B) {
  if (A->ncols!=B->nrows) return NULL;
  matrix_d_t *C = init_mat_d(A->nrows, B->ncols);
  matrix_d_t *B_T = transpose_mat_d(B);
  for (int i = 0; i < A->nrows; i++) {
    for (int j = 0; j < B->ncols; j++) {
      register double sum = 0;
      register double *ptr_A = A->data+(i*A->ncols);
      register double *ptr_B_T = B_T->data+(j*B_T->ncols);
      register int k = 0;
      for (; k < A->ncols;) {
        sum+=(*ptr_A) * (*ptr_B_T);
        ptr_A++;
        ptr_B_T++;
        k++;
      }
      C->data[i*C->ncols+j] = sum;
    }
  }
  free_mat_d(B_T);
  return C;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
matrix_d_t *dgemm(matrix_d_t *A, matrix_d_t *B) {
  if (A->ncols != B->nrows) return NULL;
  matrix_d_t *C = init_mat_d(A->nrows, B->ncols);

  for (int i = 0; i < A->nrows; i++) {
    for (int j = 0; j < B->ncols; j++) {
      register double sum = 0;
      register double *ptr_A = (A->data) + (i*A->ncols);
      register double *ptr_B = (B->data) + j;
      register int k = 0;
      for(; k < C->ncols; ){
          sum += (*ptr_A) + (*ptr_B);
          ptr_A++;
          ptr_B += B->ncols;
          k++;
      }
      C->data[i*C->ncols+j] = sum;
    }
  }
  return C;
}

Show Explanation

Line 4 transposes B to produce B_T. B_T stores the elements of matrix $B$ in column-major order.
Line 9 accesses the jth column of $B$ (by pointing to the jth row of B_T)
Line 14 increments the pointer to the next element in the jth column of $B$ (by incrementing the ptr_B_T pointer).

BTW: We use a straightforward transpose implementation. Nothing fancy.

matrix_d_t *transpose_mat_d(matrix_d_t *mat) {
    // new data array is "transpose", ncols x nrows
    // data_ij = orig_data_ji
    matrix_d_t *transpose = init_mat_d(mat->ncols, mat->nrows);
    for (int i = 0; i < transpose->nrows; i++) {
        for (int j = 0; j < transpose->ncols; j++) {
            transpose->data[i*transpose->ncols+j] = mat->data[j*mat->ncols+i];
        }
    } 
    return transpose;
}

On a 512 $\times$ 512 matrix multiplication:

C          0.768672 seconds
registers: 0.277462 seconds
transpose: 0.204523 seconds

This performance speedup may be less than you expected. Understandably, the matrix transpose in Line 4 will add slow memory accesses, even though it speeds up accesses in the core triple for-loop of matrix multiplication. The transpose optimization should be faster for large matrices but slower for small matrices.^[4]

8DGEMM 6: GCC optimizations¶

We run original DGEMM, register DGEMM, transpose DGEMM with different gcc optimization flags and report the results in Table 2.

Table 2:DGEMM/matrix multiplication runtime (in seconds) with different optimization flags.

Program	`-O0`	`-O1`	`-O2`	`-O3`
DGEMM (original)	0.768672	0.193948	0.197389	0.193935
DGEMM with registers	0.277462	0.191474	0.191921	0.126439
DGEMM transpose (and registers)	0.204523	0.132441	0.132383	0.126460

Why doesn’t loop unrolling produce (-O3) produce wildly faster code? In our current code, it is entirely possible that unrolling is limited because our outer loop is too large (the size of N). To enhance code for loop unrolling, we could introduce a fourth nested loop (that iterates for UNROLL iterations), then let gcc optimize that.^[5]

Footnotes¶

For timing multi-threaded programs, see a later section.
↩
See the UNIX specification of <time.h> for more information on CLOCKS_PER_SEC and clock().
↩
Charles E. Leiserson et al. “There’s plenty of room at the Top: What will drive computer performance after Moore’s law?” Science Volume 368, Issue 6495 (2020). DOI:10.1126/science.aam9744 Code is on GitHub as described on Zenodo.
↩
In practice (if you look at the NumPy code), the transpose and a matrix are both used for matrix multiplication; it is natural to assume that the transpose is computed ahead of time and updated when needed. See the reference open-source code for NumPy at the end of the section if you are curious.
↩
We may implement DGEMM unrolling in a future version of this course. For now, we leave this implementation to those of you who are truly inspired. :-)
↩

References¶

Leiserson, C. E., Thompson, N. C., Emer, J. S., Kuszmaul, B. C., Lampson, B. W., Sanchez, D., & Schardl, T. B. (2020). There’s plenty of room at the Top: What will drive computer performance after Moore’s law? Science, 368(6495). 10.1126/science.aam9744