Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1Learning Outcomes

This section is the final and largest step in our incremental performance journey of adapting DGEMM to the underlying hardware of our course hive machines. we use OpenMP to write threaded code that can utilize multiple cores.

Before continuing, we recommend reviewing:

DGEMM for 8 \times 8 square matrices A and B. C_{00} is computed as the dot product of row i = 0 of A and column j = 0 of B.

Figure 4:DGEMM for 8×88 \times 8 square matrices A and B. C00C_{00} is computed as the dot product of row i = 0 of A and column j = 0 of B.

Timing With non-threaded programs, we used the C standard library <time.h> to measure program time. However, on Linux, clock() measures CPU time across all threads, which will lead to an overall over-estimate. In the programs below, we will use omp_get_wtime() to get the wall time from the main thread; this time will be consistent within the main thread but not across threads. Read more on the OpenMP docs..

OpenMP Timing
<time.h> Timing
#include <omp.h>
int main() {
  ...

  clock_t start = omp_get_wtime();
  matrix_d_t* C = dgemm(A, B); // threaded
  clock_t end = omp_get_wtime();

  // execution time in seconds
  double delta_time = (double) (end - start)/CLOCKS_PER_SEC;
  ...
}

2DGEMM 11: OpenMP DGEMM

In the below code, line 5 is the single insertion that makes this code run on multiple processors. Line 5, #pragma omp parallel for, is an OpenMP pragma that tells the compiler to use multiple threads in the outermost loop. The resulting compiled program spreads the work of the outermost loop across the 12 threads on the course hive machines. For a 512 x 512 matrix multiplication, each thread then processes about 42 iterations of the loop.

DGEMM, MIMD
DGEMM (original)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
matrix_d_t *dgemm_openmp(matrix_d_t *A, matrix_d_t *B) {
  if (A->ncols != B->nrows) return NULL;
  matrix_d_t *C = init_mat_d(A->nrows, B->ncols);

  #pragma omp parallel for
  for (int i = 0; i < A->nrows; i++) {
    for (int j = 0; j < B->ncols; j++) {
      double sum = 0;
      for (int k = 0; k < A->ncols; k++) {
        sum += A->data[i*A->ncols+k]*B->data[k*B->ncols+j];
      }
      C->data[i*C->ncols+j] = sum;
    }
  }
  return C;
}

Even with no gcc compiler optimizations, our threaded OpenMP DGEMM is now blazingly fast compared to all optimizations we tried earlier and even compared to NumPy:

C                  0.768672 seconds
python NumPy:      0.000964 seconds
registers:         0.277462 seconds
simd,naive:        0.416584 seconds
openmp:            0.0000001401 seconds

3DGEMM 12: OpenMP Tiled SIMD DGEMM

This is our final program optimization. Recall what we said earlier:

The below code combines our core optimizations:

DGEMM, OpenMP Tiled SIMD
DGEMM Tiled SIMD
AVX intrinsics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
static inline void matmul_simd_tile(int si, int sj, int sk,
                 matrix_d_t *A, matrix_d_t *B, matrix_d_t *C) {
    for (int i = si; i < si + BLOCKSIZE; i++) {
        for (int j = sj; j < sj + BLOCKSIZE; j+=4) { // 4 doubles at a time
            avx256_t v_C = avx_load(C->data+i*C->ncols+j);

            for (int k = sk; k < sk + BLOCKSIZE; k++) {
                avx256_t s_A = avx_set_num(A->data[(i*A->ncols)+k]);
                avx256_t v_B = avx_load(B->data+k*B->ncols+j);
                v_C = avx_mul_add(s_A, v_B, v_C);
            }
            avx_store(C->data+i*C->ncols+j, v_C);
        }
    }
}
matrix_d_t *dgemm_openmp_simd_block(matrix_d_t *A, matrix_d_t *B) {
  if (A->ncols!=B->nrows) return NULL;
  matrix_d_t *C = init_mat_d(A->nrows, B->ncols);
  #pragma omp parallel for
  for (int sj = 0; sj < B->ncols; sj += BLOCKSIZE) {
      for (int si = 0; si < A->nrows; si += BLOCKSIZE) {
            for (int sk = 0; sk < A->ncols; sk+= BLOCKSIZE) {
                matmul_simd_tile(si, sj, sk, A, B, C);
            }
    }
  }
  return C;
}

Combining Intel SIMD Intrinsics and OpenMP gives us our fastest matrix multiplication yet:

C                  0.768672 seconds
python NumPy:      0.000964 seconds
registers:         0.277462 seconds
simd,naive:        0.416584 seconds
openmp:            0.0000001401 seconds
openmp,simd,tiled: 0.0000000686 seconds

4DGEMM 13: GCC Optimizations

Again, let’s compile using different gcc optimization flags in Table 1. The threaded versions win, every time.

Table 1:Threaded DGEMM runtime (in seconds) with different optimization flags.

Program

-O0

-O1

-O2

-O3

DGEMM (original)

0.768672

0.197168

0.197538

0.193889

DGEMM, naive SIMD

0.416584

0.048405

0.049970

0.049045

DGEMM tile SIMD

0.356344

0.033386

0.035177

0.037030

DGEMM, threaded

0.0000001401

0.0000000347

0.0000000456

0.0000000467

DGEMM, threaded, tile SIMD

0.0000000686

0.0000000124

0.0000000131

0.0000000131