Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1Learning Outcomes

Before continuing, we recommend reviewing the DGEMM benchmark and the performance optimizations we tried on a SISD architecture.

DGEMM for 8 \times 8 square matrices A and B. C_{00} is computed as the dot product of row i = 0 of A and column j = 0 of B.

Figure 4:DGEMM for 8×88 \times 8 square matrices A and B. C00C_{00} is computed as the dot product of row i = 0 of A and column j = 0 of B.

2DGEMM 7: Naive SIMD DGEMM

With the ability to perform SIMD multiplication, it is tempting to use our new 256-bit-wide registers to load in blocks of row A and blocks of column B for the element-wise multiplication necessary for dot products, as shown in Figure 1.

"TODO"

Figure 1:“Naive” SIMD DGEMM that leverages SIMD architecture registers to parallelize multiplications within a single dot product. The outlined boxes indicate which values are loaded into the 256-bit-wide registers.[1]

This naive SIMD implementation does better than our naive DGEMM but worse than specifying narrow registers with the register keyword in C:

C               0.768672 seconds
registers:      0.277462 seconds
simd,naive:     0.416584 seconds

The non-compulsory cache miss persists in our naive SIMD implementation—we are still loading in multiple rows of B instead of columns of B.

3DGEMM 8: Transpose SIMD DGEMM

We next apply cache blocking by first transposing B, then leveraging our 256-bit-wide (four-double) registers for element-wise multiplication:

"TODO"

Figure 2:“Transposed” SIMD DGEMM that uses a transposed B to load in columns of B to streamline memory accesses. The outlined boxes indicate which values are loaded into the 256-bit-wide registers.

C               0.768672 seconds
registers:      0.277462 seconds
simd,naive:     0.416584 seconds
simd,transpose: 0.275622 seconds

This version is certainly speedier, but it does not prove huge benefits beyond our register keyword approach.

4DGEMM 9: Tiled SIMD DGEMM

Recall that cache blocking is any re-design of our algorithm to adjust memory accesses. Earlier, we discussed a submatrix tiling approach to matrix multiplication—where we compute multiple elements of C with the current set of rows of A and set of columns of B.

In Figure 3, we assume that the product of a scalar with a vector can be computed as a vector operation, provided that the scalar is copied to each element in a vector of the same length.

"TODO"

Figure 3:Compute four elements of C (starting with CijC_{ij}) by iteratively adding the result of scaling four elements of the kk-th row of B (starting with BkjB_{kj}) with AikA_{ik}.

We can extend this idea to a tiled SIMD approach shown in Figure 4.

Figure 4:SIMD dgemm tiled matrix multiplication. The outlined boxes indicate which values are loaded into the 256-bit-wide registers. Use the menu bar to trace through the animation or access the original Google Slides.

The “tiled” cache blocking SIMD approach performs slightly worse than the transposed SIMD approach.

C               0.768672 seconds
registers:      0.277462 seconds
simd,naive:     0.416584 seconds
simd,transpose: 0.275622 seconds
simd,tiled:     0.356344 seconds

5DGEMM 10: GCC Optimization

Finally, let’s compile using different gcc optimization flags. In Table 1, we can see that even with mild gcc optimizations like -O1, SIMD vastly outperforms any SISD approach.

Table 1:DGEMM SIMD runtime (in seconds) with different optimization flags.

Program

-O0

-O1

-O2

-O3

DGEMM (original)

0.768672

0.197168

0.197538

0.193889

DGEMM with registers

0.277462

0.191474

0.191921

0.126439

DGEMM, naive SIMD

0.416584

0.048405

0.049970

0.049045

DGEMM transpose SIMD

0.275622

0.031824

0.032188

0.031978

DGEMM tile SIMD

0.356344

0.033386

0.035177

0.037030

Generally speaking, most of the speedup comes not from doing multiple math operations at a time, but instead from doing a large memory load/store at a time.

6DGEMM SIMD Code

The three algorithsm discussed in this section leverage Intel SIMD extensions. You will see in the code below that the SIMD instructions are written in C as Intel Intrinsics. More next!

DGEMM, SIMD naive
DGEMM, SIMD Transpose
DGEMM, SIMD Tiled
DGEMM (original)
AVX intrinsics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
matrix_d_t *dgemm_simd(matrix_d_t *A, matrix_d_t *B) {
  if (A->ncols!=B->nrows) return NULL; 
    matrix_d_t *C = init_mat_d(A->nrows, B->ncols);
  for (int i = 0; i < A->nrows; i++) {
    for (int j = 0; j < B->ncols; j+=4) { // 4 doubles at a time
            avx256_t v_C = avx_load(C->data + i*C->ncols +j);
            for (int k = 0; k < A->ncols; k++) {
                avx256_t s_A = avx_set_num(A->data[(i*A->ncols)+k]);
                avx256_t v_B = avx_load(B->data+k*B->ncols+j);
                // C_ij += a_ik * B_jk (for j = 0...3)
                v_C = avx_mul_add(s_A, v_B, v_C);
            }
            avx_store(C->data+i*C->ncols+j, v_C);
    }
  }
  return C;
}
Footnotes
  1. In Figure 1, we assume a cache that has 256-bit blocks. This carries over the cache assumption from our cache blocking discussion.