Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1Learning Outcomes

We have discussed earlier that Intel SIMD instruction set architectures (ISAs) are extensions to the base Intel x86/x87 architecture. ISAs specify assembly instructions. In this section, we discuss at a high-level the assembly instruction format, then focus more on how to call these assembly instructions from our high-level C program using Intel Intrinsics.

2Intel ISA, continued

SIMD instructions act as extensions to a base instruction set, with different systems supporting different SIMD instructions.

2.1Intel SIMD Registers

The “wide” registers that Intel SIMD architectures use are separate from the general-purpose and floating-point registers used in x86/x87. For example, the SSE2 extension has 128-bit registers. As shown in Figure 1, these 128-bit-wide registers can be interpreted as values packed in different ways: two 64-bit words, four 32-bit words, and so on.

"TODO"

Figure 1:Inte SSE/AVX-128 128-bit-wide registers and AVX 256-bit-wide registers pack different numbers of data types. On Intel architectures, words are 16-bits, so single-precision floating point is a double-word (32bit) and double-precision floating point is a quadword (64-bit).

As a side note, registers from legacy extensions operate on the lower bits of modern extensions, as shown in Figure 2.

"TODO"

Figure 2:AVX 256-bit-wide YMM registers. Legacy SSE instructions (which use the XMM registers) can still be used to operate on the lower 128 bits of the YMM registers.

2.2Intel SIMD Assembly Instructions

Assembly instructions, e.g., in SSE, operate on SSE registers. Expand the below code for examples of SSE extension assembly instructions.

We will not discuss Intel assembly instructions too much in this course. Instead, because we leverage a compiler like gcc to translate C into assembly, we will directly write such instructions into our high-level C programs using Intel intrinsics.

2.3Final comments

RISC-V doesn’t have a standard vector library, so we’re using x86’s SIMD extension operators on the course hive machines. In practice, this doesn’t matter too much since arithmetic syntax works similarly to RISC-V.

Some notes:

3Intel Intrinsics

Intel Intrinsics are C functions and procedures that provide access to assembly language. With intrinsics, we can program using assembly instructions indirectly. There is a one-to-one correspondence between a given Intel intrinsic and an Intel SIMD extension assembly instruction (e.g., SSE, AVX).

3.1Variable Declaration

C has typed variables (in contrast to assembly, which only has hardware registers that store bits). To use Intel intrinsics, we must declare registers as C variables of a specific Intel intrisic variable type.

"TODO"

Figure 3:In the Intel AVX family of extensions, registers are 256-bits wide. The corresponding Intel intrinsic __m256d reg; declaration indicates that reg is a 256-bit-wide AVX register that packs four double-precision floating point values.

Table 1:Intel Intrinsic Data Types.

TypeIntel Intrinsic FamilyDescription
__m256AVX256 bit register for storing floats
__m256dAVX256 bit register for storing doubles
__m256iAVX256 bit register for storing 32-bit integers
__m128, __m128d, __m128iSSE128 bit registers

Once declared, we can use the Intel intrinsic name similarly to C variables. Importantly, the name is associated with available registers, so we can’t just initialize, say, an array of __mm256ds.

3.2Procedures

Once we have declared Intel intrisic variables, we can call Intel intrinsics. While these look like functions and procedures, each and every function call maps directly to an assembly instruction for the SIMD hardware.

Table 2:SSE Example: Intel intrinsic mapped to assembly instruction for the SSE SIMD extension. SSE has Intel intrinsic data type _m128 (128-bit-wide register).

Intel IntrinsicAssembly InstructionComments
_mm_load_psmovapsLoad/Store operation. Aligned, packed single-precision float
_mm_store_psmovapsLoad/Store operation. Aligned, packed single-precision float
_mm_add_pdaddpdadd packed double
_mm_mul_pdmulpdmultiple, packed double

4Understanding Intel Intrinsics Format

Luckily, most intel intrinsic procedures and data types are formatted similarly. Figure 4 provides a guide.

"TODO"

Figure 4:Intel Instructions and Formats.

5Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
void simd_example_pseudo() {
    int arr[8] = {3, 1, 4, 1, 5, 9, 2, 6};
    // 1. 
    sse128_t sum_sse = sse_set_zero();
    // sum_sse: {0, 0, 0, 0}

    // 2. P
    sse128_t tmp = sse_load(arr);
    sum_sse = sse_add(sum_sse, tmp);
    // sum_sse: {3, 1, 4, 1}

    // 3. 
    tmp = sse_load(arr + 4);
    // tmp: {5, 9, 2, 6}
    sum_sse = sse_add(sum_sse, tmp);
    // sum_sse: {3 + 5, 1 + 9, 4 + 2, 1 + 6}
    // sum_sse: {8, 10, 6, 7}

    // 4. 
    int tmp_arr[4];
    sse_store(tmp_arr, sum_sse);
    int sum = tmp_arr[0] + tmp_arr[1] + tmp_arr[2] + tmp_arr[3];
    printf("sum: %d\n", sum);
}

Let’s rewrite this pseudocode with Intel Intrinsics, can you explain how the below code implements the vector pseudocode from earlier?

Intel Intrinsics
Vector pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
int simd_example_sse() {
    int arr[8] = {3, 1, 4, 1, 5, 9, 2, 6};

    // 1.
    __m128i sum_sse = _mm_setzero_si128();
    // sum_sse: {0, 0, 0, 0}

    // 2.
    __m128i tmp = _mm_loadu_si128((__m128i *) arr);
    sum_sse = _mm_add_epi32(sum_sse, tmp);
    // sum_sse: {3, 1, 4, 1}

    // 3.
    tmp = _mm_loadu_si128((__m128i *) (arr + 4));
    sum_sse = _mm_add_epi32(sum_sse, tmp);
    // sum_sse: {3 + 5, 1 + 9, 4 + 2, 1 + 6}

    // 4.
    int tmp_arr[4];
    _mm_storeu_si128((__m128i *) tmp_arr, sum_sse);
    int sum = tmp_arr[0] + tmp_arr[1] + tmp_arr[2] + tmp_arr[3];

    printf("sum: %d\n", sum);
    return sum;
}

Table 3 describes some more common operations. Note we use a vector pseudocode here; this is consistent with many final exam examples.

Table 3:Example SIMD Operations with Intel Intrinsics

SIMD PseudocodeIntel Intrinsic (SSE or AVX)Description
vector vec_load(int31_t *A);__m128i _mm_loadu_si128(__m128i *p)Loads four integers at memory address A into a vector.
void vec_store(int32_t *dst, vector src);void _mm_storeu_si128(__m128i *p, __m128i a)Stores src to dst.
vector vec_setnum(int32_t num);n/aCreates a vector where every element is equal to num.
vec_setnum(0);__m128i _mm_setzero_si128()Creates a vector with all elements set to zero.
vec_setnum(*p);__m256d _mm256_broadcast_sd(double const *p)Creates a vector with all elements set to double (from memory).
vector vec_add(vector A, vector B);__m128i _mm_add_epi32(__m128i a, __m128i b)Returns the result of adding A and B element-wise.

6Common mistakes with SIMD instructions

Footnotes
  1. Again, vector architectures are not SIMD architectures, even though SIMD architectures implement some vector operations. Again, one big difference is that SIMD operations are limited to consecutive memory access, whereas vector architecture operations are not.

  2. The p in epi32 and ps stands for “packed.” The e in epi32 likely stands for “extended” (e.g., from MMX to SSE), and the s in si64 likely stands for “scalar.” Stack Overflow