Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1Learning Outcomes

It is very tempting to jump straight into parallelization. We first discuss a few simple optimizations to sequential programs[1] that work plenty fast.

2Cost of Operations

As a general rule of thumb, the runtime cost of operations for sequential programs can broken down into several categories:

  1. I/O operations: Extremely slow. File operations from disk takes a long time! prints and other I/O operations are also expensive. As a general programming rule when testing program execution time, avoid printing lots of data. Also avoid repeatedly reading the same files.

  2. Memory operations: Take about 3-100 clock cycles. RAM is faster than disk, so this category is about 100x faster than file operations. The base speed for memory operations in an architecture can be improved through caching. We discuss programming techniques below.

  3. Branch and Jumps are slower than our last category, arithmetic operations, due to control hazards and more. We discuss programming techniques below.

  4. Arithmetic operations are the fastest operations and take 1 cycle ideally (though some operations, particularly on CISC architectures, take longer).

3The register keyword

We can save commonly used values in registers. The register keyword can be used to request that a C variable gets put in a register, instead of on the stack.

When used, register is typically for loop counters and other frequently used variables. Toggle between the two code snippets below. The register keyword must be used for variable declarations outside of the loop.

Register keyword
Original code
register int limit = max/4;
register int i;
for (i = 0; i < limit; i++) {
    f(i);
}

We note that the register keyword is like a “suggestion,” meaning that compilers can choose to ignore this keyword.[2] Nevertheless, this optimization can be especially valuable on embedded systems or even x86 architectures, where register files are small.[3] Since different CPUs can have different register files, this optimization is dependent on the architecture.

Register usage is often implicitly used by gcc when optimization flags are specified, so results from this optimization can be inconsistent. More later.

4Loop Unrolling

How might we try reducing the number of jumps and branches in a program? In our matrix multiplication example, we can try loop unrolling, which increases the number of steps done per iteration of a loop in order to reduce the total number of branches.

Loop Unrolling
Loop Unrolling with Function Inlining
Original code
int i = 0;
for (; i < max/4*4; i+=4) {
    arr[i] = f(i);
    arr[i+1] = f(i+1);
    arr[i+2] = f(i+2);
    arr[i+3] = f(i+3);
}

for (; i < max; i++) {
    arr[i] = i * i;
}

Loop unrolling is already done by some compilers (including gcc at optimization level -O3, so speedup may be minimal and hard to predict). Because there are more assembly instructions, the resulting executable can also be larger; if a loop is too large, we can exceed the range of branch immediates and take a longer penalty to runtime.

Ultimately, loop unrolling makes code much harder to read and modify. So just unroll code at the end (e.g., with compilation).

4.1Function Inlining

Another related optimization is function inlining, where we replace a function call with the body of that function. Function calls require some stack setup (e.g., for calling convention), so inling function calls can be extremely useful.

We can use the inline keyword to explicitly request function inlining from the compile (though like before, compilers can choose to ignore this keyword). This can be useful if a function is declared for the purposes of abstraction but is rarely used. Duplicating the instruction code at the assembly level will increase the number of assembly instructions but will not complicate the higher-level code.

5Cache blocking

Recall from our discussion of caches the concept of spatial and temporal locality: Accessing adjacent or recent memory will be on average faster than accessing nonadjacent memory. In general, we would like to minimize cache misses and perform as much computation on spatially and temporally local data as possible–before cache blocks need to be replaced by main memory.

To do so, we leverage the idea of cache blocking. Cache blocking is a program optimization that is critical for the programmer to do, because compilers gcc does not know the intentions of our program—just the instructions themselves.

6gcc Optimization Flags

Thus far, we have been avoiding any compiler optimizations by specifying the -O0 option to gcc:

gcc -g3 -std=c11 -Wall -O0 matmul.c run_matmul.c -o run_matmul

We can specify different optimization levels for gcc. From the GNU gcc docs:

We encourage you to look through the linked GNU gcc docs. We’ve sampled a few interesting ones below.

There is a lovely debugging option called -Og that you should use where possible. From the docs:

7Premature Optimization

The optimizations discussed in this section are useful but do not always lead to good code.

The Turing award winner Don Knuth has a great quote on the pitfalls (and benefits!) of efficient, good programming:[4]

The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their “optimized” programs. ...

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually havea strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say, about 97% of the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look at the critical code; but only after that code has been identified.

We therefore encourage the following best practices:

Footnotes
  1. We differentiate these optimizations, which can be done for programs that run on a single-core processor, from parallel optimizations that require special hardware. More later.

  2. In fact, the register keyword is actually deprecated in C++. Read more on Wikipedia

  3. There are 16 register names in x86-64. In x86, most variables get stored on the stack (unlike RISC-V, which has 32 registers).

  4. Donald E. Knuth. 1974. Structured Programming with go to Statements. ACM Comput. Surv. 6, 4 (Dec. 1974), 261–301. https://doi.org/10.1145/356635.356640

References
  1. Babalad, S., Shevade, S. K., Thazhuthaveetil, M. J., & Govindarajan, R. (2025). UJOpt: Heuristic Approach for Applying Unroll-and-Jam Optimization and Loop Order Selection. Proceedings of the 39th ACM International Conference on Supercomputing, 611–624. 10.1145/3721145.3725778