Loop Unrolling, GCC

1Learning Outcomes¶

Explain why loop unrolling reduces program execution time by reducing branches due to loops.
Explain the register and inline keywords in C.
Use gcc flags to specify program optimizations. gcc is not explicitly tested in this course but should be useful.

It is very tempting to jump straight into parallelization. We first discuss a few simple optimizations to sequential programs^[1] that work plenty fast.

2Cost of Operations¶

As a general rule of thumb, the runtime cost of operations for sequential programs can broken down into several categories:

I/O operations: Extremely slow. File operations from disk takes a long time! prints and other I/O operations are also expensive. As a general programming rule when testing program execution time, avoid printing lots of data. Also avoid repeatedly reading the same files.
Memory operations: Take about 3-100 clock cycles. RAM is faster than disk, so this category is about 100x faster than file operations. The base speed for memory operations in an architecture can be improved through caching. We discuss programming techniques below.
Branch and Jumps are slower than our last category, arithmetic operations, due to control hazards and more. We discuss programming techniques below.
Arithmetic operations are the fastest operations and take 1 cycle ideally (though some operations, particularly on CISC architectures, take longer).

3The `register` keyword¶

We can save commonly used values in registers. The register keyword can be used to request that a C variable gets put in a register, instead of on the stack.

When used, register is typically for loop counters and other frequently used variables. Toggle between the two code snippets below. The register keyword must be used for variable declarations outside of the loop.

Original code

register int limit = max/4;
register int i;
for (i = 0; i < limit; i++) {
    f(i);
}

We note that the register keyword is like a “suggestion,” meaning that compilers can choose to ignore this keyword.^[2] Nevertheless, this optimization can be especially valuable on embedded systems or even x86 architectures, where register files are small.^[3] Since different CPUs can have different register files, this optimization is dependent on the architecture.

Register usage is often implicitly used by gcc when optimization flags are specified, so results from this optimization can be inconsistent. More later.

4Loop Unrolling¶

How might we try reducing the number of jumps and branches in a program? In our matrix multiplication example, we can try loop unrolling, which increases the number of steps done per iteration of a loop in order to reduce the total number of branches.

Loop Unrolling

Loop Unrolling with Function Inlining

Original code

int i = 0;
for (; i < max/4*4; i+=4) {
    arr[i] = f(i);
    arr[i+1] = f(i+1);
    arr[i+2] = f(i+2);
    arr[i+3] = f(i+3);
}

for (; i < max; i++) {
    arr[i] = i * i;
}

Loop unrolling is already done by some compilers (including gcc at optimization level -O3, so speedup may be minimal and hard to predict). Because there are more assembly instructions, the resulting executable can also be larger; if a loop is too large, we can exceed the range of branch immediates and take a longer penalty to runtime.

Ultimately, loop unrolling makes code much harder to read and modify. So just unroll code at the end (e.g., with compilation).

4.1Function Inlining¶

Another related optimization is function inlining, where we replace a function call with the body of that function. Function calls require some stack setup (e.g., for calling convention), so inling function calls can be extremely useful.

We can use the inline keyword to explicitly request function inlining from the compile (though like before, compilers can choose to ignore this keyword). This can be useful if a function is declared for the purposes of abstraction but is rarely used. Duplicating the instruction code at the assembly level will increase the number of assembly instructions but will not complicate the higher-level code.

5Cache blocking¶

Recall from our discussion of caches the concept of spatial and temporal locality: Accessing adjacent or recent memory will be on average faster than accessing nonadjacent memory. In general, we would like to minimize cache misses and perform as much computation on spatially and temporally local data as possible–before cache blocks need to be replaced by main memory.

To do so, we leverage the idea of cache blocking. Cache blocking is a program optimization that is critical for the programmer to do, because compilers gcc does not know the intentions of our program—just the instructions themselves.

6`gcc` Optimization Flags¶

Thus far, we have been avoiding any compiler optimizations by specifying the -O0 option to gcc:

gcc -g3 -std=c11 -Wall -O0 matmul.c run_matmul.c -o run_matmul

We can specify different optimization levels for gcc. From the GNU gcc docs:

-O0: Reduce compilation time and make debugging produce the expected results. This is the default (though see recommended -Og below).
-O1: Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function. With -O1, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.
-O2: Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. This option turns on all optimization flags specified by -O1 and increases both compilation time and the performance of the generated code.
-O3: Optimize yet more. -O3 turns on all optimizations specified by -O2. These optimizations may optimize at the expense of instruction count or space.

We encourage you to look through the linked GNU gcc docs. We’ve sampled a few interesting ones below.

Some -O1 optimizations

-fdelayed-branch(enabled at -O1, not at -Og): If supported for the target machine, attempt to reorder instructions to exploit instruction slots available after delayed branch instructions.
-fcombine-stack-adjustments: Tracks stack adjustments (pushes and pops) and stack memory references and then tries to find ways to combine them.
-freorder-blocks: Reorder basic blocks in the compiled function in order to reduce number of taken branches and improve code locality.
-finline-functions-called-once: Hopefully self-explanatory.

Some -O2 optimizations

-falign-functions: Align the start of functions to the next power-of-two greater than or equal to n, skipping up to m-1 bytes. This ensures that at least the first m bytes of the function can be fetched by the CPU without crossing an n-byte alignment boundary.
-finline-functions: Consider all functions for inlining, even if they are not declared inline. The compiler heuristically decides which functions are worth integrating in this way.
-foptimize-strlen: Optimize various standard C string functions (e.g. strlen, strchr or strcpy) ... into faster alternatives.
-foptimize-crc: Detect loops calculating CRC (performing polynomial long division) and replace them with a faster implementation.

Some -O3 optimizations

-floop-interchange: Perform loop interchange outside of graphite. This flag can improve cache performance on loop nest and allow further loop optimizations, like vectorization, to take place.
-floop-unroll-and-jam: Apply unroll and jam transformations on feasible loops. In a loop nest this unrolls the outer loop by some factor and fuses the resulting multiple inner loops. Our version of unrolling (read more).

There is a lovely debugging option called -Og that you should use where possible. From the docs:

-Og: Optimize while keeping in mind debugging experience. -Og should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable blend of optimization, fast compilation and debugging experience especially for code with a high abstraction penalty. ... -Og enables all -O1 optimization flags except for those known to greatly interfere with debugging.

7Premature Optimization¶

The optimizations discussed in this section are useful but do not always lead to good code.

The Turing award winner Don Knuth has a great quote on the pitfalls (and benefits!) of efficient, good programming:^[4]

The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their “optimized” programs. ...
There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually havea strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say, about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look at the critical code; but only after that code has been identified.

We therefore encourage the following best practices:

When debugging programs, compile with -Og so you can debug your code. With more optimizations, the mapping between high-level and assembly code gts more muddled, and stepping through code gets more esoteric.
Leave loop unrolling and register declarations to the compiler.
For gcc, use -O2. -O3 often does too much.
The inline keyword is still useful. Function inlining requires knowing how frequently the function in question will be called.
Memory accesses are by far one of the most valuable to optimize, so design your algorithms with a good understanding of cache accesses, and leverage cache blocking where needed.

Footnotes¶

We differentiate these optimizations, which can be done for programs that run on a single-core processor, from parallel optimizations that require special hardware. More later.
↩
In fact, the register keyword is actually deprecated in C++. Read more on Wikipedia
↩
There are 16 register names in x86-64. In x86, most variables get stored on the stack (unlike RISC-V, which has 32 registers).
↩
Donald E. Knuth. 1974. Structured Programming with go to Statements. ACM Comput. Surv. 6, 4 (Dec. 1974), 261–301. Knuth (1974)
↩

References¶

Babalad, S., Shevade, S. K., Thazhuthaveetil, M. J., & Govindarajan, R. (2025). UJOpt: Heuristic Approach for Applying Unroll-and-Jam Optimization and Loop Order Selection. Proceedings of the 39th ACM International Conference on Supercomputing, 611–624. 10.1145/3721145.3725778
Knuth, D. E. (1974). Structured Programming with go to Statements. ACM Computing Surveys, 6(4), 261–301. 10.1145/356635.356640