Pipelining Laundry - CS 61C Course Notes

1Learning Outcomes¶

Use laundry as an analogy to understand pipelined architectures.
Compare latency and throughput of sequential and pipelined architectures.
Understand that pipelining improves performance by increasing instruction throughput, in contrast to decreasing the execution time of an individual instruction. Instruction throughput is the important metric because real programs execute billions of instructions.

🎥 Lecture Video

The concept of pipelining will increase our compute throughput. Before we jump into our RISC-V pipeline, let’s first start with a real-life example of laundry. Anyone who has done a lot of laundry has intuitively used pipelining. The four steps of laundry are shown in Figure 1:

"Laundry-room setup diagram mapping washer, dryer, and folding tasks to pipeline-style stages where each stage takes a sequential 30 minutes." — Figure 1:Laundry analogy.

Washer Stage (30 minutes). Place one dirty load of clothes in the washer.
Dryer Stage (30 minutes). When the washer is finished, place the wet load in the dryer.
“Folder” Stage (30 minutes). When the dryer is finished, place the dry load on a table and fold.
“Stasher” Stage (30 minutes). When folding is finished, put clothes in drawers.

Suppose our laundry benchmark task involves four people in a laundromat, or shared laundry room^[1], who all need to do their laundry on a Sunday evening. In Figure 1, each person is denoted by a laundry bag with their number (1 through 4).

2Sequential Laundry¶

Consider a sequential approach to using laundry, where only one person can use the room at a time,^[2] as shown in Figure 2. Person 1 starts at 6pm and finishes all four steps at 8pm; then, Person 2 starts and finishes 8-10pm, followed by Person 3, and finally Person 4, who doesn’t finish until 2am.

"Sequential laundry schedule where one person's load fully completes all stages (washer, dryer, folder, and stasher) before the next person starts." — Figure 2:Timeline for sequential laundry.

This implementation takes 8 hours for just 4 loads of laundry. If we look closely at the four major resources, they are mostly wasted; after all, each person exclusively uses one of the washer, dryer, counter, and drawers as they are progressing through the four steps. The washer, for example, was only used for 30 minutes of the whole two hours.

We can certainly do better by trying to perform these steps concurrently, where each person performs a different step. This is the intuition behind pipelining.

3Pipelined Laundry¶

Now consider a pipelined approach to using laundry,^[3] where as soon as Person 1 is done with the washer and takes out their wet clothes to move to the dryer, Person 2 can put their own dirty clothes in the washer, and so on. In Figure 3, Person 1 still starts at 6pm and finishes at 8pm, but now Person 4 can start much earlier at 7:30pm and finish by 9:30pm. Phew!

"Pipelined laundry schedule with overlapping washer, dryer, and folding work across different customers to optimize time. The next person can start the washer stage (stage 1) once the person ahead of them moves onto the dryer stage (stage 2)." — Figure 3:Timeline for pipelined laundry.

This implementation takes just 3.5 hours for the same 4 loads of laundry. At different 30-minute intervals, customers share the four resources because they are completing different steps of their laundry routine (Figure 4). At 7:30-8pm, all four resources are occupied by customers.

"Two synchronized views of pipelined laundry: on the left, a timeline showing the four customers each starting their washer stage as soon as the person ahead of them starts the dryer stage, and on the right, a focused progression of the laundry stages in order for a given 30 minute increment, showing how each stage in that 30 minutes is used by a different customer." — Figure 4:Two views of pipelined laundry: Over time (left) versus the snapshot of the laundry room during a given time interval (e.g., 7-7:30pm).

4Laundry: Latency vs. Throughput¶

From P&H 4.6:

The pipelining paradox is that the time from placing a single dirty sock in the washer until it is dried, folded, and put way is not shorter for pipelining; the reason pipelining is faster for many loads is that everything is working in parallel, so more loads are finished per hour. Pipelining improves *throughput* of our laundry system. Hence, pipelining would not decrease the time to complete one load of laundry, but when we have many loads of laundry to do, the improvement in throughput decreases the total time to complete the work.

Table 1 shows performance measures on our laundry benchmark task:

Table 1:Latency vs. Throughput: Sequential vs. Parallel Laundry.

Measure	Laundry Analogy	Sequential	Pipelined
Program execution time	Time to finish all 4 loads	8 hours	3.5 hours
Instruction latency	Time to finish a single load	2 hours	2 hours
Instruction throughput	Average* number of loads per 30 mins	0.25	1*

Note (*): Throughput is approximate for our tiny 4-customer task. In this case, 4 loads complete in seven 30-minute intervals, or approximately 0.57 loads per 30-minute interval. If we had many customers, though, the pipeline is fully “filled,” where exactly 1 customer will finish each 30 minutes. We report this steady-state in the Table 1.

5RISC-V Processor: Sequential and Pipelined¶

Let us translate this laundry analogy back our RISC-V processor.

Consider how an instruction like add t0 t1 t2 accesses the hardware in the single-cycle processor. Like when we computed instruction timing, Figure 5 represents the five phases^[4] with its major hardware unit.

"Single-cycle processor phase icons for one add instruction with read and write shading by hardware resource. The phases include IF with IMEM block, ID with Reg block, EX with ALU, MEM with DMEM block, and WB with Reg block." — Figure 5:High-level diagram of single-cycle processor executing the instruction `add t0 t1 t2`. We adopt a graphic representation of each of the five phases, where each symbol represents the major hardware resource accessed in that phase. The shading illustrates if the unit is read (i.e., the right half is shaded for IMEM in `IF` and RegFile in `ID`) or the unit is written (i.e., the left half is shaded for RegFile in `WB`). For R-Type instructions, MEM is transparent because `add` does not access the data memory.

Suppose we use the single-cycle processor to execute three instructions, in sequence.

add t0 t1 t2
or  t3 t4 t5
lw  t6 8(t7)

In our single-cycle CPU, only one instruction can access any resources in one clock cycle, in sequence. As in Figure 6, executing these three instructions will take $3 \cdot t_{cycle}$ time, where $t_{instruction} = t_{cycle}$ .

"Single-cycle processor timeline for three instructions, each instruction occupying an entire cycle and waiting to begin until the previous instruction has completed all stages." — Figure 6:Single-cycle processor usage for three instructions. The instruction time $t_{instruction}$ is equal to the length of one clock cycle, $t_{cycle}$ .

In a pipelined processor, on the other hand, multiple instructions access different resources in a single cycle. Following our laundry analogy, we would like to design a pipelined processor that can produce the following timeline in Figure 7.

"Five-stage pipelined processor timeline for three instructions overlapped across IF through WB stages." — Figure 7:Pipelined RISC-V processor usage for three instructions. In the first cycle, `add` is in the `IF` stage. In the second cycle, `or` is in the `IF` stage, while `add` has moved onto the `ID` stage. In the third cycle, `lw` is in the `IF` stage, while `or` and `add` have moved onto the `ID` and `EX` stages, respectively. Each instruction now takes **five cycles** to execute, but the clock cycle can be timed to the duration of the longest stage.

If we examine our add t0 t1 t2 instruction and how it accesses the pipelined processor, we observe something like Figure 8.

"High-level pipelined processor resource-use diagram for add showing stage-by-stage hardware activity." — Figure 8:High-level diagram of pipelined processor executing the instruction `add t0 t1 t2`. Shading is described in the caption of Figure 5.

Notes:

We have pipeline registers separating each stage, meaning that each pipeline stage takes one clock cycle. We discuss this design in more detail in the next section.
The add instruction now takes five cycles to execute, so $t_{instruction} = 5 \cdot t_{cycle}$ .
The three instructions together now take $7 \cdot t_{cycle}$ time to execute.
Each cycle is much shorter, so the overall program execution time is shorter than the single-cycle processor.

All stages in the pipelined processor take the same length. Let us assume the same phase durations from the single-cycle processor as Table 2 from our instruction timing. In the pipeline, even though the hardware in the ID and WB stages may compute values faster, a singular clock means they will take as long as the other stages.

Table 2:Step timing: a phase in a single-cycle processor vs. a stage in the five-stage pipelined procssor. In the single-cycle processor, all five steps must occur in a single-cycle, so phase length is operation time. In the pipelined processor, each step is timed to the length of a stage, which must be the length of a clock period (because of pipeline registers).

Step	Operation time	Single-cycle phase duration	Pipelined stage duration
Instruction Fetch (`IF`)	200 ps	200 ps	200 ps
Instruction Decode (`ID`)	100 ps	100 ps	200 ps
Execute (`EX`)	200ps	200 ps	200 ps
Memory Access (`MEM`)	200ps	200 ps	200 ps
Write Back (`WB`)	100ps	100 ps	200 ps

Let’s compare this performance in Table 3.

Table 3:Performance comparison: Single-cycle processor vs. 5-stage pipelined processor.

Comparison	Sequential	Pipelined
High-level diagram	INSERT	INSERT
Step timing	(see Table 2) 200 ps (`IF`, `EX`, `MEM`); 100 ps (`ID`, `WB`)	200 ps (all stages same length)
Clock period, $t_{cycle}$	200 + 100 + 200 + 200 + 100 = 800 ps	200 ps
Instruction latency $t_{instruction}$	$t_{cycle}$ = 800 ps	$5 \cdot t_{cycle}$ = 1000 ps
Clock frequency	1/800 ps = 1.25 GHz	1/200 ps = 5 GHz

5.1Pipeline speedup¶

Based on Table 3, our throughput gain, or pipelining speedup, is close to the ratio of times between instructions. One approachis to time the single-cycle and pipelined processors against a benchmark program like 1 million lw instructions^[5]:

The single-cycle processor takes $10^{6} \cdot t_{cycle}$ = 800,000,000 ps.
The pipelined processor takes $10^{6} \cdot t_cycle + 4 \cdot t_cycle$ = 200,000,800 ps, where there are four cycles to accommodate the “startup” time of the pipeline (where not all stages of pipeline are used).

If we take the ratio of total execution times:

\dfrac{800,000,000 \text{ps}}{200,000,800 \text{ps}} \approx \dfrac{800 \text{ps}}{200 \text{ps}} = 4

5.2Waterfall diagrams¶

Instead of repeatedly drawing the same tiny diagram as in Figure 7, we will use Table 4 to illustrate instruction execution in pipelined processors. This waterfall diagram should be read left-to-right: the leftmost column is the RISC-V instruction; all other columns are indexed by their time step (clock cycle, starting from 1). For each row, an entry in a time step column indicates the stage that the instruction is currently executing, or blank otherwise. For simplicity, we write MEM as M.

Table 4:Waterfall diagram of Figure 7.

Instruction	1	2	3	4	5	6
`add s0 t0 t1`	IF	ID	EX	M	WB
`or t3 t4 t5`		IF	ID	EX	M	WB
`lw t6 8(t7)`			IF	ID	EX	M	WB

Footnotes¶

The shared laundry room has a single washer, a single dryer, a single folding counter, and infinite but single set of drawers mysteriously located in the room itself.
↩
Imagine there is a single key for the shared laundry room. At the beginning of the washing stage, you instantaneously obtain the key, unlock the room and enter, and relock the room so no one else can enter. Then right at the end of the folding stage, you instantaneously unlock the room, exit and relock, and put back the key.
↩
Assume the door is unlocked now. The laundromat owner figured it out.
↩
Steps? Phases? Stages? See this note from when we discussed the five steps to a RISC-V instruction.
↩
We assume no load data hazards in this program.
↩