1Learning Outcomes¶
Define a data race.
Specify critical sections in OpenMP to synchronize threads.
🎥 Lecture Video
As discussed earlier, threads operate under a shared memory model, where different threads can read from and write to the same locations in memory.
Consider the below program. What are possible values of x after running this program with four OpenMP threads?
#include <stdio.h>
#include <omp.h>
int main() {
int x = 0; /* shared variable */
#pragma omp parallel
{
x += 1;
}
}2Data Race¶
Two memory accesses form a data race if:
They are from different threads to the same location;
At least one is a write; and
They occur one after another.
In our multi-thread execution model, instructions from different threads have their execution interleaved in time, thus causing data races.[1]
Let us translate the parallel section of our example code into the three RISC-V instructions below. With four OpenMP threads, there are four sets of three sections to execute. Each thread accesses the same variable x in (shared) memory but has its copy of register t0 used in arithmetic, resulting in the data race.
lw t0 0(sp) # x @ sp
addi t0 t0 1
sw t0 0(sp)Because of the many possibilities of interleaving the execution of these twelve instructions, the final value of x is not always the same.[2] Consider the cases below.
Case 1: All threads run sequentially.
| Thread | Instruction | x memory access |
|---|---|---|
| 1 | load | read x: 0 |
| 1 | store | write x: 1 |
| 2 | load | read x: 1 |
| 2 | store | write x: 2 |
| 3 | load | read x: 2 |
| 3 | store | write x: 3 |
| 4 | load | read x: 3 |
| 4 | store | write x: 4 |
Final value of x: 4
Case 2: Threads “perfectly” interleaved.
| Thread | Instruction | x memory access |
|---|---|---|
| 1 | load | read x: 0 |
| 2 | load | read x: 0 |
| 3 | load | read x: 0 |
| 4 | load | read x: 0 |
| 1 | store | write x: 1 |
| 2 | store | write x: 1 |
| 3 | store | write x: 1 |
| 4 | store | write x: 1 |
Final value of x: 1
Case 3: Thread 1’s store is last.
| Thread | Instruction | x memory access |
|---|---|---|
| 1 | load | read x: 0 |
| 2 | load | read x: 0 |
| 2 | store | write x: 1 |
| 3 | load | read x: 1 |
| 3 | store | write x: 2 |
| 4 | load | read x: 2 |
| 4 | store | write x: 3 |
| 1 | store | write x: 1 |
Final value of x: 1
3Critical Sections in OpenMP¶
To enforce multithreaded program correctness, we often need to synchronize threads, i.e., coordinate their execution. Most commonly, we must identify when one thread is finished writing so that it is safe for another to read.
Synchronization can be specified in user-level routines, i.e., in higher-level languages. A critical section is a segment of code that must be executed by a single thread at a time, thereby enforcing synchronization. Once a thread enters a critical section, it can safely execute all code in that critical section, knowing that it is the only thread that can execute that section at that time.
We discuss two OpenMP synchronization constucts:
#pragma omp critical: Creates a critical section within a parallel code segment. OpenMP docs#pragma omp barrier: Forces all threads to wait until all threads have hit the barrier. OpenMP docs
Returning to our example code, we can specify a critical section and prevent any data races:
#include <stdio.h>
#include <omp.h>
int main() {
int x = 0; /* shared variable */
#pragma omp parallel
{
#pragma omp critical
{
x += 1;
}
}
}4Data Race Examples¶
4.1OpenMP Hello World: Add¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19#include <stdio.h> #include <omp.h> int main() { int x = 0; /* shared variable */ #pragma omp parallel { int tid = omp_get_thread_num(); /* private variable */ #pragma omp critical { x++; } printf("Hello World from thread = %d, x = %d\n", tid, x); #pragma omp barrier if (tid == 0) { printf("Number of threads = %d\n", omp_get_num_threads()); } } }
Show Answer
No; it varies. One output run:
Hello World from thread = 11, x = 10
Hello World from thread = 01, x = 1
Hello World from thread = 06, x = 11
Hello World from thread = 07, x = 7
Hello World from thread = 09, x = 8
Hello World from thread = 03, x = 9
Hello World from thread = 04, x = 5
Hello World from thread = 10, x = 3
Hello World from thread = 08, x = 2
Hello World from thread = 00, x = 12
Number of threads = 12
Hello World from thread = 02, x = 6
Hello World from thread = 05, x = 4
Final value of x: 12Because Line 10 is in a critical section, x’s value is updated correctly. However, Line 12’s printf is not part of the critical section, and a thread with a larger value of x can print before a thread with a smaller value.
Show Answer
First, review the definition of data race.
Consider all of the memory accesses in the innermost loop (i.e., for all values of j and k), assuming that two different threads run the i = 0 and i = 1 iterations, respectively. Below, we explain that none of these memory accesses will cause races:
The variable
sumis declared within the parallel section; it is not shared across threads.Read element : No data race, because neither thread’s access is a write.
Read element : No data race, because neither thread’s access is a write.
Write element : No data race. While both threads’ accesses are writes, the elements and are different, so writes are to different locations.[3]
We present the above argument without loss of generality, meaning that it applies for all possible work-sharing configurations in the DGEMM OpenMP code
A data race is not a data hazard. While data hazards result from instruction-level parallelism on a pipelined processor architecture, data races can occur even on single-cycle processors. Data races result from not determining instruction execution order between threads—i.e., which thread accesses memory first.
Formally, a multithreaded program is only considered correct if ANY interlacing of threads yield the same result. Our example code is an incorrect program. For those curious, there are different possible orders of load and store instruction pairs (or 105 orders if we consider that all threads are identical).
If caches are in the memory hierarchy, it is likely that the two threads will each have their own copy of the same block of
Cin their respective cache. So while there is no data race, these copies will be “out of sync.” Read more about cache coherency in a later section.