Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1Learning Outcomes

How do we design a cache? From earlier:

In this section, we introduce the fully associative cache and use it as a means to discuss a detailed example and tradeoffs between replacement policies and write policies.

2Placement policy

Associativity refers to the possible entries that a particular block of data can be associated with.

3Identification

How do we determine a cache hit on a memory address? In other words, how do we know if the data at a specific memory address can be accessed from a block in the cache? From Computer Organization: A Quantitative Approach Appendix B.1:

Caches have an address tag on each block frame that gives the block address. The tag of every cache block that might contain the desired information is checked to see if it matches the block address from the processor. As a rule, all possible tags are searched in parallel because speed is critical.

There is a lot in this paragraph.[1] We first explore the relationship between a memory address and the tag of a cache entry. We then explain how we determine cache hits.

3.1Tag and Offset

We would like to connect the blocks in to the cache shown in Figure 1, which is a 16B fully associative cache with 4B blocks, to the 12-bit memory address in Figure 2. We do so by splitting the address into two portions: tag and offset.

"TODO"

Figure 1:Cache tag and offset in a 16B fully associative cache for 12-bit memory addresses. The bytes in the first entry’s block share the same upper 10 bits of their memory addresses: 0b0100001111, or 0x10F, which is the tag. The address of the most significant byte in the first block is therefore 0x43F.

"TODO"

Figure 2:For a fully associative cache, the memory address is split into two fields: the tag and the offset. If blocks are 4 bytes, then a 12-bit memory address is split into a 10-bit tag and a 2-bit offset.

What then, is a tag? Recall that all of the bytes in each block of data are from the same area of memory. Their address will share a common set of upper bits. In fully associative caches like in Figure 1, all of these upper bits are placed into the tag associated with each block.

What about the offset? Memory is byte addressable, so each of the bytes in a given block will have different memory addresses. The memory addresses of bytes in a given block will not vary in the upper bits (the tag) but rather in the lowest bits. The offset is the portion of the address needed to describe this variation.

3.2Valid bit

There must be a way to know that a cache block does not have valid information. For example, when starting up a program, the cache necessarily does not have valid information for the program. The most common procedure is to add an indicator (i.e., flag) to the tag to tell if each entry in the cache is valid for this particular program.

The valid bit indicates if the tag for the block is valid. If the valid bit is set (1), the tag refers to a valid memory address. If it is not set (0), we should not match to this tag (even if the tag bits match by chance).

"TODO"

Figure 3:A cold snapshot of the fully associative cache in Figure 1, where valid bits for all blocks are unset (i.e., set to 0). We illustrate the valid bit in our tabular visualization as an additional column of metadata.[2]

4Walkthrough: Warming up the Fully Associative Cache

The following animation traces through five memory accesses to a 12-bit address space on our 16B fully associative cache with 4B blocks. Assume the cache starts out cold, like in Figure 3.

Figure 4:Warming up a fully associative cache.

To keep things simple for now, if we encounter a cache miss, we load the new block from memory into an invalid cache entry. We discuss block replacement policies in the next section.

Of these five memory accesses:

5Replacement Policy

After the previous five memory accesses, our fully associative cache is at capacity (i.e., “full”, because all cache entries are valid), as shown in Figure 5.

"TODO"

Figure 5:After the five memory accesses described above, our small fully associative cache is full.

A replacement policy defines how the cache controller determines which block is replaced on a cache miss. For fully associative caches, there are several options for replacement policies.

A natural replacement policy is called least recently used, or LRU for short. From Computer Organization: A Quantitative Approach Appendix B.1:

To reduce the chance of throwing out information that will be needed soon, access to blocks are recorded. Relying on the past to predict the future, the block replaced is the one that has been unused for the longest time. LRU relies on a corollary of locality: If recently used blocks are likely to be used again, then a good candidate for disposal is the least recently used block.

Figure 6:Fully associative cache with least recently used (LRU) replacement policy.

Common replacement policies:[4]

  1. Least recently used (LRU): Select the most recently used block for replacement.

  2. Random: Select a block randomly for replacement.

  3. First in, first out (FIFO): Select the oldest block for replacement (even if the oldest block has been most recently used). A queue (using terminology from Data Structures).

While the implementation of replacement policies is out of the scope of this course, we note the following:

Comparison of cache replacement policies.

FeatureLRUFIFORandom
DescriptionReplace the entry that has not been used for the longest timeReplace the oldest block in the set (queue)Replace a random block
ImplementationBit counters for each cache block; see quick checkFIFO queue or similar approximationFlip a figurative hardware coin
AdvantageTemporal localityReasonable approximation to LRUEasy to implement
DisadvantageComplicated hardware to keep track of access historyTerrible if workload leverages high temporal locality

6Write Policy

So far, we have only focused on memory reads with load instructions. But what about store instructions, which write to data in memory?

Recall that cache blocks are copies of data in lower levels:

With a cache, we need to ensure that our main memory will (eventually) be in sync with our cache if we modify data. There are two basic options when writing to the cache:

  1. Write-through: The information is written to both the block in the cache and to the corresponding location in the lower-level main memory.

  2. Write-back: The information is written only to the cache block. The modified cache block is written to the lower-level main memory only when it is replaced.

To implement write-back, we note further that we can only need to write back modified blocks to main memory on replacement. This reduces the miss penalty. To reduce the frequency of writing back blocks on replacement, use a dirty bit for each cache entry.

Figure 7 animates our fully associative cache with a write-back policy, using the dirty bit

Figure 7:Write back, with dirty bit animation.

Table 2:Comparison of cache write policies.

FeatureWrite-throughWrite-back
DescriptionWrite to the cache and memory at the same timeWrite to the cache. Only write back to memory when the block is replaced.
ImplementationEasyMore complicated
“Synchronized” with memory?[6]YesNo. Sometimes cache will have the most current copy of data.
OptimizationInclude a dirty bit to only write back modified blocks.
Hit time for writesEach write to a cache block also requires a write to memory. Longer.Multiple writes within a cache block require only one write to memory (the latter happens only on replacement). Faster.
Miss penalty“Read” misses are fast and never result in a write to memory. Shorter.A “read” miss has variable time, since a write to memory may also be needed if the block to be replaced is dirty. Longer on average.
AMATLongerShorter

Which write policy should we use? It depends on the level of memory hierarchy. We discuss this more later with virtual memory.

Final note: Fortunately, most cache accesses are reads, not writes. reads dominate cache accesses (all instruction accesses are reads; furthermore, most instructions write to registers, not memory). A block can be read from the cache at the same time that the tag is read and compared. Unfortunately, for writes, modifying a block cannot begin until the tag is checked to see if the address is a hit. Nevertheless, processors do not have to wait for writes to complete and can begin executing other instructions during this time (consider the MEM stage of our five-stage pipeline).

7Walkthrough Summary

In this section, we traced through a cache design for a 12-bit address space with the following features:

"TODO"

Figure 8:Design of the fully associative cache described in this section’s example.

Cache “metadata”:

7.1Block size and performance

In our tiny cache with 4B-sized blocks, reading a word is equivalent to reading the entire block, but in practice blocks are composed of multiple words (e.g., 16 or 32 words per block).

From P&H 5.3: "The use of a bigger block takes advantage of spatial locality: it decreases the miss rate and improves the efficiency of cache hardware by reducing the amount of tag storage releative to the amount of data storage in the cache. Although a larger block size decreases the miss rate, it could also increase the miss penalty. If the miss penalty increases linearly with block size, larger blocks can easily lead to lower performance.

8Fully Associative: Hardware and Performance

Fully associative caches employ the most flexible placement policy, which is also the most costly to implement in hardware. From P&H 5.4:

To make the search [of a block in a fully associative cache] practical, it is done in parallel with a comparator associated with each entry. These comparators significantly increase the hardware cost, effectively making fully associative placement practical only for caches with small numbers of blocks.

As shown in Figure 9, the hardware is somewhat straightforward: Obtain the tag from the address (i.e., build a wire bundle for the upper bits), then use one comparator per cache entry to compare the cache entry’s tag to the provided tag. OR these comparator results together to determine a cache hit.

"TODO"

Figure 9:A fully associative placement increases hardware cost.

Fully associative caches are not common in modern processors. Because of the additional hardware needed, higher associativity increases not only hardware but also power.[7] We will see placement policies with lower associativity in the next section that reduce the number of comparators needed and reduce the complexity of each comparator by reducing the width of the cache tag.

Footnotes
  1. Here, “block frame” means the cache entry itself, “block” is the data unit, and “block address” is something that indicates the memory address of the least significant byte of this block. We will more formally define “block address” in the next section.

  2. From Computer Organization: A Quantitative Approach, Appendix B.1: “...add a valid bit to the tag to say whether or not this entry contains a valid address.”

  3. Break ties in some reasonable way, e.g., randomly.

  4. Some of you may be wondering: Why not Most Recently Used (MRU)? Why not last in, first out (LIFO)? While LIFO approximates MRU, both of these policies go entirely against the idea of temporal locality and are consequently a bit silly.

  5. Again, the implementation of a FIFO replacement policy is out of scope. Read Computer Organization: A Quantitative Approach Appendix B.1 for more details.

  6. Also known as data coherency, or cache coherency.

  7. Read more in Computer Organization: A Quantitative Appproach, 5th edition, Section 2.1.