當(dāng)前位置：首頁 > 运维知识 > windows >内容正文

windows

深入理解计算机系统——第六章 The Memory Hierarchy

發(fā)布時間：2023/12/29 windows 29 豆豆

生活随笔收集整理的這篇文章主要介紹了深入理解计算机系统——第六章 The Memory Hierarchy 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

深入理解計算機(jī)系統(tǒng)——第六章 The Memory Hierarchy

6.1 Storage Technologies
- 6.1.1 Random Access Memory
- - Nonvolatile Memory
  - Accessing Main Memory
  - Disk Geometry
  - Connecting I/O Devices
- 6.1.3 Solid State Disks
- 6.1.4 Storage Technology Trends
6.2 Locality
6.3 The Memory Hierarchy
- 6.3.1 Caching in the Memory Hierarchy
6.4 Cache Memories
- 6.4.1 Generic Cache Memory Organization
- 6.4.2 Direct-Mapped Caches
- 6.4.3 Set Associative Caches
- 6.4.4 Fully Associative Caches
- 6.4.5 Issues with Writes
- 6.4.6 Anatomy of a Real Cache Hierarchy
- 6.4.7 Performance Impact of Cache Parameters
- - Impact of Cache Size
  - Impact of Block Size
  - Impact of Associativity
  - Impact of Write Strategy
6.5 Writing Cache-Friendly Code
6.6 Putting It Together: The Impact of Caches on Program Performance
- 6.6.1 The Memory Mountain
- 6.6.2 Rearranging Loops to Increase Spatial Locality
- 6.6.3 Exploiting Locality in Your Programs

資源：

視頻課程
視頻課件1
視頻課件2
請問CPU，內(nèi)核，寄存器，緩存，RAM，ROM的作用和他們之間的聯(lián)系？

前面講匯編語言時，提到將內(nèi)存當(dāng)作一個字節(jié)數(shù)組，可以用地址作為下標(biāo)來訪問該數(shù)組的元素。
但實際上存儲系統(tǒng)（memory system）是一個非常復(fù)雜的設(shè)備層次結(jié)構(gòu)（hierarchy of devices)，它提供一個抽象，將內(nèi)存結(jié)構(gòu)抽象為一個大的線性數(shù)組。

6.1 Storage Technologies

6.1.1 Random Access Memory

特點(diǎn)：

RAM is traditionally packaged as a chip.
Basic storage unit is normally a cell (one bit per cell).
Multiple RAM chips form a memory.

分類，根據(jù)存儲單元（cell) 的實現(xiàn)方式：

SRAM (Static RAM)
DRAM (Dynamic RAM)

兩者區(qū)別見下圖：

DRAM 只需要1個晶體管（transistor) 去存儲1比特（bit)，而 SRAM 更復(fù)雜，需要 4 或者 6個晶體管，因此 SRAM 的成本更高，但訪問速度更快。
DRAM 對干擾很敏感（DRAM memory cell is very sensitive to any disturbance），因此需要刷新（The memory system must periodically refresh every bit of memory by reading it out and then rewriting it. ），做錯誤檢測等。
SRAM 不需要刷新，只要不斷電，能保持穩(wěn)定（Even when a disturbance, such as electrical noise, perturbs the voltages, the circuit will return to the stable value when the disturbance is removed.）。
SRAM 用于那些內(nèi)存容量小但速度快的芯片中，叫高速緩存（cache memory).
DRAM 被廣泛用于主存（main memory)，以及圖形中的幀緩存（frame buffers associated with graphic cards）中。

相同點(diǎn)：
兩者都是易失的（volatile），即斷電將會丟失保存的信息。

Nonvolatile Memory

非易失性存儲器（nonvolatile memory)，即在斷電后也能保存其內(nèi)容，因為歷史的原因，這些存儲器被叫只讀存儲器（read-only memories, ROMs)，但其實有些 ROMs 也能寫數(shù)據(jù)。

類型：

Read-only memory (ROM): programmed during production.
Programmable ROM (PROM) : can be programmed exactly once.
Erasable programmable ROM (EPROM): can be erased and reprogrammed on the order of 1,000 times.
Electrically erasable PROM (EEPROM): electronic erase capability (it does not require a physically separate programming device). An EEPROM can be reprogrammed on the order of $10^5$ times before it wears out.
Flash memory: based on EEPROMs, with partial (block-level) erase capability.

用途：

固件（firmware）中使用
Programs stored in ROM devices are often referred to as firmware. When a computer system is powered up, it runs firmware stored in a ROM.
Some systems provide a small set of primitive input and output functions in firmware—for example, a PC’s BIOS (basic input/output system) routines.
Complicated devices such as graphics cards and disk drive controllers also rely on firmware to translate I/O (input/output) requests from the CPU.
固態(tài)硬盤（Solid state disks, SSD)中使用
系統(tǒng)仍將它當(dāng)作傳統(tǒng)的旋轉(zhuǎn)硬盤（rotating disks），但它更快。

Accessing Main Memory

總線（bus)：a collection of parallel wires that carry address, data, and control signals.

Data flows back and forth between the processor and the DRAM main memory over shared electrical conduits called buses.

Buses are typically shared by multiple devices.

Each transfer of data between the CPU and memory is accomplished with a series of steps called a bus transaction.

示例：

例如執(zhí)行 load 操作，即將主存中的數(shù)據(jù)寫到 CPU 中：

movq A,%rax

操作過程為：
CPU 將地址 A 放到系統(tǒng)總線上，然后通過 I/O bridge 傳輸?shù)酱鎯ζ骺偩€上，主存接收到該信號后讀取地址 A 處的數(shù)據(jù)并寫到存儲器總線上，最后經(jīng)過系統(tǒng)總線傳給 CPU，CPU 讀取總線上的數(shù)據(jù)后復(fù)制到寄存器 %rax 中。

如果執(zhí)行 store 操作則是相反的過程，即將 CPU 中的數(shù)據(jù)寫入到主存，如：

movq %rax,A

操作過程為：
CPU 將地址 A 放到系統(tǒng)總線上，然后通過 I/O bridge 傳輸?shù)酱鎯ζ骺偩€上，主存接收到該信號后讀取地址 A 并等待數(shù)據(jù)到達(dá)；然后 CPU 將寄存器 %rax 的內(nèi)容復(fù)制到系統(tǒng)總線上，最終主存從存儲總線上讀取數(shù)據(jù)并存儲在地址 A 處。

注意：從圖中可以看到寄存器文件（register file）和 算術(shù)邏輯單元 （ALU）很近，因此它們之間的操作很快；但內(nèi)存是離 CPU 相對較遠(yuǎn)的一些芯片組，因此讀寫內(nèi)存的操作相對較慢。

Disk Geometry

這一小節(jié)沒看書，視頻講的很清楚，很好理解。

Notice that manufacturers express disk capacity in units of gigabytes (GB) or terabytes (TB), where 1 GB = $10^9$ bytes and 1 TB = $10^{12}$ bytes. （注意廠商表示磁盤容量不是用二進(jìn)制，而是十進(jìn)制）

這種磁盤訪問速度比 DRAM 慢了接近 250 倍。

現(xiàn)代磁盤控制器將磁盤作為一系列的邏輯塊提供給 CPU，每個塊是扇區(qū)大小的整數(shù)倍，一個扇區(qū)是 512 bytes，最簡單的情況下，一個邏輯塊就是一個扇區(qū)，塊號是一系列增長的數(shù)字，從 0 開始，0， 1，2 ，等。磁盤控制器控制物理扇區(qū)和邏輯塊的映射。

磁盤控制器會將一些柱面保留作為備用柱面，這些區(qū)域沒有被映射為邏輯塊，如果有個柱面的扇區(qū)壞了，磁盤控制器將數(shù)據(jù)復(fù)制到備用的柱面，然后磁盤就能繼續(xù)正常工作。因此磁盤的格式容量（formatted capacity）比實際容量小。

Connecting I/O Devices

Figure 6.11 shows a representative I/O bus structure that connects the CPU, main memory, and I/O devices.

讀取磁盤扇區(qū)的過程：
1、 CPU 通過三個指令初始化磁盤

給磁盤發(fā)送一個讀的命令告訴磁盤執(zhí)行讀操作，同時告訴磁盤當(dāng)讀數(shù)據(jù)結(jié)束后是否給 CPU 發(fā)一個中斷信號
第二條指令告訴磁盤讀取數(shù)據(jù)邏輯塊編號
第三條指令告訴讀取扇區(qū)的內(nèi)容后存放在主存中的地址

2、磁盤讀取數(shù)據(jù)，同時 CPU 繼續(xù)做其他事，磁盤取得總線的控制權(quán)直接復(fù)制數(shù)據(jù)，然后通過 I/O 橋?qū)?shù)據(jù)傳輸給主存，此過程無 CPU 參與，該過程稱為直接存儲訪問 （direct memory access, DMA）。

3、在 DMA 過程結(jié)束后，磁盤控制器給 CPU 發(fā)送一個中斷信號通知 CPU。This causes the CPU to stop what it is currently working on and jump to an operating system routine. The routine records the fact that the I/O has finished and then returns control to the point where the CPU was interrupted. 例如，如果目前有某個程序等著將數(shù)據(jù)寫到內(nèi)存中，那么CPU可執(zhí)行該操作并處理內(nèi)存（之前總線控制交給磁盤）。

這種直接存儲訪問的方式原因是：磁盤讀數(shù)據(jù)過程很慢，CPU 如果等磁盤讀數(shù)據(jù)，則太浪費(fèi)時間。

6.1.3 Solid State Disks

固態(tài)硬盤（SSD）的訪問速度介于旋轉(zhuǎn)磁盤和 DRAM 之間。

對于 CPU 來說，固態(tài)硬盤和旋轉(zhuǎn)磁盤完全相同，它們有相同的接口，但固態(tài)硬盤沒有那些機(jī)械部件，他是完全由閃存構(gòu)建。

見上圖所示，固態(tài)硬盤中有一個固件（firmware)設(shè)備，稱為閃存翻譯層（flash translation layer），充當(dāng)旋轉(zhuǎn)磁盤中磁盤控制器（disk controller）的功能。

固態(tài)硬盤中以頁（page）為單位從閃存中讀寫數(shù)據(jù)，頁的大小根據(jù)技術(shù)的不同，通常在 512 字節(jié) to 4 KB。

一系列的頁組成一塊（block），注意這個塊和前面提到的的邏輯塊不同。

寫數(shù)據(jù)：如果要寫數(shù)據(jù)到某一頁，必須保證該頁所在的塊的內(nèi)容全部被擦除（erased），因此在寫數(shù)據(jù)前必須將該塊的其他頁的數(shù)據(jù)內(nèi)容復(fù)制到一個新的已經(jīng)被擦除的塊中，因此寫操作很復(fù)雜。

讀數(shù)據(jù)：直接讀取。

一個塊被反復(fù)擦除大約10萬次后將被磨損，無法使用。

SSD 的性能：

從上圖可見：

順序訪問（Sequential access）比隨機(jī)訪問（Random access）更快;
隨機(jī)寫數(shù)據(jù)的速度十分慢，因為前面提到的要擦除和復(fù)制的過程。

SSD 和旋轉(zhuǎn)磁盤的比較：

1）優(yōu)點(diǎn)

SSD 沒有移動部分，因此讀寫速度更快，耗電更少，更結(jié)實。

2）缺點(diǎn)

可能磨損，但對于現(xiàn)在的固態(tài)硬盤，可能該問題沒有跟大影響，如 Intel SSD 730 在磨損前可以寫 128 PB（petabyte）次。
比旋轉(zhuǎn)磁盤貴。

6.1.4 Storage Technology Trends

下圖展示了不同存儲設(shè)備相對于 CPU 的性能：

y 軸表示訪問時間，在 2003 以前，制造商通過減小CPU的尺寸，讓各個部分更緊密，按比例增加時鐘頻率，從而使 CPU 的時鐘周期更短。

但由于時鐘頻率越高，消耗的功率越大，因此在 2003 因為功率的問題達(dá)到瓶頸，停止繼續(xù)增加時鐘頻率。

為了時 CPU 更快，2003 年后開始在芯片上放置更多處理器內(nèi)核（process cores），將 CPU 芯片細(xì)分為獨(dú)立的處理器內(nèi)核，每個內(nèi)核可以執(zhí)行自己的指令，通過并行運(yùn)行，提高效率。

現(xiàn)代的 CPU 執(zhí)行時間逐漸趨于穩(wěn)定。

從圖中可以看出，旋轉(zhuǎn)磁盤，SSD，DRAM 和 CPU 訪問時間相差很大，而程序使用的數(shù)據(jù)很多存在磁盤和內(nèi)存中，因此盡管 CPU 的執(zhí)行時間越來越短，但存儲設(shè)備的訪問速度卻基本不變，甚至相對來說越來越慢，那么計算機(jī)的性能實際不會增加，因為受訪問數(shù)據(jù)的時間限制。

6.2 Locality

The key to bridging this CPU-Memory gap is a fundamental property of computer programs known as locality.

Principle locality : Programs tend to use data and instructions with address near or equal to those they have used recently.

局部性有兩種形式：

Temporal locality
Recently referenced items are likely to be referenced again in the near future.

Spatial locality
If a memory location is referenced once, then the program is likely to reference a nearby memory location in the near future.

示例：

上述代碼分析：

循環(huán)內(nèi)部每次引用數(shù)組元素，由于是連續(xù)的步長為1的引用，因此是 Spacial locality。

循環(huán)內(nèi)每次對 sum 變量的引用屬于 temporal locality。

循環(huán)內(nèi)每次迭代引用一系列的指令，屬于 spacial locality。

每次循環(huán)都要循環(huán)使用相同的指令，屬于 temporal locality。

對程序局部性的評估：

圖 6.18 的程序局部性好，而圖 6.19 程序的局部性差。
第二段代碼的 spacial locality 很差，因為不是連續(xù)的訪問數(shù)組元素，數(shù)組的元素在內(nèi)存中是按行存儲的（row-major order，row-wise）。

6.3 The Memory Hierarchy

Some fundamental and enduring properties of storage technology and computer software:

Faster storage technologies cost more per byte than slower ones and have less capacity and require more power (heat!).
The gap between CPU and main memory speed is widening.
Well-written programs tend to exhibit good locality.

Their complementary nature suggests an approach for organizing memory systems, known as the memory hierarchy, that is used in all modern computer systems.

在存儲器的層次結(jié)構(gòu)中，每一層都包含從下一個較低級別層次所檢索的數(shù)據(jù)。如最頂層的寄存器保存從 L1 高速緩存器中檢索的數(shù)據(jù)。

6.3.1 Caching in the Memory Hierarchy

Cache: A small, fast storage device that acts as a staging area (暫存區(qū)) for the data objects stored in a larger, slower device.

Central idea of a memory hierarchy:
For each k, the faster and smaller storage device at level k serves as a cache for the larger and slower storage device at level k+1.

Why do memory hierarchies work?

Because programs tend to exhibit locality, programs tend to access the data at level k more often than they access the data at level k+1. 如果要訪問第 k+1 層存儲單元，我們將會將其拷貝到第 k 層，因為很有可能會再次訪問它。
由于不經(jīng)常訪問 k+1 層的數(shù)據(jù)，因而使用速度更低，更便宜的存儲設(shè)備。

The basic principle of caching in a memory hierarchy：

緩存是一個通用的概念，能應(yīng)用于存儲器層次結(jié)構(gòu)中的所有層。

Cache Hits： 當(dāng)程序需要在 k+1 層的數(shù)據(jù)時，會在 k 層查看是否有該數(shù)據(jù)，如果正好有，則稱為 緩存命中 （cache hit）。

Cache Misses: 當(dāng)程序需要在 k+1 層的數(shù)據(jù)時，會在 k 層查看是否有該數(shù)據(jù)，如果正沒有該數(shù)據(jù)，則稱為 緩存未命中 （cache miss）；例如上圖中如果在 k+1 層查找 塊12 的數(shù)據(jù)，而第 k 層沒有，因此會將 k+1 層 塊12 的數(shù)據(jù)復(fù)制到 k 層，替換第 k 層中塊9 的數(shù)據(jù)。

緩存未命中的種類：

Cold (compulsory) miss
Cold misses occur because the cache is empty.
這是不可避免的，將數(shù)據(jù)慢慢填充到空的緩存中的過程稱為 warming up your cache。

Capacity miss
Occur when the set of active cache blocks (working set) is larger than the cache.
這個是由于緩存的大小有限造成的，例如上圖中緩存只有4塊，如果需要8塊的內(nèi)容，則會造成 Capacity miss。 不斷被程序訪問的一組塊被稱為工作集（working set），因此當(dāng)執(zhí)行不同的程序時，工作集是會變的。

Conflict miss
沖突未命中（conflict miss) 和緩存的實現(xiàn)方式有關(guān)。大多數(shù)緩存，尤其是硬件緩存，由于它們需要設(shè)計的較為簡單，因此限制了塊可以被放置的位置。例如 block i 只能放在 block (i mod cache size) 的地方，以上圖為例，緩存的大小為4，如果要取 block 8的數(shù)據(jù)，則只能放在緩存的第0塊，同樣，block 9 放在緩存的 block 1處。因此，如果要取的塊為 block 0， block 4，block 8，那么計算出來都應(yīng)該放在緩存 block 0的位置，因此放 block 4時，會覆蓋原來 block 0的數(shù)據(jù)，加入需要循環(huán)的訪問 block 0，block 8，block 0，block 8，這樣就會一直不能命中，即使緩存有多余的空間，但位置的限制導(dǎo)致一直覆蓋緩存上 block 0 的數(shù)據(jù)，造成沖突未命中。

Cache Management:

緩存管理：當(dāng)有請求從較低的層次中讀取內(nèi)容時，需要有一個過程決定如何處理這個請求，如何將其放入緩存中的某一位置。

6.4 Cache Memories

The memory hierarchies of early computer systems consisted of only three levels: CPU registers, main memory, and disk storage.

However, because of the increasing gap between CPU and main memory, system designers were compelled to insert a small SRAM cache memory, called an L1 cache (level 1 cache) between the CPU register file and main memory.

緩存存儲器（cache memory）在 CPU 芯片上，完全由硬件管理。
上圖中位于寄存器文件附近的緩存用于存儲主存中經(jīng)常訪問的塊。

6.4.1 Generic Cache Memory Organization

因為緩存由硬件管理，因此硬件必須知道如何查找緩存中的塊，并確定是否包含特定的塊，因此必須以嚴(yán)格且簡單的方式去組織緩存存儲器。

Valid: 初始時，緩存中沒有內(nèi)容，但上圖中 B 字節(jié)的塊中有一些無效的數(shù)據(jù)，因此需要 有效位（Valid） 來識別 B 字節(jié)中的數(shù)據(jù)是否有效，0 則無效，1 有效。

Tag: 標(biāo)志位，幫助搜尋塊。

緩存大小: 一個緩存有 S 組，一組有 E 行，一行中的塊有 B 字節(jié)，因此緩存的大小 C = S * E * B。

塊的大小由內(nèi)存系統(tǒng)決定，是內(nèi)存系統(tǒng)的固定參數(shù)。

緩存硬件讀取過程：
當(dāng)程序執(zhí)行 load 指令時，需要從主存的地址 A 處讀取數(shù)據(jù)，因此 CPU 將地址 A 發(fā)送給緩存，緩存將地址 A 分成多個區(qū)域（由緩存的組織結(jié)構(gòu)決定），如果找到該地址的內(nèi)容則直接讀取數(shù)據(jù)后返回給 CPU，否則從內(nèi)存讀取數(shù)據(jù)并放入緩存中，再將數(shù)據(jù)傳給 CPU。

緩存參數(shù)：

如果有4組，那么 S 為4，s 為 2，s 表示代表組的索引的位數(shù)，組的索引為 00，01，10， 11，因此只需要2位就能表示。

如果內(nèi)存地址是4位，則 m 為 4。

如果一個數(shù)據(jù)塊有 2 個字節(jié)，那么 B 為 2（block[0]，block[1]），b 為1，則將用最低的一位作為偏移量，表示所讀取的字（word）開始的位置，例如地址 7 （0111）的偏移量為 1，將讀取數(shù)據(jù)的地址起始位置為該組相應(yīng)行的 block[1] 處。

標(biāo)志位（Tag）：如果 S 為4，E 為 1，B 為 2，m 為 4，則標(biāo)志為的數(shù)目 t = 4 - (2 + 1) = 1，即標(biāo)志位只有 1 位。標(biāo)志位用于比較緩存中該行的標(biāo)志位和地址 A 的標(biāo)志位是否一致，從而判斷緩存中的根據(jù)索引找到的行的數(shù)據(jù)是否是要查找的地址 A 的數(shù)據(jù)。

（這些參數(shù)看后面的示例了解其意義）

6.4.2 Direct-Mapped Caches

直接映射緩存：E 為1，即一組只有一行。

假設(shè)一塊的大小為 8 字節(jié)，一個 word 大小為 4 字節(jié)：

上圖所示，如果地址中 偏移位 為 4（100），假設(shè)組的索引為（ s bits） 1（00001）：

緩存將查找 Set 1 那組的數(shù)據(jù)，忽略其他組；

然后將比較標(biāo)志位和有效位；

有效位可以知道緩存上塊的數(shù)據(jù)是否有效；

標(biāo)志位則查看緩存上的數(shù)據(jù)是否是想要查找的數(shù)據(jù)；

如果標(biāo)志位有效，則通過偏移4找到數(shù)據(jù)的起始地址處，圖中獲取的數(shù)據(jù)為

w_0w_1w_2w_3

，然后直接返回給 CPU并放入寄存器中。

如果 標(biāo)志位 tag 不匹配，則表示未命中，需要去內(nèi)存中找到相應(yīng)的塊然后將該處的塊覆蓋掉（包括 tag），再取出數(shù)據(jù)返回給 CPU；

關(guān)于索引，前面講過有的硬件對塊存放的位置有嚴(yán)格的要求，有計算公式，因此塊的位置時固定的。

示例2：
Suppose we have a direct-mapped cache described by
$(S, E, B, m) = (4, 1, 2, 4)$

緩存有4組，每組一行，每塊有2字節(jié)，地址為 4 bits；假設(shè)一個 word 為 1 字節(jié)。

從上圖可以看出，一個塊包含內(nèi)存中的兩個地址，如 block 0 包含地址 0 和 1，他們的 Tag 和組的索引都相同，只是偏移量不同，block 0 的偏移量為 0，block 1 的偏移量為 1。

圖 6.30 可看出組的索引號采用地址的中間兩位，這是考慮到前面提過的 spacial locality，如果讀取相鄰地址的內(nèi)容，則索引會分到不同的組，而用高位作為索引則會分到相同組，容易造成 conflict miss：

讀地址 0 （0000）的數(shù)據(jù)：將在緩存 set 0 中查找，初始時緩存為空，因此其有效位（Valid）值為 0，因此未命中，需要從內(nèi)存中找到 block 0 的數(shù)據(jù)放入緩存中，注意因為 block 0 包含兩個地址 0 和 1，因此都會存入緩存，然后返回 m[0] 給 CPU：

放入后有效位變?yōu)?1，Tag 設(shè)置為 0。

讀地址 1 （0001）的數(shù)據(jù)：將在緩存 set 0 中查找，此時已有數(shù)據(jù)，且有效位和 Tag 均符合，因此根據(jù)偏移 1 獲取數(shù)據(jù) m[1] 直接發(fā)送給 CPU。

讀地址 13 （1101）的數(shù)據(jù)：將在 set 2 中查找，無數(shù)據(jù)，因此從內(nèi)存獲取數(shù)據(jù)放入緩存，并根據(jù)偏移 1 獲取 m[13] 發(fā)送給 CPU：

讀地址 8 （1000）的數(shù)據(jù)：將在 set 0 中查找，其標(biāo)志位為 1，和緩存中已有的標(biāo)志位 0 不同，因此需要從內(nèi)存中取數(shù)據(jù)并覆蓋掉緩存中 set 0 的數(shù)據(jù)，返回 m[8] 到 CPU：

如果再讀地址 0 的數(shù)據(jù)，則又會未命中，然后從內(nèi)存取數(shù)據(jù)覆蓋 set 0 中數(shù)據(jù)，盡管還有 set 1 和 set 3 兩組未使用，但仍會選擇相同的 set 從而造成 conflict miss。

因此緩存需要增大 E 的數(shù)目，提高關(guān)聯(lián)性。

6.4.3 Set Associative Caches

當(dāng) E 大于1時，稱為 E-way set associative cache ：

上圖 E 為2，假如入要查詢的地址組索引為 set 1，set 1 中有兩行，緩存將會同時比較兩行的有效位和標(biāo)志位，找到標(biāo)志位符合的一行。

E 為 2時，如果查詢的地址為0，組為 set 0，從內(nèi)存中放入 m[0]和m[1] 到 set 0 的第一行后，還會同時獲取 m[8] 和 m[9] 放到 set[0] 的第二行，因此從內(nèi)存中獲取數(shù)據(jù)塊時，會盡可能的將該組中空的塊寫入數(shù)據(jù)。

6.4.4 Fully Associative Caches

A fully associative cache consists of a single set (i.e., E = C/B) that contains all of the cache lines.

全相聯(lián)高速緩存，只有一組（set），所有行都在這一組中，因此沒有組索引。

全相聯(lián)高速緩存和之前介紹的緩存工作模式相同，但因為所有行都在一組中，而查詢時需要同時比較所有行的 Tag，因此很難做到緩存既大又快，這種設(shè)計比較適合小的緩存（如 translation lookaside buffers (TLBs) in virtual memory systems that cache page table entries）。

6.4.5 Issues with Writes

向內(nèi)存中寫數(shù)據(jù)和讀數(shù)據(jù)相反的過程。

1. write hit
如果要寫的數(shù)據(jù) w 已經(jīng)在緩存中，即是一個 write hit，那么緩存更新了 w 后，有兩種處理方式：

Write-through
將更新后的 w 立刻寫到下一個低等級的塊中（前面講過，存儲器的層次等級，當(dāng)前命中的緩存保存的是它第一等級的部分?jǐn)?shù)據(jù)的副本），這樣很費(fèi)時，因為訪問低層次的時間更長。
Write-back
w 先保存在當(dāng)前緩存中，只有當(dāng)前塊的內(nèi)容要被覆蓋時才將數(shù)據(jù)寫回到低等級塊中。這樣做的缺點(diǎn)是需要額外的 dirty-bit 來表示該緩存塊是否被修改。

2. Write miss
如果緩存中沒有數(shù)據(jù)，有兩種處理方式：

Write-allocate
找到要寫入的數(shù)據(jù)塊放到緩存中，更新緩存的塊。缺點(diǎn)是這樣每次未命中都要額外花時間將數(shù)據(jù)寫到緩存中，這種方式是考慮到 spacial locality，可能接下來用到的數(shù)據(jù)已經(jīng)在緩存中，能命中了。
No-write-allocate
直接將數(shù)據(jù)寫到低層次的塊中，不放入緩存中。

通常的方式是：

Write-through + No-write-allocate
Write-back + Write-allocate (更常用)

6.4.6 Anatomy of a Real Cache Hierarchy

In fact, caches an hold instructions as well as data.

A cache that holds instructions only is called an i-cache.
A cache that holds program data only is called a d-cache.
A cache that holds both instructions and data is known as a unified cache.

Modern processors include separate i-caches and d-caches.

i-caches are typically read-only, and thus simpler.

將指令緩存和數(shù)據(jù)緩存分開的原因：
1、處理器能同時讀指令和數(shù)據(jù)
2、防止數(shù)據(jù)訪問和指令訪問的 conflict miss

Characteristics of the Intel Core i7 cache hierarchy：

見上圖，如果 L1 未命中，將向 L2 發(fā)送請求嘗試在 L2 中查找數(shù)據(jù)，如果 L2 找不到，則向 L3 發(fā)送請求，查看能否在 L3 中找到數(shù)據(jù)，如果 L3 也找不到，則放棄查找。

6.4.7 Performance Impact of Cache Parameters

對緩存性能的評估：

Miss Rate
The fraction of memory references not found in cache.
Hit Time
Time to deliver a line in the cache to the CPU, including the time for set selection, line identification, and word selection. Hit time is on the order of several clock cycles for L1 caches.
Miss Penalty
Any additional time required because of a miss. 如果未命中，需要去取數(shù)據(jù)，然后再寫到緩存中更新緩存等額外花費(fèi)的時間。

Impact of Cache Size

大的緩存趨向增大命中率（hit rate)，因為大的緩存很難運(yùn)行快，因此在存儲器的層級結(jié)構(gòu)中，低層次的緩存比高層次的大。

Impact of Block Size

塊的尺寸對性能的影響：如果程序有更多的 spacial locality 比 temporal locality，那塊的尺寸大會增加命中率；而如果程序更多的是 temporal locality，那么塊的尺寸太大，對應(yīng)的行的數(shù)量就會減少（緩存大小一樣的話），因此反而降低 hit rate；并且塊越大，miss penalty 越大，因為未命中時花費(fèi)的傳輸數(shù)據(jù)的時間更多。

Impact of Associativity

6.4.3 中講過提高關(guān)聯(lián)性，即增大一組中的行數(shù)能減小 conflict miss，但這種設(shè)計對硬件的要求高。而且行數(shù)越多，則需要用于判讀的 Tag 的位越多，因此更復(fù)雜，也會增加 miss penalty 的時間。

通常的做法：更高層次相關(guān)性較低，低層次緩存相關(guān)性更高（為了增加 hit rate），如前面圖 6.39 所示， L3 是 16-way，L1 和 L2 是 8-way。

Impact of Write Strategy

In general, caches further down the hierarchy are more likely to use write-back than write-through.

6.5 Writing Cache-Friendly Code

Good programmers should always try to write code that is cache friendly, in the sense that it has good locality.

Repeated references to local variables are good because the compiler can cache them in the register file (temporal locality).
Stride-1 reference patterns are good because caches at all levels of the memory hierarchy store data as contiguous blocks (spatial locality).

6.6 Putting It Together: The Impact of Caches on Program Performance

6.6.1 The Memory Mountain

Read throughput (read bandwidth)
Number of bytes read from memory per second (MB/s).
Memory mountain
Measured read throughput as a function of spacial and temporal locality.
Compact way to characterize memory system performance.

測試函數(shù)：

上面測試函數(shù)通過改變 size 和 stride 兩個參數(shù)分別控制 temporal locality 和 spacial locality。size 越小，則 temporal locality 越好；stride 越小，則 spacial locality 越好。

通過反復(fù)給不同尺寸的 size 和 stride 參數(shù)來執(zhí)行測試函數(shù)，就能生成一個 two-dimensional function of read throughput versus temporal and spatial locality. 這個函數(shù)就是 memory mountain.

6.6.2 Rearranging Loops to Increase Spatial Locality

矩陣乘法示例：

代碼實現(xiàn)如下：

分析循環(huán)內(nèi)部代碼：
對于版本 (a) ，矩陣 A 的數(shù)據(jù)是按行取，而 B 的數(shù)據(jù)則是按列取，B 的 stride 是 n (假設(shè) n 很大），那么每次循環(huán) B 都不能命中。

改進(jìn)后的版本 (f)，A 和 B 都是按行取，因此降低了未命中率。

Core i7 matrix multiply performance：

6.6.3 Exploiting Locality in Your Programs

In particular, we recommend the following techniques:

Focus your attention on the inner loops, where the bulk of the computations and memory accesses occur.
Try to maximize the spatial locality in your programs by reading data objects sequentially, with stride 1, in the order they are stored in memory.
Try to maximize the temporal locality in your programs by using a data object as often as possible once it has been read from memory.

總結(jié)

以上是生活随笔為你收集整理的深入理解计算机系统——第六章 The Memory Hierarchy的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：健身房APP开发
下一篇： it工种分类_IT岗位中最“和谐”的两个