當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

MLC——内存延迟及带宽测试工具

發布時間：2024/9/30 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 MLC——内存延迟及带宽测试工具小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

why MLC

影響程序性能的兩個重要因素：

①應用程序從處理器緩存和從內存子系統獲取數據所消耗的時間，其中存在各種延遲；

②帶寬b/w(bandwidth 非Bilibili World)

mlc正是做這個的

測試內容

Node訪問速度

在NUMA（Non-Uniform Memory Access 非一致性內存訪問）構架下，不同的內存器件和CPU核心從屬不同的 Node，每個 Node 都有自己的集成內存控制器（IMC，Integrated Memory Controller），解決了“每個處理器共享相同的地址空間問題”，避免總線帶寬，內存沖突問題。

（補充：core=物理cpu，獨立的物理執行單元；thread=邏輯cpu，線程

socket = node 相當于主板上的cpu插槽。node內部，不同核心間使用IMC Bus通信；不同node間通過QPI（Quick Path Interconnect）進行通信

同城速達的速度肯定與國際郵件不同，所以QPI（remote）延遲明顯高于IMC Bus（local）

測試樣例：

查詢內存訪問延遲指令

./mlc --latency_matrix

結果

Numa node Numa node ? ? 0 ? ? 1 0 82.2 129.6 1 131.1 81.6

表示node之間/內部的空閑內存訪問延遲矩陣，以ns為單位

帶寬

帶寬反映了單位時間的傳輸速率馬路越寬，就不會堵車了。帶寬反映了單位時間的傳輸速率

Measuring Peak Injection Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads ? ? ? : 69143.9 3:1 Reads-Writes : 61908.4 2:1 Reads-Writes : 60040.5 1:1 Reads-Writes : 54517.6 Stream-triad like: 57473.4

r:w 表示不同讀寫比下的內存帶寬

一般情況下，內存的寫速度慢于讀取速度（Talk is easy, show me the CODE）

所以當讀寫比下降時，帶寬會下降（路窄了，塞車了）

問題分析：如果帶寬急劇下降，可能是寫入程序增多；或者是寫入程序出問題，速度太慢了

測試樣例

查詢存訪問帶寬指令（單獨判斷numa節點間內存訪問是否正常還可以使用）

./mlc --bandwidth_matrix

結果

Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic typeNuma node Numa node ? ? 0 ? ? 1 0 35216.6 32537.9 1 31875.1 35048.5

問題分析：如果副對角線數值相差過大，表明兩個node相互訪問的帶寬差距較大

解決方法：出現不平衡的時候一般從內存插法、內存是否故障以及numa平衡等角度進行排查

內存訪問帶寬和內存延遲的關系（讀操作）

Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ==========================00000 523.74 69057.400002 589.55 68668.700008 686.99 68571.400015 549.87 68873.600050 575.48 68673.000100 524.74 68877.500200 197.61 64225.800300 131.60 47141.000400 110.39 36803.000500 117.32 30135.200700 100.90 22179.101000 100.93 15762.801300 91.74 12351.601700 98.61 ? 9475.202500 86.66 ? 6927.803500 88.13 ? 5132.605000 87.68 ? 3818.609000 85.36 ? 2473.520000 84.83 ? 1538.7

可以觀察內存在負載壓力下的響應變化，以及是否在到達一定帶寬時，出現不可接受的內存響應時間

測量CPU cache到CPU cache之間的訪問延遲

Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency ? 38.6 Local Socket L2->L2 HITM latency ? 43.6 Remote Socket L2->L2 HITM latency (data address homed in writer socket)Reader Socket Writer Socket ? ? ? ? 0 ? ? ? ? 10 ? ? ? ? - ? ? 133.41 ? ? 133.7 ? ? ? ? - Remote Socket L2->L2 HITM latency (data address homed in reader socket)Reader Socket Writer Socket ? ? ? ? 0 ? ? ? ? 10 ? ? ? ? - ? ? 133.51 ? ? 133.7 ? ? ? ? -

峰值帶寬

指令

mlc --peak_bandwidth

結果

Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writesMeasuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 50035.2 3:1 Reads-Writes : 48119.3 2:1 Reads-Writes : 47434.3 1:1 Reads-Writes : 48325.5 Stream-triad like: 44029.0

空閑內存延遲

指令

mlc --idle_latency

結果

Using buffer size of 200.000MB Each iteration took 260.5 core clocks ( 113.3 ns)

有負載內存延時

指令

mlc --loaded_latency

結果

Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writesMeasuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ==========================00000 217.32 49703.400002 258.98 49482.400008 217.48 49908.100015 220.12 49973.700050 206.33 49185.700100 174.02 43811.800200 141.63 27651.100300 130.65 19614.600400 126.05 15217.000500 122.70 12506.000700 121.46 9253.001000 120.55 6690.601300 118.75 5314.901700 120.18 4148.702500 119.53 3055.703500 119.60 2349.405000 116.60 1816.909000 116.17 1257.820000 116.87 867.6

其余操作（未完待續

測量指定node之間的訪問延遲
測量CPU cache的訪問延遲
測量cores/Socket的指定子集內的訪問帶寬
測量不同讀寫比下的帶寬
指定隨機的訪問模式以替換默認的順序模式進行測量
指定測試時的步幅

總結

以上是生活随笔為你收集整理的MLC——内存延迟及带宽测试工具的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。