如何在 CPU 上优化 GEMM
如何在 CPU 上優化 GEMM
(TL;DR) TVM 提供抽象接口,允許用戶分別描述算法和算法的實施組織(所謂的調度)。通常,在高性能調度中編寫算法,會破壞算法的可讀性和模塊化。嘗試各種看似有前途的調度也很耗時。在 TVM 的幫助下,可以有效地嘗試這些調度,提高性能。
將演示如何使用 TVM 優化矩陣乘法,通過簡單地添加 18 行額外代碼,實現比基線快 200 倍。
在 CPU 上執行的密集計算應用程序,有兩個重要的優化:
- 提高內存訪問的緩存命中率。復雜的數值計算和熱點內存訪問,都可以通過高緩存命中率加速。需要將原始內存訪問模式,轉換為適合緩存策略的模式。
- SIMD(單指令多數據),或者稱向量處理單元。每次都會處理一小批數據,不是單個網格。需要統一模式,轉換循環體中的數據訪問模式, LLVM 后端可以降低為 SIMD。
實際上,所有方法都是repo 中提到的一個子技巧 。一些已被 TVM 抽象自動應用,由于 TVM 的限制,一些不能簡單地應用。
下面所有實驗結果,都是在配備 Intel i7-4770HQ CPU 的,15’ MacBook 上實現的。對于所有 x86 CPU,緩存行大小應為 64 字節。
準備和基線
將演示如何使用 TVM 優化矩陣乘法。在實際演示前,先定義這些變量。然后編寫一個基線實現,這是在 TVM 中編寫矩陣乘法的最簡單方法。
import tvm
import tvm.testing
from tvm import te
import numpy
import timeit
The size of the matrix
(M, K) x (K, N)
You are free to try out different shapes, sometimes TVM optimization outperforms numpy with MKL.
M = 1024
K = 1024
N = 1024
The default tensor type in tvm
dtype = “float32”
using Intel AVX2(Advanced Vector Extensions) ISA for SIMD
To get the best performance, please change the following line
to llvm -mcpu=core-avx2, or specific type of CPU you use
target = “llvm”
dev = tvm.device(target, 0)
Random generated tensor for testing
a = tvm.nd.array(numpy.random.rand(M, K).astype(dtype), dev)
b = tvm.nd.array(numpy.random.rand(K, N).astype(dtype), dev)
np_repeat = 100
np_runing_time = timeit.timeit(
setup=“import numpy\n”
"M = " + str(M) + “\n”
"K = " + str(K) + “\n”
"N = " + str(N) + “\n”
‘dtype = “float32”\n’
“a = numpy.random.rand(M, K).astype(dtype)\n”
“b = numpy.random.rand(K, N).astype(dtype)\n”,
stmt=“answer = numpy.dot(a, b)”,
number=np_repeat,
)
print(“Numpy running time: %f” % (np_runing_time / np_repeat))
answer = numpy.dot(a.numpy(), b.numpy())
Algorithm
k = te.reduce_axis((0, K), “k”)
A = te.placeholder((M, K), name=“A”)
B = te.placeholder((K, N), name=“B”)
C = te.compute((M, N), lambda m, n: te.sum(A[m, k] * B[k, n], axis=k), name=“C”)
Default schedule
s = te.create_schedule(C.op)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
evaluator = func.time_evaluator(func.entry_name, dev, number=1)
print(“Baseline: %f” % evaluator(a, b, c).mean)
輸出:
Numpy running time: 0.009345
Baseline: 3.291115
在 TVM 中,檢查較低級別的 IR,調試或優化調度。這是使用基線調度生成的 IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (m: int32, 0, 1024) {
for (n: int32, 0, 1024) {
C_2[((m1024) + n)] = 0f32
for (k: int32, 0, 1024) {
C_2[((m1024) + n)] = ((float32*)C_2[((m1024) + n)] + ((float32)A_2[((m1024) + k)](float32*)B_2[((k*1024) + n)]))
}
}
}
}
阻塞
提高緩存命中率的一個重要技巧是阻塞——數據塊將逐塊計算。塊內部的內存訪問是一個具有高內存局部性的小鄰域。選擇了 32 作為分塊因子。該塊將填充 32 * 32 * sizeof(float) 即 4KB 的緩存,總大小為 32KB(L1 數據緩存)。
bn = 32
kfactor = 4
s = te.create_schedule(C.op)
Blocking by loop tiling
mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
(kaxis,) = s[C].op.reduce_axis
ko, ki = s[C].split(kaxis, factor=kfactor)
Hoist reduction domain outside the blocking loop
s[C].reorder(mo, no, ko, ki, mi, ni)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
By simply tiling the loop 32x32, and hoisting ko, ki outside the blocking loops,
we can see big speedup compared with the baseline.
evaluator = func.time_evaluator(func.entry_name, dev, number=10)
print(“Opt1: %f” % evaluator(a, b, c).mean)
輸出:
Opt1: 0.310688
這是阻塞后,生成的IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (m.outer: int32, 0, 32) {
for (n.outer: int32, 0, 32) {
for (m.inner.init: int32, 0, 32) {
for (n.inner.init: int32, 0, 32) {
C_2[((((m.outer32768) + (m.inner.init1024)) + (n.outer32)) + n.inner.init)] = 0f32
}
}
for (k.outer: int32, 0, 256) {
for (k.inner: int32, 0, 4) {
for (m.inner: int32, 0, 32) {
for (n.inner: int32, 0, 32) {
C_2[((((m.outer32768) + (m.inner1024)) + (n.outer32)) + n.inner)] = ((float32*)C_2[((((m.outer32768) + (m.inner1024)) + (n.outer32)) + n.inner)] + ((float32)A_2[((((m.outer32768) + (m.inner1024)) + (k.outer4)) + k.inner)](float32*)B_2[((((k.outer4096) + (k.inner1024)) + (n.outer*32)) + n.inner)]))
}
}
}
}
}
}
}
向量化
另一個重要的技巧是向量化。當內存訪問模式一致時,編譯器可以檢測到這種模式,將連續內存傳遞給向量處理器。在 TVM 中,可以使用vectorize接口,提示編譯器這種模式,這樣就可以大大加速。
選擇向量化內循環行數據,這是緩存友好的。
s = te.create_schedule(C.op)
mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
(kaxis,) = s[C].op.reduce_axis
ko, ki = s[C].split(kaxis, factor=kfactor)
s[C].reorder(mo, no, ko, ki, mi, ni)
Vectorization
s[C].vectorize(ni)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
evaluator = func.time_evaluator(func.entry_name, dev, number=10)
print(“Opt2: %f” % evaluator(a, b, c).mean)
輸出:
Opt2: 0.341067
向量化后,生成的 IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (m.outer: int32, 0, 32) {
for (n.outer: int32, 0, 32) {
for (m.inner.init: int32, 0, 32) {
C_2[ramp((((m.outer32768) + (m.inner.init1024)) + (n.outer32)), 1, 32)] = broadcast(0f32, 32)
}
for (k.outer: int32, 0, 256) {
for (k.inner: int32, 0, 4) {
for (m.inner: int32, 0, 32) {
C_2[ramp((((m.outer32768) + (m.inner1024)) + (n.outer32)), 1, 32)] = ((float32x32*)C_2[ramp((((m.outer32768) + (m.inner1024)) + (n.outer32)), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.inner1024)) + (k.outer4)) + k.inner)], 32)(float32x32*)B_2[ramp((((k.outer4096) + (k.inner1024)) + (n.outer*32)), 1, 32)]))
}
}
}
}
}
}
循環排列
查看上面的 IR,可以看到 B 和 C 的內循環行數據,都進行了向量化。接下來,查看 A 的訪問模式。在當前調度中,A 是逐列訪問的,這對緩存不友好. 如果改變 ki 和內軸 mi 的嵌套循環順序,A 矩陣的訪問模式,對緩存更友好。
s = te.create_schedule(C.op)
mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
(kaxis,) = s[C].op.reduce_axis
ko, ki = s[C].split(kaxis, factor=kfactor)
re-ordering
s[C].reorder(mo, no, ko, mi, ki, ni)
s[C].vectorize(ni)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
evaluator = func.time_evaluator(func.entry_name, dev, number=10)
print(“Opt3: %f” % evaluator(a, b, c).mean)
輸出:
Opt3: 0.111449
循環排列后,生成的 IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (m.outer: int32, 0, 32) {
for (n.outer: int32, 0, 32) {
for (m.inner.init: int32, 0, 32) {
C_2[ramp((((m.outer32768) + (m.inner.init1024)) + (n.outer32)), 1, 32)] = broadcast(0f32, 32)
}
for (k.outer: int32, 0, 256) {
for (m.inner: int32, 0, 32) {
for (k.inner: int32, 0, 4) {
C_2[ramp((((m.outer32768) + (m.inner1024)) + (n.outer32)), 1, 32)] = ((float32x32*)C_2[ramp((((m.outer32768) + (m.inner1024)) + (n.outer32)), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.inner1024)) + (k.outer4)) + k.inner)], 32)(float32x32*)B_2[ramp((((k.outer4096) + (k.inner1024)) + (n.outer*32)), 1, 32)]))
}
}
}
}
}
}
陣列封裝
另一個重要的技巧是數組打包。訣竅是對多維數組的存儲,進行重新排序,展平存儲在一維內存中后,按順序訪問。
可以使用數組打包,解決 B 的訪問模式。觀察扁平化后 B 的數組訪問模式,在 K 維度上迭代時,這不是連續的。可以用維度 [K][N] 重新排序 B,使其具有維度 [N/bn][K][bn],bn 是阻塞因子,也是內循環中 B 的向量大小。這種重新排序,將 N 分成兩個維度 — bigN (N/bn) 和 littleN (bn) —新維度 [N/bn][K][bn] 匹配 B,從外循環到內循環的索引(no, ko, ki, ni) ,在展平后,導致 B 的順序訪問模式。
We have to re-write the algorithm slightly.
packedB = te.compute(
(N / bn, K, bn), lambda bigN, k, littleN: B[k, bigN * bn + littleN], name=“packedB”
)
C = te.compute(
(M, N),
lambda m, n: te.sum(A[m, k] * packedB[n // bn, k, tvm.tir.indexmod(n, bn)], axis=k),
name=“C”,
)
s = te.create_schedule(C.op)
mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
(kaxis,) = s[C].op.reduce_axis
ko, ki = s[C].split(kaxis, factor=kfactor)
s[C].reorder(mo, no, ko, mi, ki, ni)
s[C].vectorize(ni)
bigN, _, littleN = s[packedB].op.axis
s[packedB].vectorize(littleN)
s[packedB].parallel(bigN)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
evaluator = func.time_evaluator(func.entry_name, dev, number=10)
print(“Opt4: %f” % evaluator(a, b, c).mean)
輸出:
Opt4: 0.217310
陣列打包后,生成的IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global {
for (bigN: int32, 0, 32) “parallel” {
for (k: int32, 0, 1024) {
packedB[ramp(((bigN32768) + (k32)), 1, 32)] = (float32x32*)B_2[ramp(((k1024) + (bigN32)), 1, 32)]
}
}
for (m.outer: int32, 0, 32) {
for (n.outer: int32, 0, 32) {
for (m.inner.init: int32, 0, 32) {
C_2[ramp((((m.outer32768) + (m.inner.init1024)) + (n.outer32)), 1, 32)] = broadcast(0f32, 32)
}
for (k.outer: int32, 0, 256) {
for (m.inner: int32, 0, 32) {
for (k.inner: int32, 0, 4) {
C_2[ramp((((m.outer32768) + (m.inner1024)) + (n.outer32)), 1, 32)] = ((float32x32*)C_2[ramp((((m.outer32768) + (m.inner1024)) + (n.outer32)), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.inner1024)) + (k.outer4)) + k.inner)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + (k.inner*32)), 1, 32)]))
}
}
}
}
}
}
}
塊的寫緩存
阻塞后,程序將結果逐塊寫入C,訪問模式不是順序的。可以使用一個順序緩存數組,保存塊結果,在所有塊結果準備好時,寫入 C。
s = te.create_schedule(C.op)
Allocate write cache
CC = s.cache_write(C, “global”)
mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
Write cache is computed at no
s[CC].compute_at(s[C], no)
New inner axes
mc, nc = s[CC].op.axis
(kaxis,) = s[CC].op.reduce_axis
ko, ki = s[CC].split(kaxis, factor=kfactor)
s[CC].reorder(ko, mc, ki, nc)
s[CC].vectorize(nc)
TODO: Add separate optimization step to discuss loop unrolloing
unrolling is a loop optimization strategy which can reduce branch
prediction failures and increases the chance of concurrent execution
unroll kfactor loops
s[CC].unroll(ki)
bigN, _, littleN = s[packedB].op.axis
s[packedB].vectorize(littleN)
s[packedB].parallel(bigN)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
evaluator = func.time_evaluator(func.entry_name, dev, number=10)
print(“Opt5: %f” % evaluator(a, b, c).mean)
輸出:
Opt5: 0.215912
阻塞后,生成的IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global;
allocate(C.global: Pointer(global float32), float32, [1024]), storage_scope = global {
for (bigN: int32, 0, 32) “parallel” {
for (k: int32, 0, 1024) {
packedB[ramp(((bigN32768) + (k32)), 1, 32)] = (float32x32*)B_2[ramp(((k1024) + (bigN32)), 1, 32)]
}
}
for (m.outer: int32, 0, 32) {
for (n.outer: int32, 0, 32) {
for (m.c.init: int32, 0, 32) {
C.global[ramp((m.c.init32), 1, 32)] = broadcast(0f32, 32)
}
for (k.outer: int32, 0, 256) {
for (m.c: int32, 0, 32) {
C.global[ramp((m.c32), 1, 32)] = ((float32x32*)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[(((m.outer32768) + (m.c1024)) + (k.outer4))], 32)(float32x32*)packedB[ramp(((n.outer32768) + (k.outer128)), 1, 32)]))
C.global[ramp((m.c32), 1, 32)] = ((float32x32)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.c1024)) + (k.outer4)) + 1)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + 32), 1, 32)]))
C.global[ramp((m.c32), 1, 32)] = ((float32x32)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.c1024)) + (k.outer4)) + 2)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + 64), 1, 32)]))
C.global[ramp((m.c32), 1, 32)] = ((float32x32)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.c1024)) + (k.outer4)) + 3)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + 96), 1, 32)]))
}
}
for (m.inner: int32, 0, 32) {
for (n.inner: int32, 0, 32) {
C_2[((((m.outer32768) + (m.inner1024)) + (n.outer32)) + n.inner)] = (float32)C.global[((m.inner*32) + n.inner)]
}
}
}
}
}
}
并行化
可以利用多核處理器,進行線程級并行化。
s = te.create_schedule(C.op)
CC = s.cache_write(C, “global”)
mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
s[CC].compute_at(s[C], no)
mc, nc = s[CC].op.axis
(kaxis,) = s[CC].op.reduce_axis
ko, ki = s[CC].split(kaxis, factor=kfactor)
s[CC].reorder(ko, mc, ki, nc)
s[CC].vectorize(nc)
s[CC].unroll(ki)
parallel
s[C].parallel(mo)
bigN, _, littleN = s[packedB].op.axis
s[packedB].vectorize(littleN)
s[packedB].parallel(bigN)
func = tvm.build(s, [A, B, C], target=target, name=“mmult”)
assert func
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
evaluator = func.time_evaluator(func.entry_name, dev, number=50)
opt6_time = evaluator(a, b, c).mean
print(“Opt6: %f” % opt6_time)
輸出:
Opt6: 0.066558
并行化后,生成的IR。
print(tvm.lower(s, [A, B, C], simple_mode=True))
輸出:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“from_legacy_te_schedule”: True, “global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
buffer_map = {A_1: A, B_1: B, C_1: C} {
allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global {
for (bigN: int32, 0, 32) “parallel” {
for (k: int32, 0, 1024) {
packedB[ramp(((bigN32768) + (k32)), 1, 32)] = (float32x32*)B_2[ramp(((k1024) + (bigN32)), 1, 32)]
}
}
for (m.outer: int32, 0, 32) “parallel” {
allocate(C.global: Pointer(global float32), float32, [1024]), storage_scope = global;
for (n.outer: int32, 0, 32) {
for (m.c.init: int32, 0, 32) {
C.global[ramp((m.c.init32), 1, 32)] = broadcast(0f32, 32)
}
for (k.outer: int32, 0, 256) {
for (m.c: int32, 0, 32) {
C.global[ramp((m.c32), 1, 32)] = ((float32x32*)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[(((m.outer32768) + (m.c1024)) + (k.outer4))], 32)(float32x32*)packedB[ramp(((n.outer32768) + (k.outer128)), 1, 32)]))
C.global[ramp((m.c32), 1, 32)] = ((float32x32)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.c1024)) + (k.outer4)) + 1)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + 32), 1, 32)]))
C.global[ramp((m.c32), 1, 32)] = ((float32x32)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.c1024)) + (k.outer4)) + 2)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + 64), 1, 32)]))
C.global[ramp((m.c32), 1, 32)] = ((float32x32)C.global[ramp((m.c32), 1, 32)] + (broadcast((float32)A_2[((((m.outer32768) + (m.c1024)) + (k.outer4)) + 3)], 32)(float32x32*)packedB[ramp((((n.outer32768) + (k.outer128)) + 96), 1, 32)]))
}
}
for (m.inner: int32, 0, 32) {
for (n.inner: int32, 0, 32) {
C_2[((((m.outer32768) + (m.inner1024)) + (n.outer32)) + n.inner)] = (float32)C.global[((m.inner*32) + n.inner)]
}
}
}
}
}
}
總結
僅用 18 行代碼,應用上述簡單優化后,生成的代碼,可以使用 MKL實現numpy性能的60% 。輸出反映了非排他性 Docker 容器上的運行時間,是不可靠的。強烈建議自己運行,觀察 TVM 實現的性能提升。
參考鏈接:
https://tvm.apache.org/docs/tutorials/optimize/opt_gemm.html
總結
以上是生活随笔為你收集整理的如何在 CPU 上优化 GEMM的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 一些量化(quantization)技巧
- 下一篇: Conda安装Glossary词汇表