NVIDIA GPU自动调度神经网络
NVIDIA GPU自動調度神經網絡
 對特定設備和工作負載進行自動調整對于獲得最佳性能至關重要。這是有關如何使用自動調度器為NVIDIA GPU調整整個神經網絡。
 為了自動調整神經網絡,將網絡劃分為小的子圖,并對其進行獨立調整。每個子圖被視為一個搜索任務。任務調度程序可以對時間進行分片,并為這些任務動態分配時間資源。任務調度程序可以預測每個任務對端到端執行時間的影響,確定可以最大程度地減少執行時間的任務的優先級。
 對于每個子圖,使用compute聲明tvm/python/topi獲取張量表達式形式的計算DAG。使用自動調度器來構造此DAG的搜索空間,并搜索良好的調度(低級優化)。
 與依靠手動模板定義搜索空間的基于模板的autotvm不同,自動調度程序不需要任何調度模板。換句話說,自動調度程序僅在其中使用tvm/python/topi計算聲明,而不使用現有的調度模板。
 本文無法在Windows或最新版本的macOS上運行。要使其運行,需要將本文的內容包裝在一個if name == “main”:塊中。
 import numpy as np
import tvm
 from tvm import relay, auto_scheduler
 import tvm.relay.testing
 from tvm.contrib import graph_runtime
 定義網絡
 需要使用中繼前端API定義網絡。可以加載一些預定義的網絡tvm.relay.testing。從MXNet,ONNX,PyTorch和TensorFlow加載模型。
 對于卷積神經網絡,盡管自動調度程序可以在任何布局下正常工作,但發現使用NHWC布局通常可以實現最佳性能。還使用自動調度程序對NHWC布局實施了更多優化。建議將模型轉換為NHWC布局以使用自動調度程序。可以使用ConvertLayout傳遞在TVM中進行布局轉換。
 def get_network(name, batch_size, layout=“NHWC”, dtype=“float32”):
 “”“Get the symbol definition and random weight of a network”""
# auto-scheduler prefers NHWC layout
if layout == "NHWC":image_shape = (224, 224, 3)
elif layout == "NCHW":image_shape = (3, 224, 224)
else:raise ValueError("Invalid layout: " + layout)input_shape = (batch_size,) + image_shape
output_shape = (batch_size, 1000)if name.startswith("resnet-"):n_layer = int(name.split("-")[1])mod, params = relay.testing.resnet.get_workload(num_layers=n_layer,batch_size=batch_size,layout=layout,dtype=dtype,image_shape=image_shape,)
elif name.startswith("resnet3d-"):n_layer = int(name.split("-")[1])mod, params = relay.testing.resnet.get_workload(num_layers=n_layer,batch_size=batch_size,layout=layout,dtype=dtype,image_shape=image_shape,)
elif name == "mobilenet":mod, params = relay.testing.mobilenet.get_workload(batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape)
elif name == "squeezenet_v1.1":assert layout == "NCHW", "squeezenet_v1.1 only supports NCHW layout"mod, params = relay.testing.squeezenet.get_workload(version="1.1",batch_size=batch_size,dtype=dtype,image_shape=image_shape,)
elif name == "inception_v3":input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else (batch_size, 299, 299, 3)mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
elif name == "mxnet":# an example for mxnet modelfrom mxnet.gluon.model_zoo.vision import get_modelassert layout == "NCHW"block = get_model("resnet18_v1", pretrained=True)mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)net = mod["main"]net = relay.Function(net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs)mod = tvm.IRModule.from_expr(net)return mod, params, input_shape, output_shape
Define the neural network and compilation target
network = “resnet-18”
 batch_size = 1
 layout = “NHWC”
 target = tvm.target.Target(“cuda”)
 dtype = “float32”
 log_file = “%s-%s-B%d-%s.json” % (network, layout, batch_size, target.kind.name)
 提取搜索任務
 接下來,從網絡中提取搜索任務及其權重。任務的權重是該任務的子圖在整個網絡中的出現次數。通過使用權重,可以將網絡的端到端延遲近似為,其中sum(latency[t] * weight[t])latency[t]是任務的延遲,weight[t]是任務的權重。任務調度程序將僅優化此目標。
Extract tasks from the network
print(“Extract tasks…”)
 mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)
 tasks, task_weights = auto_scheduler.extract_tasks(mod[“main”], params, target)
for idx, task in enumerate(tasks):
 print("========== Task %d (workload key: %s) ==========" % (idx, task.workload_key))
 print(task.compute_dag)
 輸出:
 Extract tasks…
 ========== Task 0 (workload key: [“d7b65649a4dd54becea0a52aabbc5af5”, 1, 1000, 1, 1000]) ==========
 placeholder = PLACEHOLDER [1, 1000]
 T_softmax_maxelem(i0) max= placeholder[i0, k]
 T_softmax_exp(i0, i1) = tir.exp((placeholder[i0, i1] - T_softmax_maxelem[i0]))
 T_softmax_expsum(i0) += T_softmax_exp[i0, k]
 T_softmax_norm(i0, i1) = (T_softmax_exp[i0, i1]/T_softmax_expsum[i0])
========== Task 1 (workload key: [“9847f8cc0b305137f49f2c5c0c8ab25d”, 1, 512, 1000, 512, 1000, 1, 1000]) ==========
 placeholder = PLACEHOLDER [1, 512]
 placeholder = PLACEHOLDER [1000, 512]
 T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
 placeholder = PLACEHOLDER [1000]
 T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
========== Task 2 (workload key: [“69115f188984ae34ede37c3b8ca40b43”, 1, 7, 7, 512, 1, 1, 1, 512]) ==========
 placeholder = PLACEHOLDER [1, 7, 7, 512]
 tensor(ax0, ax1, ax2, ax3) += placeholder[ax0, ((ax17) + rv0), ((ax27) + rv1), ax3]
 tensor(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3]/(float32((select((bool)1, ((ax1 + 1)*7), (((ax1 + 1)7) + 1)) - (ax17)))*float32((select((bool)1, ((ax2 + 1)*7), (((ax2 + 1)7) + 1)) - (ax27)))))
========== Task 3 (workload key: [“ad6cecbf5d85cb1cda3c2bb7af170211”, 1, 7, 7, 512, 4, 4, 512, 512, 1, 7, 7, 512, 1, 1, 1, 512, 1, 1, 1, 512, 1, 7, 7, 512]) ==========
 placeholder = PLACEHOLDER [1, 7, 7, 512]
 data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 16), ((floormod(floordiv(p, 4), 4)*2) + eps), ((floormod(p, 4)*2) + nu), ci]
 B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), …(OMITTED)… ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))
 data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
 placeholder = PLACEHOLDER [4, 4, 512, 512]
 bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
 A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), …(OMITTED)… ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f))))))))
 inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])A[r_b, vw])
 conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n4)*4) + (floordiv(h, 2)*4)) + floordiv(w, 2)), co]
 placeholder = PLACEHOLDER [1, 7, 7, 512]
 T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
 placeholder = PLACEHOLDER [1, 1, 1, 512]
 T_multiply(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3]*placeholder[ax0, 0, 0, ax3])
 placeholder = PLACEHOLDER [1, 1, 1, 512]
 T_add(ax0, ax1, ax2, ax3) = (T_multiply[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 4 (workload key: [“3a69f9fbc63760d99e36b4c17b3bfc57”, 1, 7, 7, 512, 4, 4, 512, 512, 1, 1, 1, 512, 1, 7, 7, 512]) ==========
 placeholder = PLACEHOLDER [1, 7, 7, 512]
 data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 16), ((floormod(floordiv(p, 4), 4)*2) + eps), ((floormod(p, 4)*2) + nu), ci]
 B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), …(OMITTED)… ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))
 data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
 placeholder = PLACEHOLDER [4, 4, 512, 512]
 bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
 A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), …(OMITTED)… ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f))))))))
 inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])A[r_b, vw])
 conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n4)*4) + (floordiv(h, 2)*4)) + floordiv(w, 2)), co]
 placeholder = PLACEHOLDER [1, 1, 1, 512]
 T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 5 (workload key: [“d730bcd28f0920f6b97245e2a11bd8d6”, 1, 7, 7, 512, 4, 4, 512, 512, 1, 7, 7, 512, 1, 7, 7, 512]) ==========
 placeholder = PLACEHOLDER [1, 7, 7, 512]
 data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 16), ((floormod(floordiv(p, 4), 4)*2) + eps), ((floormod(p, 4)*2) + nu), ci]
 B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), …(OMITTED)… ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))
 data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
 placeholder = PLACEHOLDER [4, 4, 512, 512]
 bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
 A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), …(OMITTED)… ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f))))))))
 inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])A[r_b, vw])
 conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n4)*4) + (floordiv(h, 2)*4)) + floordiv(w, 2)), co]
 placeholder = PLACEHOLDER [1, 7, 7, 512]
 T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
========== Task 6 (workload key: [“12b88bedece6984af589a28b43e0f3c4”, 1, 14, 14, 256, 3, 3, 256, 512, 1, 1, 1, 512, 1, 7, 7, 512]) ==========
 placeholder = PLACEHOLDER [1, 14, 14, 256]
 PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 placeholder = PLACEHOLDER [3, 3, 256, 512]
 Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy2) + ry), ((xx2) + rx), rc]*placeholder[ry, rx, rc, ff])
 placeholder = PLACEHOLDER [1, 1, 1, 512]
 T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 7 (workload key: [“f3b6c10fcc6ce01ff01add933e4d21e9”, 1, 14, 14, 256, 4, 4, 256, 256, 1, 14, 14, 256, 1, 1, 1, 256, 1, 14, 14, 256]) ==========
 placeholder = PLACEHOLDER [1, 14, 14, 256]
 data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 49), ((floormod(floordiv(p, 7), 7)*2) + eps), ((floormod(p, 7)*2) + nu), ci]
 B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), …(OMITTED)… ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))
 data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
 placeholder = PLACEHOLDER [4, 4, 256, 256]
 bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
 A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), …(OMITTED)… ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f))))))))
 inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])A[r_b, vw])
 conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n7)*7) + (floordiv(h, 2)*7)) + floordiv(w, 2)), co]
 placeholder = PLACEHOLDER [1, 14, 14, 256]
 T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
 placeholder = PLACEHOLDER [1, 1, 1, 256]
 T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 8 (workload key: [“b8b52b9be9df6102466a22a014c44c1f”, 1, 14, 14, 256, 4, 4, 256, 256, 1, 1, 1, 256, 1, 14, 14, 256]) ==========
 placeholder = PLACEHOLDER [1, 14, 14, 256]
 data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 49), ((floormod(floordiv(p, 7), 7)*2) + eps), ((floormod(p, 7)*2) + nu), ci]
 B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), …(OMITTED)… ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))
 data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
 placeholder = PLACEHOLDER [4, 4, 256, 256]
 bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
 A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), …(OMITTED)… ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f))))))))
 inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])A[r_b, vw])
 conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n7)*7) + (floordiv(h, 2)*7)) + floordiv(w, 2)), co]
 placeholder = PLACEHOLDER [1, 1, 1, 256]
 T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 9 (workload key: [“d374e472bd9d8164892b9e28a0a8cb59”, 1, 14, 14, 256, 4, 4, 256, 256, 1, 14, 14, 256, 1, 14, 14, 256]) ==========
 placeholder = PLACEHOLDER [1, 14, 14, 256]
 data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 49), ((floormod(floordiv(p, 7), 7)*2) + eps), ((floormod(p, 7)*2) + nu), ci]
 B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), …(OMITTED)… ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))
 data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
 placeholder = PLACEHOLDER [4, 4, 256, 256]
 bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
 A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), …(OMITTED)… ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f))))))))
 inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])A[r_b, vw])
 conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n7)*7) + (floordiv(h, 2)*7)) + floordiv(w, 2)), co]
 placeholder = PLACEHOLDER [1, 14, 14, 256]
 T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
========== Task 10 (workload key: [“12b88bedece6984af589a28b43e0f3c4”, 1, 28, 28, 128, 3, 3, 128, 256, 1, 1, 1, 256, 1, 14, 14, 256]) ==========
 placeholder = PLACEHOLDER [1, 28, 28, 128]
 PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 placeholder = PLACEHOLDER [3, 3, 128, 256]
 Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy2) + ry), ((xx2) + rx), rc]*placeholder[ry, rx, rc, ff])
 placeholder = PLACEHOLDER [1, 1, 1, 256]
 T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 11 (workload key: [“c4500b4e2fd04e695c32d2f31bbdc14a”, 1, 28, 28, 128, 4, 4, 128, 128, 1, 28, 28, 128, 1, 1, 1, 128, 1, 28, 28, 128]) ==========
 placeholder = PLACEHOLDER [1, 28, 28, 128]
 data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*2) + eps), ((floormod(p, 14)*2) + nu), ci]
 B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), …(OMITTED)… ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))
 data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
 placeholder = PLACEHOLDER [4, 4, 128, 128]
 bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
 A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), …(OMITTED)… ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f))))))))
 inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])A[r_b, vw])
 conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n14)*14) + (floordiv(h, 2)*14)) + floordiv(w, 2)), co]
 placeholder = PLACEHOLDER [1, 28, 28, 128]
 T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
 placeholder = PLACEHOLDER [1, 1, 1, 128]
 T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 12 (workload key: [“e4cdf917b876dbdd64488c3818d9c141”, 1, 28, 28, 128, 4, 4, 128, 128, 1, 1, 1, 128, 1, 28, 28, 128]) ==========
 placeholder = PLACEHOLDER [1, 28, 28, 128]
 data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*2) + eps), ((floormod(p, 14)*2) + nu), ci]
 B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), …(OMITTED)… ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))
 data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
 placeholder = PLACEHOLDER [4, 4, 128, 128]
 bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
 A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), …(OMITTED)… ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f))))))))
 inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])A[r_b, vw])
 conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n14)*14) + (floordiv(h, 2)*14)) + floordiv(w, 2)), co]
 placeholder = PLACEHOLDER [1, 1, 1, 128]
 T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 13 (workload key: [“dac19035dd5fe9424ee8617421b9c817”, 1, 28, 28, 128, 4, 4, 128, 128, 1, 28, 28, 128, 1, 28, 28, 128]) ==========
 placeholder = PLACEHOLDER [1, 28, 28, 128]
 data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*2) + eps), ((floormod(p, 14)*2) + nu), ci]
 B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), …(OMITTED)… ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))
 data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
 placeholder = PLACEHOLDER [4, 4, 128, 128]
 bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
 A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), …(OMITTED)… ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f))))))))
 inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])A[r_b, vw])
 conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n14)*14) + (floordiv(h, 2)*14)) + floordiv(w, 2)), co]
 placeholder = PLACEHOLDER [1, 28, 28, 128]
 T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
========== Task 14 (workload key: [“12b88bedece6984af589a28b43e0f3c4”, 1, 56, 56, 64, 3, 3, 64, 128, 1, 1, 1, 128, 1, 28, 28, 128]) ==========
 placeholder = PLACEHOLDER [1, 56, 56, 64]
 PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 placeholder = PLACEHOLDER [3, 3, 64, 128]
 Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy2) + ry), ((xx2) + rx), rc]*placeholder[ry, rx, rc, ff])
 placeholder = PLACEHOLDER [1, 1, 1, 128]
 T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 15 (workload key: [“1e3c4211ffd2f2db91078ae4d04b779d”, 1, 56, 56, 64, 6, 6, 64, 64, 1, 56, 56, 64, 1, 1, 1, 64, 1, 56, 56, 64]) ==========
 placeholder = PLACEHOLDER [1, 56, 56, 64]
 data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*4) + eps), ((floormod(p, 14)*4) + nu), ci]
 B(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 6) == 5)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 6) == 4)), …(OMITTED)… (floormod(j, 6) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 6) == 0)), 1f, 0f))))))))))))))))))))))))))))))))))))
 data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
 placeholder = PLACEHOLDER [6, 6, 64, 64]
 bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
 A(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 4) == 2)), …(OMITTED)… 6) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))))))))))
 inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])A[r_b, vw])
 conv2d_winograd(n, h, w, co) = inverse[floormod(h, 4), floormod(w, 4), ((((n14)*14) + (floordiv(h, 4)*14)) + floordiv(w, 4)), co]
 placeholder = PLACEHOLDER [1, 56, 56, 64]
 T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
 placeholder = PLACEHOLDER [1, 1, 1, 64]
 T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 16 (workload key: [“b818b53148cd450f86569dfc3e04cb8a”, 1, 56, 56, 64, 6, 6, 64, 64, 1, 1, 1, 64, 1, 56, 56, 64]) ==========
 placeholder = PLACEHOLDER [1, 56, 56, 64]
 data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*4) + eps), ((floormod(p, 14)*4) + nu), ci]
 B(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 6) == 5)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 6) == 4)), …(OMITTED)… (floormod(j, 6) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 6) == 0)), 1f, 0f))))))))))))))))))))))))))))))))))))
 data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
 placeholder = PLACEHOLDER [6, 6, 64, 64]
 bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
 A(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 4) == 2)), …(OMITTED)… 6) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))))))))))
 inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])A[r_b, vw])
 conv2d_winograd(n, h, w, co) = inverse[floormod(h, 4), floormod(w, 4), ((((n14)*14) + (floordiv(h, 4)*14)) + floordiv(w, 4)), co]
 placeholder = PLACEHOLDER [1, 1, 1, 64]
 T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 17 (workload key: [“3ea73fb9b0364374730d09e068821f95”, 1, 56, 56, 64, 6, 6, 64, 64, 1, 56, 56, 64, 1, 56, 56, 64]) ==========
 placeholder = PLACEHOLDER [1, 56, 56, 64]
 data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
 input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*4) + eps), ((floormod(p, 14)*4) + nu), ci]
 B(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 6) == 5)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 6) == 4)), …(OMITTED)… (floormod(j, 6) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 6) == 0)), 1f, 0f))))))))))))))))))))))))))))))))))))
 data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
 placeholder = PLACEHOLDER [6, 6, 64, 64]
 bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
 A(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 4) == 2)), …(OMITTED)… 6) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))))))))))
 inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])A[r_b, vw])
 conv2d_winograd(n, h, w, co) = inverse[floormod(h, 4), floormod(w, 4), ((((n14)*14) + (floordiv(h, 4)*14)) + floordiv(w, 4)), co]
 placeholder = PLACEHOLDER [1, 56, 56, 64]
 T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
========== Task 18 (workload key: [“a5612fdeb9db4d579a75ec225ea4c06a”, 1, 112, 112, 64, 1, 1, 1, 64, 1, 56, 56, 64]) ==========
 placeholder = PLACEHOLDER [1, 112, 112, 64]
 pad_temp(ax0, ax1, ax2, ax3) = tir.if_then_else(((((ax1 >= 1) && (ax1 < 113)) && (ax2 >= 1)) && (ax2 < 113)), placeholder[ax0, (ax1 - 1), (ax2 - 1), ax3], -3.40282e+38f)
 tensor(ax0, ax1, ax2, ax3) max= pad_temp[ax0, ((ax12) + dh), ((ax22) + dw), ax3]
 placeholder = PLACEHOLDER [1, 1, 1, 64]
 T_add(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 19 (workload key: [“12b88bedece6984af589a28b43e0f3c4”, 1, 224, 224, 3, 7, 7, 3, 64, 1, 1, 1, 64, 1, 112, 112, 64]) ==========
 placeholder = PLACEHOLDER [1, 224, 224, 3]
 PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 3) && (i1 < 227)) && (i2 >= 3)) && (i2 < 227)), placeholder[i0, (i1 - 3), (i2 - 3), i3], 0f)
 placeholder = PLACEHOLDER [7, 7, 3, 64]
 Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy2) + ry), ((xx2) + rx), rc]*placeholder[ry, rx, rc, ff])
 placeholder = PLACEHOLDER [1, 1, 1, 64]
 T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
 T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 20 (workload key: [“7006235cfc29b73be524cf390ed5a977”, 1, 56, 56, 64, 1, 1, 64, 64, 1, 56, 56, 64]) ==========
 placeholder = PLACEHOLDER [1, 56, 56, 64]
 PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
 placeholder = PLACEHOLDER [1, 1, 64, 64]
 Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
========== Task 21 (workload key: [“f4380bb1dc62422a69ad4a1a9771f927”, 1, 56, 56, 64, 1, 1, 64, 128, 1, 28, 28, 128]) ==========
 placeholder = PLACEHOLDER [1, 56, 56, 64]
 PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
 placeholder = PLACEHOLDER [1, 1, 64, 128]
 Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy2) + ry), ((xx2) + rx), rc]*placeholder[ry, rx, rc, ff])
========== Task 22 (workload key: [“f4380bb1dc62422a69ad4a1a9771f927”, 1, 28, 28, 128, 1, 1, 128, 256, 1, 14, 14, 256]) ==========
 placeholder = PLACEHOLDER [1, 28, 28, 128]
 PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
 placeholder = PLACEHOLDER [1, 1, 128, 256]
 Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy2) + ry), ((xx2) + rx), rc]*placeholder[ry, rx, rc, ff])
========== Task 23 (workload key: [“f4380bb1dc62422a69ad4a1a9771f927”, 1, 14, 14, 256, 1, 1, 256, 512, 1, 7, 7, 512]) ==========
 placeholder = PLACEHOLDER [1, 14, 14, 256]
 PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
 placeholder = PLACEHOLDER [1, 1, 256, 512]
 Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy2) + ry), ((xx2) + rx), rc]*placeholder[ry, rx, rc, ff])
 開始調整
 設置一些選項來優化和啟動搜索任務
 ? measure_ctx啟動不同的測量過程以提供隔離。保護主進程免受測量期間GPU崩潰的影響,避免其它運行時沖突。
 ? min_repeat_ms定義每次測量中一次“重復”的最小持續時間。這樣可以預熱GPU,對于獲得準確的測量結果是必不可少的。通常,建議值> = 300毫秒。
 ? num_measure_trials是在調整期間可以使用的測量試驗的次數。可以將其設置為較小的數字(例如200),進行快速演示。實際上,建議將其設置為900 * len(tasks),使搜索收斂。例如,resnet-18中有24個任務,將其設置為20000。根據時間預算調整此參數。
 ? 將測量記錄轉儲到日志文件RecordToFile中,這些測量記錄可用于最好地查詢歷史記錄,恢復搜索,進行更多分析。
 ? 有關更多參數auto_scheduler.TuningOptions, 請參見auto_scheduler.LocalRPCMeasureContext。
 def run_tuning():
 print(“Begin tuning…”)
 measure_ctx = auto_scheduler.LocalRPCMeasureContext(repeat=1, min_repeat_ms=300, timeout=10)
tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
tune_option = auto_scheduler.TuningOptions(num_measure_trials=200,  # change this to 20000 to achieve the best performancerunner=measure_ctx.runner,measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
)tuner.tune(tune_option)
We do not run the tuning in our webpage server since it takes too long.
Uncomment the following line to run it by yourself.
run_tuning()
筆記
 調整期間說明打印的信息
 在調整期間,控制臺上會打印很多信息。用于調試目的。最重要的信息是任務調度程序的輸出。下表是示例輸出。
 
------------------------------ [ Task Scheduler ]
| ID | Latency (ms) | Speed (GFLOPS) | Trials |
| 0 | 0.005 | 0.88 | 64 |
 | 1 | 0.010 | 99.10 | 64 |
 | 2 | 0.006 | 0.00 | 64 |
 | 3 | 0.145 | 979.78 | 384 |
 | 4 | 0.130 | 1097.02 | 384 |
 | 5 | 0.143 | 992.69 | 384 |
 | 6 | 0.076 | 1526.86 | 192 |
 | 7 | 0.115 | 999.44 | 320 |
 | 8 | 0.079 | 1449.39 | 320 |
 | 9 | 0.122 | 938.73 | 384 |
 | 10 | 0.063 | 1832.98 | 192 |
 | 11 | 0.072 | 1763.62 | 256 |
 | 12 | 0.062 | 2036.40 | 192 |
 | 13 | 0.068 | 1874.44 | 192 |
 | 14 | 0.049 | 2346.50 | 128 |
 | 15 | 0.076 | 1694.31 | 256 |
 | 16 | 0.067 | 1933.30 | 448 |
 | 17 | 0.076 | 1680.90 | 256 |
 | 18 | 0.022 | 98.43 | 64 |
 | 19 | 0.076 | 3112.55 | 192 |
 | 20 | 0.013 | 2026.44 | 64 |
 | 21 | 0.011 | 1136.69 | 64 |
 | 22 | 0.013 | 992.47 | 64 |
 | 23 | 0.020 | 627.56 | 64 |
 
Estimated total latency: 1.587 ms Trials: 4992 Used time : 13296 s Next ID: 3
 下表列出了所有任務的延遲和(估計)速度。列出了所有任務的測量試驗分配。最后一行顯示這些任務的總加權延遲,可以粗略估計網絡的端到端執行時間。最后一行還顯示測量試驗的總數,自動調整所花費的總時間,要調整的下一個任務的ID。
 自動調度程序將嘗試某些無效的調度,出現一些“ dmlc :: Error”和CUDA錯誤。繼續進行調整,放心地忽略,這些錯誤與主要過程是隔離的。
 筆記
 提前終止調整
 可以通過強制終止此過程來提前終止調整。在日志文件中為每個任務至少獲得一個有效的調度,能夠進行編譯(下面的部分)。
 編譯和評估
 自動調整后,可以使用發現的最佳調度表來編譯網絡。在自動調整期間,所有測量記錄都將轉儲到日志文件中,讀取日志文件并加載最佳調度。
Compile with the history best
print(“Compile…”)
 with auto_scheduler.ApplyHistoryBest(log_file):
 with tvm.transform.PassContext(opt_level=3, config={“relay.backend.use_auto_scheduler”: True}):
 lib = relay.build(mod, target=target, params=params)
Create graph runtime
ctx = tvm.context(str(target), 0)
 module = graph_runtime.GraphModule(lib"default")
 data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
 module.set_input(“data”, data_tvm)
Evaluate
print(“Evaluate inference time cost…”)
 ftimer = module.module.time_evaluator(“run”, ctx, repeat=3, min_repeat_ms=500)
 prof_res = np.array(ftimer().results) * 1e3 # convert to millisecond
 print(“Mean inference time (std dev): %.2f ms (%.2f ms)” % (np.mean(prof_res), np.std(prof_res)))
 輸出:
 Compile…
 Evaluate inference time cost…
 Mean inference time (std dev): 3.22 ms (0.02 ms)
 其它技巧
- 調整過程中,自動調度器需要編譯許多程序并從中提取功能。該部分占用大量CPU,建議使用具有多個內核的高性能CPU,加快搜索速度。
- 提取大型日志文件,僅保存最有用的記錄。python3 -m tvm.auto_scheduler.measure_record --mode distill -i log.json
- 從上一個日志文件繼續搜索。load_log_file在function中創建任務調度程序時,只需添加一個新參數run_tuning。tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)
- 如果有多個目標GPU,全部用于測量,并行化測量。了解如何使用RPC跟蹤器和RPC服務器。要在自動調度使用RPC跟蹤,用auto_scheduler.RPCRunner,更換轉輪TuningOptions 。
總結
以上是生活随笔為你收集整理的NVIDIA GPU自动调度神经网络的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 如何使用TVM Pass Relay
- 下一篇: Deformable 可变形的DETR
