tvm模型部署c++ 分析
tvm模型部署c++ 分析
tvm c++部署官方教程
https://github.com/apache/tvm/tree/main/apps/howto_deploy
https://tvm.apache.org/docs/how_to/deploy/cpp_deploy.html
官方說執行run_example.sh腳本就可以完成部署。
使用C++ API部署TVM模塊
提供了一個關于如何在apps/howto_deploy中,部署TVM模塊的示例。要運行該示例,可以使用以下命令
cd apps/howto_deploy
./run_example.sh
獲取TVM運行庫
唯一需要的是鏈接到目標平臺中的TVM運行時。TVM提供了一個最低運行時間,根據使用的模塊數量,成本大約在300K到600K之間。在大多數情況下,可以使用libtvm_runtime.com,這是構建時附帶的。
如果發現很難構建libtvm_運行時,checkout tvm_runtime_pack.cc。這是一個提供TVM運行時的示例。可以使用構建系統編譯此文件,包含到項目中。
可以checkout應用程序,如在iOS、Android和其它平臺上,使用TVM構建的應用程序。
動態庫與系統模塊
TVM提供了兩種使用已編譯庫的方法。可以checkout prepare_test_libs.py(關于如何生成庫)和cpp_deploy.cc(關于如何使用)。
將庫存儲為共享庫,并將庫動態加載到項目中。
以系統模塊模式將編譯后的庫綁定到項目中。
動態加載更靈活,可以動態加載新模塊。系統模塊是一種更靜態的方法。可以在禁止動態庫加載的地方使用系統模塊。
c++部署代碼
https://github.com/apache/tvm/blob/main/apps/howto_deploy/cpp_deploy.cc
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* “License”); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
/*!* \brief Example code on load and run TVM module.s* \file cpp_deploy.cc*/
#include <dlpack/dlpack.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/registry.h>#include <cstdio>void Verify(tvm::runtime::Module mod, std::string fname) {// Get the function from the module.tvm::runtime::PackedFunc f = mod.GetFunction(fname);ICHECK(f != nullptr);// Allocate the DLPack data structures.//// Note that we use TVM runtime API to allocate the DLTensor in this example.// TVM accept DLPack compatible DLTensors, so function can be invoked// as long as we pass correct pointer to DLTensor array.//// For more information please refer to dlpack.// One thing to notice is that DLPack contains alignment requirement for// the data pointer and TVM takes advantage of that.// If you plan to use your customized data container, please// make sure the DLTensor you pass in meet the alignment requirement.//DLTensor* x;DLTensor* y;int ndim = 1;int dtype_code = kDLFloat;int dtype_bits = 32;int dtype_lanes = 1;int device_type = kDLCPU;int device_id = 0;int64_t shape[1] = {10};TVMArrayAlloc(shape, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &x);TVMArrayAlloc(shape, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &y);for (int i = 0; i < shape[0]; ++i) {static_cast<float*>(x->data)[i] = i;}// Invoke the function// PackedFunc is a function that can be invoked via positional argument.// The signature of the function is specified in tvm.buildf(x, y);// Print out the outputfor (int i = 0; i < shape[0]; ++i) {ICHECK_EQ(static_cast<float*>(y->data)[i], i + 1.0f);}LOG(INFO) << "Finish verification...";TVMArrayFree(x);TVMArrayFree(y);
}void DeploySingleOp() {// Normally we can directlytvm::runtime::Module mod_dylib = tvm::runtime::Module::LoadFromFile("lib/test_addone_dll.so");LOG(INFO) << "Verify dynamic loading from test_addone_dll.so";Verify(mod_dylib, "addone");// For libraries that are directly packed as system lib and linked together with the app// We can directly use GetSystemLib to get the system wide library.LOG(INFO) << "Verify load function from system lib";tvm::runtime::Module mod_syslib = (*tvm::runtime::Registry::Get("runtime.SystemLib"))();Verify(mod_syslib, "addonesys");
}void DeployGraphExecutor() {LOG(INFO) << "Running graph executor...";// load in the libraryDLDevice dev{kDLCPU, 0};tvm::runtime::Module mod_factory = tvm::runtime::Module::LoadFromFile("lib/test_relay_add.so");// create the graph executor moduletvm::runtime::Module gmod = mod_factory.GetFunction("default")(dev);tvm::runtime::PackedFunc set_input = gmod.GetFunction("set_input");tvm::runtime::PackedFunc get_output = gmod.GetFunction("get_output");tvm::runtime::PackedFunc run = gmod.GetFunction("run");// Use the C++ APItvm::runtime::NDArray x = tvm::runtime::NDArray::Empty({2, 2}, DLDataType{kDLFloat, 32, 1}, dev);tvm::runtime::NDArray y = tvm::runtime::NDArray::Empty({2, 2}, DLDataType{kDLFloat, 32, 1}, dev);for (int i = 0; i < 2; ++i) {for (int j = 0; j < 2; ++j) {static_cast<float*>(x->data)[i * 2 + j] = i * 2 + j;}}// set the right inputset_input("x", x);// run the coderun();// get the outputget_output(0, y);for (int i = 0; i < 2; ++i) {for (int j = 0; j < 2; ++j) {ICHECK_EQ(static_cast<float*>(y->data)[i * 2 + j], i * 2 + j + 1);}}
}int main(void) {DeploySingleOp();DeployGraphExecutor();return 0;
}
Makefile文件
https://github.com/apache/tvm/blob/main/apps/howto_deploy/Makefile
Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.# Makefile Example to deploy TVM modules.
TVM_ROOT=$(shell cd ../..; pwd)
DMLC_CORE=${TVM_ROOT}/3rdparty/dmlc-corePKG_CFLAGS = -std=c++14 -O2 -fPIC\-I${TVM_ROOT}/include\-I${DMLC_CORE}/include\-I${TVM_ROOT}/3rdparty/dlpack/include\-DDMLC_USE_LOGGING_LIBRARY=\<tvm/runtime/logging.h\>PKG_LDFLAGS = -L${TVM_ROOT}/build -ldl -pthread.PHONY: clean allall: lib/cpp_deploy_pack lib/cpp_deploy_normal# Build rule for all in one TVM package library
lib/libtvm_runtime_pack.o: tvm_runtime_pack.cc@mkdir -p $(@D)$(CXX) -c $(PKG_CFLAGS) -o $@ $^# The code library built by TVM
lib/test_addone_sys.o: prepare_test_libs.py@mkdir -p $(@D)python3 prepare_test_libs.py# Deploy using the all in one TVM package library
lib/cpp_deploy_pack: cpp_deploy.cc lib/test_addone_sys.o lib/libtvm_runtime_pack.o@mkdir -p $(@D)$(CXX) $(PKG_CFLAGS) -o $@ $^ $(PKG_LDFLAGS)# Deploy using pre-built libtvm_runtime.so
lib/cpp_deploy_normal: cpp_deploy.cc lib/test_addone_sys.o@mkdir -p $(@D)$(CXX) $(PKG_CFLAGS) -o $@ $^ -ltvm_runtime $(PKG_LDFLAGS)
clean:
結合Makefile文件和run_example.sh腳本一起看
腳本先創建lib目錄,然后執行sudo make命令,make操作的執行要看Makefile文件
make命令會先在lib文件夾中編譯一個名為libtvm_runtime_pack.o的靜態鏈接庫
然后運行prepare_test_lib.py文件生成將模型生成為test_addone_dll.so,test_addone_sys.o和test_relay_add.so三個庫,給cpp_deploy.cc調用,生成兩個可執行文件cpp_deploy_pack和cpp_deploy_normal
目標是用其它框架寫的深度學習網絡,通過tvm轉換成so文件,使用c++部署,在gpu上進行調用,下面是cpu上部署的代碼
/*
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- “License”); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
*/
/*!
- \brief Example code on load and run TVM module.s
- \file cpp_deploy.cc
*/
#include <dlpack/dlpack.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/registry.h>
#include
#include
#include
#include
using namespace std;
template
Type stringToNum(const string& str)
{
istringstream iss(str);
Type num;
iss >> num;
return num;
}
void DeployGraphRuntime() {
ifstream in("/home/aiteam/tiwang/tvm-tfs-gpu-bkp/data.txt");
//int image[784];
string s;
int image_index=0;
/*
while(getline(in,s))
{
image[i]=stringToNum(s);
++i;
}*/
LOG(INFO) << “Running graph runtime…”;
// load in the library
DLContext ctx{kDLGPU, 0};
tvm::runtime::Module mod_factory = tvm::runtime::Module::LoadFromFile("/home/aiteam/tiwang/tvm-tfs-gpu-bkp/model.so");
// create the graph runtime module
tvm::runtime::Module gmod = mod_factory.GetFunction(“default”)(ctx);
tvm::runtime::PackedFunc set_input = gmod.GetFunction(“set_input”);
tvm::runtime::PackedFunc get_output = gmod.GetFunction(“get_output”);
tvm::runtime::PackedFunc run = gmod.GetFunction(“run”);
// Use the C++ API
tvm::runtime::NDArray x = tvm::runtime::NDArray::Empty({1,784}, DLDataType{kDLFloat, 32, 1}, ctx);
tvm::runtime::NDArray y = tvm::runtime::NDArray::Empty({1, 10}, DLDataType{kDLFloat, 32, 1}, ctx);
while(getline(in,s))
{
static_cast<float*>(x->data)[image_index]=((float)stringToNum(s))/255;
image_index++;
}
// set the right input
set_input(“x”, x);
// run the code
run();
// get the output
get_output(0, y);
for(int i=0;i<10;++i)
{
LOG(INFO)<<static_cast<float*>(y->data)[i];
}
/*
for (int i = 0; i < 2; ++i) {
for (int j = 0; j < 2; ++j) {
ICHECK_EQ(static_cast<float*>(y->data)[i * 2 + j], i * 2 + j + 1);
}
}*/
}
int main(void) {
//DeploySingleOp();
DeployGraphRuntime();
return 0;
}
思路很簡單就是把數據讀進來,set_input,run然后get_output,修改了將target修改成cuda后并不能成功在gpu上運行,會出現core dump的問題
原因是想要讓模型在gpu上運行,需要在gpu上開辟內存,然后將數據拷貝到gpu上運行,這個代碼沒有這些操作所以運行時會導致core崩潰。
下面是tvm c++部署調用gpu的完整過程,深度學習模型使用keras寫的mnist手寫體識別網絡,保存成了pb格式,模型代碼就不放了,這里直接讀取pb文件進行轉化,模型輸入是(1,784),輸出是(1,10)
導入頭文件
import tvm
from tvm import te
from tvm import relay
os and numpy
import numpy as np
import os.path
Tensorflow imports
import tensorflow as tf
try:
tf_compat_v1 = tf.compat.v1
except ImportError:
tf_compat_v1 = tf
Tensorflow utility functions
import tvm.relay.testing.tf as tf_testing
from tvm.contrib import graph_runtime
參數設置
#cpu
#target = “llvm”
#target_host = “llvm”
#layout = None
#ctx = tvm.cpu(0)
#gpu
target = “cuda”
target_host = ‘llvm’
layout = “NCHW”
ctx = tvm.gpu(0)
處理數據
from tensorflow.python.keras.datasets import mnist
from tensorflow.python.keras.utils import np_utils
(x_train,y_train),(x_test,y_test)=mnist.load_data()
x_test1=x_test.reshape(x_test.shape[0],x_test.shape[1]*x_test.shape[2])
print(x_train.shape,x_test.shape)
print(y_train.shape,y_test.shape)
x_train=x_train.reshape(x_train.shape[0],x_train.shape[1]*x_train.shape[2])
x_test=x_test.reshape(x_test.shape[0],x_test.shape[1]*x_test.shape[2])
x_train=x_train/255
x_test=x_test/255
y_train=np_utils.to_categorical(y_train)
y_test=np_utils.to_categorical(y_test)
print(x_train.shape,x_test.shape)
print(y_train.shape,y_test.shape)
with open(“data.txt”,‘w’) as wf:
for i in range(784):
wf.write(str(x_test1[12][i]))
wf.write(’\n’)
讀取模型
with tf_compat_v1.gfile.GFile(’./frozen_models/simple_frozen_graph.pb’, “rb”) as f:
graph_def = tf_compat_v1.GraphDef()
graph_def.ParseFromString(f.read())
graph = tf.import_graph_def(graph_def, name="")
# Call the utility to import the graph definition into default graph.
graph_def = tf_testing.ProcessGraphDefParam(graph_def)
# Add shapes to the graph.
config = tf.compat.v1.ConfigProto(allow_soft_placement=True)
with tf_compat_v1.Session() as sess:graph_def = tf_testing.AddShapesToGraphDef(sess, "Identity")
tensor_name_list = [tensor.name for tensor in tf.compat.v1.get_default_graph().as_graph_def().node]
for tensor_name in tensor_name_list:
print(tensor_name,’\n’)
構建
shape_dict = {“x”: x_train[0:1].shape}
print(shape_dict)
dtype_dict = {“x”: “uint8”}
mod, params = relay.frontend.from_tensorflow(graph_def, layout=layout, shape=shape_dict)
print(“Tensorflow protobuf imported to relay frontend.”)
編譯成tvm模型
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target, target_host=target_host, params=params)
測試一下tvm模型能不能用
from tvm.contrib import graph_runtime
tt=np.zeros([1,784])
i=0
file=open(“data.txt”)
while 1:
line=file.readline()
if not line:
break
tt[0][i]=int(line)
i+=1
file.close()
dtype = “float32”
m = graph_runtime.GraphModule(lib"default")
set inputs
m.set_input(“x”, tvm.nd.array(tt.astype(dtype)))
execute
m.run()
get outputs
tvm_output = m.get_output(0, tvm.nd.empty(((1, 10)), “float32”))
print(tvm_output.shape,tvm_output)
保存模型
from tvm.contrib import utils
temp=utils.tempdir()
path_lib=temp.relpath("/home/aiteam/test_code/model.so")
print(path_lib)
lib.export_library(path_lib)
print(temp.listdir())
然后進入到tvm/apps/howto_deploy目錄,修改tvm_runtime_pack.cc文件,加上頭文件
#include “…/…/src/runtime/cuda/cuda_device_api.cc”
#include “…/…/src/runtime/cuda/cuda_module.cc”
然后再寫一個cc文件存放自己的部署代碼,修改Makefile文件進行編譯
文件名是cpp_deploy_bkp.cc
修改后的Makefile文件
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
“License”); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
“AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
Makefile Example to deploy TVM modules.
TVM_ROOT=(shellcd../..;pwd)DMLCCORE=(shell cd ../..; pwd) DMLC_CORE=(shellcd../..;pwd)DMLCC?ORE={TVM_ROOT}/3rdparty/dmlc-core
PKG_CFLAGS = -std=c++14 -g -fPIC
-IKaTeX parse error: Undefined control sequence: \ at position 19: …M_ROOT}/include\? ? -I{DMLC_CORE}/include
-I${TVM_ROOT}/3rdparty/dlpack/include
-I/usr/local/cuda/include
PKG_LDFLAGS = -L${TVM_ROOT}/build -ldl -pthread -L/usr/local/cuda/lib64 -lcudart -lcuda
.PHONY: clean all
all:lib/libtvm_runtime_pack.o lib/cpp_deploy_pack
#all: lib/cpp_deploy_pack lib/cpp_deploy_normal
Build rule for all in one TVM package library
.PHONY: lib/libtvm_runtime_pack.o
lib/libtvm_runtime_pack.o: tvm_runtime_pack.cc
@mkdir -p $(@D)
$(CXX) -c $(PKG_CFLAGS) -o $@ $^ $(PKG_LDFLAGS)
Deploy using the all in one TVM package library
.PHONY: lib/cpp_deploy_pack
lib/cpp_deploy_pack: cpp_deploy_bkp.cc lib/libtvm_runtime_pack.o
@mkdir -p $(@D)
$(CXX) $(PKG_CFLAGS) -o $@ $^ $(PKG_LDFLAGS)
里面要加上cuda頭文件的位置和動態鏈接庫的位置
cpp_deploy_bkp.cc
/*
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- “License”); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
*/
/*!
- \brief Example code on load and run TVM module.s
- \file cpp_deploy.cc
*/
#include <dlpack/dlpack.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/registry.h>
#include
#include
#include
#include
using namespace std;
template
Type stringToNum(const string& str)
{
istringstream iss(str);
Type num;
iss >> num;
return num;
}
void DeployGraphRuntime() {
constexpr int dtype_code= 2U;
constexpr int dtype_bits=32;
constexpr int dtype_lines=1;
constexpr int device_type= 2;
constexpr int device_id=0;
int ndim=2;
int64_t in_shape[2]={1,784};
int64_t out_shape[2]={1,10};
DLTensor* DLTX=nullptr;
DLTensor* DLTY=nullptr;
TVMArrayAlloc(in_shape,ndim,dtype_code,dtype_bits,dtype_lines,device_type,device_id,&DLTX);
TVMArrayAlloc(out_shape,ndim,dtype_code,dtype_bits,dtype_lines,device_type,device_id,&DLTY);
float img[784];
float rslt[10];
ifstream in("/home/aiteam/tiwang/data.txt");
//int image[784];
string s;
int image_index=0;
/*
while(getline(in,s))
{
image[i]=stringToNum(s);
++i;
}*/
bool enabled = tvm::runtime::RuntimeEnabled(“cuda”);
if (!enabled)
{
LOG(INFO) << “Skip heterogeneous test because cuda is not enabled.”<< “\n”;
return;
}
LOG(INFO) << “Running graph runtime…”;
// load in the library
DLContext ctx{kDLGPU, 0};
tvm::runtime::Module mod_factory = tvm::runtime::Module::LoadFromFile("/home/aiteam/test_code/model.so");
// create the graph runtime module
tvm::runtime::Module gmod = mod_factory.GetFunction(“default”)(ctx);
tvm::runtime::PackedFunc set_input = gmod.GetFunction(“set_input”);
tvm::runtime::PackedFunc get_output = gmod.GetFunction(“get_output”);
tvm::runtime::PackedFunc run = gmod.GetFunction(“run”);
// Use the C++ API
while(getline(in,s))
{
if(image_index%28==0)
printf("\n");
//static_cast<float*>(x->data)[image_index]=((float)stringToNum(s))/255;
img[image_index]=((float)stringToNum(s))/255;
int a=stringToNum<int>(s);printf("%4d",a);image_index++;
}
TVMArrayCopyFromBytes(DLTX,&img[0],image_indexsizeof(float));
// set the right input
set_input(“x”, DLTX);
// run the code
run();
// get the output
get_output(0, DLTY);
TVMArrayCopyToBytes(DLTY,&rslt[0],10sizeof(float));
for(int i=0;i<10;++i)
{
LOG(INFO)<<rslt[i];
//LOG(INFO)<<static_cast<float*>(y->data)[i];
}
}
int main(void) {
//DeploySingleOp();
DeployGraphRuntime();
return 0;
}
相比于之前cpu部署的代碼,gpu部署多了一個拷貝張量的過程
參照
https://discuss.tvm.apache.org/t/deploy-nnvm-module-using-c-on-gpu-using-opencl-target/229
最終結果
首先在tvm/apps/howto_deplpy目錄下執行sudo make
編譯通過,運行可執行文件 ./lib/cpp_deploy_pack
參考鏈接:
https://www.cnblogs.com/wangtianning1223/p/14662970.html
https://github.com/apache/tvm/tree/main/apps/howto_deploy
https://discuss.tvm.apache.org/t/deploy-nnvm-module-using-c-on-gpu-using-opencl-target/229/17
https://discuss.tvm.apache.org/t/performance-tvm-pytorch-bert-on-cpu/10200/10
[Performance] TVM - pytorch BERT on CPU
@masahi I add ONNX for the experiments in the following and it seems using ONNX-runtime can get the best performance no matter the sequence length is (without tuning). I use ONNX-runtime with GraphOptimizationLevel.ORT_ENABLE_ALL showing in this link 1. Besides, I plot the IR graph for ONNX, which is quite complicated.
Experiment 1 - Different Target
image1170×432 15.7 KB
Experiment 3 - Different Length
image1632×432 18.1 KB
? ONNX IR Graph: drive link
Also, I have some questions about AutoSchedule tuning.
? Q1: @merrymercy I was confused that when I use AutoSchedule to tune TVM, can I use target like llvm -libs=cblas or I should use only llvm. I found this will give different tasks to tune.
? Q2: @comaniac I think MXNet IR is more friendly than Pytorch IR for AutoSchedule tuning. I set the same parameters for tuning but Pytorch cannot get the results as MXNet (16ms for seq_len=128) The following are their tuning tasks and it seems quite different due to different IR graph. I still work on where the problem comes from, TVM front-end or original code implementation. But I think maybe TVM will have some transforms to generate similar IR graph even if from different framework.
o 1st difference: MXNet will use nn.bias_add() and Pytorch will use relay.add(), which cause the tuning tasks not include this operation. (task 0,1,2,6)
o 2nd difference: Their attention softmax operation have different shape, but I think this doesn’t cause too much latency difference (task 4)
Tasks for Pytorch AutoSchedule Tuning (Target = llvm)
========== Task 0 (workload key: [“61f56dfd63fda28bc8bcf85739c8e9e3”, 128, 3072, 768, 3072, 128, 768]) ==========
placeholder = PLACEHOLDER [128, 3072]
placeholder = PLACEHOLDER [768, 3072]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
========== Task 1 (workload key: [“61f56dfd63fda28bc8bcf85739c8e9e3”, 128, 768, 3072, 768, 128, 3072]) ==========
placeholder = PLACEHOLDER [128, 768]
placeholder = PLACEHOLDER [3072, 768]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
========== Task 2 (workload key: [“61f56dfd63fda28bc8bcf85739c8e9e3”, 128, 768, 768, 768, 128, 768]) ==========
placeholder = PLACEHOLDER [128, 768]
placeholder = PLACEHOLDER [768, 768]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
========== Task 3 (workload key: [“d2a28fdf41e83222456f5a6e5bf8a24a”, 12, 128, 128, 12, 64, 128, 12, 128, 64]) ==========
placeholder = PLACEHOLDER [12, 128, 128]
placeholder = PLACEHOLDER [12, 64, 128]
compute(b, i, j) += (placeholder[b, i, k]*placeholder[b, j, k])
========== Task 4 (workload key: [“868c2771b1610bdac0ac73167691f4eb”, 1, 12, 128, 128, 1, 12, 128, 128]) ==========
placeholder = PLACEHOLDER [1, 12, 128, 128]
T_softmax_maxelem(i0, i1, i2) max= placeholder[i0, i1, i2, k]
T_softmax_delta(i0, i1, i2, i3) = (placeholder[i0, i1, i2, i3] - T_softmax_maxelem[i0, i1, i2])
T_fast_exp(ax0, ax1, ax2, ax3) = max((tir.reinterpret(tir.shift_left(int32((tir.floor(((max(min(T_softmax_delta[ax0, ax1, ax2, a …(OMITTED)… max_delta[ax0, ax1, ax2, ax3], 88.3763f), -88.3763f)*1.4427f) + 0.5f))*0.693147f))) + 1f)), T_softmax_delta[ax0, ax1, ax2, ax3])
T_softmax_expsum(i0, i1, i2) += T_fast_exp[i0, i1, i2, k]
T_softmax_norm(i0, i1, i2, i3) = (T_fast_exp[i0, i1, i2, i3]/T_softmax_expsum[i0, i1, i2])
========== Task 5 (workload key: [“d2a28fdf41e83222456f5a6e5bf8a24a”, 12, 128, 64, 12, 128, 64, 12, 128, 128]) ==========
placeholder = PLACEHOLDER [12, 128, 64]
placeholder = PLACEHOLDER [12, 128, 64]
compute(b, i, j) += (placeholder[b, i, k]*placeholder[b, j, k])
========== Task 6 (workload key: [“61f56dfd63fda28bc8bcf85739c8e9e3”, 128, 768, 2304, 768, 128, 2304]) ==========
placeholder = PLACEHOLDER [128, 768]
placeholder = PLACEHOLDER [2304, 768]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
========== Task 7 (workload key: [“2dde9ffcbf97381c0f0307643e09dac5”, 1, 128, 768, 1, 128, 1]) ==========
placeholder = PLACEHOLDER [1, 128, 768]
placeholder_red(ax0, ax1, ax2) += placeholder[ax0, ax1, k2]
T_divide(ax0, ax1, ax2) = (placeholder_red[ax0, ax1, ax2]/768f)
========== Task 8 (workload key: [“dde89265d3f1a59075cee648386eac1e”, 1, 128, 768, 1, 128, 1, 1, 128, 1]) ==========
placeholder = PLACEHOLDER [1, 128, 768]
placeholder = PLACEHOLDER [1, 128, 1]
T_subtract(ax0, ax1, ax2) = (placeholder[ax0, ax1, ax2] - placeholder[ax0, ax1, 0])
T_multiply(ax0, ax1, ax2) = (T_subtract[ax0, ax1, ax2]*T_subtract[ax0, ax1, ax2])
T_multiply_red(ax0, ax1, ax2) += T_multiply[ax0, ax1, k2]
T_divide(ax0, ax1, ax2) = (T_multiply_red[ax0, ax1, ax2]/768f)
========== Task 9 (workload key: [“9e3bd222d4f8d250aeadf2fef0b15f2b”, 1, 768, 768, 768, 768, 1, 768]) ==========
placeholder = PLACEHOLDER [1, 768]
placeholder = PLACEHOLDER [768, 768]
T_dense(i, j) += (placeholder[i, k]placeholder[j, k])
placeholder = PLACEHOLDER [768]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
T_minimum(ax0, ax1) = min(T_add[ax0, ax1], 9f)
T_maximum(ax0, ax1) = max(T_minimum[ax0, ax1], -9f)
T_fast_tanh(ax0, ax1) = ((T_maximum[ax0, ax1](((T_maximum[ax0, ax1]T_maximum[ax0, ax1])(((T_maximum[ax0, ax1]*T_maximum[ax0, …(OMITTED)… T_maximum[ax0, ax1])(((T_maximum[ax0, ax1]*T_maximum[ax0, ax1])*1.19826e-06f) + 0.000118535f)) + 0.00226843f)) + 0.00489353f))
Tasks for MXNet AutoSchedule Tuning (Target = llvm)
========== Task 0 (workload key: [“9847f8cc0b305137f49f2c5c0c8ab25d”, 128, 3072, 768, 3072, 768, 128, 768]) ==========
placeholder = PLACEHOLDER [128, 3072]
placeholder = PLACEHOLDER [768, 3072]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
placeholder = PLACEHOLDER [768]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
========== Task 1 (workload key: [“9847f8cc0b305137f49f2c5c0c8ab25d”, 128, 768, 3072, 768, 3072, 128, 3072]) ==========
placeholder = PLACEHOLDER [128, 768]
placeholder = PLACEHOLDER [3072, 768]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
placeholder = PLACEHOLDER [3072]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
========== Task 2 (workload key: [“9847f8cc0b305137f49f2c5c0c8ab25d”, 128, 768, 768, 768, 768, 128, 768]) ==========
placeholder = PLACEHOLDER [128, 768]
placeholder = PLACEHOLDER [768, 768]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
placeholder = PLACEHOLDER [768]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
========== Task 3 (workload key: [“d2a28fdf41e83222456f5a6e5bf8a24a”, 12, 128, 128, 12, 64, 128, 12, 128, 64]) ==========
placeholder = PLACEHOLDER [12, 128, 128]
placeholder = PLACEHOLDER [12, 64, 128]
compute(b, i, j) += (placeholder[b, i, k]*placeholder[b, j, k])
========== Task 4 (workload key: [“4b5e216f8244b4e8f7b6543c4a9087e5”, 1536, 128, 1536, 128]) ==========
placeholder = PLACEHOLDER [1536, 128]
T_softmax_maxelem(i0) max= placeholder[i0, k]
T_softmax_delta(i0, i1) = (placeholder[i0, i1] - T_softmax_maxelem[i0])
T_fast_exp(ax0, ax1) = max((tir.reinterpret(tir.shift_left(int32((tir.floor(((max(min(T_softmax_delta[ax0, ax1], 88.3763f), -88. …(OMITTED)… oor(((max(min(T_softmax_delta[ax0, ax1], 88.3763f), -88.3763f)*1.4427f) + 0.5f))*0.693147f))) + 1f)), T_softmax_delta[ax0, ax1])
T_softmax_expsum(i0) += T_fast_exp[i0, k]
T_softmax_norm(i0, i1) = (T_fast_exp[i0, i1]/T_softmax_expsum[i0])
========== Task 5 (workload key: [“d2a28fdf41e83222456f5a6e5bf8a24a”, 12, 128, 64, 12, 128, 64, 12, 128, 128]) ==========
placeholder = PLACEHOLDER [12, 128, 64]
placeholder = PLACEHOLDER [12, 128, 64]
compute(b, i, j) += (placeholder[b, i, k]*placeholder[b, j, k])
========== Task 6 (workload key: [“9847f8cc0b305137f49f2c5c0c8ab25d”, 128, 768, 2304, 768, 2304, 128, 2304]) ==========
placeholder = PLACEHOLDER [128, 768]
placeholder = PLACEHOLDER [2304, 768]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
placeholder = PLACEHOLDER [2304]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
========== Task 7 (workload key: [“2dde9ffcbf97381c0f0307643e09dac5”, 128, 1, 768, 128, 1, 1]) ==========
placeholder = PLACEHOLDER [128, 1, 768]
placeholder_red(ax0, ax1, ax2) += placeholder[ax0, ax1, k2]
T_divide(ax0, ax1, ax2) = (placeholder_red[ax0, ax1, ax2]/768f)
========== Task 8 (workload key: [“dde89265d3f1a59075cee648386eac1e”, 128, 1, 768, 128, 1, 1, 128, 1, 1]) ==========
placeholder = PLACEHOLDER [128, 1, 768]
placeholder = PLACEHOLDER [128, 1, 1]
T_subtract(ax0, ax1, ax2) = (placeholder[ax0, ax1, ax2] - placeholder[ax0, ax1, 0])
T_multiply(ax0, ax1, ax2) = (T_subtract[ax0, ax1, ax2]*T_subtract[ax0, ax1, ax2])
T_multiply_red(ax0, ax1, ax2) += T_multiply[ax0, ax1, k2]
T_divide(ax0, ax1, ax2) = (T_multiply_red[ax0, ax1, ax2]/768f)
========== Task 9 (workload key: [“9e3bd222d4f8d250aeadf2fef0b15f2b”, 1, 768, 768, 768, 768, 1, 768]) ==========
placeholder = PLACEHOLDER [1, 768]
placeholder = PLACEHOLDER [768, 768]
T_dense(i, j) += (placeholder[i, k]placeholder[j, k])
placeholder = PLACEHOLDER [768]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
T_minimum(ax0, ax1) = min(T_add[ax0, ax1], 9f)
T_maximum(ax0, ax1) = max(T_minimum[ax0, ax1], -9f)
T_fast_tanh(ax0, ax1) = ((T_maximum[ax0, ax1](((T_maximum[ax0, ax1]T_maximum[ax0, ax1])(((T_maximum[ax0, ax1]*T_maximum[ax0, …(OMITTED)… T_maximum[ax0, ax1])(((T_maximum[ax0, ax1]*T_maximum[ax0, ax1])*1.19826e-06f) + 0.000118535f)) + 0.00226843f)) + 0.00489353f))
Sorry for my lots of questions. I’ll do my best to do more experiments and figure out the reasons why Pytorch AutoSchedule not working as MXNet and why TVM is not working as expected when sequence length increasing.
Reply
Thanks for the plentiful information.
For Q1, when you extract tasks with llvm -mcpu=skylake-avx512 -libs=cblas, some operators (i.e., dense) will be offloaded to cblas. It means those operators won’t be compiled by the TVM codegen, so AutoScheduler won’t see and tune them.
For Q2, the two differences you pointed out seem not really impactful. Maybe you can try to use debugger to compare the latency breakdown between two models: Debugger — tvm 0.8.dev0 documentation 4
Reply
@comaniac I follow your instructions to use debugger and compare the latency in the following IR graphs (latency > 100us with orange color). And I find maybe some operations are hard to tune, such as fused_nn_contrib_dense_pack. All my experiments are done with opt_level=3, required_pass=[“FastMath”]
Experiment 1 - Compare BERT on MXNet (68.4ms) and Pytorch (122.9ms)
? MXNet Debug IR Graph: drive link
? Pytorch Debug IR Graph: drive link 3
? From the above graphs, we can find MXNet use fused_nn_contrib_dense_pack_add while Pytorch use fused_nn_contrib_dense_pack operation. This is happened for all FC layers (4 in one transformer block) and I use the latency of first block as example.
o FC for Query, Key, and Value (M: 503us, P: 1314us)
o FC after self-attention (M: 202us, P: 651us)
o FC after layer normalization (M: 798us, P: 2578us)
o FC after GELU (M: 786us, P: 2572us)
image1043×480 60 KB
Experiment 2 - Compare BERT on MXNet (68.4ms) and MXNet-tune (autoschedule) (15.4ms)
? MXNet Debug IR Graph: drive link
? MXNet-tune Debug IR Graph: drive link
? From the above graphs, it is easy to find we can reduce most of dense and batch_matmul operations’ latency. I take the first transformer block as example.
o FC for Query, Key, and Value (M: 503us, M-tune: 229us)
o FC after self-attention (M: 202us, M-tune: 99us)
o FC after layer normalization (M: 798us, M-tune: 292us)
o FC after GELU (M: 786us, M-tune: 367us)
o batch_matmul for Quert and Key (M: 1828us, M-tune:41us)
o batch_matmul for Attention and Value (M: 1312us, M-tune:29us)
Experiment 3 - Compare BERT on Pytorch (122.9ms) and Pytorch-tune (autoschedule) (64.2ms)
? Pytorch Debug IR Graph: drive link 3
? Pytorch-tune Debug IR Graph: drive link 1
? From the above graphs, we can find we only reduce the latency of batch_matmul but not dense operation. I take the first transformer block as example.
o FC for Query, Key, and Value (P: 1314us, P-tune: 1276us)
o FC after self-attention (P: 651us, P-tune: 403us)
o FC after layer normalization (P: 2578us, P-tune: 1779us)
o FC after GELU (P: 2572us, P-tune: 1684us)
o batch_matmul for Quert and Key (P: 1624us, P-tune: 78us)
o batch_matmul for Attention and Value (P: 1265us, P-tune: 70us)
Therefore, I suspect the problem is come from the fused_nn_contrib_dense_pack operation, but I don’t know why this is slower than fused_nn_contrib_dense_pack_add. Even the latter has additional add operation for adding bias. If you want the whole debug files (json and logs), I provide in this drive link 1.
Reply
@comaniac I change the TVM version with commit id: 91e07e1f3a7 (Feb. 5, 2021) which is the same as this repo 1. And the problem is solved because we will use fused_nn_batch_matmul for all FC (dense) layers rather than fused_nn_contrib_dense_pack. I think the problem is coming from this PR 3 you provided, which also causes I cannot use -libs=mkl. I have the debug IR graphs in the following and now Pytorch can speed up from Pytorch script 26.63ms to 17.36ms after tuning with autoschedule.
? Pytorch Debug IR Graph: drive link 2
? Pytorch-tune Debug IR Graph: drive link 1
2 Replies
1
Reply
Hmm I don’t that PR is the root cause of using batch_matmul in BERT model tho. It might be due to this PR that uses dense instead of batch_matmul when the input shape is 2D and 3D:
github.com/apache/tvm 6
Fix PyTorch matmul conversion when given (2-dim, N-dim) input pair 6
apache:main ← yuchaoli:main
opened Apr 14, 2021
This PR change the matmul conversion in PyTorch when given (2-dim, N-dim) input … 6
Reply
There’s a known issue that TVM’s dense op and batch_matmul op with Y = X * W^T does have bad performance in some models.
There’re several matmul & batch_matmul ops in bert that takes data tensor as both input and weight(exp. those in multi-head attentions) rather than use const parameter as weight. In such situation, we would see some explicit transpose inserted when the model are imported from TensorFlow or Pytorch(they use X * W for matmul by default). For the MXNet, as far as I know, it uses X * W^T by default.
The PR you found looks like creating a special schedule for dense + transpose, I’m not sure if that’s the key of the performance improving you got because it is written for AutoTVM and AutoScheduler will never use these manual schedule. You can have a more detailed analyse among those dense/batch_matmul ops’ layout and shape.
I would agree with @comaniac that the miss conversion from dense to batch_matmul caused some waste of computation before.
Reply
3 MONTHS LATER
will auto-schedule work in case of “llvm -libs-cblas”? e.g. a network may contain matmul ops(leverage cblas) and other ops(leverage auto-schedule to get performance)?
Deploy NNVM module using C++ on GPU using OpenCL target
2 MONTHS LATER
Jul '18
@masahi Thanks for the sample .
In your sample code is “tvm_input” the CPU byte array copied to “x” (GPU array) ?
means is TVMArrayCopyFromBytes(destination,source,size) ?
for (int i = 0; i < n_samples; ++i) {
TVMArrayCopyFromBytes(x, &tvm_input[i * in_size], in_size * sizeof(float));
set_input(input_name.c_str(), x);
run();
get_output(0, y);
TVMArrayCopyToBytes(y, &tvm_output[i * out_size], out_size * sizeof(float));
}
Reply
? C++ deploy example: switch to OpenCL7
Jul '18
yes, source is on cpu and x is on gpu. In my code, tvm_input should contain input data coming from your input image, for example.
Reply
Jul '18
@masahi
instead of tvm_input i created a DLTensor ‘z’ of device type = kdCPU , for storing input image .
I did copy byte array as shown below .
https://gist.github.com/rajh619/74538a7b3a7e1b89a2ae89db5ab24054 30
but the i couldnt find the copied bytes in the destination (x->data) .?
Reply
Jul '18
If you already have your data in DLtensor, you should use TVMArrayCopyFromTo.
Reply
Jul '18
@masahi
ok .now i tried with TVMArrayCopyFromTo
TVMArrayCopyFromTo(z, x, nullptr);
the same issue happens , i couldnt find the bytes copied to x->data .
I think x->data should be same as z->data(image data) . please correct me if am wrong ?
Reply
if x is on GPU, you should NEVER touch x->data. You either get segfault or complete junk.
If you want to access x->data, copy x to CPU first.
Reply
@masahi Thanks . got it working . i copied back from GPU to CPU(x->data to k->data) and validated the data.
After executing "run() " , i was able to get output to CPU in two ways :
- allocate tvm array to output tensor “y” with devicetype - CPU (1) , then tvm_output(0,y) . y->data contains output . ( i think internally tvm copies the output from device to cpu_host ?)
- allocate tvm array to output tensor “y” with devicetype - GPU (4) , tvm_output(0,y) ,then copy bytes from GPU to CPU ->out_vector[] . (similar to your sample code) .
Out of both which is the right way to extract output ?
Reply
The answer is 2.
See my sample code. The output y is GPU tensor. I copy y to tvm_output, which is on cpu.
1
Reply
Hi, @masahi I am still getting segfault even I use your sample 5 code also after allocating memory i am doing memeset to 0
DLTensor* x = nullptr;
DLTensor* y = nullptr;
const int in_ndim = 4;
const int out_ndim = in_ndim;
const int num_slice = 1;
const int num_class = 4;
const int shrink_size[] = { 256, 256 };
const int64_t in_shape[] = { num_slice, 1, shrink_size[0], shrink_size[1] };
const int64_t out_shape[] = { num_slice, num_class, shrink_size[0], shrink_size[1] };
TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &x);
TVMArrayAlloc(out_shape, out_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &y);
memset(x->data, 0, 4265265);
it is happing for Cuda and OpenCL for llvm it’s working fine.
Reply
you can’t use memset on GPU memory.
Reply
hi, @masahi still i am not able working still it throws memory error.
void FR_TVM_Deploy::forward(float* imgData)
{
int in_size = (1 * 64 * 64 * 3 * 4);
constexpr int dtype_code = kDLFloat;
constexpr int dtype_bits = 32;
constexpr int dtype_lanes = 1;
constexpr int device_type = kDLCPU;
constexpr int device_id = 0;
constexpr int in_ndim = 4;
const int64_t in_shape[in_ndim] = {1, 64, 64, 3};
//Allocating memeory to DLTensor object
TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &input);//
TVMArrayCopyFromBytes(input, imgData, in_size);
//Get globl function module for graph runtime
tvm::runtime::Module* mod = (tvm::runtime::Module*)handle;
// get the function from the module(set input data)
tvm::runtime::PackedFunc set_input = mod->GetFunction(“set_input”);
set_input(“input”, input);
// get the function from the module(run it)
tvm::runtime::PackedFunc run = mod->GetFunction(“run”);
run();
int out_ndim = 2;
int64_t out_shape[2] = {1, 256};
TVMArrayAlloc(out_shape, out_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &output);
// get the function from the module(get output data)
tvm::runtime::PackedFunc get_output = mod->GetFunction(“get_output”);
get_output(0, output);
size_t out_size = out_shape[0] * out_shape[1];
std::vector tvm_output(out_size, 0);
TVMArrayCopyToBytes(output, &tvm_output[out_size], out_size);
TVMArrayFree(input);
TVMArrayFree(output);
}
when i print the tvm_output vector i am getting all 0’s means output is coming 0, in llvm case i am getting correct output. here i am printing vector tvm_output in loop, is there any othere way to check output?
參考鏈接:
https://www.cnblogs.com/wangtianning1223/p/14662970.html
https://github.com/apache/tvm/tree/main/apps/howto_deploy
https://discuss.tvm.apache.org/t/deploy-nnvm-module-using-c-on-gpu-using-opencl-target/229/17
https://discuss.tvm.apache.org/t/performance-tvm-pytorch-bert-on-cpu/10200/10
總結
以上是生活随笔為你收集整理的tvm模型部署c++ 分析的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 使用Tensor Expression张
- 下一篇: TVM实现hardware backen