tensorrt基础知识+torch版lenet转c++ trt
官網(wǎng)文檔
API文檔
Docker鏡像
自定義Plugin倉(cāng)庫(kù)
0.安裝
1.安裝tensorrt
從官網(wǎng)下載.deb包,要注意的是cuda版本
sudo dpkg -i nv-tensorrt-repo-ubuntu1604-cuda10.0-trt7.0.0.11-ga-20191216_1-1_amd64.deb sudo apt update sudo apt install tensorrtEngine plan 的兼容性依賴于GPU的compute capability 和 TensorRT 版本, 不依賴于CUDA和CUDNN版本.
2.安裝opencv
sudo apt-get update sudo apt install libopencv-devapt-get install tensorrt報(bào)錯(cuò)
https://github.com/NVIDIA/TensorRT/issues/792
tensorrt : Depends: libnvinfer7 (= 7.0.0-1+cuda10.0) but 7.2.2-1+cuda11.1 is to be installedDepends: libnvinfer-plugin7 (= 7.0.0-1+cuda10.0) but 7.2.2-1+cuda11.1 is to be installedDepends: libnvparsers7 (= 7.0.0-1+cuda10.0) but 7.2.2-1+cuda11.1 is to be installedDepends: libnvonnxparsers7 (= 7.0.0-1+cuda10.0) but 7.2.2-1+cuda11.1 is to be installedDepends: libnvinfer-bin (= 7.0.0-1+cuda10.0) but it is not going to be installedDepends: libnvinfer-dev (= 7.0.0-1+cuda10.0) but 7.2.2-1+cuda11.1 is to be installedDepends: libnvinfer-plugin-dev (= 7.0.0-1+cuda10.0) but 7.2.2-1+cuda11.1 is to be installedDepends: libnvparsers-dev (= 7.0.0-1+cuda10.0) but 7.2.2-1+cuda11.1 is to be installedDepends: libnvonnxparsers-dev (= 7.0.0-1+cuda10.0) but 7.2.2-1+cuda11.1 is to be installedDepends: libnvinfer-samples (= 7.0.0-1+cuda10.0) but it is not going to be installedDepends: libnvinfer-doc (= 7.0.0-1+cuda10.0) but it is not going to be installedmv?/etc/apt/sources.list.d/nvidia-ml.list?/etc/apt/sources.list.d/nvidia-ml.list.bak
在apt-get install tensorrt 即可
1.優(yōu)化流程:
TensorRT總共有5個(gè)階段:創(chuàng)建網(wǎng)絡(luò)、構(gòu)建推理Engine、序列化引擎、反序列化引擎以及執(zhí)行推理Engine。
其中第1,2,3大概就是c++api寫的網(wǎng)絡(luò)結(jié)構(gòu)或者其他第三方格式,經(jīng)過NetworkDefinition進(jìn)行定義,采用builder加載模型權(quán)重,進(jìn)行一些參數(shù)的優(yōu)化,然后再用engine序列化成“Plan”(流圖),其不僅保存了計(jì)算時(shí)所需的網(wǎng)絡(luò)weights也保存了Kernel執(zhí)行的調(diào)度流程。。
而4,5就是推理:采用engine反序列化,創(chuàng)建運(yùn)行環(huán)境,在進(jìn)行推理即可。
可看出TensorRT在獲得網(wǎng)絡(luò)計(jì)算流圖后會(huì)針對(duì)計(jì)算流圖進(jìn)行優(yōu)化.
深度學(xué)習(xí)框架在做推理時(shí),會(huì)對(duì)每一層調(diào)用多個(gè)/次功能函數(shù)。而由于這樣的操作都是在GPU上運(yùn)行的,從而會(huì)帶來多次的CUDA Kernel launch過程。相較于Kernel launch以及每層tensor data讀取來說,kernel的計(jì)算是更快更輕量的,從而使得這個(gè)程序受限于顯存帶寬并損害了GPU利用率。
TensorRT通過以下三種方式來解決這個(gè)問題:
Kernel縱向融合:通過融合相同順序的操作來減少Kernel Launch的消耗以及避免層之間的顯存讀寫操作。如上圖所示,卷積、Bias和Relu層可以融合成一個(gè)Kernel,這里稱之為CBR。
Kernel橫向融合:TensorRT會(huì)去挖掘輸入數(shù)據(jù)且filter大小相同但weights不同的層,對(duì)于這些層不是使用三個(gè)不同的Kernel而是使用一個(gè)Kernel來提高效率,如上圖中超寬的1x1 CBR所示,把結(jié)構(gòu)相同但權(quán)重不同的層合并成更寬的層,從而減少cuda核心的使用.。
消除concatenation層,通過預(yù)分配輸出緩存以及跳躍式的寫入方式來避免這次轉(zhuǎn)換。
通過這樣的優(yōu)化,TensorRT可以獲得更小、更快、更高效的計(jì)算流圖,其擁有更少層網(wǎng)絡(luò)結(jié)構(gòu)以及更少Kernel Launch次數(shù)。下表列出了常見幾個(gè)網(wǎng)絡(luò)在TensorRT優(yōu)化后的網(wǎng)絡(luò)層數(shù)量,很明顯的看到TensorRT可以有效的優(yōu)化網(wǎng)絡(luò)結(jié)構(gòu)、減少網(wǎng)絡(luò)層數(shù)從而帶來性能的提升。
2.torch版lenet轉(zhuǎn)trt
2.1 torch版代碼:
lenet.py
# coding:utf-8 import torch from torch import nn from torch.nn import functional as Fclass Lenet5(nn.Module):"""for cifar10 dataset."""def __init__(self):super(Lenet5, self).__init__()self.conv1 = nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0)self.pool1 = nn.AvgPool2d(kernel_size=2, stride=2, padding=0)self.conv2 = nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0)self.fc1 = nn.Linear(16 * 5 * 5, 120)self.fc2 = nn.Linear(120, 84)self.fc3 = nn.Linear(84, 10)def forward(self, x):# print('input: ', x.shape)x = F.relu(self.conv1(x))# print('conv1', x.shape)x = self.pool1(x)# print('pool1: ', x.shape)x = F.relu(self.conv2(x))# print('conv2', x.shape)x = self.pool1(x)# print('pool2', x.shape)x = x.view(x.size(0), -1)# print('view: ', x.shape)x = F.relu(self.fc1(x))# print('fc1: ', x.shape)x = F.relu(self.fc2(x))x = F.softmax(self.fc3(x), dim=1)return xdef main():import osos.environ["CUDA_VISIBLE_DEVICES"] = "1"print('cuda device count: ', torch.cuda.device_count())torch.manual_seed(1234)net = Lenet5()net = net.to('cuda:0')net.eval()import timest_time = time.time()nums = 10000for i in range(nums):tmp = torch.ones(1, 1, 32, 32).to('cuda:0')out = net(tmp)# print('lenet out shape:', out.shape)print('lenet out:', out)end_time = time.time()print('==cost time{}'.format((end_time - st_time)))torch.save(net, "lenet5.pth")if __name__ == '__main__':main()將模型權(quán)重存儲(chǔ)為.pth,并測(cè)試時(shí)間為:
2.2.pth存儲(chǔ)為.onnx
為了方便查看網(wǎng)絡(luò)結(jié)構(gòu)
# coding:utf-8 import torch from torch import nn from torch.nn import functional as Fclass Lenet5(nn.Module):"""for cifar10 dataset."""def __init__(self):super(Lenet5, self).__init__()self.conv1 = nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0)self.pool1 = nn.AvgPool2d(kernel_size=2, stride=2, padding=0)self.conv2 = nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0)self.fc1 = nn.Linear(16 * 5 * 5, 120)self.fc2 = nn.Linear(120, 84)self.fc3 = nn.Linear(84, 10)def forward(self, x):# print('input: ', x.shape)x = F.relu(self.conv1(x))# print('conv1', x.shape)x = self.pool1(x)# print('pool1: ', x.shape)x = F.relu(self.conv2(x))# print('conv2', x.shape)x = self.pool1(x)# print('pool2', x.shape)x = x.view(x.size(0), -1)# print('view: ', x.shape)x = F.relu(self.fc1(x))# print('fc1: ', x.shape)x = F.relu(self.fc2(x))x = F.softmax(self.fc3(x), dim=1)return xdef main():import osos.environ["CUDA_VISIBLE_DEVICES"] = "1"print('cuda device count: ', torch.cuda.device_count())torch.manual_seed(1234)net = Lenet5()net = net.to('cuda:0')net.eval()import timest_time = time.time()nums = 10000for i in range(nums):tmp = torch.ones(1, 1, 32, 32).to('cuda:0')out = net(tmp)# print('lenet out shape:', out.shape)print('lenet out:', out)end_time = time.time()print('==cost time{}'.format((end_time - st_time)))torch.save(net, "lenet5.pth")def model_onnx():input = torch.ones(1, 1, 32, 32, dtype=torch.float32).cuda()model = Lenet5()model = model.cuda()torch.onnx.export(model, input, "./lenet.onnx", verbose=True)if __name__ == '__main__':# main()model_onnx()抓換onnx,遇到好幾種問題,用這種基本都解決了.
torch.onnx.export(model, # model being runinput, # model input (or a tuple for multiple inputs)"./xxxx.onnx",opset_version=10,verbose=False, # store the trained parameter weights inside the model filetraining=False,do_constant_folding=True,input_names=['input'],output_names=['output'])2.3?.pth存儲(chǔ)為.wts
將模型權(quán)重按照key,value形式存儲(chǔ)為16進(jìn)制文件, inference.py
import torch from torch import nn from lenet5 import Lenet5 import os import structdef main():print('cuda device count: ', torch.cuda.device_count())net = torch.load('lenet5.pth')net = net.to('cuda:0')net.eval()#print('model: ', net)#print('state dict: ', net.state_dict()['conv1.weight'])tmp = torch.ones(1, 1, 32, 32).to('cuda:0')#print('input: ', tmp)out = net(tmp)print('lenet out:', out)f = open("lenet5.wts", 'w')print('==net.state_dict().keys():', net.state_dict().keys())f.write("{}\n".format(len(net.state_dict().keys())))for k, v in net.state_dict().items():print('key: ', k)print('value: ', v.shape)vr = v.reshape(-1).cpu().numpy()f.write("{} {}".format(k, len(vr)))for vv in vr:# print('=vv:', vv)f.write(" ")# print(struct.pack(">f", float(vv)).hex())#f.write(struct.pack(">f", float(vv)).hex())f.write("\n")print('==f:', f)def test_struct():vv = 16print(struct.pack(">f", float(vv))) #if __name__ == '__main__':main()# test_struct()2.4 .wts轉(zhuǎn)換成.engine與利用.engine推理
lenet.cpp?
#include <map> #include <chrono> #include <fstream> #include "NvInfer.h" #include "logging.h" #include "cuda_runtime_api.h"static const int INPUT_H=32; static const int INPUT_W=32; static const int BATCH_SIZE=32; static const int OUTPUT_SIZE=10; static const int INFER_NUMS=10000; const char* INPUT_BLOB_NAME = "data"; const char* OUTPUT_BLOB_NAME = "prob";using namespace nvinfer1; static Logger gLogger;#define CHECK(status) \do\{\auto ret = (status);\if (ret != 0)\{\std::cerr << "Cuda failure: " << ret << std::endl;\abort();\}\} while (0)std::map<std::string, Weights> loadWeights(const std::string file) {std::cout << "Loading weights: " << file << std::endl;std::map<std::string, Weights> weightMap;// Open weights filestd::ifstream input(file);assert(input.is_open() && "Unable to load weight file.");// Read number of weight blobsint32_t count;input >> count;assert(count > 0 && "Invalid weight map file.");while (count--){Weights wt{DataType::kFLOAT, nullptr, 0};uint32_t size;// Read name and type of blobstd::string name;input >> name >> std::dec >> size;wt.type = DataType::kFLOAT;// Load blobuint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));for (uint32_t x = 0, y = size; x < y; ++x){input >> std::hex >> val[x];}wt.values = val;wt.count = size;weightMap[name] = wt;}return weightMap; }ICudaEngine* createLenetEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {//開始定義網(wǎng)絡(luò) 0U無符號(hào)整型0INetworkDefinition* network = builder->createNetworkV2(0U);ITensor* input = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});assert(input);std::map<std::string, Weights> weightMap = loadWeights("../lenet5.wts");//載入權(quán)重放入weightMap// std::cout<<weightMap["conv1.weight"]<<std::endl; //卷積層IConvolutionLayer* conv1 = network->addConvolution(*input, 6, DimsHW{5, 5}, weightMap["conv1.weight"], weightMap["conv1.bias"]);//設(shè)置步長(zhǎng)assert(conv1);conv1->setStrideNd(DimsHW{1, 1});//激活層IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);assert(relu1);//pooling層IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});assert(pool1);pool1->setStrideNd(DimsHW{2, 2});//卷積層IConvolutionLayer* conv2 = network->addConvolution(*pool1->getOutput(0), 16, DimsHW{5, 5}, weightMap["conv2.weight"], weightMap["conv2.bias"]);//設(shè)置步長(zhǎng)assert(conv2);conv2->setStrideNd(DimsHW{1, 1}); //激活層IActivationLayer* relu2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);assert(relu2);//pooling層IPoolingLayer* pool2 = network->addPoolingNd(*relu2->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});assert(pool2);pool2->setStrideNd(DimsHW{2, 2});//全連接IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 120, weightMap["fc1.weight"], weightMap["fc1.bias"]);assert(fc1);//激活層IActivationLayer* relu3 = network->addActivation(*fc1->getOutput(0), ActivationType::kRELU);assert(relu3);//全連接IFullyConnectedLayer* fc2 = network->addFullyConnected(*relu3->getOutput(0), 84, weightMap["fc2.weight"], weightMap["fc2.bias"]);assert(fc2);//激活層IActivationLayer* relu4 = network->addActivation(*fc2->getOutput(0), ActivationType::kRELU);assert(relu4);//全連接IFullyConnectedLayer* fc3 = network->addFullyConnected(*relu4->getOutput(0), OUTPUT_SIZE, weightMap["fc3.weight"], weightMap["fc3.bias"]);assert(fc3);//分類層ISoftMaxLayer *prob = network->addSoftMax(*fc3->getOutput(0));assert(prob);prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);network->markOutput(*prob->getOutput(0));//構(gòu)造enginebuilder->setMaxBatchSize(maxBatchSize);config->setMaxWorkspaceSize(1<<20);ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);//放入engine 所以network可以銷毀了network->destroy();// 釋放資源for (auto& mem : weightMap){free((void*) (mem.second.values));}return engine; } void APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {//創(chuàng)建builderIBuilder* builder = createInferBuilder(gLogger);//網(wǎng)絡(luò)入口 類似pytorch的modelIBuilderConfig* config = builder->createBuilderConfig();//創(chuàng)建模型 搭建網(wǎng)絡(luò)層ICudaEngine* engine = createLenetEngine(maxBatchSize, builder, config, DataType::kFLOAT);assert(engine!=nullptr);//序列化engine(*modelStream)= engine->serialize();//銷毀對(duì)象 engine->destroy();builder->destroy();}void doInference(IExecutionContext& context, float* input, float *output, int batchSize) {//使用傳進(jìn)來的context恢復(fù)engine。const ICudaEngine& engine = context.getEngine();//輸入輸出總共有兩個(gè),做一下驗(yàn)證assert(engine.getNbBindings()==2);//void void* buffers[2];//獲取與這個(gè)engine相關(guān)的輸入輸出tensor的索引sconst int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);//為輸入輸出tensor開辟顯存。CHECK(cudaMalloc(&buffers[inputIndex], batchSize * INPUT_H * INPUT_W * sizeof(float)));CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));//創(chuàng)建cuda流,用于管理數(shù)據(jù)復(fù)制,存取,和計(jì)算的并發(fā)操作cudaStream_t stream;CHECK(cudaStreamCreate(&stream));//從內(nèi)存到顯存,input是讀入內(nèi)存中的數(shù)據(jù);buffers[inputIndex]是顯存上的存儲(chǔ)區(qū)域,用于存放輸入數(shù)據(jù)CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));// //啟動(dòng)cuda核,異步執(zhí)行推理計(jì)算context.enqueue(batchSize, buffers, stream, nullptr);//從顯存到內(nèi)存,buffers[outputIndex]是顯存中的存儲(chǔ)區(qū),存放模型輸出;output是內(nèi)存中的數(shù)據(jù)CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));//如果使用了多個(gè)cuda流,需要同步cudaStreamSynchronize(stream);// Release stream and buffers cudaStreamDestroy(stream);CHECK(cudaFree(buffers[inputIndex]));CHECK(cudaFree(buffers[outputIndex]));} int main(int argc, char ** argv) { if (argc!=2){ std::cerr << "arguments not right!" << std::endl;std::cerr << "./lenet -s // serialize model to plan file" << std::endl;std::cerr << "./lenet -d // deserialize plan file and run inference" << std::endl;return -1;}//序列化模型為.engine文件if(std::string(argv[1])=="-s"){ IHostMemory* modelStream{nullptr};//modelStream是一塊內(nèi)存區(qū)域,用來保存序列化文件APIToModel(1, &modelStream);assert(modelStream!=nullptr);//變換為.engine文件std::ofstream p("lenet.engine");if (!p){std::cerr<<"can not open plan file"<<std::endl;return -1;}p.write(reinterpret_cast<const char *>(modelStream->data()), modelStream->size());// p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());//銷毀對(duì)象modelStream->destroy();}else if (std::string(argv[1])=="-d"){ char *trtModelStream{nullptr};size_t size{0};std::ifstream file("lenet.engine", std::ios::binary);if (file.good()) {file.seekg(0, file.end);size = file.tellg();file.seekg(0, file.beg);trtModelStream = new char[size];assert(trtModelStream);file.read(trtModelStream, size);file.close();}else{return -1;}//模擬數(shù)據(jù)float data[INPUT_H*INPUT_W];for (int i=0;i<INPUT_W*INPUT_H;i++){data[i] = 1.0;}//創(chuàng)建運(yùn)行時(shí)環(huán)境IRuntime對(duì)象IRuntime* runtime = createInferRuntime(gLogger);assert(runtime !=nullptr);ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream,size,nullptr);assert(engine !=nullptr);//創(chuàng)建上下文環(huán)境,主要用與inference函數(shù)中啟動(dòng)cuda核IExecutionContext* context = engine->createExecutionContext();assert(context !=nullptr);//開始推理, 模擬推理1000次,存儲(chǔ)推理結(jié)果float prob[OUTPUT_SIZE];auto start = std::chrono::system_clock::now();//開始時(shí)間for (int i=0;i<INFER_NUMS;i++){ // std::cout<<"data[i]:"<<data[i]<<std::endl;doInference(*context, data, prob, 1); }auto end = std::chrono::system_clock::now();//結(jié)束時(shí)間std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms" << std::endl;context->destroy();engine->destroy();runtime->destroy();std::cout<<"prob:";for (int i=0;i<OUTPUT_SIZE;i++){ std::cout<<prob[i]<<",";} }else{return -1;}return 0; }CMakeLists.txt
cmake_minimum_required(VERSION 2.6)project(lenet)add_definitions(-std=c++11)set(TARGET_NAME "lenet")option(CUDA_USE_STATIC_CUDA_RUNTIME OFF) set(CMAKE_CXX_STANDARD 11) set(CMAKE_BUILD_TYPE Debug)include_directories(${PROJECT_SOURCE_DIR}/include) # include and link dirs of cuda and tensorrt, you need adapt them if yours are different # cuda include_directories(/usr/local/cuda/include) link_directories(/usr/local/cuda/lib64) # tensorrt include_directories(/usr/include/x86_64-linux-gnu) link_directories(/usr/lib/x86_64-linux-gnu)#tar包 tensorrt #include_directories(/red_detection/tensorrt_learn/software/TensorRT-7.0.0.11/include) #link_directories(/red_detection/tensorrt_learn/software/TensorRT-7.0.0.11/lib)FILE(GLOB SRC_FILES ${PROJECT_SOURCE_DIR}/lenet.cpp ${PROJECT_SOURCE_DIR}/include/*.h)add_executable(${TARGET_NAME} ${SRC_FILES}) target_link_libraries(${TARGET_NAME} nvinfer) target_link_libraries(${TARGET_NAME} cudart)add_definitions(-O2 -pthread)./lenet -s 轉(zhuǎn)換成 .engine文件
./lenet -d 進(jìn)行推理
推理時(shí)間:
可看出時(shí)間和torch的相比加快了至少4倍,而結(jié)果卻差不多。
一些很不錯(cuò)的倉(cāng)庫(kù):
https://github.com/wang-xinyu/tensorrtx
https://github.com/zerollzeng/tiny-tensorrt
創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來咯,堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)總結(jié)
以上是生活随笔為你收集整理的tensorrt基础知识+torch版lenet转c++ trt的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: MFC中为菜单命令添加快捷键
- 下一篇: OpenCV--SIFT算法检测特征点
