NVIDIA A100 GPUs上硬件JPEG解码器和NVIDIA nvJPEG库
NVIDIA A100 GPUs上硬件JPEG解碼器和NVIDIA
nvJPEG庫(kù)
Leveraging the Hardware JPEG Decoder and NVIDIA nvJPEG Library on NVIDIA A100 GPUs
根據(jù)調(diào)查,平均每個(gè)人產(chǎn)生1.2萬(wàn)億張圖片,這些圖片是由手機(jī)或數(shù)碼相機(jī)拍攝的。這種圖像的存儲(chǔ),特別是以高分辨率的原始格式存儲(chǔ),占用了大量的內(nèi)存。
JPEG指的是聯(lián)合攝影專家組(Joint Photography Experts Group),該組于2017年慶祝了25歲生日。JPEG標(biāo)準(zhǔn)指定了編解碼器,它定義如何將圖像壓縮成字節(jié)的比特流并將其解壓縮回圖像。
JPEG編解碼器的主要目的是最小化照片圖像文件的文件大小。JPEG是一種有損壓縮格式,這意味著它不存儲(chǔ)原始圖像的完整像素?cái)?shù)據(jù)。JPEG的優(yōu)點(diǎn)之一是它允許您微調(diào)所使用的壓縮量。這將在正確使用時(shí)產(chǎn)生良好的圖像質(zhì)量,同時(shí)也會(huì)產(chǎn)生最小的合理文件大小。 JPEG壓縮的關(guān)鍵組成部分如下:
顏色空間轉(zhuǎn)換允許您分離亮度(Y)和色度(Cb,Cr)組件。降采樣的Cb和Cr允許您減少文件大小,幾乎不明顯的質(zhì)量損失,因?yàn)槿祟惖母兄遣惶舾械倪@些圖像組成部分。這不是核心標(biāo)準(zhǔn)的一部分,但定義為JFIF格式的一部分。
基于塊的離散余弦變換(DCT)允許在較低的頻率下壓縮數(shù)據(jù)。
量化允許高頻細(xì)節(jié)的舍入系數(shù)。失去這些細(xì)節(jié)通常是可以的,因?yàn)槿搜弁ǔo法輕易區(qū)分高頻內(nèi)容。
漸進(jìn)式編碼允許您在對(duì)其位流進(jìn)行部分解碼后預(yù)覽整個(gè)圖像的低質(zhì)量版本。
以下照片(圖1)演示了JPEG壓縮的圖像質(zhì)量損失。原始蝴蝶圖像為BMP格式(512×512,24位,769kb,無壓縮),然后以JPEG格式顯示相同的圖像,質(zhì)量壓縮系數(shù)為50%,子采樣4:2:0,24位,圖像大小為33kb。
Figure 1a. Original butterfly image (no compression, Size 512×512, 24-bit), 769 KB.
Figure 1b. Compressed butterfly image (quality compression coefficient 50%, subsampling 4:2:0, 24-bit), 33 KB.
How JPEG works
圖2顯示了JPEG編碼器的一種常見配置。
Figure 2. Diagram of the JPEG encoding process employing a parallel utilization of GPU CUDA software and CPU.
首先,JPEG編碼從RGB彩色圖像開始。
第二步涉及到顏色轉(zhuǎn)換到表示亮度(亮度)的Y Cb Cr顏色空間Y和表示色度(紅色和藍(lán)色投影)的Cb和Cr通道。然后,Cb和Cr信道被預(yù)定因子(通常是2或3)降采樣。這個(gè)下采樣給你第一階段的壓縮。
在下一階段,每個(gè)信道被分成8×8個(gè)塊并計(jì)算DCT,這是頻率空間中類似于Fourier變換的變換。DCT本身是無損和可逆的,它將一個(gè)8×8的空間塊轉(zhuǎn)換成64個(gè)信道。
然后對(duì)DCT系數(shù)進(jìn)行量化,這是一個(gè)有損的過程,包括第二壓縮級(jí)。量化由JPEG質(zhì)量參數(shù)控制,較低的質(zhì)量設(shè)置對(duì)應(yīng)于更嚴(yán)重的壓縮并導(dǎo)致較小的文件。
量化閾值是特定于每個(gè)空間頻率的,并且經(jīng)過精心設(shè)計(jì)。低頻壓縮比高頻壓縮少,因?yàn)槿搜郾雀哳l信號(hào)的幅度變化更敏感于大范圍內(nèi)的細(xì)微誤差。
最后一步是用哈夫曼編碼對(duì)量化后的DCT系數(shù)進(jìn)行無損壓縮并存儲(chǔ)在JPEG文件中,如image.jpg如圖2所示。
圖3顯示了NVIDIA GPU上的JPEG解碼過程。
Figure 3. The JPEG decoding process employs a parallel utilization of GPU CUDA and software. A hybrid (CPU/GPU) approach for Huffman decoding overcomes the serial process stall.
JPEG解碼過程從壓縮的JPEG比特流開始,提取頭部信息。
然后,Huffman解碼處理串行處理,因?yàn)镈CT系數(shù)從比特流一次解碼一個(gè)。
下一步處理去量化和反DCT為8×8塊。
上采樣步驟處理YCbCr轉(zhuǎn)換并生成解碼的RGB圖像。
NVIDIA使用基于CUDA技術(shù)的nvJPEG庫(kù)加快了JPEG編解碼器的速度。我們開發(fā)了JPEG算法的完整并行實(shí)現(xiàn)。JPEG編解碼器工作流程中典型的GPU加速部分如圖2和圖3所示。
New JPEG hardware decoder最近,我們介紹了NVIDIA A100 GPU,它有一個(gè)專用的硬件JPEG解碼器。以前,在數(shù)據(jù)中心GPU上沒有這樣的硬件單元,JPEG解碼是一個(gè)純軟件CUDA解決方案,它同時(shí)使用CPU和GPU。
現(xiàn)在,硬件解碼器與GPU的其余部分同時(shí)運(yùn)行,GPU可以執(zhí)行各種計(jì)算任務(wù),如圖像分類、目標(biāo)檢測(cè)和圖像分割。與NVIDIA Tesla V100相比,它在4-8x JPEG解碼速度方面以多種方式大幅提高了吞吐量。
它是通過nvJPEG庫(kù)(CUDA工具包的一部分)公開的。
nvJPEG library overview
nvJPEG是用于JPEG編解碼器的GPU加速庫(kù)。與NVIDIA
DALI(一個(gè)數(shù)據(jù)增強(qiáng)和圖像加載庫(kù))一起,通過加速數(shù)據(jù)的解碼和增強(qiáng),可以加速對(duì)圖像分類模型的深度學(xué)習(xí)訓(xùn)練。A100包括一個(gè)5核硬件JPEG解碼引擎。nvJPEG利用硬件后端對(duì)JPEG圖像進(jìn)行批量處理。
Figure 4. The JPEG hardware decoding process employs a parallel utilization of hardware decoder and GPU CUDA software. The HW decoder is independent of the CUDA SMs so that software GPU
decoders can be used simultaneously.
通過使用nvjpegCreateEx init函數(shù)選擇硬件解碼器,nvJPEG提供了基線JPEG解碼的加速和各種顏色轉(zhuǎn)換格式(例如,YUV 420、422、444)。如圖4所示,這使得圖像解碼速度比僅使用CPU的處理速度快20倍。DALI的用戶可以直接受益于這種硬件加速,因?yàn)閚vJPEG是抽象的。nvJPEG庫(kù)支持以下操作:
· nvJPEG Encoding
· nvJPEG Transcoding轉(zhuǎn)碼
· nvJPEG Decoding (includes HW (A100) support)
庫(kù)支持以下JPEG選項(xiàng):
基線和漸進(jìn)式JPEG編碼和解碼,僅適用于A100的基線解碼
每像素8位
哈夫曼比特流解碼
多達(dá)四通道JPEG比特流
8位和16位量化表
三個(gè)顏色通道Y、Cb、Cr(Y、U、V)的以下色度子采樣:
· 4:4:4
· 4:2:2
· 4:2:0
· 4:4:0
· 4:1:1
· 4:1:0
該庫(kù)具有以下功能:
使用CPU和GPU的混合解碼。
庫(kù)的輸入在主機(jī)內(nèi)存中,輸出在GPU內(nèi)存中。
單圖像和成批圖像解碼。
用戶為設(shè)備提供的內(nèi)存管理器和固定主機(jī)內(nèi)存分配。
Performance numbers
對(duì)于本節(jié)中的性能圖,我們使用以下測(cè)試設(shè)置和GPU/CPU硬件:
· NVIDIA V100 GPU: CPU – E5-2698 v4@2GHz 3.6GHz Turbo (Broadwell) HT On GPU – Tesla V100-SXM2-16GB(GV100) 116160 MiB 180 SM GPU Video Clock 1312 Batch 128 and Single Thread
· NVIDIA A100 GPU CPU – Platinum 8168@2GHz 3.7GHz Turbo (Skylake) HT On GPU – A100-SXM4-40GB(GA100) 140557 MiB 1108
SM GPU Video Clock 1095 Batch 128 and Single Thread
· CPU: CPU – Platinum 8168@2GHz 3.7GHz Turbo (Skylake) HT On TurboJPEG decode for CPU testing
· Image dataset: 2K FHD = 1920 x 1080 4K UHD = 3840 x 2160 CUDA Toolkit 11.0 CUDA driver r450.24
接下來的兩個(gè)圖表顯示了硬件JPEG解碼器的解碼速度。
Figure 5. Graph showing the speed up achieved by hardware decode on A100 over the CUDA hybrid decode on V100.
Figure 6. The number of CPU threads required by the hybrid decoder on V100 to keep up with hardware decoder throughput on A100.
通過將解碼卸載到硬件,您可以釋放寶貴的CPU周期,以便更好地使用。
圖7顯示了編碼加速。
Figure 7a. JPEG baseline encoding throughput comparison between CPU, CUDA (V100, A100) for an image size of 1920×1080 (2K FHD), 3840×2160 (4K UHD).
Figure 7b. JPEG progressive encoding throughput comparison between CPU, CUDA (V100, A100) for an image size of
1920×1080 (2K FHD), 3840×2160 (4K UHD).
Image decoding example
下面是一個(gè)使用nvJPEG庫(kù)的圖像解碼示例。此示例顯示了在A100 GPU上使用硬件解碼器以及對(duì)其他NVIDIA GPU使用后端回退。
//
// The following
code example shows how to use the nvJPEG library for JPEG image decoding.
//
// Libraries used
// nvJPEG decoding
int main()
{
...//
create nvJPEG decoder and decoder state
nvjpegDevAllocator_t dev_allocator ={&dev_malloc, &dev_free};
nvjpegPinnedAllocator_t pinned_allocator={&host_malloc, &host_free};
// Selecting A100 Hardware decoder
nvjpegStatus_t status = nvjpegCreateE
(NVJPEG_BACKEND_HARDWARE,
&dev_allocator,
&pinned_allocator,
NVJPEG_FLAGS_DEFAULT,
params.nvjpeg_handle);
params.hw_decode_available = true;
if( status == NVJPEG_STATUS_ARCH_MISMATCH)
{
std::cout<<“Hardware Decoder
not supported. Falling back to default backend”<<std::endl;
// GPU SW decoder selected
nvjpegCreateEx(NVJPEG_BACKEND_DEFAULT,
&dev_allocator,
&pinned_allocator,
NVJPEG_FLAGS_DEFAULT,
¶ms.nvjpeg_handle);
params.hw_decode_available = false;
}
// create JPEG decoder state
nvjpegJpegStateCreate(params.nvjpeg_handle,
¶ms.nvjpeg_state)
// extract bitstream metadata to figure out
whether a bitstream can be decoded
nvjpegJpegStreamParseHeader(params.nvjpeg_handle, (const unsigned char
*)img_data[i].data(), img_len[i], params.jpeg_streams[0]);
// decode Batch images
nvjpegDecodeBatched(params.nvjpeg_handle,
params.nvjpeg_state,
batched_bitstreams.data(),
batched_bitstreams_size.data(),
atched_output.data(),
params.stream)
…
}
$ git clone
https://github.com/NVIDIA/CUDALibrarySamples.git
$ cd
nvJPEG/nvJPEG-Decoder/
$ mkdir build
$ cd build
$ cmake …
$ make
// Running nvJPEG decoder
$
./nvjpegDecoder -i …/input_images/ -o ~/tmp
Decoding images
in directory: …/input_images/, total 12, batchsize 1
Processing: …/input_images/cat_baseline.jpg
Image is 3 channels.
Channel #0 size: 64 x 64
Channel #1 size: 64 x 64
Channel #2 size: 64 x 64
YUV 4:4:4 chroma subsampling
Done writing decoded image to file:/tmp/cat_baseline.bmp
Processing: …/input_images/img8.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img8.bmp
Processing: …/input_images/img5.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img5.bmp
Processing: …/input_images/img7.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img7.bmp
Processing: …/input_images/img2.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: /tmp/img2.bmp
Processing: …/input_images/img4.jpg
Image is 3 channels.
Channel #0 size: 640 x 426
Channel #1 size: 320 x 213
Channel #2 size: 320 x 213
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img4.bmp
Processing: …/input_images/cat.jpg
Image is 3 channels.
Channel #0 size: 64 x 64
Channel #1 size: 64 x 64
Channel #2 size: 64 x 64
YUV 4:4:4 chroma subsampling
Done writing decoded image to file:/tmp/cat.bmp
Processing: …/input_images/cat_grayscale.jpg
Image is 1 channels.
Channel #0 size: 64 x 64
Grayscale JPEG
Done writing decoded image to file:/tmp/cat_grayscale.bmp
Processing: …/input_images/img1.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: /tmp/img1.bmp
Processing: …/input_images/img3.jpg
Image is 3 channels.
Channel #0 size: 640 x 426
Channel #1 size: 320 x 213
Channel #2 size: 320 x 213
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img3.bmp
Processing: …/input_images/img9.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img9.bmp
Processing: …/input_images/img6.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Done writing decoded image to file:/tmp/img6.bmp
Total decoding time: 14.8286
Avg decoding time per image: 1.23571
Avg images per sec: 0.809248
Avg decoding time per batch: 1.23571
Image resizing example
此圖像大小調(diào)整和水印示例根據(jù)客戶機(jī)的請(qǐng)求生成圖像的縮放版本。圖8顯示了圖像大小調(diào)整和水印的典型工作流程。
Figure 8. Image resizing and watermarking pipeline employing a parallel utilization of GPU software and CUDA.
下面的代碼示例演示如何調(diào)整圖像大小并用徽標(biāo)圖像對(duì)其進(jìn)行水印。
The following code example shows how to resize images and watermark them with a logo image.//
// Libraries used
// nvJPEG decoding, NPP Resize, NPP watermarking, nvJPEG encoding
int main(){ … // nvJPEG decoder nReturnCode = nvjpegDecode(nvjpeg_handle, nvjpeg_decoder_state, dpImage, nSize, oformat, &imgDesc, NULL); // NPP image resize
st = nppiResize_8u_C3R_Ctx(imgDesc.channel[0], imgDesc.pitch[0], srcSize, srcRoi, imgResize.channel[0], imgResize.pitch[0], dstSize, dstRoi, NPPI_INTER_LANCZOS, nppStreamCtx);
st = nppiResize_8u_C3R_Ctx(imgDescW.channel[0], imgDescW.pitch[0], srcSizeW, srcRoiW,imgResizeW.channel[0], imgResizeW.pitch[0], dstSize, dstRoi, NPPI_INTER_LANCZOS, nppStreamCtx);
// Alpha Blending watermarking
st = nppiAlphaCompC_8u_C3R_Ctx(imgResize.channel[0], imgResize.pitch[0], 255, imgResizeW.channel[0], imgResizeW.pitch[0], ALPHA_BLEND, imgResize.channel[0], imgResize.pitch[0], dstSize, NPPI_OP_ALPHA_PLUS, nppStreamCtx);
// nvJPEG encoding
nvjpegEncodeImage(nvjpeg_handle, nvjpeg_encoder_state, nvjpeg_encode_params, &imgResize, iformat, dstSize.width, dstSize.height,NULL)); … }
$ git clone https://github.com/NVIDIA/CUDALibrarySamples.git
$ cd nvJPEG/Image-Resize-WaterMark/
$ mkdir build$ cd build
$ cmake …
$ make // Running Image resizer and watermarking
$ ./imageResizeWatermark -i …/input_images/ -o resize_images -q 85 -rw 512 -rh 512
Summary
Download the latest version of prebuilt DALI binaries with NVIDIA Ampere architecture support. For a detailed list of new features and enhancements, see the nvJPEG Library documentation and the latest release notes.
To learn more about how DALI uses nvJPEG for accelerating a deep
learning data pipeline, see Loading Data Fast with DALI and the New Hardware JPEG Decoder in NVIDIA A100 GPUs.
總結(jié)
以上是生活随笔為你收集整理的NVIDIA A100 GPUs上硬件JPEG解码器和NVIDIA nvJPEG库的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 多尺度注意力机制的语义分割
- 下一篇: NVIDIA GPUs上深度学习推荐模型