當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Pytorch并行计算(一): DataParallel

發(fā)布時(shí)間：2023/12/18 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 Pytorch并行计算(一): DataParallel 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

PyTorch并行計(jì)算: DataParallel

1. 官網(wǎng)實(shí)例
2. 運(yùn)行過程
3. 使用方法
4. 源代碼解讀

1. 官網(wǎng)實(shí)例

PyTorch官網(wǎng)示例

PyTorch官網(wǎng)函數(shù)手冊(cè)

2. 運(yùn)行過程

torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)

module (Module) – module to be parallelized
device_ids (list of python:int or torch.device) – CUDA devices (default: all devices)
output_device (int or torch.device) – device location of output (default: device_ids[0])

首先把模型分別加載到各個(gè)GPU上，同時(shí)將數(shù)據(jù)平均分配到GPU上；然后每個(gè)GPU分別前向傳播，將最后的結(jié)果匯總到第一塊GPU上；最終計(jì)算Loss反向傳播

這里有一個(gè)問題，當(dāng)所有的batch進(jìn)行匯總到第一個(gè)GPU時(shí)（默認(rèn)的output_device=0，即在第一塊GPU上進(jìn)行匯總），第一塊GPU的內(nèi)存會(huì)明顯高于其他的，造成不均衡，同時(shí)計(jì)算Loss的時(shí)候可能會(huì)出現(xiàn)問題，而且如果分別計(jì)算Loss是不是會(huì)更快呢，這個(gè)問題之后寫解決方案，這個(gè)問題在pytorch上有人提出，由于目前很少用DataParallel的方法了，大多用DistributedDataParallel，所以怎么優(yōu)化DataParallel就不展開了

3. 使用方法

具體使用也比較簡單，如下所示，其余的不需要變化

# 當(dāng)不限制GPUs的個(gè)數(shù)時(shí)，默認(rèn)使用全部的GPU model = Model() model = nn.DataParallel(model) model.cuda()# 當(dāng)限制GPUs的個(gè)數(shù)時(shí) os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1' model = Model() model = nn.DataParallel(model) model.cuda()

4. 源代碼解讀

這里我將源代碼copy了出來進(jìn)行調(diào)試，加了注釋，很方便理解

所在電腦有兩塊GPU,即device_ids = [0, 1]

class DataParallel(Module):def __init__(self, module, device_ids=None, output_device=None, dim=0):super(DataParallel, self).__init__()'''1.查看用GPU還是CPU,如果是CPU直接返回,如果是GPU繼續(xù)torch.cuda.is_available()return "cuda"/None'''device_type = _get_available_device_type() # cuda()if device_type is None:self.module = moduleself.device_ids = []return'''2.如果沒有指定device_ids,默認(rèn)使用所有GPU如果沒有指定output_device,默認(rèn)devices=0torch._utils._get_all_device_indices() 返回GPUid,例如[0, 1]'''if device_ids is None:device_ids = _get_all_device_indices()if output_device is None:output_device = device_ids[0]'''3.獲取變量torch._utils._get_all_device_indices() 返回GPUid,例如[0, 1]torch._utils._get_device_index() 默認(rèn)返回第一個(gè)GPUdevice_ids = [0, 1]output_device = 0 默認(rèn)第一塊GPU,即0'''self.dim = dimself.module = moduleself.device_ids = [_get_device_index(x, True) for x in device_ids]self.output_device = _get_device_index(output_device, True)self.src_device_obj = torch.device(device_type, self.device_ids[0])'''4.檢查GPU之間的性能_check_balance會(huì)檢查各個(gè)GPU之間的性能差異,有兩種情況會(huì)報(bào)錯(cuò):(1)剩余內(nèi)存最小的GPU與最大的GPU比值小于0.75,不論是GPU本身的內(nèi)存還是被其他人占用后剩余的內(nèi)存只要小于0.75就會(huì)報(bào)錯(cuò)(2)線程最小的GPU與最大的GPU比值小于0.75'''_check_balance(self.device_ids)'''5.如果只有一塊GPU,將module載入即可'''if len(self.device_ids) == 1:self.module.to(self.src_device_obj)def forward(self, *inputs, **kwargs):with torch.autograd.profiler.record_function("DataParallel.forward"):if not self.device_ids:return self.module(*inputs, **kwargs)'''6.將module.parameters()以及module.buffers()進(jìn)行迭代,看看是否在GPU上沒有的話繼續(xù),有的話返回錯(cuò)誤信息cpu'''for t in chain(self.module.parameters(), self.module.buffers()):if t.device != self.src_device_obj:raise RuntimeError("module must have its parameters and buffers ""on device {} (device_ids[0]) but found one of ""them on device: {}".format(self.src_device_obj, t.device))'''7.后面這部分在下面的data_parallel中看注釋'''inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)# for forward function without any inputs, empty list and dict will be created# so the module can be executed on one device which is the first one in device_idsif not inputs and not kwargs:inputs = ((),)kwargs = ({},)if len(self.device_ids) == 1:return self.module(*inputs[0], **kwargs[0])replicas = self.replicate(self.module, self.device_ids[:len(inputs)])outputs = self.parallel_apply(replicas, inputs, kwargs)return self.gather(outputs, self.output_device)def replicate(self, module, device_ids):return replicate(module, device_ids, not torch.is_grad_enabled())def scatter(self, inputs, kwargs, device_ids):return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)def parallel_apply(self, replicas, inputs, kwargs):return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])def gather(self, outputs, output_device):return gather(outputs, output_device, dim=self.dim)def data_parallel(module, inputs, device_ids=None, output_device=None, dim=0, module_kwargs=None):r"""Evaluates module(input) in parallel across the GPUs given in device_ids.This is the functional version of the DataParallel module.Args:module (Module): the module to evaluate in parallelinputs (Tensor): inputs to the moduledevice_ids (list of int or torch.device): GPU ids on which to replicate moduleoutput_device (list of int or torch.device): GPU location of the output Use -1 to indicate the CPU.(default: device_ids[0])Returns:a Tensor containing the result of module(input) located onoutput_device"""'''1.檢查是否有輸入，即我們的batchsize_input如果有，不改變輸入；如果沒有變?yōu)榭?)'''if not isinstance(inputs, tuple):inputs = (inputs,) if inputs is not None else ()'''2.檢查當(dāng)前設(shè)備類型：一般為cuda()'''device_type = _get_available_device_type()'''3.檢查當(dāng)前是否規(guī)定了輸入和輸出的GPU默認(rèn)：輸入設(shè)備為所有GPU；輸出為GPU：0，即第一塊GPU(理解為所有輸入GPU中的第一塊)'''if device_ids is None:device_ids = _get_all_device_indices()if output_device is None:output_device = device_ids[0]'''4.確定輸入輸出GPU，同時(shí)確定源GPU，可能是最終計(jì)算用來匯總的GPU：0'''device_ids = [_get_device_index(x, True) for x in device_ids]output_device = _get_device_index(output_device, True)src_device_obj = torch.device(device_type, device_ids[0])'''5.檢查輸入以及模型參數(shù)是否都在同一個(gè)設(shè)備上，比如GPU或者CPU'''for t in chain(module.parameters(), module.buffers()):if t.device != src_device_obj:raise RuntimeError("module must have its parameters and buffers ""on device {} (device_ids[0]) but found one of ""them on device: {}".format(src_device_obj, t.device))'''6.scatter_kwargs將輸入分成m份，m=batch_size/GPUs返回tuple(inputs),有幾個(gè)GPU，inputs有幾份'''inputs, module_kwargs = scatter_kwargs(inputs, module_kwargs, device_ids, dim)# for module without any inputs, empty list and dict will be created# so the module can be executed on one device which is the first one in device_idsif not inputs and not module_kwargs:inputs = ((),)module_kwargs = ({},)if len(device_ids) == 1:return module(*inputs[0], **module_kwargs[0])used_device_ids = device_ids[:len(inputs)]'''7.replicate將模型復(fù)制m份，m為GPUs數(shù)目，并加載到每個(gè)GPU上outputs與inputs對(duì)應(yīng)，為每個(gè)GPUs的輸出結(jié)果'''replicas = replicate(module, used_device_ids)outputs = parallel_apply(replicas, inputs, module_kwargs, used_device_ids)'''8.gather將outputs整合為一個(gè)，比如有2個(gè)GPU，outputs分別在每個(gè)GPU上各有一份gather就將每個(gè)GPU上的outputs整合起來默認(rèn)放到GPU：0上'''return gather(outputs, output_device, dim)

總結(jié)

以上是生活随笔為你收集整理的Pytorch并行计算(一): DataParallel的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：使用 vue-waterfall2插件
下一篇：高德地图美食爬虫