Pytorch并行计算(一): DataParallel
PyTorch并行計(jì)算: DataParallel
- 1. 官網(wǎng)實(shí)例
- 2. 運(yùn)行過程
- 3. 使用方法
- 4. 源代碼解讀
1. 官網(wǎng)實(shí)例
PyTorch官網(wǎng)示例
PyTorch官網(wǎng)函數(shù)手冊(cè)
2. 運(yùn)行過程
torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)
- module (Module) – module to be parallelized
- device_ids (list of python:int or torch.device) – CUDA devices (default: all devices)
- output_device (int or torch.device) – device location of output (default: device_ids[0])
首先把模型分別加載到各個(gè)GPU上,同時(shí)將數(shù)據(jù)平均分配到GPU上;然后每個(gè)GPU分別前向傳播,將最后的結(jié)果匯總到第一塊GPU上;最終計(jì)算Loss反向傳播
這里有一個(gè)問題,當(dāng)所有的batch進(jìn)行匯總到第一個(gè)GPU時(shí)(默認(rèn)的output_device=0,即在第一塊GPU上進(jìn)行匯總),第一塊GPU的內(nèi)存會(huì)明顯高于其他的,造成不均衡,同時(shí)計(jì)算Loss的時(shí)候可能會(huì)出現(xiàn)問題,而且如果分別計(jì)算Loss是不是會(huì)更快呢,這個(gè)問題之后寫解決方案,這個(gè)問題在pytorch上有人提出,由于目前很少用DataParallel的方法了,大多用DistributedDataParallel,所以怎么優(yōu)化DataParallel就不展開了
3. 使用方法
具體使用也比較簡單,如下所示,其余的不需要變化
# 當(dāng)不限制GPUs的個(gè)數(shù)時(shí),默認(rèn)使用全部的GPU model = Model() model = nn.DataParallel(model) model.cuda()# 當(dāng)限制GPUs的個(gè)數(shù)時(shí) os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1' model = Model() model = nn.DataParallel(model) model.cuda()4. 源代碼解讀
這里我將源代碼copy了出來進(jìn)行調(diào)試,加了注釋,很方便理解
所在電腦有兩塊GPU,即device_ids = [0, 1]
class DataParallel(Module):def __init__(self, module, device_ids=None, output_device=None, dim=0):super(DataParallel, self).__init__()'''1.查看用GPU還是CPU,如果是CPU直接返回,如果是GPU繼續(xù)torch.cuda.is_available()return "cuda"/None'''device_type = _get_available_device_type() # cuda()if device_type is None:self.module = moduleself.device_ids = []return'''2.如果沒有指定device_ids,默認(rèn)使用所有GPU如果沒有指定output_device,默認(rèn)devices=0torch._utils._get_all_device_indices() 返回GPUid,例如[0, 1]'''if device_ids is None:device_ids = _get_all_device_indices()if output_device is None:output_device = device_ids[0]'''3.獲取變量torch._utils._get_all_device_indices() 返回GPUid,例如[0, 1]torch._utils._get_device_index() 默認(rèn)返回第一個(gè)GPUdevice_ids = [0, 1]output_device = 0 默認(rèn)第一塊GPU,即0'''self.dim = dimself.module = moduleself.device_ids = [_get_device_index(x, True) for x in device_ids]self.output_device = _get_device_index(output_device, True)self.src_device_obj = torch.device(device_type, self.device_ids[0])'''4.檢查GPU之間的性能_check_balance會(huì)檢查各個(gè)GPU之間的性能差異,有兩種情況會(huì)報(bào)錯(cuò):(1)剩余內(nèi)存最小的GPU與最大的GPU比值小于0.75,不論是GPU本身的內(nèi)存還是被其他人占用后剩余的內(nèi)存只要小于0.75就會(huì)報(bào)錯(cuò)(2)線程最小的GPU與最大的GPU比值小于0.75'''_check_balance(self.device_ids)'''5.如果只有一塊GPU,將module載入即可'''if len(self.device_ids) == 1:self.module.to(self.src_device_obj)def forward(self, *inputs, **kwargs):with torch.autograd.profiler.record_function("DataParallel.forward"):if not self.device_ids:return self.module(*inputs, **kwargs)'''6.將module.parameters()以及module.buffers()進(jìn)行迭代,看看是否在GPU上沒有的話繼續(xù),有的話返回錯(cuò)誤信息cpu'''for t in chain(self.module.parameters(), self.module.buffers()):if t.device != self.src_device_obj:raise RuntimeError("module must have its parameters and buffers ""on device {} (device_ids[0]) but found one of ""them on device: {}".format(self.src_device_obj, t.device))'''7.后面這部分在下面的data_parallel中看注釋'''inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)# for forward function without any inputs, empty list and dict will be created# so the module can be executed on one device which is the first one in device_idsif not inputs and not kwargs:inputs = ((),)kwargs = ({},)if len(self.device_ids) == 1:return self.module(*inputs[0], **kwargs[0])replicas = self.replicate(self.module, self.device_ids[:len(inputs)])outputs = self.parallel_apply(replicas, inputs, kwargs)return self.gather(outputs, self.output_device)def replicate(self, module, device_ids):return replicate(module, device_ids, not torch.is_grad_enabled())def scatter(self, inputs, kwargs, device_ids):return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)def parallel_apply(self, replicas, inputs, kwargs):return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])def gather(self, outputs, output_device):return gather(outputs, output_device, dim=self.dim)def data_parallel(module, inputs, device_ids=None, output_device=None, dim=0, module_kwargs=None):r"""Evaluates module(input) in parallel across the GPUs given in device_ids.This is the functional version of the DataParallel module.Args:module (Module): the module to evaluate in parallelinputs (Tensor): inputs to the moduledevice_ids (list of int or torch.device): GPU ids on which to replicate moduleoutput_device (list of int or torch.device): GPU location of the output Use -1 to indicate the CPU.(default: device_ids[0])Returns:a Tensor containing the result of module(input) located onoutput_device"""'''1.檢查是否有輸入,即我們的batchsize_input如果有,不改變輸入;如果沒有變?yōu)榭?)'''if not isinstance(inputs, tuple):inputs = (inputs,) if inputs is not None else ()'''2.檢查當(dāng)前設(shè)備類型:一般為cuda()'''device_type = _get_available_device_type()'''3.檢查當(dāng)前是否規(guī)定了輸入和輸出的GPU默認(rèn):輸入設(shè)備為所有GPU;輸出為GPU:0,即第一塊GPU(理解為所有輸入GPU中的第一塊)'''if device_ids is None:device_ids = _get_all_device_indices()if output_device is None:output_device = device_ids[0]'''4.確定輸入輸出GPU,同時(shí)確定源GPU,可能是最終計(jì)算用來匯總的GPU:0'''device_ids = [_get_device_index(x, True) for x in device_ids]output_device = _get_device_index(output_device, True)src_device_obj = torch.device(device_type, device_ids[0])'''5.檢查輸入以及模型參數(shù)是否都在同一個(gè)設(shè)備上,比如GPU或者CPU'''for t in chain(module.parameters(), module.buffers()):if t.device != src_device_obj:raise RuntimeError("module must have its parameters and buffers ""on device {} (device_ids[0]) but found one of ""them on device: {}".format(src_device_obj, t.device))'''6.scatter_kwargs將輸入分成m份,m=batch_size/GPUs返回tuple(inputs),有幾個(gè)GPU,inputs有幾份'''inputs, module_kwargs = scatter_kwargs(inputs, module_kwargs, device_ids, dim)# for module without any inputs, empty list and dict will be created# so the module can be executed on one device which is the first one in device_idsif not inputs and not module_kwargs:inputs = ((),)module_kwargs = ({},)if len(device_ids) == 1:return module(*inputs[0], **module_kwargs[0])used_device_ids = device_ids[:len(inputs)]'''7.replicate將模型復(fù)制m份,m為GPUs數(shù)目,并加載到每個(gè)GPU上outputs與inputs對(duì)應(yīng),為每個(gè)GPUs的輸出結(jié)果'''replicas = replicate(module, used_device_ids)outputs = parallel_apply(replicas, inputs, module_kwargs, used_device_ids)'''8.gather將outputs整合為一個(gè),比如有2個(gè)GPU,outputs分別在每個(gè)GPU上各有一份gather就將每個(gè)GPU上的outputs整合起來默認(rèn)放到GPU:0上'''return gather(outputs, output_device, dim)總結(jié)
以上是生活随笔為你收集整理的Pytorch并行计算(一): DataParallel的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 使用 vue-waterfall2插件
- 下一篇: 高德地图美食爬虫