RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.
報錯信息
報錯信息:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
遇到這個報錯的原因可能有很多,設置torch.nn.parallel.DistributedDataParallel的參數(shù)find_unused_parameters=True之類的方法就不提了,報錯信息中給的很清楚,看不懂的話google翻譯一下即可。
運行時錯誤:預計在開始新迭代之前已完成前一次迭代的減少。此錯誤表明您的模塊具有未用于產(chǎn)生損耗的參數(shù)。您可以通過 (1) 將關(guān)鍵字參數(shù) find_unused_parameters=True 傳遞給 torch.nn.parallel.DistributedDataParallel 來啟用未使用的參數(shù)檢測; (2) 確保所有 forward 函數(shù)輸出都參與計算損失。如果您已經(jīng)完成了上述兩個步驟,那么分布式數(shù)據(jù)并行模塊無法在模塊的 forward 函數(shù)的返回值中定位輸出張量。報告此問題時,請包括損失函數(shù)和模塊 forward 返回值的結(jié)構(gòu)(例如 list、dict、iterable)。
如果改個參數(shù)能夠就能夠解決你的問題的話,你也不會找到這篇博客了^^。
解決方法(之一)
這里其實報錯的最后一句值得注意:
如果您已經(jīng)完成了上述兩個步驟,那么分布式數(shù)據(jù)并行模塊無法在模塊的 forward 函數(shù)的返回值中定位輸出張量。報告此問題時,請包括損失函數(shù)和模塊 forward 返回值的結(jié)構(gòu)(例如 list、dict、iterable)。
但是第一次遇到這個問題只看官方的提示信息可能還是云里霧里,這里筆者將自己的理解和解決過程分享出來。
說的簡單點,其實就一句話:確保你的所有的forward的函數(shù)的所有輸出都被用于計算損失函數(shù)了。
注意,不僅僅是你的模型的forward函數(shù)的輸出,可能你的損失函數(shù)也是通過forward函數(shù)來計算的。也就是說,所有繼承自nn.Module的模塊(不只是模型本身)的forward函數(shù)的所有輸出都要參與損失函數(shù)的計算。
筆者本身遇到的問題就是,在多任務學習中,損失函數(shù)是通過一個整個繼承自nn.Module的模塊來計算的,但是在forward返回的loss中少加了一個任務的loss,導致這個報錯。
class multi_task_loss(nn.Module):def __init__(self, device, batch_size):super().__init__()self.ce_loss_func = nn.CrossEntropyLoss()self.l1_loss_func = nn.L1Loss()self.contra_loss_func = ContrastiveLoss(batch_size, device)def forward(self, rot_p, rot_t, pert_p, pert_t, emb_o, emb_h, emb_p,original_imgs, rect_imgs):rot_loss = self.ce_loss_func(rot_p, rot_t)pert_loss = self.ce_loss_func(pert_p, pert_t)contra_loss = self.contra_loss_func(emb_o, emb_h) \+ self.contra_loss_func(emb_o, emb_p) \+ self.contra_loss_func(emb_p, emb_h)rect_loss = self.l1_loss_func(original_imgs, rect_imgs)# tol_loss = rot_loss + pert_loss + rect_loss # 少加了一個loss,但是所有l(wèi)oss都返回了tol_loss = rot_loss + pert_loss + contra_loss + rect_loss # 修改為此行后正常return tol_loss, (rot_loss, pert_loss, contra_loss, rect_loss)讀者可以檢查一下自己整個的計算過程中(不只是模型本身),是否所有的forward的函數(shù)的所有輸出都被用于計算損失函數(shù)了。
Ref:
https://discuss.pytorch.org/t/need-help-runtimeerror-expected-to-have-finished-reduction-in-the-prior-iteration-before-starting-a-new-one/119247
總結(jié)
以上是生活随笔為你收集整理的RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 花呗不支持当前交易怎么回事
- 下一篇: LG将退出智能手机业务 曾经的巨头黯然收