[转]一文解释PyTorch求导相关 (backward, autograd.grad)
PyTorch是動(dòng)態(tài)圖,即計(jì)算圖的搭建和運(yùn)算是同時(shí)的,隨時(shí)可以輸出結(jié)果;而TensorFlow是靜態(tài)圖。
在pytorch的計(jì)算圖里只有兩種元素:數(shù)據(jù)(tensor)和 運(yùn)算(operation)
運(yùn)算包括了:加減乘除、開方、冪指對(duì)、三角函數(shù)等可求導(dǎo)運(yùn)算
數(shù)據(jù)可分為:葉子節(jié)點(diǎn)(leaf node)和非葉子節(jié)點(diǎn);葉子節(jié)點(diǎn)是用戶創(chuàng)建的節(jié)點(diǎn),不依賴其它節(jié)點(diǎn);它們表現(xiàn)出來(lái)的區(qū)別在于反向傳播結(jié)束之后,非葉子節(jié)點(diǎn)的梯度會(huì)被釋放掉,只保留葉子節(jié)點(diǎn)的梯度,這樣就節(jié)省了內(nèi)存。如果想要保留非葉子節(jié)點(diǎn)的梯度,可以使用retain_grad()方法。
torch.tensor 具有如下屬性:
- 查看 是否可以求導(dǎo) requires_grad
- 查看 運(yùn)算名稱 grad_fn
- 查看 是否為葉子節(jié)點(diǎn) is_leaf
- 查看 導(dǎo)數(shù)值 grad
針對(duì)requires_grad屬性,自己定義的葉子節(jié)點(diǎn)默認(rèn)為False,而非葉子節(jié)點(diǎn)默認(rèn)為True,神經(jīng)網(wǎng)絡(luò)中的權(quán)重默認(rèn)為True。判斷哪些節(jié)點(diǎn)是True/False的一個(gè)原則就是從你需要求導(dǎo)的葉子節(jié)點(diǎn)到loss節(jié)點(diǎn)之間是一條可求導(dǎo)的通路。
當(dāng)我們想要對(duì)某個(gè)Tensor變量求梯度時(shí),需要先指定requires_grad屬性為True,指定方式主要有兩種:
x = torch.tensor(1.).requires_grad_() # 第一種x = torch.tensor(1., requires_grad=True) # 第二種PyTorch提供兩種求梯度的方法:backward() and torch.autograd.grad() ,他們的區(qū)別在于前者是給葉子節(jié)點(diǎn)填充.grad字段,而后者是直接返回梯度給你,我會(huì)在后面舉例說(shuō)明。還需要知道y.backward()其實(shí)等同于torch.autograd.backward(y)
一個(gè)簡(jiǎn)單的求導(dǎo)例子是:y=(x+1)?(x+2)y=(x+1)*(x+2)y=(x+1)?(x+2),計(jì)算 ?y/?x\partial y /\partial x?y/?x ,假設(shè)給定 x=2x=2x=2, 先畫出計(jì)算圖:
手算的話,
?y?x=?y?a?a?x+?y?b?b?x=x+2+x+1=7\frac{\partial y}{\partial x}=\frac{\partial y}{\partial a} \frac{\partial a}{\partial x} + \frac{\partial y}{\partial b}\frac{\partial b}{\partial x} = x+2+x+1=7 ?x?y?=?a?y??x?a?+?b?y??x?b?=x+2+x+1=7
使用backward()
x = torch.tensor(2., requires_grad=True)a = torch.add(x, 1) b = torch.add(x, 2) y = torch.mul(a, b)y.backward() print(x.grad) >>>tensor(7.)看一下這幾個(gè)tensor的屬性:
print("requires_grad: ", x.requires_grad, a.requires_grad, b.requires_grad, y.requires_grad) print("is_leaf: ", x.is_leaf, a.is_leaf, b.is_leaf, y.is_leaf) print("grad: ", x.grad, a.grad, b.grad, y.grad)>>>requires_grad: True True True True >>>is_leaf: True False False False >>>grad: tensor(7.) None None None使用backward()函數(shù)反向傳播計(jì)算tensor的梯度時(shí),并不計(jì)算所有tensor的梯度,而是只計(jì)算滿足這幾個(gè)條件的tensor的梯度:
所有滿足條件的變量梯度會(huì)自動(dòng)保存到對(duì)應(yīng)的grad屬性里。
使用autograd.grad()
x = torch.tensor(2., requires_grad=True)a = torch.add(x, 1) b = torch.add(x, 2) y = torch.mul(a, b)grad = torch.autograd.grad(outputs=y, inputs=x) print(grad[0]) >>>tensor(7.)因?yàn)橹付溯敵鰕,輸入x,所以返回值就是?y/?x\partial y/\partial x?y/?x這一梯度,完整的返回值其實(shí)是一個(gè)元組,保留第一個(gè)元素就行,后面元素是?
再舉一個(gè)復(fù)雜一點(diǎn)且高階求導(dǎo)的例子:z=x2yz=x^2yz=x2y,計(jì)算 ?z/?x,?z/?y,?2z/?x2\partial z/\partial x,\partial z/\partial y,\partial^2z/\partial x^2?z/?x,?z/?y,?2z/?x2 ,假設(shè)給定x=2,y=3x=2, y=3x=2,y=3
手算的話:
?z?x=2xy→12,?z?y=x2→4,?2z?x2=2y→6\frac{\partial z}{\partial x}=2xy \to12,\frac{\partial z}{\partial y}=x^2 \to 4,\frac{\partial^2z}{\partial x^2}=2y \to 6 ?x?z?=2xy→12,?y?z?=x2→4,?x2?2z?=2y→6
求一階導(dǎo)可以用backward().
也可以用autograd.grad()
x = torch.tensor(2.).requires_grad_() y = torch.tensor(3.).requires_grad_()z = x * x * ygrad_x = torch.autograd.grad(outputs=z, inputs=x) print(grad_x[0]) # grad_y = torch.autograd.grad(outputs=z, inputs=y) 無(wú)法對(duì)y進(jìn)行求導(dǎo)了 >>>tensor(12.)為什么不在這里面同時(shí)也求對(duì)y的導(dǎo)數(shù)呢?因?yàn)闊o(wú)論是backward還是autograd.grad在計(jì)算一次梯度后圖就被釋放了,如果想要保留,需要添加retain_graph=True
x = torch.tensor(2.).requires_grad_() y = torch.tensor(3.).requires_grad_()z = x * x * ygrad_x = torch.autograd.grad(outputs=z, inputs=x, retain_graph=True) grad_y = torch.autograd.grad(outputs=z, inputs=y)print(grad_x[0], grad_y[0]) >>>tensor(12.) tensor(4.)再來(lái)看如何求高階導(dǎo),理論上其實(shí)是上面的grad_x再對(duì)x求梯度,試一下看:
x = torch.tensor(2.).requires_grad_() y = torch.tensor(3.).requires_grad_()z = x * x * ygrad_x = torch.autograd.grad(outputs=z, inputs=x, retain_graph=True) grad_xx = torch.autograd.grad(outputs=grad_x, inputs=x)print(grad_xx[0]) >>>RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn報(bào)錯(cuò)了,雖然retain_graph=True保留了計(jì)算圖和中間變量梯度, 但沒(méi)有保存grad_x的運(yùn)算方式,需要使用creat_graph=True在保留原圖的基礎(chǔ)上再建立額外的求導(dǎo)計(jì)算圖,也就是會(huì)把?z/?x=2xy\partial z/\partial x=2xy?z/?x=2xy這樣的運(yùn)算存下來(lái)。
grad_xx這里也可以直接用backward(),相當(dāng)于直接從?z/?x=2xy\partial z/\partial x=2xy?z/?x=2xy開始回傳:
# autograd.grad() + backward() x = torch.tensor(2.).requires_grad_() y = torch.tensor(3.).requires_grad_()z = x * x * ygrad = torch.autograd.grad(outputs=z, inputs=x, create_graph=True) grad[0].backward()print(x.grad) >>>tensor(6.)也可以先用backward()然后對(duì)x.grad這個(gè)一階導(dǎo)繼續(xù)求導(dǎo):
# backward() + autograd.grad() x = torch.tensor(2.).requires_grad_() y = torch.tensor(3.).requires_grad_()z = x * x * yz.backward(create_graph=True) grad_xx = torch.autograd.grad(outputs=x.grad, inputs=x)print(grad_xx[0]) >>>tensor(6.)那是不是也可以直接用兩次backward()呢?第二次直接x.grad從開始回傳,我們?cè)囈幌?#xff1a;
# backward() + backward() x = torch.tensor(2.).requires_grad_() y = torch.tensor(3.).requires_grad_()z = x * x * yz.backward(create_graph=True) # x.grad = 12 x.grad.backward()print(x.grad) >>>tensor(18., grad_fn=<CopyBackwards>)發(fā)現(xiàn)了問(wèn)題,結(jié)果不是6,而是18,發(fā)現(xiàn)第一次回傳時(shí)輸出x梯度是12。這是因?yàn)镻yTorch使用backward()時(shí)默認(rèn)會(huì)累加梯度,也就是12+6=18,需要手動(dòng)把前一次的梯度清零:
x = torch.tensor(2.).requires_grad_() y = torch.tensor(3.).requires_grad_()z = x * x * yz.backward(create_graph=True) x.grad.data.zero_() x.grad.backward()print(x.grad) >>>tensor(6., grad_fn=<CopyBackwards>)有沒(méi)有發(fā)現(xiàn)前面都是對(duì)標(biāo)量求導(dǎo),如果不是標(biāo)量會(huì)怎么樣呢?
x = torch.tensor([1.,2.]).requires_grad_() y=x+1 y.backward() print(x.grad) >>>RuntimeError: grad can be implicitly created only for scalar outputs報(bào)錯(cuò)了,因?yàn)橹荒軜?biāo)量對(duì)標(biāo)量,標(biāo)量對(duì)向量求梯度,xxx可以是標(biāo)量或者向量,但yyy只能是標(biāo)量;所以只需要先將$$y轉(zhuǎn)變?yōu)闃?biāo)量,對(duì)分別求導(dǎo)沒(méi)影響的就是求和。
此時(shí),
x=[x1,x2],y=[x12,x22],y′=y.sum()=x12+x22,?y′?x1=2x1→2,?y′?x2=2x2→4x=[x_1,x_2],y=[x_1^2, x_2^2],y\prime=y.sum()=x_1^2+x_2^2, \\ \frac{\partial y\prime}{\partial x_1}=2x_1 \to 2,\frac{\partial y\prime}{\partial x_2}=2x_2 \to 4 x=[x1?,x2?],y=[x12?,x22?],y′=y.sum()=x12?+x22?,?x1??y′?=2x1?→2,?x2??y′?=2x2?→4
再具體一點(diǎn)來(lái)解釋,讓我們寫出求導(dǎo)計(jì)算的雅可比矩陣,y=[y1,y2]\boldsymbol y=[y_1,y_2]y=[y1?,y2?]是一個(gè)向量,
J=[?y?x1,?y?x2]=[?y1?x1?y1?x2?y2?x1?y2?x2]\boldsymbol J=[\frac{\partial \boldsymbol y}{\partial x_1},\frac{\partial \boldsymbol y}{\partial x_2}]=\begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} \end{bmatrix} J=[?x1??y?,?x2??y?]=[?x1??y1???x1??y2????x2??y1???x2??y2???]
而我們希望最終的求導(dǎo)結(jié)果是[?y1?x1,?y2?x2][\frac{\partial y_1}{\partial x_1}, \frac{\partial y_2}{\partial x_2}][?x1??y1??,?x2??y2??],那怎么得到呢?注意?y1?x2\frac{\partial y_1}{\partial x_2}?x2??y1??和?y2?x1\frac{\partial y_2}{\partial x_1}?x1??y2??都是0,那是不是
[?y1?x1,?y2?x2]T=[?y1?x1?y1?x2?y2?x1?y2?x2][11][\frac{\partial y_1}{\partial x_1}, \frac{\partial y_2}{\partial x_2}]^\mathsf{T}=\begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} \end{bmatrix}\begin{bmatrix} 1 \\ 1 \end{bmatrix} [?x1??y1??,?x2??y2??]T=[?x1??y1???x1??y2????x2??y1???x2??y2???][11?]
所以不用y.sum()的另一種方式是:
也可以使用autograd。上面和這里的torch.ones_like(x) 位置指的就是雅可比矩陣右乘的那個(gè)向量。
x = torch.tensor([1., 2.]).requires_grad_() y = x * xgrad_x = torch.autograd.grad(outputs=y, inputs=x, grad_outputs=torch.ones_like(x)) print(grad_x[0]) >>>tensor([2., 4.])或者
x = torch.tensor([1., 2.]).requires_grad_() y = x * xgrad_x = torch.autograd.grad(outputs=y.sum(), inputs=x) print(grad_x[0]) >>>tensor([2., 4.])下面是著重強(qiáng)調(diào)以及引申的幾點(diǎn)
- 梯度清零
Pytorch 的自動(dòng)求導(dǎo)梯度不會(huì)自動(dòng)清零,會(huì)累積,所以一次反向傳播后需要手動(dòng)清零。
x.grad.zero_()
而在神經(jīng)網(wǎng)絡(luò)中,我們只需要執(zhí)行optimizer.zero_grad() - 使用detach()切斷,不會(huì)再往后計(jì)算梯度
假設(shè)有模型A和模型B,我們需要將A的輸出作為B的輸入,但訓(xùn)練時(shí)我們只訓(xùn)練模型B,那么可以這樣做input_B = output_A.detach()
如果還是以前面的為例子,將a切斷,將只有b一條通路,且a變?yōu)槿~子節(jié)點(diǎn)。x = torch.tensor([2.], requires_grad=True)a = torch.add(x, 1).detach() b = torch.add(x, 2) y = torch.mul(a, b)y.backward()print("requires_grad: ", x.requires_grad, a.requires_grad, b.requires_grad, y.requires_grad) print("is_leaf: ", x.is_leaf, a.is_leaf, b.is_leaf, y.is_leaf) print("grad: ", x.grad, a.grad, b.grad, y.grad)>>>requires_grad: True False True True >>>is_leaf: True True False False >>>grad: tensor([3.]) None None None
總結(jié)
以上是生活随笔為你收集整理的[转]一文解释PyTorch求导相关 (backward, autograd.grad)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 高铁咨询人工服务 「12306如何转到人
- 下一篇: 怎么区分千兆路由器和百兆路由器路由器怎么