當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

强化学习-动态规划_强化学习-第5部分

發布時間：2023/12/15 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了强化学习-动态规划_强化学习-第5部分小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

強化學習-動態規劃

有關深層學習的FAU講義 (FAU LECTURE NOTES ON DEEP LEARNING)

These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!

這些是FAU YouTube講座“ 深度學習 ”的講義。 這是演講視頻和匹配幻燈片的完整記錄。 我們希望您喜歡這些視頻。 當然，此成績單是使用深度學習技術自動創建的，并且僅進行了較小的手動修改。 自己嘗試！ 如果發現錯誤，請告訴我們！

導航 (Navigation)

Previous Lecture / Watch this Video / Top Level / Next Lecture

上一個講座 / 觀看此視頻 / 頂級 / 下一個講座

Breakout is pretty hard to learn. Image created using gifify. Source: YouTube突破很難學習。使用gifify創建的圖像。資料來源： YouTube

Welcome back to deep learning! Today, we want to talk about deep reinforcement learning. So, I have a couple of slides for you. Of course, we want to build on the concepts that we’ve seen in reinforcement learning, but we talk about deep Q-learning today.

歡迎回到深度學習！今天，我們想談一談深度強化學習。因此，我為您準備了幾張幻燈片。當然，我們希望以在強化學習中看到的概念為基礎，但是今天我們談論的是深度Q學習。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

One of the very well-known examples is human-level control through deep reinforcement learning. Here in [4], this was done by Google Deepmind. They showed a neural network is able to play Atari games. So, the idea here is to directly learn the action-value function using a deep network. The inputs are essentially the three subsequent video frames from the game and this is processed by a deep network. It produces the best next action. So, the idea is now to use this deep reinforcement framework to learn the best next controller movements. They do convolutional layers for the frame processing and then fully connected layers for the final decision-making.

眾所周知的例子之一就是通過深度強化學習進行人級控制。在[4]中，此操作由Google Deepmind完成。他們證明了神經網絡能夠玩Atari游戲。因此，這里的想法是使用深度網絡直接學習動作值函數。輸入實質上是游戲中的三個后續視頻幀，并由深度網絡處理。它產生最佳的下一動作。因此，現在的想法是使用此深度增強框架來學習最佳的下一個控制器運動。他們進行卷積層以進行幀處理，然后進行全連接層以進行最終決策。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Here, you see the main idea of the architecture. So there are these convolutional layers and ReLUs. You have the input frames that are processed by these. Then, you go into fully connected layers and again fully connected layers. Finally, you produce directly the output and you can see that in Atari games this is a very limited set. So you can either do no action, then there are essentially eight directions, there’s a fire button, and there are eight directions plus the fire button. So that’s all of the different things that you can do. So, it’s a limited domain and you can then train your system with that.

在這里，您將看到該體系結構的主要思想。因此，存在這些卷積層和ReLU。您具有由這些輸入框處理的輸入框。然后，進入完全連接的層，然后再次進入完全連接的層。最后，您直接產生輸出，您可以看到在Atari游戲中這是一個非常有限的集合。因此，您既可以不執行任何操作，則實際上有八個方向，有一個觸發按鈕，還有八個方向以及觸發按鈕。這就是您可以做的所有不同的事情。因此，這是一個有限的域，然后您可以使用它來訓練系統。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Well, it’s a deep network that directly applies Q-learning. The state of the game is essentially the current plus three previous frames as an image stack. So, you have a rather fuzzy way of incorporating memory and state. Then, you have 18 outputs that are associated with the different actions and each output estimates the action for the given input. You don’t have a label and a cost function, but you update with respect to maximize the future reward. There’s a reward of +1 when the game score is increased and a reward of -1 when the game score is decreased. Otherwise, it’s zero. They use an ε-greedy policy with ε decreasing to a low value during the training. They use a semi-gradient form of the Q-learning to update the network weights w and again they use mini-batches to accumulate the weight updates.

嗯，這是一個直接應用Q學習的深度網絡。游戲的狀態實質上是當前狀態加上前三個幀作為圖像堆棧。因此，您有一種相當模糊的方式來合并內存和狀態。然后，您有18個與不同操作相關聯的輸出，每個輸出都會估算給定輸入的操作。您沒有標簽和成本函數，但會進行更新以最大程度地提高未來的回報。當游戲分數增加時，獎勵為+1；在游戲分數降低時，獎勵為-1。否則為零。他們使用ε貪婪策略，在訓練過程中ε降低到較低的值。他們使用Q學習的半梯度形式來更新網絡權重w，并且再次使用小批量來累積權重更新。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, they have this target network and it’s updated using the following rule (see slide). You can see that this is very close to what we have seen in the previous video. Again, you have the weights and you update them with respect to the rewards. Now, the problem is, of course, that this γ and selection of the maximum q function is a function of the weights again. So, you now have a dependency on the maximization on the weights that you’re trying to update. So, your target changes simultaneously with the weights that we want to learn. This can actually lead to oscillations or divergence of your weights. So, this is not very good. To solve the problem, they introduce a second target network. After C steps, they generate this by copying the weights of the action-value network to a duplicate network and keep them fixed. So, you use the output q bar of the target network as a target to stabilize the previous maximization. You don’t use q hat, the function that you’re trying to learn, but you use the q bar which is the kind of fixed version that you use for a couple of iterations.

因此，他們有了這個目標網絡，并使用以下規則對其進行了更新(請參見幻燈片)。您可以看到這與我們在上一個視頻中看到的非常接近。同樣，您擁有權重，并根據獎勵更新權重。現在，問題當然是這個γ和最大q函數的選擇又是權重的函數。因此，您現在依賴于要更新的??權重的最大化。因此，您的目標會隨著我們想要學習的權重而同時變化。實際上，這可能導致您的體重發生波動或發散。所以，這不是很好。為了解決該問題，他們引入了第二個目標網絡。在執行C步驟之后，他們通過將操作值網絡的權重復制到重復網絡并保持固定來生成此權重。因此，您可以使用目標網絡的輸出q bar作為目標來穩定先前的最大化。您不需要使用q hat，即您要學習的功能，但可以使用q欄，這是用于幾次迭代的固定版本。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Another trick they have been using is experience replay. Here, the idea is to reduce the correlation between the updates. So after performing an action a subscript t for the image stack and receiving the reward, you add this to the replay memory. You accumulate experiences in this replay memory and then you update the network with samples drawn randomly from this memory instead of taking the most recent ones. This way, you kind of can stabilize and simultaneously not too much focus on one particular situation of the game. You try to keep in mind all of the different situations of the game and this removes the dependence on the current weights and increases the stability. I have a small example for you.

他們一直在使用的另一個技巧是體驗重播。這里的想法是減少更新之間的相關性。因此，在對圖像堆棧執行下標t并獲得獎勵后，可以將其添加到重播內存中。您在此重播內存中積累了經驗，然后使用從該內存中隨機抽取的樣本而不是最近的樣本來更新網絡。這樣，您就可以穩定并且不會過多地專注于游戲的一種特定情況。您嘗試牢記游戲的所有不同情況，從而消除了對當前權重的依賴并增加了穩定性。我有一個小例子給你。

240 hours of training helps. Image created using gifify. Source: YouTube240小時的培訓會有所幫助。使用gifify創建的圖像。資料來源： YouTube

So, this is the Atari breakout game and you can see that the agent, in the beginning, is not performing very well. If you train it over several iterations, you can see that the game is played better. So the system learns how to follow with the paddle the ball and then it is able to reflect it. You can see that if you iterate and iterate, you could argue that at some point the reinforcement learning system also figures out the weaknesses of the game. In particular, one situation where you can score really a large number of points is if you manage to bring the ball behind the bricks and then have them jump around there. It will be reflected by the boundaries and not by the paddle and it will generate a large score. So, this is something that offers the claim that the system has learned to be a good strategy by trying to kick out only the bricks on the left-hand side. Then, it needs to get the ball into the region behind the other bricks.

因此，這是Atari突破游戲，您可以看到代理在開始時表現不佳。如果您經過多次迭代訓練，可以看到游戲玩得更好。因此，系統學習了如何跟隨槳球，然后能夠將其反映出來。您可以看到，如果您反復進行迭代，則可能會爭辯說，強化學習系統有時也會找出游戲的弱點。特別是，您可以真正得分很多的情況是，如果您設法將球帶到積木后面，然后讓它們跳到那兒。它會被邊界而不是槳所反映，并且會產生很大的分數。因此，這可以證明系統通過嘗試只踢掉左側的磚塊而學會了一個好的策略。然后，它需要使球進入其他磚塊后面的區域。

Fast-forward of the game Lee Sedol vs. AlphaGo. Image created using gifify. Source: YouTubeLee Sedol vs. AlphaGo游戲的快進。使用gifify創建的圖像。資料來源： YouTube

Of course, we need to talk about AlphaGo in this video. We want to look into some of the details of how it’s actually implemented. You already heard about this one. So it’s from the paper mastering the game of go with deep neural networks.

當然，我們需要在此視頻中談論AlphaGo。我們想研究一下其實際實現方式的一些細節。您已經聽說過這一件事。因此，這是從論文精通深度神經網絡的博弈中得出的。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, we already discussed that go is a much harder problem than chess because it really has a large number of possible moves. With also a large number of possible states that can potentially emerge, the idea is that black plays against white for the control over the board. It has simple rules but an extremely high number of possible moves and situations. To achieve the performance of human players was thought to be years away because of the high in numerical complexity of the problem. So, we could brute-force chess but with Go people thought it would be impossible until we have much much faster computers — orders of magnitude faster computers. They could show that they can really beat human Go experts with the system. So, Go is a perfect information game. There is no hidden information and no chance. So theoretically, we could construct a full game tree and traverse it with min-max to find the best moves. The problem is the high number of legal moves. So in chess, you have approximately 35. In Go, there are like 250 different moves that you can do during the game in each step.

因此，我們已經討論過，走棋比象棋要難得多，因為走棋確實有很多可能的動作。由于還有大量可能出現的潛在狀態，因此想法是，黑色對白色的作用是對電路板的控制。它具有簡單的規則，但是可能的動作和情況非常多。人們認為，要達到人類運動員的表現還需要很多年，因為問題的數字復雜性很高。因此，我們可以強行下象棋，但是有了Go語言，人們認為只有擁有更快的計算機(要快幾個數量級的計算機)，這才是不可能的。他們可以證明他們可以用該系統真正擊敗人類圍棋專家。因此，Go是一款完美的信息游戲。沒有隱藏的信息，沒有機會。因此，從理論上講，我們可以構建一個完整的游戲樹，并以最小-最大的距離遍歷它，以找到最佳移動。問題在于大量的法律訴訟。因此，在國際象棋中，您大約有35個動作。在圍棋中，您在游戲中的每一步都可以進行250種不同的動作。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Also, the game may involve many moves. So approximately a hundred and fifty. This means that the exhaustive search is completely infeasible. Well, a search tree can, of course, be pruned if you have an accurate evaluation function. for chess, if you remember deep blue, this was already extremely complex and based on massive human input. Fo Go in 2002, “No simple yet reasonable evaluation will ever be found for Go.” was the state-of-the-art. Well in 2016 and 2017; AlphaGo beat Lee Sedol and Ke Kie, two of the world’s strongest players. So, there is a way of solving this game.

另外，游戲可能涉及許多動作。大約有一百五十個。這意味著窮舉搜索是完全不可行的。好吧，如果您具有準確的評估功能，則可以修剪搜索樹。對于國際象棋，如果您還記得深藍色，這已經非常復雜并且基于大量的人工輸入。在2002年的Fo Go中，“找不到對Go進行簡單但合理的評估。” 是最先進的。好在2016年和2017年; AlphaGo擊敗了世界上最強的兩位選手Lee Sedol和Ke Kie。因此，有一種解決此游戲的方法。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

There were several very good ideas in this paper. It has been developed by Silver et al. It also Deepmind and it’s a combination of multiple methods. They use, of course, deep neural networks. Then, they use Monte Carlo Tree Search and they combine supervised learning and reinforcement learning. The first improvement compared to a full tree search was the Monte Carlo tree search. They use the networks to support efficient search through the tree.

本文有幾個非常好的想法。它已經由Silver等人開發。它也是Deepmind，它是多種方法的組合。他們當然使用深度神經網絡。然后，他們使用蒙特卡羅樹搜索，并將監督學習和強化學習相結合。與全樹搜索相比的第一個改進是蒙特卡羅樹搜索。他們使用網絡來支持通過樹的有效搜索。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, what’s Monte Carlo Tree Search? Well, you expand your tree by looking into different possible future moves and you look into the moves that produce very valuable states. You expand on the valuable state over a couple of moves into the future. Then, you also look at the value of these states. So you only look into a couple of valuable states and then expand over and over again for a couple of moves. Finally, you can find a situation where you probably have a much larger state value. So, you try to look a bit into the future and follow moves that are likely produced a higher state value.

那么，什么是蒙特卡羅樹搜索？好吧，您通過查看未來可能發生的不同動作來擴展樹，并查看產生非常有價值狀態的動作。您可以在未來的幾步中擴展有價值的狀態。然后，您還將查看這些狀態的值。因此，您只需查看幾個有價值的狀態，然后一次又一次地進行幾次擴展。最后，您會發現一個狀態值可能更大的情況。因此，您嘗試展望未來，并遵循可能會產生更高狀態值的移動。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, you start from the root node which is the current state. Then, you iteratively do that until you extend the search tree to find the best future state. Here’s the algorithm: you start at the root, you traverse with the tree policy to a leaf node. Then, you expand and you add one or more child nodes to the current leaf, probably the ones that have valuable states. Next, you simulate from the current or the child node, the episodes with actions according to your rollout policy. So, you also need a policy in order to expand here. Then, you can back up and propagate the received rewards backward through the tree. This allows you to find future states that have a large state value. So, you repeat that for a certain amount of time. Lastly, you stop and you choose the action from the root note according to the accumulated statistics. In the next move, you have to start again with a new root note according to the action that actually your opponent has taken.

因此，您從當前狀態的根節點開始。然后，您需要反復進行此操作，直到擴展搜索樹以找到最佳的將來狀態為止。這是算法：從根開始，使用樹策略遍歷到葉節點。然后，展開并向當前葉子添加一個或多個子節點，可能是那些具有有價值狀態的子節點。接下來，您將從當前節點或子節點中模擬情節，并根據部署策略執行操作。因此，您還需要一個策略才能在此處進行擴展。然后，您可以備份并通過樹向后傳播收到的獎勵。這使您可以找到狀態值較大的將來狀態。因此，您需要重復一定的時間。最后，您停止并根據累積的統計信息從根注釋中選擇操作。在下一步中，您必須根據對手實際采取的行動重新開始新的根音。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So the tree policy guides in how far successful paths are used and how frequently they will be looked at. This is a typical exploration/exploitation trade-off. Well, the main problem here is, of course, that the normal Monte Carlo Tree Search is not accurate enough for Go. The idea in AlphaGo was to control the tree expansion with a neural network to find promising actions and then improve the value estimation by a neural network. So, this is more efficient in terms of extension and evaluation than the search of a tree and this means that you better let go.

因此，樹策略可指導成功路徑的使用距離和查看頻率。這是典型的勘探/開采權衡。好吧，這里的主要問題當然是普通的蒙特卡羅樹搜索對于Go語言來說不夠準確。 AlphaGo中的想法是使用神經網絡控制樹的擴展，以找到有希望的動作，然后通過神經網絡來改進價值估算。因此，在擴展和評估方面，這比搜索樹更有效，這意味著您最好放手。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

How do they use these deep neural networks? They have three different networks. They have a policy network that suggests the next move in a leaf node for the extension. Then they have a value network that looks at the current board situation and computes essentially the chances of winning. Lastly, they have a rollout policy network that guides the rollout action selection. All of those networks are deep convolutional networks and the input is the current board position and additional pre-computed features.

他們如何使用這些深度神經網絡？他們有三個不同的網絡。他們有一個策略網絡，可為擴展建議葉節點中的下一步行動。然后，他們就有了一個價值網絡，可以查看當前的董事會情況并從本質上計算獲勝的機會。最后，他們有一個指導政策網絡，指導指導行動選擇。所有這些網絡都是深度卷積網絡，輸入是當前電路板位置和其他預先計算的功能。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, here’s the policy network. It had 13 convolutional layers one output for each point on the Go board. Then, a huge database of human expert moves, 30 million, that were available. They start with supervised learning and trained the network to predict the next move in the human expert play. Then, they train this network also with reinforcement learning by playing against older versions of the self and they have a reward for winning the game. All the versions, of course, avoid correlation instability. If you look at the training time, there were three weeks on 50 GPUs for the supervised part and one day for the reinforcement learning. So, actually quite a bit of supervised learning involved here — not so much reinforcement learning.

因此，這是政策網絡。它具有13個卷積層，在Go板上的每個點都有一個輸出。然后，有了一個龐大的人類專家舉動數據庫，共有3000萬個。他們從有監督的學習開始，并對網絡進行了培訓，以預測人類專家游戲中的下一步行動。然后，他們通過與較早版本的自我對戰來通過增強學習來訓練該網絡，并且他們會贏得比賽而獲得獎勵。當然，所有版本都避免了關聯不穩定。如果您看一下培訓時間，則在受監督的部分使用50個GPU進行為期三周的時間，而對于強化學習則需要一天的時間。因此，這里實際上涉及了很多監督學習-與其說不是強化學習。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

There’s the value network. This has the same architecture as the policy network, but just one output node. The goal is here to predict the probability of winning the game. They train again on self-play games of reinforcement learning and use Monte Carlo policy evaluation for 30 million positions from these games. Training time was one week on 50 GPUs.

有價值網絡。它具有與策略網絡相同的體系結構，但只有一個輸出節點。目的是預測贏得比賽的可能性。他們再次訓練強化學習的自玩游戲，并使用蒙特卡洛政策評估這些游戲中的3000萬個職位。在50個GPU上的培訓時間為一周。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Then, they have the rollout policy network that could then be used to select the moves during rollout. Of course, here, the problem is that the inference time is comparatively high and the solution was to train a simpler linear network on a subset of the data that provides actions very quickly. So, this led to a speed-up of approximately a thousand compared to the policy network. So if you work with this rollout policy network, then you have a slimmer network, but it’s much faster. So, you can do more simulations and collect more experience. So, this is why they use this rollout policy network.

然后，他們擁有部署策略網絡，該網絡可用于在部署期間選擇移動。當然，這里的問題是推理時間相對較長，解決方案是在可提供動作非常Swift的數據子集上訓練一個更簡單的線性網絡。因此，與策略網絡相比，這導致速度提高了約一千。因此，如果使用此推出策略網絡，則網絡將更苗條，但速度要快得多。因此，您可以進行更多的模擬并收集更多的經驗。因此，這就是他們使用此推出策略網絡的原因。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Now, there was quite a bit of supervised learning involved here. So, let’s have a look at AlphaGo zero. Now, AlphaGo zero doesn’t need human play anymore. So, the idea here is that you then play solely with reinforcement learning and self-play. It has simpler Monte Carlo Tree Search and no rollout policy network in the Monte Carlo Tree Search. Also, in the self-play games, they also introduced multi-task learning. So, the policy and value network shared the initial layers. This then led to [3] and the extensions are also able to play chess and shogi. So, it’s not just code that can solve Go. With this, you can also play chess and shogi at an expert level. Okay. So, this sums up what we’ve been doing in reinforcement learning. Of course, we could look at many other things here. However, there is just not enough time.

現在，這里涉及了很多監督學習。因此，讓我們看一下AlphaGo零。現在，AlphaGo零不再需要人工操作。因此，這里的想法是讓您只玩強化學習和自我游戲。它具有更簡單的“蒙特卡洛樹搜索”，并且在“蒙特卡洛樹搜索”中沒有部署策略網絡。此外，在自玩游戲中，他們還引入了多任務學習。因此，政策和價值網絡共享初始層。然后導致[3]，擴展程序也能夠下棋和將棋。因此，不僅僅是可以解決Go的代碼。這樣，您還可以在專家級別下棋和將棋。好的。因此，這總結了我們在強化學習中所做的工作。當然，我們可以在這里查看許多其他內容。但是，沒有足夠的時間。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Next time in deep learning, we want to talk about algorithms that even don’t have rewards. So, complete unsupervised training and we also want to learn how to benefit from adversaries. We will see that there’s a very cool concept out there that is called generative adversarial networks which is able to generate all kinds of different images. Also, a very cool concept that we’ll talk about in one of the next videos, Then we look into extensions into performing image processing tasks. So, we go more and more towards the applications.

下次在深度學習中，我們想談論甚至沒有獎勵的算法。因此，完成無人看管的培訓，我們也想學習如何從對手中受益。我們將看到那里有一個很酷的概念，稱為生成對抗網絡，它可以生成各種不同的圖像。此外，我們將在下一個視頻中討論一個非常酷的概念，然后研究執行圖像處理任務的擴展。因此，我們越來越傾向于應用程序。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Well, some comprehensive questions: What is a policy? What are value functions? Explain the exploitation versus exploration dilemma, and so on. If you’re interested in reinforcement learning, I can definitely recommend having a look at the book reinforcement learning by Richard Sutton. It’s really a great book and you will learn in high detail about all the things that we could only scratch on in these videos. So, you see that you can go much deeper into all of the details of reinforcement learning and also deep reinforcement learning. There’s actually much more to say about this at this point, but we can only remain at this level for the time being. Well, I also brought you the link and I put also the link into the video description. So please enjoy this book it’s very good and, of course, we have plenty of further references.

好吧，一些綜合性問題：什么是政策？什么是價值函數？解釋開發與探索的困境，等等。如果您對強化學習感興趣，我絕對可以推薦您看一下Richard Sutton撰寫的《強化學習》一書。這確實是一本很棒的書，您將詳細了解我們在這些視頻中可能碰到的所有事情。因此，您會發現您可以更深入地學習強化學習的所有細節，也可以深入學習強化學習。在這一點上，實際上還有很多話要說，但是我們暫時只能停留在這個水平上。好吧，我還為您帶來了鏈接，并將該鏈接也放入了視頻說明中。因此，請喜歡這本書，它非常好，當然，我們還有很多其他參考。

So, thank you very much for listening and I hope you that you can now understand at least in a bit of what is happening in reinforcement learning and deep reinforcement learning and what the main ideas are in order to perform learning of games. So, thank you very much for watching this video and hope to see you in the next one. Bye-bye!

因此，非常感謝您的聆聽，并希望您現在至少可以了解一些強化學習和深度強化學習中發生的事情以及進行游戲學習的主要思想。因此，非常感謝您觀看此視頻，并希望在下一個視頻中見到您。再見！

If you liked this post, you can find more essays here, more educational material on Machine Learning here, or have a look at our Deep LearningLecture. I would also appreciate a follow on YouTube, Twitter, Facebook, or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced. If you are interested in generating transcripts from video lectures try AutoBlog.

如果你喜歡這篇文章，你可以找到這里更多的文章，更多的教育材料，機器學習在這里，或看看我們的深入學習講座。如果您希望將來了解更多文章，視頻和研究信息，也歡迎關注YouTube ， Twitter ， Facebook或LinkedIn 。本文是根據知識共享4.0署名許可發布的，如果引用，可以重新打印和修改。如果您對從視頻講座中生成成績單感興趣，請嘗試使用AutoBlog 。

鏈接 (Links)

Link to Sutton’s Reinforcement Learning in its 2018 draft, including Deep Q learning and Alpha Go details

在其2018年草案中鏈接到薩頓的強化學習，包括Deep Q學習和Alpha Go詳細信息

翻譯自: https://towardsdatascience.com/reinforcement-learning-part-5-70d10e0ca3d9

強化學習-動態規劃

總結

以上是生活随笔為你收集整理的强化学习-动态规划_强化学习-第5部分的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

动态

上一篇：辍学的名人_辍学效果如此出色的5个观点
下一篇： Mobileye 2022 年 Q4 营

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

强化学习-动态规划_强化学习-第5部分

有關深層學習的FAU講義 (FAU LECTURE NOTES ON DEEP LEARNING)

導航 (Navigation)

鏈接 (Links)

總結