當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

乐器演奏_深度强化学习代理演奏的蛇

發布時間：2023/12/15 编程问答 40 豆豆

生活随笔收集整理的這篇文章主要介紹了乐器演奏_深度强化学习代理演奏的蛇小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

樂器演奏

Ever since I watched the Netflix documentary AlphaGo, I have been fascinated by Reinforcement Learning. Reinforcement Learning is comparable with learning in real life: you see something, you do something, and your act has positive or negative consequences. You learn from the consequences and adjust your actions accordingly. Reinforcement Learning has many applications, like autonomous driving, robotics, trading and gaming. In this post, I will show how the computer can learn to play the game Snake using Deep Reinforcement Learning.

自從我看了Netflix的紀錄片AlphaGo之后，我就對Reinforcement Learning著迷了。 強化學習與現實生活中的學習具有可比性：您看到某件事，做某事，并且您的行為產生積極或消極的后果。 您可以從后果中學習，并相應地調整自己的行動。 強化學習有許多應用，例如自動駕駛，機器人技術，交易和游戲。 在本文中，我將展示計算機如何使用深度強化學習來學習玩Snake游戲。

基礎 (The Basics)

If you are familiar with Deep Reinforcement Learning, you can skip the following two sections.

如果您熟悉深度強化學習，則可以跳過以下兩個部分。

強化學習 (Reinforcement Learning)

The concept behind Reinforcement Learning (RL) is easy to grasp. An agent learns by interacting with an environment. The agent chooses an action, and receives feedback from the environment in the form of states (or observations) and rewards. This cycle continues forever or until the agent ends in a terminal state. Then a new episode of learning starts. Schematically, it looks like this:

強化學習(RL)背后的概念很容易掌握。代理通過與環境交互來學習。代理選擇一個動作，并以狀態(或觀察)和獎勵的形式接收來自環境的反饋。此循環將一直持續下去，或者直到代理終止于終端狀態為止。然后新的學習情節開始。從示意圖上看，它看起來像這樣：

Reinforcement Learning: an agent interacts with the environment by choosing actions and receiving observations (or states) and rewards.強化學習：代理人通過選擇動作并接收觀察(或狀態)和獎勵與環境互動。

The goal of the agent is to maximize the sum of the rewards during an episode. In the beginning of the learning phase the agent explores a lot: it tries different actions in the same state. It needs this information to find the best actions possible for the states. When the learning continues, exploration decreases. Instead, the agent will exploit his moves: this means he will choose the action that maximizes the reward, based on his experience.

代理的目標是在情節中最大化獎勵的總和。在學習階段的開始，代理會進行大量探索：它會在相同狀態下嘗試不同的操作。它需要此信息來找到可能對各州采取的最佳措施。當學習繼續進行時，探索會減少。相反，代理將利用自己的舉動：這意味著他將根據自己的經驗選擇使報酬最大化的動作。

深度強化學習 (Deep Reinforcement Learning)

Deep Learning uses artificial neural networks to map inputs to outputs. Deep Learning is powerful, because it can approximate any function with only one hidden layer1. How does it work? The network exists of layers with nodes. The first layer is the input layer. Then the hidden layers transform the data with weights and activation functions. The last layer is the output layer, where the target is predicted. By adjusting the weights the network can learn patterns and improve its predictions.

深度學習使用人工神經網絡將輸入映射到輸出。深度學習功能強大，因為它僅需一個隱藏層就可以近似任何功能1。它是如何工作的？網絡存在帶有節點的層。第一層是輸入層。然后，隱藏層使用權重和激活函數轉換數據。最后一層是輸出層，在其中預測目標。通過調整權重，網絡可以學習模式并改善其預測。

As the name suggests, Deep Reinforcement Learning is a combination of Deep Learning and Reinforcement Learning. By using the states as the input, values for actions as the output and the rewards for adjusting the weights in the right direction, the agent learns to predict the best action for a given state.

顧名思義，深度強化學習是深度學習和強化學習的結合。通過使用狀態作為輸入，將動作的值用作輸出，以及在正確的方向上調整權重的獎勵，代理可以學習預測給定狀態的最佳動作。

行動中的深度強化學習 (Deep Reinforcement Learning in Action)

Let’s apply these techniques to the famous game Snake. I bet you know the game, the goal is to grab as many apples as possible while not walking into a wall or the snake’s body. I build the game in Python with the turtle library.

讓我們將這些技術應用于著名的游戲Snake。我敢打賭，您知道這款游戲的目標是在不走入墻壁或蛇體的情況下，盡可能多地抓住蘋果。我使用烏龜庫以Python構建游戲。

Me playing Snake.我在玩蛇。

定義行動，獎勵和國家 (Defining Actions, Rewards and States)

To prepare the game for a RL agent, let’s formalize the problem. Defining the actions is easy. The agent can choose between going up, right, down or left. The rewards and state space are a bit harder. There are multiple solutions, and one will work better than the other. For now, let’s try the following. If the snake grabs an apple, give a reward of 10. If the snake dies, the reward is -100. To help the agent, give a reward of 1 if the snake comes closer to the apple, and a reward of -1 if the snake moves away from the apple.

為了為RL代理準備游戲，讓我們形式化問題。定義動作很容易。座席可以在向上，向右，向下或向左之間選擇。獎勵和狀態空間要難一些。有多種解決方案，一種將比另一種更好。現在，讓我們嘗試以下方法。如果蛇抓到一個蘋果，則給予10分的獎勵。如果蛇死亡，則獎勵為-100。為了幫助代理，如果蛇靠近蘋果，則獎勵1；如果蛇遠離蘋果，則獎勵-1。

There are a lot of options for the state: you can choose to give scaled coordinates of the snake and the apple or to give directions to the location of the apple. An important thing to do is to add the location of obstacles (the wall and body) so the agent learns to avoid dying. Below a summary of actions, state and rewards. Later in the article you can see how adjustments to the state affect performance.

該狀態有很多選項：您可以選擇給出蛇和蘋果的比例坐標，或者給出指向蘋果位置的方向。重要的事情是添加障礙物(墻壁和身體)的位置，以便特工學會避免死亡。下面是行動，狀態和獎勵的摘要。在本文的后面，您可以看到對狀態的調整如何影響性能。

Actions, rewards and state行動，獎勵和狀態

創建環境和代理 (Creating the Environment and the Agent)

By adding some methods to the Snake program, it’s possible to create a Reinforcement Learning environment. The added methods are: reset(self), step(self, action) and get_state(self) . Besides this it’s necessary to calculate the reward every time the agent takes a step (check out run_game(self)).

通過向Snake程序添加一些方法，可以創建一個強化學習環境。添加的方法是： reset(self) ， step(self, action)和get_state(self) 。除此之外，代理每次執行一步時都必須計算獎勵(簽出run_game(self) )。

The agent uses a Deep Q Network to find the best actions. The parameters are:

代理使用Deep Q Network查找最佳操作。參數為：

# epsilon sets the level of exploration and decreases over time
param[‘epsilon’] = 1
param[‘epsilon_min’] = .01
param[‘epsilon_decay’] = .995# gamma: value immediate (gamma=0) or future (gamma=1) rewards
param[‘gamma’] = .95# the batch size is needed for replaying previous experiences
param[‘batch_size’] = 500# neural network parameters
param[‘learning_rate’] = 0.00025
param[‘layer_sizes’] = [128, 128, 128]

If you are interested in the code, you can find it on my GitHub.

如果您對代碼感興趣，可以在我的GitHub上找到它。

特工打的蛇 (Snake Played by the Agent)

Now it is time for the key question! Does the agent learn to play the game? Let’s find out by observing how the agent interacts with the environment.

現在是關鍵問題了！代理是否學會玩游戲？讓我們觀察一下代理如何與環境交互。

The first games, the agent has no clue:

最初的游戲，代理商沒有任何線索：

The first games.第一場比賽。

The first apple! It still seems like the agent doesn’t know what he is doing…

第一個蘋果！代理商似乎仍然不知道他在做什么……

Finds the first apple… and hits the wall.找到第一個蘋果……然后撞墻。

End of game 13 and beginning of game 14:

第13局結束，第14局開始：

Improving!改善中！

The agent learns: it doesn’t take the shortest path but finds his way to the apples.

代理商知道：這并不是走最短的路，而是找到通往蘋果的路。

Game 30:

游戲30：

Good job! New high score!做得好！新高分！

Wow, the agent avoids the body of the snake and finds a fast way to the apples, after playing only 30 games!

哇，特工只玩了30場游戲，就避開了蛇的身體，找到了通往蘋果的快速途徑！

玩國家空間 (Playing with the State Space)

The agent learns to play snake (with experience replay), but maybe it’s possible to change the state space and achieve similar or better performance. Let’s try the following four state spaces:

該代理學會了玩蛇(具有重播經驗)，但是也許可以改變狀態空間并獲得類似或更好的性能。讓我們嘗試以下四個狀態空間：

State space ‘no direction’: don’t give the agent the direction the snake is going.

狀態空間“無方向”：不要給代理人蛇前進的方向。

State space ‘coordinates’: replace the location of the apple (up, right, down and/or left) with the coordinates of the apple (x, y) and the snake (x, y). The coordinates are scaled between 0 and 1.

狀態空間“坐標”：用蘋果(x，y)和蛇(x，y)的坐標替換蘋果(上，右，下和/或左)的位置。坐標在0到1之間縮放。

State space ‘direction 0 or 1’: the original state space.

狀態空間“方向0或1”：原始狀態空間。

State space ‘only walls’: don’t tell the agent when the body is up, right, down or left, only tell it if there’s a wall.

說明空間“只有墻壁”：不要告訴代理人何時身體向上，向右，向下或向左，僅告訴它是否有墻壁。

Can you make a guess and rank them from the best state space to the worst after playing 50 games?

在玩了50場游戲之后，您能猜出它們從最佳狀態空間到最差狀態嗎？

An agent playing snake prevents seeing the answer :)扮演蛇的特工阻止看到答案:)

Made your guess?

你猜對了嗎？

Here is a graph with the performance using the different state spaces:

這是使用不同狀態空間的性能圖：

Defining the right state accelerates learning! This graph shows the mean return of the last twenty games for the different state spaces.定義正確的狀態可以加速學習！此圖顯示了不同狀態空間的最后二十場比賽的平均收益。

It is clear that using the state space with the directions (the original state space) learns fast and achieves the highest return. But the state space using the coordinates is improving, and maybe it can reach the same performance when it trains longer. A reason for the slow learning might be the number of possible states: 20?*2?*4 = 1,024,000 different states are possible (the snake canvas is 20*20 steps, there are 2? options for obstacles, and 4 options for the current direction). For the original state space the number of possible states is equal to: 32*2?*4 = 576 (3 options each for above/below and left/right). 576 is more than 1,700 times smaller than 1,024,000. This influences the learning process.

顯然，使用帶有方向的狀態空間(原始狀態空間)可以快速學習并獲得最高的回報。但是使用坐標的狀態空間正在改善，也許在訓練更長的時候它可以達到相同的性能。學習緩慢的原因可能是可能的狀態數：20?*2?* 4 = 1,024,000種可能的狀態(蛇形畫布為20 * 20步，障礙物有2?選項，當前方向有4個選項) 。對于原始狀態空間，可能的狀態數等于：32*2?* 4 = 576(3個選項分別用于上方/下方和左側/右側)。 576比1,024,000小1,700倍。這會影響學習過程。

玩獎賞 (Playing with the Rewards)

What about the rewards? Is there a better way to program them?

獎勵呢？有更好的編程方法嗎？

Recall that our rewards were formatted like this:

回想一下，我們的獎勵格式如下：

Blooper #1: Walk in CirclesWhat if we change the reward -1 to 1? By doing this, the agent will receive a reward of 1 every time it survives a time step. This can slow down learning in the beginning, but in the end the agent won’t die, and that’s a pretty important part of the game!

Blooper＃1：繞圈走如果我們將獎勵-1更改為1，該怎么辦？這樣，代理每經過一個時間步長就會獲得1的獎勵。這樣一開始可能會減慢學習速度，但最終代理不會死，這是游戲中非常重要的部分！

Well, does it work? The agent quickly learns how to avoid dying:

好吧，行得通嗎？代理Swift學習如何避免死亡：

Agent receives a reward of 1 for surviving a time step.代理因在時間步長中幸存而獲得1的獎勵。

-1, come back please!

-1，請回來！

Blooper #2: Hit the WallNext try: change the reward for coming closer to the apple to -1, and the reward of grabbing an apple to 100, what will happen? You might think: the agent receives a -1 for every time step, so it will run to the apples as fast as possible! This could be the truth, but there’s another thing that might happen…

Blooper＃2：撞墻接下來的嘗試：將靠近蘋果的獎勵更改為-1，將靠近蘋果的獎勵更改為100，會發生什么？您可能會想：該代理在每個時間步長都收到-1，因此它將盡快運行到蘋果！這可能是事實，但可能還會發生另一件事……

The agent runs into the nearest wall to minimize the negative return.代理會碰到最近的墻，以最大程度地減少負回報。

體驗重播 (Experience Replay)

One secret behind fast learning of the agent (only needs 30 games) is experience replay. In experience replay the agent stores previous experiences and uses these experiences to learn faster. At every normal step, a number of replay steps (batch_size parameter) is performed. This works so well for Snake because given the same state action pair, there is low variance in reward and next state.

快速學習代理(僅需要30個游戲)的一個秘訣就是體驗重播。在體驗重播中，代理會存儲以前的體驗，并使用這些體驗來更快地學習。在每個正常步驟，都會執行許多重播步驟( batch_size參數)。這對于Snake非常有效，因為給定相同的狀態動作對，獎勵和下一個狀態的差異很小。

Blooper #3: No Experience ReplayIs experience replay really that important? Let’s remove it! For this experiment a reward of 100 for eating an apple is used.

Blooper＃3：無經驗重播經驗重播真的那么重要嗎？讓我們刪除它！在本實驗中，使用蘋果的獎勵為100。

This is the agent without using experience replay after playing 2500 games:

這是在玩2500游戲后不使用經驗重播的代理：

Training without experience replay. Even though the agent played 2500 (!) games, the agent can’t play snake. Fast playing, otherwise it would take days to reach the 10000 games.沒有經驗重播的培訓。即使該代理人玩了2500(！)游戲，該代理人也不能玩蛇。快速玩游戲，否則要花上幾天時間才能達到10000場比賽。

After 3000 games, the highest number of apples caught in one game is 2.

在進行3000場比賽后，一場比賽中最多捕獲的蘋果數是2。

After 10000 games, the highest number is 3… Was this 3 learning or was it luck?

在10000場比賽之后，最高的數字是3。。。這3是學習還是運氣？

It seems indeed that experience replay helps a lot, at least for these parameters, rewards and this state space. How many replay steps per step are necessary? The answers might surprise you. To answer this question we can play with the batch_size parameter (mentioned in the section Creating the Environment and the Agent). In the original experiment the value of batch_size was 500.

至少對于這些參數，獎勵和這種狀態空間來說，經驗重播確實確實有很大幫助。每步需要多少個重播步驟？答案可能會讓您感到驚訝。為了回答這個問題，我們可以使用batch_size參數(在創建環境和代理一節中提到)。在原始實驗中， batch_size的值為500。

An overview of returns with different experience replay batch sizes:

具有不同經驗重播批次大小的退貨概述：

Training 200 games with 3 different batch sizes: 1 (no experience replay), 2 and 4. Mean return of previous 20 episodes.用3種不同的批量大小訓練200場游戲：1(無經驗重播)，2和4。平均返回前20集。

Even with batch size 2 the agent learns to play the game. In the graph you can see the impact of increasing the batch size, the same performance is reached more than 100 games earlier if batch size 4 is used instead of batch size 2.

即使批次大小為2，代理也會學會玩游戲。在圖形中，您可以看到增加批量大小的影響，如果使用批量大小4而不是批量大小2，則可以提前100多個游戲達到相同的性能。

結論 (Conclusions)

The solution presented in this article gives results. The agent learns to play snake and achieves a high score (number of apples eaten) between 40 and 60 after playing 50 games. That is way better than a random agent!

本文介紹的解決方案可提供結果。特工學會了打蛇，并在玩了50場游戲后在40到60之間獲得高分(被吃的蘋果數量)。那比隨機代理更好！

The attentive reader would say: ‘The maximum score for this game is 399. Why doesn’t the agent achieve a score of anything close to 399? There’s a huge difference between 60 and 399!’ That’s right! And there is a problem with the solution from this article: the agent does not learn to avoid enclosing. The agent learns to avoid obstacles directly surrounding the snake’s head, but it can’t see the whole game. So the agent will enclose itself and die, especially when the snake is longer.

細心的讀者會說：“此游戲的最高分數是399。為什么代理商沒有達到接近399的分數？ 60和399之間有巨大差異！那就對了！這篇文章的解決方案存在一個問題：代理不會學會避免封閉。特工學會了避免直接繞在蛇頭周圍的障礙物，但看不到整個游戲。因此，特工將包圍自己并死亡，尤其是當蛇更長時。

Enclosing.封閉。

An interesting way to solve this problem is to use pixels and Convolutional Neural Networks in the state space2. Then it is possible for the agent to ‘see’ the whole game, instead of just nearby obstacles. It can learn to recognize the places it should go to avoid enclosing and get the maximum score.

解決此問題的一種有趣方法是在狀態空間2中使用像素和卷積神經網絡。這樣，代理就有可能“看到”整個游戲，而不僅僅是附近的障礙。它可以學會識別應該去的地方，以避免封閉并獲得最高分。

[1] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators (1989), Neural networks 2.5: 359–366

[1] K. Hornik，M。Stinchcombe，H。White，多層前饋網絡是通用逼近器 (1989)，神經網絡2.5：359–366

[2] Mnih et al, Playing Atari with Deep Reinforcement Learning (2013)

[2] Mnih等人，《使用深度強化學習玩Atari》 (2013年)

翻譯自: https://towardsdatascience.com/snake-played-by-a-deep-reinforcement-learning-agent-53f2c4331d36