當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

强化学习求解迷宫问题_使用天真强化学习的迷宫求解器

發(fā)布時(shí)間：2023/12/15 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了强化学习求解迷宫问题_使用天真强化学习的迷宫求解器小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

強(qiáng)化學(xué)習(xí) 求解迷宮問題

This is a short maze solver game I wrote from scratch in python (in under 260 lines) using numpy and opencv. Code link included at the end.

這是一個(gè)簡短的迷宮求解器游戲，我使用numpy和opencv在python中(不到260行)從頭開始編寫。末尾包含代碼鏈接。

I wrote this to understand the fundamentals of Q-Learning and apply the theoretical concepts directly in code from scratch. Follow along if you wanna get your hands dirty with reinforcement learning!

我寫這篇文章是為了了解Q-Learning的基礎(chǔ)知識，并從頭開始將理論概念直接應(yīng)用到代碼中。如果您想通過強(qiáng)化學(xué)習(xí)使自己的雙手變臟，請繼續(xù)學(xué)習(xí) ！

Game Objective -

游戲目標(biāo)-

Find the optimal movement policy which takes an agent from any starting (shown in black-gray shades on the left) to the closest destination (blue-ish) box while avoiding danger zone (red) and wall (green) boxes.

找到最佳移動策略，該策略可以使特工從任何起點(diǎn)(左側(cè)以黑灰色陰影顯示)到達(dá)最近的目的地 (藍(lán)色)方框，同時(shí)避開危險(xiǎn)區(qū)(紅色)和墻壁(綠色)方框。

A “policy” can be thought of as the set of “smart-movement” rules which the agent learns to navigate its environment. In this case, they’re visualized as arrows (shown on left). This is done through Q-Learning.

可以將“策略”視為代理可以學(xué)習(xí)在其環(huán)境中導(dǎo)航的“智能移動”規(guī)則集。在這種情況下，它們顯示為箭頭(如左圖所示)。這是通過Q-Learning完成的 。

Significance -

意義-

You might ask if making game-playing AIs like these are relevant at all in practical applications and that’s fair. Actually these are toy-problems designed in such a way that, their solutions are broadly applicable.

您可能會問，使這樣的游戲性AI在實(shí)際應(yīng)用中是否完全相關(guān)，這很公平。實(shí)際上，這些是玩具問題，其設(shè)計(jì)方式使得它們的解決方案可廣泛應(yīng)用。

For example, the current example of maze solving can further be extended for autonomous navigation in an occupancy grid to get to the nearest EV charging station.

例如，迷宮求解的當(dāng)前示例可以進(jìn)一步擴(kuò)展為在乘員網(wǎng)格中進(jìn)行自主導(dǎo)航以到達(dá)最近的EV充電站。

Q學(xué)習(xí)算法和Q表方法- (The Q-Learning Algorithm and the Q-Table approach -)

Q-Learning is centered around the Bellman Equation and finding the q-value for each action at the current state. Finding an optimal policy involves recursively solving this equation multiple times.

Q學(xué)習(xí)以Bellman方程為中心，并找到當(dāng)前狀態(tài)下每個(gè)動作的q值。尋找最佳策略需要多次遞歸求解該方程。

The Bellman Equation. This can be recursively solved to obtain the “Q-values” or “quality values” of different actions given the agent’s current state.貝爾曼方程式。可以遞歸解決此問題，以獲得給定代理程序當(dāng)前狀態(tài)的不同操作的“ Q值”或“質(zhì)量值”。

Only the main parts of the Bellman Equation relevant to this implementation will be explained in this article. For a more in-depth primer on the Bellman equation, check reference [1].

本文將只解釋與該實(shí)現(xiàn)相關(guān)的Bellman方程的主要部分。有關(guān)Bellman方程的更深入入門，請參閱參考文獻(xiàn)[1]。

Q值是多少？ (What is the Q-value?)

Imagine you are an unfortunate soul stuck in a simple 2D world like the following -

想象一下，您是一個(gè)不幸的靈魂，被困在一個(gè)簡單的2D世界中，如下所示：

Yes, that’s you. You are sad. The orange arrows dictate the displacements you can make in this 2D world.是的，就是你。你不開心。橙色箭頭指示您可以在此2D世界中進(jìn)行的位移。

Well, you look sad. You should be. Who wants to be in a 2D world anyway?

好吧，你看起來很難過。你應(yīng)該。誰想成為2D世界？

Well… lets put a smile on that face, shall we? 🎃

好吧……讓微笑在那張臉上吧？ 🎃

Given that the only movements you can make are the orange arrows shown in the image on the left (and a no-op operation), you gotta find your way to the nearest exit portal.

鑒于您只能做的動作就是左側(cè)圖像中顯示的橙色箭頭(以及無操作操作)，因此您必須前往最近的出口門戶。

Given these conditions, at any given stage, you’ll have to make a decision on one of these actions. To do that, your brain does an internal “ranking” of the actions taking many things into consideration. This might include things like -

考慮到這些條件，在任何給定階段，您都必須對這些操作之一做出決定。為此，您的大腦會在考慮到許多因素的情況下對這些行為進(jìn)行內(nèi)部“排名”。其中可能包括-

Where is the nearest exit?
最近的出口在哪里？
Are there any danger zones?
有危險(xiǎn)區(qū)域嗎？
Where dem walls at boi?
Boi的dem墻在哪里？
Why is it getting hot in here? (We’ll get to this by discussing adding a small -ve reward for every time the agent does nothing)
為什么這里天氣變熱？ (我們將通過討論每次代理人什么都不做時(shí)增加一個(gè)小的-ve獎勵來解決這個(gè)問題)

Now you being an advanced human, process these implicitly and assign a quality -value or a “Q-value” to each of the actions (up, down, left, right, no-op) you can take at that point.

現(xiàn)在您是高級人員，可以隱式處理這些內(nèi)容，并為此時(shí)可以執(zhí)行的每個(gè)動作(上，下，左，右，無操作)分配一個(gè)質(zhì)量值或一個(gè)“ Q值” 。

But how can you make a computer do it?

但是如何使計(jì)算機(jī)做到這一點(diǎn)呢？

Simple, you somehow assign a numeric q-value to each action at each situation you might encounter. However, this is the naive approach; and as stated in the title, we shall stick to this here. For more advanced stuff, there are tons of other articles where you should be looking.

很簡單，您可能會在每種情況下以某種方式為每個(gè)動作分配一個(gè)數(shù)字q值。但是，這是幼稚的方法。如標(biāo)題中所述，我們將在此處堅(jiān)持這一點(diǎn)。對于更高級的內(nèi)容，您應(yīng)該查看大量其他文章。

Pretty much like how we humans form perceptions of “good” and “bad” actions based on real-life experiences, the agent has to be trained in a similar way.

就像我們?nèi)祟惛鶕?jù)現(xiàn)實(shí)生活中的經(jīng)驗(yàn)來形成對“好”和“壞”行為的看法一樣，必須以類似的方式來訓(xùn)練代理。

Now, this brings us to the following question -

現(xiàn)在，這引出了以下問題-

什么是Q表？ (What is the Q-table?)

Simply put, this is the memory of experiences per-say you’ll be updating and querying every time you have to make a decision and perform an action in the environment.

簡而言之，這是您每次要在環(huán)境中做出決定并執(zhí)行操作時(shí)都會更新和查詢的經(jīng)驗(yàn)的記憶。

An accurate visual representation of your relationship with the Q-table is shown on the left.

左側(cè)顯示了您與Q表的關(guān)系的準(zhǔn)確視覺表示。

Now, to build the Q-table, you need to collect information about the world. It needs to know of danger zones, walls it could bump in to, and pretty much anything to help you not die soon (much like life itself).

現(xiàn)在，要建立Q表，您需要收集有關(guān)世界的信息。它需要知道危險(xiǎn)區(qū)域，可能撞到的墻以及幾乎所有可以幫助您不會很快死亡的東西(就像生命本身一樣)。

To do this, let’s assume you can die a thousand deaths. Yes, sacrifice is necessary for science.

為此，假設(shè)您可以殺死一千人。是的，犧牲對于科學(xué)是必要的。

Armed with this, you will start at random locations and kind-of begin randomly roaming around until you start forming a perception of the world around you. This perception is shaped by what you encounter while roaming around.

有了這些，您將開始在隨機(jī)的位置開始，并開始隨機(jī)漫游，直到您開始形成對周圍世界的感知。這種感知取決于您在漫游時(shí)遇到的情況。

You wanna avoid pain. In this sense, actions in situations which lead to -ve rewards. Therefore, you ‘take note of them’ in the Q-table whenever you encounter them.你想避免痛苦。從這個(gè)意義上講，在導(dǎo)致-ve獎勵的情況下采取的行動。因此，每當(dāng)遇到它們時(shí)，您都會在Q表中“記錄它們”。

For example, you may hit a wall — that’s bad, cuz you’re bleeding. Now you’ll remember in that situation, whatever action you took which caused you to bleed, shouldn’t be repeated.

例如，您可能撞墻了，這很糟糕，因?yàn)槟诹餮?現(xiàn)在您會記得在這種情況下，無論您采取什么措施導(dǎo)致您流血，都不應(yīng)重復(fù)。

Sometimes, you’ll even encounter danger zones raging with fire 🔥🧨 which will end your life as soon as you step on them. This is worse than bleeding, which will be quantified by assigning a more -ve reward value for such experiences.

有時(shí)，您甚至?xí)龅搅一鹚僚暗奈ｋU(xiǎn)區(qū)域，一旦踩到這些危險(xiǎn)區(qū)域，生命就會終止。這比流血更糟，流血將通過為此類體驗(yàn)分配更高的獎勵值來量化。

Now for the better things in life.

現(xiàn)在為了生活中更好的事情。

Similarly, you’ll also keep track of all the good things (when you receive a +ve reward) which happen during your time in the maze. Well, in this case, there’s only one good thing which can happen - E S C A P E.

同樣，您還將跟蹤迷宮中發(fā)生的所有美好事物(獲得+ ve獎勵時(shí)) 。好吧，在這種情況下，只會發(fā)生一件好事- 亞太經(jīng)社會E。

This just sounds like another way of dying, but hey let’s pretend its more fun cuz it sounds different than death.

這聽起來像是另一種死亡的方式，但是，讓我們假裝它更有趣，因?yàn)樗犉饋肀人劳鲞€不同。

To do all of this, you’ll basically build a table storing the q-values of performing each and every action in every possible scenario in the environment (do remember that this is naive for a reason).

為此，您基本上將構(gòu)建一個(gè)表，該表存儲在環(huán)境中每種可能的場景中執(zhí)行每個(gè)動作的q值(請記住，這是天真的原因)。

A higher q-value for a given action in a given state means that action will be more likely to be taken by you (the agent).

在給定狀態(tài)下，給定操作的q值較高，意味著您(代理)更有可能采取該操作。

Shown below are two different states with example q-values for each action that can be performed by you (the agent) at those states.

下面顯示的是兩種不同的狀態(tài)，您(代理)在這些狀態(tài)下可以執(zhí)行的每個(gè)操作的示例q值。

In each state, the agent is located in the boxed region in the checkerboard world. For each state, shown to the right are different actions (up, left, right, down, no-op respectively from top to bottom) the agent can take along with their q-values derived from the Q-Table.在每種狀態(tài)下，座席都位于棋盤世界的裝箱區(qū)域中。對于每種狀態(tài)，右邊顯示的是代理可以采取的不同操作(從上到下分別為上，左，右，下，無操作)以及從Q表派生的q值。

The q-values then act as a guide towards taking the next action to maximize overall reward (which means escape). At every step, the following actions will be performed sequentially in this naive scenario -

然后，q值將指導(dǎo)您采取下一步行動，以使總體獎勵最大化(這意味著逃避)。在此幼稚的場景中，每一步都會依次執(zhí)行以下操作-

Query Q-table for values pertaining to the different actions you can perform at your current state.

在Q表中查詢與您當(dāng)前狀態(tài)下可以執(zhí)行的不同操作有關(guān)的值。

Take action pertaining to the highest q-value.

采取與最高q值有關(guān)的動作。

Record the new state and reward received and use it to update the Q-table using the Bellman Equation. We’ll get here shortly.

記錄新的狀態(tài)和收到的獎勵，并使用其通過Bellman公式更新Q表。我們很快就會到這里。

Go to step 1.

轉(zhuǎn)到步驟1。

學(xué)習(xí)可視化 (Learning Visualization)

Final learned representation of the Q-table rendered visually on to the maze world. It is implemented from scratch in the codebase using numpy.在迷宮世界中可視化呈現(xiàn)的Q表的最終學(xué)習(xí)表示形式。它是使用numpy在代碼庫中從頭實(shí)現(xiàn)的。

Given all state transition rules are defined (which in this case is quite simple given the basic nature of the maze world), after a sufficient number of repeating these iterations, the agent builds a “vector field map” per-say of the different actions that should be performed at each location of the maze so as to reach the nearest destination in the minimum time.

給定所有狀態(tài)轉(zhuǎn)換規(guī)則(在這種情況下，鑒于迷宮世界的基本性質(zhì)，這非常簡單)，在重復(fù)了足夠多次重復(fù)這些迭代之后，代理會針對每個(gè)不同的動作構(gòu)建一個(gè)“ 向量場圖 ”應(yīng)該在迷宮的每個(gè)位置執(zhí)行此操作，以便在最短的時(shí)間內(nèi)到達(dá)最近的目的地。

Shown on the left is the final learned representation of the Q-table.

左側(cè)顯示的是Q表的最終學(xué)習(xí)表示。

The arrows are visualized by obtaining a vector sum of the different q-values at each location. For example, if we have the following q-values for up, left, right, down — qu, ql, qr, qd

通過在每個(gè)位置獲得不同q值的矢量和，可以使箭頭可視化。例如，如果我們有以下q值分別代表上，左，右，下— qu，ql，qr，qd

Then the arrow, on a 2D plane (Horizontal is X-axis, Vertical is Y-axis) will have its x-component as qr-ql and y-component as qd-qu

然后，在2D平面上(水平軸為X軸，垂直軸為Y軸)的箭頭的x分量為qr-ql ， y分量為qd-qu

The length of the arrow is the norm of this vector obtained using the following formula -

箭頭的長度是使用以下公式獲得的該向量的范數(shù)-

Therefore, if you start at any location in the maze, you can follow the arrows and reach the nearest destination by avoiding walls and danger zones.

因此，如果您從迷宮中的任何位置開始，您可以遵循箭頭，避開墻壁和危險(xiǎn)區(qū)域，到達(dá)最近的目的地。

在探索迷宮的同時(shí)更新Q表- (Updating the Q-Table while exploring the maze -)

This is one of the more challenging parts of the problem which greatly affects how soon you’ll be getting your sweet release (it’s not death, let’s remember that haha).

這是問題中更具挑戰(zhàn)性的部分之一，極大地影響了您獲得甜蜜釋放的時(shí)間(這不是死亡，請記住那哈哈)。

基本上，這是一個(gè)問題- (Basically, here is the question —)

You take the highest q-value action at your given state following which, you end up in a new state (let’s hope for simplicity you don’t die for now).

您在給定狀態(tài)下執(zhí)行最高的q值操作，然后您進(jìn)入新狀態(tài)(希望簡單起見，您現(xiàn)在不會死亡)。

Next, you’d like to record whether your action has brought you closer to the nearest destination in the Q-table. How could you do this?

接下來，您想記錄您的操作是否使您更接近Q表中的最近目的地。你怎么能這樣

All you have here to work with are the following -

您在這里可以使用的所有功能如下-

Existing q-values at the new and old states defined for each action. They might have been randomly initialized or obtained from a previous iteration.

為每個(gè)動作定義的新舊狀態(tài)下的現(xiàn)有q值。它們可能已經(jīng)被隨機(jī)初始化或從先前的迭代中獲得。

The reward you gained for the action you performed to get to the new state from the old state.

您為執(zhí)行從舊狀態(tài)到新狀態(tài)所執(zhí)行的操作而獲得的獎勵。

The action you performed to get to the new state from the old state.

您為從舊狀態(tài)進(jìn)入新狀態(tài)而執(zhí)行的操作。

How would you change the existing Q-table values you obtained for the old state to make a better decision if you come across it in the future?

...你流會改變你所獲得的現(xiàn)有Q-表值老態(tài)做出更好的決定，如果你將來遇到呢？

This is the very basic question which is answered by the Bellman equation in this case -

在這種情況下，這是一個(gè)非常基本的問題，由Bellman方程式回答-

Following are the variable definitions -

以下是變量定義-

a is the action.
一個(gè)是動作。
s and s’ are the old and new states respectively.
s和s'分別是舊狀態(tài)和新狀態(tài)。
𝛾 is the discount factor, a constant between 0 and 1. You need this to prioritize current reward over expected future reward.
𝛾是折現(xiàn)因子，介于0和1之間的常數(shù)。您需要使用此功能將當(dāng)前獎勵優(yōu)先于預(yù)期的未來獎勵。
Q(s) is the q-value of the action a you just took to reach the new state from the old state s.
Q(S) 是你只是把到達(dá)從舊的狀態(tài)s新狀態(tài)的動作的Q值。
Q(s’) is the maximum q-value at the new state s’.
Q(s')是新狀態(tài)s'下的最大q值。
R(s, a) is the reward you immediately receive for performing a to transition from s to s’.
R(S，a)是你立即收到用于執(zhí)行從s到S'過渡的報(bào)酬。

Tmax term is the secret sauce here. This causes the equation to iterate through every a until the maximum value of the expression inside the max term is obtained. It finally returns that value q and the corresponding action a.

T max術(shù)語是這里的秘密調(diào)味料。這會使方程式每隔一個(gè)迭代一次，直到獲得max項(xiàng)內(nèi)的表達(dá)式的最大值 。最后，它返回該值q和相應(yīng)的動作a 。

Every action a performed from state s might lead to new states s’ for each iteration. Therefore each time, the maximum of the q-values defined at s’ is chosen to compute the expression inside max.

從狀態(tài)s執(zhí)行的每個(gè)動作a可能會導(dǎo)致每次迭代的新狀態(tài)s' 。因此，每次選擇在s'定義的q值的最大值來計(jì)算max內(nèi)的表達(dá)式。

Once the values q and a are obtained, the Q-table value defined for action a at state s is then overwritten by q.

一旦獲得值q和a ，則在狀態(tài)s下為動作a定義的Q表值將被q覆蓋。

In our case, this representation is the value function (don’t worry if you don’t get this; well, I just pulled an Andrew Ng on you 😈).

在我們的例子中，該表示形式是值函數(shù)(不??要擔(dān)心，如果您不明白這一點(diǎn)，那么，我只是對您😈了Andrew Ng😈) 。

在迷宮中運(yùn)行代理- (Running the agent in the maze -)

?Finally，你在這里做到了，恭喜！這是來自我的模因頁面@ ml.exe的獨(dú)家RL模因。您應(yīng)得的芽。 (Finally, you’ve made it here, congrats! Here is an exclusive RL meme for you from my meme page @ml.exe. You deserve it bud.)

Don’t worry, healthy narcissism won’t kill you.別擔(dān)心，健康的自戀不會殺死您。

After a sufficient number of iterations of the Bellman equation, you’ll converge to optimum q-values for each action at each state.

在進(jìn)行了足夠多的Bellman方程式迭代之后，您將收斂到每個(gè)狀態(tài)下每個(gè)動作的最佳q值。

When you want to run the agent, simply start from any spawn point and blindly do the action with the highest q-value. You’ll reach the nearest destination.

當(dāng)您要運(yùn)行代理程序時(shí)，只需從任何生成點(diǎn)開始，然后盲目執(zhí)行具有最高q值的操作。您將到達(dá)最近的目的地。

However, there are a few caveats to getting this right -

但是，有一些注意事項(xiàng)可以解決此問題-

Reward policies should be carefully designed. This means correct reward values should be assigned for performing each action at each state. Since this case is so simple, a simple scheme like the following works well -

獎勵政策應(yīng)精心設(shè)計(jì)。這意味著應(yīng)該為在每個(gè)狀態(tài)下執(zhí)行每個(gè)動作分配正確的獎勵值。由于這種情況非常簡單，因此，如下所示的簡單方案非常有效-

discount_factor = 0.5
折扣系數(shù)= 0.5
default_reward = -0.5
default_reward = -0.5
wall_penalty = -0.6
wall_penalty = -0.6
win_reward = 5.0
win_reward = 5.0
lose_reward = -10.0
lost_reward = -10.0

default_reward is the reward obtained for doing nothing at all. Remember a basic question we asked ourselves in the beginning of this article “Why is it getting hot in here?”; well, here it is. Assigning a small negative reward encourages the agent to seek actions to end its misery rather than sitting around like an obese piece of lard.

default_reward是一無所獲的獎勵。記住我們在本文開頭問自己的一個(gè)基本問題：“ 為什么這里的溫度越來越高？ ”；好吧，這是。分配少量的負(fù)面獎勵會鼓勵行動者采取行動來結(jié)束其痛苦，而不是像肥胖的豬油一樣圍坐在一起。

wall_penalty is the reward received if you bump into a wall while doing the action from your present state. Whenever you bump into a wall, you remain at your original location while receiving this “reward” 🤣.

wall_penalty是當(dāng)您從當(dāng)前狀態(tài)執(zhí)行操作時(shí)撞到墻上時(shí)獲得的獎勵。每當(dāng)碰到墻時(shí)，您都會在收到此“獎勵”🤣的同時(shí)留在原來的位置。

win_reward and lose_reward speak for themselves.

win_reward和Lose_reward為自己說話。

You lose a game if you end up on any of the danger zones. Upon dying, you respawn at a randomly chosen location on the grid.

如果最終進(jìn)入任何危險(xiǎn)區(qū)域，您都會輸?shù)粢粓霰荣悺?死亡后，您會在網(wǎng)格上隨機(jī)選擇的位置重生。

In the codebase, you can play around with rewards to see how this affects solution convergence.

在代碼庫中，您可以嘗試一些獎勵，以了解它如何影響解決方案的融合。

結(jié)論 (Conclusion)

If you correctly understand the steps cited in this article, you’ll be able to fully understand the codebase I wrote from scratch to implement all of this. You can find it here -

如果您正確理解了本文引用的步驟，則將能夠完全理解我為實(shí)現(xiàn)所有這些而從頭開始編寫的代碼庫。你可以在這里找到它 -

The code writes out a video of the agent training and learning as shown in the YouTube video below. You can generate random worlds with varying complexities.

該代碼編寫了有關(guān)代理培訓(xùn)和學(xué)習(xí)的視頻，如下面的YouTube視頻所示。您可以生成具有不同復(fù)雜度的隨機(jī)世界。

If you found this helpful, feel free to follow me for more upcoming articles :)

如果您認(rèn)為這有幫助，請隨時(shí)關(guān)注我以獲取更多即將發(fā)表的文章:)

I’m the editor of the following publication which publishes Tech articles related to the usage of AI & ML in digital mapping of the Earth. Feel free to follow to stay updated :)

我是以下出版物的編輯，該出版物發(fā)表有關(guān)在地球數(shù)字地圖中使用AI和ML的技術(shù)文章。隨時(shí)關(guān)注以保持更新:)

翻譯自: https://towardsdatascience.com/maze-rl-d035f9ccdc63