1. 程式人生 > 實用技巧 >對馬爾科夫決策過程的程式碼補充解釋

對馬爾科夫決策過程的程式碼補充解釋

  這篇文章是https://www.cnblogs.com/chester-cs/p/12928649.html的程式碼補充解釋,這裡我僅僅實現了Value迭代,Policy迭代讀者可以嘗試自己實現。

  學習了MDP即馬爾科夫決策過程之後我就想用程式碼實現實現,想看看機器是如何不斷優化自身的。

  考慮這樣一個世界:  有獎勵,有陷阱,有阻礙。給Agent發出指令後Agent的行為是非確定的。

  這個問題和更常見的“增強學習問題”的差別在哪裡?

  1. 狀態空間是有限的,離散的。而有很多“增強學習問題”是狀態連續的,狀態無窮的

  2. 動作空間是有限的,離散的。而有很多“增強學習問題”是動作連續的

  3. 由(1)(2),我們能夠以矩陣形式儲存每個狀態下每個動作的$Q$值即$Q(s,a)$。當狀態、動作空間很大的時候這是不現實的,我們可以採用人工神經網路等方法來擬合$Q$。

  有了以上的思考,我們也不考慮這麼多了,看一下這個典型的MDP,優化後是什麼樣。我寫了一個ipynb檔案。

  

## 使用value-iteration來尋找最優策略,其中更改reward-function使得agent採取永遠不會輸的策略
import numpy as np

P = np.zeros((11,4,11))
V = np.zeros(11)
Q = np.zeros((11,4))
R = np.zeros((11,4))

## 定義轉移概率矩陣
P[0] = np.array([[0.15,0.8,0,0.05,0,0,0,0,0,0,0],[0.95,0,0,0.05,0,0,0,0,0,0,0],[0.95,0.05,0,0,0,0,0,0,0,0,0],[0.15,0.05,0,0.8,0,0,0,0,0,0,0]])
P[1] = np.array([[0,0.2,0.8,0,0,0,0,0,0,0,0],[0.2,0.8,0,0,0,0,0,0,0,0,0],[0.05,0.9,0.05,0,0,0,0,0,0,0,0],[0.05,0.9,0.05,0,0,0,0,0,0,0,0]])
P[2] = np.array([[0,0,0.95,0,0.05,0,0,0,0,0,0],[0,0.8,0.15,0,0.05,0,0,0,0,0,0],[0,0.05,0.95,0,0,0,0,0,0,0,0],[0,0.05,0.15,0,0.8,0,0,0,0,0,0]])
P[3] = np.array([[0.05,0,0,0.9,0,0.05,0,0,0,0,0],[0.05,0,0,0.9,0,0.05,0,0,0,0,0],[0.8,0,0,0.2,0,0,0,0,0,0,0],[0,0,0,0.2,0,0.8,0,0,0,0,0]])
P[4] = np.array([[0,0,0.05,0,0.9,0,0,0.05,0,0,0],[0,0,0.05,0,0.9,0,0,0.05,0,0,0],[0,0,0.8,0,0.2,0,0,0,0,0,0],[0,0,0,0,0.2,0,0,0.8,0,0,0]])
P[5] = np.array([[0,0,0,0.05,0,0.1,0.8,0,0.05,0,0],[0,0,0,0.05,0,0.9,0,0,0.05,0,0],[0,0,0,0.8,0,0.15,0.05,0,0,0,0],[0,0,0,0,0,0.15,0.05,0,0.8,0,0]])
P[6] = np.array([[0,0,0,0,0,0,0.15,0.8,0,0.05,0],[0,0,0,0,0,0.8,0.15,0,0,0.05,0],[0,0,0,0,0,0.05,0.9,0.05,0,0,0],[0,0,0,0,0,0.05,0.1,0.05,0,0.8,0]])
P[7] = np.array([[0,0,0,0,0.05,0,0,0.9,0,0,0.05],[0,0,0,0,0.05,0,0.8,0.1,0,0,0.05],[0,0,0,0,0.8,0,0.05,0.15,0,0,0],[0,0,0,0,0,0,0.05,0.15,0,0,0.8]])
P[8] = np.array([[0,0,0,0,0,0.05,0,0,0.15,0.8,0],[0,0,0,0,0,0.05,0,0,0.95,0,0],[0,0,0,0,0,0.8,0,0,0.15,0.05,0],[0,0,0,0,0,0,0,0,0.95,0.05,0]])
P[9] = np.array([[0,0,0,0,0,0,0,0,0,1,0],[0,0,0,0,0,0,0,0,0,1,0],[0,0,0,0,0,0,0,0,0,1,0],[0,0,0,0,0,0,0,0,0,1,0]])
P[10] = np.array([[0,0,0,0,0,0,0,0,0,0,1],[0,0,0,0,0,0,0,0,0,0,1],[0,0,0,0,0,0,0,0,0,0,1],[0,0,0,0,0,0,0,0,0,0,1]])

## 定義reward-function
punishment = -100
reward = 100
R[0] = np.array([0,0,0,0])
R[1] = np.array([0,0,0,0])
R[2] = np.array([0,0,0,0])
R[3] = np.array([0,0,0,0])
R[4] = np.array([0,0,0,0])
R[5] = np.array([0,0,0,0])
R[6] = np.array([0.05*punishment,0.05*punishment,0,0.8*punishment])
R[7] = np.array([0.05*reward,0.05*reward,0,0.8*reward])
R[8] = np.array([0.8*punishment,0,0.05*punishment,0.05*punishment])
R[9] = np.array([0,0,0,0])
R[10] = np.array([0,0,0,0])

for j in range(100):
    for s in range(9):
        for a in range(4):
            Q[s][a] = R[s][a]+0.9*P[s][a]@V
        V[s] = np.max(Q[s])
print('Print policy(With punish:-100,reward:100): ')
for l in range(9):
    print('State' + str(l+1) +': '+ str(np.argmax(Q[l])))

  一共11個狀態0-10,定義了每個狀態的價值$V$,狀態價值矩陣$Q(s,a)$,以及概率轉移矩陣$P$。0-3的動作分別表示向上下左右行走。

  價值矩陣是長度為11的向量,狀態動作矩陣是11×4的矩陣。這兩個都好理解對不對。而狀態轉移矩陣我定義成了11×4×11,是這樣來用的,前一個狀態共有11個,那麼有4個動作可選,選每個動作後下一個狀態又是“非確定的”(因為我這個遊戲的限制!)所以轉移後的狀態也應該是一個向量的概率分佈

  我分別用拿到獎盃獎勵100,被龍殺死懲罰100。和獎勵100懲罰10000,來進行價值迭代,得到的結果是不同的。

Print policy(With punish:-100,reward:100): 
State1: 0
State2: 0
State3: 3
State4: 3
State5: 3
State6: 0
State7: 0
State8: 3
State9: 2

Print policy(With punish:-10000,reward:100): 
State1: 0
State2: 0
State3: 3
State4: 2
State5: 3
State6: 2
State7: 2
State8: 3
State9: 1

  拿狀態9來看,懲罰小的時候,agent會大線條的直接向左走,因為這樣能更快吃到獎盃啊,被殺死的概率也不大。當懲罰大的時候,agent學謹慎了,一直向下走。因為它學到了向左走有0.05的概率被懲罰到傾家蕩產,而向下走雖然慢卻穩健安全。Agent的智慧不就在迭代中獲得了嗎~~~