對馬爾科夫決策過程的程式碼補充解釋
這篇文章是https://www.cnblogs.com/chester-cs/p/12928649.html的程式碼補充解釋,這裡我僅僅實現了Value迭代,Policy迭代讀者可以嘗試自己實現。
學習了MDP即馬爾科夫決策過程之後我就想用程式碼實現實現,想看看機器是如何不斷優化自身的。
考慮這樣一個世界: 有獎勵,有陷阱,有阻礙。給Agent發出指令後Agent的行為是非確定的。
這個問題和更常見的“增強學習問題”的差別在哪裡?
1. 狀態空間是有限的,離散的。而有很多“增強學習問題”是狀態連續的,狀態無窮的
2. 動作空間是有限的,離散的。而有很多“增強學習問題”是動作連續的
3. 由(1)(2),我們能夠以矩陣形式儲存每個狀態下每個動作的$Q$值即$Q(s,a)$。當狀態、動作空間很大的時候這是不現實的,我們可以採用人工神經網路等方法來擬合$Q$。
有了以上的思考,我們也不考慮這麼多了,看一下這個典型的MDP,優化後是什麼樣。我寫了一個ipynb檔案。
## 使用value-iteration來尋找最優策略,其中更改reward-function使得agent採取永遠不會輸的策略 import numpy as np P = np.zeros((11,4,11)) V = np.zeros(11) Q = np.zeros((11,4)) R = np.zeros((11,4)) ## 定義轉移概率矩陣 P[0] = np.array([[0.15,0.8,0,0.05,0,0,0,0,0,0,0],[0.95,0,0,0.05,0,0,0,0,0,0,0],[0.95,0.05,0,0,0,0,0,0,0,0,0],[0.15,0.05,0,0.8,0,0,0,0,0,0,0]]) P[1] = np.array([[0,0.2,0.8,0,0,0,0,0,0,0,0],[0.2,0.8,0,0,0,0,0,0,0,0,0],[0.05,0.9,0.05,0,0,0,0,0,0,0,0],[0.05,0.9,0.05,0,0,0,0,0,0,0,0]]) P[2] = np.array([[0,0,0.95,0,0.05,0,0,0,0,0,0],[0,0.8,0.15,0,0.05,0,0,0,0,0,0],[0,0.05,0.95,0,0,0,0,0,0,0,0],[0,0.05,0.15,0,0.8,0,0,0,0,0,0]]) P[3] = np.array([[0.05,0,0,0.9,0,0.05,0,0,0,0,0],[0.05,0,0,0.9,0,0.05,0,0,0,0,0],[0.8,0,0,0.2,0,0,0,0,0,0,0],[0,0,0,0.2,0,0.8,0,0,0,0,0]]) P[4] = np.array([[0,0,0.05,0,0.9,0,0,0.05,0,0,0],[0,0,0.05,0,0.9,0,0,0.05,0,0,0],[0,0,0.8,0,0.2,0,0,0,0,0,0],[0,0,0,0,0.2,0,0,0.8,0,0,0]]) P[5] = np.array([[0,0,0,0.05,0,0.1,0.8,0,0.05,0,0],[0,0,0,0.05,0,0.9,0,0,0.05,0,0],[0,0,0,0.8,0,0.15,0.05,0,0,0,0],[0,0,0,0,0,0.15,0.05,0,0.8,0,0]]) P[6] = np.array([[0,0,0,0,0,0,0.15,0.8,0,0.05,0],[0,0,0,0,0,0.8,0.15,0,0,0.05,0],[0,0,0,0,0,0.05,0.9,0.05,0,0,0],[0,0,0,0,0,0.05,0.1,0.05,0,0.8,0]]) P[7] = np.array([[0,0,0,0,0.05,0,0,0.9,0,0,0.05],[0,0,0,0,0.05,0,0.8,0.1,0,0,0.05],[0,0,0,0,0.8,0,0.05,0.15,0,0,0],[0,0,0,0,0,0,0.05,0.15,0,0,0.8]]) P[8] = np.array([[0,0,0,0,0,0.05,0,0,0.15,0.8,0],[0,0,0,0,0,0.05,0,0,0.95,0,0],[0,0,0,0,0,0.8,0,0,0.15,0.05,0],[0,0,0,0,0,0,0,0,0.95,0.05,0]]) P[9] = np.array([[0,0,0,0,0,0,0,0,0,1,0],[0,0,0,0,0,0,0,0,0,1,0],[0,0,0,0,0,0,0,0,0,1,0],[0,0,0,0,0,0,0,0,0,1,0]]) P[10] = np.array([[0,0,0,0,0,0,0,0,0,0,1],[0,0,0,0,0,0,0,0,0,0,1],[0,0,0,0,0,0,0,0,0,0,1],[0,0,0,0,0,0,0,0,0,0,1]]) ## 定義reward-function punishment = -100 reward = 100 R[0] = np.array([0,0,0,0]) R[1] = np.array([0,0,0,0]) R[2] = np.array([0,0,0,0]) R[3] = np.array([0,0,0,0]) R[4] = np.array([0,0,0,0]) R[5] = np.array([0,0,0,0]) R[6] = np.array([0.05*punishment,0.05*punishment,0,0.8*punishment]) R[7] = np.array([0.05*reward,0.05*reward,0,0.8*reward]) R[8] = np.array([0.8*punishment,0,0.05*punishment,0.05*punishment]) R[9] = np.array([0,0,0,0]) R[10] = np.array([0,0,0,0]) for j in range(100): for s in range(9): for a in range(4): Q[s][a] = R[s][a]+0.9*P[s][a]@V V[s] = np.max(Q[s]) print('Print policy(With punish:-100,reward:100): ') for l in range(9): print('State' + str(l+1) +': '+ str(np.argmax(Q[l])))
一共11個狀態0-10,定義了每個狀態的價值$V$,狀態價值矩陣$Q(s,a)$,以及概率轉移矩陣$P$。0-3的動作分別表示向上下左右行走。
價值矩陣是長度為11的向量,狀態動作矩陣是11×4的矩陣。這兩個都好理解對不對。而狀態轉移矩陣我定義成了11×4×11,是這樣來用的,前一個狀態共有11個,那麼有4個動作可選,選每個動作後下一個狀態又是“非確定的”(因為我這個遊戲的限制!)所以轉移後的狀態也應該是一個向量的概率分佈。
我分別用拿到獎盃獎勵100,被龍殺死懲罰100。和獎勵100懲罰10000,來進行價值迭代,得到的結果是不同的。
Print policy(With punish:-100,reward:100): State1: 0 State2: 0 State3: 3 State4: 3 State5: 3 State6: 0 State7: 0 State8: 3 State9: 2
Print policy(With punish:-10000,reward:100): State1: 0 State2: 0 State3: 3 State4: 2 State5: 3 State6: 2 State7: 2 State8: 3 State9: 1
拿狀態9來看,懲罰小的時候,agent會大線條的直接向左走,因為這樣能更快吃到獎盃啊,被殺死的概率也不大。當懲罰大的時候,agent學謹慎了,一直向下走。因為它學到了向左走有0.05的概率被懲罰到傾家蕩產,而向下走雖然慢卻穩健安全。Agent的智慧不就在迭代中獲得了嗎~~~