強化學習 Sarsa-lambda演算法走迷宮小例子
阿新 • • 發佈:2018-11-26
Sarsa-lambda是Sarsa演算法的一種提速的方法。
如果說 Sarsa 和 Qlearning 都是每次獲取到 reward, 只更新獲取到 reward 的前一步. 那 Sarsa-lambda 就是更新獲取到 reward 的前 lambda 步. lambda 是在 [0, 1] 之間取值,
如果 lambda = 0, Sarsa-lambda 就是 Sarsa, 只更新獲取到 reward 前經歷的最後一步.
如果 lambda = 1, Sarsa-lambda 更新的是 獲取到 reward 前所有經歷的步.(來自於https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/3-3-tabular-sarsa-lambda/
lambda的含義:
lambda 就是一個衰變值, 他可以讓你知道離獎勵越遠的步可能並不是讓你最快拿到獎勵的步, 所以我們想象我們站在寶藏的位置, 回頭看看我們走過的尋寶之路, 離寶藏越近的腳印越看得清, 遠處的腳印太渺小, 我們都很難看清, 那我們就索性記下離寶藏越近的腳印越重要, 越需要被好好的更新. 和之前我們提到過的 獎勵衰減值 gamma 一樣, lambda 是腳步衰減值, 都是一個在 0 和 1 之間的數.
程式碼實現走迷宮的小例子(來自於莫凡大神https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/4_Sarsa_lambda_maze
import numpy as np import pandas as pd class RL(object): def __init__(self, action_space, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9): self.actions = action_space # a list self.lr = learning_rate self.gamma = reward_decay self.epsilon = e_greedy self.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64) def check_state_exist(self, state): if state not in self.q_table.index: # append new state to q table self.q_table = self.q_table.append( pd.Series( [0]*len(self.actions), index=self.q_table.columns, name=state, ) ) def choose_action(self, observation): self.check_state_exist(observation) # action selection if np.random.rand() < self.epsilon: # choose best action state_action = self.q_table.loc[observation, :] # some actions may have the same value, randomly choose on in these actions action = np.random.choice(state_action[state_action == np.max(state_action)].index) else: # choose random action action = np.random.choice(self.actions) return action def learn(self, *args): pass # backward eligibility traces class SarsaLambdaTable(RL): def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9, trace_decay=0.9): super(SarsaLambdaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy) # backward view, eligibility trace. self.lambda_ = trace_decay self.eligibility_trace = self.q_table.copy() def check_state_exist(self, state): if state not in self.q_table.index: # append new state to q table to_be_append = pd.Series( [0] * len(self.actions), index=self.q_table.columns, name=state, ) self.q_table = self.q_table.append(to_be_append) # also update eligibility trace self.eligibility_trace = self.eligibility_trace.append(to_be_append) def learn(self, s, a, r, s_, a_): self.check_state_exist(s_) q_predict = self.q_table.loc[s, a] if s_ != 'terminal': q_target = r + self.gamma * self.q_table.loc[s_, a_] # next state is not terminal else: q_target = r # next state is terminal error = q_target - q_predict # increase trace amount for visited state-action pair # Method 1: # self.eligibility_trace.loc[s, a] += 1 # Method 2: self.eligibility_trace.loc[s, :] *= 0 self.eligibility_trace.loc[s, a] = 1 # Q update self.q_table += self.lr * error * self.eligibility_trace # decay eligibility trace after update self.eligibility_trace *= self.gamma*self.lambda_