強化學習論文研讀(四)——Deep Reinforcement Learning with Double Q-Learning
double Q learning + DQN的合成演算法。
論文主要有5點貢獻:
一是DQN會對動作的價值過估計。
二是過估計是有害的。
三是double Q learning 可以減少過估計。通過評估網路和動作選擇網路解耦實現的。
四是提出了三層卷積+FC的Double DQN 演算法結構和引數更新公式。
五是證明了 Double DQN是有效的。
相比於DQN主要改進在一點:
看到裡邊的兩個Q中的θ是不一樣的。一個是target的net,一個是當前的網路,存在時間更新的前後順序,用於解耦合。
有價值的文章:
理解:
配合nature中的target 網路使用,防止在估計局勢的使用使用一個網路,從而形成過估計。
Double DQN
DQN有一個顯著的問題,就是DQN估計的Q值往往會偏大。這是由於我們Q值是以下一個s'的Q值的最大值來估算的,但下一個state的Q值也是一個估算值,也依賴它的下一個state的Q值...,這就導致了Q值往往會有偏大的的情況出現。
們在同一個s'進行試探性出發,計算某個動作的Q值。然後和DQN的記過進行比較就可以得出上述結論。
這種欺上瞞下的做法,實在令人惱火。於是有人想到一個互相監察的想法。
這個思路也很直觀。如果只有一個Q網路,它不是經常吹牛嘛。那我就用兩個Q網路,因為兩個Q網路的引數有差別,所以對於同一個動作的評估也會有少許不同。我們選取評估出來較小的值來計算目標。這樣就能避免Q網路吹牛的情況發生了。
另外一種做法也需要用到兩個Q網路。Q1網路推薦能夠獲得最大Q值的動作;Q2網路計算這個動作在Q2網路中的Q值。
恰好,如果我們用上Fixed Q-targets,我們不就是有兩個Q網路了嗎?
所以你可以看到,這個優化在DQN上很容易實現。這就是doubleDQN和DQN的唯一的變化。
程式碼實現:
import torch import torch.nn as nn import torch.nn.functional as F import numpy as np import gym import matplotlib.pyplot as plt import copy import os import random os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE" # hyper-parameters BATCH_SIZE = 128 LR = 0.01 GAMMA = 0.90 EPISILO = 0.9 MEMORY_CAPACITY = 2000 Q_NETWORK_ITERATION = 100 env = gym.make("CartPole-v0") env = env.unwrapped NUM_ACTIONS = env.action_space.n NUM_STATES = env.observation_space.shape[0] ENV_A_SHAPE = 0 if isinstance(env.action_space.sample(), int) else env.action_space.sample.shape class Net(nn.Module): """docstring for Net""" def __init__(self): super(Net, self).__init__() self.fc1 = nn.Linear(NUM_STATES, 50) self.fc1.weight.data.normal_(0,0.1) self.fc2 = nn.Linear(50,30) self.fc2.weight.data.normal_(0,0.1) self.out = nn.Linear(30,NUM_ACTIONS) self.out.weight.data.normal_(0,0.1) def forward(self,x): x = self.fc1(x) x = F.relu(x) x = self.fc2(x) x = F.relu(x) action_prob = self.out(x) return action_prob class DQN(): """docstring for DQN""" def __init__(self): super(DQN, self).__init__() self.eval_net, self.target_net = Net().cuda(), Net().cuda() self.learn_step_counter = 0 self.memory_counter = 0 self.memory = np.zeros((MEMORY_CAPACITY, NUM_STATES * 2 + 2)) # why the NUM_STATE*2 +2 # When we store the memory, we put the state, action, reward and next_state in the memory # here reward and action is a number, state is a ndarray self.optimizer = torch.optim.Adam(self.eval_net.parameters(), lr=LR) self.loss_func = nn.MSELoss() def choose_action(self, state): state = torch.unsqueeze(torch.FloatTensor(state), 0).cuda() # get a 1D array if np.random.randn() <= EPISILO:# greedy policy action_value = self.eval_net.forward(state) action = torch.max(action_value, 1)[1].cpu().data.numpy() action = action[0] if ENV_A_SHAPE == 0 else action.reshape(ENV_A_SHAPE) else: # random policy action = np.random.randint(0,NUM_ACTIONS) action = action if ENV_A_SHAPE ==0 else action.reshape(ENV_A_SHAPE) return action def store_transition(self, state, action, reward, next_state): transition = np.hstack((state, [action, reward], next_state)) index = self.memory_counter % MEMORY_CAPACITY self.memory[index, :] = transition self.memory_counter += 1 def learn(self): #update the parameters if self.learn_step_counter % Q_NETWORK_ITERATION ==0: self.target_net.load_state_dict(self.eval_net.state_dict()) self.learn_step_counter+=1 #sample batch from memory sample_index = np.random.choice(MEMORY_CAPACITY, BATCH_SIZE) batch_memory = self.memory[sample_index, :] batch_state = torch.FloatTensor(batch_memory[:, :NUM_STATES]).cuda() batch_action = torch.LongTensor(batch_memory[:, NUM_STATES:NUM_STATES+1].astype(int)).cuda() batch_reward = torch.FloatTensor(batch_memory[:, NUM_STATES+1:NUM_STATES+2]).cuda() batch_next_state = torch.FloatTensor(batch_memory[:,-NUM_STATES:]).cuda() #q_eval actions_value = self.eval_net.forward(batch_next_state) next_action = torch.unsqueeze(torch.max(actions_value, 1)[1], 1) eval_q = self.eval_net.forward(batch_state).gather(1, batch_action) next_q = self.target_net.forward(batch_next_state).gather(1, next_action) target_q = batch_reward + GAMMA * next_q loss = self.loss_func(eval_q, target_q) self.optimizer.zero_grad() loss.backward() self.optimizer.step() def reward_func(env, x, x_dot, theta, theta_dot): r1 = (env.x_threshold - abs(x))/env.x_threshold - 0.5 r2 = (env.theta_threshold_radians - abs(theta)) / env.theta_threshold_radians - 0.5 reward = r1 + r2 return reward def main(): dqn = DQN() episodes = 250 print("Collecting Experience....") for i in range(episodes): state = env.reset() ep_reward = 0 while True: env.render() action = dqn.choose_action(state) next_state, _, done, info = env.step(action) x, x_dot, theta, theta_dot = next_state reward = reward_func(env, x, x_dot, theta, theta_dot) dqn.store_transition(state, action, reward, next_state) ep_reward += reward if dqn.memory_counter >= MEMORY_CAPACITY: dqn.learn() if done: print("episode: {} , the episode reward is ,{}".format(i, round(ep_reward, 3))) if done: break state = next_state if __name__ == '__main__': main()
在實現中要注意,學習次數不能過小,也就是episodes 不能過小,否則如論如何都立不起來。