1. 程式人生 > 其它 >強化學習論文研讀(四)——Deep Reinforcement Learning with Double Q-Learning

強化學習論文研讀(四)——Deep Reinforcement Learning with Double Q-Learning

技術標籤:論文研讀深度學習強化學習python演算法

double Q learning + DQN的合成演算法。

論文主要有5點貢獻:

一是DQN會對動作的價值過估計。

二是過估計是有害的。

三是double Q learning 可以減少過估計。通過評估網路和動作選擇網路解耦實現的。

四是提出了三層卷積+FC的Double DQN 演算法結構和引數更新公式。

五是證明了 Double DQN是有效的。

相比於DQN主要改進在一點:

看到裡邊的兩個Q中的θ是不一樣的。一個是target的net,一個是當前的網路,存在時間更新的前後順序,用於解耦合。

有價值的文章:

強化學習(十)Double DQN (DDQN)

理解:

配合nature中的target 網路使用,防止在估計局勢的使用使用一個網路,從而形成過估計。

Double DQN

DQN有一個顯著的問題,就是DQN估計的Q值往往會偏大。這是由於我們Q值是以下一個s'的Q值的最大值來估算的,但下一個state的Q值也是一個估算值,也依賴它的下一個state的Q值...,這就導致了Q值往往會有偏大的的情況出現。

們在同一個s'進行試探性出發,計算某個動作的Q值。然後和DQN的記過進行比較就可以得出上述結論。

這種欺上瞞下的做法,實在令人惱火。於是有人想到一個互相監察的想法。

這個思路也很直觀。如果只有一個Q網路,它不是經常吹牛嘛。那我就用兩個Q網路,因為兩個Q網路的引數有差別,所以對於同一個動作的評估也會有少許不同。我們選取評估出來較小的值來計算目標。這樣就能避免Q網路吹牛的情況發生了。

另外一種做法也需要用到兩個Q網路。Q1網路推薦能夠獲得最大Q值的動作;Q2網路計算這個動作在Q2網路中的Q值。

恰好,如果我們用上Fixed Q-targets,我們不就是有兩個Q網路了嗎?

所以你可以看到,這個優化在DQN上很容易實現。這就是doubleDQN和DQN的唯一的變化。

————Double DQN原理是什麼,怎樣實現?(附程式碼)

程式碼實現:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import gym
import matplotlib.pyplot as plt
import copy
import os
import random
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

# hyper-parameters
BATCH_SIZE = 128
LR = 0.01
GAMMA = 0.90
EPISILO = 0.9
MEMORY_CAPACITY = 2000
Q_NETWORK_ITERATION = 100

env = gym.make("CartPole-v0")
env = env.unwrapped
NUM_ACTIONS = env.action_space.n
NUM_STATES = env.observation_space.shape[0]
ENV_A_SHAPE = 0 if isinstance(env.action_space.sample(), int) else env.action_space.sample.shape

class Net(nn.Module):
    """docstring for Net"""
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(NUM_STATES, 50)
        self.fc1.weight.data.normal_(0,0.1)
        self.fc2 = nn.Linear(50,30)
        self.fc2.weight.data.normal_(0,0.1)
        self.out = nn.Linear(30,NUM_ACTIONS)
        self.out.weight.data.normal_(0,0.1)

    def forward(self,x):
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        action_prob = self.out(x)
        return action_prob

class DQN():
    """docstring for DQN"""
    def __init__(self):
        super(DQN, self).__init__()
        self.eval_net, self.target_net = Net().cuda(), Net().cuda()

        self.learn_step_counter = 0
        self.memory_counter = 0
        self.memory = np.zeros((MEMORY_CAPACITY, NUM_STATES * 2 + 2))
        # why the NUM_STATE*2 +2
        # When we store the memory, we put the state, action, reward and next_state in the memory
        # here reward and action is a number, state is a ndarray
        self.optimizer = torch.optim.Adam(self.eval_net.parameters(), lr=LR)
        self.loss_func = nn.MSELoss()

    def choose_action(self, state):
        state = torch.unsqueeze(torch.FloatTensor(state), 0).cuda() # get a 1D array
        if np.random.randn() <= EPISILO:# greedy policy
            action_value = self.eval_net.forward(state)
            action = torch.max(action_value, 1)[1].cpu().data.numpy()
            action = action[0] if ENV_A_SHAPE == 0 else action.reshape(ENV_A_SHAPE)
        else: # random policy
            action = np.random.randint(0,NUM_ACTIONS)
            action = action if ENV_A_SHAPE ==0 else action.reshape(ENV_A_SHAPE)
        return action


    def store_transition(self, state, action, reward, next_state):
        transition = np.hstack((state, [action, reward], next_state))
        index = self.memory_counter % MEMORY_CAPACITY
        self.memory[index, :] = transition
        self.memory_counter += 1


    def learn(self):

        #update the parameters
        if self.learn_step_counter % Q_NETWORK_ITERATION ==0:
            self.target_net.load_state_dict(self.eval_net.state_dict())
        self.learn_step_counter+=1

        #sample batch from memory
        sample_index = np.random.choice(MEMORY_CAPACITY, BATCH_SIZE)
        batch_memory = self.memory[sample_index, :]
        batch_state = torch.FloatTensor(batch_memory[:, :NUM_STATES]).cuda()
        batch_action = torch.LongTensor(batch_memory[:, NUM_STATES:NUM_STATES+1].astype(int)).cuda()
        batch_reward = torch.FloatTensor(batch_memory[:, NUM_STATES+1:NUM_STATES+2]).cuda()
        batch_next_state = torch.FloatTensor(batch_memory[:,-NUM_STATES:]).cuda()

        #q_eval
        actions_value = self.eval_net.forward(batch_next_state)
        next_action = torch.unsqueeze(torch.max(actions_value, 1)[1], 1)
        eval_q = self.eval_net.forward(batch_state).gather(1, batch_action)
        next_q = self.target_net.forward(batch_next_state).gather(1, next_action)
        target_q = batch_reward + GAMMA * next_q
        loss = self.loss_func(eval_q, target_q)


        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

def reward_func(env, x, x_dot, theta, theta_dot):
    r1 = (env.x_threshold - abs(x))/env.x_threshold - 0.5
    r2 = (env.theta_threshold_radians - abs(theta)) / env.theta_threshold_radians - 0.5
    reward = r1 + r2
    return reward

def main():

    dqn = DQN()
    episodes = 250
    print("Collecting Experience....")
    for i in range(episodes):
        state = env.reset()
        ep_reward = 0
        while True:
            env.render()
            action = dqn.choose_action(state)
            next_state, _, done, info = env.step(action)
            x, x_dot, theta, theta_dot = next_state
            reward = reward_func(env, x, x_dot, theta, theta_dot)

            dqn.store_transition(state, action, reward, next_state)
            ep_reward += reward

            if dqn.memory_counter >= MEMORY_CAPACITY:
                dqn.learn()
                if done:
                    print("episode: {} , the episode reward is ,{}".format(i, round(ep_reward, 3)))
            if done:
                break
            state = next_state


if __name__ == '__main__':
    main()

在實現中要注意,學習次數不能過小,也就是episodes 不能過小,否則如論如何都立不起來。