Torch中的Reinforcement Learning的底層實現

阿新 • • 發佈：2019-01-11

強化學習已經成為大家關注的點，至少據我瞭解世界頂級名校CV的實驗室都在做這方面的工作。最近也在做相關的Research，正好遇到了在Torch中的RL實現，發現沒有什麼可以參考的中文資料，只能試著來解釋一下Torch中的RL實現。

原理篇

在Torch中的RL【2】實現參考的是論文【1】的方法實現。實現的也是比較簡單的immediate RL。在【1】中提出了immediate RL，也是類似於associative reward-inaction AR-I的方法。

假設輸出是一個向量，那麼第i和輸出的計算如下：
首先是為了計算出si：
這裡寫圖片描述

再用mass function對si進行處理得到p

i：

這裡的fi可以使用logistic function計算：
這裡寫圖片描述

假設輸出符合伯努利分佈，那麼對於輸出為0或者1的概率分別為：
這裡寫圖片描述

在進行訓練計算BP更新w：
這裡寫圖片描述

首先是αij可以是一個常數，也可以是：
這裡寫圖片描述

對於eij的計算首先是表示mass probability function：
這裡寫圖片描述
然後可以進行求導表示為如下方式：（有點兒cross-entropy的感覺）

然後pi對si進行求導：

然後si對wij進行求導：

把這些乘起來就是eij：
這裡寫圖片描述

把reward，α加上就是最終的梯度：（baseline reward為0的時候）
這裡寫圖片描述

和一個叫做associative reward-inaction對比：(取λ

為0即可)

然後在推出來的導數中加入baseline reward，就可以得到：
這裡寫圖片描述
這裡有一個estimate reward：通過以下公式得到。

實現篇

Torch的程式碼實現在這裡：【2】，是集合了目前Torch有實現的RL方法。是根據【1】實現的。這次要講的是ReinforceCategorical【3】這部分的程式碼。

nn.ReinforceCategorical

module = nn.ReinforceCategorical([stochastic])

這個類是繼承了nn.Reinforce，實現的是針對一個sample的Multinomial distribution的分類。輸入時multinomial distribution，比如從Softmax的輸出獲取的就是一個multinomial distribution（概率和為1），得到的輸出就是一個one-hot coding，也就是輸出的vector只有一個為1，其餘為0，這樣的結構用於做分類再合適不過了。在訓練的時候，輸入可以是batch size，也就是可以進行batch gradient training。

在進行evaluation的時候，若stochastic=false，那麼輸入等於輸出。其實nn.ReinforceCategorical繼承於nn.Reinforce，主要定義了updateGradInput和updateOutput這兩個函式。

函式中有如下變數：

f：對於輸入的概率進行另外一種變化的函式，比如sigmoid。
y：對某個sample的index表示，採用one-hot code表示。
p：輸入的概率分佈，（p[1],p[2],p[3]...）

計算對weight的e(ij)進行更新的公式的時候採用如下：
d ln(f(y,p))     1/p[i]    if i = y  
------------ =   
    d p          0         otherwise

ReinforceCategorical.lua程式碼解釋：

local ReinforceCategorical, parent = torch.class("nn.ReinforceCategorical", "nn.Reinforce")

-- 根據輸入input的概率分佈，進行一次取樣，把取樣的index的輸出設定為1，其餘為0
function ReinforceCategorical:updateOutput(input)
   self.output:resizeAs(input)
   self._index = self._index or ((torch.type(input) == 'torch.CudaTensor') and torch.CudaTensor() or torch.LongTensor())
   if self.stochastic or self.train ~= false then
      -- sample from categorical with p = input
      self._input = self._input or input.new()
      -- prevent division by zero error (see updateGradInput)
      self._input:resizeAs(input):copy(input):add(0.00000001) 
      input.multinomial(self._index, input, 1)
      -- one hot encoding
      self.output:zero()
      self.output:scatter(2, self._index, 1)
   else
      -- use p for evaluation
      self.output:copy(input)
   end
   return self.output
end

-- 計算需要進行更新的weight的梯度，self.gradInput
-- 先是初始化為上面函式的one-hot uotput，然後除以input，為了防止除0錯誤，加了一個小數。然後乘以reward，乘以-1。
function ReinforceCategorical:updateGradInput(input, gradOutput)
   -- Note that gradOutput is ignored
   -- f : categorical probability mass function
   -- x : the sampled indices (one per sample) (self.output)
   -- p : probability vector (p[1], p[2], ..., p[k]) 
   -- derivative of log categorical w.r.t. p
   -- d ln(f(x,p))     1/p[i]    if i = x  
   -- ------------ =   
   --     d p          0         otherwise
   self.gradInput:resizeAs(input):zero()
   self.gradInput:copy(self.output)
   self._input = self._input or input.new()
   -- prevent division by zero error
   self._input:resizeAs(input):copy(input):add(0.00000001) 
   self.gradInput:cdiv(self._input)

   -- multiply by reward 
   self.gradInput:cmul(self:rewardAs(input))
   -- multiply by -1 ( gradient descent on input )
   self.gradInput:mul(-1)
   return self.gradInput
end

-- 設定parent型別並且把self._index置為nil
function ReinforceCategorical:type(type, tc)
   self._index = nil
   return parent.type(self, type, tc)
end

所以可以看出ReinforceCategorical計算的是用於訓練時的output以及具體的gradient，另外的函式也就是在mass function上以及表示輸出output的方式不同。還有類似的函式有： ReinforceBernoulli，ReinforceNormal，ReinforceGamma。

在程式碼裡面有scatter【5】和multinomial【6】這兩個函式，其實一眼看去這個意思是很明確，scatter是把一個數組的值或者是value填到一個Tensor裡面。multinomial是根據給出的概率分佈，進行取樣。比如給出[0.1,0.2,0.7]，然後就會進行採用，返回的是index，比如採集5次，可能會得到3，3，3，1，3這樣的序列。使用multinomial的時候要主要，不要讓分佈為0，否則報錯。為0的時候無法進行取樣。

在scatter中，介面如下：

[Tensor] scatter(dim, index, src|val)

例子如下：

x = torch.rand(2, 5)
> x
 0.3227  0.4294  0.8476  0.9414  0.1159
 0.7338  0.5185  0.2947  0.0578  0.1273
[torch.DoubleTensor of size 2x5]

y = torch.zeros(3, 5):scatter(1, torch.LongTensor{{1, 2, 3, 1, 1}, {3, 1, 1, 2, 3}}, x)
> y
 0.3227  0.5185  0.2947  0.9414  0.1159
 0.0000  0.4294  0.0000  0.0578  0.0000
 0.7338  0.0000  0.8476  0.0000  0.1273
[torch.DoubleTensor of size 3x5]
填充的時候，首先看第一個引數為1，那麼說明是以第一維（列）為單位的。那麼第一列，填充的位置分別是{1，3}，那麼x[1,1]和x[2,1]分別放在y[1,1],y[3,1]。到了第二列就是{2,1}，那麼x[1,2],x[2,2]分別放在y[2,2],y[1,2]。依次類推。

z = torch.zeros(2, 4):scatter(2, torch.LongTensor{{3}, {4}}, 1.23)
> z
 0.0000  0.0000  1.2300  0.0000
 0.0000  0.0000  0.0000  1.2300
[torch.DoubleTensor of size 2x4]
填充的時候，因為是根據第二維（行）的進行填充，那麼填的第一元素就是z[1,3],另外一個元素就在z[2,4]

nn.Reinforce

這是簡單的RL實現，由論文【10】通過一些改變得到，基本上是一樣的，除了多了別的mass function以外。

module = nn.Reinforce([stochastic])

stochastic=flase的時候只在訓練的時候進行stochastic（不確定的，概率），evaluation的時候不需要。預設為false。計算reward的方式和VRClassReward【7】類似。

reward = a*(R - b)
a在論文中是一個常數或者是用某個計算式計算（看上面的論文介紹）。R就是(通常 0 或者 1), b是基準reward（baseline reward）, 是對R的預測，用到了上一個時刻的reward以及上一個時刻的預測，看上面的論文解釋。

------------------------------------------------------------------------
--[[ Reinforce ]]--
-- Ref A. http://incompleteideas.net/sutton/williams-92.pdf
-- Abstract class for modules that use the REINFORCE algorithm (ref A).
-- The reinforce(reward) method is called by a special Reward Criterion.
-- After which, when backward is called, the reward will be used to 
-- generate gradInputs. The gradOutput is usually ignored.
------------------------------------------------------------------------
local Reinforce, parent = torch.class("nn.Reinforce", "nn.Module")

function Reinforce:__init(stochastic)
   parent.__init(self)
   -- true makes it stochastic during evaluation and training
   -- false makes it stochastic only during training
   self.stochastic = stochastic
end

-- 這個reward來自於parent的reinforce計算reward，具體計算在VRClassReward裡面【11】。
-- a Reward Criterion will call this
function Reinforce:reinforce(reward)
   parent.reinforce(self, reward)
   self.reward = reward
end

-- 在子類中實現
function Reinforce:updateOutput(input)
   self.output:set(input)
end

-- 在子類中實現，被子類呼叫用來計算梯度
function Reinforce:updateGradInput(input, gradOutput)
   local reward = self:rewardAs(input)
   self.gradInput:resizeAs(reward):copy(reward)
end

-- 計算的是input的reward，在這裡需要注意的是訓練的時候可能是batch traning，所以要考慮input size的問題（這一部分不能確定，還需要在研究一下）
-- this can be called by updateGradInput
function Reinforce:rewardAs(input)
   assert(self.reward:dim() == 1)
   if input:isSameSizeAs(self.reward) then
      return self.reward
   else
      if self.reward:size(1) ~= input:size(1) then
         -- assume input is in online-mode
         input = self:toBatch(input, input:dim())
         assert(self.reward:size(1) == input:size(1), self.reward:size(1).." ~= "..input:size(1))
      end
      self._reward = self._reward or self.reward.new()
      self.__reward = self.__reward or self.reward.new()
      local size = input:size():fill(1):totable()
      size[1] = self.reward:size(1)
      self._reward:view(self.reward, table.unpack(size))
      self.__reward:expandAs(self._reward, input)
      return self.__reward
   end
end

另外，對於reward函式，nn中也有更加專業的模組：【7】。如果想要了解完整的強化學習的內容，可以看這個example【8】，來源於Google Attention的論文復現。我目前看到整個網路遠比這個大得多，等有空再把【8】介紹一下。

在RL裡面，經常用到ArgMax.lua，用來計算每一維的最大輸出【12】。nn.Collapse對Tensor大小進行改變【13】。

Torch中的Reinforcement Learning的底層實現

原理篇

實現篇

nn.ReinforceCategorical

nn.Reinforce

Java原子類中CAS的底層實現

Mybatis中TypeHandler的底層實現

JDK中註解的底層實現

JVM 中synchronized的底層實現原理解析

17.java中集合的底層實現

冷飯新炒：理解JDK中UUID的底層實現

Torch中的Reinforcement Learning的底層實現

Java中HashMap底層實現原理(JDK1.8)源碼分析

JDK中枚舉的底層實現

JS中陣列方法的底層實現

STL中map、set的資料結構及底層實現

神經網路中embedding層作用——本質就是word2vec，資料降維，同時可以很方便計算同義詞（各個word之間的距離），底層實現是2-gram（詞頻）+神經網路

演算法---hash演算法原理(java中HashMap底層實現原理和原始碼解析)

Java中HashMap底層實現原理(JDK1.8)原始碼分析

iOS中__block 關鍵字的底層實現原理

C++中 vector、list、deque、set、map的底層實現

（轉載）Java中HashMap底層實現原理(JDK1.8)原始碼分析

Java中hashset底層實現機制

Java中常見資料結構：list與map 的底層實現

ArrayList和LinkedList中的get方法底層實現

Torch中的Reinforcement Learning的底層實現

原理篇

實現篇

nn.ReinforceCategorical

nn.Reinforce

相關推薦