強化學習之三：雙臂賭博機（Two-armed Bandit）

阿新 • • 發佈：2019-01-06

本文是對Arthur Juliani在Medium平臺釋出的強化學習系列教程的個人中文翻譯，該翻譯是基於個人分享知識的目的進行的，歡迎交流！（This article is my personal translation for the tutorial written and posted by Arthur Juliani on Medium.com. And my work is completely based on aim of sharing knowledges and welcome communicating!）

純屬自願翻譯，只為學習與分享知識。所以如果本系列教程對你有幫助，麻煩不吝在

github的專案上點個star吧！非常感謝！

引入（Introduction）

強化學習不只給我們提供了去教會一個智慧agent如何行動的能力，也使得agent可以通過自己與環境的互動去進行學習。通過結合深度神經網路針對一個基於目標驅動的agent學習可以獲得的複雜表徵（representations），計算機已經實現了一些非常驚人的成果，比如在一系列atari遊戲中擊敗人類玩家，並且在圍棋上打敗世界冠軍。

然而要學會如何建立這麼強大的agent需要已經習慣於**有監督學習（Supervised Learning）**的人們轉變一下思想，我們現在的做法再也不是簡單地讓演算法去學會對某種刺激和某種響應進行一一匹配了。相反地，強化學習演算法必須讓agent自己通過使用觀察、回報和行動的方式來學會匹配。因為對於agent來說，再也不會有某種給定狀態下應該採取的絕對“正確”的行動，所以這就使得這件事情看起來有點困難了。在本部落格中，我將帶你完整地走一遍強化學習agents的創造和訓練過程。最開始的agent和任務（task）的示例都將比較簡單，所以相關的概念也都會比較明晰，之後我們再嘗試理解更復雜的任務和環境。

雙臂賭博機（Two-Armed Bandit）

最簡單的強化學習問題就是N臂賭博機。本質上來說，N臂賭博機就是由n個槽機器（n-many slot machine），每個槽對應了一個不同的固定回報概率。我們的目標是去發現有最優回報的機器，並且通過一直選取這個機器以獲得最大化回報。我們先簡化一下這個問題，即只有兩個槽機器供我們選擇。實際上，這個問題如此簡單，它更像是一個強化學習的引導例子而不能稱作一個強化學習問題本身。因為一個典型的強化學習任務包含以下方面：

不同的行動產生不同的回報。舉例來說，當在迷宮中找寶藏時，往左走可能找到寶藏，而往右走可能遇到一群蛇。
回報總是在時間上延遲的。這就意味著即使在上面的迷宮例子裡，往左走是正確的選擇，但是我們不會知道這一點直到我們做出選擇併到達新的狀態之後。

一個行動的回報是基於環境的狀態的。仍然是迷宮的例子，在某個分叉往左走可能是理想的，但其他的分叉可能不是這樣。

n臂賭博機是一個非常好的入門問題，因為我們不用考慮上述的第二、三方面。我們只需要集中精力去學習對應的每種行動對應的回報，並保證我們總是選擇最優的那些行動。在強化學習術語中，這叫做學習一個策略（Learn a policy）。我們將使用一種稱為策略梯度（policy gradient）的方法，即我們將用一個簡單的神經網路來學習如何選擇行動，它將基於環境的反饋通過梯度下降來調整它的引數。還有另一種解決強化學習問題的方法，這些方法裡，agent會學習價值函式（value function）。在這種方法裡，相比於學習給定狀態下的最優行動，agent會學習預測一個agent將處於的給定狀態或者採取的行動多麼好。而這兩種方法都可以讓agent表現優異，不過策略梯度方法顯得更加直接一點。

策略梯度（Policy Gradient）

最簡單的理解策略梯度網路的方法就是：它其實就是一個會生成明確輸出的神經網路。在賭博機的例子裡，我們不需要基於任何狀態來說明這些輸出。因此，我們的網路將由一系列的權重構成，每個權重都和每一個可以拉動的賭博機臂相關，並且會展現出我們的agent認為拉動每個臂分別會對應多麼好的結果。如果我們初始化權重為1，那麼我們的agent將會對每個臂的潛在回報都非常樂觀。

為了更新我們的網路，我們將簡單地基於**e-貪婪策略（e-greedy policy）**嘗試每個臂（在Part 7可以看到更多關於行動選擇策略的內容）。這意味著大多數時間裡，我們的agent將會選擇有著預期最大回報值的行動，但偶爾，它也會隨機行動。通過這種方式，agent可能嘗試到每一個不同的臂並持續地學習到更多知識。一旦我們的agent採取一個行動，它將會收穫到一個值為1或-1的回報。基於這個回報，我們就可以使用策略損失函式來對我們的網路進行更新：

$Loss = -log(\pi)*A$

A是優越度，也是所有強化學習演算法的一個重要部分。直覺上，它描述了一個行動比某個基準線好多少。在未來的演算法中，我們將遇到更復雜的用於比較回報的基準線，而現在我們就假設基準線為0，我們也可以簡單地把它想成我們採取每個行動對應的回報。

$\pi$ 是策略。在這個例子中，它和所選行動的權重相關。

直覺上，這個損失函式使我們可以增加那些有望產出正回報行動的權重，而降低那些可能產生負回報的行動的權重。通過這種方式，agent將更有可能或更不可能在未來採取某個行動。通過採取行動，獲得回報並更新網路這個過程的迴圈，我們將很快得到一個收斂的agent，它將可以解決賭博機問題。不要只是聽我講，你應該自己試一試。

# Simple Reinforcement Learning in Tensorflow Part 1:
# The Multi-armed bandit
# This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the multi-armed bandit problem. For more information, see this Medium post.
# 簡單強化學習的Tensorflow實現 Part 1：
# 多臂賭博機
# 這個教程包含一個簡單的，能夠解決多臂賭博機問題的建立基於策略梯度的agent的例項

# For more Reinforcement Learning algorithms, including DQN and Model-based learning in Tensorflow, see my Github repo, DeepRL-Agents.
# 對於更多強化學習演算法，包括用Tensorflow實現的DQN和基於模型的學習，都可以看我的Github庫，DeepRL-Agents。



import tensorflow as tf
import numpy as np

# The Bandits
# Here we define our bandits. For this example we are using a four-armed bandit. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit that will give that positive reward.
# 賭博機
# 這裡我們定義了賭博機。這個例子裡我們使用了一個四臂賭博機。pullBandit函式產生了一個服從0均值正態分佈的隨機數。這個賭博機數值越小，獲得一個正回報的可能性越大。我們想讓我們的agent學會總是選擇正回報的行動。


# List out our bandits. Currently bandit 4 (index#3) is set to most often provide a positive reward.
# 賭博機的列表。當前賭博機4（標號#3）被設定為最常給出正回報的機器。
bandits = [0.2,0,-0.2,-5]
num_bandits = len(bandits)
def pullBandit(bandit):
    # Get a random number.
    # 獲得一個隨機數
    result = np.random.randn(1)
    if result > bandit:
        # return a positive reward.
        # 返回一個正回報
        return 1
    else:
        # return a negative reward.
        # 返回一個負回報
        return -1

# The Agent
# The code below established our simple neural agent. It consists of a set of values for each of the bandits. Each value is an estimate of the value of the return from choosing the bandit. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward.
# 下面的程式碼建立了我們的樣例神經網路版本的agent，它由一套針對每個賭博機的數值構成。每個數值都是對於選擇相應賭博機的回報的估計值。我們使用策略梯度方法來更新我們的agent，即將選擇的行動的數值賦給收到的彙報。

tf.reset_default_graph()

# These two lines established the feed-forward part of the network. This does the actual choosing.
# 下面兩行簡歷了網路的前饋部分。這個部分用來做行動決策。
weights = tf.Variable(tf.ones([num_bandits]))
chosen_action = tf.argmax(weights,0)

# The next six lines establish the training proceedure. We feed the reward and chosen action into the network
# to compute the loss, and use it to update the network.
# 下面六行程式碼建立了訓練過程。我們餵給網路回報以及所選行動。
# 計算損失，並用其更新網路。
reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
responsible_weight = tf.slice(weights,action_holder,[1])
loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)


# Training the Agent
# We will train our agent by taking actions in our environment, and recieving rewards. Using the rewards and actions, we can know how to properly update our network in order to more often choose actions that will yield the highest rewards over time.
# 訓練Agent
# 我們將通過在環境中採取行動並接收回報來訓練agent。通過回報和行動，我們可以知道如何合適地更新網路，以使得它將隨著訓練的進行，越來越經常的選擇有更高回報的行動。


total_episodes = 1000 #Set total number of episodes to train agent on.  agent將要訓練的episodes輪數
total_reward = np.zeros(num_bandits) #Set scoreboard for bandits to 0. 將賭博機的得分全部設為0
e = 0.1 #Set the chance of taking a random action. 設定採取一個隨機行動的概率

init = tf.initialize_all_variables()

# Launch the tensorflow graph 
# 啟動tensorflow計算圖
with tf.Session() as sess:
    sess.run(init)
    i = 0
    while i < total_episodes:
        
        # Choose either a random action or one from our network.
        # 選擇一個隨機行動，或者讓網路來決策
        if np.random.rand(1) < e:
            action = np.random.randint(num_bandits)
        else:
            action = sess.run(chosen_action)
        
        reward = pullBandit(bandits[action]) #Get our reward from picking one of the bandits. 從選擇的賭博機上獲得回報
        
        # Update the network.
        # 更新網路
        _,resp,ww = sess.run([update,responsible_weight,weights], feed_dict={reward_holder:[reward],action_holder:[action]})
        
        # Update our running tally of scores.
        # 更新執行記分器
        total_reward[action] += reward
        if i % 50 == 0:
            print "Running reward for the " + str(num_bandits) + " bandits: " + str(total_reward)
        i+=1
print "The agent thinks bandit " + str(np.argmax(ww)+1) + " is the most promising...."
if np.argmax(ww) == np.argmax(-np.array(bandits)):
    print "...and it was right!"
else:
    print "...and it was wrong!"

Running reward for the 4 bandits: [ 1.  0.  0.  0.]
Running reward for the 4 bandits: [  0.  -2.  -1.  38.]
Running reward for the 4 bandits: [  0.  -4.  -2.  83.]
Running reward for the 4 bandits: [   0.   -6.   -1.  128.]
Running reward for the 4 bandits: [   0.   -8.    1.  172.]
Running reward for the 4 bandits: [  -1.   -9.    2.  219.]
Running reward for the 4 bandits: [  -1.  -10.    4.  264.]
Running reward for the 4 bandits: [   0.  -11.    4.  312.]
Running reward for the 4 bandits: [   2.  -10.    4.  357.]
Running reward for the 4 bandits: [   2.   -9.    4.  406.]
Running reward for the 4 bandits: [   0.  -11.    4.  448.]
Running reward for the 4 bandits: [  -1.  -10.    3.  495.]
Running reward for the 4 bandits: [  -3.  -10.    2.  540.]
Running reward for the 4 bandits: [  -3.  -10.    3.  585.]
Running reward for the 4 bandits: [  -3.   -8.    3.  629.]
Running reward for the 4 bandits: [  -2.   -7.    1.  673.]
Running reward for the 4 bandits: [  -4.   -7.    2.  720.]
Running reward for the 4 bandits: [  -4.   -7.    3.  769.]
Running reward for the 4 bandits: [  -6.   -8.    3.  814.]
Running reward for the 4 bandits: [  -7.   -7.    3.  858.]
The agent thinks bandit 4 is the most promising....
...and it was right!

（09/10/2016更新）：我重新為這個教程寫了iPython程式碼。之前的損失函式不太直觀，我已經用一個更標準和具備解釋性的版本來替代了，而且對於那些非常有興趣應用策略梯度方法到更復雜的問題上的人也更有參考價值。）

如果這篇博文對你有幫助，你可以考慮捐贈以支援未來更多的相關的教程、文章和實現。對任意的幫助與貢獻都表示非常感激！

如果你想跟進我在深度學習、人工智慧、感知科學方面的工作，可以在Medium上follow我 @Arthur Juliani，或者推特@awjliani。

用Tensorflow實現簡單強化學習的系列教程：

強化學習之三：雙臂賭博機（Two-armed Bandit）

引入（Introduction）

雙臂賭博機（Two-Armed Bandit）

策略梯度（Policy Gradient）

強化學習之三：雙臂賭博機（Two-armed Bandit）

Linux學習之三：文件夾系統的結構和相對（絕對）路徑

Halcon學習之三：有關圖像通道的函數

Linux學習之三：檔案與文件系統的壓縮與打包

Django學習之三：django-admin 和項目manage.py程序

Docker學習之三：Docker在linux下簡單安裝使用

C++11併發學習之三：執行緒同步

USB開裝置開發學習之三：USB中的端點詳細瞭解

六天搞懂“深度學習”之三：多層神經網路

Esper學習筆記三：EPL語法（1）

《組合語言（第3版）（王爽著）》學習筆記三：基礎知識（3）

《統計學習方法》學習之三：SVM

強化學習之七：Visualizing an Agent’s Thoughts and Actions

Keras學習之3：迴歸問題（boston_housing資料為例）

強化學習之六：Deep Q-Network and Beyond

Redis大總結之三：SORT命令（對列表|集合|有序集合進行排序）

libevent學習之三：簡單的伺服器和客戶端

RabbitMQ系列教程之三：釋出\/訂閱（Publish\/Subscribe）

深度學習之三：機器學習的策略

視訊編解碼學習之三：變換，量化與熵編碼

強化學習之三：雙臂賭博機（Two-armed Bandit）

引入（Introduction）

雙臂賭博機（Two-Armed Bandit）

策略梯度（Policy Gradient）

相關推薦