Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

阿新 • • 發佈：2021-11-22

發表時間：2019（NeurIPS 2019）
文章要點：這篇文章結合planning和強化學習來解決複雜任務，主要思路是通過強化學習（Goal-conditioned RL）的方式構建一個圖結構（graph），圖裡的節點就包括起始位置，目標位置以及中間點，這就相當於把一個遠距離的目標狀態（distant goal state）分解成一系列的簡單任務（subgoal），然後在這個圖上通過planning的方式（graph search）就能找到到達目標點的最短路徑，然後用goal-conditioned policy走到每一個節點，最終到達目標點。具體地，通過Goal-Conditioned RL來學習各個狀態之間的距離，

其中\(s\)

就是當前狀態，\(s_g\)是目標狀態，這裡的reward每一步都是-1，代表距離的負數，所以value就代表

具體的強化演算法用的distributional Q-learning

如圖所示，橫座標就表示距離為0，1，2，3及以上。如果走到了目標狀態，那麼0那個地方的概率就是1，如果距離太遠，那麼最右邊那個bar的概率就很大。Q值的表示就是

更新用KL divergence

有了這個之後，就相當於有了圖裡面邊的權重了，就用狀態和距離建圖

這裡面有個MAXDIST，就是說如果兩個狀態的距離大於這個值了，這兩個狀態之間就沒有邊了。建完圖之後，就可以用planning的方式去圖裡找走到目標狀態的最短路徑了，有了這個路徑，然後就用goal-conditioned policy走到每一個節點，最終到達目標點。

最後作者還說了，距離的估計至關重要，所以在distributional Q-learning的基礎上還訓練了多個模型做ensemble。
總結：

感覺挺有意思的，就相當於planning（graph search）是一個fixed的high-level的policy，用來規劃每個子目標怎麼走，然後Q-learning的policy就相當於low-level的policy用來走到每個子任務。不過有的細節不知道怎麼做的。另外，只有路徑規劃能這麼做吧好像。
疑問：subgoal怎麼確定的？建圖的時候圖裡的節點應該是有抽象過的吧，不可能每個state都放到圖裡吧？
這裡面是不是RL的主要作用就是用來做exploration收集資料得到replay buffer，關鍵點在於建圖，至於Q-learning得到的policy只要能走到每個子任務就行了？
Inverse model怎麼理解？
建圖的時候，那個MAXDIST引數好像對結果影響很大，看起來沒有那麼穩定？

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

發表時間：2019（NeurIPS 2019）文章要點：這篇文章結合planning和強化學習來解決複雜任務，主要思路是通過強化學習（Goal-conditioned RL）的方式構建一個圖結構（graph），圖裡的節點就包括起始位置，目標位置以

http://10.18.5.83:8080/ 瀏覽報錯：A default document is not configured for the requested URL, and directory browsing is not enabled on the server

http://10.18.5.83:8080/瀏覽報錯：A default document is not configured for the requested URL, and directory browsing is not enabled on the server.

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

http://10.18.5.83:8080/ 瀏覽報錯：A default document is not configured for the requested URL, and directory browsing is not enabled on the server

SLF4J: Detected both log4j-over-slf4j.jar AND bound slf4j-log4j12.jar on the class path

ICLR2021 | The Intrinsic Dimension of Images and Its Impact on Learning

ON THE ROLE OF PLANNING IN MODEL-BASED DEEP REINFORCEMENT LEARNING

【深度學習】RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

C# the comparison between FileStream.Write() and StreamWriter.Write()

POJ 3093Margaritas on the River Walk 揹包DP

D. Task On The Board 構造題

Penetration Test - Planning and Scoping(8)

Grazing on the Run 題解

2020牛客多校第10場C Decrement on the Tree樹上路徑刪除

[2020牛客暑期多校訓練營（第十場）C Decrement on the Tree]

2020暑假牛客多校10 C -Decrement on the Tree (邊權轉點權處理)

vue 處理跨域問題（“No ‘Access-Control-Allow-Origin‘ header is present on the requested resource.”）

HDU 4916 Count on the path 樹形dp

2020牛客多校第十場C-Decrement on the Tree

Mysq在使用mysqldump命令備份資料庫報錯：mysqldump: [Warning] Using a password on the command line interface can be insecure.

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation 論文筆記

WHAT ARE THE DIFFERENCES BETWEEN ONE-TAILED AND TWO-TAILED TESTS?

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

相關推薦