MOPO: Model-based Offline Policy Optimization

阿新 • • 發佈：2021-10-21

發表時間：2020（NeurIPS 2020）
文章要點：目前主流的offline RL的方法都是model free的，這類方法通常需要將policy限制到data覆蓋的集合範圍裡（support），不能泛化到沒見過的狀態上。作者提出Model-based Offline Policy Optimization (MOPO)演算法,用model based的方法來做offline RL，同時通過給reward新增懲罰項（soft reward penalty）來描述環境轉移的不確定性（applying them with rewards artificially penalized by the uncertainty of the dynamics.）這種方式相當於在泛化性和風險之間做tradeoff。作者的意思是，這種方式允許演算法為了更好的泛化性而承擔一定風險（policy is allowed to take a few risky actions and then return to the confident area near the behavioral distribution without being terminated）。具體做法就是，先根據data去學一堆狀態轉移函式，這個函式是一個用神經網路表示的關於狀態和reward的高斯分佈

有了這個之後，就要在原始reward上新增penalty，新增方式是找這堆dynamics裡面最大的協方差的範數，然後reward變成

然後model和reward都有了，就直接上強化演算法就好了，文章裡用的是SAC。

總結：

雖然中間推了幾個公式，說了一下bound，但是最後落實下來其實就是在reward上加了一個uncertainty的penalty的估計，而且作者也說了this estimator lacks theoretical guarantee。就主要還是看效果吧。
疑問：reward penalty裡面的F應該是矩陣的Frobenius範數吧?

MOPO: Model-based Offline Policy Optimization

MOPO: Model-based Offline Policy Optimization

MODEL-ENSEMBLE TRUST-REGION POLICY OPTIMIZATION

Proximal Policy Optimization (PPO)詳解

Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning

Learning to Combat Compounding-Error in Model-Based Reinforcement Learning

ON THE ROLE OF PLANNING IN MODEL-BASED DEEP REINFORCEMENT LEARNING

初識Proximal Policy Optimization (PPO)

Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning

Model-based Reinforcement Learning: A Survey

Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

Model-Based Reinforcement Learning via Latent-Space Collocation

MBMF: Model-Based Priors for Model-Free Reinforcement Learning

A knowledge representation model based on the geographic spatiotemporal process

筆記：A Novel Global Feature-Oriented Relational Triple Extraction Model based on Table Filling

基於 ASP.NET Core Policy-based authorization 實現博文訪問授權

windows伺服器新增磁碟後，提示The disk is offline because of policy set by an administrator的解決辦法

Online and Offline Reinforcement Learning by Planning with a Learned Model

A Semisupervised CRF Model for CNN-Based Semantic Segmentation With Sparse Ground Truth

Flutter如何更便捷的json轉model

使用VSCode+PlantUML+C4-Model快速畫架構圖

MOPO: Model-based Offline Policy Optimization

相關推薦