Decoupling Value and Policy for Generalization in Reinforcement Learning

阿新 • • 發佈：2021-10-11

發表時間：2021（ICML2021）
文章要點：這篇文章想說，通常在訓練PG這類演算法特別是影象作為輸入的任務的時候，主流的做法是policy和value用一個網路表徵，沒有分開。這會導致policy overfitting，因為學value比學policy需要更多的資訊，如果用一個網路來共享表徵，就會導致policy利用了一些無關資訊而overfitting，導致泛化性變差。作者把這個叫做policy value representation asymmetry。作者舉了一個例子來說明這個問題確實存在，如下圖

這裡有兩個關卡，第一個狀態除了背景顏色外，其他都一樣，理論上來講policy在這的策略應該完全一樣，並且不應該利用背景顏色資訊。但是兩個關卡後面的狀態是不一樣的，這就是說在第一個狀態的value是不一樣的，這說明value function應該利用背景顏色來擬合出不同的value。作者想說這個例子就說明了policy 和value需要的資訊是不一樣多的。
作者提出的解決方法是
1）將policy和value分成兩個單獨的網路，但是隻這麼做的話之前有paper表明反而會導致訓練變差，於是policy網路除了policy外還擬合advantage。作者的解釋是advantage是一個關於動作的相對值，不是關於狀態的絕對的值，不應該利用那些無關資訊過擬合。（可能作者就是想說advantage和policy不存在policy value representation asymmetry）。（Because the advantage is a relative measure of an action’s value while the value is an absolute measure of a state’s value, the advantage can be expected to vary less with the number of remaining steps in the episode. Thus, the advantage is less likely to overfit to such instance specific features.）另一個好處是policy和value的更新頻率可以不一樣了。
整個policy network的目標函式為

第一項就是PPO那個帶clip的目標函式，第二項是entropy，第三項是advantage的loss

Value network的loss就是普通的mse

2）在1）的基礎上，新增輔助任務來避免policy過擬合。用對抗網路的方式訓練一個discriminator，讓discriminator不能區分經過policy編碼過的兩個狀態哪個在前哪個在後。所以discriminator的loss就是區分開哪個在前，哪個在後

Encoder的loss就是不讓discriminator區分開

這裡雖然看起來loss一樣，一個是用cross-entropy loss的方式，給準確的0-1 label來訓ψ，一個是訓練θ使得entropy最大，也就是靠近每個label概率為0.5。整個強化的loss就變成了

總結：

文章提出的這個問題其實感覺有點牽強，不管是舉的那個procedurally generated environments的例子，還是後面說分開了還要再加個advantage的解釋，感覺邏輯上都有點奇怪。不過作者最後自己也說了While our experiments show that predicting the advantage function improves generalization, we currently lack a firm theoretical argument for this. 可能就是這麼做效果足夠好吧。但是看實驗結果，特別是appendix裡面，挺多環境上都沒有超過一個叫PPG的baseline。並且可以想象，作者做了多少實驗，調了多少參。
疑問：

無。

Decoupling Value and Policy for Generalization in Reinforcement Learning

發表時間：2021（ICML2021）文章要點：這篇文章想說，通常在訓練PG這類演算法特別是影象作為輸入的任務的時候，主流的做法是policy和value用一個網路表徵，沒有分開。這會導致policy overfitting，因為學value比學

Improving Generalization in Reinforcement Learning with Mixture Regularization

發表時間：2020（NeurIPS 2020）文章要點：這篇文章提出了一個叫mixreg的方法來提高agent泛化性。大致方法就是說用多個環境訓練，並且對環境做插值，這樣學到的策略就會更平滑，泛化性就更好。具體的，我有兩個狀

Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning

發表時間：2020（ICML 2020）文章要點：這篇文章想說model based方法在data efficiency和planning方面都具有天然優勢，但是model的泛化性通常是個問題。這篇文章提出學一個context相關的latent vector，然後用mod

長尾分佈之DECOUPLING REPRESENTATION AND CLASSIFIER FOR LONG-TAILED RECOGNITION

原始文件：https://www.yuque.com/lart/papers/drggso ICLR 2020的文章. 針對長尾分佈的分類問題提出了一種簡單有效的基於re-sample正規化的策略.

MBMF: Model-Based Priors for Model-Free Reinforcement Learning

發表時間：2017文章要點：這篇文章提出了一個Model-Based Model-Free (MBMF)演算法，通過學習一個dynamics model然後作為先驗來做model free optimization，這裡的model free optimization指的是基於Gaussian Proces

論文記載： Deep Reinforcement Learning for Traffic LightControl in Vehicular Networks

強化學習論文記載論文名： Deep Reinforcement Learning for Traffic LightControl in Vehicular Networks （車輛網路交通訊號燈控制的深度強化學習）---年份：2018.3

Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning

發表時間：2018 文章要點：這篇文章提出了model-based value expansion (MVE)演算法，通過在model上擴充套件有限深度，來控制model uncertainty，利用這有限步上的reward來估計value，提升value估計的準確性，在結

TREEQN AND ATREEC: DIFFERENTIABLE TREE-STRUCTURED MODELS FOR DEEP REINFORCEMENT LEARNING

發表時間：2018（ICLR 2018）文章要點：這篇文章設計了特別的網路結構，將樹結構嵌入到神經網路中，實現了look-ahead tree的online planning，將model free和online planning結合起來，並提出了TreeQN和ATreeC演算法

Python for i in range ()用法詳解

for i in range ()作用： range()是一個函式， for i in range () 就是給i賦值：比如 for i in range （1，3）：

Data truncation: Incorrect datetime value: '' for column 'create_time' at row 1 問題

org.springframework.dao.DataIntegrityViolationException: PreparedStatementCallback; SQL [insert into orders values(?,?,?,?,?,?,?,?,?,?,?)]; Data truncation: Incorrect datetime value: \'\' for col

What's the replacement for fuslogvw in .net core 2?

What\'s the replacement for fuslogvw in .net core 2? When encountering problems with resolving DLLs and assemblies in general with .Net fuslogvw gave you the ability to log the binding attempts so yo

為什麼 list(range) 比 [i for i in range()] 快?

為什麼 list(range) 比 [i for i in range()]快? t0 = time.time() list(range(100000)) print(time.time()-t0)

CCS - Digital Transmission via Carrier Modulation - Probability of Error for QAM in an AWGN Channel

Probability of Error for QAM in an AWGN Channel Matlab Coding 1 % MATLAB script for Illustrative Problem 7.6.

ORA-12012: error on auto execute of job 25；ORA-12005: may not schedule automatic refresh for times in the past

　　使用BethuneX做巡檢，連續報如下錯誤： --錯誤 Thu Oct 29 14:36:04 2020 Errors in file /u01/app/oracle/diag/rdbms/mtws/mtws/trace/mtws_j000_33913.trc:

Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.NoClassDefFoundError: org/openqa/selenium/WebDriver 錯誤解決方法

java -jar執行selenium繼承工具包報錯如下： Error: A JNI error has occurred, please check your installation and try againException in thread \"main\" java.lang.NoClassDefFoundError: org/openqa/selenium/

Decoupling Value and Policy for Generalization in Reinforcement Learning

Decoupling Value and Policy for Generalization in Reinforcement Learning

Improving Generalization in Reinforcement Learning with Mixture Regularization

Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning

長尾分佈之DECOUPLING REPRESENTATION AND CLASSIFIER FOR LONG-TAILED RECOGNITION

MBMF: Model-Based Priors for Model-Free Reinforcement Learning

論文記載： Deep Reinforcement Learning for Traffic LightControl in Vehicular Networks

Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning

TREEQN AND ATREEC: DIFFERENTIABLE TREE-STRUCTURED MODELS FOR DEEP REINFORCEMENT LEARNING

Python for i in range ()用法詳解

Data truncation: Incorrect datetime value: '' for column 'create_time' at row 1 問題

What's the replacement for fuslogvw in .net core 2?

為什麼 list(range) 比 [i for i in range()] 快?

CCS - Digital Transmission via Carrier Modulation - Probability of Error for QAM in an AWGN Channel

ORA-12012: error on auto execute of job 25；ORA-12005: may not schedule automatic refresh for times in the past

Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.NoClassDefFoundError: org/openqa/selenium/WebDriver 錯誤解決方法

update mysql row (You can't specify target table 'x' for update in FROM clause)

go mixture of field:value and value initializers 錯誤

MySQL You can‘t specify target table ‘表名‘ for update in FROM clause 錯誤解決

You can‘t specify target table ‘a‘ for update in FROM clause

MySQL 報錯 You can‘t specify target table for update in FROM clause解決辦法

Decoupling Value and Policy for Generalization in Reinforcement Learning

相關推薦