Evaluating the Performance of Reinforcement Learning Algorithms

阿新 • • 發佈：2021-09-20

發表時間：2020（ICML 2020）
文章要點：文章指出RL復現難的原因在於評價指標不一致。作者提出評估指標應該滿足四點：1. Scientific,主要說你這個指標提供的資訊要告訴別人針對某個具體的問題或假設，得出了什麼結論，這個結論有沒有考慮各種不確定性可能造成的問題。2. Usability，主要是說你這個演算法適用性如何，是不是有general的適用範圍，是不是還需要調參，調參要花多少時間需要給個準信。3. Nonexploitative,不要過分在某些環境上跑分（over represented）然後就說你這演算法效果好，也不要利用一些特殊的計算指標來鑽空子（abusing a particular score normalization method）。4. tractable，這個就是說實驗要可重複。然後作者用一句話來總結了這個標準，which algorithm(s) perform well across a wide variety of environments with little or no environment-specific tuning?
基於此，作者提出了新的度量方式performance percentiles來比較各個演算法的好壞，最後提出performance bound propagation來量化評估過程的不確定性。performance percentiles的出發點是，我在很多個環境上去測試完演算法後，由於不同環境reward量級是不一樣的，那麼就需要Normalization來消除量級問題，performance percentiles就把每個環境的得分看成一個隨機變數，把這個隨機變數累積分佈函式畫出來，這樣就把所有reward樣本對映到[0,1]區間上了（that projecting a random variable through its CDF transforms the variable to be uniform on [0,1]）。

如圖所示，左邊這些點就是被對映到[0,1]區間上了，然後作者的意思是每個環境都可以這樣對映過來，那麼環境的得分就被統一了。然後要評價一個演算法，就把每個環境的CDF加權求和就可以了。

然後還有一個問題是權重怎麼設定，作者的意思是搞一個two-player game,然後找他的equilibrium。這個game就是說，第一個人先選一個演算法來最大化在所有環境的performance，然後第二個人選一個演算法和一個對應的環境來最小化第一個人的得分。相當於第一個人只能選一個演算法，第二個人決定了測試了環境以及用來normalization CDF的另一個演算法。而且這個博弈是零和的。具體定義為

Game定義好了後就要求equilibrium solution了，作者的意思是用α-Rank來求，假如求出了每個演算法（策略）選取的對應概率\((p^*,q^*)\)

，那麼加權和就是

說實在的，不是很理解這麼做的好處。
接下來就是計算置信區間，也就是performance bound propagation(PBP)來量化評估過程的不確定性。大概思路就是，對每個環境先算置信區間，根據這個置信區間來找滿足條件的\((p^*,q^*)\)，就可以求performance percentiles了，然後找他們的下界和上界。具體做法見論文附錄：

總結：說實在的，整的太複雜了，根本不會有人用的。就這個計算量，別說train模型的時間了，就是evaluation都要花很久，又是在環境裡跑一大堆episode，又是算α-Rank的，要是演算法再一多，時間指數級增加。再加上這個方法雖然看起來make sense,但是其實並不好理解他的優勢在哪。
疑問：

α-Rank具體咋做的不知道。還有這個equilibrium是什麼equilibrium，納什均衡？

Evaluating the Performance of Reinforcement Learning Algorithms

發表時間：2020（ICML 2020）文章要點：文章指出RL復現難的原因在於評價指標不一致。作者提出評估指標應該滿足四點：1. Scientific,主要說你這個指標提供的資訊要告訴別人針對某個具體的問題或假設，得出了什麼結

Offline Evaluation of Online Reinforcement Learning Algorithms

發表時間：2016（AAAI2016）文章要點：通常大家做offline評估的時候都是去評估一個訓好的fixed的策略，這篇文章就說我想在offline的setting 下去評估一個演算法好不好。根據這個出發點，大致思路是先根據收集的d

ON THE ROLE OF PLANNING IN MODEL-BASED DEEP REINFORCEMENT LEARNING

發表時間：2021（ICLR 2021）文章要點：這篇文章想要分析model-based reinforcement learning (MBRL)裡面各個部分的作用。文章以muzero為基礎，回答了三個問題

CCS - Error Rate Performance of the Detectors

Simulation of MIMO Systems Perform a Monte Carlo simulation to assess the error rate performance of an(Ny,NR) MIMO system in a Rayleigh fading AWGN channel.

[翻譯]學習的科學 The science of learning

翻譯說明關於《學習的科學》關於DEANS FOR IMPACT學生是如何理解新的概念的?學生是如何學習和記憶新知識的?學生是如何解決問題的?學習是如何遷移到課堂內外的新場景的?是什麼激發學生去學習?學生思考和學習有哪些常

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

發表時間：2019（NeurIPS 2019）文章要點：這篇文章結合planning和強化學習來解決複雜任務，主要思路是通過強化學習（Goal-conditioned RL）的方式構建一個圖結構（graph），圖裡的節點就包括起始位置，目標位置以

[論文理解]An artificial intelligence-based deep learning algorithm for the diagnosis of diabetic neuropathy using corneal confocal microscopy: a development and validation study

基於人工智慧的角膜共焦顯微鏡診斷糖尿病神經病變的深度學習演算法：開發和驗證研究，2019

Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule

鄭重宣告：原文參見標題，如有侵權，請聯絡作者，將會撤銷釋出！ Neural Computation, (2007): 2245-2279

visual studio (window10) dark主題下修改游標粗細（visual studio change the thickness of the cursor in dark theme for window10）

本人電腦配置：window10系統， Microsoft Visual Studio 2019 本來在visual studio中設定了 dark 的主題，想說使電腦亮度小點，但是發現游標強度太小，經常看不到，既浪費了尋找游標的時間，又不利於眼睛，所以上網

01MySQL核心分析-The Skeleton of the Server Code

摘要這個官方檔案一段對MySQL核心分析的一個嚮導。是對MySQL一條insert語句寫入到MySQL資料庫的分析。

What is the benefit of developing the application as a windows service?

What is the benefit of developing the application as a windows service? On the top of my head: You can control the user (and the rights associated with this user account) which starts the process

《The Design of a Practical System for Fault-Tolerant Virtual Machines》論文總結

VM-FT 論文總結說明：本文為論文《The Design of a Practical System for Fault-Tolerant Virtual Machines》的個人總結，難免有理解不到位之處，歡迎交流與指正。

《The Design of a Practical System for Fault-Tolerant Virtual Machines》論文研讀

VM-FT 論文研讀說明：本文為論文《The Design of a Practical System for Fault-Tolerant Virtual Machines》的個人理解，難免有理解不到位之處，歡迎交流與指正。

題解-The Number of Good Intervals

題面 The Number of Good Intervals 給定 \\(n\\) 和 \\(a_i(1\\le i\\le n)\\)，\\(m\\) 和 \\(b_j(1\\le j\\le m)\\)，求對於每個 \\(j\\)，\\(a_i\\) 區間 \\(\\gcd\\) 為 \\(b_j\\) 的區間數。

CppCon筆記--Back to Basics: RAII and the Rule of Zero

1.RAII 和 rule of three C++程式設計很多時候需要手動管理資源，其中包括資源的獲取，使用和釋放，而手動對資源釋放是很容易出錯的一個環節。

The Tower of Babylon

Perhaps you have heard of the legend of the Tower of Babylon. Nowadays many details of this tale have been forgotten. So now, in line with the educational nature of this contest, we will tell you the

題解 CF622F 【The Sum of the k-th Powers】

題目連結 Solution CF622F The Sum of the k-th Powers 題目大意：給定\\(i,k\\),求\\(\\sum_{i=1}^ni^k\\)

hdu2444 The Accomodation of Students

http://acm.hdu.edu.cn/showproblem.php?pid=2444 Problem Description There are a group of students. Some of them may know each other, while others don\'t. For example, A and B know each other, B and C k

1535. Find the Winner of an Array Game

Given an integer arrayarrofdistinctintegers and an integerk. A game will be played between the first two elements of the array (i.e.arr[0]andarr[1]). In each round of the game, we comparearr[0]witharr

CF622F The Sum of the k-th Powers

知識點: 拉格朗日插值原題面題意簡述定義前 \\(n\\) 個自然數 \\(k\\) 次冪的和為：

Evaluating the Performance of Reinforcement Learning Algorithms

相關推薦