lec-4-Introduction to Reinforcement Learning

阿新 • • 發佈：2022-05-09

模仿學習imitation learning與RL的不同

模仿學習中需要有專家指導的資訊
RL不需要訪問專家資訊

RL Definitions

獎勵函式
馬爾科夫決策鏈
- 只與上一個狀態有關
目的
空間
- 有限
  - 可找到最優引數
- 無限
  - 證明p的概率分佈是個平穩分佈stationary distribution
期望
- 由於獎勵函式是不平滑的
  - 轉換: 但是可以優化看似不平滑甚至稀疏的獎勵功能（不平滑or不可微的期望）在可微且平穩的概率下的函式

演算法

基本過程：
- 生成樣本→調整模型/估計回報（評估policy）→提升策略policy→生成樣本
- 各部分代價
  - 生成樣本
    Expensive：真實環境進行一次，也許代價會很高，機器人、車、電網等
    cheap：模擬環境
  - 評估policy
    expensive：學習神經網路大量引數
    cheap：MC等求均值等
  - 提升policy
    expensive：反向傳播大量引數求導
    cheap：回報均值梯度求導更新
Value Functions（基於值的）
- 核心：第二步（評估policy）使用Q-function or value function
- 定義
  - 期望：
  - Q-function：
  - Value function：
  - 關係：
  - Idea：
    Policy iteration：Policy+Q-function → improve policy
    比較QandV，if Q>V, 計算梯度增加動作概率
演算法型別
- Policy gradients
- Value-based：擬合/評估Q、V
- Actor-critic
- Model-based RL：重點在提升policy上
演算法的tradeoffs（權衡）→以至於出現如此多演算法
- Sample efficiency
- Stability and ease of use
  - 值函式擬合：定點迭代
    - 深度網路不能保證收斂性
  - 基於模型的
    - 收斂but不能保證model=better policy
  - 策略梯度
    - 只有一個在真正的目標上執行梯度下降（上升）的
- 各類演算法

lec-4-Introduction to Reinforcement Learning

模仿學習imitation learning與RL的不同模仿學習中需要有專家指導的資訊 RL不需要訪問專家資訊

Introduction to Machine Learning

2019獨角獸企業重金招聘Python工程師標準>>> 1:Introduction To Machine Learning In data science, we\'re often trying to understand a process or system using observational data.

Learning to Combat Compounding-Error in Model-Based Reinforcement Learning

發表時間：2019（NeurIPS 2019 Deep Reinforcement Learning Workshop）文章要點：這篇文章想說model based方法裡面通常model都是imperfect的，就是這個model一般只在區域性是準確的，放到全域性上看誤差會越來越

【論文閱讀】End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances

文章名：CVPR2020: End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances

Encoding Human Domain Knowledge to Warm Start Reinforcement Learning

發表時間：2020（AAAI 2021）文章要點：這篇文章提出Propositional Logic Nets (PROLONETS)，通過建立決策樹的方式來初始化神經網路的結構和權重，從而將人類知識嵌入到神經網路中作為初始化warm start，然後進行強化

2017-Learning to reinforcement learn

Key 元學習系統（監督+從屬）擴充套件於RL設定 LSTM用強化學習演算法進行訓練，可以使agent獲得一定的學習適應能力

lec-1-Deep Reinforcement Learning, Decision Making, and Control

What is RL 基於學習的決策的數學形式從經驗中學習決策和控制的方法 Why should we study this now

selenium開啟瀏覽器底層實現原理中的異常4(Failed to establish a new connection: [WinError 10061] 由於目標計算機積極拒絕，無法連線)

程式碼： #coding=utf-8 importrequests importjson url=‘http://127.0.0.1:4444/wd/hub/session‘ data=json.dumps({

DSP - Practical Introduction to Digital Filtering

Introduction Filters can be used to shape the signal spectrum in a desired way or to perform mathematical operations such as differentiation and integration.

DSP - Practical Introduction to Digital Filter Design - FIR Filter

FIR Filter Design The ideal lowpass filter is one that leaves unchanged all frequency components of a signal below a designated cutoff frequency,ωc, and rejects all components aboveωc. Be

DSP - Practical Introduction to Digital Filter Design - IIR Filter

IIR Filter Design One of the drawbacks of FIR filters is that they require a large filter order to meet some design specifications. If the ripples are kept constant, the filter order grows inversely

Introduction to the UIO

https://hustcat.github.io/introduction-to-uio/ UIO 每個UIO裝置可以通過裝置檔案（/dev/uioX）和sysfs的屬性檔案來訪問。

Max flow最大流(Introduction to Algorithms, 演算法導論，CLRS)學習筆記

Max Flow 1. Foundations What we do in Max flow: Given a flow network G with source s s s and sink t t t, to find a flow of maximum valueWhat is a valid flow: must satisfy both: 1. flow con

Linear Programming線性規劃(Introduction to Algorithms, 演算法導論，CLRS)學習筆記

Linear Programming 1. Fundamentals objective function and constraints: m i n / m a x 3 x 1 + 24 x 2 + 13 x 3 + 9 x 4 . . . s . t

NP completeness(NP完整性)(Introduction to Algorithms, 演算法導論，CLRS)學習筆記

NP completeness Here we use binary string in our problems;Call an instance of a problem language, and

Approximation Algorithm1(近似演算法(一))(Introduction to Algorithms, 演算法導論，CLRS)學習筆記

Approximation Algorithm 1. Approximation ratio Cost: the size of the solution, for example, in vertex cover, it’s the size of the cover; in TSP, it’s the total distance.

Approximation Algorithm(1)近似演算法(一)(Introduction to Algorithms, 演算法導論，CLRS)學習筆記

Approximation Algorithm 1. Approximation ratio Cost: the size of the solution, for example, in vertex cover, it’s the size of the cover; in TSP, it’s the total distance.

Reinforcement Learning (DQN) 中經驗池詳細解釋

技術標籤：python類大資料多型強化學習一般DQN中的經驗池類，都類似於下面這段程式碼。

Windows Command Line - Introduction to Files and Directory

Windows Command Line - Introduction to Files and Directory Rebooting the system using cmd and powershell

論文記載： Deep Reinforcement Learning for Traffic LightControl in Vehicular Networks

強化學習論文記載論文名： Deep Reinforcement Learning for Traffic LightControl in Vehicular Networks （車輛網路交通訊號燈控制的深度強化學習）---年份：2018.3

lec-4-Introduction to Reinforcement Learning

模仿學習imitation learning與RL的不同

RL Definitions

演算法

基本過程：

Value Functions（基於值的）

演算法型別

演算法的tradeoffs（權衡）→以至於出現如此多演算法

相關推薦