1. 程式人生 > >強化學習 by 李巨集毅(個人記錄向)

強化學習 by 李巨集毅(個人記錄向)

P2Proximal Policy Optimization (PPO)

importance sampling:importance sampling

On-policy -> Off-policy
Gradient for update:
Gradient fro update

Objective function
ppo function

*KL divergence(KL距離) 常用來衡量兩個概率分佈的距離

E(st,at)πθ

Q-learning

critic 評價action的效果
Monte-Carlo(MC) based approach
Temporal-difference(TD) approach

V
π(st)=Vπ(st+1)+rt

這裡寫圖片描述
*mc td分別代表什麼

Target Network
Target Network

Exploration Epsilon Greedy

Replay Buffer
減少與環境做互動的時間

Typical Q-learning Algorithm
Typical Q-learning Algorithm

Tips of Q-Learning

Double DQN
Q-Value總是被高估,因為target總是被設得太高

Double DQN

Dueling DQN
修改了網路的架構,分為V+A = Q

Prioritized Reply
Prioritized Reply

Multi-step
Multi-step

Q-Learning for Continuous Actions
Q-Learning不容易處理連續的行為,如自動駕駛,機器人行動
Using gradient ascent to solve the optimization problem
Continuous Action Solution

policy-based(ppo)
value-based(Q-learning)

Asynchronous Advantage Actor-Critic

Review Policy Gradient

R¯θ1Nn=1Nt=1Tn(t=tTnγttrtnb)logpθ(atn|stn)
γttrtnbrtn+Vπ(st+1n)

Actor-Critic
Actor-Critic
Advantage Actor Critic

Pathwise Derivative Policy Gradient

Pathwise Gradient

這裡寫圖片描述

Sparse Reward

agent多數情況下,無法得到reward

Reward Shaping:

ICM = instrinsic curiosity module 鼓勵冒險
instrinsic curiosity module
根據atst,Network1預測st+1與真正的st+1差距,即採取某action,無法預測接下來的結果那麼鼓勵該操作
提取需要的特徵,通過Network2仍然得到正確at,說明過濾掉是無用的資訊

Curriculum Learning

Reverse Curriculum Generation根據得分反推
Hierarchical RL分級強化學習

Imitation Learning

Behavior Cloning

  • 存在侷限性,無法收集極端情況下的資料
  • 學習無用的行為

Inverse RL
Inverse RL
IRL訓練專家的資料,得到正確的Reward Funciton;轉而,RL利用Reward Funciton得到optimal acto,類似於GAN。

應用:自駕車