【論文閱讀】End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances

阿新 • • 發佈：2022-01-02

文章名：CVPR2020: End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances
Column: December 14, 2021 11:15 AM
Last edited time: December 31, 2021 6:46 PM
Sensor: 1 RGB
Status: Finished
Summary: RL; carla leaderboard
Type: CVPR
Year: 2020
引用量: 44

參考與前言 resource

程式碼：https://github.com/valeoai/LearningByCheating；雖然程式碼是lbc的名字

但是其實是作者fork過來的，然後基於lbc基礎上進行自新增刪除等 hhhh （原來fork過來的是沒有issue欄了 emmm）

論文地址：CVPR 2020open access

視訊：

https://www.youtube.com/watch?v=hfEos9HpgBA

1. Motivation

IL DRL 對比

首先主要是指出模仿學習的缺點：因為專家的資料都是great，也就是你承認他開車開得對，沒有開錯的，這也就造成了一個distribution mismatch，只有好的資料；當然也有文章[1,32] 指出：那我就填開錯的資料進去，但是呢這種資料一般都是車保持在本車道/lateral control

[ ] 我沒搞懂這個lateral control是指橫向控制偏失嗎？

所以呢，DRL不會有distribution mismatch的問題，因為他的資料都是在提供reward signal下通過RL來學的，而這種情況下呢，不是說車開的對錯，而是你車開的好壞 how good action taken

但是呢DRL也面臨著一些問題：

依賴於replay buffer，而這個buffer又會限制input size比較小
black box → 所以有些方法是先提取資訊，比如得到分割圖片後在傳到controller，以此檢驗是否關注到資訊點了

Contribution

根據前面的對比，引出自己的contribution：

我們是第一個成功使得RL在複雜駕駛環境上用的，環路相遇和交通燈識別等
引入了一種新的技術：implicit affordance 使得RL可以在大網路下和大input size 使用replay remory的訓練

主要就是兩個階段，這樣RL這邊接收的就不是raw image，memory剩下20倍左右。→ 好像之前想的降維操作哎
1. 用resnet18框架去輸出 feature
2. 然後 RL 接收 這個降維後的feature
做了 implicit affordance 和 reward shaping 的消融實驗

總的來看 contribution除了第二點，emm 其他貢獻其他文章也基本有的；

第二點最好再仔細看一下method部分

不過我感覺這個introduction寫的不錯哎，比較全和明瞭去介紹並提出了貢獻點二，可以借鑑一下

2. Method

輸入是1個相機（4張連續時間的圖片），加RGB是3個通道，一共下來12個通道，那應該是batch_size x 12 x 288 x 288
輸出是方向盤轉角和油門

整個框架

RL 設定

首先選的是value-based method，當然這樣的設定就導致我們看的動作都是離散的，文中並沒有做policy-based的對比（挖坑後面做），借鑑開源的Rainbow-IQN Ape-X 但是去掉了dueling network

溫馨連結：RL policy-based method和value-based method區別

Reward Shaping

原來這個shaping是指... reward setting

計算的方法主要是 Carla提供了waypoint的API來進行判斷，當遇到路口的時候，隨機選擇（左、右、直走），reward主要由以下三個決定

desired speed：reward range [0,1]

如果在期望速度給最大1，然後線性打分當速度高了或者是低了，本文中期望速度為40km/h
- 展開可見圖片示意
desired position：reward range [-1,0]

如果剛好在道路中心的路徑點給0，然後和相距距離成反比的給負reward，以下兩種情況episode直接停止，並設reward為-1，本文設定\(D_{max}=2\text{m}\) 也正是中心線到路邊緣的距離
- 與中心路徑點相距超過\(D_{max}\)
- 與其他東西相撞，闖紅綠燈，前無障礙物/紅燈時車輛停下來
desired rotation

這一個設定的是因為作者發現只有前兩個的時候，車子會在有障礙物的時候直接停下，而不是繞行，因為繞行會讓他的第二個reward下降，直行的話又會使他撞上去 → 有做消融實驗證明

reward 和 optimal trajectory的angle差距值成反比
- [ ] 但是看到這點的時候，我本來打算看看rotation的reward範圍，然後發現這個作者... 是假開源，他並沒有開源RL expert的程式碼 emmm，也沒有給出RL跑的資料集 → 按道理不給程式碼給資料集應該不難？程式碼裡只有load他的model~~，連Dataloader都沒有 emmm~~
- [ ] 還有一點是在換道的時候 reward還是會偏離原路徑點的angle呀，如果是optimal trajectory的角度的話，並沒有定義清楚optimal trajectory是由誰給出的
  
  噢是不是他在開頭嫌棄的那個carla的expert？→ 不對啊他不會換道呀

Network

整體圖

① RGB → resnet那層，在classif_state_net這層輸出，把他展開成一維的長度：8192

classif_state_net = encoding.view(-1, self.size_state_RL)

② 這層有點奇怪，主要是這個框圖奇怪，首先對著程式碼僅有的model看，輸入不是什麼uniform sampling 而是直接上面的image那裡encoder的值，只是經過過的不是view(-1) 而是幾層sampled_block → 也就是程式碼中沒有體現②和③ → 破案了，沒體現② 但是③後面的有了在DQN model裡面，直接是每個noisyLayer下的兩個linear+relu搞定到output輸出

只看①對應的直接下圖的decoder部分：

下圖下半部：展開一維長度8192，再經過一層Linear&relu 到1024長度

return classif_output, state_output, dist_to_tl_output, delta_position_yaw_output
下圖上半部：在展成一維前，輸入到另一個四層的sampled_block形式 [Upsample，Conv2d，BatchNorm2d，ReLU，Conv2d，BatchNorm2d]，直接輸出return out_seg

但是這裡有個問題是，實際操作程式碼輸出的尺寸應該是24x73x128，而且程式碼裡也是直接按照這個輸入到DQN那邊的rl_state_net的

喔原來這步是decoder的輸出，看這幅圖：（第一幅畫的什麼呀 emmm）

①：經過完resnet18後還走了一層 [Conv2d，BatchNorm2d]，然後展成一維：8192

decoder：在①展成一維前，輸入到另一個四層的sampled_block形式 [Upsample，Conv2d，BatchNorm2d，ReLU，Conv2d，BatchNorm2d]，然後輸出是out_seg

# Segmentation branch
upsample0 = self.up_sampled_block_0(encoding)  # 512*8*8 or 512*6*8 (crop sky)
upsample1 = self.up_sampled_block_1(upsample0)  # 256*16*16 or 256*12*16 (crop sky)
upsample2 = self.up_sampled_block_2(upsample1)  # 128*32*32 or 128*24*32 (crop sky)
upsample3 = self.up_sampled_block_3(upsample2)  # 64*64*64 or 64*48*64 (crop sky)
upsample4 = self.up_sampled_block_4(upsample3)  # 32*128*128 or 32*74*128 (crop sky)

out_seg = self.last_bn(self.last_conv_segmentation(upsample4))  # nb_class_segmentation*128*128

# ===================================================
# We will upsample image with nearest neightboord interpolation between each umsample block
# https://distill.pub/2016/deconv-checkerboard/
self.up_sampled_block_0 = create_resnet_basic_block(6, 8, 512, 512)
self.up_sampled_block_1 = create_resnet_basic_block(12, 16, 512, 256)
self.up_sampled_block_2 = create_resnet_basic_block(24, 32, 256, 128)
self.up_sampled_block_3 = create_resnet_basic_block(48, 64, 128, 64)
self.up_sampled_block_4 = create_resnet_basic_block(74, 128, 64, 32)

# ===================================================
def create_resnet_basic_block(
    width_output_feature_map, height_output_feature_map, nb_channel_in, nb_channel_out
):
    basic_block = nn.Sequential(
        nn.Upsample(size=(width_output_feature_map, height_output_feature_map), mode="nearest"),
        nn.Conv2d(
            nb_channel_in,
            nb_channel_out,
            kernel_size=(3, 3),
            stride=(1, 1),
            padding=(1, 1),
            bias=False,
        ),
        nn.BatchNorm2d(
            nb_channel_out, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True
        ),
        nn.ReLU(inplace=True),
        nn.Conv2d(
            nb_channel_out,
            nb_channel_out,
            kernel_size=(3, 3),
            stride=(1, 1),
            padding=(1, 1),
            bias=False,
        ),
        nn.BatchNorm2d(
            nb_channel_out, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True
        ),
    )
    return basic_block

image輸入框架

總結一下：

作者沒有公佈自己的資料集，也沒有公開怎樣拿到的expert資料以訓練出這樣一個model的，而是直接給了一個稍微訓練好的model權重檔案
實際有了model後，整個流程就是收到圖片，輸出resnet18後的資料，分兩道：
1. 走到semantic decoder，輸出out_seg
2. 走到flatten展開一維，然後再經過不同的Linear配置，各自輸出：state_output, dist_to_tl_output, delta_position_yaw_output
最後第二點的a+b.一起展成一維的，再輸出到DQN那邊進行一層Linear+relu，同收到的speed和steering一起再走到三層Linear 最後輸出的action

[ ] 因為第一點的原因，純看程式碼真的看不出用來RL，即使是DQN也只是cat資料一起，經過幾層的NoiseLayer which is nn.Linear；所以emmm 一言難盡 → 審稿人竟然沒有就這點提出質疑

3. Conclusion

看完全文+程式碼，再看這個結論部分，多少有點code lie → 因為第二點 using a value-based Rainbow-IQN-Apex training with an adapted reward！沒有在程式碼中進行體現

large conditional netwrok：估計是指自己先通過resnet18 鋪平後送入DQN的意思吧
implicit affordance：分開輸出out_seg和圖二所示的下部分（紅綠燈、速度、偏航角偏移）但其實DQN那裡基本也是一起一維到DQN的，只是說中途有個好展示的東西？但是論文裡也沒給出這一個中間小輸出是長什麼樣的
future work 可以對比一下value-based, policy-based and actor-critic，所以這就是不給出value-based 的 code原因？怕太捲了大家直接看到就卷出來了

碎碎念

emmm 但是不得不說這個作者在leaderboard提交的還是比較厲害的：人家可是隻用了一個相機！看看在他前面的感測器使用量都是直接拉滿到四個。

但是感覺這個應該是訓練了很久的，不然不至於一年過後的今天還是不在repo中給出自己hard task的結果 hhhh

【論文閱讀】End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances

文章名：CVPR2020: End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances

【論文閱讀】Artificial Neural Networks to Assess Emotional States from Brain-Computer Interface

1.這篇文章究竟講了什麼問題？提供了一個實驗來評估由頭戴裝置的API提供的情感狀態的分類準確性。

【論文閱讀】iSAM: Personalizing an Artificial Intelligence Model for Emotion with Pleasure-Arousal-Dominance in Immersive Virtual Reality

1.這篇文章究竟講了什麼問題？使用人工智慧和沉浸式虛擬環境來學習和適應使用者的情感模型

【論文閱讀】Affective database for e-learning and classroom environments using Indian students’ faces, hand gestures and body postures】

1.這篇文章究竟講了什麼問題？幾乎沒有一個標準的資料集，包含學生情感狀態識別以及分析，在線上課堂和教室環境。

【論文閱讀】Application of Deep Learning on Student Engagement in e-learning environments

1.這篇文章講了個什麼問題？線上課堂的學生專注度研究 2.這是否是一個新的問題？

【論文閱讀】End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances

1. Motivation

IL DRL 對比

Contribution

2. Method

整個框架

RL 設定

Reward Shaping

Network

3. Conclusion

碎碎念

【論文閱讀】End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances

【論文閱讀】Artificial Neural Networks to Assess Emotional States from Brain-Computer Interface

【論文閱讀】iSAM: Personalizing an Artificial Intelligence Model for Emotion with Pleasure-Arousal-Dominance in Immersive Virtual Reality

【原創】【論文閱讀】2020 Learning From Noisy Large-Scale Datasets With Minimal Supervision

【論文閱讀】基於區塊鏈的無人叢集作戰資訊共享架構_臧義華

【論文閱讀】Deep Mutual Learning

【論文閱讀】CYCADA CYCLE-CONSISTENT ADVERSARIAL DOMAIN ADAPTATION

【論文閱讀】Pyramid Scene Parsing Network

【論文閱讀】Deep learning-based facial emotion recognition for human–computer interaction applications

【論文閱讀】Effects of Emotional Music on Facial Emotion Recognition in Children with Autism Spectrum Disorder (ASD)

【論文閱讀】基於面部表情的學習者情緒自動識別研究——適切性,現狀,現存問題和提升路徑

【論文閱讀】Emotion Recognition Using Frontal EEG in VR Affective Scenes

【論文閱讀】基於面部表情識別的閱讀情境匹配與體驗優化研究

【論文閱讀】Deep Neural Classifiers for EEG-Based Emotion Recognition in Immersive Environments

【論文閱讀】Emotion based Media Playback System using PPG Signal

【論文閱讀】Toward Emotionally Adaptive Virtual Reality for Mental Health Applications

【論文閱讀】an_optics_controlling_environm

【論文閱讀】CVPR2022: Learning from all vehicles

【論文閱讀】Affective database for e-learning and classroom environments using Indian students’ faces, hand gestures and body postures】

【論文閱讀】Application of Deep Learning on Student Engagement in e-learning environments

【論文閱讀】End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances

1. Motivation

IL DRL 對比

Contribution

2. Method

整個框架

RL 設定

Reward Shaping

Network

3. Conclusion

碎碎念

相關推薦