1. 程式人生 > 其它 >Model-based Reinforcement Learning: A Survey

Model-based Reinforcement Learning: A Survey


發表時間:2021
文章要點:一篇綜述,主要從dynamics model learning,planning-learning integration和implicit model-based RL三個方面介紹。dynamics model learning包括stochasticity, uncertainty, partial observability, non-stationarity, state abstraction, and temporal abstraction等問題,integration of planning and learning主要講如何把model和問題結合起來,用planning解決問題,implicit approach to model-based RL主要介紹如何學習planning,就是說planning不是一個固定的規則,也是通過優化的方式學出來的。最後文章還介紹了一點model based RL的優點,比如data efficiency, targeted exploration, stability, transfer, safety and explainability。
文章把具體方法分成Model-based RL with a learned model,Model-based RL with a known model和Planning over a learned model

這裡Planning over a learned model就是說學了model之後就只有planning,沒有RL的部分,所以有的也不把這個類別當成model based RL,因為只有model based planning。
對於model來說,有三個型別Forward model,Backward/reverse model,Inverse model。

這裡一個reverse一個inverse還挺容易搞混。
Model的估計方法作者區分了parametric and non-parametric methods,以及exact and approximate methods。通常有統計假設的都算是引數方法,比如線性迴歸這些,沒有引數假設的都算是非參方法,比如高斯過程這種。然後exact就是指準確的值,比如查表的方法,或者replay buffer全存下來這種。approximate methods顧名思義就比如線性迴歸,神經網路這種的。
然後就引出了model的一系列問題,比如Region in which the model is valid,Stochasticity,Uncertainty,Partial observability,Non-stationarity,Multi-step Prediction,State abstraction。
然後下一部分就是介紹怎麼把planning用上去。

這部分回答了四個問題

比較常見的幾個問題,裡面提到的做法也是很常見

這裡我覺得第二個問題是很值得做一做的,這個trade off有點意思。這個問題又可以分成兩個問題,When to start planning? How much time to spend on planning?
最後Implicit Model-based Reinforcement Learning這部分,提出了一個隱式學習的觀點,比如整個問題都可以看做是model free方法,裡面的各個模組只是來解決這個問題的隱式方法,我們並不需要作區分(In other words, the entire model based RL procedure (model learning, planning, and possibly integration in value/policy approximation) can from the outside be seen as a model-free RL problem)。
這就引出了implicit model-based RL,比如Value equivalent models是說模型是隱式的/抽象的,我不管你具體怎麼做,只要value對的上就行(forward dynamics might be complicated to learn, but the aspects of the dynamics that are relevant for value prediction might be much smoother and easier to learn)。文中舉的例子是Value Iteration Networks (VIN)和Universal Planning Networks (UPN)。
再比如Learning to plan,就是說planning也不是制定好的方式,比如MCTS之類的,而是像policy一樣去學出來的(The idea is to optimize our planner over a sequence of tasks to eventually obtain a better planning algorithm, which is a form of meta-learning)。文中舉的例子是MCTSNets,Imagination-augmented agents (I2A) 和Imagination-based planner (IBP)。
最後就是結合起來,model和planning一起學(If we specify a parameterized differentiable model and a parameterized differentiable planning procedure, then we can optimize the resulting computational graph jointly for the model and the planning operations.)。文中舉的例子是TreeQN和Deep Repeated ConvLSTM (DRC)。
最後結尾說了下model based RL的好處

以及一些劣勢,比如additional computation,unstable due to uncertainty and approximation errors in the model.
總結:

很新的一篇綜述,各個方向基本都列全了的。有啟發的一點是可以用model去做exploration,發現option,或者subgoal,感覺這也是一個很好的點子,畢竟model裡面是safe的。而且planning可以看成一種deep exploration,而像ϵ-greedy這種就是local exploration,顯然deep exploration會在某些情況下有優勢(Planning may identify temporally correlated action sequences that perform deep exploration towards new reward regions, which local exploration methods would fail to identify due to jittering behaviour)。
然後How much planning budget do we allocate for planning and real data collection這個問題也值得做。
疑問:
裡面這個Gaussian processes算非引數方法,這個要去看看具體是怎麼分類引數和非參的。