Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule
鄭重宣告:原文參見標題,如有侵權,請聯絡作者,將會撤銷釋出!
Neural Computation, (2007): 2245-2279
Abstract
學習智慧體,無論是自然的還是人工的,都必須更新它們的內部引數,以便隨著時間的推移改進它們的行為。在強化學習中,這種可塑性受到環境訊號(稱為獎勵)的影響,該訊號將變化引導到適當的方向。我們將最近引入的從機器學習中引入的策略學習演算法應用於脈衝神經網路,並推匯出一個脈衝時序依賴可塑性規則,以確保收斂到期望平均獎勵的區域性最優值。該方法適用於廣泛的神經元模型,包括Hodgkin-Huxley模型。我們證明了派生規則在幾個toy問題中的有效性。最後,通過統計分析,我們表明所建立的突觸可塑性規則與廣泛使用的BCM規則密切相關,具有良好的生物學證據。
1 Policy Learning and Neuronal Dynamics
2 Derivation of theWeight Update
2.1 Two Explicit Choices for α.
3 Extensions to General Neuronal Models
Algorithm 1: Synaptic Update Rule for a Generalized Neuronal Model
3.1 Explicit Calculation of the Update Rules for Different α Functions.
3.1.1 Demonstration for α(s) = qδ(s).
3.1.2 Demonstration for α(s) = .
3.2 Depressing Synapses.
4 Simulation Results
5 Relation to the BCM Rule
6 Discussion
Appendix A: Computing Expectations
A.1 Expectation with Respect to the Postsynaptic Spike Train.
A.2 Expectation with Respect to the Presynaptic Spike Train.
Appendix B: Simulation Details
Appendix C: Technical Derivations
C.1 Decaying Exponential α Function.
C.2 Depressing Synapses.
Case 1: No Presynaptic Spike Occurred Since the Last Postsynaptic Spike.
Case 2: At Least One Presynaptic Spike Occurred Since the Last Postsynaptic Spike.
C.3 MDPs, POMDPs
C.3.1 MDPs.