A3C — What It Is & What I Built

阿新 • • 發佈：2018-12-28

The 3 A’s of A3C

Actor-Critic

The basic actor-critic model stems from Deep Convolution Q-Learning which is where the agent implements q-learning, but instead of taking in a matrix of states as input, it takes in images and feeds them into a deep convolutional neural network.

Don’t worry about the rectangles on the right side, they represent a deep neural network with all the nodes and connections. It’s just easier to explain and understand A3C this way.

In a regular Deep Convolution Q-Learning network, there would only be one output and that would be the q-values of the different actions. However in A3C, there are two outputs, one of the q-values for the different actions and the other to calculate the value of being in the state the agent is actually in.

Asynchronous

Think about the saying, “Two heads are better than one.” The whole idea behind that is two people working together are more likely to solve a problem quicker, faster, and overall better.

This word, “Asynchronous”, in A3C represents this exact idea with agents. Instead of there just being one agent trying to solve the problem, there are multiple agents working on the problem and sharing information with each other about what they’ve learned.

There’s also the added benefit that if one agent gets stuck performing a task sub optimally to get a reward over and over thinking that this is the best way, the other agents can share information and show this agent that there is a better way to solve the problem!

So now there are multiple agents training side by side to solve the problem they were given.

But what happened to the sharing experience part? Well they share experience by adding up of all of their individual critic values into a big shared one. By sharing this experience they can see what states have high rewards and what the other agents have explored in the environment.

This is like the barebones of A3C, but we’re going to go a step further into a version tweaked by the creator of PyTorch which is one of the best deep learning libraries out there. What he did was that instead of each agent having their own seperate network, there is only one neural network that all the agents share. This means that all the agents now share common weights and this leads to training being easier.

Advantage

Advantage = Q(s, a) — V(s)

This is the equation that this last A is based off of. What this is saying is that we want to figure out what is the difference between the q-value of an action we take and the value of the state we’re in.

How much better is the q-value than the known value?

The whole point is to always get better q-values to get more and more reward. When the advantage is high, the q-value is a lot better than the known value.

The neural network then wants to enforce this behaviour and update its weights so it keeps repeating these actions. If the advantage is low, then the neural network tries to prevent these actions from occurring again.

One More Step Needed — Memory!

This is a screenshot I took directly from the training videos of my agent training. I want to ask you, can you tell which direction the ball is going?

Is it left, right, up, down? You literally cannot tell. There are ways that the regular A3C algorithm tackles this problem, but for my implementation, I used a Long Short Term Memory network (LSTM). This LSTM is what helps this neural network remember where the ball was in past frames so the agent can decide where to move the paddle so the ball can bounce off of it.

My Demo

This is the best result I got, training on my laptop with this algorithm! Even though it’s nothing crazy like Boston Dynamics training robots to do parkour or anything, it blows my mind everyday that this agent went from knowing absolutely nothing about the game to actually playing the game! Imagine about what we could do in the future with reinforcement learning algorithms!

Even now I’m most excited about reinforcement learning for drug design and nanotech design, what applications could rise in the future?

Here’s a bit of the code to show some of the more interesting parts of the algorithm are and how they’re implemented in code!

This is the initialization function of the Actor Critic model and this is where the convolution neural network, connections to the actor and critic models are made, as well as the LSTM cell!

This function is also part of the ActorCritic class and this is where we forward propagate through the neural network and get the outputs of the Actor and the Critic!

The share_memory function is where all the different agents share their memories with each other so that they can all learn from each other’s experiences!

A3C — What It Is & What I Built

The 3 A’s of A3C

Actor-Critic

Asynchronous

Advantage

One More Step Needed — Memory!

My Demo

A3C — What It Is & What I Built

The truth is what it is, not what you see

An introduction to Generative Art: what it is, and how you make it

A quick intro to Dependency Injection: what it is, and when to use it

How Apple’s First HQ Shaped the Company into What It Is Today

HDU 1308 What Day Is It?（模擬，日期）

2018 ICPC Greater New York Regional Contest E What time is it anyway?（暴搜）

2018 ICPC Greater New York Regional Contest E What time is it anyway?（暴搜）

I Analyzed 258 Decentralized Exchanges and This Is What I Found

What exactly is an A.I. Model? An ELi5.

Concurrent Programming in Python is not what you think it is.

Ask HN: I've made some it projects. What to do next?

What is A.I.?

Here is what I’ve learn about groupBy operator by reading RxJS sources

I’ve been learning Clojure. Here is what I love!

This is what it’s like to fail your interviews at Google.

AI at the Movies: How Ex Machina Shows AI Is What You Make It

SpringBoot is What?

【Benson的專欄】Learning, staying up to date, and working on the latest and greatest in languages and APIs is what k

Statue of Satoshi Nakamoto: This is What He Looks Like

A3C — What It Is & What I Built

The 3 A’s of A3C

Actor-Critic

Asynchronous

Advantage

One More Step Needed — Memory!

My Demo

相關推薦