Balancing CartPole with Machine Learning
Learn how to balance a CartPole using machine learning in this article by Sean Saito, the youngest ever Machine Learning Developer at SAP and the first bachelor hire for the position. He currently researches and develops machine learning algorithms that automate financial processes.
This article will show you how to solve the CartPole balancing problem. The CartPole is an inverted pendulum, where the pole is balanced against gravity. Traditionally, this problem is solved by control theory, using analytical equations. However, in this article, you’lllearn to solve the problem with machine learning.
OpenAI Gym
OpenAI is a non-profit organization dedicated to researching artificial intelligence, and the technologies developed by OpenAI are free for anyone to use.
Gym
Gym provides a toolkit to benchmark AI-based tasks. The interface is easy to use. The goal is to enable reproducible research. An agent can be taught inside the gym, and it canlearn activities such as playing games or walking. An environment is a library of problems.
The standard set of problems presented in the gym is as follows:
- CartPole
- Pendulum
- Space Invaders
- Lunar Lander
- Ant
- Mountain Car
- Acrobot
- Car Racing
- Bipedal Walker
Any algorithm can work out in the gym by training for these activities. All of the problems have the same interface. Therefore, any general reinforcement learning algorithm can be used through the interface.
Installating Gym
The primary interface of the gym is used through Python. Once you have Python3 in an environment with the pip installer, the gym can be installed as follows:
sudopip install gym
Advanced users who want to modify the source can compile from the source using the following commands:
git clone https://github.com/openai/gym
cd gym
pip install -e .
A new environment can be added to the gym with the source code. There are several environments that need more dependencies. For macOS, install the dependencies using the following command:
brew install cmake boost boost-python sdl2 swig wget
For Ubuntu, use the following commands:
apt-get install -y python-numpy python-devcmake zlib1g-dev libjpeg-devxvfblibav-tools xorg-dev python-opengllibboost-all-dev libsdl2-dev swig
Once the dependencies are present, install the complete gym as follows:
pip install 'gym[all]'
This will install most of the environments that are required.
Running an environment
Any gym environment can be initialized and run using a simple interface. Start by importing the gym library, as follows:
- First, import the gymlibrary:
import gym
- Next, create an environment by passing an argument to make. In the following code, CartPole is used as an example:
environment = gym.make('CartPole-v0')
- Next, reset the environment:
environment.reset()
- Then, start an iteration and render the environment:
fordummy in range(100):
environment.render()
environment.step(environment.action_space.sample())
Also, change the action space at every step, to see CartPole moving. Running the preceding program should produce a visualization. The scene should start with a visualization, as follows:
The preceding image is called a CartPole. The CartPole is made up of a cart that can move horizontally and a pole that can move rotationally, with respect to the center of the cart.The pole is pivoted to the cart. After some time, you will notice that the pole is falling to one side, as shown in the following image:
After a few more iterations, the pole will swing back, as shown in the following image. All movements are constrained by the laws of physics. The steps are taken randomly:
Other environments can be seen in a similar way, by replacing the argument of the gym environment, such as MsPacman-v0 or MountrainCar-v0.
Markov models
The problem is set up as a reinforcement learning problem, with a trial and error method. The environment is described using state_valuesstate_values (?), and the state_values are changed by actions. The actions are determined by an algorithm, based on the current state_value, in order to achieve a particular state_value that is termed a Markov model.
In an ideal case, the past state_values does have an influence on future state_values, but here, you assume that the current state_value has all of the previous state_values encoded. There are two types of state_values; one is observable and the other is non-observable. The model has to take non-observable state_values into account, as well. That is called a Hidden Markov model.
CartPole
At each step of the cart and pole, several variables can be observed, such as the position, velocity, angle, and angular velocity. The possible state_values of the cart are moved right and left:
- state_values: Four dimensions of continuous values.
- Actions: Two discrete values.
- The dimensions, or space, can be referred to as the state_valuespace and the action space. Start by importing the required libraries, as follows:
import gym
importnumpy as np
import random
import math
- Next, make the environment for playing CartPole, as follows:
environment = gym.make('CartPole-v0')
- Define the number of buckets and the number of actions, as follows:
no_buckets = (1, 1, 6, 3)
no_actions = environment.action_space.n
- Define the state_value_bounds, as follows:
state_value_bounds = list(zip(environment.observation_space.low, environment.observation_space.high))
state_value_bounds[1] = [-0.5, 0.5]
state_value_bounds[3] = [-math.radians(50), math.radians(50)]
- Next, define the action_index, as follows:
action_index = len(no_buckets)
- Now, define the q_value_table, as follows:
q_value_table = np.zeros(no_buckets + (no_actions,))
- Define the minimum exploration rate and the minimum learning rate:
min_explore_rate = 0.01
min_learning_rate = 0.1
- Define the maximum episodes, the maximum time steps, the streak to the end, the solving time, the discount, and the number of streaks, as constants:
max_episodes = 1000
max_time_steps = 250
streak_to_end = 120
solved_time = 199
discount = 0.99
no_streaks = 0
- Define the selectaction that can decide the action, as follows:
defselect_action(state_value, explore_rate):
ifrandom.random() <explore_rate:
action = environment.action_space.sample()
else:
action = np.argmax(q_value_table[state_value])
return action
- Now, select the explorertate, as follows:
defselect_explore_rate(x):
return max(min_explore_rate, min(1, 1.0 - math.log10((x+1)/25)))
- Select the learning rate, as follows:
defselect_learning_rate(x):
return max(min_learning_rate, min(0.5, 1.0 - math.log10((x+1)/25)))
- Next, bucketizethe state_value, as follows:
defbucketize_state_value(state_value):
bucket_indexes = []
for i in range(len(state_value)):
ifstate_value[i] <= state_value_bounds[i][0]:
bucket_index = 0
elifstate_value[i] >= state_value_bounds[i][1]:
bucket_index = no_buckets[i] - 1
else:
bound_width = state_value_bounds[i][1] - state_value_bounds[i][0]
offset = (no_buckets[i]-1)*state_value_bounds[i][0]/bound_width
scaling = (no_buckets[i]-1)/bound_width
bucket_index = int(round(scaling*state_value[i] - offset))
bucket_indexes.append(bucket_index)
return tuple(bucket_indexes)
- Train the episodes, as follows:
forepisode_no in range(max_episodes):
explore_rate = select_explore_rate(episode_no)
learning_rate = select_learning_rate(episode_no)
observation = environment.reset()
start_state_value = bucketize_state_value(observation)
previous_state_value = start_state_value
fortime_step in range(max_time_steps):
environment.render()
selected_action = select_action(previous_state_value, explore_rate)
observation, reward_gain, completed, _ = environment.step(selected_action)
state_value = bucketize_state_value(observation)
best_q_value = np.amax(q_value_table[state_value])
q_value_table[previous_state_value + (selected_action,)] += learning_rate * (
reward_gain + discount * (best_q_value) - q_value_table[previous_state_value + (selected_action,)])
- Print all relevant metrics for the training process, as follows:
print('Episode number : %d' % episode_no)
print('Time step : %d' % time_step)
print('Selection action : %d' % selected_action)
print('Current state : %s' % str(state_value))
print('Reward obtained : %f' % reward_gain)
print('Best Q value : %f' % best_q_value)
print('Learning rate : %f' % learning_rate)
print('Explore rate : %f' % explore_rate)
print('Streak number : %d' % no_streaks)
if completed:
print('Episode %d finished after %f time steps' % (episode_no, time_step))
iftime_step>= solved_time:
no_streaks += 1
else:
no_streaks = 0
break
previous_state_value = state_value
ifno_streaks>streak_to_end:
break
- After training for a period of time, the CartPole will be able to balance itself, as shown in the following image:
You have successfully learned a program that will stabilize the CartPoleusing a trial and error approach.
If you found this article interesting, you can explore Python Reinforcement Learning Projects to implement state-of-the-art deep reinforcement learning algorithms using Python and its powerful libraries. Python Reinforcement Learning Projects will help you hands-on experience with eight reinforcement learning projects, each addressing different topics and/or algorithms.