1. 程式人生 > >Balancing CartPole with Machine Learning

Balancing CartPole with Machine Learning

Learn how to balance a CartPole using machine learning in this article by Sean Saito, the youngest ever Machine Learning Developer at SAP and the first bachelor hire for the position. He currently researches and develops machine learning algorithms that automate financial processes.

This article will show you how to solve the CartPole balancing problem. The CartPole is an inverted pendulum, where the pole is balanced against gravity. Traditionally, this problem is solved by control theory, using analytical equations. However, in this article, you’lllearn to solve the problem with machine learning.

OpenAI Gym

OpenAI is a non-profit organization dedicated to researching artificial intelligence, and the technologies developed by OpenAI are free for anyone to use.

Gym

Gym provides a toolkit to benchmark AI-based tasks. The interface is easy to use. The goal is to enable reproducible research. An agent can be taught inside the gym, and it canlearn activities such as playing games or walking. An environment is a library of problems.

The standard set of problems presented in the gym is as follows:

  • CartPole
  • Pendulum
  • Space Invaders
  • Lunar Lander
  • Ant
  • Mountain Car
  • Acrobot
  • Car Racing
  • Bipedal Walker

Any algorithm can work out in the gym by training for these activities. All of the problems have the same interface. Therefore, any general reinforcement learning algorithm can be used through the interface.

Installating Gym

The primary interface of the gym is used through Python. Once you have Python3 in an environment with the pip installer, the gym can be installed as follows:

sudopip install gym

Advanced users who want to modify the source can compile from the source using the following commands:

git clone https://github.com/openai/gym

cd gym

pip install -e .

A new environment can be added to the gym with the source code. There are several environments that need more dependencies. For macOS, install the dependencies using the following command:

brew install cmake boost boost-python sdl2 swig wget

For Ubuntu, use the following commands:

apt-get install -y python-numpy python-devcmake zlib1g-dev libjpeg-devxvfblibav-tools xorg-dev python-opengllibboost-all-dev libsdl2-dev swig

Once the dependencies are present, install the complete gym as follows:

pip install 'gym[all]'

This will install most of the environments that are required.

Running an environment

Any gym environment can be initialized and run using a simple interface. Start by importing the gym library, as follows:

  1. First, import the gymlibrary:

import gym

  1. Next, create an environment by passing an argument to make. In the following code, CartPole is used as an example:

environment = gym.make('CartPole-v0')

  1. Next, reset the environment:

environment.reset()

  1. Then, start an iteration and render the environment:

fordummy in range(100):

environment.render()

environment.step(environment.action_space.sample())

Also, change the action space at every step, to see CartPole moving. Running the preceding program should produce a visualization. The scene should start with a visualization, as follows:

 

The preceding image is called a CartPole. The CartPole is made up of a cart that can move horizontally and a pole that can move rotationally, with respect to the center of the cart.The pole is pivoted to the cart. After some time, you will notice that the pole is falling to one side, as shown in the following image:

 

After a few more iterations, the pole will swing back, as shown in the following image. All movements are constrained by the laws of physics. The steps are taken randomly:

 

Other environments can be seen in a similar way, by replacing the argument of the gym environment, such as MsPacman-v0 or MountrainCar-v0.

Markov models

The problem is set up as a reinforcement learning problem, with a trial and error method. The environment is described using state_valuesstate_values (?), and the state_values are changed by actions. The actions are determined by an algorithm, based on the current state_value, in order to achieve a particular state_value that is termed a Markov model

In an ideal case, the past state_values does have an influence on future state_values, but here, you assume that the current state_value has all of the previous state_values encoded. There are two types of state_values; one is observable and the other is non-observable. The model has to take non-observable state_values into account, as well. That is called a Hidden Markov model.

CartPole

At each step of the cart and pole, several variables can be observed, such as the position, velocity, angle, and angular velocity. The possible state_values of the cart are moved right and left:

  1. state_values: Four dimensions of continuous values.
  2. Actions: Two discrete values.
  3. The dimensions, or space, can be referred to as the state_valuespace and the action space. Start by importing the required libraries, as follows:

import gym

importnumpy as np

import random

import math

  1. Next, make the environment for playing CartPole, as follows:

environment = gym.make('CartPole-v0')

  1. Define the number of buckets and the number of actions, as follows:

no_buckets = (1, 1, 6, 3)

no_actions = environment.action_space.n

  1. Define the state_value_bounds, as follows:

state_value_bounds = list(zip(environment.observation_space.low, environment.observation_space.high))

state_value_bounds[1] = [-0.5, 0.5]

state_value_bounds[3] = [-math.radians(50), math.radians(50)]

  1. Next, define the action_index, as follows:

action_index = len(no_buckets)

  1. Now, define the q_value_table, as follows:

q_value_table = np.zeros(no_buckets + (no_actions,))

  1. Define the minimum exploration rate and the minimum learning rate:

min_explore_rate = 0.01

min_learning_rate = 0.1

  1. Define the maximum episodes, the maximum time steps, the streak to the end, the solving time, the discount, and the number of streaks, as constants:

max_episodes = 1000

max_time_steps = 250

streak_to_end = 120

solved_time = 199

discount = 0.99

no_streaks = 0

  1. Define the selectaction that can decide the action, as follows:

defselect_action(state_value, explore_rate):

ifrandom.random() <explore_rate:

action = environment.action_space.sample()

else:

action = np.argmax(q_value_table[state_value])

return action

  1. Now, select the explorertate, as follows:

defselect_explore_rate(x):

return max(min_explore_rate, min(1, 1.0 - math.log10((x+1)/25)))

  1. Select the learning rate, as follows:

defselect_learning_rate(x):

return max(min_learning_rate, min(0.5, 1.0 - math.log10((x+1)/25)))

  1. Next, bucketizethe state_value, as follows:

defbucketize_state_value(state_value):

bucket_indexes = []

for i in range(len(state_value)):

ifstate_value[i] <= state_value_bounds[i][0]:

bucket_index = 0

elifstate_value[i] >= state_value_bounds[i][1]:

bucket_index = no_buckets[i] - 1

else:

bound_width = state_value_bounds[i][1] - state_value_bounds[i][0]

offset = (no_buckets[i]-1)*state_value_bounds[i][0]/bound_width

scaling = (no_buckets[i]-1)/bound_width

bucket_index = int(round(scaling*state_value[i] - offset))

bucket_indexes.append(bucket_index)

return tuple(bucket_indexes)

  1. Train the episodes, as follows:

forepisode_no in range(max_episodes):

explore_rate = select_explore_rate(episode_no)

learning_rate = select_learning_rate(episode_no)

observation = environment.reset()

start_state_value = bucketize_state_value(observation)

previous_state_value = start_state_value

fortime_step in range(max_time_steps):

environment.render()

selected_action = select_action(previous_state_value, explore_rate)

observation, reward_gain, completed, _ = environment.step(selected_action)

state_value = bucketize_state_value(observation)

best_q_value = np.amax(q_value_table[state_value])

q_value_table[previous_state_value + (selected_action,)] += learning_rate * (

reward_gain + discount * (best_q_value) - q_value_table[previous_state_value + (selected_action,)])

  1. Print all relevant metrics for the training process, as follows:

print('Episode number : %d' % episode_no)

print('Time step : %d' % time_step)

print('Selection action : %d' % selected_action)

print('Current state : %s' % str(state_value))

print('Reward obtained : %f' % reward_gain)

print('Best Q value : %f' % best_q_value)

print('Learning rate : %f' % learning_rate)

print('Explore rate : %f' % explore_rate)

print('Streak number : %d' % no_streaks)

if completed:

print('Episode %d finished after %f time steps' % (episode_no, time_step))

iftime_step>= solved_time:

no_streaks += 1

else:

no_streaks = 0

break

previous_state_value = state_value

ifno_streaks>streak_to_end:

break

  1. After training for a period of time, the CartPole will be able to balance itself, as shown in the following image:

 

You have successfully learned a program that will stabilize the CartPoleusing a trial and error approach.

If you found this article interesting, you can explore Python Reinforcement Learning Projects to implement state-of-the-art deep reinforcement learning algorithms using Python and its powerful libraries. Python Reinforcement Learning Projects will help you hands-on experience with eight reinforcement learning projects, each addressing different topics and/or algorithms.