Python Implementations of Some Reinforcement Learning Algorithms

Note: the Python code described in this section is less general and slower than the C++ code below. However, Python is a versatile and easy language to use, so I think having two versions can be useful.

For the Double Q-learning paper, I've uploaded some supplementary material, include a Python implementation of the reinforcement learning algorithms Q-learning, Double Q-learning, QV-learning, Sarsa and Expected Sarsa. You can get the code here. This code is part of a larger Python implementation that I have made and that I might upload sometime soon. The code is not really documented, but should be fairly self-explicatory.

When you unpack the supplements file, the algorithms and the used domains (a small grid and a roulette simulation) can be found in the file called RL.py. The file Run.py can be used to run an experiment and write various results to a log file. The files Roulette.py and Grid.py use the code to run such experiments. Also included is some code to plot the results. This plotting code requires matplotlib, all the other code only uses standard libraries. The code that can be downloaded now was specifically tailored for the Double Q-learning paper, but it is set up fairly general and can easily be reused for other purposes.

Feel free to use this code, or the more general C++ implementation below. If you find it useful, or you have remarks or questions, please let me know.

C++ Implementations of Some Reinforcement Learning Algorithms

The C++ code that can be downloaded is faster and more general than the Python code from the former section. A few differences include that the Python code as of yet does not have any function approximators implemented, whereas the C++ code features a neural network implementation. Other differences include the presence of Cacla (included in C++, not in Python) and Double Q-learning (included in Python, not in C++). However, these algorithms are fairly easy to include, using the description in the respective papers that can be found in my publications list.

Note: I also have made C++ implementations of the populair Mountain Car benchmark and of the Helicopter benchmark that was used for the reinforcement learning competition. The latter is a C++ adaptation of the original Java code that was used in the competition (see the helicopter domain on the rl-competition website and the RL-library for further documentation and source code). I do not guarantee that the C++ implementation of the helicopter is correct or bugfree, but the two implementations (Java and C++) seem to generate the same trajectories, which seems comforting. If you have interest in the code for any of these benchmarks, feel free to contact me. When I find the time, I might upload them here, but I do not know if or when this will happen.


You can download the code here!


For descriptions of the implemented algorithms, see the Short Introduction To Some Reinforcement Learning Algorithms.

At this point there is little documentation on the C++ implementation. It is easiest to view the interface and configuration (for instance, CartPoleInterface.cpp and CartPoleCfg) files to see an example of use. I must warn that there is minimal error handling at this point. Also, there are not many comments in the code. Therefore, if you want to use parts of this implementation, the best way is to view the example and the header files and use these for guidance.

As for the license of the code: I will not apply one of the general open source licenses to the code. Instead, anyone can use the code for scientific purposes as long as I am acknowledged as the author of the code. If anyone wishes to use the code commercially, please contact me. Also, if you use the code for any purpose at all, please send me an email and let me know.

Of course, for any questions, feel free to contact me. There may be unclarities and bugs in the code (though of course I hope not many), so I'm interested to hear about these when they are encountered.


For instance, to run the small maze test, one could do the following:

$ wget http://webdocs.cs.ualberta.ca/~vanhasse/rlcpp.tar.gz
$ tar -xzf rlcpp.tar.gz
$ make small
$ mkdir SmallMazeCfg_d
$ ./SmallMaze SmallMazeCfg

Say now we want to know how good Q-learning with Boltzmann exploration with tau = 1.0 and a learning rate of 0.1 performed on the small maze. We can then do:

$ cat SmallMazeCfg_d/Q_B1_L0.1_D0.99.onresults

which gives something like:

-0.964839 0.000383102
-0.953551 0.000201076
-0.955484 0.000172621
-0.942259 0.000353957
-0.931614 0.000324562
-0.92258 0.000294556
-0.911935 0.000367483
-0.909032 0.000395346
-0.898064 0.00059285
-0.889355 0.000519568

Here the first number is the number of experiments run to get these results; the first column contains the average rewards per step after every trainStorePer steps (specified in the configuration file); and the second column shows the variance of these average results. To see the average results of 100 steps after training, we do:

$ cat SmallMazeCfg_d/Q_B1_L0.1_D0.99.offresults
-0.75 0

At least 4 steps are needed in every episode to get from the start to the finish and only when the episode is finished a reward of 0 is received. All other steps yield -1, so -0.75 is the maximal obtainable average reward per step and Q-learning apparently solved this very simple 'maze' perfectly.

The Code

The code is structured as follows. For any problem we need the following aspects:


For the algorithms, there is a choice between state-action value based algorithms - including those that may also use state values, such as Acla and QV-learning - and the continuous action algorithm Cacla. All these algorithms are implemented and can directly be used. To implement another algorithm in this framework, inherit the Algorithm base class and make sure that all pure virtual functions in Algorithm.h get implementations.

World (or MDP)

For the MDP, inherit the World base class and make sure all pure virtual functions are implemented.

Continuous/Discrete spaces

In this implementation, there is support for both discrete and continuous state and action spaces. Any continuous space is assumed to have a fixed amount of components that range from -1 to 1 (so any MDP expecting values in a different range should scale the incoming continuous actions appropriately). Note however, that the output of Cacla is not actually bounded, so it is possible that it outputs values outside the range of -1 to 1. For instance, observe that the act function of CartPole.cpp explicitly bounds any incoming continuous action. Any instance of a state or action in a discrete space is assumed to be defined by a single integer. This integer will be in the range 0 to numberOfStates-1 or 0 to numberOfActions-1, for states and actions, respectively.

Important! Make sure that the proper global variables are set to relevant values in your MDP (and when you add a new algorithm). Set these in the constructor:

The values following the Booleans (numberOfStates, stateDimension,...) all only need to be set if the corresponding Boolean is true. For instance, an MDP with a locations in a 2D plane as its continuous state and the discrete actions turn_left, turn_right, accelerate and brake might have the following global variables:

  discreteStates    = false ;
  continuousStates  = true ;
  stateDimension    = 2 ;
  discreteActions   = true ;
  numberOfActions   = 4 ;
  continuousActions = false ;

Any MDP may be able to handle both discrete and continuous actions, but may only use discrete or continuous states. Similarly, any algorithm (and all algorithms that are implemented at this point) may handle both discrete and continuous states, but may only use either discrete or continuous actions.

If you want to use a continuous action algorithms - such as Cacla - on a problem that is more naturally defined using discrete actions, make sure that the MDP can handle continuous actions in its act( Action * action ) function.


To run experiments, you need to implement an interface class. This class should inherit the Experiment base class. For the rest, the interface classes may be very simple. They will also usually include the main function.

Interfaces inheriting from Experiment by default use a configuration file that specifies the different parameters used by the algorithms and the discount factor of the MDP. For the learning rates, the exploration parameters, discount factors and the algorithms, it is possible to give multiple values. (At this time multiple values for the number of hidden nodes - nHiddenQ and nHiddenV - are NOT supported. So for different number of hidden nodes, simply use different configuration files.) Then, experiments with all combinations of these values are performed.

The results are stored in text files in the directory cfgfile_d, where cfgfile is the configuration file. These directories should be made by hand before running the experiments. The on-line results (during training, with exploration) are stored in files with the extension .onresults, while the off-line results (after training, without exploration) are stored in files with .offresults.

Note that parameters that are irrelevant to an algorithm, such as the Boltzmann parameter tau for the algorithm Cacla, are simply ignored.


At the present time, two examples are implemented. These can be used as inspiration for your own implementations.

Example 1: Small Maze

The first example is a small maze, which really should not even be called a maze. It is a 3x3 grid, where you start in one corner and have to reach the opposite corner. Every step yields a -1 reward, except the step leading to the goal state, which yield a reward of 0 and ends the episode.

The MDP features 9 discrete states and 4 discrete actions. It also accepts continuous actions that are mapped to one of the discrete options. For instance, a two dimensional continuous action of (-0.2,0.4) would be mapped to right (because the second dimension - which is the left-right dimension - has the largest absolute value and its value is positive). Similarly, (-0.3,0.1) would be mapped to the action down.

Example 2: Cart Pole

The second example is the well known cart pole. In this task, the goal is to balance a pole on a cart by pushing the cart. In its present implementation, the state space is a 4 dimensional continuous space with the positions and velocities of the cart and the pole. The actions may be discrete, from 0 to 20, or continuous, from -1 to 1. Both these get scaled to -10 to 10 and the continuous actions get rounded to the nearest available option. For instance, a continuous action of 0.64 will get turned into an action of 6, which is a push to the right with 6 N. A discrete action of 3 will translate into an action of -7, which pushes the cart to the left with 7 N.

The supplied parameter file CartPoleCfg trains for 30000 steps, which amounts to 10 minutes of simulated time. Then the resulting policy is tested for 5 minutes of simulated time. An example run gave me:

$ make cp
$ mkdir CartPoleCfg_d
$ ./CartPole CartPoleCfg
$ cat CartPoleCfg_d/Cacla_G10_L0.0001_0.001_D0.99.offresults
0.985773 3.46774e-07
$ cat CartPoleCfg_d/Q_B1_L0.0001_D0.99.offresults
0.855626 0.00891752