Illustration of a lit up brain with connections around it

What is reinforcement learning?

Reinforcement learning (RL) is a subset of machine learning that allows an AI-driven system (sometimes referred to as an agent) to learn through trial and error using feedback from its actions. This feedback is either negative or positive, signalled as punishment or reward with, of course, the aim of maximising the reward function. RL learns from its mistakes and offers artificial intelligence that mimics natural intelligence as closely as it is currently possible.

In terms of learning methods, RL is similar to supervised learning only in that it uses mapping between input and output, but that is the only thing they have in common. Whereas in supervised learning, the feedback contains the correct set of actions for the agent to follow. In RL there is no such answer key. The agent decides what to do itself to perform the task correctly. Compared with unsupervised learning, RL has different goals. The goal of unsupervised learning is to find similarities or differences between data points. RL’s goal is to find the most suitable action model to maximise total cumulative reward for the RL agent. With no training dataset, the RL problem is solved by the agent’s own actions with input from the environment.

RL methods like Monte Carlo, state–action–reward–state–action (SARSA), and Q-learning offer a more dynamic approach than traditional machine learning, and so are breaking new ground in the field.

There are three types of RL implementations: 

  • Policy-based RL uses a policy or deterministic strategy that maximises cumulative reward
  • Value-based RL tries to maximise an arbitrary value function
  • Model-based RL creates a virtual model for a certain environment and the agent learns to perform within those constraints

How does RL work?

Describing fully how reinforcement learning works in one article is no easy task. To get a good grounding in the subject, the book Reinforcement Learning: An Introduction by Andrew Barto and Richard S. Sutton is a good resource.

The best way to understand reinforcement learning is through video games, which follow a reward and punishment mechanism. Because of this, classic Atari games have been used as a test bed for reinforcement learning algorithms. In a game, you play a character who is the agent that exists within a particular environment. The scenarios they encounter are analogous to a state. Your character or agent reacts by performing an action, which takes them from one state to a new state. After this transition, they may receive a reward or punishment. The policy is the strategy which dictates the actions the agent takes as a function of the agent’s state as well as the environment.

To build an optimal policy, the RL agent is faced with the dilemma of whether to explore new states at the same time as maximising its reward. This is known as Exploration versus Exploitation trade-off. The aim is not to look for immediate reward, but to optimise for maximum cumulative reward over the length of training. Time is also important – the reward agent doesn’t just rely on the current state, but on the entire history of states. Policy iteration is an algorithm that helps find the optimal policy for given states and actions.

The environment in a reinforcement learning algorithm is commonly expressed as a Markov decision process (MDP), and almost all RL problems are formalised using MDPs. SARSA is an algorithm for learning a Markov decision. It’s a slight variation of the popular Q-learning algorithm. SARSA and Q-learning are the two most typically used RL algorithms.

Some other frequently used methods include Actor-Critic, which is a Temporal Difference version of Policy Gradient methods. It’s similar to an algorithm called REINFORCE with baseline. The Bellman equation is one of the central elements of many reinforcement learning algorithms. It usually refers to the dynamic programming equation associated with discrete-time optimisation problems.

The Asynchrous Advantage Actor Critic (A3C) algorithm is one of the newest developed in the field of deep reinforcement learning algorithms. Unlike other popular deep RL algorithms like Deep Q-Learning (DQN) which uses a single agent and a single environment, A3C uses multiple agents with their own network parameters and a copy of the environment. The agents interact with their environments asynchronously, learning with every interaction, contributing to the total knowledge of a global network. The global network also allows agents to have more diversified training data. This mimics the real-life environment in which humans gain knowledge from the experiences of others, allowing the entire global network to benefit.

Does RL need data?

In RL, the data is accumulated from machine learning systems that use a trial-and-error method. Data is not part of the input that you would find in supervised or unsupervised machine learning.

Temporal difference (TD) learning is a class of model-free RL methods that learn via bootstrapping from a current estimate of the value function. The name “temporal difference” comes from the fact that it uses changes – or differences – in predictions over successive time steps to push the learning process forward. At any given time step, the prediction is updated, bringing it closer to the prediction of the same quantity at the next time step. Often used to predict the total amount of future reward, TD learning is a combination of Monte Carlo ideas and Dynamic Programming. However, whereas learning takes place at the end of any Monte Carlo method, learning takes place after each interaction in TD.

TD Gammon is a computer backgammon program that was developed in 1992 by Gerald Tesauro at IBM’s Thomas J. Watson Research Center. It used RL and, specifically, a non-linear form of the TD algorithm to train computers to play backgammon to the level of grandmasters. It was an instrumental step in teaching machines how to play complex games.

Monte Carlo methods represent a broad class of algorithms that rely on repeated random sampling in order to gain numerical results that point to probability. Monte Carlo methods can be used to calculate the probability of:

  • an opponent’s move in a game like chess
  • a weather event occurring in the future
  • the chances of a car crash under specific conditions

Named after the casino in the city of the same name in Monaco, Monte Carlo methods first arose within the field of particle physics and contributed to the development of the first computers. Monte Carlo simulations allow people to account for risk in quantitative analysis and decision making. It’s a technique used in a wide variety of fields including finance, project management, manufacturing, engineering, research and development, insurance, transportation, and the environment.

In machine learning or robotics, Monte Carlo methods provide a basis for estimating the likelihood of outcomes in artificial intelligence problems using simulation. The bootstrap method is built upon Monte Carlo methods, and is a resampling technique for estimating a quantity, such as the accuracy of a model on a limited dataset.

Applications of RL

RL is the method used by DeepMind to initiate artificial intelligence in how to play complex games like chess, Go, and shogi (Japanese chess). It was used in the building of AlphaGo, the first computer program to beat a professional human Go player. From this grew the deep neural network agent AlphaZero, which taught itself to play chess well enough to beat the chess engine Stockfish in just four hours.

AlphaZero has only two parts: a neural network, and an algorithm called Monte Carlo Tree Search. Compare this with the brute force computing power of Deep Blue, which, even in 1997 when it beat world chess champion Garry Kasparov, allowed the consideration of 200 million possible chess positions per second. The representations of deep neural networks like those used by AlphaZero, however, are opaque, so our understanding of their decisions is restricted. The paper Acquisition of Chess Knowledge in AlphaZero explores this conundrum.

Deep RL is being proposed in the use of unmanned spacecraft to navigate new environments, whether it’s Mars or the Moon. MarsExplorer is an OpenAI Gym compatible environment that has been developed by a group of Greek scientists. There are four deep reinforcement learning algorithms that the team has trained on the MarsExplorer environment, A3C, Ranbow, PPO, and SAC, with PPO performing best. MarsExplorer is the first open-AI compatible reinforcement learning framework that is optimised for the exploration of unknown terrain.

Reinforcement learning is also used in self-driving cars, in trading and finance to predict stock prices, and in healthcare for diagnosing rare diseases.

Deepen your learning with a Masters

These complex learning systems created by reinforcement learning are just one facet of the fascinating and ever-expanding world of artificial intelligence. Studying a Masters degree can allow you to contribute to this field, which offers numerous possibilities and solutions to societal problems and the challenges of the future. 

The University of York offers a 100% online MSc Computer Science with Artificial Intelligence to expand your learning, and your career progression.