• Praveen Kaushik

Reinforcement Learning

Reinforcement Learning (RL) is a type of Machine Learning technique based on rewarding desirable behaviour and punishing undesirable or negative ones.

According to Richard S. Sutton and Andrew G. Barto, “Reinforcement learning is learning what to do, how to map situations to actions so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics- trial-and-error search and delayed reward are the two most important distinguishing features of reinforcement learning.”

Elements of Reinforcement Learning

There are 4 main sub-elements of reinforcement learning system other than agent and the environment- a policy, a reward signal, a value function and a model of the environment.

1. Policy

Defines the learning agent’s way of behaving at a given time. It is the method of mapping the agent’s perceived state to actions to be taken when in those states.

2. Reward signal

Defines the goal of a reinforcement learning problem. The environment sends feedback after every step to the agent in the form of a single digit called the reward. The reward signal defines what are the good and the bad events for the agent. They are also the primary basis for altering the policy.

3. Value function

Future reward that an agent can expect to accumulate by taking an action in a particular state.

4. Model of the environment

This mimics the behaviour of the environment and helps to infer how the environment will behave. Models are used for planning as given a state and action, the model might predict the resultant next state and next reward. This method of solving reinforcement learning problems using models and planning is also called model-based method as opposed to the model-free method using trial and error.

Action -reward feedback loop of a generic RL Model

OpenAI Gym

OpenAI Gym is an open-source Python library that gives a simple setup and toolkit for developing and comparing reinforcement learning algorithms. It gives huge number of test environments to work on the RL agent’s algorithms with shared interfaces for writing general algorithms and testing them. These simulated environments use episode-based settings, where the experience of an agent is actually further divided into a sequence of episodes. It is also compatible with other numerical computation library, such as TensorFlow or Theano.

Gym is an attempt to fix two main issues with RL

  1. The need for better benchmarks.

  2. Lack of standardization of environments used in publications.

Observations of the OpenAI Gym

If we want the RL tasks to perform better than what it might be by just taking random actions at every step, it is important to know what our actions are doing within the environment. The environment’s step function returns exactly what we need. The step returns four value-

  1. Observation (Object): It is an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.

  2. Reward (Float): Reward is kind of feedback given to the agent. It is the amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.

  3. Done (Boolean): This is often mainly used once you are required to reset the environment. Most (but not all) tasks are divided up into well-defined episodes and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)

  4. Info (Dict): It is a diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward. The process gets started by calling reset, which returns an initial observation.

Revolution of Reinforcement Learning in Machine Learning

Though the term Reinforcement Learning was known to the world in 1950s, RL as Machine Learning methodology started it’s revolution in 2013 when DeepMind made a software to learn Atari Games without any prior knowledge about the game rules. In 2016, a Go game competition took place between world’s second highest ranking professional player, Lee Sedol and the computer program AlphaGo designed by Google’s DeepMind company. The board game Go has been viewed as one of the most challenging tasks for artificial intelligence because it is “complex, pattern-based and hard to program”. AlphaGo’s victory over a human professional Lee Sedol became a significant moment in the history of Artificial Intelligence. AlphaGo’s rules are learned and not designed, implementing machine learning as well as several neural networks to create a learning component and become better at Go. An example of this is when one trains a dog. The dog is given a command several times and is trained to respond to that command for a particular action. When the dog responds to the command and does the appropriate action, it is rewarded with a treat.

These achievements by RL demonstrate the capability to discover new strategies that are superior to those we humans can devise, which is an exciting prospect. Now Google AI has announced a breakthrough in chip design. The AI system outperforms humans in designing floorplans for microchips, according to the Nature paper. Artificial intelligence can help the electronics industry to speed up chip design.

The recent achievements by Reinforcement learning by solving problems at super human level, like playing board games, controlling robotic arms and playing real-time strategy games on a professional level shows that it has potential to transform the world and the next step in AI Development.


  1. https://illumin.usc.edu/ai-behind-alphago-machine-learning-and-neural-network/

  2. https://www.nature.com/articles/s41586-021-03544-w.epdf?sharing_token=8za_nMkuk42509LyAn-xY9RgN0jAjWel9jnR3ZoTv0PW0K0NmVrRsFPaMa9Y5We97spjdO-aPpvZYXPHhKbfpfPljZaIm3b-kyQ3gKElVBjZIxn_5lBKsnqIIUn2YkCI3IFe5puGE49yIrhVbJrW9eUbKmMo7FS9KDgM4hs9TFFEBv1CLtLi4EFaXPirF-G_lwtOzFcc-pVSzW5vcQBQt19OPe2Fx4nUQHU5ItFuNC8%3D&tracking_referrer=www.theverge.com

  3. http://incompleteideas.net/book/1/node2.html

  4. https://gym.openai.com/docs/

44 views0 comments

Recent Posts

See All