Bitter Lessons in Reinforcement Learning – ICML 2024

  • Scope:
  • Artificial Intelligence
Bitter Lessons in Reinforcement Learning - ICML 2024
Date: July 11, 2024 Author: Konrad Budek 6 min read

What if I told you we’ve already seen machine rebellions and creations outsmarting their creators? 

Reinforcement Learning is a powerful Machine Learning technique used in autonomous cars and the gaming industry. Yet the field struggles with many challenges, including the fact that machines that learn in a human way can begin to show interestingly humane traits, like laziness or a lack of motivation. 

The Tooploox team and researchers from other educational and business organizations tested a set of techniques to limit RL systems’ misbehaviors. The full findings of the research can be found in the published arXiv paper

What is Reinforcement Learning

Reinforcement Learning (RL) is a popular Machine Learning (ML) approach that enables machines to perform tasks that were considered too complex or too complicated for automated systems. In the case of RL, the system learns in a more human way, that is, by interacting with its environment and collecting feedback from available actions.

A good example can be found running multiple agents in a hide-and-seek game. 

Reinforcement learning examples

In RL lingo, the entity controlled by the AI is called an agent. It may be an automated investing system, a mastermind autonomous car controller, or a system that oversees cooling systems to find opportunities for savings – you name it. In one of Tooploox’s case studies, it was a system that controlled an oil refinery to optimize earnings and oil flow. 

Or, in some cases – it could be learning how to walk without an idea of what walking actually looks like. 

Google’s DeepMind AI Just Taught Itself To Walk

There is also a strong connection between reinforcement learning and generative AI. The GPT-4 model was trained using a feedback loop with its users, who provided information about the model’s successes and failures.

How the reinforcement learning works

As mentioned above, RL is based on a feedback loop between an agent interacting with its environment and an algorithm that grants points to the agent. The points may be either positive for performing a desired task or negative when the agent misbehaves. 

Training an autonomous car is a good example of deep reinforcement learning in action. The positive points are assigned when the agent follows the rules, such as stopping at a red light and sticking to the speed limit. On the reverse side of the same coin, an agent that is speeding, hits pedestrians, or ends its journey at the nearest tree. 

But one needs to remember that driving a car (and parking!) is a bigger challenge than we think, especially if the agent starts off with just random actions. 

AI Learns to Park – Deep Reinforcement Learning

This approach lets the machine develop super-complicated skills and sometimes to even achieve human levels of control (through deep reinforcement learning). But this comes with high costs. 

Challenges of reinforcement learning

Reinforcement Learning opens up great possibilities to machine learning applications. Yet, due to the nature of the process, it comes with limitations and challenges. These include:

  • Resource-consuming – the process of gathering experiences from the environment requires the delivery of a training environment – as you can’t allow an AI to test drive its new car out on mainstreet. But this is easier said than done. The most common approach is to construct a simulator that the agent can interact with. Depending on the task to perform, this varies from complicated (the stock market, e-commerce analytics) to insanely complicated (a simulated city for autonomous car training).  
  • Time consuming – the tremendous computing power used in the machine learning processes is still no match for the human brain and its capability to learn. It is common for RL agents to spend a long time in a simulated environment. How long? In the Tooploox case it was 2045 years of constant, no-sleep, no-break, learning. Of course the environment is put into hyper speed, which makes the process even more costly.
  • Policy abuse – last but not least, reinforcement learning agents aim to maximize positive feedback, but they have no real grasp of the goal they are meant to achieve. In fact, the reward function is the only way the data scientist can communicate with the RL agent. As such, the RL agents are prone to behaviors that abuse the reward function instead of performing the desired task. For example, if the agent controlling an autonomous vehicle gets points for every red light it stops at, the vehicle may just circle through the nearest crossroads to stop at every light instead of driving from point A to point B. 
  • Plasticity loss – in the ML concept, plasticity is a system’s ability to acquire new skills and knowledge. Plasticity loss occurs when a system loses its ability to polish up or learn new skills without losing previously known ones. For example, an autonomous car, after learning how to stick to the right side of the road, might begin to ignore traffic lights completely. And when taught how to obey traffic lights, it begins to ride in the middle of the road. Plasticity loss is one of the key blockers of multiobjective reinforcement learning development. 
  • Overfitting – it can be seen when a model is super-trained to operate on a particular dataset, yet it gets confused when operating in a different (even if similar) environment. For example, a model may have memorized the whole map of New York and flawlessly performed at every crossroad or turn, yet it would malfunction when driving through Boston or Chicago. 
  • Overestimation – in this particular case, the model is prone to predicting higher reward values than the actual values and thus it focuses on the wrong aspects of the task to be delivered. Overestimation is seen as one of the reasons for policy abuse, as mentioned above. The model may, for example, overestimate the reward for stopping at a red light compared to safely traveling from point A to point B – and by that, it may still drive safely but show suboptimal performance overall.

Overestimation, overfitting, and plasticity loss are among the challenges mentioned above that regularization can counter—or can it? That’s where the Tooploox research team comes into play!

Overcoming the challenges with regularization

The most obvious way to overcome the challenges of overestimation, overfitting and plasticity loss is applying some form of regularization – a technique seen widely in the machine learning field, be it inside or outside the reinforcement learning world. On the other hand, RL-related researchers have experimented with multiple approaches, unseen in other branches of Machine Learning. 

Our work – testing regularization on network plasticity, overfitting, and overestimation

In a study conducted by Tooploox researchers, alongside their academia and business partners, the team wanted to determine which regularization techniques lead to robust performance improvements. 

The team applied techniques with multiple benchmarks, including seven DeepMind Control Suite tasks: acrobat-swing up, hopper-hop, humanoid-walk, humanoid-run, dog-trot, dog-run, and quadruped-run. It also included seven MetaWorld tasks: Hammer, Push, Sweep, Coffee-Push, StickPull, Reach, and Hand-Insert. 

What some of these tests look like can be seen in the video below:

These tasks are well-established and commonly used techniques to test and benchmark RL models. 

Experiments 

When running the tests, the team contrasted common regularization approaches with RL-specific approaches widely used in the field. The tests were performed on the benchmarks listed above, with the first approach using a typical RL-specific approach and the second using a regularization approach. The technical details of each test can be found in the resulting arXiv research paper

Conclusions

General Neural Network regularizers outperform most RL-specific algorithmic improvements in agent performance. The research team considers the results surprising, as the RL-specific approaches were designed to overcome these challenges and solve these specific problems.

The most significant takeaways include:

  • Critic regularization (CR) methods exhibit limited effectiveness in enhancing performance. When using network or plasticity regularization, CR leads to reduced performance.
  • Periodical network resetting is the most robust intervention across two benchmarks in a high replay ratio regime, and highly surpasses other plasticity regularization techniques in both robustness and performance.
  • Layer norm is essential for some environments.
  • When considering network regularization approaches, layer norm is generally recommended for DeepMind Control Suite tasks, while spectral norm is more effective for MetaWorld benchmarks. When considering a diverse range of tasks, we find spectral norm to be more robust than layer norm. Weight decay has generally low performance when used alone with  Soft Actor-Critic.

Summary

The research paper described above will be presented at the upcoming ICML (International Conference on Machine Learning) 2024 in Vienna, Austria—one of the leading Machine Learning and Reinforcement Learning conferences in the world. 

The Tooploox-submitted paper is the work of a team of researchers consisting of Michał Nauman, Michał Bortkiewicz, Piotr Miłoś, Tomasz Trzciński, Mateusz Ostaszewski, and Marek Cygan, from institutions such as Tooploox, the University of Warsaw, Warsaw University of Technology, and the Polish Academy of Science. The full findings of the research can be found in the published arXiv paper.