The Tooploox RnD team, alongside researchers from other research institutions and universities, delivered a new framework that limits the effects of catastrophic forgetting in continual learning.
“Everyone in the world knows how to seek out knowledge that they do not have, but do not know how to find what they already know.”
Zhuangzi, The Book of Chuang Tzu
Yeah, but what does a Taoist sage have in common with Machine Learning models? If we are talking about continual learning and catastrophic forgetting – surprisingly a lot.
What is continual learning?
Continual learning in machine learning is a subfield where the system incrementally learns from the constant (or at least regular) inflow of data. When following the continual learning approach, a model can be improved endlessly from the moment it is implemented, without the need to go through the time-and-money-consuming process of retraining a whole neural network. The savings and benefits are obvious, including being able to learn new tasks on-the-go, and that one can train the model to be more versatile without the need to basically build it anew.
At least in theory. Real-life continual learning includes multiple challenges, with catastrophic forgetting being one of the biggest.
What is catastrophic forgetting?
Catastrophic forgetting is a phenomenon observed in machine learning which manifests itself when a model (typically neural network) forgets all (or nearly all) the skills it had when acquiring new ones.. As a result, model’s performance is hindered and the system’s overall effectiveness is significantly reduced.
- Example: An Artificial Intelligence system powering autonomous cars is skilled in controlling the machine, recognizing signs, and reacting to the behaviors of other vehicles on the road. If the neural network gets enriched with a new skill – such as spotting the risky behaviors of pedestrians – the overall performance of the system drops. Thus, the upskilled system is less effective in driving and recognizing signs, even if it can now spot a pedestrian about to run across the road in an unpermitted place.
The system basically “forgot” the knowledge it had previously acquired, and now finds performing the task it “already knew” way more challenging than performing the new one, making Zuangzi surprisingly accurate regarding the ways Artificial Neural Networks operate.
Tackling this challenge is one of the key goals of modern continual learning research. Tooploox RnD team is working towards developing a solution of this challenge, and below we present a one step closer towards it!
Divide and not Forget – overcoming catastrophic forgetting in neural networks
To understand the way the Tooploox team tackled this challenge, it is necessary get familiar with a model architecture called Mixture of Experts, as it serves as an inspiration for the proposed approach. In a “Mixture of Experts” (MoE), typically only one or a few expert models are run for each input, leading to the point where expert obtain their own specialization through training.
In the approach proposed in the ICLR paper presented by Tooploox RnD team this year, , the experts are trained individually based on the input they see at a given training iteration, while at inference, the model provides a weighted average of answers provided by all the experts.
The gathering of sages
As such, it can be compared to a gathering of sages (hello again, Zhuangzi), who get a question (is that a cat in a picture?) and each one, using its own knowledge, provides an answer (yes, no, it’s a dog, a fox, a chicken, etc.) The larger the number of experts, the lesser the probability of getting a wrong or unrelated answer.
When experts are expected to learn a new skill using continual learning, the current knowledge in their “minds” fades. It is basically overtaken by or blended with the new information. One of the solutions seen in traditional machine learning is to retrain the full neural network from scratch. Yet this comes with great cost, all in terms of the time, money, and computing power required.
Continual learning aims to retrain the system almost on the go, yet with the catastrophic forgetting, all the “sages” inside the network gradually lose their cognitive abilities. Putting things simply, their performance regarding their previous tasks significantly drops.
“Forget the years, forget distinctions. Leap into the boundless and make it your home!”
Zhuangzi, Discussion on Making All Things Equal
But how about teaching only a select, most witty, group of sages, cherry-picked from the whole gathering?
Our approach
By “most witty” the Tooploox team considers selected experts from the neural network that have a distribution of knowledge organized enough to “fit” new facts into their “minds.” Ironically, the “mind” of an expert requires enough empty space to store new knowledge.
What makes it distinctive from the previous approach is the fact that not all experts acquire new knowledge – only ones whose “minds” are empty enough. As an outcome, only the selected experts gain new information, yet with the new knowledge collected, their voices become strongest in their new areas of expertise. What changes is data distribution within the neural network.
- Example: A theoretical hate speech detection model is capable of spotting thousands of hateful messages. Yet a new type of slur emerges, never before seen, explicitly devised to overcome automated detection systems. To ensure the safety of users, the company used a continual learning approach to update the neural network.
In this scenario, the majority of these internal experts may see a particular word as non-harmful. Yet the selected experts now have the new knowledge required to recognize the new slurs and consider them harmful. Despite the fact that they are in the minority, the similarity in patterns shows that these particular experts really know their subject matter. Thus, the system considers their voice “stronger” than “untrained ones.”
The act of dividing experts into diverse, trained and untrained, groups was the inspiration behind the paper title – Divide and not Forget.
The effect
Putting things simply – it works. The system manages to acquire new skills without catastrophic forgetting. This novel approach was named SEED and the benchmark shows that it achieves state-of-the-art performance in exemplar-free settings across various scenarios.
The research paper, providing details on the model, its performance, and reproduction notes, can be found on openreview.net. The research was conducted by the Tooploox Research and Development team alongside researchers from Warsaw University of Technology; Gdańsk University of Technology; Jagiellonian University; the Computer Vision Center, Barcelona; and the Department of Computer Science, Universitat Autònoma de Barcelona.
The paper will be published during the upcoming International Conference on Learning Representations (ICLR, CORE A* conference) in Vienna.