MagMax model merging for continual learning – Tooploox at ECCV 2024

  • Scope:
  • Artificial Intelligence
MagMax model merging for continual learning - Tooploox at ECCV 2024
Date: September 19, 2024 Author: Konrad Budek 5 min read

In “Computing Machinery and Intelligence,” a pioneering paper on Artificial Intelligence, Alan Turing proposed a test that could be used to determine whether a machine is indeed intelligent or not. It is a famous test, where one has to determine if the entity one is speaking with is an AI agent or a real human. The paper was published in 1950, shortly after the end of World War II. 

Modern AI solutions have come a long way very fast. With increasing access to computing power and vast amounts of data, the Machine Learning solutions grow increasingly sophisticated. And, as a result, systems like ChatGPT could easily pass this test.

The test itself was one of the first tests designed to determine if a particular entity is intelligent, yet it was flawed – holding a conversation is only one of many aspects of intelligence. Parrots are talkative fellas, yet they are poor conversational partners. Even one of the most trained and intellectually sophisticated parrots in the world, named Alex (an acronym for Avian Language Experiment), could handle only approximately 100 words. Not to mention ChatGPT’s erudition and politeness. 

Yet what makes the biggest difference is the universality of Alex’s (or any rat’s, dolphin’s, or basically any other living creature’s) versatility. 

What is narrow AI (and why does it matter)

The AI solutions used now are narrow. Systems like AlphaGo, AlphaChess, or AlphaStar show superhuman performance in one particular task. AlphaGo has outperformed Lee Sedol, world champion of Go, arguably one of the most challenging board games ever created. AlphaGo triumphed, but on the other hand, AlphaGo can do nothing but play Go at superhuman level. By contrast, Lee Sedol can walk home, maneuver between pedestrians without having a collision, distinguish between a rock and an apple, and (probably) operate a car successfully enough to safely go for a long drive. 

The same goes for many other systems. AI-powered solutions may be able to spot early signs of cancer in an X-ray image, yet taking other factors into account (how the patient feels? Do they smoke? Was there cancer in their family history?) is up to a physician – the system is incapable of operating on the information, not to mention incapable of making any meaningful decisions. 

These examples show what “narrow AI” is. Many modern systems excel in performing one particular task, but in nothing else. 

And the world is a complicated place.

What is continual learning (and why does it matter)

Continual learning is a technique that allows AI teams to either modify the existing skills of a neural network or to add new ones. In the standard approach, building a new, more flexible or skilled system would require training a new network from scratch on a new dataset. 

This is both time and resource-consuming. Also – if the system performs well, scrapping it is a plain waste of resources. With continual learning, the existing system can be enriched either with new knowledge or new skills to perform. But, of course, it is not that simple. 

Challenge – catastrophic forgetting

Catastrophic forgetting is a phenomenon that occurs when a neural network is retrained. In gaining new skills or knowledge the system stops performing well in its previous tasks. Following the self driving car example, the car may stop turning left or right after gaining the ability to read road signs. It can now go only straight forward – or backward. 

Obviously, catastrophic forgetting rendered the process of continual learning way more complicated and challenging. The Tooploox research team is renowned for its work on solving this challenge and merging networks using MagMax is one of the techniques the team has recently developed, alongside researchers from IDEAS NCBR, Warsaw University of Technology, the Autonomous University of Barcelona, and the Computer Vision Center. 

Merging Networks

Merging networks is the process of enriching an existing network with new skills by training a new, one-task system and merging it with an existing network. This can be extremely useful in two cases: 

  • There is an existing, proven, tested and well-performing system in place, yet it needs to be updated. Yet delivering a new network is not only costly, but also there is no certainty regarding its performance.
  • There is a need for a super sophisticated, multi-task network that would be extremely costly and undoubtedly under-performing, if trained as a monolith. 

Merging networks enables the team to create a patchwork of neural networks that is capable of performing a variety of tasks, each of which used to be the sole speciality of one of the networks in the completed patchwork. But this approach is also not free of problems. 

MagMax – Artificial Neural Surgery 

Neural networks are neither monolithic nor even – the “knowledge” inside is distributed across multiple parts of the network, in a way comparable to the human brain. 

There are areas responsible for processing math and logic-related problems, areas with more creative use cases, areas responsible for keeping speech-related tasks, etc. In some interesting and comparable cases, a victim of an accident may forget only some part of their own knowledge when subjected to brain damage. 

Merging networks is about spotting the “areas” where a network holds the most valuable knowledge and using it in the final system. 

In the approach designed by the research team of, MagMax, or Maximum Magnitude, they sequentially fine-tuned the model on subsequent tasks and created task vectors by subtracting the weights of the pre-trained model θ0. They then merged the task vectors using Maximum Magnitude Selection strategy, which selects the parameters of task vectors by highest magnitude. Finally, they applied the merged task vector to the pre-trained model to obtain a multitask model. 

In other words – the team spots the most “knowledge heavy” areas of the network, extracts them, and later uses them in the final network patchwork. 

The effects:

Using the MagMax approach reduces the cost of training more sophisticated networks or incorporating new skills into existing ones. By cutting out the elements that have little to no impact on performance, the team may deliver agents that do their tasks in a more effective way, faster, and for less money. 

Real-life applications

Combining connected yet diverse data without losing context is a skill that can be used in multiple environments and industries. Examples include:

  • Healthcare – a system designed to spot early signs of a disease from x-ray scans may be enriched with capabilities to analyze the ultrasounds or MRIs of the same patient. Thus the solution may be even better in aiding diagnosis by cross-comparing signs and markers. 
  • E-commerce – the on-site selling process is extremely sophisticated, with tremendous amounts of data to process – from real-time behavioral data to historical and contextual information and beyond. A truly powerful AI system would be capable of processing the data without losing context and seeing each customer as a person with various motivations, not a constellation of unrelated data points. 
  • Autonomous cars – as mentioned above, autonomous vehicles need to operate on various types of data and perform tasks with high levels of accuracy. Merging networks is a way to produce a network capable of handling this need.

Summary

The paper was delivered by a team consisting of Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzciński and Sebastian Cygert, from institutions including IDEAS NCBR, Warsaw University of Technology, Tooploox, the Autonomous University of Barcelona, and the Computer Vision Center. The paper was accepted by the upcoming European Conference on Computer Vision. A full version of the paper can be found on Arxiv