Where MagMax meets DINO – Tooploox at ECCV Conference 2024

  • Scope:
  • Artificial Intelligence
Where MagMax meets DINO - Tooploox at ECCV Conference 2024
Date: September 26, 2024 Author: Konrad Budek 5 min read

The European Conference on Computer Vision is a biennial research conference taking place in Europe. The event is considered to be one of the world’s top computer vision conferences, with a CORE  A* rating. 

The conference usually takes place over six days with talks, tutorials and workshops run by experts and academics. Apart from learning and knowledge-sharing sessions, the ECCV presents the Koenderink Prize to recognize crucial contributions to the development of computer vision. 

This year Tooploox-affiliated researchers have published five research papers for the conference.

AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale

Active Visual Exploration (AVE) is a task that involves dynamically selecting observations (glimpses), which are critical to facilitate comprehension and navigation within any environment. A typical example of this task can be seen when a rookie skier has to spot (and avoid!) the only tree on a ski track. One needs to ignore all the trees to either side, a hotel on a distant hill, or the snowboarder following behind – as avoiding the tree is vital. The same goes for autonomous vehicles, where commanding the nearest object has priority over horizon observation.

While modern AVE methods have demonstrated impressive performance, they are constrained to fixed-scale glimpses drawn from rigid grids. In contrast, existing mobile platforms equipped with optical zoom capabilities can capture glimpses of arbitrary positions and scales. It is no problem for a modern camera to perform a swift zoom. 

To address this gap between software and hardware capabilities, this paper introduces AdaGlimpse. It uses Soft Actor-Critic, a reinforcement learning algorithm tailored to exploration tasks, to select glimpses of arbitrary position and scale. This approach enables our model to rapidly establish a general awareness of the environment before zooming in for detailed analysis. 

Experimental results demonstrate that AdaGlimpse surpasses previous methods across various visual tasks while maintaining greater applicability in realistic AVE scenarios. The research was delivered by a team consisting of Adam Pardyl, Michał Wronka, Maciej Wołczyk, Kamil Adamczewski, Tomasz Trzciński, and Bartosz Zieliński of Jagiellonian University, Warsaw University of Technology, and Tooploox. 

CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless interaction with arbitrary text prompts. However, its lack of spatial awareness makes it unsuitable for dense computer vision tasks, e.g., semantic segmentation, without an additional fine-tuning step that often uses annotations and can potentially suppress its original open-vocabulary properties. In short, the model is capable of drawing a car, but cannot recognize the “car” as a particular entity in an image.

Meanwhile, self-supervised representation methods have demonstrated good localization properties without the need for human-made annotations or explicit supervision. 

In this work, we take the best of both worlds and propose an open-vocabulary semantic segmentation method which does not require any annotations. We propose a local improvement of dense MaskCLIP features, which are computed with a simple modification of CLIP’s last pooling layer, by integrating localization priors extracted from self-supervised features. The team has probed the neural network using an unlabelled dataset to map the “distribution of knowledge” within the system, which is to be later exploited in the image recognition process. 

By doing so, we have greatly improved the performance of MaskCLIP and produced smooth outputs. Moreover, we show that the used self-supervised feature properties can be directly learnt from CLIP features. Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference, no extra supervision nor extra memory needed, to reach state-of-the-art results on challenging and fine-grained benchmarks such as COCO, Pascal Context, Cityscapes and ADE20k.

This research was delivered by a team consisting of Monika Wysoczanska, Oriane Simeoni, Michael Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński, and Patrick Perez of Warsaw University of Technology, Valeo.ai, Meta AI, Tooploox,and Ideas NCBR. The full paper may be found on Arxiv.

Revisiting Supervision for Continual Representation Learning

In the field of continual learning, models are designed to learn tasks one after another. While most research has centered on supervised continual learning, there is a growing interest in unsupervised continual learning, which makes use of vast amounts of unlabeled data. 

Recent studies have highlighted the strengths of unsupervised methods, particularly self-supervised learning, in providing robust representations. The improved transferability of the representations built with self-supervised methods is often associated with the role played by a multi-layer perceptron projector. 

In this work, we depart from this observation and reexamine the role of supervision in continual representation learning. We reckon that additional information, such as human annotations, should not deteriorate the quality of representations. With some research showing that the supervision aspect may even reduce the effectiveness of the learning process, the team has decided to check this hypothesis again with a slightly different approach.

This research was performed by Daniel Marcak, Sebastian Cygert, Tomasz Trzciński, and Bartłomiej Twardowski of IDEAS NCBR, Warsaw University of Technology, Gdańsk University of Technology, Tooploox, the Autonomous University of Barcelona, and the Computer Vision Center. The full paper can be found on Arxiv.  

CAMP: Category Adaptation Meets Projected Distillation in Generalized Continual Category Discovery

Generalized Continual Category Discovery (GCCD) tackles learning from sequentially arriving, partially labeled datasets while uncovering new categories. For example, in a dataset with animals, only cats and dogs are labeled. In this situation, the machine recognizes other animals, yet the labels are proposed internally. The system spots the difference between an elephant and a hamster, yet obviously the labels are generated without a connection to the names of the real entities. 

Traditional methods depend on feature distillation to prevent the forgetting of the old knowledge. However, this strategy restricts the model’s ability to adapt and effectively distinguish new categories. Following the example above – when introducing the category of “elephant” the system loses its capability to recognize either cats and dogs effectively due to the fact that, after the training, every animal now slightly resembles an elephant. 

To address this, we introduce a novel technique integrating a learnable projector with feature distillation, thus enhancing model adaptability without sacrificing past knowledge. The resulting distribution shift of previously learned categories is mitigated with an auxiliary category adaptation network. We demonstrate that while each component offers modest benefits individually, their combination — dubbed CAMP (Category Adaptation Meets Projected distillation) — significantly improves the balance between learning new information and retaining the old. 

This research was performed by Grzegorz Rypeść, Daniel Marczak, Sebastian Cygert, Tomasz Trzciński and Bartłomiej Twardowski of IDEAS NCBR, Warsaw University of Technology, Gdańsk University of Technology, Tooploox, the Autonomous University of Barcelona, and the Computer Vision Center. The full paper can be found on Arxiv

MagMax: Leveraging Model Merging for Seamless Continual Learning

This paper introduces a continual learning approach named MagMax, which utilizes model merging to enable large pre-trained models to continuously learn from new data without forgetting previously acquired knowledge. Distinct from traditional continual learning methods that aim to reduce forgetting during task training, MagMax combines sequential fine-tuning with a maximum magnitude weight selection for effective knowledge integration across tasks. 

The detailed description of the research paper can be found in this blog post about MagMax model merging for continual learning.

Summary

This year’s edition of ECCV will take place in Milan, Italy, with researchers attending from all around the world. The event will start on September 29th and will continue until October 4th.