Using Hypernetworks for Image Recontextualization

  • Scope:
  • Artificial Intelligence

A vision of General Ulysses S. Grant visiting his army, sitting with bold posture on a horse, was truly inspiring and required great skill from the photographer. Yet upon a closer inspection, suspicions arise. 

Grant was renowned for his skill with horses, yet his posture is odd, not to mention the slightly unnatural angle of his head. And he is clearly wearing a uniform from a different time period. 

A team of researchers from the Library of Congress have found that the image was, in fact, a combination of three other photos. The head was indeed taken from general Ulysses S. Grant, but the body and horse was originally owned by General Major Alexander McDowell McCook. And in the background, there was no army… but Confederate prisoners of war. 

This photograph was arguably the first example of image recontextualization – Ulysses S. Grant’s head was “recontextualized” onto another body, and this mashup was later recontextualized into the image showing tents. 

Since then, the art of photomanipulation has gone unimaginable distances, with people being completely indifferent to images of cars in space, food floating in perfectly unnatural environments, and people who are simply too perfect to be real.

Yet what connects transplanting the head of President Grant to preparing modern commercial or fashion photography, enriched with retouches, is the intense need for human labor. Yet this will also soon change due to AI tools. 

Image recontextualization basics

Taking part of an existing image and using it elsewhere (for example, with another background or with some elements replaced) is the core of image recontextualization. Currently abundant, it was first seen in what was nearly the dawn of photography, not to mention its earlier form as seen in portraits and paintings, with kings made taller and athletes being made more muscular.

Classical recontextualization requires skills and hours of work from people. And when there is a need to recontextualize dozens, if not hundreds, of images, for example after a fashion photoshot, a little army of designers are needed to work on it. If a company wishes to recontextualize second-hand pictures, for example submitted by external parties, the workload scales to infinity. 

That’s why AI was used to streamline the process. Yet this was far from easy. 

Challenges for AI recontextualization

The AI systems used for recontextualization are powerful and capable of deceiving even more seasoned news consumers. The famous photo of Pope Francis wearing a puffer jacket is a great example of AI-delivered recontextualization.

Yet this was a single image that required little prompting and tinkering. Also, the image didn’t face the challenges usually seen in the fashion industry – as the puffer jacket shown above doesn’t even exist. Tooploox and eBay researchers have shown that going en masse with this type of manipulation comes with some challenges:

  • Fine-tuning – to keep all the features of the piece of clothing or accessory consistent and make the image look correct, the model requires a good deal of fine-tuning, even to the point of single-image fitting. For the sake of the Pope wearing Balenciaga as an internet meme, losing the pattern or changing the proportions is acceptable, yet in commercial fashion photography, it is a deal breaker.
  • Prompt engineering – delivering an AI-generated recontextualized image also requires extensive prompt engineering, with a human specialist overseeing the process. Without that the images may not end up with the desired quality.

These challenges have made AI solutions extremely challenging to implement for en-masse recontextualizations as seen in fashion industry or internet ads. Yet the Tooploox research team, together with eBay researchers, delivered a solution – HyperLoRA. 

Solution: HyperLoRA

HyperLoRA is basically a combination of LoRA and hypernetworks. This may sound confusing and understanding the concepts is the first step toward understanding the way this system works. 

What is LoRA (Low Rank Adaptation)

LoRA, or rather Low Rank Adaptation, is a fine tuning technique commonly used in Large Language Models. It is especially used to adapt machine learning models without the need to change the entire model. If one wishes to retrain an entire model for a new task, it can be done. Yet, if the model itself is big (think about Stable Diffusion or GPT4.0), the cost of the operation would be tremendous. 

LoRA can be compared to adding a new part to the full network which injects new knowledge and skills into the system. 

What is a hyper network

The hypernetwork is an approach that assumes the use of one network to generate weights for another network. With this approach, the costs of modifying network output are reduced and optimized. The network predicts the parameters of the other model to adapt to a particular context. When using hypernetworks, there is no need to fine-tune the model.

Works as plug-in model

The system works as a plug-in model, so it can be attached to another solution to support workflows. For example, if there is an automated ad workflow used in the marketplace, LoRA can be attached to help recontextualize the images to use them in multiple ad creations without the need of support from a graphic designer.

The effect: 

The model uses hypernetworks to predict parameters and significantly reduces the workload required. This method offers advantages compared to previous ones, including:

  • No fine-tuning – the model works nearly out-of-the-box, with significantly reduced amount of work required in production.
  • More accurate adjustments – the images generated using LoRA look more natural and convincing compared to previous state-of-the-art methods, like MaticClothing or IP Adapter. Experiments show the system’s effectiveness in garment-to-model recontextualization.
  • Applicability across multiple modalities – LoRA operates as a plugin network, so it can be used to enrich or support workflows that use other types of data. 

The work can be further developed by extending the types of data that are compatible with LoRA, for example replacing images with encodings, texts, or videos. 

This work was presented during the NeurIPS 2024 conference in Vancouver. The team of researchers consisted of Maciej Zięba, Jakub Balicki, Tomasz Dróżdż, Konrad Karanowski, Paweł Lorek, Hong Lyu, Aleksander Skorupa, Tomasz Trzciński, Oriol Caudevilla Torras, and Jakub M. Tomczak, with all working for eBay and Tooploox, excluding Hong Lyu and Oriol Caudevilla Torras, who work exclusively for eBay. 

The full paper can be found in OpenAccess

Similar Posts

See all posts