Among those who attended Richard Feynman’s famous lectures, the majority are likely now dead – for obvious reasons. The brilliant scientist, born during the closing months of World War I (May 1918), delivered his most popular lectures in Caltech in the 1960s.
Luckily for us, his lectures were recorded by CalTech and the BBC, so new generations of physicists can enjoy Feynman’s sense of humour, teaching mastery, and his unusual ability to show the complicated matters of theoretical physics through words understandable by non-experts.
But what about the thousands of brilliant minds of the past who didn’t leave recordings? Benjamin Franklin, who not only experimented with electricity but also invented the to-do list, the foundation of modern project management? Mary Skłodowska-Curie, with a mind so infective with brilliance, that not only she got the Nobel Prize (not only one but two, actually), but also her husband (Pierre Curie), one of her two daughters (Irene Joliot-Curie), and her son-in-law (Frederic Joliot-Curie). Hildegarde of Bingen? Or maybe Homer himself retelling the Illiad?
How wonderful would it be to see and hear them speaking?
With AI technology, these wonders are shifting from fiction to reality. But not without challenges.
Challenges in generating talking heads with AI
There are many challenges with generating a realistic image of a person speaking. These include, but are not limited to:
- Emotional expressions – facial expressions need to fit the emotions reflected in speech.
- Lip synchronization – the lip motions need to fit the words spoken. Even untrained people can easily spot oddities and mismatches in lip movements and words spoken. The challenge can easily be seen when dubbing movies: https://www.youtube.com/watch?v=qp83h_uUkOM
- Natural moving patterns – when a speaking person is unnaturally still, with only the lips moving, it produces a disturbing effect. Additional movements, like the eyes blinking, dynamic head movement, taking a breath, or the hair shifting is a necessary part of producing a convincing video of a person speaking.
- Long term consistency – currently it is easy to deliver a 20-second video of a person speaking, but creating a longer piece, for example the entirety of one of Madame Curie’s lectures on radiochemistry, would be utterly impossible.
Or – would it?
Introducing KeyFace – a way to deliver talking head videos
A team composed of Tooploox researchers, alongside researchers from the Imperial College of London, University of Wrocław, and Wroclaw University of Science and Technology delivered a new way to generate Audio-Driven Facial Animation using a recorded speech and an image of a speaker. The model in action can be seen on the project page.
How does it work?
Instead of generating the full video “on the go” as the current models do, KeyFace generates two (or more) distant frames. These frames directly match the audio file containing the speech. The frames are produced in low quality, yet using the audio file as “ground truth.” The low quality is forced by the fact that the frames need to fit the emotional state of the voice sample.
In the second stage, the interpolation model animates the full sequence by filling the gaps between frames of reference, upscaling the quality, and delivering a better video.
The model that fills the gaps uses Stable Video Diffusion as a foundation in both instances. More details, including technical information and related work, can be found in the paper published on Arxiv. The paper was coauthored by Tooploox Head of R&D w Tooploox.
The effect
Using this architecture enabled the research team to deliver the state-of-the-art solution, outperforming everything currently available in the research community and on the market alike.
The system is not only delivering the plausible-looking animation of a talking head, but also tackles challenges like:
- Continuous emotion modelling – the system keeps consistency in emotion modelling, ensuring that the facial expressions match the mood of the text spoken. With the reference frames delivered, the risk of “losing” the mood in a longer video is minimized.
- Integration of non-speech vocalizations – the system incorporates additional vocalizations like sighs, coughs or breathing.
Practical applications
Talking heads may be one of the most boring formats imaginable, seen in news reporting and talk-shows. But with the ability to use AI and infuse still images with life, there are multiple real-life applications. Off the top of head examples include:
- Education – as mentioned above, building more entertaining and interesting educational content, where Albert Einstein himself explains the theory of relativity, Baldwin the Leper King guides students through the fate of the Kingdom of Jerusalem, or Hildegarde of Bingen may tell them about the history of music. Basically any other historical (or fictional – for example, Jason of the Argonauts) celebrity can support the learning process in a more efficient and fun way.
- Media and entertainment – the same goes for less serious applications. Talking heads can be generated for a fictional entity to bring joy – one can think about asking the aforementioned historical celebrities for a short commentary on current affairs.
Summary
The paper will be presented at the upcoming CVPR 2025 conference that will take place in Nashville, TN on the 11th of June. More details on the upcoming event can be found on the event’s webpage.