Artificial Intelligence (AI)

MultiTalk: Generating Multi-Person Conversations from Just Audio

Researchers from Meituan, HKUST, and Sun Yat-sen University have introduced MultiTalk, a new framework that brings multi-person audio-driven conversational videos to life. Unlike earlier methods that only focused on single-person talking heads, MultiTalk handles multi-stream audio, ensures correct lip sync for each individual, and follows detailed scene instructions like “a man and a woman were talking, and then they kissed.”

A key innovation is Label Rotary Position Embedding (L-RoPE), which helps bind the right audio stream to the right person. The model also preserves instruction-following through clever training strategies like partial parameter and multi-task training.

From virtual actors to e-commerce livestreams, the potential use cases are huge.

MultiTalk: Generating Multi-Person Conversations from Just Audio

Researchers from Meituan, HKUST, and Sun Yat-sen University have introduced MultiTalk, a new framework that brings multi-person audio-driven conversational videos to life. Unlike earlier methods that only focused on single-person talking heads, MultiTalk handles multi-stream audio, ensures correct lip sync for each individual, and follows detailed scene instructions like “a man and a woman were talking, and then they kissed.”

A key innovation is Label Rotary Position Embedding (L-RoPE), which helps bind the right audio stream to the right person. The model also preserves instruction-following through clever training strategies like partial parameter and multi-task training.

From virtual actors to e-commerce livestreams, the potential use cases are huge.

https://arxiv.org/pdf/2505.22647v1