FoleyGenEx

FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision

Abstract

We introduce FoleyGenEx, a unified framework for video-to-audio (VTA) generation that integrates multi-modal control, frame-level temporal alignment, and fine-grained semantic expressivity, enabling synchronized, versatile, and expressive audio synthesis across diverse tasks. Existing VTA methods either offer multi-modal control with weak temporal alignment or achieve strong alignment while lacking reference audio conditioning and semantic precision. FoleyGenEx bridges this gap through three key innovations: a conditional injection mechanism enabling audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving synchronization during multi-modal training, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enrich audio representations and textual supervision with nuanced semantic cues. Experiments on AudioCaps, VGGSound, and Greatest Hits show that FoleyGenEx delivers competitive performance in controllable VTA generation, achieving strong temporal fidelity, versatile multi-modal control, and fine-grained semantic precision compared to existing methods.

Method

Figure 1: FoleyGenEx supports a range of multi-modal controlled audio generation tasks, including Text-to-Audio (TTA), basic Video-to-Audio (VTA), Text-Controlled VTA (TC-VTA), Audio-Controlled VTA (AC-VTA), and Foley extension (FE). It unifies these tasks while achieving strong synchronization, versatile control, and expressive audio generation.

Figure 2: FoleyGenEx training framework.

Figure 3: Multi-modal controlled audio generation tasks.

FoleyGenEx Demo Gallery

Text-Control Video-to-Audio

MultiFoley [link]

Bird chirping.

Male speaking.

Rooster crowing.

Sheep bleating.

FoleyGenEx (Ours)

Bird chirping.

Male speaking.

Rooster crowing.

Sheep bleating.

MultiFoley

Cat meowing.

Horse neighing.

Lion roaring.

FoleyGenEx (Ours)

Cat meowing.

Horse neighing.

Lion roaring.

MultiFoley

Typewriter.

Playing piano.

Typing on computer keyboard.

FoleyGenEx (Ours)

Typewriter.

Playing piano.

Typing on computer keyboard.

MultiFoley

Playing cello.

Playing erhu.

Chainsawing trees.

FoleyGenEx (Ours)

Playing cello.

Playing erhu.

Chainsawing trees.

Cat meowing.

Tiger roaring.

Audio-Control Video-to-Audio (Material)

Inference conditions: 2-second audio snippet from reference video & 2-second copied snippet from target video

MMAudio

Reference Audio

Result 1

Result 2

FoleyGenEx (Ours)

Reference Audio

Result 1

Result 2

MMAudio

Reference Audio

Result 1

Result 2

FoleyGenEx (Ours)

Reference Audio

Result 1

Result 2

Reference Audio

Result

Reference Audio

Result

Audio-Control Video-to-Audio (Audio Event)

Inference conditions: 2-second audio snippet from reference video & 2-second copied snippet from target video

Reference Audio

Result 1

Result 2

Reference Audio

Result 1

Result 2

Reference Audio

Result 1

Result 2

Audio Continuation

Input: Provides the first 5-second audio clip

Result

Input: Provides the first 5 - second audio clip

Result

Input: Provides the first 5 - second audio clip

Result

Input: Provides the first 5 - second audio clip

Result

Input: Provides the first 5 - second audio clip

Result

Input: Provides the first 5 - second audio clip

Result

Input: Provides the first 5 - second audio clip

Result

>

Input: Provides the first 5 - second audio clip

Result

Input: Provides the first 5 - second audio clip

Result

Editing

Edit the 0-3s segment and input the text as "pouring milk"

Result

Regenerate the 7.5-10s segment based on the 0-7.5s segment

Result

Regenerate the 2-5s segment based on the rest of the segments

Result

Regenerate the 0-1.5s segment based on the rest of the segments

Result

Regenerate the 0-2s segment based on the rest of the segments

Result

Regenerate the 0-2s segment based on the rest of the segments

Result

Regenerate the 0-5s segment based on the rest of the segments

Result

Adverb-Augmented (MMAudio)

Test Caption	MMAudio (w/o AA)	MMAudio (w/ AA)
A dog runs excitedly from a distance to nearby.
A commercial airliner flies farther away gradually.
Guests are chatting and laughing in the distance.
Ambient sounds of a remote village.
Heavy rain, and terrifying thunder rings out in a distant place.
In the distance along the road, an artist is playing a lively Swiss folk song.
A black cat lets out a soft meow.
A beautiful woman sits in front of the piano and plays rapidly.
The soft tapping sound of metal.
Rapid footsteps echo in the corridor.

Prompts

Audio augmentation method	Prompt
Speed Augmentation	This sentence is a description of the sound content heard in the audio: {original caption}. The enhancement strategy applied to this audio is: {audio augmentation strategy}. The description of the original speed of this audio is: {adverb in the original caption}. Please rewrite the description of the sound considering this enhancement. Do not imagine non-existent sound content; avoid comparing with music characteristics if not relevant; exclude specific place names, people's names, or other proper nouns. The output caption should be within 30 words and be able to distinguish among the following three situations based on the enhanced sound: when the original speed was described as fast and enhanced by speeding up; when the original speed was described as fast but enhanced by slowing down or when the original speed was described as slow but enhanced by speeding up; when the original speed was described as slow and enhanced by slowing down. The rewritten caption should only describe the sound after enhancement without explicitly stating the enhancement strategy. Describe the speed in the rewritten caption based on the effect of applying the {audio augmentation strategy} to the {adverb in the original caption}. For example, when speeding up a slow-moving sound or slowing down a fast-moving sound, describe the speed as moderate; when speeding up a fast-moving sound, describe it as very fast; when slowing down a slow-moving sound, describe it as very slow.
Distance Augmentation	This sentence is a description of the sound content heard in the audio: {original caption}. The enhancement strategy applied to this audio is: {audio augmentation strategy}. When the enhancement strategy is marked as 'Far', it indicates that the enhanced audio has a significant amount of reverberation. In the rewritten caption, focus on expressing this characteristic of the sound. When the enhancement strategy is marked as 'Very Far', it means the enhanced audio gives the impression of being in the distance. The rewritten description should convey this sense of distance. Do not fabricate non-existent sound content; avoid making comparisons with music characteristics if they are not relevant; exclude specific place names, people's names, or other proper nouns. The output should be within 30 words. The rewritten caption should only describe the sound after enhancement without explicitly stating the enhancement strategy.
Volume Dynamic Augmentation	This sentence is a description of the sound content heard in the audio: {original caption}. The enhancement strategy applied to this audio is: {audio augmentation strategy}. If the enhancement strategy is 'Crescendo', rewrite the sound description to convey that the enhanced audio seems to approach from a distance with gradually increasing volume. If the strategy is 'Decrescendo', rewrite it to show that the enhanced audio appears to move away into the distance with gradually decreasing volume. Do not imagine non-existent sound content; avoid comparing with music characteristics if not relevant; exclude specific place names, people's names, or other proper nouns. The output should be within 30 words. The rewritten caption should only describe the sound after enhancement without explicitly stating the enhancement strategy.

Acknowledgements

The videos in this demo page are sourced from the following:

Videos from the demo page of MultiFoley
Videos from kling
Videos from Greatest Hits Dataset
Some videos crawled from the internet. These videos are used solely for demonstration purposes and we do not claim any copyright. If any content infringes upon your rights, please contact us and we will remove it immediately.