Video to Audio AI Generator

🎬 Video generation models are advancing at an incredible pace, but many current systems can only generate silent output. One of the next major steps toward bringing generated movies to life is creating soundtracks for these silent videos. ✨

Today, we’re excited to share our latest progress on video-to-audio generation, powered by A2E’s ThinkSound technology — a breakthrough that enables synchronized audiovisual creation from video and text prompts. 🚀

A2E seamlessly combines 🎞️ video pixels with 📝 natural language cues to generate immersive soundscapes that align perfectly with the on-screen action.
Whether it’s a 🎻 dramatic musical score, 💥 realistic sound effects, or 🗣️ character-specific dialogue, the system dynamically matches sound to the visual context and emotional tone of the video.

Our technology integrates smoothly with Image to Video, making it possible to create fully realized audiovisual scenes without manual sound design. ✨

Beyond AI-generated content, A2E can also enrich traditional footage — from 📼 archival material to 🕰️ silent films — by automatically generating compelling audio tracks.
This opens up new possibilities for storytelling, 🛠️ restoration, and 🎨 creative expression across industries. Come and try it!

Prompt for audio: Cinematic, horror film, music, tension, ambience, footsteps

Prompt for audio: Jellyfish pulsating under water, marine life, ocean

Prompt for audio: Steam locomotive with whistle blowing, clattering on railway tracks, and surrounding ambient sounds

Prompt for audio: A drummer on a stage at a concert surrounded by flashing lights and a cheering crowd

Prompt for audio: Cars skidding, car engine throttling, electronic music

Prompt for audio: Rhythmic drip-drop rain for a peaceful mood

Prompt for audio: A lone wolf howls as the sun sets over the open prairie

Prompt for audio: Electrical noise paired with storytelling background music

Enhanced Creative Control

A2E’s Video to Audio technology, powered by ThinkSound, enables users to generate unlimited audio tracks from any video input. By setting custom cues, users can guide the sound generation to match the desired mood, timing, or action.

With ThinkSound’s advanced audio synthesis, users gain full control and flexibility—making it easy to experiment with multiple versions and choose the perfect audio match.

Prompt for audio: A city street at night with passing cars, echoing footsteps, and upbeat, dynamic music

Prompt for audio: Nighttime street with faint car sounds, echoing footsteps, and creepy ambient horror music

Prompt for audio: Quiet street at night with soft car sounds, gentle footsteps, and calm, storytelling background music

How it works

ThinkSound decomposes audio generation and editing into three interactive stages, all guided by MLLM-based Chain-of-Thought (CoT) reasoning:

  1. Foley Generation: Generate foundational, semantically and temporally aligned soundscapes from video.
  2. Object-Centric Refinement: Refine or add sounds for user-specified objects via clicks or regions in the video.
  3. Targeted Audio Editing: Modify generated audio using high-level natural language instructions.
Video to Audio AI Generator Workflow

Further research underway

Our research stands out from existing video-to-audio solutions because it can understand raw pixels and adding a text prompt is optional.

Also, the system doesn’t need manual alignment of the generated sound with the video, which involves tediously adjusting different elements of sounds, visuals and timings.

Still, there are a number of other limitations we’re trying to address and further research is underway.

Our commitment to safety and transparency

We’re committed to developing and deploying AI technologies responsibly. To make sure our A2E technology can have a positive impact on the creative community, we’re gathering diverse perspectives and insights from leading creators and filmmakers, and using this valuable feedback to inform our ongoing research and development.

Watermark all AI-generated content to help safeguard against the potential for misuse of this technology.