How to Add Voice and Sound to a Midjourney Video

Midjourney’s jump into video generation is exciting: it turns still images into short, loopable animated clips that open the door to storytelling and motion-driven content. But until Midjourney ships built-in, polished audio tracks (if it ever does), creators must stitch audio onto the silent video output using a mix of AI audio tools and classic editors. This article explains the current landscape (tools, workflows, tips, and legal guardrails), and gives you a step-by-step, production-ready workflow for adding voice and sound to Midjourney video clips.
What exactly is a “Midjourney video” and why does it need external audio?
What Midjourney’s video feature currently produces
Midjourney’s video capability converts a generated or uploaded image into a short animated clip (initially 5 seconds, extendable in increments) that emphasizes motion and camera/subject movement rather than synchronized audio or lip-synced dialogue. The tool is intended to generate visually rich short loops, not finished audiovisual narratives. This means every Midjourney video you export will be silent and must be paired with audio in post-production to become anything more than a moving image.
What are the basic Midjourney video rules and limitations?
Midjourney’s video feature converts a starting image into a short animated clip (5s default), with options to extend the length up to 21 seconds total, choose “Low” or “High” motion, loop, and change batch size. Videos are downloadable as .mp4
and Midjourney exposes a --video
parameter (and --motion low|high
, --loop
, --end
, --bs #
, --raw
--end
, and --bs
parameters— are in Midjourney’s official docs) for Discord or API prompts. Resolution is SD(480p), with HD (720p) ; batch sizes and motion settings affect GPU time and cost.
Practical takeaway: Midjourney clips are short (5–21s), so plan narration and audio to fit that envelope — or prepare to stitch multiple clips. Download the Raw Video (.mp4) from Midjourney’s Create page for the best quality to work with in post-production.
Why you should add voice, music and SFX
Adding audio:
- Provides context and narrative (voiceover), making abstract visuals communicative.
- Sets emotional tone (music choice) and improves viewer retention.
- Grounds the AI visuals in realism (sound design, Foley, ambient beds).
- Makes content platform-ready for TikTok, YouTube, or reels where audio is essential.
What is the simplest workflow to add voice and sound to a MidJourney video?
Quick one-paragraph recipe
- Generate your visual video or animated frames in MidJourney (Gallery → Animate / Video features).
- Export/download the produced video (MP4/GIF).
- Produce voiceover with OpenAI’s TTS (e.g.,
gpt-4o-mini-tts
or other TTS models) and export as WAV/MP3. - Create background music and SFX using AI audio tools (tools such as MM Audio, Udio, or Runway can help).
- Align and mix in a DAW (Reaper, Audacity, Logic, or simply use ffmpeg for straight merges).
- Optionally run AI lip-sync if the video contains faces and you want the mouth to match speech (Wav2Lip, Sync.so, and commercial services).
Why this separation (visuals vs audio) matters
MidJourney focuses on visual creativity and motion design; audio design is a different technical stack (speech generation, audio design, synchronization). Separating responsibilities gives you much more control—voice character, pacing, sound design, and mastering—without fighting with the visual generator.
How should I craft the Midjourney prompt for video?
You can create videos from any image in your gallery or by pasting a publicly hosted image URL into the Imagine bar and adding the --video
parameter (on Discord or API). After generation you can download the MP4 (Raw or Social versions) directly from the Midjourney Create page or from Discord.
A simple Discord-style example that uses an uploaded image as the start frame:
<your_image_url> cinematic slow pan across a neon city at dusk, vignette, shallow depth of field --video --motion high --bs 1 --raw
Notes:
- Put the image URL at the start to use it as the starting frame.
- Add
--video
and a motion flag (--motion low
or--motion high
). - Use
--bs 1
if you only need a single output (saves GPU time). - Use
--raw
if you want less stylization and more deterministic motion.
If the video is shorter than your desired narration, you’ll either extend the video in Midjourney (you can extend up to +4s per extension, up to 21s total) or cut/loop audio to fit. Note the exact duration (seconds + milliseconds) so you can align narration and SFX. Midjourney provides a “Download Raw Video” option on the Create page and in Discord; use that as your starting file.
Which OpenAI TTS models should I consider and why?
What are the TTS options available right now?
OpenAI offers multiple TTS options: historically tts-1
/ tts-1-hd
and the newer steerable gpt-4o-mini-tts
. The gpt-4o-mini-tts
model emphasizes steerability (you can instruct tone, pacing, emotion) and is designed for flexible, expressive voice generation; tts-1
and tts-1-hd
remain strong choices for high-quality, more traditional TTS. Use gpt-4o-mini-tts
when you want to control how the text is spoken (style, vibe), and tts-1-hd
for maximum fidelity when style control is less critical. penAI has continued to iterate on audio models (announcements in 2025 expanded speech and transcription capabilities), so pick the model that balances cost, quality, and controls for your project. tts model APIs is also integrated into CometAPI.
Any production caveats or current limitations?
gpt-4o-mini-tts
can sometimes exhibit instability on longer audio files (pauses, volume fluctuation) especially beyond ~1.5–2 minutes. For short Midjourney clips (under ~20–30s) this is seldom a problem, but for longer narration or long-form voice-overs, test and validate. If you expect longer narration, prefer tts-1-hd
or split text into shorter chunks and stitch them carefully.
Other option tool
Background music & SFX: Tools such as MM Audio (community tools), Udio, MagicShot, or Runway can create matching background music and context-sensitive SFX quickly; community threads and tutorials show creators blending these into MidJourney videos. For production-grade control, generate stems (music + ambient) and export them for mixing.
Lip sync and face animation: If the video includes characters or closeups of faces and you want realistic mouth movement, consider Wav2Lip (open source) or commercial APIs like Sync.so, Synthesia, or other lip-sync services. These tools analyze audio to produce phoneme-aligned mouth shapes and apply them to a target face or frame sequence.
How do I generate a voice file with OpenAI’s TTS (practical code)?
Below are two practical examples from CometAPI call format that generate an MP3 (or WAV) using OpenAI’s TTS endpoint. You can adapt voice names and streaming flags per your CometAPI account and SDK updates.
⚠️ Replace
YOUR_CometAPI_API_KEY
with your API key. Test on a short phrase first. Refer to
Audio Models DOC in CometAPI.
Example A — quick curl
(command line)
curl -s -X POST "https://api.cometapi.com/v1/audio/speech" \
-H "Authorization: Bearer $YOUR_CometAPI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"voice": "alloy",
"input": "Welcome to our neon city demo. This clip demonstrates motion and narration synced for social media."
}' \
--output narration.mp3
If you prefer WAV:
- Change output file name to
narration.wav
, and (if available) specify an audio format parameter in the body (some SDKs allowformat: "wav"
).
Why this works: The TTS endpoint accepts text and returns a binary audio file you can save and merge with your video later. Use voice
and instructions
(where available) to steer prosody and style.
Example B: Python using requests
import os, requests
API_KEY = os.environ["CometAPI_API_KEY"]
text = "This is a sample TTS output for your MidJourney video."
resp = requests.post(
"https://api.cometapi.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
},
json={
"model": "gpt-4o-mini-tts",
"voice": "alloy",
"input": text,
"format": "mp3"
},
stream=True,
)
resp.raise_for_status()
with open("voiceover.mp3", "wb") as f:
for chunk in resp.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
print("Saved voiceover.mp3")
How do I combine the TTS audio with a MidJourney video file?
Export the video from MidJourney
MidJourney’s Video/Animate features let you create an MP4/GIF or export a video from your Gallery—use the “Animate” function or the gallery export options to get a local file. Midjourney
Simple merge with ffmpeg
If you already have video.mp4
(no or placeholder audio) and voiceover.wav
(or mp3), use ffmpeg to merge:
# Replace or add audio, re-encode audio to AAC; keep video stream as-is
ffmpeg -i video.mp4 -i voiceover.wav -c:v copy -c:a aac -map 0:v:0 -map 1:a:0 -shortest -b:a 192k final_video.mp4
Notes:
-shortest
stops at the shorter stream; omit if you want the video to keep playing longer than audio (or vice versa).-c:v copy
keeps the video stream unchanged.-c:a aac
encodes audio to AAC (compatible with MP4).- Use
-af "volume=...
filters for loudness matching. - For professional finalization, open the audio stems in a DAW to adjust timing, EQ, and compression.
Trim or pad audio to exact video length
If the audio is longer than the video and you want a precise cut:
ffmpeg -i narration.mp3 -ss 0 -to 00:00:05 -c copy narration_trim.mp3
ffmpeg -i mid.mp4 -i narration_trim.mp3 -c:v copy -c:a aac -map 0:v:0 -map 1:a:0 output.mp4
If the audio is shorter and you want background music to fill the remainder or loop the voice, use adelay
, apad
, or mix with background track. Example: loop narration to match a 20s clip (not usually recommended for voice):
ffmpeg -stream_loop -1 -i narration.mp3 -i mid.mp4 -t 00:00:20 -c:v copy -c:a aac -map 1:v:0 -map 0:a:0 output_looped.mp4
How to offset audio (if narration needs to start later)
If your narration should start after a short silence or you have multiple segments to place at offsets, use -itsoffset
:
ffmpeg -i midjourney_raw.mp4 -itsoffset 0.5 -i speech.mp3 -map 0:v -map 1:a -c:v copy -c:a aac -shortest output_offset.mp4
-itsoffset 0.5
delays the second input by 0.5 seconds.
For multiple audio tracks or very precise placement use -filter_complex
with adelay
after Generate the TTS in small segments (one sentence per file).:
ffmpeg -i mid.mp4 \
-i line1.mp3 -i line2.mp3 -i sfx.wav \
-filter_complex \
"[1:a]adelay=0|0[a1]; \
[2:a]adelay=2500|2500[a2]; \
[3:a]adelay=1200|1200[a3]; \
[a1][a2][a3]amix=inputs=3[aout]" \
-map 0:v -map "[aout]" -c:v copy -c:a aac -shortest timed_output.mp4
Here adelay
takes milliseconds (2500 ms = 2.5s), so you can align text to visual cues precisely.
Keep narration short and scene-aware: Because Midjourney’s clips are short and often stylized, aim for a concise hook (~5–15 seconds) that matches the video’s tempo. Break text into short sentences that breathe with the visual cuts or motion cues.
How to mix background music + narration + SFX
Use filter_complex
to mix multiple audio inputs and control volumes. Example:
ffmpeg -i midjourney_raw.mp4 -i narration.mp3 -i music.mp3 \
-filter_complex "[1:a]volume=1[a1];[2:a]volume=0.18[a2];[a1][a2]amix=inputs=2:duration=shortest[aout]" \
-map 0:v -map "[aout]" -c:v copy -c:a aac final_with_music.mp4
This mixes narration (narration.mp3
) and music (music.mp3
) while setting the music level low so it sits under the voice. You can also run dynamic ducking (making music fade when narration plays) via sidechain filters or edit in a DAW for precise fades.
Advanced Editing
Script and pacing
- Write a tight script and mark visual cues (timecode or frame numbers) so the TTS output aligns to scene changes.
- Use short sentences for better natural cadence; if you need long reads, insert intentional pauses or split into multiple TTS calls.
Match motion, intensity and texture
- Use transient SFX to accent visual cuts or camera moves.
- For slow, painterly Midjourney motion (
--motion low
), favor subtle ambience and long reverb tails. - For high action (
--motion high
), use punchy SFX, tempo-matched musical hits, and short reverb.
Steering voice style
Use instructive prompts to steer gpt-4o-mini-tts
— e.g., "instructions": "Calm, conversational, slight warmth, medium speed"
or include that instruction as part of the text payload. For example:
{
"model":"gpt-4o-mini-tts",
"voice":"alloy",
"instructions":"Friendly, slightly breathy; emphasize words 'neon' and 'dawn'",
"input":"In the neon city, dawn felt electric..."
}
Be careful: exact parameter names differ across SDK versions — test the fields your SDK supports.
Sound design tips
- Add a low-volume bed track (music) and sidechain or duck it during voice.
- Use short whooshes, risers, or impact SFX aligned to visual transitions. Keep SFX short and crisp.
- Normalize voice (-1 dBFS) and compress lightly (ratio 2:1) for consistent loudness across platforms.
- For social platforms, encode final video with AAC-LC audio and H.264 video for compatibility.
Can I make characters in a MidJourney video “speak” (lip-sync) to the generated voice?
Yes—use a lip-sync model to map phonemes from the TTS audio to mouth movement frames. The two common approaches are:
Use open tools like Wav2Lip (local or hosted)
Wav2Lip aligns spoken audio to mouth movement and can be run locally or via hosted GUIs. Typical workflow:
- Export video or a series of frames (image sequence) from MidJourney.
- Produce the voice file (OpenAI TTS).
- Run Wav2Lip to output a new video where mouth shapes match the audio.
Wav2Lip is excellent for 1:1 mouth alignment and is open source; you may need some postprocessing for visual polish.
Use commercial APIs for one-step lip-sync
Services like Sync.so, Synthesia, and others offer API/GUI pipelines that handle both speech and lipsync/dubbing, sometimes including multilingual dubbing. They can be faster and less technical but are paid services and may limit fine control.
Practical notes on realism
- Perfect realism often requires microexpressions, eye blinks, and head movement—some lip-sync services add these automatically; others require manual tweaks.
- If characters are stylized (non-photoreal), small lip-sync errors are less noticeable; for closeups, invest time in a DAW + facial retouching pipeline.
Getting Started
CometAPI is a unified API platform that aggregates over 500 AI models from leading providers—such as OpenAI’s GPT series, Google’s Gemini, Anthropic’s Claude, Midjourney, Suno, and more—into a single, developer-friendly interface. By offering consistent authentication, request formatting, and response handling, CometAPI dramatically simplifies the integration of AI capabilities into your applications. Whether you’re building chatbots, image generators, music composers, or data‐driven analytics pipelines, CometAPI lets you iterate faster, control costs, and remain vendor-agnostic—all while tapping into the latest breakthroughs across the AI ecosystem.
Use MidJourney Video in CometAPI
CometAPI offer a price far lower than the official price to help you integrate Midjourney API and Midjourney Video API, Welcome to register and experience CometAPI. .To begin, explore the model’s capabilities in the Playground and consult the API guide for detailed instructions. Before accessing, please make sure you have logged in to CometAPI and obtained the API key. CometAPI supports resolution SD 480P and HD 720P.
Calling Method: Use the parameter videoType=vid_1.1_i2v_720.
Midjourney V1 Video generration: Developers can integrate video generation via RESTful API. A typical request structure (illustrative)
curl --
location
--request POST 'https://api.cometapi.com/mj/submit/video' \
--header 'Authorization: Bearer {{api-key}}' \
--header 'Content-Type: application/json' \
--data-raw '{ "prompt": "https://cdn.midjourney.com/f9e3db60-f76c-48ca-a4e1-ce6545d9355d/0_0.png add a dog", "videoType": "vid_1.1_i2v_720", "mode": "fast", "animateMode": "manual" }'
Audio Models
Developers can access GPT 4o audio and tts-1 through CometAPI,the latest model version(endpoint:gpt-4o-mini-audio-preview-2024-12-17; tts-1-hd; tts-1
) is always updated with the official website. To begin, explore the model’s capabilities in the Playground and consult the audio API guide for detailed instructions. Before accessing, please make sure you have logged in to CometAPI and obtained the API key. CometAPI offer a price far lower than the official price to help you integrate.
Conclusion
Adding voice and sound to Midjourney video is straightforward: generate a short Midjourney clip, synthesize short narration with OpenAI’s steerable TTS, then combine and polish using ffmpeg
. The new gpt-4o-mini-tts
model gives you strong stylistic control, while Midjourney’s --video
workflow produces clean short animations — perfect for social, prototype, or concept work.