tutorials

Soundify Guide Using Flixly Tools

Step-by-step walkthrough showing how to match sound effects to video using Flixly models and tools such as lip sync and music generation.

June 15, 2026
Soundify Guide Using Flixly Tools

TL;DR

Upload a silent clip to reference-to-video, generate matching audio stems with Seedance 2.0 and Gemini 3.1 Flash TTS, align layers inside video-to-video, then export the 45-second result at 48 kHz. Total cost is 39 credits.

A 45-second product demo video sits ready for upload. The clip needs matching footsteps, door clicks, and background hum within the next 90 minutes.

Start the Flixly project

Open the dashboard and select the reference-to-video tool at /dashboard/reference-to-video. Upload the silent clip and note the 1920x1080 resolution plus 24 fps rate. Name the project Soundify-demo-0615.

Generate base audio layers

Move to the music-generation page at /dashboard/music-generation. Choose the Seedance 2.0 model. Enter a prompt that reads "quiet office footsteps on tile at 0-8 seconds then 22-30 seconds". Set length to 45 seconds and sample rate to 48 kHz. The first pass returns a 12 MB WAV file.

Verify timing markers

Play the track inside the preview window. Check that the first footstep hit lands at 0.8 seconds and the second sequence begins at 22.4 seconds. Adjust the prompt seed if any marker drifts more than 0.3 seconds.

Add voice and effects

Switch to the lip-sync tool at /dashboard/lip-sync. Upload the same video plus a short voice line recorded at 16 kHz. Pick the Gemini 3.1 Flash TTS model for the spoken overlay. The system aligns mouth shapes to the new audio in 14 seconds of processing.

Layer ambient tracks

Return to music-generation and run a second pass with prompt "low HVAC hum constant from 0-45 seconds". Export at 24-bit depth. Import both WAV files into the video-to-video page at /dashboard/video-to-video. Mix levels at -12 dB for footsteps and -18 dB for hum.

Match to specific frames

Use the first-to-last-frame tool at /dashboard/first-to-last-frame. Set keyframes at 8 s, 22 s, and 35 s. Feed the mixed audio stem as reference. The output file carries embedded markers that line up within two frames of each target.

Final render and check

Export the finished clip at 1080p and 48 kHz audio. File size lands at 187 MB. Open the result in any media player and scrub to 0:07, 0:21, and 0:34. Each sound lands on the intended action without drift.

The finished video now carries synchronized audio that matches the original motion exactly. Repeat the same sequence on your next clip by starting at the reference-to-video page.

Model choices for each layer

Pick models based on duration and style needs. Seedance 2.0 handles short rhythmic sounds under 60 seconds. Kling 3.0 works better for longer ambient beds up to four minutes. Veo 3.1 offers cleaner separation when multiple overlapping effects are required.

Credit cost breakdown

A 45-second music pass costs 18 credits. Lip-sync alignment adds 12 credits. Video-to-video mix step uses 9 credits. Total spend for one full soundify run equals 39 credits on the current pricing tier.

Common timing fixes

If footsteps land 400 ms late, shift the prompt start time forward by 0.4 seconds and regenerate only that segment. When hum bleeds into voice frequencies, lower the hum track by 3 dB before the final mix.

Export settings table

Step Format Bitrate Sample Rate Duration
Music pass 1 WAV 1411 kbps 48 kHz 45 s
Lip sync MP4 8000 kbps 48 kHz 45 s
Final mix MP4 12000 kbps 48 kHz 45 s

Repeat the workflow

Open the reference-to-video page again and load your next silent clip. Follow the same four steps to keep every new video in sync with its sound design.

Model Selection Criteria

Choosing the generation model depends on clip length, number of overlapping elements, and required timing precision. Seedance 2.0 remains the default for any rhythmic or percussive layer shorter than 60 seconds because its seed-based timing stays within 0.3 seconds of prompt markers on the first pass. For ambient beds that stretch past two minutes, switch to Kling 3.0; its longer context window reduces looping artifacts that appear when Seedance 2.0 is forced beyond its training range.

When three or more distinct sound classes must occupy the same 45-second window, Veo 3.1 separates frequency bands more cleanly during the final mix step. Test a 10-second excerpt first: generate one pass with each model using identical prompts, then compare the exported stems inside the video-to-video preview at /dashboard/video-to-video. The model whose stem shows the least masking in the 2–4 kHz voice range is the one to keep for the full clip.

Batch Workflow Example

A single 45-second clip rarely exists in isolation. After the first successful soundify run, open the batch queue at /dashboard/batch-audio and drop the next four silent clips into the list. The interface reuses the project settings from Soundify-demo-0615, including the 48 kHz sample rate and the same Seedance 2.0 prompt seed. Only the reference video and any new voice line need manual upload; all other parameters copy forward.

Run the queue overnight. Each clip receives its own credit deduction line item so the total spend stays visible before export. Once finished, the system places the rendered files in a single folder named after the batch job. Scrub the first 10 seconds of each output in sequence; any timing drift above 0.3 seconds triggers a targeted regenerate on only the affected segment rather than the entire batch.

Quality Assurance Checklist

Before final export, run the rendered file through a short verification list inside the media player:

  • Play at 0.25× speed from 0:00 to 0:10 and confirm every footstep lands inside the intended frame window.
  • Solo the voice track and check that lip-sync markers align within two frames on every plosive consonant.
  • Listen for hum bleed above –18 dB in the 200–500 Hz band; if present, reopen the mix step and lower the ambient stem by an additional 2 dB.
  • Export a 10-second test clip at the same settings as the final render to verify file size stays under 50 MB per minute of 1080p footage.

Document any manual dB adjustments in the project notes so the same offsets can be applied to future batches without re-testing.

Post-Production Audio Tweaks

Even after the automated mix, small frequency corrections often remain necessary. Import the final stem into the audio editor linked from /dashboard/audio-editor and apply a narrow notch filter at 3.2 kHz if the HVAC hum interferes with the recorded voice. Limit the notch width to 40 Hz so the surrounding ambience stays intact. For footsteps that still feel too present, automate a –3 dB dip only on the exact frames where the on-screen character is stationary; the editor accepts keyframe import directly from the video timeline.

After these targeted changes, re-render only the affected 8-second region rather than the full clip. The system merges the corrected segment back into the master file while preserving the original embedded markers. This approach keeps total processing time under five minutes even when three or four small fixes are required.

Credit Optimization Strategies

Track usage inside the credit dashboard after every batch job. The interface lists per-clip deductions so you can spot segments that consumed more than the 18-credit baseline for a music pass. When a prompt returns an output with noticeable timing drift, regenerate only the flagged 8-second window instead of the full 45-second file; this keeps the second deduction at roughly 6 credits. Set a running total alert at 150 credits so the queue pauses automatically before an overnight run exceeds the monthly allocation.

Group similar clips into one project folder before launching the batch queue. The system reuses the same Seedance 2.0 seed across all items in that folder, eliminating repeated model-loading steps that add 3 credits each time a new session starts. Export the mixed stems at 24-bit depth only when the final verification checklist is complete; intermediate 16-bit previews cost 2 fewer credits and still allow accurate timing checks.

Cross-Platform Export Considerations

After the 1080p render finishes, open the export options panel to choose container settings that match downstream editing software. ProRes 422 inside an MOV wrapper preserves the embedded timing markers when the file moves to DaVinci Resolve or Premiere. If the next step is social-media upload only, switch to H.264 at 8000 kbps; the audio track remains at 48 kHz and the markers stay intact inside the MP4.

Test one clip from the batch at 720p before committing the full queue. The lower-resolution pass finishes in under three minutes and reveals any sample-rate mismatch that would otherwise appear only after the 187 MB file lands on disk. Once the test passes, queue the remaining clips at the target resolution with the same audio settings.

Platform Recommended Container Video Codec Audio Sample Rate Notes
Social upload MP4 H.264 48 kHz Keeps file under 200 MB
NLE import MOV ProRes 422 48 kHz Retains frame-accurate markers
Archive MKV HEVC 48 kHz Smaller size for long-term storage

Version History and Rollback Procedures

Every render step writes a new entry to the project timeline. Click any prior entry to load the exact prompt, model, and mix levels that produced it. This snapshot restores the 12 MB WAV from the first music pass without re-running the generation step, saving 18 credits if a later mix adjustment proves unsuitable.

When three team members work on the same clip, assign each person a separate branch inside the timeline view. Branches keep their own credit logs so one member’s experimental ambient layer does not affect the shared master stem. Merge branches only after the Quality Assurance Checklist confirms timing within 0.3 seconds on all markers.

Rollback to an earlier branch by selecting the entry and choosing “Restore to current project.” The system copies the embedded markers forward and updates the reference video link automatically. Keep at least four versions of any 45-second clip; the storage cost stays under 1 GB per project and allows quick comparison of hum levels or footstep placement without regenerating audio.

Integrating External Voice Lines

Record the voice line at 16 kHz on any device, then upload it directly to the lip-sync tool. The Gemini 3.1 Flash model accepts the file without conversion, but verify the peak level sits between -6 dB and -3 dB before upload. If the recording contains room tone above -30 dB, apply a light gate inside the audio editor first; this prevents the alignment algorithm from locking onto noise instead of speech.

After alignment completes, export the synced voice stem and import it as an additional layer in the video-to-video mix. Set its level at -9 dB relative to the footsteps so consonants remain intelligible while the ambient hum stays audible underneath. If the external line contains plosives that cause lip-sync drift, split the stem at each plosive and realign the segments individually; each split adds roughly four seconds of processing but keeps mouth shapes matched to within one frame.

Tools mentioned in this post

tutorialaudiovideo

Ready to create with tutorials?

Jump straight into Flixly's AI studio and try tutorials with 50+ models — free to start.