tutorials

AI Podcast Intro Generator Guide

Step-by-step tutorial for building AI podcast intros with specific Flixly models, timing targets, and export settings that match real podcast workflows.

By Flixly TeamApril 14, 2026
AI Podcast Intro Generator Guide

TL;DR

AI podcast intro generators work best when you combine Gemini 3.1 Flash TTS for narration with separate music generation passes rather than relying on one tool. Follow the eight-step sequence above to produce 25-30 second files at 44.1 kHz that sit correctly under voice.

Many creators assume an AI podcast intro generator is just a single button that spits out voiceover from text.

That assumption falls short because strong intros need layered audio from multiple models. Flixly combines Gemini 3.1 Flash TTS for narration with separate music generation passes to hit 15-30 second durations that match typical podcast beats.

Picking models for audio layers

Start with voice output first. Gemini 3.1 Flash TTS produces clean narration at 128 kbps. Pair it with Seedance 2.0 music stems when you need upbeat background tracks under 20 seconds.

Voice options

  • Gemini 3.1 Flash TTS for fast English delivery under 10 seconds latency.
  • Voice cloning when the host wants consistent branding across 50 episodes.

Music options

  • Music generation at 15-second loops for quick fades.
  • Reference tracks uploaded to match existing podcast tone.

Use Text to Speech first, then layer results in Music Generation.

Building the intro in sequence

Follow these exact actions to produce one ready-to-export file.

  1. Create a free account at the sign-up page and add 500 credits to cover multiple test generations.
  2. Open the text-to-speech page and paste a 25-word script such as "This is the daily tech update with fresh stories every morning."
  3. Select Gemini 3.1 Flash TTS, set speed to 1.05x, and generate a 12-second clip.
  4. Switch to the music tool, enter prompt "short electronic bed 15 seconds no vocals 120 bpm," and generate three variations.
  5. Download both files and import into any DAW or Flixly video editor for a 3-second crossfade at -18 dB.
  6. Run a final voice clone pass if you want the same timbre on future episodes.
  7. Export as 44.1 kHz WAV at 24-bit depth for platform upload.
  8. Test the file in your podcast host player at 50% volume to confirm levels sit under music.

Checking output quality

Listen on three devices. Count words spoken in the first 8 seconds. If the count exceeds 18 words, shorten the script and regenerate. Compare peak levels: narration should sit at -10 dBFS while music peaks at -18 dBFS.

Element Target length Sample rate Bit depth Model used
Narration 8-12 seconds 44.1 kHz 24-bit Gemini 3.1 Flash TTS
Music bed 15-20 seconds 48 kHz 24-bit Music Generation
Final mix 25-30 seconds 44.1 kHz 16-bit Manual crossfade

The table above shows the settings used for a 28-second intro produced last week.

Common timing mistakes

Scripts that run past 35 words usually push total length over 35 seconds. Trim to 22 words max. Music prompts without a BPM reference produce tracks that clash with voice cadence.

Link the first tool mention in every section. Try Voice Cloning when you need identical delivery week after week.

Testing across platforms

Upload the file to three podcast directories and play the first 10 seconds on mobile and desktop. Adjust gain if the intro clips on phone speakers. Re-export once with a 1-second silence tail.

Apply the corrected model mix next time you need fresh audio. Start at Text to Speech.

Frequently Asked Questions

How many credits does a typical 30-second podcast intro cost in Flixly?

One narration pass with Gemini 3.1 Flash TTS uses about 8 credits. Music generation adds another 6 credits. A full mix test run lands near 20 credits total.

Can I keep the same voice across multiple podcast episodes?

Yes. Run a voice cloning pass on a 30-second clean sample of the host. Save the clone profile and reference it in later text-to-speech jobs.

What script length works best for a 25-second intro?

Aim for 20-24 words spoken at 1.05x speed. This timing leaves room for a 3-second music swell at the start and end.

Tools mentioned in this post

tutorialaudiopodcast

Ready to create with tutorials?

Jump straight into Flixly's AI studio and try tutorials with 50+ models — free to start.