tutorials

How to API Integrate AI TTS

A practical walkthrough for calling the Flixly TTS endpoint, choosing Gemini 3.1 Flash TTS, handling responses, and verifying audio output with real parameter examples.

By Flixly TeamApril 14, 20264 views
How to API Integrate AI TTS

TL;DR

Sign up, buy credits, select Gemini 3.1 Flash TTS, build a JSON payload with text and format fields, post to the endpoint, poll for the file URL, then download and verify the 24 kHz wav output. The same flow supports cloned voices and parallel batches.

You need voiceovers for 12 short clips from a 4-minute script and the deadline is tonight. Start at the Text to Speech page to confirm available models before touching any code.

Set up your account and credits

Create an account at the sign-up page. Buy a starter pack of 500 credits. Each TTS generation of 30 seconds costs about 8 credits when using Gemini 3.1 Flash TTS. Check your balance in the dashboard before every batch run.

Choose the right model

Flixly lists Gemini 3.1 Flash TTS, Seedance 2.0 audio tracks, and Kling 3.0 voice options. Gemini 3.1 Flash TTS handles English at 24 kHz with low latency. Test a single line first to compare output quality against your script tone.

Build the request payload

Prepare a JSON body with text, voice_id, speed, and format fields. Speed accepts values from 0.8 to 1.3. Format supports mp3 or wav at 16 kHz or 24 kHz. Keep the text under 3000 characters per call to stay inside credit limits.

Send the API call

Use your API key from the dashboard settings. Post to the endpoint with the payload. The response returns a job_id and estimated credits. Poll the status endpoint every 4 seconds until the file URL appears.

Verify and download the file

Listen to the first 10 seconds for pronunciation errors. If speed feels off, adjust and rerun. Download the wav file at 24 kHz for editing in your video tool. Store the file URL for 48 hours before it expires.

Step-by-step integration

  1. Log in and note your API key from account settings. The key stays valid for 90 days.
  2. Open the Text to Speech tool page and copy the exact model string for Gemini 3.1 Flash TTS.
  3. Write a test script of 45 words and count characters. This keeps the call small while checking latency.
  4. Build the JSON payload with text, model, and output format set to wav. Save it as a local file for reuse.
  5. Send the POST request using curl or your language of choice. Capture the job_id from the response body.
  6. Poll the status URL with the job_id until the state changes to complete. Expect 6 to 12 seconds for a 20-second clip.
  7. Download the audio file and play it back. Note the exact credit cost shown in the response header.
  8. Repeat the call with updated speed or voice settings until the delivery matches your script.

Parameter reference table

Field Type Example value Notes
text string "Hello world today" Max 3000 chars per request
model string gemini-3.1-flash-tts Use exact string from dashboard
speed float 1.1 Range 0.8 to 1.3
format string wav 16kHz or 24kHz options
voice_id string clone_482 Optional when using cloned voice

Combine with other tools

After TTS generation, feed the audio into Lip Sync Video for character mouth movement. The same credit balance works across tools. Link the resulting video to Shorts Generator if you need vertical crops at 9:16. Each step logs its own credit use so you can track costs per clip.

Handle errors

If the response shows insufficient credits, buy more before the next batch. If pronunciation fails on a name, add phonetic spelling inside brackets in the text field. The system returns an error code 422 for format mismatches. Fix the payload and retry the same job_id within 10 minutes.

Scale to production

Store the API key in an environment variable. Run loops that split long scripts into 250-word chunks. Each chunk generates in parallel when you space calls by 2 seconds. Monitor total credits used against your monthly budget in the dashboard.

Related audio options

Test Voice Cloning to match an existing narrator. The clone file uploads once and then appears as a selectable voice_id in TTS calls. Pair the output with Music Generation at low volume for background tracks under 10 seconds.

You now hold a working script that turns text into timed audio files on demand. Run the same flow again at Text to Speech whenever new copy arrives.

Tools mentioned in this post

tutorialsapittsguides

Ready to create with tutorials?

Jump straight into Flixly's AI studio and try tutorials with 50+ models — free to start.