tutorials

AI Chat Personalization Tutorial Guide

Follow this exact workflow to clone voices, generate speech, and sync video replies inside Flixly. Eight concrete steps and a settings table included.

By Flixly TeamApril 14, 20261 views
AI Chat Personalization Tutorial Guide

TL;DR

Clone a 30-second reference, generate speech with Gemini 3.1 Flash TTS, then lip-sync the audio to a headshot. Reuse the same clone ID across the conversation to maintain consistent character identity. One full 60-second reply costs 26 credits.

The real question behind AI chat personalization is not which prompt to tweak first. It is how to make every voice and visual match one consistent character across sessions.

Flixly lets you do that with voice cloning and text-to-speech models that run on the same credit system as image and video tools.

Start with a reference voice sample

Record 30 seconds of clean speech from the target speaker. Upload it directly to the voice cloning page. The system returns a cloned model after 45 seconds on average.

Use the cloned model inside Text to Speech with Gemini 3.1 Flash TTS. Type the exact lines your chat will speak. Export at 24 kHz mono for lowest latency in chat widgets.

Add lip sync for video replies

When your chat needs to answer with video, send the generated audio to Lip Sync Video. Choose a 5-second headshot clip as the base. The tool outputs a 1080p 30 fps file with mouth movement timed to phonemes.

Keep music and captions consistent

Background tracks from Music Generation run at 120 BPM for neutral tone. Add Auto Captions at 0.8 seconds per line to match the cloned voice speed.

Tradeoffs to expect

Cloned voices drop 8-12% in naturalness on the first generation when the reference has background noise. Rerun with a cleaner sample to recover quality. Video lip sync adds 15 credits per 10-second clip.

Step-by-step setup

  1. Sign up at /auth/register and purchase 500 credits.
  2. Open the voice cloning tool and upload a 30-second WAV file.
  3. Name the clone "SupportAgent-01" and save the model ID.
  4. Switch to text-to-speech, select Gemini 3.1 Flash TTS, paste the clone ID, and generate the first line.
  5. Download the MP3 and test latency in your chat widget.
  6. For video replies, import the audio into lip sync, pick a reference face, and render at 1080p.
  7. Generate a matching music bed at 120 BPM and layer it 12 dB below the voice.
  8. Apply auto captions, export the final MP4, and embed the link in the chat response.

Settings reference

Element Recommended value Credit cost \ Output size \ Notes
Voice sample 30 s clean WAV 25 Model file Rerun if SNR < 20 dB
TTS length 120 words 3 MP3 24 kHz Use Gemini 3.1 Flash TTS
Lip sync clip 5 s headshot 15 MP4 1080p 30 fps Sync offset under 40 ms
Music bed 120 BPM 8 MP3 stereo Keep under voice by 12 dB

The one decision rule worth remembering is to lock the clone ID and reuse it for every new line in the same conversation thread. This single choice keeps character identity stable without extra prompt engineering.

FAQ

What reference length gives the best clone quality? Thirty seconds of studio-grade speech at 48 kHz works best. Shorter clips lose timbre on vowels.

How many credits does a full 60-second personalized reply use? One TTS line costs 3 credits, one lip-sync render costs 15 credits, and one music bed costs 8 credits for a total of 26 credits.

Can I change the clone voice mid-chat? Yes. Save multiple models and switch the model ID parameter in the API call before each generation.

Does background music affect lip-sync timing? No. Music sits on a separate track and does not shift phoneme alignment.

What file format works best for embedding in chat widgets? Export TTS as 24 kHz mono MP3 and lip-sync video as 1080p H.264 MP4 with AAC audio.

Frequently Asked Questions

What reference length gives the best clone quality?

Thirty seconds of studio-grade speech at 48 kHz works best. Shorter clips lose timbre on vowels.

How many credits does a full 60-second personalized reply use?

One TTS line costs 3 credits, one lip-sync render costs 15 credits, and one music bed costs 8 credits for a total of 26 credits.

Can I change the clone voice mid-chat?

Yes. Save multiple models and switch the model ID parameter in the API call before each generation.

Does background music affect lip-sync timing?

No. Music sits on a separate track and does not shift phoneme alignment.

What file format works best for embedding in chat widgets?

Export TTS as 24 kHz mono MP3 and lip-sync video as 1080p H.264 MP4 with AAC audio.

Tools mentioned in this post

tutorialvoicepersonalization

Ready to create with tutorials?

Jump straight into Flixly's AI studio and try tutorials with 50+ models — free to start.