Outcome
- 1 vertical 9:16 video where your avatar (or any photo of a person) says exactly what you recorded, with synced lips.
Time and credits
- Total time: ~5-10 minutes (depending on audio length).
- Credits: 4 cr per real second of speech. A 15 s voiceover = 60 cr.
Steps
Record the audio
On your phone or with a clean mic. What matters: clarity, no echo,
no noise. MP3 or WAV. Natural delivery, short sentences.No audio? Skip this step and use TTS in the next one.
Prepare the photo
A vertical image of the person or avatar. Front or 3/4, face well
lit. It can be:
- One of the variations from your AI influencer.
- A real photo (yours or a client’s, with permission).
- An image generated with Image.
Take everything to Talking photo
At zevor.ai/talking-photo:
- Upload the photo.
- Audio tab: upload your MP3/WAV (or paste a link if it’s online).
- Alt: Text tab, type what you want it to say and pick a TTS voice.
Best practices
- Clean audio: noisy recordings can drift lip sync at specific spots. Recording somewhere quiet or with a decent mic changes the result a lot.
- Natural 5-10 s phrases. 30 s blocks work but feel more artificial.
- Same photo = same identity. If you make several voiceovers with the same face, the audience starts associating them. Same principle as an AI influencer (see that recipe).
- Profile loses detail: if your photo is heavily sideways, lip sync loses fidelity.
Common mistakes
- Very low-quality audio: sync depends on what the model hears. Re-record before spending credits.
- Photo with multiple faces: the AI picks one. Crop so only the person who will speak is in frame.
More on limits and formats at
Talking photo.
