> ## Documentation Index
> Fetch the complete documentation index at: https://docs.zevor.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Talking photo with your own audio

> Turn a still photo into a video where the face says exactly what you recorded. Recipe for voiceovers, ads and testimonials with an AI avatar.

This recipe combines a photo (real or AI avatar) with **audio you
record**, and returns an MP4 with the face talking and lips synced.
Ideal for ads, brand voiceovers or testimonials.

## Outcome

* 1 vertical 9:16 video where your avatar (or any photo of a person)
  says exactly what you recorded, with synced lips.

## Time and credits

* **Total time**: \~5-10 minutes (depending on audio length).
* **Credits**: 4 cr per real second of speech. A 15 s voiceover = 60 cr.

## Steps

<Steps>
  <Step title="Record the audio">
    On your phone or with a clean mic. What matters: clarity, no echo,
    no noise. MP3 or WAV. Natural delivery, short sentences.

    No audio? Skip this step and use TTS in the next one.
  </Step>

  <Step title="Prepare the photo">
    A vertical image of the person or avatar. Front or 3/4, face well
    lit. It can be:

    * One of the variations from your [AI influencer](/en/recipes/ai-influencer).
    * A real photo (yours or a client's, with permission).
    * An image generated with [Image](/en/modes/image).
  </Step>

  <Step title="Take everything to Talking photo">
    At [zevor.ai/talking-photo](https://zevor.ai/talking-photo):

    * Upload the photo.
    * **Audio** tab: upload your MP3/WAV (or paste a link if it's
      online).
    * Alt: **Text** tab, type what you want it to say and pick a TTS
      voice.
  </Step>

  <Step title="Generate and download">
    Takes 1-3 minutes. Download the MP4. Voice is synced to lips;
    the result goes straight to publish.
  </Step>
</Steps>

## Best practices

* **Clean audio**: noisy recordings can drift lip sync at specific
  spots. Recording somewhere quiet or with a decent mic changes the
  result a lot.
* **Natural 5-10 s phrases**. 30 s blocks work but feel more
  artificial.
* **Same photo = same identity**. If you make several voiceovers with
  the same face, the audience starts associating them. Same principle
  as an AI influencer (see [that recipe](/en/recipes/ai-influencer)).
* **Profile loses detail**: if your photo is heavily sideways, lip
  sync loses fidelity.

## Common mistakes

* **Very low-quality audio**: sync depends on what the model hears.
  Re-record before spending credits.
* **Photo with multiple faces**: the AI picks one. Crop so only the
  person who will speak is in frame.

<Note>
  More on limits and formats at
  [Talking photo](/en/video-tools/talking-photo).
</Note>
