Tutorial - a2e.ai

Tutorial: Getting Started with A2E Digital Human Video Synthesis Platform

Thank you for choosing the A2E Avatar Video Synthesis Platform. A2E’s avatar lip-sync and voice clone technology, developed since 2021, aims to be the simplest and most powerful avatar generation platform. Here, you can create your own AI avatar and voice, and then generate videos of your digital counterpart by simply typing text. Alternatively, you can use pre-made avatars and voices to create videos by configuring scripts.

Login

Use your gmail or email to log in at https://video.a2e.ai. Recommended browsers: Microsoft Edge or Chrome on desktop for the best experience.

If you do not have an account yet, click “Register” to create a new account. If you have an invitation code, refer to the Q&A section for instructions on how to enter it.

For mobile users, access https://m.a2e.ai. Note: The mobile version has fewer features. If possible, use the desktop version.

Creating Your First Video

After logging in, try creating an avatar video:

Go to the “Create” module.
Name your video.
Input the script for the avatar to narrate.
Select the appropriate language and voice tone, then click “Preview Audio”.

Upon successful audio preview, the button will change to “Create Video.” You’ll also hear the synthesized avatar’s voice on your computer.
The system will display the number of coins required to generate the video from the audio. If satisfied with the preview and coin cost, click “Synthesize Video” to generate the video.

The platform will redirect you to the results page, where you can download the video after a short wait.
To track progress and find completed videos, go to the “My Results” module.

Advanced Features for Video Synthesis

Choose Different Avatars

You can select:

My Avatars: Learn how to create your own avatar here.
Public Avatars: These are provided as examples for your reference. Note: these avatars may not have copyright or commercial use guarantees in your country.

Avatar Types: If you click “My Avatars”, you will find each avatar is labeled with ???? or ⚡.

• ???? Diamond: Results of “Continue Training”. These avatars are trained for higher quality lip-sync and realism. Learn how to do “continue training” here.

• ⚡ Lightning: Results of “Quick Preview”. It offers basic lip-sync quality. If you clicked “continue training”, but your training has not completed, you will find ⚡ too. In other words, only the completed “continue traininng” avatar will be labeled with ????. ⚡ Lightning is suitable for single-image generated avatars, or training videos without synchronized audio-lip-motion.

Choose a Voice

• Select a voice that matches your script. A2E supports dozens of languages and hundreds of voice tones.

• You can also use your cloned voice for dubbing. Learn how to create your own cloned voice here.

Automatic Caption

• If you would like to add captions to your video, enable “Caption” below the avatar video and configure the font, color, and background.

• Note: the auto caption are only available for text-driven videos, not audio-driven ones.

Change Background

If you want to change the background of your avatar video. Click “Background”. Note that not all avatars support background changes. Check the “Background” option for availability. If unavailable, select another avatar.

If you need to enable background change for your custom ed”My Avatars”, ensure you configure the background during the avatar creation process to enable this feature.

If the selected avatar supports background change, you will be able to:

Use platform-provided images.
Upload your own image or video as the background.

Create Your Customized Avatars Using Videos

A2E allows you to create an avatar using either a short video clip (called base video) or an image. In this section, you will learn how to upload a short base video of a real person to serve as the training data for generating your personalized AI avatar videos. The base video determines the resolution, lip movement, actions, clothing, and props for future videos. For high quality avatars, the base video must meet specific requirements. If you would like to use API, please read this section to learn the meaning of each parameters in the API. We offers two types of training process to obtain “My Avatars”:

“Quick Preview ⚡”: The training process usually finishes within 1 minute. This process if free (no diamonds consumed). This mode is the only suitable option for single-image generated avatars, or training videos without synchronized audio-lip-motion. The result of this training process will be labeled with ⚡.
“Continue Training ????”: The training process usually finishes within 60 minutes. This process costs 1 ????. The result of “continue training” usually brings much higher lip sync quality than “Quick Preview”. The result of continue training will be labeled with ????. In order to apply “continue training”, first upload a base video and finish “quick preview”, then click the ???? icon of the avatar icon. Next you will see “Continue Training ????” and click on it. Wait for ~60 minutes. At last, you will be able to the avatar icon is labeled with ???? in “Create” module.

Steps to Create a Customized Avatar from Videos

1. In the “Add Avatars” module, upload a video following the requirements

• Name: Provide a name for your avatar.

• Gender: Select based on the video. This will slightly affects the lip-sync results.

• Original Material: Choose Video for cloning.

• Original Background: If you want the background removed, upload a background image. Learn how to capture the background image.

2. Click “Quick Preview” to begin instant-mode cloning.

3. After training, view the avatar in “My Avatars” and use it for video creation.

4. For improved quality, click the ???? icon of the avatar icon, click “Continue Training” from pop up window.

Create Your Customized Avatars Using Images

A2E allows you to create an avatar using either a short video clip (called base video) or an image. In this section, you will learn how to use an image of a real person to serve as the training data for generating your personalized AI avatar videos. In you choose this image mode, you will not be able to perform “continue training”. After you click “Quick Preview ⚡”, the training process usually finishes within 1 minute. This process costs 1 ????. You will not need to (cannot) perform “continue training” for image generated avatars afterwards.

Steps to Create a Customized Avatar from Images

1. In the “Add Avatars” module, upload a video following the requirements

• Name: Provide a name for your avatar.

• Gender: Select based on the video. This will slightly affects the lip-sync results.

• Original Material: Choose Image mode for cloning.

• Original Background: If you want the background removed, upload a background image.

2. Click “Quick Preview” to begin instant-mode cloning.

3. After training, view the avatar in “My Avatars” and use it for video creation.

Use Your Own Voice

A2E enables you to create a personalized voice using a short audio clip. Please note that our system automatically detects the voices of public figures and celebrities. If your submission violates our terms of use, the voice will be disabled, and your account may be suspended.

1. First, click Voice Clone, then set the voice name and select the gender.

2. (a) Upload an audio clip and click Start Training. Learn how to get a high quality audio recording.

(b) If you don’t have an audio clip ready, you can record one directly on our webpage. Click Start Recording, choose a script that suits your application, and then click the ???? icon to begin. The recording should be 15-20 seconds long. Once the duration is reached, click Stop to finalize the recording.

3. The results are typically ready within 1 minute and will appear on the right panel. The cloned voice can be selected in the Voices section of the Create module. In the dropdown menu, you will find your cloned voice under the Voice Clone section.

FAQ

Q: How to Get a High-Quality “Base Video”?

Lighting Recommendations:

Use a portrait lighting setup in a studio, with at least a foreground light source and a backlight (to create an outline effect).
If you use a green screen setting, keep the person (your model) at least 2 meters away from the green screen to achieve optimal green screen keying results.
Avoid clothing or props that reflect green light.

Ambient Noise Recommendations:

Ensure background noise is less than 45dB, with no echoes or reverberation.
Avoid noise from AC and computer fans. We recommend to turn off the AC of the room during footage capture.
Always check your microphone before recording.

Camera Recommendations:

An SLR with at least 30 FPS video recording capability is recommended. Most of our public avatars are filmed with Sony A7S3 or A7M4.
Resolution: 2K at 30fps (or higher if high-definition is required). A2E supports 4K resolution at max.
Format: Non-HDR. A2E supports basic HDR video decoding. But the HDR decoding is not recommended.
Lens: For portrait shooting, use a 40mm or 50mm lens (e.g., 50mm F1.8 prime lens). Aperture F1.8–F2.5, with a shutter speed of no more than 1/100 and ISO below 800. Adjust settings based on actual lighting conditions.
Depending on your hardware choice, use other devices such as computers, video capture cards, or teleprompters to meet recording needs.

Mircophone Recommendations:

Lavalier microphones (e.g., clip-on mics). Avoid placing the microphone too far from the person.
The microphone must be connected to the camera for best synchronization.
The recorded video must include the recorded audio, with precise synchronization between audio and visuals.

Action of the Model:

The camera should be clearly capturing the subject’s face.
The model should keep speaking in the footage. While speaking, the model can look at a teleprompter. The teleprompter must align with the camera.
The subject can prepare their own script or use simplified material like familiar poems, numbers, or letters. Ensure clear, loud speech. The speech may contain occasional mistakes. Do not pay attention to mistakes. Pay attention to the emtion and the facial expression during the speech.
Keep movements within the camera frame and avoid blocking the face. Use general gestures (e.g., slight hand movements or head nods) but avoid meaningful actions such as approving/disapproving gestures.
Movements should not be too exaggerated or frequent and should not obstruct the face.
Lip movements can be slightly exaggerated compared to natural speech.
Any actions captured in the video will appear in the generated output in the same order.
Frame glasses are acceptable

To Enable Background Change:

If background replacement is required, record 1–2 seconds of “blank footage” without the subject in the frame. Then, upload a screenshot of this “blank footage” as the Original Background during the “Add Avatars” process.
The background image does not necessarily need to be a green screen. Our AI algorithm supports any type of background, provided it perfectly matches the background of the avatar video.
If you are filming the model in a green screen setting, ensure the subject avoids wearing green or blue clothing to prevent keying issues.

Output Requirements:

The footage video should be 30 seconds to 5 minutes long.
The video must include speech, with perfect synchronization between audio and lip movements.
The video should feature only one face.
No environmental noise or other sounds (apart from the subject’s speech) are allowed.
Maintain a moderate speaking pace.
A pace that is too slow may reduce lip-sync accuracy.
A pace that is too fast may cause lip-sync distortion.

Q: How to Record a High-Quality “Base Audio”?

Audio Duration: 15–60 seconds.

Environment Requirements:

• Background noise must be less than 45dB, with no echoes or reverberation.

• Perform sound checks before recording.

Equipment Requirements:

• Use a directional condenser microphone (e.g., a cardioid microphone).

Recording Requirements:

• Use a microphone to record audio that matches the tone and mood required for the intended application (this tone will be reflected in the synthesized audio output).

• Avoid placing the microphone too close to your mouth to prevent plosive sounds.

• If you make a mistake, skip to the next sentence rather than starting over.

• Ensure reverb is turned off, and do not perform any post-processing on the original audio file.

Recording Parameters:

• Sample rate: 48kHz

• Bit depth: 16-bit

• Channel: Mono

• After normalization, the gain should be above 0.4.

Recommended Software: Adobe Audition.