Digital Human Project ER-NeRF: A Detailed Guide for Use and Deployment

中文 | English

1. Introduction to ER-NeRF

ER-NeRF (Official Github) is a Talking Portrait Synthesis project. In brief, given a video of someone speaking and an audio clip, this model processes the video to make the speaker’s lip movements match the audio.

Advantages of this model include:

  • Real-time response: The model is lightweight and processes quickly.


  • Every video used for lip-syncing requires training. In other words, each video corresponds to a unique model.
  • The generated head movements are not very stable.

2. Deploying ER-NeRF

Environment requirements for ER-NeRF:

  • PyTorch 1.12
  • CUDA 11.x (mandatory, or there will be errors with the PyTorch3D-related code)

Deployment steps::

(1). Run the following commands in order (execute one by one):

# The local CUDA version doesn't have to be exactly 11.6, but it must be 11.x.
conda install cudatoolkit=11.6 -c pytorch

# Install PyTorch
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --index-url

# Install PyTorch3D. This step must succeed, or errors will occur in the later data processing stages
pip install "git+"

# Install TensorFlow
pip install tensorflow-gpu==2.8.0

# Install necessary libraries
apt-get update
apt install portaudio19-dev
apt-get install ffmpeg

# Clone the project repository
git clone

# Install the required libraries
cd ER-NeRF
pip install -r requirements.txt

# Reinstall protobuf and use version 3.20.3
pip uninstall protobuf
pip install protobuf==3.20.3

(2). Download the model files from the link given by the project. After finishing it, the structure should look like this:

-- checkpoints  # Place this in the `~/.cache/torch/hub/checkpoints` directory (optional, as the source code can download it automatically)
-- data_utils  # Place this in the '`ER-NeRF/data_utils` directory' 
    -- face_parsing
    -- face_tracking

(3). Initialize 3DMM model-related files:

cd data_utils/face_tracking

3. Training Your Own Digital Human

If you just want to use an existing model (currently only Obama), you can skip to Chapter 4.

The source code includes a pre-trained video (Obama). If you want to train your own digital human model, follow these steps (using the provided Obama video as an example):

(1). Download the video you want to train on and place it in the data directory. Name the video as data/<ID>/<ID>.mp4. For example, kunkun.mp4 would be placed in ER-NeRF/data/kunkun/kunkun.mp4.

wget -O data/obama/obama.mp4

Video requirements (must be met): ① Frame rate: 25FPS; ② Every frame should show the person speaking; ③ Resolution: 512x512; ④ Duration: 1-5 minutes; ⑤ The background should remain stable. Tip: You can use "Leawo Video Converter Ultimate" to process the video.

(2). Process the video using the data_utils/ script.

python data_utils/ data/<ID>/<ID>.mp4

This step will take some time and is error-prone (often due to incorrect environment setup). The script contains multiple tasks, each generating several files saved in the data/<ID>/* folder. You can check if each task completed successfully by looking for the generated files or checking the logs:

  • Task 1: Splitting the video. This task generates the aud.wav file. If it fails, it's usually caused by ffmpeg. You can try to reinstall it to resolve problems.
  • Task 2: Extract audio data (aud.npy) . Errors here are typically due to protobuf version problems.
  • Task 3: Extracting each frame from the video. This task will generate many image files like ori_imgs/XXX.jpg.
  • Task 4: Semantic segmentation. This task generates many parsing/XX.png files.
  • Task 5: Extracting the background image, generating a bc.jpg file (background image of the person).
  • Task 6: Segmenting body parts and creating Ground Truth images, generating gt_imgs/XXX.jpg and torso_imgs/XXX.png (torso-only images without the face).
  • Task 7: Extracting facial landmark coordinates, generating ori_imgs/XXX.lms files.
  • Task 8: Generating face tracking data by training a tracking model (this task will take a long time). This task generates the file. Errors here are often related to pytorch3d or CUDA version.
  • Task 9: Generating transformers_train.json and transforms_val.json.

If a task fails, you can retry it using the command: python data_utils/ data/<ID>/<ID>.mp4 --task <taskId>. For example, to retry task 2: python data_utils/ data/obama/obama.mp4 --task 2

(3). Make a copy of the generated aud.npy file and rename it to aud_ds.npy (There are some bugs in the source code, so you should take this step).

(4). Generate the <ID>.csv file using OpenFace. The specific steps are as follows: ① Download OpenFace (link for windows version) ;② Extract the files and run OpenFaceOffline.exe; ③ In the Record tab, only check Record AUs.
Image description
;④ Select File->Open Video, then the process will start;⑤ After it finishes, a <ID>.csv file will be generated in the ./processed folder. Rename it to au.csv and place it in the data/<ID>/ folder.

(5). Train the model by running the following commands:

# Command 1: Train the model
python data/obama/ --workspace trial_obama/ -O --iters 100000
# Command 2: Fine-tune
python data/obama/ --workspace trial_obama/ -O --iters 125000 --finetune_lips --patch_size 32

trial_obama is the workspace where the generated model will be saved. After running the commands, a trial_obama folder will be created with the following structure:

-- checkpoints/  # Model files
    ├── ngp_ep0013.pth  # Model for the 13th epoch
    ├── ngp_ep0014.pth
    └── ngp.pth   # Final model file
-- log_ngp.txt   # Log file
-- opt.txt  # Training options (args).
-- result   # Result files
    ├── ngp_ep0014_depth.mp4
    └── ngp_ep0014.mp4  # You can download and check the result.
-- run/ngp/events.out.xxxxx
-- validation

After completing the above commands, run the following:

python data/obama/ --workspace trial_obama_torso/ -O --torso --head_ckpt trial_obama/checkpoints/ngp.pth --iters 200000

trial_obama/checkpoints/ngp.pth is the final model file generated earlier.

4. Generating Digital Human Videos

Once the model is generated, you can use your own voice to create videos. Follow these three steps:

(1). Upload the audio and extract audio data (generate the corresponding npy file).

For example:

python data_utils/deepspeech_features/ --input /root/demo2.wav

Change demo2.wav to your audio file. After execution, a demo2.npy file will be generated in the same directory.

(2). Run the model inference to generate the lip-synced video file. Note that the generated video will have no audio.

python data/obama/ --workspace trial_obama_torso/ -O --torso --test --test_train --aud /root/demo2.npy

(3). Merge the audio and video together.

ffmpeg -i /root/ER-NeRF/trial_obama_torso/results/ngp_ep0028.mp4 -i /root/demo2.wav -c:v copy -c:a aac -strict experimental /root/output.mp4

ngp_ep0028.mp4 is the video generated in the second step (you can find its location in the log file). The demo2.wav is the uploaded audio and /root/output.mp4 is the path for the output file you want.

5. Comparison of Other Digital Human Models

Model Name Inference Speed Require Training Advantages Disadvantages
video-retalking Slow Not 1. Simple deployment
2.No training needed; can be used directly on any video
3.Mature project with strong compatibility
4. Includes video processing, no need for video processing by yourself
1. Slow inference speed, not capable of real-time processing.
2.Unstable results; some videos have poor performance
ER-NeRF Fast Required 1. Small model, fast inference speed, meets real-time requirements
2. Good lip-syncing effect
1. Immature project; it's source code from a paper.
2. Poor compatibility, strict requirements for the deployment environment.
3. Data processing and training are time-consuming; a 5-minute video takes about one day.
Wav2Lip Fast Not 1. Mature project
1. Project is quite old (4 years ago).
2. Low resolution.


Common Errors

ValueError: Found array with 0 sample(s) (shape=(0, 2)) while a minimum of 1 is required by NearestNeighbors.:

Traceback (most recent call last):
  File "data_utils/", line 417, in <module>
    extract_background(base_dir, ori_imgs_dir)
  File "data_utils/", line 112, in extract_background
    nbrs = NearestNeighbors(n_neighbors=1, algorithm='kd_tree').fit(fg_xys)
  File "/root/miniconda3/lib/python3.8/site-packages/sklearn/", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/sklearn/neighbors/", line 175, in fit
    return self._fit(X)
  File "/root/miniconda3/lib/python3.8/site-packages/sklearn/neighbors/", line 498, in _fit
    X = self._validate_data(X, accept_sparse="csr", order="C")
  File "/root/miniconda3/lib/python3.8/site-packages/sklearn/", line 605, in _validate_data
    out = check_array(X, input_name="X", **check_params)
  File "/root/miniconda3/lib/python3.8/site-packages/sklearn/utils/", line 967, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0, 2)) while a minimum of 1 is required by NearestNeighbors.

Cause: Some frames in the video do not contain any faces. This typically occurs at the beginning or end of the video. You can confirm this by checking the images in the generated parsing folder. See details in the issue

RuntimeError: Given groups=1, weight of size [32, 44, 3], expected input[8, 29, 16] to have 44 channels, but got 29 channels instead

==> Start Training Epoch 1, lr=0.001000 ...
  0% 0/7355 [00:00<?, ?it/s]Traceback (most recent call last):
  File "", line 248, in <module>
    trainer.train(train_loader, valid_loader, max_epochs)
  File "/root/ER-NeRF/nerf_triplane/", line 983, in train
  File "/root/ER-NeRF/nerf_triplane/", line 1241, in train_one_epoch
  File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/ER-NeRF/nerf_triplane/", line 432, in update_extra_state
    enc_a = self.encode_audio(auds)
  File "/root/ER-NeRF/nerf_triplane/", line 232, in encode_audio
    enc_a = self.audio_net(a) # [1/8, 64]
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/ER-NeRF/nerf_triplane/", line 64, in forward
    x = self.encoder_conv(x).squeeze(-1)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/", line 139, in forward
    input = module(input)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/", line 307, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/", line 303, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [32, 44, 3], expected input[8, 29, 16] to have 44 channels, but got 29 channels instead

Cause: It usually happens when you want to use the Wav2vec audio extractor, but the --asr wav2vec parameter was forgotten.

python: can't open file 'nerf/': [Errno 2] No such file or directory

[INFO] ===== extract audio labels for data/zhf/aud.wav =====
python: can't open file 'nerf/': [Errno 2] No such file or directory
[INFO] ===== extracted audio labels =====

Cause: There is an issue in the source code that needs to be rewritten. See issues#91

How to Process a Video?

Note: My knowledge of audio and video processing is lacking, so my approach may not be optimal. Here are my steps:

  1. Download the “Leawo Video Converter Ultimate” software and open the video you need to process.
  2. Use the "Cut" option to select "Start Time" and "End Time."
  3. Use the "Crop" option to set the size to "Square" and select "Full Screen." You might notice that the video is stretched or squished, which is normal due to display aspect ratio is not correct.
  4. Click Convert to generate the video.
  5. Use ffmpeg to process the video to 512x512, using the command: ffmpeg -i input.mp4 -vf scale=512:512 output.mp4
  6. Use ffmpeg to set the aspect ratio of the video to 512:512 (1:1) with the command: ffmpeg -i input.mp4 -aspect 512:512 output.mp4

If you need to rotate your video, you can use the command: ffmpeg -i input.mp4 -vf "transpose=2" output.mp4

Inference Acceleration (Using wav2vec)

Using the default DeepSpeech to extract audio features can be slow. You can choose to use wav2vec for audio feature extraction, which will significantly speed up inference.

You need to add some parameters during both the training and inference phases to select the use of wav2vec:

Training Phase:

  • In Step 2, add the --asr wav2vec parameter: python data_utils/ data/<ID>/<ID>.mp4 --asr wav2vec
  • Change aud_ds.npy in Step 3 to aud_eo.npy, meaning you should copy the generated aud.npy and rename it to aud_eo.npy.
  • Add the --asr_model esperanto parameter in Step 5:
python data/obama/ --workspace trial_obama/ -O --iters 100000 --asr_model esperanto
python data/obama/ --workspace trial_obama/ -O --iters 125000 --finetune_lips --patch_size 32 --asr_model esperanto
python data/obama/ --workspace trial_obama_torso/ -O --torso --head_ckpt trial_obama/checkpoints/ngp.pth --iters 200000 --asr_model esperanto

Inference Phase:

  • For audio processing in Step 1, use, with the command: python data_utils/ --wav /root/demo2.wav --save_feats
  • In Step 2, add the --asr_model esperanto parameter with the command: python data/obama/ --workspace trial_obama_torso/ -O --torso --test --test_train --aud /root/demo2_eo.npy --asr_model esperanto
Next Post Previous Post
No Comment
Add Comment
comment url