[Image Generation AI] Innovative! Realistic Face Swapping Experience with CG Videos

[Image Generation AI] Innovative! Realistic Face Swapping Experience with CG Videos

Hello, I'm Ushiyama from the CG team.

In this blog, we have been continuously exploring how to make CG characters more realistic using image-generating AI.
Let's first recap what we've covered so far.

 

[Image Generation AI] CG Avatar Replacement using Stable Diffusion

Here, we explained how to make only the facial part more realistic by using masks and text prompts.

[Image Generation AI] Ensuring Avatar Consistency from Various Angles!

We aimed for consistency in portraying the same person from various angles, not just relying on the Seed value but also using ControlNet.

[Image Generation AI] Utilizing Deepfake to Create the Same Individual!

We utilized an extension called Roop for face swapping, aiming for consistency in the person’s appearance. The strength was being able to swap faces consistently at both close and long distances.

Our verification results so far suggest that swapping faces offers higher versatility compared to making them more realistic through text or contour detection.

Today's Task

We will continue using deepfake, this time with CG videos for verification. Looking at various cases online, face swapping in live-action footage seems to have reached a high level.
We will focus on a scenario of swapping faces in CG videos for apparel, keeping in mind that for apparel-related CG, the clothing should be made using CG.
If we change the clothing with AI, it might deviate from the actual product, so we are imposing this constraint.

Today's Goals

  • Understand the entire process and workload for creating a video
  • Verify if there are any random elements or flickering between frames
  • Identify the parts where AI struggles and devise countermeasures

Preparation for Verification

Deepfake Mechanism

Let's first briefly look at the mechanism of deepfake in image-generating AI, as explained by ChatGPT.

  1. Detecting Faces in Images:

    • The AI analyzes the image to find human faces.
    • It detects facial features (eyes, nose, mouth, etc.), the direction of the face, and expressions, gathering these data.
  2. Replacing Faces in Specified Images:

    • At this stage, the face in another image replaces the original face.
    • The AI adjusts the size, shadows, and skin tones of the face to ensure the replaced face appears natural.
    • The replacement is finely tuned so that eye and mouth movements sync with the original image.

Thus, if a "face can't be detected," the face swapping fails.
To test this, we will create videos with various camera works.

    Preparing Materials

    Let's prepare materials that can be used in our upcoming accelerando.AI project.

    First, we prepare images for face swapping.
    We'll turn a Zepeto avatar used influentially for accelerando.AI into a real-life appearance using AI.

    Then, we prepare CG materials.

    For this verification, we’ll use AI to change only the face of a CG-created character. We won’t make complex changes like AI-generated hair or clothing realism; other parts of the character will use CG rendering. This approach simplifies the verification process.

    We use the silver outfit from the first launch of accelerando.AI. ("People and AI Collaboration" brought to life – the first item from the future fashion brand "accelerando.Ai" goes on sale)
    We didn't have time to create detailed hair, so we’ll use something similar from a preset.

    Procedure and Settings

    We'll proceed as follows:

    1. Use a walking cycle motion for CG
    2. Set the camera for each hypothesis item and render CG videos
    3. Replace faces in each rendered video using Stable Diffusion

    The AI generation settings are at Denoise 0.1, meaning no replacement except for the face.

    Face swapping is done using an extension called Re Actor.
    Previously, we used Roop, but it seems Re Actor is now the mainstream due to its higher performance.

    Also, for rendering CG, we prioritize speed and use Blender's Eevee. We also compare with high-quality Cycles towards the end.

    Verification Results

    Fixed Camera

    Here, we verify if the same person appears in each frame.
    Let's first look at the face swapping results.

    The swapping worked well! We intended to swap "only the face," but the surrounding hair also became slightly more realistic in the blending process.
    Using higher-quality CG materials like Cycles should yield even more realistic results.

    Now, let's compare the videos.

    They look good!
    We were concerned about changes in shadows, brightness, or sudden switches to different faces, but everything appears natural.

    However, some points of concern include:

    • Unstable Eye Contact
      • If you look closely, the eye contact swings left and right. It's not too noticeable, but if the eye movement is slower or has a clear intention, extra care might be needed.
    • Noise at the Hair Boundary
      • It might not be evident in the embedded video, but there's noise every time the hair moves at the boundary.
      • As mentioned earlier, the side effect of the blending process post-face swapping seems to be causing this noise.
      • It's not too concerning, but currently, there's no workaround.

    Camera Movement

    What happens when a character appears from the edge of the screen, and the face is partially undetected?

    It's too fast to see clearly, so let's look at it frame by frame.

    As expected, when it's less than half visible, it can't be detected.

    Thinking of Countermeasures

    For cut-ins, allowing some leeway for face detection seems wise, so increasing the resolution horizontally might be a good idea.
    Although this means longer rendering times and adjusting the ratio for each cut is cumbersome...
    It seems like the only option for now.

    Camera Panning

    Here, we'll see up to what angle the face can be detected. 

    Let's also examine this closely, frame by frame.

    Unlike with camera movement, the face is detected here.
    However, there's an error-like behavior where the AI tries to apply a frontal face to the back of the head.

    Thinking of Countermeasures

    This is very difficult to avoid... A simple solution might be to use the original CG frame where the error occurs.
    If the motion is fast, it might go unnoticed, but slow turns might reveal the discrepancy.
    It might be best to avoid such shots in AI-based replacements.
    Let's hope for future improvements in accuracy.

    Camera Zoom

    Actually, this was the most unpredictable test.
    In prompt-based image generation, maintaining consistency in images where the subject size changes, despite fixed seed values or using ControlNet, is challenging.

    In the previous blog post, we successfully fixed faces using Roop.
    How about with Re Actor?

    It's successful!

    Like with a fixed camera, there's a slight tendency for the eyes to wander, but it's at an acceptable level.

    As I mentioned at the beginning, the reason why swapping faces is the most versatile is evident here.

    If we could also fix hairstyles in the same way, it would greatly enhance practicality.

    Other: Comparison of CG Rendering Engines

    We've been looking at AI generation against Eevee rendering. Let's finally compare with high-quality, more time-consuming Cycles rendering.

     

    Nice!
    With proper shadows, there's a significant increase in realism! 
    Even when swapping faces with AI, inheriting shadows from CG means it's good to have proper lighting and rendering.

    Cost

    Let's look at the cost aspect, including generation time.

    Item Per Frame Total (90 Frames)
    AI Generation Time (Face Only) 10 seconds 15 minutes
    Rendering Time (Eevee) 1 second 90 seconds
    Rendering Time (Cycles) 10 seconds 15 minutes

    The specs of the test machine are RTX2080, 32GB memory. It's a high-end machine from a few years ago, not so much by today's standards.
    The rendering resolution is 1024px.
    Since it's not an overnight process, it seems quite feasible to try.

    For this test, we kept rendering settings quite conservative. In a production environment, Cycles would likely take more than 60 seconds per frame.

    Exploring how to reduce rendering costs, perhaps using Unreal Engine, could be interesting.

    Conclusion

    How was it?
    Just making the face realistic, while the rest of the CG remains unchanged, seems to enhance the quality.

    The results of our verification show that current AI technology is sufficient for practical use, especially for tasks involving only face swapping.
    However, as tested, there are camera angles that AI struggles with.
    In cases where it's difficult to devise countermeasures for AI-detectable frames, it might be necessary to rethink those camera works or cuts.

    Moreover, since everything else remains CG, we need to ensure there's no incongruity in textures.
    Preparing hairstyles and ethnicities close to the target is a bit of a hassle.
    Replacing everything from clothes to hairstyles in videos with consistency seems a bit further down the road.

    AI is finally entering the era of videos.
    I hope to create rich content by combining CG and AI!

    Recommended Bookmark List

    To last...

    For those who haven't yet implemented Stable Diffusion, it's become much easier compared to the beginning of the year, though it can be overwhelming with the deluge of information.
    I'll introduce a few useful tutorials.

    AI is in wonderland offers clear, up-to-date explanations and is highly recommended!

     

     

    Back to blog

    Leave a comment

    Please note, comments need to be approved before they are published.