[Image Generation AI] Ensuring Avatar Consistency from Various Angles!

August 5, 2023

Hello, this is Ushiyama from the CG team. I wrote the following blog last time:

[Image Generation AI] CG Avatar Replacement using Stable Diffusion

Last time, we looked at using the image generating AI, Stable Diffusion, to generate images of realistic avatars that resemble the original CG images.

What We'll Do Today

Today, I'd like to dive a bit deeper into more practical matters.

Not only making a single still image realistic, I want to make the images generated from various angles appear as if they are the same person.

Assuming a digital sampling site, while we will make the face realistic with AI, we will leave the clothes created with CG as they are. CG clothes are designed with the actual production in mind, so we can't allow AI to modify them arbitrarily.

Finally, we will use the StableDiffusion plugin to try our hand at making a video of the avatar.

Our Goals This Time

Make the CG avatar appear as the same person in images from various angles
Composite the CG clothes and AI face smoothly
Use these results to make a video with camera work

Let's get started.

Test 1: Try Fixing the Seed Value and Prompt

There's a method often mentioned online for fixing the person. This is to fix the seed value. In this test, unlike last time, we have the mask OFF for now.

We'll use the two images above. The left one is the same image as last time, and the right one is taken from a different angle.

Let's try using the same prompt and seed value as last time.

Oh? They don't look much alike... and the image on the right is blurry.

Let's add prompts to specify race and age. We'll also add blur to the negative prompts to remove the blurriness of the image.

The facial features are starting to look alike! There seems to be a slight difference in the curl of the perm.

Add wavy hair to specify hairstyle.

...Huh? It's starting to look a bit masculine. Oh, I forgot to specify the gender!

Let's add Woman. It's starting to look somewhat alike. There are still slight differences in the size of the perm, and it seems difficult to specify more with text.

Summary of Test 1:

Like last time, it's difficult to reproduce the "lottery" we accidentally drew in a state without character depiction prompts just with the seed value.

With fewer prompts, the AI may pick up on things nicely. However, if you want to fix the seed value, the seed value is tied to the prompt, so you need to describe the person in as much detail as possible with the prompt. Even with the same seed value, the impression changes as you add more prompts, so be careful that even if you draw something good with fewer prompts at first, you may not be able to reproduce it if you add more prompts later.

As prompts to describe a person, I recommend specifying race, age, gender, and hairstyle from the beginning.

Test 2: Trying to Extract Outlines with ControlNet

Next, I'd like to look at ControlNet, an extension of StableDiffusion. With ControlNet, you can extract contour lines, depth, and poses from an image and use them to generate another image.

It's a powerful feature of StableDiffusion that allows you to make various specifications that can't be expressed by text prompts.

There are two reasons for using ControlNet this time.

One is for the purpose of fixing the person. Even with the same prompt or seed value, the features may be slightly off, or the direction of the face may be off.

The other is for later compositing. Because we need to separate the elements strictly at the boundary of the clothes because of the constraint that only the face is made with AI and the clothes are CG.

Here's a comparison image of ControlNet ON/OFF.

When not used, you can see that the hairstyle changes. The position of the eyes and nose and the direction of the neck also change slightly.

Also, please pay attention to the neckline of the overlay comparison image on the right. The boundary line of the neck changes, making it difficult to composite the face onto the CG clothes.

The method of extracting outlines with ControlNet is called Soft edge. Within Soft edge, there are categories for how to extract. HED will pick up the fluffy boundary line of bangs.

On the other hand, a method called pidinet will no longer detect bangs. In this case, I think the cause was that the boundary line of the bangs was blurry because of the original perm. However, it's also true that pidinet is a newer method and extracts more beautiful contours.

Finally, there's a contour extraction method called Canny, which is different from Soft edge. It can reproduce fine lines, but it overlooks the large boundary line of the bangs.

Summary of Test 2:

Before the advent of ControlNet in Stable Diffusion, there were too many elements of luck, and image generating AI was positioned merely as inspiration.

With the use of ControlNet, you can literally control the image by giving various instructions that can't be expressed in text. With the assurance of image consistency, it has become of a quality that can be used as a final product, rather than just inspiration.

Test 3: Problems with Using ControlNet and Masks Together

There is something I noticed when using ControlNet.

It has been found that when masks are used in combination, the realism decreases.

In the image on the left, no mask is used. Although it deviates from the original purpose, please note that not only the face but also the clothes and background are all changed. It seems that the difference in realism depends on whether a mask is used or not.

I don't know the cause, but this kind of problem does not occur when using the mask alone, it occurs when used in combination with ControlNet.

When I turned on the 'Pixel Perfect' interpolation function of ControlNet, it improved slightly. Still, I think there are differences in realism, such as in highlighted parts.

Summary of Test 3:

As such, I don't think the compatibility between ControlNet and masks is very good.

If you can use a mask, it's quite easy as it can be used as the final product, but even if you use a mask, I think there are some jagged edges, like around the neckline.

Currently, anyway, there is a need to do some processing in Photoshop at the end, so I have chosen to not use a mask during AI generation, and to use Photoshop to use a mask and synthesize after generation.

Test 4: Video After Fixing the Character

Now let's use these techniques to replace CG with AI at various other angles.

There's a good match in the features, like where the tip of the person's chin is split. However, the position of moles and the degree of perm are still different.

Still, if it's a still image, it might have reached a level without discomfort.

Finally, I would like to make these into a video.

...The perm is flickering. As expected, in the case of video, not only the angle but also the connection between frames becomes important, so it seems more difficult. We will verify this area again in the future.

By the way, I'm creating videos based on the tutorial below. Depending on the settings (probably lowering Denoise), it should be possible to achieve a quality with almost no discomfort.

Summary of Test 4:

How was it? Not only by fixing the prompt and seed value, but also by using ControlNet to extract contours, you can align features like the face's orientation, nose, and neckline.

In a still image, this technique can match the character at various angles.

However, in parts we didn't cover this time, not only the angle, but also the difference in distance between the camera and the subject requires another technique to match the character.

Let's look at that next time.

Back to blog