Deep-fakes and Generative AI How easy is it to clone a person?

This blog post is based upon a presentation that I had for the Microsoft Security User Group in Norway earlier this week (Meetup October 2024: AVNM & Deepfakes how easy can you clone a person.., Tue, Oct 15, 2024, 4:30 PM | Meetup) where I talked about How easy has it come to clone a person using todays technology? (both speech and visuals)

Deep-fakes can be summarized as a way to use technology that uses artificial intelligence (AI) and machine learning to manipulate or generate video or audio content that looks or sounds real. While this has been around since 2017, its use has exploded with the rapid development of Generative AI.

For instance we have services like thispersondoesnotexist.com (1024×1024) which has been around since 2017, which uses StyleGAN2 “(StyleGAN) yields state-of-the-art results in data-driven unconditional generative image modeling“. And this data model has been trained on thousands of different pictures of faces and gives the model a good understand on how a persons face should look like. Even if this service has been around for 7 years already it gives a pretty realistic picture of non-existent person.

An example picture from thispersonndoesnotexist.com

Of course, services like this have been misused by cybercriminals to create fake social media profiles and carry out social engineering attacks. However, this service only generates images, while much of the deepfake content we see today involves videos, which are already widely available online.

For example, consider the deepfake videos of Tom Cruise, where a YouTube creator superimposes Tom Cruise’s face onto their own. These deepfakes are typically generated using tools that combine facial recognition with face-swapping models. Such technology is widely used in the film industry today and facial recognition is frequently applied in law enforcement.

In today’s digital age, deepfake technology has introduced a new realm of cybersecurity threats. Here are five key areas where deepfakes pose significant risks:

Fraud and Financial Gain: Cybercriminals create fake videos or audio recordings to deceive individuals into transferring money or providing sensitive information.
Extortion: Deepfake content can be used to blackmail individuals by depicting them in false, compromising situations.
Disinformation and Reputation Damage: Malicious actors can spread fake news or undermine the credibility of institutions by distributing manipulated media.
Identity Theft and Fraud: Criminals can impersonate faces using deepfake technology to steal identities and commit fraud.
Manipulation of Public Trust: Deepfakes are used to craft false narratives, fostering distrust in official sources and institutions.

And we have already seen many examples of the last one in regards to the presidential campaign in the US.

We also had a case earlier this year, where a finance worker was tricked using deep fake video with someone the though was the CFO, which you can read more about here –> Finance worker pays out $25 million after video call with deepfake ‘chief financial officer’ | CNN and as the technology evolves and becomes better and better it will be even harder to distinguish what is fake and what is real.

When it comes to the process of creating deepfakes there are two main ways to do it. One is using Face recognition and face swapping models the other one is using Generative AI. The first option requires a bit more technical expertise (while many of the tools are easy to use) and also requires some access to GPUs in order to generate a video in a short timeframe. (Preferably NVIDIA with CUDA or using AMD with DirectML) but ill get back to that a bit later).

Generative AI has become much more accessible because it doesn’t require specialized hardware or advanced technical skills—just prompt engineering. Many GenAI models, like Midjourney, also allow users to upload their own images as input to generate new pictures. However, a key limitation is that most GenAI models have been trained on specific datasets, which often feature public figures, celebrities, and well-known politicians. As a result, these models are generally much better at producing realistic images of celebrities than of everyday individuals.

For instance, earlier this year. Microsoft showcased a research project they called VASA-1, where they demonstrated the ability to generate a photo realistic video of just using a single image and an audio clip (VASA-1 – Microsoft Research)

You can see all the examples at Microsoft website listed above. While you can clearly see that the video is AI generated, since the hair does not move at all and when you look at the eyes they have some unnatural movement. However bare in mind that this only from a single picture and an audio clip. The other interesting part is that the people in the example from Microsoft listed above is also created using Generative AI.

Microsoft also eventually decided to not publish the source core behind this project, since they saw too many risks with releasing the code.

Now this was only focusing on the face, we also have other technologies where we can replace the entire body. With projects like this Alibaba MIMO, where you can replace entire bodies with another picture which could be a real person or just a fictional character. You can see more examples here –> MIMO the source code for this project as well is also planned to be released sometime soon.

Another interesting thing is the development of generative AI models. To easily visualize this you can see at the different versions of this picture from Midjourney where the same prompt has been used against all the different versions of their model. Where 6.1 is the latest version of their model, but you can see how much better the quality and realism of the picture went from V1 to V6.

When it comes to creating an actual deep-fake video if we were to use face swapping approach and provide a clip where the person is also talking we would need to replicate the persons voice as well. There are many online tools available that can be used. Eleven Labs has a pretty decent voice cloning feature (NOTE: That use of this type of service requires that you need to have the persons permission to actually replicate his/her voice) if you try and use it against a celebrity you will be prompted for a Voice Captcha.

Now to replicate their voice you only need about 10 minutes of decent audio clip where the person is speaking, which in many cases can be found on You tube or social media. Services like Eleven labs then offers features like Text-to-speech or Speech-to-speech.

Eleven labs also has the ability to change between different languages and even supports Norwegian!

One thing to note is that with the use of text-to-speech features, you will have no artificial breaks in the audio and pretty consistent tone of voice. The best approach to get the best results is using speech-to-speech feature, where you map your own voice underneath the cloned voice (of the person you want to clone) This allows you to add artificial breaks, change the pitch making it sound more realistic.

While Elevenlabs is one of the services available to make this you can also run local models using tools like coquiTTS ort XTTS

My audio process looks something like this, find suitable audio online from the person and preferably use speech-to-speech since this is the easiest way to copy the mannerisms of the persons voice

hen you need to make the video you then have different tools you can use to make it with adjustments. The most stable application I found was called Facefusion. Facefusion can be used with different models and provides age modifier and face editor that can be used to adjust eye size, (smiling)

If you want to run this software, I recommend that you use Anaconda for python and set up a virtual environment. You should also have a GPU with at least 12 GB of video memory and prefer if it was an NVIDIA Card, if not you can also use AMD with the DirectML library. NOTE: Rendering it completely on CPU will take extremely long time!

winget install -e --id Anaconda.Miniconda3 --override "/AddToPath=1"
conda create --name facefusion python=3.10
conda activate facefusion
winget install -e --id Gyan.FFmpeg (Installed FFmpeg)
winget install -e --id CodecGuide.K-LiteCodecPack.Basic
python install.py --onnxruntime (or directml or cuda depending on GPU card)
python facefusion.py run

When you install it with a specific runtime provider, that option will be available in the web page. As seen in the picture above I only had CPU provider since I haven’t installed the provider yet.

Facefusion video as seen above, using the default settings and filmed within Teams to get a “fake” background without using a Green Screen.

Also within Face fusion we can also customize other settings such as Eye gaze or Eye open which allows us to adjust the face swapping model.

Once you are happy with the results of creating both the audio and the video you can merge it together either using video editing software. Now the best part to create a full Deep-fake video can consist of multiple of these features that I have written about here. In this video example below I am using

Elevenlabs to generate a text-to-speech audio clip
Thispersondoesnotexist to generate a picture of a person
Green Screen and OBS software to do recording
Using the green screen added an artificial background to the video
Face fusion to map the GenAI picture on top of my own face

Could I have made this more realistic? Sure! just if I used speech-to-speech instead it would map better against my face/lip movement. Also for a deep-fake video to look realistic, if you want to make a video of another person you have need to have much of the same similarities (face structure, skin texture, hair color and hairstyle and also some of the same eyes)

If you don’t have that, the results will create something that is not that realistic….

Now when looking at deep-fakes or GenAI created photos it is not always easy to see if it is fake or real. There are some software out there that can detect this such as SightEngine AI image detector. Detect AI-generated media at scale which actually seems to work (however at a certain cost…)

There is however work in progress as well from the largest tech companies in a initiative called C2PA. C2PA (Coalition for Content Provenance and Authenticity) is an industry initiative aimed at combating misleading information and verifying the authenticity of digital content.

Its main objectives are:

Providing publishers, creators, and consumers the ability to verify content provenance
Combating disinformation and deepfakes
Enabling transparency about AI-generated content

For instance DALL-E is already adding this metadata to the pictures they generate C2PA in DALL·E 3 | OpenAI Help Center. This will be an important initiative when combating Deep-fakes in the future, also that websites including social media have this metadata clearly visible.

In the future it will be even more difficult regardless of the C2PA to detect deepfakes or AI. Those that want too can easily remove the metadata or user other models that do not use the C2PA standard. I highly recommend that you all spend some time training on detecting what content is generated by AI or Not. One simple way of doing this is by using this online quiz Real Or Not

Share this:

Leave a Reply Cancel reply