The dawn of deeply faked emotions

0


Researchers have developed a new machine learning technique to arbitrarily impose new emotions on faces in videos by adapting existing technologies that have recently emerged as solutions to adapt lip movements to foreign language synchronization.

the research is an equal collaboration between Northeastern University in Boston and the Media Lab at MIT and bears the title Reversible frowns: video-to-video translation of facial emotions. Although the researchers acknowledge that the initial quality of the results will need to be developed through further research, they claim that the technique called Wav2Lip emotion is the first of its kind that directly relates to the modification of full video expression through neural network techniques.

The base code was Approved on GitHub, although model checkpoints will be added to the open source repository later, the authors promise.

On the left a “sad” frame of the source video. On the right a ‘happy’ frame. In the middle are two emerging approaches to synthesizing alternative emotions – top row: a fully masked face with the entire expression surface replaced; bottom row: a more traditional Wav2Lip method that only replaces the lower part of the face. Source: https://raw.githubusercontent.com/jagnusson/Wav2Lip-Emotion/main/literature/ADGD_2021_Wav2Lip-emotion.pdf

Single video as source data

In theory, such manipulations are now available through extensive training on traditional deepfake repositories such as DeepFaceLab or FaceSwap. However, the standard workflow would involve the use of an alternate identity to the “target” identity, such as an actor embodying the target’s identity, whose own expressions would be transferred to another person along with the rest of the performance. In addition, deepfake voice cloning techniques would normally be required to complete the illusion.

In addition, the expression of actually changes Goal1> Goal1 in a single source video under these popular frameworks, changing the Face Alignment Vectors in a way that these architectures do not currently allow.

Wav2Lip-Emotion maintains the lip-sync of the original video-audio dialogue while transforming the associated expressions.

Wav2Lip-Emotion maintains the lip-sync of the original video-audio dialogue while transforming the associated expressions.

Instead, Wav2Lip-Emotion seeks to effectively “copy and paste” emotion-related expressions from one part of a video and replace them in other places, with a self-imposed frugality of the source data that ultimately results in a less laborious method of expression manipulation.

Off-line models could later be developed that are trained on alternate videos of the speaker, thereby avoiding the need for each individual video to contain a “palette” of states of expression with which the video can be manipulated.

Possible purposes

The authors suggest a number of uses for altering expression, including a live video filter to compensate for the effects of PTSD and facial palsy. The paper states:

“People with or without inhibited facial expressions can benefit from adapting their own facial expressions better to their social circumstances. You may want to change the terms in videos shown to you. During a video conference, the speakers may yell at each other, but still want to gather the content in their exchange without looking uncomfortable. Or a film director might want to reinforce or weaken an actor’s facial expressions. ‘

Since the facial expression a. is Key and core indicator of intentEven if it drags against the spoken words, the ability to change the expression also offers the possibility to change the way of communication to some extent receive.

Previous work

Interest in changing expressions through machine learning goes back at least to 2012 when a cooperation between Adobe, Facebook, and Rutgers University proposed a method of changing expressions by using a tensor-based 3-D geometry reconstruction approach that painstakingly overlay a CGI mesh over each frame of a target video to effect the change.

2012 Adobe / Facebook research manipulated expressions by making traditional, CGI-driven changes to video footage.  Expressions can be expanded or suppressed.  Source: https://yfalan.github.io/files/papers/FeiYang_CVPR2012.pdf

2012 Adobe / Facebook research manipulated expressions by making traditional, CGI-driven changes to video footage. Expressions can be expanded or suppressed. Source: https://yfalan.github.io/files/papers/FeiYang_CVPR2012.pdf

While the results were promising, the technique was arduous and the resources required were considerable. At this point in time, CGI was far ahead of the computer vision-based approaches to direct feature space and pixel manipulation.

Is more closely related to the new paper MET, a 2020 record and phrase generation model that is capable of generating talking head videos but without the level of sophistication that can potentially be achieved by directly modifying the actual source video.

Expression generation with MEAD from 2020, a collaboration between SenseTime Research, Carnegie Mellon and three Chinese universities.  Source: https://wywu.github.io/projects/MEAD/MEAD.html

Expression generation with MEAD from 2020, a collaboration between SenseTime Research, Carnegie Mellon and three Chinese universities. Source: https://wywu.github.io/projects/MEAD/MEAD.html

In 2018 another paper with the title GANimation: Anatomical facial animation from a single image, arose from an American-Spanish academic research collaboration and used Generative Adversarial Networks to expand or change expressions only in still images.

Change expressions in still images with GANimation.  Source: https://arxiv.org/pdf/1807.09251.pdf

Change expressions in still images with GANimation. Source: https://arxiv.org/pdf/1807.09251.pdf

Wav2Lip emotion

Instead, the new project is based on Wav2Lip, the excited publicity in 2020 by offering a potential method of resynchronizing lip movement to accommodate novel speech (or song) Entrance that never appeared in the original video.

The original Wav2Lip architecture was trained on a corpus of spoken sentences from the BBC archives. In order to adapt Wav2Lip to the task of changing the expression, the researchers “fine-tuned” the architecture of the MEAD data set mentioned above.

MEAD consists of 40 hours of video with 60 actors reading the same sentence using different facial expressions. The actors come from 15 different countries and offer a range of international features that are intended to help the project (and derived projects) produce applicable and well generalized expression syntheses.

At the time of the research, MEAD had only published the first part of the data set, in which 47 people presented expressions such as “angry”, “disgusting”, “fear”, “contempt”, “happy”, “sad” and “surprised” ‘. On this first foray into a new approach, the researchers limited the scope of the project to superimposing or otherwise altering the perceived emotions “happy” and “sad” as these are the easiest to recognize.

Method and results

The original Wav2Lip architecture only replaces the lower face area, while Wav2Lip-Emotion is also experimenting with a full face replacement mask and expression synthesis. It was therefore necessary for the researchers to additionally modify the built-in evaluation methods, as these were not designed for a full-area configuration.

The authors improve the original code by keeping the original audio input and maintaining the consistency of the lip movement.

The generator element according to the earlier work comprises an identity encoder, a speech encoder and a face decoder. The speech element is also coded as a stacked 2D convolution, which is then concatenated with its associated frame / s.

In addition to the generative element, the modified architecture has three main discriminator components aimed at the quality of lip synchronization, an emotion target element and a counterproductive trained visual quality target.

For the reconstruction of the entire face, the original Wav2Lip work did not set a precedent, so the model was trained from scratch. For the lower face training (half mask), the researchers assumed control points that were contained in the original Wav2Lip code.

In addition to the automatic evaluation, the researchers used crowd-sourcing opinions provided by a semi-automatic service platform. In general, the workers rated the performance with regard to the recognition of the overlaid emotions high, while they only reported “moderate” ratings for the image quality.

The authors suggest that in addition to improving the video quality produced through further refinements, future iterations of the work could encompass a wider range of emotions, and that the work could equally be applied to labeled or automatically derived source data and datasets in the future, which could eventually lead to it , to an authentic system where emotions could be chosen up or down at the discretion of the user, or ultimately replaced with contrasting emotions with respect to the original source video.


Share.

Leave A Reply