23 research outputs found
Handbook of Digital Face Manipulation and Detection
This open access book provides the first comprehensive collection of studies dealing with the hot topic of digital face manipulation such as DeepFakes, Face Morphing, or Reenactment. It combines the research fields of biometrics and media forensics including contributions from academia and industry. Appealing to a broad readership, introductory chapters provide a comprehensive overview of the topic, which address readers wishing to gain a brief overview of the state-of-the-art. Subsequent chapters, which delve deeper into various research challenges, are oriented towards advanced readers. Moreover, the book provides a good starting point for young researchers as well as a reference guide pointing at further literature. Hence, the primary readership is academic institutions and industry currently involved in digital face manipulation and detection. The book could easily be used as a recommended text for courses in image processing, machine learning, media forensics, biometrics, and the general security area
AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction
In this work, we present a multimodal solution to the problem of 4D face
reconstruction from monocular videos. 3D face reconstruction from 2D images is
an under-constrained problem due to the ambiguity of depth. State-of-the-art
methods try to solve this problem by leveraging visual information from a
single image or video, whereas 3D mesh animation approaches rely more on audio.
However, in most cases (e.g. AR/VR applications), videos include both visual
and speech information. We propose AVFace that incorporates both modalities and
accurately reconstructs the 4D facial and lip motion of any speaker, without
requiring any 3D ground truth for training. A coarse stage estimates the
per-frame parameters of a 3D morphable model, followed by a lip refinement, and
then a fine stage recovers facial geometric details. Due to the temporal audio
and video information captured by transformer-based modules, our method is
robust in cases when either modality is insufficient (e.g. face occlusions).
Extensive qualitative and quantitative evaluation demonstrates the superiority
of our method over the current state-of-the-art
Modelling talking human faces
This thesis investigates a number of new approaches for visual speech
synthesis using data-driven methods to implement a talking face.
The main contributions in this thesis are the following. The accuracy
of shared Gaussian process latent variable model (SGPLVM)
built using the active appearance model (AAM) and relative spectral
transform-perceptual linear prediction (RASTAPLP) features is improved
by employing a more accurate AAM. This is the first study
to report that using a more accurate AAM improves the accuracy of
SGPLVM. Objective evaluation via reconstruction error is performed
to compare the proposed approach against previously existing methods.
In addition, it is shown experimentally that the accuracy of AAM
can be improved by using a larger number of landmarks and/or larger
number of samples in the training data.
The second research contribution is a new method for visual speech
synthesis utilising a fully Bayesian method namely the manifold relevance
determination (MRD) for modelling dynamical systems through
probabilistic non-linear dimensionality reduction. This is the first time
MRD was used in the context of generating talking faces from the
input speech signal. The expressive power of this model is in the ability
to consider non-linear mappings between audio and visual features
within a Bayesian approach. An efficient latent space has been learnt
iii
Abstract iv
using a fully Bayesian latent representation relying on conditional nonlinear
independence framework. In the SGPLVM the structure of the
latent space cannot be automatically estimated because of using a maximum
likelihood formulation. In contrast to SGPLVM the Bayesian approaches
allow the automatic determination of the dimensionality of the
latent spaces. The proposed method compares favourably against several
other state-of-the-art methods for visual speech generation, which
is shown in quantitative and qualitative evaluation on two different
datasets.
Finally, the possibility of incremental learning of AAM for inclusion
in the proposed MRD approach for visual speech generation is
investigated. The quantitative results demonstrate that using MRD in
conjunction with incremental AAMs produces only slightly less accurate
results than using batch methods. These results support a way of
training this kind of models on computers with limited resources, for
example in mobile computing.
Overall, this thesis proposes several improvements to the current
state-of-the-art in generating talking faces from speech signal leading
to perceptually more convincing results
Handbook of Digital Face Manipulation and Detection
This open access book provides the first comprehensive collection of studies dealing with the hot topic of digital face manipulation such as DeepFakes, Face Morphing, or Reenactment. It combines the research fields of biometrics and media forensics including contributions from academia and industry. Appealing to a broad readership, introductory chapters provide a comprehensive overview of the topic, which address readers wishing to gain a brief overview of the state-of-the-art. Subsequent chapters, which delve deeper into various research challenges, are oriented towards advanced readers. Moreover, the book provides a good starting point for young researchers as well as a reference guide pointing at further literature. Hence, the primary readership is academic institutions and industry currently involved in digital face manipulation and detection. The book could easily be used as a recommended text for courses in image processing, machine learning, media forensics, biometrics, and the general security area
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Speech enhancement and speech separation are two related tasks, whose purpose
is to extract either one or more target speech signals, respectively, from a
mixture of sounds generated by several sources. Traditionally, these tasks have
been tackled using signal processing and machine learning techniques applied to
the available acoustic signals. Since the visual aspect of speech is
essentially unaffected by the acoustic environment, visual information from the
target speakers, such as lip movements and facial expressions, has also been
used for speech enhancement and speech separation systems. In order to
efficiently fuse acoustic and visual information, researchers have exploited
the flexibility of data-driven approaches, specifically deep learning,
achieving strong performance. The ceaseless proposal of a large number of
techniques to extract features and fuse multimodal information has highlighted
the need for an overview that comprehensively describes and discusses
audio-visual speech enhancement and separation based on deep learning. In this
paper, we provide a systematic survey of this research topic, focusing on the
main elements that characterise the systems in the literature: acoustic
features; visual features; deep learning methods; fusion techniques; training
targets and objective functions. In addition, we review deep-learning-based
methods for speech reconstruction from silent videos and audio-visual sound
source separation for non-speech signals, since these methods can be more or
less directly applied to audio-visual speech enhancement and separation.
Finally, we survey commonly employed audio-visual speech datasets, given their
central role in the development of data-driven approaches, and evaluation
methods, because they are generally used to compare different systems and
determine their performance
Audiovisual speech perception in cochlear implant patients
Hearing with a cochlear implant (CI) is very different compared to a normal-hearing (NH) experience, as the CI can only provide limited auditory input. Nevertheless, the central auditory system is capable of learning how to interpret such limited auditory input such that it can extract meaningful information within a few months after
implant switch-on. The capacity of the auditory cortex to adapt to new auditory stimuli is an example of intra-modal plasticity — changes within a sensory cortical region as a result of altered statistics of the respective sensory input. However, hearing deprivation before implantation and restoration of hearing capacities after implantation can also induce cross-modal plasticity — changes within a sensory cortical region as a result of altered statistics of a different sensory input. Thereby, a preserved cortical region can, for example, support a deprived cortical region, as in the case of CI users which have been shown to exhibit cross-modal visual-cortex activation for purely auditory stimuli. Before implantation, during the period of hearing deprivation, CI users typically rely on additional visual cues like lip-movements for understanding speech. Therefore, it has been suggested that CI users show a pronounced binding of the auditory and visual systems, which may allow them to integrate auditory and visual speech information more efficiently. The projects included in this thesis investigate auditory, and particularly audiovisual speech processing in CI users. Four event-related potential (ERP) studies approach the matter from different perspectives, each with a distinct focus.
The first project investigates how audiovisually presented syllables are processed by CI users with bilateral hearing loss compared to NH controls. Previous ERP studies employing non-linguistic stimuli and studies using different neuroimaging techniques found distinct audiovisual interactions in CI users. However, the precise timecourse
of cross-modal visual-cortex recruitment and enhanced audiovisual interaction for speech related stimuli is unknown. With our ERP study we fill this gap, and we present differences in the timecourse of audiovisual interactions as well as in cortical source configurations between CI users and NH controls.
The second study focuses on auditory processing in single-sided deaf (SSD) CI users. SSD CI patients experience a maximally asymmetric hearing condition, as they have a CI on one ear and a contralateral NH ear. Despite the intact ear, several behavioural studies have demonstrated a variety of beneficial effects of restoring binaural hearing, but there are only few ERP studies which investigate auditory processing in SSD CI users. Our study investigates whether the side of implantation affects auditory processing and whether auditory processing via the NH ear of SSD CI users works similarly as in NH controls.
Given the distinct hearing conditions of SSD CI users, the question arises whether there are any quantifiable differences between CI user with unilateral hearing loss and bilateral hearing loss. In general, ERP studies on SSD CI users are rather scarce, and there is no study on audiovisual processing in particular. Furthermore, there are no reports on lip-reading abilities of SSD CI users. To this end, in the third project we extend the first study by including SSD CI users as a third experimental group. The study discusses both differences and similarities between CI users with bilateral hearing loss and CI users with unilateral hearing loss as well as NH controls and provides — for the first time — insights into audiovisual interactions in SSD CI users.
The fourth project investigates the influence of background noise on audiovisual interactions in CI users and whether a noise-reduction algorithm can modulate these interactions. It is known that in environments with competing background noise listeners generally rely more strongly on visual cues for understanding speech and that such situations are particularly difficult for CI users. As shown in previous auditory behavioural studies, the recently introduced noise-reduction algorithm "ForwardFocus" can be a useful aid in such cases. However, the questions whether employing the algorithm is beneficial in audiovisual conditions as well and whether using the algorithm has a measurable effect on cortical processing have not been investigated yet. In this ERP study, we address these questions with an auditory and audiovisual syllable discrimination task.
Taken together, the projects included in this thesis contribute to a better understanding of auditory and especially audiovisual speech processing in CI users, revealing distinct processing strategies employed to overcome the limited input provided by a CI. The results have clinical implications, as they suggest that clinical hearing assessments, which are currently purely auditory, should be extended to audiovisual assessments. Furthermore, they imply that rehabilitation including audiovisual training methods may be beneficial for all CI user groups for quickly achieving the most effective CI implantation outcome
Generation of realistic human behaviour
As the use of computers and robots in our everyday lives increases so does the need for better interaction with these devices. Human-computer interaction relies on the ability to understand and generate human behavioural signals such as speech, facial expressions and motion. This thesis deals with the synthesis and evaluation of such signals, focusing not only on their intelligibility but also on their realism. Since these signals are often correlated, it is common for methods to drive the generation of one signal using another. The thesis begins by tackling the problem of speech-driven facial animation and proposing models capable of producing realistic animations from a single image and an audio clip. The goal of these models is to produce a video of a target person, whose lips move in accordance with the driving audio. Particular focus is also placed on a) generating spontaneous expression such as blinks, b) achieving audio-visual synchrony and c) transferring or producing natural head motion. The second problem addressed in this thesis is that of video-driven speech reconstruction, which aims at converting a silent video into waveforms containing speech. The method proposed for solving this problem is capable of generating intelligible and accurate speech for both seen and unseen speakers. The spoken content is correctly captured thanks to a perceptual loss, which uses features from pre-trained speech-driven animation models. The ability of the video-to-speech model to run in real-time allows its use in hearing assistive devices and telecommunications. The final work proposed in this thesis is a generic domain translation system, that can be used for any translation problem including those mapping across different modalities. The framework is made up of two networks performing translations in opposite directions and can be successfully applied to solve diverse sets of translation problems, including speech-driven animation and video-driven speech reconstruction.Open Acces
Data and methods for a visual understanding of sign languages
Signed languages are complete and natural languages used as the first or preferred mode of communication by millions of people worldwide. However, they, unfortunately, continue to be marginalized languages. Designing, building, and evaluating models that work on sign languages presents compelling research challenges and requires interdisciplinary and collaborative efforts. The recent advances in Machine Learning (ML) and Artificial Intelligence (AI) has the power to enable better accessibility to sign language users and narrow down the existing communication barrier between the Deaf community and non-sign language users. However, recent AI-powered technologies still do not account for sign language in their pipelines. This is mainly because sign languages are visual languages, that use manual and non-manual features to convey information, and do not have a standard written form. Thus, the goal of this thesis is to contribute to the development of new technologies that account for sign language by creating large-scale multimodal resources suitable for training modern data-hungry machine learning models and developing automatic systems that focus on computer vision tasks related to sign language that aims at learning better visual understanding of sign languages.
Thus, in Part I, we introduce the How2Sign dataset, which is a large-scale collection of multimodal and multiview sign language videos in American Sign Language. In Part II, we contribute to the development of technologies that account for sign languages by presenting in Chapter 4 a framework called Spot-Align, based on sign spotting methods, to automatically annotate sign instances in continuous sign language. We further present the benefits of this framework and establish a baseline for the sign language recognition task on the How2Sign dataset. In addition to that, in Chapter 5 we benefit from the different annotations and modalities of the How2Sign to explore sign language video retrieval by learning cross-modal embeddings. Later in Chapter 6, we explore sign language video generation by applying Generative Adversarial Networks to the sign language domain and assess if and how well sign language users can understand automatically generated sign language videos by proposing an evaluation protocol based on How2Sign topics and English translationLes llengües de signes són llengües completes i naturals que utilitzen milions de persones de tot el món com mode de comunicació primer o preferit. Tanmateix, malauradament, continuen essent llengües marginades. Dissenyar, construir i avaluar tecnologies que funcionin amb les llengües de signes presenta reptes de recerca que requereixen d’esforços interdisciplinaris i col·laboratius. Els avenços recents en l’aprenentatge automà tic i la intel·ligència artificial (IA) poden millorar l’accessibilitat tecnològica dels signants, i alhora reduir la barrera de comunicació existent entre la comunitat sorda i les persones no-signants. Tanmateix, les tecnologies més modernes en IA encara no consideren les llengües de signes en les seves interfÃcies amb l’usuari. Això es deu principalment a que les llengües de signes són llenguatges visuals, que utilitzen caracterÃstiques manuals i no manuals per transmetre informació, i no tenen una forma escrita està ndard. Els objectius principals d’aquesta tesi són la creació de recursos multimodals a gran escala adequats per entrenar models d’aprenentatge automà tic per a llengües de signes, i desenvolupar sistemes de visió per computador adreçats a una millor comprensió automà tica de les llengües de signes. AixÃ, a la Part I presentem la base de dades How2Sign, una gran col·lecció multimodal i multivista de vÃdeos de la llengua de signes nord-americana. A la Part II, contribuïm al desenvolupament de tecnologia per a llengües de signes, presentant al capÃtol 4 una solució per anotar signes automà ticament anomenada Spot-Align, basada en mètodes de localització de signes en seqüències contÃnues de signes. Després, presentem els avantatges d’aquesta solució i proporcionem uns primers resultats per la tasca de reconeixement de la llengua de signes a la base de dades How2Sign. A continuació, al capÃtol 5 aprofitem de les anotacions i diverses modalitats de How2Sign per explorar la cerca de vÃdeos en llengua de signes a partir de l’entrenament d’incrustacions multimodals. Finalment, al capÃtol 6, explorem la generació de vÃdeos en llengua de signes aplicant xarxes adversà ries generatives al domini de la llengua de signes. Avaluem fins a quin punt els signants poden entendre els vÃdeos generats automà ticament, proposant un nou protocol d’avaluació basat en les categories dins de How2Sign i la traducció dels vÃdeos a l’anglès escritLas lenguas de signos son lenguas completas y naturales que utilizan millones de personas
de todo el mundo como modo de comunicación primero o preferido. Sin embargo, desgraciadamente,
siguen siendo lenguas marginadas. Diseñar, construir y evaluar tecnologÃas
que funcionen con las lenguas de signos presenta retos de investigación que requieren
esfuerzos interdisciplinares y colaborativos. Los avances recientes en el aprendizaje automático
y la inteligencia artificial (IA) pueden mejorar la accesibilidad tecnológica de
los signantes, al tiempo que reducir la barrera de comunicación existente entre la comunidad
sorda y las personas no signantes. Sin embargo, las tecnologÃas más modernas en
IA todavÃa no consideran las lenguas de signos en sus interfaces con el usuario. Esto
se debe principalmente a que las lenguas de signos son lenguajes visuales, que utilizan
caracterÃsticas manuales y no manuales para transmitir información, y carecen de una
forma escrita estándar. Los principales objetivos de esta tesis son la creación de recursos
multimodales a gran escala adecuados para entrenar modelos de aprendizaje automático
para lenguas de signos, y desarrollar sistemas de visión por computador dirigidos a una
mejor comprensión automática de las lenguas de signos.
AsÃ, en la Parte I presentamos la base de datos How2Sign, una gran colección multimodal
y multivista de vÃdeos de lenguaje la lengua de signos estadounidense. En la Part II,
contribuimos al desarrollo de tecnologÃa para lenguas de signos, presentando en el capÃtulo
4 una solución para anotar signos automáticamente llamada Spot-Align, basada en
métodos de localización de signos en secuencias continuas de signos. Después, presentamos
las ventajas de esta solución y proporcionamos unos primeros resultados por la tarea
de reconocimiento de la lengua de signos en la base de datos How2Sign. A continuación,
en el capÃtulo 5 aprovechamos de las anotaciones y diversas modalidades de How2Sign
para explorar la búsqueda de vÃdeos en lengua de signos a partir del entrenamiento de
incrustaciones multimodales. Finalmente, en el capÃtulo 6, exploramos la generación
de vÃdeos en lengua de signos aplicando redes adversarias generativas al dominio de la
lengua de signos. Evaluamos hasta qué punto los signantes pueden entender los vÃdeos
generados automáticamente, proponiendo un nuevo protocolo de evaluación basado en
las categorÃas dentro de How2Sign y la traducción de los vÃdeos al inglés escrito.Postprint (published version
Data and methods for a visual understanding of sign languages
Signed languages are complete and natural languages used as the first or preferred mode of communication by millions of people worldwide. However, they, unfortunately, continue to be marginalized languages. Designing, building, and evaluating models that work on sign languages presents compelling research challenges and requires interdisciplinary and collaborative efforts. The recent advances in Machine Learning (ML) and Artificial Intelligence (AI) has the power to enable better accessibility to sign language users and narrow down the existing communication barrier between the Deaf community and non-sign language users. However, recent AI-powered technologies still do not account for sign language in their pipelines. This is mainly because sign languages are visual languages, that use manual and non-manual features to convey information, and do not have a standard written form. Thus, the goal of this thesis is to contribute to the development of new technologies that account for sign language by creating large-scale multimodal resources suitable for training modern data-hungry machine learning models and developing automatic systems that focus on computer vision tasks related to sign language that aims at learning better visual understanding of sign languages.
Thus, in Part I, we introduce the How2Sign dataset, which is a large-scale collection of multimodal and multiview sign language videos in American Sign Language. In Part II, we contribute to the development of technologies that account for sign languages by presenting in Chapter 4 a framework called Spot-Align, based on sign spotting methods, to automatically annotate sign instances in continuous sign language. We further present the benefits of this framework and establish a baseline for the sign language recognition task on the How2Sign dataset. In addition to that, in Chapter 5 we benefit from the different annotations and modalities of the How2Sign to explore sign language video retrieval by learning cross-modal embeddings. Later in Chapter 6, we explore sign language video generation by applying Generative Adversarial Networks to the sign language domain and assess if and how well sign language users can understand automatically generated sign language videos by proposing an evaluation protocol based on How2Sign topics and English translationLes llengües de signes són llengües completes i naturals que utilitzen milions de persones de tot el món com mode de comunicació primer o preferit. Tanmateix, malauradament, continuen essent llengües marginades. Dissenyar, construir i avaluar tecnologies que funcionin amb les llengües de signes presenta reptes de recerca que requereixen d’esforços interdisciplinaris i col·laboratius. Els avenços recents en l’aprenentatge automà tic i la intel·ligència artificial (IA) poden millorar l’accessibilitat tecnològica dels signants, i alhora reduir la barrera de comunicació existent entre la comunitat sorda i les persones no-signants. Tanmateix, les tecnologies més modernes en IA encara no consideren les llengües de signes en les seves interfÃcies amb l’usuari. Això es deu principalment a que les llengües de signes són llenguatges visuals, que utilitzen caracterÃstiques manuals i no manuals per transmetre informació, i no tenen una forma escrita està ndard. Els objectius principals d’aquesta tesi són la creació de recursos multimodals a gran escala adequats per entrenar models d’aprenentatge automà tic per a llengües de signes, i desenvolupar sistemes de visió per computador adreçats a una millor comprensió automà tica de les llengües de signes. AixÃ, a la Part I presentem la base de dades How2Sign, una gran col·lecció multimodal i multivista de vÃdeos de la llengua de signes nord-americana. A la Part II, contribuïm al desenvolupament de tecnologia per a llengües de signes, presentant al capÃtol 4 una solució per anotar signes automà ticament anomenada Spot-Align, basada en mètodes de localització de signes en seqüències contÃnues de signes. Després, presentem els avantatges d’aquesta solució i proporcionem uns primers resultats per la tasca de reconeixement de la llengua de signes a la base de dades How2Sign. A continuació, al capÃtol 5 aprofitem de les anotacions i diverses modalitats de How2Sign per explorar la cerca de vÃdeos en llengua de signes a partir de l’entrenament d’incrustacions multimodals. Finalment, al capÃtol 6, explorem la generació de vÃdeos en llengua de signes aplicant xarxes adversà ries generatives al domini de la llengua de signes. Avaluem fins a quin punt els signants poden entendre els vÃdeos generats automà ticament, proposant un nou protocol d’avaluació basat en les categories dins de How2Sign i la traducció dels vÃdeos a l’anglès escritLas lenguas de signos son lenguas completas y naturales que utilizan millones de personas
de todo el mundo como modo de comunicación primero o preferido. Sin embargo, desgraciadamente,
siguen siendo lenguas marginadas. Diseñar, construir y evaluar tecnologÃas
que funcionen con las lenguas de signos presenta retos de investigación que requieren
esfuerzos interdisciplinares y colaborativos. Los avances recientes en el aprendizaje automático
y la inteligencia artificial (IA) pueden mejorar la accesibilidad tecnológica de
los signantes, al tiempo que reducir la barrera de comunicación existente entre la comunidad
sorda y las personas no signantes. Sin embargo, las tecnologÃas más modernas en
IA todavÃa no consideran las lenguas de signos en sus interfaces con el usuario. Esto
se debe principalmente a que las lenguas de signos son lenguajes visuales, que utilizan
caracterÃsticas manuales y no manuales para transmitir información, y carecen de una
forma escrita estándar. Los principales objetivos de esta tesis son la creación de recursos
multimodales a gran escala adecuados para entrenar modelos de aprendizaje automático
para lenguas de signos, y desarrollar sistemas de visión por computador dirigidos a una
mejor comprensión automática de las lenguas de signos.
AsÃ, en la Parte I presentamos la base de datos How2Sign, una gran colección multimodal
y multivista de vÃdeos de lenguaje la lengua de signos estadounidense. En la Part II,
contribuimos al desarrollo de tecnologÃa para lenguas de signos, presentando en el capÃtulo
4 una solución para anotar signos automáticamente llamada Spot-Align, basada en
métodos de localización de signos en secuencias continuas de signos. Después, presentamos
las ventajas de esta solución y proporcionamos unos primeros resultados por la tarea
de reconocimiento de la lengua de signos en la base de datos How2Sign. A continuación,
en el capÃtulo 5 aprovechamos de las anotaciones y diversas modalidades de How2Sign
para explorar la búsqueda de vÃdeos en lengua de signos a partir del entrenamiento de
incrustaciones multimodales. Finalmente, en el capÃtulo 6, exploramos la generación
de vÃdeos en lengua de signos aplicando redes adversarias generativas al dominio de la
lengua de signos. Evaluamos hasta qué punto los signantes pueden entender los vÃdeos
generados automáticamente, proponiendo un nuevo protocolo de evaluación basado en
las categorÃas dentro de How2Sign y la traducción de los vÃdeos al inglés escrito.Teoria del Senyal i Comunicacion