719 research outputs found
HDTR-Net: A Real-Time High-Definition Teeth Restoration Network for Arbitrary Talking Face Generation Methods
Talking Face Generation (TFG) aims to reconstruct facial movements to achieve
high natural lip movements from audio and facial features that are under
potential connections. Existing TFG methods have made significant advancements
to produce natural and realistic images. However, most work rarely takes visual
quality into consideration. It is challenging to ensure lip synchronization
while avoiding visual quality degradation in cross-modal generation methods. To
address this issue, we propose a universal High-Definition Teeth Restoration
Network, dubbed HDTR-Net, for arbitrary TFG methods. HDTR-Net can enhance teeth
regions at an extremely fast speed while maintaining synchronization, and
temporal consistency. In particular, we propose a Fine-Grained Feature Fusion
(FGFF) module to effectively capture fine texture feature information around
teeth and surrounding regions, and use these features to fine-grain the feature
map to enhance the clarity of teeth. Extensive experiments show that our method
can be adapted to arbitrary TFG methods without suffering from lip
synchronization and frame coherence. Another advantage of HDTR-Net is its
real-time generation ability. Also under the condition of high-definition
restoration of talking face video synthesis, its inference speed is
faster than the current state-of-the-art face restoration based on
super-resolution.Comment: 15pages, 6 figures, PRCV202
Multimodal emotion recognition
Reading emotions from facial expression and speech is a milestone in Human-Computer
Interaction. Recent sensing technologies, namely the Microsoft Kinect Sensor, provide
basic input modalities data, such as RGB imaging, depth imaging and speech, that can
be used in Emotion Recognition. Moreover Kinect can track a face in real time and
present the face fiducial points, as well as 6 basic Action Units (AUs).
In this work we explore this information by gathering a new and exclusive
dataset. This is a new opportunity for the academic community as well to the progress
of the emotion recognition problem. The database includes RGB, depth, audio, fiducial
points and AUs for 18 volunteers for 7 emotions. We then present automatic emotion
classification results on this dataset by employing k-Nearest Neighbor, Support Vector
Machines and Neural Networks classifiers, with unimodal and multimodal approaches.
Our conclusions show that multimodal approaches can attain better results.Ler e reconhecer emoções de expressões faciais e verbais é um marco na Interacção
Humana com um Computador. As recentes tecnologias de deteção, nomeadamente o
sensor Microsoft Kinect, recolhem dados de modalidades básicas como imagens RGB,
de informaçãode profundidade e defala que podem ser usados em reconhecimento de
emoções. Mais ainda, o sensor Kinect consegue reconhecer e seguir uma cara em tempo
real e apresentar os pontos fiduciais, assim como as 6 AUs – Action Units básicas.
Neste trabalho exploramos esta informação através da compilação de um dataset único e
exclusivo que representa uma oportunidade para a comunidade académica e para o
progresso do problema do reconhecimento de emoções. Este dataset inclui dados RGB,
de profundidade, de fala, pontos fiduciais e AUs, para 18 voluntários e 7 emoções.
Apresentamos resultados com a classificação automática de emoções com este dataset,
usando classificadores k-vizinhos próximos, máquinas de suporte de vetoreseredes
neuronais, em abordagens multimodais e unimodais. As nossas conclusões indicam que
abordagens multimodais permitem obter melhores resultados
A MODEL FOR PREDICTING THE PERFORMANCE OF IP VIDEOCONFERENCING
With the incorporation of free desktop videoconferencing (DVC) software on the
majority of the world's PCs, over the recent years, there has, inevitably, been considerable
interest in using DVC over the Internet. The growing popularity of DVC
increases the need for multimedia quality assessment. However, the task of predicting
the perceived multimedia quality over the Internet Protocol (IP) networks is
complicated by the fact that the audio and video streams are susceptible to unique
impairments due to the unpredictable nature of IP networks, different types of task
scenarios, different levels of complexity, and other related factors. To date, a standard
consensus to define the IP media Quality of Service (QoS) has yet to be implemented.
The thesis addresses this problem by investigating a new approach to
assess the quality of audio, video, and audiovisual overall as perceived in low cost
DVC systems.
The main aim of the thesis is to investigate current methods used to assess the perceived
IP media quality, and then propose a model which will predict the quality of
audiovisual experience from prevailing network parameters.
This thesis investigates the effects of various traffic conditions, such as, packet loss,
jitter, and delay and other factors that may influence end user acceptance, when low
cost DVC is used over the Internet. It also investigates the interaction effects between
the audio and video media, and the issues involving the lip sychronisation
error. The thesis provides the empirical evidence that the subjective mean opinion
score (MOS) of the perceived multimedia quality is unaffected by lip synchronisation
error in low cost DVC systems.
The data-gathering approach that is advocated in this thesis involves both field and
laboratory trials to enable the comparisons of results between classroom-based experiments
and real-world environments to be made, and to provide actual real-world
confirmation of the bench tests. The subjective test method was employed
since it has been proven to be more robust and suitable for the research studies, as
compared to objective testing techniques.
The MOS results, and the number of observations obtained, have enabled a set of
criteria to be established that can be used to determine the acceptable QoS for given
network conditions and task scenarios. Based upon these comprehensive findings,
the final contribution of the thesis is the proposal of a new adaptive architecture
method that is intended to enable the performance of IP based DVC of a particular
session to be predicted for a given network condition
Towards Ultrasound Tongue Image prediction from EEG during speech production
Previous initial research has already been carried out to propose
speech-based BCI using brain signals (e.g.~non-invasive EEG and invasive sEEG /
ECoG), but there is a lack of combined methods that investigate non-invasive
brain, articulation, and speech signals together and analyze the cognitive
processes in the brain, the kinematics of the articulatory movement and the
resulting speech signal. In this paper, we describe our multimodal
(electroencephalography, ultrasound tongue imaging, and speech) analysis and
synthesis experiments, as a feasibility study. We extend the analysis of brain
signals recorded during speech production with ultrasound-based articulation
data. From the brain signal measured with EEG, we predict ultrasound images of
the tongue with a fully connected deep neural network. The results show that
there is a weak but noticeable relationship between EEG and ultrasound tongue
images, i.e. the network can differentiate articulated speech and neutral
tongue position.Comment: accepted at Interspeech 202
Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping
By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. In this paper, we present a geometrical-based automatic lip reading system that extracts the lip region from images using conventional techniques, but the contour itself is extracted using a novel application of a combination of border following and convex hull approaches. Classification is carried out using an enhanced dynamic time warping technique that has the ability to operate in multiple dimensions and a template probability technique that is able to compensate for differences in the way words are uttered in the training set. The performance of the new system has been assessed in recognition of the English digits 0 to 9 as available in the CUAVE database. The experimental results obtained from the new approach compared favorably with those of existing lip reading approaches, achieving a word recognition accuracy of up to 71% with the visual information being obtained from estimates of lip height, width and their ratio
Audio-visual speech processing system for Polish applicable to human-computer interaction
This paper describes audio-visual speech recognition system for Polish language and a set of performance tests under various acoustic conditions. We first present the overall structure of AVASR systems with three main areas: audio features extraction, visual features extraction and subsequently, audiovisual speech integration. We present MFCC features for audio stream with standard HMM modeling technique, then we describe appearance and shape based visual features. Subsequently we present two feature integration techniques, feature concatenation and model fusion. We also discuss the results of a set of experiments conducted to select best system setup for Polish, under noisy audio conditions. Experiments are simulating human-computer interaction in computer control case with voice commands in difficult audio environments. With Active Appearance Model (AAM) and multistream Hidden Markov Model (HMM) we can improve system accuracy by reducing Word Error Rate for more than 30%, comparing to audio-only speech recognition, when Signal-to-Noise Ratio goes down to 0dB
- …