80 research outputs found
King's speech: pronounce a foreign language with style
Computer assisted pronunciation training requires strategies that capture the attention of the learners and guide them along the learning pathway. In this paper, we introduce an immersive storytelling scenario for creating appropriate learning conditions. The proposed learning interaction is orchestrated by a spoken karaoke. We motivate the concept of the spoken karaoke and describe our design. Driven by the requirements of the proposed scenario, we suggest a modular architecture designed for immersive learning applications. We present our prototype system and our approach for the processing of spoken and visual interaction modalities. Finally, we discuss how technological challenges can be addressed in order to enable the learner's self-evaluation
V-Cloak: Intelligibility-, Naturalness- & Timbre-Preserving Real-Time Voice Anonymization
Voice data generated on instant messaging or social media applications
contains unique user voiceprints that may be abused by malicious adversaries
for identity inference or identity theft. Existing voice anonymization
techniques, e.g., signal processing and voice conversion/synthesis, suffer from
degradation of perceptual quality. In this paper, we develop a voice
anonymization system, named V-Cloak, which attains real-time voice
anonymization while preserving the intelligibility, naturalness and timbre of
the audio. Our designed anonymizer features a one-shot generative model that
modulates the features of the original audio at different frequency levels. We
train the anonymizer with a carefully-designed loss function. Apart from the
anonymity loss, we further incorporate the intelligibility loss and the
psychoacoustics-based naturalness loss. The anonymizer can realize untargeted
and targeted anonymization to achieve the anonymity goals of unidentifiability
and unlinkability.
We have conducted extensive experiments on four datasets, i.e., LibriSpeech
(English), AISHELL (Chinese), CommonVoice (French) and CommonVoice (Italian),
five Automatic Speaker Verification (ASV) systems (including two DNN-based, two
statistical and one commercial ASV), and eleven Automatic Speech Recognition
(ASR) systems (for different languages). Experiment results confirm that
V-Cloak outperforms five baselines in terms of anonymity performance. We also
demonstrate that V-Cloak trained only on the VoxCeleb1 dataset against
ECAPA-TDNN ASV and DeepSpeech2 ASR has transferable anonymity against other
ASVs and cross-language intelligibility for other ASRs. Furthermore, we verify
the robustness of V-Cloak against various de-noising techniques and adaptive
attacks. Hopefully, V-Cloak may provide a cloak for us in a prism world.Comment: Accepted by USENIX Security Symposium 202
Automatic Speech Recognition Using LP-DCTC/DCS Analysis Followed by Morphological Filtering
Front-end feature extraction techniques have long been a critical component in Automatic Speech Recognition (ASR). Nonlinear filtering techniques are becoming increasingly important in this application, and are often better than linear filters at removing noise without distorting speech features. However, design and analysis of nonlinear filters are more difficult than for linear filters. Mathematical morphology, which creates filters based on shape and size characteristics, is a design structure for nonlinear filters. These filters are limited to minimum and maximum operations that introduce a deterministic bias into filtered signals.
This work develops filtering structures based on a mathematical morphology that utilizes the bias while emphasizing spectral peaks. The combination of peak emphasis via LP analysis with morphological filtering results in more noise robust speech recognition rates.
To help understand the behavior of these pre-processing techniques the deterministic and statistical properties of the morphological filters are compared to the properties of feature extraction techniques that do not employ such algorithms. The robust behavior of these algorithms for automatic speech recognition in the presence of rapidly fluctuating speech signals with additive and convolutional noise is illustrated. Examples of these nonlinear feature extraction techniques are given using the Aurora 2.0 and Aurora 3.0 databases. Features are computed using LP analysis alone to emphasize peaks, morphological filtering alone, or a combination of the two approaches. Although absolute best results are normally obtained using a combination of the two methods, morphological filtering alone is nearly as effective and much more computationally efficient
Recent Advances in Signal Processing
The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity
Computational Multimedia for Video Self Modeling
Video self modeling (VSM) is a behavioral intervention technique in which a learner models a target behavior by watching a video of oneself. This is the idea behind the psychological theory of self-efficacy - you can learn or model to perform certain tasks because you see yourself doing it, which provides the most ideal form of behavior modeling. The effectiveness of VSM has been demonstrated for many different types of disabilities and behavioral problems ranging from stuttering, inappropriate social behaviors, autism, selective mutism to sports training. However, there is an inherent difficulty associated with the production of VSM material. Prolonged and persistent video recording is required to capture the rare, if not existed at all, snippets that can be used to string together in forming novel video sequences of the target skill. To solve this problem, in this dissertation, we use computational multimedia techniques to facilitate the creation of synthetic visual content for self-modeling that can be used by a learner and his/her therapist with a minimum amount of training data. There are three major technical contributions in my research. First, I developed an Adaptive Video Re-sampling algorithm to synthesize realistic lip-synchronized video with minimal motion jitter. Second, to denoise and complete the depth map captured by structure-light sensing systems, I introduced a layer based probabilistic model to account for various types of uncertainties in the depth measurement. Third, I developed a simple and robust bundle-adjustment based framework for calibrating a network of multiple wide baseline RGB and depth cameras
EMG-to-Speech: Direct Generation of Speech from Facial Electromyographic Signals
The general objective of this work is the design, implementation, improvement and evaluation of a system that uses surface electromyographic (EMG) signals and directly synthesizes an audible speech output: EMG-to-speech
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field
that aims to design computer agents with intelligent capabilities such as
understanding, reasoning, and learning through integrating multiple
communicative modalities, including linguistic, acoustic, visual, tactile, and
physiological messages. With the recent interest in video understanding,
embodied autonomous agents, text-to-image generation, and multisensor fusion in
application domains such as healthcare and robotics, multimodal machine
learning has brought unique computational and theoretical challenges to the
machine learning community given the heterogeneity of data sources and the
interconnections often found between modalities. However, the breadth of
progress in multimodal research has made it difficult to identify the common
themes and open questions in the field. By synthesizing a broad range of
application domains and theoretical frameworks from both historical and recent
perspectives, this paper is designed to provide an overview of the
computational and theoretical foundations of multimodal machine learning. We
start by defining two key principles of modality heterogeneity and
interconnections that have driven subsequent innovations, and propose a
taxonomy of 6 core technical challenges: representation, alignment, reasoning,
generation, transference, and quantification covering historical and recent
trends. Recent technical achievements will be presented through the lens of
this taxonomy, allowing researchers to understand the similarities and
differences across new approaches. We end by motivating several open problems
for future research as identified by our taxonomy
Human activity recognition for pervasive interaction
PhD ThesisThis thesis addresses the challenge of computing food preparation context in the kitchen. The automatic
recognition of fine-grained human activities and food ingredients is realized through pervasive sensing
which we achieve by instrumenting kitchen objects such as knives, spoons, and chopping boards with
sensors. Context recognition in the kitchen lies at the heart of a broad range of real-world applications. In
particular, activity and food ingredient recognition in the kitchen is an essential component for situated
services such as automatic prompting services for cognitively impaired kitchen users and digital situated
support for healthier eating interventions. Previous works, however, have addressed the activity
recognition problem by exploring high-level-human activities using wearable sensing (i.e. worn sensors
on human body) or using technologies that raise privacy concerns (i.e. computer vision). Although such
approaches have yielded significant results for a number of activity recognition problems, they are not
applicable to our domain of investigation, for which we argue that the technology itself must be genuinely
“invisible”, thereby allowing users to perform their activities in a completely natural manner.
In this thesis we describe the development of pervasive sensing technologies and algorithms for finegrained
human activity and food ingredient recognition in the kitchen. After reviewing previous work on
food and activity recognition we present three systems that constitute increasingly sophisticated
approaches to the challenge of kitchen context recognition. Two of these systems, Slice&Dice and Classbased
Threshold Dynamic Time Warping (CBT-DTW), recognize fine-grained food preparation
activities. Slice&Dice is a proof-of-concept application, whereas CBT-DTW is a real-time application
that also addresses the problem of recognising unknown activities. The final system, KitchenSense is a
real-time context recognition framework that deals with the recognition of a more complex set of
activities, and includes the recognition of food ingredients and events in the kitchen. For each system, we
describe the prototyping of pervasive sensing technologies, algorithms, as well as real-world experiments
and empirical evaluations that validate the proposed solutions.Vietnamese government’s 322 project, executed by the Vietnamese Ministry of
Education and Training
Handbook of Digital Face Manipulation and Detection
This open access book provides the first comprehensive collection of studies dealing with the hot topic of digital face manipulation such as DeepFakes, Face Morphing, or Reenactment. It combines the research fields of biometrics and media forensics including contributions from academia and industry. Appealing to a broad readership, introductory chapters provide a comprehensive overview of the topic, which address readers wishing to gain a brief overview of the state-of-the-art. Subsequent chapters, which delve deeper into various research challenges, are oriented towards advanced readers. Moreover, the book provides a good starting point for young researchers as well as a reference guide pointing at further literature. Hence, the primary readership is academic institutions and industry currently involved in digital face manipulation and detection. The book could easily be used as a recommended text for courses in image processing, machine learning, media forensics, biometrics, and the general security area
- …