1,540 research outputs found
Towards Improving Low-Resource Speech Recognition Using Articulatory and Language Features
In an increasingly globalized world, there is a rising demand for speech recognition systems. Systems for languages like English, German or French do achieve a decent performance, but there exists a long tail of languages for which such systems do not yet exist. State-of-the-art speech recognition systems feature Deep Neural Networks (DNNs). Being a data driven method and therefore highly dependent on sufficient training data, the lack of resources directly affects the recognition performance. There exist multiple techniques to deal with such resource constraint conditions, one approach is the use of additional data from other languages. In the past, is was demonstrated that multilingually trained systems benefit from adding language feature vectors (LFVs) to the input features, similar to i-Vectors. In this work, we extend this approach by the addition of articulatory features (AFs). We show that AFs also benefit from LFVs and that multilingual system setups benefit from adding both AFs and LFVs. Pretending English to be a low-resource language, we restricted ourselves to use only 10h of English acoustic training data. For system training, we use additional data from French, German and Turkish. By using a combination of AFs and LFVs, we were able to decrease the WER from 18.1% to 17.3% after system combination in our setup using a multilingual phone set
Location Reference Recognition from Texts: A Survey and Comparison
A vast amount of location information exists in unstructured texts, such as social media posts, news stories, scientific articles, web pages, travel blogs, and historical archives. Geoparsing refers to recognizing location references from texts and identifying their geospatial representations. While geoparsing can benefit many domains, a summary of its specific applications is still missing. Further, there is a lack of a comprehensive review and comparison of existing approaches for location reference recognition, which is the first and core step of geoparsing. To fill these research gaps, this review first summarizes seven typical application domains of geoparsing: geographic information retrieval, disaster management, disease surveillance, traffic management, spatial humanities, tourism management, and crime management. We then review existing approaches for location reference recognition by categorizing these approaches into four groups based on their underlying functional principle: rule-based, gazetteer matching–based, statistical learning-–based, and hybrid approaches. Next, we thoroughly evaluate the correctness and computational efficiency of the 27 most widely used approaches for location reference recognition based on 26 public datasets with different types of texts (e.g., social media posts and news stories) containing 39,736 location references worldwide. Results from this thorough evaluation can help inform future methodological developments and can help guide the selection of proper approaches based on application needs
AI: Limits and Prospects of Artificial Intelligence
The emergence of artificial intelligence has triggered enthusiasm and promise of boundless opportunities as much as uncertainty about its limits. The contributions to this volume explore the limits of AI, describe the necessary conditions for its functionality, reveal its attendant technical and social problems, and present some existing and potential solutions. At the same time, the contributors highlight the societal and attending economic hopes and fears, utopias and dystopias that are associated with the current and future development of artificial intelligence
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
This paper proposes a novel lip reading framework, especially for
low-resource languages, which has not been well addressed in the previous
literature. Since low-resource languages do not have enough video-text paired
data to train the model to have sufficient power to model lip movements and
language, it is regarded as challenging to develop lip reading models for
low-resource languages. In order to mitigate the challenge, we try to learn
general speech knowledge, the ability to model lip movements, from a
high-resource language through the prediction of speech units. It is known that
different languages partially share common phonemes, thus general speech
knowledge learned from one language can be extended to other languages. Then,
we try to learn language-specific knowledge, the ability to model language, by
proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder
saves language-specific audio features into memory banks and can be trained on
audio-text paired data which is more easily accessible than video-text paired
data. Therefore, with LMDecoder, we can transform the input speech units into
language-specific audio features and translate them into texts by utilizing the
learned rich language knowledge. Finally, by combining general speech knowledge
and language-specific knowledge, we can efficiently develop lip reading models
even for low-resource languages. Through extensive experiments using five
languages, English, Spanish, French, Italian, and Portuguese, the effectiveness
of the proposed method is evaluated.Comment: Accepted at ICCV 202
LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech
Self-supervised learning (SSL) is at the origin of unprecedented improvements
in many different domains including computer vision and natural language
processing. Speech processing drastically benefitted from SSL as most of the
current domain-related tasks are now being approached with pre-trained models.
This work introduces LeBenchmark 2.0 an open-source framework for assessing and
building SSL-equipped French speech technologies. It includes documented,
large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous
speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to
one billion learnable parameters shared with the community, and an evaluation
protocol made of six downstream tasks to complement existing benchmarks.
LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for
speech with the investigation of frozen versus fine-tuned downstream models,
task-agnostic versus task-specific pre-trained models as well as a discussion
on the carbon footprint of large-scale model training.Comment: Under submission at Computer Science and Language. Preprint allowe
Leveraging audio-visual speech effectively via deep learning
The rising popularity of neural networks, combined with the recent proliferation of online audio-visual media, has led to a revolution in the way machines encode, recognize, and generate acoustic and visual speech. Despite the ubiquity of naturally paired audio-visual data, only a limited number of works have applied recent advances in deep learning to leverage the duality between audio and video within this domain. This thesis considers the use of neural networks to learn from large unlabelled datasets of audio-visual speech to enable new practical applications. We begin by training a visual speech encoder that predicts latent features extracted from the corresponding audio on a large unlabelled audio-visual corpus. We apply the trained visual encoder to improve performance on lip reading in real-world scenarios. Following this, we extend the idea of video learning from audio by training a model to synthesize raw speech directly from raw video, without the need for text transcriptions. Remarkably, we find that this framework is capable of reconstructing intelligible audio from videos of new, previously unseen speakers. We also experiment with a separate speech reconstruction framework, which leverages recent advances in sequence modeling and spectrogram inversion to improve the realism of the generated speech. We then apply our research in video-to-speech synthesis to advance the state-of-the-art in audio-visual speech enhancement, by proposing a new vocoder-based model that performs particularly well under extremely noisy scenarios. Lastly, we aim to fully realize the potential of paired audio-visual data by proposing two novel frameworks that leverage acoustic and visual speech to train two encoders that learn from each other simultaneously. We leverage these pre-trained encoders for deepfake detection, speech recognition, and lip reading, and find that they consistently yield improvements over training from scratch.Open Acces
WearPut : Designing Dexterous Wearable Input based on the Characteristics of Human Finger Motions
Department of Biomedical Engineering (Human Factors Engineering)Powerful microchips for computing and networking allow a wide range of wearable devices to be miniaturized with high fidelity and availability. In particular, the commercially successful smartwatches placed on the wrist drive market growth by sharing the role of smartphones and health management. The emerging Head Mounted Displays (HMDs) for Augmented Reality (AR) and Virtual Reality (VR) also impact various application areas in video games, education, simulation, and productivity tools. However, these powerful wearables have challenges in interaction with the inevitably limited space for input and output due to the specialized form factors for fitting the body parts. To complement the constrained interaction experience, many wearable devices still rely on other large form factor devices (e.g., smartphones or hand-held controllers). Despite their usefulness, the additional devices for interaction can constrain the viability of wearable devices in many usage scenarios by tethering users' hands to the physical devices. This thesis argues that developing novel Human-Computer interaction techniques for the specialized wearable form factors is vital for wearables to be reliable standalone products.
This thesis seeks to address the issue of constrained interaction experience with novel interaction techniques by exploring finger motions during input for the specialized form factors of wearable devices. The several characteristics of the finger input motions are promising to enable increases in the expressiveness of input on the physically limited input space of wearable devices. First, the input techniques with fingers are prevalent on many large form factor devices (e.g., touchscreen or physical keyboard) due to fast and accurate performance and high familiarity. Second, many commercial wearable products provide built-in sensors (e.g., touchscreen or hand tracking system) to detect finger motions. This enables the implementation of novel interaction systems without any additional sensors or devices. Third, the specialized form factors of wearable devices can create unique input contexts while the fingers approach their locations, shapes, and components. Finally, the dexterity of fingers with a distinctive appearance, high degrees of freedom, and high sensitivity of joint angle perception have the potential to widen the range of input available with various movement features on the surface and in the air. Accordingly, the general claim of this thesis is that understanding how users move their fingers during input will enable increases in the expressiveness of the interaction techniques we can create for resource-limited wearable devices.
This thesis demonstrates the general claim by providing evidence in various wearable scenarios with smartwatches and HMDs. First, this thesis explored the comfort range of static and dynamic touch input with angles on the touchscreen of smartwatches. The results showed the specific comfort ranges on variations in fingers, finger regions, and poses due to the unique input context that the touching hand approaches a small and fixed touchscreen with a limited range of angles. Then, finger region-aware systems that recognize the flat and side of the finger were constructed based on the contact areas on the touchscreen to enhance the expressiveness of angle-based touch input. In the second scenario, this thesis revealed distinctive touch profiles of different fingers caused by the unique input context for the touchscreen of smartwatches. The results led to the implementation of finger identification systems for distinguishing two or three fingers. Two virtual keyboards with 12 and 16 keys showed the feasibility of touch-based finger identification that enables increases in the expressiveness of touch input techniques. In addition, this thesis supports the general claim with a range of wearable scenarios by exploring the finger input motions in the air. In the third scenario, this thesis investigated the motions of in-air finger stroking during unconstrained in-air typing for HMDs. The results of the observation study revealed details of in-air finger motions during fast sequential input, such as strategies, kinematics, correlated movements, inter-fingerstroke relationship, and individual in-air keys. The in-depth analysis led to a practical guideline for developing robust in-air typing systems with finger stroking. Lastly, this thesis examined the viable locations of in-air thumb touch input to the virtual targets above the palm. It was confirmed that fast and accurate sequential thumb touch can be achieved at a total of 8 key locations with the built-in hand tracking system in a commercial HMD. Final typing studies with a novel in-air thumb typing system verified increases in the expressiveness of virtual target selection on HMDs.
This thesis argues that the objective and subjective results and novel interaction techniques in various wearable scenarios support the general claim that understanding how users move their fingers during input will enable increases in the expressiveness of the interaction techniques we can create for resource-limited wearable devices. Finally, this thesis concludes with thesis contributions, design considerations, and the scope of future research works, for future researchers and developers to implement robust finger-based interaction systems on various types of wearable devices.ope
Privacy-preserving and Privacy-attacking Approaches for Speech and Audio -- A Survey
In contemporary society, voice-controlled devices, such as smartphones and
home assistants, have become pervasive due to their advanced capabilities and
functionality. The always-on nature of their microphones offers users the
convenience of readily accessing these devices. However, recent research and
events have revealed that such voice-controlled devices are prone to various
forms of malicious attacks, hence making it a growing concern for both users
and researchers to safeguard against such attacks. Despite the numerous studies
that have investigated adversarial attacks and privacy preservation for images,
a conclusive study of this nature has not been conducted for the audio domain.
Therefore, this paper aims to examine existing approaches for
privacy-preserving and privacy-attacking strategies for audio and speech. To
achieve this goal, we classify the attack and defense scenarios into several
categories and provide detailed analysis of each approach. We also interpret
the dissimilarities between the various approaches, highlight their
contributions, and examine their limitations. Our investigation reveals that
voice-controlled devices based on neural networks are inherently susceptible to
specific types of attacks. Although it is possible to enhance the robustness of
such models to certain forms of attack, more sophisticated approaches are
required to comprehensively safeguard user privacy
Language variation, automatic speech recognition and algorithmic bias
In this thesis, I situate the impacts of automatic speech recognition systems in relation to sociolinguistic theory (in particular drawing on concepts of language variation, language ideology
and language policy) and contemporary debates in AI ethics (especially regarding algorithmic
bias and fairness). In recent years, automatic speech recognition systems, alongside other
language technologies, have been adopted by a growing number of users and have been embedded in an increasing number of algorithmic systems. This expansion into new application
domains and language varieties can be understood as an expansion into new sociolinguistic
contexts. In this thesis, I am interested in how automatic speech recognition tools interact
with this sociolinguistic context, and how they affect speakers, speech communities and their
language varieties.
Focussing on commercial automatic speech recognition systems for British Englishes, I first
explore the extent and consequences of performance differences of these systems for different user groups depending on their linguistic background. When situating this predictive bias
within the wider sociolinguistic context, it becomes apparent that these systems reproduce and
potentially entrench existing linguistic discrimination and could therefore cause direct and indirect harms to already marginalised speaker groups. To understand the benefits and potentials
of automatic transcription tools, I highlight two case studies: transcribing sociolinguistic data
in English and transcribing personal voice messages in isiXhosa. The central role of the sociolinguistic context in developing these tools is emphasised in this comparison. Design choices,
such as the choice of training data, are particularly consequential because they interact with existing processes of language standardisation. To understand the impacts of these choices, and
the role of the developers making them better, I draw on theory from language policy research
and critical data studies. These conceptual frameworks are intended to help practitioners and
researchers in anticipating and mitigating predictive bias and other potential harms of speech
technologies. Beyond looking at individual choices, I also investigate the discourses about language variation and linguistic diversity deployed in the context of language technologies. These
discourses put forward by researchers, developers and commercial providers not only have a
direct effect on the wider sociolinguistic context, but they also highlight how this context (e.g.,
existing beliefs about language(s)) affects technology development. Finally, I explore ways of
building better automatic speech recognition tools, focussing in particular on well-documented,
naturalistic and diverse benchmark datasets. However, inclusive datasets are not necessarily
a panacea, as they still raise important questions about the nature of linguistic data and language variation (especially in relation to identity), and may not mitigate or prevent all potential
harms of automatic speech recognition systems as embedded in larger algorithmic systems
and sociolinguistic contexts
- …