3,829 research outputs found
Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition
Accents, as variations from standard pronunciation, pose significant
challenges for speech recognition systems. Although joint automatic speech
recognition (ASR) and accent recognition (AR) training has been proven
effective in handling multi-accent scenarios, current multi-task ASR-AR
approaches overlook the granularity differences between tasks. Fine-grained
units capture pronunciation-related accent characteristics, while
coarse-grained units are better for learning linguistic information. Moreover,
an explicit interaction of two tasks can also provide complementary information
and improve the performance of each other, but it is rarely used by existing
approaches. In this paper, we propose a novel Decoupling and Interacting
Multi-task Network (DIMNet) for joint speech and accent recognition, which is
comprised of a connectionist temporal classification (CTC) branch, an AR
branch, an ASR branch, and a bottom feature encoder. Specifically, AR and ASR
are first decoupled by separated branches and two-granular modeling units to
learn task-specific representations. The AR branch is from our previously
proposed linguistic-acoustic bimodal AR model and the ASR branch is an
encoder-decoder based Conformer model. Then, for the task interaction, the CTC
branch provides aligned text for the AR task, while accent embeddings extracted
from our AR model are incorporated into the ASR branch's encoder and decoder.
Finally, during ASR inference, a cross-granular rescoring method is introduced
to fuse the complementary information from the CTC and attention decoder after
the decoupling. Our experiments on English and Chinese datasets demonstrate the
effectiveness of the proposed model, which achieves 21.45%/28.53% AR accuracy
relative improvement and 32.33%/14.55% ASR error rate relative reduction over a
published standard baseline, respectively.Comment: Accepted by IEEE Transactions on Audio, Speech and Language
Processing (TASLP
VILAS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition
Enhancing automatic speech recognition (ASR) performance by leveraging
additional multimodal information has shown promising results in previous
studies. However, most of these works have primarily focused on utilizing
visual cues derived from human lip motions. In fact, context-dependent visual
and linguistic cues can also benefit in many scenarios. In this paper, we first
propose ViLaS (Vision and Language into Automatic Speech Recognition), a novel
multimodal ASR model based on the continuous integrate-and-fire (CIF)
mechanism, which can integrate visual and textual context simultaneously or
separately, to facilitate speech recognition. Next, we introduce an effective
training strategy that improves performance in modal-incomplete test scenarios.
Then, to explore the effects of integrating vision and language, we create
VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese
and English versions. Finally, empirical results are reported on the public
Flickr8K and self-constructed VSDial datasets. We explore various cross-modal
fusion schemes, analyze fine-grained crossmodal alignment on VSDial, and
provide insights into the effects of integrating multimodal information on
speech recognition.Comment: Accepted to ICASSP 202
Current trends in multilingual speech processing
In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin
Automatic Pronunciation Assessment -- A Review
Pronunciation assessment and its application in computer-aided pronunciation
training (CAPT) have seen impressive progress in recent years. With the rapid
growth in language processing and deep learning over the past few years, there
is a need for an updated review. In this paper, we review methods employed in
pronunciation assessment for both phonemic and prosodic. We categorize the main
challenges observed in prominent research trends, and highlight existing
limitations, and available resources. This is followed by a discussion of the
remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding
Getting Past the Language Gap: Innovations in Machine Translation
In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT
Cross-modal Alignment with Optimal Transport for CTC-based ASR
Temporal connectionist temporal classification (CTC)-based automatic speech
recognition (ASR) is one of the most successful end to end (E2E) ASR
frameworks. However, due to the token independence assumption in decoding, an
external language model (LM) is required which destroys its fast parallel
decoding property. Several studies have been proposed to transfer linguistic
knowledge from a pretrained LM (PLM) to the CTC based ASR. Since the PLM is
built from text while the acoustic model is trained with speech, a cross-modal
alignment is required in order to transfer the context dependent linguistic
knowledge from the PLM to acoustic encoding. In this study, we propose a novel
cross-modal alignment algorithm based on optimal transport (OT). In the
alignment process, a transport coupling matrix is obtained using OT, which is
then utilized to transform a latent acoustic representation for matching the
context-dependent linguistic features encoded by the PLM. Based on the
alignment, the latent acoustic feature is forced to encode context dependent
linguistic information. We integrate this latent acoustic feature to build
conformer encoder-based CTC ASR system. On the AISHELL-1 data corpus, our
system achieved 3.96% and 4.27% character error rate (CER) for dev and test
sets, respectively, which corresponds to relative improvements of 28.39% and
29.42% compared to the baseline conformer CTC ASR system without cross-modal
knowledge transfer.Comment: Accepted to IEEE ASRU 202
A Survey on Deep Multi-modal Learning for Body Language Recognition and Generation
Body language (BL) refers to the non-verbal communication expressed through
physical movements, gestures, facial expressions, and postures. It is a form of
communication that conveys information, emotions, attitudes, and intentions
without the use of spoken or written words. It plays a crucial role in
interpersonal interactions and can complement or even override verbal
communication. Deep multi-modal learning techniques have shown promise in
understanding and analyzing these diverse aspects of BL. The survey emphasizes
their applications to BL generation and recognition. Several common BLs are
considered i.e., Sign Language (SL), Cued Speech (CS), Co-speech (CoS), and
Talking Head (TH), and we have conducted an analysis and established the
connections among these four BL for the first time. Their generation and
recognition often involve multi-modal approaches. Benchmark datasets for BL
research are well collected and organized, along with the evaluation of SOTA
methods on these datasets. The survey highlights challenges such as limited
labeled data, multi-modal learning, and the need for domain adaptation to
generalize models to unseen speakers or languages. Future research directions
are presented, including exploring self-supervised learning techniques,
integrating contextual information from other modalities, and exploiting
large-scale pre-trained multi-modal models. In summary, this survey paper
provides a comprehensive understanding of deep multi-modal learning for various
BL generations and recognitions for the first time. By analyzing advancements,
challenges, and future directions, it serves as a valuable resource for
researchers and practitioners in advancing this field. n addition, we maintain
a continuously updated paper list for deep multi-modal learning for BL
recognition and generation: https://github.com/wentaoL86/awesome-body-language
Towards a unified framework for sub-lexical and supra-lexical linguistic modeling
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (p. 171-178).Conversational interfaces have received much attention as a promising natural communication channel between humans and computers. A typical conversational interface consists of three major systems: speech understanding, dialog management and spoken language generation. In such a conversational interface, speech recognition as the front-end of speech understanding remains to be one of the fundamental challenges for establishing robust and effective human/computer communications. On the one hand, the speech recognition component in a conversational interface lives in a rich system environment. Diverse sources of knowledge are available and can potentially be beneficial to its robustness and accuracy. For example, the natural language understanding component can provide linguistic knowledge in syntax and semantics that helps constrain the recognition search space. On the other hand, the speech recognition component also faces the challenge of spontaneous speech, and it is important to address the casualness of speech using the knowledge sources available. For example, sub-lexical linguistic information would be very useful in providing linguistic support for previously unseen words, and dynamic reliability modeling may help improve recognition robustness for poorly articulated speech. In this thesis, we mainly focused on the integration of knowledge sources within the speech understanding system of a conversational interface. More specifically, we studied the formalization and integration of hierarchical linguistic knowledge at both the sub-lexical level and the supra-lexical level, and proposed a unified framework for integrating hierarchical linguistic knowledge in speech recognition using layered finite-state transducers (FSTs).(cont.) Within the proposed framework, we developed context-dependent hierarchical linguistic models at both sub-lexical and supra-lexical levels. FSTs were designed and constructed to encode both structure and probability constraints provided by the hierarchical linguistic models. We also studied empirically the feasibility and effectiveness of integrating hierarchical linguistic knowledge into speech recognition using the proposed framework. We found that, at the sub-lexical level, hierarchical linguistic modeling is effective in providing generic sub-word structure and probability constraints. Since such constraints are not restricted to a fixed system vocabulary, they can help the recognizer correctly identify previously unseen words. Together with the unknown word support from natural language understanding, a conversational interface would be able to deal with unknown words better, and can possibly incorporate them into the active recognition vocabulary on-the-fly. At the supra-lexical level, experimental results showed that the shallow parsing model built within the proposed layered FST framework with top-level n-gram probabilities and phrase-level context-dependent probabilities was able to reduce recognition errors, compared to a class n-gram model of the same order. However, we also found that its application can be limited by the complexity of the composed FSTs. This suggests that, with a much more complex grammar at the supra-lexical level, a proper tradeoff between tight knowledge integration and system complexity becomes more important ...by Xiaolong Mou.Ph.D
Tracing the Algorithm of Bilingual Language Learning
206 p.Aprender un nuevo idioma es una tarea ardua pero altamente gratificante. Los aprendices deben adquirir un vocabulario extensivo, así como una serie de reglas sobre cómo variar y combinar este vocabulario para producir oraciones con sentido. No obstante, es posible que aprender nuevos idiomas se vuelva más sencillo una vez conocemos al menos dos. Basado en esta idea, en esta tesis exploro si existen diferencias entre las personas que sólo saben un idioma (monolingües) y aquellas que hablan dos idiomas (bilingües) a la hora de aprender un nuevo idioma. Para ello, llevé a cabo seis experimentos conductuales con participantes de distintos perfiles lingüísticos: un grupo de hablantes monolingües del castellano, un grupo bilingüe castellano-inglés, y un grupo bilingüe castellano-vasco. Estos experimentos, en conjunto, abarcaban el aprendizaje implícito y explícito de nuevos idiomas utilizando estímulos lingüísticos artificiales. En general, los resultados de todos experimentos indicaron que ambos grupos bilingües desempeñaron mejor que el grupo monolingüe al aprender vocabulario de manera implícita y explícita, pero no en otros ámbitos (fonología, ortografía, morfología). Para explicar cómo surgen estas diferencias en el aprendizaje de vocabulario, desarrollé un modelo computacional capaz de aprender palabras escritas utilizando los patrones ortográficos de palabras en uno o dos idiomas. Este modelo indicó que, al aprender palabras en dos idiomas, es más sencillo reconocer y producir nuevas palabras que al aprender vocabulario de un único idioma. La totalidad de estos resultados me llevaron a concluir que los monolingües y bilingües difieren fundamentalmente en el aprendizaje de vocabulario, debido a que la exposición a distintos patrones dentro de palabras en dos idiomas les hace más flexibles a la hora de integrar la información ortográfica (y posiblemente fonológica) de nuevas palabras
- …