875 research outputs found
Spoken Language Understanding in a Latent Topic-based Subspace
International audiencePerformance of spoken language understanding applications declines when spoken documents are automatically transcribed in noisy conditions due to high Word Error Rates (WER). To improve the robustness to transcription errors, recent solutions propose to map these automatic transcriptions in a latent space. These studies have proposed to compare classical topic-based representations such as Latent Dirichlet Allocation (LDA), supervised LDA and author-topic (AT) models. An original compact representation, called c-vector, has recently been introduced to walk around the tricky choice of the number of latent topics in these topic-based representations. Moreover, c-vectors allow to increase the robustness of document classification with respect to transcription errors by compacting different LDA representations of a same speech document in a reduced space and then compensate most of the noise of the document representation. The main drawback of this method is the number of sub-tasks needed to build the c-vector space. This paper proposes to both improve this compact representation (c-vector) of spoken documents and to reduce the number of needed sub-tasks, using an original framework in a robust low dimensional space of features from a set of AT models called "Latent Topic-based Sub-space" (LTS). In comparison to LDA, the AT model considers not only the dialogue content (words), but also the class related to the document. Experiments are conducted on the DECODA corpus containing speech conversations from the call-center of the RATP Paris transportation company. Results show that the original LTS representation outperforms the best previous compact representation (c-vector), with a substantial gain of more than 2.5% in terms of correctly labeled conversations
Learning Latent Representations for Speech Generation and Transformation
An ability to model a generative process and learn a latent representation
for speech in an unsupervised fashion will be crucial to process vast
quantities of unlabelled speech data. Recently, deep probabilistic generative
models such as Variational Autoencoders (VAEs) have achieved tremendous success
in modeling natural images. In this paper, we apply a convolutional VAE to
model the generative process of natural speech. We derive latent space
arithmetic operations to disentangle learned latent representations. We
demonstrate the capability of our model to modify the phonetic content or the
speaker identity for speech segments using the derived operations, without the
need for parallel supervisory data.Comment: Accepted to Interspeech 201
A survey on mouth modeling and analysis for Sign Language recognition
© 2015 IEEE.Around 70 million Deaf worldwide use Sign Languages (SLs) as their native languages. At the same time, they have limited reading/writing skills in the spoken language. This puts them at a severe disadvantage in many contexts, including education, work, usage of computers and the Internet. Automatic Sign Language Recognition (ASLR) can support the Deaf in many ways, e.g. by enabling the development of systems for Human-Computer Interaction in SL and translation between sign and spoken language. Research in ASLR usually revolves around automatic understanding of manual signs. Recently, ASLR research community has started to appreciate the importance of non-manuals, since they are related to the lexical meaning of a sign, the syntax and the prosody. Nonmanuals include body and head pose, movement of the eyebrows and the eyes, as well as blinks and squints. Arguably, the mouth is one of the most involved parts of the face in non-manuals. Mouth actions related to ASLR can be either mouthings, i.e. visual syllables with the mouth while signing, or non-verbal mouth gestures. Both are very important in ASLR. In this paper, we present the first survey on mouth non-manuals in ASLR. We start by showing why mouth motion is important in SL and the relevant techniques that exist within ASLR. Since limited research has been conducted regarding automatic analysis of mouth motion in the context of ALSR, we proceed by surveying relevant techniques from the areas of automatic mouth expression and visual speech recognition which can be applied to the task. Finally, we conclude by presenting the challenges and potentials of automatic analysis of mouth motion in the context of ASLR
Deep quaternion neural networks for spoken language understanding
International audienceThe availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility. The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with a rich documentation and is designed to properly work locally or on HPC clusters. Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers
Recommended from our members
Decoding the Real-Time Neurobiological Properties of Incremental Semantic Interpretation.
Communication through spoken language is a central human capacity, involving a wide range of complex computations that incrementally interpret each word into meaningful sentences. However, surprisingly little is known about the spatiotemporal properties of the complex neurobiological systems that support these dynamic predictive and integrative computations. Here, we focus on prediction, a core incremental processing operation guiding the interpretation of each upcoming word with respect to its preceding context. To investigate the neurobiological basis of how semantic constraints change and evolve as each word in a sentence accumulates over time, in a spoken sentence comprehension study, we analyzed the multivariate patterns of neural activity recorded by source-localized electro/magnetoencephalography (EMEG), using computational models capturing semantic constraints derived from the prior context on each upcoming word. Our results provide insights into predictive operations subserved by different regions within a bi-hemispheric system, which over time generate, refine, and evaluate constraints on each word as it is heard.ERC Horizon202
Quaternion Denoising Encoder-Decoder for Theme Identification of Telephone Conversations
International audienceIn the last decades, encoder-decoders or autoencoders (AE) have received a great interest from researchers due to their capability to construct robust representations of documents in a low dimensional subspace. Nonetheless, autoencoders reveal little in way of spoken document internal structure by only considering words or topics contained in the document as an isolate basic element, and tend to overfit with small corpus of documents. Therefore, Quaternion Multi-layer Perceptrons (QMLP) have been introduced to capture such internal latent dependencies , whereas denoising autoencoders (DAE) are composed with different stochastic noises to better process small set of documents. This paper presents a novel autoencoder based on both hitherto-proposed DAE (to manage small corpus) and the QMLP (to consider internal latent structures) called "Quater-nion denoising encoder-decoder" (QDAE). Moreover, the paper defines an original angular Gaussian noise adapted to the speci-ficity of hyper-complex algebra. The experiments, conduced on a theme identification task of spoken dialogues from the DE-CODA framework, show that the QDAE obtains the promising gains of 3% and 1.5% compared to the standard real valued de-noising autoencoder and the QMLP respectively
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
- …