Search CORE

2,821 research outputs found

Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Author: Arthur Philip
Besacier Laurent
Black Alan
Ciannella Francesco
Du Mingxing
Dupoux Emmanuel
Godard Pierre
Hasegawa-Johnson Mark
Larsen Elin
Merkx Danny
Metze Florian
Mueller Markus
Neubig Graham
Ondel Lucas
Palaskar Shruti
Riad Rachid
Scharenborg Odette
Stueker Sebastian
Wang Liming
Publication venue
Publication date: 14/02/2018
Field of study

We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.Comment: Accepted to ICASSP 201

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

English Conversational Telephone Speech Recognition by Humans and Machines

Author: Audhkhasi Kartik
Cui Xiaodong
Dimitriadis Dimitrios
Hall Phil
Kurata Gakuto
Lim Lynn-Li
Picheny Michael
Ramabhadran Bhuvana
Roomi Bergul
Saon George
Sercu Tom
Thomas Samuel
Publication venue
Publication date: 06/03/2017
Field of study

One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates? A recent paper by Microsoft suggests that we have already achieved human performance. In trying to verify this statement, we performed an independent set of human performance measurements on two conversational tasks and found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve. We also report on our own efforts in this area, presenting a set of acoustic and language modeling techniques that lowered the word error rate of our own English conversational telephone LVCSR system to the level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000 evaluation, which - at least at the writing of this paper - is a new performance milestone (albeit not at what we measure to be human performance!). On the acoustic side, we use a score fusion of three models: one LSTM with multiple feature inputs, a second LSTM trained with speaker-adversarial multi-task learning and a third residual net (ResNet) with 25 convolutional layers and time-dilated convolutions. On the language modeling side, we use word and character LSTMs and convolutional WaveNet-style language models

arXiv.org e-Print Archive

Crossref

Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems

Author: Dong Ren (135842)
Hong Du (117108)
Libing Song (203669)
Qing Yang (67856)
Wei Guo (86150)
Xinsheng Peng (350750)
Yuhu Dai (436301)
Publication venue
Publication date: 01/07/2017
Field of study

Neural models have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and compare their quality for predicting phone labels. Our experiments shed light on important aspects of the end-to-end model such as layer depth, model complexity, and other design choices.Comment: NIPS 201

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare