Search CORE

5,129 research outputs found

Vocabulary size influences spontaneous speech in native language users: Validating the use of automatic speech recognition in individual differences research

Author: Hintz F.
Jongman S.
Khoe Y.
Publication venue: 'SAGE Publications'
Publication date: 30/03/2020
Field of study

Previous research has shown that vocabulary size affects performance on laboratory word production tasks. Individuals who know many words show faster lexical access and retrieve more words belonging to pre-specified categories than individuals who know fewer words. The present study examined the relationship between receptive vocabulary size and speaking skills as assessed in a natural sentence production task. We asked whether measures derived from spontaneous responses to every-day questions correlate with the size of participants’ vocabulary. Moreover, we assessed the suitability of automatic speech recognition for the analysis of participants’ responses in complex language production data. We found that vocabulary size predicted indices of spontaneous speech: Individuals with a larger vocabulary produced more words and had a higher speech-silence ratio compared to individuals with a smaller vocabulary. Importantly, these relationships were reliably identified using manual and automated transcription methods. Taken together, our results suggest that spontaneous speech elicitation is a useful method to investigate natural language production and that automatic speech recognition can alleviate the burden of labor-intensive speech transcription

MPG.PuRe

Recommended from our members

Transcription of child sign language: A focus on narrative

Author: Morgan G.
Publication venue: 'John Benjamins Publishing Company'
Publication date: 01/01/2005
Field of study

This paper describes some general difficulties in analysing child sign language data with an emphasis on the process of transcription. The particular issue of capturing how signers encode simultaneity in narrative is discussed

City Research Online

Unity in diversity : integrating differing linguistic data in TUSNELDA

Author: Wagner Andreas
Publication venue
Publication date: 01/01/2005
Field of study

This paper describes the creation and preparation of TUSNELDA, a collection of corpus data built for linguistic research. This collection contains a number of linguistically annotated corpora which differ in various aspects such as language, text sorts / data types, encoded annotation levels, and linguistic theories underlying the annotation. The paper focuses on this variation on the one hand and the way how these heterogeneous data are integrated into one resource on the other hand

Hochschulschriftenserver - Universität Frankfurt am Main

The Scottish corpus of texts and speech

Author: Anderson J.
Beavan D.
Kay C.
Publication venue: Palgrave
Publication date: 01/01/2007
Field of study

Enlighten

Adapting End-to-End Speech Recognition for Readable Subtitles

Author: Liu Danni
Niehues Jan
Spanakis Gerasimos
Publication venue
Publication date: 01/01/2020
Field of study

Automatic speech recognition (ASR) systems are primarily evaluated on transcription accuracy. However, in some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time. Therefore, this work focuses on ASR with output compression, a task challenging for supervised approaches due to the scarcity of training data. We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech. We then compare several methods of end-to-end speech recognition under output length constraints. The experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities. Furthermore, the best performance in terms of WER and ROUGE scores is achieved by explicitly modeling the length constraints within the end-to-end ASR system.Comment: IWSLT 202

arXiv.org e-Print Archive

Maastricht University Research Portal

Crossref

Relative Positional Encoding for Speech Recognition and Direct Translation

Author: Ha Thanh-Le
Nguyen Thai-Son
Nguyen Tuan-Nam
Niehues Jan
Pham Ngoc-Quan
Salesky Elizabeth
Stueker Sebastian
Waibel Alexander
Publication venue
Publication date: 01/01/2020
Field of study

Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. As a result, the network can better adapt to the variable distributions present in speech data. Our experiments show that our resulting model achieves the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also show that this model is able to better utilize synthetic data than the Transformer, and adapts better to variable sentence segmentation quality for speech translation.Comment: Submitted to Interspeech 202

arXiv.org e-Print Archive

Maastricht University Research Portal

Crossref

Recommended from our members

Lexical and sub-lexical knowledge influences the encoding, storage, and articulation of nonwords

Author: Jones G
Witherstone HL
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Nonword repetition (NWR) has been used extensively in the study of child language. Although lexical and sub-lexical knowledge is known to influence NWR performance, there has been little examination of the NWR processes (e.g., encoding, storage, articulation) that may be affected by lexical and sub-lexical knowledge. We administered 2- and 3-syllable spoken nonword recognition and nonword repetition tests on two independent groups of 31 children (M=5;07). Spoken nonword recognition primarily involves encoding and storage, whereas NWR involves an additional articulation process. The influence of lexical and sub-lexical knowledge was determined by examining the amount of lexical errors produced. There was a clear involvement of long-term lexical and sub-lexical knowledge in both spoken nonword recognition and NWR. In spoken nonword recognition, twice as many errors involved selecting a foil that contained a lexical item (e.g., yashukup) over a foil that contained only nonsense syllables (e.g., yashunup). In repetition, over 30% of errors changed a nonsense syllable to a lexical item. Our results show that long-term lexical and sub-lexical knowledge is pervasive in NWR – any explanation of NWR performance must therefore consider the influence of lexical and sub-lexical knowledge throughout the whole repetition process, from the encoding of nonwords to the articulation of them

Nottingham Trent Institutional Repository (IRep)

Encoding of phonology in a recurrent neural model of grounded speech

Author: Alishahi Afra
Barking Marie
Chrupała Grzegorz
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

We study the representation and encoding of phonemes in a recurrent neural network model of grounded speech. We use a model which processes images and their spoken descriptions, and projects the visual and auditory representations into the same semantic space. We perform a number of analyses on how information about individual phonemes is encoded in the MFCC features extracted from the speech signal, and the activations of the layers of the model. Via experiments with phoneme decoding and phoneme discrimination we show that phoneme representations are most salient in the lower layers of the model, where low-level signals are processed at a fine-grained level, although a large amount of phonological information is retain at the top recurrent layer. We further find out that the attention mechanism following the top recurrent layer significantly attenuates encoding of phonology and makes the utterance embeddings much more invariant to synonymy. Moreover, a hierarchical clustering of phoneme representations learned by the network shows an organizational structure of phonemes similar to those proposed in linguistics.Comment: Accepted at CoNLL 201

arXiv.org e-Print Archive

Crossref

Tilburg University Repository

Recommended from our members

Transcribing nonsense words: The effect of numbers of voices and repetitions

Author: Ashby M.
Baddeley A. D.
Maguire E.
Martin C.
Rachael-Anne Knight
Publication venue: 'Informa UK Limited'
Publication date: 06/02/2010
Field of study

Transcription skills are crucially important to all phoneticians, and particularly for speech and language therapists who may use transcriptions to make decisions about diagnosis and intervention. Whilst interest in factors affecting transcription accuracy is increasing, there are still a number of issues that are yet to be investigated. The present paper considers how the number and type of voices, and the number of repetitions affects the transcription of nonsense words. Thirty two students in their second year of study for a BSc in Speech and Language Therapy were participants in an experiment. They heard two nonsense words presented ten times in either one or two voices. Results show that the number and gender of voices did not affect accuracy, but that accuracy increased between six and ten repetitions. Implications for teaching and learning, clinical practice, and further research are discussed

City Research Online

Crossref