5,129 research outputs found
Vocabulary size influences spontaneous speech in native language users: Validating the use of automatic speech recognition in individual differences research
Previous research has shown that vocabulary size affects performance on laboratory word production tasks. Individuals who know many words show faster lexical access and retrieve more words belonging to pre-specified categories than individuals who know fewer words. The present study examined the relationship between receptive vocabulary size and speaking skills as assessed in a natural sentence production task. We asked whether measures derived from spontaneous responses to every-day questions correlate with the size of participants’ vocabulary. Moreover, we assessed the suitability of automatic speech recognition for the analysis of participants’ responses in complex language production data. We found that vocabulary size predicted indices of spontaneous speech: Individuals with a larger vocabulary produced more words and had a higher speech-silence ratio compared to individuals with a smaller vocabulary. Importantly, these relationships were reliably identified using manual and automated transcription methods. Taken together, our results suggest that spontaneous speech elicitation is a useful method to investigate natural language production and that automatic speech recognition can alleviate the burden of labor-intensive speech transcription
Recommended from our members
Transcription of child sign language: A focus on narrative
This paper describes some general difficulties in analysing child sign language data with an emphasis on the process of transcription. The particular issue of capturing how signers encode simultaneity in narrative is discussed
Unity in diversity : integrating differing linguistic data in TUSNELDA
This paper describes the creation and preparation of TUSNELDA, a collection of corpus data built for linguistic research. This collection contains a number of linguistically annotated corpora which differ in various aspects such as language, text sorts / data types, encoded annotation levels, and linguistic theories underlying the annotation. The paper focuses on this variation on the one hand and the way how these heterogeneous data are integrated into one resource on the other hand
Adapting End-to-End Speech Recognition for Readable Subtitles
Automatic speech recognition (ASR) systems are primarily evaluated on
transcription accuracy. However, in some use cases such as subtitling, verbatim
transcription would reduce output readability given limited screen size and
reading time. Therefore, this work focuses on ASR with output compression, a
task challenging for supervised approaches due to the scarcity of training
data. We first investigate a cascaded system, where an unsupervised compression
model is used to post-edit the transcribed speech. We then compare several
methods of end-to-end speech recognition under output length constraints. The
experiments show that with limited data far less than needed for training a
model from scratch, we can adapt a Transformer-based ASR model to incorporate
both transcription and compression capabilities. Furthermore, the best
performance in terms of WER and ROUGE scores is achieved by explicitly modeling
the length constraints within the end-to-end ASR system.Comment: IWSLT 202
Relative Positional Encoding for Speech Recognition and Direct Translation
Transformer models are powerful sequence-to-sequence architectures that are
capable of directly mapping speech inputs to transcriptions or translations.
However, the mechanism for modeling positions in this model was tailored for
text modeling, and thus is less ideal for acoustic inputs. In this work, we
adapt the relative position encoding scheme to the Speech Transformer, where
the key addition is relative distance between input states in the
self-attention network. As a result, the network can better adapt to the
variable distributions present in speech data. Our experiments show that our
resulting model achieves the best recognition result on the Switchboard
benchmark in the non-augmentation condition, and the best published result in
the MuST-C speech translation benchmark. We also show that this model is able
to better utilize synthetic data than the Transformer, and adapts better to
variable sentence segmentation quality for speech translation.Comment: Submitted to Interspeech 202
Recommended from our members
Lexical and sub-lexical knowledge influences the encoding, storage, and articulation of nonwords
Nonword repetition (NWR) has been used extensively in the study of child language. Although lexical and sub-lexical knowledge is known to influence NWR performance, there has been little examination of the NWR processes (e.g., encoding, storage, articulation) that may be affected by lexical and sub-lexical knowledge. We administered 2- and 3-syllable spoken nonword recognition and nonword repetition tests on two independent groups of 31 children (M=5;07). Spoken nonword recognition primarily involves encoding and storage, whereas NWR involves an additional articulation process. The influence of lexical and sub-lexical knowledge was determined by examining the amount of lexical errors produced. There was a clear involvement of long-term lexical and sub-lexical knowledge in both spoken nonword recognition and NWR. In spoken nonword recognition, twice as many errors involved selecting a foil that contained a lexical item (e.g., yashukup) over a foil that contained only nonsense syllables (e.g., yashunup). In repetition, over 30% of errors changed a nonsense syllable to a lexical item. Our results show that long-term lexical and sub-lexical knowledge is pervasive in NWR – any explanation of NWR performance must therefore consider the influence of lexical and sub-lexical knowledge throughout the whole repetition process, from the encoding of nonwords to the articulation of them
Encoding of phonology in a recurrent neural model of grounded speech
We study the representation and encoding of phonemes in a recurrent neural
network model of grounded speech. We use a model which processes images and
their spoken descriptions, and projects the visual and auditory representations
into the same semantic space. We perform a number of analyses on how
information about individual phonemes is encoded in the MFCC features extracted
from the speech signal, and the activations of the layers of the model. Via
experiments with phoneme decoding and phoneme discrimination we show that
phoneme representations are most salient in the lower layers of the model,
where low-level signals are processed at a fine-grained level, although a large
amount of phonological information is retain at the top recurrent layer. We
further find out that the attention mechanism following the top recurrent layer
significantly attenuates encoding of phonology and makes the utterance
embeddings much more invariant to synonymy. Moreover, a hierarchical clustering
of phoneme representations learned by the network shows an organizational
structure of phonemes similar to those proposed in linguistics.Comment: Accepted at CoNLL 201
Recommended from our members
Transcribing nonsense words: The effect of numbers of voices and repetitions
Transcription skills are crucially important to all phoneticians, and particularly for speech and language therapists who may use transcriptions to make decisions about diagnosis and intervention. Whilst interest in factors affecting transcription accuracy is increasing, there are still a number of issues that are yet to be investigated. The present paper considers how the number and type of voices, and the number of repetitions affects the transcription of nonsense words. Thirty two students in their second year of study for a BSc in Speech and Language Therapy were participants in an experiment. They heard two nonsense words presented ten times in either one or two voices. Results show that the number and gender of voices did not affect accuracy, but that accuracy increased between six and ten repetitions. Implications for teaching and learning, clinical practice, and further research are discussed
- …