12,462 research outputs found

    I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance

    Get PDF
    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient

    Spoken query processing for interactive information retrieval

    Get PDF
    It has long been recognised that interactivity improves the effectiveness of information retrieval systems. Speech is the most natural and interactive medium of communication and recent progress in speech recognition is making it possible to build systems that interact with the user via speech. However, given the typical length of queries submitted to information retrieval systems, it is easy to imagine that the effects of word recognition errors in spoken queries must be severely destructive on the system's effectiveness. The experimental work reported in this paper shows that the use of classical information retrieval techniques for spoken query processing is robust to considerably high levels of word recognition errors, in particular for long queries. Moreover, in the case of short queries, both standard relevance feedback and pseudo relevance feedback can be effectively employed to improve the effectiveness of spoken query processing

    Are words easier to learn from infant- than adult-directed speech? A quantitative corpus-based investigation

    Get PDF
    We investigate whether infant-directed speech (IDS) could facilitate word form learning when compared to adult-directed speech (ADS). To study this, we examine the distribution of word forms at two levels, acoustic and phonological, using a large database of spontaneous speech in Japanese. At the acoustic level we show that, as has been documented before for phonemes, the realizations of words are more variable and less discriminable in IDS than in ADS. At the phonological level, we find an effect in the opposite direction: the IDS lexicon contains more distinctive words (such as onomatopoeias) than the ADS counterpart. Combining the acoustic and phonological metrics together in a global discriminability score reveals that the bigger separation of lexical categories in the phonological space does not compensate for the opposite effect observed at the acoustic level. As a result, IDS word forms are still globally less discriminable than ADS word forms, even though the effect is numerically small. We discuss the implication of these findings for the view that the functional role of IDS is to improve language learnability.Comment: Draf

    Pauses and the temporal structure of speech

    Get PDF
    Natural-sounding speech synthesis requires close control over the temporal structure of the speech flow. This includes a full predictive scheme for the durational structure and in particuliar the prolongation of final syllables of lexemes as well as for the pausal structure in the utterance. In this chapter, a description of the temporal structure and the summary of the numerous factors that modify it are presented. In the second part, predictive schemes for the temporal structure of speech ("performance structures") are introduced, and their potential for characterising the overall prosodic structure of speech is demonstrated

    Access to recorded interviews: A research agenda

    Get PDF
    Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed

    Modelling the effects of speech rate variation for automatic speech recognition

    Get PDF
    Wrede B. Modelling the effects of speech rate variation for automatic speech recognition. Bielefeld (Germany): Bielefeld University; 2002.In automatic speech recognition it is a widely observed phenomenon that variations in speech rate cause severe degradations of the speech recognition performance. This is due to the fact that standard stochastic based speech recognition systems specialise on average speech rate. Although many approaches to modelling speech rate variation have been made, an integrated approach in a substantial system still has be to developed. General approaches to rate modelling are based on rate dependent models which are trained with rate specific subsets of the training data. During decoding a signal based rate estimation is performed according to which the set of rate dependent models is selected. While such approaches are able to reduce the word error rate significantly, they suffer from shortcomings such as the reduction of training data and the expensive training and decoding procedure. However, phonetic investigations show that there is a systematic relationship between speech rate and the acoustic characteristics of speech. In fast speech a tendency of reduction can be observed which can be described in more detail as a centralisation effect and an increase in coarticulation. Centralisation means that the formant frequencies of vowels tend to shift towards the vowel space center while increased coarticulation denotes the tendency of the spectral features of a vowel to shift towards those of its phonemic neighbour. The goal of this work is to investigate the possibility to incorporate the knowledge of the systematic nature of the influence of speech rate variation on the acoustic features in speech rate modelling. In an acoustic-phonetic analysis of a large corpus of spontaneous speech it was shown that an increased degree of the two effects of centralisation and coarticulation can be found in fast speech. Several measures for these effects were developed and used in speech recognition experiments with rate dependent models. A thorough investigation of rate dependent models showed that with duration and coarticulation based measures significant increases of the performance could be achieved. It was shown that by the use of different measures the models were adapted either to centralisation or coarticulation. Further experiments showed that by a more detailed modelling with more rate classes a further improvement can be achieved. It was also observed that a general basis for the models is needed before rate adaptation can be performed. In a comparison to other sources of acoustic variation it was shown that the effects of speech rate are as severe as those of speaker variation and environmental noise. All these results show that for a more substantial system that models rate variations accurately it is necessary to focus on both, durational and spectral effects. The systematic nature of the effects indicates that a continuous modelling is possible

    Lexical stress and lexical access: effects in read and spontaneous speech

    Get PDF
    This thesis examines three issues which are of importance in the study of auditory word recognition: the phonological unit which is used to access representations in the mental lexicon; the extent to which hearers can rely on words being identified before their acoustic offsets; and the role of context in auditory word recognition. Three hypotheses which are based on the predictions of the Cohort Model (Marslen-Wilson and Tyler 1980) are tested experimentally using the gating paradigm. First, the phonological access hypothesis claims that word onsets, rather than any other part of the word, are used to access representations in the mental lexicon. An alternative candidate which has been proposed as the initiator of lexical access is the stressed syllable. Second, the early recognition hypothesis states that polysyllabic words, and the majority of words heard in context, will be recognised before their acoustic offsets. Finally, the context-free hypothesis predicts that during the initial stages of the processing of words, no effects of context will be discernible.Experiment 1 tests all three predictions by manipulating aspects of carefully articulated, read speech. First, examination of the gating responses from three context conditions offers no support for the context-free hypothesis. Second, the high number of words which are identified before their acoustic offsets is consistent with the early recognition hypothesis. Finally, the phonological access hypothesis is tested by manipulation of the stress patterns of stimuli. The dependent variables which are examined relate to the processes of lexical access and lexical retrieval; stress differences are found on access measures but not on those relating to retrieval. When the experiment is replicated with a group of subjects whose level of literacy is lower than that of the undergraduates who took part in the original experiment, differences are found in measures relating to contextual processing.Experiment 2 continues to examine the phonological access hypothesis, by manipulating speech style (read versus conversational) as well as stress pattern. Gated words, excised from the speech of six speakers, are presented in isolation. Words excised from read speech and words stressed on the first syllable elicit a greater number of responses which match the stimuli than conversational tokens and words with unstressed initial syllables. Intelligibility differences among the four conditions are also reported.Experiment 3 aims to investigate the processing of read and spontaneous tokens heard in context, while maintaining the manipulation of stress pattern. A subset of the words from Experiment 2 are presented in their original sentence contexts: the test words themselves, plus up to three subsequent words, are gated. Although the presence of preceding context generally enhances intelligibility, some words remain unrecognised by the end of the third subsequent word. An interaction between stress and speech style may be explained in terms of the unintelligibility of the preceding context.Several issues arising from Experiments 1, 2 and 3 are considered further. The characteristics of words which fail to be recognised before their offsets are examined using the statistical technique of regression; the contributions of phonetic and phonological aspects of stressed syllables are assessed; and a further experiment is reported which explores top-down processing in spontaneous speech, and which offers support for the interpretation of the results of Experiment 3 offered earlier

    Impaired generalization of speaker identity in the perception of familiar and unfamiliar voices

    Get PDF
    In 2 behavioral experiments, we explored how the extraction of identity-related information from familiar and unfamiliar voices is affected by naturally occurring vocal flexibility and variability, introduced by different types of vocalizations and levels of volitional control during production. In a first experiment, participants performed a speaker discrimination task on vowels, volitional (acted) laughter, and spontaneous (authentic) laughter from 5 unfamiliar speakers. We found that performance was significantly impaired for spontaneous laughter, a vocalization produced under reduced volitional control. We additionally found that the detection of identity-related information fails to generalize across different types of nonverbal vocalizations (e.g., laughter vs. vowels) and across mismatches in volitional control within vocalization pairs (e.g., volitional laughter vs. spontaneous laughter), with performance levels indicating an inability to discriminate between speakers. In a second experiment, we explored whether personal familiarity with the speakers would afford greater accuracy and better generalization of identity perception. Using new stimuli, we largely replicated our previous findings: whereas familiarity afforded a consistent performance advantage for speaker discriminations, the experimental manipulations impaired performance to similar extents for familiar and unfamiliar listener groups. We discuss our findings with reference to prototype-based models of voice processing and suggest potential underlying mechanisms and representations of familiar and unfamiliar voice perception. (PsycINFO Database Record (c) 2016 APA, all rights reserved
    • 

    corecore