217 research outputs found

    Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates

    Get PDF
    International audienceDespite years of speech recognition research, little is known about which words tend to be misrecognized and why. Previous work has shown that errors increase for infrequent words, short words, and very loud or fast speech, but many other presumed causes of error (e.g., nearby disfluencies, turn-initial words, phonetic neighborhood density) have never been carefully tested. The reasons for the huge differences found in error rates between speakers also remain largely mysterious. Using a mixed-effects regression model, we investigate these and other factors by analyzing the errors of two state-of-the-art recognizers on conversational speech. Words with higher error rates include those with extreme prosodic characteristics, those occurring turn-initially or as discourse markers, and : acoustically similar words that also have similar language model probabilities. Words preceding disfluent interruption points (first repetition tokens and words before fragments) also have higher error rates. Finally, even after accounting for other factors, speaker differences cause enormous variance in error rates, suggesting that speaker error rate variance is not fully explained by differences in word choice, fluency, or prosodic characteristics. We also propose that doubly confusable pairs, rather than high neighborhood density, may better explain phonetic neighborhood errors in human speech processing

    Prosodic Classification of Discourse Markers

    Get PDF
    The first contribution of this study is the description of the prosodic behavior of discourse markers present in two speech corpora of European Portuguese (EP) in different domains (university lectures, and map-task dialogues). The second contribution is a multiclass classification to verify, given their prosodic features, which words in both corpora are classified as discourse markers, which are disfluencies, and which correspond to words that are neither markers nor disfluencies (chunks). Our goal is to automatically predict discourse markers and include them in rich transcripts, along with other structural metadata events (e.g., disfluencies and punctuation marks) that are already encompassed in the language models of our in-house speech recognizer. Results show that the automatic classification of discourse markers is better for the lectures corpus (87%) than for the dialogue corpus (84%). Nonetheless, in both corpora, discourse markers are more easily confused with chunks than with disfluencies.info:eu-repo/semantics/publishedVersio

    Acoustic-prosodic entrainment in structural metadata events

    Get PDF
    This paper presents an acoustic-prosodic analysis of entrain- ment in a Portuguese map-task corpus. Our aim is to ana- lyze how turn-by-turn entrainment varies with distinct structural metadata events: types of sentence-like units (SU) in consecu- tive turns (e.g. interrogatives followed by declaratives, or both declaratives), and with the presence of discourse markers, affir- mative cue words, and disfluencies in the beginning of turns. Entrainment at turn-exchanges may be observed in terms of pitch, energy, duration, and voice quality. Regarding SU types, question-answer turns are the ones with stronger similarity, and declarative-interrogative pairs are the ones where less entrain- ment occurs, as expected. Moreover, in question-answer pairs, there is also stronger evidence of entrainment with Yes/No and Tag questions than with Wh- questions. In fact, these subtypes are coded in distinctive prosodic ways (moreover, the first sub- type has no associated lexical-syntactic cues in Portuguese, only prosodic). As for turn-initial structures, entrainment is stronger when the second turn begins with an affirmative cue word; less strong with ambiguous structures (such as ‘OK’), emphatic af- firmative answers, and negative answers; and scarce with dis- fluencies and discourse markers. The different degrees of local entrainment may be related with the informative structure of distinct structural metadata events.info:eu-repo/semantics/publishedVersio

    Automatic detection of disfluencies in a corpus of university lectures

    Get PDF
    This dissertation focuses on the identification of disfluent sequences and their distinct structural regions. Reported experiments are based on audio segmentation and prosodic features, calculated from a corpus of university lectures in European Portuguese, containing about 32 hours of speech and about 7.7% of disfluencies. The set of features automatically extracted from the forced alignment corpus proved to be discriminant of the regions contained in the production of a disfluency. The best results concern the detection of the interregnum, followed by the detection of the interruption point. Several machine learning methods have been applied, but experiments show that Classification and Regression Trees usually outperform the other methods. The set of most informative features for cross-region identification encompasses word duration ratios, word confidence score, silent ratios, and pitch and energy slopes. Features such as the number of phones and syllables per word proved to be more useful for the identification of the interregnum, whereas energy slopes were most suited for identifying the interruption point. We have also conducted initial experiments on automatic detecting filled pauses, the most frequent disfluency type. For now, only force aligned transcripts were used, since the ASR system is not well adapted to this domain. This study is a step towards automatic detection of filled pauses for European Portuguese using prosodic features. Future work will extend this study for fully automatic transcripts, and will also tackle other domains, also exploring extended sets of linguistic features.Esta tese aborda a identificação de sequências disfluentes e respetivas regiões estruturais. As experiências aqui descritas baseiam-se em segmentação e informação relativa a prosódia, calculadas a partir de um corpus de aulas universitárias em Português Europeu, contendo cerca de 32 horas de fala e de cerca de 7,7% de disfluências. O conjunto de características utilizadas provou ser discriminatório na identificação das regiões contidas na produção de disfluências. Os melhores resultados dizem respeito à deteção do interregnum, seguida da deteção do ponto de interrupção. Foram testados vários métodos de aprendizagem automática, sendo as Árvores de Decisão e Regressão as que geralmente obtiveram os melhores resultados. O conjunto de características mais informativas para a identificação e distinção de regiões disfluentes abrange rácios de duração de palavras, nível de confiança da palavra atual, rácios envolvendo silêncios e declives de pitch e de energia. Características tais como o número de fones e sílabas por palavra provaram ser mais úteis para a identificação do interregnum, enquanto pitch e energia foram os mais adequados para identificar o ponto de interrupção. Foram também realizadas experiências focando a deteção de pausas preenchidas. Por enquanto, para estas experiências foi utilizado apenas material proveniente de alinhamento forçado, já que o sistema de reconhecimento automático não está bem adaptado a este domínio. Este estudo representa um novo passo no sentido da deteção automática de pausas preenchidas para Português Europeu, utilizando recursos prosódicos. Em trabalho futuro pretende-se estender esse estudo para transcrições automáticas e também abordar outros domínios, explorando conjuntos mais extensos de características linguísticas

    Cross-domain analysis of discourse markers in European Portuguese

    Get PDF
    This paper presents an analysis of discourse markers in two spontaneous speech corpora for European Portuguese - university lectures and map-task dialogues - and also in a collection of tweets, aiming at contributing to their categorization, scarcely existent for European Portuguese. Our results show that the selection of discourse markers is domain and speaker dependent. We also found that the most frequent discourse markers are similar in all three corpora, despite tweets containing discourse markers not found in the other two corpora. In this multidisciplinary study, comprising both a linguistic perspective and a computational approach, discourse markers are also automatically discriminated from other structural metadata events, namely sentence-like units and disfluencies. Our results show that discourse markers and disfluencies tend to co-occur in the dialogue corpus, but have a complementary distribution in the university lectures. We used three acoustic-prosodic feature sets and machine learning to automatically distinguish between discourse markers, disfluencies and sentence-like units. Our in-domain experiments achieved an accuracy of about 87% in university lectures and 84% in dialogues, in line with our previous results. The eGeMAPS features, commonly used for other paralinguistic tasks, achieved a considerable performance on our data, especially considering the small size of the feature set. Our results suggest that turn-initial discourse markers are usually easier to classify than disfluencies, a result also previously reported in the literature. We conducted a cross-domain evaluation in order to evaluate the robustness of the models across domains. The results achieved are about 11%-12% lower, but we conclude that data from one domain can still be used to classify the same events in the other. Overall, despite the complexity of this task, these are very encouraging state-of-the-art results. Ultimately, using exclusively acoustic-prosodic cues, discourse markers can be fairly discriminated from disfluencies and SUs. In order to better understand the contribution of each feature, we have also reported the impact of the features in both the dialogues and the university lectures. Pitch features are the most relevant ones for the distinction between discourse markers and disfluencies, namely pitch slopes. These features are in line with the wide pitch range of discourse markers, in a continuum from a very compressed pitch range to a very wide one, expressed by total deaccented material or H+L* L* contours, with upstep H tones.info:eu-repo/semantics/publishedVersio

    Cross-domain analysis of discourse markers in European Portuguese

    Get PDF
    This paper presents an analysis of discourse markers in two spontaneous speech corpora for European Portuguese - university lectures and map-task dialogues - and also in a collection of tweets, aiming at contributing to their categorization, scarcely existent for European Portuguese. Our results show that the selection of discourse markers is domain and speaker dependent. We also found that the most frequent discourse markers are similar in all three corpora, despite tweets containing discourse markers not found in the other two corpora. In this multidisciplinary study, comprising both a linguistic perspective and a computational approach, discourse markers are also automatically discriminated from other structural metadata events, namely sentence-like units and disfluencies. Our results show that discourse markers and disfluencies tend to co-occur in the dialogue corpus, but have a complementary distribution in the university lectures. We used three acoustic-prosodic feature sets and machine learning to automatically distinguish between discourse markers, disfluencies and sentence-like units. Our in-domain experiments achieved an accuracy of about 87% in university lectures and 84% in dialogues, in line with our previous results. The eGeMAPS features, commonly used for other paralinguistic tasks, achieved a considerable performance on our data, especially considering the small size of the feature set. Our results suggest that turn-initial discourse markers are usually easier to classify than disfluencies, a result also previously reported in the literature. We conducted a cross-domain evaluation in order to evaluate the robustness of the models across domains. The results achieved are about 11%-12% lower, but we conclude that data from one domain can still be used to classify the same events in the other. Overall, despite the complexity of this task, these are very encouraging state-of-the-art results. Ultimately, using exclusively acoustic-prosodic cues, discourse markers can be fairly discriminated from disfluencies and SUs. In order to better understand the contribution of each feature, we have also reported the impact of the features in both the dialogues and the university lectures. Pitch features are the most relevant ones for the distinction between discourse markers and disfluencies, namely pitch slopes. These features are in line with the wide pitch range of discourse markers, in a continuum from a very compressed pitch range to a very wide one, expressed by total deaccented material or H+L* L* contours, with upstep H tones

    The Distribution Of Disfluencies In Spontaneous Speech: Empirical Observations And Theoretical Implications

    Get PDF
    This dissertation provides an empirical description of the forms and their distribution of disfluencies in spontaneous speech. Although research in this area has received much attention in past four decades, large scale analyses of speech corpora from multiple communication settings, languages, and speaker\u27s cognitive states are still lacking. Understandings of regularities of different kinds of disfluencies based on large speech samples across multiple domains are essential for both theoretical and applied purposes. As an attempt to fill this gap, this dissertation takes the approach of quantitative analysis of large corpora of spontaneous speech. The selected corpora reflect a diverse range of tasks and languages. The dissertation re-examines speech disfluency phenomena, including silent pauses, filled pauses (``um and ``uh ) and repetitions, and provides the empirical basis for future work in both theoretical and applied settings. Results from the study of silent and filled pauses indicate that a potential sociolinguistic variation can in fact be explained from the perspective of the speech planning process. The descriptive analysis of repetitions has identified a new form of repetitive phenomenon: repetitive interpolation. Both the acoustic and textual properties of repetitive interpolation have been documented through rigorous quantitative analysis. The defining features of this phenomenon can be further used in designing speech based applications such as speaker state detection. Although the goal of this descriptive analysis is not to formulate and test specific hypothesis about speech production, potential directions for future research in speech production models are proposed and evaluated. The quantitative methods employed throughout this dissertation can also be further developed into interpretable features in machine learning systems that require automatic processing of spontaneous speech

    The Distribution Of Disfluencies In Spontaneous Speech: Empirical Observations And Theoretical Implications

    Get PDF
    This dissertation provides an empirical description of the forms and their distribution of disfluencies in spontaneous speech. Although research in this area has received much attention in past four decades, large scale analyses of speech corpora from multiple communication settings, languages, and speaker\u27s cognitive states are still lacking. Understandings of regularities of different kinds of disfluencies based on large speech samples across multiple domains are essential for both theoretical and applied purposes. As an attempt to fill this gap, this dissertation takes the approach of quantitative analysis of large corpora of spontaneous speech. The selected corpora reflect a diverse range of tasks and languages. The dissertation re-examines speech disfluency phenomena, including silent pauses, filled pauses (``um and ``uh ) and repetitions, and provides the empirical basis for future work in both theoretical and applied settings. Results from the study of silent and filled pauses indicate that a potential sociolinguistic variation can in fact be explained from the perspective of the speech planning process. The descriptive analysis of repetitions has identified a new form of repetitive phenomenon: repetitive interpolation. Both the acoustic and textual properties of repetitive interpolation have been documented through rigorous quantitative analysis. The defining features of this phenomenon can be further used in designing speech based applications such as speaker state detection. Although the goal of this descriptive analysis is not to formulate and test specific hypothesis about speech production, potential directions for future research in speech production models are proposed and evaluated. The quantitative methods employed throughout this dissertation can also be further developed into interpretable features in machine learning systems that require automatic processing of spontaneous speech
    corecore