9 research outputs found

    Prosodic Event Recognition using Convolutional Neural Networks with Context Information

    Full text link
    This paper demonstrates the potential of convolutional neural networks (CNN) for detecting and classifying prosodic events on words, specifically pitch accents and phrase boundary tones, from frame-based acoustic features. Typical approaches use not only feature representations of the word in question but also its surrounding context. We show that adding position features indicating the current word benefits the CNN. In addition, this paper discusses the generalization from a speaker-dependent modelling approach to a speaker-independent setup. The proposed method is simple and efficient and yields strong results not only in speaker-dependent but also speaker-independent cases.Comment: Interspeech 2017 4 pages, 1 figur

    Cross-linguistic Influences on Sentence Accent Detection in Background Noise.

    Get PDF
    This paper investigates whether sentence accent detection in a non-native language is dependent on (relative) similarity between prosodic cues to accent between the non-native and the native language, and whether cross-linguistic differences in the use of local and more widely distributed (i.e., non-local) cues to sentence accent detection lead to differential effects of the presence of background noise on sentence accent detection in a non-native language. We compared Dutch, Finnish, and French non-native listeners of English, whose cueing and use of prosodic prominence is gradually further removed from English, and compared their results on a phoneme monitoring task in different levels of noise and a quiet condition to those of native listeners. Overall phoneme detection performance was high for the native and the non-native listeners, but deteriorated to the same extent in the presence of background noise. Crucially, relative similarity between the prosodic cues to sentence accent of one's native language compared to that of a non-native language does not determine the ability to perceive and use sentence accent for speech perception in that non-native language. Moreover, proficiency in the non-native language is not a straightforward predictor of sentence accent perception performance, although high proficiency in a non-native language can seemingly overcome certain differences at the prosodic level between the native and non-native language. Instead, performance is determined by the extent to which listeners rely on local cues (English and Dutch) versus cues that are more distributed (Finnish and French), as more distributed cues survive the presence of background noise better

    Detecting Pitch Accents at the Word, Syllable and Vowel Level

    Get PDF
    The automatic identification of prosodic events such as pitch accent in English has long been a topic of interest to speech researchers, with applications to a variety of spoken language processing tasks. However, much remains to be understood about the best methods for obtaining high accuracy detection. We describe experiments examining the optimal domain for accent analysis. Specifically, we compare pitch accent identification at the syllable, vowel or word level as domains for analysis of acoustic indicators of accent. Our results indicate that a word-based approach is superior to syllable- or vowel-based detection, achieving an accuracy of 84.2%.

    Effects of prosody on natural language processing

    Get PDF
    Prosody -- or the systematic variation in the energy, pitch, timing, and voice quality of speech -- plays an important role in speech communication. For example, pitch is the primary way an English speaker can distinguish between certain kinds of questions and statements (e.g., 'That's today?' vs. 'That's today.'). Despite the fact that prosody can convey a range of linguistic features, it is uncommon for NLP systems that deal with speech inputs to give consideration to prosodic features. Many systems such as dialog agents start with an automatic speech recognition (ASR) step, which converts the audio signal into text, after which all prosodic information is discarded. Previous research has established that prosody can be helpful -- it has been shown to aid in tasks such as syntactic parsing (Tran et al. 2018) -- but the amount of benefit shown for many tasks is modest enough that including prosodic inputs still remains a niche approach in NLP. The goal of this thesis is to revisit the question of how prosodic features can benefit a range of NLP tasks. First, Chapter 3 considers the question of what modeling choices are best for incorporating prosodic inputs to NLP tasks. These experiments show that a wide input context is helpful in detecting prosodic information, but even so, text features alone are able to predict a relatively large portion of prosodic activity. Second, Chapter 4 showcases an example where prosody has no observed effect. Even though there is good linguistic justification for expecting that prosody should help in better conveying information status in speech translation, this effect is not seen because the biases of the speech translation model itself make any effect unmeasureable, underscoring the importance of task and model selection. Third, Chapter 5 shows that prosody does help with syntactic parsing in the more realistic setting where the input is not pre-segmented into sentences. In fact, prosody helps more with segmenting the speech into sentences than with parsing itself, but both tasks benefit. These experiments show that the realistic task of parsing plus segmentation benefits in more ways from including prosody than does parsing alone. Finally, Chapter 6 considers what happens in the sentence segmentation task when an ASR transcript is used as the lexical input, and acoustic noise is introduced to the audio signal. As more sources of noise are added, prosody becomes progressively more important for the model's performance. This suggests that the information in the prosodic and lexical channels is somewhat redundant, with the prosodic channel acting more as a `back-up' for the lexical channel than as a channel for novel information. Together, these results suggest that prosody has the potential to be helpful in many NLP tasks, but that these benefits are more marked in cases that better approximate real-world language usage, where there are obstacles to clear communication. Because the information in the prosodic and lexical channels overlaps so much, adding prosodic information does not boost performance as much when both channels are clear and unobstructed. However, when obstacles to clear perception (such as lacking sentence boundaries, using an ASR transcript, or acoustic noise) are present, prosody becomes more important. This suggests that in future work, it will be important to move towards modelling assumptions that better approximate the non-idealized conditions of real-world language use in order to fully understand the value of prosody for NLP tasks

    Sociophonetics and class differentiation: A study of working- and middle- class English in Cape Town's coloured community

    Get PDF
    Includes bibliographical references.This thesis provides a detailed acoustic description of the phonetic variation and changes evident in the monophthongal vowel system of Coloured South African English in Cape Town. The changes are largely a result of South Africa's post-apartheid socio-educational reform. A detailed acoustic description highlights the most salient changes (compared with earlier reports of the variety), indicating the extent of the change amongst working-class and middle-class speakers. The fieldwork conducted for this study consists of sociolinguistic interviews, conducted with a total of 40 Coloured speakers (half male, half female) from both working-class and middle-class backgrounds. All speakers were young adults, born between 1983 and 1993, thus raised and schooled in a period of transition from apartheid to democracy. Each of the middle-class speakers had some experience of attending formerly exclusively White schools, giving them significant contact with White peers and teachers, while the educational careers of the working-class speakers exposed them almost solely to Coloured peers and educators. The acoustic data were processed using methods of Forced Alignment and automatic formant extraction – methods applied for the first time to any variety of South African English. The results of the analysis were found generally to support the findings of scholars who have documented this variety previously, with some notable exceptions amongst middle-class speakers. The changes are attributable to socio-educational change in the post-apartheid setting and the directionality of the changes approximate trends amongst White South African English speakers. The TRAP, GOOSE and FOOT lexical sets show most change: TRAP is lowering, while GOOSE and FOOT are fronting. Although the changes approximate the vowel quality used by White speakers, middle-class Coloured speakers use an intermediate value between White speakers and working-class Coloured speakers i.e. they have not fully adopted White norms for any of the vowel classes. Working-class speakers were found to have maintained the monophthongal vowel system traditionally used by Coloured speakers

    Dynamic aspects of speech and intonation in Brazilian Portuguese

    Get PDF
    Orientador: Plinio Almeida BarbosaTese (doutorado) - Universidade Estadual de Campinas, Instituto de Estudos da LinguagemResumo: Esta tese explora a relação entre padrões entoacionais ritmo e discurso de acordo com o programa de investigação dos sistemas dinâmicos. O estudo dessas relações foram feitas tendo como base o Modelo Dinâmico do Ritmo da Fala, proposto por Barbosa (2006), o Sistema DaTo de notação entoacional, proposto por Lucente (2008) e o Modelo Computacional da Estrutura do Discurso, proposto por Grosz & Sidner (1986). O Modelo de Dinâmico do Ritmo sugere que o ritmo da fala seja resultado da ação de dois osciladores - um acentual e outro silábico - que ao receberem na entrada do sistema informações de níveis lingüísticos superiores e de uma pauta gestual, geram a duração gestual na saída. A hipótese desta tese é que, paralelamente a esses osciladores, um oscilador glotal possa agir controlando os padrões entoacionais da fala. Os padrões, ou ciclos entoacionais, em que se organiza a entoação do PB emergem quando relacionados à segmentação de trechos de discurso em modalidade espontânea. Para cada trecho de fala classificado como espontâneo de acordo com um critério proposto nesta tese, o discurso é segmentado no sistema DaTo em unidades linguisticamente estruturadas, que contém os propósitos de comunicar e atrair atenção. Cada um destes segmentos do discurso se alinham a um padrão entoacional iniciado por um contorno entoacional ascendente (LH ou >LH) e finalizado por um contorno descendente (LHL) ou por um nível de fronteira baixo (L). Alinhado a este padrão formado entre entoação e discurso está também o ritmo. Com o acréscimo de uma camada no sistema DaTo para a segmentação dos enunciados em grupos acentuais pôde-se observar o alinhamento entre a segmentação dos grupos acentuais e a notação dos contornos entoacionais coincidindo com fronteiras das unidades discursivas. A observação do alinhamento entre entoação, ritmo e discurso tendo como atratores os grupos acentuais possibilitou a proposta de inserção de um oscilador glotal ao Modelo Dinâmico do RitmoAbstract: This thesis explores the relationship between intonational patterns and its relationship with speech rhythm and discourse, according to the dynamic systems research program. The study of these relationships were based on Barbosa's (2006) Dynamic Model of Speech Rhythm; on DaTo intonational annotation system proposed by Lucente (2008); and on the Computational Model of the Structure of Discourse, proposed by Grosz & Sidner (1986). The Dynamic Model of Rhythm suggests that speech rhythm is the result of two oscillators action - accentual and syllabic - which receive linguistic and gestural information as input, and give the gestural duration as output. This thesis hypothesis is that in addition to these oscillators, a glottal oscillator can act controlling the intonation patterns of speech. These patterns, or intonational cycles, which organize the BP intonation, emerge when related to the spontaneous discourse segmentation. For each discourse segment classified as spontaneous, according to a criteria proposed in this thesis, the speech is segmented into the DaTo system framework in linguistically structured units, which contains the purposes of communication and attention. Each of these segments is aligned to the speech intonation pattern delimitated by a rising contour (LH or> HL) at the beginning and by a falling contour (LHL), or a boundary level (L), at the end. The speech rhythm is also aligned to the pattern formed between intonation and discourse. By the inclusion of a new layer for the stress groups segmentation into DaTo system was possible to observe the alignment between stress group segmentation and intonational annotation coinciding with discourse segments boundaries. The alignment between intonation, rhythm and discourse, having the stress groups as attractors, allowed us to propose the insertion of a glottal oscillator into the Dynamic Model of RhythmDoutoradoDoutora em Linguístic

    Computational Approaches to the Syntax–Prosody Interface: Using Prosody to Improve Parsing

    Full text link
    Prosody has strong ties with syntax, since prosody can be used to resolve some syntactic ambiguities. Syntactic ambiguities have been shown to negatively impact automatic syntactic parsing, hence there is reason to believe that prosodic information can help improve parsing. This dissertation considers a number of approaches that aim to computationally examine the relationship between prosody and syntax of natural languages, while also addressing the role of syntactic phrase length, with the ultimate goal of using prosody to improve parsing. Chapter 2 examines the effect of syntactic phrase length on prosody in double center embedded sentences in French. Data collected in a previous study were reanalyzed using native speaker judgment and automatic methods (forced alignment). Results demonstrate similar prosodic splitting behavior as in English in contradiction to the original study’s findings. Chapter 3 presents a number of studies examining whether syntactic ambiguity can yield different prosodic patterns, allowing humans and/or computers to resolve the ambiguity. In an experimental study, humans disambiguated sentences with prepositional phrase- (PP)-attachment ambiguity with 49% accuracy presented as text, and 63% presented as audio. Machine learning on the same data yielded an accuracy of 63-73%. A corpus study on the Switchboard corpus used both prosodic breaks and phrase lengths to predict the attachment, with an accuracy of 63.5% for PP-attachment sentences, and 71.2% for relative clause attachment. Chapter 4 aims to identify aspects of syntax that relate to prosody and use these in combination with prosodic cues to improve parsing. The aspects identified (dependency configurations) are based on dependency structure, reflecting the relative head location of two consecutive words, and are used as syntactic features in an ensemble system based on Recurrent Neural Networks, to score parse hypotheses and select the most likely parse for a given sentence. Using syntactic features alone, the system achieved an improvement of 1.1% absolute in Unlabelled Attachment Score (UAS) on the test set, above the best parser in the ensemble, while using syntactic features combined with prosodic features (pauses and normalized duration) led to a further improvement of 0.4% absolute. The results achieved demonstrate the relationship between syntax, syntactic phrase length, and prosody, and indicate the ability and future potential of prosody to resolve ambiguity and improve parsing
    corecore