688 research outputs found
Prosodic Event Recognition using Convolutional Neural Networks with Context Information
This paper demonstrates the potential of convolutional neural networks (CNN)
for detecting and classifying prosodic events on words, specifically pitch
accents and phrase boundary tones, from frame-based acoustic features. Typical
approaches use not only feature representations of the word in question but
also its surrounding context. We show that adding position features indicating
the current word benefits the CNN. In addition, this paper discusses the
generalization from a speaker-dependent modelling approach to a
speaker-independent setup. The proposed method is simple and efficient and
yields strong results not only in speaker-dependent but also
speaker-independent cases.Comment: Interspeech 2017 4 pages, 1 figur
Exploiting Contextual Information for Prosodic Event Detection Using Auto-Context
Prosody and prosodic boundaries carry significant information regarding linguistics and paralinguistics and are important aspects of speech. In the field of prosodic event detection, many local acoustic features have been investigated; however, contextual information has not yet been thoroughly exploited. The most difficult aspect of this lies in learning the long-distance contextual dependencies effectively and efficiently. To address this problem, we introduce the use of an algorithm called auto-context. In this algorithm, a classifier is first trained based on a set of local acoustic features, after which the generated probabilities are used along with the local features as contextual information to train new classifiers. By iteratively using updated probabilities as the contextual information, the algorithm can accurately model contextual dependencies and improve classification ability. The advantages of this method include its flexible structure and the ability of capturing contextual relationships. When using the auto-context algorithm based on support vector machine, we can improve the detection accuracy by about 3% and F-score by more than 7% on both two-way and four-way pitch accent detections in combination with the acoustic context. For boundary detection, the accuracy improvement is about 1% and the F-score improvement reaches 12%. The new algorithm outperforms conditional random fields, especially on boundary detection in terms of F-score. It also outperforms an n-gram language model on the task of pitch accent detection
Analyzing Prosody with Legendre Polynomial Coefficients
This investigation demonstrates the effectiveness of Legendre polynomial coefficients representing prosodic contours within the context of two different tasks: nativeness classification and sarcasm detection. By making use of accurate representations of prosodic contours to answer fundamental linguistic questions, we contribute significantly to the body of research focused on analyzing prosody in linguistics as well as modeling prosody for machine learning tasks. Using Legendre polynomial coefficient representations of prosodic contours, we answer prosodic questions about differences in prosody between native English speakers and non-native English speakers whose first language is Mandarin. We also learn more about prosodic qualities of sarcastic speech. We additionally perform machine learning classification for both tasks, (achieving an accuracy of 72.3% for nativeness classification, and achieving 81.57% for sarcasm detection). We recommend that linguists looking to analyze prosodic contours make use of Legendre polynomial coefficients modeling; the accuracy and quality of the resulting prosodic contour representations makes them highly interpretable for linguistic analysis
Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations
In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models ranging from feature-based classifiers to neural network systems for the prediction of discretized prosodic prominence. We show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10% of the training data. Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. The dataset and the code for the models are publicly available.Peer reviewe
Recommended from our members
Production of English Prominence by Native Mandarin Chinese Speakers
Native-like production of intonational prominence is important for spoken language competency. Non-native speakers may have trouble producing prosodic variation in a second language (L2) and thus, problems in being understood. By identifying common sources of production error, we will be able to aid in the instruction of L2 speakers. In this paper we present results of a production study designed to test the ability of Mandarin L1 speakers to produce prominence in English. Our results show that there are some consistent differences between the L1 and L2 speakers in the use of pitch to indicate prominence, as well as in the accenting of phrase-initial tokens. We also find that we can automatically detect prominence on Mandarin L1 English with 87.23% and an f-measure of 0.866 if we train a classifier with annotated Mandarin L1 English data. Models trained on native English speech can detect prominence in Mandarin L1 English with an accuracy of 74.77% and f-measure of 0.824
Detecting Prominence in Conversational Speech: Pitch Accent, Givenness and Focus
The variability and reduction that are characteristic of talking in natural interaction make it very difficult to detect prominence in conversational speech. In this paper, we present analytic studies and automatic detection results for pitch accent, as well as on the realization of information structure phenomena like givenness and focus. For pitch accent, our conditional random field model combining acoustic and textual features has an accuracy of 78%, substantially better than chance performance of 58%. For givenness and focus, our analysis demonstrates that even in conversational speech there are measurable differences in acoustic properties and that an automatic detector for these categories can perform significantly above chance
Automatic prosodic analysis for computer aided pronunciation teaching
Correct pronunciation of spoken language requires the appropriate modulation of acoustic characteristics of speech to convey linguistic information at a suprasegmental level. Such prosodic modulation is a key aspect of spoken language and is an important component of foreign language learning, for purposes of both comprehension and intelligibility. Computer aided pronunciation teaching involves automatic analysis of the speech of a non-native talker in order to provide a diagnosis of the learner's performance in comparison with the speech of a native talker. This thesis describes research undertaken to automatically analyse the prosodic aspects of speech for computer aided pronunciation teaching. It is necessary to describe the suprasegmental composition of a learner's speech in order to characterise significant deviations from a native-like prosody, and to offer some kind of corrective diagnosis. Phonological theories of prosody aim to describe the suprasegmental composition of speech..
- …