5,138 research outputs found

    On Representation of Fundamental Frequency of Speech for Prosody Analysis Using Reliability Function.

    Get PDF
    This paper highlights on a method that provides a new prosodic feature called β€˜F0 reliability field’ based on a reliability function of the fundamental frequency (F0). The proposed method does not employ any correction process for F0 estimation errors that occur during automatic F0 extraction. By applying this feature as a score function for prosodic analyses like prosodic structure estimation or superpositional modeling of prosodic commands, these prosodic information could be acquired with higher accuracy. The feature has been applied to β€˜F0 template matching method’, which detects accent phrase boundaries in Japanese continuous speech. The experimental results show that compared to the conventional F0 contour, the proposed feature overcomes the harmful influence caused by F0 errors

    Fundamental frequency height as a resource for the management of overlap in talk-in-interaction.

    Get PDF
    Overlapping talk is common in talk-in-interaction. Much of the previous research on this topic agrees that speaker overlaps can be either turn competitive or noncompetitive. An investigation of the differences in prosodic design between these two classes of overlaps can offer insight into how speakers use and orient to prosody as a resource for turn competition. In this paper, we investigate the role of fundamental frequency (F0) as a resource for turn competition in overlapping speech. Our methodological approach combines detailed conversation analysis of overlap instances with acoustic measurements of F0 in the overlapping sequence and in its local context. The analyses are based on a collection of overlap instances drawn from the ICSI Meeting corpus. We found that overlappers mark an overlapping incoming as competitive by raising F0 above their norm for turn beginnings, and retaining this higher F0 until the point of overlap resolution. Overlappees may respond to these competitive incomings by returning competition, in which case they raise their F0 too. Our results thus provide instrumental support for earlier claims made on impressionistic evidence, namely that participants in talk-in-interaction systematically manipulate F0 height when competing for the turn

    Cue Phrase Classification Using Machine Learning

    Full text link
    Cue phrases may be used in a discourse sense to explicitly signal discourse structure, but also in a sentential sense to convey semantic rather than structural information. Correctly classifying cue phrases as discourse or sentential is critical in natural language processing systems that exploit discourse structure, e.g., for performing tasks such as anaphora resolution and plan recognition. This paper explores the use of machine learning for classifying cue phrases as discourse or sentential. Two machine learning programs (Cgrendel and C4.5) are used to induce classification models from sets of pre-classified cue phrases and their features in text and speech. Machine learning is shown to be an effective technique for not only automating the generation of classification models, but also for improving upon previous results. When compared to manually derived classification models already in the literature, the learned models often perform with higher accuracy and contain new linguistic insights into the data. In addition, the ability to automatically construct classification models makes it easier to comparatively analyze the utility of alternative feature representations of the data. Finally, the ease of retraining makes the learning approach more scalable and flexible than manual methods.Comment: 42 pages, uses jair.sty, theapa.bst, theapa.st

    Responses to intensity-shifted auditory feedback during running speech

    Full text link
    PURPOSE: Responses to intensity perturbation during running speech were measured to understand whether prosodic features are controlled in an independent or integrated manner. METHOD: Nineteen English-speaking healthy adults (age range = 21-41 years) produced 480 sentences in which emphatic stress was placed on either the 1st or 2nd word. One participant group received an upward intensity perturbation during stressed word production, and the other group received a downward intensity perturbation. Compensations for perturbation were evaluated by comparing differences in participants' stressed and unstressed peak fundamental frequency (F0), peak intensity, and word duration during perturbed versus baseline trials. RESULTS: Significant increases in stressed-unstressed peak intensities were observed during the ramp and perturbation phases of the experiment in the downward group only. Compensations for F0 and duration did not reach significance for either group. CONCLUSIONS: Consistent with previous work, speakers appear sensitive to auditory perturbations that affect a desired linguistic goal. In contrast to previous work on F0 perturbation that supported an integrated-channel model of prosodic control, the current work only found evidence for intensity-specific compensation. This discrepancy may suggest different F0 and intensity control mechanisms, threshold-dependent prosodic modulation, or a combined control scheme.R01 DC002852 - NIDCD NIH HHS; R03 DC011159 - NIDCD NIH HH

    Speech-driven Animation with Meaningful Behaviors

    Full text link
    Conversational agents (CAs) play an important role in human computer interaction. Creating believable movements for CAs is challenging, since the movements have to be meaningful and natural, reflecting the coupling between gestures and speech. Studies in the past have mainly relied on rule-based or data-driven approaches. Rule-based methods focus on creating meaningful behaviors conveying the underlying message, but the gestures cannot be easily synchronized with speech. Data-driven approaches, especially speech-driven models, can capture the relationship between speech and gestures. However, they create behaviors disregarding the meaning of the message. This study proposes to bridge the gap between these two approaches overcoming their limitations. The approach builds a dynamic Bayesian network (DBN), where a discrete variable is added to constrain the behaviors on the underlying constraint. The study implements and evaluates the approach with two constraints: discourse functions and prototypical behaviors. By constraining on the discourse functions (e.g., questions), the model learns the characteristic behaviors associated with a given discourse class learning the rules from the data. By constraining on prototypical behaviors (e.g., head nods), the approach can be embedded in a rule-based system as a behavior realizer creating trajectories that are timely synchronized with speech. The study proposes a DBN structure and a training approach that (1) models the cause-effect relationship between the constraint and the gestures, (2) initializes the state configuration models increasing the range of the generated behaviors, and (3) captures the differences in the behaviors across constraints by enforcing sparse transitions between shared and exclusive states per constraint. Objective and subjective evaluations demonstrate the benefits of the proposed approach over an unconstrained model.Comment: 13 pages, 12 figures, 5 table

    운율 정보λ₯Ό μ΄μš©ν•œ λ§ˆλΉ„λ§μž₯μ•  μŒμ„± μžλ™ κ²€μΆœ 및 평가

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (석사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : μΈλ¬ΈλŒ€ν•™ μ–Έμ–΄ν•™κ³Ό, 2020. 8. Minhwa Chung.말μž₯μ• λŠ” 신경계 λ˜λŠ” 퇴행성 μ§ˆν™˜μ—μ„œ κ°€μž₯ 빨리 λ‚˜νƒ€λ‚˜λŠ” 증 상 쀑 ν•˜λ‚˜μ΄λ‹€. λ§ˆλΉ„λ§μž₯μ• λŠ” νŒŒν‚¨μŠ¨λ³‘, λ‡Œμ„± λ§ˆλΉ„, κ·Όμœ„μΆ•μ„± μΈ‘μ‚­ 경화증, λ‹€λ°œμ„± 경화증 ν™˜μž λ“± λ‹€μ–‘ν•œ ν™˜μžκ΅°μ—μ„œ λ‚˜νƒ€λ‚œλ‹€. λ§ˆλΉ„λ§μž₯μ• λŠ” μ‘°μŒκΈ°κ΄€ μ‹ κ²½μ˜ μ†μƒμœΌλ‘œ λΆ€μ •ν™•ν•œ μ‘°μŒμ„ μ£Όμš” νŠΉμ§•μœΌλ‘œ 가지고, μš΄μœ¨μ—λ„ 영ν–₯을 λ―ΈμΉ˜λŠ” κ²ƒμœΌλ‘œ λ³΄κ³ λœλ‹€. μ„ ν–‰ μ—°κ΅¬μ—μ„œλŠ” 운율 기반 μΈ‘μ •μΉ˜λ₯Ό λΉ„μž₯μ•  λ°œν™”μ™€ λ§ˆλΉ„λ§μž₯μ•  λ°œν™”λ₯Ό κ΅¬λ³„ν•˜λŠ” 것에 μ‚¬μš©ν–ˆλ‹€. μž„μƒ ν˜„μž₯μ—μ„œλŠ” λ§ˆλΉ„λ§μž₯애에 λŒ€ν•œ 운율 기반 뢄석이 λ§ˆλΉ„λ§μž₯μ• λ₯Ό μ§„λ‹¨ν•˜κ±°λ‚˜ μž₯μ•  양상에 λ”°λ₯Έ μ•Œλ§žμ€ μΉ˜λ£Œλ²•μ„ μ€€λΉ„ν•˜λŠ” 것에 도움이 될 것이닀. λ”°λΌμ„œ λ§ˆλΉ„λ§μž₯μ• κ°€ μš΄μœ¨μ— 영ν–₯을 λ―ΈμΉ˜λŠ” μ–‘μƒλΏλ§Œ μ•„λ‹ˆλΌ λ§ˆλΉ„λ§μž₯μ• μ˜ 운율 νŠΉμ§•μ„ κΈ΄λ°€ν•˜κ²Œ μ‚΄νŽ΄λ³΄λŠ” 것이 ν•„μš”ν•˜λ‹€. ꡬ체 적으둜, 운율이 μ–΄λ–€ μΈ‘λ©΄μ—μ„œ λ§ˆλΉ„λ§μž₯애에 영ν–₯을 λ°›λŠ”μ§€, 그리고 운율 μ• κ°€ μž₯μ•  정도에 따라 μ–΄λ–»κ²Œ λ‹€λ₯΄κ²Œ λ‚˜νƒ€λ‚˜λŠ”μ§€μ— λŒ€ν•œ 뢄석이 ν•„μš”ν•˜λ‹€. λ³Έ 논문은 μŒλ†’μ΄, 음질, 말속도, 리듬 λ“± μš΄μœ¨μ„ λ‹€μ–‘ν•œ 츑면에 μ„œ μ‚΄νŽ΄λ³΄κ³ , λ§ˆλΉ„λ§μž₯μ•  κ²€μΆœ 및 평가에 μ‚¬μš©ν•˜μ˜€λ‹€. μΆ”μΆœλœ 운율 νŠΉμ§•λ“€μ€ λͺ‡ 가지 νŠΉμ§• 선택 μ•Œκ³ λ¦¬μ¦˜μ„ 톡해 μ΅œμ ν™”λ˜μ–΄ λ¨Έμ‹ λŸ¬λ‹ 기반 λΆ„λ₯˜κΈ°μ˜ μž…λ ₯κ°’μœΌλ‘œ μ‚¬μš©λ˜μ—ˆλ‹€. λΆ„λ₯˜κΈ°μ˜ μ„±λŠ₯은 정확도, 정밀도, μž¬ν˜„μœ¨, F1-점수둜 ν‰κ°€λ˜μ—ˆλ‹€. λ˜ν•œ, λ³Έ 논문은 μž₯μ•  쀑증도(경도, 쀑등도, 심도)에 따라 운율 정보 μ‚¬μš©μ˜ μœ μš©μ„±μ„ λΆ„μ„ν•˜μ˜€λ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, μž₯μ•  λ°œν™” μˆ˜μ§‘μ΄ μ–΄λ €μš΄ 만큼, λ³Έ μ—°κ΅¬λŠ” ꡐ차 μ–Έμ–΄ λΆ„λ₯˜κΈ°λ₯Ό μ‚¬μš©ν•˜μ˜€λ‹€. ν•œκ΅­μ–΄μ™€ μ˜μ–΄ μž₯μ•  λ°œν™”κ°€ ν›ˆλ ¨ μ…‹μœΌλ‘œ μ‚¬μš©λ˜μ—ˆμœΌλ©°, ν…ŒμŠ€νŠΈμ…‹μœΌλ‘œλŠ” 각 λͺ©ν‘œ μ–Έμ–΄λ§Œμ΄ μ‚¬μš©λ˜μ—ˆλ‹€. μ‹€ν—˜ κ²°κ³ΌλŠ” λ‹€μŒκ³Ό 같이 μ„Έ 가지λ₯Ό μ‹œμ‚¬ν•œλ‹€. 첫째, 운율 정보 λ₯Ό μ‚¬μš©ν•˜λŠ” 것은 λ§ˆλΉ„λ§μž₯μ•  κ²€μΆœ 및 평가에 도움이 λœλ‹€. MFCC λ§Œμ„ μ‚¬μš©ν–ˆμ„ λ•Œμ™€ λΉ„κ΅ν–ˆμ„ λ•Œ, 운율 정보λ₯Ό ν•¨κ»˜ μ‚¬μš©ν•˜λŠ” 것이 ν•œκ΅­μ–΄μ™€ μ˜μ–΄ 데이터셋 λͺ¨λ‘μ—μ„œ 도움이 λ˜μ—ˆλ‹€. λ‘˜μ§Έ, 운율 μ •λ³΄λŠ” 평가에 특히 μœ μš©ν•˜λ‹€. μ˜μ–΄μ˜ 경우 κ²€μΆœκ³Ό ν‰κ°€μ—μ„œ 각각 1.82%와 20.6%의 μƒλŒ€μ  정확도 ν–₯상을 λ³΄μ˜€λ‹€. ν•œκ΅­μ–΄μ˜ 경우 κ²€μΆœμ—μ„œλŠ” ν–₯상을 보이지 μ•Šμ•˜μ§€λ§Œ, ν‰κ°€μ—μ„œλŠ” 13.6%의 μƒλŒ€μ  ν–₯상이 λ‚˜νƒ€λ‚¬λ‹€. μ…‹μ§Έ, ꡐ차 μ–Έμ–΄ λΆ„λ₯˜κΈ°λŠ” 단일 μ–Έμ–΄ λΆ„λ₯˜κΈ°λ³΄λ‹€ ν–₯μƒλœ κ²°κ³Όλ₯Ό 보인닀. μ‹€ν—˜ κ²°κ³Ό ꡐ차언어 λΆ„λ₯˜κΈ°λŠ” 단일 μ–Έμ–΄ λΆ„λ₯˜κΈ°μ™€ λΉ„κ΅ν–ˆμ„ λ•Œ μƒλŒ€μ μœΌλ‘œ 4.12% 높은 정확도λ₯Ό λ³΄μ˜€λ‹€. 이것은 νŠΉμ • 운율 μž₯μ• λŠ” 범언어적 νŠΉμ§•μ„ 가지며, λ‹€λ₯Έ μ–Έμ–΄ 데이터λ₯Ό ν¬ν•¨μ‹œμΌœ 데이터가 λΆ€μ‘±ν•œ ν›ˆλ ¨ 셋을 보완할 수 있 μŒμ„ μ‹œμ‚¬ν•œλ‹€.One of the earliest cues for neurological or degenerative disorders are speech impairments. Individuals with Parkinsons Disease, Cerebral Palsy, Amyotrophic lateral Sclerosis, Multiple Sclerosis among others are often diagnosed with dysarthria. Dysarthria is a group of speech disorders mainly affecting the articulatory muscles which eventually leads to severe misarticulation. However, impairments in the suprasegmental domain are also present and previous studies have shown that the prosodic patterns of speakers with dysarthria differ from the prosody of healthy speakers. In a clinical setting, a prosodic-based analysis of dysarthric speech can be helpful for diagnosing the presence of dysarthria. Therefore, there is a need to not only determine how the prosody of speech is affected by dysarthria, but also what aspects of prosody are more affected and how prosodic impairments change by the severity of dysarthria. In the current study, several prosodic features related to pitch, voice quality, rhythm and speech rate are used as features for detecting dysarthria in a given speech signal. A variety of feature selection methods are utilized to determine which set of features are optimal for accurate detection. After selecting an optimal set of prosodic features we use them as input to machine learning-based classifiers and assess the performance using the evaluation metrics: accuracy, precision, recall and F1-score. Furthermore, we examine the usefulness of prosodic measures for assessing different levels of severity (e.g. mild, moderate, severe). Finally, as collecting impaired speech data can be difficult, we also implement cross-language classifiers where both Korean and English data are used for training but only one language used for testing. Results suggest that in comparison to solely using Mel-frequency cepstral coefficients, including prosodic measurements can improve the accuracy of classifiers for both Korean and English datasets. In particular, large improvements were seen when assessing different severity levels. For English a relative accuracy improvement of 1.82% for detection and 20.6% for assessment was seen. The Korean dataset saw no improvements for detection but a relative improvement of 13.6% for assessment. The results from cross-language experiments showed a relative improvement of up to 4.12% in comparison to only using a single language during training. It was found that certain prosodic impairments such as pitch and duration may be language independent. Therefore, when training sets of individual languages are limited, they may be supplemented by including data from other languages.1. Introduction 1 1.1. Dysarthria 1 1.2. Impaired Speech Detection 3 1.3. Research Goals & Outline 6 2. Background Research 8 2.1. Prosodic Impairments 8 2.1.1. English 8 2.1.2. Korean 10 2.2. Machine Learning Approaches 12 3. Database 18 3.1. English-TORGO 20 3.2. Korean-QoLT 21 4. Methods 23 4.1. Prosodic Features 23 4.1.1. Pitch 23 4.1.2. Voice Quality 26 4.1.3. Speech Rate 29 4.1.3. Rhythm 30 4.2. Feature Selection 34 4.3. Classification Models 38 4.3.1. Random Forest 38 4.3.1. Support Vector Machine 40 4.3.1 Feed-Forward Neural Network 42 4.4. Mel-Frequency Cepstral Coefficients 43 5. Experiment 46 5.1. Model Parameters 47 5.2. Training Procedure 48 5.2.1. Dysarthria Detection 48 5.2.2. Severity Assessment 50 5.2.3. Cross-Language 51 6. Results 52 6.1. TORGO 52 6.1.1. Dysarthria Detection 52 6.1.2. Severity Assessment 56 6.2. QoLT 57 6.2.1. Dysarthria Detection 57 6.2.2. Severity Assessment 58 6.1. Cross-Language 59 7. Discussion 62 7.1. Linguistic Implications 62 7.2. Clinical Applications 65 8. Conclusion 67 References 69 Appendix 76 Abstract in Korean 79Maste

    Predicting continuous conflict perception with Bayesian Gaussian processes

    Get PDF
    Conflict is one of the most important phenomena of social life, but it is still largely neglected by the computing community. This work proposes an approach that detects common conversational social signals (loudness, overlapping speech, etc.) and predicts the conflict level perceived by human observers in continuous, non-categorical terms. The proposed regression approach is fully Bayesian and it adopts Automatic Relevance Determination to identify the social signals that influence most the outcome of the prediction. The experiments are performed over the SSPNet Conflict Corpus, a publicly available collection of 1430 clips extracted from televised political debates (roughly 12 hours of material for 138 subjects in total). The results show that it is possible to achieve a correlation close to 0.8 between actual and predicted conflict perception
    • …
    corecore