105 research outputs found

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems

    Deep Learning-Based Speech Emotion Recognition Using Librosa

    Get PDF
    Speech Emotion Recognition is a challenge of computational paralinguistic and speech processing that tries to identify and classify the emotions expressed in spoken language. The objective is to infer from a speaker's speech patterns, such as prosody, pitch, and rhythm, their emotional state, such as happiness, rage, sadness, or frustration. In the modern world, one of the most crucial marketing tactics is emotion detection. For a person, you might tailor several things in order to best fit their interests. Due to this, we made the decision to work on a project where we could identify a person's emotions based just on their speech, allowing us to handle a variety of AI-related applications. Examples include the ability of call centers to play music during tense exchanges. Another example might be a smart automobile that slows down when someone is scared or furious. In Python, we processed and extracted features from the audio files using the Librosa module. A Python library for audio and music analysis is called Librosa. It offers the fundamental components required to develop systems for retrieving music-related information. Because of this, there is a lot of potential for this kind of application in the market that would help businesses and ensure customer safety

    Improving Automatic Speech Recognition on Endangered Languages

    Get PDF
    As the world moves towards a more globalized scenario, it has brought along with it the extinction of several languages. It has been estimated that over the next century, over half of the world\u27s languages will be extinct, and an alarming 43% of the world\u27s languages are at different levels of endangerment or extinction already. The survival of many of these languages depends on the pressure imposed on the dwindling speakers of these languages. Often there is a strong correlation between endangered languages and the number and quality of recordings and documentations of each. But why do we care about preserving these less prevalent languages? The behavior of cultures is often expressed in the form of speech via one\u27s native language. The memories, ideas, major events, practices, cultures and lessons learnt, both individual as well as the community\u27s, are all communicated to the outside world via language. So, language preservation is crucial to understanding the behavior of these communities. Deep learning models have been shown to dramatically improve speech recognition accuracy but require large amounts of labelled data. Unfortunately, resource constrained languages typically fall short of the necessary data for successful training. To help alleviate the problem, data augmentation techniques fabricate many new samples from each sample. The aim of this master\u27s thesis is to examine the effect of different augmentation techniques on speech recognition of resource constrained languages. The augmentation methods being experimented with are noise augmentation, pitch augmentation, speed augmentation as well as voice transformation augmentation using Generative Adversarial Networks (GANs). This thesis also examines the effectiveness of GANs in voice transformation and its limitations. The information gained from this study will further augment the collection of data, specifically, in understanding the conditions required for the data to be collected in, so that GANs can effectively perform voice transformation. Training of the original data on the Deep Speech model resulted in 95.03% WER. Training the Seneca data on a Deep Speech model that was pretrained on an English dataset, reduced the WER to 70.43%. On adding 15 augmented samples per sample, the WER reduced to 68.33%. Finally, adding 25 augmented samples per sample, the WER reduced to 48.23%. Experiments to find the best augmentation method among noise addition, pitch variation, speed variation augmentation and GAN augmentation revealed that GAN augmentation performed the best, with a WER reduction to 60.03%

    μŒμ„±μ–Έμ–΄ μ΄ν•΄μ—μ„œμ˜ μ€‘μ˜μ„± ν•΄μ†Œ

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2022. 8. κΉ€λ‚¨μˆ˜.μ–Έμ–΄μ˜ μ€‘μ˜μ„±μ€ 필연적이닀. 그것은 μ–Έμ–΄κ°€ μ˜μ‚¬ μ†Œν†΅μ˜ μˆ˜λ‹¨μ΄μ§€λ§Œ, λͺ¨λ“  μ‚¬λžŒμ΄ μƒκ°ν•˜λŠ” μ–΄λ–€ κ°œλ…μ΄ μ™„λ²½νžˆ λ™μΌν•˜κ²Œ 전달될 수 μ—†λŠ” 것에 κΈ°μΈν•œλ‹€. μ΄λŠ” 필연적인 μš”μ†Œμ΄κΈ°λ„ ν•˜μ§€λ§Œ, μ–Έμ–΄ μ΄ν•΄μ—μ„œ μ€‘μ˜μ„±μ€ μ’…μ’… μ˜μ‚¬ μ†Œν†΅μ˜ λ‹¨μ ˆμ΄λ‚˜ μ‹€νŒ¨λ₯Ό κ°€μ Έμ˜€κΈ°λ„ ν•œλ‹€. μ–Έμ–΄μ˜ μ€‘μ˜μ„±μ—λŠ” λ‹€μ–‘ν•œ μΈ΅μœ„κ°€ μ‘΄μž¬ν•œλ‹€. ν•˜μ§€λ§Œ, λͺ¨λ“  μƒν™©μ—μ„œ μ€‘μ˜μ„±μ΄ ν•΄μ†Œλ  ν•„μš”λŠ” μ—†λ‹€. νƒœμŠ€ν¬λ§ˆλ‹€, λ„λ©”μΈλ§ˆλ‹€ λ‹€λ₯Έ μ–‘μƒμ˜ μ€‘μ˜μ„±μ΄ μ‘΄μž¬ν•˜λ©°, 이λ₯Ό 잘 μ •μ˜ν•˜κ³  ν•΄μ†Œλ  수 μžˆλŠ” μ€‘μ˜μ„±μž„μ„ νŒŒμ•…ν•œ ν›„ μ€‘μ˜μ μΈ λΆ€λΆ„ κ°„μ˜ 경계λ₯Ό 잘 μ •ν•˜λŠ” 것이 μ€‘μš”ν•˜λ‹€. λ³Έκ³ μ—μ„œλŠ” μŒμ„± μ–Έμ–΄ 처리, 특히 μ˜λ„ 이해에 μžˆμ–΄ μ–΄λ–€ μ–‘μƒμ˜ μ€‘μ˜μ„±μ΄ λ°œμƒν•  수 μžˆλŠ”μ§€ μ•Œμ•„λ³΄κ³ , 이λ₯Ό ν•΄μ†Œν•˜κΈ° μœ„ν•œ 연ꡬλ₯Ό μ§„ν–‰ν•œλ‹€. μ΄λŸ¬ν•œ ν˜„μƒμ€ λ‹€μ–‘ν•œ μ–Έμ–΄μ—μ„œ λ°œμƒν•˜μ§€λ§Œ, κ·Έ 정도 및 양상은 언어에 λ”°λΌμ„œ λ‹€λ₯΄κ²Œ λ‚˜νƒ€λ‚˜λŠ” κ²½μš°κ°€ λ§Žλ‹€. 우리의 μ—°κ΅¬μ—μ„œ μ£Όλͺ©ν•˜λŠ” 뢀뢄은, μŒμ„± 언어에 λ‹΄κΈ΄ μ •λ³΄λŸ‰κ³Ό 문자 μ–Έμ–΄μ˜ μ •λ³΄λŸ‰ 차이둜 인해 μ€‘μ˜μ„±μ΄ λ°œμƒν•˜λŠ” κ²½μš°λ“€μ΄λ‹€. λ³Έ μ—°κ΅¬λŠ” 운율(prosody)에 따라 λ¬Έμž₯ ν˜•μ‹ 및 μ˜λ„κ°€ λ‹€λ₯΄κ²Œ ν‘œν˜„λ˜λŠ” κ²½μš°κ°€ λ§Žμ€ ν•œκ΅­μ–΄λ₯Ό λŒ€μƒμœΌλ‘œ μ§„ν–‰λœλ‹€. ν•œκ΅­μ–΄μ—μ„œλŠ” λ‹€μ–‘ν•œ κΈ°λŠ₯이 μžˆλŠ”(multi-functionalν•œ) μ’…κ²°μ–΄λ―Έ(sentence ender), λΉˆλ²ˆν•œ νƒˆλ½ ν˜„μƒ(pro-drop), μ˜λ¬Έμ‚¬ κ°„μ„­(wh-intervention) λ“±μœΌλ‘œ 인해, 같은 ν…μŠ€νŠΈκ°€ μ—¬λŸ¬ μ˜λ„λ‘œ μ½νžˆλŠ” ν˜„μƒμ΄ λ°œμƒν•˜κ³€ ν•œλ‹€. 이것이 μ˜λ„ 이해에 ν˜Όμ„ μ„ κ°€μ Έμ˜¬ 수 μžˆλ‹€λŠ” 데에 μ°©μ•ˆν•˜μ—¬, λ³Έ μ—°κ΅¬μ—μ„œλŠ” μ΄λŸ¬ν•œ μ€‘μ˜μ„±μ„ λ¨Όμ € μ •μ˜ν•˜κ³ , μ€‘μ˜μ μΈ λ¬Έμž₯듀을 감지할 수 μžˆλ„λ‘ λ§λ­‰μΉ˜λ₯Ό κ΅¬μΆ•ν•œλ‹€. μ˜λ„ 이해λ₯Ό μœ„ν•œ λ§λ­‰μΉ˜λ₯Ό κ΅¬μΆ•ν•˜λŠ” κ³Όμ •μ—μ„œ λ¬Έμž₯의 지ν–₯μ„±(directivity)κ³Ό μˆ˜μ‚¬μ„±(rhetoricalness)이 κ³ λ €λœλ‹€. 이것은 μŒμ„± μ–Έμ–΄μ˜ μ˜λ„λ₯Ό μ„œμˆ , 질문, λͺ…λ Ή, μˆ˜μ‚¬μ˜λ¬Έλ¬Έ, 그리고 μˆ˜μ‚¬λͺ…λ Ήλ¬ΈμœΌλ‘œ κ΅¬λΆ„ν•˜κ²Œ ν•˜λŠ” 기쀀이 λœλ‹€. λ³Έ μ—°κ΅¬μ—μ„œλŠ” 기둝된 μŒμ„± μ–Έμ–΄(spoken language)λ₯Ό μΆ©λΆ„νžˆ 높은 μΌμΉ˜λ„(kappa = 0.85)둜 μ£Όμ„ν•œ λ§λ­‰μΉ˜λ₯Ό μ΄μš©ν•΄, μŒμ„±μ΄ 주어지지 μ•Šμ€ μƒν™©μ—μ„œ μ€‘μ˜μ μΈ ν…μŠ€νŠΈλ₯Ό κ°μ§€ν•˜λŠ” 데에 μ–΄λ–€ μ „λž΅ ν˜Ήμ€ μ–Έμ–΄ λͺ¨λΈμ΄ νš¨κ³Όμ μΈκ°€λ₯Ό 보이고, ν•΄λ‹Ή νƒœμŠ€ν¬μ˜ νŠΉμ§•μ„ μ •μ„±μ μœΌλ‘œ λΆ„μ„ν•œλ‹€. λ˜ν•œ, μš°λ¦¬λŠ” ν…μŠ€νŠΈ μΈ΅μœ„μ—μ„œλ§Œ μ€‘μ˜μ„±μ— μ ‘κ·Όν•˜μ§€ μ•Šκ³ , μ‹€μ œλ‘œ μŒμ„±μ΄ 주어진 μƒν™©μ—μ„œ μ€‘μ˜μ„± ν•΄μ†Œ(disambiguation)κ°€ κ°€λŠ₯ν•œμ§€λ₯Ό μ•Œμ•„λ³΄κΈ° μœ„ν•΄, ν…μŠ€νŠΈκ°€ μ€‘μ˜μ μΈ λ°œν™”λ“€λ§ŒμœΌλ‘œ κ΅¬μ„±λœ 인곡적인 μŒμ„± λ§λ­‰μΉ˜λ₯Ό μ„€κ³„ν•˜κ³  λ‹€μ–‘ν•œ 집쀑(attention) 기반 신경망(neural network) λͺ¨λΈλ“€μ„ μ΄μš©ν•΄ μ€‘μ˜μ„±μ„ ν•΄μ†Œν•œλ‹€. 이 κ³Όμ •μ—μ„œ λͺ¨λΈ 기반 톡사적/의미적 μ€‘μ˜μ„± ν•΄μ†Œκ°€ μ–΄λ– ν•œ κ²½μš°μ— κ°€μž₯ νš¨κ³Όμ μΈμ§€ κ΄€μ°°ν•˜κ³ , μΈκ°„μ˜ μ–Έμ–΄ μ²˜λ¦¬μ™€ μ–΄λ–€ 연관이 μžˆλŠ”μ§€μ— λŒ€ν•œ 관점을 μ œμ‹œν•œλ‹€. λ³Έ μ—°κ΅¬μ—μ„œλŠ” λ§ˆμ§€λ§‰μœΌλ‘œ, μœ„μ™€ 같은 절차둜 μ˜λ„ 이해 κ³Όμ •μ—μ„œμ˜ μ€‘μ˜μ„±μ΄ ν•΄μ†Œλ˜μ—ˆμ„ 경우, 이λ₯Ό μ–΄λ–»κ²Œ 산업계 ν˜Ήμ€ 연ꡬ λ‹¨μ—μ„œ ν™œμš©ν•  수 μžˆλŠ”κ°€μ— λŒ€ν•œ κ°„λž΅ν•œ λ‘œλ“œλ§΅μ„ μ œμ‹œν•œλ‹€. ν…μŠ€νŠΈμ— κΈ°λ°˜ν•œ μ€‘μ˜μ„± νŒŒμ•…κ³Ό μŒμ„± 기반의 μ˜λ„ 이해 λͺ¨λ“ˆμ„ ν†΅ν•©ν•œλ‹€λ©΄, 였λ₯˜μ˜ μ „νŒŒλ₯Ό μ€„μ΄λ©΄μ„œλ„ 효율적으둜 μ€‘μ˜μ„±μ„ λ‹€λ£° 수 μžˆλŠ” μ‹œμŠ€ν…œμ„ λ§Œλ“€ 수 μžˆμ„ 것이닀. μ΄λŸ¬ν•œ μ‹œμŠ€ν…œμ€ λŒ€ν™” λ§€λ‹ˆμ €(dialogue manager)와 ν†΅ν•©λ˜μ–΄ κ°„λ‹¨ν•œ λŒ€ν™”(chit-chat)κ°€ κ°€λŠ₯ν•œ λͺ©μ  지ν–₯ λŒ€ν™” μ‹œμŠ€ν…œ(task-oriented dialogue system)을 ꡬ좕할 μˆ˜λ„ 있고, 단일 μ–Έμ–΄ 쑰건(monolingual condition)을 λ„˜μ–΄ μŒμ„± λ²ˆμ—­μ—μ„œμ˜ μ—λŸ¬λ₯Ό μ€„μ΄λŠ” 데에 ν™œμš©λ  μˆ˜λ„ μžˆλ‹€. μš°λ¦¬λŠ” λ³Έκ³ λ₯Ό 톡해, μš΄μœ¨μ— λ―Όκ°ν•œ(prosody-sensitive) μ–Έμ–΄μ—μ„œ μ˜λ„ 이해λ₯Ό μœ„ν•œ μ€‘μ˜μ„± ν•΄μ†Œκ°€ κ°€λŠ₯ν•˜λ©°, 이λ₯Ό μ‚°μ—… 및 연ꡬ λ‹¨μ—μ„œ ν™œμš©ν•  수 μžˆμŒμ„ 보이고자 ν•œλ‹€. λ³Έ 연ꡬ가 λ‹€λ₯Έ μ–Έμ–΄ 및 λ„λ©”μΈμ—μ„œλ„ 고질적인 μ€‘μ˜μ„± 문제λ₯Ό ν•΄μ†Œν•˜λŠ” 데에 도움이 되길 바라며, 이λ₯Ό μœ„ν•΄ 연ꡬλ₯Ό μ§„ν–‰ν•˜λŠ” 데에 ν™œμš©λœ λ¦¬μ†ŒμŠ€, κ²°κ³Όλ¬Ό 및 μ½”λ“œλ“€μ„ κ³΅μœ ν•¨μœΌλ‘œμ¨ ν•™κ³„μ˜ λ°œμ „μ— μ΄λ°”μ§€ν•˜κ³ μž ν•œλ‹€.Ambiguity in the language is inevitable. It is because, albeit language is a means of communication, a particular concept that everyone thinks of cannot be conveyed in a perfectly identical manner. As this is an inevitable factor, ambiguity in language understanding often leads to breakdown or failure of communication. There are various hierarchies of language ambiguity. However, not all ambiguity needs to be resolved. Different aspects of ambiguity exist for each domain and task, and it is crucial to define the boundary after recognizing the ambiguity that can be well-defined and resolved. In this dissertation, we investigate the types of ambiguity that appear in spoken language processing, especially in intention understanding, and conduct research to define and resolve it. Although this phenomenon occurs in various languages, its degree and aspect depend on the language investigated. The factor we focus on is cases where the ambiguity comes from the gap between the amount of information in the spoken language and the text. Here, we study the Korean language, which often shows different sentence structures and intentions depending on the prosody. In the Korean language, a text is often read with multiple intentions due to multi-functional sentence enders, frequent pro-drop, wh-intervention, etc. We first define this type of ambiguity and construct a corpus that helps detect ambiguous sentences, given that such utterances can be problematic for intention understanding. In constructing a corpus for intention understanding, we consider the directivity and rhetoricalness of a sentence. They make up a criterion for classifying the intention of spoken language into a statement, question, command, rhetorical question, and rhetorical command. Using the corpus annotated with sufficiently high agreement on a spoken language corpus, we show that colloquial corpus-based language models are effective in classifying ambiguous text given only textual data, and qualitatively analyze the characteristics of the task. We do not handle ambiguity only at the text level. To find out whether actual disambiguation is possible given a speech input, we design an artificial spoken language corpus composed only of ambiguous sentences, and resolve ambiguity with various attention-based neural network architectures. In this process, we observe that the ambiguity resolution is most effective when both textual and acoustic input co-attends each feature, especially when the audio processing module conveys attention information to the text module in a multi-hop manner. Finally, assuming the case that the ambiguity of intention understanding is resolved by proposed strategies, we present a brief roadmap of how the results can be utilized at the industry or research level. By integrating text-based ambiguity detection and speech-based intention understanding module, we can build a system that handles ambiguity efficiently while reducing error propagation. Such a system can be integrated with dialogue managers to make up a task-oriented dialogue system capable of chit-chat, or it can be used for error reduction in multilingual circumstances such as speech translation, beyond merely monolingual conditions. Throughout the dissertation, we want to show that ambiguity resolution for intention understanding in prosody-sensitive language can be achieved and can be utilized at the industry or research level. We hope that this study helps tackle chronic ambiguity issues in other languages ​​or other domains, linking linguistic science and engineering approaches.1 Introduction 1 1.1 Motivation 2 1.2 Research Goal 4 1.3 Outline of the Dissertation 5 2 Related Work 6 2.1 Spoken Language Understanding 6 2.2 Speech Act and Intention 8 2.2.1 Performatives and statements 8 2.2.2 Illocutionary act and speech act 9 2.2.3 Formal semantic approaches 11 2.3 Ambiguity of Intention Understanding in Korean 14 2.3.1 Ambiguities in language 14 2.3.2 Speech act and intention understanding in Korean 16 3 Ambiguity in Intention Understanding of Spoken Language 20 3.1 Intention Understanding and Ambiguity 20 3.2 Annotation Protocol 23 3.2.1 Fragments 24 3.2.2 Clear-cut cases 26 3.2.3 Intonation-dependent utterances 28 3.3 Data Construction . 32 3.3.1 Source scripts 32 3.3.2 Agreement 32 3.3.3 Augmentation 33 3.3.4 Train split 33 3.4 Experiments and Results 34 3.4.1 Models 34 3.4.2 Implementation 36 3.4.3 Results 37 3.5 Findings and Summary 44 3.5.1 Findings 44 3.5.2 Summary 45 4 Disambiguation of Speech Intention 47 4.1 Ambiguity Resolution 47 4.1.1 Prosody and syntax 48 4.1.2 Disambiguation with prosody 50 4.1.3 Approaches in SLU 50 4.2 Dataset Construction 51 4.2.1 Script generation 52 4.2.2 Label tagging 54 4.2.3 Recording 56 4.3 Experiments and Results 57 4.3.1 Models 57 4.3.2 Results 60 4.4 Summary 63 5 System Integration and Application 65 5.1 System Integration for Intention Identification 65 5.1.1 Proof of concept 65 5.1.2 Preliminary study 69 5.2 Application to Spoken Dialogue System 75 5.2.1 What is 'Free-running' 76 5.2.2 Omakase chatbot 76 5.3 Beyond Monolingual Approaches 84 5.3.1 Spoken language translation 85 5.3.2 Dataset 87 5.3.3 Analysis 94 5.3.4 Discussion 95 5.4 Summary 100 6 Conclusion and Future Work 103 Bibliography 105 Abstract (In Korean) 124 Acknowledgment 126λ°•

    Voice Conversion

    Get PDF

    IberSPEECH 2020: XI Jornadas en TecnologΓ­a del Habla and VII Iberian SLTech

    Get PDF
    IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, β€œIberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red EspaΓ±ola de TecnologΓ­as del Habla. Universidad de Valladoli

    Producing Acoustic-Prosodic Entrainment in a Robotic Learning Companion to Build Learner Rapport

    Get PDF
    abstract: With advances in automatic speech recognition, spoken dialogue systems are assuming increasingly social roles. There is a growing need for these systems to be socially responsive, capable of building rapport with users. In human-human interactions, rapport is critical to patient-doctor communication, conflict resolution, educational interactions, and social engagement. Rapport between people promotes successful collaboration, motivation, and task success. Dialogue systems which can build rapport with their user may produce similar effects, personalizing interactions to create better outcomes. This dissertation focuses on how dialogue systems can build rapport utilizing acoustic-prosodic entrainment. Acoustic-prosodic entrainment occurs when individuals adapt their acoustic-prosodic features of speech, such as tone of voice or loudness, to one another over the course of a conversation. Correlated with liking and task success, a dialogue system which entrains may enhance rapport. Entrainment, however, is very challenging to model. People entrain on different features in many ways and how to design entrainment to build rapport is unclear. The first goal of this dissertation is to explore how acoustic-prosodic entrainment can be modeled to build rapport. Towards this goal, this work presents a series of studies comparing, evaluating, and iterating on the design of entrainment, motivated and informed by human-human dialogue. These models of entrainment are implemented in the dialogue system of a robotic learning companion. Learning companions are educational agents that engage students socially to increase motivation and facilitate learning. As a learning companion’s ability to be socially responsive increases, so do vital learning outcomes. A second goal of this dissertation is to explore the effects of entrainment on concrete outcomes such as learning in interactions with robotic learning companions. This dissertation results in contributions both technical and theoretical. Technical contributions include a robust and modular dialogue system capable of producing prosodic entrainment and other socially-responsive behavior. One of the first systems of its kind, the results demonstrate that an entraining, social learning companion can positively build rapport and increase learning. This dissertation provides support for exploring phenomena like entrainment to enhance factors such as rapport and learning and provides a platform with which to explore these phenomena in future work.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Dysarthric Speech Recognition and Offline Handwriting Recognition using Deep Neural Networks

    Get PDF
    Millions of people around the world are diagnosed with neurological disorders like Parkinson’s, Cerebral Palsy or Amyotrophic Lateral Sclerosis. Due to the neurological damage as the disease progresses, the person suffering from the disease loses control of muscles, along with speech deterioration. Speech deterioration is due to neuro motor condition that limits manipulation of the articulators of the vocal tract, the condition collectively called as dysarthria. Even though dysarthric speech is grammatically and syntactically correct, it is difficult for humans to understand and for Automatic Speech Recognition (ASR) systems to decipher. With the emergence of deep learning, speech recognition systems have improved a lot compared to traditional speech recognition systems, which use sophisticated preprocessing techniques to extract speech features. In this digital era there are still many documents that are handwritten many of which need to be digitized. Offline handwriting recognition involves recognizing handwritten characters from images of handwritten text (i.e. scanned documents). This is an interesting task as it involves sequence learning with computer vision. The task is more difficult than Optical Character Recognition (OCR), because handwritten letters can be written in virtually infinite different styles. This thesis proposes exploiting deep learning techniques like Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) for offline handwriting recognition. For speech recognition, we compare traditional methods for speech recognition with recent deep learning methods. Also, we apply speaker adaptation methods both at feature level and at parameter level to improve recognition of dysarthric speech

    Dealing with linguistic mismatches for automatic speech recognition

    Get PDF
    Recent breakthroughs in automatic speech recognition (ASR) have resulted in a word error rate (WER) on par with human transcribers on the English Switchboard benchmark. However, dealing with linguistic mismatches between the training and testing data is still a significant challenge that remains unsolved. Under the monolingual environment, it is well-known that the performance of ASR systems degrades significantly when presented with the speech from speakers with different accents, dialects, and speaking styles than those encountered during system training. Under the multi-lingual environment, ASR systems trained on a source language achieve even worse performance when tested on another target language because of mismatches in terms of the number of phonemes, lexical ambiguity, and power of phonotactic constraints provided by phone-level n-grams. In order to address the issues of linguistic mismatches for current ASR systems, my dissertation investigates both knowledge-gnostic and knowledge-agnostic solutions. In the first part, classic theories relevant to acoustics and articulatory phonetics that present capability of being transferred across a dialect continuum from local dialects to another standardized language are re-visited. Experiments demonstrate the potentials that acoustic correlates in the vicinity of landmarks could help to build a bridge for dealing with mismatches across difference local or global varieties in a dialect continuum. In the second part, we design an end-to-end acoustic modeling approach based on connectionist temporal classification loss and propose to link the training of acoustics and accent altogether in a manner similar to the learning process in human speech perception. This joint model not only performed well on ASR with multiple accents but also boosted accuracies of accent identification task in comparison to separately-trained models
    • …
    corecore