113 research outputs found
Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition
Pretrained contextual word representations in NLP have greatly improved
performance on various downstream tasks. For speech, we propose contextual
frame representations that capture phonetic information at the acoustic frame
level and can be used for utterance-level language, speaker, and speech
recognition. These representations come from the frame-wise intermediate
representations of an end-to-end, self-attentive ASR model (SAN-CTC) on spoken
utterances. We first train the model on the Fisher English corpus with
context-independent phoneme labels, then use its representations at inference
time as features for task-specific models on the NIST LRE07 closed-set language
recognition task and a Fisher speaker recognition task, giving significant
improvements over the state-of-the-art on both (e.g., language EER of 4.68% on
3sec utterances, 23% relative reduction in speaker EER). Results remain
competitive when using a novel dilated convolutional model for language
recognition, or when ASR pretraining is done with character labels only.Comment: submitted to INTERSPEECH 201
Neural approaches to spoken content embedding
Comparing spoken segments is a central operation to speech processing.
Traditional approaches in this area have favored frame-level dynamic
programming algorithms, such as dynamic time warping, because they require no
supervision, but they are limited in performance and efficiency. As an
alternative, acoustic word embeddings -- fixed-dimensional vector
representations of variable-length spoken word segments -- have begun to be
considered for such tasks as well. However, the current space of such
discriminative embedding models, training approaches, and their application to
real-world downstream tasks is limited. We start by considering ``single-view"
training losses where the goal is to learn an acoustic word embedding model
that separates same-word and different-word spoken segment pairs. Then, we
consider ``multi-view" contrastive losses. In this setting, acoustic word
embeddings are learned jointly with embeddings of character sequences to
generate acoustically grounded embeddings of written words, or acoustically
grounded word embeddings.
In this thesis, we contribute new discriminative acoustic word embedding
(AWE) and acoustically grounded word embedding (AGWE) approaches based on
recurrent neural networks (RNNs). We improve model training in terms of both
efficiency and performance. We take these developments beyond English to
several low-resource languages and show that multilingual training improves
performance when labeled data is limited. We apply our embedding models, both
monolingual and multilingual, to the downstream tasks of query-by-example
speech search and automatic speech recognition. Finally, we show how our
embedding approaches compare with and complement more recent self-supervised
speech models.Comment: PhD thesi
μμ±μΈμ΄ μ΄ν΄μμμ μ€μμ± ν΄μ
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2022. 8. κΉλ¨μ.μΈμ΄μ μ€μμ±μ νμ°μ μ΄λ€. κ·Έκ²μ μΈμ΄κ° μμ¬ μν΅μ μλ¨μ΄μ§λ§, λͺ¨λ μ¬λμ΄ μκ°νλ μ΄λ€ κ°λ
μ΄ μλ²½ν λμΌνκ² μ λ¬λ μ μλ κ²μ κΈ°μΈνλ€. μ΄λ νμ°μ μΈ μμμ΄κΈ°λ νμ§λ§, μΈμ΄ μ΄ν΄μμ μ€μμ±μ μ’
μ’
μμ¬ μν΅μ λ¨μ μ΄λ μ€ν¨λ₯Ό κ°μ Έμ€κΈ°λ νλ€.
μΈμ΄μ μ€μμ±μλ λ€μν μΈ΅μκ° μ‘΄μ¬νλ€. νμ§λ§, λͺ¨λ μν©μμ μ€μμ±μ΄ ν΄μλ νμλ μλ€. νμ€ν¬λ§λ€, λλ©μΈλ§λ€ λ€λ₯Έ μμμ μ€μμ±μ΄ μ‘΄μ¬νλ©°, μ΄λ₯Ό μ μ μνκ³ ν΄μλ μ μλ μ€μμ±μμ νμ
ν ν μ€μμ μΈ λΆλΆ κ°μ κ²½κ³λ₯Ό μ μ νλ κ²μ΄ μ€μνλ€.
λ³Έκ³ μμλ μμ± μΈμ΄ μ²λ¦¬, νΉν μλ μ΄ν΄μ μμ΄ μ΄λ€ μμμ μ€μμ±μ΄ λ°μν μ μλμ§ μμλ³΄κ³ , μ΄λ₯Ό ν΄μνκΈ° μν μ°κ΅¬λ₯Ό μ§ννλ€. μ΄λ¬ν νμμ λ€μν μΈμ΄μμ λ°μνμ§λ§, κ·Έ μ λ λ° μμμ μΈμ΄μ λ°λΌμ λ€λ₯΄κ² λνλλ κ²½μ°κ° λ§λ€. μ°λ¦¬μ μ°κ΅¬μμ μ£Όλͺ©νλ λΆλΆμ, μμ± μΈμ΄μ λ΄κΈ΄ μ 보λκ³Ό λ¬Έμ μΈμ΄μ μ 보λ μ°¨μ΄λ‘ μΈν΄ μ€μμ±μ΄ λ°μνλ κ²½μ°λ€μ΄λ€.
λ³Έ μ°κ΅¬λ μ΄μ¨(prosody)μ λ°λΌ λ¬Έμ₯ νμ λ° μλκ° λ€λ₯΄κ² ννλλ κ²½μ°κ° λ§μ νκ΅μ΄λ₯Ό λμμΌλ‘ μ§νλλ€. νκ΅μ΄μμλ λ€μν κΈ°λ₯μ΄ μλ(multi-functionalν) μ’
κ²°μ΄λ―Έ(sentence ender), λΉλ²ν νλ½ νμ(pro-drop), μλ¬Έμ¬ κ°μ(wh-intervention) λ±μΌλ‘ μΈν΄, κ°μ ν
μ€νΈκ° μ¬λ¬ μλλ‘ μ½νλ νμμ΄ λ°μνκ³€ νλ€. μ΄κ²μ΄ μλ μ΄ν΄μ νΌμ μ κ°μ Έμ¬ μ μλ€λ λ°μ μ°©μνμ¬, λ³Έ μ°κ΅¬μμλ μ΄λ¬ν μ€μμ±μ λ¨Όμ μ μνκ³ , μ€μμ μΈ λ¬Έμ₯λ€μ κ°μ§ν μ μλλ‘ λ§λμΉλ₯Ό ꡬμΆνλ€.
μλ μ΄ν΄λ₯Ό μν λ§λμΉλ₯Ό ꡬμΆνλ κ³Όμ μμ λ¬Έμ₯μ μ§ν₯μ±(directivity)κ³Ό μμ¬μ±(rhetoricalness)μ΄ κ³ λ €λλ€. μ΄κ²μ μμ± μΈμ΄μ μλλ₯Ό μμ , μ§λ¬Έ, λͺ
λ Ή, μμ¬μλ¬Έλ¬Έ, κ·Έλ¦¬κ³ μμ¬λͺ
λ Ήλ¬ΈμΌλ‘ ꡬλΆνκ² νλ κΈ°μ€μ΄ λλ€. λ³Έ μ°κ΅¬μμλ κΈ°λ‘λ μμ± μΈμ΄(spoken language)λ₯Ό μΆ©λΆν λμ μΌμΉλ(kappa = 0.85)λ‘ μ£Όμν λ§λμΉλ₯Ό μ΄μ©ν΄, μμ±μ΄ μ£Όμ΄μ§μ§ μμ μν©μμ μ€μμ μΈ ν
μ€νΈλ₯Ό κ°μ§νλ λ°μ μ΄λ€ μ λ΅ νΉμ μΈμ΄ λͺ¨λΈμ΄ ν¨κ³Όμ μΈκ°λ₯Ό 보μ΄κ³ , ν΄λΉ νμ€ν¬μ νΉμ§μ μ μ±μ μΌλ‘ λΆμνλ€.
λν, μ°λ¦¬λ ν
μ€νΈ μΈ΅μμμλ§ μ€μμ±μ μ κ·Όνμ§ μκ³ , μ€μ λ‘ μμ±μ΄ μ£Όμ΄μ§ μν©μμ μ€μμ± ν΄μ(disambiguation)κ° κ°λ₯νμ§λ₯Ό μμ보기 μν΄, ν
μ€νΈκ° μ€μμ μΈ λ°νλ€λ§μΌλ‘ ꡬμ±λ μΈκ³΅μ μΈ μμ± λ§λμΉλ₯Ό μ€κ³νκ³ λ€μν μ§μ€(attention) κΈ°λ° μ κ²½λ§(neural network) λͺ¨λΈλ€μ μ΄μ©ν΄ μ€μμ±μ ν΄μνλ€. μ΄ κ³Όμ μμ λͺ¨λΈ κΈ°λ° ν΅μ¬μ /μλ―Έμ μ€μμ± ν΄μκ° μ΄λ ν κ²½μ°μ κ°μ₯ ν¨κ³Όμ μΈμ§ κ΄μ°°νκ³ , μΈκ°μ μΈμ΄ μ²λ¦¬μ μ΄λ€ μ°κ΄μ΄ μλμ§μ λν κ΄μ μ μ μνλ€.
λ³Έ μ°κ΅¬μμλ λ§μ§λ§μΌλ‘, μμ κ°μ μ μ°¨λ‘ μλ μ΄ν΄ κ³Όμ μμμ μ€μμ±μ΄ ν΄μλμμ κ²½μ°, μ΄λ₯Ό μ΄λ»κ² μ°μ
κ³ νΉμ μ°κ΅¬ λ¨μμ νμ©ν μ μλκ°μ λν κ°λ΅ν λ‘λ맡μ μ μνλ€. ν
μ€νΈμ κΈ°λ°ν μ€μμ± νμ
κ³Ό μμ± κΈ°λ°μ μλ μ΄ν΄ λͺ¨λμ ν΅ν©νλ€λ©΄, μ€λ₯μ μ νλ₯Ό μ€μ΄λ©΄μλ ν¨μ¨μ μΌλ‘ μ€μμ±μ λ€λ£° μ μλ μμ€ν
μ λ§λ€ μ μμ κ²μ΄λ€. μ΄λ¬ν μμ€ν
μ λν 맀λμ (dialogue manager)μ ν΅ν©λμ΄ κ°λ¨ν λν(chit-chat)κ° κ°λ₯ν λͺ©μ μ§ν₯ λν μμ€ν
(task-oriented dialogue system)μ ꡬμΆν μλ μκ³ , λ¨μΌ μΈμ΄ 쑰건(monolingual condition)μ λμ΄ μμ± λ²μμμμ μλ¬λ₯Ό μ€μ΄λ λ°μ νμ©λ μλ μλ€.
μ°λ¦¬λ λ³Έκ³ λ₯Ό ν΅ν΄, μ΄μ¨μ λ―Όκ°ν(prosody-sensitive) μΈμ΄μμ μλ μ΄ν΄λ₯Ό μν μ€μμ± ν΄μκ° κ°λ₯νλ©°, μ΄λ₯Ό μ°μ
λ° μ°κ΅¬ λ¨μμ νμ©ν μ μμμ 보μ΄κ³ μ νλ€. λ³Έ μ°κ΅¬κ° λ€λ₯Έ μΈμ΄ λ° λλ©μΈμμλ κ³ μ§μ μΈ μ€μμ± λ¬Έμ λ₯Ό ν΄μνλ λ°μ λμμ΄ λκΈΈ λ°λΌλ©°, μ΄λ₯Ό μν΄ μ°κ΅¬λ₯Ό μ§ννλ λ°μ νμ©λ 리μμ€, κ²°κ³Όλ¬Ό λ° μ½λλ€μ 곡μ ν¨μΌλ‘μ¨ νκ³μ λ°μ μ μ΄λ°μ§νκ³ μ νλ€.Ambiguity in the language is inevitable. It is because, albeit language is a means of communication, a particular concept that everyone thinks of cannot be conveyed in a perfectly identical manner. As this is an inevitable factor, ambiguity in language understanding often leads to breakdown or failure of communication.
There are various hierarchies of language ambiguity. However, not all ambiguity needs to be resolved. Different aspects of ambiguity exist for each domain and task, and it is crucial to define the boundary after recognizing the ambiguity that can be well-defined and resolved.
In this dissertation, we investigate the types of ambiguity that appear in spoken language processing, especially in intention understanding, and conduct research to define and resolve it. Although this phenomenon occurs in various languages, its degree and aspect depend on the language investigated. The factor we focus on is cases where the ambiguity comes from the gap between the amount of information in the spoken language and the text.
Here, we study the Korean language, which often shows different sentence structures and intentions depending on the prosody. In the Korean language, a text is often read with multiple intentions due to multi-functional sentence enders, frequent pro-drop, wh-intervention, etc. We first define this type of ambiguity and construct a corpus that helps detect ambiguous sentences, given that such utterances can be problematic for intention understanding.
In constructing a corpus for intention understanding, we consider the directivity and rhetoricalness of a sentence. They make up a criterion for classifying the intention of spoken language into a statement, question, command, rhetorical question, and rhetorical command. Using the corpus annotated with sufficiently high agreement on a spoken language corpus, we show that colloquial corpus-based language models are effective in classifying ambiguous text given only textual data, and qualitatively analyze the characteristics of the task.
We do not handle ambiguity only at the text level. To find out whether actual disambiguation is possible given a speech input, we design an artificial spoken language corpus composed only of ambiguous sentences, and resolve ambiguity with various attention-based neural network architectures. In this process, we observe that the ambiguity resolution is most effective when both textual and acoustic input co-attends each feature, especially when the audio processing module conveys attention information to the text module in a multi-hop manner.
Finally, assuming the case that the ambiguity of intention understanding is resolved by proposed strategies, we present a brief roadmap of how the results can be utilized at the industry or research level. By integrating text-based ambiguity detection and speech-based intention understanding module, we can build a system that handles ambiguity efficiently while reducing error propagation. Such a system can be integrated with dialogue managers to make up a task-oriented dialogue system capable of chit-chat, or it can be used for error reduction in multilingual circumstances such as speech translation, beyond merely monolingual conditions.
Throughout the dissertation, we want to show that ambiguity resolution for intention understanding in prosody-sensitive language can be achieved and can be utilized at the industry or research level. We hope that this study helps tackle chronic ambiguity issues in other languages ββor other domains, linking linguistic science and engineering approaches.1 Introduction 1
1.1 Motivation 2
1.2 Research Goal 4
1.3 Outline of the Dissertation 5
2 Related Work 6
2.1 Spoken Language Understanding 6
2.2 Speech Act and Intention 8
2.2.1 Performatives and statements 8
2.2.2 Illocutionary act and speech act 9
2.2.3 Formal semantic approaches 11
2.3 Ambiguity of Intention Understanding in Korean 14
2.3.1 Ambiguities in language 14
2.3.2 Speech act and intention understanding in Korean 16
3 Ambiguity in Intention Understanding of Spoken Language 20
3.1 Intention Understanding and Ambiguity 20
3.2 Annotation Protocol 23
3.2.1 Fragments 24
3.2.2 Clear-cut cases 26
3.2.3 Intonation-dependent utterances 28
3.3 Data Construction . 32
3.3.1 Source scripts 32
3.3.2 Agreement 32
3.3.3 Augmentation 33
3.3.4 Train split 33
3.4 Experiments and Results 34
3.4.1 Models 34
3.4.2 Implementation 36
3.4.3 Results 37
3.5 Findings and Summary 44
3.5.1 Findings 44
3.5.2 Summary 45
4 Disambiguation of Speech Intention 47
4.1 Ambiguity Resolution 47
4.1.1 Prosody and syntax 48
4.1.2 Disambiguation with prosody 50
4.1.3 Approaches in SLU 50
4.2 Dataset Construction 51
4.2.1 Script generation 52
4.2.2 Label tagging 54
4.2.3 Recording 56
4.3 Experiments and Results 57
4.3.1 Models 57
4.3.2 Results 60
4.4 Summary 63
5 System Integration and Application 65
5.1 System Integration for Intention Identification 65
5.1.1 Proof of concept 65
5.1.2 Preliminary study 69
5.2 Application to Spoken Dialogue System 75
5.2.1 What is 'Free-running' 76
5.2.2 Omakase chatbot 76
5.3 Beyond Monolingual Approaches 84
5.3.1 Spoken language translation 85
5.3.2 Dataset 87
5.3.3 Analysis 94
5.3.4 Discussion 95
5.4 Summary 100
6 Conclusion and Future Work 103
Bibliography 105
Abstract (In Korean) 124
Acknowledgment 126λ°
Transferring speech-generic and depression-specific knowledge for Alzheimer's disease detection
The detection of Alzheimer's disease (AD) from spontaneous speech has
attracted increasing attention while the sparsity of training data remains an
important issue. This paper handles the issue by knowledge transfer,
specifically from both speech-generic and depression-specific knowledge. The
paper first studies sequential knowledge transfer from generic foundation
models pretrained on large amounts of speech and text data. A block-wise
analysis is performed for AD diagnosis based on the representations extracted
from different intermediate blocks of different foundation models. Apart from
the knowledge from speech-generic representations, this paper also proposes to
simultaneously transfer the knowledge from a speech depression detection task
based on the high comorbidity rates of depression and AD. A parallel knowledge
transfer framework is studied that jointly learns the information shared
between these two tasks. Experimental results show that the proposed method
improves AD and depression detection, and produces a state-of-the-art F1 score
of 0.928 for AD diagnosis on the commonly used ADReSSo dataset.Comment: 8 pages, 4 figures. Accepted by ASRU 202
- β¦