358 research outputs found
Disambiguation of Korean Utterances Using Automatic Intonation Recognition
The paper describes a research on a use of intonation for disambiguating utterance types of Korean spoken sentences. Based on tilt intonation theory (Taylor and Black 1994), two related but separate experiments were performed at speaker independent level, both using the Hidden Markov Model training technique. In the first experiment, a system is established so that rough boundary positions of major intonation events are detected. Subsequently the significant parameters are extracted from the products of the first experiment, which are directly used to train the final models for utterance type disambiguation. Results show that the intonation contour can be used as a significant meaning distinguisher in an automatic speech recognition system of Korean as well as in a natural human communication system
μμ±μΈμ΄ μ΄ν΄μμμ μ€μμ± ν΄μ
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2022. 8. κΉλ¨μ.μΈμ΄μ μ€μμ±μ νμ°μ μ΄λ€. κ·Έκ²μ μΈμ΄κ° μμ¬ μν΅μ μλ¨μ΄μ§λ§, λͺ¨λ μ¬λμ΄ μκ°νλ μ΄λ€ κ°λ
μ΄ μλ²½ν λμΌνκ² μ λ¬λ μ μλ κ²μ κΈ°μΈνλ€. μ΄λ νμ°μ μΈ μμμ΄κΈ°λ νμ§λ§, μΈμ΄ μ΄ν΄μμ μ€μμ±μ μ’
μ’
μμ¬ μν΅μ λ¨μ μ΄λ μ€ν¨λ₯Ό κ°μ Έμ€κΈ°λ νλ€.
μΈμ΄μ μ€μμ±μλ λ€μν μΈ΅μκ° μ‘΄μ¬νλ€. νμ§λ§, λͺ¨λ μν©μμ μ€μμ±μ΄ ν΄μλ νμλ μλ€. νμ€ν¬λ§λ€, λλ©μΈλ§λ€ λ€λ₯Έ μμμ μ€μμ±μ΄ μ‘΄μ¬νλ©°, μ΄λ₯Ό μ μ μνκ³ ν΄μλ μ μλ μ€μμ±μμ νμ
ν ν μ€μμ μΈ λΆλΆ κ°μ κ²½κ³λ₯Ό μ μ νλ κ²μ΄ μ€μνλ€.
λ³Έκ³ μμλ μμ± μΈμ΄ μ²λ¦¬, νΉν μλ μ΄ν΄μ μμ΄ μ΄λ€ μμμ μ€μμ±μ΄ λ°μν μ μλμ§ μμλ³΄κ³ , μ΄λ₯Ό ν΄μνκΈ° μν μ°κ΅¬λ₯Ό μ§ννλ€. μ΄λ¬ν νμμ λ€μν μΈμ΄μμ λ°μνμ§λ§, κ·Έ μ λ λ° μμμ μΈμ΄μ λ°λΌμ λ€λ₯΄κ² λνλλ κ²½μ°κ° λ§λ€. μ°λ¦¬μ μ°κ΅¬μμ μ£Όλͺ©νλ λΆλΆμ, μμ± μΈμ΄μ λ΄κΈ΄ μ 보λκ³Ό λ¬Έμ μΈμ΄μ μ 보λ μ°¨μ΄λ‘ μΈν΄ μ€μμ±μ΄ λ°μνλ κ²½μ°λ€μ΄λ€.
λ³Έ μ°κ΅¬λ μ΄μ¨(prosody)μ λ°λΌ λ¬Έμ₯ νμ λ° μλκ° λ€λ₯΄κ² ννλλ κ²½μ°κ° λ§μ νκ΅μ΄λ₯Ό λμμΌλ‘ μ§νλλ€. νκ΅μ΄μμλ λ€μν κΈ°λ₯μ΄ μλ(multi-functionalν) μ’
κ²°μ΄λ―Έ(sentence ender), λΉλ²ν νλ½ νμ(pro-drop), μλ¬Έμ¬ κ°μ(wh-intervention) λ±μΌλ‘ μΈν΄, κ°μ ν
μ€νΈκ° μ¬λ¬ μλλ‘ μ½νλ νμμ΄ λ°μνκ³€ νλ€. μ΄κ²μ΄ μλ μ΄ν΄μ νΌμ μ κ°μ Έμ¬ μ μλ€λ λ°μ μ°©μνμ¬, λ³Έ μ°κ΅¬μμλ μ΄λ¬ν μ€μμ±μ λ¨Όμ μ μνκ³ , μ€μμ μΈ λ¬Έμ₯λ€μ κ°μ§ν μ μλλ‘ λ§λμΉλ₯Ό ꡬμΆνλ€.
μλ μ΄ν΄λ₯Ό μν λ§λμΉλ₯Ό ꡬμΆνλ κ³Όμ μμ λ¬Έμ₯μ μ§ν₯μ±(directivity)κ³Ό μμ¬μ±(rhetoricalness)μ΄ κ³ λ €λλ€. μ΄κ²μ μμ± μΈμ΄μ μλλ₯Ό μμ , μ§λ¬Έ, λͺ
λ Ή, μμ¬μλ¬Έλ¬Έ, κ·Έλ¦¬κ³ μμ¬λͺ
λ Ήλ¬ΈμΌλ‘ ꡬλΆνκ² νλ κΈ°μ€μ΄ λλ€. λ³Έ μ°κ΅¬μμλ κΈ°λ‘λ μμ± μΈμ΄(spoken language)λ₯Ό μΆ©λΆν λμ μΌμΉλ(kappa = 0.85)λ‘ μ£Όμν λ§λμΉλ₯Ό μ΄μ©ν΄, μμ±μ΄ μ£Όμ΄μ§μ§ μμ μν©μμ μ€μμ μΈ ν
μ€νΈλ₯Ό κ°μ§νλ λ°μ μ΄λ€ μ λ΅ νΉμ μΈμ΄ λͺ¨λΈμ΄ ν¨κ³Όμ μΈκ°λ₯Ό 보μ΄κ³ , ν΄λΉ νμ€ν¬μ νΉμ§μ μ μ±μ μΌλ‘ λΆμνλ€.
λν, μ°λ¦¬λ ν
μ€νΈ μΈ΅μμμλ§ μ€μμ±μ μ κ·Όνμ§ μκ³ , μ€μ λ‘ μμ±μ΄ μ£Όμ΄μ§ μν©μμ μ€μμ± ν΄μ(disambiguation)κ° κ°λ₯νμ§λ₯Ό μμ보기 μν΄, ν
μ€νΈκ° μ€μμ μΈ λ°νλ€λ§μΌλ‘ ꡬμ±λ μΈκ³΅μ μΈ μμ± λ§λμΉλ₯Ό μ€κ³νκ³ λ€μν μ§μ€(attention) κΈ°λ° μ κ²½λ§(neural network) λͺ¨λΈλ€μ μ΄μ©ν΄ μ€μμ±μ ν΄μνλ€. μ΄ κ³Όμ μμ λͺ¨λΈ κΈ°λ° ν΅μ¬μ /μλ―Έμ μ€μμ± ν΄μκ° μ΄λ ν κ²½μ°μ κ°μ₯ ν¨κ³Όμ μΈμ§ κ΄μ°°νκ³ , μΈκ°μ μΈμ΄ μ²λ¦¬μ μ΄λ€ μ°κ΄μ΄ μλμ§μ λν κ΄μ μ μ μνλ€.
λ³Έ μ°κ΅¬μμλ λ§μ§λ§μΌλ‘, μμ κ°μ μ μ°¨λ‘ μλ μ΄ν΄ κ³Όμ μμμ μ€μμ±μ΄ ν΄μλμμ κ²½μ°, μ΄λ₯Ό μ΄λ»κ² μ°μ
κ³ νΉμ μ°κ΅¬ λ¨μμ νμ©ν μ μλκ°μ λν κ°λ΅ν λ‘λ맡μ μ μνλ€. ν
μ€νΈμ κΈ°λ°ν μ€μμ± νμ
κ³Ό μμ± κΈ°λ°μ μλ μ΄ν΄ λͺ¨λμ ν΅ν©νλ€λ©΄, μ€λ₯μ μ νλ₯Ό μ€μ΄λ©΄μλ ν¨μ¨μ μΌλ‘ μ€μμ±μ λ€λ£° μ μλ μμ€ν
μ λ§λ€ μ μμ κ²μ΄λ€. μ΄λ¬ν μμ€ν
μ λν 맀λμ (dialogue manager)μ ν΅ν©λμ΄ κ°λ¨ν λν(chit-chat)κ° κ°λ₯ν λͺ©μ μ§ν₯ λν μμ€ν
(task-oriented dialogue system)μ ꡬμΆν μλ μκ³ , λ¨μΌ μΈμ΄ 쑰건(monolingual condition)μ λμ΄ μμ± λ²μμμμ μλ¬λ₯Ό μ€μ΄λ λ°μ νμ©λ μλ μλ€.
μ°λ¦¬λ λ³Έκ³ λ₯Ό ν΅ν΄, μ΄μ¨μ λ―Όκ°ν(prosody-sensitive) μΈμ΄μμ μλ μ΄ν΄λ₯Ό μν μ€μμ± ν΄μκ° κ°λ₯νλ©°, μ΄λ₯Ό μ°μ
λ° μ°κ΅¬ λ¨μμ νμ©ν μ μμμ 보μ΄κ³ μ νλ€. λ³Έ μ°κ΅¬κ° λ€λ₯Έ μΈμ΄ λ° λλ©μΈμμλ κ³ μ§μ μΈ μ€μμ± λ¬Έμ λ₯Ό ν΄μνλ λ°μ λμμ΄ λκΈΈ λ°λΌλ©°, μ΄λ₯Ό μν΄ μ°κ΅¬λ₯Ό μ§ννλ λ°μ νμ©λ 리μμ€, κ²°κ³Όλ¬Ό λ° μ½λλ€μ 곡μ ν¨μΌλ‘μ¨ νκ³μ λ°μ μ μ΄λ°μ§νκ³ μ νλ€.Ambiguity in the language is inevitable. It is because, albeit language is a means of communication, a particular concept that everyone thinks of cannot be conveyed in a perfectly identical manner. As this is an inevitable factor, ambiguity in language understanding often leads to breakdown or failure of communication.
There are various hierarchies of language ambiguity. However, not all ambiguity needs to be resolved. Different aspects of ambiguity exist for each domain and task, and it is crucial to define the boundary after recognizing the ambiguity that can be well-defined and resolved.
In this dissertation, we investigate the types of ambiguity that appear in spoken language processing, especially in intention understanding, and conduct research to define and resolve it. Although this phenomenon occurs in various languages, its degree and aspect depend on the language investigated. The factor we focus on is cases where the ambiguity comes from the gap between the amount of information in the spoken language and the text.
Here, we study the Korean language, which often shows different sentence structures and intentions depending on the prosody. In the Korean language, a text is often read with multiple intentions due to multi-functional sentence enders, frequent pro-drop, wh-intervention, etc. We first define this type of ambiguity and construct a corpus that helps detect ambiguous sentences, given that such utterances can be problematic for intention understanding.
In constructing a corpus for intention understanding, we consider the directivity and rhetoricalness of a sentence. They make up a criterion for classifying the intention of spoken language into a statement, question, command, rhetorical question, and rhetorical command. Using the corpus annotated with sufficiently high agreement on a spoken language corpus, we show that colloquial corpus-based language models are effective in classifying ambiguous text given only textual data, and qualitatively analyze the characteristics of the task.
We do not handle ambiguity only at the text level. To find out whether actual disambiguation is possible given a speech input, we design an artificial spoken language corpus composed only of ambiguous sentences, and resolve ambiguity with various attention-based neural network architectures. In this process, we observe that the ambiguity resolution is most effective when both textual and acoustic input co-attends each feature, especially when the audio processing module conveys attention information to the text module in a multi-hop manner.
Finally, assuming the case that the ambiguity of intention understanding is resolved by proposed strategies, we present a brief roadmap of how the results can be utilized at the industry or research level. By integrating text-based ambiguity detection and speech-based intention understanding module, we can build a system that handles ambiguity efficiently while reducing error propagation. Such a system can be integrated with dialogue managers to make up a task-oriented dialogue system capable of chit-chat, or it can be used for error reduction in multilingual circumstances such as speech translation, beyond merely monolingual conditions.
Throughout the dissertation, we want to show that ambiguity resolution for intention understanding in prosody-sensitive language can be achieved and can be utilized at the industry or research level. We hope that this study helps tackle chronic ambiguity issues in other languages ββor other domains, linking linguistic science and engineering approaches.1 Introduction 1
1.1 Motivation 2
1.2 Research Goal 4
1.3 Outline of the Dissertation 5
2 Related Work 6
2.1 Spoken Language Understanding 6
2.2 Speech Act and Intention 8
2.2.1 Performatives and statements 8
2.2.2 Illocutionary act and speech act 9
2.2.3 Formal semantic approaches 11
2.3 Ambiguity of Intention Understanding in Korean 14
2.3.1 Ambiguities in language 14
2.3.2 Speech act and intention understanding in Korean 16
3 Ambiguity in Intention Understanding of Spoken Language 20
3.1 Intention Understanding and Ambiguity 20
3.2 Annotation Protocol 23
3.2.1 Fragments 24
3.2.2 Clear-cut cases 26
3.2.3 Intonation-dependent utterances 28
3.3 Data Construction . 32
3.3.1 Source scripts 32
3.3.2 Agreement 32
3.3.3 Augmentation 33
3.3.4 Train split 33
3.4 Experiments and Results 34
3.4.1 Models 34
3.4.2 Implementation 36
3.4.3 Results 37
3.5 Findings and Summary 44
3.5.1 Findings 44
3.5.2 Summary 45
4 Disambiguation of Speech Intention 47
4.1 Ambiguity Resolution 47
4.1.1 Prosody and syntax 48
4.1.2 Disambiguation with prosody 50
4.1.3 Approaches in SLU 50
4.2 Dataset Construction 51
4.2.1 Script generation 52
4.2.2 Label tagging 54
4.2.3 Recording 56
4.3 Experiments and Results 57
4.3.1 Models 57
4.3.2 Results 60
4.4 Summary 63
5 System Integration and Application 65
5.1 System Integration for Intention Identification 65
5.1.1 Proof of concept 65
5.1.2 Preliminary study 69
5.2 Application to Spoken Dialogue System 75
5.2.1 What is 'Free-running' 76
5.2.2 Omakase chatbot 76
5.3 Beyond Monolingual Approaches 84
5.3.1 Spoken language translation 85
5.3.2 Dataset 87
5.3.3 Analysis 94
5.3.4 Discussion 95
5.4 Summary 100
6 Conclusion and Future Work 103
Bibliography 105
Abstract (In Korean) 124
Acknowledgment 126λ°
A Survey on Awesome Korean NLP Datasets
English based datasets are commonly available from Kaggle, GitHub, or
recently published papers. Although benchmark tests with English datasets are
sufficient to show off the performances of new models and methods, still a
researcher need to train and validate the models on Korean based datasets to
produce a technology or product, suitable for Korean processing. This paper
introduces 15 popular Korean based NLP datasets with summarized details such as
volume, license, repositories, and other research results inspired by the
datasets. Also, I provide high-resolution instructions with sample or
statistics of datasets. The main characteristics of datasets are presented on a
single table to provide a rapid summarization of datasets for researchers.Comment: 11 pages, 1 horizontal page for large tabl
CLiFF Notes: Research In Natural Language Processing at the University of Pennsylvania
The Computational Linguistics Feedback Forum (CLIFF) is a group of students and faculty who gather once a week to discuss the members\u27 current research. As the word feedback suggests, the group\u27s purpose is the sharing of ideas. The group also promotes interdisciplinary contacts between researchers who share an interest in Cognitive Science.
There is no single theme describing the research in Natural Language Processing at Penn. There is work done in CCG, Tree adjoining grammars, intonation, statistical methods, plan inference, instruction understanding, incremental interpretation, language acquisition, syntactic parsing, causal reasoning, free word order languages, ... and many other areas. With this in mind, rather than trying to summarize the varied work currently underway here at Penn, we suggest reading the following abstracts to see how the students and faculty themselves describe their work. Their abstracts illustrate the diversity of interests among the researchers, explain the areas of common interest, and describe some very interesting work in Cognitive Science.
This report is a collection of abstracts from both faculty and graduate students in Computer Science, Psychology and Linguistics. We pride ourselves on the close working relations between these groups, as we believe that the communication among the different departments and the ongoing inter-departmental research not only improves the quality of our work, but makes much of that work possible
Universal and language-specific processing : the case of prosody
A key question in the science of language is how speech processing can be influenced by both language-universal and language-specific mechanisms (Cutler, Klein, & Levinson, 2005). My graduate research aimed to address this question by adopting a crosslanguage approach to compare languages with different phonological systems. Of all components of linguistic structure, prosody is often considered to be one of the most language-specific dimensions of speech. This can have significant implications for our understanding of language use, because much of speech processing is specifically tailored to the structure and requirements of the native language. However, it is still unclear whether prosody may also play a universal role across languages, and very little comparative attempts have been made to explore this possibility. In this thesis, I examined both the production and perception of prosodic cues to prominence and phrasing in native speakers of English and Mandarin Chinese. In focus production, our research revealed that English and Mandarin speakers were alike in how they used prosody to encode prominence, but there were also systematic language-specific differences in the exact degree to which they enhanced the different prosodic cues (Chapter 2). This, however, was not the case in focus perception, where English and Mandarin listeners were alike in the degree to which they used prosody to predict upcoming prominence, even though the precise cues in the preceding prosody could differ (Chapter 3). Further experiments examining prosodic focus prediction in the speech of different talkers have demonstrated functional cue equivalence in prosodic focus detection (Chapter 4). Likewise, our experiments have also revealed both crosslanguage similarities and differences in the production and perception of juncture cues (Chapter 5). Overall, prosodic processing is the result of a complex but subtle interplay of universal and language-specific structure
Research in the Language, Information and Computation Laboratory of the University of Pennsylvania
This report takes its name from the Computational Linguistics Feedback Forum (CLiFF), an informal discussion group for students and faculty. However the scope of the research covered in this report is broader than the title might suggest; this is the yearly report of the LINC Lab, the Language, Information and Computation Laboratory of the University of Pennsylvania.
It may at first be hard to see the threads that bind together the work presented here, work by faculty, graduate students and postdocs in the Computer Science and Linguistics Departments, and the Institute for Research in Cognitive Science. It includes prototypical Natural Language fields such as: Combinatorial Categorial Grammars, Tree Adjoining Grammars, syntactic parsing and the syntax-semantics interface; but it extends to statistical methods, plan inference, instruction understanding, intonation, causal reasoning, free word order languages, geometric reasoning, medical informatics, connectionism, and language acquisition.
Naturally, this introduction cannot spell out all the connections between these abstracts; we invite you to explore them on your own. In fact, with this issue itβs easier than ever to do so: this document is accessible on the βinformation superhighwayβ. Just call up http://www.cis.upenn.edu/~cliff-group/94/cliffnotes.html
In addition, you can find many of the papers referenced in the CLiFF Notes on the net. Most can be obtained by following links from the authorsβ abstracts in the web version of this report.
The abstracts describe the researchersβ many areas of investigation, explain their shared concerns, and present some interesting work in Cognitive Science. We hope its new online format makes the CLiFF Notes a more useful and interesting guide to Computational Linguistics activity at Penn
Marked initial pitch in questions signals marked communicative function
In conversation, the initial pitch of an utterance can provide an early phonetic cue of the communicative function, the speech act, or the social action being implemented. We conducted quantitative acoustic measurements and statistical analyses of pitch in over 10,000 utterances, including 2512 questions, their responses, and about 5000 other utterances by 180 total speakers from a corpus of 70 natural conversations in 10 languages. We measured pitch at first prominence in a speakerβs utterance and discriminated utterances by language, speaker, gender, question form, and what social action is achieved by the speakerβs turn. Through applying multivariate logistic regression we found that initial pitch that significantly deviated from the speakerβs median pitch level was predictive of the social action of the question. In questions designed to solicit agreement with an evaluation rather than information, pitch was divergent from a speakerβs median predictably in the top 10% of a speakers range. This latter finding reveals a kind of iconicity in the relationship between prosody and social action in which a marked pitch correlates with a marked social action. Thus, we argue that speakers rely on pitch to provide an early signal for recipients that the question is not to be interpreted through its literal semantics but rather through an inference
CLiFF Notes: Research in the Language Information and Computation Laboratory of The University of Pennsylvania
This report takes its name from the Computational Linguistics Feedback Forum (CLIFF), an informal discussion group for students and faculty. However the scope of the research covered in this report is broader than the title might suggest; this is the yearly report of the LINC Lab, the Language, Information and Computation Laboratory of the University of Pennsylvania. It may at first be hard to see the threads that bind together the work presented here, work by faculty, graduate students and postdocs in the Computer Science, Psychology, and Linguistics Departments, and the Institute for Research in Cognitive Science. It includes prototypical Natural Language fields such as: Combinatorial Categorial Grammars, Tree Adjoining Grammars, syntactic parsing and the syntax-semantics interface; but it extends to statistical methods, plan inference, instruction understanding, intonation, causal reasoning, free word order languages, geometric reasoning, medical informatics, connectionism, and language acquisition. With 48 individual contributors and six projects represented, this is the largest LINC Lab collection to date, and the most diverse
Recommended from our members
Chapter 2: The Original ToBI System and the Evolution of the ToBI Framework
In this chapter, the authors will try to identify the essential properties of a ToBI framework annotation system by describing the development and design of the original ToBI conventions. In this description, the authors will overview the general phonological theory and the specific theory of Mainstream American English intonation and prosody that the authors decided to incorporate in the original ToBI tags. The authors will also state the practical principles that led us to make the decisions that the authors did. The chapter is organised as follows. Section 2.2 briefly chronicles how the MAE_ToBI system came into being. Section 2.3 briefly describes the consensus account of English intonation and prosody on which the MAE_ToBI system is based. Section 2.4 catalogues the different components of a MAE_ToBI transcription and lists the salient rules which constrain the relationships between different components. This section also expands upon the theoretical foundations and practical consequences of adopting the general structure of multiple labelling tiers, and particularly the separation of the labels for tones from the labels for indexing prosodic boundary strength. Section 2.5 then describes some of the extensions of the basic ToBI tiers that have been adopted by some sites. This section also compares our decisions about the number of tiers and about inter-tier constraints with the analogous decisions for some of the other ToBI systems described in this book. Section 2.6 discusses the status of the symbolic labels relative to the continuous phonetic records that are also an obligatory component of the MAE_ToBI transcription. Section 2.7 then closes by listing several open research questions that the authors would like to see addressed by MAE_ToBI users and the larger ToBI community
- β¦