111 research outputs found

    음성언어 이해에서의 중의성 해소

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2022. 8. 김남수.언어의 중의성은 필연적이다. 그것은 언어가 의사 소통의 수단이지만, 모든 사람이 생각하는 어떤 개념이 완벽히 동일하게 전달될 수 없는 것에 기인한다. 이는 필연적인 요소이기도 하지만, 언어 이해에서 중의성은 종종 의사 소통의 단절이나 실패를 가져오기도 한다. 언어의 중의성에는 다양한 층위가 존재한다. 하지만, 모든 상황에서 중의성이 해소될 필요는 없다. 태스크마다, 도메인마다 다른 양상의 중의성이 존재하며, 이를 잘 정의하고 해소될 수 있는 중의성임을 파악한 후 중의적인 부분 간의 경계를 잘 정하는 것이 중요하다. 본고에서는 음성 언어 처리, 특히 의도 이해에 있어 어떤 양상의 중의성이 발생할 수 있는지 알아보고, 이를 해소하기 위한 연구를 진행한다. 이러한 현상은 다양한 언어에서 발생하지만, 그 정도 및 양상은 언어에 따라서 다르게 나타나는 경우가 많다. 우리의 연구에서 주목하는 부분은, 음성 언어에 담긴 정보량과 문자 언어의 정보량 차이로 인해 중의성이 발생하는 경우들이다. 본 연구는 운율(prosody)에 따라 문장 형식 및 의도가 다르게 표현되는 경우가 많은 한국어를 대상으로 진행된다. 한국어에서는 다양한 기능이 있는(multi-functional한) 종결어미(sentence ender), 빈번한 탈락 현상(pro-drop), 의문사 간섭(wh-intervention) 등으로 인해, 같은 텍스트가 여러 의도로 읽히는 현상이 발생하곤 한다. 이것이 의도 이해에 혼선을 가져올 수 있다는 데에 착안하여, 본 연구에서는 이러한 중의성을 먼저 정의하고, 중의적인 문장들을 감지할 수 있도록 말뭉치를 구축한다. 의도 이해를 위한 말뭉치를 구축하는 과정에서 문장의 지향성(directivity)과 수사성(rhetoricalness)이 고려된다. 이것은 음성 언어의 의도를 서술, 질문, 명령, 수사의문문, 그리고 수사명령문으로 구분하게 하는 기준이 된다. 본 연구에서는 기록된 음성 언어(spoken language)를 충분히 높은 일치도(kappa = 0.85)로 주석한 말뭉치를 이용해, 음성이 주어지지 않은 상황에서 중의적인 텍스트를 감지하는 데에 어떤 전략 혹은 언어 모델이 효과적인가를 보이고, 해당 태스크의 특징을 정성적으로 분석한다. 또한, 우리는 텍스트 층위에서만 중의성에 접근하지 않고, 실제로 음성이 주어진 상황에서 중의성 해소(disambiguation)가 가능한지를 알아보기 위해, 텍스트가 중의적인 발화들만으로 구성된 인공적인 음성 말뭉치를 설계하고 다양한 집중(attention) 기반 신경망(neural network) 모델들을 이용해 중의성을 해소한다. 이 과정에서 모델 기반 통사적/의미적 중의성 해소가 어떠한 경우에 가장 효과적인지 관찰하고, 인간의 언어 처리와 어떤 연관이 있는지에 대한 관점을 제시한다. 본 연구에서는 마지막으로, 위와 같은 절차로 의도 이해 과정에서의 중의성이 해소되었을 경우, 이를 어떻게 산업계 혹은 연구 단에서 활용할 수 있는가에 대한 간략한 로드맵을 제시한다. 텍스트에 기반한 중의성 파악과 음성 기반의 의도 이해 모듈을 통합한다면, 오류의 전파를 줄이면서도 효율적으로 중의성을 다룰 수 있는 시스템을 만들 수 있을 것이다. 이러한 시스템은 대화 매니저(dialogue manager)와 통합되어 간단한 대화(chit-chat)가 가능한 목적 지향 대화 시스템(task-oriented dialogue system)을 구축할 수도 있고, 단일 언어 조건(monolingual condition)을 넘어 음성 번역에서의 에러를 줄이는 데에 활용될 수도 있다. 우리는 본고를 통해, 운율에 민감한(prosody-sensitive) 언어에서 의도 이해를 위한 중의성 해소가 가능하며, 이를 산업 및 연구 단에서 활용할 수 있음을 보이고자 한다. 본 연구가 다른 언어 및 도메인에서도 고질적인 중의성 문제를 해소하는 데에 도움이 되길 바라며, 이를 위해 연구를 진행하는 데에 활용된 리소스, 결과물 및 코드들을 공유함으로써 학계의 발전에 이바지하고자 한다.Ambiguity in the language is inevitable. It is because, albeit language is a means of communication, a particular concept that everyone thinks of cannot be conveyed in a perfectly identical manner. As this is an inevitable factor, ambiguity in language understanding often leads to breakdown or failure of communication. There are various hierarchies of language ambiguity. However, not all ambiguity needs to be resolved. Different aspects of ambiguity exist for each domain and task, and it is crucial to define the boundary after recognizing the ambiguity that can be well-defined and resolved. In this dissertation, we investigate the types of ambiguity that appear in spoken language processing, especially in intention understanding, and conduct research to define and resolve it. Although this phenomenon occurs in various languages, its degree and aspect depend on the language investigated. The factor we focus on is cases where the ambiguity comes from the gap between the amount of information in the spoken language and the text. Here, we study the Korean language, which often shows different sentence structures and intentions depending on the prosody. In the Korean language, a text is often read with multiple intentions due to multi-functional sentence enders, frequent pro-drop, wh-intervention, etc. We first define this type of ambiguity and construct a corpus that helps detect ambiguous sentences, given that such utterances can be problematic for intention understanding. In constructing a corpus for intention understanding, we consider the directivity and rhetoricalness of a sentence. They make up a criterion for classifying the intention of spoken language into a statement, question, command, rhetorical question, and rhetorical command. Using the corpus annotated with sufficiently high agreement on a spoken language corpus, we show that colloquial corpus-based language models are effective in classifying ambiguous text given only textual data, and qualitatively analyze the characteristics of the task. We do not handle ambiguity only at the text level. To find out whether actual disambiguation is possible given a speech input, we design an artificial spoken language corpus composed only of ambiguous sentences, and resolve ambiguity with various attention-based neural network architectures. In this process, we observe that the ambiguity resolution is most effective when both textual and acoustic input co-attends each feature, especially when the audio processing module conveys attention information to the text module in a multi-hop manner. Finally, assuming the case that the ambiguity of intention understanding is resolved by proposed strategies, we present a brief roadmap of how the results can be utilized at the industry or research level. By integrating text-based ambiguity detection and speech-based intention understanding module, we can build a system that handles ambiguity efficiently while reducing error propagation. Such a system can be integrated with dialogue managers to make up a task-oriented dialogue system capable of chit-chat, or it can be used for error reduction in multilingual circumstances such as speech translation, beyond merely monolingual conditions. Throughout the dissertation, we want to show that ambiguity resolution for intention understanding in prosody-sensitive language can be achieved and can be utilized at the industry or research level. We hope that this study helps tackle chronic ambiguity issues in other languages ​​or other domains, linking linguistic science and engineering approaches.1 Introduction 1 1.1 Motivation 2 1.2 Research Goal 4 1.3 Outline of the Dissertation 5 2 Related Work 6 2.1 Spoken Language Understanding 6 2.2 Speech Act and Intention 8 2.2.1 Performatives and statements 8 2.2.2 Illocutionary act and speech act 9 2.2.3 Formal semantic approaches 11 2.3 Ambiguity of Intention Understanding in Korean 14 2.3.1 Ambiguities in language 14 2.3.2 Speech act and intention understanding in Korean 16 3 Ambiguity in Intention Understanding of Spoken Language 20 3.1 Intention Understanding and Ambiguity 20 3.2 Annotation Protocol 23 3.2.1 Fragments 24 3.2.2 Clear-cut cases 26 3.2.3 Intonation-dependent utterances 28 3.3 Data Construction . 32 3.3.1 Source scripts 32 3.3.2 Agreement 32 3.3.3 Augmentation 33 3.3.4 Train split 33 3.4 Experiments and Results 34 3.4.1 Models 34 3.4.2 Implementation 36 3.4.3 Results 37 3.5 Findings and Summary 44 3.5.1 Findings 44 3.5.2 Summary 45 4 Disambiguation of Speech Intention 47 4.1 Ambiguity Resolution 47 4.1.1 Prosody and syntax 48 4.1.2 Disambiguation with prosody 50 4.1.3 Approaches in SLU 50 4.2 Dataset Construction 51 4.2.1 Script generation 52 4.2.2 Label tagging 54 4.2.3 Recording 56 4.3 Experiments and Results 57 4.3.1 Models 57 4.3.2 Results 60 4.4 Summary 63 5 System Integration and Application 65 5.1 System Integration for Intention Identification 65 5.1.1 Proof of concept 65 5.1.2 Preliminary study 69 5.2 Application to Spoken Dialogue System 75 5.2.1 What is 'Free-running' 76 5.2.2 Omakase chatbot 76 5.3 Beyond Monolingual Approaches 84 5.3.1 Spoken language translation 85 5.3.2 Dataset 87 5.3.3 Analysis 94 5.3.4 Discussion 95 5.4 Summary 100 6 Conclusion and Future Work 103 Bibliography 105 Abstract (In Korean) 124 Acknowledgment 126박

    Examining Linguistic Behavior of a Virtual Museum Guide

    Get PDF
    As artificial intelligence that uses natural language processing becomes more prevalent, analyzing such software so that it maximally understands us holds all the more significance. Despite the high accuracy at which recurrent neural networks can predict, say, the next word in a sentence, they function differently than a human brain. Given that the sources of difficulty encountered by a natural language processing software may not be intuitive to a human, searching for patterns among its errors provides insight as to where software can be improved. Additionally, given that language is productive, humans interacting with natural language processing technology can produce an unlimited variety of stimuli. When a stimulus is as unpredictable as a human prompted to ask whatever they want, the response it yields reveals how natural language processing software handles variability. At the Center of Science and Industry (COSI), a science museum in Columbus, Ohio, researchers from the Ohio State University have developed an interactive avatar that uses natural language processing. In this context, an avatar can be defined as a human-like bot created to interact with users. Visitors can ask the avatar questions related to linguistics, computer science, and exhibits at the museum. The avatar consists of both an animated visual component and its artificial intelligence software, which processes speech as input and produces a response accordingly. In this case, the artificial intelligence used to process language is a recurrent neural network, pretrained on a large corpus of general English text, then subsequently trained on a smaller corpus of text pertaining to computer science, linguistics, and COSI exhibits. My research focuses on the effectiveness of the avatar's responses.No embargoAcademic Major: Linguistic

    Bag-of-words representations for computer audition

    Get PDF
    Computer audition is omnipresent in everyday life, in applications ranging from personalised virtual agents to health care. From a technical point of view, the goal is to robustly classify the content of an audio signal in terms of a defined set of labels, such as, e.g., the acoustic scene, a medical diagnosis, or, in the case of speech, what is said or how it is said. Typical approaches employ machine learning (ML), which means that task-specific models are trained by means of examples. Despite recent successes in neural network-based end-to-end learning, taking the raw audio signal as input, models relying on hand-crafted acoustic features are still superior in some domains, especially for tasks where data is scarce. One major issue is nevertheless that a sequence of acoustic low-level descriptors (LLDs) cannot be fed directly into many ML algorithms as they require a static and fixed-length input. Moreover, also for dynamic classifiers, compressing the information of the LLDs over a temporal block by summarising them can be beneficial. However, the type of instance-level representation has a fundamental impact on the performance of the model. In this thesis, the so-called bag-of-audio-words (BoAW) representation is investigated as an alternative to the standard approach of statistical functionals. BoAW is an unsupervised method of representation learning, inspired from the bag-of-words method in natural language processing, forming a histogram of the terms present in a document. The toolkit openXBOW is introduced, enabling systematic learning and optimisation of these feature representations, unified across arbitrary modalities of numeric or symbolic descriptors. A number of experiments on BoAW are presented and discussed, focussing on a large number of potential applications and corresponding databases, ranging from emotion recognition in speech to medical diagnosis. The evaluations include a comparison of different acoustic LLD sets and configurations of the BoAW generation process. The key findings are that BoAW features are a meaningful alternative to statistical functionals, offering certain benefits, while being able to preserve the advantages of functionals, such as data-independence. Furthermore, it is shown that both representations are complementary and their fusion improves the performance of a machine listening system.Maschinelles Hören ist im täglichen Leben allgegenwärtig, mit Anwendungen, die von personalisierten virtuellen Agenten bis hin zum Gesundheitswesen reichen. Aus technischer Sicht besteht das Ziel darin, den Inhalt eines Audiosignals hinsichtlich einer Auswahl definierter Labels robust zu klassifizieren. Die Labels beschreiben bspw. die akustische Umgebung der Aufnahme, eine medizinische Diagnose oder - im Falle von Sprache - was gesagt wird oder wie es gesagt wird. Übliche Ansätze hierzu verwenden maschinelles Lernen, d.h., es werden anwendungsspezifische Modelle anhand von Beispieldaten trainiert. Trotz jüngster Erfolge beim Ende-zu-Ende-Lernen mittels neuronaler Netze, in welchen das unverarbeitete Audiosignal als Eingabe benutzt wird, sind Modelle, die auf definierten akustischen Merkmalen basieren, in manchen Bereichen weiterhin überlegen. Dies gilt im Besonderen für Einsatzzwecke, für die nur wenige Daten vorhanden sind. Allerdings besteht dabei das Problem, dass Zeitfolgen von akustischen Deskriptoren in viele Algorithmen des maschinellen Lernens nicht direkt eingespeist werden können, da diese eine statische Eingabe fester Länge benötigen. Außerdem kann es auch für dynamische (zeitabhängige) Klassifikatoren vorteilhaft sein, die Deskriptoren über ein gewisses Zeitintervall zusammenzufassen. Jedoch hat die Art der Merkmalsdarstellung einen grundlegenden Einfluss auf die Leistungsfähigkeit des Modells. In der vorliegenden Dissertation wird der sogenannte Bag-of-Audio-Words-Ansatz (BoAW) als Alternative zum Standardansatz der statistischen Funktionale untersucht. BoAW ist eine Methode des unüberwachten Lernens von Merkmalsdarstellungen, die von der Bag-of-Words-Methode in der Computerlinguistik inspiriert wurde, bei der ein Textdokument als Histogramm der vorkommenden Wörter beschrieben wird. Das Toolkit openXBOW wird vorgestellt, welches systematisches Training und Optimierung dieser Merkmalsdarstellungen - vereinheitlicht für beliebige Modalitäten mit numerischen oder symbolischen Deskriptoren - erlaubt. Es werden einige Experimente zum BoAW-Ansatz durchgeführt und diskutiert, die sich auf eine große Zahl möglicher Anwendungen und entsprechende Datensätze beziehen, von der Emotionserkennung in gesprochener Sprache bis zur medizinischen Diagnostik. Die Auswertungen beinhalten einen Vergleich verschiedener akustischer Deskriptoren und Konfigurationen der BoAW-Methode. Die wichtigsten Erkenntnisse sind, dass BoAW-Merkmalsvektoren eine geeignete Alternative zu statistischen Funktionalen darstellen, gewisse Vorzüge bieten und gleichzeitig wichtige Eigenschaften der Funktionale, wie bspw. die Datenunabhängigkeit, erhalten können. Zudem wird gezeigt, dass beide Darstellungen komplementär sind und eine Fusionierung die Leistungsfähigkeit eines Systems des maschinellen Hörens verbessert

    Computer audition for emotional wellbeing

    Get PDF
    This thesis is focused on the application of computer audition (i. e., machine listening) methodologies for monitoring states of emotional wellbeing. Computer audition is a growing field and has been successfully applied to an array of use cases in recent years. There are several advantages to audio-based computational analysis; for example, audio can be recorded non-invasively, stored economically, and can capture rich information on happenings in a given environment, e. g., human behaviour. With this in mind, maintaining emotional wellbeing is a challenge for humans and emotion-altering conditions, including stress and anxiety, have become increasingly common in recent years. Such conditions manifest in the body, inherently changing how we express ourselves. Research shows these alterations are perceivable within vocalisation, suggesting that speech-based audio monitoring may be valuable for developing artificially intelligent systems that target improved wellbeing. Furthermore, computer audition applies machine learning and other computational techniques to audio understanding, and so by combining computer audition with applications in the domain of computational paralinguistics and emotional wellbeing, this research concerns the broader field of empathy for Artificial Intelligence (AI). To this end, speech-based audio modelling that incorporates and understands paralinguistic wellbeing-related states may be a vital cornerstone for improving the degree of empathy that an artificial intelligence has. To summarise, this thesis investigates the extent to which speech-based computer audition methodologies can be utilised to understand human emotional wellbeing. A fundamental background on the fields in question as they pertain to emotional wellbeing is first presented, followed by an outline of the applied audio-based methodologies. Next, detail is provided for several machine learning experiments focused on emotional wellbeing applications, including analysis and recognition of under-researched phenomena in speech, e. g., anxiety, and markers of stress. Core contributions from this thesis include the collection of several related datasets, hybrid fusion strategies for an emotional gold standard, novel machine learning strategies for data interpretation, and an in-depth acoustic-based computational evaluation of several human states. All of these contributions focus on ascertaining the advantage of audio in the context of modelling emotional wellbeing. Given the sensitive nature of human wellbeing, the ethical implications involved with developing and applying such systems are discussed throughout

    Аналитический обзор аудиовизуальных систем для определения средств индивидуальной защиты на лице человека

    Get PDF
    Since 2019 all countries of the world have faced the rapid spread of the pandemic caused by the COVID-19 coronavirus infection, the fight against which continues to the present day by the world community. Despite the obvious effectiveness of personal respiratory protection equipment against coronavirus infection, many people neglect the use of protective face masks in public places. Therefore, to control and timely identify violators of public health regulations, it is necessary to apply modern information technologies that will detect protective masks on people's faces using video and audio information. The article presents an analytical review of existing and developing intelligent information technologies for bimodal analysis of the voice and facial characteristics of a masked person. There are many studies on the topic of detecting masks from video images, and a significant number of cases containing images of faces both in and without masks obtained by various methods can also be found in the public access. Research and development aimed at detecting personal respiratory protection equipment by the acoustic characteristics of human speech is still quite small, since this direction began to develop only during the pandemic caused by the COVID-19 coronavirus infection. Existing systems allow to prevent the spread of coronavirus infection by recognizing the presence/absence of masks on the face, and these systems also help in remote diagnosis of COVID-19 by detecting the first symptoms of a viral infection by acoustic characteristics. However, to date, there is a number of unresolved problems in the field of automatic diagnosis of COVID-19 and the presence/absence of masks on people's faces. First of all, this is the low accuracy of detecting masks and coronavirus infection, which does not allow for performing automatic diagnosis without the presence of experts (medical personnel). Many systems are not able to operate in real time, which makes it impossible to control and monitor the wearing of protective masks in public places. Also, most of the existing systems cannot be built into a smartphone, so that users be able to diagnose the presence of coronavirus infection anywhere. Another major problem is the collection of data from patients infected with COVID-19, as many people do not agree to distribute confidential information.Начиная с 2019 года все страны мира столкнулись со стремительным распространением пандемии, вызванной коронавирусной инфекцией COVID-19, борьба с которой продолжается мировым сообществом и по настоящее время. Несмотря на очевидную эффективность средств индивидуальной защиты органов дыхания от заражения коронавирусной инфекцией, многие люди пренебрегают использованием защитных масок для лица в общественных местах. Поэтому для контроля и своевременного выявления нарушителей общественных правил здравоохранения необходимо применять современные информационные технологии, которые будут детектировать защитные маски на лицах людей по видео- и аудиоинформации. В статье приведен аналитический обзор существующих и разрабатываемых интеллектуальных информационных технологий бимодального анализа голосовых и лицевых характеристик человека в маске. Существует много исследований на тему обнаружения масок по видеоизображениям, также в открытом доступе можно найти значительное количество корпусов, содержащих изображения лиц как без масок, так и в масках, полученных различными способами. Исследований и разработок, направленных на детектирование средств индивидуальной защиты органов дыхания по акустическим характеристикам речи человека пока достаточно мало, так как это направление начало развиваться только в период пандемии, вызванной коронавирусной инфекцией COVID-19. Существующие системы позволяют предотвратить распространение коронавирусной инфекции с помощью распознавания наличия/отсутствия масок на лице, также данные системы помогают в дистанционном диагностировании COVID-19 с помощью обнаружения первых симптомов вирусной инфекции по акустическим характеристикам. Однако, на сегодняшний день существует ряд нерешенных проблем в области автоматического диагностирования симптомов COVID-19 и наличия/отсутствия масок на лицах людей. В первую очередь это низкая точность обнаружения масок и коронавирусной инфекции, что не позволяет осуществлять автоматическую диагностику без присутствия экспертов (медицинского персонала). Многие системы не способны работать в режиме реального времени, из-за чего невозможно производить контроль и мониторинг ношения защитных масок в общественных местах. Также большинство существующих систем невозможно встроить в смартфон, чтобы пользователи могли в любом месте произвести диагностирование наличия коронавирусной инфекции. Еще одной основной проблемой является сбор данных пациентов, зараженных COVID-19, так как многие люди не согласны распространять конфиденциальную информацию

    Аналитический обзор аудиовизуальных систем для определения средств индивидуальной защиты на лице человека

    Get PDF
    Начиная с 2019 года все страны мира столкнулись со стремительным распространением пандемии, вызванной коронавирусной инфекцией COVID-19, борьба с которой продолжается мировым сообществом и по настоящее время. Несмотря на очевидную эффективность средств индивидуальной защиты органов дыхания от заражения коронавирусной инфекцией, многие люди пренебрегают использованием защитных масок для лица в общественных местах. Поэтому для контроля и своевременного выявления нарушителей общественных правил здравоохранения необходимо применять современные информационные технологии, которые будут детектировать защитные маски на лицах людей по видео- и аудиоинформации. В статье приведен аналитический обзор существующих и разрабатываемых интеллектуальных информационных технологий бимодального анализа голосовых и лицевых характеристик человека в маске. Существует много исследований на тему обнаружения масок по видеоизображениям, также в открытом доступе можно найти значительное количество корпусов, содержащих изображения лиц как без масок, так и в масках, полученных различными способами. Исследований и разработок, направленных на детектирование средств индивидуальной защиты органов дыхания по акустическим характеристикам речи человека пока достаточно мало, так как это направление начало развиваться только в период пандемии, вызванной коронавирусной инфекцией COVID-19. Существующие системы позволяют предотвратить распространение коронавирусной инфекции с помощью распознавания наличия/отсутствия масок на лице, также данные системы помогают в дистанционном диагностировании COVID-19 с помощью обнаружения первых симптомов вирусной инфекции по акустическим характеристикам. Однако, на сегодняшний день существует ряд нерешенных проблем в области автоматического диагностирования симптомов COVID-19 и наличия/отсутствия масок на лицах людей. В первую очередь это низкая точность обнаружения масок и коронавирусной инфекции, что не позволяет осуществлять автоматическую диагностику без присутствия экспертов (медицинского персонала). Многие системы не способны работать в режиме реального времени, из-за чего невозможно производить контроль и мониторинг ношения защитных масок в общественных местах. Также большинство существующих систем невозможно встроить в смартфон, чтобы пользователи могли в любом месте произвести диагностирование наличия коронавирусной инфекции. Еще одной основной проблемой является сбор данных пациентов, зараженных COVID-19, так как многие люди не согласны распространять конфиденциальную информацию

    Adaptation of Speaker and Speech Recognition Methods for the Automatic Screening of Speech Disorders using Machine Learning

    Get PDF
    This PhD thesis presented methods for exploiting the non-verbal communication of individuals suffering from specific diseases or health conditions aiming to reach an automatic screening of them. More specifically, we employed one of the pillars of non-verbal communication, paralanguage, to explore techniques that could be utilized to model the speech of subjects. Paralanguage is a non-lexical component of communication that relies on intonation, pitch, speed of talking, and others, which can be processed and analyzed in an automatic manner. This is called Computational Paralinguistics, which can be defined as the study of modeling non-verbal latent patterns within the speech of a speaker by means of computational algorithms; these patterns go beyond the linguistic} approach. By means of machine learning, we present models from distinct scenarios of both paralinguistics and pathological speech which are capable of estimating the health status of a given disease such as Alzheimer's, Parkinson's, and clinical depression, among others, in an automatic manner
    corecore