634 research outputs found

    Automatic assessment of spoken language proficiency of non-native children

    Full text link
    This paper describes technology developed to automatically grade Italian students (ages 9-16) on their English and German spoken language proficiency. The students' spoken answers are first transcribed by an automatic speech recognition (ASR) system and then scored using a feedforward neural network (NN) that processes features extracted from the automatic transcriptions. In-domain acoustic models, employing deep neural networks (DNNs), are derived by adapting the parameters of an original out of domain DNN

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Visual and linguistic processes in deep neural networks:A cognitive perspective

    Get PDF
    When people describe an image, there are complex visual and linguistic processes at work. For instance, speakers tend to look at an object right before mentioning it, but not every time. Similarly, during a conversation, speakers can refer to an entity multiple times, using expressions evolving in the common ground. In this thesis, I develop computational models of such visual and linguistic processes, drawing inspiration from theories and findings from cognitive science and psycholinguistics. This work, where I aim to capture the intricate relationship between non-linguistic modalities and language within deep artificial neural networks, contributes to the line of research into multimodal Natural Language Processing. This thesis consists of two parts: (1) modeling human gaze in language use (production and comprehension), and (2) modeling communication strategies in referential tasks in visually grounded dialogue. In the first part, I delve into enhancing image description generation models using eye-tracking data; evaluating the variation in human signals while describing images; and predicting human reading behavior in the form of eye movements. In the second part, I build models quantifying, generating, resolving, and adapting utterances in referential tasks situated within visual and conversational contexts. The outcomes advance our understanding of human visuo-linguistic processes by revealing intricate strategies at play in such processes, and point to the importance of accounting for them when developing and utilizing multimodal models. The findings shed light on how the advancements in artificial intelligence could contribute to advancing the research on crossmodal processes in humans and vice versa

    Visual and linguistic processes in deep neural networks:A cognitive perspective

    Get PDF
    When people describe an image, there are complex visual and linguistic processes at work. For instance, speakers tend to look at an object right before mentioning it, but not every time. Similarly, during a conversation, speakers can refer to an entity multiple times, using expressions evolving in the common ground. In this thesis, I develop computational models of such visual and linguistic processes, drawing inspiration from theories and findings from cognitive science and psycholinguistics. This work, where I aim to capture the intricate relationship between non-linguistic modalities and language within deep artificial neural networks, contributes to the line of research into multimodal Natural Language Processing. This thesis consists of two parts: (1) modeling human gaze in language use (production and comprehension), and (2) modeling communication strategies in referential tasks in visually grounded dialogue. In the first part, I delve into enhancing image description generation models using eye-tracking data; evaluating the variation in human signals while describing images; and predicting human reading behavior in the form of eye movements. In the second part, I build models quantifying, generating, resolving, and adapting utterances in referential tasks situated within visual and conversational contexts. The outcomes advance our understanding of human visuo-linguistic processes by revealing intricate strategies at play in such processes, and point to the importance of accounting for them when developing and utilizing multimodal models. The findings shed light on how the advancements in artificial intelligence could contribute to advancing the research on crossmodal processes in humans and vice versa

    Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

    Get PDF
    Peer reviewe

    Deep Learning Methods for Dialogue Act Recognition using Visual Information

    Get PDF
    Rozpoznávání dialogových aktů (DA) je důležitým krokem v řízení a porozumění dialogu. Tato úloha spočívá v automatickém přiřazení třídy k výroku/promluvě (nebo jeho části) na základě jeho funkce v dialogu (např. prohlášení, otázka, potvrzení atd.). Takováto klasifikace pak pomáhá modelovat a identifikovat strukturu spontánních dialogů. I když je rozpoznávání DA obvykle realizováno na zvukovém signálu (řeči) pomocí modelů pro automatické rozpoznávání řeči, dialogy existují rovněž ve formě obrázků (např. komiksy). Tato práce se zabývá automatickým rozpoznáváním dialogových aktů z obrazových dokumentů. Dle nás se jedná o první pokus o navržení přístupu rozpoznávání DA využívající obrázky jako vstup. Pro tento úkol je nutné extrahovat text z obrázků. Využíváme proto algoritmy z oblasti počítačového vidění a~zpracování obrazu, jako je prahování obrazu, segmentace textu a optické rozpoznávání znaků (OCR). Hlavním přínosem v této oblasti je návrh a implementace OCR modelu založeného na konvolučních a rekurentních neuronových sítích. Také prozkoumáváme různé strategie pro trénování tohoto modelu, včetně generování syntetických dat a technik rozšiřování dat (tzv. augmentace). Dosahujeme vynikajících výsledků OCR v případě, kdy je malé množství trénovacích dat. Mezi naše přínosy tedy patří to, jak vytvořit efektivní OCR systém s~minimálními náklady na ruční anotaci. Dále se zabýváme vícejazyčností v oblasti rozpoznávání DA. Úspěšně jsme použili a nasadili obecný model, který byl trénován všemi dostupnými jazyky, a také další modely, které byly trénovány pouze na jednom jazyce, a vícejazyčnosti je dosaženo pomocí transformací sémantického prostoru. Také zkoumáme techniku přenosu učení (tzv. transfer learning) pro tuto úlohu tam, kde je k dispozici malý počet anotovaných dat. Používáme příznaky jak na úrovni slov, tak i vět a naše modely hlubokých neuronových sítí (včetně architektury Transformer) dosáhly výborných výsledků v oblasti vícejazyčného rozpoznávání dialogových aktů. Pro rozpoznávání DA z obrazových dokumentů navrhujeme nový multimodální model založený na konvoluční a rekurentní neuronové síti. Tento model kombinuje textové a obrazové vstupy. Textová část zpracovává text z OCR, zatímco vizuální část extrahuje obrazové příznaky, které tvoří další vstup do modelu. Text z OCR obsahuje často překlepy nebo jiné lexikální chyby. Demonstrujeme na experimentech, že tento multimodální model využívající dva vstupy dokáže částečně vyvážit ztrátu informace způsobenou chybovostí OCR systému.ObhájenoDialogue act (DA) recognition is an important step of dialogue management and understanding. This task is to automatically assign a label to an utterance (or its part) based on its function in a dialogue (e.g. statement, question, backchannel, etc.). Such utterance-level classification thus helps to model and identify the structure of spontaneous dialogues. Even though DA recognition is usually realized on audio data using an automatic speech recognition engine, the dialogues exist also in a form of images (e.g. comic books). This thesis deals with automatic dialogue act recognition from image documents. To the best of our knowledge, this is the first attempt to propose DA recognition approaches using the images as an input. For this task, it is necessary to extract the text from the images. Therefore, we employ algorithms from the field of computer vision and image processing such as image thresholding, text segmentation, and optical character recognition (OCR). The main contribution in this field is to design and implement a custom OCR model based on convolutional and recurrent neural networks. We also explore different strategies for training such a~model, including synthetic data generation and data augmentation techniques. We achieve new state-of-the-art OCR results in the constraints when only a few training data are available. Summing up, our contribution is hence also presenting an overview of how to create an efficient OCR system with minimal costs. We further deal with the multilinguality in the DA recognition field. We successfully employ one general model that was trained by data from all available languages, as well as several models that are trained on a single language, and cross-linguality is achieved by using semantic space transformations. Moreover, we explore transfer learning for DA recognition where there is a small number of annotated data available. We use word-level and utterance-level features and our models contain deep neural network architectures, including Transformers. We obtain new state-of-the-art results in multi- and cross-lingual DA regonition field. For DA recognition from image documents, we propose and implement a novel multimodal model based on convolutional and recurrent neural network. This model combines text and image inputs. A text part is fed by text tokens from OCR, while the visual part extracts image features that are considered as an auxiliary input. Extracted text from dialogues is often erroneous and contains typos or other lexical errors. We show that the multimodal model deals with the erroneous text and visual information partially balance this loss of information

    Articulatory features for conversational speech recognition

    Get PDF
    corecore