180 research outputs found
Representation Learning for Natural Language Processing
This open access book provides an overview of the recent advances in representation learning theory, algorithms and applications for natural language processing (NLP). It is divided into three parts. Part I presents the representation learning techniques for multiple language entries, including words, phrases, sentences and documents. Part II then introduces the representation techniques for those objects that are closely related to NLP, including entity-based world knowledge, sememe-based linguistic knowledge, networks, and cross-modal entries. Lastly, Part III provides open resource tools for representation learning techniques, and discusses the remaining challenges and future research directions. The theories and algorithms of representation learning presented can also benefit other related domains such as machine learning, social network analysis, semantic Web, information retrieval, data mining and computational biology. This book is intended for advanced undergraduate and graduate students, post-doctoral fellows, researchers, lecturers, and industrial engineers, as well as anyone interested in representation learning and natural language processing
Deep Learning Methods for Dialogue Act Recognition using Visual Information
RozpoznávánĂ dialogovĂ˝ch aktĹŻ (DA) je dĹŻleĹľitĂ˝m krokem v Ĺ™ĂzenĂ a porozumÄ›nĂ dialogu. Tato Ăşloha spoÄŤĂvá v automatickĂ©m pĹ™iĹ™azenĂ tĹ™Ădy k vĂ˝roku/promluvÄ› (nebo jeho části) na základÄ› jeho funkce v dialogu (napĹ™. prohlášenĂ, otázka, potvrzenĂ atd.). Takováto klasifikace pak pomáhá modelovat a identifikovat strukturu spontánnĂch dialogĹŻ. I kdyĹľ je rozpoznávánĂ DA obvykle realizováno na zvukovĂ©m signálu (Ĺ™eÄŤi) pomocĂ modelĹŻ pro automatickĂ© rozpoznávánĂ Ĺ™eÄŤi, dialogy existujĂ rovněž ve formÄ› obrázkĹŻ (napĹ™. komiksy).
Tato práce se zabĂ˝vá automatickĂ˝m rozpoznávánĂm dialogovĂ˝ch aktĹŻ z obrazovĂ˝ch dokumentĹŻ.
Dle nás se jedná o prvnĂ pokus o navrĹľenĂ pĹ™Ăstupu rozpoznávánĂ DA vyuĹľĂvajĂcĂ obrázky jako vstup.
Pro tento Ăşkol je nutnĂ© extrahovat text z obrázkĹŻ. VyuĹľĂváme proto algoritmy z oblasti poÄŤĂtaÄŤovĂ©ho vidÄ›nĂ a~zpracovánĂ obrazu, jako je prahovánĂ obrazu, segmentace textu a optickĂ© rozpoznávánĂ znakĹŻ (OCR). HlavnĂm pĹ™Ănosem v tĂ©to oblasti je návrh a implementace OCR modelu zaloĹľenĂ©ho na konvoluÄŤnĂch a rekurentnĂch neuronovĂ˝ch sĂtĂch. TakĂ© prozkoumáváme rĹŻznĂ© strategie pro trĂ©novánĂ tohoto modelu, vÄŤetnÄ› generovánĂ syntetickĂ˝ch dat a technik rozšiĹ™ovánĂ dat (tzv. augmentace).
Dosahujeme vynikajĂcĂch vĂ˝sledkĹŻ OCR v pĹ™ĂpadÄ›, kdy je malĂ© mnoĹľstvĂ trĂ©novacĂch dat. Mezi naše pĹ™Ănosy tedy patřà to, jak vytvoĹ™it efektivnĂ OCR systĂ©m s~minimálnĂmi náklady na ruÄŤnĂ anotaci.
Dále se zabĂ˝váme vĂcejazyÄŤnostĂ v oblasti rozpoznávánĂ DA. ĂšspěšnÄ› jsme pouĹľili a nasadili obecnĂ˝ model, kterĂ˝ byl trĂ©nován všemi dostupnĂ˝mi jazyky, a takĂ© dalšà modely, kterĂ© byly trĂ©novány pouze na jednom jazyce, a vĂcejazyÄŤnosti je dosaĹľeno pomocĂ transformacĂ sĂ©mantickĂ©ho prostoru.
TakĂ© zkoumáme techniku pĹ™enosu uÄŤenĂ (tzv. transfer learning) pro tuto Ăşlohu tam, kde je k dispozici malĂ˝ poÄŤet anotovanĂ˝ch dat. PouĹľĂváme pĹ™Ăznaky jak na Ăşrovni slov, tak i vÄ›t a naše modely hlubokĂ˝ch neuronovĂ˝ch sĂtĂ (vÄŤetnÄ› architektury Transformer) dosáhly vĂ˝bornĂ˝ch vĂ˝sledkĹŻ v oblasti vĂcejazyÄŤnĂ©ho rozpoznávánĂ dialogovĂ˝ch aktĹŻ.
Pro rozpoznávánĂ DA z obrazovĂ˝ch dokumentĹŻ navrhujeme novĂ˝ multimodálnĂ model zaloĹľenĂ˝ na konvoluÄŤnĂ a rekurentnĂ neuronovĂ© sĂti. Tento model kombinuje textovĂ© a obrazovĂ© vstupy. Textová část zpracovává text z OCR, zatĂmco vizuálnà část extrahuje obrazovĂ© pĹ™Ăznaky, kterĂ© tvořà dalšà vstup do modelu. Text z OCR obsahuje ÄŤasto pĹ™eklepy nebo jinĂ© lexikálnĂ chyby. Demonstrujeme na experimentech, Ĺľe tento multimodálnĂ model vyuĹľĂvajĂcĂ dva vstupy dokáže částeÄŤnÄ› vyvážit ztrátu informace zpĹŻsobenou chybovostĂ OCR systĂ©mu.ObhájenoDialogue act (DA) recognition is an important step of dialogue management and understanding. This task is to automatically assign a label to an utterance (or its part) based on its function in a dialogue (e.g. statement, question, backchannel, etc.). Such utterance-level classification thus helps to model and identify the structure of spontaneous dialogues. Even though DA recognition is usually realized on audio data using an automatic speech recognition engine, the dialogues exist also in a form of images (e.g. comic books).
This thesis deals with automatic dialogue act recognition from image documents.
To the best of our knowledge, this is the first attempt to propose DA recognition approaches using the images as an input.
For this task, it is necessary to extract the text from the images.
Therefore, we employ algorithms from the field of computer vision and image processing such as image thresholding, text segmentation, and optical character recognition (OCR). The main contribution in this field is to design and implement a custom OCR model based on convolutional and recurrent neural networks. We also explore different strategies for training such a~model, including synthetic data generation and data augmentation techniques. We achieve new state-of-the-art OCR results in the constraints when only a few training data are available. Summing up, our contribution is hence also presenting an overview of how to create an efficient OCR system with minimal costs.
We further deal with the multilinguality in the DA recognition field. We successfully employ one general model that was trained by data from all available languages, as well as several models that are trained on a single language, and cross-linguality is achieved by using semantic space transformations. Moreover, we explore transfer learning for DA recognition where there is a small number of annotated data available. We use word-level and utterance-level features and our models contain deep neural network architectures, including Transformers. We obtain new state-of-the-art results in multi- and cross-lingual DA regonition field.
For DA recognition from image documents, we propose and implement a novel multimodal model based on convolutional and recurrent neural network. This model combines text and image inputs. A text part is fed by text tokens from OCR, while the visual part extracts image features that are considered as an auxiliary input. Extracted text from dialogues is often erroneous and contains typos or other lexical errors. We show that the multimodal model deals with the erroneous text and visual information partially balance this loss of information
Advances in Image Processing, Analysis and Recognition Technology
For many decades, researchers have been trying to make computers’ analysis of images as effective as the system of human vision is. For this purpose, many algorithms and systems have previously been created. The whole process covers various stages, including image processing, representation and recognition. The results of this work can be applied to many computer-assisted areas of everyday life. They improve particular activities and provide handy tools, which are sometimes only for entertainment, but quite often, they significantly increase our safety. In fact, the practical implementation of image processing algorithms is particularly wide. Moreover, the rapid growth of computational complexity and computer efficiency has allowed for the development of more sophisticated and effective algorithms and tools. Although significant progress has been made so far, many issues still remain, resulting in the need for the development of novel approaches
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to
combine various modalities in a single joint representation. Especially in the
area of visiolinguistic (VL) learning multiple models and techniques have been
developed, targeting a variety of tasks that involve images and text. VL models
have reached unprecedented performances by extending the idea of Transformers,
so that both modalities can learn from each other. Massive pre-training
procedures enable VL models to acquire a certain level of real-world
understanding, although many gaps can be identified: the limited comprehension
of commonsense, factual, temporal and other everyday knowledge aspects
questions the extendability of VL tasks. Knowledge graphs and other knowledge
sources can fill those gaps by explicitly providing missing information,
unlocking novel capabilities of VL models. In the same time, knowledge graphs
enhance explainability, fairness and validity of decision making, issues of
outermost importance for such complex implementations. The current survey aims
to unify the fields of VL representation learning and knowledge graphs, and
provides a taxonomy and analysis of knowledge-enhanced VL models
Automatic Image Captioning with Style
This thesis connects two core topics in machine learning, vision
and language. The problem of choice is image caption generation:
automatically constructing natural language descriptions of image
content. Previous research into image caption generation has
focused on generating purely descriptive captions; I focus on
generating visually relevant captions with a distinct linguistic
style. Captions with style have the potential to ease
communication and add a new layer of personalisation.
First, I consider naming variations in image captions, and
propose a method for predicting context-dependent names that
takes into account visual and linguistic information. This method
makes use of a large-scale image caption dataset, which I also
use to explore naming conventions and report naming conventions
for hundreds of animal classes. Next I propose the SentiCap
model, which relies on recent advances in artificial neural
networks to generate visually relevant image captions with
positive or negative sentiment. To balance descriptiveness and
sentiment, the SentiCap model dynamically switches between two
recurrent neural networks, one tuned for descriptive words and
one for sentiment words. As the first published model for
generating captions with sentiment, SentiCap has influenced a
number of subsequent works. I then investigate the sub-task of
modelling styled sentences without images. The specific task
chosen is sentence simplification: rewriting news article
sentences to make them easier to understand.
For this task I design a neural sequence-to-sequence model that
can work with
limited training data, using novel adaptations for word copying
and sharing
word embeddings. Finally, I present SemStyle, a system for
generating visually
relevant image captions in the style of an arbitrary text corpus.
A shared term
space allows a neural network for vision and content planning to
communicate
with a network for styled language generation. SemStyle achieves
competitive
results in human and automatic evaluations of descriptiveness and
style.
As a whole, this thesis presents two complete systems for styled
caption generation that are first of their kind and demonstrate,
for the first time, that automatic style transfer for image
captions is achievable. Contributions also include novel ideas
for object naming and sentence simplification. This thesis opens
up inquiries into highly personalised image captions; large scale
visually grounded concept naming; and more generally, styled text
generation with content control
- …