714 research outputs found
Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning
Developing artificial learning systems that can understand and generate natural language has been one of the long-standing goals of artificial intelligence. Recent decades have witnessed an impressive progress on both of these problems, giving rise to a new family of approaches. Especially, the advances in deep learning over the past couple of years have led to neural approaches to natural language generation (NLG). These methods combine generative language learning techniques with neural-networks based frameworks. With a wide range of applications in natural language processing, neural NLG (NNLG) is a new and fast growing field of research. In this state-of-the-art report, we investigate the recent developments and applications of NNLG in its full extent from a multidimensional view, covering critical perspectives such as multimodality, multilinguality, controllability and learning strategies. We summarize the fundamental building blocks of NNLG approaches from these aspects and provide detailed reviews of commonly used preprocessing steps and basic neural architectures. This report also focuses on the seminal applications of these NNLG models such as machine translation, description generation, automatic speech recognition, abstractive summarization, text simplification, question answering and generation, and dialogue generation. Finally, we conclude with a thorough discussion of the described frameworks by pointing out some open research directions.This work has been partially supported by the European Commission ICT COST Action “Multi-task, Multilingual, Multi-modal Language Generation” (CA18231). AE was supported by BAGEP 2021 Award of the Science Academy. EE was supported in part by TUBA GEBIP 2018 Award. BP is in in part funded by Independent Research Fund Denmark (DFF) grant 9063-00077B. IC has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 838188. EL is partly funded by Generalitat Valenciana and the Spanish Government throught projects PROMETEU/2018/089 and RTI2018-094649-B-I00, respectively. SMI is partly funded by UNIRI project uniri-drustv-18-20. GB is partly supported by the Ministry of Innovation and the National Research, Development and Innovation Office within the framework of the Hungarian Artificial Intelligence National Laboratory Programme. COT is partially funded by the Romanian Ministry of European Investments and Projects through the Competitiveness Operational Program (POC) project “HOLOTRAIN” (grant no. 29/221 ap2/07.04.2020, SMIS code: 129077) and by the German Academic Exchange Service (DAAD) through the project “AWAKEN: content-Aware and netWork-Aware faKE News mitigation” (grant no. 91809005). ESA is partially funded by the German Academic Exchange Service (DAAD) through the project “Deep-Learning Anomaly Detection for Human and Automated Users Behavior” (grant no. 91809358)
Multi-view Representation Learning for Unifying Languages, Knowledge and Vision
The growth of content on the web has raised various challenges, yet also provided numerous opportunities. Content exists in varied forms such as text appearing in different languages, entity-relationship graph represented as structured knowledge and as a visual embodiment like images/videos. They are often referred to as modalities. In many instances, the different amalgamation of modalities co-exists to complement each other or to provide consensus. Thus making the content either heterogeneous or homogeneous. Having an additional point of view for each instance in the content is beneficial for data-driven learning and intelligent content processing. However, despite having availability of such content. Most advancements made in data-driven learning (i.e., machine learning) is by solving tasks separately for the single modality. The similar endeavor was not shown for the challenges which required input either from all or subset of them.
In this dissertation, we develop models and techniques that can leverage multiple views of heterogeneous or homogeneous content and build a shared representation for aiding several applications which require a combination of modalities mentioned above. In particular, we aim to address applications such as content-based search, categorization, and generation by providing several novel contributions.
First, we develop models for heterogeneous content by jointly modeling diverse representations emerging from two views depicting text and image by learning their correlation. To be specific, modeling such correlation is helpful to retrieve cross-modal content. Second, we replace the heterogeneous content with homogeneous to learn a common space representation for content categorization across languages. Furthermore, we develop models that take input from both homogeneous and heterogeneous content to facilitate the construction of common space representation from more than two views. Specifically, representation is used to generate one view from another. Lastly, we describe a model that can handle missing views, and demonstrate that the model can generate missing views by utilizing external knowledge. We argue that techniques the models leverage internally provide many practical benefits and lot of immediate value applications.
From the modeling perspective, our contributed model design in this thesis can be summarized under the phrase Multi-view Representation Learning( MVRL ). These models are variations and extensions of shallow statistical and deep neural networks approaches that can jointly optimize and exploit all views of the input content arising from different independent representations. We show that our models advance state of the art, but not limited to tasks such as cross-modal retrieval, cross-language text classification, image-caption generation in multiple languages and caption generation for images containing unseen visual object categories
Developing ChatGPT for Biology and Medicine: A Complete Review of Biomedical Question Answering
ChatGPT explores a strategic blueprint of question answering (QA) in
delivering medical diagnosis, treatment recommendations, and other healthcare
support. This is achieved through the increasing incorporation of medical
domain data via natural language processing (NLP) and multimodal paradigms. By
transitioning the distribution of text, images, videos, and other modalities
from the general domain to the medical domain, these techniques have expedited
the progress of medical domain question answering (MDQA). They bridge the gap
between human natural language and sophisticated medical domain knowledge or
expert manual annotations, handling large-scale, diverse, unbalanced, or even
unlabeled data analysis scenarios in medical contexts. Central to our focus is
the utilizing of language models and multimodal paradigms for medical question
answering, aiming to guide the research community in selecting appropriate
mechanisms for their specific medical research requirements. Specialized tasks
such as unimodal-related question answering, reading comprehension, reasoning,
diagnosis, relation extraction, probability modeling, and others, as well as
multimodal-related tasks like vision question answering, image caption,
cross-modal retrieval, report summarization, and generation, are discussed in
detail. Each section delves into the intricate specifics of the respective
method under consideration. This paper highlights the structures and
advancements of medical domain explorations against general domain methods,
emphasizing their applications across different tasks and datasets. It also
outlines current challenges and opportunities for future medical domain
research, paving the way for continued innovation and application in this
rapidly evolving field.Comment: 50 pages, 3 figures, 3 table
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
With the urgent demand for generalized deep models, many pre-trained big
models are proposed, such as BERT, ViT, GPT, etc. Inspired by the success of
these models in single domains (like computer vision and natural language
processing), the multi-modal pre-trained big models have also drawn more and
more attention in recent years. In this work, we give a comprehensive survey of
these models and hope this paper could provide new insights and helps fresh
researchers to track the most cutting-edge works. Specifically, we firstly
introduce the background of multi-modal pre-training by reviewing the
conventional deep learning, pre-training works in natural language process,
computer vision, and speech. Then, we introduce the task definition, key
challenges, and advantages of multi-modal pre-training models (MM-PTMs), and
discuss the MM-PTMs with a focus on data, objectives, network architectures,
and knowledge enhanced pre-training. After that, we introduce the downstream
tasks used for the validation of large-scale MM-PTMs, including generative,
classification, and regression tasks. We also give visualization and analysis
of the model parameters and results on representative downstream tasks.
Finally, we point out possible research directions for this topic that may
benefit future works. In addition, we maintain a continuously updated paper
list for large-scale pre-trained multi-modal big models:
https://github.com/wangxiao5791509/MultiModal_BigModels_SurveyComment: Accepted by Machine Intelligence Researc
BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
We present BIG-C (Bemba Image Grounded Conversations), a large multimodal
dataset for Bemba. While Bemba is the most populous language of Zambia, it
exhibits a dearth of resources which render the development of language
technologies or language processing research almost impossible. The dataset is
comprised of multi-turn dialogues between Bemba speakers based on images,
transcribed and translated into English. There are more than 92,000
utterances/sentences, amounting to more than 180 hours of audio data with
corresponding transcriptions and English translations. We also provide
baselines on speech recognition (ASR), machine translation (MT) and speech
translation (ST) tasks, and sketch out other potential future multimodal uses
of our dataset. We hope that by making the dataset available to the research
community, this work will foster research and encourage collaboration across
the language, speech, and vision communities especially for languages outside
the "traditionally" used high-resourced ones. All data and code are publicly
available: https://github.com/csikasote/bigc.Comment: accepted to ACL 202
- …