147,192 research outputs found
Unified Language Model Pre-training for Natural Language Understanding and Generation
This paper presents a new Unified pre-trained Language Model (UniLM) that can
be fine-tuned for both natural language understanding and generation tasks. The
model is pre-trained using three types of language modeling tasks:
unidirectional, bidirectional, and sequence-to-sequence prediction. The unified
modeling is achieved by employing a shared Transformer network and utilizing
specific self-attention masks to control what context the prediction conditions
on. UniLM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0
and CoQA question answering tasks. Moreover, UniLM achieves new
state-of-the-art results on five natural language generation datasets,
including improving the CNN/DailyMail abstractive summarization ROUGE-L to
40.51 (2.04 absolute improvement), the Gigaword abstractive summarization
ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question
answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question
generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7
document-grounded dialog response generation NIST-4 to 2.67 (human performance
is 2.65). The code and pre-trained models are available at
https://github.com/microsoft/unilm.Comment: Accepted by NeurIPS-19. Code and pre-trained models:
https://github.com/microsoft/unil
UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training
We propose to pre-train a unified language model for both autoencoding and
partially autoregressive language modeling tasks using a novel training
procedure, referred to as a pseudo-masked language model (PMLM). Given an input
text with masked tokens, we rely on conventional masks to learn inter-relations
between corrupted tokens and context via autoencoding, and pseudo masks to
learn intra-relations between masked spans via partially autoregressive
modeling. With well-designed position embeddings and self-attention masks, the
context encodings are reused to avoid redundant computation. Moreover,
conventional masks used for autoencoding provide global masking information, so
that all the position embeddings are accessible in partially autoregressive
language modeling. In addition, the two tasks pre-train a unified language
model as a bidirectional encoder and a sequence-to-sequence decoder,
respectively. Our experiments show that the unified language models pre-trained
using PMLM achieve new state-of-the-art results on a wide range of natural
language understanding and generation tasks across several widely used
benchmarks.Comment: 11 page
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
Pre-trained models have achieved state-of-the-art results in various Natural
Language Processing (NLP) tasks. Recent works such as T5 and GPT-3 have shown
that scaling up pre-trained language models can improve their generalization
abilities. Particularly, the GPT-3 model with 175 billion parameters shows its
strong task-agnostic zero-shot/few-shot learning capabilities. Despite their
success, these large-scale models are trained on plain texts without
introducing knowledge such as linguistic knowledge and world knowledge. In
addition, most large-scale models are trained in an auto-regressive way. As a
result, this kind of traditional fine-tuning approach demonstrates relatively
weak performance when solving downstream language understanding tasks. In order
to solve the above problems, we propose a unified framework named ERNIE 3.0 for
pre-training large-scale knowledge enhanced models. It fuses auto-regressive
network and auto-encoding network, so that the trained model can be easily
tailored for both natural language understanding and generation tasks with
zero-shot learning, few-shot learning or fine-tuning. We trained the model with
10 billion parameters on a 4TB corpus consisting of plain texts and a
large-scale knowledge graph. Empirical results show that the model outperforms
the state-of-the-art models on 54 Chinese NLP tasks, and its English version
achieves the first place on the SuperGLUE benchmark (July 3, 2021), surpassing
the human performance by +0.8% (90.6% vs. 89.8%)
Generalizing Natural Language Analysis through Span-relation Representations
Natural language processing covers a wide variety of tasks predicting syntax,
semantics, and information content, and usually each type of output is
generated with specially designed architectures. In this paper, we provide the
simple insight that a great variety of tasks can be represented in a single
unified format consisting of labeling spans and relations between spans, thus a
single task-independent model can be used across different tasks. We perform
extensive experiments to test this insight on 10 disparate tasks spanning
dependency parsing (syntax), semantic role labeling (semantics), relation
extraction (information content), aspect based sentiment analysis (sentiment),
and many others, achieving performance comparable to state-of-the-art
specialized models. We further demonstrate benefits of multi-task learning, and
also show that the proposed method makes it easy to analyze differences and
similarities in how the model handles different tasks. Finally, we convert
these datasets into a unified format to build a benchmark, which provides a
holistic testbed for evaluating future models for generalized natural language
analysis.Comment: ACL 202
VD-BERT: A Unified Vision and Dialog Transformer with BERT
Visual dialog is a challenging vision-language task, where a dialog agent
needs to answer a series of questions through reasoning on the image content
and dialog history. Prior work has mostly focused on various attention
mechanisms to model such intricate interactions. By contrast, in this work, we
propose VD-BERT, a simple yet effective framework of unified vision-dialog
Transformer that leverages the pretrained BERT language models for Visual
Dialog tasks. The model is unified in that (1) it captures all the interactions
between the image and the multi-turn dialog using a single-stream Transformer
encoder, and (2) it supports both answer ranking and answer generation
seamlessly through the same architecture. More crucially, we adapt BERT for the
effective fusion of vision and dialog contents via visually grounded training.
Without the need of pretraining on external vision-language data, our model
yields new state of the art, achieving the top position in both single-model
and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog
leaderboard. Our code and pretrained models are released at
https://github.com/salesforce/VD-BERT.Comment: EMNLP 2020 (14 pages
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Large-scale pre-training methods of learning cross-modal representations on
image-text pairs are becoming popular for vision-language tasks. While existing
methods simply concatenate image region features and text features as input to
the model to be pre-trained and use self-attention to learn image-text semantic
alignments in a brute force manner, in this paper, we propose a new learning
method Oscar (Object-Semantics Aligned Pre-training), which uses object tags
detected in images as anchor points to significantly ease the learning of
alignments. Our method is motivated by the observation that the salient objects
in an image can be accurately detected, and are often mentioned in the paired
text. We pre-train an Oscar model on the public corpus of 6.5 million
text-image pairs, and fine-tune it on downstream tasks, creating new
state-of-the-arts on six well-established vision-language understanding and
generation tasks.Comment: ECCV 2020, Code and pre-trained models are released:
https://github.com/microsoft/Osca
Neural Approaches to Conversational AI
The present paper surveys neural approaches to conversational AI that have
been developed in the last few years. We group conversational systems into
three categories: (1) question answering agents, (2) task-oriented dialogue
agents, and (3) chatbots. For each category, we present a review of
state-of-the-art neural approaches, draw the connection between them and
traditional approaches, and discuss the progress that has been made and
challenges still being faced, using specific systems and models as case
studies.Comment: Foundations and Trends in Information Retrieval (95 pages
Unified vector space mapping for knowledge representation systems
One of the most significant problems which inhibits further developments in
the areas of Knowledge Representation and Artificial Intelligence is a problem
of semantic alignment or knowledge mapping. The progress in its solution will
be greatly beneficial for further advances of information retrieval, ontology
alignment, relevance calculation, text mining, natural language processing etc.
In the paper the concept of multidimensional global knowledge map, elaborated
through unsupervised extraction of dependencies from large documents corpus, is
proposed. In addition, the problem of direct Human - Knowledge Representation
System interface is addressed and a concept of adaptive decoder proposed for
the purpose of interaction with previously described unified mapping model. In
combination these two approaches are suggested as basis for a development of a
new generation of knowledge representation systems
Multimodal Transformer with Multi-View Visual Representation for Image Captioning
Image captioning aims to automatically generate a natural language
description of a given image, and most state-of-the-art models have adopted an
encoder-decoder framework. The framework consists of a convolution neural
network (CNN)-based image encoder that extracts region-based visual features
from the input image, and an recurrent neural network (RNN)-based caption
decoder that generates the output caption words based on the visual features
with the attention mechanism. Despite the success of existing studies, current
methods only model the co-attention that characterizes the inter-modal
interactions while neglecting the self-attention that characterizes the
intra-modal interactions. Inspired by the success of the Transformer model in
machine translation, here we extend it to a Multimodal Transformer (MT) model
for image captioning. Compared to existing image captioning approaches, the MT
model simultaneously captures intra- and inter-modal interactions in a unified
attention block. Due to the in-depth modular composition of such attention
blocks, the MT model can perform complex multimodal reasoning and output
accurate captions. Moreover, to further improve the image captioning
performance, multi-view visual features are seamlessly introduced into the MT
model. We quantitatively and qualitatively evaluate our approach using the
benchmark MSCOCO image captioning dataset and conduct extensive ablation
studies to investigate the reasons behind its effectiveness. The experimental
results show that our method significantly outperforms the previous
state-of-the-art methods. With an ensemble of seven models, our solution ranks
the 1st place on the real-time leaderboard of the MSCOCO image captioning
challenge at the time of the writing of this paper.Comment: submitted to a journa
Data Augmentation for Spoken Language Understanding via Pretrained Models
The training of spoken language understanding (SLU) models often faces the
problem of data scarcity. In this paper, we put forward a data augmentation
method with pretrained language models to boost the variability and accuracy of
generated utterances. Furthermore, we investigate and propose solutions to two
previously overlooked scenarios of data scarcity in SLU: i) Rich-in-Ontology:
ontology information with numerous valid dialogue acts are given; ii)
Rich-in-Utterance: a large number of unlabelled utterances are available.
Empirical results show that our method can produce synthetic training data that
boosts the performance of language understanding models in various scenarios.Comment: 6 pages, 1 figur
- …