Search CORE

147,192 research outputs found

Unified Language Model Pre-training for Natural Language Understanding and Generation

Author: Dong Li
Gao Jianfeng
Hon Hsiao-Wuen
Liu Xiaodong
Wang Wenhui
Wang Yu
Wei Furu
Yang Nan
Zhou Ming
Publication venue
Publication date: 15/10/2019
Field of study

This paper presents a new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UniLM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UniLM achieves new state-of-the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm.Comment: Accepted by NeurIPS-19. Code and pre-trained models: https://github.com/microsoft/unil

arXiv.org e-Print Archive

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Author: Bao Hangbo
Dong Li
Gao Jianfeng
Hon Hsiao-Wuen
Liu Xiaodong
Piao Songhao
Wang Wenhui
Wang Yu
Wei Furu
Yang Nan
Zhou Ming
Publication venue
Publication date: 28/02/2020
Field of study

We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks using a novel training procedure, referred to as a pseudo-masked language model (PMLM). Given an input text with masked tokens, we rely on conventional masks to learn inter-relations between corrupted tokens and context via autoencoding, and pseudo masks to learn intra-relations between masked spans via partially autoregressive modeling. With well-designed position embeddings and self-attention masks, the context encodings are reused to avoid redundant computation. Moreover, conventional masks used for autoencoding provide global masking information, so that all the position embeddings are accessible in partially autoregressive language modeling. In addition, the two tasks pre-train a unified language model as a bidirectional encoder and a sequence-to-sequence decoder, respectively. Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks across several widely used benchmarks.Comment: 11 page

arXiv.org e-Print Archive

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Author: Chen Xuyi
Ding Siyu
Feng Shikun
Gong Weibao
Liang Jianzhong
Liu Jiaxiang
Liu Wei
Liu Weixin
Lu Yuxiang
Ouyang Xuan
Pang Chao
Shang Junyuan
Shang Zhizhou
Sun Peng
Sun Yu
Tian Hao
Wang Haifeng
Wang Shuohuan
Wu Hua
Wu Zhihua
Yu Dianhai
Zhao Yanbin
Publication venue
Publication date: 05/07/2021
Field of study

Pre-trained models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. Recent works such as T5 and GPT-3 have shown that scaling up pre-trained language models can improve their generalization abilities. Particularly, the GPT-3 model with 175 billion parameters shows its strong task-agnostic zero-shot/few-shot learning capabilities. Despite their success, these large-scale models are trained on plain texts without introducing knowledge such as linguistic knowledge and world knowledge. In addition, most large-scale models are trained in an auto-regressive way. As a result, this kind of traditional fine-tuning approach demonstrates relatively weak performance when solving downstream language understanding tasks. In order to solve the above problems, we propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models. It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks with zero-shot learning, few-shot learning or fine-tuning. We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph. Empirical results show that the model outperforms the state-of-the-art models on 54 Chinese NLP tasks, and its English version achieves the first place on the SuperGLUE benchmark (July 3, 2021), surpassing the human performance by +0.8% (90.6% vs. 89.8%)

arXiv.org e-Print Archive

Generalizing Natural Language Analysis through Span-relation Representations

Author: Araki Jun
Jiang Zhengbao
Neubig Graham
Xu Wei
Publication venue
Publication date: 03/05/2020
Field of study

Natural language processing covers a wide variety of tasks predicting syntax, semantics, and information content, and usually each type of output is generated with specially designed architectures. In this paper, we provide the simple insight that a great variety of tasks can be represented in a single unified format consisting of labeling spans and relations between spans, thus a single task-independent model can be used across different tasks. We perform extensive experiments to test this insight on 10 disparate tasks spanning dependency parsing (syntax), semantic role labeling (semantics), relation extraction (information content), aspect based sentiment analysis (sentiment), and many others, achieving performance comparable to state-of-the-art specialized models. We further demonstrate benefits of multi-task learning, and also show that the proposed method makes it easy to analyze differences and similarities in how the model handles different tasks. Finally, we convert these datasets into a unified format to build a benchmark, which provides a holistic testbed for evaluating future models for generalized natural language analysis.Comment: ACL 202

arXiv.org e-Print Archive

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Author: Hoi Steven C. H.
Joty Shafiq
King Irwin
Lyu Michael R.
Wang Yue
Xiong Caiming
Publication venue
Publication date: 02/11/2020
Field of study

Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog leaderboard. Our code and pretrained models are released at https://github.com/salesforce/VD-BERT.Comment: EMNLP 2020 (14 pages

arXiv.org e-Print Archive

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Author: Choi Yejin
Dong Li
Gao Jianfeng
Hu Houdong
Hu Xiaowei
Li Chunyuan
Li Xiujun
Wang Lijuan
Wei Furu
Yin Xi
Zhang Lei
Zhang Pengchuan
Publication venue
Publication date: 25/07/2020
Field of study

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks.Comment: ECCV 2020, Code and pre-trained models are released: https://github.com/microsoft/Osca

arXiv.org e-Print Archive

Neural Approaches to Conversational AI

Author: Galley Michel
Gao Jianfeng
Li Lihong
Publication venue
Publication date: 10/09/2019
Field of study

The present paper surveys neural approaches to conversational AI that have been developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) chatbots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between them and traditional approaches, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies.Comment: Foundations and Trends in Information Retrieval (95 pages

arXiv.org e-Print Archive

Unified vector space mapping for knowledge representation systems

Author: Filatov Dmytro
Filatov Taras
Publication venue
Publication date: 21/02/2015
Field of study

One of the most significant problems which inhibits further developments in the areas of Knowledge Representation and Artificial Intelligence is a problem of semantic alignment or knowledge mapping. The progress in its solution will be greatly beneficial for further advances of information retrieval, ontology alignment, relevance calculation, text mining, natural language processing etc. In the paper the concept of multidimensional global knowledge map, elaborated through unsupervised extraction of dependencies from large documents corpus, is proposed. In addition, the problem of direct Human - Knowledge Representation System interface is addressed and a concept of adaptive decoder proposed for the purpose of interaction with previously described unified mapping model. In combination these two approaches are suggested as basis for a development of a new generation of knowledge representation systems

arXiv.org e-Print Archive

Multimodal Transformer with Multi-View Visual Representation for Image Captioning

Author: Huang Qingming
Li Jing
Yu Jun
Yu Zhou
Publication venue
Publication date: 19/05/2019
Field of study

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN)-based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra- and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper.Comment: submitted to a journa

arXiv.org e-Print Archive

Data Augmentation for Spoken Language Understanding via Pretrained Models

Author: Gao Jianfeng
Peng Baolin
Zeng Michael
Zhu Chenguang
Publication venue
Publication date: 29/04/2020
Field of study

The training of spoken language understanding (SLU) models often faces the problem of data scarcity. In this paper, we put forward a data augmentation method with pretrained language models to boost the variability and accuracy of generated utterances. Furthermore, we investigate and propose solutions to two previously overlooked scenarios of data scarcity in SLU: i) Rich-in-Ontology: ontology information with numerous valid dialogue acts are given; ii) Rich-in-Utterance: a large number of unlabelled utterances are available. Empirical results show that our method can produce synthetic training data that boosts the performance of language understanding models in various scenarios.Comment: 6 pages, 1 figur

arXiv.org e-Print Archive