6,057 research outputs found
Small and Practical BERT Models for Sequence Labeling
We propose a practical scheme to train a single multilingual sequence
labeling model that yields state of the art results and is small and fast
enough to run on a single CPU. Starting from a public multilingual BERT
checkpoint, our final model is 6x smaller and 27x faster, and has higher
accuracy than a state-of-the-art multilingual baseline. We show that our model
especially outperforms on low-resource languages, and works on codemixed input
text without being explicitly trained on codemixed examples. We showcase the
effectiveness of our method by reporting on part-of-speech tagging and
morphological prediction on 70 treebanks and 48 languages.Comment: 11 pages including appendices; accepted to appear at EMNLP-IJCNLP
201
Structure-Level Knowledge Distillation For Multilingual Sequence Labeling
Multilingual sequence labeling is a task of predicting label sequences using
a single unified model for multiple languages. Compared with relying on
multiple monolingual models, using a multilingual model has the benefit of a
smaller model size, easier in online serving, and generalizability to
low-resource languages. However, current multilingual models still underperform
individual monolingual models significantly due to model capacity limitations.
In this paper, we propose to reduce the gap between monolingual models and the
unified multilingual model by distilling the structural knowledge of several
monolingual models (teachers) to the unified multilingual model (student). We
propose two novel KD methods based on structure-level information: (1)
approximately minimizes the distance between the student's and the teachers'
structure level probability distributions, (2) aggregates the structure-level
knowledge to local distributions and minimizes the distance between two local
probability distributions. Our experiments on 4 multilingual tasks with 25
datasets show that our approaches outperform several strong baselines and have
stronger zero-shot generalizability than both the baseline model and teacher
models.Comment: Accepted to ACL 2020, camera-ready. 14 page
Exploring and Predicting Transferability across NLP Tasks
Recent advances in NLP demonstrate the effectiveness of training large-scale
language models and transferring them to downstream tasks. Can fine-tuning
these models on tasks other than language modeling further improve performance?
In this paper, we conduct an extensive study of the transferability between 33
NLP tasks across three broad classes of problems (text classification, question
answering, and sequence labeling). Our results show that transfer learning is
more beneficial than previously thought, especially when target task data is
scarce, and can improve performance even when the source task is small or
differs substantially from the target task (e.g., part-of-speech tagging
transfers well to the DROP QA dataset). We also develop task embeddings that
can be used to predict the most transferable source tasks for a given target
task, and we validate their effectiveness in experiments controlled for source
and target data size. Overall, our experiments reveal that factors such as
source data size, task and domain similarity, and task complexity all play a
role in determining transferability.Comment: Accepted as a conference paper at EMNLP 2020, 45 pages, 3 figures, 34
table
Glyce: Glyph-vectors for Chinese Character Representations
It is intuitive that NLP tasks for logographic languages like Chinese should
benefit from the use of the glyph information in those languages. However, due
to the lack of rich pictographic evidence in glyphs and the weak generalization
ability of standard computer vision models on character data, an effective way
to utilize the glyph information remains to be found. In this paper, we address
this gap by presenting Glyce, the glyph-vectors for Chinese character
representations. We make three major innovations: (1) We use historical Chinese
scripts (e.g., bronzeware script, seal script, traditional Chinese, etc) to
enrich the pictographic evidence in characters; (2) We design CNN structures
(called tianzege-CNN) tailored to Chinese character image processing; and (3)
We use image-classification as an auxiliary task in a multi-task learning setup
to increase the model's ability to generalize. We show that glyph-based models
are able to consistently outperform word/char ID-based models in a wide range
of Chinese NLP tasks. We are able to set new state-of-the-art results for a
variety of Chinese NLP tasks, including tagging (NER, CWS, POS), sentence pair
classification, single sentence classification tasks, dependency parsing, and
semantic role labeling. For example, the proposed model achieves an F1 score of
80.6 on the OntoNotes dataset of NER, +1.5 over BERT; it achieves an almost
perfect accuracy of 99.8\% on the Fudan corpus for text classification. Code
found at https://github.com/ShannonAI/glyce.Comment: Accepted by NeurIPS 201
QuASE: Question-Answer Driven Sentence Encoding
Question-answering (QA) data often encodes essential information in many
facets. This paper studies a natural question: Can we get supervision from QA
data for other tasks (typically, non-QA ones)? For example, {\em can we use
QAMR (Michael et al., 2017) to improve named entity recognition?} We suggest
that simply further pre-training BERT is often not the best option, and propose
the {\em question-answer driven sentence encoding (QuASE)} framework. QuASE
learns representations from QA data, using BERT or other state-of-the-art
contextual language models. In particular, we observe the need to distinguish
between two types of sentence encodings, depending on whether the target task
is a single- or multi-sentence input; in both cases, the resulting encoding is
shown to be an easy-to-use plugin for many downstream tasks. This work may
point out an alternative way to supervise NLP tasks
A Practical Framework for Relation Extraction with Noisy Labels Based on Doubly Transitional Loss
Either human annotation or rule based automatic labeling is an effective
method to augment data for relation extraction. However, the inevitable wrong
labeling problem for example by distant supervision may deteriorate the
performance of many existing methods. To address this issue, we introduce a
practical end-to-end deep learning framework, including a standard feature
extractor and a novel noisy classifier with our proposed doubly transitional
mechanism. One transition is basically parameterized by a non-linear
transformation between hidden layers that implicitly represents the conversion
between the true and noisy labels, and it can be readily optimized together
with other model parameters. Another is an explicit probability transition
matrix that captures the direct conversion between labels but needs to be
derived from an EM algorithm. We conduct experiments on the NYT dataset and
SemEval 2018 Task 7. The empirical results show comparable or better
performance over state-of-the-art methods.Comment: 10 page
Instance-Based Learning of Span Representations: A Case Study through Named Entity Recognition
Interpretable rationales for model predictions play a critical role in
practical applications. In this study, we develop models possessing
interpretable inference process for structured prediction. Specifically, we
present a method of instance-based learning that learns similarities between
spans. At inference time, each span is assigned a class label based on its
similar spans in the training set, where it is easy to understand how much each
training instance contributes to the predictions. Through empirical analysis on
named entity recognition, we demonstrate that our method enables to build
models that have high interpretability without sacrificing performance.Comment: Accepted by ACL202
What do you learn from context? Probing for sentence structure in contextualized word representations
Contextualized representation models such as ELMo (Peters et al., 2018a) and
BERT (Devlin et al., 2018) have recently achieved state-of-the-art results on a
diverse array of downstream NLP tasks. Building on recent token-level probing
work, we introduce a novel edge probing task design and construct a broad suite
of sub-sentence tasks derived from the traditional structured NLP pipeline. We
probe word-level contextual representations from four recent models and
investigate how they encode sentence structure across a range of syntactic,
semantic, local, and long-range phenomena. We find that existing models trained
on language modeling and translation produce strong representations for
syntactic phenomena, but only offer comparably small improvements on semantic
tasks over a non-contextual baseline.Comment: ICLR 2019 camera-ready version, 17 pages including appendice
Star-Transformer
Although Transformer has achieved great successes on many NLP tasks, its
heavy structure with fully-connected attention connections leads to
dependencies on large training data. In this paper, we present
Star-Transformer, a lightweight alternative by careful sparsification. To
reduce model complexity, we replace the fully-connected structure with a
star-shaped topology, in which every two non-adjacent nodes are connected
through a shared relay node. Thus, complexity is reduced from quadratic to
linear, while preserving capacity to capture both local composition and
long-range dependency. The experiments on four tasks (22 datasets) show that
Star-Transformer achieved significant improvements against the standard
Transformer for the modestly sized datasets.Comment: Accepted by NAACL 201
Multipurpose Intelligent Process Automation via Conversational Assistant
Intelligent Process Automation (IPA) is an emerging technology with a primary
goal to assist the knowledge worker by taking care of repetitive, routine and
low-cognitive tasks. Conversational agents that can interact with users in a
natural language are potential application for IPA systems. Such intelligent
agents can assist the user by answering specific questions and executing
routine tasks that are ordinarily performed in a natural language (i.e.,
customer support). In this work, we tackle a challenge of implementing an IPA
conversational assistant in a real-world industrial setting with a lack of
structured training data. Our proposed system brings two significant benefits:
First, it reduces repetitive and time-consuming activities and, therefore,
allows workers to focus on more intelligent processes. Second, by interacting
with users, it augments the resources with structured and to some extent
labeled training data. We showcase the usage of the latter by re-implementing
several components of our system with Transfer Learning (TL) methods.Comment: Presented at the AAAI-20 Workshop on Intelligent Process Automatio
- …