131 research outputs found
Structured Training for Neural Network Transition-Based Parsing
We present structured perceptron training for neural network transition-based
dependency parsing. We learn the neural network representation using a gold
corpus augmented by a large number of automatically parsed sentences. Given
this fixed network representation, we learn a final layer using the structured
perceptron with beam-search decoding. On the Penn Treebank, our parser reaches
94.26% unlabeled and 92.41% labeled attachment accuracy, which to our knowledge
is the best accuracy on Stanford Dependencies to date. We also provide in-depth
ablative analysis to determine which aspects of our model provide the largest
gains in accuracy
Coreference Resolution through a seq2seq Transition-Based System
Most recent coreference resolution systems use search algorithms over
possible spans to identify mentions and resolve coreference. We instead present
a coreference resolution system that uses a text-to-text (seq2seq) paradigm to
predict mentions and links jointly. We implement the coreference system as a
transition system and use multilingual T5 as an underlying language model. We
obtain state-of-the-art accuracy on the CoNLL-2012 datasets with 83.3 F1-score
for English (a 2.3 higher F1-score than previous work (Dobrovolskii, 2021))
using only CoNLL data for training, 68.5 F1-score for Arabic (+4.1 higher than
previous work) and 74.3 F1-score for Chinese (+5.3). In addition we use the
SemEval-2010 data sets for experiments in the zero-shot setting, a few-shot
setting, and supervised setting using all available training data. We get
substantially higher zero-shot F1-scores for 3 out of 4 languages than previous
approaches and significantly exceed previous supervised state-of-the-art
results for all five tested languages
Conciseness: An Overlooked Language Task
We report on novel investigations into training models that make sentences
concise. We define the task and show that it is different from related tasks
such as summarization and simplification. For evaluation, we release two test
sets, consisting of 2000 sentences each, that were annotated by two and five
human annotators, respectively. We demonstrate that conciseness is a difficult
task for which zero-shot setups with large neural language models often do not
perform well. Given the limitations of these approaches, we propose a synthetic
data generation method based on round-trip translations. Using this data to
either train Transformers from scratch or fine-tune T5 models yields our
strongest baselines that can be further improved by fine-tuning on an
artificial conciseness dataset that we derived from multi-annotator machine
translation test sets.Comment: EMNLP 2022 Workshop on Text Simplification, Accessibility, and
Readability (TSAR
PLAN: Summarizing using a Content Plan as Cross-Lingual Bridge
Cross-lingual summarization consists of generating a summary in one language
given an input document in a different language, allowing for the dissemination
of relevant content across speakers of other languages. The task is challenging
mainly due to the paucity of cross-lingual datasets and the compounded
difficulty of summarizing and translating. This work presents PLAN, an
approach to cross-lingual summarization that uses an intermediate planning step
as a cross-lingual bridge. We formulate the plan as a sequence of entities
capturing the summary's content and the order in which it should be
communicated. Importantly, our plans abstract from surface form: using a
multilingual knowledge base, we align entities to their canonical designation
across languages and generate the summary conditioned on this cross-lingual
bridge and the input. Automatic and human evaluation on the XWikis dataset
(across four language pairs) demonstrates that our planning objective achieves
state-of-the-art performance in terms of informativeness and faithfulness.
Moreover, PLAN models improve the zero-shot transfer to new cross-lingual
language pairs compared to baselines without a planning component.Comment: EACL 202
- …