27 research outputs found
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule
Progress in neural grammatical error correction (GEC) is hindered by the lack
of annotated training data. Sufficient amounts of high-quality manually
annotated data are not available, so recent research has relied on generating
synthetic data, pretraining on it, and then fine-tuning on real datasets;
performance gains have been achieved either by ensembling or by using huge
pretrained models such as XXL-T5 as the backbone. In this work, we explore an
orthogonal direction: how to use available data more efficiently. First, we
propose auxiliary tasks that exploit the alignment between the original and
corrected sentences, such as predicting a sequence of corrections. We formulate
each task as a sequence-to-sequence problem and perform multi-task training.
Second, we discover that the order of datasets used for training and even
individual instances within a dataset may have important effects on the final
performance, so we set out to find the best training schedule. Together, these
two ideas lead to significant improvements, producing results that improve
state of the art with much smaller models; in particular, we outperform the
best models based on T5-XXL (11B parameters) with a BART-based model (400M
parameters).Comment: EMNLP 202
GEC-DePenD: Non-Autoregressive Grammatical Error Correction with Decoupled Permutation and Decoding
Grammatical error correction (GEC) is an important NLP task that is currently
usually solved with autoregressive sequence-to-sequence models. However,
approaches of this class are inherently slow due to one-by-one token
generation, so non-autoregressive alternatives are needed. In this work, we
propose a novel non-autoregressive approach to GEC that decouples the
architecture into a permutation network that outputs a self-attention weight
matrix that can be used in beam search to find the best permutation of input
tokens (with auxiliary {ins} tokens) and a decoder network based on a
step-unrolled denoising autoencoder that fills in specific tokens. This allows
us to find the token permutation after only one forward pass of the permutation
network, avoiding autoregressive constructions. We show that the resulting
network improves over previously known non-autoregressive methods for GEC and
reaches the level of autoregressive methods that do not use language-specific
synthetic data generation methods. Our results are supported by a comprehensive
experimental validation on the ConLL-2014 and Write&Improve+LOCNESS datasets
and an extensive ablation study that supports our architectural and algorithmic
choices.Comment: ACL 202
Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection
Real-life applications, heavily relying on machine learning, such as dialog
systems, demand out-of-domain detection methods. Intent classification models
should be equipped with a mechanism to distinguish seen intents from unseen
ones so that the dialog agent is capable of rejecting the latter and avoiding
undesired behavior. However, despite increasing attention paid to the task, the
best practices for out-of-domain intent detection have not yet been fully
established.
This paper conducts a thorough comparison of out-of-domain intent detection
methods. We prioritize the methods, not requiring access to out-of-domain data
during training, gathering of which is extremely time- and labor-consuming due
to lexical and stylistic variation of user utterances. We evaluate multiple
contextual encoders and methods, proven to be efficient, on three standard
datasets for intent classification, expanded with out-of-domain utterances. Our
main findings show that fine-tuning Transformer-based encoders on in-domain
data leads to superior results. Mahalanobis distance, together with utterance
representations, derived from Transformer-based encoders, outperforms other
methods by a wide margin and establishes new state-of-the-art results for all
datasets.
The broader analysis shows that the reason for success lies in the fact that
the fine-tuned Transformer is capable of constructing homogeneous
representations of in-domain utterances, revealing geometrical disparity to out
of domain utterances. In turn, the Mahalanobis distance captures this disparity
easily.Comment: to appear in AAAI 202
Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval
A recent trend in multimodal retrieval is related to postprocessing test set
results via the dual-softmax loss (DSL). While this approach can bring
significant improvements, it usually presumes that an entire matrix of test
samples is available as DSL input. This work introduces a new postprocessing
approach based on Sinkhorn transformations that outperforms DSL. Further, we
propose a new postprocessing setting that does not require access to multiple
test queries. We show that our approach can significantly improve the results
of state of the art models such as CLIP4Clip, BLIP, X-CLIP, and DRL, thus
achieving a new state-of-the-art on several standard text-video retrieval
datasets both with access to the entire test set and in the single-query
setting.Comment: SIGIR 202
Artificial Text Boundary Detection with Topological Data Analysis and Sliding Window Techniques
Due to the rapid development of text generation models, people increasingly
often encounter texts that may start out as written by a human but then
continue as machine-generated results of large language models. Detecting the
boundary between human-written and machine-generated parts of such texts is a
very challenging problem that has not received much attention in literature. In
this work, we consider and compare a number of different approaches for this
artificial text boundary detection problem, comparing several predictors over
features of different nature. We show that supervised fine-tuning of the
RoBERTa model works well for this task in general but fails to generalize in
important cross-domain and cross-generator settings, demonstrating a tendency
to overfit to spurious properties of the data. Then, we propose novel
approaches based on features extracted from a frozen language model's
embeddings that are able to outperform both the human accuracy level and
previously considered baselines on the Real or Fake Text benchmark. Moreover,
we adapt perplexity-based approaches for the boundary detection task and
analyze their behaviour. We analyze the robustness of all proposed classifiers
in cross-domain and cross-model settings, discovering important properties of
the data that can negatively influence the performance of artificial text
boundary detection algorithms
Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts
Rapidly increasing quality of AI-generated content makes it difficult to
distinguish between human and AI-generated texts, which may lead to undesirable
consequences for society. Therefore, it becomes increasingly important to study
the properties of human texts that are invariant over text domains and various
proficiency of human writers, can be easily calculated for any language, and
can robustly separate natural and AI-generated texts regardless of the
generation model and sampling method. In this work, we propose such an
invariant of human texts, namely the intrinsic dimensionality of the manifold
underlying the set of embeddings of a given text sample. We show that the
average intrinsic dimensionality of fluent texts in natural language is
hovering around the value for several alphabet-based languages and around
for Chinese, while the average intrinsic dimensionality of AI-generated
texts for each language is lower, with a clear statistical
separation between human-generated and AI-generated distributions. This
property allows us to build a score-based artificial text detector. The
proposed detector's accuracy is stable over text domains, generator models, and
human writer proficiency levels, outperforming SOTA detectors in model-agnostic
and cross-domain scenarios by a significant margin
Topological Data Analysis for Speech Processing
We apply topological data analysis (TDA) to speech classification problems
and to the introspection of a pretrained speech model, HuBERT. To this end, we
introduce a number of topological and algebraic features derived from
Transformer attention maps and embeddings. We show that a simple linear
classifier built on top of such features outperforms a fine-tuned
classification head. In particular, we achieve an improvement of about
accuracy and ERR on four common datasets; on CREMA-D, the proposed
feature set reaches a new state of the art performance with accuracy .
We also show that topological features are able to reveal functional roles of
speech Transformer heads; e.g., we find the heads capable to distinguish
between pairs of sample sources (natural/synthetic) or voices without any
downstream fine-tuning. Our results demonstrate that TDA is a promising new
approach for speech analysis, especially for tasks that require structural
prediction. Appendices, an introduction to TDA, and other additional materials
are available here - https://topohubert.github.io/speech-topology-webpages/Comment: Accepted to INTERSPEECH 2023 conferenc
Acceptability Judgements via Examining the Topology of Attention Maps
The role of the attention mechanism in encoding linguistic knowledge has
received special interest in NLP. However, the ability of the attention heads
to judge the grammatical acceptability of a sentence has been underexplored.
This paper approaches the paradigm of acceptability judgments with topological
data analysis (TDA), showing that the geometric properties of the attention
graph can be efficiently exploited for two standard practices in linguistics:
binary judgments and linguistic minimal pairs. Topological features enhance the
BERT-based acceptability classifier scores by %-% on CoLA in three
languages (English, Italian, and Swedish). By revealing the topological
discrepancy between attention maps of minimal pairs, we achieve the human-level
performance on the BLiMP benchmark, outperforming nine statistical and
Transformer LM baselines. At the same time, TDA provides the foundation for
analyzing the linguistic functions of attention heads and interpreting the
correspondence between the graph features and grammatical phenomena.Comment: Accepted to EMNLP 2022 Finding
Betti numbers of attention graphs is all you really need
We apply methods of topological analysis to the attention graphs, calculated
on the attention heads of the BERT model ( arXiv:1810.04805v2 ). Our research
shows that the classifier built upon basic persistent topological features
(namely, Betti numbers) of the trained neural network can achieve
classification results on par with the conventional classification method. We
show the relevance of such topological text representation on three text
classification benchmarks. For the best of our knowledge, it is the first
attempt to analyze the topology of an attention-based neural network, widely
used for Natural Language Processing.Comment: This short paper was submitted to "Topological Data Analysis and
Beyond" Workshop at NeurIPS 2020 at July 2020, but wasn't accepted. Later the
ideas from this short paper found a rich development in arXiv:2109.04825 and
arXiv:2205.0963