393 research outputs found
Introduction to Transformers: an NLP Perspective
Transformers have dominated empirical machine learning models of natural
language processing. In this paper, we introduce basic concepts of Transformers
and present key techniques that form the recent advances of these models. This
includes a description of the standard Transformer architecture, a series of
model refinements, and common applications. Given that Transformers and related
deep learning techniques might be evolving in ways we have never seen, we
cannot dive into all the model details or cover all the technical areas.
Instead, we focus on just those concepts that are helpful for gaining a good
understanding of Transformers and their variants. We also summarize the key
ideas that impact this field, thereby yielding some insights into the strengths
and limitations of these models.Comment: 119 pages and 21 figure
Towards Bidirectional Hierarchical Representations for Attention-Based Neural Machine Translation
This paper proposes a hierarchical attentional neural translation model which
focuses on enhancing source-side hierarchical representations by covering both
local and global semantic information using a bidirectional tree-based encoder.
To maximize the predictive likelihood of target words, a weighted variant of an
attention mechanism is used to balance the attentive information between
lexical and phrase vectors. Using a tree-based rare word encoding, the proposed
model is extended to sub-word level to alleviate the out-of-vocabulary (OOV)
problem. Empirical results reveal that the proposed model significantly
outperforms sequence-to-sequence attention-based and tree-based neural
translation models in English-Chinese translation tasks.Comment: Accepted for publication at EMNLP 201
Modality Adaption or Regularization? A Case Study on End-to-End Speech Translation
Pre-training and fine-tuning is a paradigm for alleviating the data scarcity
problem in end-to-end speech translation (E2E ST). The commonplace "modality
gap" between speech and text data often leads to inconsistent inputs between
pre-training and fine-tuning. However, we observe that this gap occurs in the
early stages of fine-tuning, but does not have a major impact on the final
performance. On the other hand, we find that there has another gap, which we
call the "capacity gap": high resource tasks (such as ASR and MT) always
require a large model to fit, when the model is reused for a low resource task
(E2E ST), it will get a sub-optimal performance due to the over-fitting. In a
case study, we find that the regularization plays a more important role than
the well-designed modality adaption method, which achieves 29.0 for en-de and
40.3 for en-fr on the MuST-C dataset. Code and models are available at
https://github.com/hannlp/TAB.Comment: ACL 2023 Main Conferenc
Quantum Phase Recognition via Quantum Kernel Methods
The application of quantum computation to accelerate machine learning
algorithms is one of the most promising areas of research in quantum
algorithms. In this paper, we explore the power of quantum learning algorithms
in solving an important class of Quantum Phase Recognition (QPR) problems,
which are crucially important in understanding many-particle quantum systems.
We prove that, under widely believed complexity theory assumptions, there
exists a wide range of QPR problems that cannot be efficiently solved by
classical learning algorithms with classical resources. Whereas using a quantum
computer, we prove the efficiency and robustness of quantum kernel methods in
solving QPR problems through Linear order parameter Observables. We numerically
benchmark our algorithm for a variety of problems, including recognizing
symmetry-protected topological phases and symmetry-broken phases. Our results
highlight the capability of quantum machine learning in predicting such quantum
phase transitions in many-particle systems
An Efficient Transformer Decoder with Compressed Sub-layers
The large attention-based encoder-decoder network (Transformer) has become
prevailing recently due to its effectiveness. But the high computation
complexity of its decoder raises the inefficiency issue. By examining the
mathematic formulation of the decoder, we show that under some mild conditions,
the architecture could be simplified by compressing its sub-layers, the basic
building block of Transformer, and achieves a higher parallelism. We thereby
propose Compressed Attention Network, whose decoder layer consists of only one
sub-layer instead of three. Extensive experiments on 14 WMT machine translation
tasks show that our model is 1.42x faster with performance on par with a strong
baseline. This strong baseline is already 2x faster than the widely used
standard baseline without loss in performance.Comment: accepted by AAAI202
Towards Robust Aspect-based Sentiment Analysis through Non-counterfactual Augmentations
While state-of-the-art NLP models have demonstrated excellent performance for
aspect based sentiment analysis (ABSA), substantial evidence has been presented
on their lack of robustness. This is especially manifested as significant
degradation in performance when faced with out-of-distribution data. Recent
solutions that rely on counterfactually augmented datasets show promising
results, but they are inherently limited because of the lack of access to
explicit causal structure. In this paper, we present an alternative approach
that relies on non-counterfactual data augmentation. Our proposal instead
relies on using noisy, cost-efficient data augmentations that preserve
semantics associated with the target aspect. Our approach then relies on
modelling invariances between different versions of the data to improve
robustness. A comprehensive suite of experiments shows that our proposal
significantly improves upon strong pre-trained baselines on both standard and
robustness-specific datasets. Our approach further establishes a new
state-of-the-art on the ABSA robustness benchmark and transfers well across
domains.Comment: 10pages,1 figure,10 table
- …