5 research outputs found
Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning
In sequence-to-sequence learning, the decoder relies on the attention
mechanism to efficiently extract information from the encoder. While it is
common practice to draw information from only the last encoder layer, recent
work has proposed to use representations from different encoder layers for
diversified levels of information. Nonetheless, the decoder still obtains only
a single view of the source sequences, which might lead to insufficient
training of the encoder layer stack due to the hierarchy bypassing problem. In
this work, we propose layer-wise cross-view decoding, where for each decoder
layer, together with the representations from the last encoder layer, which
serve as a global view, those from other encoder layers are supplemented for a
stereoscopic view of the source sequences. Systematic experiments show that we
successfully address the hierarchy bypassing problem and substantially improve
the performance of sequence-to-sequence learning with deep representations on
diverse tasks.Comment: 9 pages, 6 figure
Layer-wise Representation Fusion for Compositional Generalization
Despite successes across a broad range of applications, sequence-to-sequence
models' construct of solutions are argued to be less compositional than
human-like generalization. There is mounting evidence that one of the reasons
hindering compositional generalization is representations of the encoder and
decoder uppermost layer are entangled. In other words, the syntactic and
semantic representations of sequences are twisted inappropriately. However,
most previous studies mainly concentrate on enhancing token-level semantic
information to alleviate the representations entanglement problem, rather than
composing and using the syntactic and semantic representations of sequences
appropriately as humans do. In addition, we explain why the entanglement
problem exists from the perspective of recent studies about training deeper
Transformer, mainly owing to the ``shallow'' residual connections and its
simple, one-step operations, which fails to fuse previous layers' information
effectively. Starting from this finding and inspired by humans' strategies, we
propose \textsc{FuSion} (\textbf{Fu}sing \textbf{S}yntactic and
Semant\textbf{i}c Representati\textbf{on}s), an extension to
sequence-to-sequence models to learn to fuse previous layers' information back
into the encoding and decoding process appropriately through introducing a
\emph{fuse-attention module} at each encoder and decoder layer. \textsc{FuSion}
achieves competitive and even \textbf{state-of-the-art} results on two
realistic benchmarks, which empirically demonstrates the effectiveness of our
proposal.Comment: work in progress. arXiv admin note: substantial text overlap with
arXiv:2305.1216
Introduction to Transformers: an NLP Perspective
Transformers have dominated empirical machine learning models of natural
language processing. In this paper, we introduce basic concepts of Transformers
and present key techniques that form the recent advances of these models. This
includes a description of the standard Transformer architecture, a series of
model refinements, and common applications. Given that Transformers and related
deep learning techniques might be evolving in ways we have never seen, we
cannot dive into all the model details or cover all the technical areas.
Instead, we focus on just those concepts that are helpful for gaining a good
understanding of Transformers and their variants. We also summarize the key
ideas that impact this field, thereby yielding some insights into the strengths
and limitations of these models.Comment: 119 pages and 21 figure
Exploring Syntactic Representations in Pre-trained Transformers to Improve Neural Machine Translation by a Fusion of Neural Network Architectures
Neural networks in Machine Translation (MT) engines may not consider deep linguistic knowledge, often resulting in low-quality translations. In order to improve translation quality, this study examines the feasibility of fusing two data augmentation strategies: the explicit syntactic knowledge incorporation and the pre-trained language model BERT.
The study first investigates what BERT knows about syntactic knowledge of the source language sentences before and after MT fine-tuning through syntactic probing experiments, as well as using a Quality Estimation (QE) model and the chi-square test to clarify the correlation between syntactic knowledge of the source language sentences and the quality of translations in the target language. The experimental results show that BERT can explicitly predict different types of dependency relations in source language sentences and exhibit different learning trends, which probes can reveal. Moreover, experiments confirm a correlation between dependency relations in source language sentences and translation quality in MT scenarios, which can somewhat influence translation quality. The dependency relations of the source language sentences frequently appear in low-quality translations are detected. Probes can be linked to those dependency relations, where prediction scores of dependency relations tend to be higher in the middle layer of BERT than those in the top layer.
The study then presents dependency relation prediction experiments to examine what a Graph Attention Network (GAT) learns syntactic dependencies and investigates how it learns such knowledge by different pairs of the number of attention heads and model layers. Additionally, the study examines the potential of incorporating GAT-based syntactic predictions in MT scenarios by comparing GAT with fine-tuned BERT in dependency relations prediction. Based on the paired t-test and prediction scores, GAT outperforms MT-B, a version of BERT specifically fine-tuned for MT. GAT exhibits higher prediction scores for the majority of dependency relations. For some dependency relations, it even outperforms UD-B, a version of BERT specifically fine-tuned for syntactic dependencies. However, GAT faces difficulties in predicting accurately by the quantity and subtype of dependency relations, which can lead to lower prediction scores.
Finally, the study proposes a novel MT architecture of Syntactic knowledge via Graph attention with BERT (SGB) engines and examines how the translation quality changes from various perspectives. The experimental results indicate that the SGB engines can improve low-quality translations across different source language sentence lengths and better recognize the syntactic structure defined by dependency relations of source language sentences based on the QE scores. However, improving translation quality relies on BERT correctly modeling the source language sentences. Otherwise, the syntactic knowledge on the graphs is of limited impact. The prediction scores of GAT for dependency relations can also be linked to improved translation quality. GAT allows some layers of BERT to reconsider the syntactic structures of the source language sentences. Using XLM-R instead of BERT still results in improved translation quality, indicating the efficiency of syntactic knowledge on graphs. These experiments not only show the effectiveness of the proposed strategies but also provide explanations, which bring more inspiration for future fusion that graph neural network modeling linguistic knowledge and pre-trained language models in MT scenarios