5 research outputs found

    Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning

    Full text link
    In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder. While it is common practice to draw information from only the last encoder layer, recent work has proposed to use representations from different encoder layers for diversified levels of information. Nonetheless, the decoder still obtains only a single view of the source sequences, which might lead to insufficient training of the encoder layer stack due to the hierarchy bypassing problem. In this work, we propose layer-wise cross-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences. Systematic experiments show that we successfully address the hierarchy bypassing problem and substantially improve the performance of sequence-to-sequence learning with deep representations on diverse tasks.Comment: 9 pages, 6 figure

    Layer-wise Representation Fusion for Compositional Generalization

    Full text link
    Despite successes across a broad range of applications, sequence-to-sequence models' construct of solutions are argued to be less compositional than human-like generalization. There is mounting evidence that one of the reasons hindering compositional generalization is representations of the encoder and decoder uppermost layer are entangled. In other words, the syntactic and semantic representations of sequences are twisted inappropriately. However, most previous studies mainly concentrate on enhancing token-level semantic information to alleviate the representations entanglement problem, rather than composing and using the syntactic and semantic representations of sequences appropriately as humans do. In addition, we explain why the entanglement problem exists from the perspective of recent studies about training deeper Transformer, mainly owing to the ``shallow'' residual connections and its simple, one-step operations, which fails to fuse previous layers' information effectively. Starting from this finding and inspired by humans' strategies, we propose \textsc{FuSion} (\textbf{Fu}sing \textbf{S}yntactic and Semant\textbf{i}c Representati\textbf{on}s), an extension to sequence-to-sequence models to learn to fuse previous layers' information back into the encoding and decoding process appropriately through introducing a \emph{fuse-attention module} at each encoder and decoder layer. \textsc{FuSion} achieves competitive and even \textbf{state-of-the-art} results on two realistic benchmarks, which empirically demonstrates the effectiveness of our proposal.Comment: work in progress. arXiv admin note: substantial text overlap with arXiv:2305.1216

    Introduction to Transformers: an NLP Perspective

    Full text link
    Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep learning techniques might be evolving in ways we have never seen, we cannot dive into all the model details or cover all the technical areas. Instead, we focus on just those concepts that are helpful for gaining a good understanding of Transformers and their variants. We also summarize the key ideas that impact this field, thereby yielding some insights into the strengths and limitations of these models.Comment: 119 pages and 21 figure

    Exploring Syntactic Representations in Pre-trained Transformers to Improve Neural Machine Translation by a Fusion of Neural Network Architectures

    Get PDF
    Neural networks in Machine Translation (MT) engines may not consider deep linguistic knowledge, often resulting in low-quality translations. In order to improve translation quality, this study examines the feasibility of fusing two data augmentation strategies: the explicit syntactic knowledge incorporation and the pre-trained language model BERT. The study first investigates what BERT knows about syntactic knowledge of the source language sentences before and after MT fine-tuning through syntactic probing experiments, as well as using a Quality Estimation (QE) model and the chi-square test to clarify the correlation between syntactic knowledge of the source language sentences and the quality of translations in the target language. The experimental results show that BERT can explicitly predict different types of dependency relations in source language sentences and exhibit different learning trends, which probes can reveal. Moreover, experiments confirm a correlation between dependency relations in source language sentences and translation quality in MT scenarios, which can somewhat influence translation quality. The dependency relations of the source language sentences frequently appear in low-quality translations are detected. Probes can be linked to those dependency relations, where prediction scores of dependency relations tend to be higher in the middle layer of BERT than those in the top layer. The study then presents dependency relation prediction experiments to examine what a Graph Attention Network (GAT) learns syntactic dependencies and investigates how it learns such knowledge by different pairs of the number of attention heads and model layers. Additionally, the study examines the potential of incorporating GAT-based syntactic predictions in MT scenarios by comparing GAT with fine-tuned BERT in dependency relations prediction. Based on the paired t-test and prediction scores, GAT outperforms MT-B, a version of BERT specifically fine-tuned for MT. GAT exhibits higher prediction scores for the majority of dependency relations. For some dependency relations, it even outperforms UD-B, a version of BERT specifically fine-tuned for syntactic dependencies. However, GAT faces difficulties in predicting accurately by the quantity and subtype of dependency relations, which can lead to lower prediction scores. Finally, the study proposes a novel MT architecture of Syntactic knowledge via Graph attention with BERT (SGB) engines and examines how the translation quality changes from various perspectives. The experimental results indicate that the SGB engines can improve low-quality translations across different source language sentence lengths and better recognize the syntactic structure defined by dependency relations of source language sentences based on the QE scores. However, improving translation quality relies on BERT correctly modeling the source language sentences. Otherwise, the syntactic knowledge on the graphs is of limited impact. The prediction scores of GAT for dependency relations can also be linked to improved translation quality. GAT allows some layers of BERT to reconsider the syntactic structures of the source language sentences. Using XLM-R instead of BERT still results in improved translation quality, indicating the efficiency of syntactic knowledge on graphs. These experiments not only show the effectiveness of the proposed strategies but also provide explanations, which bring more inspiration for future fusion that graph neural network modeling linguistic knowledge and pre-trained language models in MT scenarios
    corecore