4 research outputs found
Lightweight Cross-Lingual Sentence Representation Learning
Large-scale models for learning fixed-dimensional cross-lingual sentence
representations like LASER (Artetxe and Schwenk, 2019b) lead to significant
improvement in performance on downstream tasks. However, further increases and
modifications based on such large-scale models are usually impractical due to
memory limitations. In this work, we introduce a lightweight dual-transformer
architecture with just 2 layers for generating memory-efficient cross-lingual
sentence representations. We explore different training tasks and observe that
current cross-lingual training tasks leave a lot to be desired for this shallow
architecture. To ameliorate this, we propose a novel cross-lingual language
model, which combines the existing single-word masked language model with the
newly proposed cross-lingual token-level reconstruction task. We further
augment the training task by the introduction of two computationally-lite
sentence-level contrastive learning tasks to enhance the alignment of
cross-lingual sentence representation space, which compensates for the learning
bottleneck of the lightweight transformer for generative tasks. Our comparisons
with competing models on cross-lingual sentence retrieval and multilingual
document classification confirm the effectiveness of the newly proposed
training tasks for a shallow model.Comment: ACL 202
Disentangled Code Representation Learning for Multiple Programming Languages
Developing effective distributed representations of source code is fundamental yet challenging for many software engineering tasks such as code clone detection, code search, code translation and transformation. However, current code embedding approaches that represent the semantic and syntax of code in a mixed way are less interpretable and the resulting embedding can not be easily generalized across programming languages. In this paper, we propose a disentangled code representation learning approach to separate the semantic from the syntax of source code under a multi-programming-language setting, obtaining better interpretability and generalizability. Specially, we design three losses dedicated to the characteristics of source code to enforce the disentanglement effectively. We conduct comprehensive experiments on a real-world dataset composed of programming exercises implemented by multiple solutions that are semantically identical but grammatically distinguished. The experimental results validate the superiority of our proposed disentangled code representation, compared to several baselines, across three types of downstream tasks, i.e., code clone detection, code translation, and code-to-code search