Search CORE

7 research outputs found

InfeRE: Step-by-Step Regex Generation via Chain of Inference

Author: Chen Yuting
Gu Xiaodong
Shen Beijun
Zhang Shuai
Publication venue
Publication date: 08/08/2023
Field of study

Automatically generating regular expressions (abbrev. regexes) from natural language description (NL2RE) has been an emerging research area. Prior studies treat regex as a linear sequence of tokens and generate the final expressions autoregressively in a single pass. They did not take into account the step-by-step internal text-matching processes behind the final results. This significantly hinders the efficacy and interpretability of regex generation by neural language models. In this paper, we propose a new paradigm called InfeRE, which decomposes the generation of regexes into chains of step-by-step inference. To enhance the robustness, we introduce a self-consistency decoding mechanism that ensembles multiple outputs sampled from different models. We evaluate InfeRE on two publicly available datasets, NL-RX-Turk and KB13, and compare the results with state-of-the-art approaches and the popular tree-based generation approach TRANX. Experimental results show that InfeRE substantially outperforms previous baselines, yielding 16.3% and 14.7% improvement in DFA@5 accuracy on two datasets, respectively. Particularly, InfeRE outperforms the popular tree-based generation approach by 18.1% and 11.3% on both datasets, respectively, in terms of DFA@5 accuracy.Comment: This paper has been accepted by ASE'2

arXiv.org e-Print Archive

Diet Code Is Healthy: Simplifying Programs for Pre-trained Models of Code

Author: Gu Xiaodong
Shen Beijun
Zhang Hongyu
Zhang Zhaowei
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/09/2022
Field of study

Pre-trained code representation models such as CodeBERT have demonstrated superior performance in a variety of software engineering tasks, yet they are often heavy in complexity, quadratically with the length of the input sequence. Our empirical analysis of CodeBERT's attention reveals that CodeBERT pays more attention to certain types of tokens and statements such as keywords and data-relevant statements. Based on these findings, we propose DietCode, which aims at lightweight leverage of large pre-trained models for source code. DietCode simplifies the input program of CodeBERT with three strategies, namely, word dropout, frequency filtering, and an attention-based strategy which selects statements and tokens that receive the most attention weights during pre-training. Hence, it gives a substantial reduction in the computational cost without hampering the model performance. Experimental results on two downstream tasks show that DietCodeBERT provides comparable results to CodeBERT with 40% less computational cost in fine-tuning and testing.Comment: Accepted to be published in ESEC/FSE 202

arXiv.org e-Print Archive

On the Evaluation of Neural Code Translation: Taxonomy and Benchmark

Author: Gu Xiaodong
Jiao Mingsheng
Li Xuan
Qiu Guanjie
Shen Beijun
Yu Tingrui
Publication venue
Publication date: 17/08/2023
Field of study

In recent years, neural code translation has gained increasing attention. While most of the research focuses on improving model architectures and training processes, we notice that the evaluation process and benchmark for code translation models are severely limited: they primarily treat source code as natural languages and provide a holistic accuracy score while disregarding the full spectrum of model capabilities across different translation types and complexity. In this paper, we present a comprehensive investigation of four state-of-the-art models and analyze in-depth the advantages and limitations of three existing benchmarks. Based on the empirical results, we develop a taxonomy that categorizes code translation tasks into four primary types according to their complexity and knowledge dependence: token level (type 1), syntactic level (type 2), library level (type 3), and algorithm level (type 4). We then conduct a thorough analysis of how existing approaches perform across these four categories. Our findings indicate that while state-of-the-art code translation models excel in type-1 and type-2 translations, they struggle with knowledge-dependent ones such as type-3 and type-4. Existing benchmarks are biased towards trivial translations, such as keyword mapping. To overcome these limitations, we construct G-TransEval, a new benchmark by manually curating type-3 and type-4 translation pairs and unit test cases. Results on our new benchmark suggest that G-TransEval can exhibit more comprehensive and finer-grained capability of code translation models and thus provide a more rigorous evaluation. Our studies also provide more insightful findings and suggestions for future research, such as building type-3 and type-4 training data and ensembling multiple pretraining approaches.Comment: accepted by ASE202

arXiv.org e-Print Archive

Library as knowledge ecosystem

Author: Beijun Shen
Dehua Ju
Publication venue: 'Emerald'
Publication date
Field of study

Crossref

To seize the emerging historical opportunity of the networked knowledge

Author: Beijun Shen
Dehua Ju
Kai Lu
Publication venue: 'Emerald'
Publication date
Field of study

Crossref

Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries

Author: CHEN Yuting
SHEN Beijun
YAN Shuhan
YU Hang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/02/2020
Field of study

Ministry of Education, Singapore under its Academic Research Funding Tier

Crossref

Institutional Knowledge at Singapore Management University