Search CORE

12 research outputs found

Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks

Author: Bui Nghi D. Q.
Jiang Lingxiao
Yu Yijun
Publication venue
Publication date: 29/11/2017
Field of study

Towards the vision of translating code that implements an algorithm from one programming language into another, this paper proposes an approach for automated program classification using bilateral tree-based convolutional neural networks (BiTBCNNs). It is layered on top of two tree-based convolutional neural networks (TBCNNs), each of which recognizes the algorithm of code written in an individual programming language. The combination layer of the networks recognizes the similarities and differences among code in different programming languages. The BiTBCNNs are trained using the source code in different languages but known to implement the same algorithms and/or functionalities. For a preliminary evaluation, we use 3591 Java and 3534 C++ code snippets from 6 algorithms we crawled systematically from GitHub. We obtained over 90% accuracy in the cross-language binary classification task to tell whether any given two code snippets implement a same algorithm. Also, for the algorithm classification task, i.e., to predict which one of the six algorithm labels is implemented by an arbitrary C++ code snippet, we achieved over 80% precision

arXiv.org e-Print Archive

Open Research Online (The Open University)

Recommended from our members

AutoFocus: Interpreting Attention-based Neural Networks by Code Perturbation

Author: Bui Nghi D. Q.
Jiang Lingxiao
Yu Yijun
Publication venue
Publication date
Field of study

Despite being adopted in software engineering tasks, deep neural networks are treated mostly as a black box due to the difficulty in interpreting how the networks infer the outputs from the inputs. To address this problem, we propose AutoFocus, an automated approach for rating and visualizing the importance of input elements based on their effects on the outputs of the networks. The approach is built on our hypotheses that (1) attention mechanisms incorporated into neural networks can generate discriminative scores for various input elements and (2) the discriminative scores reflect the effects of input elements on the outputs of the networks. This paper verifies the hypotheses by applying AutoFocus on the task of algorithm classification (i.e., given a program source code as input, determine the algorithm implemented by the program). AutoFocus identifies and perturbs code elements in a program systematically, and quantifies the effects of the perturbed elements on the network’s classification results. Based on evaluation on more than 1000 programs for 10 different sorting algorithms, we observe that the attention scores are highly correlated to the effects of the perturbed code elements. Such a correlation provides a strong basis for the uses of attention scores to interpret the relations between code elements and the algorithm classification results of a neural network, and we believe that visualizing code elements in an input program ranked according to their attention scores can facilitate faster program comprehension with reduced code

Open Research Online (The Open University)

CodeTF: One-stop Transformer Library for State-of-the-art Code LLM

Author: Bui Nghi D. Q.
Gotmare Akhilesh Deepak
Hoi Steven C. H.
Le Hung
Li Junnan
Wang Yue
Publication venue
Publication date: 31/05/2023
Field of study

Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential in tackling these tasks by leveraging massive open-source code data and programming language features. However, the development and deployment of such models often require expertise in both machine learning and software engineering, creating a barrier for the model adoption. In this paper, we present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Following the principles of modular design and extensible framework, we design CodeTF with a unified interface to enable rapid access and development across different types of models, datasets and tasks. Our library supports a collection of pretrained Code LLM models and popular code benchmarks, including a standardized interface to train and serve code LLMs efficiently, and data features such as language-specific parsers and utility functions for extracting code attributes. In this paper, we describe the design principles, the architecture, key modules and components, and compare with other related library tools. Finally, we hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering, providing a comprehensive open-source solution for developers, researchers, and practitioners.Comment: Ongoing work - Draft Previe

arXiv.org e-Print Archive

HierarchyNet : learning to summarize source code with heterogeneous representations

Author: Bui Nghi D. Q.
Hy Truong Son
Nguyen Minh Huynh
Nguyen Tien N.
Tran-Thanh Long
Publication venue: Association for Computational Linguistics.
Publication date: 17/03/2024
Field of study

Code representation is important to machine learning models in the code-related applications. Existing code summarization approaches primarily leverage Abstract Syntax Trees (ASTs) and sequential information from source code to generate code summaries while often overlooking the critical consideration of the interplay of dependencies among code elements and code hierarchy. However, effective summarization necessitates a holistic analysis of code snippets from three distinct aspects: lexical, syntactic, and semantic information. In this paper, we propose a novel code summarization approach utilizing Heterogeneous Code Representations (HCRs) and our specially designed HierarchyNet. HCRs adeptly capture essential code features at lexical, syntactic, and semantic levels within a hierarchical structure. HierarchyNet processes each layer of the HCR separately, employing a Heterogeneous Graph Transformer, a Tree-based CNN, and a Transformer Encoder. In addition, HierarchyNet demonstrates superior performance compared to fine-tuned pre-trained models, including CodeT5, and CodeBERT, as well as large language models that employ zero/few-shot settings, such as CodeLlama, StarCoder, and CodeGen. Implementation details can be found at https://github.com/FSoft-AI4Code/HierarchyNet

Warwick Research Archives Portal Repository

Class based Influence Functions for Error Detection

Author: Bui Nghi D. Q.
Dau Anh T. V.
Huu-Tien Dang
Nguyen Hieu Ngoc
Nguyen-Duc Thang
Thanh-Tung Hoang
Tran Quan Hung
Publication venue
Publication date: 02/05/2023
Field of study

Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information to improve the stability of IFs. Extensive experiments show that our modification significantly improves the performance and stability of IFs while incurring no additional computational cost.Comment: Thang Nguyen-Duc, Hoang Thanh-Tung, and Quan Hung Tran are co-first authors of this paper. 12 pages, 12 figures. Accepted to ACL 202

arXiv.org e-Print Archive

M<sup>3</sup>: Semantic API Migrations

Author: Alrubaye Hussein
Balog Matej
Barrett Clark
Bui Nghi D. Q.
Bui Nghi DQ
Chen Chunyang
Chen Qiaochu
Chen Yanju
Coelho Fabien
Collie B.
Dagenais Barthélémy
Deville Yves
Dig Danny
Ellis Kevin
Fedyukovich Grigory
Feser John K.
Ginsbach Philip
Ginsbach Philip
Gulwani Sumit
Gulwani Sumit
Hussain Zaidi Syed Sajjad
Jha Susmit
Kalyan Ashwin
Kuznetsov Volodymyr
Mikolov Tomas
Miranda André
Nguyen Anh Tuan
Nguyen Hoan Anh
Nguyen Trong Duc
Nye Maxwell
Osera Peter-Michael
Osera Peter-Michael
Pandita Rahul
Parisotto Emilio
Phan H. D.
Polikarpova Nadia
Rosin Christopher D.
Sasnauskas Raimondas
Schäfer T.
Shaw A.
Shin Richard
So Sunbeom
Solar-Lezama Armando
Solar-Lezama Armando
Tang Wei
Teyton C.
Wang Chenglong
Wasserman Louis
Wu W.
Xavier Laerte
Xu Shengzhe
Yang Zijiang
Zhong Hao
Zohar Amit
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/12/2020
Field of study

Crossref

Edinburgh Research Explorer

Energy-bounded Learning for Robust Models of Code

Author: Bui Nghi D. Q.
Yu Yijun
Publication venue
Publication date: 09/05/2022
Field of study

In programming, learning code representations has a variety of applications, including code classification, code search, comment generation, bug prediction, and so on. Various representations of code in terms of tokens, syntax trees, dependency graphs, code navigation paths, or a combination of their variants have been proposed, however, existing vanilla learning techniques have a major limitation in robustness, i.e., it is easy for the models to make incorrect predictions when the inputs are altered in a subtle way. To enhance the robustness, existing approaches focus on recognizing adversarial samples rather than on the valid samples that fall outside a given distribution, which we refer to as out-of-distribution (OOD) samples. Recognizing such OOD samples is the novel problem investigated in this paper. To this end, we propose to first augment the in=distribution datasets with out-of-distribution samples such that, when trained together, they will enhance the model's robustness. We propose the use of an energy-bounded learning objective function to assign a higher score to in-distribution samples and a lower score to out-of-distribution samples in order to incorporate such out-of-distribution samples into the training process of source code models. In terms of OOD detection and adversarial samples detection, our evaluation results demonstrate a greater robustness for existing source code models to become more accurate at recognizing OOD data while being more resistant to adversarial attacks at the same time. Furthermore, the proposed energy-bounded score outperforms all existing OOD detection scores by a large margin, including the softmax confidence score, the Mahalanobis score, and ODIN.Comment: There are some flaws in our experiments, we would like to fix it and publish a fixed version again in the very near futur

arXiv.org e-Print Archive

Learning to Represent Programs with Code Hierarchies

Author: Bui Nghi D. Q.
Nguyen Minh
Publication venue
Publication date: 30/05/2022
Field of study

When used to process source code, graph neural networks have been shown to produce impressive results for a wide range of software engineering tasks. Existing techniques, however, still have two issues: (1) long-term dependency and (2) different code components are treated as equals when they should not be. To address these issues, we propose a method for representing code as a hierarchy (Code Hierarchy), in which different code components are represented separately at various levels of granularity. Then, to process each level of representation, we design a novel network architecture, HIRGAST, which combines the strengths of Heterogeneous Graph Transformer Networks and Tree-based Convolutional Neural Networks to learn Abstract Syntax Trees enriched with code dependency information. We also propose a novel pretraining objective called Missing Subtree Prediction to complement our Code Hierarchy. The evaluation results show that our method significantly outperforms other baselines in three downstream tasks: any-code completion, code classification, and code clone detection

arXiv.org e-Print Archive

Recommended from our members

Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations

Author: Bui Nghi D. Q.
Jiang Lingxiao
Yu Yijun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/07/2021
Field of study

We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in a fine-tuning process for tasks that might still require label data such as code summarization. The key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets through a contrastive learning objective. To do so, we use a set of semantic-preserving transformation operators to generate code snippets that are syntactically diverse but semantically equivalent. Through extensive experiments, we have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks

Open Research Online (The Open University)