Search CORE

42 research outputs found

Ccbert:Self-Supervised Code Change Representation Learning

Author: Han DongGyun
He Junda
Lo David
Xu Bowen
Yang Zhou
Zhou Xin
Publication venue
Publication date: 27/09/2023
Field of study

Numerous code changes are made by developers in their daily work, and a superior representation of code changes is desired for effective code change analysis. Recently, Hoang et al. proposed CC2Vec, a neural network-based approach that learns a distributed representation of code changes to capture the semantic intent of the changes. Despite demonstrated effectiveness in multiple tasks, CC2Vec has several limitations: 1) it considers only coarse-grained information about code changes, and 2) it relies on log messages rather than the self-contained content of the code changes. In this work, we propose CCBERT (\underline{C}ode \underline{C}hange \underline{BERT}), a new Transformer-based pre-trained model that learns a generic representation of code changes based on a large-scale dataset containing massive unlabeled code changes. CCBERT is pre-trained on four proposed self-supervised objectives that are specialized for learning code change representations based on the contents of code changes. CCBERT perceives fine-grained code changes at the token level by learning from the old and new versions of the content, along with the edit actions. Our experiments demonstrate that CCBERT significantly outperforms CC2Vec or the state-of-the-art approaches of the downstream tasks by 7.7\%--14.0\% in terms of different metrics and tasks. CCBERT consistently outperforms large pre-trained code models, such as CodeBERT, while requiring 6--10

\times

less training time, 5--30

\times

less inference time, and 7.9

\times

less GPU memory

Royal Holloway - Pure

Generation-based Code Review Automation:How Far Are We?

Author: Han DongGyun
He Junda
Kim Kisub
Lo David
Xu Bowen
Zhou Xin
Publication venue
Publication date: 13/03/2023
Field of study

Royal Holloway - Pure

Smaller, Faster, Greener: Compressing Pre-trained Code Models via Surrogate-Assisted Optimization

Author: He Junda
Kang Hong Jin
Lo David
Shi Jieke
Xu Bowen
Yang Zhou
Publication venue
Publication date: 07/09/2023
Field of study

Large pre-trained models of code have been adopted to tackle many software engineering tasks and achieved excellent results. However, their large model size and expensive energy consumption prevent them from being widely deployed on developers' computers to provide real-time assistance. A recent study by Shi et al. can compress the pre-trained models into a small size. However, other important considerations in deploying models to have not been addressed: the model should have fast inference speed and minimal energy consumption. This requirement motivates us to propose Avatar, a novel approach that can reduce the model size as well as inference latency and energy consumption without compromising effectiveness (i.e., prediction accuracy). Avatar trains a surrogate model to predict the performance of a tiny model given only its hyperparameters setting. Moreover, Avatar designs a new fitness function embedding multiple key objectives, maximizing the predicted model accuracy and minimizing the model size, inference latency, and energy consumption. After finding the best model hyperparameters using a tailored genetic algorithm (GA), Avatar employs the knowledge distillation technique to train the tiny model. We evaluate Avatar and the baseline approach from Shi et al. on three datasets for two popular software engineering tasks: vulnerability prediction and clone detection. We use Avatar to compress models to a small size (3 MB), which is 160

\times

smaller than the original pre-trained models. Compared with the original models, the inference latency of compressed models is significantly reduced on all three datasets. On average, our approach is capable of reducing the inference latency by 62

\times

, 53

\times

, and 186

\times

. In terms of energy consumption, compressed models only have 0.8 GFLOPs, which is 173

\times

smaller than the original pre-trained models.Comment: 12 pages, a working-in-progress versio

arXiv.org e-Print Archive

Institutional Knowledge at Singapore Management University

CoRAL: Collaborative Retrieval-Augmented Large Language Models Improve Long-tail Recommendation

Author: Chang Cheng-Chun
He Zhankui
Hou Yupeng
McAuley Julian
Wang Jianing
Wu Junda
Yu Tong
Publication venue
Publication date: 11/03/2024
Field of study

The long-tail recommendation is a challenging task for traditional recommender systems, due to data sparsity and data imbalance issues. The recent development of large language models (LLMs) has shown their abilities in complex reasoning, which can help to deduce users' preferences based on very few previous interactions. However, since most LLM-based systems rely on items' semantic meaning as the sole evidence for reasoning, the collaborative information of user-item interactions is neglected, which can cause the LLM's reasoning to be misaligned with task-specific collaborative information of the dataset. To further align LLMs' reasoning to task-specific user-item interaction knowledge, we introduce collaborative retrieval-augmented LLMs, CoRAL, which directly incorporate collaborative evidence into the prompts. Based on the retrieved user-item interactions, the LLM can analyze shared and distinct preferences among users, and summarize the patterns indicating which types of users would be attracted by certain items. The retrieved collaborative evidence prompts the LLM to align its reasoning with the user-item interaction patterns in the dataset. However, since the capacity of the input prompt is limited, finding the minimally-sufficient collaborative information for recommendation tasks can be challenging. We propose to find the optimal interaction set through a sequential decision-making process and develop a retrieval policy learned through a reinforcement learning (RL) framework, CoRAL. Our experimental results show that CoRAL can significantly improve LLMs' reasoning abilities on specific recommendation tasks. Our analysis also reveals that CoRAL can more efficiently explore collaborative information through reinforcement learning.Comment: 11 page

arXiv.org e-Print Archive

Stealthy Backdoor Attack for Code Models

Author: He Junda
Kang Hong Jin
Lo David
Shi Jieke
Xu Bowen
Yang Zhou
Zhang Jie M.
Publication venue
Publication date: 28/08/2023
Field of study

Code models, such as CodeBERT and CodeT5, offer general-purpose representations of code and play a vital role in supporting downstream automated software engineering tasks. Most recently, code models were revealed to be vulnerable to backdoor attacks. A code model that is backdoor-attacked can behave normally on clean examples but will produce pre-defined malicious outputs on examples injected with triggers that activate the backdoors. Existing backdoor attacks on code models use unstealthy and easy-to-detect triggers. This paper aims to investigate the vulnerability of code models with stealthy backdoor attacks. To this end, we propose AFRAIDOOR (Adversarial Feature as Adaptive Backdoor). AFRAIDOOR achieves stealthiness by leveraging adversarial perturbations to inject adaptive triggers into different inputs. We evaluate AFRAIDOOR on three widely adopted code models (CodeBERT, PLBART and CodeT5) and two downstream tasks (code summarization and method name prediction). We find that around 85% of adaptive triggers in AFRAIDOOR bypass the detection in the defense process. By contrast, only less than 12% of the triggers from previous work bypass the defense. When the defense method is not applied, both AFRAIDOOR and baselines have almost perfect attack success rates. However, once a defense is applied, the success rates of baselines decrease dramatically to 10.47% and 12.06%, while the success rate of AFRAIDOOR are 77.05% and 92.98% on the two tasks. Our finding exposes security weaknesses in code models under stealthy backdoor attacks and shows that the state-of-the-art defense method cannot provide sufficient protection. We call for more research efforts in understanding security threats to code models and developing more effective countermeasures.Comment: 18 pages, Under review of IEEE Transactions on Software Engineerin

arXiv.org e-Print Archive

Drug Target Prediction Based on the Herbs Components: The Study on the Multitargets Pharmacological Mechanism of Qishenkeli Acting on the Coronary Heart Disease

Author: Guo Shuzhen
He Fuchu
Li Chun
Li Dong
Liu Zhongyang
Ouyang Yulin
Wang Wei
Wang Yong
Yu Junda
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2012
Field of study

In this paper, we present a case study of Qishenkeli (QSKL) to research TCM's underlying molecular mechanism, based on drug target prediction and analyses of TCM chemical components and following experimental validation. First, after determining the compositive compounds of QSKL, we use drugCIPHER-CS to predict their potential drug targets. These potential targets are significantly enriched with known cardiovascular disease-related drug targets. Then we find these potential drug targets are significantly enriched in the biological processes of neuroactive ligand-receptor interaction, aminoacyl-tRNA biosynthesis, calcium signaling pathway, glycine, serine and threonine metabolism, and renin-angiotensin system (RAAS), and so on. Then, animal model of coronary heart disease (CHD) induced by left anterior descending coronary artery ligation is applied to validate predicted pathway. RAAS pathway is selected as an example, and the results show that QSKL has effect on both rennin and angiotensin II receptor (AT1R), which eventually down regulates the angiotensin II (AngII). Bioinformatics combing with experiment verification can provide a credible and objective method to understand the complicated multitargets mechanism for Chinese herbal formula

Crossref

Directory of Open Access Journals

PubMed Central

Mind Your Data! Hiding Backdoors in Offline Reinforcement Learning Datasets

Author: Bai Yunpeng
Fan Guoliang
Gong Chen
He Junda
Hou Xinwen
Lo David
Shi Jieke
Sinha Arunesh
Xu Bowen
Yang Zhou
Publication venue
Publication date: 07/10/2022
Field of study

A growing body of research works has focused on the Offline Reinforcement Learning (RL) paradigm. Data providers share large pre-collected datasets on which others can train high-quality agents without interacting with the environments. Such an offline RL paradigm has demonstrated effectiveness in many critical tasks, including robot control, autonomous driving, etc. A well-trained agent can be regarded as a software system. However, less attention is paid to investigating the security threats to the offline RL system. In this paper, we focus on a critical security threat: backdoor attacks. Given normal observations, an agent implanted with backdoors takes actions leading to high rewards. However, the same agent takes actions that lead to low rewards if the observations are injected with triggers that can activate the backdoor. In this paper, we propose Baffle (Backdoor Attack for Offline Reinforcement Learning) and evaluate how different Offline RL algorithms react to this attack. Our experiments conducted on four tasks and four offline RL algorithms expose a disquieting fact: none of the existing offline RL algorithms is immune to such a backdoor attack. More specifically, Baffle modifies

10\%

of the datasets for four tasks (3 robotic controls and 1 autonomous driving). Agents trained on the poisoned datasets perform well in normal settings. However, when triggers are presented, the agents' performance decreases drastically by

63.6\%

57.8\%

60.8\%

and

44.7\%

in the four tasks on average. The backdoor still persists after fine-tuning poisoned agents on clean datasets. We further show that the inserted backdoor is also hard to be detected by a popular defensive method. This paper calls attention to developing more effective protection for the open-source offline RL dataset.Comment: 13 pages, 6 figure

arXiv.org e-Print Archive

Answer Summarization for Technical Queries:Benchmark and New Approach

Author: Han Donggyun
He Junda
Lo David
Shi Jieke
Shi Yucen
Thung Ferdian
Xu Bowen
Yang Chengran
Yang Zhou
Zhang Ting
Zhou Xin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/01/2023
Field of study

Royal Holloway - Pure

Answer Summarization for Technical Queries: Benchmark and New Approach

Author: Chengran Yang
Han DongGyun
He Junda
Lo David
Shi Jieke
Shi Yucen
Thung Ferdian
Xu Bowen
Yang Zhou
Zhang Ting
Zhou Xin
Publication venue
Publication date: 22/09/2022
Field of study

Prior studies have demonstrated that approaches to generate an answer summary for a given technical query in Software Question and Answer (SQA) sites are desired. We find that existing approaches are assessed solely through user studies. There is a need for a benchmark with ground truth summaries to complement assessment through user studies. Unfortunately, such a benchmark is non-existent for answer summarization for technical queries from SQA sites. To fill the gap, we manually construct a high-quality benchmark to enable automatic evaluation of answer summarization for technical queries for SQA sites. Using the benchmark, we comprehensively evaluate the performance of existing approaches and find that there is still a big room for improvement. Motivated by the results, we propose a new approach TechSumBot with three key modules:1) Usefulness Ranking module, 2) Centrality Estimation module, and 3) Redundancy Removal module. We evaluate TechSumBot in both automatic (i.e., using our benchmark) and manual (i.e., via a user study) manners. The results from both evaluations consistently demonstrate that TechSumBot outperforms the best performing baseline approaches from both SE and NLP domains by a large margin, i.e., 10.83%-14.90%, 32.75%-36.59%, and 12.61%-17.54%, in terms of ROUGE-1, ROUGE-2, and ROUGE-L on automatic evaluation, and 5.79%-9.23% and 17.03%-17.68%, in terms of average usefulness and diversity score on human evaluation. This highlights that the automatic evaluation of our benchmark can uncover findings similar to the ones found through user studies. More importantly, automatic evaluation has a much lower cost, especially when it is used to assess a new approach. Additionally, we also conducted an ablation study, which demonstrates that each module in TechSumBot contributes to boosting the overall performance of TechSumBot.Comment: Accepted by ASE 202

arXiv.org e-Print Archive

Institutional Knowledge at Singapore Management University

Royal Holloway - Pure

Being a morning man has causal effects on the cerebral cortex: a Mendelian randomization study

Author: Binghua He
Fan Yang
Fan Yang
Fan Yang
Fan Yang
Fan Yang
Junda Li
Linghui Pan
Linghui Pan
Linghui Pan
Linghui Pan
Ru Liu
Sheng He
Sheng He
Sijie Ruan
Publication venue: Frontiers Media S.A.
Publication date: 01/07/2023
Field of study

IntroductionNumerous studies have suggested a connection between circadian rhythm and neurological disorders with cognitive and consciousness impairments in humans, yet little evidence stands for a causal relationship between circadian rhythm and the brain cortex.MethodsThe top 10,000 morningness-related single-nucleotide polymorphisms of the Genome-wide association study (GWAS) summary statistics were used to filter the instrumental variables. GWAS summary statistics from the ENIGMA Consortium were used to assess the causal relationship between morningness and variates like cortical thickness (TH) or surficial area (SA) on the brain cortex. The inverse-variance weighted (IVW) and weighted median (WM) were used as the major estimates whereas MR-Egger, MR Pleiotropy RESidual Sum and Outlier, leave-one-out analysis, and funnel-plot were used for heterogeneity and pleiotropy detecting.ResultsRegionally, morningness decreased SA of the rostral middle frontal gyrus with genomic control (IVW: β = −24.916 mm, 95% CI: −47.342 mm to −2.490 mm, p = 0.029. WM: β = −33.208 mm, 95% CI: −61.933 mm to −4.483 mm, p = 0.023. MR Egger: β < 0) and without genomic control (IVW: β = −24.581 mm, 95% CI: −47.552 mm to −1.609 mm, p = 0.036. WM: β = −32.310 mm, 95% CI: −60.717 mm to −3.902 mm, p = 0.026. MR Egger: β < 0) on a nominal significance, with no heterogeneity or no outliers.Conclusions and implicationsCircadian rhythm causally affects the rostral middle frontal gyrus; this sheds new light on the potential use of MRI in disease diagnosis, revealing the significance of circadian rhythm on the progression of disease, and might also suggest a fresh therapeutic approach for disorders related to the rostral middle frontal gyrus-related

Directory of Open Access Journals