Search CORE

10 research outputs found

Gitor: Scalable Code Clone Detection by Building Global Sample Graph

Author: Dou Shihan
Liu Yang
Shan Junjie
Wu Hairu
Wu Yueming
Publication venue
Publication date: 18/11/2023
Field of study

Code clone detection is about finding out similar code fragments, which has drawn much attention in software engineering since it is important for software maintenance and evolution. Researchers have proposed many techniques and tools for source code clone detection, but current detection methods concentrate on analyzing or processing code samples individually without exploring the underlying connections among code samples. In this paper, we propose Gitor to capture the underlying connections among different code samples. Specifically, given a source code database, we first tokenize all code samples to extract the pre-defined individual information. After obtaining all samples individual information, we leverage them to build a large global sample graph where each node is a code sample or a type of individual information. Then we apply a node embedding technique on the global sample graph to extract all the samples vector representations. After collecting all code samples vectors, we can simply compare the similarity between any two samples to detect possible clone pairs. More importantly, since the obtained vector of a sample is from a global sample graph, we can combine it with its own code features to improve the code clone detection performance. To demonstrate the effectiveness of Gitor, we evaluate it on a widely used dataset namely BigCloneBench. Our experimental results show that Gitor has higher accuracy in terms of code clone detection and excellent execution time for inputs of various sizes compared to existing state-of-the-art tools. Moreover, we also evaluate the combination of Gitor with other traditional vector-based clone detection methods, the results show that the use of Gitor enables them detect more code clones with higher F1.Comment: 12 pages, 5 figure

arXiv.org e-Print Archive

Obfuscation-resilient Android Malware Analysis Based on Contrastive Learning

Author: Dou Shihan
Jin Hai
Qiang Weizhong
Wu Yueming
Yang Wei
Zou Deqing
Publication venue
Publication date: 08/07/2021
Field of study

Due to its open-source nature, Android operating system has been the main target of attackers to exploit. Malware creators always perform different code obfuscations on their apps to hide malicious activities. Features extracted from these obfuscated samples through program analysis contain many useless and disguised features, which leads to many false negatives. To address the issue, in this paper, we demonstrate that obfuscation-resilient malware analysis can be achieved through contrastive learning. We take the Android malware classification as an example to demonstrate our analysis. The key insight behind our analysis is that contrastive learning can be used to reduce the difference introduced by obfuscation while amplifying the difference between malware and benign apps (or other types of malware). Based on the proposed analysis, we design a system that can achieve robust and interpretable classification of Android malware. To achieve robust classification, we perform contrastive learning on malware samples to learn an encoder that can automatically extract robust features from malware samples. To achieve interpretable classification, we transform the function call graph of a sample into an image by centrality analysis. Then the corresponding heatmaps are obtained by visualization techniques. These heatmaps can help users understand why the malware is classified as this family. We implement IFDroid and perform extensive evaluations on two widely used datasets. Experimental results show that IFDroid is superior to state-of-the-art Android malware familial classification systems. Moreover, IFDroid is capable of maintaining 98.2% true positive rate on classifying 8,112 obfuscated malware samples

arXiv.org e-Print Archive

On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection

Author: Dou Shihan
Gao Songyang
Huang Xuanjing
Ma Jin
Shan Ying
Zhang Qi
Publication venue
Publication date: 26/06/2023
Field of study

Detecting adversarial samples that are carefully crafted to fool the model is a critical step to socially-secure applications. However, existing adversarial detection methods require access to sufficient training data, which brings noteworthy concerns regarding privacy leakage and generalizability. In this work, we validate that the adversarial sample generated by attack algorithms is strongly related to a specific vector in the high-dimensional inputs. Such vectors, namely UAPs (Universal Adversarial Perturbations), can be calculated without original training data. Based on this discovery, we propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs. Experimental results show that our method achieves competitive detection performance on various text classification tasks, and maintains an equivalent time consumption to normal inference.Comment: Accepted by ACL2023 (Short Paper

arXiv.org e-Print Archive

DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization

Author: Dou Shihan
Gao Songyang
Liu Yan
Ma Jin
Shan Ying
Wang Xiao
Wei Zhongyu
Zhang Qi
Publication venue
Publication date: 26/06/2023
Field of study

Adversarial training is one of the best-performing methods in improving the robustness of deep language models. However, robust models come at the cost of high time consumption, as they require multi-step gradient ascents or word substitutions to obtain adversarial samples. In addition, these generated samples are deficient in grammatical quality and semantic consistency, which impairs the effectiveness of adversarial training. To address these problems, we introduce a novel, effective procedure for instead adversarial training with only clean data. Our procedure, distribution shift risk minimization (DSRM), estimates the adversarial loss by perturbing the input data's probability distribution rather than their embeddings. This formulation results in a robust model that minimizes the expected global loss under adversarial attacks. Our approach requires zero adversarial samples for training and reduces time consumption by up to 70\% compared to current best-performing adversarial training methods. Experiments demonstrate that DSRM considerably improves BERT's resistance to textual adversarial attacks and achieves state-of-the-art robust accuracy on various benchmarks.Comment: Accepted by ACL202

arXiv.org e-Print Archive

Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback

Author: Dou Shihan
Gui Tao
Huang Xuanjing
Shen Wei
Zhan Wenyu
Zhang Qi
Zhao Jun
Zheng Rui
Publication venue
Publication date: 29/11/2023
Field of study

Reinforcement learning from human feedback serves as a crucial bridge, aligning large language models with human and societal values. This alignment requires a vast corpus of human feedback to learn a reward model, which is subsequently used to finetune language models. However, we have identified that the reward model often finds shortcuts to bypass its intended objectives, misleadingly assuming that humans prefer longer responses. The emergence of length bias often induces the model to favor longer outputs, yet it doesn't equate to an increase in helpful information within these outputs. In this paper, we propose an innovative solution, applying the Product-of-Experts (PoE) technique to separate reward modeling from the influence of sequence length. In our framework, the main expert concentrates on understanding human intents, while the biased expert targets the identification and capture of length bias. To further enhance the learning of bias, we introduce perturbations into the bias-focused expert, disrupting the flow of semantic information. Experimental results validate the effectiveness of our approach, indicating that language model performance is improved, irrespective of sequence length.Comment: EMNLP 2023 findings, Length Bias in RLHF, Mitigate bias in reward modelin

arXiv.org e-Print Archive

MINER: Improving Out-of-Vocabulary Named Entity Recognition from an Information Theoretic Perspective

Author: Cheng Zhanzhan
Dou Shihan
Gui Tao
Huang Xuanjing
Qiao Liang
Wang Xiao
Xiong Limao
Zhang Qi
Zou Yicheng
Publication venue
Publication date: 09/04/2022
Field of study

NER model has achieved promising performance on standard NER benchmarks. However, recent studies show that previous approaches may over-rely on entity mention information, resulting in poor performance on out-of-vocabulary (OOV) entity recognition. In this work, we propose MINER, a novel NER learning framework, to remedy this issue from an information-theoretic perspective. The proposed approach contains two mutual information-based training objectives: i) generalizing information maximization, which enhances representation via deep understanding of context and entity surface forms; ii) superfluous information minimization, which discourages representation from rote memorizing entity names or exploiting biased cues in data. Experiments on various settings and datasets demonstrate that it achieves better performance in predicting OOV entities

arXiv.org e-Print Archive

Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey

Author: Deng Wenhao
Dou Shihan
Gui Tao
He Wei
Huang Xuanjing
Jia Haoxiang
Liu Yang
Shan Junjie
Wu Yueming
Xi Zhiheng
Publication venue
Publication date: 03/08/2023
Field of study

Code cloning, the duplication of code fragments, is common in software development. While some reuse aids productivity, excessive cloning hurts maintainability and introduces bugs. Hence, automatic code clone detection is vital. Meanwhile, large language models (LLMs) possess diverse code-related knowledge, making them versatile for various software engineering challenges. However, LLMs' performance in code clone detection is unclear and needs more study for accurate assessment. In this paper, we provide the first comprehensive evaluation of LLMs for clone detection, covering different clone types, languages, and prompts. We find advanced LLMs excel in detecting complex semantic clones, surpassing existing methods. Adding intermediate reasoning steps via chain-of-thought prompts noticeably enhances performance. Additionally, representing code as vector embeddings, especially with text encoders, effectively aids clone detection.Lastly, the ability of LLMs to detect code clones differs among various programming languages. Our study suggests that LLMs have potential for clone detection due to their language capabilities, offering insights for developing robust LLM-based methods to enhance software engineering.Comment: 13 pages, 3 figure

arXiv.org e-Print Archive

Secrets of RLHF in Large Language Models Part I: PPO

Author: Chang Cheng
Chen Lu
Cheng Wensen
Dou Shihan
Gao Songyang
Gui Tao
Hua Yuan
Huang Haoran
Huang Xuanjing
Jin Senjie
Lai Wenbin
Liu Qin
Liu Yan
Qiu Xipeng
Shen Wei
Sun Tianxiang
Wang Binghai
Weng Rongxiang
Xi Zhiheng
Xiong Limao
Xu Nuo
Yan Hang
Yin Zhangyue
Zhang Qi
Zheng Rui
Zhou Yuhao
Zhu Minghao
Publication venue
Publication date: 10/07/2023
Field of study

Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. Its primary objective is to function as a human-centric (helpful, honest, and harmless) assistant. Alignment with humans assumes paramount significance, and reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. Current technical routes usually include \textbf{reward models} to measure human preferences, \textbf{Proximal Policy Optimization} (PPO) to optimize policy model outputs, and \textbf{process supervision} to improve step-by-step reasoning capabilities. However, due to the challenges of reward design, environment interaction, and agent training, coupled with huge trial and error cost of large language models, there is a significant barrier for AI researchers to motivate the development of technical alignment and safe landing of LLMs. The stable training of RLHF has still been a puzzle. In the first report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training. We identify policy constraints being the key factor for the effective implementation of the PPO algorithm. Therefore, we explore the PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model. Based on our main results, we perform a comprehensive analysis of RLHF abilities compared with SFT models and ChatGPT. The absence of open-source implementations has posed significant challenges to the investigation of LLMs alignment. Therefore, we are eager to release technical reports, reward models and PPO code

arXiv.org e-Print Archive

Recent Progress of Fluorescence Sensors for Histamine in Foods

Author: Dapeng Li
Gan Wu
Jicheng Zhang
Jing Xie
Shihan Xu
Xilin Dou
Zhaoyang Ding
Publication venue: MDPI AG
Publication date: 01/03/2022
Field of study

Biological amines are organic nitrogen compounds that can be produced by the decomposition of spoiled food. As an important biological amine, histamine has played an important role in food safety. Many methods have been used to detect histamine in foods. Compared with traditional analysis methods, fluorescence sensors as an adaptable detection tool for histamine in foods have the advantages of low cost, convenience, less operation, high sensitivity, and good visibility. In terms of food safety, fluorescence sensors have shown great utilization potential. In this review, we will introduce the applications and development of fluorescence sensors in food safety based on various types of materials. The performance and effectiveness of the fluorescence sensors are discussed in detail regarding their structure, luminescence mechanism, and recognition mechanism. This review may contribute to the exploration of the application of fluorescence sensors in food-related work

Directory of Open Access Journals

PubMed Central

Enhanced cycling performance and rate capacity of SiO anode material by compositing with monoclinic TiO2 (B)

Author: Armand
Bai
Bie
Cao
Cao
Chen
Chen
Cheng
Chou
Cromer
Dou
Ge
Goodenough
Goodenough
Gu
Guo
He
Hongbo Zhang
Huang
Jeong
Jeong
Jiao
Kavan
Kim
Lee
Lee
Liu
Liu
Luo
Mao Xia
Marchand
Ming
Nan Zhou
Patel
Qing Zhou
Ren
Ryu
Shi
Shihan Liu
Si
Su
Su
Sun
Tang
Wang
Wang
Wang
Xia
Xia
Xiao
Yao
Yiran Li
Yufan Wu
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Zhi Zhou
Zhou
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref