10 research outputs found
Gitor: Scalable Code Clone Detection by Building Global Sample Graph
Code clone detection is about finding out similar code fragments, which has
drawn much attention in software engineering since it is important for software
maintenance and evolution. Researchers have proposed many techniques and tools
for source code clone detection, but current detection methods concentrate on
analyzing or processing code samples individually without exploring the
underlying connections among code samples. In this paper, we propose Gitor to
capture the underlying connections among different code samples. Specifically,
given a source code database, we first tokenize all code samples to extract the
pre-defined individual information. After obtaining all samples individual
information, we leverage them to build a large global sample graph where each
node is a code sample or a type of individual information. Then we apply a node
embedding technique on the global sample graph to extract all the samples
vector representations. After collecting all code samples vectors, we can
simply compare the similarity between any two samples to detect possible clone
pairs. More importantly, since the obtained vector of a sample is from a global
sample graph, we can combine it with its own code features to improve the code
clone detection performance. To demonstrate the effectiveness of Gitor, we
evaluate it on a widely used dataset namely BigCloneBench. Our experimental
results show that Gitor has higher accuracy in terms of code clone detection
and excellent execution time for inputs of various sizes compared to existing
state-of-the-art tools. Moreover, we also evaluate the combination of Gitor
with other traditional vector-based clone detection methods, the results show
that the use of Gitor enables them detect more code clones with higher F1.Comment: 12 pages, 5 figure
Obfuscation-resilient Android Malware Analysis Based on Contrastive Learning
Due to its open-source nature, Android operating system has been the main
target of attackers to exploit. Malware creators always perform different code
obfuscations on their apps to hide malicious activities. Features extracted
from these obfuscated samples through program analysis contain many useless and
disguised features, which leads to many false negatives. To address the issue,
in this paper, we demonstrate that obfuscation-resilient malware analysis can
be achieved through contrastive learning. We take the Android malware
classification as an example to demonstrate our analysis. The key insight
behind our analysis is that contrastive learning can be used to reduce the
difference introduced by obfuscation while amplifying the difference between
malware and benign apps (or other types of malware).
Based on the proposed analysis, we design a system that can achieve robust
and interpretable classification of Android malware. To achieve robust
classification, we perform contrastive learning on malware samples to learn an
encoder that can automatically extract robust features from malware samples. To
achieve interpretable classification, we transform the function call graph of a
sample into an image by centrality analysis. Then the corresponding heatmaps
are obtained by visualization techniques. These heatmaps can help users
understand why the malware is classified as this family. We implement IFDroid
and perform extensive evaluations on two widely used datasets. Experimental
results show that IFDroid is superior to state-of-the-art Android malware
familial classification systems. Moreover, IFDroid is capable of maintaining
98.2% true positive rate on classifying 8,112 obfuscated malware samples
On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection
Detecting adversarial samples that are carefully crafted to fool the model is
a critical step to socially-secure applications. However, existing adversarial
detection methods require access to sufficient training data, which brings
noteworthy concerns regarding privacy leakage and generalizability. In this
work, we validate that the adversarial sample generated by attack algorithms is
strongly related to a specific vector in the high-dimensional inputs. Such
vectors, namely UAPs (Universal Adversarial Perturbations), can be calculated
without original training data. Based on this discovery, we propose a
data-agnostic adversarial detection framework, which induces different
responses between normal and adversarial samples to UAPs. Experimental results
show that our method achieves competitive detection performance on various text
classification tasks, and maintains an equivalent time consumption to normal
inference.Comment: Accepted by ACL2023 (Short Paper
DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization
Adversarial training is one of the best-performing methods in improving the
robustness of deep language models. However, robust models come at the cost of
high time consumption, as they require multi-step gradient ascents or word
substitutions to obtain adversarial samples. In addition, these generated
samples are deficient in grammatical quality and semantic consistency, which
impairs the effectiveness of adversarial training. To address these problems,
we introduce a novel, effective procedure for instead adversarial training with
only clean data. Our procedure, distribution shift risk minimization (DSRM),
estimates the adversarial loss by perturbing the input data's probability
distribution rather than their embeddings. This formulation results in a robust
model that minimizes the expected global loss under adversarial attacks. Our
approach requires zero adversarial samples for training and reduces time
consumption by up to 70\% compared to current best-performing adversarial
training methods. Experiments demonstrate that DSRM considerably improves
BERT's resistance to textual adversarial attacks and achieves state-of-the-art
robust accuracy on various benchmarks.Comment: Accepted by ACL202
Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback
Reinforcement learning from human feedback serves as a crucial bridge,
aligning large language models with human and societal values. This alignment
requires a vast corpus of human feedback to learn a reward model, which is
subsequently used to finetune language models. However, we have identified that
the reward model often finds shortcuts to bypass its intended objectives,
misleadingly assuming that humans prefer longer responses. The emergence of
length bias often induces the model to favor longer outputs, yet it doesn't
equate to an increase in helpful information within these outputs. In this
paper, we propose an innovative solution, applying the Product-of-Experts (PoE)
technique to separate reward modeling from the influence of sequence length. In
our framework, the main expert concentrates on understanding human intents,
while the biased expert targets the identification and capture of length bias.
To further enhance the learning of bias, we introduce perturbations into the
bias-focused expert, disrupting the flow of semantic information. Experimental
results validate the effectiveness of our approach, indicating that language
model performance is improved, irrespective of sequence length.Comment: EMNLP 2023 findings, Length Bias in RLHF, Mitigate bias in reward
modelin
MINER: Improving Out-of-Vocabulary Named Entity Recognition from an Information Theoretic Perspective
NER model has achieved promising performance on standard NER benchmarks.
However, recent studies show that previous approaches may over-rely on entity
mention information, resulting in poor performance on out-of-vocabulary (OOV)
entity recognition. In this work, we propose MINER, a novel NER learning
framework, to remedy this issue from an information-theoretic perspective. The
proposed approach contains two mutual information-based training objectives: i)
generalizing information maximization, which enhances representation via deep
understanding of context and entity surface forms; ii) superfluous information
minimization, which discourages representation from rote memorizing entity
names or exploiting biased cues in data. Experiments on various settings and
datasets demonstrate that it achieves better performance in predicting OOV
entities
Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey
Code cloning, the duplication of code fragments, is common in software
development. While some reuse aids productivity, excessive cloning hurts
maintainability and introduces bugs. Hence, automatic code clone detection is
vital. Meanwhile, large language models (LLMs) possess diverse code-related
knowledge, making them versatile for various software engineering challenges.
However, LLMs' performance in code clone detection is unclear and needs more
study for accurate assessment. In this paper, we provide the first
comprehensive evaluation of LLMs for clone detection, covering different clone
types, languages, and prompts. We find advanced LLMs excel in detecting complex
semantic clones, surpassing existing methods. Adding intermediate reasoning
steps via chain-of-thought prompts noticeably enhances performance.
Additionally, representing code as vector embeddings, especially with text
encoders, effectively aids clone detection.Lastly, the ability of LLMs to
detect code clones differs among various programming languages. Our study
suggests that LLMs have potential for clone detection due to their language
capabilities, offering insights for developing robust LLM-based methods to
enhance software engineering.Comment: 13 pages, 3 figure
Secrets of RLHF in Large Language Models Part I: PPO
Large language models (LLMs) have formulated a blueprint for the advancement
of artificial general intelligence. Its primary objective is to function as a
human-centric (helpful, honest, and harmless) assistant. Alignment with humans
assumes paramount significance, and reinforcement learning with human feedback
(RLHF) emerges as the pivotal technological paradigm underpinning this pursuit.
Current technical routes usually include \textbf{reward models} to measure
human preferences, \textbf{Proximal Policy Optimization} (PPO) to optimize
policy model outputs, and \textbf{process supervision} to improve step-by-step
reasoning capabilities. However, due to the challenges of reward design,
environment interaction, and agent training, coupled with huge trial and error
cost of large language models, there is a significant barrier for AI
researchers to motivate the development of technical alignment and safe landing
of LLMs. The stable training of RLHF has still been a puzzle. In the first
report, we dissect the framework of RLHF, re-evaluate the inner workings of
PPO, and explore how the parts comprising PPO algorithms impact policy agent
training. We identify policy constraints being the key factor for the effective
implementation of the PPO algorithm. Therefore, we explore the PPO-max, an
advanced version of PPO algorithm, to efficiently improve the training
stability of the policy model. Based on our main results, we perform a
comprehensive analysis of RLHF abilities compared with SFT models and ChatGPT.
The absence of open-source implementations has posed significant challenges to
the investigation of LLMs alignment. Therefore, we are eager to release
technical reports, reward models and PPO code
Recent Progress of Fluorescence Sensors for Histamine in Foods
Biological amines are organic nitrogen compounds that can be produced by the decomposition of spoiled food. As an important biological amine, histamine has played an important role in food safety. Many methods have been used to detect histamine in foods. Compared with traditional analysis methods, fluorescence sensors as an adaptable detection tool for histamine in foods have the advantages of low cost, convenience, less operation, high sensitivity, and good visibility. In terms of food safety, fluorescence sensors have shown great utilization potential. In this review, we will introduce the applications and development of fluorescence sensors in food safety based on various types of materials. The performance and effectiveness of the fluorescence sensors are discussed in detail regarding their structure, luminescence mechanism, and recognition mechanism. This review may contribute to the exploration of the application of fluorescence sensors in food-related work