25 research outputs found
開放系オゾン濃度制御システムの開発とオゾンがコムギ生産に及ぼす広域的影響の推定
学位の種別:論文博士University of Tokyo(東京大学
Just-in-Time Security Patch Detection -- LLM At the Rescue for Data Augmentation
In the face of growing vulnerabilities found in open-source software, the
need to identify {discreet} security patches has become paramount. The lack of
consistency in how software providers handle maintenance often leads to the
release of security patches without comprehensive advisories, leaving users
vulnerable to unaddressed security risks. To address this pressing issue, we
introduce a novel security patch detection system, LLMDA, which capitalizes on
Large Language Models (LLMs) and code-text alignment methodologies for patch
review, data enhancement, and feature combination. Within LLMDA, we initially
utilize LLMs for examining patches and expanding data of PatchDB and SPI-DB,
two security patch datasets from recent literature. We then use labeled
instructions to direct our LLMDA, differentiating patches based on security
relevance. Following this, we apply a PTFormer to merge patches with code,
formulating hybrid attributes that encompass both the innate details and the
interconnections between the patches and the code. This distinctive combination
method allows our system to capture more insights from the combined context of
patches and code, hence improving detection precision. Finally, we devise a
probabilistic batch contrastive learning mechanism within batches to augment
the capability of the our LLMDA in discerning security patches. The results
reveal that LLMDA significantly surpasses the start of the art techniques in
detecting security patches, underscoring its promise in fortifying software
maintenance
Patch-CLIP: A Patch-Text Pre-Trained Model
In recent years, patch representation learning has emerged as a necessary
research direction for exploiting the capabilities of machine learning in
software generation. These representations have driven significant performance
enhancements across a variety of tasks involving code changes. While the
progress is undeniable, a common limitation among existing models is their
specialization: they predominantly excel in either predictive tasks, such as
security patch classification, or in generative tasks such as patch description
generation. This dichotomy is further exacerbated by a prevalent dependency on
potentially noisy data sources. Specifically, many models utilize patches
integrated with Abstract Syntax Trees (AST) that, unfortunately, may contain
parsing inaccuracies, thus acting as a suboptimal source of supervision. In
response to these challenges, we introduce PATCH-CLIP, a novel pre-training
framework for patches and natural language text. PATCH-CLIP deploys a
triple-loss training strategy for 1) patch-description contrastive learning,
which enables to separate patches and descriptions in the embedding space, 2)
patch-description matching, which ensures that each patch is associated to its
description in the embedding space, and 3) patch-description generation, which
ensures that the patch embedding is effective for generation. These losses are
implemented for joint learning to achieve good performance in both predictive
and generative tasks involving patches. Empirical evaluations focusing on patch
description generation, demonstrate that PATCH-CLIP sets new state of the art
performance, consistently outperforming the state-of-the-art in metrics like
BLEU, ROUGE-L, METEOR, and Recall
App Review Driven Collaborative Bug Finding
Software development teams generally welcome any effort to expose bugs in
their code base. In this work, we build on the hypothesis that mobile apps from
the same category (e.g., two web browser apps) may be affected by similar bugs
in their evolution process. It is therefore possible to transfer the experience
of one historical app to quickly find bugs in its new counterparts. This has
been referred to as collaborative bug finding in the literature. Our novelty is
that we guide the bug finding process by considering that existing bugs have
been hinted within app reviews. Concretely, we design the BugRMSys approach to
recommend bug reports for a target app by matching historical bug reports from
apps in the same category with user app reviews of the target app. We
experimentally show that this approach enables us to quickly expose and report
dozens of bugs for targeted apps such as Brave (web browser app). BugRMSys's
implementation relies on DistilBERT to produce natural language text
embeddings. Our pipeline considers similarities between bug reports and app
reviews to identify relevant bugs. We then focus on the app review as well as
potential reproduction steps in the historical bug report (from a same-category
app) to reproduce the bugs.
Overall, after applying BugRMSys to six popular apps, we were able to
identify, reproduce and report 20 new bugs: among these, 9 reports have been
already triaged, 6 were confirmed, and 4 have been fixed by official
development teams, respectively
Just-in-Time Security Patch Detection -- LLM At the Rescue for Data Augmentation
In the face of growing vulnerabilities found in open-source software, the need to identify {discreet} security patches has become paramount. The lack of consistency in how software providers handle maintenance often leads to the release of security patches without comprehensive advisories, leaving users vulnerable to unaddressed security risks. To address this pressing issue, we introduce a novel security patch detection system, LLMDA, which capitalizes on Large Language Models (LLMs) and code-text alignment methodologies for patch review, data enhancement, and feature combination. Within LLMDA, we initially utilize LLMs for examining patches and expanding data of PatchDB and SPI-DB, two security patch datasets from recent literature. We then use labeled instructions to direct our LLMDA, differentiating patches based on security relevance. Following this, we apply a PTFormer to merge patches with code, formulating hybrid attributes that encompass both the innate details and the interconnections between the patches and the code. This distinctive combination method allows our system to capture more insights from the combined context of patches and code, hence improving detection precision. Finally, we devise a probabilistic batch contrastive learning mechanism within batches to augment the capability of the our LLMDA in discerning security patches. The results reveal that LLMDA significantly surpasses the start of the art techniques in detecting security patches, underscoring its promise in fortifying software maintenance
Learning to Represent Patches
Patch representation is crucial in automating various software engineering
tasks, like determining patch accuracy or summarizing code changes. While
recent research has employed deep learning for patch representation, focusing
on token sequences or Abstract Syntax Trees (ASTs), they often miss the
change's semantic intent and the context of modified lines. To bridge this gap,
we introduce a novel method, Patcherizer. It delves into the intentions of
context and structure, merging the surrounding code context with two innovative
representations. These capture the intention in code changes and the intention
in AST structural modifications pre and post-patch. This holistic
representation aptly captures a patch's underlying intentions. Patcherizer
employs graph convolutional neural networks for structural intention graph
representation and transformers for intention sequence representation. We
evaluated Patcherizer's embeddings' versatility in three areas: (1) Patch
description generation, (2) Patch accuracy prediction, and (3) Patch intention
identification. Our experiments demonstrate the representation's efficacy
across all tasks, outperforming state-of-the-art methods. For example, in patch
description generation, Patcherizer excels, showing an average boost of 19.39%
in BLEU, 8.71% in ROUGE-L, and 34.03% in METEOR scores
Innovatives Stickstoffmanagement und innovative Düngetechnologien in den intensiv genutzten Reis-Weizen Anbausystemen Südostchinas
Als Teil eines interdisziplinären deutsch-chinesischen Forschungsverbundes wurden mit Beginn der Winterweizen-frucht 2008/09 in zwei Kreisen der Provinz Jiangsu im Südosten Chinas Feldversuche zu Demonstrationszwecken eingerichtet. Hierbei wurde in drei verschiedenen Behandlungen „Standard“, „Reduziert“ und eine Nullparzelle ausschließlich die Menge der mineralischen Stickstoff (N)-Düngung variiert. Die Ergebnisse nach der Winterweizenernte zeigen, dass in der „Reduzierten“ Behandlung kein Ertragsrückgang zu verzeichnen war. Parallel hierzu konnte außerdem im Vergleich zur „Standard“ Variante eine Abnahme der Rest-Nmin-Gehalte im Boden nach der Ernte um knapp 40 % festgestellt werden
Patch-CLIP : A Patch-Text Pre-Trained Model
In recent years, patch representation learning has emerged as a necessary research direction for exploiting the capabilities of machine learning in software generation. These representations have driven significant performance enhancements across a variety of tasks involving code changes. While the progress is undeniable, a common limitation among existing models is their specialization: they predominantly excel in either predictive tasks, such as security patch classification, or in generative tasks such as patch description generation. This dichotomy is further exacerbated by a prevalent dependency on potentially noisy data sources. Specifically, many models utilize patches integrated with Abstract Syntax Trees (AST) that, unfortunately, may contain parsing inaccuracies, thus acting as a suboptimal source of supervision. In response to these challenges, we introduce PATCH-CLIP, a novel pre-training framework for patches and natural language text. PATCH-CLIP deploys a triple-loss training strategy for 1) patch-description contrastive learning, which enables to separate patches and descriptions in the embedding space, 2) patch-description matching, which ensures that each patch is associated to its description in the embedding space, and 3) patch-description generation, which ensures that the patch embedding is effective for generation. These losses are implemented for joint learning to achieve good performance in both predictive and generative tasks involving patches. Empirical evaluations focusing on patch description generation, demonstrate that PATCH-CLIP sets new state of the art performance, consistently outperforming the state-of-the-art in metrics like BLEU, ROUGE-L, METEOR, and Recall
Multilevel Semantic Embedding of Software Patches: A Fine-to-Coarse Grained Approach Towards Security Patch Detection
The growth of open-source software has increased the risk of hidden vulnerabilities that can affect downstream software applications. This concern is further exacerbated by software vendors' practice of silently releasing security patches without explicit warnings or common vulnerability and exposure (CVE) notifications. This lack of transparency leaves users unaware of potential security threats, giving attackers an opportunity to take advantage of these vulnerabilities. In the complex landscape of software patches, grasping the nuanced semantics of a patch is vital for ensuring secure software maintenance. To address this challenge, we introduce a multilevel Semantic Embedder for security patch detection, termed MultiSEM. This model harnesses word-centric vectors at a fine-grained level, emphasizing the significance of individual words, while the coarse-grained layer adopts entire code lines for vector representation, capturing the essence and interrelation of added or removed lines. We further enrich this representation by assimilating patch descriptions to obtain a holistic semantic portrait. This combination of multi-layered embeddings offers a robust representation, balancing word complexity, understanding code-line insights, and patch descriptions. Evaluating MultiSEM for detecting patch security, our results demonstrate its superiority, outperforming state-of-the-art models with promising margins: a 22.46\% improvement on PatchDB and a 9.21\% on SPI-DB in terms of the F1 metric
Hyperbolic Code Retrieval: A Novel Approach for Efficient Code Search Using Hyperbolic Space Embeddings
Within the realm of advanced code retrieval, existing methods have primarily relied on intricate matching and attention-based mechanisms. However, these methods often lead to computational and memory inefficiencies, posing a significant challenge to their real-world applicability. To tackle this challenge, we propose a novel approach, the Hyperbolic Code QA Matching (HyCoQA). This approach leverages the unique properties of Hyperbolic space to express connections between code fragments and their corresponding queries, thereby obviating the necessity for intricate interaction layers. The process commences with a reimagining of the code retrieval challenge, framed within a question-answering (QA) matching framework, constructing a dataset with triple matches characterized as \texttt{}. These matches are subsequently processed via a static BERT embedding layer, yielding initial embeddings. Thereafter, a hyperbolic embedder transforms these representations into hyperbolic space, calculating distances between the codes and descriptions. The process concludes by implementing a scoring layer on these distances and leveraging hinge loss for model training. Especially, the design of HyCoQA inherently facilitates self-organization, allowing for the automatic detection of embedded hierarchical patterns during the learning phase. Experimentally, HyCoQA showcases remarkable effectiveness in our evaluations: an average performance improvement of 3.5\% to 4\% compared to state-of-the-art code retrieval techniques