Search CORE

371 research outputs found

Aspect-Controlled Neural Argument Generation

Author: Daxenberger Johannes
Gurevych Iryna
Schiller Benjamin
Publication venue
Publication date: 30/04/2020
Field of study

We rely on arguments in our daily lives to deliver our opinions and base them on evidence, making them more convincing in turn. However, finding and formulating arguments can be challenging. In this work, we train a language model for argument generation that can be controlled on a fine-grained level to generate sentence-level arguments for a given topic, stance, and aspect. We define argument aspect detection as a necessary method to allow this fine-granular control and crowdsource a dataset with 5,032 arguments annotated with aspects. Our evaluation shows that our generation model is able to generate high-quality, aspect-specific arguments. Moreover, these arguments can be used to improve the performance of stance detection models via data augmentation and to generate counter-arguments. We publish all datasets and code to fine-tune the language model

arXiv.org e-Print Archive

TUbiblio

GraphBinMatch: Graph-based Similarity Learning for Cross-Language Binary and Source Code Matching

Author: Chen Hanze
Jannesari Ali
TehraniJamsaz Ali
Publication venue
Publication date: 10/04/2023
Field of study

Matching binary to source code and vice versa has various applications in different fields, such as computer security, software engineering, and reverse engineering. Even though there exist methods that try to match source code with binary code to accelerate the reverse engineering process, most of them are designed to focus on one programming language. However, in real life, programs are developed using different programming languages depending on their requirements. Thus, cross-language binary-to-source code matching has recently gained more attention. Nonetheless, the existing approaches still struggle to have precise predictions due to the inherent difficulties when the problem of matching binary code and source code needs to be addressed across programming languages. In this paper, we address the problem of cross-language binary source code matching. We propose GraphBinMatch, an approach based on a graph neural network that learns the similarity between binary and source codes. We evaluate GraphBinMatch on several tasks, such as cross-language binary-to-source code matching and cross-language source-to-source matching. We also evaluate our approach performance on single-language binary-to-source code matching. Experimental results show that GraphBinMatch outperforms state-of-the-art significantly, with improvements as high as 15% over the F1 score

arXiv.org e-Print Archive

Implant Global and Local Hierarchy Information to Sequence based Code Representation Models

Author: Jin Zhi
Li Ge
Li Zhuo
Zhang Kechi
Publication venue
Publication date: 14/03/2023
Field of study

Source code representation with deep learning techniques is an important research field. There have been many studies that learn sequential or structural information for code representation. But sequence-based models and non-sequence-models both have their limitations. Researchers attempt to incorporate structural information to sequence-based models, but they only mine part of token-level hierarchical structure information. In this paper, we analyze how the complete hierarchical structure influences the tokens in code sequences and abstract this influence as a property of code tokens called hierarchical embedding. The hierarchical embedding is further divided into statement-level global hierarchy and token-level local hierarchy. Furthermore, we propose the Hierarchy Transformer (HiT), a simple but effective sequence model to incorporate the complete hierarchical embeddings of source code into a Transformer model. We demonstrate the effectiveness of hierarchical embedding on learning code structure with an experiment on variable scope detection task. Further evaluation shows that HiT outperforms SOTA baseline models and show stable training efficiency on three source code-related tasks involving classification and generation tasks across 8 different datasets.Comment: Accepted by ICPC 202

arXiv.org e-Print Archive

Normalizer: Augmenting Code Clone Detectors using Source Code Normalization

Author: Ly Kevin
Publication venue: DigitalCommons@CalPoly
Publication date: 01/03/2017
Field of study

Code clones are duplicate fragments of code that perform the same task. As software code bases increase in size, the number of code clones also tends to increase. These code clones, possibly created through copy-and-paste methods or unintentional duplication of effort, increase maintenance cost over the lifespan of the software. Code clone detection tools exist to identify clones where a human search would prove unfeasible, however the quality of the clones found may vary. I demonstrate that the performance of such tools can be improved by normalizing the source code before usage. I developed Normalizer, a tool to transform C source code to normalized source code where the code is written as consistently as possible. By maintaining the code\u27s function while enforcing a strict format, the variability of the programmer\u27s style will be taken out. Thus, code clones may be easier to detect by tools regardless of how it was written. Reordering statements, removing useless code, and renaming identifiers are used to achieve normalized code. Normalizer was used to show that more clones can be found in Introduction to Computer Networks assignments by normalizing the source code versus the original source code using a small variety of code clone detection tools

DigitalCommons@CalPoly

The Impact of Programming Language’s Type on Probabilistic Machine Learning Models

Author: Elsaid Sherif
Publication venue: SJSU ScholarWorks
Publication date: 17/12/2021
Field of study

Software development is an expensive and difficult process. Mistakes can be easily made, and without extensive review process, those mistakes can make it to the production code and may have unintended disastrous consequences. This is why various automated code review services have arisen in the recent years. From AWS’s CodeGuro and Microsoft’s Code Analysis to more integrated code assistants, like IntelliCode and auto completion tools. All of which are designed to help and assist the developers with their work and help catch overlooked bugs. Thanks to recent advances in machine learning, these services have grown tremen- dously in sophistication to a point where they can catch bugs that often go unnoticed even with traditional code reviews. This project investigates the use of code2vec [1], which is a probabilistic machine learning model on source code, in correctly labeling methods from different program- ming language families. We extend this model to work with more languages, train the created models, and compare the performance of static and dynamic languages. As a by-product we create new datasets from the top stared open source GitHub projects in various languages. Different approaches for static and dynamic languages are applied, as well as some improvement techniques, like transfer learning. Finally, different parsers were used to see their effect on the model’s performance

SJSU ScholarWorks

Large Language Models for Software Engineering: A Systematic Literature Review

Author: Grundy John
Hou Xinyi
Li Li
Liu Yue
Lo David
Luo Xiapu
Wang Haoyu
Wang Kailong
Yang Zhou
Zhao Yanjie
Publication venue
Publication date: 21/08/2023
Field of study

Large Language Models (LLMs) have significantly impacted numerous domains, notably including Software Engineering (SE). Nevertheless, a well-rounded understanding of the application, effects, and possible limitations of LLMs within SE is still in its early stages. To bridge this gap, our systematic literature review takes a deep dive into the intersection of LLMs and SE, with a particular focus on understanding how LLMs can be exploited in SE to optimize processes and outcomes. Through a comprehensive review approach, we collect and analyze a total of 229 research papers from 2017 to 2023 to answer four key research questions (RQs). In RQ1, we categorize and provide a comparative analysis of different LLMs that have been employed in SE tasks, laying out their distinctive features and uses. For RQ2, we detail the methods involved in data collection, preprocessing, and application in this realm, shedding light on the critical role of robust, well-curated datasets for successful LLM implementation. RQ3 allows us to examine the specific SE tasks where LLMs have shown remarkable success, illuminating their practical contributions to the field. Finally, RQ4 investigates the strategies employed to optimize and evaluate the performance of LLMs in SE, as well as the common techniques related to prompt optimization. Armed with insights drawn from addressing the aforementioned RQs, we sketch a picture of the current state-of-the-art, pinpointing trends, identifying gaps in existing research, and flagging promising areas for future study

arXiv.org e-Print Archive