18 research outputs found

    TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree transformation

    Full text link
    Large-scale language models have made great progress in the field of software engineering in recent years. They can be used for many code-related tasks such as code clone detection, code-to-code search, and method name prediction. However, these large-scale language models based on each code token have several drawbacks: They are usually large in scale, heavily dependent on labels, and require a lot of computing power and time to fine-tune new datasets.Furthermore, code embedding should be performed on the entire code snippet rather than encoding each code token. The main reason for this is that encoding each code token would cause model parameter inflation, resulting in a lot of parameters storing information that we are not very concerned about. In this paper, we propose a novel framework, called TransformCode, that learns about code embeddings in a contrastive learning manner. The framework uses the Transformer encoder as an integral part of the model. We also introduce a novel data augmentation technique called abstract syntax tree transformation: This technique applies syntactic and semantic transformations to the original code snippets to generate more diverse and robust anchor samples. Our proposed framework is both flexible and adaptable: It can be easily extended to other downstream tasks that require code representation such as code clone detection and classification. The framework is also very efficient and scalable: It does not require a large model or a large amount of training data, and can support any programming language.Finally, our framework is not limited to unsupervised learning, but can also be applied to some supervised learning tasks by incorporating task-specific labels or objectives. To explore the effectiveness of our framework, we conducted extensive experiments on different software engineering tasks using different programming languages and multiple datasets

    AdaCCD: Adaptive Semantic Contrasts Discovery based Cross Lingual Adaptation for Code Clone Detection

    Full text link
    Code Clone Detection, which aims to retrieve functionally similar programs from large code bases, has been attracting increasing attention. Modern software often involves a diverse range of programming languages. However, current code clone detection methods are generally limited to only a few popular programming languages due to insufficient annotated data as well as their own model design constraints. To address these issues, we present AdaCCD, a novel cross-lingual adaptation method that can detect cloned codes in a new language without any annotations in that language. AdaCCD leverages language-agnostic code representations from pre-trained programming language models and propose an Adaptively Refined Contrastive Learning framework to transfer knowledge from resource-rich languages to resource-poor languages. We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages. AdaCCD achieves significant improvements over other baselines, and it is even comparable to supervised fine-tuning.Comment: 10 page

    A code search engine for software ecosystems

    Get PDF
    Searching and reusing source code play an increasingly significant role in the daily tasks of software developers. While code repositories, such as GitHub and Stackoverflow, may provide some results, a code search engine is generally considered most helpful when searching for code snippets as they typically crawl data from a wide range of code repositories. Code search engines enable software developers to search for code snippets using search terms. The accuracy of the search results can be increased if the searchers' intent can be modeled and predicted correctly. This study proposes a novel code search engine to model user intents through a dialogue system and then suggests a ranked list of code snippets that can meet user requirements

    Aspect of Code Cloning Towards Software Bug and Imminent Maintenance: A Perspective on Open-source and Industrial Mobile Applications

    Get PDF
    As a part of the digital era of microtechnology, mobile application (app) development is evolving with lightning speed to enrich our lives and bring new challenges and risks. In particular, software bugs and failures cost trillions of dollars every year, including fatalities such as a software bug in a self-driving car that resulted in a pedestrian fatality in March 2018 and the recent Boeing-737 Max tragedies that resulted in hundreds of deaths. Software clones (duplicated fragments of code) are also found to be one of the crucial factors for having bugs or failures in software systems. There have been many significant studies on software clones and their relationships to software bugs for desktop-based applications. Unfortunately, while mobile apps have become an integral part of today’s era, there is a marked lack of such studies for mobile apps. In order to explore this important aspect, in this thesis, first, we studied the characteristics of software bugs in the context of mobile apps, which might not be prevalent for desktop-based apps such as energy-related (battery drain while using apps) and compatibility-related (different behaviors of same app in different devices) bugs/issues. Using Support Vector Machine (SVM), we classified about 3K mobile app bug reports of different open-source development sites into four categories: crash, energy, functionality and security bug. We then manually examined a subset of those bugs and found that over 50% of the bug-fixing code-changes occurred in clone code. There have been a number of studies with desktop-based software systems that clearly show the harmful impacts of code clones and their relationships to software bugs. Given that there is a marked lack of such studies for mobile apps, in our second study, we examined 11 open-source and industrial mobile apps written in two different languages (Java and Swift) and noticed that clone code is more bug-prone than non-clone code and that industrial mobile apps have a higher code clone ratio than open-source mobile apps. Furthermore, we correlated our study outcomes with those of existing desktop based studies and surveyed 23 mobile app developers to validate our findings. Along with validating our findings from the survey, we noticed that around 95% of the developers usually copy/paste (code cloning) code fragments from the popular Crowd-sourcing platform, Stack Overflow (SO) to their projects and that over 75% of such developers experience bugs after such activities (the code cloning from SO). Existing studies with desktop-based systems also showed that while SO is one of the most popular online platforms for code reuse (and code cloning), SO code fragments are usually toxic in terms of software maintenance perspective. Thus, in the third study of this thesis, we studied the consequences of code cloning from SO in different open source and industrial mobile apps. We observed that closed-source industrial apps even reused more SO code fragments than open-source mobile apps and that SO code fragments were more change-prone (such as bug) than non-SO code fragments. We also experienced that SO code fragments were related to more bugs in industrial projects than open-source ones. Our studies show how we could efficiently and effectively manage clone related software bugs for mobile apps by utilizing the positive sides of code cloning while overcoming (or at least minimizing) the negative consequences of clone fragments

    Codeklonerkennung mit Dominatorinformationen

    Get PDF
    If an existing function in a software project is copied and reused (in a slightly modiïŹed version), the result is a code clone. If there was an error or vulnerability in the original function, this error or vulnerability is now contained in several places in the software project. This is one of the reasons why research is being done to develop powerful and scalable clone detection techniques. In this thesis, a new clone detection method is presented that uses paths and path sets derived from the dominator trees of the functions to detect the code clones. A dominator tree is a special form of the control ïŹ‚ow graph, which does not contain cycles. The dominator tree based method has been implemented in the StoneDetector tool and can detect code clones in Java source code as well as in Java bytecode. It has equally good or better recall and precision results than previously published code clone detection methods. The evaluation was performed using the BigCloneBench. Scalability measurements showed that even source code with several 100 million lines of code can be searched in a reasonable time. In order to evaluate the bytecode based StoneDetector variant, the BigCloneBench ïŹles had to be compiled. For this purpose, the Stubber tool was developed, which can compile Java source code ïŹles without the required libraries. Finally, it could be shown that using the register code generated from the Java bytecode, similar recall and precision values could be achieved compared to the source code based variant. Since some machine learning studies specify that very good recall and precision values can be achieved for all clone types, a machine learning method was trained with dominator trees. It could be shown that the results published by the studies are not reproducible on unseen data.Wird eine bestehende Funktion in einem Softwareprojekt kopiert und (in leicht angepasster Form) erneut genutzt, entsteht ein Codeklon. War in der ursprĂŒnglichen Funktion jedoch ein Fehler oder eine Schwachstelle, so ist dieser Fehler beziehungsweise diese Schwachstelle jetzt an mehreren Stellen im Softwareprojekt enthalten. Dies ist einer der GrĂŒnde, weshalb an der Entwicklung von leistungsstarken und skalierbaren Klonerkennungsverfahren geforscht wird. In der hier vorliegenden Arbeit wird ein neues Klonerkennungsverfahren vorgestellt, das zum Detektieren der Codeklone Pfade und Pfadmengen nutzt, die aus den DominatorbĂ€umen der Funktionen abgeleitet werden. Ein Dominatorbaum wird aus dem Kontrollflussgraphen abgeleitet und enthĂ€lt keine Zyklen. Das Dominatorbaum-basierte Verfahren wurde in dem Werkzeug StoneDetector umgesetzt und kann Codeklone sowohl im Java-Quelltext als auch im Java-Bytecode detektieren. Dabei hat es gleich gute oder bessere Recall- und Precision-Werte als bisher veröffentlichte Codeklonerkennungsverfahren. Die Wert-Evaluierungen wurden dabei unter Verwendung des BigClone-Benchs durchgefĂŒhrt. Skalierbarkeitsmessungen zeigten, dass sogar Quellcodedateien mit mehreren 100-Millionen Codezeilen in angemessener Zeit durchsucht werden können. Damit die Bytecode-basierte StoneDetector-Variante auch evaluiert werden konnte, mussten die Dateien des BigCloneBench kompiliert werden. Dazu wurde das Stubber-Tool entwickelt, welches Java-Quelltextdateien ohne die benötigten AbhĂ€ngigkeiten kompilieren kann. Schlussendlich konnte somit gezeigt werden, dass mithilfe des aus dem Java-Bytecode generierten Registercodes Ă€hnliche Recall- und Precision-Werte im Vergleich zu der Quelltext-basierten Variante erreicht werden können. Da einige Arbeiten mit maschinellen Lernverfahren angeben, bei allen Klontypen sehr gute Recall- und Precision-Werte zu erreichen, wurde ein maschinelles Lernverfahren mit DominatorĂ€umen trainiert. Es konnte gezeigt werden, dass die von den Arbeiten veröffentlichten Ergebnisse nicht auf ungesehenen Daten reproduzierbar sind

    Modélisation formelle des systÚmes de détection d'intrusions

    Get PDF
    L’écosystĂšme de la cybersĂ©curitĂ© Ă©volue en permanence en termes du nombre, de la diversitĂ©, et de la complexitĂ© des attaques. De ce fait, les outils de dĂ©tection deviennent inefficaces face Ă  certaines attaques. On distingue gĂ©nĂ©ralement trois types de systĂšmes de dĂ©tection d’intrusions : dĂ©tection par anomalies, dĂ©tection par signatures et dĂ©tection hybride. La dĂ©tection par anomalies est fondĂ©e sur la caractĂ©risation du comportement habituel du systĂšme, typiquement de maniĂšre statistique. Elle permet de dĂ©tecter des attaques connues ou inconnues, mais gĂ©nĂšre aussi un trĂšs grand nombre de faux positifs. La dĂ©tection par signatures permet de dĂ©tecter des attaques connues en dĂ©finissant des rĂšgles qui dĂ©crivent le comportement connu d’un attaquant. Cela demande une bonne connaissance du comportement de l’attaquant. La dĂ©tection hybride repose sur plusieurs mĂ©thodes de dĂ©tection incluant celles sus-citĂ©es. Elle prĂ©sente l’avantage d’ĂȘtre plus prĂ©cise pendant la dĂ©tection. Des outils tels que Snort et Zeek offrent des langages de bas niveau pour l’expression de rĂšgles de reconnaissance d’attaques. Le nombre d’attaques potentielles Ă©tant trĂšs grand, ces bases de rĂšgles deviennent rapidement difficiles Ă  gĂ©rer et Ă  maintenir. De plus, l’expression de rĂšgles avec Ă©tat dit stateful est particuliĂšrement ardue pour reconnaĂźtre une sĂ©quence d’évĂ©nements. Dans cette thĂšse, nous proposons une approche stateful basĂ©e sur les diagrammes d’état-transition algĂ©briques (ASTDs) afin d’identifier des attaques complexes. Les ASTDs permettent de reprĂ©senter de façon graphique et modulaire une spĂ©cification, ce qui facilite la maintenance et la comprĂ©hension des rĂšgles. Nous Ă©tendons la notation ASTD avec de nouvelles fonctionnalitĂ©s pour reprĂ©senter des attaques complexes. Ensuite, nous spĂ©cifions plusieurs attaques avec la notation Ă©tendue et exĂ©cutons les spĂ©cifications obtenues sur des flots d’évĂ©nements Ă  l’aide d’un interprĂ©teur pour identifier des attaques. Nous Ă©valuons aussi les performances de l’interprĂ©teur avec des outils industriels tels que Snort et Zeek. Puis, nous rĂ©alisons un compilateur afin de gĂ©nĂ©rer du code exĂ©cutable Ă  partir d’une spĂ©cification ASTD, capable d’identifier de façon efficiente les sĂ©quences d’évĂ©nements.Abstract : The cybersecurity ecosystem continuously evolves with the number, the diversity, and the complexity of cyber attacks. Generally, we have three types of Intrusion Detection System (IDS) : anomaly-based detection, signature-based detection, and hybrid detection. Anomaly detection is based on the usual behavior description of the system, typically in a static manner. It enables detecting known or unknown attacks but also generating a large number of false positives. Signature based detection enables detecting known attacks by defining rules that describe known attacker’s behavior. It needs a good knowledge of attacker behavior. Hybrid detection relies on several detection methods including the previous ones. It has the advantage of being more precise during detection. Tools like Snort and Zeek offer low level languages to represent rules for detecting attacks. The number of potential attacks being large, these rule bases become quickly hard to manage and maintain. Moreover, the representation of stateful rules to recognize a sequence of events is particularly arduous. In this thesis, we propose a stateful approach based on algebraic state-transition diagrams (ASTDs) to identify complex attacks. ASTDs allow a graphical and modular representation of a specification, that facilitates maintenance and understanding of rules. We extend the ASTD notation with new features to represent complex attacks. Next, we specify several attacks with the extended notation and run the resulting specifications on event streams using an interpreter to identify attacks. We also evaluate the performance of the interpreter with industrial tools such as Snort and Zeek. Then, we build a compiler in order to generate executable code from an ASTD specification, able to efficiently identify sequences of events

    Automating Software Development for Mobile Computing Platforms

    Get PDF
    Mobile devices such as smartphones and tablets have become ubiquitous in today\u27s computing landscape. These devices have ushered in entirely new populations of users, and mobile operating systems are now outpacing more traditional desktop systems in terms of market share. The applications that run on these mobile devices (often referred to as apps ) have become a primary means of computing for millions of users and, as such, have garnered immense developer interest. These apps allow for unique, personal software experiences through touch-based UIs and a complex assortment of sensors. However, designing and implementing high quality mobile apps can be a difficult process. This is primarily due to challenges unique to mobile development including change-prone APIs and platform fragmentation, just to name a few. in this dissertation we develop techniques that aid developers in overcoming these challenges by automating and improving current software design and testing practices for mobile apps. More specifically, we first introduce a technique, called Gvt, that improves the quality of graphical user interfaces (GUIs) for mobile apps by automatically detecting instances where a GUI was not implemented to its intended specifications. Gvt does this by constructing hierarchal models of mobile GUIs from metadata associated with both graphical mock-ups (i.e., created by designers using photo-editing software) and running instances of the GUI from the corresponding implementation. Second, we develop an approach that completely automates prototyping of GUIs for mobile apps. This approach, called ReDraw, is able to transform an image of a mobile app GUI into runnable code by detecting discrete GUI-components using computer vision techniques, classifying these components into proper functional categories (e.g., button, dropdown menu) using a Convolutional Neural Network (CNN), and assembling these components into realistic code. Finally, we design a novel approach for automated testing of mobile apps, called CrashScope, that explores a given android app using systematic input generation with the intrinsic goal of triggering crashes. The GUI-based input generation engine is driven by a combination of static and dynamic analyses that create a model of an app\u27s GUI and targets common, empirically derived root causes of crashes in android apps. We illustrate that the techniques presented in this dissertation represent significant advancements in mobile development processes through a series of empirical investigations, user studies, and industrial case studies that demonstrate the effectiveness of these approaches and the benefit they provide developers
    corecore