18 research outputs found
TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree transformation
Large-scale language models have made great progress in the field of software
engineering in recent years. They can be used for many code-related tasks such
as code clone detection, code-to-code search, and method name prediction.
However, these large-scale language models based on each code token have
several drawbacks: They are usually large in scale, heavily dependent on
labels, and require a lot of computing power and time to fine-tune new
datasets.Furthermore, code embedding should be performed on the entire code
snippet rather than encoding each code token. The main reason for this is that
encoding each code token would cause model parameter inflation, resulting in a
lot of parameters storing information that we are not very concerned about. In
this paper, we propose a novel framework, called TransformCode, that learns
about code embeddings in a contrastive learning manner. The framework uses the
Transformer encoder as an integral part of the model. We also introduce a novel
data augmentation technique called abstract syntax tree transformation: This
technique applies syntactic and semantic transformations to the original code
snippets to generate more diverse and robust anchor samples. Our proposed
framework is both flexible and adaptable: It can be easily extended to other
downstream tasks that require code representation such as code clone detection
and classification. The framework is also very efficient and scalable: It does
not require a large model or a large amount of training data, and can support
any programming language.Finally, our framework is not limited to unsupervised
learning, but can also be applied to some supervised learning tasks by
incorporating task-specific labels or objectives. To explore the effectiveness
of our framework, we conducted extensive experiments on different software
engineering tasks using different programming languages and multiple datasets
AdaCCD: Adaptive Semantic Contrasts Discovery based Cross Lingual Adaptation for Code Clone Detection
Code Clone Detection, which aims to retrieve functionally similar programs
from large code bases, has been attracting increasing attention. Modern
software often involves a diverse range of programming languages. However,
current code clone detection methods are generally limited to only a few
popular programming languages due to insufficient annotated data as well as
their own model design constraints. To address these issues, we present AdaCCD,
a novel cross-lingual adaptation method that can detect cloned codes in a new
language without any annotations in that language. AdaCCD leverages
language-agnostic code representations from pre-trained programming language
models and propose an Adaptively Refined Contrastive Learning framework to
transfer knowledge from resource-rich languages to resource-poor languages. We
evaluate the cross-lingual adaptation results of AdaCCD by constructing a
multilingual code clone detection benchmark consisting of 5 programming
languages. AdaCCD achieves significant improvements over other baselines, and
it is even comparable to supervised fine-tuning.Comment: 10 page
A code search engine for software ecosystems
Searching and reusing source code play an increasingly significant role in the daily tasks of software developers. While code repositories, such as GitHub and Stackoverflow, may provide some results, a code search engine is generally considered most helpful when searching for code snippets as they typically crawl data from a wide range of code repositories. Code search engines enable software developers to search for code snippets using search terms. The accuracy of the search results can be increased if the searchers' intent can be modeled and predicted correctly. This study proposes a novel code search engine to model user intents through a dialogue system and then suggests a ranked list of code snippets that can meet user requirements
Aspect of Code Cloning Towards Software Bug and Imminent Maintenance: A Perspective on Open-source and Industrial Mobile Applications
As a part of the digital era of microtechnology, mobile application (app) development is evolving with
lightning speed to enrich our lives and bring new challenges and risks. In particular, software bugs and
failures cost trillions of dollars every year, including fatalities such as a software bug in a self-driving car
that resulted in a pedestrian fatality in March 2018 and the recent Boeing-737 Max tragedies that resulted
in hundreds of deaths. Software clones (duplicated fragments of code) are also found to be one of the
crucial factors for having bugs or failures in software systems. There have been many significant studies
on software clones and their relationships to software bugs for desktop-based applications. Unfortunately,
while mobile apps have become an integral part of todayâs era, there is a marked lack of such studies for
mobile apps. In order to explore this important aspect, in this thesis, first, we studied the characteristics of
software bugs in the context of mobile apps, which might not be prevalent for desktop-based apps such as
energy-related (battery drain while using apps) and compatibility-related (different behaviors of same app
in different devices) bugs/issues. Using Support Vector Machine (SVM), we classified about 3K mobile app
bug reports of different open-source development sites into four categories: crash, energy, functionality and
security bug. We then manually examined a subset of those bugs and found that over 50% of the bug-fixing
code-changes occurred in clone code. There have been a number of studies with desktop-based software
systems that clearly show the harmful impacts of code clones and their relationships to software bugs. Given
that there is a marked lack of such studies for mobile apps, in our second study, we examined 11 open-source
and industrial mobile apps written in two different languages (Java and Swift) and noticed that clone code
is more bug-prone than non-clone code and that industrial mobile apps have a higher code clone ratio than
open-source mobile apps. Furthermore, we correlated our study outcomes with those of existing desktop based studies and surveyed 23 mobile app developers to validate our findings. Along with validating our
findings from the survey, we noticed that around 95% of the developers usually copy/paste (code cloning)
code fragments from the popular Crowd-sourcing platform, Stack Overflow (SO) to their projects and that
over 75% of such developers experience bugs after such activities (the code cloning from SO). Existing studies
with desktop-based systems also showed that while SO is one of the most popular online platforms for code
reuse (and code cloning), SO code fragments are usually toxic in terms of software maintenance perspective.
Thus, in the third study of this thesis, we studied the consequences of code cloning from SO in different open source and industrial mobile apps. We observed that closed-source industrial apps even reused more SO code
fragments than open-source mobile apps and that SO code fragments were more change-prone (such as bug)
than non-SO code fragments. We also experienced that SO code fragments were related to more bugs in
industrial projects than open-source ones. Our studies show how we could efficiently and effectively manage
clone related software bugs for mobile apps by utilizing the positive sides of code cloning while overcoming
(or at least minimizing) the negative consequences of clone fragments
Codeklonerkennung mit Dominatorinformationen
If an existing function in a software project is copied and reused (in a slightly modiïŹed version), the result is a code clone. If there was an error or vulnerability in the original function, this error or vulnerability is now contained in several places in the software project. This is one of the reasons why research is being done to develop powerful and scalable clone detection techniques. In this thesis, a new clone detection method is presented that uses paths and path sets derived from the dominator trees of the functions to detect the code clones. A dominator tree is a special form of the control ïŹow graph, which does not contain cycles. The dominator tree based method has been implemented in the StoneDetector tool and can detect code clones in Java source code as well as in Java bytecode. It has equally good or better recall and precision results than previously published code clone detection methods. The evaluation was performed using the BigCloneBench. Scalability measurements showed that even source code with several 100 million lines of code can be searched in a reasonable time. In order to evaluate the bytecode based StoneDetector variant, the BigCloneBench ïŹles had to be compiled. For this purpose, the Stubber tool was developed, which can compile Java source code ïŹles without the required libraries. Finally, it could be shown that using the register code generated from the Java bytecode, similar recall and precision values could be achieved compared to the source code based variant. Since some machine learning studies specify that very good recall and precision values can be achieved for all clone types, a machine learning method was trained with dominator trees. It could be shown that the results published by the studies are not reproducible on unseen data.Wird eine bestehende Funktion in einem Softwareprojekt kopiert und (in leicht angepasster Form) erneut genutzt, entsteht ein Codeklon. War in der ursprĂŒnglichen Funktion jedoch ein Fehler oder eine Schwachstelle, so ist dieser Fehler beziehungsweise diese Schwachstelle jetzt an mehreren Stellen im Softwareprojekt enthalten. Dies ist einer der GrĂŒnde, weshalb an der Entwicklung von leistungsstarken und skalierbaren Klonerkennungsverfahren geforscht wird. In der hier vorliegenden Arbeit wird ein neues Klonerkennungsverfahren vorgestellt, das zum Detektieren der Codeklone Pfade und Pfadmengen nutzt, die aus den DominatorbĂ€umen der Funktionen abgeleitet werden. Ein Dominatorbaum wird aus dem Kontrollflussgraphen abgeleitet und enthĂ€lt keine Zyklen. Das Dominatorbaum-basierte Verfahren wurde in dem Werkzeug StoneDetector umgesetzt und kann Codeklone sowohl im Java-Quelltext als auch im Java-Bytecode detektieren. Dabei hat es gleich gute oder bessere Recall- und Precision-Werte als bisher veröffentlichte Codeklonerkennungsverfahren. Die Wert-Evaluierungen wurden dabei unter Verwendung des BigClone-Benchs durchgefĂŒhrt. Skalierbarkeitsmessungen zeigten, dass sogar Quellcodedateien mit mehreren 100-Millionen Codezeilen in angemessener Zeit durchsucht werden können. Damit die Bytecode-basierte StoneDetector-Variante auch evaluiert werden konnte, mussten die Dateien des BigCloneBench kompiliert werden. Dazu wurde das Stubber-Tool entwickelt, welches Java-Quelltextdateien ohne die benötigten AbhĂ€ngigkeiten kompilieren kann. Schlussendlich konnte somit gezeigt werden, dass mithilfe des aus dem Java-Bytecode generierten Registercodes Ă€hnliche Recall- und Precision-Werte im Vergleich zu der Quelltext-basierten Variante erreicht werden können. Da einige Arbeiten mit maschinellen Lernverfahren angeben, bei allen Klontypen sehr gute Recall- und Precision-Werte zu erreichen, wurde ein maschinelles Lernverfahren mit DominatorĂ€umen trainiert. Es konnte gezeigt werden, dass die von den Arbeiten veröffentlichten Ergebnisse nicht auf ungesehenen Daten reproduzierbar sind
Modélisation formelle des systÚmes de détection d'intrusions
LâĂ©cosystĂšme de la cybersĂ©curitĂ© Ă©volue en permanence en termes du nombre, de la diversitĂ©, et de la complexitĂ© des attaques. De ce fait, les outils de dĂ©tection deviennent inefficaces face Ă certaines attaques. On distingue gĂ©nĂ©ralement trois types de systĂšmes de dĂ©tection dâintrusions : dĂ©tection par anomalies, dĂ©tection par signatures et dĂ©tection hybride. La dĂ©tection par anomalies est fondĂ©e sur la caractĂ©risation du comportement habituel du systĂšme, typiquement de maniĂšre statistique. Elle permet de dĂ©tecter des attaques connues ou inconnues, mais gĂ©nĂšre aussi un trĂšs grand nombre de faux positifs. La dĂ©tection par signatures permet de dĂ©tecter des attaques connues en dĂ©finissant des rĂšgles qui dĂ©crivent le comportement connu dâun attaquant. Cela demande une bonne connaissance du comportement de lâattaquant. La dĂ©tection hybride repose sur plusieurs mĂ©thodes de dĂ©tection incluant celles sus-citĂ©es. Elle prĂ©sente lâavantage dâĂȘtre plus prĂ©cise pendant la dĂ©tection. Des outils tels que Snort et Zeek offrent des langages de bas niveau pour lâexpression de rĂšgles de reconnaissance dâattaques. Le nombre dâattaques potentielles Ă©tant trĂšs grand, ces bases de rĂšgles deviennent rapidement difficiles Ă gĂ©rer et Ă maintenir. De plus, lâexpression de rĂšgles avec Ă©tat dit stateful est particuliĂšrement ardue pour reconnaĂźtre une sĂ©quence dâĂ©vĂ©nements. Dans cette thĂšse, nous proposons une approche stateful basĂ©e sur les diagrammes dâĂ©tat-transition algĂ©briques (ASTDs) afin dâidentifier des attaques complexes. Les ASTDs permettent de reprĂ©senter de façon graphique et modulaire une spĂ©cification, ce qui facilite la maintenance et la comprĂ©hension des rĂšgles. Nous Ă©tendons la notation ASTD avec de nouvelles fonctionnalitĂ©s pour reprĂ©senter des attaques complexes. Ensuite, nous spĂ©cifions plusieurs attaques avec la notation Ă©tendue et exĂ©cutons les spĂ©cifications obtenues sur des flots dâĂ©vĂ©nements Ă lâaide dâun interprĂ©teur pour identifier des attaques. Nous Ă©valuons aussi les performances de lâinterprĂ©teur avec des outils industriels tels que Snort et Zeek. Puis, nous rĂ©alisons un compilateur afin de gĂ©nĂ©rer du code exĂ©cutable Ă partir dâune spĂ©cification ASTD, capable dâidentifier de façon efficiente les sĂ©quences dâĂ©vĂ©nements.Abstract : The cybersecurity ecosystem continuously evolves with the number, the diversity,
and the complexity of cyber attacks. Generally, we have three types of Intrusion
Detection System (IDS) : anomaly-based detection, signature-based detection, and
hybrid detection. Anomaly detection is based on the usual behavior description of
the system, typically in a static manner. It enables detecting known or unknown attacks
but also generating a large number of false positives. Signature based detection
enables detecting known attacks by defining rules that describe known attackerâs behavior.
It needs a good knowledge of attacker behavior. Hybrid detection relies on
several detection methods including the previous ones. It has the advantage of being
more precise during detection. Tools like Snort and Zeek offer low level languages to
represent rules for detecting attacks. The number of potential attacks being large,
these rule bases become quickly hard to manage and maintain. Moreover, the representation
of stateful rules to recognize a sequence of events is particularly arduous. In this thesis, we propose a stateful approach based on algebraic state-transition
diagrams (ASTDs) to identify complex attacks. ASTDs allow a graphical and modular
representation of a specification, that facilitates maintenance and understanding of
rules. We extend the ASTD notation with new features to represent complex attacks.
Next, we specify several attacks with the extended notation and run the resulting specifications
on event streams using an interpreter to identify attacks. We also evaluate
the performance of the interpreter with industrial tools such as Snort and Zeek. Then,
we build a compiler in order to generate executable code from an ASTD specification,
able to efficiently identify sequences of events
Automating Software Development for Mobile Computing Platforms
Mobile devices such as smartphones and tablets have become ubiquitous in today\u27s computing landscape. These devices have ushered in entirely new populations of users, and mobile operating systems are now outpacing more traditional desktop systems in terms of market share. The applications that run on these mobile devices (often referred to as apps ) have become a primary means of computing for millions of users and, as such, have garnered immense developer interest. These apps allow for unique, personal software experiences through touch-based UIs and a complex assortment of sensors. However, designing and implementing high quality mobile apps can be a difficult process. This is primarily due to challenges unique to mobile development including change-prone APIs and platform fragmentation, just to name a few. in this dissertation we develop techniques that aid developers in overcoming these challenges by automating and improving current software design and testing practices for mobile apps. More specifically, we first introduce a technique, called Gvt, that improves the quality of graphical user interfaces (GUIs) for mobile apps by automatically detecting instances where a GUI was not implemented to its intended specifications. Gvt does this by constructing hierarchal models of mobile GUIs from metadata associated with both graphical mock-ups (i.e., created by designers using photo-editing software) and running instances of the GUI from the corresponding implementation. Second, we develop an approach that completely automates prototyping of GUIs for mobile apps. This approach, called ReDraw, is able to transform an image of a mobile app GUI into runnable code by detecting discrete GUI-components using computer vision techniques, classifying these components into proper functional categories (e.g., button, dropdown menu) using a Convolutional Neural Network (CNN), and assembling these components into realistic code. Finally, we design a novel approach for automated testing of mobile apps, called CrashScope, that explores a given android app using systematic input generation with the intrinsic goal of triggering crashes. The GUI-based input generation engine is driven by a combination of static and dynamic analyses that create a model of an app\u27s GUI and targets common, empirically derived root causes of crashes in android apps. We illustrate that the techniques presented in this dissertation represent significant advancements in mobile development processes through a series of empirical investigations, user studies, and industrial case studies that demonstrate the effectiveness of these approaches and the benefit they provide developers
Recommended from our members
The Roles of Language Models and Hierarchical Models in Neural Sequence-to-Sequence Prediction
With the advent of deep learning, research in many areas of machine learning is converging towards the same set of methods and models. For example, long short-term memory networks are not only popular for various tasks in natural language processing (NLP) such as speech recognition, machine translation, handwriting recognition, syntactic parsing, etc., but they are also applicable to seemingly unrelated fields such as robot control, time series prediction, and bioinformatics. Recent advances in contextual word embeddings like BERT boast with achieving state-of-the-art results on 11 NLP tasks with the same model. Before deep learning, a speech recognizer and a syntactic parser used to have little in common as systems were much more tailored towards the task at hand.
At the core of this development is the tendency to view each task as yet another data mapping problem, neglecting the particular characteristics and (soft) requirements tasks often have in practice. This often goes along with a sharp break of deep learning methods with previous research in the specific area. This work can be understood as an antithesis to this paradigm. We show how traditional symbolic statistical machine translation models can still improve neural machine translation (NMT) while reducing the risk for common pathologies of NMT such as hallucinations and neologisms. Other external symbolic models such as spell checkers and morphology databases help neural grammatical error correction. We also focus on language models that often do not play a role in vanilla end-to-end approaches and apply them in different ways to word reordering, grammatical error correction, low-resource NMT, and document-level NMT. Finally, we demonstrate the benefit of hierarchical models in sequence-to-sequence prediction. Hand-engineered covering grammars are effective in preventing catastrophic errors in neural text normalization systems. Our operation sequence model for interpretable NMT represents translation as a series of actions that modify the translation state, and can also be seen as derivation in a formal grammar.EPSRC grant EP/L027623/1
EPSRC Tier-2 capital grant EP/P020259/