60 research outputs found
PROGRAM INSPECTION AND TESTING TECHNIQUES FOR CODE CLONES AND REFACTORINGS IN EVOLVING SOFTWARE
Developers often perform copy-and-paste activities. This practice causes the similar code fragment (aka code clones) to be scattered throughout a code base. Refactoring for clone removal is beneficial, preventing clones from having negative effects on software quality, such as hidden bug propagation and unintentional inconsistent changes. However, recent research has provided evidence that factoring out clones does not always reduce the risk of introducing defects, and it is often difficult or impossible to remove clones using standard refactoring techniques. To investigate which or how clones can be refactored, developers typically spend a significant amount of their time managing individual clone instances or clone groups scattered across a large code base.
To address the problem, this research proposes two techniques to inspect and validate refactoring changes. First, we propose a technique for managing clone refactorings, Pattern-based clone Refactoring Inspection (PRI), using refactoring pattern templates. By matching the refactoring pattern templates against a code base, it summarizes refactoring changes of clones, and detects the clone instances not consistently factored out as potential anomalies. Second, we propose Refactoring Investigation and Testing technique, called RIT. RIT improves the testing efficiency for validating refactoring changes. RIT uses PRI to identify refactorings by analyzing original and edited versions of a program. It then uses the semantic impact of a set of identified refactoring changes to detect tests whose behavior may have been affected and modified by refactoring edits. Given each failed asserts, RIT helps developers focus their attention on logically related program statements by applying program slicing for minimizing each test. For debugging purposes, RIT determines specific failure-inducing refactoring edits, separating from other changes that only affect other asserts or tests
apk2vec: Semi-supervised multi-view representation learning for profiling Android applications
Building behavior profiles of Android applications (apps) with holistic, rich
and multi-view information (e.g., incorporating several semantic views of an
app such as API sequences, system calls, etc.) would help catering downstream
analytics tasks such as app categorization, recommendation and malware analysis
significantly better. Towards this goal, we design a semi-supervised
Representation Learning (RL) framework named apk2vec to automatically generate
a compact representation (aka profile/embedding) for a given app. More
specifically, apk2vec has the three following unique characteristics which make
it an excellent choice for largescale app profiling: (1) it encompasses
information from multiple semantic views such as API sequences, permissions,
etc., (2) being a semi-supervised embedding technique, it can make use of
labels associated with apps (e.g., malware family or app category labels) to
build high quality app profiles, and (3) it combines RL and feature hashing
which allows it to efficiently build profiles of apps that stream over time
(i.e., online learning). The resulting semi-supervised multi-view hash
embeddings of apps could then be used for a wide variety of downstream tasks
such as the ones mentioned above. Our extensive evaluations with more than
42,000 apps demonstrate that apk2vec's app profiles could significantly
outperform state-of-the-art techniques in four app analytics tasks namely,
malware detection, familial clustering, app clone detection and app
recommendation.Comment: International Conference on Data Mining, 201
Slicing unconditional jumps with unnecessary control dependencies
[EN] Program slicing is an analysis technique that has a wide range of applications, ranging from compilers to clone detection software, and that has been applied to practically all programming languages. Most program slicing techniques are based on a widely extended program representation, the System Dependence Graph (SDG). However, in the presence of unconditional jumps, there exist some situations where most SDG-based slicing techniques are not as accurate as possible, including more code than strictly necessary. In this paper, we identify one of these scenarios, pointing out the cause of the inaccuracy, and describing the initial solution to the problem proposed in the literature, together with an extension, which solves the problem completely. These solutions modify both the SDG generation and the slicing algorithm. Additionally, we propose an alternative solution, that solves the problem by modifying only the SDG generation, leaving the slicing algorithm untouched.This work has been partially supported by the EU (FEDER) and the Spanish MCI/AEI under grants TIN2016-76843-C4-1-R and PID2019-104735RB-C41, by the Generalitat Valenciana under grant Prometeo/2019/098 (DeepTrust), and by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215Galindo-Jiménez, CS.; Pérez-Rubio, S.; Silva, J. (2021). Slicing unconditional jumps with unnecessary control dependencies. Lecture Notes in Computer Science. 12561:293-308. https://doi.org/10.1007/978-3-030-68446-4_15S2933081256
Scalable detection of semantic clones
Several techniques have been developed for identifying similar code fragments in programs. These similar fragments, referred to as code clones, can be used to identify redundant code, locate bugs, or gain insight into program design. Existing scalable approaches to clone detection are limited to finding program fragments that are similar only in their contiguous syntax. Other, semantics-based approaches are more resilient to differences in syntax, such as reordered statements, related statements interleaved with other unrelated statements, or the use of semantically equivalent control structures. However, none of these techniques have scaled to real world code bases. These approaches capture semantic information from Program Dependence Graphs (PDGs), program representations that encode data and control dependencies between statements and predicates. Our definition of a code clone is also based on this representation: we consider program fragments with isomorphic PDGs to be clones. In this paper, we present the first scalable clone detection algorithm based on this definition of semantic clones. Our insight is the reduction of the difficult graph similarity problem to a simpler tree similarity problem by mapping carefully selected PDG subgraphs to their related structured syntax. We efficiently solve the tree similarity problem to create a scalable analysis. We have implemented this algorithm in a practical tool and performed evaluations on several million-line open source projects, including the Linux kernel. Compared with previous approaches, our tool locates significantly more clones, which are often more semantically interesting than simple copied and pasted code fragments
Information Flow Control with System Dependence Graphs - Improving Modularity, Scalability and Precision for Object Oriented Languages
Die vorliegende Arbeit befasst sich mit dem Gebiet der statischen Programmanalyse
â insbesondere betrachten wir Analysen, deren Ziel es ist,
bestimmte Sicherheitseigenschaften, wie etwa IntegritÀt und Vertraulichkeit,
fĂŒr Programme zu garantieren. HierfĂŒr verwenden wir sogenannte
AbhÀngigkeitsgraphen, welche das potentielle Verhalten des Programms
sowie den Informationsfluss zwischen einzelnen Programmpunkten
abbilden. Mit Hilfe dieser Technik können wir sicherstellen, dass z.B. ein
Programm keinerlei Information ĂŒber ein geheimes Passwort preisgibt.
Im Speziellen liegt der Fokus dieser Arbeit auf Techniken, die das
Erstellen des AbhÀngigkeitsgraphen verbessern, da dieser die Grundlage
fĂŒr viele weiterfĂŒhrende Sicherheitsanalysen bildet. Die vorgestellten
Algorithmen und Verbesserungen wurden in unser Analysetool Joana
integriert und als Open-Source öffentlich verfĂŒgbar gemacht. Zahlreiche
Kooperationen und Veröffentlichungen belegen, dass die Verbesserungen
an Joana auch in der Forschungspraxis relevant sind.
Diese Arbeit besteht im Wesentlichen aus drei Teilen. Teil 1 befasst sich
mit Verbesserungen bei der Berechnung des AbhÀngigkeitsgraphen, Teil 2
stellt einen neuen Ansatz zur Analyse von unvollstÀndigen Programmen
vor und Teil 3 zeigt aktuelle Verwendungsmöglichkeiten von Joana an
konkreten Beispielen.
Im ersten Teil gehen wir detailliert auf die Algorithmen zum Erstellen
eines AbhÀngigkeitsgraphen ein, dabei legen wir besonderes Augenmerk
auf die Probleme und Herausforderung bei der Analyse von Objektorientierten
Sprachen wie Java. So stellen wir z.B. eine Analyse vor,
die den durch Exceptions ausgelösten Kontrollfluss prÀzise behandeln
kann. HauptsÀchlich befassen wir uns mit der Modellierung von
Seiteneffekten, die bei der Kommunikation ĂŒber Methodengrenzen hinweg
entstehen können. Bei AbhÀngigkeitsgraphen werden Seiteneffekte, also
Speicherstellen, die von einer Methode gelesen oder verÀndert werden,
in Form von zusÀtzlichen Knoten dargestellt. Dabei zeigen wir, dass die
Art und Weise der Darstellung, das sogenannte Parametermodel, enormen
Einfluss sowohl auf die PrÀzision als auch auf die Laufzeit der gesamten
Analyse hat. Wir erklÀren die SchwÀchen des alten Parametermodels,
das auf ObjektbÀumen basiert, und prÀsentieren unsere Verbesserungen
in Form eines neuen Modells mit Objektgraphen. Durch das gezielte
Zusammenfassen von redundanten Informationen können wir die Anzahl
der berechneten Parameterknoten deutlich reduzieren und zudem
beschleunigen, ohne dabei die PrÀzision des resultierenden AbhÀngigkeitsgraphen
zu verschlechtern. Bereits bei kleineren Programmen im
Bereich von wenigen tausend Codezeilen erreichen wir eine im Schnitt
8-fach bessere Laufzeit â wĂ€hrend die PrĂ€zision des Ergebnisses in der
Regel verbessert wird. Bei gröĂeren Programmen ist der Unterschied
sogar noch deutlicher, was dazu fĂŒhrt, dass einige unserer TestfĂ€lle und
alle von uns getesteten Programme ab einer GröĂe von 20000 Codezeilen
nur noch mit Objektgraphen berechenbar sind. Dank dieser Verbesserungen
kann Joana mit erhöhter PrĂ€zision und bei wesentlich gröĂeren
Programmen eingesetzt werden.
Im zweiten Teil befassen wir uns mit dem Problem, dass bisherige,
auf AbhÀngigkeitsgraphen basierende Sicherheitsanalysen nur vollstÀndige
Programme analysieren konnten. So war es z.B. unmöglich,
Bibliothekscode ohne Kenntnis aller Verwendungsstellen zu betrachten
oder vorzuverarbeiten. Wir entdeckten bei der bestehenden Analyse
eine Monotonie-Eigenschaft, welche es uns erlaubt, Analyseergebnisse
von Programmteilen auf beliebige Verwendungsstellen zu ĂŒbertragen.
So lassen sich zum einen Programmteile vorverarbeiten und zum anderen
auch generelle Aussagen ĂŒber die Sicherheitseigenschaften von
Programmteilen treffen, ohne deren konkrete Verwendungsstellen zu
kennen. Wir definieren die Monotonie-Eigenschaft im Detail und skizzieren
einen Beweis fĂŒr deren Korrektheit. Darauf aufbauend entwickeln
wir eine Methode zur Vorverarbeitung von Programmteilen, die es uns
ermöglicht, modulare AbhÀngigkeitsgraphen zu erstellen. Diese Graphen
können zu einem spÀteren Zeitpunkt der jeweiligen Verwendungsstelle
angepasst werden. Da die prÀzise Erstellung eines modularen AbhÀngigkeitsgraphen
sehr aufwendig werden kann, entwickeln wir einen
Algorithmus basierend auf sogenannten Zugriffspfaden, der die Skalierbarkeit
verbessert. Zuletzt skizzieren wir einen Beweis, der zeigt, dass
dieser Algorithmus tatsÀchlich immer eine konservative Approximation
des modularen Graphen berechnet und deshalb die Ergebnisse darauf
aufbauender Sicherheitsanalysen weiterhin gĂŒltig sind.
Im dritten Teil prÀsentieren wir einige erfolgreiche Anwendungen
von Joana, die im Rahmen einer Kooperation mit Ralf KĂŒsters von der
UniversitÀt Trier entstanden sind. Hier erklÀren wir zum einen, wie
man unser Sicherheitswerkzeug Joana generell verwenden kann. Zum
anderen zeigen wir, wie in Kombination mit weiteren Werkzeugen und
Techniken kryptographische Sicherheit fĂŒr ein Programm garantiert
werden kann - eine Aufgabe, die bisher fĂŒr auf Informationsfluss basierende
Analysen nicht möglich war. In diesen Anwendungen wird
insbesondere deutlich, wie die im Rahmen dieser Arbeit vereinfachte
Bedienung die Verwendung von Joana erleichtert und unsere Verbesserungen
der PrÀzision des Ergebnisses die erfolgreiche Analyse erst
ermöglichen
Identifying and exploiting concurrency in object-based real-time systems
The use of object-based mechanisms, i.e., abstract data types (ADTs), for constructing software systems can help to decrease development costs, increase understandability and increase maintainability. However, execution efficiency may be sacrificed due to the large number of procedure calls, and due to contention for shared ADTs in concurrent systems. Such inefficiencies are a concern in real-time applications that have stringent timing requirements. To address these issues, the potentially inefficient procedure calls are turned into a source of concurrency via asynchronous procedure calls (ARPCs), and contention for shared ADTS is reduced via ADT cloning. A framework for concurrency analysis in object-based systems is developed, and compiler techniques for identifying potential concurrency via ARPCs and cloning are introduced. Exploitation of the parallelizing compiler techniques is illustrated in the context of an incremental schedule construction algorithm that enhances concurrency incrementally so that feasible real-time schedules can be constructed. Experimental results show large speedup gains with these techniques. Additionally, experiments show that the concurrency enhancement techniques are often useful in constructing feasible schedules for hard real-time systems
DeepWukong: Statically Detecting Software Vulnerabilities Using Deep Graph Neural Network
Static bug detection has shown its effectiveness in detecting well-defined memory errors, e.g., memory leaks, buffer overflows, and null dereference. However, modern software systems have a wide variety of vulnerabilities. These vulnerabilities are extremely complicated with sophisticated programming logic, and these bugs are often caused by different bad programming practices, challenging existing bug detection solutions. It is hard and labor-intensive to develop precise and efficient static analysis solutions for different types of vulnerabilities, particularly for those that may not have a clear specification as the traditional well-defined vulnerabilities. This article presents DeepWukong, a new deep-learning-based embedding approach to static detection of software vulnerabilities for C/C++ programs. Our approach makes a new attempt by leveraging advanced recent graph neural networks to embed code fragments in a compact and low-dimensional representation, producing a new code representation that preserves high-level programming logic (in the form of control-and data-flows) together with the natural language information of a program. Our evaluation studies the top 10 most common C/C++ vulnerabilities during the past 3 years. We have conducted our experiments using 105,428 real-world programs by comparing our approach with four well-known traditional static vulnerability detectors and three state-of-the-art deep-learning-based approaches. The experimental results demonstrate the effectiveness of our research and have shed light on the promising direction of combining program analysis with deep learning techniques to address the general static code analysis challenges
- âŠ