15 research outputs found

    Formal framework for reasoning about the precision of dynamic analysis

    Get PDF
    Dynamic program analysis is extremely successful both in code debugging and in malicious code attacks. Fuzzing, concolic, and monkey testing are instances of the more general problem of analysing programs by dynamically executing their code with selected inputs. While static program analysis has a beautiful and well established theoretical foundation in abstract interpretation, dynamic analysis still lacks such a foundation. In this paper, we introduce a formal model for understanding the notion of precision in dynamic program analysis. It is known that in sound-by-construction static program analysis the precision amounts to completeness. In dynamic analysis, which is inherently unsound, precision boils down to a notion of coverage of execution traces with respect to what the observer (attacker or debugger) can effectively observe about the computation. We introduce a topological characterisation of the notion of coverage relatively to a given (fixed) observation for dynamic program analysis and we show how this coverage can be changed by semantic preserving code transformations. Once again, as well as in the case of static program analysis and abstract interpretation, also for dynamic analysis we can morph the precision of the analysis by transforming the code. In this context, we validate our model on well established code obfuscation and watermarking techniques. We confirm the efficiency of existing methods for preventing control-flow-graph extraction and data exploit by dynamic analysis, including a validation of the potency of fully homomorphic data encodings in code obfuscation

    Mapping features to source code in dynamically configured avionics software

    Get PDF
    Mapping software features to the code that implements them is an important activity for program comprehension and software reengineering. In this paper, we present a novel automated approach to locate features in source code based on static analysis and model checking. This approach focuses on dynamically configured software in which the activation of specific features is controlled by configuration variables. The main advantages of a static approach to feature location are its affordability and applicability to large systems containing hundreds of features. Our methodology is applied to an industrial Flight Management System from the avionics industry. Results show that a static approach to feature mapping is feasible and can locate complex features whose implementation is spread across multiple files and functions

    Real-Time Reflexion Modelling in architecture reconciliation: A multi case study

    Get PDF
    Context Reflexion Modelling is considered one of the more successful approaches to architecture reconciliation. Empirical studies strongly suggest that professional developers involved in real-life industrial projects find the information provided by variants of this approach useful and insightful, but the degree to which it resolves architecture conformance issues is still unclear. Objective This paper aims to assess the level of architecture conformance achieved by professional architects using Reflexion Modelling, and to determine how the approach could be extended to improve its suitability for this task. Method An in vivo, multi-case-study protocol was adopted across five software systems, from four different financial services organizations. Think-aloud, video-tape and interview data from professional architects involved in Reflexion Modelling sessions were analysed qualitatively. Results This study showed that (at least) four months after the Reflexion Modelling sessions less than 50% of the architectural violations identified were removed. The majority of participants who did remove violations favoured changes to the architectural model rather than to the code. Participants seemed to work off two specific architectural templates, and interactively explored their architectural model to focus in on the causes of violations, and to assess the ramifications of potential code changes. They expressed a desire for dependency analysis beyond static-source-code analysis and scalable visualizations. Conclusion The findings support several interesting usage-in-practice traits, previously hinted at in the literature. These include (1) the iterative analysis of systems through Reflexion models, as a precursor to possible code change or as a focusing mechanism to identify the location of architecture conformance issues, (2) the extension of the approach with respect to dependency analysis of software systems and architectural modelling templates, (3) improved visualization support and (4) the insight that identification of architectural violations in itself does not lead to their removal in the majority of instances.This work was supported, in part, by Science Foundation Ireland Grants 12/IP/1351 and 10/CE/I1855 to Lero – the Irish Software Engineering Research Centre (www.lero.ie) and by the University of Brighton under the Rising Star Scheme awarded to Nour Ali

    Pencarian Kode Sumber Perangkat Lunak Pada Media Sosial Stackoverflow Menggunakan Lokasi Konsep

    Get PDF
    Pencarian kode sumber di internet merupakan area penelitian yang cukup populer. Berbagai pendekatan telah dilakukan untuk mendapatkan kode sumber di internet. StackOverflow menyediakan sarana bagi para pemrogram untuk tanya jawab seputar kode. Pencarian kode di StackOverflow pada penelitian sebelumnya menggunakan metode tf-idf yaitu mencari berdasarkan jumlah kemunculan kata. Penelitian tersebut memiliki kelemahan yaitu penamaan variabel atau fungsi dianggap kata biasa, padahal variabel atau metode seringkali merupakan gabungan dua kata atau lebih. Sebagai contoh ada fungsi bernama randomString maka yang disimpan juga kata randomString. Sehingga jika mencari kata random maka kode yang ada pada fungsi randomString kemungkinan tidak akan direkomendasikan karena kata randomString dan random merupakan kata yang berbeda begitu juga sebaliknya. Permasalahan tersebut dapat diatasi dengan lokasi konsep. Lokasi konsep telah lama digunakan untuk mendapatkan korelasi antara kode dengan fitur atau konsep tertentu. Penelitian terdahulu tentang lokasi konsep berfokus pada kode dan komentar serta relasinya. Penelitian ini mengajukan sebuah mekanisme mencari kode pada StackOverflow menggunakan lokasi konsep. Semua pertanyaan, jawaban, & kode yang berlabel Java diunduh dari server StackOverflow. Data kemudian diekstrak menjadi corpus. Lokasi konsep pada corpus dicari dengan algoritma LDA (Latent Dirichlet Allocation). Pengguna cukup memasukkan konsep yang akan dicari ke dalam sistem. Sistem kemudian memberikan rekomendasi kode yang relevan dengan lokasi konsep yang diinputkan. Dengan mekanisme tersebut diatas maka pemrogram mendapatkan rekomendasi kode dengan mudah. Hasil penelitian ini mencapai hasil dengan tingkat precision 0.52% dan recall 0.72%. =========================================================================================================== Internet code search is quite popular research area. Various approaches have been made to get the source code from the internet. StackOverflow allows programmers to ask and answer questions about code. Previous approach to search code on StackOverflow use tf-idf method which based on number of occurrences of the word to recommend the codes. This method has the disadvantage that naming a variable or function are considered as normal words, even though a variable or function are often a combination of two or more words. For example if there is a method named “randomString” then stored well wording randomString. So if we search using keyword “random” the system probably will not recommend “randomString” because both words are different. Concept location may tackle this problem. Concept location has been used widely to obtain the correlation between code with a specific concepts or features. Previous research of concept location only focused on codes comments, and relation among the objects within the codes. This research proposes a mechanism for finding source code on StackOverflow using concept location. Questions, answers & source code about Java code are downloaded from StackOverflow to local repository. Corpuses are generated by extracting questions, answers and code snippets. Inferencing concept location of codes created using LDA(Latent Dirichlet Allocation) algorithm. Users querying concepts to system and then system will recomend the codes based on relevant concepts. By using this mechanism, programmers are able to get the recommendations code easily. Results of this study achieve results with a level of precision and recall of 0.52% to 0.72%

    Tool-supported identification of functional concerns in object-oriented code

    Get PDF
    Concern identification aims to find the implementation of a functional concern in existing source code. In this work, concerns are described, using the Hierarchic Concern Model, as gray-boxes containing subconcerns, inputs, and outputs. The inputs and outputs are used as concern seeds to identify data-oriented abstractions of concern implementations, called concern skeletons. The identification approach is based on context free language reachability and supported by a tool, called CoDEx

    Localisation de fonctionnalités par analyse statique dans du code avionique configuré dynamiquement

    Get PDF
    RÉSUMÉ La localisation de l'emplacement où diverses fonctionnalités d'un logiciel sont implémentées au sein du code source peut être utile à la compréhension de programme et pour diverses activités de réingénierie. Dans l'industrie avionique, la réingénierie est un sujet d'actualité puisque plusieurs systèmes logiciels doivent être modernisés. Or, cette réingénierie doit conserver la richesse algorithmique des logiciels existants. Les travaux présentés dans ce mémoire visent donc à appuyer les efforts de réingénierie en avionique via l'utilisation d'une méthodologie de localisation de fonctionnalités à partir de l'analyse statique du code source. Le principal objectif est de définir une telle méthodologie applicable à des logiciels configurés dynamiquement, type de logiciel qu'on retrouve entre autres dans l'industrie avionique. La méthodologie développée se base sur l'extraction d'un graphe de flux de contrôle représentant le code source et l'utilisation de model checking pour vérifier diverses propriétés reliées aux fonctionnalités du logiciel. Chacune des étapes de la méthodologie est automatisée, ce qui lui confère un avantage très intéressant par rapport aux autres techniques de localisation de fonctionnalités existantes. Un second objectif des recherches présentées est d'appliquer cette méthodologie sur un système de gestion de vol provenant de l'industrie avionique. Par la suite, les résultats obtenus sont interprétés afin de déterminer la distribution des différentes fonctionnalités au sein du code source de ce logiciel. Cette distribution est présentée en termes de quantité de fichiers, de blocs de code et de lignes de code reliés à chaque fonctionnalité. La technique développée permet d'identifier le code relié à un ensemble de fonctionnalités du logiciel, ce qui serait utilisable dans une optique d'identification de produits logiciels. Les résultats obtenus peuvent donc être utilisés dans le cadre d'une réingénierie du logiciel et peuvent faciliter l'extraction d'un modèle de ligne de produits logiciels. La méthodologie présentée ici est, à notre connaissance, la première technique automatisée de localisation de fonctionnalités basées sur l'analyse statique. Les résultats obtenus suite à l'analyse du système de gestion de vol montrent que la localisation de fonctionnalités par analyse statique du code source est possible sous certaines conditions. Diverses améliorations, telles que le traitement des pointeurs de fonctions et l'analyse de la propagation des variables, pourraient éventuellement être appliquées à la méthodologie afin d'améliorer sa précision dans certains contextes.----------ABSTRACT Locating where software features are implemented in source code can be useful to program comprehension and software reengineering. In the avionics industry, reengineering is a hot topic since many software systems need to be modernized. However, this reengineering effort must preserve existing algorithms to allow their reuse. This thesis aims to support avionics software reengineering by using a feature location methodology based on static analysis of the source code. The main objective is to define such methodology applicable in dynamically configured software, a type of software sometimes found in the avionics industry. The methodology is based on the extraction of a control flow graph representing the source code and the use of model checking to verify properties related to each feature found in the software program. Each step of the methodology is automated, which provides an interesting advantage compared to other existing feature location approaches. A second objective of the researches presented in this thesis is to apply the developed methodology on a flight management system from the avionics industry. Results are then interpreted to obtain the system features' distribution over the source code. This distribution is presented by number of files, code blocks and lines of code related to each software feature. The developed methodology allows a user to obtain the source code related to a set of software features, which is information that could be used to identify software products. Thus, results can be used in the context of software reengineering and can facilitate the extraction of a software product line model. To the best our knowledge, the methodology presented here is the first automated feature location approach based solely on static analysis. Results from the analysis of the flight management system show that locating features using static analysis of the source code is possible under certain conditions. Some improvements, such as considering function pointers and the propagation of variables, could eventually be applied to our methodology to improve its precision in some contexts

    Program Similarity Analysis for Malware Classification and its Pitfalls

    Get PDF
    Malware classification, specifically the task of grouping malware samples into families according to their behaviour, is vital in order to understand the threat they pose and how to protect against them. Recognizing whether one program shares behaviors with another is a task that requires semantic reasoning, meaning that it needs to consider what a program actually does. This is a famously uncomputable problem, due to Rice\u2019s theorem. As there is no one-size-fits-all solution, determining program similarity in the context of malware classification requires different tools and methods depending on what is available to the malware defender. When the malware source code is readily available (or at least, easy to retrieve), most approaches employ semantic \u201cabstractions\u201d, which are computable approximations of the semantics of the program. We consider this the first scenario for this thesis: malware classification using semantic abstractions extracted from the source code in an open system. Structural features, such as the control flow graphs of programs, can be used to classify malware reasonably well. To demonstrate this, we build a tool for malware analysis, R.E.H.A. which targets the Android system and leverages its openness to extract a structural feature from the source code of malware samples. This tool is first successfully evaluated against a state of the art malware dataset and then on a newly collected dataset. We show that R.E.H.A. is able to classify the new samples into their respective families, often outperforming commercial antivirus software. However, abstractions have limitations by virtue of being approximations. We show that by increasing the granularity of the abstractions used to produce more fine-grained features, we can improve the accuracy of the results as in our second tool, StranDroid, which generates fewer false positives on the same datasets. The source code of malware samples is not often available or easily retrievable. For this reason, we introduce a second scenario in which the classification must be carried out with only the compiled binaries of malware samples on hand. Program similarity in this context cannot be done using semantic abstractions as before, since it is difficult to create meaningful abstractions from zeros and ones. Instead, by treating the compiled programs as raw data, we transform them into images and build upon common image classification algorithms using machine learning. This led us to develop novel deep learning models, a convolutional neural network and a long short-term memory, to classify the samples into their respective families. To overcome the usual obstacle of deep learning of lacking sufficiently large and balanced datasets, we utilize obfuscations as a data augmentation tool to generate semantically equivalent variants of existing samples and expand the dataset as needed. Finally, to lower the computational cost of the training process, we use transfer learning and show that a model trained on one dataset can be used to successfully classify samples in different malware datasets. The third scenario explored in this thesis assumes that even the binary itself cannot be accessed for analysis, but it can be executed, and the execution traces can then be used to extract semantic properties. However, dynamic analysis lacks the formal tools and frameworks that exist in static analysis to allow proving the effectiveness of obfuscations. For this reason, the focus shifts to building a novel formal framework that is able to assess the potency of obfuscations against dynamic analysis. We validate the new framework by using it to encode known analyses and obfuscations, and show how these obfuscations actually hinder the dynamic analysis process

    Supporting feature-level software maintenance

    Get PDF
    Software maintenance is the process of modifying a software system to fix defects, improve performance, add new functionality, or adapt the system to a new environment. A maintenance task is often initiated by a bug report or a request for new functionality. Bug reports typically describe problems with incorrect behaviors or functionalities. These behaviors or functionalities are known as features. Even in very well-designed systems, the source code that implements features is often not completely modularized. The delocalized nature of features makes maintaining them challenging. Since maintenance tasks are expressed in terms of features, the goal of this dissertation is to support software maintenance at the feature-level. We focus on two tasks in particular: feature location and impact analysis via feature coupling.;Feature location is the process of identifying the source code that implements a feature, and it is an essential first step to any maintenance task. There are many existing techniques for feature location that incorporate various types of analyses such as static, dynamic, and textual. In this dissertation, we recognize the advantages of leveraging several types of analyses and introduce a new approach to feature location based on combining dynamic analysis, textual analysis, and web mining algorithms applied to software. The use of web mining for feature location is a novel contribution, and we show that our new techniques based on web mining are significantly more effective than the current state of the art.;After using feature location to identify a feature\u27s source code, maintenance can be completed on that feature. Impact analysis should then be performed to revalidate the system and determine which other features may have been affected by the modifications. We define three feature coupling metrics that capture the relationship between features based on structural information, textual information, and their combination. Our novel feature coupling metrics can be used for impact analysis to quantify the strength of coupling between pairs of features. We performed three empirical studies on open-source software systems to assess the feature coupling metrics and established three major results. First, there is a moderate to strong statistically significant correlation between feature coupling and faults. Second, feature coupling can be used to correctly determine about half of the other features that would be affected by a change to a given feature. Finally, we found that the metrics align with developers\u27 opinions about pairs of features that are actually coupled
    corecore