29 research outputs found

    Supporting software processes analysis and decision-making using provenance data

    Get PDF
    Data provenance can be defined as the description of the origins of a piece of data and the process by which it arrived in a database. Provenance has been successfully used in health sciences, chemical industries, and scientific computing, considering that these areas require a comprehensive traceability mechanism. Moreover, companies have been increasing the amount of data they collect from their systems and processes, considering the dropping cost of memory and storage technologies in the last years. Thus, this thesis investigates if the use of provenance models and techniques can support software processes execution analysis and data-driven decision-making, considering the increasing availability of process data provided by companies. A provenance model for software processes was developed and evaluated by experts in process and provenance area, in addition to an approach for capturing, storing, inferencing of implicit information, and visualization to software process provenance data. In addition, a case study using data from industry’s processes was conducted to evaluate the approach, with a discussion about several specific analysis and data-driven decision-making possibilities.Proveniência de dados é definida como a descrição da origem de um dado e o processo pelo qual este passou até chegar ao seu estado atual. Proveniência de dados tem sido usada com sucesso em domínios como ciências da saúde, indústrias químicas e computação científica, considerando que essas áreas exigem um mecanismo abrangente de rastreabilidade. Por outro lado, as empresas vêm aumentando a quantidade de dados que coletam de seus sistemas e processos, considerando a diminuição no custo das tecnologias de memória e armazenamento nos últimos anos. Assim, esta tese investiga se o uso de modelos e técnicas de proveniência é capaz de apoiar a análise da execução de processos de software e a tomada de decisões baseada em dados, considerando a disponibilização cada vez maior de dados relativos a processos pelas empresas. Um modelo de proveniência para processos de software foi desenvolvido e avaliado por especialistas em processos e proveniência, além de uma abordagem e ferramental de apoio para captura, armazenamento, inferência de novas informações e posterior análise e visualização dos dados de proveniência de processos. Um estudo de caso utilizando dados de processos da indústria foi conduzido para avaliação da abordagem e discussão de possibilidades distintas para análise e tomada de decisão orientada por estes dados

    Source File Set Search for Clone-and-Own Reuse Analysis

    Get PDF
    Clone-and-own approach is a natural way of source code reuse for software developers. To assess how known bugs and security vulnerabilities of a cloned component affect an application, developers and security analysts need to identify an original version of the component and understand how the cloned component is different from the original one. Although developers may record the original version information in a version control system and/or directory names, such information is often either unavailable or incomplete. In this research, we propose a code search method that takes as input a set of source files and extracts all the components including similar files from a software ecosystem (i.e., a collection of existing versions of software packages). Our method employs an efficient file similarity computation using b-bit minwise hashing technique. We use an aggregated file similarity for ranking components. To evaluate the effectiveness of this tool, we analyzed 75 cloned components in Firefox and Android source code. The tool took about two hours to report the original components from 10 million files in Debian GNU/Linux packages. Recall of the top-five components in the extracted lists is 0.907, while recall of a baseline using SHA-1 file hash is 0.773, according to the ground truth recorded in the source code repositories.Comment: 14th International Conference on Mining Software Repositorie

    LibDetector: Version Identification of Libraries in Android Applications

    Get PDF
    In the Android ecosystem today, code is often reused by developers in the form of software libraries. This practice not only saves time, but also reduces the complexity of software development. However, like all other software, software libraries are prone to bugs, design flaws, and security vulnerabilities. They too undergo incremental updates to not only add/change features, but also to address their flaws. Unfortunately, the knowledge gap between consumers and maintainers of software libraries presents a barrier to the timely adoption of important library updates. Therefore we present LibDetector, a tool for identifying the specific version of Java libraries used in Android applications. Using LibDetector, we perform a large empirical analysis of the current trends of library use in the Android ecosystem. We find that a huge proportion of applications currently available on the Google Play Store use outdated libraries. We also explore the potential effects of this lax updating practice. In 2 of the 17 libraries we studied, apps that contain outdated versions of the library had a significantly different average rating than apps that contain more recent versions of the library. Finally, we find in a case study that a vulnerable version of a library is a realistic threat to the security of apps consuming that version of the library

    Empirical Studies of Android API Usage: Suggesting Related API Calls and Detecting License Violations.

    Get PDF
    We mine the API method calls used by Android App developers to (1)suggest related API calls based on the version history of Apps, (2) suggest related API calls based on StackOverflow posts, and (3) find potential App copyright and license vio- lations based the similarity of API calls made by them. Zimmermann et al suggested that �Programmers who changed these functions also changed� functions that could be mined from previous groupings of functions found in the version history of a system. Our first contribution is to expand this approach to a community of Apps. Android developers use a set of API calls when creating Apps. These API methods are used in similar ways across multiple applications. Clustering co-changing API methods used by 230 Android Apps, we are able to predict the changes to API methods that individual App developers will make to their application with an average precision of 73% and recall of 25%. Our second contribution can be characterized as �Programmers who discussed these functions were also interested in these functions.� Informal discussion on Stack- Overflow provides a rich source of related API methods as developers provide solu- tions to common problems. Clustering salient API methods in the same highly ranked posts, we are able to create rules that predict the changes App developers will make with an average precision of 64% and recall of 15%. Our last contribution is to find out whether proprietary Apps copy code from open source Apps, thereby violating the open source license. We have provided a set of techniques that determines how similar two Apps are based on the API calls they make. These techniques include android API calls matching, API calls coverage, App categories, Method/Class clusters and released size of Apps. To validate this approach we conduct a case study of 150 open source project and 950 proprietary projects

    Software Provenance Tracking at the Scale of Public Source Code

    Get PDF
    International audienceWe study the possibilities to track provenance of software source code artifacts within the largest publicly accessible corpus of publicly available source code, the Software Heritage archive, with over 4 billions unique source code files and 1 billion commits capturing their development histories across 50 million software projects. We perform a systematic and generic estimate of the replication factor across the different layers of this corpus, analysing how much the same artifacts (e.g., SLOC, files or commits) appear in different contexts (e.g., files, commits or source code repositories). We observe a combinatorial explosion in the number of identical source code files across different commits. To discuss the implication of these findings, we benchmark different data models for capturing software provenance information at this scale, and we identify a viable solution, based on the properties of isochrone subgraphs, that is deployable on commodity hardware, is incremental and appears to be maintainable for the foreseeable future. Using these properties, we quantify, at a scale never achieved previously, the growth rate of original, i.e. never-seen-before, source code files and commits, and find it to be exponential over a period of more than 40 years

    The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

    Get PDF
    In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

    Assisting Software Developers With License Compliance

    Get PDF
    Open source licensing determines how open source systems are reused, distributed, and modified from a legal perspective. While it facilitates rapid development, it can present difficulty for developers in understanding due to the legal language of these licenses. Because of misunderstandings, systems can incorporate licensed code in a way that violates the terms of the license. Such incompatibilities between licensing can result in the inability to reuse a particular library without either relicensing the system or redesigning the architecture of the system. Prior efforts have predominantly focused on license identification or understanding the underlying phenomena without reasoning about compatibility in a broad scale. The work in this dissertation first investigates the rationale of developers and identifies the areas that developers struggle with respect to free/open source software licensing. First, we investigate the diffusion of licenses and the prevalence of license changes in a large scale empirical study of 16,221 Java systems. We observed a clear lack of traceability and a lack of standardized licensing that led to difficulties and confusion for developers trying to reuse source code. We further investigated the difficulty by surveying the developers of the systems with license changes to understand why they first adopted a license and then changed licenses. Additionally, we performed an analysis on issue trackers and legal mailing lists to extract licensing bugs. From these works, we identified key areas in which developers struggled and needed support. While developers need support to identify license incompatibilities and understand both the cause and implications of the incompatibilities, we observed that state-of-the-art license identification tools did not identify license exceptions. Since these exceptions directly modify the license terms (either the permissions granted by the license or the restrictions imposed by the license), we proposed an approach to complement current license identification techniques in order to classify license exceptions. The approach relies on supervised machine learners to classify the licensing text to identify the particular license exceptions or the lack of a license exception. Subsequently, we built an infrastructure to assist developers with evaluating license compliance warnings for their system. The infrastructure evaluates compliance across the dependency tree of a system to ensure it is compliant with all of the licenses of the dependencies. When an incompatibility is present, it notes the specific library/libraries and the conflicting license(s) so that the developers can investigate these compliance warnings, which would prevent distribution of their software, in their system. We conduct a study on 121,094 open source projects spanning 6 programming languages, and we demonstrate that the infrastructure is able to identify license incompatibilities between these projects and their dependencies

    Automatic multi-label categorization of Java applications using Dependency graphs

    Get PDF
    Automatic approaches for categorization of software repositories are increasingly gaining acceptance because they reduce manual effort and can produce high quality results. Most of the existing approaches have strongly relied on supervised machine learning {which requires a set of predefined categories to be used as training data{ and have used source code, comments, API Calls and other sources to obtain information about the projects to be categorized. We consider that existing approaches have weaknesses that can have major implications on the categorization results and haven't been solved at the same time, namely the assumption of non-restricted access to source code and the use of predefined sets of categories. Therefore, we present Sally: a novel, unsupervised and multi-label automatic categorization model that is able to obtain meaningful categories without depending on access to source code nor the existence of predefined categories by leveraging on information obtained from the projects in the categorization corpus and the dependency relations between them. We performed two experiments in which we compared Sally to the categorization strategies of two widely used websites and to MUDABlue, a categorization model proposed by Kawaguchi et al. that we consider to be a good baseline. Additionally, we assessed the proposed model by conducting a survey with 14 developers with a wide range of programming experience and developed a web application to make the proposed model available to potential users.Resumen. La categorización automática de repositorios de software ha ido ganando aceptación debido a que reduce el esfuerzo manual y puede generar resultados de alta calidad. La mayoría de los modelos existentes dependen fuertemente del aprendizaje de máquina supervisado { que necesita de un conjunto predefinido de categorías para ser usado como datos de entrenamiento{ y han usado código fuente, comentarios, llamadas de API y otras fuentes para obtener información sobre los proyectos a categorizar. Consideramos que los modelos existentes tienen debilidades que pueden tener implicaciones importantes en el resultado de la categorización y no han sido resueltas al mismo tiempo, específicamente la suposición de que el código fuente de los proyectos se encuentra completamente disponible y la necesidad de conjuntos predefinidos de categorías. Por esto, presentamos el modelo Sally: Un enfoque de categorización automática de software novedoso, no supervisado y multi-etiqueta que es capaz de generar categorías descriptivas sin depender del acceso al código fuente ni a categorías predefinidas usando información obtenida de los proyectos a categorizar y las relaciones entre ellos. Realizamos dos experimentos en los que comparamos a Sally con las estrategias de categorización automática de dos herramientas online ámpliamente utilizadas y con MUDABlue, un modelo de categorización automática de software propuesto por Kawaguchi et al. que consideramos una buena base de comparación. Adicionalmente, evaluamos el modelo propuesto por medio de un caso de estudio llevado a cabo con la participación de 14 desarrolladores con un ámplio rango de experiencia en programación y desarrollamos una aplicación web para poner el modelo propuesto a disposición de usuarios potenciales.Maestrí
    corecore