694 research outputs found

    Source File Set Search for Clone-and-Own Reuse Analysis

    Get PDF
    Clone-and-own approach is a natural way of source code reuse for software developers. To assess how known bugs and security vulnerabilities of a cloned component affect an application, developers and security analysts need to identify an original version of the component and understand how the cloned component is different from the original one. Although developers may record the original version information in a version control system and/or directory names, such information is often either unavailable or incomplete. In this research, we propose a code search method that takes as input a set of source files and extracts all the components including similar files from a software ecosystem (i.e., a collection of existing versions of software packages). Our method employs an efficient file similarity computation using b-bit minwise hashing technique. We use an aggregated file similarity for ranking components. To evaluate the effectiveness of this tool, we analyzed 75 cloned components in Firefox and Android source code. The tool took about two hours to report the original components from 10 million files in Debian GNU/Linux packages. Recall of the top-five components in the extracted lists is 0.907, while recall of a baseline using SHA-1 file hash is 0.773, according to the ground truth recorded in the source code repositories.Comment: 14th International Conference on Mining Software Repositorie

    Using build identifiers to fingerprint ELF binaries and link to build information without having access to source code

    Get PDF
    Finding out where a software program or library comes from and how it was built without having direct access to the source code is not a trivial problem to solve. While versions of programs can be fairly accurately guessed this is a lot more difficult for build configuration. By comparing build identifiers from binaries of which nothing is known with build identifiers extracted from binaries for which source code and build information is available it is in certain cases possible to find out what source code and build information was used for a binary

    Better unpacking binary files using contextual information

    Get PDF
    To unpack firmware files, disk images, raw flash dumps, file systems or other archives various tools are available, that examine the contents of the file, find offsets of archives, compressed files, media files and so on, carve these from a larger file, decompress the carved files, and make the unpacked data available for recursive unpacking. Currently available tools treat all found files of a certain type the same (all PNG files are treated the same, all ZIP files are treated the same and so on), without taking the context in which they were found into account, which actually could matter depending on the situation. This document describes possible approaches to this problem, where contextual information from unpacking is made available to allow for more accurate unpacking and labeling of files

    Finding (partial) code clones at method level in binary Java programs without access to source code to detect copyright infringements or security issues

    Get PDF
    Many Java programs are distributed in binary form without source code being made available. This means that it is a lot harder to do audits of these programs for for example copyright infringement detection or security issue detection. By examining individual class files inside a Java program and comparing these to a database of class files from known programs it is possible to make an educated guess of which programs or program fragments are used in the program, and possibly detect copyright infringements or trojaned versions of programs

    ASSESSING THE QUALITY OF SOFTWARE DEVELOPMENT TUTORIALS AVAILABLE ON THE WEB

    Get PDF
    Both expert and novice software developers frequently access software development resources available on the Web in order to lookup or learn new APIs, tools and techniques. Software quality is affected negatively when developers fail to find high-quality information relevant to their problem. While there is a substantial amount of freely available resources that can be accessed online, some of the available resources contain information that suffers from error proneness, copyright infringement, security concerns, and incompatible versions. Use of such toxic information can have a strong negative effect on developer’s efficacy. This dissertation focuses specifically on software tutorials, aiming to automatically evaluate the quality of such documents available on the Web. In order to achieve this goal, we present two contributions: 1) scalable detection of duplicated code snippets; 2) automatic identification of valid version ranges. Software tutorials consist of a combination of source code snippets and natural language text. The code snippets in a tutorial can originate from different sources, perhaps carrying stringent licensing requirements or known security vulnerabilities. Developers, typically unaware of this, can reuse these code snippets in their project. First, in this thesis, we present our work on a Web-scale code clone search technique that is able to detect duplicate code snippets between large scale document and source code corpora in order to trace toxic code snippets. As software libraries and APIs evolve over time, existing software development tutorials can become outdated. It is difficult for software developers and especially novices to determine the expected version of the software implicit in a specific tutorial in order to decide whether the tutorial is applicable to their software development environment. To overcome this challenge, in this thesis we present a novel technique for automatic identification of the valid version range of software development tutorials on the Web

    On the Detection of Licenses Violations in the Android Ecosystem

    Get PDF
    RÉSUMÉ Très souvent, les développeurs d’applications mobiles réutilisent les bibliothèques et les composants déjà existants dans le but de réduire les coûts de développements. Cependant, ces bibliothèques et composants sont régies par des licences auxquelles les développeurs doivent se soumettre. Une licence contrôle la manière dont une bibliothèque ou un bout de code pourraient être réutilisés, modifiés ou redistribués. Une licence peut être vu comme étant une liste de règles que les développeurs doivent respecter avant d’utiliser le composant. Le non-respect des termes d’une licence pourrait engendrer des pénalités et des amendes. A travers ce mémoire de maîtrise, nous proposons une méthode d’identification des licences utilisées dans une application à code source ouvert. A l’aîde de cette méthode, nous menons une étude pour identifier les licences utilisées dans 857 applications mobiles, provenant du marché “F-Droid”, dans le but de comprendre les types de licences les plus souvent utilisées par les développeurs ainsi que la manière avec laquelle ces licenses évoluent à travers le temps. Nous menons notre étude sur deux niveaux; le niveau du projet et celui du fichier. Nous investigons également les infractions portées aux licences et leursévolutions à travers le temps; Nous comparons les licences déclarées au niveau du project avec celles de ses fichiers, des fichiers entre eux et des projets et fichiers avec ceux des bibliothèques utilisées par le projet, afin d’identifier des licences incompatibles utilisées dans un même projet. Les résultats montrent que les licences les plus utilisées sont les licences “GPL" et “Apache"; aussi bien au niveau du projet qu’au niveau fichiers. Nous remarquons que, dans plusieurs cas, les développeurs ne portent pas assez attention aux licences de leurs code source. Des 8 938 versions d’applications analysées, 3 250 versions ne sont pas accompagnées d’informations relatives aux licences. Concernant l’évolution des licences, nous remarquons que la probabilité pour un projet de demeurer sous une même licence est très élevée (95% en moyenne), et dans le cas d’un changement de license, le changement se fait généralement vers des licences plus permissives. Au niveau du fichier, nous avons remarqué que les développeurs ont tendance à retarder leur choix de licence. Dans 15% des changements de license, les développeurs retirent les informations relatives aux licences. Parmi les 857 projets analysés, nous avons identifier 15 projets contenant des infractions concernant les licences. 7 de ces projets contenaient encore des infractions dans leur version finale. Dans les autres cas, pour résoudre les infractions, les dévloppeurs ont changés les licences liés à quelques fichiers de l’application; ou ont retirés les fichiers problématiques des applications. En moyenne, 19 versions de l’application étaient nécessaires pour résoudre les infractions portées aux licences. Ces résultats sont une indication que les développeurs ont de la difficulté à comprendre les contraintes légales des termes des licences. Une autre explication est que le manque de cohérence et d’uniformisation des déclarations des licences créent une confusion chez les développeurs. Notre méthode de détection des licences pourrait être appliqué par les développeurs afin de traquer les infractions portées aux licences dans leurs projets avant la mise en marché.----------ABSTRACT Mobile applications (apps) developers often reuse code from existing libraries and frameworks in order to reduce development costs. However, these libraries and frameworks are governed by licenses to which developers must comply. A license governs the way in which a library or chunk of code can be reused, modified or redistributed. It can be seen as a list of rules that developers must respect before using the component. A failure to comply with a license is likely to result in penalties and fines. In this thesis, we propose our approach for license identification in open source applications. By applying this approach, we conduct a case study to identify licenses in 857 mobile apps from the F-droid market with the aim to understand the types of licenses that are most used by developers and how these licenses evolve overtime. We conduct our study both at project level and file level. We also investigates licenses violations and the evolution of these violations overtime; we compare licenses declared at the project level, file level and those of the libraries used by a project to seek for licenses that are incompatible and used in the same project. Results show that most used Licenses are GPL and Apache licenses both at the project level and file level. In many cases we noticed that developers didn’t pay too much attention to license their source code. For 3,250 apps releases out of 8,938 releases, the apps were distributed without licenses information. Regarding license evolution, we noticed that the probability for a project to stay under the same license is very high (95% in average) and in case of change, changes are generally toward more permissive licenses. At the file level, we noticed that developers tend to delay their decision about license selection, also in 15% of license changes, developers removed licensed information. We identified 15 projects out of 857 projects, with a license violation; 7 projects had violations in their final release. To solve license violations, developers either changed the license of some of the apps’ files or removed the contentious files from the apps. It took in average 19 releases to solve a license violation. These findings suggest that developers of mobile apps may be having some difficulties in understanding the legal constraint of licenses’ terms or it may be that the lack of consistency and standardization in license declarations fosters confusion among developers. Our license detection approach can be used by developers to track license violations in their projects

    Oswaldo: A Semantic Web Enabled Approach for Identifying Open Source License Violations

    Get PDF
    Open source license violations are numerous, multifaceted, and pose significant risk to developers and companies in the form of litigation, sometimes resulting in millions in dollars in damages or settlements. Free/Libre and Open Source Licenses utilize copyright law and are written in legalese, which is often outside the scope of a developer’s expertise. Software Engineers commit violations of these licenses’ terms and conditions easily and often unknowingly. Consequently, increased knowledge, better tools, and sound processes to detect and prevent license violations are extremely important. This work is an investigation in the types of potential license violations that are committed, through direct and transitive dependency hierarchies in hundreds of thousands of real-world software projects. This thesis contributes a novel approach, entitled Oswaldo, that defines and detects three types of license conflicts: Type 1 Simple Violation, Type 2 Embedded Violations, Type 3 Compound Violations. Unidirectional compatibility/incompatibility relationships of major licenses are modelled. Ontologies and Linked Data are advantageously exploited to detect transitive violation Types 2 and 3, as well as the direct violation Type 1. This thesis also reports initial evaluations of these three types of license violations found in the Maven repository

    Finding (partial) code clones at method level in Android programs without access to source code to detect copyright infringements or security issues

    Get PDF
    Nearly all programs for Android devices are distributed without source code being made available. This means that it is a lot harder to do audits of these programs for for example copyright infringement detection or security issue detection. By examining individual methods inside an Android program and comparing these to a database of methods from known programs it is possible to make an educated guess of which programs or program fragments are used in the program, and possibly detect copyright infringements or trojaned versions of programs
    corecore