106 research outputs found
Assisting Software Developers With License Compliance
Open source licensing determines how open source systems are reused, distributed, and modified from a legal perspective. While it facilitates rapid development, it can present difficulty for developers in understanding due to the legal language of these licenses. Because of misunderstandings, systems can incorporate licensed code in a way that violates the terms of the license. Such incompatibilities between licensing can result in the inability to reuse a particular library without either relicensing the system or redesigning the architecture of the system. Prior efforts have predominantly focused on license identification or understanding the underlying phenomena without reasoning about compatibility in a broad scale. The work in this dissertation first investigates the rationale of developers and identifies the areas that developers struggle with respect to free/open source software licensing. First, we investigate the diffusion of licenses and the prevalence of license changes in a large scale empirical study of 16,221 Java systems. We observed a clear lack of traceability and a lack of standardized licensing that led to difficulties and confusion for developers trying to reuse source code. We further investigated the difficulty by surveying the developers of the systems with license changes to understand why they first adopted a license and then changed licenses. Additionally, we performed an analysis on issue trackers and legal mailing lists to extract licensing bugs. From these works, we identified key areas in which developers struggled and needed support. While developers need support to identify license incompatibilities and understand both the cause and implications of the incompatibilities, we observed that state-of-the-art license identification tools did not identify license exceptions. Since these exceptions directly modify the license terms (either the permissions granted by the license or the restrictions imposed by the license), we proposed an approach to complement current license identification techniques in order to classify license exceptions. The approach relies on supervised machine learners to classify the licensing text to identify the particular license exceptions or the lack of a license exception. Subsequently, we built an infrastructure to assist developers with evaluating license compliance warnings for their system. The infrastructure evaluates compliance across the dependency tree of a system to ensure it is compliant with all of the licenses of the dependencies. When an incompatibility is present, it notes the specific library/libraries and the conflicting license(s) so that the developers can investigate these compliance warnings, which would prevent distribution of their software, in their system. We conduct a study on 121,094 open source projects spanning 6 programming languages, and we demonstrate that the infrastructure is able to identify license incompatibilities between these projects and their dependencies
On the Detection of Licenses Violations in the Android Ecosystem
RÉSUMÉ
Très souvent, les développeurs d’applications mobiles réutilisent les bibliothèques et les composants
déjà existants dans le but de réduire les coûts de développements. Cependant, ces bibliothèques et composants sont régies par des licences auxquelles les développeurs doivent se soumettre. Une licence contrôle la manière dont une bibliothèque ou un bout de code pourraient être réutilisés, modifiés ou redistribués. Une licence peut être vu comme étant une liste de règles que les développeurs doivent respecter avant d’utiliser le composant. Le
non-respect des termes d’une licence pourrait engendrer des pénalités et des amendes.
A travers ce mémoire de maîtrise, nous proposons une méthode d’identification des licences utilisées dans une application à code source ouvert. A l’aîde de cette méthode, nous menons une étude pour identifier les licences utilisées dans 857 applications mobiles, provenant du
marché “F-Droid”, dans le but de comprendre les types de licences les plus souvent utilisées par les développeurs ainsi que la manière avec laquelle ces licenses évoluent à travers le temps. Nous menons notre étude sur deux niveaux; le niveau du projet et celui du fichier.
Nous investigons également les infractions portées aux licences et leursévolutions à travers le temps; Nous comparons les licences déclarées au niveau du project avec celles de ses fichiers, des fichiers entre eux et des projets et fichiers avec ceux des bibliothèques utilisées par le projet, afin d’identifier des licences incompatibles utilisées dans un même projet.
Les résultats montrent que les licences les plus utilisées sont les licences “GPL" et “Apache"; aussi bien au niveau du projet qu’au niveau fichiers. Nous remarquons que, dans plusieurs cas, les développeurs ne portent pas assez attention aux licences de leurs code source. Des 8
938 versions d’applications analysées, 3 250 versions ne sont pas accompagnées d’informations relatives aux licences. Concernant l’évolution des licences, nous remarquons que la probabilité pour un projet de demeurer sous une même licence est très élevée (95% en moyenne),
et dans le cas d’un changement de license, le changement se fait généralement vers des licences plus permissives. Au niveau du fichier, nous avons remarqué que les développeurs
ont tendance à retarder leur choix de licence. Dans 15% des changements de license, les développeurs retirent les informations relatives aux licences. Parmi les 857 projets analysés, nous avons identifier 15 projets contenant des infractions concernant les licences. 7 de ces projets contenaient encore des infractions dans leur version finale. Dans les autres cas, pour résoudre les infractions, les dévloppeurs ont changés les licences liés à quelques fichiers de l’application; ou ont retirés les fichiers problématiques des applications. En moyenne, 19 versions
de l’application étaient nécessaires pour résoudre les infractions portées aux licences.
Ces résultats sont une indication que les développeurs ont de la difficulté à comprendre les contraintes légales des termes des licences. Une autre explication est que le manque de cohérence et d’uniformisation des déclarations des licences créent une confusion chez les développeurs.
Notre méthode de détection des licences pourrait être appliqué par les développeurs afin de traquer les infractions portées aux licences dans leurs projets avant la mise en marché.----------ABSTRACT
Mobile applications (apps) developers often reuse code from existing libraries and frameworks in order to reduce development costs. However, these libraries and frameworks are governed by licenses to which developers must comply. A license governs the way in which a library or chunk of code can be reused, modified or redistributed. It can be seen as a list of rules that developers must respect before using the component. A failure to comply with a license is likely to result in penalties and fines.
In this thesis, we propose our approach for license identification in open source applications. By applying this approach, we conduct a case study to identify licenses in 857 mobile apps from the F-droid market with the aim to understand the types of licenses that are most used by developers and how these licenses evolve overtime. We conduct our study both at project level and file level. We also investigates licenses violations and the evolution of these violations overtime; we compare licenses declared at the project level, file level and those of the libraries used by a project to seek for licenses that are incompatible and used in the same project.
Results show that most used Licenses are GPL and Apache licenses both at the project level and file level. In many cases we noticed that developers didn’t pay too much attention to license their source code. For 3,250 apps releases out of 8,938 releases, the apps were
distributed without licenses information. Regarding license evolution, we noticed that the probability for a project to stay under the same license is very high (95% in average) and in case of change, changes are generally toward more permissive licenses. At the file level, we noticed that developers tend to delay their decision about license selection, also in 15% of license changes, developers removed licensed information. We identified 15 projects out of 857 projects, with a license violation; 7 projects had violations in their final release. To solve license violations, developers either changed the license of some of the apps’ files or removed the contentious files from the apps. It took in average 19 releases to solve a license violation.
These findings suggest that developers of mobile apps may be having some difficulties in understanding the legal constraint of licenses’ terms or it may be that the lack of consistency and standardization in license declarations fosters confusion among developers. Our license detection approach can be used by developers to track license violations in their projects
Putting the Semantics into Semantic Versioning
The long-standing aspiration for software reuse has made astonishing strides
in the past few years. Many modern software development ecosystems now come
with rich sets of publicly-available components contributed by the community.
Downstream developers can leverage these upstream components, boosting their
productivity.
However, components evolve at their own pace. This imposes obligations on and
yields benefits for downstream developers, especially since changes can be
breaking, requiring additional downstream work to adapt to. Upgrading too late
leaves downstream vulnerable to security issues and missing out on useful
improvements; upgrading too early results in excess work. Semantic versioning
has been proposed as an elegant mechanism to communicate levels of
compatibility, enabling downstream developers to automate dependency upgrades.
While it is questionable whether a version number can adequately characterize
version compatibility in general, we argue that developers would greatly
benefit from tools such as semantic version calculators to help them upgrade
safely. The time is now for the research community to develop such tools: large
component ecosystems exist and are accessible, component interactions have
become observable through automated builds, and recent advances in program
analysis make the development of relevant tools feasible. In particular,
contracts (both traditional and lightweight) are a promising input to semantic
versioning calculators, which can suggest whether an upgrade is likely to be
safe.Comment: to be published as Onward! Essays 202
Coverage-Based Debloating for Java Bytecode
Software bloat is code that is packaged in an application but is actually not
necessary to run the application. The presence of software bloat is an issue
for security, for performance, and for maintenance. In this paper, we introduce
a novel technique for debloating Java bytecode, which we call coverage-based
debloating. We leverage a combination of state-of-the-art Java bytecode
coverage tools to precisely capture what parts of a project and its
dependencies are used at runtime. Then, we automatically remove the parts that
are not covered to generate a debloated version of the compiled project. We
successfully generate debloated versions of 220 open-source Java libraries,
which are syntactically correct and preserve their original behavior according
to the workload. Our results indicate that 68.3% of the libraries' bytecode and
20.5% of their total dependencies can be removed through coverage-based
debloating. Meanwhile, we present the first experiment that assesses the
utility of debloated libraries with respect to client applications that reuse
them. We show that 80.9% of the clients with at least one test that uses the
library successfully compile and pass their test suite when the original
library is replaced by its debloated version
Demystifying Dependency Bugs in Deep Learning Stack
Deep learning (DL) applications, built upon a heterogeneous and complex DL
stack (e.g., Nvidia GPU, Linux, CUDA driver, Python runtime, and TensorFlow),
are subject to software and hardware dependencies across the DL stack. One
challenge in dependency management across the entire engineering lifecycle is
posed by the asynchronous and radical evolution and the complex version
constraints among dependencies. Developers may introduce dependency bugs (DBs)
in selecting, using and maintaining dependencies. However, the characteristics
of DBs in DL stack is still under-investigated, hindering practical solutions
to dependency management in DL stack. To bridge this gap, this paper presents
the first comprehensive study to characterize symptoms, root causes and fix
patterns of DBs across the whole DL stack with 446 DBs collected from
StackOverflow posts and GitHub issues. For each DB, we first investigate the
symptom as well as the lifecycle stage and dependency where the symptom is
exposed. Then, we analyze the root cause as well as the lifecycle stage and
dependency where the root cause is introduced. Finally, we explore the fix
pattern and the knowledge sources that are used to fix it. Our findings from
this study shed light on practical implications on dependency management
Insights from Population Genomics to Enhance and Sustain Biological Control of Insect Pests
Biological control—the use of organisms (e.g., nematodes, arthropods, bacteria, fungi, viruses) for the suppression of insect pest species—is a well-established, ecologically sound and economically profitable tactic for crop protection. This approach has served as a sustainable solution for many insect pest problems for over a century in North America. However, all pest management tactics have associated risks. Specifically, the ecological non-target effects of biological control have been examined in numerous systems. In contrast, the need to understand the short- and long-term evolutionary consequences of human-mediated manipulation of biological control organisms for importation, augmentation and conservation biological control has only recently been acknowledged. Particularly, population genomics presents exceptional opportunities to study adaptive evolution and invasiveness of pests and biological control organisms. Population genomics also provides insights into (1) long-term biological consequences of releases, (2) the ecological success and sustainability of this pest management tactic and (3) non-target effects on native species, populations and ecosystems. Recent advances in genomic sequencing technology and model-based statistical methods to analyze population-scale genomic data provide a much needed impetus for biological control programs to benefit by incorporating a consideration of evolutionary consequences. Here, we review current technology and methods in population genomics and their applications to biological control and include basic guidelines for biological control researchers for implementing genomic technology and statistical modeling
Understanding and assessing security on Android via static code analysis
Smart devices have become a rich source of sensitive information including personal data (contacts and account data) and context information like GPS data that is continuously aggregated by onboard sensors. As a consequence, mobile platforms have become a prime target for malicious and over-curious applications. The growing complexity and the quickly rising number of mobile apps have further reinforced the demand for comprehensive application security vetting. This dissertation presents a line of work that advances security testing on Android via static code analysis. In the first part of this dissertation, we build an analysis framework that statically models the complex runtime behavior of apps and Android’s application framework (on which apps are built upon) to extract privacy and security-relevant data-flows. We provide the first classification of Android’s protected resources within the framework and generate precise API-to-permission mappings that excel over prior work. We then propose a third-party library detector for apps that is resilient against common code obfuscations to measure the outdatedness of libraries in apps and to attribute vulnerabilities to the correct software component. Based on these results, we identify root causes of app developers not updating their dependencies and propose actionable items to remedy the current status quo. Finally, we measure to which extent libraries can be updated automatically without modifying the application code.Smart Devices haben sich zu Quellen persönlicher Daten (z.B. Kontaktdaten) und Kontextinformationen (z.B. GPS Daten), die kontinuierlich über Sensoren gesammelt werden, entwickelt. Aufgrund dessen sind mobile Platformen ein attraktives Ziel für Schadsoftware geworden. Die stetig steigende App Komplexität und Anzahl verfügbarer Apps haben zusätzlich ein Bedürfnis für gründliche Sicherheitsüberprüfungen von Applikationen geschaffen. Diese Dissertation präsentiert eine Reihe von Forschungsarbeiten, die Sicherheitsbewertungen auf Android durch statische Code Analyse ermöglicht. Zunächst wurde ein Analyseframework gebaut, dass das komplexe Laufzeitverhalten von Apps und Android’s Applikationsframework (dessen Funktionalität Apps nutzen) statisch modelliert, um sicherheitsrelevante Datenflüsse zu extrahieren. Zudem ermöglicht diese Arbeit eine Klassifizierung geschützter Framework Funktionalität und das Generieren präziser Mappings von APIs-auf-Berechtigungen. Eine Folgearbeit stellt eine obfuskierungs-resistente Technik zur Erkennung von Softwarekomponenten innerhalb der App vor, um die Aktualität der Komponenten und, im Falle von Sicherheitlücken, den Urheber zu identifizieren. Darauf aufbauend wurde Ursachenforschung betrieben, um herauszufinden wieso App Entwickler Komponenten nicht aktualisieren und wie man diese Situation verbessern könnte. Abschließend wurde untersucht bis zu welchem Grad man veraltete Komponenten innerhalb der App automatisch aktualisieren kann
Recommended from our members
Understanding Software-2.0: A Study of Machine Learning library usage and evolution
Enabled by a rich ecosystem of Machine Learning (ML) libraries, programming using learned models, i.e., Software-2.0, has gained substantial adoption. However, we do not know what challenges developers encounter when they use ML libraries. With this knowledge gap, researchers miss opportunities to contribute to new research directions, tool builders do not invest resources where automation is most needed, library designers cannot make informed decisions when releasing ML library versions, and developers fail to use common practices when using ML libraries.
We present the first large-scale quantitative and qualitative empirical study to shed light on how developers in Software-2.0 use ML libraries, and how this evolution affects their code. Particularly, using static analysis we perform a longitudinal study of 3,394 top-rated open-source projects with 46,125 contributors. To further
understand the challenges of ML library evolution, we survey 109 developers who introduce and evolve ML libraries. Using this rich dataset we reveal several novel findings.
Among others, we found an increasing trend of using ML libraries: the ratio of new Python projects that use ML libraries increased from 2% in 2013 to 50% in 2018. We identify several usage patterns including: (i) 36% of the projects use multiple ML libraries to implement various stages of the ML workflows, (ii) developers
update ML libraries more often than the traditional libraries, (iii) strict upgrades are the most popular for ML libraries among other update kinds, (iv) ML library updates often result in cascading library updates, and (v) ML libraries are often downgraded (22.04% of cases). We also observed unique challenges when evolving and
maintaining Software-2.0 such as (i) binary incompatibility of trained ML models, and (ii) benchmarking ML models. Finally, we present actionable implications of our findings for researchers, tool builders, developers, educators, library vendors, and hardware vendors
Dependency Management 2.0 – A Semantic Web Enabled Approach
Software development and evolution are highly distributed processes that involve a multitude of supporting tools and resources. Application programming interfaces are commonly used by software developers to reduce development cost and complexity by reusing code developed by third-parties or published by the open source community. However, these application programming interfaces have also introduced new challenges to the Software Engineering community (e.g., software vulnerabilities, API incompatibilities, and software license violations) that not only extend beyond the traditional boundaries of individual projects but also involve different software artifacts. As a result, there is the need for a technology-independent representation of software dependency semantics and the ability to seamlessly integrate this representation with knowledge from other software artifacts.
The Semantic Web and its supporting technology stack have been widely promoted to model, integrate, and support interoperability among heterogeneous data sources. This dissertation takes advantage of the Semantic Web and its enabling technology stack for knowledge modeling and integration. The thesis introduces five major contributions: (1) We present a formal Software Build System Ontology – SBSON, which captures concepts and properties for software build and dependency management systems. This formal knowledge representation allows us to take advantage of Semantic Web inference services forming the basis for a more flexibility API dependency analysis compared to traditional proprietary analysis approaches. (2) We conducted a user survey which involved 53 open source developers to allow us to gain insights on how actual developers manage API breaking changes. (3) We introduced a novel approach which integrates our SBSON model with knowledge about source code usage and changes within the Maven ecosystem to support API consumers and producers in managing (assessing and minimizing) the impacts of breaking changes. (4) A Security Vulnerability Analysis Framework (SV-AF) is introduced, which integrates builds system, source code, versioning system, and vulnerability ontologies to trace and assess the impact of security vulnerabilities across project boundaries. (5) Finally, we introduce an Ontological Trustworthiness Assessment Model (OntTAM). OntTAM is an integration of our build, source code, vulnerability and license ontologies which supports a holistic analysis and assessment of quality attributes related to the trustworthiness of libraries and APIs in open source systems.
Several case studies are presented to illustrate the applicability and flexibility of our modelling approach, demonstrating that our knowledge modeling approach can seamlessly integrate and reuse knowledge extracted from existing build and dependency management systems with other existing heterogeneous data sources found in the software engineering domain. As part of our case studies, we also demonstrate how this unified knowledge model can enable new types of project dependency analysis
- …