Search CORE

21 research outputs found

A Comprehensive Study of Bloated Dependencies in the Maven Ecosystem

Author: Baudry Benoit
Harrand Nicolas
Monperrus Martin
Soto-Valero César
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/01/2020
Field of study

Build automation tools and package managers have a profound influence on software development. They facilitate the reuse of third-party libraries, support a clear separation between the application's code and its external dependencies, and automate several software development tasks. However, the wide adoption of these tools introduces new challenges related to dependency management. In this paper, we propose an original study of one such challenge: the emergence of bloated dependencies. Bloated dependencies are libraries that the build tool packages with the application's compiled code but that are actually not necessary to build and run the application. This phenomenon artificially grows the size of the built binary and increases maintenance effort. We propose a tool, called DepClean, to analyze the presence of bloated dependencies in Maven artifacts. We analyze 9,639 Java artifacts hosted on Maven Central, which include a total of 723,444 dependency relationships. Our key result is that 75.1% of the analyzed dependency relationships are bloated. In other words, it is feasible to reduce the number of dependencies of Maven artifacts up to 1/4 of its current count. We also perform a qualitative study with 30 notable open-source projects. Our results indicate that developers pay attention to their dependencies and are willing to remove bloated dependencies: 18/21 answered pull requests were accepted and merged by developers, removing 131 dependencies in total.Comment: Manuscript submitted to Empirical Software Engineering (EMSE

arXiv.org e-Print Archive

Coverage-Based Debloating for Java Bytecode

Author: Baudry Benoit
Durieux Thomas
Harrand Nicolas
Soto-Valero César
Publication venue
Publication date: 08/12/2021
Field of study

Software bloat is code that is packaged in an application but is actually not necessary to run the application. The presence of software bloat is an issue for security, for performance, and for maintenance. In this paper, we introduce a novel technique for debloating Java bytecode, which we call coverage-based debloating. We leverage a combination of state-of-the-art Java bytecode coverage tools to precisely capture what parts of a project and its dependencies are used at runtime. Then, we automatically remove the parts that are not covered to generate a debloated version of the compiled project. We successfully generate debloated versions of 220 open-source Java libraries, which are syntactically correct and preserve their original behavior according to the workload. Our results indicate that 68.3% of the libraries' bytecode and 20.5% of their total dependencies can be removed through coverage-based debloating. Meanwhile, we present the first experiment that assesses the utility of debloated libraries with respect to client applications that reuse them. We show that 80.9% of the clients with at least one test that uses the library successfully compile and pass their test suite when the original library is replaced by its debloated version

arXiv.org e-Print Archive

Publikationer från KTH

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Automatic Specialization of Third-Party Java Dependencies

Author: Baudry Benoit
Soto-Valero César
Tiwari Deepika
Toady Tim
Publication venue
Publication date: 16/02/2023
Field of study

Modern software systems rely on a multitude of third-party dependencies. This large-scale code reuse reduces development costs and time, and it poses new challenges with respect to maintenance and security. Techniques such as tree shaking or shading can remove dependencies that are completely unused by a project, which partly address these challenges. Yet, the remaining dependencies are likely to be used only partially, leaving room for further reduction of third-party code. In this paper, we propose a novel technique to specialize dependencies of Java projects, based on their actual usage. For each dependency, we systematically identify the subset of its functionalities that is necessary to build the project, and remove the rest. Each specialized dependency is repackaged. Then, we generate specialized dependency trees where the original dependencies are replaced by the specialized versions and we rebuild the project. We implement our technique in a tool called DepTrim, which we evaluate with 30 notable open-source Java projects. DepTrim specializes a total of 343 (86.6%) dependencies across these projects, and successfully rebuilds each project with a specialized dependency tree. Moreover, through this specialization, DepTrim removes a total of 60,962 (47.0%) classes from the dependencies, reducing the ratio of dependency classes to project classes from 8.7x in the original projects to 4.4x after specialization. These results indicate the relevance of dependency specialization to significantly reduce the share of third-party code in Java projects.Comment: 17 pages, 2 figures, 4 tables, 1 algorithm, 2 code listings, 3 equation

arXiv.org e-Print Archive

Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement

Author: Gao Kai
He Runzhi
Xie Bing
Zhou Minghui
Publication venue
Publication date: 28/06/2023
Field of study

Deep learning (DL) package supply chains (SCs) are critical for DL frameworks to remain competitive. However, vital knowledge on the nature of DL package SCs is still lacking. In this paper, we explore the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs to bridge this knowledge gap. We analyze the metadata of nearly six million PyPI package distributions and construct version-sensitive SCs for two popular DL frameworks: TensorFlow and PyTorch. We find that popular packages (measured by the number of monthly downloads) in the two SCs cover 34 domains belonging to eight categories. Applications, Infrastructure, and Sciences categories account for over 85% of popular packages in either SC and TensorFlow and PyTorch SC have developed specializations on Infrastructure and Applications packages respectively. We employ the Leiden community detection algorithm and detect 131 and 100 clusters in the two SCs. The clusters mainly exhibit four shapes: Arrow, Star, Tree, and Forest with increasing dependency complexity. Most clusters are Arrow or Star, but Tree and Forest clusters account for most packages (Tensorflow SC: 70%, PyTorch SC: 90%). We identify three groups of reasons why packages disengage from the SC (i.e., remove the DL framework and its dependents from their installation dependencies): dependency issues, functional improvements, and ease of installation. The most common disengagement reason in the two SCs are different. Our study provides rich implications on the maintenance and dependency management practices of PyPI DL SCs.Comment: Manuscript submitted to ACM Transactions on Software Engineering and Methodolog

arXiv.org e-Print Archive

Challenges of Producing Software Bill Of Materials for Java

Author: Balliu Musard
Baudry Benoit
Bobadilla Sofia
Ekstedt Mathias
Monperrus Martin
Ron Javier
Sharma Aman
Skoglund Gabriel
Soto-Valero César
Wittlinger Martin
Publication venue
Publication date: 07/06/2023
Field of study

Software bills of materials (SBOM) promise to become the backbone of software supply chain hardening. We deep-dive into 6 tools and the accuracy of the SBOMs they produce for complex open-source Java projects. Our novel insights reveal some hard challenges for the accurate production and usage of SBOMs

arXiv.org e-Print Archive

Taxonomy of Attacks on Open-Source Software Supply Chains

Author: Barais Olivier
Ladisa Piergiorgio
Martinez Matias
Plate Henrik
Publication venue
Publication date: 19/04/2022
Field of study

The widespread dependency on open-source software makes it a fruitful target for malicious actors, as demonstrated by recurring attacks. The complexity of today's open-source supply chains results in a significant attack surface, giving attackers numerous opportunities to reach the goal of injecting malicious code into open-source artifacts that is then downloaded and executed by victims. This work proposes a general taxonomy for attacks on open-source supply chains, independent of specific programming languages or ecosystems, and covering all supply chain stages from code contributions to package distribution. Taking the form of an attack tree, it covers 107 unique vectors, linked to 94 real-world incidents, and mapped to 33 mitigating safeguards. User surveys conducted with 17 domain experts and 134 software developers positively validated the correctness, comprehensiveness and comprehensibility of the taxonomy, as well as its suitability for various use-cases. Survey participants also assessed the utility and costs of the identified safeguards, and whether they are used

arXiv.org e-Print Archive

HAL-CentraleSupelec

Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code

Author: German Daniel M.
Sun Yiming
Zacchiroli Stefano
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/05/2023
Field of study

We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers-such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination.By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI's 244 K packages we found 11.2 M different global identifiers (classes and method/function names-with only 0.6% of identifiers shared among the two types of entities); 76% of identifiers were used only in one package, and 93% in at most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89% of the cases.We validate the proposed approach by mapping Debian source packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77

arXiv.org e-Print Archive

An SDN QoE Monitoring Framework for VoIP and video applications

Author: Xezonaki Maria-Evgenia
Ξεζωνάκη Μαρία-Ευγενία
Publication venue
Publication date: 01/01/2017
Field of study

Τα τελευταία χρόνια έχει σημειωθεί ραγδαία άνοδος του κλάδου των κινητών επικοινωνιών, αφού η χρήση των κινητών συσκευών εξαπλώνεται με ταχύτατους ρυθμούς και αναμένεται να συνεχίσει τη διείσδυσή της στην καθημερινότητα των καταναλωτών. Το γεγονός αυτό, σε συνδυασμό με τους περιορισμούς που θέτει η τρέχουσα δομή των δικτύων επικοινωνιών, καθιστά αναγκαία την ανάπτυξη νέων δικτύων με αυξημένες δυνατότητες, ώστε να είναι δυνατή η εξυπηρέτηση των χρηστών με την καλύτερη δυνατή ποιότητα εμπειρίας και ταυτόχρονα τη βέλτιστη αξιοποίηση των πόρων του δικτύου. Μία νέα δικτυακή προσέγγιση αποτελεί η δικτύωση βασισμένη στο λογισμικό (Software Defined Networking - SDN), η οποία αφαιρεί τον έλεγχο από τις συσκευές προώθησης του δικτύου, και οι αποφάσεις λαμβάνονται σε κεντρικό σημείο. Η ποιότητα υπηρεσίας που αντιλαμβάνεται ο χρήστης, ή αλλιώς ποιότητα εμπειρίας, κρίνεται ζήτημα υψηλής σημασίας στα δίκτυα SDN. Η παρούσα διπλωματική εργασία έχει ως στόχο την παρουσίαση της τεχνολογίας SDN, την επισκόπηση της υπάρχουσας έρευνας στο πεδίο της ποιότητας εμπειρίας σε SDN δίκτυα και στη συνέχεια την ανάπτυξη μίας SDN εφαρμογής η οποία παρακολουθεί και διατηρεί την ποιότητας εμπειρίας σε υψηλά επίπεδα για εφαρμογές VoIP και video. Πιο συγκεκριμένα, η εφαρμογή SQMF (SDN QoE Monitoring Framework) παρακολουθεί περιοδικά στο μονοπάτι μετάδοσης των πακέτων διάφορες παραμέτρους του δικτύου, με βάση τις οποίες υπολογίζει την ποιότητα εμπειρίας. Εάν διαπιστωθεί ότι το αποτέλεσμα είναι μικρότερο από ένα προσδιορισμένο κατώφλι, η εφαρμογή αλλάζει το μονοπάτι μετάδοσης, και έτσι η ποιότητα εμπειρίας ανακάμπτει. Η δομή της παρούσας διπλωματικής εργασίας είναι η εξής: Στο κεφάλαιο 1 παρουσιάζεται η σημερινή εικόνα των δικτύων επικοινωνιών και οι προβλέψεις για τη μελλοντική εικόνα, καθώς και οι προκλήσεις στις οποίες τα σημερινά δίκτυα δε θα μπορούν να αντεπεξέλθουν. Στη συνέχεια στο κεφάλαιο 2 περιγράφεται αναλυτικά η τεχνολογία SDN ως προς την αρχιτεκτονική, το κύριο πρωτόκολλο που χρησιμοποιεί, τα σενάρια χρήσης της, την προτυποποίηση, τα πλεονεκτήματα και τα μειονεκτήματά της. Το κεφάλαιο 3 εισάγει την έννοια της ποιότητας εμπειρίας του χρήστη και παραθέτει ευρέως γνωστά μοντέλα υπολογισμού της για διάφορους τύπους εφαρμογών, που χρησιμοποιούνται στην παρούσα εργασία. Σχετικές υπάρχουσες μελέτες στο πεδίο της ποιότητας εμπειρίας σε δίκτυα SDN αλλά και συγκριτικός πίνακας μπορούν να βρεθούν στο κεφάλαιο 4. Τα επόμενα κεφάλαια αφορούν στην εφαρμογή SQMF που υλοποιήθηκε στα πλαίσια της παρούσας διπλωματικής εργασίας: το κεφάλαιο 5 περιγράφει αναλυτικά όλα τα προαπαιτούμενα εργαλεία και οδηγίες για την ανάπτυξη του SQMF, ενώ το κεφάλαιο 6 παρουσιάζει παραδείγματα όπου η ποιότητα εμπειρίας ενός δικτύου μπορεί να υποστεί μείωση. Τέλος, το κεφάλαιο 7 αναλύει σε βάθος τις σχεδιαστικές προδιαγραφές, τη λογική και τον κώδικα του SQMF και παρέχει επίδειξη της λειτουργίας του και αξιολόγησή του, ενώ το κεφάλαιο 8 συνοψίζει επιγραμματικά τα συμπεράσματα της παρούσας εργασίας και ανοιχτά θέματα για μελλοντική έρευνα.Lately, there has been a rapid rise of the mobile communications industry, since the use of mobile devices is spreading at a fast pace and is expected to continue its penetration into the daily routine of consumers. This fact, combined with the limitations of the current communications networks’ structure, necessitates the development of new networks with increased capabilities, so that users can be served with the best possible quality of service and at the same time with the optimal network resources utilization. A new networking approach is Software Defined Networking (SDN) which decouples the control from the data plane, transforming the network elements to simple forwarding devices and making decisions centrally. The quality of service perceived by the user, or quality of experience (QoE), is considered to be a matter of great importance in software defined networks. This diploma thesis aims at presenting SDN technology, reviewing existing research in the field of QoE on SDN networks and then developing an SDN application that monitors and preserves the QoE for VoIP and video applications. More specifically, the developed SDN QoE Monitoring Framework (SQMF) periodically monitors various network parameters on the VoIP/video packets transmission path, based on which it calculates the QoE. If it is found that the result is less than a predefined threshold, the framework changes the transmission path, and thus the QoE recovers. The structure of this diploma thesis is the following: Chapter 1 presents the current state of communications networks and predictions for the future state, as well as the challenges that current networks will not be able to cope with. Chapter 2 then describes in detail the SDN technology in terms of architecture, main control-data plane communication protocol, use cases, standardization, advantages and disadvantages. Chapter 3 introduces the concept of QoE and lists well-known QoE estimation models for various applications types, some of which were used in this thesis. Relevant existing studies in the field of QoE on SDN networks as well as a comparative table can be found in chapter 4. The following chapters concern the framework implemented in the context of this diploma thesis: Chapter 5 describes in detail all the required tools and instructions for the development of SQMF, while Chapter 6 presents examples where the QoE in a network can face degradation. Finally, Chapter 7 analyzes in depth SQMF's design principles, logic and code files, provides a demonstration of its operation and evaluates it, whereas Chapter 8 briefly summarizes the conclusions and of this thesis and future work points

Pergamos : Unified Institutional Repository / Digital Library Platform of the National and Kapodistrian University of Athens

Gathering solutions and providing APIs for their orchestration to implement continuous software delivery

Author: Wettinger Johannes
Publication venue
Publication date: 01/01/2017
Field of study

In traditional IT environments, it is common for software updates and new releases to take up to several weeks or even months to be eventually available to end users. Therefore, many IT vendors and providers of software products and services face the challenge of delivering updates considerably more frequently. This is because users, customers, and other stakeholders expect accelerated feedback loops and significantly faster responses to changing demands and issues that arise. Thus, taking this challenge seriously is of utmost economic importance for IT organizations if they wish to remain competitive. Continuous software delivery is an emerging paradigm adopted by an increasing number of organizations in order to address this challenge. It aims to drastically shorten release cycles while ensuring the delivery of high-quality software. Adopting continuous delivery essentially means to make it economical to constantly deliver changes in small batches. Infrequent high-risk releases with lots of accumulated changes are thereby replaced by a continuous stream of small and low-risk updates. To gain from the benefits of continuous delivery, a high degree of automation is required. This is technically achieved by implementing continuous delivery pipelines consisting of different application-specific stages (build, test, production, etc.) to automate most parts of the application delivery process. Each stage relies on a corresponding application environment such as a build environment or production environment. This work presents concepts and approaches to implement continuous delivery pipelines based on systematically gathered solutions to be used and orchestrated as building blocks of application environments. Initially, the presented Gather'n'Deliver method is centered around a shared knowledge base to provide the foundation for gathering, utilizing, and orchestrating diverse solutions such as deployment scripts, configuration definitions, and Cloud services. Several classification dimensions and taxonomies are discussed in order to facilitate a systematic categorization of solutions, in addition to expressing application environment requirements that are satisfied by those solutions. The presented GatherBase framework enables the collaborative and automated gathering of solutions through solution repositories. These repositories are the foundation for building diverse knowledge base variants that provide fine-grained query mechanisms to find and retrieve solutions, for example, to be used as building blocks of specific application environments. Combining and integrating diverse solutions at runtime is achieved by orchestrating their APIs. Since some solutions such as lower-level executable artifacts (deployment scripts, configuration definitions, etc.) do not immediately provide their functionality through APIs, additional APIs need to be supplied. This issue is addressed by different approaches, such as the presented Any2API framework that is intended to generate individual APIs for such artifacts. An integrated architecture in conjunction with corresponding prototype implementations aims to demonstrate the technical feasibility of the presented approaches. Finally, various validation scenarios evaluate the approaches within the scope of continuous delivery and application environments and even beyond