21 research outputs found
A Comprehensive Study of Bloated Dependencies in the Maven Ecosystem
Build automation tools and package managers have a profound influence on
software development. They facilitate the reuse of third-party libraries,
support a clear separation between the application's code and its external
dependencies, and automate several software development tasks. However, the
wide adoption of these tools introduces new challenges related to dependency
management. In this paper, we propose an original study of one such challenge:
the emergence of bloated dependencies.
Bloated dependencies are libraries that the build tool packages with the
application's compiled code but that are actually not necessary to build and
run the application. This phenomenon artificially grows the size of the built
binary and increases maintenance effort. We propose a tool, called DepClean, to
analyze the presence of bloated dependencies in Maven artifacts. We analyze
9,639 Java artifacts hosted on Maven Central, which include a total of 723,444
dependency relationships. Our key result is that 75.1% of the analyzed
dependency relationships are bloated. In other words, it is feasible to reduce
the number of dependencies of Maven artifacts up to 1/4 of its current count.
We also perform a qualitative study with 30 notable open-source projects. Our
results indicate that developers pay attention to their dependencies and are
willing to remove bloated dependencies: 18/21 answered pull requests were
accepted and merged by developers, removing 131 dependencies in total.Comment: Manuscript submitted to Empirical Software Engineering (EMSE
Coverage-Based Debloating for Java Bytecode
Software bloat is code that is packaged in an application but is actually not
necessary to run the application. The presence of software bloat is an issue
for security, for performance, and for maintenance. In this paper, we introduce
a novel technique for debloating Java bytecode, which we call coverage-based
debloating. We leverage a combination of state-of-the-art Java bytecode
coverage tools to precisely capture what parts of a project and its
dependencies are used at runtime. Then, we automatically remove the parts that
are not covered to generate a debloated version of the compiled project. We
successfully generate debloated versions of 220 open-source Java libraries,
which are syntactically correct and preserve their original behavior according
to the workload. Our results indicate that 68.3% of the libraries' bytecode and
20.5% of their total dependencies can be removed through coverage-based
debloating. Meanwhile, we present the first experiment that assesses the
utility of debloated libraries with respect to client applications that reuse
them. We show that 80.9% of the clients with at least one test that uses the
library successfully compile and pass their test suite when the original
library is replaced by its debloated version
Automatic Specialization of Third-Party Java Dependencies
Modern software systems rely on a multitude of third-party dependencies. This
large-scale code reuse reduces development costs and time, and it poses new
challenges with respect to maintenance and security. Techniques such as tree
shaking or shading can remove dependencies that are completely unused by a
project, which partly address these challenges. Yet, the remaining dependencies
are likely to be used only partially, leaving room for further reduction of
third-party code. In this paper, we propose a novel technique to specialize
dependencies of Java projects, based on their actual usage. For each
dependency, we systematically identify the subset of its functionalities that
is necessary to build the project, and remove the rest. Each specialized
dependency is repackaged. Then, we generate specialized dependency trees where
the original dependencies are replaced by the specialized versions and we
rebuild the project. We implement our technique in a tool called DepTrim, which
we evaluate with 30 notable open-source Java projects. DepTrim specializes a
total of 343 (86.6%) dependencies across these projects, and successfully
rebuilds each project with a specialized dependency tree. Moreover, through
this specialization, DepTrim removes a total of 60,962 (47.0%) classes from the
dependencies, reducing the ratio of dependency classes to project classes from
8.7x in the original projects to 4.4x after specialization. These results
indicate the relevance of dependency specialization to significantly reduce the
share of third-party code in Java projects.Comment: 17 pages, 2 figures, 4 tables, 1 algorithm, 2 code listings, 3
equation
Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement
Deep learning (DL) package supply chains (SCs) are critical for DL frameworks
to remain competitive. However, vital knowledge on the nature of DL package SCs
is still lacking. In this paper, we explore the domains, clusters, and
disengagement of packages in two representative PyPI DL package SCs to bridge
this knowledge gap. We analyze the metadata of nearly six million PyPI package
distributions and construct version-sensitive SCs for two popular DL
frameworks: TensorFlow and PyTorch. We find that popular packages (measured by
the number of monthly downloads) in the two SCs cover 34 domains belonging to
eight categories. Applications, Infrastructure, and Sciences categories account
for over 85% of popular packages in either SC and TensorFlow and PyTorch SC
have developed specializations on Infrastructure and Applications packages
respectively. We employ the Leiden community detection algorithm and detect 131
and 100 clusters in the two SCs. The clusters mainly exhibit four shapes:
Arrow, Star, Tree, and Forest with increasing dependency complexity. Most
clusters are Arrow or Star, but Tree and Forest clusters account for most
packages (Tensorflow SC: 70%, PyTorch SC: 90%). We identify three groups of
reasons why packages disengage from the SC (i.e., remove the DL framework and
its dependents from their installation dependencies): dependency issues,
functional improvements, and ease of installation. The most common
disengagement reason in the two SCs are different. Our study provides rich
implications on the maintenance and dependency management practices of PyPI DL
SCs.Comment: Manuscript submitted to ACM Transactions on Software Engineering and
Methodolog
Challenges of Producing Software Bill Of Materials for Java
Software bills of materials (SBOM) promise to become the backbone of software
supply chain hardening. We deep-dive into 6 tools and the accuracy of the SBOMs
they produce for complex open-source Java projects. Our novel insights reveal
some hard challenges for the accurate production and usage of SBOMs
Taxonomy of Attacks on Open-Source Software Supply Chains
The widespread dependency on open-source software makes it a fruitful target
for malicious actors, as demonstrated by recurring attacks. The complexity of
today's open-source supply chains results in a significant attack surface,
giving attackers numerous opportunities to reach the goal of injecting
malicious code into open-source artifacts that is then downloaded and executed
by victims.
This work proposes a general taxonomy for attacks on open-source supply
chains, independent of specific programming languages or ecosystems, and
covering all supply chain stages from code contributions to package
distribution. Taking the form of an attack tree, it covers 107 unique vectors,
linked to 94 real-world incidents, and mapped to 33 mitigating safeguards.
User surveys conducted with 17 domain experts and 134 software developers
positively validated the correctness, comprehensiveness and comprehensibility
of the taxonomy, as well as its suitability for various use-cases. Survey
participants also assessed the utility and costs of the identified safeguards,
and whether they are used
Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code
We consider the problem of identifying the provenance of free/open source
software (FOSS) and specifically the need of identifying where reused source
code has been copied from. We propose a lightweight approach to solve the
problem based on software identifiers-such as the names of variables, classes,
and functions chosen by programmers. The proposed approach is able to
efficiently narrow down to a small set of candidate origin products, to be
further analyzed with more expensive techniques to make a final provenance
determination.By analyzing the PyPI (Python Packaging Index) open source
ecosystem we find that globally defined identifiers are very distinct. Across
PyPI's 244 K packages we found 11.2 M different global identifiers (classes and
method/function names-with only 0.6% of identifiers shared among the two types
of entities); 76% of identifiers were used only in one package, and 93% in at
most 3. Randomly selecting 3 non-frequent global identifiers from an input
product is enough to narrow down its origins to a maximum of 3 products within
89% of the cases.We validate the proposed approach by mapping Debian source
packages implemented in Python to the corresponding PyPI packages; this
approach uses at most five trials, where each trial uses three randomly chosen
global identifiers from a randomly chosen python file of the subject software
package, then ranks results using a popularity index and requires to inspect
only the top result. In our experiments, this method is effective at finding
the true origin of a project with a recall of 0.9 and precision of 0.77
An SDN QoE Monitoring Framework for VoIP and video applications
Τα τελευταία χρόνια έχει σημειωθεί ραγδαία άνοδος του κλάδου των κινητών επικοινωνιών, αφού η χρήση των κινητών συσκευών εξαπλώνεται με ταχύτατους ρυθμούς και αναμένεται να συνεχίσει τη διείσδυσή της στην καθημερινότητα των καταναλωτών. Το γεγονός αυτό, σε συνδυασμό με τους περιορισμούς που θέτει η τρέχουσα δομή των δικτύων επικοινωνιών, καθιστά αναγκαία την ανάπτυξη νέων δικτύων με αυξημένες δυνατότητες, ώστε να είναι δυνατή η εξυπηρέτηση των χρηστών με την καλύτερη δυνατή ποιότητα εμπειρίας και ταυτόχρονα τη βέλτιστη αξιοποίηση των πόρων του δικτύου. Μία νέα δικτυακή προσέγγιση αποτελεί η δικτύωση βασισμένη στο λογισμικό (Software Defined Networking - SDN), η οποία αφαιρεί τον έλεγχο από τις συσκευές προώθησης του δικτύου, και οι αποφάσεις λαμβάνονται σε κεντρικό σημείο. Η ποιότητα υπηρεσίας που αντιλαμβάνεται ο χρήστης, ή αλλιώς ποιότητα εμπειρίας, κρίνεται ζήτημα υψηλής σημασίας στα δίκτυα SDN.
Η παρούσα διπλωματική εργασία έχει ως στόχο την παρουσίαση της τεχνολογίας SDN, την επισκόπηση της υπάρχουσας έρευνας στο πεδίο της ποιότητας εμπειρίας σε SDN δίκτυα και στη συνέχεια την ανάπτυξη μίας SDN εφαρμογής η οποία παρακολουθεί και διατηρεί την ποιότητας εμπειρίας σε υψηλά επίπεδα για εφαρμογές VoIP και video. Πιο συγκεκριμένα, η εφαρμογή SQMF (SDN QoE Monitoring Framework) παρακολουθεί περιοδικά στο μονοπάτι μετάδοσης των πακέτων διάφορες παραμέτρους του δικτύου, με βάση τις οποίες υπολογίζει την ποιότητα εμπειρίας. Εάν διαπιστωθεί ότι το αποτέλεσμα είναι μικρότερο από ένα προσδιορισμένο κατώφλι, η εφαρμογή αλλάζει το μονοπάτι μετάδοσης, και έτσι η ποιότητα εμπειρίας ανακάμπτει.
Η δομή της παρούσας διπλωματικής εργασίας είναι η εξής: Στο κεφάλαιο 1 παρουσιάζεται η σημερινή εικόνα των δικτύων επικοινωνιών και οι προβλέψεις για τη μελλοντική εικόνα, καθώς και οι προκλήσεις στις οποίες τα σημερινά δίκτυα δε θα μπορούν να αντεπεξέλθουν. Στη συνέχεια στο κεφάλαιο 2 περιγράφεται αναλυτικά η τεχνολογία SDN ως προς την αρχιτεκτονική, το κύριο πρωτόκολλο που χρησιμοποιεί, τα σενάρια χρήσης της, την προτυποποίηση, τα πλεονεκτήματα και τα μειονεκτήματά της. Το κεφάλαιο 3 εισάγει την έννοια της ποιότητας εμπειρίας του χρήστη και παραθέτει ευρέως γνωστά μοντέλα υπολογισμού της για διάφορους τύπους εφαρμογών, που χρησιμοποιούνται στην παρούσα εργασία. Σχετικές υπάρχουσες μελέτες στο πεδίο της ποιότητας εμπειρίας σε δίκτυα SDN αλλά και συγκριτικός πίνακας μπορούν να βρεθούν στο κεφάλαιο 4. Τα επόμενα κεφάλαια αφορούν στην εφαρμογή SQMF που υλοποιήθηκε στα πλαίσια της παρούσας διπλωματικής εργασίας: το κεφάλαιο 5 περιγράφει αναλυτικά όλα τα προαπαιτούμενα εργαλεία και οδηγίες για την ανάπτυξη του SQMF, ενώ το κεφάλαιο 6 παρουσιάζει παραδείγματα όπου η ποιότητα εμπειρίας ενός δικτύου μπορεί να υποστεί μείωση. Τέλος, το κεφάλαιο 7 αναλύει σε βάθος τις σχεδιαστικές προδιαγραφές, τη λογική και τον κώδικα του SQMF και παρέχει επίδειξη της λειτουργίας του και αξιολόγησή του, ενώ το κεφάλαιο 8 συνοψίζει επιγραμματικά τα συμπεράσματα της παρούσας εργασίας και ανοιχτά θέματα για μελλοντική έρευνα.Lately, there has been a rapid rise of the mobile communications industry, since the use of mobile devices is spreading at a fast pace and is expected to continue its penetration into the daily routine of consumers. This fact, combined with the limitations of the current communications networks’ structure, necessitates the development of new networks with increased capabilities, so that users can be served with the best possible quality of service and at the same time with the optimal network resources utilization. A new networking approach is Software Defined Networking (SDN) which decouples the control from the data plane, transforming the network elements to simple forwarding devices and making decisions centrally. The quality of service perceived by the user, or quality of experience (QoE), is considered to be a matter of great importance in software defined networks.
This diploma thesis aims at presenting SDN technology, reviewing existing research in the field of QoE on SDN networks and then developing an SDN application that monitors and preserves the QoE for VoIP and video applications. More specifically, the developed SDN QoE Monitoring Framework (SQMF) periodically monitors various network parameters on the VoIP/video packets transmission path, based on which it calculates the QoE. If it is found that the result is less than a predefined threshold, the framework changes the transmission path, and thus the QoE recovers.
The structure of this diploma thesis is the following: Chapter 1 presents the current state of communications networks and predictions for the future state, as well as the challenges that current networks will not be able to cope with. Chapter 2 then describes in detail the SDN technology in terms of architecture, main control-data plane communication protocol, use cases, standardization, advantages and disadvantages. Chapter 3 introduces the concept of QoE and lists well-known QoE estimation models for various applications types, some of which were used in this thesis. Relevant existing studies in the field of QoE on SDN networks as well as a comparative table can be found in chapter 4. The following chapters concern the framework implemented in the context of this diploma thesis: Chapter 5 describes in detail all the required tools and instructions for the development of SQMF, while Chapter 6 presents examples where the QoE in a network can face degradation. Finally, Chapter 7 analyzes in depth SQMF's design principles, logic and code files, provides a demonstration of its operation and evaluates it, whereas Chapter 8 briefly summarizes the conclusions and of this thesis and future work points
Gathering solutions and providing APIs for their orchestration to implement continuous software delivery
In traditional IT environments, it is common for software updates and new releases to take up to several weeks or even months to be eventually available to end users. Therefore, many IT vendors and providers of software products and services face the challenge of delivering updates considerably more frequently. This is because users, customers, and other stakeholders expect accelerated feedback loops and significantly faster responses to changing demands and issues that arise. Thus, taking this challenge seriously is of utmost economic importance for IT organizations if they wish to remain competitive. Continuous software delivery is an emerging paradigm adopted by an increasing number of organizations in order to address this challenge. It aims to drastically shorten release cycles while ensuring the delivery of high-quality software. Adopting continuous delivery essentially means to make it economical to constantly deliver changes in small batches. Infrequent high-risk releases with lots of accumulated changes are thereby replaced by a continuous stream of small and low-risk updates.
To gain from the benefits of continuous delivery, a high degree of automation is required. This is technically achieved by implementing continuous delivery pipelines consisting of different application-specific stages (build, test, production, etc.) to automate most parts of the application delivery process. Each stage relies on a corresponding application environment such as a build environment or production environment. This work presents concepts and approaches to implement continuous delivery pipelines based on systematically gathered solutions to be used and orchestrated as building blocks of application environments. Initially, the presented Gather'n'Deliver method is centered around a shared knowledge base to provide the foundation for gathering, utilizing, and orchestrating diverse solutions such as deployment scripts, configuration definitions, and Cloud services. Several classification dimensions and taxonomies are discussed in order to facilitate a systematic categorization of solutions, in addition to expressing application environment requirements that are satisfied by those solutions. The presented GatherBase framework enables the collaborative and automated gathering of solutions through solution repositories. These repositories are the foundation for building diverse knowledge base variants that provide fine-grained query mechanisms to find and retrieve solutions, for example, to be used as building blocks of specific application environments. Combining and integrating diverse solutions at runtime is achieved by orchestrating their APIs. Since some solutions such as lower-level executable artifacts (deployment scripts, configuration definitions, etc.) do not immediately provide their functionality through APIs, additional APIs need to be supplied. This issue is addressed by different approaches, such as the presented Any2API framework that is intended to generate individual APIs for such artifacts. An integrated architecture in conjunction with corresponding prototype implementations aims to demonstrate the technical feasibility of the presented approaches. Finally, various validation scenarios evaluate the approaches within the scope of continuous delivery and application environments and even beyond