988 research outputs found
Promises and Perils of Mining Software Package Ecosystem Data
The use of third-party packages is becoming increasingly popular and has led
to the emergence of large software package ecosystems with a maze of
inter-dependencies. Since the reliance on these ecosystems enables developers
to reduce development effort and increase productivity, it has attracted the
interest of researchers: understanding the infrastructure and dynamics of
package ecosystems has given rise to approaches for better code reuse,
automated updates, and the avoidance of vulnerabilities, to name a few
examples. But the reality of these ecosystems also poses challenges to software
engineering researchers, such as: How do we obtain the complete network of
dependencies along with the corresponding versioning information? What are the
boundaries of these package ecosystems? How do we consistently detect
dependencies that are declared but not used? How do we consistently identify
developers within a package ecosystem? How much of the ecosystem do we need to
understand to analyse a single component? How well do our approaches generalise
across different programming languages and package ecosystems? In this chapter,
we review promises and perils of mining the rich data related to software
package ecosystems available to software engineering researchers.Comment: Submitted as a Book Chapte
Demystifying Compiler Unstable Feature Usage and Impacts in the Rust Ecosystem
Rust programming language is gaining popularity rapidly in building reliable
and secure systems due to its security guarantees and outstanding performance.
To provide extra functionalities, the Rust compiler introduces Rust unstable
features (RUF) to extend compiler functionality, syntax, and standard library
support. However, these features are unstable and may get removed, introducing
compilation failures to dependent packages. Even worse, their impacts propagate
through transitive dependencies, causing large-scale failures in the whole
ecosystem. Although RUF is widely used in Rust, previous research has primarily
concentrated on Rust code safety, with the usage and impacts of RUF from the
Rust compiler remaining unexplored. Therefore, we aim to bridge this gap by
systematically analyzing the RUF usage and impacts in the Rust ecosystem. We
propose novel techniques for extracting RUF precisely, and to assess its impact
on the entire ecosystem quantitatively, we accurately resolve package
dependencies. We have analyzed the whole Rust ecosystem with 590K package
versions and 140M transitive dependencies. Our study shows that the Rust
ecosystem uses 1000 different RUF, and at most 44% of package versions are
affected by RUF, causing compiling failures for at most 12%. To mitigate wide
RUF impacts, we further design and implement a RUF-compilation-failure recovery
tool that can recover up to 90% of the failure. We believe our techniques,
findings, and tools can help to stabilize the Rust compiler, ultimately
enhancing the security and reliability of the Rust ecosystem.Comment: Published in ICSE'2024 Conference:
https://conf.researchr.org/details/icse-2024/icse-2024-research-track/6/Demystifying-Compiler-Unstable-Feature-Usage-and-Impacts-in-the-Rust-Ecosystem.
Project webiste: https://sites.google.com/view/ruf-study/home. Released
Source Code Zonodo: https://zenodo.org/records/828937
Avatud lähtekoodiga tarkvaraprojektide vearaportite ja tehniliste sõltuvuste haldamise analüüsimine
Nüüdisaegses tarkvaraarenduses kasutatakse avatud lähtekoodiga tarkvara komponente, et vähendada korratava töö hulka. Tarkvaraarendajad lisavad vaba lähtekoodiga komponente oma projektidesse, omamata ülevaadet kasutatud komponentide arendamisest ja hooldamisest. Selle töö eesmärk on analüüsida tarkvaraprojektide vearaporteid ja sõltuvuste haldamist ning arendada välja kohased meetodid. Tarkvaraprojektides kasutatakse töö organiseerimiseks veahaldussüsteeme, mille abil hallatakse tööülesandeid, vearaporteid ja uusi kasutajanõudeid. Enamat kui 4000 avatud lähtekoodiga projekti analüüsides selgus, et paljud vearaportid jäävad pikaks ajaks lahendamata. Muu hulgas võib nii ka mõni kriitiline turvaviga parandamata jääda. Doktoritöös arendatakse välja meetod, mis võimaldab automaatselt hinnata vearaporti lahendamiseks kuluvat aega. Meetod põhineb veahaldussüsteemi talletunud andmete analüüsil. Vearaporti eluaja hindamine aitab projektiosalistel prioriseerida tööülesandeid ja planeerida ressursse. Töö teises osas uuritakse, kuidas avatud lähtekoodiga projektide koodis kolmanda poole komponente kasutatakse. Tarkvaraarendajad kasutavad varem väljaarendatud komponente, et kiirendada arendust ja vähendada korratava töö hulka. Samamoodi kasutavad spetsiifilised komponendid veel omakorda teisi komponente, misläbi moodustub komponentide vaheliste seoste kaudu sõltuvuslik võrgustik. Selles doktoritöös analüüsitakse sõltuvuste võrgustikku populaarsete programmeerimiskeelte näidetel. Töö käigus arendatud meetod on rakendatav sõltuvuste võrgustiku struktuuri ja kasvu analüüsimiseks. Töös demonstreeritakse, kuidas võrgustiku struktuuri analüüsi abil saab hinnata tarkvaraprojektide riski hõlmata sõltuvusahela kaudu mõni turvaviga. Doktoritöös arendatud meetodid ja tulemused aitavad avatud lähtekoodiga projektide vearaportite ja tehniliste sõltuvuste haldamise praktikat läbipaistvamaks muuta.Modern software development relies on open-source software to facilitate reuse and reduce redundant work. Software developers use open-source packages in their projects without having insights into how these components are being developed and maintained. The aim of this thesis is to develop approaches for analyzing issue and dependency management in software projects. Software projects organize their work with issue trackers, tools for tracking issues such as development tasks, bug reports, and feature requests. By analyzing issue handling in more than 4,000 open-source projects, we found that many issues are left open for long periods of time, which can result in bugs and vulnerabilities not being fixed in a timely manner. This thesis proposes a method for predicting the amount of time it takes to resolve an issue by using the historical data available in issue trackers. Methods for predicting issue lifetime can help software project managers to prioritize issues and allocate resources accordingly. Another problem studied in this thesis is how software dependencies are used. Software developers often include third-party open-source software packages in their project code as a dependency. The included dependencies can also have their own dependencies. A complex network of dependency relationships exists among open-source software packages. This thesis analyzes the structure and the evolution of dependency networks of three popular programming languages. We propose an approach to measure the growth and the evolution of dependency networks. This thesis demonstrates that dependency network analysis can quantify what is the likelihood of acquiring vulnerabilities through software packages and how it changes over time. The approaches and findings developed here could help to bring transparency into open-source projects with respect to how issues are handled, or dependencies are updated
Untriviality of Trivial Packages
Nowadays, developing software would be unthinkable without the use of third-party packages. Although such code reuse helps to achieve rapid continuous delivery of software to end-users, blindly reusing code has its pitfalls. Prior work investigated the rationale for using packages of small size, known as trivial packages, that implement simple functionality. This prior work showed that, although these trivial packages are simple, they are popular and prevalent in the \npm ecosystem. This popularity and prevalence of trivial packages piqued our interest in questioning; first, the `triviality' of these packages and, second, the impact of using these packages on the quality of the client software applications.
To better understand the `triviality' of trivial packages and their impact, in this thesis we report on two large scale empirical studies. In both studies, we mine a large set of JavaScript applications that use trivial \npm packages. In the first study, we evaluate the `triviality' of these packages from two complementary points of view: based on application usage and ecosystem usage. Our result shows that trivial packages are being used in important JavaScript files, by the means of their `centrality', in software applications. Additionally, by analyzing all external package API calls in these JavaScript files, we find that a high percentage of these API calls are attributed to trivial packages. Therefore, these packages play a significant role in JavaScript files. Furthermore, in the package dependency network, we observe that 16.8% packages are trivial and in some cases removing a trivial package can break approximately 30% of the packages in the ecosystem. In the second study, we started by understanding the circumstances which incorporate trivial packages in software applications. We analyze and classify commits that introduce trivial packages into software applications. We notice that developers resort to trivial packages while performing a wild range of development tasks that are mostly related to `Building' and `Refactoring'. We empirically evaluate the bugginess of the files and applications that use trivial packages. Our result shows that JavaScript files and applications that use trivial packages tend to have a higher percentage of bug-fixing commits than files and applications that do not depend on trivial packages.
Overall, the findings of our thesis indicate that although smaller in size and complexity, trivial packages are highly depended on packages. These packages may be trivial by the means of size, their utility in software applications suggests that their role is not so trivial
Software tools for conducting real-time information processing and visualization in industry: an up-to-date review
The processing of information in real-time (through the processing of complex events) has become an essential task for the optimal functioning of manufacturing plants. Only in this way can artificial intelligence, data extraction, and even business intelligence techniques be applied, and the data produced daily be used in a beneficent way, enhancing automation processes and improving service delivery. Therefore, professionals and researchers need a wide range of tools to extract, transform, and load data in real-time efficiently. Additionally, the same tool supports or at least facilitates the visualization of this data intuitively and interactively. The review presented in this document aims to provide an up-to-date review of the various tools available to perform these tasks. Of the selected tools, a brief description of how they work, as well as the advantages and disadvantages of their use, will be presented. Furthermore, a critical analysis of overall operation and performance will be presented. Finally, a hybrid architecture that aims to synergize all tools and technologies is presented and discussed.This work is funded by “FCT—Fundação para a Ciência e Tecnologia” within the R&D
Units Project Scope: UIDB/00319/2020. The grants of R.S., R.M., A.M., and N.L. are supported by
the European Structural and Investment Funds in the FEDER component, through the Operational
Competitiveness and Internalization Programme (COMPETE 2020). [Project n. 039479. Funding
Reference: POCI-01-0247-FEDER-039479]
Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement
Deep learning (DL) package supply chains (SCs) are critical for DL frameworks
to remain competitive. However, vital knowledge on the nature of DL package SCs
is still lacking. In this paper, we explore the domains, clusters, and
disengagement of packages in two representative PyPI DL package SCs to bridge
this knowledge gap. We analyze the metadata of nearly six million PyPI package
distributions and construct version-sensitive SCs for two popular DL
frameworks: TensorFlow and PyTorch. We find that popular packages (measured by
the number of monthly downloads) in the two SCs cover 34 domains belonging to
eight categories. Applications, Infrastructure, and Sciences categories account
for over 85% of popular packages in either SC and TensorFlow and PyTorch SC
have developed specializations on Infrastructure and Applications packages
respectively. We employ the Leiden community detection algorithm and detect 131
and 100 clusters in the two SCs. The clusters mainly exhibit four shapes:
Arrow, Star, Tree, and Forest with increasing dependency complexity. Most
clusters are Arrow or Star, but Tree and Forest clusters account for most
packages (Tensorflow SC: 70%, PyTorch SC: 90%). We identify three groups of
reasons why packages disengage from the SC (i.e., remove the DL framework and
its dependents from their installation dependencies): dependency issues,
functional improvements, and ease of installation. The most common
disengagement reason in the two SCs are different. Our study provides rich
implications on the maintenance and dependency management practices of PyPI DL
SCs.Comment: Manuscript submitted to ACM Transactions on Software Engineering and
Methodolog
- …