22,002 research outputs found
A Quantitative Study of Java Software Buildability
Researchers, students and practitioners often encounter a situation when the
build process of a third-party software system fails. In this paper, we aim to
confirm this observation present mainly as anecdotal evidence so far. Using a
virtual environment simulating a programmer's one, we try to fully
automatically build target archives from the source code of over 7,200 open
source Java projects. We found that more than 38% of builds ended in failure.
Build log analysis reveals the largest portion of errors are
dependency-related. We also conduct an association study of factors affecting
build success
An Introduction to Programming for Bioscientists: A Python-based Primer
Computing has revolutionized the biological sciences over the past several
decades, such that virtually all contemporary research in the biosciences
utilizes computer programs. The computational advances have come on many
fronts, spurred by fundamental developments in hardware, software, and
algorithms. These advances have influenced, and even engendered, a phenomenal
array of bioscience fields, including molecular evolution and bioinformatics;
genome-, proteome-, transcriptome- and metabolome-wide experimental studies;
structural genomics; and atomistic simulations of cellular-scale molecular
assemblies as large as ribosomes and intact viruses. In short, much of
post-genomic biology is increasingly becoming a form of computational biology.
The ability to design and write computer programs is among the most
indispensable skills that a modern researcher can cultivate. Python has become
a popular programming language in the biosciences, largely because (i) its
straightforward semantics and clean syntax make it a readily accessible first
language; (ii) it is expressive and well-suited to object-oriented programming,
as well as other modern paradigms; and (iii) the many available libraries and
third-party toolkits extend the functionality of the core language into
virtually every biological domain (sequence and structure analyses,
phylogenomics, workflow management systems, etc.). This primer offers a basic
introduction to coding, via Python, and it includes concrete examples and
exercises to illustrate the language's usage and capabilities; the main text
culminates with a final project in structural bioinformatics. A suite of
Supplemental Chapters is also provided. Starting with basic concepts, such as
that of a 'variable', the Chapters methodically advance the reader to the point
of writing a graphical user interface to compute the Hamming distance between
two DNA sequences.Comment: 65 pages total, including 45 pages text, 3 figures, 4 tables,
numerous exercises, and 19 pages of Supporting Information; currently in
press at PLOS Computational Biolog
Identifying Bugs in Make and JVM-Oriented Builds
Incremental and parallel builds are crucial features of modern build systems.
Parallelism enables fast builds by running independent tasks simultaneously,
while incrementality saves time and computing resources by processing the build
operations that were affected by a particular code change. Writing build
definitions that lead to error-free incremental and parallel builds is a
challenging task. This is mainly because developers are often unable to predict
the effects of build operations on the file system and how different build
operations interact with each other. Faulty build scripts may seriously degrade
the reliability of automated builds, as they cause build failures, and
non-deterministic and incorrect build results.
To reason about arbitrary build executions, we present buildfs, a
generally-applicable model that takes into account the specification (as
declared in build scripts) and the actual behavior (low-level file system
operation) of build operations. We then formally define different types of
faults related to incremental and parallel builds in terms of the conditions
under which a file system operation violates the specification of a build
operation. Our testing approach, which relies on the proposed model, analyzes
the execution of single full build, translates it into buildfs, and uncovers
faults by checking for corresponding violations.
We evaluate the effectiveness, efficiency, and applicability of our approach
by examining hundreds of Make and Gradle projects. Notably, our method is the
first to handle Java-oriented build systems. The results indicate that our
approach is (1) able to uncover several important issues (245 issues found in
45 open-source projects have been confirmed and fixed by the upstream
developers), and (2) orders of magnitude faster than a state-of-the-art tool
for Make builds
Exploring the Duality between Product and Organizational Architectures: A Test of the Mirroring Hypothesis
A variety of academic studies argue that a relationship exists between the structure of an organization and the design of the products that this organization produces. Specifically, products tend to "mirror" the architectures of the organizations in which they are developed. This dynamic occurs because the organization's governance structures, problem solving routines and communication patterns constrain the space in which it searches for new solutions. Such a relationship is important, given that product architecture has been shown to be an important predictor of product performance, product variety, process flexibility and even the path of industry evolution. We explore this relationship in the software industry. Our research takes advantage of a natural experiment, in that we observe products that fulfill the same function being developed by very different organizational forms. At one extreme are commercial software firms, in which the organizational participants are tightly-coupled, with respect to their goals, structure and behavior. At the other, are open source software communities, in which the participants are much more loosely-coupled by comparison. The mirroring hypothesis predicts that these different organizational forms will produce products with distinctly different architectures. Specifically, loosely-coupled organizations will develop more modular designs than tightly-coupled organizations. We test this hypothesis, using a sample of matched-pair products. We find strong evidence to support the mirroring hypothesis. In all of the pairs we examine, the product developed by the loosely-coupled organization is significantly more modular than the product from the tightly-coupled organization. We measure modularity by capturing the level of coupling between a product's components. The magnitude of the differences is substantial - up to a factor of eight, in terms of the potential for a design change in one component to propagate to others. Our results have significant managerial implications, in highlighting the impact of organizational design decisions on the technical structure of the artifacts that these organizations subsequently develop.Organizational Design, Product Design, Architecture, Modularity, Open-Source Software.
GenomeVIP: A cloud platform for genomic variant discovery and interpretation
Identifying genomic variants is a fundamental first step toward the understanding of the role of inherited and acquired variation in disease. The accelerating growth in the corpus of sequencing data that underpins such analysis is making the data-download bottleneck more evident, placing substantial burdens on the research community to keep pace. As a result, the search for alternative approaches to the traditional âdownload and analyzeâ paradigm on local computing resources has led to a rapidly growing demand for cloud-computing solutions for genomics analysis. Here, we introduce the Genome Variant Investigation Platform (GenomeVIP), an open-source framework for performing genomics variant discovery and annotation using cloud- or local high-performance computing infrastructure. GenomeVIP orchestrates the analysis of whole-genome and exome sequence data using a set of robust and popular task-specific tools, including VarScan, GATK, Pindel, BreakDancer, Strelka, and Genome STRiP, through a web interface. GenomeVIP has been used for genomic analysis in large-data projects such as the TCGA PanCanAtlas and in other projects, such as the ICGC Pilots, CPTAC, ICGC-TCGA DREAM Challenges, and the 1000 Genomes SV Project. Here, we demonstrate GenomeVIP's ability to provide high-confidence annotated somatic, germline, and de novo variants of potential biological significance using publicly available data sets.</jats:p
- âŠ