14,605 research outputs found
A Historical Perspective on Runtime Assertion Checking in Software Development
This report presents initial results in the area of software testing and analysis produced as part of the Software Engineering Impact Project. The report describes the historical development of runtime assertion checking, including a description of the origins of and significant features associated with assertion checking mechanisms, and initial findings about current industrial use. A future report will provide a more comprehensive assessment of development practice, for which we invite readers of this report to contribute information
Achieving Cost-Effective Software Reliability Through Self-Healing
Heterogeneity, mobility, complexity and new application domains raise new software reliability issues that cannot be met cost-effectively only with classic software engineering approaches. Self-healing systems can successfully address these problems, thus increasing software reliability while reducing maintenance costs. Self-healing systems must be able to automatically identify runtime failures, locate faults, and find a way to bring the system back to an acceptable behavior. This paper discusses the challenges underlying the construction of self-healing systems with particular focus on functional failures, and presents a set of techniques to build software systems that can automatically heal such failures. It introduces techniques to automatically derive assertions to effectively detect functional failures, locate the faults underlying the failures, and identify sequences of actions alternative to the failing sequence to bring the system back to an acceptable behavior
Automatic Repair of Buggy If Conditions and Missing Preconditions with SMT
We present Nopol, an approach for automatically repairing buggy if conditions
and missing preconditions. As input, it takes a program and a test suite which
contains passing test cases modeling the expected behavior of the program and
at least one failing test case embodying the bug to be repaired. It consists of
collecting data from multiple instrumented test suite executions, transforming
this data into a Satisfiability Modulo Theory (SMT) problem, and translating
the SMT result -- if there exists one -- into a source code patch. Nopol
repairs object oriented code and allows the patches to contain nullness checks
as well as specific method calls.Comment: CSTVA'2014, India (2014
Automatically generated acceptance test: A software reliability experiment
This study presents results of a software reliability experiment investigating the feasibility of a new error detection method. The method can be used as an acceptance test and is solely based on empirical data about the behavior of internal states of a program. The experimental design uses the existing environment of a multi-version experiment previously conducted at the NASA Langley Research Center, in which the launch interceptor problem is used as a model. This allows the controlled experimental investigation of versions with well-known single and multiple faults, and the availability of an oracle permits the determination of the error detection performance of the test. Fault interaction phenomena are observed that have an amplifying effect on the number of error occurrences. Preliminary results indicate that all faults examined so far are detected by the acceptance test. This shows promise for further investigations, and for the employment of this test method on other applications
Search based software engineering: Trends, techniques and applications
© ACM, 2012. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version is available from the link below.In the past five years there has been a dramatic increase in work on Search-Based Software Engineering (SBSE), an approach to Software Engineering (SE) in which Search-Based Optimization (SBO) algorithms are used to address problems in SE. SBSE has been applied to problems throughout the SE lifecycle, from requirements and project planning to maintenance and reengineering. The approach is attractive because it offers a suite of adaptive automated and semiautomated solutions in situations typified by large complex problem spaces with multiple competing and conflicting objectives.
This article provides a review and classification of literature on SBSE. The work identifies research trends and relationships between the techniques applied and the applications to which they have been applied and highlights gaps in the literature and avenues for further research.EPSRC and E
Identifying Bugs in Make and JVM-Oriented Builds
Incremental and parallel builds are crucial features of modern build systems.
Parallelism enables fast builds by running independent tasks simultaneously,
while incrementality saves time and computing resources by processing the build
operations that were affected by a particular code change. Writing build
definitions that lead to error-free incremental and parallel builds is a
challenging task. This is mainly because developers are often unable to predict
the effects of build operations on the file system and how different build
operations interact with each other. Faulty build scripts may seriously degrade
the reliability of automated builds, as they cause build failures, and
non-deterministic and incorrect build results.
To reason about arbitrary build executions, we present buildfs, a
generally-applicable model that takes into account the specification (as
declared in build scripts) and the actual behavior (low-level file system
operation) of build operations. We then formally define different types of
faults related to incremental and parallel builds in terms of the conditions
under which a file system operation violates the specification of a build
operation. Our testing approach, which relies on the proposed model, analyzes
the execution of single full build, translates it into buildfs, and uncovers
faults by checking for corresponding violations.
We evaluate the effectiveness, efficiency, and applicability of our approach
by examining hundreds of Make and Gradle projects. Notably, our method is the
first to handle Java-oriented build systems. The results indicate that our
approach is (1) able to uncover several important issues (245 issues found in
45 open-source projects have been confirmed and fixed by the upstream
developers), and (2) orders of magnitude faster than a state-of-the-art tool
for Make builds
TRANSOM: An Efficient Fault-Tolerant System for Training LLMs
Large language models (LLMs) with hundreds of billions or trillions of
parameters, represented by chatGPT, have achieved profound impact on various
fields. However, training LLMs with super-large-scale parameters requires large
high-performance GPU clusters and long training periods lasting for months. Due
to the inevitable hardware and software failures in large-scale clusters,
maintaining uninterrupted and long-duration training is extremely challenging.
As a result, A substantial amount of training time is devoted to task
checkpoint saving and loading, task rescheduling and restart, and task manual
anomaly checks, which greatly harms the overall training efficiency. To address
these issues, we propose TRANSOM, a novel fault-tolerant LLM training system.
In this work, we design three key subsystems: the training pipeline automatic
fault tolerance and recovery mechanism named Transom Operator and Launcher
(TOL), the training task multi-dimensional metric automatic anomaly detection
system named Transom Eagle Eye (TEE), and the training checkpoint asynchronous
access automatic fault tolerance and recovery technology named Transom
Checkpoint Engine (TCE). Here, TOL manages the lifecycle of training tasks,
while TEE is responsible for task monitoring and anomaly reporting. TEE detects
training anomalies and reports them to TOL, who automatically enters the fault
tolerance strategy to eliminate abnormal nodes and restart the training task.
And the asynchronous checkpoint saving and loading functionality provided by
TCE greatly shorten the fault tolerance overhead. The experimental results
indicate that TRANSOM significantly enhances the efficiency of large-scale LLM
training on clusters. Specifically, the pre-training time for GPT3-175B has
been reduced by 28%, while checkpoint saving and loading performance have
improved by a factor of 20.Comment: 14 pages, 9 figure
- …