4,953 research outputs found
Recommended from our members
Where Are My Intelligent Assistant's Mistakes? A Systematic Testing Approach
Intelligent assistants are handling increasingly critical tasks, but until now, end users have had no way to systematically assess where their assistants make mistakes. For some intelligent assistants, this is a serious problem: if the assistant is doing work that is important, such as assisting with qualitative research or monitoring an elderly parent’s safety, the user may pay a high cost for unnoticed mistakes. This paper addresses the problem with WYSIWYT/ML (What You See Is What You Test for Machine Learning), a human/computer partnership that enables end users to systematically test intelligent assistants. Our empirical evaluation shows that WYSIWYT/ML helped end users find assistants’ mistakes significantly more effectively than ad hoc testing. Not only did it allow users to assess an assistant’s work on an average of 117 predictions in only 10 minutes, it also scaled to a much larger data set, assessing an assistant’s work on 623 out of 1,448 predictions using only the users’ original 10 minutes’ testing effort
Towards Automated Performance Bug Identification in Python
Context: Software performance is a critical non-functional requirement,
appearing in many fields such as mission critical applications, financial, and
real time systems. In this work we focused on early detection of performance
bugs; our software under study was a real time system used in the
advertisement/marketing domain.
Goal: Find a simple and easy to implement solution, predicting performance
bugs.
Method: We built several models using four machine learning methods, commonly
used for defect prediction: C4.5 Decision Trees, Na\"{\i}ve Bayes, Bayesian
Networks, and Logistic Regression.
Results: Our empirical results show that a C4.5 model, using lines of code
changed, file's age and size as explanatory variables, can be used to predict
performance bugs (recall=0.73, accuracy=0.85, and precision=0.96). We show that
reducing the number of changes delivered on a commit, can decrease the chance
of performance bug injection.
Conclusions: We believe that our approach can help practitioners to eliminate
performance bugs early in the development cycle. Our results are also of
interest to theoreticians, establishing a link between functional bugs and
(non-functional) performance bugs, and explicitly showing that attributes used
for prediction of functional bugs can be used for prediction of performance
bugs
Untangling Fine-Grained Code Changes
After working for some time, developers commit their code changes to a
version control system. When doing so, they often bundle unrelated changes
(e.g., bug fix and refactoring) in a single commit, thus creating a so-called
tangled commit. Sharing tangled commits is problematic because it makes review,
reversion, and integration of these commits harder and historical analyses of
the project less reliable. Researchers have worked at untangling existing
commits, i.e., finding which part of a commit relates to which task. In this
paper, we contribute to this line of work in two ways: (1) A publicly available
dataset of untangled code changes, created with the help of two developers who
accurately split their code changes into self contained tasks over a period of
four months; (2) a novel approach, EpiceaUntangler, to help developers share
untangled commits (aka. atomic commits) by using fine-grained code change
information. EpiceaUntangler is based and tested on the publicly available
dataset, and further evaluated by deploying it to 7 developers, who used it for
2 weeks. We recorded a median success rate of 91% and average one of 75%, in
automatically creating clusters of untangled fine-grained code changes
Combining hardware and software instrumentation to classify program executions
Several research efforts have studied ways to infer properties of software systems from program spectra gathered from the running systems, usually with software-level instrumentation. While these efforts appear to produce accurate classifications, detailed understanding of their costs and potential cost-benefit tradeoffs is lacking. In this work we present a hybrid instrumentation approach which uses hardware performance counters to gather program spectra at very low cost. This underlying data is further augmented with data captured by minimal amounts of software-level instrumentation. We also
evaluate this hybrid approach by comparing it to other existing approaches. We conclude that these hybrid spectra can reliably distinguish failed executions from successful executions at a fraction of the runtime overhead cost of using software-based execution data
Predictive Analytics and Software Defect Severity: A Systematic Review and Future Directions
Software testing identifies defects in software products with varying multiplying effects based on their severity levels and sequel to instant rectifications, hence the rate of a research study in the software engineering domain. In this paper, a systematic literature review (SLR) on machine learning-based software defect severity prediction was conducted in the last decade. The SLR was aimed at detecting germane areas central to efficient predictive analytics, which are seldom captured in existing software defect severity prediction reviews. The germane areas include the analysis of techniques or approaches which have a significant influence on the threats to the validity of proposed models, and the bias-variance tradeoff considerations techniques in data science-based approaches. A population, intervention, and outcome model is adopted for better search terms during the literature selection process, and subsequent quality assurance scrutiny yielded fifty-two primary studies. A subsequent thoroughbred systematic review was conducted on the final selected studies to answer eleven main research questions, which uncovers approaches that speak to the aforementioned germane areas of interest. The results indicate that while the machine learning approach is ubiquitous for predicting software defect severity, germane techniques central to better predictive analytics are infrequent in literature. This study is concluded by summarizing prominent study trends in a mind map to stimulate future research in the software engineering industry.publishedVersio
Is One Hyperparameter Optimizer Enough?
Hyperparameter tuning is the black art of automatically finding a good
combination of control parameters for a data miner. While widely applied in
empirical Software Engineering, there has not been much discussion on which
hyperparameter tuner is best for software analytics. To address this gap in the
literature, this paper applied a range of hyperparameter optimizers (grid
search, random search, differential evolution, and Bayesian optimization) to
defect prediction problem. Surprisingly, no hyperparameter optimizer was
observed to be `best' and, for one of the two evaluation measures studied here
(F-measure), hyperparameter optimization, in 50\% cases, was no better than
using default configurations.
We conclude that hyperparameter optimization is more nuanced than previously
believed. While such optimization can certainly lead to large improvements in
the performance of classifiers used in software analytics, it remains to be
seen which specific optimizers should be applied to a new dataset.Comment: 7 pages, 2 columns, accepted for SWAN1
Software Verification and Graph Similarity for Automated Evaluation of Students' Assignments
In this paper we promote introducing software verification and control flow
graph similarity measurement in automated evaluation of students' programs. We
present a new grading framework that merges results obtained by combination of
these two approaches with results obtained by automated testing, leading to
improved quality and precision of automated grading. These two approaches are
also useful in providing a comprehensible feedback that can help students to
improve the quality of their programs We also present our corresponding tools
that are publicly available and open source. The tools are based on LLVM
low-level intermediate code representation, so they could be applied to a
number of programming languages. Experimental evaluation of the proposed
grading framework is performed on a corpus of university students' programs
written in programming language C. Results of the experiments show that
automatically generated grades are highly correlated with manually determined
grades suggesting that the presented tools can find real-world applications in
studying and grading
- …