6,262 research outputs found
Evaluating defect prediction approaches: a benchmark and an extensive comparison
Reliably predicting software defects is one of the holy grails of software engineering. Researchers have devised and implemented a plethora of defect/bug prediction approaches varying in terms of accuracy, complexity and the input data they require. However, the absence of an established benchmark makes it hard, if not impossible, to compare approaches. We present a benchmark for defect prediction, in the form of a publicly available dataset consisting of several software systems, and provide an extensive comparison of well-known bug prediction approaches, together with novel approaches we devised. We evaluate the performance of the approaches using different performance indicators: classification of entities as defect-prone or not, ranking of the entities, with and without taking into account the effort to review an entity. We performed three sets of experiments aimed at (1) comparing the approaches across different systems, (2) testing whether the differences in performance are statistically significant, and (3) investigating the stability of approaches across different learners. Our results indicate that, while some approaches perform better than others in a statistically significant manner, external validity in defect prediction is still an open problem, as generalizing results to different contexts/learners proved to be a partially unsuccessful endeavo
On the feasibility of automated prediction of bug and non-bug issues
Context
Issue tracking systems are used to track and describe tasks in the development process, e.g., requested feature improvements or reported bugs. However, past research has shown that the reported issue types often do not match the description of the issue.
Objective
We want to understand the overall maturity of the state of the art of issue type prediction with the goal to predict if issues are bugs and evaluate if we can improve existing models by incorporating manually specified knowledge about issues.
Method
We train different models for the title and description of the issue to account for the difference in structure between these fields, e.g., the length. Moreover, we manually detect issues whose description contains a null pointer exception, as these are strong indicators that issues are bugs.
Results
Our approach performs best overall, but not significantly different from an approach from the literature based on the fastText classifier from Facebook AI Research. The small improvements in prediction performance are due to structural information about the issues we used. We found that using information about the content of issues in form of null pointer exceptions is not useful. We demonstrate the usefulness of issue type prediction through the example of labelling bugfixing commits.
Conclusions
Issue type prediction can be a useful tool if the use case allows either for a certain amount of missed bug reports or the prediction of too many issues as bug is acceptable
Active Learning of Discriminative Subgraph Patterns for API Misuse Detection
A common cause of bugs and vulnerabilities are the violations of usage
constraints associated with Application Programming Interfaces (APIs). API
misuses are common in software projects, and while there have been techniques
proposed to detect such misuses, studies have shown that they fail to reliably
detect misuses while reporting many false positives. One limitation of prior
work is the inability to reliably identify correct patterns of usage. Many
approaches confuse a usage pattern's frequency for correctness. Due to the
variety of alternative usage patterns that may be uncommon but correct, anomaly
detection-based techniques have limited success in identifying misuses. We
address these challenges and propose ALP (Actively Learned Patterns),
reformulating API misuse detection as a classification problem. After
representing programs as graphs, ALP mines discriminative subgraphs. While
still incorporating frequency information, through limited human supervision,
we reduce the reliance on the assumption relating frequency and correctness.
The principles of active learning are incorporated to shift human attention
away from the most frequent patterns. Instead, ALP samples informative and
representative examples while minimizing labeling effort. In our empirical
evaluation, ALP substantially outperforms prior approaches on both MUBench, an
API Misuse benchmark, and a new dataset that we constructed from real-world
software projects
On the feasibility of automated prediction of bug and non-bug issues
Context: Issue tracking systems are used to track and describe tasks in the
development process, e.g., requested feature improvements or reported bugs.
However, past research has shown that the reported issue types often do not
match the description of the issue.
Objective: We want to understand the overall maturity of the state of the art
of issue type prediction with the goal to predict if issues are bugs and
evaluate if we can improve existing models by incorporating manually specified
knowledge about issues.
Method: We train different models for the title and description of the issue
to account for the difference in structure between these fields, e.g., the
length. Moreover, we manually detect issues whose description contains a null
pointer exception, as these are strong indicators that issues are bugs.
Results: Our approach performs best overall, but not significantly different
from an approach from the literature based on the fastText classifier from
Facebook AI Research. The small improvements in prediction performance are due
to structural information about the issues we used. We found that using
information about the content of issues in form of null pointer exceptions is
not useful. We demonstrate the usefulness of issue type prediction through the
example of labelling bugfixing commits.
Conclusions: Issue type prediction can be a useful tool if the use case
allows either for a certain amount of missed bug reports or the prediction of
too many issues as bug is acceptable
Extended Rate, more GFUN
We present a software package that guesses formulae for sequences of, for
example, rational numbers or rational functions, given the first few terms. We
implement an algorithm due to Bernhard Beckermann and George Labahn, together
with some enhancements to render our package efficient. Thus we extend and
complement Christian Krattenthaler's program Rate, the parts concerned with
guessing of Bruno Salvy and Paul Zimmermann's GFUN, the univariate case of
Manuel Kauers' Guess.m and Manuel Kauers' and Christoph Koutschan's
qGeneratingFunctions.m.Comment: 26 page
FairMask: Better Fairness via Model-based Rebalancing of Protected Attributes
Context: Machine learning software can generate models that inappropriately
discriminate against specific protected social groups (e.g., groups based on
gender, ethnicity, etc). Motivated by those results, software engineering
researchers have proposed many methods for mitigating those discriminatory
effects. While those methods are effective in mitigating bias, few of them can
provide explanations on what is the root cause of bias.
Objective: We aim at better detection and mitigation of algorithmic
discrimination in machine learning software problems.
Method: Here we propose xFAIR, a model-based extrapolation method, that is
capable of both mitigating bias and explaining the cause. In our xFAIR
approach, protected attributes are represented by models learned from the other
independent variables (and these models offer extrapolations over the space
between existing examples). We then use the extrapolation models to relabel
protected attributes later seen in testing data or deployment time. Our
approach aims to offset the biased predictions of the classification model via
rebalancing the distribution of protected attributes.
Results: The experiments of this paper show that, without compromising
(original) model performance, xFAIR can achieve significantly better group and
individual fairness (as measured in different metrics) than benchmark methods.
Moreover, when compared to another instance-based rebalancing method, our
model-based approach shows faster runtime and thus better scalability.
Conclusion: Algorithmic decision bias can be removed via extrapolation that
smooths away outlier points. As evidence for this, our proposed xFAIR is not
only performance-wise better (measured by fairness and performance metrics)
than two state-of-the-art fairness algorithms.Comment: 14 pages, 6 figures, 7 tables, accepted by TS
Weighted KNN Menggunakan Grey Relational Analysis untuk Solusi Nilai yang Absen pada Prediksi Kesalahan Perangkat Lunak Lintas Ranah
Prediksi kesalahan perangkat lunak memiliki peranan penting dalam mendeteksi komponen yang paling rentan terjadi kesalahan perangkat lunak. Beberapa penelitian telah berupaya meningkatkan akurasi prediksi kesalahan perangkat lunak agar dapat mengelola sumber daya (manusia, biaya dan waktu) lebih baik. Namun penelitian sebelumnya masih membuat model prediksi kesalahan perangkat lunak untuk ranah tertentu saja. Belum terdapat penanganan dataset yang lintas ranah.
Penelitian ini memperbaiki model prediksi kesalahan perangkat lunak agar dapat menangani dataset yang digabung (lintas ranah) dengan jumlah fitur yang berbeda-beda. Agar jumlah fitur tiap dataset seimbang maka pengisian nilai yang absen akibat penggabungan dataset lintas ranah dilakukan. Penelitian ini mengembangkan metode weighted KNN untuk mengisi nilai yang absen tersebut. Dataset yang diperlengkapi tesebut selanjutnya diklasifikasi menggunakan naive bayes dan random forest. Penelitian ini juga mencari kumpulan fitur apa yang relevan dalam mendeteksi defect dengan cara melakukan analisis perbandingan metode seleksi fitur.
Untuk pengujian, penelitian ini menggunakan tujuh dataset NASA public MDP (Modular toolkit for Data Preprocessing). Hasil pengujian menunjukkan bahwa data tidak imbang (imbalance) menghasilkan nilai balance terbaik jika metode naive bayes dikombinasi dengan metode seleksi fitur information gain (IG) atau symmetric uncertainty (SU), yaitu 0.4975. Hasil pengujian juga menunjukkan bahwa data imbang (balance) menghasilkan nilai balance terbaik jika metode random forest dikombinasi dengan metode seleksi fitur gain ratio (GR), yaitu 0.7795. Secara umum, hasil klasifikasi dengan masing-masing tujuh dataset NASA public MDP relative tidak jauh berbeda dari hasil klasifikasi data lintas ranah dimana hasil klasifikasi lintah ranah adalah 0.4975. Bahkan hasil ini masih diatas dari hasil klasifikasi atas dataset PC2, yaitu 0.4033.
================================================================================================
Defect prediction plays important roles in detecting vulnerable component within a software. Some researches have tried to improve the accuration of software defect prediction so that it helps developer to manage resources (human, cost, and time) better. Those researches focus on building the software defect prediction model only for a specific domain. Research on cross-project domain has not been carried out before.
This research developed a software defect prediction model for cross-project domain. Thus, the domain contains datasets with different number of features. To extend shorted features in a dataset, the method calculates the missing values. This research developed a method, called weighted KNN, to fill in the missing value. The refill datasets are then classified using naive bayes and random forest. This research also conducted a feature selection process to select relevant featues for detecting defects by means of a comparative analysis of methods of selection of features.
For the experimentation, this research used seven NASA public dataset MDPs. The results show that for imbalance data, naïve bayes combined with information gain (IG) or symmetric uncertainty (SU) feature selection produced the best balance, i.e. 0.4975. It also shows that for balance data, random forest combined with gain ratio (GR) produced the best balance, i.e. 0.7795. In general, the developed method performed relatively alike the previous method, which classifiy only specific domain, i.e. 0.4975. It even outperformed previous method for dataset PC2, i.e. 0.4033
- …