532 research outputs found

    Investigating Automatic Static Analysis Results to Identify Quality Problems: an Inductive Study

    Get PDF
    Background: Automatic static analysis (ASA) tools examine source code to discover "issues", i.e. code patterns that are symptoms of bad programming practices and that can lead to defective behavior. Studies in the literature have shown that these tools find defects earlier than other verification activities, but they produce a substantial number of false positive warnings. For this reason, an alternative approach is to use the set of ASA issues to identify defect prone files and components rather than focusing on the individual issues. Aim: We conducted an exploratory study to investigate whether ASA issues can be used as early indicators of faulty files and components and, for the first time, whether they point to a decay of specific software quality attributes, such as maintainability or functionality. Our aim is to understand the critical parameters and feasibility of such an approach to feed into future research on more specific quality and defect prediction models. Method: We analyzed an industrial C# web application using the Resharper ASA tool and explored if significant correlations exist in such a data set. Results: We found promising results when predicting defect-prone files. A set of specific Resharper categories are better indicators of faulty files than common software metrics or the collection of issues of all issue categories, and these categories correlate to different software quality attributes. Conclusions: Our advice for future research is to perform analysis on file rather component level and to evaluate the generalizability of categories. We also recommend using larger datasets as we learned that data sparseness can lead to challenges in the proposed analysis proces

    A Review of Metrics and Modeling Techniques in Software Fault Prediction Model Development

    Get PDF
    This paper surveys different software fault predictions progressed through different data analytic techniques reported in the software engineering literature. This study split in three broad areas; (a) The description of software metrics suites reported and validated in the literature. (b) A brief outline of previous research published in the development of software fault prediction model based on various analytic techniques. This utilizes the taxonomy of analytic techniques while summarizing published research. (c) A review of the advantages of using the combination of metrics. Though, this area is comparatively new and needs more research efforts

    Software defect prediction using maximal information coefficient and fast correlation-based filter feature selection

    Get PDF
    Software quality ensures that applications that are developed are failure free. Some modern systems are intricate, due to the complexity of their information processes. Software fault prediction is an important quality assurance activity, since it is a mechanism that correctly predicts the defect proneness of modules and classifies modules that saves resources, time and developers’ efforts. In this study, a model that selects relevant features that can be used in defect prediction was proposed. The literature was reviewed and it revealed that process metrics are better predictors of defects in version systems and are based on historic source code over time. These metrics are extracted from the source-code module and include, for example, the number of additions and deletions from the source code, the number of distinct committers and the number of modified lines. In this research, defect prediction was conducted using open source software (OSS) of software product line(s) (SPL), hence process metrics were chosen. Data sets that are used in defect prediction may contain non-significant and redundant attributes that may affect the accuracy of machine-learning algorithms. In order to improve the prediction accuracy of classification models, features that are significant in the defect prediction process are utilised. In machine learning, feature selection techniques are applied in the identification of the relevant data. Feature selection is a pre-processing step that helps to reduce the dimensionality of data in machine learning. Feature selection techniques include information theoretic methods that are based on the entropy concept. This study experimented the efficiency of the feature selection techniques. It was realised that software defect prediction using significant attributes improves the prediction accuracy. A novel MICFastCR model, which is based on the Maximal Information Coefficient (MIC) was developed to select significant attributes and Fast Correlation Based Filter (FCBF) to eliminate redundant attributes. Machine learning algorithms were then run to predict software defects. The MICFastCR achieved the highest prediction accuracy as reported by various performance measures.School of ComputingPh. D. (Computer Science

    Software Fault Prediction and Test Data Generation Using Articial Intelligent Techniques

    Get PDF
    The complexity in requirements of the present-day software, which are often very large in nature has lead to increase in more number of lines of code, resulting in more number of modules. There is every possibility that some of the modules may give rise to varieties of defects, if testing is not done meticulously. In practice, it is not possible to carry out white box testing of every module of any software. Thus, software testing needs to be done selectively for the modules, which are prone to faults. Identifying the probable fault-prone modules is a critical task, carried out for any software. This dissertation, emphasizes on design of prediction and classication models to detect fault prone classes for object-oriented programs. Then, test data are generated for a particular task to check the functionality of the software product. In the eld of object-oriented software engineering, it is observed that Chidamber and Kemerer (CK) software metrics suite is more frequently used for fault prediction analysis, as it covers the unique aspects of object - oriented programming such as the complexity, data abstraction, and inheritance. It is observed that one of the most important goals of fault prediction is to detect fault prone modules as early as possible in the software development life cycle (SDLC). Numerous authors have used design and code metrics for predicting fault-prone modules. In this work, design metrics are used for fault prediction. In order to carry out fault prediction analysis, prediction models are designed using machine learning methods. Machine learning methods such as Statistical methods, Articial neural network, Radial basis function network, Functional link articial neural network, and Probabilistic neural network are deployed for fault prediction analysis. In the rst phase, fault prediction is performed using the CK metrics suite. In the next phase, the reduced feature sets of CK metrics suite obtained by applying principal component analysis and rough set theory are used to perform fault prediction. A comparative approach is drawn to nd a suitable prediction model among the set of designed models for fault prediction. Prediction models designed for fault proneness, need to be validated for their eciency. To achieve this, a cost-based evaluation framework is designed to evaluate the eectiveness of the designed fault prediction models. This framework, is based on the classication of classes as faulty or not-faulty. In this cost-based analysis, it is observed that fault prediction is found to be suitable where normalized estimated fault removal cost (NEcost) is less than certain threshold value. Also this indicated that any prediction model having NEcost greater than the threshold value are not suitable for fault prediction, and then further these classes are unit tested. All the prediction and classier models used in the fault prediction analysis are applied on a case study viz., Apache Integration Framework (AIF). The metric data values are obtained from PROMISE repository and are mined using Chidamber and Kemerer Java Metrics (CKJM) tool. Test data are generated for object-oriented program for withdrawal task in Bank ATM using three meta-heuristic search algorithms such as Clonal selection algorithm, Binary particle swarm optimization, and Articial bee colony algorithm. It is observed that Articial bee colony algorithm is able to obtain near optimal test data when compared to the other two algorithms. The test data are generated for withdrawal task based on the tness function derived by using the branch distance proposed by Bogdan Korel. The generated test data ensure the proper functionality or the correctness of the programmed module in a software

    Finding Faulty Functions From the Traces of Field Failures

    Get PDF
    Corrective maintenance, which rectifies field faults, consumes 30-60% time of software maintenance. Literature indicates that 50% to 90% of the field failures are rediscoveries of previous faults, and that 20% of the code is responsible for 80% of the faults. Despite this, identification of the location of the field failures in system code remains challenging and consumes substantial (30-40%) time of corrective maintenance. Prior fault discovery techniques for field traces require many pass-fail traces, discover only crashing failures, or identify faulty coarse grain code such as files as the source of faults. This thesis (which is in the integrated article format) first describes a novel technique (F007) that focuses on identifying finer grain faulty code (faulty functions) from only the failing traces of deployed software. F007 works by training the decision trees on the function-call level failed traces of previous faults of a program. When a new failed trace arrives, F007 then predicts a ranked list of faulty functions based on the probability of fault proneness obtained via the decision trees. Second, this thesis describes a novel strategy, F007-plus, that trains F007 on the failed traces of mutants (artificial faults) and previous faults. F007-plus facilitates F007 in discovering new faulty functions that could not be discovered because they were not faulty in the traces of previously known actual faults. F007 (including F007-plus) was evaluated on the Siemens suite, Space program, four UNIX utilities, and a large commercial application of size approximately 20 millions LOC. F007 (including the use of F007-plus) was able to identify faulty functions in approximately 90% of the failed traces by reviewing approximately less than 10% of the code (i.e., by reviewing only the first few functions in the ranked list). These results, in fact, lead to an emerging theory that a faulty function can be identified by using prior traces of at least one fault in that function. Thus, F007 and F007-plus can correctly identify faulty functions in the failed traces of the majority (80%-90%) of the field failures by using the knowledge of faults in a small percentage (20%) of functions

    A Machine-learning Based Ensemble Method For Anti-patterns Detection

    Full text link
    Anti-patterns are poor solutions to recurring design problems. Several empirical studies have highlighted their negative impact on program comprehension, maintainability, as well as fault-proneness. A variety of detection approaches have been proposed to identify their occurrences in source code. However, these approaches can identify only a subset of the occurrences and report large numbers of false positives and misses. Furthermore, a low agreement is generally observed among different approaches. Recent studies have shown the potential of machine-learning models to improve this situation. However, such algorithms require large sets of manually-produced training-data, which often limits their application in practice. In this paper, we present SMAD (SMart Aggregation of Anti-patterns Detectors), a machine-learning based ensemble method to aggregate various anti-patterns detection approaches on the basis of their internal detection rules. Thus, our method uses several detection tools to produce an improved prediction from a reasonable number of training examples. We implemented SMAD for the detection of two well known anti-patterns: God Class and Feature Envy. With the results of our experiments conducted on eight java projects, we show that: (1) our method clearly improves the so aggregated tools; (2) SMAD significantly outperforms other ensemble methods.Comment: Preprint Submitted to Journal of Systems and Software, Elsevie

    Fault Driven Supervised Tie Breaking for Test Case Prioritization

    Get PDF
    Regression test suites are an excellent tool to validate the existing functionality of an application during the development process. However, they can be large and time consuming to execute, thus making them inefficient in finding faults. Test Case Prioritization is an area of study that looks to improve the fault detection rates of these test suites by re-ordering execution sequence of the test cases. It attempts to execute the test cases that have the highest probability of detecting faults first. Most prioritization techniques base their decisions on the coverage information gathered from running the test cases. These coverage-based techniques however have a high probability of encountering ties in coverages between two or more test cases. Most studies employ a random selection to break these ties despite it being considered a lower bound method. This thesis designs and develops a framework to supervise the tie breaking in coverage based Test Case Prioritization using fault predictor models. Fault predictor models can assist in identifying the modules in the application that are most prone to containing faults. By selecting test cases that cover modules most prone to faults, the fault detection rate of the test cases can be improved. A fault prediction framework is also introduced in this thesis that supervises the tie breaking for coverage-based techniques. The framework employs an ensemble learner that aggregates results from multiple predictors. To date, no single predictor has been found that can perform consistently on all datasets. Numerous predictors have also required expert knowledge to make them performant. An ensemble learner is a reliable technique to mitigate the problems and bias faced by single predictors and disregard results from poorly performing predictors. In order to evaluate the supervised tie breaking, empirical studies were conducted on two large scale applications, Cassandra and Tomcat. As part of the evaluation, real faults that existed in the application during development were used instead of hand seeded faults or mutation faults as used by many other studies. The data used for fault prediction were also not groomed or marked by experts, unlike other studies. Results from the studies showed significant improvements in the fault detection rates for both case studies when using the fault driven supervision for tie breaking

    Quality-Aware Learning to Prioritize Test Cases

    Get PDF
    Software applications evolve at a rapid rate because of continuous functionality extensions, changes in requirements, optimization of code, and fixes of faults. Moreover, modern software is often composed of components engineered with different programming languages by different internal or external teams. During this evolution, it is crucial to continuously detect unintentionally injected faults and continuously release new features. Software testing aims at reducing this risk by running a certain suite of test cases regularly or at each change of the source code. However, the large number of test cases makes it infeasible to run all test cases. Automated test case prioritization and selection techniques have been studied in order to reduce the cost and improve the efficiency of testing tasks. However, the current state-of-art techniques remain limited in some aspects. First, the existing test prioritization and selection techniques often assume that faults are equally distributed across the software components, which can lead to spending most of the testing budget on components less likely to fail rather than the ones highly to contain faults. Second, the existing techniques share a scalability problem not only in terms of the size of the selected test suite but also in terms of the round-trip time between code commits and engineer feedback on test cases failures in the context of Continuous Integration (CI) development environments. Finally, it is hard to algorithmically capture the domain knowledge of the human testers which is crucial in testing and release cycles. This thesis is a new take on the old problem of reducing the cost of software testing in these regards by presenting a data-driven lightweight approach for test case prioritization and execution scheduling that is being used (i) during CI cycles for quick and resource-optimal feedback to engineers, and (ii) during release planning by capturing the testers domain knowledge and release requirements. Our approach combines software quality metrics with code churn metrics to build a regressive model that predicts the fault density of each component and a classification model to discriminate faulty from non-faulty components. Both models are used to guide the testing effort to the components likely to contain the largest number of faults. The predictive models have been validated on eight industrial automotive software applications at Daimler, showing a classification accuracy of 89% and an accuracy of 85.7% for the regression model. The thesis develops a test cases prioritization model based on features of the code change, the tests execution history and the component development history. The model reduces the cost of CI by predicting whether a particular code change should trigger the individual test suites and their corresponding test cases. In order to algorithmically capture the domain knowledge and the preferences of the tester, our approach developed a test case execution scheduling model that consumes the testers preferences in the form of a probabilistic graph and solves the optimal test budget allocation problem both online in the context of CI cycles and offline when planning a release. Finally, the thesis presents a theoretical cost model that describes when our prioritization and scheduling approach is worthwhile. The overall approach is validated on two industrial analytical applications in the area of energy management and predictive maintenance, showing that over 95% of the test failures are still reported back to the engineers while only 43% of the total available test cases are being executed
    corecore