Search CORE

4 research outputs found

Towards an Intelligent System for Software Traceability Datasets Generation

Author: Zogaan Waleed Abdu
Publication venue: RIT Scholar Works
Publication date: 17/12/2019
Field of study

Software datasets and artifacts play a crucial role in advancing automated software traceability research. They can be used by researchers in different ways to develop or validate new automated approaches. Software artifacts, other than source code and issue tracking entities, can also provide a great deal of insight into a software system and facilitate knowledge sharing and information reuse. The diversity and quality of the datasets and artifacts within a research community have a significant impact on the accuracy, generalizability, and reproducibility of the results and consequently on the usefulness and practicality of the techniques under study. Collecting and assessing the quality of such datasets are not trivial tasks and have been reported as an obstacle by many researchers in the domain of software engineering. In this dissertation, we report our empirical work that aims to automatically generate and assess the quality of such datasets. Our goal is to introduce an intelligent system that can help researchers in the domain of software traceability in obtaining high-quality “training sets”, “testing sets” or appropriate “case studies” from open source repositories based on their needs. In the first project, we present a first-of-its-kind study to review and assess the datasets that have been used in software traceability research over the last fifteen years. It presents and articulates the current status of these datasets, their characteristics, and their threats to validity. Second, this dissertation introduces a Traceability-Dataset Quality Assessment (T-DQA) framework to categorize software traceability datasets and assist researchers to select appropriate datasets for their research based on different characteristics of the datasets and the context in which those datasets will be used. Third, we present the results of an empirical study with limited scope to generate datasets using three baseline approaches for the creation of training data. These approaches are (i) Expert-Based, (ii) Automated Web-Mining, which generates training sets by automatically mining tactic\u27s APIs from technical programming websites, and lastly, (iii) Automated Big-Data Analysis, which mines ultra-large-scale code repositories to generate training sets. We compare the trace-link creation accuracy achieved using each of these three baseline approaches and discuss the costs and benefits associated with them. Additionally, in a separate study, we investigate the impact of training set size on the accuracy of recovering trace links. Finally, we conduct a large-scale study to identify which types of software artifacts are produced by a wide variety of open-source projects at different levels of granularity. Then we propose an automated approach based on Machine Learning techniques to identify various types of software artifacts. Through a set of experiments, we report and compare the performance of these algorithms when applied to software artifacts. Finally, we conducted a study to understand how software traceability experts and practitioners evaluate the quality of their datasets. In addition, we aim at gathering experts’ opinions on all quality attributes and metrics proposed by T-DQA

RIT Scholar Works

Efficient Information Retrieval for Software Bug Localization

Author: Khatiwada Saket
Publication venue: LSU Digital Commons
Publication date: 12/03/2022
Field of study

Software systems are often shipped with defects. When a bug is reported, developers use the information available in the associated report to locate source code fragments that need to be modified to fix the bug. However, as software systems evolve in size and complexity, bug localization can become a tedious and time-consuming process. Contemporary bug localization tools utilize Information Retrieval (IR) methods for automated support to minimize the manual effort. IR methods exploit the textual content of bug reports to capture and rank relevant buggy source files. However, for an IR-based bug localization tool to be useful, it must achieve adequate retrieval accuracy. Lower precision and recall can leave developers with large amounts of incorrect information to wade through. Motivated by these observations, in this dissertation, we propose a new paradigm of information-theoretic IR methods to support bug localization tasks in software systems. These methods exploit the co-occurrence patterns of code terms in software systems to reveal latent semantic information that other methods often fail to capture. We further investigate the impact of combining various IR methods on the retrieval accuracy of bug localization engines. The main assumption is that different IR methods, targeting different dimensions of similarity between software artifacts, can enhance the confidence in each other\u27s results. Furthermore, we propose a novel approach for enhancing the performance of IR-enabled bug localization methods in the context of Open-Source Software (OSS). The proposed approach exploits knowledge from previously resolved bugs to help localize new bugs. Our analysis uses multiple datasets generated for multiple open-source and closed source projects. Our results show that a) information-theoretic IR methods can significantly outperform classical IR methods in bug localization tasks, b) optimized IR-hybrids can significantly outperform individual IR methods, and near-optimal global configurations can be determined for different combinations of IR methods, and c) information extracted from previously resolved bug reports can significantly enhance the accuracy of IR-enabled bug localization methods in OSS

Louisiana State University

Recommended from our members

Effective bug detection and localization using information retrieval

Author: Saha Ripon Kumar
Publication venue
Publication date: 06/09/2016
Field of study

Software bugs pose a fundamental threat to the reliability of software systems, even in systems designed with the best software engineering (SE) teams using the best SE practices. Detecting bugs early and fixing them quickly are extremely important. However, they are very expensive and challenging, especially at-scale. While the sciences of bug detection (e.g., software testing) and localization via static and dynamic program analyses have been explored considerably, text-based Information Retrieval (IR) techniques for bug detection and localization are interesting and promising new approaches for these problems. One advantage of text-based approaches is that it can utilize a lot of (implicit) semantic information about a program’s functionality from the program text, which is almost impossible to extract using program analysis based techniques. This dissertation builds a deeper understanding of current bug triaging and fixing processes via mining software repositories, and introduces new techniques for effective bug detection and localization. The dissertation has three main parts. First, we perform a number of empirical studies to investigate the extent of and reasons for long lived bugs, their severities, and time spent in different phases of bug fixing process. We demonstrate that many bugs remain unfixed for inordinate period of time due to numerous reasons, including difficulties in detecting, localizing, and fixing them. Second, we demonstrate that developers use very similar program text in source code and their corresponding test cases, which could be utilized to implement powerful test prioritization techniques. We introduce a novel IR based regression test prioritization technique called REPiR that embodies our insight, and show that REPiR is more efficient than program analysis based or dynamic coverage based techniques. Third, we demonstrate that fine grained program text such as class names, method names, variable names, and comments carry different levels of information, and it can be utilized to improve IR based bug localization. We introduce a structured retrieval technique called BLUiR that embodies our insights and show that BLUiR outperforms the existing state-of-the-art IR-based bug localization approaches. Finally, we further improve BLUiR by natural language processing. We make four contributions in this dissertation. One, we provide empirical evidence that there are considerable numbers of non-trivial bugs in software projects that survive for a long time. We describe the reasons for delay in fixing, the nature of fixes, and overall fixing process of these long lived bugs in a great detail. Two, we introduce the notion of IR-based regression test prioritization based on program changes. Three, we introduce the notion of structured retrieval for bug localization. Four, we provide an in-depth analysis of the extent to which natural languages processing can play an important role in improving IR-based bug localization further. The central ideas are embodied in a suite of prototype tools. Rigorous empirical evaluation is performed to validate the efficacy of the proposed techniques using datasets containing a variety of real-world Java and C programs.Electrical and Computer Engineerin

Texas ScholarWorks

Atmospheric profiles of CO₂ as integrators of regional scale exchange

Author: Smallman Thomas Luke
Publication venue: The University of Edinburgh
Publication date: 30/06/2014
Field of study

The global climate is changing due to the accumulation of greenhouse gases (GHGs) in the atmosphere, primarily due to anthropogenic activity. The dominant GHG is CO₂ which originates from combustion of fossil fuels, land use change and management. The terrestrial biosphere is a key driver of climate and biogeochemical cycles at regional and global scales. Furthermore, the response of the Earth system to future drivers of climate change will depend on feedbacks between biogeochemistry and climate. Therefore, understanding these processes requires a mechanistic approach in any model simulation framework. However ecosystem processes are complex and nonlinear and consequently models need to be validated against observations at multiple spatial scales. In this thesis the weather research and forecasting model (WRF) has been coupled to the mechanistic terrestrial ecosystem model soil-plant-atmosphere (SPA), creating WRF-SPA. The thesis is split into three main chapters: i. WRF-SPA model development and validation at multiple spatial scales, scaling from surface fluxes of CO₂ and energy to aircraft profiles and tall tower observations of atmospheric CO₂ concentrations. ii. Investigation of ecosystem contributions to observations of atmospheric CO₂ concentrations made at tall tower Angus, Dundee, Scotland using ecosystem specific CO₂ tracers at seasonal and interannual time scales. iii. An assessment of detectability of a policy relevant national scale afforestation by observations made at a tall tower. Detectability of changes in atmospheric CO₂ concentrations was assessed through a comparison of a control simulation, using current day forest extent, and an experimentally afforested simulation using WRF-SPA. WRF-SPA performs well at both site and regional scales, accurately simulating aircraft profiles of CO₂ concentration magnitudes (error <+- 4 ppm), indicating appropriate source sink distribution and realistic atmospheric transport. Hourly observations made at tall tower Angus were also well simulated by WRF-SPA (R² = 0.67, RMSE = 3.5 ppm, bias = 0.58 ppm). Analysis of CO₂ tracers at tall tower Angus show an increase in the seasonal error between WRF-SPA simulated atmospheric CO₂ and observations, which coincides with simulated cropland harvest. WRF-SPA does not simulate uncultivated land associated with agriculture, which in Scotland represents 36 % of agricultural holdings. Therefore, uncultivated land components may provide an explanation for the increase in model-data error. Interannual variation in weather is indicated to have a greater impact on ecosystem specific contributions to atmospheric CO₂ concentrations at Angus than variation in surface activity. In a model experiment, afforestation of Scotland was simulated to test the impact on Scotland’s carbon balance. The changes were shown to be potentially detectable by observations made at tall tower Angus. Afforestation results in a reduction in atmospheric CO₂ concentrations by up to 0.6 ppm at seasonal time scales at tall tower Angus. Detection of changes in forest surface net CO₂ uptake flux due to afforestation was improved through the use of a network of tall towers (R² = 0.83) compared to tall tower Angus alone (R² = 0.75)

Edinburgh Research Archive