11 research outputs found

    Reconstruction of time series data wİth missing values

    Get PDF
    Time series data are used to represent many real world phenomenon. For various reasons, a time series database may have some missing data. Traditional interpolation or estimation methods usually become invalid when the observation interval of the missing data is not small (Hong and Chen, 2003)

    RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning

    Full text link
    Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), and area under the curve for receiver operating characteristic plots (all p<106p < 10^{-6}). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases

    Fuzzy C-mean missing data imputation for analogy-based effort estimation

    Get PDF
    The accuracy of effort estimation in one of the major factors in the success or failure of software projects. Analogy-Based Estimation (ABE) is a widely accepted estimation model since its flow human nature in selecting analogies similar in nature to the target project. The accuracy of prediction in ABE model in strongly associated with the quality of the dataset since it depends on previous completed projects for estimation. Missing Data (MD) is one of major challenges in software engineering datasets. Several missing data imputation techniques have been investigated by researchers in ABE model. Identification of the most similar donor values from the completed software projects dataset for imputation is a challenging issue in existing missing data techniques adopted for ABE model. In this study, Fuzzy C-Mean Imputation (FCMI), Mean Imputation (MI) and K-Nearest Neighbor Imputation (KNNI) are investigated to impute missing values in Desharnais dataset under different missing data percentages (Desh-Miss1, Desh-Miss2) for ABE model. FCMI-ABE technique is proposed in this study. Evaluation comparison among MI, KNNI, and (ABE-FCMI) is conducted for ABE model to identify the suitable MD imputation method. The results suggest that the use of (ABE-FCMI), rather than MI and KNNI, imputes more reliable values to incomplete software projects in the missing datasets. It was also found that the proposed imputation method significantly improves software development effort prediction of ABE model

    Potential and limitations of the ISBSG dataset in enhancing software engineering research: A mapping review

    Full text link
    Context The International Software Benchmarking Standards Group (ISBSG) maintains a software development repository with over 6000 software projects. This dataset makes it possible to estimate a project s size, effort, duration, and cost. Objective The aim of this study was to determine how and to what extent, ISBSG has been used by researchers from 2000, when the first papers were published, until June of 2012. Method A systematic mapping review was used as the research method, which was applied to over 129 papers obtained after the filtering process. Results The papers were published in 19 journals and 40 conferences. Thirty-five percent of the papers published between years 2000 and 2011 have received at least one citation in journals and only five papers have received six or more citations. Effort variable is the focus of 70.5% of the papers, 22.5% center their research in a variable different from effort and 7% do not consider any target variable. Additionally, in as many as 70.5% of papers, effort estimation is the research topic, followed by dataset properties (36.4%). The more frequent methods are Regression (61.2%), Machine Learning (35.7%), and Estimation by Analogy (22.5%). ISBSG is used as the only support in 55% of the papers while the remaining papers use complementary datasets. The ISBSG release 10 is used most frequently with 32 references. Finally, some benefits and drawbacks of the usage of ISBSG have been highlighted. Conclusion This work presents a snapshot of the existing usage of ISBSG in software development research. ISBSG offers a wealth of information regarding practices from a wide range of organizations, applications, and development types, which constitutes its main potential. However, a data preparation process is required before any analysis. Lastly, the potential of ISBSG to develop new research is also outlined.Fernández Diego, M.; González-Ladrón-De-Guevara, F. (2014). Potential and limitations of the ISBSG dataset in enhancing software engineering research: A mapping review. Information and Software Technology. 56(6):527-544. doi:10.1016/j.infsof.2014.01.003S52754456

    Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

    Get PDF
    Being able to predict software quality is essential, but also it pose significant challenges in software engineering. Historical software project datasets are often being utilized together with various machine learning algorithms for fault-proneness classification. Unfortunately, the missing values in datasets have negative impacts on the estimation accuracy and therefore, could lead to inconsistent results. As a method handling missing data, K nearest neighbor (KNN) imputation gradually gains acceptance in empirical studies by its exemplary performance and simplicity. To date, researchers still call for optimized parameter setting for KNN imputation to further improve its performance. In the work, we develop a novel incomplete-instance based KNN imputation technique, which utilizes a cross-validation scheme to optimize the parameters for each missing value. An experimental assessment is conducted on eight quality datasets under various missingness scenarios. The study also compared the proposed imputation approach with mean imputation and other three KNN imputation approaches. The results show that our proposed approach is superior to others in general. The relatively optimal fixed parameter settings for KNN imputation for software quality data is also determined. It is observed that the classification accuracy is improved or at least maintained by using our approach for missing data imputation

    The usage of ISBSG data fields in software effort estimation: A systematic mapping study

    Full text link
    [EN] The International Software Benchmarking Standards Group (ISBSG) maintains a repository of data about completed software projects. A common use of the ISBSG dataset is to investigate models to estimate a software project's size, effort, duration, and cost. The aim of this paper is to determine which and to what extent variables in the ISBSG dataset have been used in software engineering to build effort estimation models. For that purpose a systematic mapping study was applied to 107 research papers, obtained after a filtering process, that were published from 2000 until the end of 2013, and which listed the independent variables used in the effort estimation models. The usage of ISBSG variables for filtering, as dependent variables, and as independent variables is described. The 20 variables (out of 71) mostly used as independent variables for effort estimation are identified and analysed in detail, with reference to the papers and types of estimation methods that used them. We propose guidelines that can help researchers make informed decisions about which ISBSG variables to select for their effort estimation models.González-Ladrón-De-Guevara, F.; Fernández-Diego, M.; Lokan, C. (2016). The usage of ISBSG data fields in software effort estimation: A systematic mapping study. Journal of Systems and Software. 113:188-215. doi:10.1016/j.jss.2015.11.040S18821511

    Predictability of Missing Data Theory to Improve U.S. Estimator’s Unreliable Data Problem

    Get PDF
    Since the topic of improving data quality has not been addressed for the U.S. defense cost estimating discipline beyond changes in public policy, the goal of the study was to close this gap and provide empirical evidence that supports expanding options to improve software cost estimation data matrices for U.S. defense cost estimators. The purpose of this quantitative study was to test and measure the level of predictive accuracy of missing data theory techniques that were referenced as traditional approaches in the literature, compare each theories’ results to a complete data matrix used in support of the U.S. defense cost estimation discipline, and determine which theories rendered incomplete and missing data sets in a single data matrix most reliable and complete under eight missing value percentages. A quantitative pre-experimental research design, a one group pretest-posttest no control group design, empirically tested and measured the predictive accuracy of traditional missing data theory techniques typically used in non-cost estimating disciplines. The results from the pre-experiments on a representative U.S. defense software cost estimation data matrix obtained, a nonproprietary set of historical software effort, size, and schedule numerical data used at Defense Acquisition University revealed that single and multiple imputation techniques were two viable options to improve data quality since calculations fell within 20% of the original data value 16.4% and 18.6%, respectively. This study supports positive social change by investigating how cost estimators, engineering economists, and engineering managers could improve the reliability of their estimate forecasts, provide better estimate predictions, and ultimately reduce taxpayer funds that are spent to fund defense acquisition cost overruns

    Improvement and implementation of analog based method for software project cost estimation

    Get PDF