Search CORE

7,150 research outputs found

An empirical study of imputation techniques for software data sets

Author: Yenduri Sumanth
Publication venue: LSU Digital Commons
Publication date: 01/01/2005
Field of study

Software Project Effort/Cost/Time Estimation has been one of the hot topics of research in the current software engineering industry. Solutions for effort/cost/time estimation are in great demand. Knowledge of accurate effort/cost/time estimates early in the software project life cycle enables project managers manage and exploit resources efficiently. The constraints of cost and time can also be met. To this day, most companies rely on their historical database of past project data sets to predict estimates for future projects. Like other data sets, software project data sets also suffer from numerous problems. The most important problem is they contain missing/incomplete data. Significant amounts of missing or incomplete data are frequently found in data sets utilized to build effort/cost/time prediction models in the current software industry. The reasons are numerous and the missingness is inevitable. The traditional approaches used by the companies ignore all the missing data and provide estimates based on the remaining complete information. Thus, the very estimates are prone to bias. In this thesis, we investigate the application of a few well-known data imputation techniques (Listwise Deletion, Mean Imputation, 10 variants of Hot-Deck Imputation and Full Information Maximum Likelihood Approach) to six real-time software project data sets. Using the imputed data sets we build effort prediction models to evaluate their performance. We study the inherent characteristics of software project data sets such as data set size, missing mechanism, pattern of missingness etc and provide a generic classification schema for all software project data sets based on their characteristics. We further implement a hybrid methodology for solving the same. We perform useful experimental analyses and compare the impacts of these methods for enhancing prediction accuracies. We also highlight the conditions to be considered and measures to be taken while using an imputation technique. We note the ideal and worst conditions for each method. Finally, we discuss the findings and the appropriateness of each method for data imputation to software project data sets

Louisiana State University

Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset

Author: Armina Roslan
Publication venue
Publication date: 01/01/2018
Field of study

Data mining techniques have been used to analyse pattern from data sets in order to derive useful information. Classification of data sets into clusters is one of the essential process for data manipulation. One of the most popular and efficient clustering methods is K-means method. However, the K-means clustering method has some difficulties in the analysis of high dimension data sets with the presence of missing values. Moreover, previous studies showed that high dimensionality of the feature in data set presented poses different problems for K-means clustering. For missing value problem, imputation method is needed to minimise the effect of incomplete high dimensional data sets in K-means clustering process. This research studies the effect of imputation algorithm and dimensionality reduction techniques on the performance of K-means clustering. Three imputation methods are implemented for the missing value estimation which are K-nearest neighbours (KNN), Least Local Square (LLS), and Bayesian Principle Component Analysis (BPCA). Principal Component Analysis (PCA) is a dimension reduction method that has a dimensional reduction capability by removing the unnecessary attribute of high dimensional data sets. Hence, PCA hybrid with K-means (PCA K-means) is proposed to give a better clustering result. The experimental process was performed by using Wisconsin Breast Cancer. By using LLS imputation method, the proposed hybrid PCA K-means outperformed the standard Kmeans clustering based on the results for breast cancer data set; in terms of clustering accuracy (0.29%) and computing time (95.76%)

Universiti Teknologi Malaysia Institutional Repository