4 research outputs found

    Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset

    Get PDF
    Data mining techniques have been used to analyse pattern from data sets in order to derive useful information. Classification of data sets into clusters is one of the essential process for data manipulation. One of the most popular and efficient clustering methods is K-means method. However, the K-means clustering method has some difficulties in the analysis of high dimension data sets with the presence of missing values. Moreover, previous studies showed that high dimensionality of the feature in data set presented poses different problems for K-means clustering. For missing value problem, imputation method is needed to minimise the effect of incomplete high dimensional data sets in K-means clustering process. This research studies the effect of imputation algorithm and dimensionality reduction techniques on the performance of K-means clustering. Three imputation methods are implemented for the missing value estimation which are K-nearest neighbours (KNN), Least Local Square (LLS), and Bayesian Principle Component Analysis (BPCA). Principal Component Analysis (PCA) is a dimension reduction method that has a dimensional reduction capability by removing the unnecessary attribute of high dimensional data sets. Hence, PCA hybrid with K-means (PCA K-means) is proposed to give a better clustering result. The experimental process was performed by using Wisconsin Breast Cancer. By using LLS imputation method, the proposed hybrid PCA K-means outperformed the standard Kmeans clustering based on the results for breast cancer data set; in terms of clustering accuracy (0.29%) and computing time (95.76%)

    Ameliorative missing value imputation for robust biological knowledge inference

    No full text
    Gene expression data is widely used in various post genomic analyses. The data is often probed using microarrays due to their ability to simultaneously measure the expressions of thousands of genes. The expression data, however, contains significant numbers of missing values, which can impact on subsequent biological analysis. To minimize the impact of these missing values, several imputation algorithms including Collateral Missing Value Estimation (CMVE), Bayesian Principal Component Analysis (BPCA), Least Square Impute(LSImpute), Local Least Square Impute (LLSImpute), and K-Nearest Neighbour (KNN) have been proposed. These algorithms, however, exploit either only the global or local correlation structure of the data, which normally can lead to higher estimation errors. This paper presents an Ameliorative Missing Value Imputation (AMVI) technique which has ability to exploit global/local and positive/negative correlations in a given dataset by automatic selection of the optimal number of predictor genes k using a wrapper non-parametric method based on Monte Carlo simulations. The AMVI technique has CMVE strategy at its core because CMVE has demonstrated improved performance compared to both low variance methods like BPCA, LLSImpute, and high variance methods such as KNN and ZeroImpute, as CMVE exploits positive/negative correlations. The performance of AMVI is compared with CMVE, BPCA, LLSImpute, and KNN by randomly removing between 1% and 15% missing values in eight different ovarian, breast cancer and yeast datasets. Together with the standard NRMS error metric, the True Positive (TP) rate of the significant genes selection, biological significance of the selected genes and the statistical significance test results are presented to investigate the impact of missing values on subsequent biological analysis. The enhanced performance of AMVI was demonstrated by its lower NRMS error, improved TP rate, bio significance of the selected genes and statistical significance test results, when compared with the aforementioned imputation methods across all the datasets. The results show that AMVI adapted to the latent correlation structure of the data and proved to be an effective and robust approach compared with the trial and error methodology for selecting k. The results confirmed that AMVI can be successfully applied to accurately impute missing values prior to any microarray data analysis. 2007 Elsevier Inc. All rights reserved

    Assessing Affordability of Fruits and Vegetables in the Brazos Valley-Texas

    Get PDF
    The burden of obesity-related illness, which disproportionately affects low income households and historically disadvantaged racial and ethnic groups, is a leading public health issue in the United States. In addition, previous research has documented differences in eating behavior and dietary intake between racial and ethnic groups, as well as between urban and rural residents. The coexistence of diet-related disparities and diet-related health conditions has therefore become a major focus of research and policy. Researchers have hypothesized that differences in eating behavior originate from differing levels of access to and affordability of healthy food options, such as fresh fruits and vegetables. Therefore, this dissertation examines the affordability of fresh produce in the Brazos Valley of Texas. This study uses information on produce prices collected by taking a census of food stores in a large regional area through the method ground-truthing. These are combined with responses to a contemporaneous health assessment survey. Key innovations include the construction of price indices based on economic theory, testing the robustness of results to different methods of price imputation, and employing spatial econometric techniques. In the first part of the analysis, I evaluate the socioeconomic and geographical factors associated with the affordability of fresh fruits and vegetables. The results based on Ordinary Least Squares (OLS) regression show that except housing values (as median value of owner-occupied units) and store type, most factors do not have significant effects on the prices for these food items. In addition, the sizes and signs of the coefficients vary greatly across items. We found that consumers who pay higher premiums for fresh produce reside in rural areas and high proportion of minorities neighborhoods. We then assess how our results are influenced by different imputation methods to account for missing prices. The results reveal that the impacts of the factors used are similar regardless of the imputation methods. Finally we investigate the presence of spatial relationships between prices at particular stores and competing stores in the neighborhoods. The spatial estimation results based on Maximum Likelihood (ML) indicate a weak spatial correlation between the prices at stores located near each others in the neighborhoods. Stores selling vegetables display a certain level of spatial autocorrelation between the prices at a particular store and its neighboring competitors. Stores selling fruits do not present such relations in the prices
    corecore