15,327 research outputs found
An efficient -means-type algorithm for clustering datasets with incomplete records
The -means algorithm is arguably the most popular nonparametric clustering
method but cannot generally be applied to datasets with incomplete records. The
usual practice then is to either impute missing values under an assumed
missing-completely-at-random mechanism or to ignore the incomplete records, and
apply the algorithm on the resulting dataset. We develop an efficient version
of the -means algorithm that allows for clustering in the presence of
incomplete records. Our extension is called -means and reduces to the
-means algorithm when all records are complete. We also provide
initialization strategies for our algorithm and methods to estimate the number
of groups in the dataset. Illustrations and simulations demonstrate the
efficacy of our approach in a variety of settings and patterns of missing data.
Our methods are also applied to the analysis of activation images obtained from
a functional Magnetic Resonance Imaging experiment.Comment: 21 pages, 12 figures, 3 tables, in press, Statistical Analysis and
Data Mining -- The ASA Data Science Journal, 201
Multiple Imputation Ensembles (MIE) for dealing with missing data
Missing data is a significant issue in many real-world datasets, yet there are no robust methods for dealing with it appropriately. In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Clustering Patients with Tensor Decomposition
In this paper we present a method for the unsupervised clustering of
high-dimensional binary data, with a special focus on electronic healthcare
records. We present a robust and efficient heuristic to face this problem using
tensor decomposition. We present the reasons why this approach is preferable
for tasks such as clustering patient records, to more commonly used
distance-based methods. We run the algorithm on two datasets of healthcare
records, obtaining clinically meaningful results.Comment: Presented at 2017 Machine Learning for Healthcare Conference (MLHC
2017). Boston, M
- …