74,314 research outputs found
Bandwidth selection for kernel estimation in mixed multi-dimensional spaces
Kernel estimation techniques, such as mean shift, suffer from one major
drawback: the kernel bandwidth selection. The bandwidth can be fixed for all
the data set or can vary at each points. Automatic bandwidth selection becomes
a real challenge in case of multidimensional heterogeneous features. This paper
presents a solution to this problem. It is an extension of \cite{Comaniciu03a}
which was based on the fundamental property of normal distributions regarding
the bias of the normalized density gradient. The selection is done iteratively
for each type of features, by looking for the stability of local bandwidth
estimates across a predefined range of bandwidths. A pseudo balloon mean shift
filtering and partitioning are introduced. The validity of the method is
demonstrated in the context of color image segmentation based on a
5-dimensional space
Sparse Probit Linear Mixed Model
Linear Mixed Models (LMMs) are important tools in statistical genetics. When
used for feature selection, they allow to find a sparse set of genetic traits
that best predict a continuous phenotype of interest, while simultaneously
correcting for various confounding factors such as age, ethnicity and
population structure. Formulated as models for linear regression, LMMs have
been restricted to continuous phenotypes. We introduce the Sparse Probit Linear
Mixed Model (Probit-LMM), where we generalize the LMM modeling paradigm to
binary phenotypes. As a technical challenge, the model no longer possesses a
closed-form likelihood function. In this paper, we present a scalable
approximate inference algorithm that lets us fit the model to high-dimensional
data sets. We show on three real-world examples from different domains that in
the setup of binary labels, our algorithm leads to better prediction accuracies
and also selects features which show less correlation with the confounding
factors.Comment: Published version, 21 pages, 6 figure
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
- …