29,673 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Learning with Clustering Structure
We study supervised learning problems using clustering constraints to impose
structure on either features or samples, seeking to help both prediction and
interpretation. The problem of clustering features arises naturally in text
classification for instance, to reduce dimensionality by grouping words
together and identify synonyms. The sample clustering problem on the other
hand, applies to multiclass problems where we are allowed to make multiple
predictions and the performance of the best answer is recorded. We derive a
unified optimization formulation highlighting the common structure of these
problems and produce algorithms whose core iteration complexity amounts to a
k-means clustering step, which can be approximated efficiently. We extend these
results to combine sparsity and clustering constraints, and develop a new
projection algorithm on the set of clustered sparse vectors. We prove
convergence of our algorithms on random instances, based on a union of
subspaces interpretation of the clustering structure. Finally, we test the
robustness of our methods on artificial data sets as well as real data
extracted from movie reviews.Comment: Completely rewritten. New convergence proofs in the clustered and
sparse clustered case. New projection algorithm on sparse clustered vector
A new kernel method for hyperspectral image feature extraction
Hyperspectral image provides abundant spectral information for remote discrimination of subtle differences in ground covers. However, the increasing spectral dimensions, as well as the information redundancy, make the analysis and interpretation of hyperspectral images a challenge. Feature extraction is a very important step for hyperspectral image processing. Feature extraction methods aim at reducing the dimension of data, while preserving as much information as possible. Particularly, nonlinear feature extraction methods (e.g. kernel minimum noise fraction (KMNF) transformation) have been reported to benefit many applications of hyperspectral remote sensing, due to their good preservation of high-order structures of the original data. However, conventional KMNF or its extensions have some limitations on noise fraction estimation during the feature extraction, and this leads to poor performances for post-applications. This paper proposes a novel nonlinear feature extraction method for hyperspectral images. Instead of estimating noise fraction by the nearest neighborhood information (within a sliding window), the proposed method explores the use of image segmentation. The approach benefits both noise fraction estimation and information preservation, and enables a significant improvement for classification. Experimental results on two real hyperspectral images demonstrate the efficiency of the proposed method. Compared to conventional KMNF, the improvements of the method on two hyperspectral image classification are 8 and 11%. This nonlinear feature extraction method can be also applied to other disciplines where high-dimensional data analysis is required
- …