Search CORE

5,378 research outputs found

A CLUE for CLUster Ensembles

Author: Kurt Hornik
Publication venue
Publication date
Field of study

Cluster ensembles are collections of individual solutions to a given clustering problem which are useful or necessary to consider in a wide range of applications. The R package clue provides an extensible computational environment for creating and analyzing cluster ensembles, with basic data structures for representing partitions and hierarchies, and facilities for computing on these, including methods for measuring proximity and obtaining consensus and "secondary" clusterings.

Research Papers in Economics

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California

Ensemble Clustering for Biological Datasets

Author: Harun Pirim
Şadi Evren Şeker
Publication venue: 'IntechOpen'
Publication date: 28/11/2012
Field of study

IntechOpen

Clustering ensemble method

Author: Alqurashi Tahani
Wang Wenjia
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/01/2018
Field of study

A clustering ensemble aims to combine multiple clustering models to produce a better result than that of the individual clustering algorithms in terms of consistency and quality. In this paper, we propose a clustering ensemble algorithm with a novel consensus function named Adaptive Clustering Ensemble. It employs two similarity measures, cluster similarity and a newly defined membership similarity, and works adaptively through three stages. The first stage is to transform the initial clusters into a binary representation, and the second is to aggregate the initial clusters that are most similar based on the cluster similarity measure between clusters. This iterates itself adaptively until the intended candidate clusters are produced. The third stage is to further refine the clusters by dealing with uncertain objects to produce an improved final clustering result with the desired number of clusters. Our proposed method is tested on various real-world benchmark datasets and its performance is compared with other state-of-the-art clustering ensemble methods, including the Co-association method and the Meta-Clustering Algorithm. The experimental results indicate that on average our method is more accurate and more efficient

University of East Anglia digital repository

Recommended from our members

Transcript-indexed ATAC-seq for precision immune profiling.

Author: Buenrostro Jason D
Chang Howard Y
Corces M Ryan
Davis Mark M
Gennert David G
Granja Jeffrey M
Greenleaf William J
Khavari Paul A
Khodadoust Michael S
Kim Youn H
Lareau Caleb A
Li Rui
Mumbach Maxwell R
Parker Kevin R
Qi Yanyan
Rubin Adam J
Saligrama Naresha
Satpathy Ansuman T
Schep Alicia N
Serratelli William S
Wei Yuning
Wu Beijing
Publication venue: eScholarship, University of California
Publication date: 01/05/2018
Field of study

T cells create vast amounts of diversity in the genes that encode their T cell receptors (TCRs), which enables individual clones to recognize specific peptide-major histocompatibility complex (MHC) ligands. Here we combined sequencing of the TCR-encoding genes with assay for transposase-accessible chromatin with sequencing (ATAC-seq) analysis at the single-cell level to provide information on the TCR specificity and epigenomic state of individual T cells. By using this approach, termed transcript-indexed ATAC-seq (T-ATAC-seq), we identified epigenomic signatures in immortalized leukemic T cells, primary human T cells from healthy volunteers and primary leukemic T cells from patient samples. In peripheral blood CD4+ T cells from healthy individuals, we identified cis and trans regulators of naive and memory T cell states and found substantial heterogeneity in surface-marker-defined T cell populations. In patients with a leukemic form of cutaneous T cell lymphoma, T-ATAC-seq enabled identification of leukemic and nonleukemic regulatory pathways in T cells from the same individual by allowing separation of the signals that arose from the malignant clone from the background T cell noise. Thus, T-ATAC-seq is a new tool that enables analysis of epigenomic landscapes in clonal T cells and should be valuable for studies of T cell malignancy, immunity and immunotherapy

eScholarship - University of California

Data-Driven Representation Learning in Multimodal Feature Fusion

Author
Publication venue
Publication date: 01/01/2018
Field of study

abstract: Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at the core to achieve improved model robustness and inferencing performance. This dissertation focuses on the representation learning approaches as the fusion strategy. Specifically, the objective is to learn the shared latent representation which jointly exploit the structural information encoded in all modalities, such that a straightforward learning model can be adopted to obtain the prediction. We first consider sensor fusion, a typical multimodal fusion problem critical to building a pervasive computing platform. A systematic fusion technique is described to support both multiple sensors and descriptors for activity recognition. Targeted to learn the optimal combination of kernels, Multiple Kernel Learning (MKL) algorithms have been successfully applied to numerous fusion problems in computer vision etc. Utilizing the MKL formulation, next we describe an auto-context algorithm for learning image context via the fusion with low-level descriptors. Furthermore, a principled fusion algorithm using deep learning to optimize kernel machines is developed. By bridging deep architectures with kernel optimization, this approach leverages the benefits of both paradigms and is applied to a wide variety of fusion problems. In many real-world applications, the modalities exhibit highly specific data structures, such as time sequences and graphs, and consequently, special design of the learning architecture is needed. In order to improve the temporal modeling for multivariate sequences, we developed two architectures centered around attention models. A novel clinical time series analysis model is proposed for several critical problems in healthcare. Another model coupled with triplet ranking loss as metric learning framework is described to better solve speaker diarization. Compared to state-of-the-art recurrent networks, these attention-based multivariate analysis tools achieve improved performance while having a lower computational complexity. Finally, in order to perform community detection on multilayer graphs, a fusion algorithm is described to derive node embedding from word embedding techniques and also exploit the complementary relational information contained in each layer of the graph.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

ASU Digital Repository

Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs

Author: Korenblum Daniel
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2018
Field of study

Laplacian mixture models identify overlapping regions of influence in unlabeled graph and network data in a scalable and computationally efficient way, yielding useful low-dimensional representations. By combining Laplacian eigenspace and finite mixture modeling methods, they provide probabilistic or fuzzy dimensionality reductions or domain decompositions for a variety of input data types, including mixture distributions, feature vectors, and graphs or networks. Provable optimal recovery using the algorithm is analytically shown for a nontrivial class of cluster graphs. Heuristic approximations for scalable high-performance implementations are described and empirically tested. Connections to PageRank and community detection in network analysis demonstrate the wide applicability of this approach. The origins of fuzzy spectral methods, beginning with generalized heat or diffusion equations in physics, are reviewed and summarized. Comparisons to other dimensionality reduction and clustering methods for challenging unsupervised machine learning problems are also discussed.Comment: 13 figures, 35 reference

arXiv.org e-Print Archive

Directory of Open Access Journals

Gene prioritization and clustering by multi-view text mining

Author: De Moor Bart
Moreau Yves
Tranchevent Leon-Charles
Yu Shi
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Text mining has become a useful tool for biologists trying to understand the genetics of diseases. In particular, it can help identify the most interesting candidate genes for a disease for further experimental analysis. Many text mining approaches have been introduced, but the effect of disease-gene identification varies in different text mining models. Thus, the idea of incorporating more text mining models may be beneficial to obtain more refined and accurate knowledge. However, how to effectively combine these models still remains a challenging question in machine learning. In particular, it is a non-trivial issue to guarantee that the integrated model performs better than the best individual model. Results We present a multi-view approach to retrieve biomedical knowledge using different controlled vocabularies. These controlled vocabularies are selected on the basis of nine well-known bio-ontologies and are applied to index the vast amounts of gene-based free-text information available in the MEDLINE repository. The text mining result specified by a vocabulary is considered as a view and the obtained multiple views are integrated by multi-source learning algorithms. We investigate the effect of integration in two fundamental computational disease gene identification tasks: gene prioritization and gene clustering. The performance of the proposed approach is systematically evaluated and compared on real benchmark data sets. In both tasks, the multi-view approach demonstrates significantly better performance than other comparing methods. Conclusions In practical research, the relevance of specific vocabulary pertaining to the task is usually unknown. In such case, multi-view text mining is a superior and promising strategy for text-based disease gene identification.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central