418 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Factors affecting the effectiveness of biomedical document indexing and retrieval based on terminologies

    Get PDF
    International audienceThe aim of this work is to evaluate a set of indexing and retrieval strategies based on the integration of several biomedical terminologies on the available TREC Genomics collections for an ad hoc information retrieval (IR) task.Materials and methodsWe propose a multi-terminology based concept extraction approach to selecting best concepts from free text by means of voting techniques. We instantiate this general approach on four terminologies (MeSH, SNOMED, ICD-10 and GO). We particularly focus on the effect of integrating terminologies into a biomedical IR process, and the utility of using voting techniques for combining the extracted concepts from each document in order to provide a list of unique concepts.ResultsExperimental studies conducted on the TREC Genomics collections show that our multi-terminology IR approach based on voting techniques are statistically significant compared to the baseline. For example, tested on the 2005 TREC Genomics collection, our multi-terminology based IR approach provides an improvement rate of +6.98% in terms of MAP (mean average precision) (p < 0.05) compared to the baseline. In addition, our experimental results show that document expansion using preferred terms in combination with query expansion using terms from top ranked expanded documents improve the biomedical IR effectiveness.ConclusionWe have evaluated several voting models for combining concepts issued from multiple terminologies. Through this study, we presented many factors affecting the effectiveness of biomedical IR system including term weighting, query expansion, and document expansion models. The appropriate combination of those factors could be useful to improve the IR performance

    Kernel methods for large-scale graph-based heterogeneous biological data integration

    Get PDF
    The last decade has experienced a rapid growth in volume and diversity of biological data, thanks to the development of high-throughput technologies related to web services and embeded systems. It is common that information related to a given biological phenomenon is encoded in multiple data sources. On the one hand, this provides a great opportunity for biologists and data scientists to have more unified views about phenomenon of interest. On the other hand, this presents challenges for scientists to find optimal ways in order to wisely extract knowledge from such huge amount of data which normally cannot be done without the help of automated learning systems. Therefore, there is a high need of developing smart learning systems, whose input as set of multiple sources, to support experts to form and assess hypotheses in biology and medicine. In these systems, the problem of combining multiple data sources or data integration needs to be efficiently solved to achieve high performances. Biological data can naturally be represented as graphs. By taking graphs for data representation, we can take advantages from the access to a solid and principled mathematical framework for graphs, and the problem of data integration becomes graph-based integration. In recent years, the machine learning community has witnessed the tremendous growth in the development of kernel-based learning algorithms. Kernel methods whose kernel functions allow to separate between the representation of the data and the general learning algorithm. Interestingly, kernel representation can be applied to any type of data, including trees, graphs, vectors, etc. For this reason, kernel methods are a reasonable and logical choice for graph-based inference systems. However, there is a number of challenges for graph-based systems using kernel methods need to be effectively solved, including definition of node similarity measure, graph sparsity, scalability, efficiency, complementary property exploitation, integration methods. The contributions of the thesis aim at investigating to propose solutions that overcome the challenges faced when constructing graph-based data integration learning systems. The first contribution is the definition of a decompositional graph node kernel, named Conjunctive Disjunctive Node Kernel (CDNK), which intends to measure the similarities between nodes of graphs. Differently of existing graph node kernels that only exploit the topologies of graphs, the proposed kernel also utilizes the available information on the graph nodes. In CDNK, first, the graph is transformed into a set of linked connected components in which we distinguish between “conjunctive” links whose endpoints are in the same connected components and “disjunctive” links that connect nodes located in different connected components. Then the similarity between any couple of nodes is measured by employing a particular graph kernel on two neighborhood subgraphs rooted as each node. Next, it integrates the side information by applying convolution of the discrete information with the real valued vectors associated to graph nodes. Empirical evaluation shows that the kernel presents better performance compared to state-of-the-art graph node kernels. The second contribution aims at dealing with the graph sparsity problem. When working with sparse graphs, i.e graphs with a high number of missing links, the available information is not efficient to learn effectively. An idea to overcome this problem is to use link enrichment to enrich information for graphs. However, the performance of a link enrichment strongly depends on the adopted link prediction method. Therefore, we propose an effective link prediction method (JNSL). In this method, first, each link is represented as a joint neighborhood subgraphs. Then link prediction is considered as a binary classification. We empirically show that the proposed link prediction outperforms various other methods. Besides, we also present a method to boost the performance of diffusion-based kernels, which are most popularly used, by coupling kernel methods with link enrichment. Experimental results prove that the performances of diffusion-based graph node kernels are considerably improved by using link enrichment. The last contribution proposes a general kernel-based framework for graph integration that we name Graph-one. Graph-one is designed to overcome the challenges when handling with graph integration. In particular, it is a scalable and efficient framework. Besides, it is able to deal with unbanlanced settings where the number of positive and negative instances are much different. Numerous variations of Graph-one are evaluated in disease gene prioritization context. The results from experiments illustrate the power of the proposed framework. Precisely, Graph-one shows better performance than various methods. Moreover, Graph-one with data integration gets higher results than it with any single data source. It presents the effectiveness of Graph-one in exploiting the complementary property of graph integration

    Graphlet-adjacencies provide complementary views on the functional organisation of the cell and cancer mechanisms

    Get PDF
    Recent biotechnological advances have led to a wealth of biological network data. Topo- logical analysis of these networks (i.e., the analysis of their structure) has led to break- throughs in biology and medicine. The state-of-the-art topological node and network descriptors are based on graphlets, induced connected subgraphs of different shapes (e.g., paths, triangles). However, current graphlet-based methods ignore neighbourhood infor- mation (i.e., what nodes are connected). Therefore, to capture topology and connectivity information simultaneously, I introduce graphlet adjacency, which considers two nodes adjacent based on their frequency of co-occurrence on a given graphlet. I use graphlet adjacency to generalise spectral methods and apply these on molecular networks. I show that, depending on the chosen graphlet, graphlet spectral clustering uncovers clusters en- riched in different biological functions, and graphlet diffusion of gene mutation scores predicts different sets of cancer driver genes. This demonstrates that graphlet adjacency captures topology-function and topology-disease relationships in molecular networks. To further detail these relationships, I take a pathway-focused approach. To enable this investigation, I introduce graphlet eigencentrality to compute the importance of a gene in a pathway either from the local pathway perspective or from the global network perspective. I show that pathways are best described by the graphlet adjacencies that capture the importance of their functionally critical genes. I also show that cancer driver genes characteristically perform hub roles between pathways. Given the latter finding, I hypothesise that cancer pathways should be identified by changes in their pathway-pathway relationships. Within this context, I propose pathway- driven non-negative matrix tri-factorisation (PNMTF), which fuses molecular network data and pathway annotations to learn an embedding space that captures the organisation of a network as a composition of subnetworks. In this space, I measure the functional importance of a pathway or gene in the cell and its functional disruption in cancer. I apply this method to predict genes and the pathways involved in four major cancers. By using graphlet-adjacency, I can exploit the tendency of cancer-related genes to perform hub roles to improve the prediction accuracy

    Compact Integration of Multi-Network Topology for Functional Analysis of Genes

    Get PDF
    The topological landscape of molecular or functional interaction networks provides a rich source of information for inferring functional patterns of genes or proteins. However, a pressing yet-unsolved challenge is how to combine multiple heterogeneous networks, each having different connectivity patterns, to achieve more accurate inference. Here, we describe the Mashup framework for scalable and robust network integration. In Mashup, the diffusion in each network is first analyzed to characterize the topological context of each node. Next, the high-dimensional topological patterns in individual networks are canonically represented using low-dimensional vectors, one per gene or protein. These vectors can then be plugged into off-the-shelf machine learning methods to derive functional insights about genes or proteins. We present tools based on Mashup that achieve state-of-the-art performance in three diverse functional inference tasks: protein function prediction, gene ontology reconstruction, and genetic interaction prediction. Mashup enables deeper insights into the struct ure of rapidly accumulating and diverse biological network data and can be broadly applied to other network science domains. Keywords: interactome analysis; network integration; heterogeneous networks; dimensionality reduction; network diffusion; gene function prediction; genetic interaction prediction; gene ontology reconstruction; drug response predictionNational Institutes of Health (U.S.) (Grant R01GM081871

    Learning by Fusing Heterogeneous Data

    Get PDF
    It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data. Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways. A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets
    corecore