825 research outputs found

    Kernel methods for large-scale graph-based heterogeneous biological data integration

    Get PDF
    The last decade has experienced a rapid growth in volume and diversity of biological data, thanks to the development of high-throughput technologies related to web services and embeded systems. It is common that information related to a given biological phenomenon is encoded in multiple data sources. On the one hand, this provides a great opportunity for biologists and data scientists to have more unified views about phenomenon of interest. On the other hand, this presents challenges for scientists to find optimal ways in order to wisely extract knowledge from such huge amount of data which normally cannot be done without the help of automated learning systems. Therefore, there is a high need of developing smart learning systems, whose input as set of multiple sources, to support experts to form and assess hypotheses in biology and medicine. In these systems, the problem of combining multiple data sources or data integration needs to be efficiently solved to achieve high performances. Biological data can naturally be represented as graphs. By taking graphs for data representation, we can take advantages from the access to a solid and principled mathematical framework for graphs, and the problem of data integration becomes graph-based integration. In recent years, the machine learning community has witnessed the tremendous growth in the development of kernel-based learning algorithms. Kernel methods whose kernel functions allow to separate between the representation of the data and the general learning algorithm. Interestingly, kernel representation can be applied to any type of data, including trees, graphs, vectors, etc. For this reason, kernel methods are a reasonable and logical choice for graph-based inference systems. However, there is a number of challenges for graph-based systems using kernel methods need to be effectively solved, including definition of node similarity measure, graph sparsity, scalability, efficiency, complementary property exploitation, integration methods. The contributions of the thesis aim at investigating to propose solutions that overcome the challenges faced when constructing graph-based data integration learning systems. The first contribution is the definition of a decompositional graph node kernel, named Conjunctive Disjunctive Node Kernel (CDNK), which intends to measure the similarities between nodes of graphs. Differently of existing graph node kernels that only exploit the topologies of graphs, the proposed kernel also utilizes the available information on the graph nodes. In CDNK, first, the graph is transformed into a set of linked connected components in which we distinguish between “conjunctive” links whose endpoints are in the same connected components and “disjunctive” links that connect nodes located in different connected components. Then the similarity between any couple of nodes is measured by employing a particular graph kernel on two neighborhood subgraphs rooted as each node. Next, it integrates the side information by applying convolution of the discrete information with the real valued vectors associated to graph nodes. Empirical evaluation shows that the kernel presents better performance compared to state-of-the-art graph node kernels. The second contribution aims at dealing with the graph sparsity problem. When working with sparse graphs, i.e graphs with a high number of missing links, the available information is not efficient to learn effectively. An idea to overcome this problem is to use link enrichment to enrich information for graphs. However, the performance of a link enrichment strongly depends on the adopted link prediction method. Therefore, we propose an effective link prediction method (JNSL). In this method, first, each link is represented as a joint neighborhood subgraphs. Then link prediction is considered as a binary classification. We empirically show that the proposed link prediction outperforms various other methods. Besides, we also present a method to boost the performance of diffusion-based kernels, which are most popularly used, by coupling kernel methods with link enrichment. Experimental results prove that the performances of diffusion-based graph node kernels are considerably improved by using link enrichment. The last contribution proposes a general kernel-based framework for graph integration that we name Graph-one. Graph-one is designed to overcome the challenges when handling with graph integration. In particular, it is a scalable and efficient framework. Besides, it is able to deal with unbanlanced settings where the number of positive and negative instances are much different. Numerous variations of Graph-one are evaluated in disease gene prioritization context. The results from experiments illustrate the power of the proposed framework. Precisely, Graph-one shows better performance than various methods. Moreover, Graph-one with data integration gets higher results than it with any single data source. It presents the effectiveness of Graph-one in exploiting the complementary property of graph integration

    On the Effect of Semantically Enriched Context Models on Software Modularization

    Full text link
    Many of the existing approaches for program comprehension rely on the linguistic information found in source code, such as identifier names and comments. Semantic clustering is one such technique for modularization of the system that relies on the informal semantics of the program, encoded in the vocabulary used in the source code. Treating the source code as a collection of tokens loses the semantic information embedded within the identifiers. We try to overcome this problem by introducing context models for source code identifiers to obtain a semantic kernel, which can be used for both deriving the topics that run through the system as well as their clustering. In the first model, we abstract an identifier to its type representation and build on this notion of context to construct contextual vector representation of the source code. The second notion of context is defined based on the flow of data between identifiers to represent a module as a dependency graph where the nodes correspond to identifiers and the edges represent the data dependencies between pairs of identifiers. We have applied our approach to 10 medium-sized open source Java projects, and show that by introducing contexts for identifiers, the quality of the modularization of the software systems is improved. Both of the context models give results that are superior to the plain vector representation of documents. In some cases, the authoritativeness of decompositions is improved by 67%. Furthermore, a more detailed evaluation of our approach on JEdit, an open source editor, demonstrates that inferred topics through performing topic analysis on the contextual representations are more meaningful compared to the plain representation of the documents. The proposed approach in introducing a context model for source code identifiers paves the way for building tools that support developers in program comprehension tasks such as application and domain concept location, software modularization and topic analysis

    A multi-species functional embedding integrating sequence and network structure

    Full text link
    A key challenge to transferring knowledge between species is that different species have fundamentally different genetic architectures. Initial computational approaches to transfer knowledge across species have relied on measures of heredity such as genetic homology, but these approaches suffer from limitations. First, only a small subset of genes have homologs, limiting the amount of knowledge that can be transferred, and second, genes change or repurpose functions, complicating the transfer of knowledge. Many approaches address this problem by expanding the notion of homology by leveraging high-throughput genomic and proteomic measurements, such as through network alignment. In this work, we take a new approach to transferring knowledge across species by expanding the notion of homology through explicit measures of functional similarity between proteins in different species. Specifically, our kernel-based method, HANDL (Homology Assessment across Networks using Diffusion and Landmarks), integrates sequence and network structure to create a functional embedding in which proteins from different species are embedded in the same vector space. We show that inner products in this space and the vectors themselves capture functional similarity across species, and are useful for a variety of functional tasks. We perform the first whole-genome method for predicting phenologs, generating many that were previously identified, but also predicting new phenologs supported from the biological literature. We also demonstrate the HANDL embedding captures pairwise gene function, in that gene pairs with synthetic lethal interactions are significantly separated in HANDL space, and the direction of separation is conserved across species. Software for the HANDL algorithm is available at http://bit.ly/lrgr-handl.Published versio

    Information flow in interaction networks II: channels, path lengths and potentials

    Full text link
    In our previous publication, a framework for information flow in interaction networks based on random walks with damping was formulated with two fundamental modes: emitting and absorbing. While many other network analysis methods based on random walks or equivalent notions have been developed before and after our earlier work, one can show that they can all be mapped to one of the two modes. In addition to these two fundamental modes, a major strength of our earlier formalism was its accommodation of context-specific directed information flow that yielded plausible and meaningful biological interpretation of protein functions and pathways. However, the directed flow from origins to destinations was induced via a potential function that was heuristic. Here, with a theoretically sound approach called the channel mode, we extend our earlier work for directed information flow. This is achieved by constructing a potential function facilitating a purely probabilistic interpretation of the channel mode. For each network node, the channel mode combines the solutions of emitting and absorbing modes in the same context, producing what we call a channel tensor. The entries of the channel tensor at each node can be interpreted as the amount of flow passing through that node from an origin to a destination. Similarly to our earlier model, the channel mode encompasses damping as a free parameter that controls the locality of information flow. Through examples involving the yeast pheromone response pathway, we illustrate the versatility and stability of our new framework.Comment: Minor changes from v3. 30 pages, 7 figures. Plain LaTeX format. This version contains some additional material compared to the journal submission: two figures, one appendix and a few paragraph

    Nonparametric Feature Extraction from Dendrograms

    Full text link
    We propose feature extraction from dendrograms in a nonparametric way. The Minimax distance measures correspond to building a dendrogram with single linkage criterion, with defining specific forms of a level function and a distance function over that. Therefore, we extend this method to arbitrary dendrograms. We develop a generalized framework wherein different distance measures can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, to address the model selection problem, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep representations. In the first approach, for example for the clustering problem, we build a graph with positive and negative edge weights according to the consistency of the clustering labels of different objects among different solutions, in the context of ensemble methods. Then, we use an efficient variant of correlation clustering to produce the final clusters. In the second approach, we investigate the sequential combination of different distances and features sequentially in the spirit of multi-layered architectures to obtain the final features. Finally, we demonstrate the effectiveness of our approach via several numerical studies

    Network diffusion methods for omics big bio data analytics and interpretation with application to cancer datasets

    Get PDF
    Nella attuale ricerca biomedica un passo fondamentale verso una comprensione dei meccanismi alla radice di una malattia è costituito dalla identificazione dei disease modules, cioè quei sottonetwork dell'interattoma, il network delle interazioni tra proteine, con un alto numero di alterazioni geniche. Tuttavia, l'incompletezza del network e l'elevata variabilità dei geni alterati rendono la soluzione di questo problema non banale. I metodi fisici che sfruttano le proprietà dei processi diffusivi su network, dei quali mi sono occupato in questo lavoro di tesi, sono quelli che consentono di ottenere le migliori prestazioni. Nella prima parte del mio lavoro, ho indagato la teoria relativa alla diffusione ed ai random walk su network, trovando interessanti relazioni con le tecniche di clustering e con altri modelli fisici la cui dinamica è descritta dalla matrice laplaciana. Ho poi implementato un tecnica basata sulla diffusione su rete applicandola a dati di espressione genica e mutazioni somatiche di tre diverse tipologie di cancro. Il metodo è organizzato in due parti. Dopo aver selezionato un sottoinsieme dei nodi dell'interattoma, associamo ad ognuno di essi un'informazione iniziale che riflette il "grado" di alterazione del gene. L'algoritmo di diffusione propaga l'informazione iniziale nel network raggiungendo, dopo un transiente, lo stato stazionario. A questo punto, la quantità di fluido in ciascun nodo è utilizzata per costruire un ranking dei geni. Nella seconda parte, i disease modules sono identificati mediante una procedura di network resampling. L'analisi condotta ci ha permesso di identificare un numero consistente di geni già noti nella letteratura relativa ai tipi di cancro studiati, nonché un insieme di altri geni correlati a questi che potrebbero essere interessanti candidati per ulteriori approfondimenti.Attraverso una procedura di Gene Set Enrichment abbiamo infine testato la correlazione dei moduli identificati con pathway biologici noti

    Network-guided data integration and gene prioritization

    Get PDF

    Physics based supervised and unsupervised learning of graph structure

    Get PDF
    Graphs are central tools to aid our understanding of biological, physical, and social systems. Graphs also play a key role in representing and understanding the visual world around us, 3D-shapes and 2D-images alike. In this dissertation, I propose the use of physical or natural phenomenon to understand graph structure. I investigate four phenomenon or laws in nature: (1) Brownian motion, (2) Gauss\u27s law, (3) feedback loops, and (3) neural synapses, to discover patterns in graphs
    corecore