825 research outputs found
Kernel methods for large-scale graph-based heterogeneous biological data integration
The last decade has experienced a rapid growth in volume and diversity of biological data, thanks to the development of high-throughput technologies related to web services and embeded systems. It is common that information related to a given biological phenomenon is encoded in multiple data sources. On the one hand, this provides a great opportunity for biologists and data scientists to have more unified views about phenomenon of interest. On the other hand, this presents challenges for scientists to find optimal ways in order to wisely extract knowledge from such huge amount of data which normally cannot be done without the help of automated learning systems. Therefore, there is a high need of developing smart learning systems, whose input as set of multiple sources, to support experts to form and assess hypotheses in biology and medicine. In these systems, the problem of combining multiple data sources or data integration needs to be efficiently solved to achieve high performances. Biological data can naturally be represented as graphs. By taking graphs for data representation, we can take advantages from the access to a solid and principled mathematical framework for graphs, and the problem of data integration becomes graph-based integration. In recent years, the machine learning community has witnessed the tremendous growth in the development of kernel-based learning algorithms. Kernel methods whose kernel functions allow to separate between the representation of the data and the general learning algorithm. Interestingly, kernel representation can be applied to any type of data, including trees, graphs, vectors, etc.
For this reason, kernel methods are a reasonable and logical choice for graph-based inference systems. However, there is a number of challenges for graph-based systems using kernel methods need to be effectively solved, including definition of node similarity measure, graph sparsity, scalability, efficiency, complementary property exploitation, integration methods.
The contributions of the thesis aim at investigating to propose solutions that overcome the challenges faced when constructing graph-based data integration learning systems.
The first contribution is the definition of a decompositional graph node kernel, named Conjunctive Disjunctive Node Kernel (CDNK), which intends to measure the similarities between nodes of graphs. Differently of existing graph node kernels that only exploit the topologies of graphs, the proposed kernel also utilizes the available information on the graph nodes. In CDNK, first, the graph is transformed into a set of linked connected components in which we distinguish between “conjunctive” links whose endpoints are in the same connected components and “disjunctive” links that connect nodes located in different connected components. Then the similarity between any couple of nodes is measured by employing a particular graph kernel on two neighborhood subgraphs rooted as each node. Next, it integrates the side information by applying convolution of the discrete information with the real valued vectors associated to graph nodes. Empirical evaluation shows that the kernel presents better performance compared to state-of-the-art graph node kernels.
The second contribution aims at dealing with the graph sparsity problem. When working with sparse graphs, i.e graphs with a high number of missing links, the available information is not efficient to learn effectively. An idea to overcome this problem is to use link enrichment to enrich information for graphs. However, the performance of a link enrichment strongly depends on the adopted link prediction method. Therefore, we propose an effective link prediction method (JNSL). In this method, first, each link is represented as a joint neighborhood subgraphs. Then link prediction is considered as a binary classification. We empirically show that the proposed link prediction outperforms various other methods. Besides, we also present a method to boost the performance of diffusion-based kernels, which are most popularly used, by coupling kernel methods with link enrichment. Experimental results prove that the performances of diffusion-based graph node kernels are considerably improved by using link enrichment.
The last contribution proposes a general kernel-based framework for graph integration that we name Graph-one. Graph-one is designed to overcome the challenges when handling with graph integration. In particular, it is a scalable and efficient framework. Besides, it is able to deal with unbanlanced settings where the number of positive and negative instances are much different. Numerous variations of Graph-one are evaluated in disease gene prioritization context. The results from experiments illustrate the power of the proposed framework. Precisely, Graph-one shows better performance than various methods. Moreover, Graph-one with data integration gets higher results than it with any single data source. It presents the effectiveness of Graph-one in exploiting the complementary property of graph integration
On the Effect of Semantically Enriched Context Models on Software Modularization
Many of the existing approaches for program comprehension rely on the
linguistic information found in source code, such as identifier names and
comments. Semantic clustering is one such technique for modularization of the
system that relies on the informal semantics of the program, encoded in the
vocabulary used in the source code. Treating the source code as a collection of
tokens loses the semantic information embedded within the identifiers. We try
to overcome this problem by introducing context models for source code
identifiers to obtain a semantic kernel, which can be used for both deriving
the topics that run through the system as well as their clustering. In the
first model, we abstract an identifier to its type representation and build on
this notion of context to construct contextual vector representation of the
source code. The second notion of context is defined based on the flow of data
between identifiers to represent a module as a dependency graph where the nodes
correspond to identifiers and the edges represent the data dependencies between
pairs of identifiers. We have applied our approach to 10 medium-sized open
source Java projects, and show that by introducing contexts for identifiers,
the quality of the modularization of the software systems is improved. Both of
the context models give results that are superior to the plain vector
representation of documents. In some cases, the authoritativeness of
decompositions is improved by 67%. Furthermore, a more detailed evaluation of
our approach on JEdit, an open source editor, demonstrates that inferred topics
through performing topic analysis on the contextual representations are more
meaningful compared to the plain representation of the documents. The proposed
approach in introducing a context model for source code identifiers paves the
way for building tools that support developers in program comprehension tasks
such as application and domain concept location, software modularization and
topic analysis
A multi-species functional embedding integrating sequence and network structure
A key challenge to transferring knowledge between species is that different species have fundamentally different genetic architectures. Initial computational approaches to transfer knowledge across species have relied on measures of heredity such as genetic homology, but these approaches suffer from limitations. First, only a small subset of genes have homologs, limiting the amount of knowledge that can be transferred, and second, genes change or repurpose functions, complicating the transfer of knowledge. Many approaches address this problem by expanding the notion of homology by leveraging high-throughput genomic and proteomic measurements, such as through network alignment. In this work, we take a new approach to transferring knowledge across species by expanding the notion of homology through explicit measures of functional similarity between proteins in different species. Specifically, our kernel-based method, HANDL (Homology Assessment across Networks using Diffusion and Landmarks), integrates sequence and network structure to create a functional embedding in which proteins from different species are embedded in the same vector space. We show that inner products in this space and the vectors themselves capture functional similarity across species, and are useful for a variety of functional tasks. We perform the first whole-genome method for predicting phenologs, generating many that were previously identified, but also predicting new phenologs supported from the biological literature. We also demonstrate the HANDL embedding captures pairwise gene function, in that gene pairs with synthetic lethal interactions are significantly separated in HANDL space, and the direction of separation is conserved across species. Software for the HANDL algorithm is available at http://bit.ly/lrgr-handl.Published versio
Information flow in interaction networks II: channels, path lengths and potentials
In our previous publication, a framework for information flow in interaction
networks based on random walks with damping was formulated with two fundamental
modes: emitting and absorbing. While many other network analysis methods based
on random walks or equivalent notions have been developed before and after our
earlier work, one can show that they can all be mapped to one of the two modes.
In addition to these two fundamental modes, a major strength of our earlier
formalism was its accommodation of context-specific directed information flow
that yielded plausible and meaningful biological interpretation of protein
functions and pathways. However, the directed flow from origins to destinations
was induced via a potential function that was heuristic. Here, with a
theoretically sound approach called the channel mode, we extend our earlier
work for directed information flow. This is achieved by constructing a
potential function facilitating a purely probabilistic interpretation of the
channel mode. For each network node, the channel mode combines the solutions of
emitting and absorbing modes in the same context, producing what we call a
channel tensor. The entries of the channel tensor at each node can be
interpreted as the amount of flow passing through that node from an origin to a
destination. Similarly to our earlier model, the channel mode encompasses
damping as a free parameter that controls the locality of information flow.
Through examples involving the yeast pheromone response pathway, we illustrate
the versatility and stability of our new framework.Comment: Minor changes from v3. 30 pages, 7 figures. Plain LaTeX format. This
version contains some additional material compared to the journal submission:
two figures, one appendix and a few paragraph
Nonparametric Feature Extraction from Dendrograms
We propose feature extraction from dendrograms in a nonparametric way. The
Minimax distance measures correspond to building a dendrogram with single
linkage criterion, with defining specific forms of a level function and a
distance function over that. Therefore, we extend this method to arbitrary
dendrograms. We develop a generalized framework wherein different distance
measures can be inferred from different types of dendrograms, level functions
and distance functions. Via an appropriate embedding, we compute a vector-based
representation of the inferred distances, in order to enable many numerical
machine learning algorithms to employ such distances. Then, to address the
model selection problem, we study the aggregation of different dendrogram-based
distances respectively in solution space and in representation space in the
spirit of deep representations. In the first approach, for example for the
clustering problem, we build a graph with positive and negative edge weights
according to the consistency of the clustering labels of different objects
among different solutions, in the context of ensemble methods. Then, we use an
efficient variant of correlation clustering to produce the final clusters. In
the second approach, we investigate the sequential combination of different
distances and features sequentially in the spirit of multi-layered
architectures to obtain the final features. Finally, we demonstrate the
effectiveness of our approach via several numerical studies
Network diffusion methods for omics big bio data analytics and interpretation with application to cancer datasets
Nella attuale ricerca biomedica un passo fondamentale verso una comprensione dei meccanismi alla radice di una malattia è costituito dalla identificazione dei disease modules, cioè quei sottonetwork dell'interattoma, il network delle interazioni tra proteine, con un alto numero di alterazioni geniche.
Tuttavia, l'incompletezza del network e l'elevata variabilità dei geni alterati rendono la soluzione di questo problema non banale.
I metodi fisici che sfruttano le proprietà dei processi diffusivi su network, dei quali mi sono occupato in questo lavoro di tesi, sono quelli che consentono di ottenere le migliori prestazioni.
Nella prima parte del mio lavoro, ho indagato la teoria relativa alla diffusione ed ai random walk su network, trovando interessanti relazioni con le tecniche di clustering e con altri modelli fisici la cui dinamica è descritta dalla matrice laplaciana.
Ho poi implementato un tecnica basata sulla diffusione su rete applicandola a dati di espressione genica e mutazioni somatiche di tre diverse tipologie di cancro.
Il metodo è organizzato in due parti. Dopo aver selezionato un sottoinsieme dei nodi dell'interattoma, associamo ad ognuno di essi un'informazione iniziale che riflette il "grado" di alterazione del gene. L'algoritmo di diffusione propaga l'informazione iniziale nel network raggiungendo, dopo un transiente, lo stato stazionario. A questo punto, la quantità di fluido in ciascun nodo è utilizzata per costruire un ranking dei geni. Nella seconda parte, i disease modules sono identificati mediante una procedura di network resampling.
L'analisi condotta ci ha permesso di identificare un numero consistente di geni già noti nella letteratura relativa ai tipi di cancro studiati, nonché un insieme di altri geni correlati a questi che potrebbero essere interessanti candidati per ulteriori approfondimenti.Attraverso una procedura di Gene Set Enrichment abbiamo infine testato la correlazione dei moduli identificati con pathway biologici noti
Physics based supervised and unsupervised learning of graph structure
Graphs are central tools to aid our understanding of biological, physical, and social systems. Graphs also play a key role in representing and understanding the visual world around us, 3D-shapes and 2D-images alike. In this dissertation, I propose the use of physical or natural phenomenon to understand graph structure. I investigate four phenomenon or laws in nature: (1) Brownian motion, (2) Gauss\u27s law, (3) feedback loops, and (3) neural synapses, to discover patterns in graphs
- …