46 research outputs found
S.cerevisiae complex function prediction with modular multi-relational framework
Proceeding of: 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2010, Córdoba, Spain, June 1-4, 2010Determining the functions of genes is essential for understanding how the metabolisms work, and for trying to solve their malfunctions. Genes usually work in groups rather than isolated, so functions should be assigned to gene groups and not to individual genes. Moreover, the genetic knowledge has many relations and is very frequently changeable. Thus, a propositional ad-hoc approach is not appropriate to deal with the gene group function prediction domain. We propose the Modular Multi-Relational Framework (MMRF), which faces the problem from a relational and flexible point of view. The MMRF consists of several modules covering all involved domain tasks (grouping, representing and learning using computational prediction techniques). A specific application is described, including a relational representation language, where each module of MMRF is individually instantiated and refined for obtaining a prediction under specific given conditions.This research work has been supported by CICYT, TRA 2007-67374-C02-02 project and by the expert biological knowledge of the Structural Computational Biology Group in Spanish National Cancer Research Centre (CNIO). The authors would like to thank members of Tilde tool developer
group in K.U.Leuven for providing their help and many useful suggestions.Publicad
S.cerevisiae Complex Function Prediction with Modular Multi-Relational Framework
Proceeding of: 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2010, Córdoba, Spain, June 1-4, 2010Determining the functions of genes is essential for understanding how the metabolisms work, and for trying to solve their malfunctions. Genes usually work in groups rather than isolated, so functions should be assigned to gene groups and not to individual genes. Moreover, the genetic knowledge has many relations and is very frequently changeable. Thus, a propositional ad-hoc approach is not appropriate to deal with the gene group function prediction domain. We propose the Modular Multi-Relational Framework (MMRF), which faces the problem from a relational and flexible point of view. The MMRF consists of several modules covering all involved domain tasks (grouping, representing and learning using computational prediction techniques). A specific application is described, including a relational representation language, where each module of MMRF is individually instantiated and refined for obtaining a prediction under specific given conditions.This research work has been supported by CICYT, TRA 2007-67374-C02-02 project and by the expert biological knowledge of the Structural Computational Biology Group in Spanish National Cancer Research Centre (CNIO). The authors would like to thank members of Tilde tool developer
group in K.U.Leuven for providing their help and many useful suggestions.Publicad
MMRF for proteome annotation applied to human protein disease prediction
Proceedings of: 20th International Conference, ILP 2010, Florence, Italy, June 27-30, 2010Biological processes where every gene and protein participates is an essential knowledge for designing disease treatments. Nowadays, these annotations are still unknown for many genes and proteins. Since making annotations from in-vivo experiments is costly, computational predictors are needed for different kinds of annotation such as metabolic pathway, interaction network, protein family, tissue, disease and so on. Biological data has an intrinsic relational structure, including genes and proteins, which can be grouped by many criteria. This hinders the possibility of finding good hypotheses when attribute-value representation is used. Hence, we propose the generic Modular Multi-Relational Framework (MMRF) to predict different kinds of gene and protein annotation using Relational Data Mining (RDM). The specific MMRF application to annotate human protein with diseases verifies that group knowledge (mainly protein-protein interaction pairs) improves the prediction, particularly doubling the area under the precision-recall curvePublicad
Modular multi-relational framework for gene group function prediction
Poster of: 19th International Conference on Inductive Logic Programming (ILP 2009), Leuven, Belgium, 2 - 4 Jul, 2009Determining the functions of genes is essential for understanding how the metabolisms work, and for trying to solve their malfunctions. Genes usually work in groups rather than isolated, so functions should be assigned to gene groups and not to individual genes. Moreover, the genetic knowledge has many relations and is very frequently changeable. Thus, a propositional ad-hoc approach is not appropriate to deal with the gene group function prediction domain. We propose the Modular Multi-Relational Framework (MMRF), which faces the problem from a relational and flexible point of view. The MMRF consists of several modules covering all involved domain tasks (grouping, representing and learning using computational prediction techniques). A specific application is described, including a relational representation language, where each module of MMRF is individually instantiated and refined for obtaining a prediction under specific given conditions.The research reported here has been supported by
CICYT, TRA2007-67374-C02-02 project
Consensus Network Inference of Microarray Gene Expression Data
Genetic and protein interactions are essential to regulate cellular machinery. Their
identification has become an important aim of systems biology research. In recent years, a
variety of computational network inference algorithms have been employed to reconstruct
gene regulatory networks from post-genomic data. However, precisely predicting these
regulatory networks remains a challenge.
We began our study by assessing the ability of various network inference algorithms
to accurately predict gene regulatory interactions using benchmark simulated datasets. It was
observed from our analysis that different algorithms have strengths and weaknesses when
identifying regulatory networks, with a gene-pair interaction (edge) predicted by one
algorithm not always necessarily consistent with the other. An edge not predicted by most
inference algorithms may be an important one, and should not be missed. The naïve
consensus (intersection) method is perhaps the most conservative approach and can be used
to address this concern by extracting the edges consistently predicted across all inference
algorithms; however, it lacks credibility as it does not provide a quantifiable measure for
edge weights. Existing quantitative consensus approaches, such as the inverse-variance
weighted method (IVWM) and the Borda count election method (BCEM), have been
previously implemented to derive consensus networks from diverse datasets. However, the
former method was biased towards finding local solutions in the whole network, and the
latter considered species diversity to build the consensus network.
In this thesis we proposed a novel consensus approach, in which we used Fishers
Combined Probability Test (FCPT) to combine the statistical significance values assigned to
each network edge by a number of different networking algorithms to produce a consensus
network. We tested our method by applying it to a variety of in silico benchmark expression datasets of different dimensions and evaluated its performance against individual inference
methods, Bayesian models and also existing qualitative and quantitative consensus
techniques. We also applied our approach to real experimental data from the yeast (S.
cerevisiae) network as this network has been comprehensively elucidated previously. Our
results demonstrated that the FCPT-based consensus method outperforms single algorithms in
terms of robustness and accuracy. In developing the consensus approach, we also proposed a
scoring technique that quantifies biologically meaningful hierarchical modular networks.University of Exeter studentshi
Biochemical complex data generation and integration in genome-scale metabolic models
Dissertação de mestrado em BioinformaticsThe (re-)construction of Genome-Scale Metabolic (GSM) models is highly dependent on
biochemical databases. In fact, the biochemical data within these databases is limited, lacking,
most of the times, in structurally defined compounds’ representations. In order to circumvent
this limitation, compounds are frequently represented by their generic version. Lipids are
paradigmatic cases: given that a multitude of lipid species can occur in nature, not only is
their storage in databases hampered, but also their integration into GSM models. Accordingly,
converting one lipid version, in GSM models, into another can be tricky, as these compounds
possess side chains that are likely to be transferred all across their biosynthetic network.
Hence, converting a lipid implies that all its precursors have to be converted as well, requiring
information on lipid specificity and biosynthetic context.
The present work represents a strategy to tackle this issue. Biochemical cOmplex data
Integration in Metabolic Models at Genome scale (BOIMMG)’s pipeline encompasses the
integration and processing of biochemical data from different sources, aiming at expanding the
current knowledge in lipid biosynthesis, and its integration in GSM models.
Generic reactions retrieved from MetaCyc were handled and transformed into reactions with
structurally defined lipid species. More than 30 generic reactions were fully (and 27 partially)
characterized, allowing to predict over 30000 new lipid structures and their biosynthetic context.
The integration of BOIMMG’s data into GSM models was conducted for electron-transfer
quinones, glycerolipids, and phospholipids metabolism. The validation accounted on the
comparison of models with different versions of these metabolites. BOIMMG’s conversion
modules were applied to Escherichia coli’s iJR904 model [1], generating 53 more matching lipids
and 38 more matching reactions with iJR904 model’s iteration iAF1260b [2, 3], in which the
conversion was performed and curated manually.
To the best of our knowledge, BOIMMG’s database is the only with biosynthetic information
regarding structurally defined lipids. Moreover, there is no other state-of-the-art tool capable
of automatically generating complex lipid-specific networks.A reconstrução de modelos metabólicos à escala genómica (GSM na língua inglesa) depende
grandemente da informaçãoo bioquímica presente em bases de dados. De facto, esta informação
é muitas vezes limitada, podendo não conter representações de compostos estruturalmente
definidos. Como tentativa de contornar esta limitação, os compostos químicos são frequentemente
representados pela sua representação genérica. Os lípidos são casos paradigmáticos,
dado que uma multitude de diferentes espécies químicas de lípidos ocorrem na natureza, dificultando
o seu armazenamento em bases de dados, assim como a sua integração em modelos
GSM. Desta forma, o processo de converter lípidos de uma versão genérica para uma versão
estruturalmente definida não é trivial, dado que estes compostos possuem cadeias laterais que
são transferidas ao longo das suas vias de biossíntese. Consequentemente, essa conversão
implica que todos os precursores desses lípidos também sejam convertidos, requerendo haver
informação relativa a lípidos específicos e às suas relações biossintéticas.
O presente trabalho representa uma estratégia para resolver esse problema. A pipeline do
software desenvolvido no âmbito deste trabalho, Biochemical cOmplex dataIntegration in Metabolic
Models at Genome scale (BOIMMG), engloba a integração e processamento de dados bioquímicos
de diferentes fontes, visando a expansão do conhecimento atual na biossíntese de lípidos, assim
como a sua integração em modelos GSM.
Relativamente à segunda fase, reações genéricas extraídas da base de dados MetaCyc foram
processadas e transformadas em reações com lípidos estruturalmente definidos. Mais de 30
reações genéricas foram completamente (e 27 parcialmente) caracterizadas, permitindo prever
mais de 30000 novas estruturas de lípidos, assim como os seus contextos biossintéticos.
A integração dos dados nos modelos GSM foi conduzido para o metabolismo das quinonas
transportadoras de eletrões, glicerolípidos e fosfolípidos. A validação teve em conta a
comparação entre modelos com diferentes versões destes metabolitos. Os módulos de conversão do BOIMMG foram aplicados ao modelo iJR904 de Escherichia coli [1], gerando mais
53 lípidos e 38 reações que se encontram no modelo iAF1260b [2, 3], uma iteração do modelo
iJR904 cuja conversão de lípidos se procedeu manualmente.
A base de dados gerada pelo método BOIMMG é a única que contém informação biossintética
relata a lípidos estruturalmente definidos. Adicionalmente, BOIMMG é uma ferramenta única
que permite gerar redes complexas de lípidos automaticamente
Classifying distinct data types: textual streams protein sequences and genomic variants
Artificial Intelligence (AI) is an interdisciplinary field combining different research areas with the end goal to automate processes in the everyday life and industry. The fundamental components of AI models are an “intelligent” model and a functional component defined by the end-application. That is, an intelligent model can be a statistical model that can recognize patterns in data instances to distinguish differences in between these instances.
For example, if the AI is applied in car manufacturing, based on an image of a part of a car, the model can categorize if the car part is in the front, middle or rear compartment of the car, as a human brain would do. For the same example application, the statistical model informs a mechanical arm, the functional component, for the current car compartment and the arm in turn assembles this compartment, of the car, based on predefined instructions, likely as a human hand would follow human brain neural signals. A crucial step of AI applications is the classification of input instances by the intelligent model.
The classification step in the intelligent model pipeline allows the subsequent steps to act in similar fashion for instances belonging to the same category. We define as classification the module of the intelligent model, which categorizes the input instances based on predefined human-expert or data-driven produced patterns of the instances. Irrespectively
of the method to find patterns in data, classification is composed of four distinct steps: (i) input representation, (ii) model building (iii) model prediction and (iv) model assessment. Based on these classification steps, we argue that applying classification on distinct data
types holds different challenges.
In this thesis, I focus on challenges for three distinct classification scenarios: (i) Textual Streams: how to advance the model building step, commonly used for static distribution of data, to classify textual posts with transient data distribution? (ii) Protein Prediction: which biologically meaningful information can be used in the input representation step to overcome the limited training data challenge? (iii) Human Variant Pathogenicity Prediction:
how to develop a classification system for functional impact of human variants, by providing standardized and well accepted evidence for the classification outcome and thus enabling the model assessment step?
To answer these research questions, I present my contributions in classifying these different types of data: temporalMNB: I adapt the sequential prediction with expert advice paradigm to optimally aggregate complementary distributions to enhance a Naive Bayes model to adapt on drifting distribution of the characteristics of the textual posts. dom2vec:
our proposal to learn embedding vectors for the protein domains using self-supervision. Based on the high performance achieved by the dom2vec embeddings in quantitative intrinsic assessment on the captured biological information, I provide example evidence for an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Last, I describe GenOtoScope bioinformatics software tool to automate standardized evidence-based criteria for pathogenicity impact of variants associated with hearing loss. Finally, to increase the practical use of our last contribution, I develop easy-to-use software interfaces to be used, in research settings, by clinical diagnostics personnel.Künstliche Intelligenz (KI) ist ein interdisziplinäres Gebiet, das verschiedene Forschungsbereiche mit dem Ziel verbindet, Prozesse im Alltag und in der Industrie zu automatisieren. Die grundlegenden Komponenten von KI-Modellen sind ein “intelligentes” Modell und eine durch die Endanwendung definierte funktionale Komponente. Das heißt, ein intelligentes Modell kann ein statistisches Modell sein, das Muster in Dateninstanzen erkennen kann, um Unterschiede zwischen diesen Instanzen zu unterscheiden. Wird die KI beispielsweise in der Automobilherstellung eingesetzt, kann das Modell auf der Grundlage eines Bildes eines Autoteils
kategorisieren, ob sich das Autoteil im vorderen, mittleren oder hinteren Bereich des Autos befindet, wie es ein menschliches Gehirn tun würde. Bei der gleichen Beispielanwendung informiert das statistische Modell einen mechanischen Arm, die funktionale Komponente, über den aktuellen Fahrzeugbereich, und der Arm wiederum baut diesen Bereich des Fahrzeugs auf der Grundlage vordefinierter Anweisungen zusammen, so wie eine menschliche Hand den neuronalen Signalen des menschlichen Gehirns folgen würde. Ein entscheidender Schritt bei KI-Anwendungen ist die Klassifizierung von Eingabeinstanzen durch das intelligente
Modell. Unabhängig von der Methode zum Auffinden von Mustern in Daten besteht die Klassifizierung aus vier verschiedenen Schritten: (i) Eingabedarstellung, (ii) Modellbildung, (iii) Modellvorhersage und (iv) Modellbewertung. Ausgehend von diesen Klassifizierungsschritten argumentiere ich, dass die Anwendung der Klassifizierung auf verschiedene Datentypen unterschiedliche Herausforderungen mit sich bringt. In dieser Arbeit konzentriere ich uns auf die Herausforderungen für drei verschiedene Klassifizierungsszenarien: (i) Textdatenströme: Wie kann der Schritt der Modellerstellung, der üblicherweise für eine statische Datenverteilung verwendet wird, weiterentwickelt werden, um die Klassifizierung von Textbeiträgen mit einer instationären Datenverteilung zu erlernen? (ii) Proteinvorhersage: Welche biologisch sinnvollen Informationen können im Schritt der Eingabedarstellung verwendet werden, um die Herausforderung der begrenzten Trainingsdaten zu überwinden? (iii) Vorhersage der Pathogenität menschlicher Varianten:
Wie kann ein Klassifizierungssystem für die funktionellen Auswirkungen menschlicher Varianten entwickelt werden, indem standardisierte und anerkannte Beweise für das Klassifizierungsergebnis bereitgestellt werden und somit der Schritt der Modellbewertung ermöglicht wird? Um diese Forschungsfragen zu beantworten, stelle ich meine Beiträge zur Klassifizierung dieser verschiedenen Datentypen vor: temporalMNB: Verbesserung des Naive-Bayes-Modells zur Klassifizierung driftender Textströme durch Ensemble-Lernen. dom2vec: Lernen von
Einbettungsvektoren für Proteindomänen durch Selbstüberwachung. Auf der Grundlage der berichteten Ergebnisse liefere ich Beispiele für eine Analogie zwischen den lokalen linguistischen Merkmalen in natürlichen Sprachen und den Domänenstruktur- und Funktionsinformationen in Domänenarchitekturen. Schließlich beschreibe ich ein bioinformatisches Softwaretool, GenOtoScope, zur Automatisierung standardisierter evidenzbasierter Kriterien für die orthogenitätsauswirkungen von Varianten, die mit angeborener Schwerhörigkeit in
Verbindung stehen
Vertikal integration, globale und Modularanalyse von molekularen Wechselwirkungen Netzwerke von Escherichia coli
Phenotypical characteristics of cells often arise from interactions between genes, proteins and metabolites. For a complete understanding of cellular processes and their regulations it is necessary to vertically integrate the molecular networks into an interactome and understand its global structure. In this thesis,, an integrated molecular network (IMN) of Escherichia coli was reconstructed which comprises metabolic reactions, metabolite-protein interactions (MPI) and transcriptional regulation data. Three fundamental aspects of cellular processes were studied: (i) feedback regulation of gene expression, (ii) network motifs and (iii) global organization. Intriguingly, this work found that feedback regulation of gene expression in E. coli is mediated by MPIs and 69 such feedback loops (FBLs) were identified. Motif studies identified the FBL as a significant pattern and detected 12 other three-node motifs comprising five composite motifs. Connectivity analysis discovered the existence of bow-tie architecture and motif analysis in the bow-tie components revealed that 77% of them interconnect to form the giant strong component which is the backbone of the bow-tie. Further in this work, cluster and modular analyses were performed on the integrated molecular network of E. coli constructed from diverse collection of datasets involving metabolic reactions, metabolite protein interactions and transcriptional regulation. Modularity was used as the parameter of an appropriate, fast and robust method for clustering such a heterogeneous molecular circuitry of interactions. This work revealed that clustering this complex network significantly grouped together genes of known similar function in well-defined physiologically related modules. Identification of network motifs and correlating them with the modules of highly connected nodes may define their potential functional role. To this end, twelve highly significant three-node network motifs among which four are composite network motifs comprising multiple types of interactions were detected and analyzed. Distribution analysis of these motifs within and between the various functional modules supported the fact that these motifs represent basic patterns of regulation and organization of genes into modules. This thesis illustrates the potential of data integration of molecular networks to detect the feedback interactions in regulatory networks and its global analysis for better understanding cellular processes and their regulation. Moreover this work also presents a basic framework for detecting functional modules and their interaction with various motifs in an integrated E.coli system.Phenotypische Eigenschaften von Zellen entstehen häufig aus Wechselwirkungen zwischen Genen, Proteinen und Metaboliten. Für ein ganzheitliches Verstehen von Zellprozessen und ihrer Regulation ist es notwendig, die molekularen Netzwerke vertikal in ein Interactom zu integrieren und seine globale Struktur zu verstehen. In dieser Arbeit wurde ein integriertes molekulares Netzwerk (IMN) von Escherichia coli modelliert, dass aus den metabolischen Reaktionen, Metabolit-Protein-Wechselwirkungen (MPI) und den transkriptional-regulatorischen Elementen bestand. Drei grundsätzliche Aspekte von Zellprozessen wurden untersucht: (i) Feedback-Regulierung der Genexpression, (ii) Netzwerkmotive und (iii) globale Organisation. Diese Arbeit lieferte faszinierende Ergebnisse: Es konnte aufgezeigt werden, dass die Feedback-Regulierung der Genexpression in E. coli durch MPIs vermittelt wird und 69 solcher Feedback-Schleifen (FBLs) identifiziert werden konnten. Motiv-Untersuchungen identifizierten die FBLs als ein bedeutendes Muster und entdeckten 12 andere Drei-Knoten-Motive, die fünf zerlegbare Motive umfassen. Konnektivitätsanalysen zeigten die Existenz der Bow-tie-Struktur auf und Motivanalyse der Bow-tie-Komponenten offenbarte, dass 77 % davon das GSC (giant strong component) bilden, welches das Rückgrat des Bow-tie darstellt. Weiterhin wurden Cluster- und Modularanalysen im integrierten-molekularen Netwerk von E. coli durchgeführt, die auf diversen Sammlungen von Daten beruhten, die metabolische Reaktionen, Metabolit-Protein-Wechselwirkungen und transkriptionelle Regulierung beinhalteten. Modularität wurde als Parameter einer geeigneten, schnellen und robusten Methode zur Clusterung solcher heterogenen molekularen Schaltung von Wechselwirkungen genutzt. Diese Arbeit zeigte, dass die Clusterung dieses komplexen Netzwerkes Gene bekannter ähnlicher Funktion in wohl-definierten physiologisch verwandten Modulen signifikant gruppierte. Die Identifizierung von Netzwerk-Motiven und die Korrelation dieser mit Modulen hochverzweigter Knoten mag ihre potentielle funktionelle Rolle definieren. Zu diesem Zweck wurden zwölf hochsignifikante 3-Knoten-Motive, von denen vier zusammengesetzte Netzwerkmotive multiple Typen von Interaktionen darstellen, entdeckt und analysiert. Verteilungsanalyse dieser Motive innerhalb und zwischen verschiedenen funktionellen Modulen unterstützte die Tatsache, dass diese Motive Grundmuster der Regulation und Organisation von Genen in Modulen darstellen. Diese These illustriert das Potential der Datenintegrierung molekularer Netzwerke zur Entdeckung von Feedback-Interaktionen in regulatorischen Netzwerken und seiner globalen Analyse zur besseren Erkenntnis zellulärer Prozesse und ihrer Regulierung. Darüberhinaus zeigt diese Arbeit einen Grundrahmen für die Entdeckung funktioneller Module und ihrer Wechselwirkungen mit verschiedenen Motiven in einem integrierten System von E. coli auf
GRAPH-BASED APPROACHES FOR IMBALANCED DATA IN FUNCTIONAL GENOMICS
The Gene Function Prediction (GFP) problem consists in inferring biological properties for the genes whose function is unknown or only partially
known, and raises challenging issues from both a machine learning and a
computational biology standpoint.
The GFP problem can be formalized as a semi-supervised learning problem in an undirected graph. Indeed, given a graph with a partial graph labeling, where nodes represent genes, edges functional relationships between
genes, and labels their membership to functional classes, GFP consists in
inferring the unknown functional classes of genes, by exploiting the topological relationships of the networks and the available a priori knowledge about
the functional properties of genes.
Several network-based machine learning algorithms have been proposed
for solving this problem, including Hopfield networks and label propagation
methods; however, some issues have been only partially considered, e.g. the
preservation of the prior knowledge and the unbalance between positive and
negative labels.
A first contribution of the thesis is the design of a Hopfield-based cost
sensitive neural network algorithm (COSNet) to address these learning issues. The method factorizes the solution of the problem in two parts: 1) the
subnetwork composed by the labelled vertices is considered, and the network
parameters are estimated through a supervised algorithm; 2) the estimated
parameters are extended to the subnetwork composed of the unlabeled vertices, and the attractor reached by the dynamics of this subnetwork allows
to predict the labeling of the unlabeled vertices.
The proposed method embeds in the neural algorithm the \u201ca priori\u201d
knowledge coded in the labeled part of the graph, and separates node labels
and neuron states, allowing to differentially weight positive and negative
node labels, and to perform a learning approach that takes into account the
\u201cunbalance problem\u201d that affects GFP.
A second contribution of this thesis is the development of a new algorithm (LSI ) which exploits some ideas of COSNet for evaluating the predictive capability of each input network. By this algorithm we can estimate the
effectiveness of each source of data for predicting a specific class, and then
we can use this information to appropriately integrate multiple networks by
weighting them according to an appropriate integration scheme.
Both COSNet and LSI are computationally efficient and scale well with
the dimension of the data.
COSNet and LSI have been applied to the genome-wide prediction of gene functions in the yeast and mouse model organisms, achieving results
comparable with those obtained with state-of-the-art semi-supervised and
supervised machine learning methods