Search CORE

11 research outputs found

Identification, organisation and visualisation of complete proteomes in UniProt throughout all taxonomic ranks :|barchaea, bacteria, eukatyote and virus

Author: Stanley Eleanor Juliet
Publication venue: Cranfield University
Publication date: 01/04/2012
Field of study

Users of uniprot.org want to be able to query, retrieve and download proteome sets for an organism of their choice. They expect the data to be easily accessed, complete and up to date based on current available knowledge. UniProt release 2012_01 (25th Jan 2012) contains the proteomes of 2,923 organisms; 50% of which are bacteria, 38% viruses, 8% eukaryota and 4% archaea. Note that the term 'organism' is used in a broad sense to include subspecies, strains and isolates. Each completely sequenced organism is processed as an independent organism, hence the availability of 38 strain-specific proteomes Escherichia coli that are accessible for download. There is a project within UniProt dedicated to the mammoth task of maintaining the “Proteomes database”. This active resource is essential for UniProt to continually provide high quality proteome sets to the users. Accurate identification and incorporation of new, publically available, proteomes as well as the maintenance of existing proteomes permits sustained growth of the proteomes project. This is a huge, complicated and vital task accomplished by the activities of both curators and programmers. This thesis explains the data input and output of the proteomes database: the flow of genome project data from the nucleotide database into the proteomes database, then from each genome how a proteome is identified, augmented and made visible to uniprot.org users. Along this journey of discovery many issues arose, puzzles concerning data gathering, data integrity and also data visualisation. All were resolved and the outcome is a well-documented, actively maintained database that strives to provide optimal proteome information to its users

Cranfield CERES

Global analysis of SNPs, proteins and protein-protein interactions: approaches for the prioritisation of candidate disease genes.

Author: Dobson Richard James Butler
Publication venue
Publication date: 01/01/2009
Field of study

PhDUnderstanding the etiology of complex disease remains a challenge in biology. In recent years there has been an explosion in biological data, this study investigates machine learning and network analysis methods as tools to aid candidate disease gene prioritisation, specifically relating to hypertension and cardiovascular disease. This thesis comprises four sets of analyses: Firstly, non synonymous single nucleotide polymorphisms (nsSNPs) were analysed in terms of sequence and structure based properties using a classifier to provide a model for predicting deleterious nsSNPs. The degree of sequence conservation at the nsSNP position was found to be the single best attribute but other sequence and structural attributes in combination were also useful. Predictions for nsSNPs within Ensembl have been made publicly available. Secondly, predicting protein function for proteins with an absence of experimental data or lack of clear similarity to a sequence of known function was addressed. Protein domain attributes based on physicochemical and predicted structural characteristics of the sequence were used as input to classifiers for predicting membership of large and diverse protein superfamiles from the SCOP database. An enrichment method was investigated that involved adding domains to the training dataset that are currently absent from SCOP. This analysis resulted in improved classifier accuracy, optimised classifiers achieved 66.3% for single domain proteins and 55.6% when including domains from multi domain proteins. The domains from superfamilies with low sequence similarity, share global sequence properties enabling applications to be developed which compliment profile methods for detecting distant sequence relationships. Thirdly, a topological analysis of the human protein interactome was performed. The results were combined with functional annotation and sequence based properties to build models for predicting hypertension associated proteins. The study found that predicted hypertension related proteins are not generally associated with network hubs and do not exhibit high clustering coefficients. Despite this, they tend to be closer and better connected to other hypertension proteins on the interaction network than would be expected by chance. Classifiers that combined PPI network, amino acid sequence and functional properties produced a range of precision and recall scores according to the applied 3 weights. Finally, interactome properties of proteins implicated in cardiovascular disease and cancer were studied. The analysis quantified the influential (central) nature of each protein and defined characteristics of functional modules and pathways in which the disease proteins reside. Such proteins were found to be enriched 2 fold within proteins that are influential (p<0.05) in the interactome. Additionally, they cluster in large, complex, highly connected communities, acting as interfaces between multiple processes more often than expected. An approach to prioritising disease candidates based on this analysis was proposed. Each analyses can provide some new insights into the effort to identify novel disease related proteins for cardiovascular disease

Queen Mary Research Online

OpenGrey Repository

In silico analysis of mitochondrial proteins

Author: Shen Yaoqing
Publication venue
Publication date: 01/10/2009
Field of study

Le rôle important joué par la mitochondrie dans la cellule eucaryote est admis depuis longtemps. Cependant, la composition exacte des mitochondries, ainsi que les processus biologiques qui sy déroulent restent encore largement inconnus. Deux facteurs principaux permettent dexpliquer pourquoi létude des mitochondries progresse si lentement : le manque defficacité des méthodes didentification des protéines mitochondriales et le manque de précision dans lannotation de ces protéines. En conséquence, nous avons développé un nouvel outil informatique, YimLoc, qui permet de prédire avec succès les protéines mitochondriales à partir des séquences génomiques. Cet outil intègre plusieurs indicateurs existants, et sa performance est supérieure à celle des indicateurs considérés individuellement. Nous avons analysé environ 60 génomes fongiques avec YimLoc afin de lever la controverse concernant la localisation de la bêta-oxydation dans ces organismes. Contrairement à ce qui était généralement admis, nos résultats montrent que la plupart des groupes de Fungi possèdent une bêta-oxydation mitochondriale. Ce travail met également en évidence la diversité des processus de bêta-oxydation chez les champignons, en corrélation avec leur utilisation des acides gras comme source dénergie et de carbone. De plus, nous avons étudié le composant clef de la voie de bêta-oxydation mitochondriale, lacyl-CoA déshydrogénase (ACAD), dans 250 espèces, couvrant les 3 domaines de la vie, en combinant la prédiction de la localisation subcellulaire avec la classification en sous-familles et linférence phylogénétique. Notre étude suggère que les gènes ACAD font partie dune ancienne famille qui a adopté des stratégies évolutionnaires innovatrices afin de générer un large ensemble denzymes susceptibles dutiliser la plupart des acides gras et des acides aminés. Finalement, afin de permettre la prédiction de protéines mitochondriales à partir de données autres que les séquences génomiques, nous avons développé le logiciel TESTLoc qui utilise comme données des Expressed Sequence Tags (ESTs). La performance de TESTLoc est significativement supérieure à celle de tout autre outil de prédiction connu. En plus de fournir deux nouveaux outils de prédiction de la localisation subcellulaire utilisant différents types de données, nos travaux démontrent comment lassociation de la prédiction de la localisation subcellulaire à dautres méthodes danalyse in silico permet daméliorer la connaissance des protéines mitochondriales. De plus, ces travaux proposent des hypothèses claires et faciles à vérifier par des expériences, ce qui présente un grand potentiel pour faire progresser nos connaissances des métabolismes mitochondriaux.The important role of mitochondria in the eukaryotic cell has long been appreciated, but their exact composition and the biological processes taking place in mitochondria are not yet fully understood. The two main factors that slow down the progress in this field are inefficient recognition and imprecise annotation of mitochondrial proteins. Therefore, we developed a new computational tool, YimLoc, which effectively predicts mitochondrial proteins from genomic sequences. This tool integrates the strengths of existing predictors and yields higher performance than any individual predictor. We applied YimLoc to ~60 fungal genomes in order to address the controversy about the localization of beta oxidation in these organisms. Our results show that in contrast to previous studies, most fungal groups do possess mitochondrial beta oxidation. This work also revealed the diversity of beta oxidation in fungi, which correlates with their utilization of fatty acids as energy and carbon sources. Further, we conducted an investigation of the key component of the mitochondrial beta oxidation pathway, the acyl-CoA dehydrogenase (ACAD). We combined subcellular localization prediction with subfamily classification and phylogenetic inference of ACAD enzymes from 250 species covering all three domains of life. Our study suggests that ACAD genes are an ancient family with innovative evolutionary strategies to generate a large enzyme toolset for utilizing most diverse fatty acids and amino acids. Finally, to enable the prediction of mitochondrial proteins from data beyond genome sequences, we designed the tool TESTLoc that uses expressed sequence tags (ESTs) as input. TESTLoc performs significantly better than known tools. In addition to providing two new tools for subcellular localization designed for different data, our studies demonstrate the power of combining subcellular localization prediction with other in silico analyses to gain insights into the function of mitochondrial proteins. Most importantly, this work proposes clear hypotheses that are easily testable, with great potential for advancing our knowledge of mitochondrial metabolism

Dépôt Institutionnel Numérique

Improving the hierarchical classification of protein functions With swarm intelligence

Author: Holden Nicholas
Publication venue
Publication date: 25/11/2022
Field of study

This thesis investigates methods to improve the performance of hierarchical classification. In terms of this thesis hierarchical classification is a form of supervised learning, where the classes in a data set are arranged in a tree structure. As a base for our new methods we use the TDDC (top-down divide-and-conquer) approach for hierarchical classification, where each classifier is built only to discriminate between sibling classes. Firstly, we propose a swarm intelligence technique which varies the types of classifiers used at each divide within the TDDC tree. Our technique, PSO/ACO-CS (Particle Swarm Optimisation/Ant Colony Optimisation Classifier Selection), finds combinations of classifiers to be used in the TDDC tree using the global search ability of PSO/ACO. Secondly, we propose a technique that attempts to mitigate a major drawback of the TDDC approach. The drawback is that if at any point in the TDDC tree an example is misclassified it can never be correctly classified further down the TDDC tree. Our approach, PSO/ACO-RO (PSO/ACO-Recovery Optimisation) decides whether to redirect examples at a given classifier node using, again, the global search ability of PSO/ACO. Thirdly, we propose an ensemble based technique, HEHRS (Hierarchical Ensembles of Hierarchical Rule Sets), which attempts to boost the accuracy at each classifier node in the TDDC tree by using information from classifiers (rule sets) in the rest of that tree. We use Particle Swarm Optimisation to weight the individual rules within each ensemble. We evaluate these three new methods in hierarchical bioinformatics datasets that we have created for this research. These data sets represent the real world problem of protein function prediction. We find through extensive experimentation that the three proposed methods improve upon the baseline TDDC method to varying degrees. Overall the HEHRS and PSO/ACO- CS-RO approaches are most effective, although they are associated with a higher computational cost

Kent Academic Repository

Novel bioinformatics tools assisting targeted peptide-centric proteomics and global proteomics data dissemination

Author: Martens Lennart
Publication venue
Publication date: 01/01/2006
Field of study

Ghent University Academic Bibliography

Bioinformatics

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here

Directory of Open Access Books (DOAB)

Αλγόριθμοι για την υπολογιστική ανάλυση της λειτουργίας των μη κωδικών μεταγραφών

Author: Παρασκευοπούλου Μαρία Δ.
Publication venue
Publication date: 01/01/2016
Field of study

University of Thessaly Institutional Repository

Contextual Analysis of Gene Expression Data

Author: Sohler Florian
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 20/07/2006
Field of study

As measurement of gene expression using microarrays has become a standard high throughput method in molecular biology, the analysis of gene expression data is still a very active area of research in bioinformatics and statistics. Despite some issues in quality and reproducibility of microarray and derived data, they are still considered as one of the most promising experimental techniques for the understanding of complex molecular mechanisms. This work approaches the problem of expression data analysis using contextual information. While all analyses must be based on sound statistical data processing, it is also important to include biological knowledge to arrive at biologically interpretable results. After giving an introduction and some biological background, in chapter 2 some standard methods for the analysis of microarray data including normalization, computation of differentially expressed genes, and clustering are reviewed. The first source of context information that is used to aid in the interpretation of the data, is functional annotation of genes. Such information is often represented using ontologies such as gene ontology. GO annotations are provided by many gene and protein databases and have been used to find functional groups that are significantly enriched in differentially expressed, or otherwise conspicuous genes. In gene clustering approaches, functional annotations have been used to find enriched functional classes within each cluster. In chapter 3, a clustering method for the samples of an expression data set is described that uses GO annotations during the clustering process in order to find functional classes that imply a particularly strong separation of the samples. The resulting clusters can be interpreted more easily in terms of GO classes. The clustering method was developed in joint work with Henning Redestig. More complex biological information that covers interactions between biological objects is contained in networks. Such networks can be obtained from public databases of metabolic pathways, signaling cascades, transcription factor binding sites, or high-throughput measurements for the detection of protein-protein interactions such as yeast two hybrid experiments. Furthermore, networks can be inferred using literature mining approaches or network inference from expression data. The information contained in such networks is very heterogenous with respect to the type, the quality and the completeness of the contained data. ToPNet, a software tool for the interactive analysis of networks and gene expression data has been developed in cooperation with Daniel Hanisch. The basic analysis and visualization methods as well as some important concepts of this tool are described in chapter 4. In order to access the heterogeneous data represented as networks with annotated experimental data and functions, it is important to provide advanced querying functionality. Pathway queries allow the formulation of network templates that can include functional annotations as well as expression data. The pathway search algorithm finds all instances of the template in a given network. In order to do so, a special case of the well known subgraph isomorphism problem has to be solved. Although the algorithm has exponential running time in the worst case, some implementation tricks make it run fast enough for practical purposes. Often, a pathway query has many matching instances, and it is important to assess the statistical significance of the individual instances with respect to expression data or other criteria. In chapter 5 the pathway query language and the pathway search algorithm are described in detail and some theoretical properties are derived. Furthermore, some scoring methods that have been implemented are described. The possibility of combining different scoring schemes for different parts of the query result in very flexible scoring capabilities. In chapter 6, some applications of the methods are described, using public data sets as well as data sets from research projects. On the basis of the well studied public data sets, it is demonstrated that the methods yield biologically meaningful results. The other analyses show how new hypotheses can be generated in more complex biological systems, but the validation of these hypotheses can only be provided by new experiments. Finally, an outlook is given on how the presented methods can contribute to ongoing research efforts in the area of expression data analysis, their applicability to other types of data (such as proteomics data) and their possible extensions.Während die Messung von RNA-Konzentrationen mittels Microarrays eine Standardtechnik zur genomweiten Bestimmung von Genexpressionswerten geworden ist, ist die Analyse der dabei gewonnenen Daten immer noch ein Gebiet äußerst aktiver Forschung. Trotz einiger Probleme bezüglich der Reproduzierbarkeit von Microarray- und davon abgeleiteten Daten werden diese als eine der vielversprechendsten Technologien zur Aufklärung komplexer molekularer Mechanismen angesehen. Diese Arbeit beschäftigt sich mit dem Problem der Expressionsdatenanalyse mit Hilfe von Kontextinformationen. Alle Analysen müssen auf solider Statistik beruhen, aber es ist außerdem wichtig, biologisches Wissen einzubeziehen, um biologisch interpretierbare Ergebnisse zu erhalten. Nach einer Einleitung und einigem biologischen Hintergrund werden in Kapitel 2 einige Standardmethoden zur Analyse von Expressionsdaten vorgestellt, wie z.B. Normalisierung, Berechnung differenziell exprimierter Gene sowie Clustering. Die erste Quelle von Kontextinformationen, die zur besseren Interpretation der Daten herangezogen wird, ist funktionale Annotation von Genen. Solche Informationen werden oft mit Hilfe von Ontologien wie z.B. der Gene Ontology dargestellt. GO Annotationen werden von vielen Gen- und Proteindatenbanken zur Verfügung gestellt und werden unter anderem benutzt, um Funktionen zu finden, die signifikant angereichert sind an differenziell exprimierten oder aus anderen Gründen auffälligen Genen. Bei Clusteringmethoden werden funktionale Annotationen benutzt, um in den gefundenen Clustern angereicherte Funktionen zu identifizieren. In Kapitel 3 wird ein neues Clusterverfahren für Proben in Expressionsdatensätzen vorgestellt, das GO Annotationen während des Clustering benutzt, um Funktionen zu finden, anhand derer die Expressionsdaten besonders deutlich getrennt werden können. Die resultierenden Cluster können mit Hilfe der GO Annotationen leichter interpretiert werden. Die Clusteringmethode wurde in Zusammenarbeit mit Henning Redestig entwickelt. Komplexere biologische Informationen, die auch die Interaktionen zwischen biologischen Objekten beinhaltet, sind in Netzwerken enthalten. Solche Netzwerke können aus öffentlichen Datenbanken von metabolischen Pfaden, Signalkaskaden, Bindestellen von Transkriptionsfaktoren, aber auch aus Hochdurchsatzexperimenten wie der Yeast Two Hybrid Methode gewonnen werden. Außerdem können Netzwerke durch die automatische Auswertung wissenschaftlicher Literatur oder Inferenz aus Expressionsdaten gewonnen werden. Die Information, die in solchen Netzwerken enthalten ist, ist sehr verschieden in Bezug auf die Art, die Qualität und die Vollständigkeit der Daten. ToPNet, ein Computerprogramm zur interaktiven Analyse von Netzwerken und Genexpressionsdaten, wurde gemeinsam mit Daniel Hanisch entwickelt. Die grundlegenden Analyse und Visualisierungsmethoden sowie einige wichtige Konzepte dieses Programms werden in Kapitel 4 beschrieben. Um auf die verschiedenartigen Daten zugreifen zu können, die durch Netzwerke mit funktionalen Annotationen sowie Expressionsdaten repräsentiert werden, ist es wichtig, flexible und mächtige Anfragefunktionalität zur Verfügung zu stellen. Pathway queries erlauben die Beschreibung von Netzwerkmustern, die funktionale Annotationen sowie Expressionsdaten enthalten. Der pathway search Algorithmus findet alle Instanzen des Musters in einem gegebenen Netzwerk. Dazu muss ein Spezialfall des bekannten Subgraph-Isomorphie-Problems gelöst werden. Obwohl der Algorithmus im schlechtesten Fall exponentielle Laufzeit in der Größe des Musters hat, läuft er durch einige Implementationstricks schnell genug für praktische Anwendungen. Oft hat eine pathway query viele Instanzen, so dass es wichtig ist, die statistische Signifikanz der einzelnen Instanzen in Hinblick auf Expressionsdaten oder andere Kriterien zu bestimmen. In Kapitel 5 werden die Anfragesprache pathway query language sowie der pathway search Algorithmus im Detail vorgestellt und einige theoretische Eigenschaften gezeigt. Außerdem werden einige implementierte Scoring-Methoden beschrieben. Die Möglichkeit, verschiedene Teile der Anfrage mit verschiedenen Scoring-Methoden zu bewerten und zu einem Gesamtscore zusammenzufassen, erlaubt äußerst flexible Bewertungen der Instanzen. In Kapitel 6 werden einige Anwendungen der vorgestellten Methoden beschrieben, die auf öffentlichen Datensätzen sowie Datensätzen aus Forschungsprojekten beruhen. Mit Hilfe der gut untersuchten öffentlichen Datensätze wird gezeigt, dass die Methoden biologisch sinnvolle Ergebnisse liefern. Die anderen Analysen zeigen, wie neue Hypothesen in komplexeren biologischen Systemen generiert werden können, die jedoch nur mit Hilfe von weiteren biologischen Experimenten validiert werden könnten. Schließlich wird ein Ausblick gegeben, was die vorgestellten Methoden zur laufenden Forschung im Bereich der Expressionsdatenanalyse beitragen können, wie sie auf andere Daten angewendet werden können und welche Erweiterungen denkbar und wünschenswert sind

Digitale Hochschulschriften der LMU

Exploiting gene expression and protein data for predicting remote homology and tissue specificity

Author: Wieser Daniela
Publication venue
Publication date: 01/01/2010
Field of study

In this thesis I describe my investigations of applying machine learning methods to high throughput experimental and predicted biological data. The importance of such analysis as a means of making inferences about biological functions is widely acknowledged in the bioinformatics community. Specifically, this work makes three novel contributions based on the systematic analysis of publicly archived data of protein sequences, three dimensional structures, gene expression and functional annotations: (a) remote homology detection based on amino acid sequences and secondary structures; (b) the analysis of tissue-specific gene expression for predictive signals in the sequence and secondary structure of the resulting protein product; and (c) a study of ageing in the fruit fly, a commonly used model organism, in which tissue specific and whole-organism gene expression changes are contrasted. In the problem of remote homology detection, a kernel-based method that combines pairwise alignment scores of amino acid sequences and secondary structures is shown to improve the prediction accuracies in a benchmark task defined using the Structural Classification of Proteins (SCOP) database. While the task of predicting SCOP superfamilies should be regarded as an easy one, with not much room for performance improvement, it is still widely accepted as the gold standard due to careful manual annotation by experts in the subject of protein evolution.A similar method is introduced to investigate whether tissue specificity of gene expression is correlated with the sequence and secondary structure of the resulting protein product. An information theoretic approach is adopted for sorting fruit fly and mouse genes according to their tissue specificity based on gene expression data. A classifier is then trained to predict the degree of specificity for these genes. The study concludes that the tissue specificity of gene expression is correlated with the sequence, and to a certain extent, with the secondary structure of the gene’s protein product.The sorted list of genes introduced in the previous chapter is used to investigate the tissue specificity of transcript profiles obtained from a study of ageing in the fruit fly. The same list is utilised to investigate how filtering tissue-restricted genes affects gene set enrichment analysis in the ageing study, and to examine the specificity of age-associated genes identified in the literature. The conclusion drawn in this chapter is that categorisation of genes according to their tissue specificity using Shannon’s information theory is useful for the interpretation of whole-fly gene expression data

Southampton (e-Prints Soton)

OpenGrey Repository

Anotación funcional de proteínas basada en representación relacional en el entorno de la biología de sistemas

Author: García Jiménez Beatriz
Publication venue
Publication date: 01/01/2012
Field of study

La anotación funcional es un tema de investigación abierto e importante en Biología Molecular. El problema de definir función a nivel de terminología es complicado, puesto que la función ocupa muchos niveles para una misma proteína y no existe un criterio unificado. Ante estas dificultades, la forma de determinar la función de una proteína es anotarla con distintos términos en diferentes vocabularios. Las proteínas desarrollan su función en cooperación con otras proteínas formando complejos. Estas interacciones se representan en una red, formada por interacciones que han sido demostradas experimentalmente entre proteínas. Analizar y utilizar la red de interacciones es una tarea de interés debido al gran número de asociaciones existentes, y a las múltiples formas en que una proteína puede influir en la función de otras. Por lo tanto, esta tesis se centra en la predicción de anotación funcional basada en redes Es evidente que este complejo escenario no puede afrontarse sin el uso de herramientas computacionales. De hecho existe una actividad considerable en el área de Biología Computacional dedicada específicamente a este tema. Esta tesis es parte de este esfuerzo en la aplicación de métodos computacionales a problemas biológicos en el área de Biología de Sistemas. Esta aproximación puede enmarcarse en este contexto de la Biología de Sistemas, puesto que no se analiza la función de forma aislada para cada molécula, sino a nivel de sistema, teniendo en cuenta todas las relaciones existentes entre genes y proteínas conectados a distintos niveles. Para aprovechar todas estas relaciones biológicas, y mantener su semántica estructural, esta tesis plantea usar Representación Relacional, por ser un dominio particularmente apropiado para ello. A partir de dicha representación se aplican múltiples transformaciones y técnicas de Inteligencia Artificial para extraer conocimiento de las proteínas relacionadas, y proponer nuevas funciones a través de la prediccion de asociaciones funcionales entre proteínas. La propuesta general de esta tesis es la caracterización de función de proteínas y genes basándose en información de redes, a través de la Representación Relacional y el Aprendizaje Automático. En concreto, partiendo de una representación relacional para anotación funcional, se busca el diseño computacional necesario para resolver dos problemas concretos, diferentes e interesantes en Biología. Uno es la predicción de asociaciones funcionales entre pares de proteínas en E.coli, y el otro la extensión de rutas biológicas en humanos. Ambos se evalúan en términos computacionales y de interpretación biológica. También se proponen nuevas anotaciones funcionales de proteínas a ser verificadas experimentalmente. Además, se exploran diversos enfoques en la representación del conocimiento y en las técnicas de aprendizaje, proponiendo estrategias concretas para resolver otros problemas bioinformáticos, especialmente influenciados por la información relacional y el aprendizaje multi-clase y multi-etiqueta. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Functional annotation is an open and interesting research topic in Molecular Biology. Determining a function in terminology terms is a hard task, due to lack of unified criterion and also because a function takes up many levels for the same protein. Given this difficulties, the way to determine a protein function is to annotate it with several terms from different vocabularies. Proteins carry out their function together with other proteins, being part of protein complexes. These interactions are represented in a network of experimentally verified protein-protein interactions. Analyzing and using the interaction network is task of interest due to the great number of associations, and to the multiple ways in which a protein could influence in the function of others. Therefore, this thesis focuses in the prediction of functional annotation based on networks. It’s apparent that this complex scenario couldn’t be faced without computational techniques. In fact, in Computational Biology, there is a considerable activity specially devoted to this topic. This thesis is part of this effort for applying computational methods to biological problems in the Systems Biology area. This approximation can belong to the Systems Biology context, because it does not analyze function in an isolated way for each molecule, but at system level, taking into account all the relations among genes and proteins linked at different levels. To take advantages of all these biological relations, and to preserve their structured semantics, this thesis suggests to use Relational Representation, since in particular it is suitable for the concerning domain. Over such representation, multiple transformations and Artificial Intelligence techniques are applied to retrieve implicit knowledge from the related proteins, and to propose new functions through the prediction of functional associations between proteins. The main proposal of this thesis is to characterize the function of proteins and genes based on networks, through Relational Representation and Machine Learning. Specially, from a relational representation specific to functional annotation, we look for the computational design needed to solve two specific, biological interesting and different problems. The former consists of predicting functional association between pair of proteins in E.coli, and the latter comprises expanding pathways in humans. We perform an assessment in computational and biological interpretation terms. Besides, we propose new putative protein functional annotations to be experimentally verified. In addition, the thesis investigates diverse approaches to knowledge representation and learning techniques, suggesting specific strategies to tackle other biological problems, specially where relational data or multi-class and multi-label targets are present

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo