919 research outputs found

    Ranking relations using analogies in biological and information networks

    Get PDF
    Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects S={A(1):B(1),A(2):B(2),,A(N):B(N)}\mathbf{S}=\{A^{(1)}:B^{(1)},A^{(2)}:B^{(2)},\ldots,A^{(N)}:B ^{(N)}\}, measures how well other pairs A:B fit in with the set S\mathbf{S}. Our work addresses the following question: is the relation between objects A and B analogous to those relations found in S\mathbf{S}? Such questions are particularly relevant in information retrieval, where an investigator might want to search for analogous pairs of objects that match the query set of interest. There are many ways in which objects can be related, making the task of measuring analogies very challenging. Our approach combines a similarity measure on function spaces with Bayesian analysis to produce a ranking. It requires data containing features of the objects of interest and a link matrix specifying which relationships exist; no further attributes of such relationships are necessary. We illustrate the potential of our method on text analysis and information networks. An application on discovering functional interactions between pairs of proteins is discussed in detail, where we show that our approach can work in practice even if a small set of protein pairs is provided.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS321 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Identifying Essential Hub Genes and Protein Complexes in Malaria GO Data using Semantic Similarity Measures

    Full text link
    Hub genes play an essential role in biological systems because of their interaction with other genes. A vocabulary used in bioinformatics called Gene Ontology (GO) describes how genes and proteins operate. This flexible ontology illustrates the operation of molecular, biological, and cellular processes (Pmol, Pbio, Pcel). There are various methodologies that can be analyzed to determine semantic similarity. Research in this study, we employ the jack-knife method by taking into account 4 well-liked Semantic similarity measures namely Jaccard similarity, Cosine similarity, Pairsewise document similarity, and Levenshtein distance. Based on these similarity values, the protein-protein interaction network (PPI) of Malaria GO (Gene Ontology) data is built, which causes clusters of identical or related protein complexes (Px) to form. The hub nodes of the network are these necessary proteins. We use a variety of centrality measures to establish clusters of these networks in order to determine which node is the most important. The clusters' unique formation makes it simple to determine which class of Px they are allied to.Comment: 23 pages, 15 figure

    Bayesian Markov Random Field Analysis for Protein Function Prediction Based on Network Data

    Get PDF
    Inference of protein functions is one of the most important aims of modern biology. To fully exploit the large volumes of genomic data typically produced in modern-day genomic experiments, automated computational methods for protein function prediction are urgently needed. Established methods use sequence or structure similarity to infer functions but those types of data do not suffice to determine the biological context in which proteins act. Current high-throughput biological experiments produce large amounts of data on the interactions between proteins. Such data can be used to infer interaction networks and to predict the biological process that the protein is involved in. Here, we develop a probabilistic approach for protein function prediction using network data, such as protein-protein interaction measurements. We take a Bayesian approach to an existing Markov Random Field method by performing simultaneous estimation of the model parameters and prediction of protein functions. We use an adaptive Markov Chain Monte Carlo algorithm that leads to more accurate parameter estimates and consequently to improved prediction performance compared to the standard Markov Random Fields method. We tested our method using a high quality S.cereviciae validation network with 1622 proteins against 90 Gene Ontology terms of different levels of abstraction. Compared to three other protein function prediction methods, our approach shows very good prediction performance. Our method can be directly applied to protein-protein interaction or coexpression networks, but also can be extended to use multiple data sources. We apply our method to physical protein interaction data from S. cerevisiae and provide novel predictions, using 340 Gene Ontology terms, for 1170 unannotated proteins and we evaluate the predictions using the available literature

    Subontology Extraction Using Hyponym and Hypernym Closure on is-a Directed Acyclic Graphs

    Get PDF
    International audienceOntologies are successfully used as semantic guides when navigating through the huge and ever increasing quantity of digital documents. Nevertheless, the size of numerous domain ontologies tends to grow beyond the human capacity to grasp information. This growth is problematic for a lot of key applications that require user interactions such as document annotation or ontology modification/evolution. The problem could be partially overcome by providing users with a sub-ontology focused on their current concepts of interest. A sub-ontology restricted to this sole set of concepts is of limited interest since their relationships can generally not be explicit without adding some of their hyponyms and hypernyms. This paper proposes efficient algorithms to identify these additional key concepts based on the closure of two common graph operators: the least common-ancestor and greatest common descendant. The resulting method produces ontology excerpts focused on a set of concepts of interest and is fast enough to be used in interactive environments. As an example, we use the resulting program, called OntoFocus (http://www.ontotoolkit.mines-ales.fr/), to restrict, in few seconds, the large Gene Ontology (~30,000 concepts) to a sub-ontology focused on concepts annotating a gene related to breast cancer

    Retrieval, alignment, and clustering of computational models based on semantic annotations

    Get PDF
    As the number of computational systems biology models increases, new methods are needed to explore their content and build connections with experimental data. In this Perspective article, the authors propose a flexible semantic framework that can help achieve these aims

    Gene Ontology density estimation and discourse analysis for automatic GeneRiF extraction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>This paper describes and evaluates a sentence selection engine that extracts a GeneRiF (Gene Reference into Functions) as defined in ENTREZ-Gene based on a MEDLINE record. Inputs for this task include both a gene and a pointer to a MEDLINE reference. In the suggested approach we merge two independent sentence extraction strategies. The first proposed strategy (LASt) uses argumentative features, inspired by discourse-analysis models. The second extraction scheme (GOEx) uses an automatic text categorizer to estimate the density of Gene Ontology categories in every sentence; thus providing a full ranking of all possible candidate GeneRiFs. A combination of the two approaches is proposed, which also aims at reducing the size of the selected segment by filtering out non-content bearing rhetorical phrases.</p> <p>Results</p> <p>Based on the TREC-2003 Genomics collection for GeneRiF identification, the LASt extraction strategy is already competitive (52.78%). When used in a combined approach, the extraction task clearly shows improvement, achieving a Dice score of over 57% (+10%).</p> <p>Conclusions</p> <p>Argumentative representation levels and conceptual density estimation using Gene Ontology contents appear complementary for functional annotation in proteomics.</p

    Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data

    Get PDF
    Dramatic improvements in high throughput sequencing technologies have led to a staggering growth in the number of predicted genes. However, a large fraction of these newly discovered genes do not have a functional assignment. Fortunately, a variety of novel high-throughput genome-wide functional screening technologies provide important clues that shed light on gene function. The integration of heterogeneous data to predict protein function has been shown to improve the accuracy of automated gene annotation systems. In this paper, we propose and evaluate a probabilistic approach for protein function prediction that integrates protein-protein interaction (PPI) data, gene expression data, protein motif information, mutant phenotype data, and protein localization data. First, functional linkage graphs are constructed from PPI data and gene expression data, in which an edge between nodes (proteins) represents evidence for functional similarity. The assumption here is that graph neighbors are more likely to share protein function, compared to proteins that are not neighbors. The functional linkage graph model is then used in concert with protein domain, mutant phenotype and protein localization data to produce a functional prediction. Our method is applied to the functional prediction of Saccharomyces cerevisiae genes, using Gene Ontology (GO) terms as the basis of our annotation. In a cross validation study we show that the integrated model increases recall by 18%, compared to using PPI data alone at the 50% precision. We also show that the integrated predictor is significantly better than each individual predictor. However, the observed improvement vs. PPI depends on both the new source of data and the functional category to be predicted. Surprisingly, in some contexts integration hurts overall prediction accuracy. Lastly, we provide a comprehensive assignment of putative GO terms to 463 proteins that currently have no assigned function
    corecore