Search CORE

Harvard University - DASH

UCL Discovery

CUED - Cambridge University Engineering Department

Identifying Essential Hub Genes and Protein Complexes in Malaria GO Data using Semantic Similarity Measures

Author: Alphonse P. J. A.
Das Mamata
K. Selvakumar
Publication venue
Publication date: 09/08/2023
Field of study

Hub genes play an essential role in biological systems because of their interaction with other genes. A vocabulary used in bioinformatics called Gene Ontology (GO) describes how genes and proteins operate. This flexible ontology illustrates the operation of molecular, biological, and cellular processes (Pmol, Pbio, Pcel). There are various methodologies that can be analyzed to determine semantic similarity. Research in this study, we employ the jack-knife method by taking into account 4 well-liked Semantic similarity measures namely Jaccard similarity, Cosine similarity, Pairsewise document similarity, and Levenshtein distance. Based on these similarity values, the protein-protein interaction network (PPI) of Malaria GO (Gene Ontology) data is built, which causes clusters of identical or related protein complexes (Px) to form. The hub nodes of the network are these necessary proteins. We use a variety of centrality measures to establish clusters of these networks in order to determine which node is the most important. The clusters' unique formation makes it simple to determine which class of Px they are allied to.Comment: 23 pages, 15 figure

arXiv.org e-Print Archive

Bayesian Markov Random Field Analysis for Protein Function Prediction Based on Network Data

Author: A Kuzniar
A Vazquez
Aalt D. J. van Dijk
AJ Enright
C Moler
Cajo J. F. ter Braak
CJF Ter Braak
CJF Ter Braak
CM Federovitch
DJC MacKay
GD Bader
GR Lanckriet
H Lee
I Kosmidis
I Ulitsky
Iddo Friedberg
IM Cheeseman
J Besag
JA Hanley
L Milligan
L Peña Castillo
M Ashburner
M Deng
M Deng
M Punta
Marco C. A. M. Bink
N Nariai
NJ Mulder
P McCullagh
R Sharan
RI Kondor
Roeland C. H. J. van Ham
S Ferré
S Geman
S Letovsky
S Mostafavi
SF Altschul
SR Collins
SZ Li
T Gabaldon
U Karaoz
V Vethantham
XL Chen
Y Chen
Y Guan
Yiannis A. I. Kourmpetis
Z Barutcuoglu
Z Wei
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Inference of protein functions is one of the most important aims of modern biology. To fully exploit the large volumes of genomic data typically produced in modern-day genomic experiments, automated computational methods for protein function prediction are urgently needed. Established methods use sequence or structure similarity to infer functions but those types of data do not suffice to determine the biological context in which proteins act. Current high-throughput biological experiments produce large amounts of data on the interactions between proteins. Such data can be used to infer interaction networks and to predict the biological process that the protein is involved in. Here, we develop a probabilistic approach for protein function prediction using network data, such as protein-protein interaction measurements. We take a Bayesian approach to an existing Markov Random Field method by performing simultaneous estimation of the model parameters and prediction of protein functions. We use an adaptive Markov Chain Monte Carlo algorithm that leads to more accurate parameter estimates and consequently to improved prediction performance compared to the standard Markov Random Fields method. We tested our method using a high quality S.cereviciae validation network with 1622 proteins against 90 Gene Ontology terms of different levels of abstraction. Compared to three other protein function prediction methods, our approach shows very good prediction performance. Our method can be directly applied to protein-protein interaction or coexpression networks, but also can be extended to use multiple data sources. We apply our method to physical protein interaction data from S. cerevisiae and provide novel predictions, using 340 Gene Ontology terms, for 1170 unannotated proteins and we evaluate the predictions using the available literature

Public Library of Science (PLOS)

Wageningen University & Research Publications

Fast approximate hierarchical clustering using similarity heuristics

Author: A Saeed
AJ Saldanha
AK Jain
C Böhm
D Eppstein
J Herrero
J Vilo
Jaak Vilo
JC Gower
L Kaufmann
M Ashburner
M Lukk
MB Eisen
Meelis Kull
MJL de Hoon
P Erdös
P Legendre
P Zezula
Q Zhang
R Shyamsundar
S Datta
T Cormen
Z Du
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

CiteSeerX

Springer - Publisher Connector

Subontology Extraction Using Hyponym and Hypernym Closure on is-a Directed Acyclic Graphs

Author: Janaqi Stefan
Ranwez Sylvie
Ranwez Vincent
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 20/12/2012
Field of study

International audienceOntologies are successfully used as semantic guides when navigating through the huge and ever increasing quantity of digital documents. Nevertheless, the size of numerous domain ontologies tends to grow beyond the human capacity to grasp information. This growth is problematic for a lot of key applications that require user interactions such as document annotation or ontology modification/evolution. The problem could be partially overcome by providing users with a sub-ontology focused on their current concepts of interest. A sub-ontology restricted to this sole set of concepts is of limited interest since their relationships can generally not be explicit without adding some of their hyponyms and hypernyms. This paper proposes efficient algorithms to identify these additional key concepts based on the closure of two common graph operators: the least common-ancestor and greatest common descendant. The resulting method produces ontology excerpts focused on a set of concepts of interest and is fast enough to be used in interactive environments. As an example, we use the resulting program, called OntoFocus (http://www.ontotoolkit.mines-ales.fr/), to restrict, in few seconds, the large Gene Ontology (~30,000 concepts) to a sub-ontology focused on concepts annotating a gene related to breast cancer

HAL-CIRAD

Retrieval, alignment, and clustering of computational models based on semantic annotations

Author: Becker J
Budanitsky A
Edda Klipp
Falko Krause
Fielding R
Henkel R
Jiang J
Liebermeister W
Lin D
Marvin Schulz
Nicolas Le Novère
Resnik P
Salton G
Salton G
Tohsato Y
van Rijsbergen C
Wolfram Liebermeister
Publication venue: Nature Publishing Group
Publication date
Field of study

As the number of computational systems biology models increases, new methods are needed to explore their content and build connections with experimental data. In this Perspective article, the authors propose a flexible semantic framework that can help achieve these aims

Gene Ontology density estimation and discourse analysis for automatic GeneRiF extraction

Author: Ehrler Frédéric
Gobeill Julien
Mottaz Anaïs
Ruch Patrick
Tbahriti Imad
Veuthey Anne-Lise
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background This paper describes and evaluates a sentence selection engine that extracts a GeneRiF (Gene Reference into Functions) as defined in ENTREZ-Gene based on a MEDLINE record. Inputs for this task include both a gene and a pointer to a MEDLINE reference. In the suggested approach we merge two independent sentence extraction strategies. The first proposed strategy (LASt) uses argumentative features, inspired by discourse-analysis models. The second extraction scheme (GOEx) uses an automatic text categorizer to estimate the density of Gene Ontology categories in every sentence; thus providing a full ranking of all possible candidate GeneRiFs. A combination of the two approaches is proposed, which also aims at reducing the size of the selected segment by filtering out non-content bearing rhetorical phrases. Results Based on the TREC-2003 Genomics collection for GeneRiF identification, the LASt extraction strategy is already competitive (52.78%). When used in a combined approach, the extraction task clearly shows improvement, achieving a Dice score of over 57% (+10%). Conclusions Argumentative representation levels and conceptual density estimation using Gene Ontology contents appear complementary for functional annotation in proteomics.</p

Springer - Publisher Connector

Public Library of Science (PLOS)

Archive ouverte UNIGE

Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data

Author: Kasif Simon
Kolaczyk Eric D.
Nariai Naoki
Publication venue: Public Library of Science
Publication date: 01/03/2007
Field of study

Dramatic improvements in high throughput sequencing technologies have led to a staggering growth in the number of predicted genes. However, a large fraction of these newly discovered genes do not have a functional assignment. Fortunately, a variety of novel high-throughput genome-wide functional screening technologies provide important clues that shed light on gene function. The integration of heterogeneous data to predict protein function has been shown to improve the accuracy of automated gene annotation systems. In this paper, we propose and evaluate a probabilistic approach for protein function prediction that integrates protein-protein interaction (PPI) data, gene expression data, protein motif information, mutant phenotype data, and protein localization data. First, functional linkage graphs are constructed from PPI data and gene expression data, in which an edge between nodes (proteins) represents evidence for functional similarity. The assumption here is that graph neighbors are more likely to share protein function, compared to proteins that are not neighbors. The functional linkage graph model is then used in concert with protein domain, mutant phenotype and protein localization data to produce a functional prediction. Our method is applied to the functional prediction of Saccharomyces cerevisiae genes, using Gene Ontology (GO) terms as the basis of our annotation. In a cross validation study we show that the integrated model increases recall by 18%, compared to using PPI data alone at the 50% precision. We also show that the integrated predictor is significantly better than each individual predictor. However, the observed improvement vs. PPI depends on both the new source of data and the functional category to be predicted. Surprisingly, in some contexts integration hurts overall prediction accuracy. Lastly, we provide a comprehensive assignment of putative GO terms to 463 proteins that currently have no assigned function

Boston University Institutional Repository (OpenBU)