25 research outputs found

    Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Uncovering cellular roles of a protein is a task of tremendous importance and complexity that requires dedicated experimental work as well as often sophisticated data mining and processing tools. Protein functions, often referred to as its annotations, are believed to manifest themselves through topology of the networks of inter-proteins interactions. In particular, there is a growing body of evidence that proteins performing the same function are more likely to interact with each other than with proteins with other functions. However, since functional annotation and protein network topology are often studied separately, the direct relationship between them has not been comprehensively demonstrated. In addition to having the general biological significance, such demonstration would further validate the data extraction and processing methods used to compose protein annotation and protein-protein interactions datasets.</p> <p>Results</p> <p>We developed a method for automatic extraction of protein functional annotation from scientific text based on the Natural Language Processing (NLP) technology. For the protein annotation extracted from the entire PubMed, we evaluated the precision and recall rates, and compared the performance of the automatic extraction technology to that of manual curation used in public Gene Ontology (GO) annotation. In the second part of our presentation, we reported a large-scale investigation into the correspondence between communities in the literature-based protein networks and GO annotation groups of functionally related proteins. We found a comprehensive two-way match: proteins within biological annotation groups form significantly denser linked network clusters than expected by chance and, conversely, densely linked network communities exhibit a pronounced non-random overlap with GO groups. We also expanded the publicly available GO biological process annotation using the relations extracted by our NLP technology. An increase in the number and size of GO groups without any noticeable decrease of the link density within the groups indicated that this expansion significantly broadens the public GO annotation without diluting its quality. We revealed that functional GO annotation correlates mostly with clustering in a physical interaction protein network, while its overlap with indirect regulatory network communities is two to three times smaller.</p> <p>Conclusion</p> <p>Protein functional annotations extracted by the NLP technology expand and enrich the existing GO annotation system. The GO functional modularity correlates mostly with the clustering in the physical interaction network, suggesting that the essential role of structural organization maintained by these interactions. Reciprocally, clustering of proteins in physical interaction networks can serve as an evidence for their functional similarity.</p

    Automatic pathway building in biological association networks

    Get PDF
    BACKGROUND: Scientific literature is a source of the most reliable and comprehensive knowledge about molecular interaction networks. Formalization of this knowledge is necessary for computational analysis and is achieved by automatic fact extraction using various text-mining algorithms. Most of these techniques suffer from high false positive rates and redundancy of the extracted information. The extracted facts form a large network with no pathways defined. RESULTS: We describe the methodology for automatic curation of Biological Association Networks (BANs) derived by a natural language processing technology called Medscan. The curated data is used for automatic pathway reconstruction. The algorithm for the reconstruction of signaling pathways is also described and validated by comparison with manually curated pathways and tissue-specific gene expression profiles. CONCLUSION: Biological Association Networks extracted by MedScan technology contain sufficient information for constructing thousands of mammalian signaling pathways for multiple tissues. The automatically curated MedScan data is adequate for automatic generation of good quality signaling networks. The automatically generated Regulome pathways and manually curated pathways used for their validation are available free in the ResNetCore database from Ariadne Genomics, Inc. [1]. The pathways can be viewed and analyzed through the use of a free demo version of PathwayStudio software. The Medscan technology is also available for evaluation using the free demo version of PathwayStudio software

    Atlas of Signaling for Interpretation of Microarray Experiments

    Get PDF
    Microarray-based expression profiling of living systems is a quick and inexpensive method to obtain insights into the nature of various diseases and phenotypes. A typical microarray profile can yield hundreds or even thousands of differentially expressed genes and finding biologically plausible themes or regulatory mechanisms underlying these changes is a non-trivial and daunting task. We describe a novel approach for systems-level interpretation of microarray expression data using a manually constructed “overview” pathway depicting the main cellular signaling channels (Atlas of Signaling). Currently, the developed pathway focuses on signal transduction from surface receptors to transcription factors and further transcriptional regulation of cellular “workhorse” proteins. We show how the constructed Atlas of Signaling in combination with an enrichment analysis algorithm allows quick identification and visualization of the main signaling cascades and cellular processes affected in a gene expression profiling experiment. We validate our approach using several publicly available gene expression datasets

    Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics Extracting Protein Function Information from MEDLINE Using a Full-Sentence Parser

    No full text
    The living cell is a complex machine that depends on the proper functioning of its numerous parts, including proteins. Understanding protein functions and how they modify and regulate each other is the next great challenge for life science researchers. The collective knowledge about protein functions and pathways is scattered throughout numerous publications in scientific journals. Bringing the relevant information together creates a bottleneck in the research and discovery process. The volume of such information grows exponentially which, in turn, renders manual curation impractical. As a viable alternative, automated literature processing tools could be employed to extract and organize biological data into a knowledge base, making it amenable to computational analysis and data mining. We present MedScan, a completely automated NLP-based information extraction system. We have used MedScan to extract about 280,000 mammalian proteins functional links from the entire 2003 release of MEDLINE in only 21 hours. The precision of the extracted information was found to be 91%. We have compared the extracted data with protein co-occurrence data and with the nine well-studied cellular signaling pathways and estimated the recovery rate of MedScan for the entirety of MEDLINE to be between 30 % and 50%. Further improvement of the MedScan technology is discussed. 1

    An overlap between a network cluster obtained by Potts algorithm 31 and the best matching GO groups from the biological function GOA combined from MedScan annotation and public annotation

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks"</p><p>http://www.biomedcentral.com/1471-2105/8/243</p><p>BMC Bioinformatics 2007;8():243-243.</p><p>Published online 10 Jul 2007</p><p>PMCID:PMC1940026.</p><p></p> The cluster contains nine proteins involved in DNA repair and telomere capping: ATM – ataxia telangiectasia mutated homolog (human) (mapped); PRKDC – catalytic polypeptide of DNA activated protein kinase; NBS1 – nibrin; CHEK2 – protein kinase Chk2; XRCC5 – X-ray repair complementing defective repair in Chinese hamster cells 5; H2AFX – dolichyl-phosphate (UDP-N-acetylglucosamine) N-acetylglucosaminephosphotransferase 1 (GlcNAc-1-P transferase); G22P1 – thyroid autoantigen; NFBD1 – mediator of DNA damage checkpoint 1; TREX1 – three prime repair exonuclease 1. The ataxia-telangiectasia mutated (ATM) kinase signals the presence of DNA double-strand breaks in mammalian cells by phosphorylating proteins that initiate cell-cycle arrest, apoptosis, and DNA repair. The Mre11-Rad50-Nbs1 (MRN) complex acts as a double-strand break sensor for ATM and recruits ATM to broken DNA molecules [42]. Activated ATM phosphorylates its downstream cellular targets H2AFX and Chk2 as well as proteins directly involved in DNA repair: XRCC5, TREX1 and NFBD1. G22P1 and PRKDC are subunits of DNA activated protein kinase that can be induced by DNA damage to promote DNA end joining [43]. It also can attenuate CHK2 control of the damage checkpoint [44]. – The portion of GO classification overlapping with network cluster. The GO classification tree depiction is the same as in Figure 5A. – The network cluster overlapping with GO classification from Figure A. Highlighted proteins belong to the best overlapping GO group from molecular function classification – telomere capping (GO:0016233). The proteins selected by the blue line belong to the second best overlapping GO group from combined biological processes classification – double-strand break repair (GO:0006302). Gray links indicate relation, violet links indicate relation, and green arrows represent relations

    A scatter plot of the number of links of a randomized versus real binding network in the public cellular component GOA

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks"</p><p>http://www.biomedcentral.com/1471-2105/8/243</p><p>BMC Bioinformatics 2007;8():243-243.</p><p>Published online 10 Jul 2007</p><p>PMCID:PMC1940026.</p><p></p> All GO groups below the diagonal line have the number of randomized links lower the real ones. The error bars correspond to the p-value 10for normal distribution; that is, if the top of an error bar lies below the diagonal line, the probability that the corresponding GO group has this number of links by pure chance is equal or less than 10. It appears that only a few small GO groups are not linked densely enough to satisfy the 10threshold

    Distribution of the number of protein-GO association for three GO annotations

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks"</p><p>http://www.biomedcentral.com/1471-2105/8/243</p><p>BMC Bioinformatics 2007;8():243-243.</p><p>Published online 10 Jul 2007</p><p>PMCID:PMC1940026.</p><p></p> Horizontal axis, GO degree – number of GO associations; Vertical axis, Probability of GO degree – fraction of proteins with a given GO degree; Red line – public GOA, green – MedScan GOA, black – combined GOA
    corecore