138,309 research outputs found

    GO for gene documents

    Get PDF

    Semantic Clustering of Genomic Documents using GO Terms as Feature Set

    Get PDF
    The biological databases generate huge volume of genomics and proteomics data. The sequence information is used by researches to find similarity of genes, proteins and to find other related information. The genomic sequence database consists of large number of attributes as annotations, represented for defining the sequences in Xml format. It is necessary to have proper mechanism to group the documents for information retrieval. Data mining techniques like clustering and classification methods can be used to group the documents. The objective of the paper is to analyze the set of keywords which can be represented as features for grouping the documents semantically. This paper focuses on clustering genomic documents based on both structural and content similarity .The structural similarity is found using structural path between the documents. The semantic similarity is found for the structurally similar documents. We have proposed a methodology to cluster the genomic documents using sequence attributes without using the sequence data. The sequence attributes for genomic documents are analyzed using Filter based feature selection methods to find the relevant feature set for grouping the similar documents. Based on the attribute ranking we have clustered the similar documents using All Keyword approach (KBA) and GO Terms based approach (GOTA). The experimental results of the clusters are validated for two approaches by inferring biological meaning using Gene Ontology. From the results it was inferred that all keywords based approach grouped documents based on the semantic meaning of Gene Ontology terms. The GO terms based approach grouped larger number of documents without considering any other keywords, which is semantically relevant which results in reducing the complexity of the attributes considered. We claim that using GO terms can alone be used as features set to group genomic documents with high similarity

    Mining protein function from text using term-based support vector machines

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents.</p> <p>Results</p> <p>The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent.</p> <p>Conclusion</p> <p>A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2.</p

    Automatically linking MEDLINE abstracts to the Gene Ontology

    Get PDF
    Much has been written recently about the need for effective tools and methods for mining the wealth of information present in biomedical literature (Mack and Hehenberger, 2002; Blagosklonny and Pardee, 2001; Rindflesch et al., 2002)—the activity of conceptual biology. Keyword search engines operating over large electronic document stores (such as PubMed and the PNAS) offer some help, but there are fundamental obstacles that limit their effectiveness. In the first instance, there is no general consensus among scientists about the vernacular to be used when describing research about genes, proteins, drugs, diseases, tissues and therapies, making it very difficult to formulate a search query that retrieves the right documents. Secondly, finding relevant articles is just one aspect of the investigative process. A more fundamental goal is to establish links and relationships between facts existing in published literature in order to “validate current hypotheses or to generate new ones” (Barnes and Robertson, 2002)—something keyword search engines do little to support

    Ontology-Based MEDLINE Document Classification

    Get PDF
    An increasing and overwhelming amount of biomedical information is available in the research literature mainly in the form of free-text. Biologists need tools that automate their information search and deal with the high volume and ambiguity of free-text. Ontologies can help automatic information processing by providing standard concepts and information about the relationships between concepts. The Medical Subject Headings (MeSH) ontology is already available and used by MEDLINE indexers to annotate the conceptual content of biomedical articles. This paper presents a domain-independent method that uses the MeSH ontology inter-concept relationships to extend the existing MeSH-based representation of MEDLINE documents. The extension method is evaluated within a document triage task organized by the Genomics track of the 2005 Text REtrieval Conference (TREC). Our method for extending the representation of documents leads to an improvement of 17% over a non-extended baseline in terms of normalized utility, the metric defined for the task. The SVMlight software is used to classify documents

    The TREC 2004 genomics track categorization task: classifying full text biomedical documents

    Get PDF
    BACKGROUND: The TREC 2004 Genomics Track focused on applying information retrieval and text mining techniques to improve the use of genomic information in biomedicine. The Genomics Track consisted of two main tasks, ad hoc retrieval and document categorization. In this paper, we describe the categorization task, which focused on the classification of full-text documents, simulating the task of curators of the Mouse Genome Informatics (MGI) system and consisting of three subtasks. One subtask of the categorization task required the triage of articles likely to have experimental evidence warranting the assignment of GO terms, while the other two subtasks were concerned with the assignment of the three top-level GO categories to each paper containing evidence for these categories. RESULTS: The track had 33 participating groups. The mean and maximum utility measure for the triage subtask was 0.3303, with a top score of 0.6512. No system was able to substantially improve results over simply using the MeSH term Mice. Analysis of significant feature overlap between the training and test sets was found to be less than expected. Sample coverage of GO terms assigned to papers in the collection was very sparse. Determining papers containing GO term evidence will likely need to be treated as separate tasks for each concept represented in GO, and therefore require much denser sampling than was available in the data sets. The annotation subtask had a mean F-measure of 0.3824, with a top score of 0.5611. The mean F-measure for the annotation plus evidence codes subtask was 0.3676, with a top score of 0.4224. Gene name recognition was found to be of benefit for this task. CONCLUSION: Automated classification of documents for GO annotation is a challenging task, as was the automated extraction of GO code hierarchies and evidence codes. However, automating these tasks would provide substantial benefit to biomedical curation, and therefore work in this area must continue. Additional experience will allow comparison and further analysis about which algorithmic features are most useful in biomedical document classification, and better understanding of the task characteristics that make automated classification feasible and useful for biomedical document curation. The TREC Genomics Track will be continuing in 2005 focusing on a wider range of triage tasks and improving results from 2004

    Literature-based discovery of diabetes- and ROS-related targets

    Get PDF
    Abstract Background Reactive oxygen species (ROS) are known mediators of cellular damage in multiple diseases including diabetic complications. Despite its importance, no comprehensive database is currently available for the genes associated with ROS. Methods We present ROS- and diabetes-related targets (genes/proteins) collected from the biomedical literature through a text mining technology. A web-based literature mining tool, SciMiner, was applied to 1,154 biomedical papers indexed with diabetes and ROS by PubMed to identify relevant targets. Over-represented targets in the ROS-diabetes literature were obtained through comparisons against randomly selected literature. The expression levels of nine genes, selected from the top ranked ROS-diabetes set, were measured in the dorsal root ganglia (DRG) of diabetic and non-diabetic DBA/2J mice in order to evaluate the biological relevance of literature-derived targets in the pathogenesis of diabetic neuropathy. Results SciMiner identified 1,026 ROS- and diabetes-related targets from the 1,154 biomedical papers (http://jdrf.neurology.med.umich.edu/ROSDiabetes/). Fifty-three targets were significantly over-represented in the ROS-diabetes literature compared to randomly selected literature. These over-represented targets included well-known members of the oxidative stress response including catalase, the NADPH oxidase family, and the superoxide dismutase family of proteins. Eight of the nine selected genes exhibited significant differential expression between diabetic and non-diabetic mice. For six genes, the direction of expression change in diabetes paralleled enhanced oxidative stress in the DRG. Conclusions Literature mining compiled ROS-diabetes related targets from the biomedical literature and led us to evaluate the biological relevance of selected targets in the pathogenesis of diabetic neuropathy.http://deepblue.lib.umich.edu/bitstream/2027.42/78315/1/1755-8794-3-49.xmlhttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/2/1755-8794-3-49-S7.XLShttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/3/1755-8794-3-49-S10.XLShttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/4/1755-8794-3-49-S8.XLShttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/5/1755-8794-3-49-S3.XLShttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/6/1755-8794-3-49-S1.XLShttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/7/1755-8794-3-49-S4.XLShttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/8/1755-8794-3-49-S2.XLShttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/9/1755-8794-3-49-S12.XLShttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/10/1755-8794-3-49-S11.XLShttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/11/1755-8794-3-49-S9.XLShttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/12/1755-8794-3-49-S5.XLShttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/13/1755-8794-3-49-S6.XLShttp://deepblue.lib.umich.edu/bitstream/2027.42/78315/14/1755-8794-3-49.pdfPeer Reviewe

    StemNet: An Evolving Service for Knowledge Networking in the Life Sciences

    Get PDF
    Up until now, crucial life science information resources, whether bibliographic or factual databases, are isolated from each other. Moreover, semantic metadata intended to structure their contents is supplied in a manual form only. In the StemNet project we aim at developing a framework for semantic interoperability for these resources. This will facilitate the extraction of relevant information from textual sources and the generation of semantic metadata in a fully automatic manner. In this way, (from a computational perspective) unstructured life science documents are linked to structured biological fact databases, in particular to the identifiers of genes, proteins, etc. Thus, life scientists will be able to seamlessly access information from a homogeneous platform, despite the fact that the original information was unlinked and scattered over the whole variety of heterogeneous life science information resources and, therefore, almost inaccessible for integrated systematic search by academic, clinical, or industrial users
    • 

    corecore