60 research outputs found

    From Text to Knowledge

    Get PDF
    The global information space provided by the World Wide Web has changed dramatically the way knowledge is shared all over the world. To make this unbelievable huge information space accessible, search engines index the uploaded contents and provide efficient algorithmic machinery for ranking the importance of documents with respect to an input query. All major search engines such as Google, Yahoo or Bing are keyword-based, which is indisputable a very powerful tool for accessing information needs centered around documents. However, this unstructured, document-oriented paradigm of the World Wide Web has serious drawbacks, when searching for specific knowledge about real-world entities. When asking for advanced facts about entities, today's search engines are not very good in providing accurate answers. Hand-built knowledge bases such as Wikipedia or its structured counterpart DBpedia are excellent sources that provide common facts. However, these knowledge bases are far from being complete and most of the knowledge lies still buried in unstructured documents. Statistical machine learning methods have the great potential to help to bridge the gap between text and knowledge by (semi-)automatically transforming the unstructured representation of the today's World Wide Web to a more structured representation. This thesis is devoted to reduce this gap with Probabilistic Graphical Models. Probabilistic Graphical Models play a crucial role in modern pattern recognition as they merge two important fields of applied mathematics: Graph Theory and Probability Theory. The first part of the thesis will present a novel system called Text2SemRel that is able to (semi-)automatically construct knowledge bases from textual document collections. The resulting knowledge base consists of facts centered around entities and their relations. Essential part of the system is a novel algorithm for extracting relations between entity mentions that is based on Conditional Random Fields, which are Undirected Probabilistic Graphical Models. In the second part of the thesis, we will use the power of Directed Probabilistic Graphical Models to solve important knowledge discovery tasks in semantically annotated large document collections. In particular, we present extensions of the Latent Dirichlet Allocation framework that are able to learn in an unsupervised way the statistical semantic dependencies between unstructured representations such as documents and their semantic annotations. Semantic annotations of documents might refer to concepts originating from a thesaurus or ontology but also to user-generated informal tags in social tagging systems. These forms of annotations represent a first step towards the conversion to a more structured form of the World Wide Web. In the last part of the thesis, we prove the large-scale applicability of the proposed fact extraction system Text2SemRel. In particular, we extract semantic relations between genes and diseases from a large biomedical textual repository. The resulting knowledge base contains far more potential disease genes exceeding the number of disease genes that are currently stored in curated databases. Thus, the proposed system is able to unlock knowledge currently buried in the literature. The literature-derived human gene-disease network is subject of further analysis with respect to existing curated state of the art databases. We analyze the derived knowledge base quantitatively by comparing it with several curated databases with regard to size of the databases and properties of known disease genes among other things. Our experimental analysis shows that the facts extracted from the literature are of high quality

    Extraction of semantic biomedical relations from text using conditional random fields

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The increasing amount of published literature in biomedicine represents an immense source of knowledge, which can only efficiently be accessed by a new generation of automated information extraction tools. Named entity recognition of well-defined objects, such as genes or proteins, has achieved a sufficient level of maturity such that it can form the basis for the next step: the extraction of relations that exist between the recognized entities. Whereas most early work focused on the mere detection of relations, the classification of the type of relation is also of great importance and this is the focus of this work. In this paper we describe an approach that extracts both the existence of a relation and its type. Our work is based on Conditional Random Fields, which have been applied with much success to the task of named entity recognition.</p> <p>Results</p> <p>We benchmark our approach on two different tasks. The first task is the identification of semantic relations between diseases and treatments. The available data set consists of manually annotated PubMed abstracts. The second task is the identification of relations between genes and diseases from a set of concise phrases, so-called GeneRIF (Gene Reference Into Function) phrases. In our experimental setting, we do not assume that the entities are given, as is often the case in previous relation extraction work. Rather the extraction of the entities is solved as a subproblem. Compared with other state-of-the-art approaches, we achieve very competitive results on both data sets. To demonstrate the scalability of our solution, we apply our approach to the complete human GeneRIF database. The resulting gene-disease network contains 34758 semantic associations between 4939 genes and 1745 diseases. The gene-disease network is publicly available as a machine-readable RDF graph.</p> <p>Conclusion</p> <p>We extend the framework of Conditional Random Fields towards the annotation of semantic relations from text and apply it to the biomedical domain. Our approach is based on a rich set of textual features and achieves a performance that is competitive to leading approaches. The model is quite general and can be extended to handle arbitrary biological entities and relation types. The resulting gene-disease network shows that the GeneRIF database provides a rich knowledge source for text mining. Current work is focused on improving the accuracy of detection of entities as well as entity boundaries, which will also greatly improve the relation extraction performance.</p

    Identifying and classifying biomedical perturbations in text

    Get PDF
    Molecular perturbations provide a powerful toolset for biomedical researchers to scrutinize the contributions of individual molecules in biological systems. Perturbations qualify the context of experimental results and, despite their diversity, share properties in different dimensions in ways that can be formalized. We propose a formal framework to describe and classify perturbations that allows accumulation of knowledge in order to inform the process of biomedical scientific experimentation and target analysis. We apply this framework to develop a novel algorithm for automatic detection and characterization of perturbations in text and show its relevance in the study of gene–phenotype associations and protein–protein interactions in diabetes and cancer. Analyzing perturbations introduces a novel view of the multivariate landscape of biological systems

    Anticipating annotations and emerging trends in biomedical literature

    Full text link
    The BioJournalMonitor is a decision support system for the analysis of trends and topics in the biomedical literature. Its main goal is to identify potential diagnostic and therapeu-tic biomarkers for specific diseases. Several data sources are continuously integrated to provide the user with up-to-date information on current research in this field. State-of-the-art text mining technologies are deployed to provide added value on top of the original content, including named en-tity detection, relation extraction, classification, clustering, ranking, summarization, and visualization. We present two novel technologies that are related to the analysis of tem-poral dynamics of text archives and associated ontologies. Currently, the MeSH ontology is used to annotate the sci-entific articles entering the PubMed database with medical terms. Both the maintenance of the ontology as well as the annotation of new articles is performed largely manually. We describe how probabilistic topic models can be used to anno-tate recent articles with the most likely MeSH terms. This provides our users with a competitive advantage because, when searching for MeSH terms, articles are found long be-fore they are manually annotated. We further present a study on how to predict the inclusion of new terms in the MeSH ontology. The results suggest that early prediction of emerging trends is possible. The trend ranking functions are deployed in our system to enable interactive searches for the hottest new trends relating to a disease

    GenCLiP: a software program for clustering gene lists by literature profiling and constructing gene co-occurrence networks related to custom keywords

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Biomedical researchers often want to explore pathogenesis and pathways regulated by abnormally expressed genes, such as those identified by microarray analyses. Literature mining is an important way to assist in this task. Many literature mining tools are now available. However, few of them allows the user to make manual adjustments to zero in on what he/she wants to know in particular.</p> <p>Results</p> <p>We present our software program, GenCLiP (Gene Cluster with Literature Profiles), which is based on the methods presented by Chaussabel and Sher (<it>Genome Biol </it>2002, 3(10):RESEARCH0055) that search gene lists to identify functional clusters of genes based on up-to-date literature profiling. Four features were added to this previously described method: the ability to 1) manually curate keywords extracted from the literature, 2) search genes and gene co-occurrence networks related to custom keywords, 3) compare analyzed gene results with negative and positive controls generated by GenCLiP, and 4) calculate probabilities that the resulting genes and gene networks are randomly related. In this paper, we show with a set of differentially expressed genes between keloids and normal control, how implementation of functions in GenCLiP successfully identified keywords related to the pathogenesis of keloids and unknown gene pathways involved in the pathogenesis of keloids.</p> <p>Conclusion</p> <p>With regard to the identification of disease-susceptibility genes, GenCLiP allows one to quickly acquire a primary pathogenesis profile and identify pathways involving abnormally expressed genes not previously associated with the disease.</p

    Gene-Disease Network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental Diseases

    Get PDF
    Scientists have been trying to understand the molecular mechanisms of diseases to design preventive and therapeutic strategies for a long time. For some diseases, it has become evident that it is not enough to obtain a catalogue of the disease-related genes but to uncover how disruptions of molecular networks in the cell give rise to disease phenotypes. Moreover, with the unprecedented wealth of information available, even obtaining such catalogue is extremely difficult. We developed a comprehensive gene-disease association database by integrating associations from several sources that cover different biomedical aspects of diseases. In particular, we focus on the current knowledge of human genetic diseases including mendelian, complex and environmental diseases. To assess the concept of modularity of human diseases, we performed a systematic study of the emergent properties of human gene-disease networks by means of network topology and functional annotation analysis. The results indicate a highly shared genetic origin of human diseases and show that for most diseases, including mendelian, complex and environmental diseases, functional modules exist. Moreover, a core set of biological pathways is found to be associated with most human diseases. We obtained similar results when studying clusters of diseases, suggesting that related diseases might arise due to dysfunction of common biological processes in the cell. For the first time, we include mendelian, complex and environmental diseases in an integrated gene-disease association database and show that the concept of modularity applies for all of them. We furthermore provide a functional analysis of disease-related modules providing important new biological insights, which might not be discovered when considering each of the gene-disease association repositories independently. Hence, we present a suitable framework for the study of how genetic and environmental factors, such as drugs, contribute to diseases. The gene-disease networks used in this study and part of the analysis are available at http://ibi.imim.es/DisGeNET/DisGeNETweb.html#Download

    HypertenGene: extracting key hypertension genes from biomedical literature with position and automatically-generated template features

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The genetic factors leading to hypertension have been extensively studied, and large numbers of research papers have been published on the subject. One of hypertension researchers' primary research tasks is to locate key hypertension-related genes in abstracts. However, gathering such information with existing tools is not easy: (1) Searching for articles often returns far too many hits to browse through. (2) The search results do not highlight the hypertension-related genes discovered in the abstract. (3) Even though some text mining services mark up gene names in the abstract, the key genes investigated in a paper are still not distinguished from other genes. To facilitate the information gathering process for hypertension researchers, one solution would be to extract the key hypertension-related genes in each abstract. Three major tasks are involved in the construction of this system: (1) gene and hypertension named entity recognition, (2) section categorization, and (3) gene-hypertension relation extraction.</p> <p>Results</p> <p>We first compare the retrieval performance achieved by individually adding template features and position features to the baseline system. Then, the combination of both is examined. We found that using position features can almost double the original AUC score (0.8140vs.0.4936) of the baseline system. However, adding template features only results in marginal improvement (0.0197). Including both improves AUC to 0.8184, indicating that these two sets of features are complementary, and do not have overlapping effects. We then examine the performance in a different domain--diabetes, and the result shows a satisfactory AUC of 0.83.</p> <p>Conclusion</p> <p>Our approach successfully exploits template features to recognize true hypertension-related gene mentions and position features to distinguish key genes from other related genes. Templates are automatically generated and checked by biologists to minimize labor costs. Our approach integrates the advantages of machine learning models and pattern matching. To the best of our knowledge, this the first systematic study of extracting hypertension-related genes and the first attempt to create a hypertension-gene relation corpus based on the GAD database. Furthermore, our paper proposes and tests novel features for extracting key hypertension genes, such as relative position, section, and template features, which could also be applied to key-gene extraction for other diseases.</p
    • …
    corecore