203 research outputs found

    Systematising and scaling literature curation for genetically determined developmental disorders

    Get PDF
    The widespread availability of genomic sequencing has transformed the diagnosis of genetically-determined developmental disorders (GDD). However, this type of test often generates a number of genetic variants, which have to be reviewed and related back to the clinical features (phenotype) of the individual being tested. This frequently entails a time-consuming review of the peer-reviewed literature to look for case reports describing variants in the gene(s) of interest. This is particularly true for newly described and/or very rare disorders not covered in phenotype databases. Therefore, there is a need for scalable, automated literature curation to increase the efficiency of this process. This should lead to improvements in the speed in which diagnosis is made, and an increase in the number of individuals who are diagnosed through genomic testing. Phenotypic data in case reports/case series is not usually recorded in a standardised, computationally-tractable format. Plain text descriptions of similar clinical features may be recorded in several different ways. For example, a technical term such as ‘hypertelorism’, may be recorded as its synonym ‘widely spaced eyes’. In addition, case reports are found across a wide range of journals, with different structures and file formats for each publication. The Human Phenotype Ontology (HPO) was developed to store phenotypic data in a computationally-accessible format. Several initiatives have been developed to link diseases to phenotype data, in the form of HPO terms. However, these rely on manual expert curation and therefore are not inherently scalable, and cannot be updated automatically. Methods of extracting phenotype data from text at scale developed to date have relied on abstracts or open access papers. At the time of writing, Europe PubMed Central (EPMC, https://europepmc.org/) contained approximately 39.5 million articles, of which only 3.8 million were open access. Therefore, there is likely a significant volume of phenotypic data which has not been used previously at scale, due to difficulties accessing non-open access manuscripts. In this thesis, I present a method for literature curation which can utilise all relevant published full text through a newly developed package which can download almost all manuscripts licenced by a university or other institution. This is scalable to the full spectrum of GDD. Using manuscripts identified through manual literature review, I use a full text download pipeline and NLP (natural language processing) based methods to generate disease models. These are comprised of HPO terms weighted according to their frequency in the literature. I demonstrate iterative refinement of these models, and use a custom annotated corpus of 50 papers to show the text mining process has high precision and recall. I demonstrate that these models clinically reflect true disease expressivity, as defined by manual comparison with expert literature reviews, for three well-characterised GDD. I compare these disease models to those in the most commonly used genetic disease phenotype databases. I show that the automated disease models have increased depth of phenotyping, i.e. there are more terms than those which are manually-generated. I show that, in comparison to ‘real life’ prospectively gathered phenotypic data, automated disease models outperform existing phenotype databases in predicting diagnosis, as defined by increased area under the curve (by 0.05 and 0.08 using different similarity measures) on ROC curve plots. I present a method for automated PubMed search at scale, to use as input for disease model generation. I annotated a corpus of 6500 abstracts. Using this corpus I show a high precision (up to 0.80) and recall (up to 1.00) for machine learning classifiers used to identify manuscripts relevant to GDD. These use hand-picked domain-specific features, for example utilising specific MeSH terms. This method can be used to scale automated literature curation to the full spectrum of GDD. I also present an analysis of the phenotypic terms used in one year of GDD-relevant papers in a prominent journal. This shows that use of supplemental data and parsing clinical report sections from manuscripts is likely to result in more patient-specific phenotype extraction in future. In summary, I present a method for automated curation of full text from the peer-reviewed literature in the context of GDD. I demonstrate that this method is robust, reflects clinical disease expressivity, outperforms existing manual literature curation, and is scalable. Applying this process to clinical testing in future should improve the efficiency and accuracy of diagnosis

    Text Mining for Drug Discovery

    Get PDF

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org

    Taxonomic models of individual differences: A guide to transdisciplinary approaches

    Get PDF
    Models and constructs of individual differences are numerous and diverse. But detecting commonalities, differences and interrelations is hindered by the common abstract terms (e.g. ‘personality’, ‘temperament’, ‘traits’) that do not reveal the particular phenomena denoted. This article applies a transdisciplinary paradigm for research on individuals that builds on complexity theory and epistemological complementarity. Its philosophical, metatheoretical and methodological frameworks provide concepts to differentiate various kinds of phenomena (e.g. physiology, behaviour, psyche, language). They are used to scrutinize the field's basic concepts and to elaborate methodological foundations for taxonomizing individual variations in humans and other species. This guide to developing comprehensive and representative models explores the decisions taxonomists must make about which individual variations to include, which to retain and how to model them. Selection and reduction approaches from various disciplines are classified by their underlying rationales, pinpointing possibilities and limitations. Analyses highlight that individuals' complexity cannot be captured by one universal model. Instead, multiple models phenotypically taxonomizing different kinds of variability in different kinds of phenomena are needed to explore their causal and functional interrelations and ontogenetic development that are then modelled in integrative and explanatory taxonomies. This research agenda requires the expertise of many disciplines and is inherently transdisciplinary

    Information Extraction from Text for Improving Research on Small Molecules and Histone Modifications

    Get PDF
    The cumulative number of publications, in particular in the life sciences, requires efficient methods for the automated extraction of information and semantic information retrieval. The recognition and identification of information-carrying units in text – concept denominations and named entities – relevant to a certain domain is a fundamental step. The focus of this thesis lies on the recognition of chemical entities and the new biological named entity type histone modifications, which are both important in the field of drug discovery. As the emergence of new research fields as well as the discovery and generation of novel entities goes along with the coinage of new terms, the perpetual adaptation of respective named entity recognition approaches to new domains is an important step for information extraction. Two methodologies have been investigated in this concern: the state-of-the-art machine learning method, Conditional Random Fields (CRF), and an approximate string search method based on dictionaries. Recognition methods that rely on dictionaries are strongly dependent on the availability of entity terminology collections as well as on its quality. In the case of chemical entities the terminology is distributed over more than 7 publicly available data sources. The join of entries and accompanied terminology from selected resources enables the generation of a new dictionary comprising chemical named entities. Combined with the automatic processing of respective terminology – the dictionary curation – the recognition performance reached an F1 measure of 0.54. That is an improvement by 29 % in comparison to the raw dictionary. The highest recall was achieved for the class of TRIVIAL-names with 0.79. The recognition and identification of chemical named entities provides a prerequisite for the extraction of related pharmacological relevant information from literature data. Therefore, lexico-syntactic patterns were defined that support the automated extraction of hypernymic phrases comprising pharmacological function terminology related to chemical compounds. It was shown that 29-50 % of the automatically extracted terms can be proposed for novel functional annotation of chemical entities provided by the reference database DrugBank. Furthermore, they are a basis for building up concept hierarchies and ontologies or for extending existing ones. Successively, the pharmacological function and biological activity concepts obtained from text were included into a novel descriptor for chemical compounds. Its successful application for the prediction of pharmacological function of molecules and the extension of chemical classification schemes, such as the the Anatomical Therapeutic Chemical (ATC), is demonstrated. In contrast to chemical entities, no comprehensive terminology resource has been available for histone modifications. Thus, histone modification concept terminology was primary recognized in text via CRFs with a F1 measure of 0.86. Subsequent, linguistic variants of extracted histone modification terms were mapped to standard representations that were organized into a newly assembled histone modification hierarchy. The mapping was accomplished by a novel developed term mapping approach described in the thesis. The combination of term recognition and term variant resolution builds up a new procedure for the assembly of novel terminology collections. It supports the generation of a term list that is applicable in dictionary-based methods. For the recognition of histone modification in text it could be shown that the named entity recognition method based on dictionaries is superior to the used machine learning approach. In conclusion, the present thesis provides techniques which enable an enhanced utilization of textual data, hence, supporting research in epigenomics and drug discovery

    Automatic Population of Structured Reports from Narrative Pathology Reports

    Get PDF
    There are a number of advantages for the use of structured pathology reports: they can ensure the accuracy and completeness of pathology reporting; it is easier for the referring doctors to glean pertinent information from them. The goal of this thesis is to extract pertinent information from free-text pathology reports and automatically populate structured reports for cancer diseases and identify the commonalities and differences in processing principles to obtain maximum accuracy. Three pathology corpora were annotated with entities and relationships between the entities in this study, namely the melanoma corpus, the colorectal cancer corpus and the lymphoma corpus. A supervised machine-learning based-approach, utilising conditional random fields learners, was developed to recognise medical entities from the corpora. By feature engineering, the best feature configurations were attained, which boosted the F-scores significantly from 4.2% to 6.8% on the training sets. Without proper negation and uncertainty detection, the quality of the structured reports will be diminished. The negation and uncertainty detection modules were built to handle this problem. The modules obtained overall F-scores ranging from 76.6% to 91.0% on the test sets. A relation extraction system was presented to extract four relations from the lymphoma corpus. The system achieved very good performance on the training set, with 100% F-score obtained by the rule-based module and 97.2% F-score attained by the support vector machines classifier. Rule-based approaches were used to generate the structured outputs and populate them to predefined templates. The rule-based system attained over 97% F-scores on the training sets. A pipeline system was implemented with an assembly of all the components described above. It achieved promising results in the end-to-end evaluations, with 86.5%, 84.2% and 78.9% F-scores on the melanoma, colorectal cancer and lymphoma test sets respectively

    Applications of Natural Language Processing in Biodiversity Science

    Get PDF
    Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science
    corecore