2,006 research outputs found

    Investigating Genotype-Phenotype relationship extraction from biomedical text

    Get PDF
    During the last decade biomedicine has developed at a tremendous pace. Every day a lot of biomedical papers are published and a large amount of new information is produced. To help enable automated and human interaction in the multitude of applications of this biomedical data, the need for Natural Language Processing systems to process the vast amount of new information is increasing. Our main purpose in this research project is to extract the relationships between genotypes and phenotypes mentioned in the biomedical publications. Such a system provides important and up-to-date data for database construction and updating, and even text summarization. To achieve this goal we had to solve three main problems: finding genotype names, finding phenotype names, and finally extracting phenotype--genotype interactions. We consider all these required modules in a comprehensive system and propose a promising solution for each of them taking into account available tools and resources. BANNER, an open source biomedical named entity recognition system, which has achieved good results in detecting genotypes, has been used for the genotype name recognition task. We were the first group to start working on phenotype name recognition. We have developed two different systems (rule-based and machine-learning based) for extracting phenotype names from text. These systems incorporated the available knowledge from the Unified Medical Language System metathesaurus and the Human Phenotype Onotolgy (HPO). As there was no available annotated corpus for phenotype names, we created a valuable corpus with annotated phenotype names using information available in HPO and a self-training method which can be used for future research. To solve the final problem of this project i.e. , phenotype--genotype relationship extraction, a machine learning method has been proposed. As there was no corpus available for this task and it was not possible for us to annotate a sufficiently large corpus manually, a semi-automatic approach has been used to annotate a small corpus and a self-training method has been proposed to annotate more sentences and enlarge this corpus. A test set was manually annotated by an expert. In addition to having phenotype-genotype relationships annotated, the test set contains important comments about the nature of these relationships. The evaluation results related to each system demonstrate the significantly good performance of all the proposed methods

    Term-BLAST-like alignment tool for concept recognition in noisy clinical texts.

    Get PDF
    MOTIVATION: Methods for concept recognition (CR) in clinical texts have largely been tested on abstracts or articles from the medical literature. However, texts from electronic health records (EHRs) frequently contain spelling errors, abbreviations, and other nonstandard ways of representing clinical concepts. RESULTS: Here, we present a method inspired by the BLAST algorithm for biosequence alignment that screens texts for potential matches on the basis of matching k-mer counts and scores candidates based on conformance to typical patterns of spelling errors derived from 2.9 million clinical notes. Our method, the Term-BLAST-like alignment tool (TBLAT) leverages a gold standard corpus for typographical errors to implement a sequence alignment-inspired method for efficient entity linkage. We present a comprehensive experimental comparison of TBLAT with five widely used tools. Experimental results show an increase of 10% in recall on scientific publications and 20% increase in recall on EHR records (when compared against the next best method), hence supporting a significant enhancement of the entity linking task. The method can be used stand-alone or as a complement to existing approaches. AVAILABILITY AND IMPLEMENTATION: Fenominal is a Java library that implements TBLAT for named CR of Human Phenotype Ontology terms and is available at https://github.com/monarch-initiative/fenominal under the GNU General Public License v3.0

    Text mining processing pipeline for semi structured data D3.3

    Get PDF
    Unstructured and semi-structured cohort data contain relevant information about the health condition of a patient, e.g., free text describing disease diagnoses, drugs, medication reasons, which are often not available in structured formats. One of the challenges posed by medical free texts is that there can be several ways of mentioning a concept. Therefore, encoding free text into unambiguous descriptors allows us to leverage the value of the cohort data, in particular, by facilitating its findability and interoperability across cohorts in the project.Named entity recognition and normalization enable the automatic conversion of free text into standard medical concepts. Given the volume of available data shared in the CINECA project, the WP3 text mining working group has developed named entity normalization techniques to obtain standard concepts from unstructured and semi-structured fields available in the cohorts. In this deliverable, we present the methodology used to develop the different text mining tools created by the dedicated SFU, UMCG, EBI, and HES-SO/SIB groups for specific CINECA cohorts

    Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data

    Get PDF
    Microarrays can capture gene expression activity for thousands of genes simultaneously and thus make it possible to analyze cell physiology and disease processes on molecular level. The interpretation of microarray gene expression experiments profits from knowledge on the analyzed genes and proteins and the biochemical networks in which they play a role. The trend is towards the development of data analysis methods that integrate diverse data types. Currently, the most comprehensive biomedical knowledge source is a large repository of free text articles. Text mining makes it possible to automatically extract and use information from texts. This thesis addresses two key aspects, biomedical text mining and gene expression data analysis, with the focus on providing high-quality methods and data that contribute to the development of integrated analysis approaches. The work is structured in three parts. Each part begins by providing the relevant background, and each chapter describes the developed methods as well as applications and results. Part I deals with biomedical text mining: Chapter 2 summarizes the relevant background of text mining; it describes text mining fundamentals, important text mining tasks, applications and particularities of text mining in the biomedical domain, and evaluation issues. In Chapter 3, a method for generating high-quality gene and protein name dictionaries is described. The analysis of the generated dictionaries revealed important properties of individual nomenclatures and the used databases (Fundel and Zimmer, 2006). The dictionaries are publicly available via a Wiki, a web service, and several client applications (Szugat et al., 2005). In Chapter 4, methods for the dictionary-based recognition of gene and protein names in texts and their mapping onto unique database identifiers are described. These methods make it possible to extract information from texts and to integrate text-derived information with data from other sources. Three named entity identification systems have been set up, two of them building upon the previously existing tool ProMiner (Hanisch et al., 2003). All of them have shown very good performance in the BioCreAtIvE challenges (Fundel et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007). In Chapter 5, a new method for relation extraction (Fundel et al., 2007) is presented. It was applied on the largest collection of biomedical literature abstracts, and thus a comprehensive network of human gene and protein relations has been generated. A classification approach (Küffner et al., 2006) can be used to specify relation types further; e. g., as activating, direct physical, or gene regulatory relation. Part II deals with gene expression data analysis: Gene expression data needs to be processed so that differentially expressed genes can be identified. Gene expression data processing consists of several sequential steps. Two important steps are normalization, which aims at removing systematic variances between measurements, and quantification of differential expression by p-value and fold change determination. Numerous methods exist for these tasks. Chapter 6 describes the relevant background of gene expression data analysis; it presents the biological and technical principles of microarrays and gives an overview of the most relevant data processing steps. Finally, it provides a short introduction to osteoarthritis, which is in the focus of the analyzed gene expression data sets. In Chapter 7, quality criteria for the selection of normalization methods are described, and a method for the identification of differentially expressed genes is proposed, which is appropriate for data with large intensity variances between spots representing the same gene (Fundel et al., 2005b). Furthermore, a system is described that selects an appropriate combination of feature selection method and classifier, and thus identifies genes which lead to good classification results and show consistent behavior in different sample subgroups (Davis et al., 2006). The analysis of several gene expression data sets dealing with osteoarthritis is described in Chapter 8. This chapter contains the biomedical analysis of relevant disease processes and distinct disease stages (Aigner et al., 2006a), and a comparison of various microarray platforms and osteoarthritis models. Part III deals with integrated approaches and thus provides the connection between parts I and II: Chapter 9 gives an overview of different types of integrated data analysis approaches, with a focus on approaches that integrate gene expression data with manually compiled data, large-scale networks, or text mining. In Chapter 10, a method for the identification of genes which are consistently regulated and have a coherent literature background (Küffner et al., 2005) is described. This method indicates how gene and protein name identification and gene expression data can be integrated to return clusters which contain genes that are relevant for the respective experiment together with literature information that supports interpretation. Finally, in Chapter 11 ideas on how the described methods can contribute to current research and possible future directions are presented

    Molecular genetics of chicken egg quality

    Get PDF
    Faultless quality in eggs is important in all production steps, from chicken to packaging, transportation, storage, and finally to the consumer. The egg industry (specifically transportation and packing) is interested in robustness, the consumer in safety and taste, and the chicken itself in the reproductive performance of the egg. High quality is commercially profitable, and egg quality is currently one of the key traits in breeding goals. In conventional breeding schemes, the more traits that are included in a selection index, the slower the rate of genetic progress for all the traits will be. The unveiling of the genes underlying the traits, and subsequent utilization of this genomic information in practical breeding, would enhance the selection progress, especially with traits of low inheritance, genderconfined traits, or traits which are difficult to assess. In this study, two experimental mapping populations were used to identify quantitative trait loci (QTL) of egg quality traits. A whole genome scan was conducted in both populations with different sets of microsatellite markers. Phenotypic observations of albumen quality, internal inclusions, egg taint, egg shell quality traits, and production traits during the entire production period were collected. To study the presence of QTL, a multiple marker linear regression was used. Polymorphisms found in candidate genes were used as SNP (single nucleotide polymorphism) markers to refine the map position of QTL by linkage and association. Furthermore, independent commercial egg layer lines were utilized to confirm some of the associations. Albumen quality, the incidence of internal inclusions, and egg taint were first mapped with the whole genome scan and fine-mapped with subsequent analyses. In albumen quality, two distinct QTL areas were found on chromosome 2. Vimentin, a gene maintaining the mechanical integrity of the cells, was studied as a candidate gene. Neither sequencing nor subsequent analysis using SNP within the gene in the QTL analysis suggested that variation in this gene could explain the effect on albumen thinning. The same mapping approach was used to study the incidence of internal inclusions, specifically, blood and meat spots. Linkage analysis revealed one genome-wide significant region on chromosome Z. Fine-mapping exposed that the QTL overlapped with a tight junction protein gene ZO-2, and a microsatellite marker inside the gene. Sequencing of a fragment of the gene revealed several SNPs. Two novel SNPs were found to be located in a miRNA (gga-mir-1556) within the ZO-2. MicroRNA-SNP and an exonic synonymous SNP were genotyped in the populations and showed significant association to blood and meat spots. A good congruence between the experimental population and commercial breeds was achieved both in QTL locations and in association results. As a conclusion, ZO-2 and gga-mir-1556 remained candidates for having a role in susceptibility to blood and meat spot defects across populations. This is the first report of QTL affecting blood and meat spot frequency in chicken eggs, albeit the effect explained only 2 % of the phenotypic variance. Fishy taint is a disorder, which is a characteristic of brown layer lines. Marker-trait association analyses of pooled samples indicated that egg-taint and the FMO3 gene map to chicken chromosome 8 and that the variation found by sequencing in the chicken FMO3 gene was associated with the TMA content of the egg. The missense mutation in the FMO3 changes an evolutionary, highly conserved amino acid within the FMO-characteristic motif (FATGY). In conclusion, several QTL regions affecting egg quality traits were successfully detected. Some of the QTL findings, such as albumen quality, remained at the level of wide chromosomal regions. For some QTL, a putative causative gene was indicated: miRNA gga-mir-1556 and/or its host gene ZO-2 might have a role in susceptibility to blood and meat spot defects across populations. Nonetheless, fishy taint in chicken eggs was found to be caused with a substitution within a conserved motif of the FMO3 gene. This variation has been used in a breeding program to eliminate fishy-taint defects from commercial egg layer lines. Objective The objective of this thesis was to map loci affecting economically important egg quality traits in chickens and to increase knowledge of the molecular genetics of these complex traits. The aim was to find markers linked to the egg quality traits, and finally unravel the variation in the genes underlying the phenotypic variation of internal egg quality. QTL mapping methodology was used to identify chromosomal regions affecting various production and egg quality traits (I, III, IV). Three internal egg quality traits were selected for fine-mapping (II, III, IV). Some of the results were verified in independent mapping populations and present-day commercial lines (III, IV). The ultimate objective was to find markers to be applied in commercial selection programs

    Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review

    Get PDF
    Novel approaches that complement and go beyond evidence-based medicine are required in the domain of chronic diseases, given the growing incidence of such conditions on the worldwide population. A promising avenue is the secondary use of electronic health records (EHRs), where patient data are analyzed to conduct clinical and translational research. Methods based on machine learning to process EHRs are resulting in improved understanding of patient clinical trajectories and chronic disease risk prediction, creating a unique opportunity to derive previously unknown clinical insights. However, a wealth of clinical histories remains locked behind clinical narratives in free-form text. Consequently, unlocking the full potential of EHR data is contingent on the development of natural language processing (NLP) methods to automatically transform clinical text into structured clinical data that can guide clinical decisions and potentially delay or prevent disease onset

    Discovery of novel biomarkers and phenotypes by semantic technologies.

    Get PDF
    Biomarkers and target-specific phenotypes are important to targeted drug design and individualized medicine, thus constituting an important aspect of modern pharmaceutical research and development. More and more, the discovery of relevant biomarkers is aided by in silico techniques based on applying data mining and computational chemistry on large molecular databases. However, there is an even larger source of valuable information available that can potentially be tapped for such discoveries: repositories constituted by research documents
    corecore