30 research outputs found
Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining
Background. Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships. Results. We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation. Conclusions. We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider. com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/ chemlist
Automated annotation of chemical names in the literature with tunable accuracy
<p>Abstract</p> <p>Background</p> <p>A significant portion of the biomedical and chemical literature refers to small molecules. The accurate identification and annotation of compound name that are relevant to the topic of the given literature can establish links between scientific publications and various chemical and life science databases. Manual annotation is the preferred method for these works because well-trained indexers can understand the paper topics as well as recognize key terms. However, considering the hundreds of thousands of new papers published annually, an automatic annotation system with high precision and relevance can be a useful complement to manual annotation.</p> <p>Results</p> <p>An automated chemical name annotation system, MeSH Automated Annotations (MAA), was developed to annotate small molecule names in scientific abstracts with tunable accuracy. This system aims to reproduce the MeSH term annotations on biomedical and chemical literature that would be created by indexers. When comparing automated free text matching to those indexed manually of 26 thousand MEDLINE abstracts, more than 40% of the annotations were false-positive (FP) cases. To reduce the FP rate, MAA incorporated several filters to remove "incorrect" annotations caused by nonspecific, partial, and low relevance chemical names. In part, relevance was measured by the position of the chemical name in the text. Tunable accuracy was obtained by adding or restricting the sections of the text scanned for chemical names. The best precision obtained was 96% with a 28% recall rate. The best performance of MAA, as measured with the F statistic was 66%, which favorably compares to other chemical name annotation systems.</p> <p>Conclusions</p> <p>Accurate chemical name annotation can help researchers not only identify important chemical names in abstracts, but also match unindexed and unstructured abstracts to chemical records. The current work is tested against MEDLINE, but the algorithm is not specific to this corpus and it is possible that the algorithm can be applied to papers from chemical physics, material, polymer and environmental science, as well as patents, biological assay descriptions and other textual data.</p
Mining metabolites: extracting the yeast metabolome from the literature
Text mining methods have added considerably to our capacity to extract biological knowledge from the literature. Recently the field of systems biology has begun to model and simulate metabolic networks, requiring knowledge of the set of molecules involved. While genomics and proteomics technologies are able to supply the macromolecular parts list, the metabolites are less easily assembled. Most metabolites are known and reported through the scientific literature, rather than through large-scale experimental surveys. Thus it is important to recover them from the literature. Here we present a novel tool to automatically identify metabolite names in the literature, and associate structures where possible, to define the reported yeast metabolome. With ten-fold cross validation on a manually annotated corpus, our recognition tool generates an f-score of 78.49 (precision of 83.02) and demonstrates greater suitability in identifying metabolite names than other existing recognition tools for general chemical molecules. The metabolite recognition tool has been applied to the literature covering an important model organism, the yeast Saccharomyces cerevisiae, to define its reported metabolome. By coupling to ChemSpider, a major chemical database, we have identified structures for much of the reported metabolome and, where structure identification fails, been able to suggest extensions to ChemSpider. Our manually annotated gold-standard data on 296 abstracts are available as supplementary materials. Metabolite names and, where appropriate, structures are also available as supplementary materials
Identification of a Shared Genetic Susceptibility Locus for Coronary Heart Disease and Periodontitis
Recent studies indicate a mutual epidemiological relationship between coronary heart disease (CHD) and periodontitis. Both diseases are associated with similar risk factors and are characterized by a chronic inflammatory process. In a candidate-gene association study, we identify an association of a genetic susceptibility locus shared by both diseases. We confirm the known association of two neighboring linkage disequilibrium regions on human chromosome 9p21.3 with CHD and show the additional strong association of these loci with the risk of aggressive periodontitis. For the lead SNP of the main associated linkage disequilibrium region, rs1333048, the odds ratio of the autosomal-recessive mode of inheritance is 1.99 (95% confidence interval 1.33–2.94; P = 6.9×10−4) for generalized aggressive periodontitis, and 1.72 (1.06–2.76; P = 2.6×10−2) for localized aggressive periodontitis. The two associated linkage disequilibrium regions map to the sequence of the large antisense noncoding RNA ANRIL, which partly overlaps regulatory and coding sequences of CDKN2A/CDKN2B. A closely located diabetes-associated variant was independent of the CHD and periodontitis risk haplotypes. Our study demonstrates that CHD and periodontitis are genetically related by at least one susceptibility locus, which is possibly involved in ANRIL activity and independent of diabetes associated risk variants within this region. Elucidation of the interplay of ANRIL transcript variants and their involvement in increased susceptibility to the interactive diseases CHD and periodontitis promises new insight into the underlying shared pathogenic mechanisms of these complex common diseases
Gut microbiota and diabetes: from pathogenesis to therapeutic perspective
More than several hundreds of millions of people will be diabetic and obese over the next decades in front of which the actual therapeutic approaches aim at treating the consequences rather than causes of the impaired metabolism. This strategy is not efficient and new paradigms should be found. The wide analysis of the genome cannot predict or explain more than 10–20% of the disease, whereas changes in feeding and social behavior have certainly a major impact. However, the molecular mechanisms linking environmental factors and genetic susceptibility were so far not envisioned until the recent discovery of a hidden source of genomic diversity, i.e., the metagenome. More than 3 million genes from several hundreds of species constitute our intestinal microbiome. First key experiments have demonstrated that this biome can by itself transfer metabolic disease. The mechanisms are unknown but could be involved in the modulation of energy harvesting capacity by the host as well as the low-grade inflammation and the corresponding immune response on adipose tissue plasticity, hepatic steatosis, insulin resistance and even the secondary cardiovascular events. Secreted bacterial factors reach the circulating blood, and even full bacteria from intestinal microbiota can reach tissues where inflammation is triggered. The last 5 years have demonstrated that intestinal microbiota, at its molecular level, is a causal factor early in the development of the diseases. Nonetheless, much more need to be uncovered in order to identify first, new predictive biomarkers so that preventive strategies based on pre- and probiotics, and second, new therapeutic strategies against the cause rather than the consequence of hyperglycemia and body weight gain
Brain MRI data sharing guide
We present a guide on sharing Magnetic Resonance Imaging (MRI) data, with a focus on The Netherlands. The guide is meant as a help for researchers to know what they can share and where, and where they can find information or support
Why workflows break - Understanding and combating decay in Taverna workflows.
Workflows provide a popular means for preserving scientific methods by explicitly encoding their process. However, some of them are subject to a decay in their ability to be re-executed or reproduce the same results over time, largely due to the volatility of the resources required for workflow executions. This paper provides an analysis of the root causes of workflow decay based on an empirical study of a collection of Taverna workflows from the myExperiment repository. Although our analysis was based on a specific type of workflow, the outcomes and methodology should be applicable to workflows from other systems, at least those whose executions also rely largely on accessing third-party resources. Based on our understanding about decay we recommend a minimal set of auxiliary resources to be preserved together with the workflows as an aggregation object and provide a software tool for end-users to create such aggregations and to assess their completeness. ©2012 IEEE
Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data
Contains fulltext :
125714.pdf (publisher's version ) (Open Access)BACKGROUND: Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set analysis (GSA) methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles. METHODS: We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human) and 588 (mouse) gene sets from the Comparative Toxicogenomics Database (CTD). We tested for significant differential expression (SDE) (false discovery rate -corrected p-values < 0.05) of the next-gen TM-derived gene sets and the CTD-derived gene sets in gene expression (GE) data sets of five chemicals (from experimental models). We tested for SDE of gene sets for six fibrates in a peroxisome proliferator-activated receptor alpha (PPARA) knock-out GE dataset and compared to results from the Connectivity Map. We tested for SDE of 319 next-gen TM-derived gene sets for environmental toxicants in three GE data sets of triazoles, and tested for SDE of 442 gene sets associated with embryonic structures. We compared the gene sets to triazole effects seen in the Whole Embryo Culture (WEC), and used principal component analysis (PCA) to discriminate triazoles from other chemicals. RESULTS: Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals. CONCLUSIONS: Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect