1,032 research outputs found

    Communication and re-use of chemical information in bioscience.

    Get PDF
    The current methods of publishing chemical information in bioscience articles are analysed. Using 3 papers as use-cases, it is shown that conventional methods using human procedures, including cut-and-paste are time-consuming and introduce errors. The meaning of chemical terms and the identity of compounds is often ambiguous. valuable experimental data such as spectra and computational results are almost always omitted. We describe an Open XML architecture at proof-of-concept which addresses these concerns. Compounds are identified through explicit connection tables or links to persistent Open resources such as PubChem. It is argued that if publishers adopt these tools and protocols, then the quality and quantity of chemical information available to bioscientists will increase and the authors, publishers and readers will find the process cost-effective.An article submitted to BiomedCentral Bioinformatics, created on request with their Publicon system. The transformed manuscript is archived as PDF. Although it has been through the publishers system this is purely automatic and the contents are those of a pre-refereed preprint. The formatting is provided by the system and tables and figures appear at the end. An accommpanying submission, http://www.dspace.cam.ac.uk/handle/1810/34580, describes the rationale and cultural aspects of publishing , abstracting and aggregating chemical information. BMC is an Open Access publisher and we emphasize that all content is re-usable under Creative Commons Licens

    Text Mining for Chemical Compounds

    Get PDF
    Exploring the chemical and biological space covered by patent and journal publications is crucial in early- stage medicinal chemistry activities. The analysis provides understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents and journals through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. In this book, we addressed the lack of quality measurements for assessing the correctness of structural representation within and across chemical databases; lack of resources to build text-mining systems; lack of high performance systems to extract chemical compounds from journals and patents; and lack of automated systems to identify relevant compounds in patents. The consistency and ambiguity of chemical identifiers was analyzed within and between small- molecule databases in Chapter 2 and Chapter 3. In Chapter 4 and Chapter 7 we developed resources to enable the construction of chemical text-mining systems. In Chapter 5 and Chapter 6, we used community challenges (BioCreative V and BioCreative VI) and their corresponding resources to identify mentions of chemical compounds in journal abstracts and patents. In Chapter 7 we used our findings in previous chapters to extract chemical named entities from patent full text and to classify the relevancy of chemical compounds

    Characterisation of data resources for in silico modelling: benchmark datasets for ADME properties.

    Get PDF
    Introduction: The cost of in vivo and in vitro screening of ADME properties of compounds has motivated efforts to develop a range of in silico models. At the heart of the development of any computational model are the data; high quality data are essential for developing robust and accurate models. The characteristics of a dataset, such as its availability, size, format and type of chemical identifiers used, influence the modelability of the data. Areas covered: This review explores the usefulness of publicly available ADME datasets for researchers to use in the development of predictive models. More than 140 ADME datasets were collated from publicly available resources and the modelability of 31selected datasets were assessed using specific criteria derived in this study. Expert opinion: Publicly available datasets differ significantly in information content and presentation. From a modelling perspective, datasets should be of adequate size, available in a user-friendly format with all chemical structures associated with one or more chemical identifiers suitable for automated processing (e.g. CAS number, SMILES string or InChIKey). Recommendations for assessing dataset suitability for modelling and publishing data in an appropriate format are discussed

    On InChI and evaluating the quality of cross-reference links

    Get PDF
    BACKGROUND: There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones. RESULTS: We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (28.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links. CONCLUSIONS: We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method

    Consequences of refining biological networks through detailed pathway information : From genes to proteoforms

    Get PDF
    Biologiske nettverk kan brukes til å modellere molekylære prosesser, forstå sykdomsprogresjon og finne nye behandlingsstrategier. Denne avhandlingen har undersøkt hvordan utformingen av slike nettverk påvirker deres struktur, og hvordan dette kan benyttes til å forbedre spesifisiteten for påfølgende analyser av slike modeller. Det første som ble undersøkt var potensialet ved å bruke mer detaljerte molekylære data når man modellerer humane biokjemiske reaksjonsnettverk. Resultatene bekrefter at det er nok informasjon om proteoformer, det vil si proteiner i spesifikke post-translasjonelle tilstander, for systematiske analyser og viste også store forskjeller i strukturen mellom en gensentrisk og en proteoformsentrisk representasjon. Deretter utviklet vi programmatisk tilgang og søk i slike nettverk basert på ulike typer av biomolekyler, samt en generisk algoritme som muliggjør fleksibel kartlegging av eksperimentelle data knyttet til den teoretiske representasjonen av proteoformer i referansedatabaser. Til slutt ble det konstruert såkalte pathway-spesifikke nettverk ved bruk av ulike detaljnivåer ved representasjonen av biokjemiske reaksjoner. Her ble informasjon som vanligvis blir oversett i standard nettverksrepresentasjoner inkludert: små molekyler, isoformer og modifikasjoner. Strukturelle egenskaper, som nettverksstørrelse, graddistribusjon og tilkobling i både globale og lokale undernettverk, ble deretter analysert for å kvantifisere virkningene av endringene.Biological networks can be used to model molecular processes, understand disease progression, and find new treatment strategies. This thesis investigated how refining the design of biological networks influences their structure, and how this can be used to improve the specificity of pathway analyses. First, we investigate the potential to use more detailed molecular data in current human biological pathways. We verified that there are enough proteoform annotations, i.e. information about proteins in specific post-translational states, for systematic analyses and characterized the structure of gene-centric versus proteoform-centric network representations of pathways. Next, we enabled the programmatic search and mining of pathways using different models for biomolecules including proteoforms. We notably designed a generic proteoform matching algorithm enabling the flexible mapping of experimental data to the theoretic representation in reference databases. Finally, we constructed pathway-based networks using different degrees of detail in the representation of biochemical reactions. We included information overlooked in most standard network representations: small molecules, isoforms, and post-translational modifications. Structural properties such as network size, degree distribution, and connectivity in both global and local subnetworks, were analysed to quantify the impact of the added molecular entities.Doktorgradsavhandlin

    Consistency, inconsistency, and ambiguity of metabolite names in biochemical databases used for genome-scale metabolic modelling

    Get PDF
    Genome-scale metabolic models (GEMs) are manually curated repositories describing the metabolic capabilities of an organism. GEMs have been successfully used in different research areas, ranging from systems medicine to biotechnology. However, the different naming conventions (namespaces) of databases used to build GEMs limit model reusability and prevent the integration of existing models. This problem is known in the GEM community, but its extent has not been analyzed in depth. In this study, we investigate the name ambiguity and the multiplicity of non-systematic identifiers and we highlight the (in)consistency in their use in 11 biochemical databases of biochemical reactions and the problems that arise when mapping between different namespaces and databases. We found that such inconsistencies can be as high as 83.1%, thus emphasizing the need for strategies to deal with these issues. Currently, manual verification of the mappings appears to be the only solution to remove inconsistencies when combining models. Finally, we discuss several possible approaches to facilitate (future) unambiguous mapping.</p

    DEVELOPMENT OF TOOLS FOR ATOM-LEVEL INTERPRETATION OF STABLE ISOTOPE-RESOLVED METABOLOMICS DATASETS

    Get PDF
    Metabolomics is the global study of small molecules in living systems under a given state, merging as a new ‘omics’ study in systems biology. It has shown great promise in elucidating biological mechanism in various areas. Many diseases, especially cancers, are closely linked to reprogrammed metabolism. As the end point of biological processes, metabolic profiles are more representative of the biological phenotype compared to genomic or proteomic profiles. Therefore, characterizing metabolic phenotype of various diseases will help clarify the metabolic mechanisms and promote the development of novel and effective treatment strategies. Advances in analytical technologies such as nuclear magnetic resonance and mass spectroscopy greatly contribute to the detection and characterization of global metabolites in a biological system. Furthermore, application of these analytical tools to stable isotope resolved metabolomics experiments can generate large-scale high-quality metabolomics data containing isotopic flow through cellular metabolism. However, the lack of the corresponding computational analysis tools hinders the characterization of metabolic phenotypes and the downstream applications. Both detailed metabolic modeling and quantitative analysis are required for proper interpretation of these complex metabolomics data. For metabolic modeling, currently there is no comprehensive metabolic network at an atom-resolved level that can be used for deriving context-specific metabolic models for SIRM metabolomics datasets. For quantitative analysis, most available tools conduct metabolic flux analysis based on a well-defined metabolic model, which is hard to achieve for complex biological system due to the limitations in our knowledge. Here, we developed a set of methods to address these problems. First, we developed a neighborhood-specific coloring method that can create identifier for each atom in a specific compound. With the atom identifiers, we successfully harmonized compounds and reactions across KEGG and MetaCyc databases at various levels. In addition, we evaluated the atom mappings of the harmonized metabolic reactions. These results will contribute to the construction of a comprehensive atom-resolved metabolic network. In addition, this method can be easily applied to any metabolic database that provides a molfile representation of compounds, which will greatly facilitate future expansion. In addition, we developed a moiety modeling framework to deconvolute metabolite isotopologue profiles using moiety models along with the analysis and selection of the best moiety model(s) based on the experimental data. To our knowledge, this is the first method that can analyze datasets involving multiple isotope tracers. Furthermore, instead of a single predefined metabolic model, this method allows the comparison of multiple metabolic models derived from a given metabolic profile, and we have demonstrated the robust performance of the moiety modeling framework in model selection with a 13C-labeled UDP-GlcNAc isotopologue dataset. We further explored the data quality requirements and the factors that affect model selection. Collectively, these methods and tools help interpret SIRM metabolomics datasets from metabolic modeling to quantitative analysis

    Is systems pharmacology ready to impact upon therapy development? A study of the cholesterol biosynthesis pathway.

    Get PDF
    Background and Purpose An ever-growing wealth of information on current drugs and their pharmacological effects is available from online databases. As our understanding of systems biology increases, we have the opportunity to predict, model and quantify how drug combinations can be introduced that outperform conventional single-drug therapies. Here, we explore the feasibility of such systems pharmacology approaches with an analysis of the mevalonate branch of the cholesterol biosynthesis pathway. Experimental Approach Using open online resources, we assembled a computational model of the mevalonate pathway and compiled a set of inhibitors directed against targets in this pathway. We used computational optimization to identify combination and dose options that show not only maximal efficacy of inhibition on the cholesterol producing branch but also minimal impact on the geranylation branch, known to mediate the side effects of pharmaceutical treatment. Key Results We describe serious impediments to systems pharmacology studies arising from limitations in the data, incomplete coverage and inconsistent reporting. By curating a more complete dataset, we demonstrate the utility of computational optimization for identifying multi-drug treatments with high efficacy and minimal off-target effects. Conclusion and Implications We suggest solutions that facilitate systems pharmacology studies, based on the introduction of standards for data capture that increase the power of experimental data. We propose a systems pharmacology workflow for the refinement of data and the generation of future therapeutic hypotheses

    Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data

    Get PDF
    Microarrays can capture gene expression activity for thousands of genes simultaneously and thus make it possible to analyze cell physiology and disease processes on molecular level. The interpretation of microarray gene expression experiments profits from knowledge on the analyzed genes and proteins and the biochemical networks in which they play a role. The trend is towards the development of data analysis methods that integrate diverse data types. Currently, the most comprehensive biomedical knowledge source is a large repository of free text articles. Text mining makes it possible to automatically extract and use information from texts. This thesis addresses two key aspects, biomedical text mining and gene expression data analysis, with the focus on providing high-quality methods and data that contribute to the development of integrated analysis approaches. The work is structured in three parts. Each part begins by providing the relevant background, and each chapter describes the developed methods as well as applications and results. Part I deals with biomedical text mining: Chapter 2 summarizes the relevant background of text mining; it describes text mining fundamentals, important text mining tasks, applications and particularities of text mining in the biomedical domain, and evaluation issues. In Chapter 3, a method for generating high-quality gene and protein name dictionaries is described. The analysis of the generated dictionaries revealed important properties of individual nomenclatures and the used databases (Fundel and Zimmer, 2006). The dictionaries are publicly available via a Wiki, a web service, and several client applications (Szugat et al., 2005). In Chapter 4, methods for the dictionary-based recognition of gene and protein names in texts and their mapping onto unique database identifiers are described. These methods make it possible to extract information from texts and to integrate text-derived information with data from other sources. Three named entity identification systems have been set up, two of them building upon the previously existing tool ProMiner (Hanisch et al., 2003). All of them have shown very good performance in the BioCreAtIvE challenges (Fundel et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007). In Chapter 5, a new method for relation extraction (Fundel et al., 2007) is presented. It was applied on the largest collection of biomedical literature abstracts, and thus a comprehensive network of human gene and protein relations has been generated. A classification approach (Küffner et al., 2006) can be used to specify relation types further; e. g., as activating, direct physical, or gene regulatory relation. Part II deals with gene expression data analysis: Gene expression data needs to be processed so that differentially expressed genes can be identified. Gene expression data processing consists of several sequential steps. Two important steps are normalization, which aims at removing systematic variances between measurements, and quantification of differential expression by p-value and fold change determination. Numerous methods exist for these tasks. Chapter 6 describes the relevant background of gene expression data analysis; it presents the biological and technical principles of microarrays and gives an overview of the most relevant data processing steps. Finally, it provides a short introduction to osteoarthritis, which is in the focus of the analyzed gene expression data sets. In Chapter 7, quality criteria for the selection of normalization methods are described, and a method for the identification of differentially expressed genes is proposed, which is appropriate for data with large intensity variances between spots representing the same gene (Fundel et al., 2005b). Furthermore, a system is described that selects an appropriate combination of feature selection method and classifier, and thus identifies genes which lead to good classification results and show consistent behavior in different sample subgroups (Davis et al., 2006). The analysis of several gene expression data sets dealing with osteoarthritis is described in Chapter 8. This chapter contains the biomedical analysis of relevant disease processes and distinct disease stages (Aigner et al., 2006a), and a comparison of various microarray platforms and osteoarthritis models. Part III deals with integrated approaches and thus provides the connection between parts I and II: Chapter 9 gives an overview of different types of integrated data analysis approaches, with a focus on approaches that integrate gene expression data with manually compiled data, large-scale networks, or text mining. In Chapter 10, a method for the identification of genes which are consistently regulated and have a coherent literature background (Küffner et al., 2005) is described. This method indicates how gene and protein name identification and gene expression data can be integrated to return clusters which contain genes that are relevant for the respective experiment together with literature information that supports interpretation. Finally, in Chapter 11 ideas on how the described methods can contribute to current research and possible future directions are presented
    corecore