171 research outputs found

    Tautomerism in large databases

    Get PDF
    We have used the Chemical Structure DataBase (CSDB) of the NCI CADD Group, an aggregated collection of over 150 small-molecule databases totaling 103.5 million structure records, to conduct tautomerism analyses on one of the largest currently existing sets of real (i.e. not computer-generated) compounds. This analysis was carried out using calculable chemical structure identifiers developed by the NCI CADD Group, based on hash codes available in the chemoinformatics toolkit CACTVS and a newly developed scoring scheme to define a canonical tautomer for any encountered structure. CACTVS’s tautomerism definition, a set of 21 transform rules expressed in SMIRKS line notation, was used, which takes a comprehensive stance as to the possible types of tautomeric interconversion included. Tautomerism was found to be possible for more than 2/3 of the unique structures in the CSDB. A total of 680 million tautomers were calculated from, and including, the original structure records. Tautomerism overlap within the same individual database (i.e. at least one other entry was present that was really only a different tautomeric representation of the same compound) was found at an average rate of 0.3% of the original structure records, with values as high as nearly 2% for some of the databases in CSDB. Projected onto the set of unique structures (by FICuS identifier), this still occurred in about 1.5% of the cases. Tautomeric overlap across all constituent databases in CSDB was found for nearly 10% of the records in the collection

    In silico assessment of potential druggable pockets on the surface of α1-Antitrypsin conformers

    Get PDF
    The search for druggable pockets on the surface of a protein is often performed on a single conformer, treated as a rigid body. Transient druggable pockets may be missed in this approach. Here, we describe a methodology for systematic in silico analysis of surface clefts across multiple conformers of the metastable protein α1-antitrypsin (A1AT). Pathological mutations disturb the conformational landscape of A1AT, triggering polymerisation that leads to emphysema and hepatic cirrhosis. Computational screens for small molecule inhibitors of polymerisation have generally focused on one major druggable site visible in all crystal structures of native A1AT. In an alternative approach, we scan all surface clefts observed in crystal structures of A1AT and in 100 computationally produced conformers, mimicking the native solution ensemble. We assess the persistence, variability and druggability of these pockets. Finally, we employ molecular docking using publicly available libraries of small molecules to explore scaffold preferences for each site. Our approach identifies a number of novel target sites for drug design. In particular one transient site shows favourable characteristics for druggability due to high enclosure and hydrophobicity. Hits against this and other druggable sites achieve docking scores corresponding to a Kd in the µM–nM range, comparing favourably with a recently identified promising lead. Preliminary ThermoFluor studies support the docking predictions. In conclusion, our strategy shows considerable promise compared with the conventional single pocket/single conformer approach to in silico screening. Our best-scoring ligands warrant further experimental investigation

    Functional Group and Substructure Searching as a Tool in Metabolomics

    Get PDF
    BACKGROUND: A direct link between the names and structures of compounds and the functional groups contained within them is important, not only because biochemists frequently rely on literature that uses a free-text format to describe functional groups, but also because metabolic models depend upon the connections between enzymes and substrates being known and appropriately stored in databases. METHODOLOGY: We have developed a database named "Biochemical Substructure Search Catalogue" (BiSSCat), which contains 489 functional groups, >200,000 compounds and >1,000,000 different computationally constructed substructures, to allow identification of chemical compounds of biological interest. CONCLUSIONS: This database and its associated web-based search program (http://bisscat.org/) can be used to find compounds containing selected combinations of substructures and functional groups. It can be used to determine possible additional substrates for known enzymes and for putative enzymes found in genome projects. Its applications to enzyme inhibitor design are also discussed

    11th German Conference on Chemoinformatics (GCC 2015) : Fulda, Germany. 8-10 November 2015.

    Get PDF

    Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Since 2004 public cheminformatic databases and their collective functionality for exploring relationships between compounds, protein sequences, literature and assay data have advanced dramatically. In parallel, commercial sources that extract and curate such relationships from journals and patents have also been expanding. This work updates a previous comparative study of databases chosen because of their bioactive content, availability of downloads and facility to select informative subsets.</p> <p>Results</p> <p>Where they could be calculated, extracted compounds-per-journal article were in the range of 12 to 19 but compound-per-protein counts increased with document numbers. Chemical structure filtration to facilitate standardised comparisons typically reduced source counts by between 5% and 30%. The pair-wise overlaps between 23 databases and subsets were determined, as well as changes between 2006 and 2008. While all compound sets have increased, PubChem has doubled to 14.2 million. The 2008 comparison matrix shows not only overlap but also unique content across all sources. Many of the detailed differences could be attributed to individual strategies for data selection and extraction. While there was a big increase in patent-derived structures entering PubChem since 2006, GVKBIO contains over 0.8 million unique structures from this source. Venn diagrams showed extensive overlap between compounds extracted by independent expert curation from journals by GVKBIO, WOMBAT (both commercial) and BindingDB (public) but each included unique content. In contrast, the approved drug collections from GVKBIO, MDDR (commercial) and DrugBank (public) showed surprisingly low overlap. Aggregating all commercial sources established that while 1 million compounds overlapped with PubChem 1.2 million did not.</p> <p>Conclusion</p> <p>On the basis of chemical structure content <it>per se </it>public sources have covered an increasing proportion of commercial databases over the last two years. However, commercial products included in this study provide links between compounds and information from patents and journals at a larger scale than current public efforts. They also continue to capture a significant proportion of unique content. Our results thus demonstrate not only an encouraging overall expansion of data-supported bioactive chemical space but also that both commercial and public sources are complementary for its exploration.</p

    Fast 3D shape screening of large chemical databases through alignment-recycling

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Large chemical databases require fast, efficient, and simple ways of looking for similar structures. Although such tasks are now fairly well resolved for graph-based similarity queries, they remain an issue for 3D approaches, particularly for those based on 3D shape overlays. Inspired by a recent technique developed to compare molecular shapes, we designed a hybrid methodology, alignment-recycling, that enables efficient retrieval and alignment of structures with similar 3D shapes.</p> <p>Results</p> <p>Using a dataset of more than one million PubChem compounds of limited size (< 28 heavy atoms) and flexibility (< 6 rotatable bonds), we obtained a set of a few thousand diverse structures covering entirely the 3D shape space of the conformers of the dataset. Transformation matrices gathered from the overlays between these diverse structures and the 3D conformer dataset allowed us to drastically (100-fold) reduce the CPU time required for shape overlay. The alignment-recycling heuristic produces results consistent with <it>de novo </it>alignment calculation, with better than 80% hit list overlap on average.</p> <p>Conclusion</p> <p>Overlay-based 3D methods are computationally demanding when searching large databases. Alignment-recycling reduces the CPU time to perform shape similarity searches by breaking the alignment problem into three steps: selection of diverse shapes to describe the database shape-space; overlay of the database conformers to the diverse shapes; and non-optimized overlay of query and database conformers using common reference shapes. The precomputation, required by the first two steps, is a significant cost of the method; however, once performed, querying is two orders of magnitude faster. Extensions and variations of this methodology, for example, to handle more flexible and larger small-molecules are discussed.</p

    Identification of Anti-Malarial Compounds as Novel Antagonists to Chemokine Receptor CXCR4 in Pancreatic Cancer Cells

    Get PDF
    Despite recent advances in targeted therapies, patients with pancreatic adenocarcinoma continue to have poor survival highlighting the urgency to identify novel therapeutic targets. Our previous investigations have implicated chemokine receptor CXCR4 and its selective ligand CXCL12 in the pathogenesis and progression of pancreatic intraepithelial neoplasia and invasive pancreatic cancer; hence, CXCR4 is a promising target for suppression of pancreatic cancer growth. Here, we combined in silico structural modeling of CXCR4 to screen for candidate anti-CXCR4 compounds with in vitro cell line assays and identified NSC56612 from the National Cancer Institute's (NCI) Open Chemical Repository Collection as an inhibitor of activated CXCR4. Next, we identified that NSC56612 is structurally similar to the established anti-malarial drugs chloroquine and hydroxychloroquine. We evaluated these compounds in pancreatic cancer cells in vitro and observed specific antagonism of CXCR4-mediated signaling and cell proliferation. Recent in vivo therapeutic applications of chloroquine in pancreatic cancer mouse models have demonstrated decreased tumor growth and improved survival. Our results thus provide a molecular target and basis for further evaluation of chloroquine and hydroxychloroquine in pancreatic cancer. Historically safe in humans, chloroquine and hydroxychloroquine appear to be promising agents to safely and effectively target CXCR4 in patients with pancreatic cancer

    Structure-based classification and ontology in chemistry

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent years have seen an explosion in the availability of data in the chemistry domain. With this information explosion, however, retrieving <it>relevant </it>results from the available information, and <it>organising </it>those results, become even harder problems. Computational processing is essential to filter and organise the available resources so as to better facilitate the work of scientists. Ontologies encode expert domain knowledge in a hierarchically organised machine-processable format. One such ontology for the chemical domain is ChEBI. ChEBI provides a classification of chemicals based on their structural features and a role or activity-based classification. An example of a structure-based class is 'pentacyclic compound' (compounds containing five-ring structures), while an example of a role-based class is 'analgesic', since many different chemicals can act as analgesics without sharing structural features. Structure-based classification in chemistry exploits elegant regularities and symmetries in the underlying chemical domain. As yet, there has been neither a systematic analysis of the types of structural classification in use in chemistry nor a comparison to the capabilities of available technologies.</p> <p>Results</p> <p>We analyze the different categories of structural classes in chemistry, presenting a list of patterns for features found in class definitions. We compare these patterns of class definition to tools which allow for automation of hierarchy construction within cheminformatics and within logic-based ontology technology, going into detail in the latter case with respect to the expressive capabilities of the Web Ontology Language and recent extensions for modelling structured objects. Finally we discuss the relationships and interactions between cheminformatics approaches and logic-based approaches.</p> <p>Conclusion</p> <p>Systems that perform intelligent reasoning tasks on chemistry data require a diverse set of underlying computational utilities including algorithmic, statistical and logic-based tools. For the task of automatic structure-based classification of chemical entities, essential to managing the vast swathes of chemical data being brought online, systems which are capable of hybrid reasoning combining several different approaches are crucial. We provide a thorough review of the available tools and methodologies, and identify areas of open research.</p

    Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry

    Get PDF
    BACKGROUND: Structure elucidation of unknown small molecules by mass spectrometry is a challenge despite advances in instrumentation. The first crucial step is to obtain correct elemental compositions. In order to automatically constrain the thousands of possible candidate structures, rules need to be developed to select the most likely and chemically correct molecular formulas. RESULTS: An algorithm for filtering molecular formulas is derived from seven heuristic rules: (1) restrictions for the number of elements, (2) LEWIS and SENIOR chemical rules, (3) isotopic patterns, (4) hydrogen/carbon ratios, (5) element ratio of nitrogen, oxygen, phosphor, and sulphur versus carbon, (6) element ratio probabilities and (7) presence of trimethylsilylated compounds. Formulas are ranked according to their isotopic patterns and subsequently constrained by presence in public chemical databases. The seven rules were developed on 68,237 existing molecular formulas and were validated in four experiments. First, 432,968 formulas covering five million PubChem database entries were checked for consistency. Only 0.6% of these compounds did not pass all rules. Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions. Thirdly 6,000 pharmaceutical, toxic and natural compounds were selected from DrugBank, TSCA and DNP databases. The correct formulas were retrieved as top hit at 80–99% probability when assuming data acquisition with complete resolution of unique compounds and 5% absolute isotope ratio deviation and 3 ppm mass accuracy. Last, some exemplary compounds were analyzed by Fourier transform ion cyclotron resonance mass spectrometry and by gas chromatography-time of flight mass spectrometry. In each case, the correct formula was ranked as top hit when combining the seven rules with database queries. CONCLUSION: The seven rules enable an automatic exclusion of molecular formulas which are either wrong or which contain unlikely high or low number of elements. The correct molecular formula is assigned with a probability of 98% if the formula exists in a compound database. For truly novel compounds that are not present in databases, the correct formula is found in the first three hits with a probability of 65–81%. Corresponding software and supplemental data are available for downloads from the authors' website

    A taxonomic backbone for the global synthesis of species diversity in the angiosperm order Caryophyllales

    Full text link
    The Caryophyllales constitute a major lineage of flowering plants with approximately 12500 species in 39 families. A taxonomic backbone at the genus level is provided that reflects the current state of knowledge and accepts 749 genera for the order. A detailed review of the literature of the past two decades shows that enormous progress has been made in understanding overall phylogenetic relationships in Caryophyllales. The process of re-circumscribing families in order to be monophyletic appears to be largely complete and has led to the recognition of eight new families (Anacampserotaceae, Kewaceae, Limeaceae, Lophiocarpaceae, Macarthuriaceae, Microteaceae, Montiaceae and Talinaceae), while the phylogenetic evaluation of generic concepts is still well underway. As a result of this, the number of genera has increased by more than ten percent in comparison to the last complete treatments in the Families and genera of vascular plants” series. A checklist with all currently accepted genus names in Caryophyllales, as well as nomenclatural references, type names and synonymy is presented. Notes indicate how extensively the respective genera have been studied in a phylogenetic context. The most diverse families at the generic level are Cactaceae and Aizoaceae, but 28 families comprise only one to six genera. This synopsis represents a first step towards the aim of creating a global synthesis of the species diversity in the angiosperm order Caryophyllales integrating the work of numerous specialists around the world
    corecore