49 research outputs found

    Tautomerism in large databases

    Get PDF
    We have used the Chemical Structure DataBase (CSDB) of the NCI CADD Group, an aggregated collection of over 150 small-molecule databases totaling 103.5 million structure records, to conduct tautomerism analyses on one of the largest currently existing sets of real (i.e. not computer-generated) compounds. This analysis was carried out using calculable chemical structure identifiers developed by the NCI CADD Group, based on hash codes available in the chemoinformatics toolkit CACTVS and a newly developed scoring scheme to define a canonical tautomer for any encountered structure. CACTVS’s tautomerism definition, a set of 21 transform rules expressed in SMIRKS line notation, was used, which takes a comprehensive stance as to the possible types of tautomeric interconversion included. Tautomerism was found to be possible for more than 2/3 of the unique structures in the CSDB. A total of 680 million tautomers were calculated from, and including, the original structure records. Tautomerism overlap within the same individual database (i.e. at least one other entry was present that was really only a different tautomeric representation of the same compound) was found at an average rate of 0.3% of the original structure records, with values as high as nearly 2% for some of the databases in CSDB. Projected onto the set of unique structures (by FICuS identifier), this still occurred in about 1.5% of the cases. Tautomeric overlap across all constituent databases in CSDB was found for nearly 10% of the records in the collection

    On InChI and evaluating the quality of cross-reference links

    Get PDF
    BACKGROUND: There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones. RESULTS: We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (28.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links. CONCLUSIONS: We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method

    Chemoinformatics approaches for new drugs discovery

    Get PDF
    Chemoinformatics uses computational methods and technologies to solve chemical problems. It works on molecular structures, their representations, properties and related data. The first and most important phase in this field is the translation of interconnected atomic systems into in-silico models, ensuring complete and correct chemical information transfer. In the last 20 years the chemical databases evolved from the state of molecular repositories to research tools for new drugs identification, while the modern high-throughput technologies allow for continuous chemical libraries size increase as highlighted by publicly available repository like PubChem [http://pubchem.ncbi.nlm.nih.gov/], ZINC [http://zinc.docking.org/], ChemSpider[http://www.chemspider. com/]. Chemical libraries fundamental requirements are molecular uniqueness, absence of ambiguity, chemical correctness (related to atoms, bonds, chemical orthography), standardized storage and registration formats. The aim of this work is the development of chemoinformatics tools and data for drug discovery process. The first part of the research project was focused on accessible commercial chemical space analysis; looking for molecular redundancy and in-silico models correctness in order to identify a unique and univocal molecular descriptor for chemical libraries indexing. This allows for the 0%-redundancy achievement on a 42 millions compounds library. The protocol was implemented as MMsDusty, a web based tool for molecular databases cleaning. The major protocol developed is MMsINC, a chemoinformatics platform based on a starting number of 4 millions non-redundant high-quality annotated and biomedically relevant chemical structures; the library is now being expanded up to 460 millions compounds. MMsINC is able to perform various types of queries, like substructure or similarity search and descriptors filtering. MMsINC is interfaced with PDB(Protein Data Bank)[http://www.rcsb.org/pdb/home/home.do] and related to approved drugs. The second developed protocol is called pepMMsMIMIC, a peptidomimetic screening tool based on multiconformational chemical libraries; the screening process uses pharmacophoric fingerprints similarity to identify small molecules able to geometrically and chemically mimic endogenous peptides or proteins. The last part of this project lead to the implementation of an optimized and exhaustive conformational space analysis protocol for small molecules libraries; this is crucial for high quality 3D molecular models prediction as requested in chemoinformatics applications. The torsional exploration was optimized in the range of most frequent dihedral angles seen in X-ray solved small molecules structures of CSD(Cambridge Structural Database); by appling this on a 89 millions structures library was generated a library of 2.6 x 10 exp 7 high quality conformers. Tools, protocols and platforms developed in this work allow for chemoinformatics analysis and screening on large size chemical libraries achieving high quality, correct and unique chemical data and in-silico model

    The Polytope Formalism: isomerism and associated unimolecular isomerisation

    Get PDF
    This thesis concerns the ontology of isomerism, this encompassing the conceptual frameworks and relationships that comprise the subject matter; the necessary formal definitions, nomenclature, and representations that have impacts reaching into unexpected areas such as drug registration and patent specifications; the requisite controlled and precise vocabulary that facilitates nuanced communication; and the digital/computational formalisms that underpin the chemistry software and database tools that empower chemists to perform much of their work. Using conceptual tools taken from Combinatorics, and Graph Theory, means are presented to provide a unified description of isomerism and associated unimolecular isomerisation spanning both constitutional isomerism and stereoisomerism called the Polytope Formalism. This includes unification of the varying approaches historically taken to describe and understand stereoisomerism in organic and inorganic compounds. Work for this Thesis began with the synthesis, isolation, and characterisation of compounds not adequately describable using existing IUPAC recommendations. Generalisation of the polytopal-rearrangements model of stereoisomerisation used for inorganic chemistry led to the prescriptions that could deal with the synthesised compounds, revealing an unrecognised fundamental form of isomerism called akamptisomerism. Following on, this Thesis describes how in attempting to place akamptisomerism within the context of existing stereoisomerism reveals significant systematic deficiencies in the IUPAC recommendations. These shortcomings have limited the conceptualisation of broad classes of compounds and hindered development of molecules for medicinal and technological applications. It is shown how the Polytope Formalism can be applied to the description of constitutional isomerism in a practical manner. Finally, a radically different medicinal chemistry design strategy with broad application, based upon the principles, is describe

    Characterisation of data resources for in silico modelling: benchmark datasets for ADME properties.

    Get PDF
    Introduction: The cost of in vivo and in vitro screening of ADME properties of compounds has motivated efforts to develop a range of in silico models. At the heart of the development of any computational model are the data; high quality data are essential for developing robust and accurate models. The characteristics of a dataset, such as its availability, size, format and type of chemical identifiers used, influence the modelability of the data. Areas covered: This review explores the usefulness of publicly available ADME datasets for researchers to use in the development of predictive models. More than 140 ADME datasets were collated from publicly available resources and the modelability of 31selected datasets were assessed using specific criteria derived in this study. Expert opinion: Publicly available datasets differ significantly in information content and presentation. From a modelling perspective, datasets should be of adequate size, available in a user-friendly format with all chemical structures associated with one or more chemical identifiers suitable for automated processing (e.g. CAS number, SMILES string or InChIKey). Recommendations for assessing dataset suitability for modelling and publishing data in an appropriate format are discussed

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    "MS-Ready" structures for non-targeted high-resolution mass spectrometry screening studies.

    Get PDF
    Chemical database searching has become a fixture in many non-targeted identification workflows based on high-resolution mass spectrometry (HRMS). However, the form of a chemical structure observed in HRMS does not always match the form stored in a database (e.g., the neutral form versus a salt; one component of a mixture rather than the mixture form used in a consumer product). Linking the form of a structure observed via HRMS to its related form(s) within a database will enable the return of all relevant variants of a structure, as well as the related metadata, in a single query. A Konstanz Information Miner (KNIME) workflow has been developed to produce structural representations observed using HRMS ("MS-Ready structures") and links them to those stored in a database. These MS-Ready structures, and associated mappings to the full chemical representations, are surfaced via the US EPA's Chemistry Dashboard ( https://comptox.epa.gov/dashboard/ ). This article describes the workflow for the generation and linking of ~ 700,000 MS-Ready structures (derived from ~ 760,000 original structures) as well as download, search and export capabilities to serve structure identification using HRMS. The importance of this form of structural representation for HRMS is demonstrated with several examples, including integration with the in silico fragmentation software application MetFrag. The structures, search, download and export functionality are all available through the CompTox Chemistry Dashboard, while the MetFrag implementation can be viewed at https://msbi.ipb-halle.de/MetFragBeta/

    Проблема проявления динамических процессов при решении задачи подтверждения подлинности органических соединений методом ЯМР-спектроскопии

    Get PDF
    The number, shape and position of NMR spectral lines depend on dynamic processes, and this creates certain difficulties in identification of pharmaceutical substances by NMR spectroscopy. The aim of the paper was to study instances of manifestation of intramolecular dynamic processes that affect identification of organic compounds by NMR, and to illustrate the potential of the methods used for their reduction, as well as associated problems.Materials and methods: 1H and 13C spectra of the following pharmaceutical substances: «buserelin acetate», «valsartan», «goserelin acetate», «iopromide», «clopidogrel hydrogensulfate», «omeprazole», «proroxan», «risperidone», «triptorelin acetate», and «enalapril maleate» were used to demonstrate negative effects of dynamic processes. The spatial structures of conformers were established by 1H-1H ROESY experiments. The quantum-chemical calculation of geometric and thermodynamic characteristics of different conformers was carried out by the PM3 method, and electronic characteristics—by the AM1 method with the help of the HyperChem software.Results: the authors analysed intramolecular dynamic processes which are most commonly encountered in expert work: pyramidal inversion of nitrogen in a heterocyclic compound (risperidone, proroxan, clopidogrel), rotation of molecular fragments around the amide bond (valsartan, iopromide, enalapril), prototropic rearrangements (buserelin, goserelin, omeprazole, triptorelin). The change in exchange rates was explained from the perspective of the change in the system of intra- and intermolecular nonvalent interactions.Conclusions: the use of traditional methods for increasing the rate of dynamic processes (increasing the temperature and changing the solvent) does not always eliminate the negative effects of intramolecular transformations. Methods of smoothing the spectral manifestations of dynamic processes have limited application due to strong intramolecular nonvalent interactions which prevent the conversion of the dynamic process rate into fast exchange. Experts and manufacturers should take into account the manifestation of dynamic processes during identification of pharmaceutical substances by NMR spectroscopy.Зависимость числа, формы и положения линий в ЯМР-спектре от динамических процессов создает определенные трудности при подтверждении подлинности фармацевтической субстанции методом ЯМР-спектроскопии.Цель работы: рассмотреть примеры проявления внутримолекулярных динамических процессов, отрицательно влияющих на процедуру идентификации органического соединения методом ЯМР, и показать возможности и ограничения способов их снижения.Материалы и методы: для иллюстрации отрицательных эффектов динамических процессов использованы ЯМР-спектры 1Н и 13С лекарственных субстанций: бусерелина ацетат, валсартан, гозерелина ацетат, йопромид, клопидогрела гидросульфат, омепразол, пророксан, рисперидон, трипторелина ацетат, эналаприла малеат. Пространственное строение конформеров устанавливали на основе данных 1Н-1Н ROESY экспериментов. Квантово-химический расчет геометрических и термодинамических характеристик различных конформеров проведен методом РМ3, электронных – АМ1 с использованием программы HyperChem.Результаты: рассмотрены наиболее часто встречающиеся в экспертной практике внутримолекулярные динамические процессы: пирамидальная инверсия конфигурации атома азота в гетероциклическом соединении (рисперидон, пророксан, клопидогрел), вращение фрагментов молекул вокруг амидной связи (валсартан, йопромид, эналаприл), прототропные перегруппировки (бусерелин, гозерелин, омепразол, трипторелин). Изменение скорости обмена объяснено с позиции изменения системы внутри- и межмолекулярных невалентных взаимодействий.Выводы: показано, что использование традиционных приемов увеличения скорости динамических процессов (увеличение температуры и смена растворителя) не всегда позволяет устранить отрицательные эффекты внутримолекулярных превращений. Ограничения в применении способов нивелирования спектральных проявлений динамических процессов связаны с сильными внутримолекулярными невалентными взаимодействиями, которые препятствуют переводу скорости динамического процесса в область быстрого обмена. Проявление динамических процессов необходимо учитывать экспертам и производителям при подтверждении подлинности фармацевтических субстанций методом ЯМР-спектроскопии
    corecore