2,937 research outputs found

    Chemoinformatics Research at the University of Sheffield: A History and Citation Analysis

    Get PDF
    This paper reviews the work of the Chemoinformatics Research Group in the Department of Information Studies at the University of Sheffield, focusing particularly on the work carried out in the period 1985-2002. Four major research areas are discussed, these involving the development of methods for: substructure searching in databases of three-dimensional structures, including both rigid and flexible molecules; the representation and searching of the Markush structures that occur in chemical patents; similarity searching in databases of both two-dimensional and three-dimensional structures; and compound selection and the design of combinatorial libraries. An analysis of citations to 321 publications from the Group shows that it attracted a total of 3725 residual citations during the period 1980-2002. These citations appeared in 411 different journals, and involved 910 different citing organizations from 54 different countries, thus demonstrating the widespread impact of the Group's work

    A survey of chemical information systems

    Get PDF
    A survey of the features, functions, and characteristics of a fairly wide variety of chemical information storage and retrieval systems currently in operation is given. The types of systems (together with an identification of the specific systems) addressed within this survey are as follows: patents and bibliographies (Derwent's Patent System; IFI Comprehensive Database; PULSAR); pharmacology and toxicology (Chemfile; PAGODE; CBF; HEEDA; NAPRALERT; MAACS); the chemical information system (CAS Chemical Registry System; SANSS; MSSS; CSEARCH; GINA; NMRLIT; CRYST; XTAL; PDSM; CAISF; RTECS Search System; AQUATOX; WDROP; OHMTADS; MLAB; Chemlab); spectra (OCETH; ASTM); crystals (CRYSRC); and physical properties (DETHERM). Summary characteristics and current trends in chemical information systems development are also examined

    The computer storage, retrieval and searching of generic structures in chemical patents : the machine-readable representation of generic structures.

    Get PDF
    The nature of the generic chemical structures found in patents is described, with a discussion of the types of statement commonly found in them. The available representations for such structures are reviewed, with particular note being given to the suitability of the representation for searching files of such structures. Requirements for the unambiguous representation of generic structures in an "ideal" storage and retrieval system are discussed. The basic principles of the theory of formal languages are reviewed, with particular consideration being given to parsing methods for context-free languages. The Grammar and parsing of computer programming languages, as an example of artificial formal languages, is discussed. Applications of formal language theory to chemistry and information work are briefly reviewed. GENSAL, a formal language for the unambiguous description of generic structures from patents, is presented. It is designed to be intelligible to a chemist or patent agent, yet sufficiently ABSTRACT formaLised to be amenabLe to computer anaLysis. DetaiLed description is given of the facilities it provides for generic structure representation, and there is discussion of its Limitations and the principLes behind its design. A connection-tabLe-based internaL representation for generic structures, caLLed an ECTR <Extended Connection TabLe Representation) is presented. It is designed to represent generic structures unambiguousLy, and to be generated automatically from structures encoded in GENSAL. It is compared to other proposed representations, and its implementation using data types of the programming Language PascaL described. An interpreter program which generates an ECTR from structures encoded in a subset of the GENSAL Language is presented. The principles of its operation are described. Possible applications of GENSAL outside the area of patent documentation are discussed, and suggestions made for further work on the development of a generic structure storage and retrieval system based on GENSAL and ECTRs

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Similarity Methods in Chemoinformatics

    Get PDF
    promoting access to White Rose research paper

    AiiDA: Automated Interactive Infrastructure and Database for Computational Science

    Full text link
    Computational science has seen in the last decades a spectacular rise in the scope, breadth, and depth of its efforts. Notwithstanding this prevalence and impact, it is often still performed using the renaissance model of individual artisans gathered in a workshop, under the guidance of an established practitioner. Great benefits could follow instead from adopting concepts and tools coming from computer science to manage, preserve, and share these computational efforts. We illustrate here our paradigm sustaining such vision, based around the four pillars of Automation, Data, Environment, and Sharing. We then discuss its implementation in the open-source AiiDA platform (http://www.aiida.net), that has been tuned first to the demands of computational materials science. AiiDA's design is based on directed acyclic graphs to track the provenance of data and calculations, and ensure preservation and searchability. Remote computational resources are managed transparently, and automation is coupled with data storage to ensure reproducibility. Last, complex sequences of calculations can be encoded into scientific workflows. We believe that AiiDA's design and its sharing capabilities will encourage the creation of social ecosystems to disseminate codes, data, and scientific workflows.Comment: 30 pages, 7 figure

    Structural Pattern Recognition for Chemical-Compound Virtual Screening

    Get PDF
    Les molècules es configuren de manera natural com a xarxes, de manera que són ideals per estudiar utilitzant les seves representacions gràfiques, on els nodes representen àtoms i les vores representen els enllaços químics. Una alternativa per a aquesta representació directa és el gràfic reduït ampliat, que resumeix les estructures químiques mitjançant descripcions de nodes de tipus farmacòfor per codificar les propietats moleculars rellevants. Un cop tenim una manera adequada de representar les molècules com a gràfics, hem de triar l’eina adequada per comparar-les i analitzar-les. La distància d'edició de gràfics s'utilitza per resoldre la concordança de gràfics tolerant als errors; aquesta metodologia calcula la distància entre dos gràfics determinant el nombre mínim de modificacions necessàries per transformar un gràfic en l’altre. Aquestes modificacions (conegudes com a operacions d’edició) tenen associat un cost d’edició (també conegut com a cost de transformació), que s’ha de determinar en funció del problema. Aquest estudi investiga l’eficàcia d’una comparació molecular basada només en gràfics que utilitza gràfics reduïts ampliats i distància d’edició de gràfics com a eina per a aplicacions de cribratge virtual basades en lligands. Aquestes aplicacions estimen la bioactivitat d'una substància química que utilitza la bioactivitat de compostos similars. Una part essencial d’aquest estudi es centra en l’ús d’aprenentatge automàtic i tècniques de processament del llenguatge natural per optimitzar els costos de transformació utilitzats en les comparacions moleculars amb la distància d’edició de gràfics.Las moléculas tienen la forma natural de redes, lo que las hace ideales para estudiar mediante el empleo de sus representaciones gráficas, donde los nodos representan los átomos y los bordes representan los enlaces químicos. Una alternativa para esta representación sencilla es el gráfico reducido extendido, que resume las estructuras químicas utilizando descripciones de nodos de tipo farmacóforo para codificar las propiedades moleculares relevantes. Una vez que tenemos una forma adecuada de representar moléculas como gráficos, debemos elegir la herramienta adecuada para compararlas y analizarlas. La distancia de edición de gráficos se utiliza para resolver la coincidencia de gráficos tolerante a errores; esta metodología estima una distancia entre dos gráficos determinando el número mínimo de modificaciones necesarias para transformar un gráfico en el otro. Estas modificaciones (conocidas como operaciones de edición) tienen un costo de edición (también conocido como costo de transformación) asociado, que debe determinarse en función del problema. Este estudio investiga la efectividad de una comparación molecular basada solo en gráficos que emplea gráficos reducidos extendidos y distancia de edición de gráficos como una herramienta para aplicaciones de detección virtual basadas en ligandos. Estas aplicaciones estiman la bioactividad de una sustancia química empleando la bioactividad de compuestos similares. Una parte esencial de este estudio se centra en el uso de técnicas de procesamiento de lenguaje natural y aprendizaje automático para optimizar los costos de transformación utilizados en las comparaciones moleculares con la distancia de edición de gráficos.Molecules are naturally shaped as networks, making them ideal for studying by employing their graph representations, where nodes represent atoms and edges represent the chemical bonds. An alternative for this straightforward representation is the extended reduced graph, which summarizes the chemical structures using pharmacophore-type node descriptions to encode the relevant molecular properties. Once we have a suitable way to represent molecules as graphs, we need to choose the right tool to compare and analyze them. Graph edit distance is used to solve the error-tolerant graph matching; this methodology estimates a distance between two graphs by determining the minimum number of modifications required to transform one graph into the other. These modifications (known as edit operations) have an edit cost (also known as transformation cost) associated, which must be determined depending on the problem. This study investigates the effectiveness of a graph-only driven molecular comparison employing extended reduced graphs and graph edit distance as a tool for ligand-based virtual screening applications. Those applications estimate the bioactivity of a chemical employing the bioactivity of similar compounds. An essential part of this study focuses on using machine learning and natural language processing techniques to optimize the transformation costs used in the molecular comparisons with the graph edit distance. Overall, this work shows a framework that combines graph reduction and comparison with optimization tools and natural language processing to identify bioactivity similarities in a structurally diverse group of molecules. We confirm the efficiency of this framework with several chemoinformatic tests applied to regression and classification problems over different publicly available datasets
    corecore