SPECTRa-T Final Report July 2008

Abstract

Much of the experimental data generated by postgraduate researchers in chemistry and related departments are conventionally reported in theses. Although such theses might describe up to 50 novel chemical syntheses, with full characterisation of synthesised compounds, much of this is not communicated in peer-reviewed publication to the scientific community in an appropriate form (numbers are reduced to points on diagrams, tables are converted to graphs in pixel form) and a significant proportion of preparative procedures (anecdotally estimated at 80%) are never formally submitted at all. Although the bare outline essentials of the synthesis are published, the detailed experimental recipes (as found in the thesis) are often omitted. The SPECTRa-T (Submission Preservation Exposure of Chemistry Teaching and Research Data from Theses) project was funded as a proof-of-concept approach to develop software to automatically extract chemical terms and objects contained within electronic theses (e-theses)2. We have shown that it is possible to reliably identify organic chemical terms in both Portable Document Format (PDF)3 and Office Open XML (DOCX)4 format theses and to extract and deposit these within a Resource Description Framework (RDF) triplestore. Semantic Web standards for searching data have been developed by W3C5, and we have explored the viability of RDF-based semantic querying to enable re-use of the data contained within chemistry e-theses. Although the internal structure of PDF did not permit the identification of chemical objects (e.g. spectral assignments and physical properties), their capture from DOCX format e-theses as Chemical Markup Language CML6 data files was achieved. These files were deposited in APP-enabled7 data repositories, each being URI-linked to a searchable named chemical entity in the RDF triplestore. We have demonstrated: • routine and automatic extraction of Chemical Objects (e.g. molecules, spectra) and named chemical entities in high volumes, transformation into metadata and their capture into data repositories and triplestores. • exploration of the viability of RDF-based semantic querying. • review of current document format practice in the deposition of chemistry theses and how this influences ease of data extraction This machine-based identification of chemical terms was achieved using modified OSCAR3 processing software8 which, in part using the ChEBI chemistry ontology9, is specific to ‘small molecule’ organic structures typically found in synthetic organic chemistry theses. The need to develop other chemistry-domain ontologies is indicated. SPECTRa-T was funded by JISC's Digital Repositories Programme as a joint project between Cambridge University Library and the chemistry departments of the University of Cambridge and Imperial College London

    Similar works

    Full text

    thumbnail-image

    Available Versions