Development of deep learning applications for the automated extraction of chemical information from scientific literature

Abstract

This dissertation focuses on developing deep learning applications for extracting chemical information from scientific literature, particularly targeting the automated recognition of molecular structures in images. DECIMER Segmentation, a novel application, employs a Mask Region-based Convolutional Neural Network (MRCNN) model to segment chemical structures in documents, aided by a mask expansion algorithm, marking a significant advancement in processing chemical literature. The Optical Chemical Structure Recognition (OCSR) tool DECIMER Image Transformer uses an encoder-decoder architecture to convert chemical structure depictions into the machine-readable SMILES format. The model has been trained on over 450 million pairs of images and SMILES representations. Its ability to interpret various depiction styles, including hand-drawn structures, sets a new standard in OCSR. To artificially generate large and diverse OCSR training datasets using multiple cheminformatics toolkits, RanDepict was developed. The diversification of training data ensures robust model generalisation across different chemical structure depictions. A unique dataset of hand-drawn molecule images was created to evaluate the model's performance in interpreting these challenging depictions. This dataset further contributes to the understanding of automated structure recognition from diverse styles. The integration of these technologies led to the creation of DECIMER.ai, an open-source web application that combines segmentation and interpretation tools, allowing users to extract and process chemical information from literature efficiently. The work concludes with a discussion on the significance of open data in advancing molecular informatics, highlighting the potential to broader chemical research domains. By adhering to FAIR data standards and open-source principles, the tools developed for this dissertation are designed for adaptability and future development within the community

    Similar works