26 research outputs found

    A document classifier for medicinal chemistry publications trained on the ChEMBL corpus

    Get PDF
    BackgroundĀ  The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ā€˜ChEMBL-likeā€™ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.Ā  ResultsĀ  The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining webcite. These can be readily modified to include additional keyword constraints to further focus searches.Ā  ConclusionsĀ  Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.FWN ā€“ Publicaties zonder aanstelling Universiteit Leide

    R-BERT-CNN : Drug-target interactions extraction from biomedical literature

    Get PDF
    In this research, we present our work participation for the DrugProt task of BioCreative VII challenge. Drug-target interactions (DTIs) are critical for drug discovery and repurposing, which are often manually extracted from the experimental articles. There are >32M biomedical articles on PubMed and manually extracting DTIs from such a huge knowledge base is challenging. To solve this issue, we provide a solution for Track 1, which aims to extract 10 types of interactions between drug and protein entities. We applied an Ensemble Classifier model that combines BioMed-RoBERTa, a state of art language model, with Convolutional Neural Networks (CNN) to extract these relations. Despite the class imbalances in the BioCreative VII DrugProt test corpus, our model achieves a good performance compared to the average of other submissions in the challenge, with the micro F1 score of 55.67% (and 63% on BioCreative VI ChemProt test corpus). The results show the potential of deep learning in extracting various types of DTIs.Peer reviewe

    Implementation of virtual workflows in KNIME for medicinal chemistry

    Get PDF
    This project demonstrates how two programs are created in KNIME - an open source data analytic, reporting and integration platform, are used to support research scientists in medicinal chemistry. The first application flags pan-assay interference compounds such as ā€œpromiscuousā€ compounds present in chemical libraries that recurrently behaves as false positive hits in screening campaigns. The second application adapted a previously published workflow, where it automatically scans the recently published scientific literature on a weekly basis, and identifies articles considered relevant to medicinal chemists focused on epigenetic mechanisms, a novel and promising field in drug discovery. These workflows are very important because they allow a user with relatively little training to be able to extract important data that would typically need a trained chemist for. The PAINS workflow performed adequately but data was problematic. This workflow and an online tool, used to compare results, agged different, but overlapping sets of compounds. The PubMed alert workflow performed very well, being able to consistently identify new papers. These workflows have been implemented at the Structural Genomics Consortium, in Toronto. Both Workflows are available at http://sgc.utoronto.ca/ditommaso.zip The implementation of these workflows demonstrate that the process is viable, and paves the way for the implementation of more complex workflows. Ce projet montre comment deux logiciels qui ont eĢteĢ creĢeĢs en utilisant KNIME - une plate-forme open-source dā€™inteĢgration et de reportage de data analytique, sont utiliseĢes comme soutient pour les chercheurs dans le domaine de chimie meĢdicale. La premieĢ€re application signale les composeĢs dā€™interfeĢrence pan-essai (PAINS), par exemples des composeĢs ā€˜libeĢreĢsā€™ preĢsents dans les chimiotheĢ€ques, qui sā€™agissent souvent comme des fausses reĢactions positives pendant les campagnes de deĢpistage. La deuxieĢ€me application, le systeĢ€me de workflow PubMed alert, a adapteĢ un systeĢ€me de workflow deĢveloppeĢ auparavant qui parcourt rapidement la litteĢrature scientifique publieĢe reĢcemment une fois par semaine et identifie des articles qui sont pertinents pour des chimistes meĢdicales qui eĢtudient des meĢcaniques eĢpigeĢneĢtiques, un domaine novateur et prometteur dans les deĢcouvertes des drogues. Ces systeĢ€mes de workflow sont treĢ€s importants car ils permettent un utilisateur avec relativement peu dā€™entraiĢ‚nement aĢ€ soutirer des donneĢes importantes qui ont typiquement besoin dā€™eĢ‚tre trouveĢes par les chimistes entraiĢ‚neĢs. Le systeĢ€me de workflow de PAINS a fonctionneĢ suffisamment mais les donneĢes trouveĢes eĢtaient probleĢmatiques. Le systeĢ€me et un outil en ligne utiliseĢ pour la comparaison des reĢsultats ont signaleĢs des reĢsultats diffeĢrents, mais les reĢsultats se sont deĢbordeĢs sur les unes les autres. Nous avons trouveĢs que le systeĢ€me de workflow PubMed alert a treĢ€s bien fonctionneĢ, car le systeĢ€me pouvait constamment identifier des nouveaux papiers scientifiques. Ces systeĢ€mes de workflow sont maintenant impleĢmenteĢs au Consortium GeĢnomique Structurel (SGC) aĢ€ Toronto. Les deux systeĢ€mes de workflow sont disponibles aĢ€ http://sgc.utoronto.ca/ditommaso.zip. Lā€™impleĢmentation de ces systeĢ€mes de workflow montre que le proceĢ€s est viable et ouvre la voie pour lā€™impleĢmention des systeĢ€mes de workflow plus complexes.

    Computer Aided Synthesis Prediction to Enable Augmented Chemical Discovery and Chemical Space Exploration

    Get PDF
    The drug-like chemical space is estimated to be 10 to the power of 60 molecules, and the largest generated database (GDB) obtained by the Reymond group is 165 billion molecules with up to 17 heavy atoms. Furthermore, deep learning techniques to explore regions of chemical space are becoming more popular. However, the key to realizing the generated structures experimentally lies in chemical synthesis. The application of which was previously limited to manual planning or slow computer assisted synthesis planning (CASP) models. Despite the 60-year history of CASP few synthesis planning tools have been open-sourced to the community. In this thesis I co-led the development of and investigated one of the only fully open-source synthesis planning tools called AiZynthFinder, trained on both public and proprietary datasets consisting of up to 17.5 million reactions. This enables synthesis guided exploration of the chemical space in a high throughput manner, to bridge the gap between compound generation and experimental realisation. I firstly investigate both public and proprietary reaction data, and their influence on route finding capability. Furthermore, I develop metrics for assessment of retrosynthetic prediction, single-step retrosynthesis models, and automated template extraction workflows. This is supplemented by a comparison of the underlying datasets and their corresponding models. Given the prevalence of ring systems in the GDB and wider medicinal chemistry domain, I developed ā€˜Ring Breakerā€™ - a data-driven approach to enable the prediction of ring-forming reactions. I demonstrate its utility on frequently found and unprecedented ring systems, in agreement with literature syntheses. Additionally, I highlight its potential for incorporation into CASP tools, and outline methodological improvements that result in the improvement of route-finding capability. To tackle the challenge of model throughput, I report a machine learning (ML) based classifier called the retrosynthetic accessibility score (RAscore), to assess the likelihood of finding a synthetic route using AiZynthFinder. The RAscore computes at least 4,500 times faster than AiZynthFinder. Thus, opens the possibility of pre-screening millions of virtual molecules from enumerated databases or generative models for synthesis informed compound prioritization. Finally, I combine chemical library visualization with synthetic route prediction to facilitate experimental engagement with synthetic chemists. I enable the navigation of chemical property space by using interactive visualization to deliver associated synthetic data as endpoints. This aids in the prioritization of compounds. The ability to view synthetic route information alongside structural descriptors facilitates a feedback mechanism for the improvement of CASP tools and enables rapid hypothesis testing. I demonstrate the workflow as applied to the GDB databases to augment compound prioritization and synthetic route design

    Text Mining for Chemical Compounds

    Get PDF
    Exploring the chemical and biological space covered by patent and journal publications is crucial in early- stage medicinal chemistry activities. The analysis provides understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents and journals through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. In this book, we addressed the lack of quality measurements for assessing the correctness of structural representation within and across chemical databases; lack of resources to build text-mining systems; lack of high performance systems to extract chemical compounds from journals and patents; and lack of automated systems to identify relevant compounds in patents. The consistency and ambiguity of chemical identifiers was analyzed within and between small- molecule databases in Chapter 2 and Chapter 3. In Chapter 4 and Chapter 7 we developed resources to enable the construction of chemical text-mining systems. In Chapter 5 and Chapter 6, we used community challenges (BioCreative V and BioCreative VI) and their corresponding resources to identify mentions of chemical compounds in journal abstracts and patents. In Chapter 7 we used our findings in previous chapters to extract chemical named entities from patent full text and to classify the relevancy of chemical compounds

    Automatic identification of relevant chemical compounds from patents

    Get PDF
    In commercial research and development projects, public disclosure of new chemical compounds often takes place in patents. Only a small proportion of these compounds are published in journals, usually a few years after the patent. Patent authorities make available the patents but do not provide systematic continuous chemical annotations. Content databases such as Elsevierā€™s Reaxys provide such services mostly based on manual excerptions, which are time-consuming and costly. Automatic text-mining approaches help overcome some of the limitations of the manual process. Different text-mining approaches exist to extract chemical entities from patents. The majority of them have been developed using sub-sections of patent documents and focus on mentions of compounds. Less attention has been given to relevancy of a compound in a patent. Relevancy of a compound to a patent is based on the patentā€™s context. A relevant compound plays a major role within a patent. Identification of relevant compounds reduces the size of the extracted data and improves the usefulness of patent resources (e.g. supports identifying the main compounds). Annotators of databases like Reaxys only annotate relevant compounds. In this study, we design an automated system that extracts chemical entities from patents and classifies their relevance. The goldstandard set contained 18 789 chemical entity annotations. Of these, 10% were relevant compounds, 88% were irrelevant and 2% were equivocal. Our compound recognition system was based on proprietary tools. The performance (F-score) of the system on compound recognition was 84% on the development set and 86% on the test set. The relevancy classification system had an F-score of 86% on the development set and 82% on the test set. Our system can extract chemical compounds from patents and classify their relevance with high performance. This enables the extension of the Reaxys database by means of automation

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Communityā€™s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by ConselleriĢa de Cultura, EducacioĢn e OrdenacioĢn Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank InĢƒigo GarciaĢ -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Structuring the Unstructured: Unlocking pharmacokinetic data from journals with Natural Language Processing

    Get PDF
    The development of a new drug is an increasingly expensive and inefficient process. Many drug candidates are discarded due to pharmacokinetic (PK) complications detected at clinical phases. It is critical to accurately estimate the PK parameters of new drugs before being tested in humans since they will determine their efficacy and safety outcomes. Preclinical predictions of PK parameters are largely based on prior knowledge from other compounds, but much of this potentially valuable data is currently locked in the format of scientific papers. With an ever-increasing amount of scientific literature, automated systems are essential to exploit this resource efficiently. Developing text mining systems that can structure PK literature is critical to improving the drug development pipeline. This thesis studied the development and application of text mining resources to accelerate the curation of PK databases. Specifically, the development of novel corpora and suitable natural language processing architectures in the PK domain were addressed. The work presented focused on machine learning approaches that can model the high diversity of PK studies, parameter mentions, numerical measurements, units, and contextual information reported across the literature. Additionally, architectures and training approaches that could efficiently deal with the scarcity of annotated examples were explored. The chapters of this thesis tackle the development of suitable models and corpora to (1) retrieve PK documents, (2) recognise PK parameter mentions, (3) link PK entities to a knowledge base and (4) extract relations between parameter mentions, estimated measurements, units and other contextual information. Finally, the last chapter of this thesis studied the feasibility of the whole extraction pipeline to accelerate tasks in drug development research. The results from this thesis exhibited the potential of text mining approaches to automatically generate PK databases that can aid researchers in the field and ultimately accelerate the drug development pipeline. Additionally, the thesis presented contributions to biomedical natural language processing by developing suitable architectures and corpora for multiple tasks, tackling novel entities and relations within the PK domain
    corecore