46 research outputs found

    A metadata-driven approach to data repository design

    Get PDF
    The design and use of a metadata-driven data repository for research data management is described. Metadata is collected automatically during the submission process whenever possible and is registered with DataCite in accordance with their current metadata schema, in exchange for a persistent digital object identifier. Two examples of data preview are illustrated, including the demonstration of a method for integration with commercial software that confers rich domain-specific data analytics without introducing customisation into the repository itself

    Development and implementation of in silico molecule fragmentation algorithms for the cheminformatics analysis of natural product spaces

    Get PDF
    Computational methodologies extracting specific substructures like functional groups or molecular scaffolds from input molecules can be grouped under the term “in silico molecule fragmentation”. They can be used to investigate what specifically characterises a heterogeneous compound class, like pharmaceuticals or Natural Products (NP) and in which aspects they are similar or dissimilar. The aim is to determine what specifically characterises NP structures to transfer patterns favourable for bioactivity to drug development. As part of this thesis, the first algorithmic approach to in silico deglycosylation, the removal of glycosidic moieties for the study of aglycones, was developed with the Sugar Removal Utility (SRU) (Publication A). The SRU has also proven useful for investigating NP glycoside space. It was applied to one of the largest open NP databases, COCONUT (COlleCtion of Open Natural prodUcTs), for this purpose (Publication B). A contribution was made to the Chemistry Development Kit (CDK) by developing the open Scaffold Generator Java library (Publication C). Scaffold Generator can extract different scaffold types and dissect them into smaller parent scaffolds following the scaffold tree or scaffold network approach. Publication D describes the OngLai algorithm, the first automated method to identify homologous series in input datasets, group the member structures of each group, and extract their common core. To support the development of new fragmentation algorithms, the open Java rich client graphical user interface application MORTAR (MOlecule fRagmenTAtion fRamework) was developed as part of this thesis (Publication E). MORTAR allows users to quickly execute the steps of importing a structural dataset, applying a fragmentation algorithm, and visually inspecting the results in different ways. All software developed as part of this thesis is freely and openly available (see https://github.com/JonasSchaub)

    Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation

    Get PDF
    The discovery of novel materials and functional molecules can help to solve some of society's most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel pharmaceutical drug candidates. Traditionally matter engineering -- generally denoted as inverse design -- was based massively on human intuition and high-throughput virtual screening. The last few years have seen the emergence of significant interest in computer-inspired designs based on evolutionary or deep learning methods. The major challenge here is that the standard strings molecular representation SMILES shows substantial weaknesses in that task because large fractions of strings do not correspond to valid molecules. Here, we solve this problem at a fundamental level and introduce SELFIES (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100\% robust. Every SELFIES string corresponds to a valid molecule, and SELFIES can represent every molecule. SELFIES can be directly applied in arbitrary machine learning models without the adaptation of the models; each of the generated molecule candidates is valid. In our experiments, the model's internal memory stores two orders of magnitude more diverse molecules than a similar test with SMILES. Furthermore, as all molecules are valid, it allows for explanation and interpretation of the internal working of the generative models.Comment: 6+3 pages, 6+1 figure

    ORCID: Issues and concerns about its use for academic purposes and research integrity

    Get PDF
    246-250ORCID (Open Researcher and Contributor ID) was launched in 2012 as an initiative to fortify the validity and integrity of academic publishing through author name disambiguation. Less than a decade later, this portal is being actively promoted in an attempt to ensure that academics adhere to this permanent identifier. Without a doubt, a complete, up-to-date and authentic ORCID has value, not only to a researcher, but to the academic community because it allows facilitated online submissions, and links to funding agencies and other profiles. The mandatory requirement of an ORCID account for the submitting or corresponding author, sometimes for all authors, is becoming more common during the submission of manuscripts to ORCID member journals. Not only are there issues pertaining to academic freedom, or unfair treatment of those without an ORCID, there are other highly pertinent, unpalatable, and contentious issues related to ORCID that need greater attention and debate. These include the inconsistent implementation of ORCID among co-authors, the existence of empty or “ghost” ORCID accounts that are uninformative and thus of limited use, and the plausible abuse of ORCIDs to register potentially fake elements. These issues would not only reduce trust in ORCID, which is actively promoted as a tool for maintaining science’s integrity, they may land up weakening a publishing system that was meant to be fortified by this initiative. They may also hurt the reputation of valid ORCID users who share a platform with “ghost” ORCID accounts or with fake authors, or authors whose identities are unverifiable

    ORCID: Issues and concerns about its use for academic purposes and research integrity

    Get PDF
    ORCID (Open Researcher and Contributor ID) was launched in 2012 as an initiative to fortify the validity and integrity of academic publishing through author name disambiguation. Less than a decade later, this portal is being actively promoted in an attempt to ensure that academics adhere to this permanent identifier. Without a doubt, a complete, up-to-date and authentic ORCID has value, not only to a researcher, but to the academic community because it allows facilitated online submissions, and links to funding agencies and other profiles. The mandatory requirement of an ORCID account for the submitting or corresponding author, sometimes for all authors, is becoming more common during the submission of manuscripts to ORCID member journals. Not only are there issues pertaining to academic freedom, or unfair treatment of those without an ORCID, there are other highly pertinent, unpalatable, and contentious issues related to ORCID that need greater attention and debate. These include the inconsistent implementation of ORCID among co-authors, the existence of empty or “ghost” ORCID accounts that are uninformative and thus of limited use, and the plausible abuse of ORCIDs to register potentially fake elements. These issues would not only reduce trust in ORCID, which is actively promoted as a tool for maintaining science’s integrity, they may land up weakening a publishing system that was meant to be fortified by this initiative. They may also hurt the reputation of valid ORCID users who share a platform with “ghost” ORCID accounts or with fake authors, or authors whose identities are unverifiable

    Graph networks for molecular design

    Get PDF
    Deep learning methods applied to chemistry can be used to accelerate the discovery of new molecules. This work introduces GraphINVENT, a platform developed for graph-based molecular design using graph neural networks (GNNs). GraphINVENT uses a tiered deep neural network architecture to probabilistically generate new molecules a single bond at a time. All models implemented in GraphINVENT can quickly learn to build molecules resembling the training set molecules without any explicit programming of chemical rules. The models have been benchmarked using the MOSES distribution-based metrics, showing how GraphINVENT models compare well with state-of-the-art generative models. This work compares six different GNN-based generative models in GraphINVENT, and shows that ultimately the gated-graph neural network performs best against the metrics considered here

    Development of efficient open-source chemical graph generators

    Get PDF
    In chemistry, one of the crucial problems has been the structure identification of molecules, whose chemical composition is unknown. This research topic has impacts on various fields such as natural product and drug discovery studies. For the efficient and the fast identification process, computer assisted structure elucidation (CASE) toolkits has been developed. These tools utilise spectral data of unknown molecules as the input to determine their structure. The effectiveness of these software primarily depends on how well the structure generators perform. The basic input for these generators is the molecular formula of the unknown molecule to generate its unique list of isomers. In cheminformatics, there has been several software for the structure generation, especially, MOLGEN was considered as the de-facto gold standard in the field due to its speed and efficiency. However, it is a commercial tool and there was the need of an efficient open-source structure generators, in other words, chemical graph generators. To fulfil this need, the development of efficient open-source chemical graph generators was aimed for this PhD study, and the aim was succeeded by the development of two software, namely, MAYGEN and surge. First MAYGEN was developed as an alternative to MOLGEN. It was benchmarked against MOLGEN and was just around 3 times slower than MOLGEN. Following MAYGEN, another software, surge, was developed as an open-source chemical graph generator. It was benchmarked against MOLGEN for randomly chosen natural products' molecular formulae. Based on the results, surge is approximately 100 times faster than MOLGEN, which made it the state-of-art in the field

    Drug Discovery Maps, a Machine Learning Model That Visualizes and Predicts Kinome-Inhibitor Interaction Landscapes

    Get PDF
    The interpretation of high-dimensional structure-activity data sets in drug discovery to predict ligand-protein interaction landscapes is a challenging task. Here we present Drug Discovery Maps (DDM), a machine learning model that maps the activity profile of compounds across an entire protein family, as illustrated here for the kinase family. DDM is based on the t-distributed stochastic neighbor embedding (t-SNE) algorithm to generate a visualization of molecular and biological similarity. DDM maps chemical and target space and predicts the activities of novel kinase inhibitors across the kinome. The model was validated using independent data sets and in a prospective experimental setting, where DDM predicted new inhibitors for FMS-like tyrosine kinase 3 (FLT3), a therapeutic target for the treatment of acute myeloid leukemia. Compounds were resynthesized, yielding highly potent, cellularly active FLT3 inhibitors. Biochemical assays confirmed most of the predicted off-targets. DDM is further unique in that it is completely open-source and available as a ready-to-use executable to facilitate broad and easy adoption
    corecore