3 research outputs found

    SELFormer: Molecular Representation Learning via SELFIES Language Models

    Full text link
    Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing (NLP) algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose; however, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models, on predicting aqueous solubility of molecules and adverse drug reactions. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.Comment: 22 pages, 4 figures, 8 table

    TMCO1 Gen Sekans Varyanlatlarının Fonksiyonel Özelliklerinin In Silico Analizlerlerle Değerlendirilmesi

    No full text
    Transmembrane and Coiled-Coil Domains 1 (TMCO1) protein is encoded by TMCO1 gene consists of 7 exons. Previous studies have identified multiple TMCO1 variants in patients with cerebro-facio-thoracic dysplasia (CFTD) and TMCO1 locus was also shown to be associated with primary open angle glaucoma (POAG). However, there are limited number of research exist reporting associations of the TMCO1 gene sequence variants and majority of the findings affirm the pathogenicity of the nonsense and frameshift TMCO1 variants and their associations with clinical phenotypes. Thus functional properties of the single nucleotide variants causing amino acid changes in the TMCO1 are yet to be comprehensively elucidated. In this study, we evaluated the effects of amino acid substitutions on protein structure, identified their putative roles in post-translational modifications (PTM) and in regulatory mechanism for TMCO1 protein. We classified 41 missense variants as pathogenic based on combined scores of common in silico tools (SIFT, MutationTaster2, Polyphen2). Of these 41 variants, four (p.K211Q, p.K105E, p.S235F, p.K237R) were identified to be located in PTMs and regulatory protein binding sites; thus they were proposed to be putative functional variants. Moreover, rs1387528611 (p.Lys128Gln) had also strong evidence (RegulomeDB score=2b) for its possible regulatory function. The results of our in silico analyses highlight the functional importance of the missense TMCO1 variants that may contribute to the TMCO1-associated disease phenotypes and further in vivo evaluation yet to be needed to uncover their role in human diseases

    Democratizing knowledge representation with BioCypher

    No full text
    International audienceStandardising the representation of biomedical knowledge among allresearchers is an insurmountable task, hindering the effectiveness of manycomputational methods. To facilitate harmonisation and interoperability despitethis fundamental challenge, we propose to standardise the framework ofknowledge graph creation instead. We implement this standardisation inBioCypher, a FAIR (findable, accessible, interoperable, reusable) framework totransparently build biomedical knowledge graphs while preserving provenances ofthe source data. Mapping the knowledge onto biomedical ontologies helps tobalance the needs for harmonisation, human and machine readability, and ease ofuse and accessibility to non-specialist researchers. We demonstrate the usefulnessof the framework on a variety of use cases, from maintenance of task-specificknowledge stores, to interoperability between biomedical domains, to on-demandbuilding of task-specific knowledge graphs for federated learning. BioCypher(https://biocypher.org) thus facilitates automating knowledge-based biomedicalresearch, and we encourage the community to further develop and use it
    corecore