15 research outputs found
Deep Generative Model for Sparse Graphs using Text-Based Learning with Augmentation in Generative Examination Networks
Graphs and networks are a key research tool for a variety of science fields,
most notably chemistry, biology, engineering and social sciences. Modeling and
generation of graphs with efficient sampling is a key challenge for graphs. In
particular, the non-uniqueness, high dimensionality of the vertices and local
dependencies of the edges may render the task challenging. We apply our
recently introduced method, Generative Examination Networks (GENs) to create
the first text-based generative graph models using one-line text formats as
graph representation. In our GEN, a RNN-generative model for a one-line text
format learns autonomously to predict the next available character. The
training is stopped by an examination mechanism checking validating the
percentage of valid graphs generated. We achieved moderate to high validity
using dense g6 strings (random 67.8 +/- 0.6, canonical 99.1 +/- 0.2). Based on
these results we have adapted the widely used SMILES representation for
molecules to a new input format, which we call linear graph input (LGI). Apart
from the benefits of a short compressible text-format, a major advantage
include the possibility to randomize and augment the format. The generative
models are evaluated for overall performance and for reconstruction of the
property space. The results show that LGI strings are very well suited for
machine-learning and that augmentation is essential for the performance of the
model in terms of validity, uniqueness and novelty. Lastly, the format can
address smaller and larger dataset of graphs and the format can be easily
adapted to define another meaning of the characters used in the LGI-string and
can address sparse graph problems in used in other fields of science
Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
Large Language Models (LLMs), with their remarkable task-handling
capabilities and innovative outputs, have catalyzed significant advancements
across a spectrum of fields. However, their proficiency within specialized
domains such as biomolecular studies remains limited. To address this
challenge, we introduce Mol-Instructions, a meticulously curated, comprehensive
instruction dataset expressly designed for the biomolecular realm.
Mol-Instructions is composed of three pivotal components: molecule-oriented
instructions, protein-oriented instructions, and biomolecular text
instructions, each curated to enhance the understanding and prediction
capabilities of LLMs concerning biomolecular features and behaviors. Through
extensive instruction tuning experiments on the representative LLM, we
underscore the potency of Mol-Instructions to enhance the adaptability and
cognitive acuity of large models within the complex sphere of biomolecular
studies, thereby promoting advancements in the biomolecular research community.
Mol-Instructions is made publicly accessible for future research endeavors and
will be subjected to continual updates for enhanced applicability.Comment: Project homepage: https://github.com/zjunlp/Mol-Instructions. Add
quantitative evaluation
Insilico generation of novel ligands for the inhibition of SARS-CoV-2 main protease (3CLpro) using deep learning
The recent emergence of novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) causing the coronavirus disease (COVID-19) has become a global public health crisis, and a crucial need exists for rapid identification and development of novel therapeutic interventions. In this study, a recurrent neural network (RNN) is trained and optimized to produce novel ligands that could serve as potential inhibitors to the SARS-CoV-2 viral protease: 3 chymotrypsin-like protease (3CLpro). Structure-based virtual screening was performed through molecular docking, ADMET profiling, and predictions of various molecular properties were done to evaluate the toxicity and drug-likeness of the generated novel ligands. The properties of the generated ligands were also compared with current drugs under various phases of clinical trials to assess the efficacy of the novel ligands. Twenty novel ligands were selected that exhibited good drug-likeness properties, with most ligands conforming to Lipinski’s rule of 5, high binding affinity (highest binding affinity: −9.4 kcal/mol), and promising ADMET profile. Additionally, the generated ligands complexed with 3CLpro were found to be stable based on the results of molecular dynamics simulation studies conducted over a 100 ns period. Overall, the findings offer a promising avenue for the rapid identification and development of effective therapeutic interventions to treat COVID-19
MolFM: A Multimodal Molecular Foundation Model
Molecular knowledge resides within three different modalities of information
sources: molecular structures, biomedical documents, and knowledge bases.
Effective incorporation of molecular knowledge from these modalities holds
paramount significance in facilitating biomedical research. However, existing
multimodal molecular foundation models exhibit limitations in capturing
intricate connections between molecular structures and texts, and more
importantly, none of them attempt to leverage a wealth of molecular expertise
derived from knowledge graphs. In this study, we introduce MolFM, a multimodal
molecular foundation model designed to facilitate joint representation learning
from molecular structures, biomedical texts, and knowledge graphs. We propose
cross-modal attention between atoms of molecular structures, neighbors of
molecule entities and semantically related texts to facilitate cross-modal
comprehension. We provide theoretical analysis that our cross-modal
pre-training captures local and global molecular knowledge by minimizing the
distance in the feature space between different modalities of the same
molecule, as well as molecules sharing similar structures or functions. MolFM
achieves state-of-the-art performance on various downstream tasks. On
cross-modal retrieval, MolFM outperforms existing models with 12.13% and 5.04%
absolute gains under the zero-shot and fine-tuning settings, respectively.
Furthermore, qualitative analysis showcases MolFM's implicit ability to provide
grounding from molecular substructures and knowledge graphs. Code and models
are available on https://github.com/BioFM/OpenBioMed.Comment: 31 pages, 15 figures, and 15 table
Get Your Atoms in Orderî—¸An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm
Finding a canonical
ordering of the atoms in a molecule is a prerequisite
for generating a unique representation of the molecule. The canonicalization
of a molecule is usually accomplished by applying some sort of graph
relaxation algorithm, the most common of which is the Morgan algorithm.
There are known issues with that algorithm that lead to noncanonical
atom orderings as well as problems when it is applied to large molecules
like proteins. Furthermore, each cheminformatics toolkit or software
provides its own version of a canonical ordering, most based on unpublished
algorithms, which also complicates the generation of a universal unique
identifier for molecules. We present an alternative canonicalization
approach that uses a standard stable-sorting algorithm instead of
a Morgan-like index. Two new invariants that allow canonical ordering
of molecules with dependent chirality as well as those with highly
symmetrical cyclic graphs have been developed. The new approach proved
to be robust and fast when tested on the 1.45 million compounds of
the ChEMBL 20 data set in different scenarios like random renumbering
of input atoms or SMILES round tripping. Our new algorithm is able
to generate a canonical order of the atoms of protein molecules within
a few milliseconds. The novel algorithm is implemented in the open-source
cheminformatics toolkit RDKit. With this paper, we provide a reference
Python implementation of the algorithm that could easily be integrated
in any cheminformatics toolkit. This provides a first step toward
a common standard for canonical atom ordering to generate a universal
unique identifier for molecules other than InChI
Get Your Atoms in Orderî—¸An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm
Finding a canonical
ordering of the atoms in a molecule is a prerequisite
for generating a unique representation of the molecule. The canonicalization
of a molecule is usually accomplished by applying some sort of graph
relaxation algorithm, the most common of which is the Morgan algorithm.
There are known issues with that algorithm that lead to noncanonical
atom orderings as well as problems when it is applied to large molecules
like proteins. Furthermore, each cheminformatics toolkit or software
provides its own version of a canonical ordering, most based on unpublished
algorithms, which also complicates the generation of a universal unique
identifier for molecules. We present an alternative canonicalization
approach that uses a standard stable-sorting algorithm instead of
a Morgan-like index. Two new invariants that allow canonical ordering
of molecules with dependent chirality as well as those with highly
symmetrical cyclic graphs have been developed. The new approach proved
to be robust and fast when tested on the 1.45 million compounds of
the ChEMBL 20 data set in different scenarios like random renumbering
of input atoms or SMILES round tripping. Our new algorithm is able
to generate a canonical order of the atoms of protein molecules within
a few milliseconds. The novel algorithm is implemented in the open-source
cheminformatics toolkit RDKit. With this paper, we provide a reference
Python implementation of the algorithm that could easily be integrated
in any cheminformatics toolkit. This provides a first step toward
a common standard for canonical atom ordering to generate a universal
unique identifier for molecules other than InChI
DEVELOPMENT OF TOOLS FOR ATOM-LEVEL INTERPRETATION OF STABLE ISOTOPE-RESOLVED METABOLOMICS DATASETS
Metabolomics is the global study of small molecules in living systems under a given state, merging as a new ‘omics’ study in systems biology. It has shown great promise in elucidating biological mechanism in various areas. Many diseases, especially cancers, are closely linked to reprogrammed metabolism. As the end point of biological processes, metabolic profiles are more representative of the biological phenotype compared to genomic or proteomic profiles. Therefore, characterizing metabolic phenotype of various diseases will help clarify the metabolic mechanisms and promote the development of novel and effective treatment strategies.
Advances in analytical technologies such as nuclear magnetic resonance and mass spectroscopy greatly contribute to the detection and characterization of global metabolites in a biological system. Furthermore, application of these analytical tools to stable isotope resolved metabolomics experiments can generate large-scale high-quality metabolomics data containing isotopic flow through cellular metabolism. However, the lack of the corresponding computational analysis tools hinders the characterization of metabolic phenotypes and the downstream applications.
Both detailed metabolic modeling and quantitative analysis are required for proper interpretation of these complex metabolomics data. For metabolic modeling, currently there is no comprehensive metabolic network at an atom-resolved level that can be used for deriving context-specific metabolic models for SIRM metabolomics datasets. For quantitative analysis, most available tools conduct metabolic flux analysis based on a well-defined metabolic model, which is hard to achieve for complex biological system due to the limitations in our knowledge.
Here, we developed a set of methods to address these problems. First, we developed a neighborhood-specific coloring method that can create identifier for each atom in a specific compound. With the atom identifiers, we successfully harmonized compounds and reactions across KEGG and MetaCyc databases at various levels. In addition, we evaluated the atom mappings of the harmonized metabolic reactions. These results will contribute to the construction of a comprehensive atom-resolved metabolic network. In addition, this method can be easily applied to any metabolic database that provides a molfile representation of compounds, which will greatly facilitate future expansion. In addition, we developed a moiety modeling framework to deconvolute metabolite isotopologue profiles using moiety models along with the analysis and selection of the best moiety model(s) based on the experimental data. To our knowledge, this is the first method that can analyze datasets involving multiple isotope tracers. Furthermore, instead of a single predefined metabolic model, this method allows the comparison of multiple metabolic models derived from a given metabolic profile, and we have demonstrated the robust performance of the moiety modeling framework in model selection with a 13C-labeled UDP-GlcNAc isotopologue dataset. We further explored the data quality requirements and the factors that affect model selection. Collectively, these methods and tools help interpret SIRM metabolomics datasets from metabolic modeling to quantitative analysis