192 research outputs found
Pre-training of Molecular GNNs via Conditional Boltzmann Generator
Learning representations of molecular structures using deep learning is a
fundamental problem in molecular property prediction tasks. Molecules
inherently exist in the real world as three-dimensional structures;
furthermore, they are not static but in continuous motion in the 3D Euclidean
space, forming a potential energy surface. Therefore, it is desirable to
generate multiple conformations in advance and extract molecular
representations using a 4D-QSAR model that incorporates multiple conformations.
However, this approach is impractical for drug and material discovery tasks
because of the computational cost of obtaining multiple conformations. To
address this issue, we propose a pre-training method for molecular GNNs using
an existing dataset of molecular conformations to generate a latent vector
universal to multiple conformations from a 2D molecular graph. Our method,
called Boltzmann GNN, is formulated by maximizing the conditional marginal
likelihood of a conditional generative model for conformations generation. We
show that our model has a better prediction performance for molecular
properties than existing pre-training methods using molecular graphs and
three-dimensional molecular structures.Comment: 4 page
Variational Autoencoding Molecular Graphs with Denoising Diffusion Probabilistic Model
In data-driven drug discovery, designing molecular descriptors is a very
important task. Deep generative models such as variational autoencoders (VAEs)
offer a potential solution by designing descriptors as probabilistic latent
vectors derived from molecular structures. These models can be trained on large
datasets, which have only molecular structures, and applied to transfer
learning. Nevertheless, the approximate posterior distribution of the latent
vectors of the usual VAE assumes a simple multivariate Gaussian distribution
with zero covariance, which may limit the performance of representing the
latent features. To overcome this limitation, we propose a novel molecular deep
generative model that incorporates a hierarchical structure into the
probabilistic latent vectors. We achieve this by a denoising diffusion
probabilistic model (DDPM). We demonstrate that our model can design effective
molecular latent vectors for molecular property prediction from some
experiments by small datasets on physical properties and activity. The results
highlight the superior prediction performance and robustness of our model
compared to existing approaches.Comment: 2 pages. Short paper submitted to IEEE CIBCB 202
AMDORAP: Non-targeted metabolic profiling based on high-resolution LC-MS
<p>Abstract</p> <p>Background</p> <p>Liquid chromatography-mass spectrometry (LC-MS) utilizing the high-resolution power of an orbitrap is an important analytical technique for both metabolomics and proteomics. Most important feature of the orbitrap is excellent mass accuracy. Thus, it is necessary to convert raw data to accurate and reliable <it>m/z </it>values for metabolic fingerprinting by high-resolution LC-MS.</p> <p>Results</p> <p>In the present study, we developed a novel, easy-to-use and straightforward <it>m/z </it>detection method, AMDORAP. For assessing the performance, we used real biological samples, <it>Bacillus subtilis </it>strains 168 and MGB874, in the positive mode by LC-orbitrap. For 14 identified compounds by measuring the authentic compounds, we compared obtained <it>m/z </it>values with other LC-MS processing tools. The errors by AMDORAP were distributed within ±3 ppm and showed the best performance in <it>m/z </it>value accuracy.</p> <p>Conclusions</p> <p>Our method can detect <it>m/z </it>values of biological samples much more accurately than other LC-MS analysis tools. AMDORAP allows us to address the relationships between biological effects and cellular metabolites based on accurate <it>m/z </it>values. Obtaining the accurate <it>m/z </it>values from raw data should be indispensable as a starting point for comparative LC-orbitrap analysis. AMDORAP is freely available under an open-source license at <url>http://amdorap.sourceforge.net/</url>.</p
A novel bioinformatics tool for phylogenetic classification of genomic sequence fragments derived from mixed genomes of uncultured environmental microbes
A Self-Organizing Map (SOM) is an effective tool for clustering and visualizing high-dimensional complex data on a two-dimensional map. We modified the conventional SOM to genome informatics, making the learning process and resulting map independent of the order of data input, and developed a novel bioinformatics tool for phylogenetic classification of sequence fragments obtained from pooled genome samples of microorganisms in environmental samples allowing visualization of microbial diversity and the relative abundance of microorganisms on a map. First we constructed SOMs of tri- and tetranucleotide frequencies from a total of 3.3-Gb of sequences derived using 113 prokaryotic and 13 eukaryotic genomes, for which complete genome sequences are available. SOMs classified the 330000 10-kb sequences from these genomes mainly according to species without information on the species. Importantly, classification was possible without orthologous sequence sets and thus was useful for studies of novel sequences from poorly characterized species such as those living only under extreme conditions and which have attracted wide scientific and industrial attention. Using the SOM method, sequences that were derived from a single genome but cloned independently in a metagenome library could be reassociated in silico. The usefulness of SOMs in metagenome studies was also discussed
Prediction of Biological Activities of Volatile Metabolites Using Molecular Fingerprints and Machine Learning Methods
Volatile metabolites are small molecules, comprise a diverse chemical group with various biological activities and have high vapor pressures under ambient conditions. It is crucial to determine the biological activities of volatile metabolites as they play important roles in chemical ecology and human healthcare. In this study, we have accumulated 341 volatiles emitted by biological species associated with 11 types of biological activities and deposited the data into our database, which is called KNApSAcK Metabolite Ecology Database. Using this dataset, we have developed 72 classification models to predict biological activities of volatile metabolites by using various machine learning methods. Eight types of molecular fingerprints were used to represent the molecules, which are PubChem (881 bits), CDK (1024 bits), Extended CDK (1024bits), MACCS (166 bits), Klekota-Roth (4860 bits), Substructure (307 bits), Estate (79 bits), and atom pairs (780 bits). A new type of fingerprint was also proposed by combining all features of these eight fingerprints (Combine, 9121 bits). The best classification model was developed by our proposed fingerprint (Combine, 9121 bits) trained with gradient boosting method algorithm (GBM) with predictive accuracy at 94.43%. The results indicated that molecular fingerprints and machine learning methods could be useful for predicting biological activities of volatile metabolites
Characterization of Genetic Signal Sequences with Batch-Learning SOM
An unsupervised clustering algorithm Kohonen's SOM is an effective tool for clustering and visualizing high-dimensional complex data on a single map. We previously modified the conventional SOM for genome informatics, making the learning process and resulting map independent of the order of data input on the basis of Batch Learning SOM (BL-SOM). We generated BL-SOMs for tetra- and pentanucleotide frequencies in 300,000 10-kb sequences from 13 eukaryotes for which almost complete genomic sequences are available. BL-SOM recognized species-specific characteristics of oligonucleotide frequencies in most 10-kb sequences, permitting species-specific classification of sequences without any information regarding the species. We next constructed BL-SOMs with tetra- and pentanucleotide frequencies in 37,086 full-length mouse cDNA sequences. With BL-SOM we also analyzed occurrence patterns of the oligonucleotides that are thought to be involved in transcriptional regulation on the human genome
Development and implementation of an algorithm for detection of protein complexes in large interaction networks
BACKGROUND: After complete sequencing of a number of genomes the focus has now turned to proteomics. Advanced proteomics technologies such as two-hybrid assay, mass spectrometry etc. are producing huge data sets of protein-protein interactions which can be portrayed as networks, and one of the burning issues is to find protein complexes in such networks. The enormous size of protein-protein interaction (PPI) networks warrants development of efficient computational methods for extraction of significant complexes. RESULTS: This paper presents an algorithm for detection of protein complexes in large interaction networks. In a PPI network, a node represents a protein and an edge represents an interaction. The input to the algorithm is the associated matrix of an interaction network and the outputs are protein complexes. The complexes are determined by way of finding clusters, i. e. the densely connected regions in the network. We also show and analyze some protein complexes generated by the proposed algorithm from typical PPI networks of Escherichia coli and Saccharomyces cerevisiae. A comparison between a PPI and a random network is also performed in the context of the proposed algorithm. CONCLUSION: The proposed algorithm makes it possible to detect clusters of proteins in PPI networks which mostly represent molecular biological functional units. Therefore, protein complexes determined solely based on interaction data can help us to predict the functions of proteins, and they are also useful to understand and explain certain biological processes
Predicting state transitions in the transcriptome and metabolome using a linear dynamical system model
<p>Abstract</p> <p>Background</p> <p>Modelling of time series data should not be an approximation of input data profiles, but rather be able to detect and evaluate dynamical changes in the time series data. Objective criteria that can be used to evaluate dynamical changes in data are therefore important to filter experimental noise and to enable extraction of unexpected, biologically important information.</p> <p>Results</p> <p>Here we demonstrate the effectiveness of a Markov model, named the Linear Dynamical System, to simulate the dynamics of a transcript or metabolite time series, and propose a probabilistic index that enables detection of time-sensitive changes. This method was applied to time series datasets from <it>Bacillus subtilis </it>and <it>Arabidopsis thaliana </it>grown under stress conditions; in the former, only gene expression was studied, whereas in the latter, both gene expression and metabolite accumulation. Our method not only identified well-known changes in gene expression and metabolite accumulation, but also detected novel changes that are likely to be responsible for each stress response condition.</p> <p>Conclusion</p> <p>This general approach can be applied to any time-series data profile from which one wishes to identify elements responsible for state transitions, such as rapid environmental adaptation by an organism.</p
Mass Spectra-Based Framework for Automated Structural Elucidation of Metabolome Data to Explore Phytochemical Diversity
A novel framework for automated elucidation of metabolite structures in liquid chromatography–mass spectrometer metabolome data was constructed by integrating databases. High-resolution tandem mass spectra data automatically acquired from each metabolite signal were used for database searches. Three distinct databases, KNApSAcK, ReSpect, and the PRIMe standard compound database, were employed for the structural elucidation. The outputs were retrieved using the CAS metabolite identifier for identification and putative annotation. A simple metabolite ontology system was also introduced to attain putative characterization of the metabolite signals. The automated method was applied for the metabolome data sets obtained from the rosette leaves of 20 Arabidopsis accessions. Phenotypic variations in novel Arabidopsis metabolites among these accessions could be investigated using this method
MODELLING INGREDIENT OF JAMU TO PREDICT ITS EFFICACY
Jamu is an Indonesian herbal medicine made from a mixture of several plants. Nowadays, many jamu are produced commercially by many industries in Indonesia. Each producer may have their own jamu formula. However, one is certain; the efficacy of jamu is determined by the composition of the plants used. Thus, it is interesting to model the ingredient of jamu which consist of plants and use it to predict efficacy of jamu. In this analysis, Partial Least Squares Discriminant Analysis (PLSDA) is used in modeling jamu ingredients to predict the efficacy. It is obtained that utilizing the prediction of y ij obtained from PLSDA directly rather than use it to calculate probability of jamu i belong to efficacy j and then use the probability to predict efficacy produces lower False Positive Rate (FPR) in predicting efficacy group. Keywords: Jamu, PLSD
- …