Search CORE

192 research outputs found

Pre-training of Molecular GNNs via Conditional Boltzmann Generator

Author: Kanaya Shigehiko
Koge Daiki
Ono Naoaki
Publication venue
Publication date: 18/01/2024
Field of study

Learning representations of molecular structures using deep learning is a fundamental problem in molecular property prediction tasks. Molecules inherently exist in the real world as three-dimensional structures; furthermore, they are not static but in continuous motion in the 3D Euclidean space, forming a potential energy surface. Therefore, it is desirable to generate multiple conformations in advance and extract molecular representations using a 4D-QSAR model that incorporates multiple conformations. However, this approach is impractical for drug and material discovery tasks because of the computational cost of obtaining multiple conformations. To address this issue, we propose a pre-training method for molecular GNNs using an existing dataset of molecular conformations to generate a latent vector universal to multiple conformations from a 2D molecular graph. Our method, called Boltzmann GNN, is formulated by maximizing the conditional marginal likelihood of a conditional generative model for conformations generation. We show that our model has a better prediction performance for molecular properties than existing pre-training methods using molecular graphs and three-dimensional molecular structures.Comment: 4 page

arXiv.org e-Print Archive

Variational Autoencoding Molecular Graphs with Denoising Diffusion Probabilistic Model

Author: Kanaya Shigehiko
Koge Daiki
Ono Naoaki
Publication venue
Publication date: 22/08/2023
Field of study

In data-driven drug discovery, designing molecular descriptors is a very important task. Deep generative models such as variational autoencoders (VAEs) offer a potential solution by designing descriptors as probabilistic latent vectors derived from molecular structures. These models can be trained on large datasets, which have only molecular structures, and applied to transfer learning. Nevertheless, the approximate posterior distribution of the latent vectors of the usual VAE assumes a simple multivariate Gaussian distribution with zero covariance, which may limit the performance of representing the latent features. To overcome this limitation, we propose a novel molecular deep generative model that incorporates a hierarchical structure into the probabilistic latent vectors. We achieve this by a denoising diffusion probabilistic model (DDPM). We demonstrate that our model can design effective molecular latent vectors for molecular property prediction from some experiments by small datasets on physical properties and activity. The results highlight the superior prediction performance and robustness of our model compared to existing approaches.Comment: 2 pages. Short paper submitted to IEEE CIBCB 202

arXiv.org e-Print Archive

AMDORAP: Non-targeted metabolic profiling based on high-resolution LC-MS

Author: Kanaya Shigehiko
Morimoto Takuya
Ogasawara Naotake
Takahashi Hiroki
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Liquid chromatography-mass spectrometry (LC-MS) utilizing the high-resolution power of an orbitrap is an important analytical technique for both metabolomics and proteomics. Most important feature of the orbitrap is excellent mass accuracy. Thus, it is necessary to convert raw data to accurate and reliable <it>m/z </it>values for metabolic fingerprinting by high-resolution LC-MS. Results In the present study, we developed a novel, easy-to-use and straightforward <it>m/z </it>detection method, AMDORAP. For assessing the performance, we used real biological samples, <it>Bacillus subtilis </it>strains 168 and MGB874, in the positive mode by LC-orbitrap. For 14 identified compounds by measuring the authentic compounds, we compared obtained <it>m/z </it>values with other LC-MS processing tools. The errors by AMDORAP were distributed within ±3 ppm and showed the best performance in <it>m/z </it>value accuracy. Conclusions Our method can detect <it>m/z </it>values of biological samples much more accurately than other LC-MS analysis tools. AMDORAP allows us to address the relationships between biological effects and cellular metabolites based on accurate <it>m/z </it>values. Obtaining the accurate <it>m/z </it>values from raw data should be indispensable as a starting point for comparative LC-orbitrap analysis. AMDORAP is freely available under an open-source license at <url>http://amdorap.sourceforge.net/</url>.</p

Crossref

Directory of Open Access Journals

PubMed Central

A novel bioinformatics tool for phylogenetic classification of genomic sequence fragments derived from mixed genomes of uncultured environmental microbes

Author: Abe Takashi
Ikemura Toshimichi
Kanaya Shigehiko
Sugawara Hideaki
Publication venue: Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, and The Graduate University for Advanced Studies (Sokendai)/Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, and The Graduate University for Advanced Studies (Sokendai)/Department of Bioinformatics and Genomes, Graduate School of Information Science, Nara Institute of Science and Technology/The Graduate University for Advanced Studies (Sokendai), Hayama Center for Advanced Research
Publication date: 01/12/2006
Field of study

A Self-Organizing Map (SOM) is an effective tool for clustering and visualizing high-dimensional complex data on a two-dimensional map. We modified the conventional SOM to genome informatics, making the learning process and resulting map independent of the order of data input, and developed a novel bioinformatics tool for phylogenetic classification of sequence fragments obtained from pooled genome samples of microorganisms in environmental samples allowing visualization of microbial diversity and the relative abundance of microorganisms on a map. First we constructed SOMs of tri- and tetranucleotide frequencies from a total of 3.3-Gb of sequences derived using 113 prokaryotic and 13 eukaryotic genomes, for which complete genome sequences are available. SOMs classified the 330000 10-kb sequences from these genomes mainly according to species without information on the species. Importantly, classification was possible without orthologous sequence sets and thus was useful for studies of novel sequences from poorly characterized species such as those living only under extreme conditions and which have attracted wide scientific and industrial attention. Using the SOM method, sequences that were derived from a single genome but cloned independently in a metagenome library could be reassociated in silico. The usefulness of SOMs in metagenome studies was also discussed

National Institute of Polar Research Repository

Prediction of Biological Activities of Volatile Metabolites Using Molecular Fingerprints and Machine Learning Methods

Author: Abdullah Azian Azamimi
Kanaya Shigehiko
Publication venue: Journal of Telecommunication, Electronic and Computer Engineering (JTEC)
Publication date: 19/03/2018
Field of study

Volatile metabolites are small molecules, comprise a diverse chemical group with various biological activities and have high vapor pressures under ambient conditions. It is crucial to determine the biological activities of volatile metabolites as they play important roles in chemical ecology and human healthcare. In this study, we have accumulated 341 volatiles emitted by biological species associated with 11 types of biological activities and deposited the data into our database, which is called KNApSAcK Metabolite Ecology Database. Using this dataset, we have developed 72 classification models to predict biological activities of volatile metabolites by using various machine learning methods. Eight types of molecular fingerprints were used to represent the molecules, which are PubChem (881 bits), CDK (1024 bits), Extended CDK (1024bits), MACCS (166 bits), Klekota-Roth (4860 bits), Substructure (307 bits), Estate (79 bits), and atom pairs (780 bits). A new type of fingerprint was also proposed by combining all features of these eight fingerprints (Combine, 9121 bits). The best classification model was developed by our proposed fingerprint (Combine, 9121 bits) trained with gradient boosting method algorithm (GBM) with predictive accuracy at 94.43%. The results indicated that molecular fingerprints and machine learning methods could be useful for predicting biological activities of volatile metabolites

Universiti Teknikal Malaysia Melaka: UTeM Open Journal System

Characterization of Genetic Signal Sequences with Batch-Learning SOM

Author: Abe Takashi
Ikeda Shun
Ikemura Toshimichi
Kanaya Shigehiko
Wada Kennosuke
Publication venue: Technische Fakultät, Arbeitsgruppen der Informatik
Publication date: 31/12/2007
Field of study

An unsupervised clustering algorithm Kohonen's SOM is an effective tool for clustering and visualizing high-dimensional complex data on a single map. We previously modified the conventional SOM for genome informatics, making the learning process and resulting map independent of the order of data input on the basis of Batch Learning SOM (BL-SOM). We generated BL-SOMs for tetra- and pentanucleotide frequencies in 300,000 10-kb sequences from 13 eukaryotes for which almost complete genomic sequences are available. BL-SOM recognized species-specific characteristics of oligonucleotide frequencies in most 10-kb sequences, permitting species-specific classification of sequences without any information regarding the species. We next constructed BL-SOMs with tetra- and pentanucleotide frequencies in 37,086 full-length mouse cDNA sequences. With BL-SOM we also analyzed occurrence patterns of the oligonucleotides that are thought to be involved in transcriptional regulation on the human genome

BieColl - Bielefeld Electronic Collections

BieColl - Bielefeld eCollections

Development and implementation of an algorithm for detection of protein complexes in large interaction networks

Author: Altaf-Ul-Amin Md
Kanaya Shigehiko
Kurokawa Ken
Mihara Kenji
Shinbo Yoko
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: After complete sequencing of a number of genomes the focus has now turned to proteomics. Advanced proteomics technologies such as two-hybrid assay, mass spectrometry etc. are producing huge data sets of protein-protein interactions which can be portrayed as networks, and one of the burning issues is to find protein complexes in such networks. The enormous size of protein-protein interaction (PPI) networks warrants development of efficient computational methods for extraction of significant complexes. RESULTS: This paper presents an algorithm for detection of protein complexes in large interaction networks. In a PPI network, a node represents a protein and an edge represents an interaction. The input to the algorithm is the associated matrix of an interaction network and the outputs are protein complexes. The complexes are determined by way of finding clusters, i. e. the densely connected regions in the network. We also show and analyze some protein complexes generated by the proposed algorithm from typical PPI networks of Escherichia coli and Saccharomyces cerevisiae. A comparison between a PPI and a random network is also performed in the context of the proposed algorithm. CONCLUSION: The proposed algorithm makes it possible to detect clusters of proteins in PPI networks which mostly represent molecular biological functional units. Therefore, protein complexes determined solely based on interaction data can help us to predict the functions of proteins, and they are also useful to understand and explain certain biological processes

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Predicting state transitions in the transcriptome and metabolome using a linear dynamical system model

Author: Hirai Masami Y
Kanaya Shigehiko
Morioka Ryoko
Ogasawara Naotake
Saito Kazuki
Yano Mitsuru
Publication venue: BioMed Central
Publication date: 01/09/2007
Field of study

Abstract Background Modelling of time series data should not be an approximation of input data profiles, but rather be able to detect and evaluate dynamical changes in the time series data. Objective criteria that can be used to evaluate dynamical changes in data are therefore important to filter experimental noise and to enable extraction of unexpected, biologically important information. Results Here we demonstrate the effectiveness of a Markov model, named the Linear Dynamical System, to simulate the dynamics of a transcript or metabolite time series, and propose a probabilistic index that enables detection of time-sensitive changes. This method was applied to time series datasets from <it>Bacillus subtilis </it>and <it>Arabidopsis thaliana </it>grown under stress conditions; in the former, only gene expression was studied, whereas in the latter, both gene expression and metabolite accumulation. Our method not only identified well-known changes in gene expression and metabolite accumulation, but also detected novel changes that are likely to be responsible for each stress response condition. Conclusion This general approach can be applied to any time-series data profile from which one wishes to identify elements responsible for state transitions, such as rapid environmental adaptation by an organism.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Mass Spectra-Based Framework for Automated Structural Elucidation of Metabolome Data to Explore Phytochemical Diversity

Author: Hirai Masami Y.
Kanaya Shigehiko
Matsuda Fumio
Nakabayashi Ryo
Saito Kazuki
Sawada Yuji
Suzuki Makoto
Publication venue: Frontiers Research Foundation
Publication date: 01/01/2011
Field of study

A novel framework for automated elucidation of metabolite structures in liquid chromatography–mass spectrometer metabolome data was constructed by integrating databases. High-resolution tandem mass spectra data automatically acquired from each metabolite signal were used for database searches. Three distinct databases, KNApSAcK, ReSpect, and the PRIMe standard compound database, were employed for the structural elucidation. The outputs were retrieved using the CAS metabolite identifier for identification and putative annotation. A simple metabolite ontology system was also introduced to attain putative characterization of the metabolite signals. The automated method was applied for the metabolome data sets obtained from the rosette leaves of 20 Arabidopsis accessions. Phenotypic variations in novel Arabidopsis metabolites among these accessions could be investigated using this method

Crossref

PubMed Central

Frontiers - Publisher Connector

MODELLING INGREDIENT OF JAMU TO PREDICT ITS EFFICACY

Author: . Md. Altaf-Ul-Amin
. Sulistiyani
Afendi Farit Mochamad
Hirai Aki
Kanaya Shigehiko
Nakamura Kensuke
Takahashi Hiroki
Publication venue: FORUM STATISTIKA DAN KOMPUTASI
Publication date: 01/10/2010
Field of study

Jamu is an Indonesian herbal medicine made from a mixture of several plants. Nowadays, many jamu are produced commercially by many industries in Indonesia. Each producer may have their own jamu formula. However, one is certain; the efficacy of jamu is determined by the composition of the plants used. Thus, it is interesting to model the ingredient of jamu which consist of plants and use it to predict efficacy of jamu. In this analysis, Partial Least Squares Discriminant Analysis (PLSDA) is used in modeling jamu ingredients to predict the efficacy. It is obtained that utilizing the prediction of y ij obtained from PLSDA directly rather than use it to calculate probability of jamu i belong to efficacy j and then use the probability to predict efficacy produces lower False Positive Rate (FPR) in predicting efficacy group. Keywords: Jamu, PLSD

FORUM STATISTIKA DAN KOMPUTASI

Scientific Journals of Bogor Agricultural University