Search CORE

3 research outputs found

De novo Molecular Design using Deep Learning

Author: Hoang Nguyen (8514915)
Publication venue
Publication date: 23/01/2024
Field of study

Currently, the growth of data science, computer science, and artificial intelligence has transformed traditional drug discovery. The era of information has opened numerous opportunities for various research fields. The introduction of computer-aided stages (e.g., molecule generation, property prediction, virtual screening, etc.) into the drug discovery pipeline has immensely enhanced the success rate of finding promising molecules. Despite initial accomplishments, computer-aided drug discovery still needs to be significantly improved. Among the well-known topics in computer-aided drug discovery, 'de novo molecular design' is a highly focused topic that attracts a large number of researchers. De novo molecular design aims to excavate novel molecules from the huge chemical space that has not been fully exploited. Although there are various deep learning architectures proposed for molecule generation, each approach has certain limitations that need to be addressed. Additionally, since molecule generation is a random and non-directional process, finding drug candidates with desired properties from billions of molecules is almost infeasible. To tackle this problem, several optimization techniques were utilized to direct the generative model to produce `molecule of interest'. However, the property-optimized process restricts the 'creativity' of the generative model. Furthermore, it is a fact that not every desired property can be optimized because of insufficient data, and optimization-driving generation is computationally expensive. In such cases, using Quantitative Structure-Activity Relationship (QSAR) models is an alternative solution for identifying molecules with desired properties.The overall goal of this thesis is to develop a generative model and a series of QSAR models for drug discovery. The generative model is used to produce novel molecules, while the QSAR models are used to virtually filter the molecules with desired properties. To achieve this goal, a range of computational techniques and interdisciplinary knowledge are employed in this thesis. First, we conducted a critical review of existing molecular representations, generative models, and property prediction models. The review is highly essential to providing readers with a fundamental understanding of de novo molecular design. The review analyzes the pros and cons of each molecular representation and summarizes the present development and challenges of molecular generation and property prediction tasks. Second, we investigated a novel deep learning architecture for de novo molecular design. The architecture is designed to process graph-structure data. The generative model developed using the proposed architecture can produce hypothetical molecules with high novelty and diversity. Experimental results indicated that our generative model can create drug-like molecules varying in size, scaffold, and properties.Third, we proposed two novel deep learning architectures for molecular property prediction. These two architectures, including the Residual Graph Attention (ResGAT) Network and the Graph Convolution-Attention Network (GCoAtNet), are designed to process graph-structure data. Our findings demonstrated that ResGAT achieved competitive performance while GCoAtNet achieved higher performance compared to state-of-the-art architectures. Our models were benchmarked against these state-of-the-art models on nine molecular datasets. Finally, we used these proposed architectures to construct a generative model and two QSAR models. The generative model was driven to produce a large number of hypothetical molecules. Subsequently, these molecules were virtually screened to eliminate those with drug-induced liver injury (property 1}) and Cytochrome-P450-inhibitory (property 2) activities. For each property, we developed two QSAR models that can independently identify molecules with desired properties. The intersection set of molecules suggested by these two models was considered a short list of potential drug candidates. These shortlisted molecules can be sent to the chemistry lab for further investigation, i.e., structural optimization and modification, synthesis, and evaluation. The results demonstrated that these computer-designed molecules are synthesizable and suitable for further research.</p

Victoria University of Wellington

FigShare

i4mC-GRU: Identifying DNA N<sup>4</sup>-Methylcytosine sites in mouse genomes using bidirectional gated recurrent unit and sequence-embedded features

Author: Binh Nguyen (8511558)
Hoang Nguyen (8514915)
PU Nguyen-Hoang (16973088)
QH Trinh (16973085)
S Rahardja (10455851)
TH Nguyen-Vo (10455830)
Publication venue
Publication date: 01/01/2023
Field of study

N4-methylcytosine (4mC) is one of the most common DNA methylation modifications found in both prokaryotic and eukaryotic genomes. Since the 4mC has various essential biological roles, determining its location helps reveal unexplored physiological and pathological pathways. In this study, we propose an effective computational method called i4mC-GRU using a gated recurrent unit and duplet sequence-embedded features to predict potential 4mC sites in mouse (Mus musculus) genomes. To fairly assess the performance of the model, we compared our method with several state-of-the-art methods using two different benchmark datasets. Our results showed that i4mC-GRU achieved area under the receiver operating characteristic curve values of 0.97 and 0.89 and area under the precision-recall curve values of 0.98 and 0.90 on the first and second benchmark datasets, respectively. Briefly, our method outperformed existing methods in predicting 4mC sites in mouse genomes. Also, we deployed i4mC-GRU as an online web server, supporting users in genomics studies

Victoria University of Wellington

iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features

Author: Binh Nguyen (8511558)
Hoang Nguyen (8514915)
PU Nguyen-Hoang (16973088)
QH Trinh (16973085)
S Rahardja (10455851)
TH Nguyen-Vo (10455830)
Publication venue
Publication date: 03/10/2022
Field of study

Background: Promoters, non-coding DNA sequences located at upstream regions of the transcription start site of genes/gene clusters, are essential regulatory elements for the initiation and regulation of transcriptional processes. Furthermore, identifying promoters in DNA sequences and genomes significantly contributes to discovering entire structures of genes of interest. Therefore, exploration of promoter regions is one of the most imperative topics in molecular genetics and biology. Besides experimental techniques, computational methods have been developed to predict promoters. In this study, we propose iPromoter-Seqvec – an efficient computational model to predict TATA and non-TATA promoters in human and mouse genomes using bidirectional long short-term memory neural networks in combination with sequence-embedded features extracted from input sequences. The promoter and non-promoter sequences were retrieved from the Eukaryotic Promoter database and then were refined to create four benchmark datasets. Results: The area under the receiver operating characteristic curve (AUCROC) and the area under the precision-recall curve (AUCPR) were used as two key metrics to evaluate model performance. Results on independent test sets showed that iPromoter-Seqvec outperformed other state-of-the-art methods with AUCROC values ranging from 0.85 to 0.99 and AUCPR values ranging from 0.86 to 0.99. Models predicting TATA promoters in both species had slightly higher predictive power compared to those predicting non-TATA promoters. With a novel idea of constructing artificial non-promoter sequences based on promoter sequences, our models were able to learn highly specific characteristics discriminating promoters from non-promoters to improve predictive efficiency. Conclusions: iPromoter-Seqvec is a stable and robust model for predicting both TATA and non-TATA promoters in human and mouse genomes. Our proposed method was also deployed as an online web server with a user-friendly interface to support research communities. Links to our source codes and web server are available at https://github.com/mldlproject/2022-iPromoter-Seqvec

Victoria University of Wellington

PubMed Central