1 research outputs found

    Understanding RNA

    Get PDF
    Ribonucleic acids (RNAs) are a group of biologically active nucleic acids in the cell. RNAs are intermediaries between DNA and protein, and they exist in all living organisms and are essential for life. Structurally, they are most similar to DNA. The biological tasks of the RNAs are mostly investigated in biology, computational biology, bioinformatics, medical science, and drug discovery. The fundamental problem is that the number of RNA chains in the cells is very large, and each of them is in different shapes and sizes, and the comparison of RNA molecules is a challenging problem in machine learning and computational biology to determine their functions. It is difficult to understand the function of most of the RNA molecules we find in the real world. In this work, we investigate possible solutions to determine RNA functions that could lead to a massive step forward in understanding biological systems. Graphs are a type of structured data, and graph-based methods have shown promise for other biological compounds such as protein. RNAs can be represented in graph-structured form, and then RNA graph data can be used in learning applications. In this work, we investigate how to extract useful information from the 1/2/3D RNA shapes, encode these piece of information into structured data, and then apply classification methods to determine their functions. The thesis has four main contributions. The first contribution is to develop a new large RNA dataset that consists of graph-based representations and 3D Point Cloud representations. The RNA dataset includes 3178 RNA chains, and the RNA chains are labelled in 8 classes according to their reported biological functions. The data set aims to provide a platform to investigate RNA functions in the use of classification methods. The second contribution of the thesis is to introduce a new graph representation of the RNA molecules based on the minimum free energy (MFE) of secondary structure elements of the molecules. The contribution is to encode each structural component of the 2D RNA shapes as an edge and the total MFE on each 2D RNA component as an edge weight. The weights are determined by a labelling process that considers the MFE of the structure and the particular setting within the RNA. The motivation for this encoding is to reduce the size of the graph representation while giving the secondary structure elements an explicit encoding in the graph. The third contribution of the thesis is to treat 3D RNA strands as 3D curves using geometric three coordinates (x, y, z) information of C3 atom of each nucleobase. Use three coordinate information to represent each RNA curve with square root velocity function (SRVF), arc length, curvature, and min distance to describe a number of possible graph representations and 3D point clouds representations. Armed with RNA graph representations, the state-of-the-art graph kernel methods applied to determine the relative importance of each RNA graph representation. The applied methods are Weisfeiler Lehman and optimal assignment kernel (WL-OA), shortest paths kernel (SP), and all paths and cycle method (APC). The last contribution is to use geometric deep learning (GDL) methods to determine the type of RNA molecules. Broadly, two different approaches of GDL methods are applied to report the classification results. The first approach analyses the classification performance using the GDL method based on the graph neural networks (GNN). The applied methods are Deep Graph Convolutional Neural Network (DGCNN), Graph Isomorphism Network (GIN), Structure2vec, Graph U-Nets, and LCGNNGIN. In the use of GDL methods on GNN, novel graph representation methods also introduced where the node features of novel RNA graph representations consist of multi-dimensional continuous node features. The second approach is to analyse GDL based on 3D Point Cloud. PointNet, PointNet++, and PointConv are applied RNA 3D Point cloud representations to provide classification results
    corecore