3 research outputs found

    Métodos de Kernels en secuencias para la clasificación de residuos catalíticos en sitios activos de enzimas

    Get PDF
    Este trabajo presenta una metodología de solución al problema de clasificación de residuos catalíticos en sitios activos de enzimas. Esta metodología está basada en el aprendizaje de máquina específicamente en las máquinas de soporte vectorial (MSV); que junto a las funciones kernel permite clasificar residuos en enzimas a partir de su secuencia. El conjunto de datos utilizados fue Catalytic Site Atlas (CSA). En la metodología planteada, en primer lugar encontramos la información biológica de los residuos integrada con la representación en secuencia de la enzima que lo contiene; esto por medio de las funciones kernel gaussiano y string, respectivamente. Posteriormente; el algoritmo jerárquico AGNES (Agglomerative Nesting) es aplicado para obtener un número de grupos inicial para el algoritmo de agrupación k-medias; obteniendo como resultado cinco grupos de enzimas. Por último, para cada grupo se desarrolló un sistema basado en MSV. La estimación del error de generalización después de validación cruzada es usada como criterio de desempeño del modelo.Abstract. This project presents a methodology to solve the problem of classification of catalytic residues in enzyme active sites. This methodology is based on machine learning and more specifically support vector machine (SVM); which together with the kernel functions allows classifying residues in enzyme with their own sequence. The dataset used during this study was Catalytic Site Atlas (CSA). In the proposed methodology, first it is found the biologic information of the residues integrated with the sequence representation of the enzyme that contains the residue. This is done by means of the Gaussian and string kernel functions, respectively. Afterwards, the hierarchical clustering algorithm AGNES (Agglomerative Nesting) is applied in order to get a number of groups to initialize the k-means clustering algorithm, obtaining as result five groups of enzymes. Finally, for each one of the clusters, it was developed a sorting system based on SVM. The estimation of generalization error using cross validation is used as criteria of model performance.Maestrí

    Understanding RNA

    Get PDF
    Ribonucleic acids (RNAs) are a group of biologically active nucleic acids in the cell. RNAs are intermediaries between DNA and protein, and they exist in all living organisms and are essential for life. Structurally, they are most similar to DNA. The biological tasks of the RNAs are mostly investigated in biology, computational biology, bioinformatics, medical science, and drug discovery. The fundamental problem is that the number of RNA chains in the cells is very large, and each of them is in different shapes and sizes, and the comparison of RNA molecules is a challenging problem in machine learning and computational biology to determine their functions. It is difficult to understand the function of most of the RNA molecules we find in the real world. In this work, we investigate possible solutions to determine RNA functions that could lead to a massive step forward in understanding biological systems. Graphs are a type of structured data, and graph-based methods have shown promise for other biological compounds such as protein. RNAs can be represented in graph-structured form, and then RNA graph data can be used in learning applications. In this work, we investigate how to extract useful information from the 1/2/3D RNA shapes, encode these piece of information into structured data, and then apply classification methods to determine their functions. The thesis has four main contributions. The first contribution is to develop a new large RNA dataset that consists of graph-based representations and 3D Point Cloud representations. The RNA dataset includes 3178 RNA chains, and the RNA chains are labelled in 8 classes according to their reported biological functions. The data set aims to provide a platform to investigate RNA functions in the use of classification methods. The second contribution of the thesis is to introduce a new graph representation of the RNA molecules based on the minimum free energy (MFE) of secondary structure elements of the molecules. The contribution is to encode each structural component of the 2D RNA shapes as an edge and the total MFE on each 2D RNA component as an edge weight. The weights are determined by a labelling process that considers the MFE of the structure and the particular setting within the RNA. The motivation for this encoding is to reduce the size of the graph representation while giving the secondary structure elements an explicit encoding in the graph. The third contribution of the thesis is to treat 3D RNA strands as 3D curves using geometric three coordinates (x, y, z) information of C3 atom of each nucleobase. Use three coordinate information to represent each RNA curve with square root velocity function (SRVF), arc length, curvature, and min distance to describe a number of possible graph representations and 3D point clouds representations. Armed with RNA graph representations, the state-of-the-art graph kernel methods applied to determine the relative importance of each RNA graph representation. The applied methods are Weisfeiler Lehman and optimal assignment kernel (WL-OA), shortest paths kernel (SP), and all paths and cycle method (APC). The last contribution is to use geometric deep learning (GDL) methods to determine the type of RNA molecules. Broadly, two different approaches of GDL methods are applied to report the classification results. The first approach analyses the classification performance using the GDL method based on the graph neural networks (GNN). The applied methods are Deep Graph Convolutional Neural Network (DGCNN), Graph Isomorphism Network (GIN), Structure2vec, Graph U-Nets, and LCGNNGIN. In the use of GDL methods on GNN, novel graph representation methods also introduced where the node features of novel RNA graph representations consist of multi-dimensional continuous node features. The second approach is to analyse GDL based on 3D Point Cloud. PointNet, PointNet++, and PointConv are applied RNA 3D Point cloud representations to provide classification results

    A New Kernel Method for RNA Classification

    No full text
    Support vector machines (SVMs) are a state-of-the-art machine learning tool widely used in speech recognition, image processing and biological sequence analysis. An essential step in SVMs is to devise a kernel function to compute the similarity between two data points in Euclidean space. In this paper we present a new kernel that takes advantage of both global and local structural information in RNAs and uses the information together to classify RNAs with support vector machines. Experimental results demonstrate the good performance of the new kernel and show that it outperforms existing kernels when applied to classifying non-coding RNA sequences
    corecore