8,332 research outputs found

    Protein sequences classification by means of feature extraction with substitution matrices

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step.</p> <p>Results</p> <p>In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works.</p> <p>Conclusions</p> <p>The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks.</p

    Mining Representative Unsubstituted Graph Patterns Using Prior Similarity Matrix

    Full text link
    One of the most powerful techniques to study protein structures is to look for recurrent fragments (also called substructures or spatial motifs), then use them as patterns to characterize the proteins under study. An emergent trend consists in parsing proteins three-dimensional (3D) structures into graphs of amino acids. Hence, the search of recurrent spatial motifs is formulated as a process of frequent subgraph discovery where each subgraph represents a spatial motif. In this scope, several efficient approaches for frequent subgraph discovery have been proposed in the literature. However, the set of discovered frequent subgraphs is too large to be efficiently analyzed and explored in any further process. In this paper, we propose a novel pattern selection approach that shrinks the large number of discovered frequent subgraphs by selecting the representative ones. Existing pattern selection approaches do not exploit the domain knowledge. Yet, in our approach we incorporate the evolutionary information of amino acids defined in the substitution matrices in order to select the representative subgraphs. We show the effectiveness of our approach on a number of real datasets. The results issued from our experiments show that our approach is able to considerably decrease the number of motifs while enhancing their interestingness

    Development of a deep learning-based computational framework for the classification of protein sequences

    Get PDF
    Dissertação de mestrado em BioinformaticsProteins are one of the more important biological structures in living organisms, since they perform multiple biological functions. Each protein has different characteristics and properties, which can be employed in many industries, such as industrial biotechnology, clinical applications, among others, demonstrating a positive impact. Modern high-throughput methods allow protein sequencing, which provides the protein sequence data. Machine learning methodologies are applied to characterize proteins using information of the protein sequence. However, a major problem associated with this method is how to properly encode the protein sequences without losing the biological relationship between the amino acid residues. The transformation of the protein sequence into a numeric representation is done by encoder methods. In this sense, the main objective of this project is to study different encoders and identify the methods which yield the best biological representation of the protein sequences, when used in machine learning (ML) models to predict different labels related to their function. The methods were analyzed in two study cases. The first is related to enzymes, since they are a well-established case in the literature. The second used transporter sequences, a lesser studied case in the literature. In both cases, the data was collected from the curated database Swiss-Prot. The encoders that were tested include: calculated protein descriptors; matrix substitution methods; position-specific scoring matrices; and encoding by pre-trained transformer methods. The use of state-of-the-art pretrained transformers to encode protein sequences proved to be a good biological representation for subsequent application in state-of-the-art ML methods. Namely, the ESM-1b transformer achieved a Mathews correlation coefficient above 0.9 for any multiclassification task of the transporter classification system.As proteínas são estruturas biológicas importantes dos organismos vivos, uma vez que estas desempenham múltiplas funções biológicas. Cada proteína tem características e propriedades diferentes, que podem ser aplicadas em diversas indústrias, tais como a biotecnologia industrial, aplicações clínicas, entre outras, demonstrando um impacto positivo. Os métodos modernos de alto rendimento permitem a sequenciação de proteínas, fornecendo dados da sequência proteica. Metodologias de aprendizagem de máquinas tem sido aplicada para caracterizar as proteínas utilizando informação da sua sequência. Um problema associado a este método e como representar adequadamente as sequências proteicas sem perder a relação biológica entre os resíduos de aminoácidos. A transformação da sequência de proteínas numa representação numérica é feita por codificadores. Neste sentido, o principal objetivo deste projeto é estudar diferentes codificadores e identificar os métodos que produzem a melhor representação biológica das sequências proteicas, quando utilizados em modelos de aprendizagem mecânica para prever a classificação associada à sua função a sua função. Os métodos foram analisados em dois casos de estudo. O primeiro caso foi baseado em enzimas, uma vez que são um caso bem estabelecido na literatura. O segundo, na utilização de proteínas de transportadores, um caso menos estudado na literatura. Em ambos os casos, os dados foram recolhidos a partir da base de dados curada Swiss-Prot. Os codificadores testados incluem: descritores de proteínas calculados; métodos de substituição por matrizes; matrizes de pontuação específicas da posição; e codificação por modelos de transformadores pré-treinados. A utilização de transformadores de última geração para codificar sequências de proteínas demonstrou ser uma boa representação biológica para aplicação subsequente em métodos ML de última geração. Nomeadamente, o transformador ESM-1b atingiu um coeficiente de correlação de Matthews acima de 0,9 para multiclassificação do sistema de classificação de proteínas transportadoras

    Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

    Get PDF
    Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth

    Word correlation matrices for protein sequence analysis and remote homology detection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive.</p> <p>Results</p> <p>In this work we present a novel kernel for protein sequences based on average word similarity between two sequences. We show that this kernel gives rise to a feature space that allows analysis of discriminative features and fast classification of new sequences. We demonstrate the performance of our approach on a widely-used benchmark setup for protein remote homology detection.</p> <p>Conclusion</p> <p>Our word correlation approach provides highly competitive performance as compared with state-of-the-art methods for protein remote homology detection. The learned model is interpretable in terms of biologically meaningful features. In particular, analysis of discriminative words allows the identification of characteristic regions in biological sequences. Because of its high computational efficiency, our method can be applied to ranking of potential homologs in large databases.</p
    corecore