Search CORE

60 research outputs found

A combination of feature extraction methods with an ensemble of different classifiers for protein structural class prediction problem

Author: Dehzangi A.
Dehzangi O.
Paliwal K.K.
Sattar A.
Sharma Alokanand
Publication venue
Publication date: 01/01/2013
Field of study

Better understanding of structural class of a given protein reveals important information about its overall folding type and its domain. It can also be directly used to provide critical information on general tertiary structure of a protein which has a profound impact on protein function determination and drug design. Despite tremendous enhancements made by pattern recognition-based approaches to solve this problem, it still remains as an unsolved issue for bioinformatics which demands more attention and exploration. In this study, we propose a novel feature extraction model which incorporates physicochemical and evolutionary-based information simultaneously. We also propose overlapped segmented distribution and autocorrelation based feature extraction methods to provide more local and global discriminatory information. The proposed feature extraction methods are explored for 15 most promising attributes that are selected from a wide range of physicochemical-based attributes. Finally, by applying an ensemble of different classiﬁers namely, Adaboost.M1, LogitBoost, Naive Bayes, Multi-Layer Perceptron (MLP), and Support Vector Machine (SVM) we show enhancement of the protein structural class prediction accuracy for four popular benchmarks

University of the South Pacific Electronic Research Repository

Identification of RNA Binding Proteins and RNA Binding Residues Using Effective Machine Learning Techniques

Author: Khanal Reecha
Publication venue: ScholarWorks@UNO
Publication date: 01/04/2019
Field of study

Identification and annotation of RNA Binding Proteins (RBPs) and RNA Binding residues from sequence information alone is one of the most challenging problems in computational biology. RBPs play crucial roles in several fundamental biological functions including transcriptional regulation of RNAs and RNA metabolism splicing. Existing experimental techniques are time-consuming and costly. Thus, efficient computational identification of RBPs directly from the sequence can be useful to annotate RBP and assist the experimental design. Here, we introduce AIRBP, a computational sequence-based method, which utilizes features extracted from evolutionary information, physiochemical properties, and disordered properties to train a machine learning method designed using stacking, an advanced machine learning technique, for effective prediction of RBPs. Furthermore, it makes use of efficient machine learning algorithms like Support Vector Machine, Logistic Regression, K-Nearest Neighbor and XGBoost (Extreme Gradient Boosting Algorithm). In this research work, we also propose another predictor for efficient annotation of RBP residues. This RBP residue predictor also uses stacking and evolutionary algorithms for efficient annotation of RBPs and RNA Binding residue. The RNA-binding residue predictor also utilizes various evolutionary, physicochemical and disordered properties to train a robust model. This thesis presents a possible solution to the RBP and RNA binding residue prediction problem through two independent predictors, both of which outperform existing state-of-the-art approaches

Identification of RNA Binding Proteins and RNA Binding Residues Using Effective Machine Learning Techniques

Author: Khanal Reecha
Publication venue: ScholarWorks@UNO
Publication date: 01/04/2019
Field of study

University of New Orleans

Development of a deep learning-based computational framework for the classification of protein sequences

Author: Barros Miguel Ângelo Pereira
Publication venue
Publication date: 16/12/2022
Field of study

Dissertação de mestrado em BioinformaticsProteins are one of the more important biological structures in living organisms, since they perform multiple biological functions. Each protein has different characteristics and properties, which can be employed in many industries, such as industrial biotechnology, clinical applications, among others, demonstrating a positive impact. Modern high-throughput methods allow protein sequencing, which provides the protein sequence data. Machine learning methodologies are applied to characterize proteins using information of the protein sequence. However, a major problem associated with this method is how to properly encode the protein sequences without losing the biological relationship between the amino acid residues. The transformation of the protein sequence into a numeric representation is done by encoder methods. In this sense, the main objective of this project is to study different encoders and identify the methods which yield the best biological representation of the protein sequences, when used in machine learning (ML) models to predict different labels related to their function. The methods were analyzed in two study cases. The first is related to enzymes, since they are a well-established case in the literature. The second used transporter sequences, a lesser studied case in the literature. In both cases, the data was collected from the curated database Swiss-Prot. The encoders that were tested include: calculated protein descriptors; matrix substitution methods; position-specific scoring matrices; and encoding by pre-trained transformer methods. The use of state-of-the-art pretrained transformers to encode protein sequences proved to be a good biological representation for subsequent application in state-of-the-art ML methods. Namely, the ESM-1b transformer achieved a Mathews correlation coefficient above 0.9 for any multiclassification task of the transporter classification system.As proteínas são estruturas biológicas importantes dos organismos vivos, uma vez que estas desempenham múltiplas funções biológicas. Cada proteína tem características e propriedades diferentes, que podem ser aplicadas em diversas indústrias, tais como a biotecnologia industrial, aplicações clínicas, entre outras, demonstrando um impacto positivo. Os métodos modernos de alto rendimento permitem a sequenciação de proteínas, fornecendo dados da sequência proteica. Metodologias de aprendizagem de máquinas tem sido aplicada para caracterizar as proteínas utilizando informação da sua sequência. Um problema associado a este método e como representar adequadamente as sequências proteicas sem perder a relação biológica entre os resíduos de aminoácidos. A transformação da sequência de proteínas numa representação numérica é feita por codificadores. Neste sentido, o principal objetivo deste projeto é estudar diferentes codificadores e identificar os métodos que produzem a melhor representação biológica das sequências proteicas, quando utilizados em modelos de aprendizagem mecânica para prever a classificação associada à sua função a sua função. Os métodos foram analisados em dois casos de estudo. O primeiro caso foi baseado em enzimas, uma vez que são um caso bem estabelecido na literatura. O segundo, na utilização de proteínas de transportadores, um caso menos estudado na literatura. Em ambos os casos, os dados foram recolhidos a partir da base de dados curada Swiss-Prot. Os codificadores testados incluem: descritores de proteínas calculados; métodos de substituição por matrizes; matrizes de pontuação específicas da posição; e codificação por modelos de transformadores pré-treinados. A utilização de transformadores de última geração para codificar sequências de proteínas demonstrou ser uma boa representação biológica para aplicação subsequente em métodos ML de última geração. Nomeadamente, o transformador ESM-1b atingiu um coeficiente de correlação de Matthews acima de 0,9 para multiclassificação do sistema de classificação de proteínas transportadoras

Universidade do Minho: RepositoriUM

A novel coiled-coil repeat variant in a class of bacterial cytoskeletal proteins

Author: Altschul
Ausmees
Bagchi
Bentley
Bi
Bisaglia
Borhani
Bosgraaf
Brown
Cabeen
Cabeen
Charbon
Crick
Cuff
Del Rizzo
Del Rizzo
Delorenzi
Edgar
Edwards
Edwards
Finn
Fischetti
Flardh
Fraser
Gabriella H. Kelemen
Gonzalez
Gonzalez
Gonzalez
Gruber
Gruber
Harbury
Harbury
Herrmann
Hicks
Hoiczyk
Holberton
Hornung
Insall
John Walshaw
Karimova
Kyte
Letek
Linding
Liu
Lupas
Lupas
Lupas
Löwe
Marshall
Marshall
Mazouni
McDonnell
McLachlan
Michael D. Gillespie
Mihajlovic
Monera
Norrander
Olia
Omura
O’Shea
Parry
Pauling
Peters
Puupponen-Pimiä
Rice
Roberts
Ross
Seo
Shin
Smith
Sokolova
Squire
Stetefeld
Strelkov
Sumby
Sweet
The UniProt Consortium
Wade
Walter
Weber
Wilson
Wise
You
Yu
Publication venue: 'Elsevier BV'
Publication date: 22/02/2010
Field of study

Crossref

University of East Anglia digital repository

Improved general regression network for protein domain boundary prediction

Author: A Ceroni
A Vieira
Abdur R Sikder
AK Jain
Albert Y Zomaya
AR Sikder
AR Sikder
Bing Bing Zhou
C Chothia
C Civera
CC Lee
CR Robinson
DB Wetlaufer
FMG Pearl
G Pollastri
G Pollastri
HC Van Leeuwen
HM Berman
J Chen
J Cheng
J Liu
J Sim
JCB Melo
JE Gewehr
JS Richardson
JSR Jang
M Dumontier
M Dumontier
M Suyama
MJ Lehtinen
N Nagarajan
OV Galzitskaya
P Baldi
P Bork
Paul D Yoo
RA George
RE Schapire
RL Marsden
RR Copley
RR Joshi
RS Gokhale
S Prompramote
S Veretnik
SF Altschul
TA Holland
Y Freund
Publication venue: BioMed Central
Publication date: 13/02/2008
Field of study

Background: Protein domains present some of the most useful information that can be used to understand protein structure and functions. Recent research on protein domain boundary prediction has been mainly based on widely known machine learning techniques, such as Artificial Neural Networks and Support Vector Machines. In this study, we propose a new machine learning model (IGRN) that can achieve accurate and reliable classification, with significantly reduced computations. The IGRN was trained using a PSSM (Position Specific Scoring Matrix), secondary structure, solvent accessibility information and inter-domain linker index to detect possible domain boundaries for a target sequence. Results: The proposed model achieved average prediction accuracy of 67% on the Benchmark_2 dataset for domain boundary identification in multi-domains proteins and showed superior predictive performance and generalisation ability among the most widely used neural network models. With the CASP7 benchmark dataset, it also demonstrated comparable performance to existing domain boundary predictors such as DOMpro, DomPred, DomSSEA, DomCut and DomainDiscovery with 70.10% prediction accuracy. Conclusion: The performance of proposed model has been compared favourably to the performance of other existing machine learning based methods as well as widely known domain boundary predictors on two benchmark datasets and excels in the identification of domain boundaries in terms of model bias, generalisation and computational requirements. © 2008 Yoo et al; licensee BioMed Central Ltd

Crossref

Michigan Technological University

PubMed Central

A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models

Author: A Andreeva
A Ben-Hur
A Karwath
A Karwath
A Shah
Alessandra Carbone
B Liu
B Qian
B Webb-Robertson
C Ferreira
C Leslie
D Higgins
F Wilcoxon
G Yona
Gerson Zaverucha
H Rangwala
H Saigo
J Bernardes
J Davis
J Gough
J Quinlan
J Soeding
J Weston
Juliana S Bernardes
L De Raedt
L Dehaspe
L Liao
N Shan-Hwei
Q Dong
Q Su
R Agrawal
R Hughey
R King
R King
R Kuang
R Sadreyev
S Altschul
S Altschul
S Brenner
S Eddy
S Eddy
S Kawashima
S Lee
T Handstad
T Jaakkola
T Lingner
U Syed
V Alexandrov
V Atalay
Y Hou
Y Hou
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Remote homology detection is a hard computational problem. Most approaches have trained computational models by using either full protein sequences or multiple sequence alignments (MSA), including all positions. However, when we deal with proteins in the "twilight zone" we can observe that only some segments of sequences (motifs) are conserved. We introduce a novel logical representation that allows us to represent physico-chemical properties of sequences, conserved amino acid positions and conserved physico-chemical positions in the MSA. From this, Inductive Logic Programming (ILP) finds the most frequent patterns (motifs) and uses them to train propositional models, such as decision trees and support vector machines (SVM). Results We use the SCOP database to perform our experiments by evaluating protein recognition within the same superfamily. Our results show that our methodology when using SVM performs significantly better than some of the state of the art methods, and comparable to other. However, our method provides a comprehensible set of logical rules that can help to understand what determines a protein function. Conclusions The strategy of selecting only the most frequent patterns is effective for the remote homology detection. This is possible through a suitable first-order logical representation of homologous properties, and through a set of frequent patterns, found by an ILP system, that summarizes essential features of protein functions.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

HAL-Inserm

PubMed Central

An Empirical Study of Different Approaches for Protein Classification

Author: Alessandra Lumini
Loris Nanni
Sheryl Brahnam
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

Many domains would benefit from reliable and efficient systems for automatic protein classification. An area of particular interest in recent studies on automatic protein classification is the exploration of new methods for extracting features from a protein that work well for specific problems. These methods, however, are not generalizable and have proven useful in only a few domains. Our goal is to evaluate several feature extraction approaches for representing proteins by testing them across multiple datasets. Different types of protein representations are evaluated: those starting from the position specific scoring matrix of the proteins (PSSM), those derived from the amino-acid sequence, two matrix representations, and features taken from the 3D tertiary structure of the protein. We also test new variants of proteins descriptors. We develop our system experimentally by comparing and combining different descriptors taken from the protein representations. Each descriptor is used to train a separate support vector machine (SVM), and the results are combined by sum rule. Some stand-alone descriptors work well on some datasets but not on others. Through fusion, the different descriptors provide a performance that works well across all tested datasets, in some cases performing better than the state-of-the-art

Crossref

Directory of Open Access Journals

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

From Sequence to Structure And Back Again: An Alignment Tale

Author: Simossis V.A.
Publication venue: Amsterdam: Vrije Universiteit
Publication date: 01/01/2005
Field of study

Heringa, J. [Promotor

VU Research Portal

Computational Reverse-Engineering of a Spider-Venom Derived Peptide Active Against Plasmodium falciparum SUB1

merozoites and invasion into erythrocytes. As PfSUB1 has emerged as an interesting drug target, we explored the hypothesis that PcFK1 targeted PfSUB1 enzymatic activity. culture in a range compatible with our bioinformatics analysis. Using contact analysis and free energy decomposition we propose that residues A14 and Q15 are important in the interaction with PfSUB1.Our computational reverse engineering supported the hypothesis that PcFK1 targeted PfSUB1, and this was confirmed by experimental evidence showing that PcFK1 inhibits PfSUB1 enzymatic activity. This outlines the usefulness of advanced bioinformatics tools to predict the function of a protein structure. The structural features of PcFK1 represent an interesting protein scaffold for future protein engineering

Public Library of Science (PLOS)

Crossref

Hal - Université Grenoble Alpes

Directory of Open Access Journals

PubMed Central

HAL-Pasteur