Search CORE

605 research outputs found

Prediction of Protein Domain with mRMR Feature Selection and Analysis

Author: AA Schaffer
AG Murzin
AK Dunker
AM Moses
AP Elhammer
B Saffari
Bi-Qing Li
Bin Xue
BQ Li
CA Orengo
D Chivian
D Li
DE Kim
E Angov
EC Mbamala
G Pugalenthi
GP Zhou
GP Zhou
H Ingolfsson
H Mohabatkar
H Peng
HB Shen
HB Shen
I Walsh
ID Campbell
IH Witten
J Chen
J Cheng
J Cheng
J Cheng
J Eickholt
J Lin
J Liu
J Liu
J Wang
JD Qiu
JE Gewehr
JJ Chou
JR Schnell
K Peng
K Shameer
K Wang
Kai-Yan Feng
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KK Kandaswamy
Kuo-Chen Chou
L Breiman
L Chen
L Holm
Le-Le Hu
Lei Chen
M Esmaeili
M Hayat
M Suyama
MJ Berardi
MK Yoon
N Nagarajan
N von Ohsen
NM Goldenberg
P Mundra
P Tompa
P Wang
PE Wright
PK Nielsen
Q Gu
R Apweiler
R Bondugula
R Guerois
R Linding
RA George
RA Poorman
S Gong
S Kawashima
S Roy
SC Jia
SF Altschul
SM Reynolds
T Ebina
T Huang
TA Holland
W Li
W Zhao
WR Atchley
WZ Lin
X Xiao
X Xiao
X Xiao
X Xiao
X Xiao
X Xiao
X Xiao
Y Zhang
YD Cai
YD Li
Yu-Dong Cai
YX Li
Z He
Z Qiu
ZC Wu
ZC Wu
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28–40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine

CiteSeerX

Directory of Open Access Journals

FigShare

A new pairwise kernel for biological network inference with support vector machines

Author: A Ben-Hur
A Ramani
B Schölkopf
C Harbison
C von Mering
E Sprinzak
E Xing
EM Marcotte
F Pazos
GD Bader
GRG Lanckriet
GS Kimeldorf
HW Mewes
IW Tsang
Jean-Philippe Vert
Jian Qiu
JP Vert
KQ Weinberger
N Aronszajn
N Friedman
P Pavlidis
R Jansen
RI Kondor
S Boyd
S Martin
SF Altschul
SM Gomez
VN Vapnik
William S Noble
WK Huh
Y Qi
Y Yamanishi
Y Yamanishi
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

International audienceBACKGROUND: Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression, etc.). RESULTS: Here, we distinguish between two modes of inference in this setting: direct inference based upon similarities between nodes joined by an edge, and indirect inference based upon similarities between one pair of nodes and another pair of nodes. We propose a supervised approach for the direct case by translating it into a distance metric learning problem. A relaxation of the resulting convex optimization problem leads to the support vector machine (SVM) algorithm with a particular kernel for pairs, which we call the metric learning pairwise kernel. This new kernel for pairs can easily be used by most SVM implementations to solve problems of supervised classification and inference of pairwise relationships from heterogeneous data. We demonstrate, using several real biological networks and genomic datasets, that this approach often improves upon the state-of-the-art SVM for indirect inference with another pairwise kernel, and that the combination of both kernels always improves upon each individual kernel. CONCLUSION: The metric learning pairwise kernel is a new formulation to infer pairwise relationships with SVM, which provides state-of-the-art results for the inference of several biological networks from heterogeneous genomic data

Springer - Publisher Connector

Building an automated platform for the classification of peptides/proteins using machine learning

Author: Sequeira Ana Marta Fernandes Tavares
Publication venue
Publication date: 18/11/2019
Field of study

Dissertação de mestrado em BioinformaticsOne of the challenging problems in bioinformatics is to computationally characterize sequences, structures and functions of proteins. Sequence-derived structural and physico-chemical properties of proteins have been used in the development of machine learning models in protein related problems. However, tools and platforms to calculate features and perform Machine learning (ML) with proteins are scarce and have their limitations in terms of effectiveness, user-friendliness and capacity. Here, a generic modular automated platform for the classification of proteins based on their physicochemical properties using different ML algorithms is proposed. The tool developed, as a Python package, facilitates the major tasks of ML and includes modules to read and alter sequences, calculate protein features, preprocess datasets, execute feature reduction and selection, perform clustering, train and optimize ML models and make predictions. As it is modular, the user retains the power to alter the code to fit specific needs. This platform was tested to predict membrane active anticancer and antimicrobial peptides and further used to explore viral fusion peptides. Membrane-interacting peptides play a crucial role in several biological processes. Fusion peptides are a subclass found in enveloped viruses, that are particularly relevant for membrane fusion. Determining what are the properties that characterize fusion peptides and distinguishing them from other proteins is a very relevant scientific question with important technological implications. Using three different datasets composed by well annotated sequences, different feature extraction techniques and feature selection methods (resulting in a total of over 20 datasets), seven ML models were trained and tested, using cross validation for error estimation and grid search for model selection. The different models, feature sets and feature selection techniques were compared. The best models obtained for distinct metric were then used to predict the location of a known fusion peptide in a protein sequence from the Dengue virus. Feature importances were also analysed. The models obtained will be useful in future research, also providing a biological insight of the distinctive physicochemical characteristics of fusion peptides. This work presents a freely available tool to perform ML-based protein classification and the first global analysis and prediction of viral fusion peptides using ML, reinforcing the usability and importance of ML in protein classification problems.Um dos problemas mais desafiantes em bioinformática é a caracterização de sequências, estruturas e funções de proteínas. Propriedades físico-químicas e estruturais derivadas da sequêcia proteica têm sido utilizadas no desenvolvimento de modelos de aprendizagem máquina (AM). No entanto, ferramentas para calcular estes atributos são escassas e têm limitações em termos de eficiência, facilidade de uso e capacidade de adaptação a diferentes problemas. Aqui, é descrita uma plataforma modular genérica e automatizada para a classificação de proteínas com base nas suas propriedades físico-químicas, que faz uso de diferentes algoritmos de AM. A ferramenta desenvolvida facilita as principais tarefas de AM e inclui módulos para ler e alterar sequências, calcular atributos de proteínas, realizar pré-processamento de dados, fazer redução e seleção de features, executar clustering, criar modelos de AM e fazer previsões. Como é construído de forma modular, o utilizador mantém o poder de alterar o código para atender às suas necessidades específicas. Esta plataforma foi testada com péptidos anticancerígenos e antimicrobianos e foi ainda utilizada para explorar péptidos de fusão virais. Os péptidos de fusão são uma classe de péptidos que interagem com a membrana, encontrados em vírus encapsulados e que são particularmente relevantes para a fusão da membrana do vírus com a membrana do hospedeiro. Determinar quais são as propriedades que os caracterizam é uma questão científica muito relevante, com importantes implicações tecnológicas. Usando três conjuntos de dados diferentes compostos por sequências bem anotadas, quatro técnicas diferentes de extração de features e cinco métodos diferentes de seleção de features (num total de 24 conjuntos de dados testados), sete modelos de AM, com validação cruzada de io vezes e uma abordagem de pesquisa em grelha, foram treinados e testados. Os melhores modelos obtidos, com avaliações MCC entre 0,7 e o,8 e precisão entre 0,85 e 0,9, foram utilizados para prever a localização de um péptido de fusão conhecido numa sequência da proteína de fusão do vírus do Dengue. Os modelos obtidos para prever a localização do péptido de fusão são úteis em pesquisas futuras, fornecendo também uma visão biológica das características físico-químicas distintivas dos mesmos. Este trabalho apresenta uma ferramenta disponível gratuitamente para realizar a classificação de proteínas com AM e a primeira análise global de péptidos de fusão virais usando métodos baseados em AM, reforçando a usabilidade e a importância da AM em problemas de classificação de proteínas

Exploring the Hidden Challenges Associated with the Evaluation of Multi-class Datasets using Multiple Classifiers

Author: Al-Jumeily D
Fergus P
Hussain A
Iram S
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

The optimization and evaluation of a pattern recognition system requires different problems like multi-class and imbalanced datasets be addressed. This paper presents the classification of multi-class datasets which present more challenges when compare to binary class datasets in machine learning. Furthermore, it argues that the performance evaluation of a classification model for multi-class imbalanced datasets in terms of simple “accuracy rate” can possibly provide misleading results. Other parameters such as failure avoidance, true identification of positive and negative instances of a class and class discrimination are also very important. We, in this paper, hypothesize that “misclassification of true positive patterns should not necessarily be categorized as false negative while evaluating a classifier for multi-class datasets”, a common practice that has been observed in the existing literature. In order to address these hidden challenges for the generalization of a particular classifier, several evaluation metrics are compared for a multi-class dataset with four classes; three of them belong to different neurodegenerative diseases and one to control subjects. Three classifiers, linear discriminant, quadratic discriminant and Parzen are selected to demonstrate the results with examples

LJMU Research Online (Liverpool John Moores University)

Statistical Relational Learning for Proteomics: Function, Interactions and Evolution

Author: Teso Stefano
Publication venue: University of Trento
Publication date: 19/12/2013
Field of study

In recent years, the field of Statistical Relational Learning (SRL) [1, 2] has produced new, powerful learning methods that are explicitly designed to solve complex problems, such as collective classification, multi-task learning and structured output prediction, which natively handle relational data, noise, and partial information. Statistical-relational methods rely on some First- Order Logic as a general, expressive formal language to encode both the data instances and the relations or constraints between them. The latter encode background knowledge on the problem domain, and are use to restrict or bias the model search space according to the instructions of domain experts. The new tools developed within SRL allow to revisit old computational biology problems in a less ad hoc fashion, and to tackle novel, more complex ones. Motivated by these developments, in this thesis we describe and discuss the application of SRL to three important biological problems, highlighting the advantages, discussing the trade-offs, and pointing out the open problems. In particular, in Chapter 3 we show how to jointly improve the outputs of multiple correlated predictors of protein features by means of a very gen- eral probabilistic-logical consistency layer. The logical layer — based on grounding-specific Markov Logic networks [3] — enforces a set of weighted first-order rules encoding biologically motivated constraints between the pre- dictions. The refiner then improves the raw predictions so that they least violate the constraints. Contrary to canonical methods for the prediction of protein features, which typically take predicted correlated features as in- puts to improve the output post facto, our method can jointly refine all predictions together, with potential gains in overall consistency. In order to showcase our method, we integrate three stand-alone predictors of corre- lated features, namely subcellular localization (Loctree[4]), disulfide bonding state (Disulfind[5]), and metal bonding state (MetalDetector[6]), in a way that takes into account the respective strengths and weaknesses. The ex- perimental results show that the refiner can improve the performance of the underlying predictors by removing rule violations. In addition, the proposed method is fully general, and could in principle be applied to an array of heterogeneous predictions without requiring any change to the underlying software. In Chapter 4 we consider the multi-level protein–protein interaction (PPI) prediction problem. In general, PPIs can be seen as a hierarchical process occurring at three related levels: proteins bind by means of specific domains, which in turn form interfaces through patches of residues. Detailed knowl- edge about which domains and residues are involved in a given interaction has extensive applications to biology, including better understanding of the bind- ing process and more efficient drug/enzyme design. We cast the prediction problem in terms of multi-task learning, with one task per level (proteins, domains and residues), and propose a machine learning method that collec- tively infers the binding state of all object pairs, at all levels, concurrently. Our method is based on Semantic Based Regularization (SBR) [7], a flexible and theoretically sound SRL framework that employs First-Order Logic con- straints to tie the learning tasks together. Contrarily to most current PPI prediction methods, which neither identify which regions of a protein actu- ally instantiate an interaction nor leverage the hierarchy of predictions, our method resolves the prediction problem up to residue level, enforcing con- sistent predictions between the hierarchy levels, and fruitfully exploits the hierarchical nature of the problem. We present numerical results showing that our method substantially outperforms the baseline in several experi- mental settings, indicating that our multi-level formulation can indeed lead to better predictions. Finally, in Chapter 5 we consider the problem of predicting drug-resistant protein mutations through a combination of Inductive Logic Programming [8, 9] and Statistical Relational Learning. In particular, we focus on viral pro- teins: viruses are typically characterized by high mutation rates, which allow them to quickly develop drug-resistant mutations. Mining relevant rules from mutation data can be extremely useful to understand the virus adaptation mechanism and to design drugs that effectively counter potentially resistant mutants. We propose a simple approach for mutant prediction where the in- put consists of mutation data with drug-resistance information, either as sets of mutations conferring resistance to a certain drug, or as sets of mutants with information on their susceptibility to the drug. The algorithm learns a set of relational rules characterizing drug-resistance, and uses them to generate a set of potentially resistant mutants. Learning a weighted combination of rules allows to attach generated mutants with a resistance score as predicted by the statistical relational model and select only the highest scoring ones. Promising results were obtained in generating resistant mutations for both nucleoside and non-nucleoside HIV reverse transcriptase inhibitors. The ap- proach can be generalized quite easily to learning mutants characterized by more complex rules correlating multiple mutations

Nucleotide Complementarity Features in the Design of Effective Artificial miRNAs

Author: Yan Yifei
Publication venue
Publication date: 01/04/2018
Field of study

L'importance du miARN dans la régulation des gènes a bien été établie. Cependant, le mécanisme précis du processus de reconnaissance des cibles n'est toujours pas complètement compris. Parmi les facteurs connus, la complémentarité en nucléotides, l'accessibilité des sites cibles, la concentration en espèces d'ARN et la coopérativité des sites ont été jugées importantes. En utilisant ces règles connues, nous avons précédemment conçu des miARN artificiels qui inhibent la croissance des cellules cancéreuses en réprimant l'expression de plusieurs gènes. De telles séquences guides ont été délivrées dans les cellules sous forme de shARN. Le VIH étant un virus à ARN, nous avons conçu et testé des ARN guides qui inhibent sa réplication en ciblant directement le génome viral et les facteurs cellulaires nécessaires au virus dans le cadre de mon premier projet. En utilisant une version mise à jour du programme de conception, mirBooking, nous devenons capables de prédire l'effet de concentration des espèces à ARN avec plus de précision. Les séquences guides conçues fournissaient aux cellules une résistance efficace à l'infection virale, égale ou meilleure que celles ciblant directement le génome viral par une complémentarité quasi-parfaite. Cependant, les niveaux de répression des facteurs viraux et cellulaires ne pouvaient pas être prédits avec précision. Afin de mieux comprendre les règles de reconnaissance des cibles miARN, les règles de couplage des bases au-delà du « seed » ont été approfondies dans mon deuxième projet. En concevant des séquences guides correspondant partiellement à la cible et en analysant le schéma de répression, nous avons établi un modèle unificateur de reconnaissance de cible par miARN via la protéine Ago2. Il montre qu'une fois que le « seed » est appariée avec l'ARN cible, la formation d'un duplex d'ARN est interrompue au niveau de la partie centrale du brin guide mais reprend plus loin en aval de la partie centrale en suivant un ordre distinct. L'implémentation des règles découvertes dans un programme informatique, MicroAlign, a permis d'améliorer la conception de miARN artificiels efficaces. Dans cette étude, nous avons non seulement confirmé la contribution des nucléotides non-germes à l'efficacité des miARN, mais également défini de manière quantitative la manière dont ils fonctionnent. Le point de vue actuellement répandu selon lequel les miARN peuvent cibler efficacement tous les gènes de manière égale, avec uniquement des correspondances de semences, peut nécessiter un réexamenThe importance of miRNA in gene regulation has been well established; however, the precise mechanism of its target recognition process is still not completely understood. Among the known factors, nucleotide complementarity, accessibility of the target sites, and the concentration of the RNA species, and site cooperativity were deemed important. Using these known rules, we previously designed artificial miRNAs that inhibit cancer cell growth by repressing the expression of multiple genes. Such guide sequences were delivered into the cells in the form of shRNAs. HIV is an RNA virus. We designed and tested guide RNAs that inhibit its replication by directly targeting the viral genome and cellular factors that the virus requires in my first project. Using an updated version of the design program, mirBooking, we become capable to predict the concentration effect of RNA species more accurately. Designed guide sequences provided cells with effective resistance against viral infection. The protection was equal or better than those that target the viral genome directly via near-perfect complementarity. However, the repression levels of the viral and cellular factors could not be precisely predicted. In order to gain further insights on the rules of miRNA target recognition, the rules of base pairing beyond the seed was further investigated in my second project. By designing guide sequences that partially match the target and analysing the repression pattern, we established a unifying model of miRNA target recognition via Ago2 protein. It shows that once the seed is base-paired with the target RNA, the formation of an RNA duplex is interrupted at the central portion of the guide strand but resumes further downstream of the central portion following a distinct order. The implementation of the discovered rules in a computer program, MicroAlign, enhanced the design of efficient artificial miRNAs. In this study, we not only confirmed the contribution of non-seed nucleotides to the efficiency of miRNAs, but also quantitatively defined the way through which they work. The currently popular view that miRNAs can effectively target all genes equally with only seed matches may require careful re-examination