605 research outputs found
Prediction of Protein Domain with mRMR Feature Selection and Analysis
The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28â40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine
A new pairwise kernel for biological network inference with support vector machines
International audienceBACKGROUND: Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression, etc.). RESULTS: Here, we distinguish between two modes of inference in this setting: direct inference based upon similarities between nodes joined by an edge, and indirect inference based upon similarities between one pair of nodes and another pair of nodes. We propose a supervised approach for the direct case by translating it into a distance metric learning problem. A relaxation of the resulting convex optimization problem leads to the support vector machine (SVM) algorithm with a particular kernel for pairs, which we call the metric learning pairwise kernel. This new kernel for pairs can easily be used by most SVM implementations to solve problems of supervised classification and inference of pairwise relationships from heterogeneous data. We demonstrate, using several real biological networks and genomic datasets, that this approach often improves upon the state-of-the-art SVM for indirect inference with another pairwise kernel, and that the combination of both kernels always improves upon each individual kernel. CONCLUSION: The metric learning pairwise kernel is a new formulation to infer pairwise relationships with SVM, which provides state-of-the-art results for the inference of several biological networks from heterogeneous genomic data
Building an automated platform for the classification of peptides/proteins using machine learning
Dissertação de mestrado em BioinformaticsOne of the challenging problems in bioinformatics is to computationally characterize sequences, structures and functions of proteins. Sequence-derived structural and physico-chemical properties of proteins have been used in the development of machine learning models in protein related problems. However, tools and platforms to calculate features and perform Machine learning (ML) with proteins are scarce and have their limitations in terms of effectiveness, user-friendliness and capacity. Here, a generic modular automated platform for the classification of proteins based on their physicochemical properties using different ML algorithms is proposed. The tool developed, as a Python package, facilitates the major tasks of ML and includes modules to read and alter sequences, calculate protein features, preprocess datasets, execute feature reduction and selection, perform clustering, train and optimize ML models and make predictions. As it is modular, the user retains the power to alter the code to fit specific needs. This platform was tested to predict membrane active anticancer and antimicrobial peptides and further used to explore viral fusion peptides. Membrane-interacting peptides play a crucial role in several biological processes. Fusion peptides are a subclass found in enveloped viruses, that are particularly relevant for membrane fusion. Determining what are the properties that characterize fusion peptides and distinguishing them from other proteins is a very relevant scientific question with important technological implications. Using three different datasets composed by well annotated sequences, different feature extraction techniques and feature selection methods (resulting in a total of over 20 datasets), seven ML models were trained and tested, using cross validation for error estimation and grid search for model selection. The different models, feature sets and feature selection techniques were compared. The best models obtained for distinct metric were then used to predict the location of a known fusion peptide in a protein sequence from the Dengue virus. Feature importances were also analysed. The models obtained will be useful in future research, also providing a biological insight of the distinctive physicochemical characteristics of fusion peptides. This work presents a freely available tool to perform ML-based protein classification and the first global analysis and prediction of viral fusion peptides using ML, reinforcing the usability and importance of ML in protein classification problems.Um dos problemas mais desafiantes em bioinformĂĄtica Ă© a caracterização de sequĂȘncias, estruturas e funçÔes de proteĂnas. Propriedades fĂsico-quĂmicas e estruturais derivadas da sequĂȘcia proteica tĂȘm sido utilizadas no desenvolvimento de modelos de aprendizagem mĂĄquina (AM). No entanto, ferramentas para calcular estes atributos sĂŁo escassas e tĂȘm limitaçÔes em termos de eficiĂȘncia, facilidade de uso e capacidade de adaptação a diferentes problemas. Aqui, Ă© descrita uma plataforma modular genĂ©rica e automatizada para a classificação de proteĂnas com base nas suas propriedades fĂsico-quĂmicas, que faz uso de diferentes algoritmos de AM. A ferramenta desenvolvida facilita as principais tarefas de AM e inclui mĂłdulos para ler e alterar sequĂȘncias, calcular atributos de proteĂnas, realizar prĂ©-processamento de dados, fazer redução e seleção de features, executar clustering, criar modelos de AM e fazer previsĂ”es. Como Ă© construĂdo de forma modular, o utilizador mantĂ©m o poder de alterar o cĂłdigo para atender Ă s suas necessidades especĂficas. Esta plataforma foi testada com pĂ©ptidos anticancerĂgenos e antimicrobianos e foi ainda utilizada para explorar pĂ©ptidos de fusĂŁo virais. Os pĂ©ptidos de fusĂŁo sĂŁo uma classe de pĂ©ptidos que interagem com a membrana, encontrados em vĂrus encapsulados e que sĂŁo particularmente relevantes para a fusĂŁo da membrana do vĂrus com a membrana do hospedeiro. Determinar quais sĂŁo as propriedades que os caracterizam Ă© uma questĂŁo cientĂfica muito relevante, com importantes implicaçÔes tecnolĂłgicas. Usando trĂȘs conjuntos de dados diferentes compostos por sequĂȘncias bem anotadas, quatro tĂ©cnicas diferentes de extração de features e cinco mĂ©todos diferentes de seleção de features (num total de 24 conjuntos de dados testados), sete modelos de AM, com validação cruzada de io vezes e uma abordagem de pesquisa em grelha, foram treinados e testados. Os melhores modelos obtidos, com avaliaçÔes MCC entre 0,7 e o,8 e precisĂŁo entre 0,85 e 0,9, foram utilizados para prever a localização de um pĂ©ptido de fusĂŁo conhecido numa sequĂȘncia da proteĂna de fusĂŁo do vĂrus do Dengue. Os modelos obtidos para prever a localização do pĂ©ptido de fusĂŁo sĂŁo Ășteis em pesquisas futuras, fornecendo tambĂ©m uma visĂŁo biolĂłgica das caracterĂsticas fĂsico-quĂmicas distintivas dos mesmos. Este trabalho apresenta uma ferramenta disponĂvel gratuitamente para realizar a classificação de proteĂnas com AM e a primeira anĂĄlise global de pĂ©ptidos de fusĂŁo virais usando mĂ©todos baseados em AM, reforçando a usabilidade e a importĂąncia da AM em problemas de classificação de proteĂnas
Exploring the Hidden Challenges Associated with the Evaluation of Multi-class Datasets using Multiple Classifiers
The optimization and evaluation of a pattern recognition system requires different problems like multi-class and imbalanced datasets be addressed. This paper presents the classification of multi-class datasets which present more challenges when compare to binary class datasets in machine learning. Furthermore, it argues that the performance evaluation of a classification model for multi-class imbalanced datasets in terms of simple âaccuracy rateâ can possibly provide misleading results. Other parameters such as failure avoidance, true identification of positive and negative instances of a class and class discrimination are also very important. We, in this paper, hypothesize that âmisclassification of true positive patterns should not necessarily be categorized as false negative while evaluating a classifier for multi-class datasetsâ, a common practice that has been observed in the existing literature. In order to address these hidden challenges for the generalization of a particular classifier, several evaluation metrics are compared for a multi-class dataset with four classes; three of them belong to different neurodegenerative diseases and one to control subjects. Three classifiers, linear discriminant, quadratic discriminant and Parzen are selected to demonstrate the results with examples
Statistical Relational Learning for Proteomics: Function, Interactions and Evolution
In recent years, the field of Statistical Relational Learning (SRL) [1, 2] has
produced new, powerful learning methods that are explicitly designed to solve
complex problems, such as collective classification, multi-task learning and
structured output prediction, which natively handle relational data, noise,
and partial information. Statistical-relational methods rely on some First-
Order Logic as a general, expressive formal language to encode both the data
instances and the relations or constraints between them. The latter encode
background knowledge on the problem domain, and are use to restrict or bias
the model search space according to the instructions of domain experts. The
new tools developed within SRL allow to revisit old computational biology
problems in a less ad hoc fashion, and to tackle novel, more complex ones.
Motivated by these developments, in this thesis we describe and discuss the
application of SRL to three important biological problems, highlighting the
advantages, discussing the trade-offs, and pointing out the open problems.
In particular, in Chapter 3 we show how to jointly improve the outputs
of multiple correlated predictors of protein features by means of a very gen-
eral probabilistic-logical consistency layer. The logical layer â based on
grounding-specific Markov Logic networks [3] â enforces a set of weighted
first-order rules encoding biologically motivated constraints between the pre-
dictions. The refiner then improves the raw predictions so that they least
violate the constraints. Contrary to canonical methods for the prediction
of protein features, which typically take predicted correlated features as in-
puts to improve the output post facto, our method can jointly refine all
predictions together, with potential gains in overall consistency. In order
to showcase our method, we integrate three stand-alone predictors of corre-
lated features, namely subcellular localization (Loctree[4]), disulfide bonding
state (Disulfind[5]), and metal bonding state (MetalDetector[6]), in a way
that takes into account the respective strengths and weaknesses. The ex-
perimental results show that the refiner can improve the performance of the
underlying predictors by removing rule violations. In addition, the proposed
method is fully general, and could in principle be applied to an array of
heterogeneous predictions without requiring any change to the underlying
software.
In Chapter 4 we consider the multi-level proteinâprotein interaction (PPI)
prediction problem. In general, PPIs can be seen as a hierarchical process
occurring at three related levels: proteins bind by means of specific domains,
which in turn form interfaces through patches of residues. Detailed knowl-
edge about which domains and residues are involved in a given interaction has
extensive applications to biology, including better understanding of the bind-
ing process and more efficient drug/enzyme design. We cast the prediction
problem in terms of multi-task learning, with one task per level (proteins,
domains and residues), and propose a machine learning method that collec-
tively infers the binding state of all object pairs, at all levels, concurrently.
Our method is based on Semantic Based Regularization (SBR) [7], a flexible
and theoretically sound SRL framework that employs First-Order Logic con-
straints to tie the learning tasks together. Contrarily to most current PPI
prediction methods, which neither identify which regions of a protein actu-
ally instantiate an interaction nor leverage the hierarchy of predictions, our
method resolves the prediction problem up to residue level, enforcing con-
sistent predictions between the hierarchy levels, and fruitfully exploits the
hierarchical nature of the problem. We present numerical results showing
that our method substantially outperforms the baseline in several experi-
mental settings, indicating that our multi-level formulation can indeed lead
to better predictions.
Finally, in Chapter 5 we consider the problem of predicting drug-resistant
protein mutations through a combination of Inductive Logic Programming [8,
9] and Statistical Relational Learning. In particular, we focus on viral pro-
teins: viruses are typically characterized by high mutation rates, which allow
them to quickly develop drug-resistant mutations. Mining relevant rules from
mutation data can be extremely useful to understand the virus adaptation
mechanism and to design drugs that effectively counter potentially resistant
mutants. We propose a simple approach for mutant prediction where the in-
put consists of mutation data with drug-resistance information, either as sets
of mutations conferring resistance to a certain drug, or as sets of mutants with
information on their susceptibility to the drug. The algorithm learns a set
of relational rules characterizing drug-resistance, and uses them to generate
a set of potentially resistant mutants. Learning a weighted combination of
rules allows to attach generated mutants with a resistance score as predicted
by the statistical relational model and select only the highest scoring ones.
Promising results were obtained in generating resistant mutations for both
nucleoside and non-nucleoside HIV reverse transcriptase inhibitors. The ap-
proach can be generalized quite easily to learning mutants characterized by
more complex rules correlating multiple mutations
Nucleotide Complementarity Features in the Design of Effective Artificial miRNAs
L'importance du miARN dans la régulation des gÚnes a bien été établie. Cependant, le mécanisme précis du processus de reconnaissance des cibles n'est toujours pas complÚtement compris. Parmi les facteurs connus, la complémentarité en nucléotides, l'accessibilité des sites cibles, la concentration en espÚces d'ARN et la coopérativité des sites ont été jugées importantes. En utilisant ces rÚgles connues, nous avons précédemment conçu des miARN artificiels qui inhibent la croissance des cellules cancéreuses en réprimant l'expression de plusieurs gÚnes. De telles séquences guides ont été délivrées dans les cellules sous forme de shARN.
Le VIH Ă©tant un virus Ă ARN, nous avons conçu et testĂ© des ARN guides qui inhibent sa rĂ©plication en ciblant directement le gĂ©nome viral et les facteurs cellulaires nĂ©cessaires au virus dans le cadre de mon premier projet. En utilisant une version mise Ă jour du programme de conception, mirBooking, nous devenons capables de prĂ©dire l'effet de concentration des espĂšces Ă ARN avec plus de prĂ©cision. Les sĂ©quences guides conçues fournissaient aux cellules une rĂ©sistance efficace Ă l'infection virale, Ă©gale ou meilleure que celles ciblant directement le gĂ©nome viral par une complĂ©mentaritĂ© quasi-parfaite. Cependant, les niveaux de rĂ©pression des facteurs viraux et cellulaires ne pouvaient pas ĂȘtre prĂ©dits avec prĂ©cision. Afin de mieux comprendre les rĂšgles de reconnaissance des cibles miARN, les rĂšgles de couplage des bases au-delĂ du « seed » ont Ă©tĂ© approfondies dans mon deuxiĂšme projet. En concevant des sĂ©quences guides correspondant partiellement Ă la cible et en analysant le schĂ©ma de rĂ©pression, nous avons Ă©tabli un modĂšle unificateur de reconnaissance de cible par miARN via la protĂ©ine Ago2. Il montre qu'une fois que le « seed » est appariĂ©e avec l'ARN cible, la formation d'un duplex d'ARN est interrompue au niveau de la partie centrale du brin guide mais reprend plus loin en aval de la partie centrale en suivant un ordre distinct. L'implĂ©mentation des rĂšgles dĂ©couvertes dans un programme informatique, MicroAlign, a permis d'amĂ©liorer la conception de miARN artificiels efficaces.
Dans cette étude, nous avons non seulement confirmé la contribution des nucléotides non-germes à l'efficacité des miARN, mais également défini de maniÚre quantitative la maniÚre dont ils fonctionnent. Le point de vue actuellement répandu selon lequel les miARN peuvent cibler efficacement tous les gÚnes de maniÚre égale, avec uniquement des correspondances de semences, peut nécessiter un réexamenThe importance of miRNA in gene regulation has been well established; however, the precise mechanism of its target recognition process is still not completely understood. Among the known factors, nucleotide complementarity, accessibility of the target sites, and the concentration of the RNA species, and site cooperativity were deemed important. Using these known rules, we previously designed artificial miRNAs that inhibit cancer cell growth by repressing the expression of multiple genes. Such guide sequences were delivered into the cells in the form of shRNAs.
HIV is an RNA virus. We designed and tested guide RNAs that inhibit its replication by directly targeting the viral genome and cellular factors that the virus requires in my first project. Using an updated version of the design program, mirBooking, we become capable to predict the concentration effect of RNA species more accurately. Designed guide sequences provided cells with effective resistance against viral infection. The protection was equal or better than those that target the viral genome directly via near-perfect complementarity. However, the repression levels of the viral and cellular factors could not be precisely predicted. In order to gain further insights on the rules of miRNA target recognition, the rules of base pairing beyond the seed was further investigated in my second project. By designing guide sequences that partially match the target and analysing the repression pattern, we established a unifying model of miRNA target recognition via Ago2 protein. It shows that once the seed is base-paired with the target RNA, the formation of an RNA duplex is interrupted at the central portion of the guide strand but resumes further downstream of the central portion following a distinct order. The implementation of the discovered rules in a computer program, MicroAlign, enhanced the design of efficient artificial miRNAs.
In this study, we not only confirmed the contribution of non-seed nucleotides to the efficiency of miRNAs, but also quantitatively defined the way through which they work. The currently popular view that miRNAs can effectively target all genes equally with only seed matches may require careful re-examination
- âŠ