605 research outputs found

    Prediction of Protein Domain with mRMR Feature Selection and Analysis

    Get PDF
    The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28–40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine

    A new pairwise kernel for biological network inference with support vector machines

    Get PDF
    International audienceBACKGROUND: Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression, etc.). RESULTS: Here, we distinguish between two modes of inference in this setting: direct inference based upon similarities between nodes joined by an edge, and indirect inference based upon similarities between one pair of nodes and another pair of nodes. We propose a supervised approach for the direct case by translating it into a distance metric learning problem. A relaxation of the resulting convex optimization problem leads to the support vector machine (SVM) algorithm with a particular kernel for pairs, which we call the metric learning pairwise kernel. This new kernel for pairs can easily be used by most SVM implementations to solve problems of supervised classification and inference of pairwise relationships from heterogeneous data. We demonstrate, using several real biological networks and genomic datasets, that this approach often improves upon the state-of-the-art SVM for indirect inference with another pairwise kernel, and that the combination of both kernels always improves upon each individual kernel. CONCLUSION: The metric learning pairwise kernel is a new formulation to infer pairwise relationships with SVM, which provides state-of-the-art results for the inference of several biological networks from heterogeneous genomic data

    Building an automated platform for the classification of peptides/proteins using machine learning

    Get PDF
    Dissertação de mestrado em BioinformaticsOne of the challenging problems in bioinformatics is to computationally characterize sequences, structures and functions of proteins. Sequence-derived structural and physico-chemical properties of proteins have been used in the development of machine learning models in protein related problems. However, tools and platforms to calculate features and perform Machine learning (ML) with proteins are scarce and have their limitations in terms of effectiveness, user-friendliness and capacity. Here, a generic modular automated platform for the classification of proteins based on their physicochemical properties using different ML algorithms is proposed. The tool developed, as a Python package, facilitates the major tasks of ML and includes modules to read and alter sequences, calculate protein features, preprocess datasets, execute feature reduction and selection, perform clustering, train and optimize ML models and make predictions. As it is modular, the user retains the power to alter the code to fit specific needs. This platform was tested to predict membrane active anticancer and antimicrobial peptides and further used to explore viral fusion peptides. Membrane-interacting peptides play a crucial role in several biological processes. Fusion peptides are a subclass found in enveloped viruses, that are particularly relevant for membrane fusion. Determining what are the properties that characterize fusion peptides and distinguishing them from other proteins is a very relevant scientific question with important technological implications. Using three different datasets composed by well annotated sequences, different feature extraction techniques and feature selection methods (resulting in a total of over 20 datasets), seven ML models were trained and tested, using cross validation for error estimation and grid search for model selection. The different models, feature sets and feature selection techniques were compared. The best models obtained for distinct metric were then used to predict the location of a known fusion peptide in a protein sequence from the Dengue virus. Feature importances were also analysed. The models obtained will be useful in future research, also providing a biological insight of the distinctive physicochemical characteristics of fusion peptides. This work presents a freely available tool to perform ML-based protein classification and the first global analysis and prediction of viral fusion peptides using ML, reinforcing the usability and importance of ML in protein classification problems.Um dos problemas mais desafiantes em bioinformĂĄtica Ă© a caracterização de sequĂȘncias, estruturas e funçÔes de proteĂ­nas. Propriedades fĂ­sico-quĂ­micas e estruturais derivadas da sequĂȘcia proteica tĂȘm sido utilizadas no desenvolvimento de modelos de aprendizagem mĂĄquina (AM). No entanto, ferramentas para calcular estes atributos sĂŁo escassas e tĂȘm limitaçÔes em termos de eficiĂȘncia, facilidade de uso e capacidade de adaptação a diferentes problemas. Aqui, Ă© descrita uma plataforma modular genĂ©rica e automatizada para a classificação de proteĂ­nas com base nas suas propriedades fĂ­sico-quĂ­micas, que faz uso de diferentes algoritmos de AM. A ferramenta desenvolvida facilita as principais tarefas de AM e inclui mĂłdulos para ler e alterar sequĂȘncias, calcular atributos de proteĂ­nas, realizar prĂ©-processamento de dados, fazer redução e seleção de features, executar clustering, criar modelos de AM e fazer previsĂ”es. Como Ă© construĂ­do de forma modular, o utilizador mantĂ©m o poder de alterar o cĂłdigo para atender Ă s suas necessidades especĂ­ficas. Esta plataforma foi testada com pĂ©ptidos anticancerĂ­genos e antimicrobianos e foi ainda utilizada para explorar pĂ©ptidos de fusĂŁo virais. Os pĂ©ptidos de fusĂŁo sĂŁo uma classe de pĂ©ptidos que interagem com a membrana, encontrados em vĂ­rus encapsulados e que sĂŁo particularmente relevantes para a fusĂŁo da membrana do vĂ­rus com a membrana do hospedeiro. Determinar quais sĂŁo as propriedades que os caracterizam Ă© uma questĂŁo cientĂ­fica muito relevante, com importantes implicaçÔes tecnolĂłgicas. Usando trĂȘs conjuntos de dados diferentes compostos por sequĂȘncias bem anotadas, quatro tĂ©cnicas diferentes de extração de features e cinco mĂ©todos diferentes de seleção de features (num total de 24 conjuntos de dados testados), sete modelos de AM, com validação cruzada de io vezes e uma abordagem de pesquisa em grelha, foram treinados e testados. Os melhores modelos obtidos, com avaliaçÔes MCC entre 0,7 e o,8 e precisĂŁo entre 0,85 e 0,9, foram utilizados para prever a localização de um pĂ©ptido de fusĂŁo conhecido numa sequĂȘncia da proteĂ­na de fusĂŁo do vĂ­rus do Dengue. Os modelos obtidos para prever a localização do pĂ©ptido de fusĂŁo sĂŁo Ășteis em pesquisas futuras, fornecendo tambĂ©m uma visĂŁo biolĂłgica das caracterĂ­sticas fĂ­sico-quĂ­micas distintivas dos mesmos. Este trabalho apresenta uma ferramenta disponĂ­vel gratuitamente para realizar a classificação de proteĂ­nas com AM e a primeira anĂĄlise global de pĂ©ptidos de fusĂŁo virais usando mĂ©todos baseados em AM, reforçando a usabilidade e a importĂąncia da AM em problemas de classificação de proteĂ­nas

    Exploring the Hidden Challenges Associated with the Evaluation of Multi-class Datasets using Multiple Classifiers

    Get PDF
    The optimization and evaluation of a pattern recognition system requires different problems like multi-class and imbalanced datasets be addressed. This paper presents the classification of multi-class datasets which present more challenges when compare to binary class datasets in machine learning. Furthermore, it argues that the performance evaluation of a classification model for multi-class imbalanced datasets in terms of simple “accuracy rate” can possibly provide misleading results. Other parameters such as failure avoidance, true identification of positive and negative instances of a class and class discrimination are also very important. We, in this paper, hypothesize that “misclassification of true positive patterns should not necessarily be categorized as false negative while evaluating a classifier for multi-class datasets”, a common practice that has been observed in the existing literature. In order to address these hidden challenges for the generalization of a particular classifier, several evaluation metrics are compared for a multi-class dataset with four classes; three of them belong to different neurodegenerative diseases and one to control subjects. Three classifiers, linear discriminant, quadratic discriminant and Parzen are selected to demonstrate the results with examples

    Statistical Relational Learning for Proteomics: Function, Interactions and Evolution

    Get PDF
    In recent years, the field of Statistical Relational Learning (SRL) [1, 2] has produced new, powerful learning methods that are explicitly designed to solve complex problems, such as collective classification, multi-task learning and structured output prediction, which natively handle relational data, noise, and partial information. Statistical-relational methods rely on some First- Order Logic as a general, expressive formal language to encode both the data instances and the relations or constraints between them. The latter encode background knowledge on the problem domain, and are use to restrict or bias the model search space according to the instructions of domain experts. The new tools developed within SRL allow to revisit old computational biology problems in a less ad hoc fashion, and to tackle novel, more complex ones. Motivated by these developments, in this thesis we describe and discuss the application of SRL to three important biological problems, highlighting the advantages, discussing the trade-offs, and pointing out the open problems. In particular, in Chapter 3 we show how to jointly improve the outputs of multiple correlated predictors of protein features by means of a very gen- eral probabilistic-logical consistency layer. The logical layer — based on grounding-specific Markov Logic networks [3] — enforces a set of weighted first-order rules encoding biologically motivated constraints between the pre- dictions. The refiner then improves the raw predictions so that they least violate the constraints. Contrary to canonical methods for the prediction of protein features, which typically take predicted correlated features as in- puts to improve the output post facto, our method can jointly refine all predictions together, with potential gains in overall consistency. In order to showcase our method, we integrate three stand-alone predictors of corre- lated features, namely subcellular localization (Loctree[4]), disulfide bonding state (Disulfind[5]), and metal bonding state (MetalDetector[6]), in a way that takes into account the respective strengths and weaknesses. The ex- perimental results show that the refiner can improve the performance of the underlying predictors by removing rule violations. In addition, the proposed method is fully general, and could in principle be applied to an array of heterogeneous predictions without requiring any change to the underlying software. In Chapter 4 we consider the multi-level protein–protein interaction (PPI) prediction problem. In general, PPIs can be seen as a hierarchical process occurring at three related levels: proteins bind by means of specific domains, which in turn form interfaces through patches of residues. Detailed knowl- edge about which domains and residues are involved in a given interaction has extensive applications to biology, including better understanding of the bind- ing process and more efficient drug/enzyme design. We cast the prediction problem in terms of multi-task learning, with one task per level (proteins, domains and residues), and propose a machine learning method that collec- tively infers the binding state of all object pairs, at all levels, concurrently. Our method is based on Semantic Based Regularization (SBR) [7], a flexible and theoretically sound SRL framework that employs First-Order Logic con- straints to tie the learning tasks together. Contrarily to most current PPI prediction methods, which neither identify which regions of a protein actu- ally instantiate an interaction nor leverage the hierarchy of predictions, our method resolves the prediction problem up to residue level, enforcing con- sistent predictions between the hierarchy levels, and fruitfully exploits the hierarchical nature of the problem. We present numerical results showing that our method substantially outperforms the baseline in several experi- mental settings, indicating that our multi-level formulation can indeed lead to better predictions. Finally, in Chapter 5 we consider the problem of predicting drug-resistant protein mutations through a combination of Inductive Logic Programming [8, 9] and Statistical Relational Learning. In particular, we focus on viral pro- teins: viruses are typically characterized by high mutation rates, which allow them to quickly develop drug-resistant mutations. Mining relevant rules from mutation data can be extremely useful to understand the virus adaptation mechanism and to design drugs that effectively counter potentially resistant mutants. We propose a simple approach for mutant prediction where the in- put consists of mutation data with drug-resistance information, either as sets of mutations conferring resistance to a certain drug, or as sets of mutants with information on their susceptibility to the drug. The algorithm learns a set of relational rules characterizing drug-resistance, and uses them to generate a set of potentially resistant mutants. Learning a weighted combination of rules allows to attach generated mutants with a resistance score as predicted by the statistical relational model and select only the highest scoring ones. Promising results were obtained in generating resistant mutations for both nucleoside and non-nucleoside HIV reverse transcriptase inhibitors. The ap- proach can be generalized quite easily to learning mutants characterized by more complex rules correlating multiple mutations

    Nucleotide Complementarity Features in the Design of Effective Artificial miRNAs

    Full text link
    L'importance du miARN dans la rĂ©gulation des gĂšnes a bien Ă©tĂ© Ă©tablie. Cependant, le mĂ©canisme prĂ©cis du processus de reconnaissance des cibles n'est toujours pas complĂštement compris. Parmi les facteurs connus, la complĂ©mentaritĂ© en nuclĂ©otides, l'accessibilitĂ© des sites cibles, la concentration en espĂšces d'ARN et la coopĂ©rativitĂ© des sites ont Ă©tĂ© jugĂ©es importantes. En utilisant ces rĂšgles connues, nous avons prĂ©cĂ©demment conçu des miARN artificiels qui inhibent la croissance des cellules cancĂ©reuses en rĂ©primant l'expression de plusieurs gĂšnes. De telles sĂ©quences guides ont Ă©tĂ© dĂ©livrĂ©es dans les cellules sous forme de shARN. Le VIH Ă©tant un virus Ă  ARN, nous avons conçu et testĂ© des ARN guides qui inhibent sa rĂ©plication en ciblant directement le gĂ©nome viral et les facteurs cellulaires nĂ©cessaires au virus dans le cadre de mon premier projet. En utilisant une version mise Ă  jour du programme de conception, mirBooking, nous devenons capables de prĂ©dire l'effet de concentration des espĂšces Ă  ARN avec plus de prĂ©cision. Les sĂ©quences guides conçues fournissaient aux cellules une rĂ©sistance efficace Ă  l'infection virale, Ă©gale ou meilleure que celles ciblant directement le gĂ©nome viral par une complĂ©mentaritĂ© quasi-parfaite. Cependant, les niveaux de rĂ©pression des facteurs viraux et cellulaires ne pouvaient pas ĂȘtre prĂ©dits avec prĂ©cision. Afin de mieux comprendre les rĂšgles de reconnaissance des cibles miARN, les rĂšgles de couplage des bases au-delĂ  du « seed » ont Ă©tĂ© approfondies dans mon deuxiĂšme projet. En concevant des sĂ©quences guides correspondant partiellement Ă  la cible et en analysant le schĂ©ma de rĂ©pression, nous avons Ă©tabli un modĂšle unificateur de reconnaissance de cible par miARN via la protĂ©ine Ago2. Il montre qu'une fois que le « seed » est appariĂ©e avec l'ARN cible, la formation d'un duplex d'ARN est interrompue au niveau de la partie centrale du brin guide mais reprend plus loin en aval de la partie centrale en suivant un ordre distinct. L'implĂ©mentation des rĂšgles dĂ©couvertes dans un programme informatique, MicroAlign, a permis d'amĂ©liorer la conception de miARN artificiels efficaces. Dans cette Ă©tude, nous avons non seulement confirmĂ© la contribution des nuclĂ©otides non-germes Ă  l'efficacitĂ© des miARN, mais Ă©galement dĂ©fini de maniĂšre quantitative la maniĂšre dont ils fonctionnent. Le point de vue actuellement rĂ©pandu selon lequel les miARN peuvent cibler efficacement tous les gĂšnes de maniĂšre Ă©gale, avec uniquement des correspondances de semences, peut nĂ©cessiter un rĂ©examenThe importance of miRNA in gene regulation has been well established; however, the precise mechanism of its target recognition process is still not completely understood. Among the known factors, nucleotide complementarity, accessibility of the target sites, and the concentration of the RNA species, and site cooperativity were deemed important. Using these known rules, we previously designed artificial miRNAs that inhibit cancer cell growth by repressing the expression of multiple genes. Such guide sequences were delivered into the cells in the form of shRNAs. HIV is an RNA virus. We designed and tested guide RNAs that inhibit its replication by directly targeting the viral genome and cellular factors that the virus requires in my first project. Using an updated version of the design program, mirBooking, we become capable to predict the concentration effect of RNA species more accurately. Designed guide sequences provided cells with effective resistance against viral infection. The protection was equal or better than those that target the viral genome directly via near-perfect complementarity. However, the repression levels of the viral and cellular factors could not be precisely predicted. In order to gain further insights on the rules of miRNA target recognition, the rules of base pairing beyond the seed was further investigated in my second project. By designing guide sequences that partially match the target and analysing the repression pattern, we established a unifying model of miRNA target recognition via Ago2 protein. It shows that once the seed is base-paired with the target RNA, the formation of an RNA duplex is interrupted at the central portion of the guide strand but resumes further downstream of the central portion following a distinct order. The implementation of the discovered rules in a computer program, MicroAlign, enhanced the design of efficient artificial miRNAs. In this study, we not only confirmed the contribution of non-seed nucleotides to the efficiency of miRNAs, but also quantitatively defined the way through which they work. The currently popular view that miRNAs can effectively target all genes equally with only seed matches may require careful re-examination
    • 

    corecore