321 research outputs found

    Hierarchical Classification Using Evolutionary Strategy

    Get PDF
    Hierarchical classification is a problem with applications in many areas as protein function prediction where the dates are hierarchically structured. Therefore, it is necessary the development of algorithms able to induce hierarchical classification models. This paper presents experimenters using the algorithm for hierarchical classification called Hierarchical Classification using Evolutionary Strategy (HC-ES). It was tested in eight datasets the G-Protein-Coupled Receptor (GPCR) and Enzyme Commission Codes (EC). The results are compared with other hierarchical classifier using the distance and hF-Measure

    Improving the hierarchical classification of protein functions With swarm intelligence

    Get PDF
    This thesis investigates methods to improve the performance of hierarchical classification. In terms of this thesis hierarchical classification is a form of supervised learning, where the classes in a data set are arranged in a tree structure. As a base for our new methods we use the TDDC (top-down divide-and-conquer) approach for hierarchical classification, where each classifier is built only to discriminate between sibling classes. Firstly, we propose a swarm intelligence technique which varies the types of classifiers used at each divide within the TDDC tree. Our technique, PSO/ACO-CS (Particle Swarm Optimisation/Ant Colony Optimisation Classifier Selection), finds combinations of classifiers to be used in the TDDC tree using the global search ability of PSO/ACO. Secondly, we propose a technique that attempts to mitigate a major drawback of the TDDC approach. The drawback is that if at any point in the TDDC tree an example is misclassified it can never be correctly classified further down the TDDC tree. Our approach, PSO/ACO-RO (PSO/ACO-Recovery Optimisation) decides whether to redirect examples at a given classifier node using, again, the global search ability of PSO/ACO. Thirdly, we propose an ensemble based technique, HEHRS (Hierarchical Ensembles of Hierarchical Rule Sets), which attempts to boost the accuracy at each classifier node in the TDDC tree by using information from classifiers (rule sets) in the rest of that tree. We use Particle Swarm Optimisation to weight the individual rules within each ensemble. We evaluate these three new methods in hierarchical bioinformatics datasets that we have created for this research. These data sets represent the real world problem of protein function prediction. We find through extensive experimentation that the three proposed methods improve upon the baseline TDDC method to varying degrees. Overall the HEHRS and PSO/ACO- CS-RO approaches are most effective, although they are associated with a higher computational cost

    Novel approaches for hierarchical classification with case studies in protein function prediction

    Get PDF
    A very large amount of research in the data mining, machine learning, statistical pattern recognition and related research communities has focused on flat classification problems. However, many problems in the real world such as hierarchical protein function prediction have their classes naturally organised into hierarchies. The task of hierarchical classification, however, needs to be better defined as researchers into one application domain are often unaware of similar efforts developed in other research areas. The first contribution of this thesis is to survey the task of hierarchical classification across different application domains and present an unifying framework for the task. After clearly defining the problem, we explore novel approaches to the task. Based on the understanding gained by surveying the task of hierarchical classification, there are three major approaches to deal with hierarchical classification problems. The first approach is to use one of the many existing flat classification algorithms to predict only the leaf classes in the hierarchy. Note that, in the training phase, this approach completely ignores the hierarchical class relationships, i.e. the parent-child and sibling class relationships, but in the testing phase the ancestral classes of an instance can be inferred from its predicted leaf classes. The second approach is to build a set of local models, by training one flat classification algorithm for each local view of the hierarchy. The two main variations of this approach are: (a) training a local flat multi-class classifier at each non-leaf class node, where each classifier discriminates among the child classes of its associated class; or (b) training a local fiat binary classifier at each node of the class hierarchy, where each classifier predicts whether or not a new instance has the classifier’s associated class. In both these variations, in the testing phase a procedure is used to combine the predictions of the set of local classifiers in a coherent way, avoiding inconsistent predictions. The third approach is to use a global-model hierarchical classification algorithm, which builds one single classification model by taking into account all the hierarchical class relationships in the training phase. In the context of this categorization of hierarchical classification approaches, the other contributions of this thesis are as follows. The second contribution of this thesis is a novel algorithm which is based on the local classifier per parent node approach. The novel algorithm is the selective representation approach that automatically selects the best protein representation to use at each non-leaf class node. The third contribution is a global-model hierarchical classification extension of the well known naive Bayes algorithm. Given the good predictive performance of the global-model hierarchical-classification naive Bayes algorithm, we relax the Naive Bayes’ assumption that attributes are independent from each other given the class by using the concept of k dependencies. Hence, we extend the flat classification /¿-Dependence Bayesian network classifier to the task of hierarchical classification, which is the fourth contribution of this thesis. Both the proposed global-model hierarchical classification Naive Bayes and the proposed global-model hierarchical /¿-Dependence Bayesian network classifier have achieved predictive accuracies that were, overall, significantly higher than the predictive accuracies obtained by their corresponding local hierarchical classification versions, across a number of datasets for the task of hierarchical protein function prediction

    Computational Methods for the Analysis of Genomic Data and Biological Processes

    Get PDF
    In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality

    Expert Knowledge-Guided Length-Variant Hierarchical Label Generation for Proposal Classification

    Full text link
    To advance the development of science and technology, research proposals are submitted to open-court competitive programs developed by government agencies (e.g., NSF). Proposal classification is one of the most important tasks to achieve effective and fair review assignments. Proposal classification aims to classify a proposal into a length-variant sequence of labels. In this paper, we formulate the proposal classification problem into a hierarchical multi-label classification task. Although there are certain prior studies, proposal classification exhibit unique features: 1) the classification result of a proposal is in a hierarchical discipline structure with different levels of granularity; 2) proposals contain multiple types of documents; 3) domain experts can empirically provide partial labels that can be leveraged to improve task performances. In this paper, we focus on developing a new deep proposal classification framework to jointly model the three features. In particular, to sequentially generate labels, we leverage previously-generated labels to predict the label of next level; to integrate partial labels from experts, we use the embedding of these empirical partial labels to initialize the state of neural networks. Our model can automatically identify the best length of label sequence to stop next label prediction. Finally, we present extensive results to demonstrate that our method can jointly model partial labels, textual information, and semantic dependencies in label sequences, and, thus, achieve advanced performances.Comment: 10 pages, Accepted as regular paper by ICDM 202

    Predicting Transporter Proteins and Their Substrate Specificity

    Get PDF
    The publication of numerous genome projects has resulted in an abundance of protein sequences, a significant number of which are still unannotated. Membrane proteins such as transporters, receptors, and enzymes are among the least characterized proteins due to their hydrophobic surfaces and lack of conformational stability. This research aims to build a proteome-wide system to determine transporter substrate specificity, which involves three phases: 1) distinguishing membrane proteins, 2) differentiating transporters from other functional types of membrane proteins, and 3) detecting the substrate specificity of the transporters. To distinguish membrane from non-membrane proteins, we propose a novel tool, TooT-M, that combines the predictions from transmembrane topology prediction tools and a selective set of classifiers where protein samples are represented by pseudo position-specific scoring matrix (Pse-PSSM) vectors. The results suggest that the proposed tool outperforms all state-of-the-art methods in terms of the overall accuracy and Matthews correlation coefficient (MCC). To distinguish transporters from other proteins, we propose an ensemble classifier, TooT-T, that is trained to optimally combine the predictions from homology annotation transfer and machine learning methods. The homology annotation transfer components detect transporters by searching against the transporter classification database (TCDB) using different thresholds. The machine learning methods include three models wherein the protein sequences are encoded using a novel encoding psi-composition. The results show that TooT-T outperforms all state-of-the-art de novo transporter predictors in terms of the overall accuracy and MCC. To detect the substrate specificity of a transporter, we propose a novel tool, TooT-SC, that combines compositional, evolutionary, and positional information to represent protein samples. TooT-SC can efficiently classify transport proteins into eleven classes according to their transported substrate, which is the highest number of predicted substrates offered by any de novo prediction tool. Our results indicate that TooT-SC significantly outperforms all of the state-of-the-art methods. Further analysis of the locations of the informative positions reveals that there are more statistically significant informative positions in the transmembrane segments (TMSs) than the non-TMSs, and there are more statistically significant informative positions that occur close to the TMSs compared to regions far from them

    Machine learning-based predictions of dietary restriction associations across ageing-related genes

    Get PDF
    BACKGROUND: Dietary restriction (DR) is the most studied pro-longevity intervention; however, a complete understanding of its underlying mechanisms remains elusive, and new research directions may emerge from the identification of novel DR-related genes and DR-related genetic features. RESULTS: This work used a Machine Learning (ML) approach to classify ageing-related genes as DR-related or NotDR-related using 9 different types of predictive features: PathDIP pathways, two types of features based on KEGG pathways, two types of Protein–Protein Interactions (PPI) features, Gene Ontology (GO) terms, Genotype Tissue Expression (GTEx) expression features, GeneFriends co-expression features and protein sequence descriptors. Our findings suggested that features biased towards curated knowledge (i.e. GO terms and biological pathways), had the greatest predictive power, while unbiased features (mainly gene expression and co-expression data) have the least predictive power. Moreover, a combination of all the feature types diminished the predictive power compared to predictions based on curated knowledge. Feature importance analysis on the two most predictive classifiers mostly corroborated existing knowledge and supported recent findings linking DR to the Nuclear Factor Erythroid 2-Related Factor 2 (NRF2) signalling pathway and G protein-coupled receptors (GPCR). We then used the two strongest combinations of feature type and ML algorithm to predict DR-relatedness among ageing-related genes currently lacking DR-related annotations in the data, resulting in a set of promising candidate DR-related genes (GOT2, GOT1, TSC1, CTH, GCLM, IRS2 and SESN2) whose predicted DR-relatedness remain to be validated in future wet-lab experiments. CONCLUSIONS: This work demonstrated the strong potential of ML-based techniques to identify DR-associated features as our findings are consistent with literature and recent discoveries. Although the inference of new DR-related mechanistic findings based solely on GO terms and biological pathways was limited due to their knowledge-driven nature, the predictive power of these two features types remained useful as it allowed inferring new promising candidate DR-related genes. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04523-8

    Identifying the molecular components that matter: a statistical modelling approach to linking functional genomics data to cell physiology

    Get PDF
    Functional genomics technologies, in which thousands of mRNAs, proteins, or metabolites can be measured in single experiments, have contributed to reshape biological investigations. One of the most important issues in the analysis of the generated large datasets is the selection of relatively small sub-sets of variables that are predictive of the physiological state of a cell or tissue. In this thesis, a truly multivariate variable selection framework using diverse functional genomics data has been developed, characterized, and tested. This framework has also been used to prove that it is possible to predict the physiological state of the tumour from the molecular state of adjacent normal cells. This allows us to identify novel genes involved in cell to cell communication. Then, using a network inference technique networks representing cell-cell communication in prostate cancer have been inferred. The analysis of these networks has revealed interesting properties that suggests a crucial role of directional signals in controlling the interplay between normal and tumour cell to cell communication. Experimental verification performed in our laboratory has provided evidence that one of the identified genes could be a novel tumour suppressor gene. In conclusion, the findings and methods reported in this thesis have contributed to further understanding of cell to cell interaction and multivariate variable selection not only by applying and extending previous work, but also by proposing novel approaches that can be applied to any functional genomics data

    Combining Ion Mobility Mass Spectrometry and Computational Methods to Study Structures of Biomolecules in the Gas Phase

    Full text link
    Characterizing the complex, dynamically regulated networks in cells is critical for the understanding of disease mechanisms and development of therapeutics. Over the last two decades, mass spectrometry (MS) has emerged as a key structural biology tool enabling rapid analysis of complex samples. Native MS has had tremendous success in the structural elucidation of proteins, protein complexes, and protein-ligand interactions. Ion mobility MS (IM-MS), under the native MS category, has gained popularity as a structural biology technique capable of reporting collision cross section (CCS) area of biomolecular ions that can be used as an attribute for identification in bioinformatics workflows and restraint for generating three-dimensional models of proteins. Traveling wave IM (TWIM) is the most used IM platform across research and industry laboratories. However, the amount of information and the accuracy obtained from TWIM measurements have been compromised due to the lack of fundamental understanding of the technology itself. Therefore, in this thesis, novel developments in IM-MS techniques, especially with TWIM, are described that are capable of providing accurate biophysical measurements of proteins and protein complexes in a high throughput manner. In chapter 2, we devise a semi-empirical relationship that can model TWIM arrival time distributions (ATDs) across a range of TWIM conditions. A conformational broadening parameter can be extracted from the semi-empirical formalism that describes the size of the structural heterogeneity of biomolecules in the gas phase. We validated our method by investigating the origins of structural heterogeneity arising in a set of model peptides. The conformational broadening parameter also properly reflected the reduction in structural flexibility when we introduced cross-links in a protein complex. In chapter 3, we described a novel pseudo-trapping phenomenon in TW ion guides that produces aberrant ATDs. This was described using a theoretical model and ion trajectory simulations highlighting that imperfect TW leads to a repetitive pattern of ion motion causing the ions with even small mobility difference to travel with the same mean velocity. Consequently, the ions' transit times through the device were altered detrimentally affecting the calibrated CCS values. In chapter 4 we show new calibration functions capable of generating precise and accurate CCS values from TWIM measurements. Velocity relaxation and travelling wave edge effects are incorporated into the new function termed as blend + radial that outperforms the current calibration function in terms of accuracy, preciseness, and robustness. We benchmarked the new function using a large scale of analyte ions comprised of small molecules and metabolites, peptides, denature proteins, and native proteins. The last chapter showcases the utility of IM-MS platform for high throughput characterization of protein structure and protein-ligand interactions using collision induced unfolding (CIU) experiments. A classification algorithm was built for a single state and multi-state classification of CIU fingerprints, where a state can be defined as charge states of the ions, protein incubation properties, etc. Using our classification workflow, we were able to identify the class of an unknown endogenous lipid in a membrane protein complex. Multi-state classifier boosted the accuracy of the classification model, which was demonstrated using Src-kinase ligand binding experiments and biotherapeutic innovator and biosimilar comparisons. Overall, the developments in the IM-MS methods, especially the theoretical contributions to TWIM technology, described in this thesis will allow the widespread TWIM community to properly utilize the platform in the areas of chemical analysis and structural biology.PHDChemistryUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/153475/1/sugyan_1.pd
    • …
    corecore