10,111 research outputs found

    A study of hierarchical and flat classification of proteins

    Get PDF
    Automatic classification of proteins using machine learning is an important problem that has received significant attention in the literature. One feature of this problem is that expert-defined hierarchies of protein classes exist and can potentially be exploited to improve classification performance. In this article we investigate empirically whether this is the case for two such hierarchies. We compare multi-class classification techniques that exploit the information in those class hierarchies and those that do not, using logistic regression, decision trees, bagged decision trees, and support vector machines as the underlying base learners. In particular, we compare hierarchical and flat variants of ensembles of nested dichotomies. The latter have been shown to deliver strong classification performance in multi-class settings. We present experimental results for synthetic, fold recognition, enzyme classification, and remote homology detection data. Our results show that exploiting the class hierarchy improves performance on the synthetic data, but not in the case of the protein classification problems. Based on this we recommend that strong flat multi-class methods be used as a baseline to establish the benefit of exploiting class hierarchies in this area

    On Machine-Learned Classification of Variable Stars with Sparse and Noisy Time-Series Data

    Full text link
    With the coming data deluge from synoptic surveys, there is a growing need for frameworks that can quickly and automatically produce calibrated classification probabilities for newly-observed variables based on a small number of time-series measurements. In this paper, we introduce a methodology for variable-star classification, drawing from modern machine-learning techniques. We describe how to homogenize the information gleaned from light curves by selection and computation of real-numbered metrics ("feature"), detail methods to robustly estimate periodic light-curve features, introduce tree-ensemble methods for accurate variable star classification, and show how to rigorously evaluate the classification results using cross validation. On a 25-class data set of 1542 well-studied variable stars, we achieve a 22.8% overall classification error using the random forest classifier; this represents a 24% improvement over the best previous classifier on these data. This methodology is effective for identifying samples of specific science classes: for pulsational variables used in Milky Way tomography we obtain a discovery efficiency of 98.2% and for eclipsing systems we find an efficiency of 99.1%, both at 95% purity. We show that the random forest (RF) classifier is superior to other machine-learned methods in terms of accuracy, speed, and relative immunity to features with no useful class information; the RF classifier can also be used to estimate the importance of each feature in classification. Additionally, we present the first astronomical use of hierarchical classification methods to incorporate a known class taxonomy in the classifier, which further reduces the catastrophic error rate to 7.8%. Excluding low-amplitude sources, our overall error rate improves to 14%, with a catastrophic error rate of 3.5%.Comment: 23 pages, 9 figure

    Online Robot Introspection via Wrench-based Action Grammars

    Full text link
    Robotic failure is all too common in unstructured robot tasks. Despite well-designed controllers, robots often fail due to unexpected events. How do robots measure unexpected events? Many do not. Most robots are driven by the sense-plan act paradigm, however more recently robots are undergoing a sense-plan-act-verify paradigm. In this work, we present a principled methodology to bootstrap online robot introspection for contact tasks. In effect, we are trying to enable the robot to answer the question: what did I do? Is my behavior as expected or not? To this end, we analyze noisy wrench data and postulate that the latter inherently contains patterns that can be effectively represented by a vocabulary. The vocabulary is generated by segmenting and encoding the data. When the wrench information represents a sequence of sub-tasks, we can think of the vocabulary forming a sentence (set of words with grammar rules) for a given sub-task; allowing the latter to be uniquely represented. The grammar, which can also include unexpected events, was classified in offline and online scenarios as well as for simulated and real robot experiments. Multiclass Support Vector Machines (SVMs) were used offline, while online probabilistic SVMs were are used to give temporal confidence to the introspection result. The contribution of our work is the presentation of a generalizable online semantic scheme that enables a robot to understand its high-level state whether nominal or abnormal. It is shown to work in offline and online scenarios for a particularly challenging contact task: snap assemblies. We perform the snap assembly in one-arm simulated and real one-arm experiments and a simulated two-arm experiment. This verification mechanism can be used by high-level planners or reasoning systems to enable intelligent failure recovery or determine the next most optima manipulation skill to be used.Comment: arXiv admin note: substantial text overlap with arXiv:1609.0494

    Multi-TGDR: a regularization method for multi-class classification in microarray experiments

    Get PDF
    Background With microarray technology becoming mature and popular, the selection and use of a small number of relevant genes for accurate classification of samples is a hot topic in the circles of biostatistics and bioinformatics. However, most of the developed algorithms lack the ability to handle multiple classes, which arguably a common application. Here, we propose an extension to an existing regularization algorithm called Threshold Gradient Descent Regularization (TGDR) to specifically tackle multi-class classification of microarray data. When there are several microarray experiments addressing the same/similar objectives, one option is to use meta-analysis version of TGDR (Meta-TGDR), which considers the classification task as combination of classifiers with the same structure/model while allowing the parameters to vary across studies. However, the original Meta-TGDR extension did not offer a solution to the prediction on independent samples. Here, we propose an explicit method to estimate the overall coefficients of the biomarkers selected by Meta-TGDR. This extension permits broader applicability and allows a comparison between the predictive performance of Meta-TGDR and TGDR using an independent testing set. Results Using real-world applications, we demonstrated the proposed multi-TGDR framework works well and the number of selected genes is less than the sum of all individualized binary TGDRs. Additionally, Meta-TGDR and TGDR on the batch-effect adjusted pooled data approximately provided same results. By adding Bagging procedure in each application, the stability and good predictive performance are warranted. Conclusions Compared with Meta-TGDR, TGDR is less computing time intensive, and requires no samples of all classes in each study. On the adjusted data, it has approximate same predictive performance with Meta-TGDR. Thus, it is highly recommended
    corecore