2,654 research outputs found

    Reducing the Effects of Detrimental Instances

    Full text link
    Not all instances in a data set are equally beneficial for inducing a model of the data. Some instances (such as outliers or noise) can be detrimental. However, at least initially, the instances in a data set are generally considered equally in machine learning algorithms. Many current approaches for handling noisy and detrimental instances make a binary decision about whether an instance is detrimental or not. In this paper, we 1) extend this paradigm by weighting the instances on a continuous scale and 2) present a methodology for measuring how detrimental an instance may be for inducing a model of the data. We call our method of identifying and weighting detrimental instances reduced detrimental instance learning (RDIL). We examine RIDL on a set of 54 data sets and 5 learning algorithms and compare RIDL with other weighting and filtering approaches. RDIL is especially useful for learning algorithms where every instance can affect the classification boundary and the training instances are considered individually, such as multilayer perceptrons trained with backpropagation (MLPs). Our results also suggest that a more accurate estimate of which instances are detrimental can have a significant positive impact for handling them.Comment: 6 pages, 5 tables, 2 figures. arXiv admin note: substantial text overlap with arXiv:1403.189

    Over-optimism in bioinformatics: an illustration

    Get PDF
    In statistical bioinformatics research, different optimization mechanisms potentially lead to "over-optimism" in published papers. The present empirical study illustrates these mechanisms through a concrete example from an active research field. The investigated sources of over-optimism include the optimization of the data sets, of the settings, of the competing methods and, most importantly, of the method’s characteristics. We consider a "promising" new classification algorithm that turns out to yield disappointing results in terms of error rate, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. We quantitatively demonstrate that this disappointing method can artificially seem superior to existing approaches if we "fish for significance”. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should be validated using "fresh" validation data sets

    Support vector machine for functional data classification

    Get PDF
    In many applications, input data are sampled functions taking their values in infinite dimensional spaces rather than standard vectors. This fact has complex consequences on data analysis algorithms that motivate modifications of them. In fact most of the traditional data analysis tools for regression, classification and clustering have been adapted to functional inputs under the general name of functional Data Analysis (FDA). In this paper, we investigate the use of Support Vector Machines (SVMs) for functional data analysis and we focus on the problem of curves discrimination. SVMs are large margin classifier tools based on implicit non linear mappings of the considered data into high dimensional spaces thanks to kernels. We show how to define simple kernels that take into account the unctional nature of the data and lead to consistent classification. Experiments conducted on real world data emphasize the benefit of taking into account some functional aspects of the problems.Comment: 13 page

    Network measures for protein folding state discrimination

    Get PDF
    Proteins fold using a two-state or multi-state kinetic mechanisms, but up to now there is not a first-principle model to explain this different behavior. We exploit the network properties of protein structures by introducing novel observables to address the problem of classifying the different types of folding kinetics. These observables display a plain physical meaning, in terms of vibrational modes, possible configurations compatible with the native protein structure, and folding cooperativity. The relevance of these observables is supported by a classification performance up to 90%, even with simple classifiers such as discriminant analysis

    Differences in intention to use educational RSS feeds between Lebanese and British students: A multi‑group analysis based on the technology acceptance model

    Get PDF
    Really Simple Syndication (RSS) offers a means for university students to receive timely updates from virtual learning environments. However, despite its utility, only 21% of home students surveyed at a university in Lebanon claim to have ever used the technology. To investigate whether national culture could be an influence on intention to use RSS, the survey was extended to British students in the UK. Using the Technology Adoption Model (TAM) as a research framework, 437 students responded to a questionnaire containing four constructs: behavioral intention to use; attitude towards benefit; perceived usefulness; and perceived ease of use. Principle components analysis and structural equation modelling were used to explore the psychometric qualities and utility of TAM in both contexts. The results show that adoption was significantly higher, but also modest, in the British context at 36%. Configural and metric invariance were fully supported, while scalar and factorial invariance were partially supported. Further analysis shows significant differences between perceived usefulness and perceived ease of use across the two contexts studied. Therefore, it is recommended that faculty demonstrate to students how educational RSS feeds can be used effectively to increase awareness and emphasize usefulness in both contexts
    corecore