2,654 research outputs found
Reducing the Effects of Detrimental Instances
Not all instances in a data set are equally beneficial for inducing a model
of the data. Some instances (such as outliers or noise) can be detrimental.
However, at least initially, the instances in a data set are generally
considered equally in machine learning algorithms. Many current approaches for
handling noisy and detrimental instances make a binary decision about whether
an instance is detrimental or not. In this paper, we 1) extend this paradigm by
weighting the instances on a continuous scale and 2) present a methodology for
measuring how detrimental an instance may be for inducing a model of the data.
We call our method of identifying and weighting detrimental instances reduced
detrimental instance learning (RDIL). We examine RIDL on a set of 54 data sets
and 5 learning algorithms and compare RIDL with other weighting and filtering
approaches. RDIL is especially useful for learning algorithms where every
instance can affect the classification boundary and the training instances are
considered individually, such as multilayer perceptrons trained with
backpropagation (MLPs). Our results also suggest that a more accurate estimate
of which instances are detrimental can have a significant positive impact for
handling them.Comment: 6 pages, 5 tables, 2 figures. arXiv admin note: substantial text
overlap with arXiv:1403.189
Over-optimism in bioinformatics: an illustration
In statistical bioinformatics research, different optimization mechanisms potentially lead to "over-optimism" in published papers. The present empirical study illustrates these mechanisms through a concrete example from an active research field. The investigated sources of over-optimism include the optimization of the data sets, of the settings, of the competing methods and, most importantly, of the method’s characteristics. We consider a "promising" new classification algorithm that turns out to yield disappointing results in terms of error rate, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. We quantitatively demonstrate that this disappointing method can artificially seem superior to existing approaches if we "fish for significance”. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should be validated using "fresh" validation data sets
Support vector machine for functional data classification
In many applications, input data are sampled functions taking their values in
infinite dimensional spaces rather than standard vectors. This fact has complex
consequences on data analysis algorithms that motivate modifications of them.
In fact most of the traditional data analysis tools for regression,
classification and clustering have been adapted to functional inputs under the
general name of functional Data Analysis (FDA). In this paper, we investigate
the use of Support Vector Machines (SVMs) for functional data analysis and we
focus on the problem of curves discrimination. SVMs are large margin classifier
tools based on implicit non linear mappings of the considered data into high
dimensional spaces thanks to kernels. We show how to define simple kernels that
take into account the unctional nature of the data and lead to consistent
classification. Experiments conducted on real world data emphasize the benefit
of taking into account some functional aspects of the problems.Comment: 13 page
Network measures for protein folding state discrimination
Proteins fold using a two-state or multi-state kinetic mechanisms, but up to now there is not a first-principle model to explain this different behavior. We exploit the network properties of protein structures by introducing novel observables to address the problem of classifying the different types of folding kinetics. These observables display a plain physical meaning, in terms of vibrational modes, possible configurations compatible with the native protein structure, and folding cooperativity. The relevance of these observables is supported by a classification performance up to 90%, even with simple classifiers such as discriminant analysis
Differences in intention to use educational RSS feeds between Lebanese and British students: A multi‑group analysis based on the technology acceptance model
Really Simple Syndication (RSS) offers a means for university students to receive timely updates from virtual learning environments. However, despite its utility, only 21% of home students surveyed at a university in Lebanon claim to have ever used the technology. To investigate whether national culture could be an influence on intention to use RSS, the survey was extended to British students in the UK. Using the Technology Adoption Model (TAM) as a research framework, 437 students responded to a questionnaire containing four constructs: behavioral intention to use; attitude towards benefit; perceived usefulness; and perceived ease of use. Principle components analysis and structural equation modelling were used to explore the psychometric qualities and utility of TAM in both contexts. The results show that adoption was significantly higher, but also modest, in the British context at 36%. Configural and metric invariance were fully supported, while scalar and factorial invariance were partially supported. Further analysis shows significant differences between perceived usefulness and perceived ease of use across the two contexts studied. Therefore, it is recommended that faculty demonstrate to students how educational RSS feeds can be used effectively to increase awareness and emphasize usefulness in both contexts
- …