15,337 research outputs found

    Advances in Feature Selection with Mutual Information

    Full text link
    The selection of features that are relevant for a prediction or classification problem is an important problem in many domains involving high-dimensional data. Selecting features helps fighting the curse of dimensionality, improving the performances of prediction or classification methods, and interpreting the application. In a nonlinear context, the mutual information is widely used as relevance criterion for features and sets of features. Nevertheless, it suffers from at least three major limitations: mutual information estimators depend on smoothing parameters, there is no theoretically justified stopping criterion in the feature selection greedy procedure, and the estimation itself suffers from the curse of dimensionality. This chapter shows how to deal with these problems. The two first ones are addressed by using resampling techniques that provide a statistical basis to select the estimator parameters and to stop the search procedure. The third one is addressed by modifying the mutual information criterion into a measure of how features are complementary (and not only informative) for the problem at hand

    Stability and aggregation of ranked gene lists

    Get PDF
    Ranked gene lists are highly instable in the sense that similar measures of differential gene expression may yield very different rankings, and that a small change of the data set usually affects the obtained gene list considerably. Stability issues have long been under-considered in the literature, but they have grown to a hot topic in the last few years, perhaps as a consequence of the increasing skepticism on the reproducibility and clinical applicability of molecular research findings. In this article, we review existing approaches for the assessment of stability of ranked gene lists and the related problem of aggregation, give some practical recommendations, and warn against potential misuse of these methods. This overview is illustrated through an application to a recent leukemia data set using the freely available Bioconductor package GeneSelector

    Learning Hybrid Neuro-Fuzzy Classifier Models From Data: To Combine or Not to Combine?

    Get PDF
    To combine or not to combine? Though not a question of the same gravity as the Shakespeare’s to be or not to be, it is examined in this paper in the context of a hybrid neuro-fuzzy pattern classifier design process. A general fuzzy min-max neural network with its basic learning procedure is used within six different algorithm independent learning schemes. Various versions of cross-validation, resampling techniques and data editing approaches, leading to a generation of a single classifier or a multiple classifier system, are scrutinised and compared. The classification performance on unseen data, commonly used as a criterion for comparing different competing designs, is augmented by further four criteria attempting to capture various additional characteristics of classifier generation schemes. These include: the ability to estimate the true classification error rate, the classifier transparency, the computational complexity of the learning scheme and the potential for adaptation to changing environments and new classes of data. One of the main questions examined is whether and when to use a single classifier or a combination of a number of component classifiers within a multiple classifier system

    Combining Neuro-Fuzzy Classifiers for Improved Generalisation and Reliability

    Get PDF
    In this paper a combination of neuro-fuzzy classifiers for improved classification performance and reliability is considered. A general fuzzy min-max (GFMM) classifier with agglomerative learning algorithm is used as a main building block. An alternative approach to combining individual classifier decisions involving the combination at the classifier model level is proposed. The resulting classifier complexity and transparency is comparable with classifiers generated during a single crossvalidation procedure while the improved classification performance and reduced variance is comparable to the ensemble of classifiers with combined (averaged/voted) decisions. We also illustrate how combining at the model level can be used for speeding up the training of GFMM classifiers for large data sets

    A Functional Wavelet-Kernel Approach for Continuous-time Prediction

    Get PDF
    We consider the prediction problem of a continuous-time stochastic process on an entire time-interval in terms of its recent past. The approach we adopt is based on functional kernel nonparametric regression estimation techniques where observations are segments of the observed process considered as curves. These curves are assumed to lie within a space of possibly inhomogeneous functions, and the discretized times series dataset consists of a relatively small, compared to the number of segments, number of measurements made at regular times. We thus consider only the case where an asymptotically non-increasing number of measurements is available for each portion of the times series. We estimate conditional expectations using appropriate wavelet decompositions of the segmented sample paths. A notion of similarity, based on wavelet decompositions, is used in order to calibrate the prediction. Asymptotic properties when the number of segments grows to infinity are investigated under mild conditions, and a nonparametric resampling procedure is used to generate, in a flexible way, valid asymptotic pointwise confidence intervals for the predicted trajectories. We illustrate the usefulness of the proposed functional wavelet-kernel methodology in finite sample situations by means of three real-life datasets that were collected from different arenas

    Unbiased and Consistent Nested Sampling via Sequential Monte Carlo

    Full text link
    We introduce a new class of sequential Monte Carlo methods called Nested Sampling via Sequential Monte Carlo (NS-SMC), which reframes the Nested Sampling method of Skilling (2006) in terms of sequential Monte Carlo techniques. This new framework allows convergence results to be obtained in the setting when Markov chain Monte Carlo (MCMC) is used to produce new samples. An additional benefit is that marginal likelihood estimates are unbiased. In contrast to NS, the analysis of NS-SMC does not require the (unrealistic) assumption that the simulated samples be independent. As the original NS algorithm is a special case of NS-SMC, this provides insights as to why NS seems to produce accurate estimates despite a typical violation of its assumptions. For applications of NS-SMC, we give advice on tuning MCMC kernels in an automated manner via a preliminary pilot run, and present a new method for appropriately choosing the number of MCMC repeats at each iteration. Finally, a numerical study is conducted where the performance of NS-SMC and temperature-annealed SMC is compared on several challenging and realistic problems. MATLAB code for our experiments is made available at https://github.com/LeahPrice/SMC-NS .Comment: 45 pages, some minor typographical errors fixed since last versio

    Cluster validity in clustering methods

    Get PDF
    • …
    corecore