67,239 research outputs found

    Assessing similarity of feature selection techniques in high-dimensional domains

    Get PDF
    Recent research efforts attempt to combine multiple feature selection techniques instead of using a single one. However, this combination is often made on an ā€œad hocā€ basis, depending on the specific problem at hand, without considering the degree of diversity/similarity of the involved methods. Moreover, though it is recognized that different techniques may return quite dissimilar outputs, especially in high dimensional/small sample size domains, few direct comparisons exist that quantify these differences and their implications on classification performance. This paper aims to provide a contribution in this direction by proposing a general methodology for assessing the similarity between the outputs of different feature selection methods in high dimensional classification problems. Using as benchmark the genomics domain, an empirical study has been conducted to compare some of the most popular feature selection methods, and useful insight has been obtained about their pattern of agreement

    Committee-Based Sample Selection for Probabilistic Classifiers

    Full text link
    In many real-world learning tasks, it is expensive to acquire a sufficient number of labeled examples for training. This paper investigates methods for reducing annotation cost by `sample selection'. In this approach, during training the learning program examines many unlabeled examples and selects for labeling only those that are most informative at each stage. This avoids redundantly labeling examples that contribute little new information. Our work follows on previous research on Query By Committee, extending the committee-based paradigm to the context of probabilistic classification. We describe a family of empirical methods for committee-based sample selection in probabilistic classification models, which evaluate the informativeness of an example by measuring the degree of disagreement between several model variants. These variants (the committee) are drawn randomly from a probability distribution conditioned by the training set labeled so far. The method was applied to the real-world natural language processing task of stochastic part-of-speech tagging. We find that all variants of the method achieve a significant reduction in annotation cost, although their computational efficiency differs. In particular, the simplest variant, a two member committee with no parameters to tune, gives excellent results. We also show that sample selection yields a significant reduction in the size of the model used by the tagger

    Context for Ubiquitous Data Management

    Get PDF
    In response to the advance of ubiquitous computing technologies, we believe that for computer systems to be ubiquitous, they must be context-aware. In this paper, we address the impact of context-awareness on ubiquitous data management. To do this, we overview different characteristics of context in order to develop a clear understanding of context, as well as its implications and requirements for context-aware data management. References to recent research activities and applicable techniques are also provided
    • ā€¦
    corecore