15,583 research outputs found

    A CLUE for CLUster Ensembles

    Get PDF
    Cluster ensembles are collections of individual solutions to a given clustering problem which are useful or necessary to consider in a wide range of applications. The R package clue provides an extensible computational environment for creating and analyzing cluster ensembles, with basic data structures for representing partitions and hierarchies, and facilities for computing on these, including methods for measuring proximity and obtaining consensus and "secondary" clusterings.

    How Many Topics? Stability Analysis for Topic Models

    Full text link
    Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification

    A pp-adic RanSaC algorithm for stereo vision using Hensel lifting

    Full text link
    A pp-adic variation of the Ran(dom) Sa(mple) C(onsensus) method for solving the relative pose problem in stereo vision is developped. From two 2-adically encoded images a random sample of five pairs of corresponding points is taken, and the equations for the essential matrix are solved by lifting solutions modulo 2 to the 2-adic integers. A recently devised pp-adic hierarchical classification algorithm imitating the known LBG quantisation method classifies the solutions for all the samples after having determined the number of clusters using the known intra-inter validity of clusterings. In the successful case, a cluster ranking will determine the cluster containing a 2-adic approximation to the "true" solution of the problem.Comment: 15 pages; typos removed, abstract changed, computation error remove

    A survey of outlier detection methodologies

    Get PDF
    Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review
    corecore