5,899 research outputs found

    Semantic distillation: a method for clustering objects by their contextual specificity

    Full text link
    Techniques for data-mining, latent semantic analysis, contextual search of databases, etc. have long ago been developed by computer scientists working on information retrieval (IR). Experimental scientists, from all disciplines, having to analyse large collections of raw experimental data (astronomical, physical, biological, etc.) have developed powerful methods for their statistical analysis and for clustering, categorising, and classifying objects. Finally, physicists have developed a theory of quantum measurement, unifying the logical, algebraic, and probabilistic aspects of queries into a single formalism. The purpose of this paper is twofold: first to show that when formulated at an abstract level, problems from IR, from statistical data analysis, and from physical measurement theories are very similar and hence can profitably be cross-fertilised, and, secondly, to propose a novel method of fuzzy hierarchical clustering, termed \textit{semantic distillation} -- strongly inspired from the theory of quantum measurement --, we developed to analyse raw data coming from various types of experiments on DNA arrays. We illustrate the method by analysing DNA arrays experiments and clustering the genes of the array according to their specificity.Comment: Accepted for publication in Studies in Computational Intelligence, Springer-Verla

    TopSig: Topology Preserving Document Signatures

    Get PDF
    Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and from the theoretical perspective it positions the file signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201

    View subspaces for indexing and retrieval of 3D models

    Full text link
    View-based indexing schemes for 3D object retrieval are gaining popularity since they provide good retrieval results. These schemes are coherent with the theory that humans recognize objects based on their 2D appearances. The viewbased techniques also allow users to search with various queries such as binary images, range images and even 2D sketches. The previous view-based techniques use classical 2D shape descriptors such as Fourier invariants, Zernike moments, Scale Invariant Feature Transform-based local features and 2D Digital Fourier Transform coefficients. These methods describe each object independent of others. In this work, we explore data driven subspace models, such as Principal Component Analysis, Independent Component Analysis and Nonnegative Matrix Factorization to describe the shape information of the views. We treat the depth images obtained from various points of the view sphere as 2D intensity images and train a subspace to extract the inherent structure of the views within a database. We also show the benefit of categorizing shapes according to their eigenvalue spread. Both the shape categorization and data-driven feature set conjectures are tested on the PSB database and compared with the competitor view-based 3D shape retrieval algorithmsComment: Three-Dimensional Image Processing (3DIP) and Applications (Proceedings Volume) Proceedings of SPIE Volume: 7526 Editor(s): Atilla M. Baskurt ISBN: 9780819479198 Date: 2 February 201

    Multi-mode partitioning for text clustering to reduce dimensionality and noises

    Get PDF
    Co-clustering in text mining has been proposed to partition words and documents simultaneously. Although the main advantage of this approach may improve interpretation of clusters on the data, there are still few proposals on these methods; while one-way partition is even now widely utilized for information retrieval. In contrast to structured information, textual data suffer of high dimensionality and sparse matrices, so it is strictly necessary to pre-process texts for applying clustering techniques. In this paper, we propose a new procedure to reduce high dimensionality of corpora and to remove the noises from the unstructured data. We test two different processes to treat data applying two co-clustering algorithms; based on the results we present the procedure that provides the best interpretation of the data
    corecore