110,533 research outputs found

    Reducing the loss of information through annealing text distortion

    Full text link
    Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Granados, A. ;Cebrian, M. ; Camacho, D. ; de Borja Rodriguez, F. "Reducing the Loss of Information through Annealing Text Distortion". IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 7 pp. 1090 - 1102, July 2011Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper, we take a step toward understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the nondistorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-Ziv, Statistical and Block-Sorting.This work was supported by the Spanish Ministry of Education and Science under TIN2010-19872 and TIN2010-19607 projects

    USING CONCEPT RELATIONSHIPS TO IMPROVE DOCUMENT CATEGORIZATION

    Get PDF
    In the information age we much depend on our ability to find information hidden in mostly unstructured and textual documents. This article proposes a simple method in which (as an addition to existing systems) categorization accuracy can be improved, compared to traditional techniques, without requiring any time-consuming or language-dependent computation. That result is achieved by exploiting properties observed in the entire document collection as opposed to individual documents, which may also be regarded as a construction of an approximate concept network (measuring semantic distances). These properties are sufficiently simple to avoid entailing massive computations; however, they try to capture the most fundamental relationships between words or concepts. Experiments performed on the Reuters-21578 news article collections were evaluated using a set of simple measurements estimating clustering efficiency, and in addition by Cluto, a widely used document clustering software. Results show a 5-10% improvement in clustering quality over traditional tf (term frequency) or tf x idf (term frequency-inverse document frequency) based clustering

    Estimating the number of clusters using diversity

    Get PDF
    It is an important and challenging problem in unsupervised learning to estimate the number of clusters in a dataset. Knowing the number of clusters is a prerequisite for many commonly used clustering algorithms such as k-means. In this paper, we propose a novel diversity based approach to this problem. Specifically, we show that the difference between the global diversity of clusters and the sum of each cluster's local diversity of their members can be used as an effective indicator of the optimality of the number of clusters, where the diversity is measured by Rao's quadratic entropy. A notable advantage of our proposed method is that it encourages balanced clustering by taking into account both the sizes of clusters and the distances between clusters. In other words, it is less prone to very small "outlier" clusters than existing methods. Our extensive experiments on both synthetic and real-world datasets (with known ground-truth clustering) have demonstrated that our proposed method is robust for clusters of different sizes, variances, and shapes, and it is more accurate than existing methods (including elbow, Calinski-Harabasz, silhouette, and gap-statistic) in terms of finding out the optimal number of clusters

    Looking for atypical groups of distributions in the context of genomic data

    Get PDF
    This work addresses the problem of detecting groups of observations (distributions) and flagging those that differ abnormally from the majority of the groups, termed as atypical groups. The proposed method combines a hierarchical classification technique, to identify groups of similar distributions, with a functional outlier detection method, to identify those groups that contain outliers. Groups with outlying observations are forwarded for sub clustering. Once the final partition is obtained, each cluster is represented by a class prototype, whose outlyingness is evaluated according to a functional approach. Clusters with atypical class labels are flagged as atypical groups. The method is applied for the detection of groups of atypical genomic words, based on their distances distributions.publishe

    Measuring Global Similarity between Texts

    Get PDF
    We propose a new similarity measure between texts which, contrary to the current state-of-the-art approaches, takes a global view of the texts to be compared. We have implemented a tool to compute our textual distance and conducted experiments on several corpuses of texts. The experiments show that our methods can reliably identify different global types of texts.Comment: Submitted to SLSP 201

    Similarity of fMRI activity patterns in left perirhinal cortex reflects semantic similarity between words

    Get PDF
    How verbal and nonverbal visuoperceptual input connects to semantic knowledge is a core question in visual and cognitive neuroscience, with significant clinical ramifications. In an event-related functional magnetic resonance imaging (fMRI) experiment we determined how cosine similarity between fMRI response patterns to concrete words and pictures reflects semantic clustering and semantic distances between the represented entities within a single category. Semantic clustering and semantic distances between 24 animate entities were derived from a concept-feature matrix based on feature generation by >1000 subjects. In the main fMRI study, 19 human subjects performed a property verification task with written words and pictures and a low-level control task. The univariate contrast between the semantic and the control task yielded extensive bilateral occipitotemporal activation from posterior cingulate to anteromedial temporal cortex. Entities belonging to a same semantic cluster elicited more similar fMRI activity patterns in left occipitotemporal cortex. When words and pictures were analyzed separately, the effect reached significance only for words. The semantic similarity effect for words was localized to left perirhinal cortex. According to a representational similarity analysis of left perirhinal responses, semantic distances between entities correlated inversely with cosine similarities between fMRI response patterns to written words. An independent replication study in 16 novel subjects confirmed these novel findings. Semantic similarity is reflected by similarity of functional topography at a fine-grained level in left perirhinal cortex. The word specificity excludes perceptually driven confounds as an explanation and is likely to be task dependent.Rose Bruffaerts, Patrick Dupont, Ronald Peeters, Simon De Deyne, Gerrit Storms and Rik Vandenbergh

    Developments in the theory of randomized shortest paths with a comparison of graph node distances

    Get PDF
    There have lately been several suggestions for parametrized distances on a graph that generalize the shortest path distance and the commute time or resistance distance. The need for developing such distances has risen from the observation that the above-mentioned common distances in many situations fail to take into account the global structure of the graph. In this article, we develop the theory of one family of graph node distances, known as the randomized shortest path dissimilarity, which has its foundation in statistical physics. We show that the randomized shortest path dissimilarity can be easily computed in closed form for all pairs of nodes of a graph. Moreover, we come up with a new definition of a distance measure that we call the free energy distance. The free energy distance can be seen as an upgrade of the randomized shortest path dissimilarity as it defines a metric, in addition to which it satisfies the graph-geodetic property. The derivation and computation of the free energy distance are also straightforward. We then make a comparison between a set of generalized distances that interpolate between the shortest path distance and the commute time, or resistance distance. This comparison focuses on the applicability of the distances in graph node clustering and classification. The comparison, in general, shows that the parametrized distances perform well in the tasks. In particular, we see that the results obtained with the free energy distance are among the best in all the experiments.Comment: 30 pages, 4 figures, 3 table

    Parameterized Complexity of Feature Selection for Categorical Data Clustering

    Get PDF
    We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers l (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented by m-dimensional vectors whose elements belong to a finite set of values Σ), we want to select m-l relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (l0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters k, B, and |Σ|. Our main result is an algorithm that solves the Feature Selection problem in time f(k,B,|Σ|)⋅m^{g(k,|Σ|)}⋅n² for some functions f and g. In other words, the problem is fixed-parameter tractable parameterized by B when |Σ| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering, Binary and Boolean Low-rank Matrix Approximation with Outliers, and Binary Robust Projective Clustering. Thus as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.publishedVersio
    • …
    corecore