16 research outputs found

    Intrinsic Dimensionality

    Full text link
    This entry for the SIGSPATIAL Special July 2010 issue on Similarity Searching in Metric Spaces discusses the notion of intrinsic dimensionality of data in the context of similarity search.Comment: 4 pages, 4 figures, latex; diagram (c) has been correcte

    Formal concept analysis for evaluating intrinsic dimension of a natural language

    Full text link
    Some results of a computational experiment for determining the intrinsic dimension of linguistic varieties for the Bengali and Russian languages are presented. At the same time, both sets of words and sets of bigrams in these languages were considered separately. The method used to solve this problem was based on formal concept analysis algorithms. It was found that the intrinsic dimensions of these languages are significantly less than the dimensions used in popular neural network models in natural language processing.Comment: Preprint, 10th International Conference on Pattern Recognition and Machine Intelligence (PReMI 2023

    Dimension Estimation Using Weighted Correlation Dimension Method

    Get PDF

    Study of manifold geometry using non-negative kernel graphs

    Get PDF
    Amb l'augment de la mida de les dades, els sistemes efectius de reducció de la dimensionalitat s'han tornat necessaris per una gran varietat de tasques. Un conjunt de dades es pot caracteritzar per les seves propietats geomètriques, entre les quals es troben la densitat dels punts que hi té, la seva curvatura, i la dimensionalitat. En aquest context, la dimensió intrínseca (ID) fa referència al nombre mínim de paràmetres necessaris per caracteritzar un conjunt de dades. S'han proposat moltes eines per a l'estimació de DI, i les que aconsegueixen els millors resultats estan molt enfocades a resoldre aquest objectiu. Aquests estimadors altament especialitzats no permeten la interpretació de la geometria local de les dades en altres aspectes a part de la ID. A més, els mètodes que si ho permeten no són capaços d'estimar la ID de manera fiable. Proposem l'ús de grafs de kernel no negatiu (NNK), una aproximació a la construcció de grafs que caracteritza la geometria local de les dades, per estudiar la dimensió i la forma de les superfícies mutlidimensionals de dades a múltiples escales. Proposem l'ús d'una sèrie de propietats relacionades amb els grafs NNK per obtenir informació sobre diversos conjunts de dades. En particular, observem el nombre de veïns en un graf NNK, la dimensió de les aproximacions per anàlisi de components principals tant per als grafs K-nearest neighbor (KNN) com NNK, el diàmetre dels polítops definits pels grafs NNK i els angles principals entre les aproximacions per anàlisi de components principals dels grafs NNK. A més, estudiem aquestes propietats a múltiples escales utilitzant un algorisme que fa que les dades siguin més disperses fusionant punts en funció d'una tria de similitud. Utilitzant una similitud basada en els conjunts de veïns NNK, podem submostrejar conjunts de dades preservant les propietats geomètriques del conjunt de dades inicial.Given the increasing amounts of data being measured and recorded, effective dimensionality reduction systems have become necessary for a wide variety of tasks. A dataset can be characterized by its geometrical properties, including its point density, curvature, and dimensionality. In this context, the intrinsic dimension (ID) refers to the minimum number of parameters required to characterize a dataset. Many tools have been proposed for the estimation of ID, and the ones that achieve the best results are narrowly focused on solving this goal. These highly specialized estimators don't allow for the interpretation of the local geometry of the data in other aspects besides ID. Moreover, methods that do make this possible are not able to estimate ID reliably. We propose the use of non-negative kernel (NNK) graphs, an approach to graph construction that characterizes the local geometry of the data, to study the dimension and shape of data manifolds at multiple scales. We propose the use of a series of properties related to NNK graphs to gain insight into manifold datasets. In particular, we look at the number of neighbors in an NNK graph, the dimension of the low-rank approximations for both K-nearest neighbor (KNN) and NNK graphs, the diameter of the polytopes defined by NNK graphs, and the principal angles between the low-rank approximations of NNK graphs. Moreover, we study these properties at multiple scales using an algorithm that makes data sparse by merging points based on a choice of similarity. By using similarity based on local NNK neighborhoods we can subsample datasets preserving the geometrical properties of the initial dataset

    {MDL4BMF}: Minimum Description Length for Boolean Matrix Factorization

    No full text
    Matrix factorizations—where a given data matrix is approximated by a prod- uct of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem’ of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)—where data, factors, and matrix product are Boolean—has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. However, so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding, starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior
    corecore