551,379 research outputs found

    On distance measures for well-distributed sets

    Full text link
    In this paper we investigate the Erd\"os/Falconer distance conjecture for a natural class of sets statistically, though not necessarily arithmetically, similar to a lattice. We prove a good upper bound for spherical means that have been classically used to study this problem. We conjecture that a majorant for the spherical means suffices to prove the distance conjecture(s) in this setting. For a class of non-Euclidean distances, we show that this generally cannot be achieved, at least in dimension two, by considering integer point distributions on convex curves and surfaces. In higher dimensions, we link this problem to the question about the existence of smooth well-curved hypersurfaces that support many integer points

    Spectral comparison of large urban graphs

    Get PDF
    The spectrum of an axial graph is proposed as a means for comparison between spaces, particularly for measuring between very large and complex graphs. A number of methods have been used in recent years for comparative analysis within large sets of urban areas, both to investigate properties of specific known types of street network or to propose a taxonomy of urban morphology based on an analytical technique. In many cases, a single or small range of predefined, scalar measures such as metric distance, integration, control or clustering coefficient have been used to compare the graphs. While these measures are well understood theoretically, their low dimensionality determines the range of observations that can ultimately be drawn from the data. Spectral analysis consists of a high dimensional vector representing each space, between which metric distance may be measured to indicate the overall difference between two spaces, or subspaces may be extracted to correspond to certain features. It is used for comparison of entire urban graphs, to determine similarities (and differences) in their overall structure. Results are shown of a comparison of 152 cities distributed around the world. The clustering of cities of similar properties in a high dimensional space is discussed. Principal and nonlinear components of the data set indicate significant correlations in the graph similarities between cities and their proximity to one another, suggesting that cultural features based on location are evident in the city form and that these can be quantified by the proposed method. Results of classification tests show that a city’s location can be estimated based purely on its form. The high dimensionality of the spectra is beneficial for its utility in data-mining applications that can draw correlations with other data sets such as land use information. It is shown how further processing by supervised learning allows the extraction of relevant features. A methodological comparison is also drawn with statistical studies that use a strong correlation between human genetic markers and geographical location of populations to derive detailed reconstructions of prehistoric migration. Thus, it is suggested that the method may be utilised for mapping the transfer of cultural memes by measuring comparison between cities

    Exploration of the (Non-)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics

    Get PDF
    Comparing strings and assessing their similarity is a basic operation in many application domains of machine learning, such as in information retrieval, natural language processing and bioinformatics. The practitioner can choose from a large variety of available similarity measures for this task, each emphasizing different aspects of the string data. In this article, we present Harry, a small tool specifically designed for measuring the similarity of strings. Harry implements over 20 similarity measures, including common string distances and string kernels, such as the Levenshtein distance and the Subsequence kernel. The tool has been designed with efficiency in mind and allows for multi-threaded as well as distributed computing, enabling the analysis of large data sets of strings. Harry supports common data formats and thus can interface with analysis environments, such as Matlab, Pylab and Weka

    The intrinsic dimension of biological data landscapes

    Get PDF
    Analyzing large volumes of high-dimensional data is an issue of fundamental importance in science and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. That manifold however is generally twisted and curved; in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard. Here we propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. The ID estimator is theoretically exact in uniformly distributed data sets, and provides consistent measures in general. When used in combination with block analysis, it allows discriminating the relevant dimensions as a function of the block size. This allows estimating the ID even when the data lie on a manifold perturbed by a high-dimensional noise, a situation often encountered in real world data sets. Upon defining a notion of distance between protein sequences, This tools is used to estimate the ID of protein families, and to assess the consistency of generative models. Moreover, If coupled with a density estimator, our ID allows to measure the density of points by taking into account the space in which they actually lie, thus allowing for a cleaner estimation. Here we move a step further towards an automatic classification of protein sequences by using three new tools: our ID estimator, a density estimator and a clustering algorithm. We present the analysis performed on a Pfam PUA clan, showing that these combined tools allow to successfully separate protein domains into architectures. Finally, we present a generalized model for the estimation of the ID that is able to work in data sets with multiple dimensionalities: taking advantage of Bayesian inference techniques, the method allows discriminating manifolds with different dimensions as well as assigning all the points to the respective manifolds. We test the method on a molecular dynamics trajectory, showing that the folded state has a higher dimension with respect to the unfolded one

    Random Metric Spaces and Universality

    Full text link
    WWe define the notion of a random metric space and prove that with probability one such a space is isometricto the Urysohn universal metric space. The main technique is the study of universal and random distance matrices; we relate the properties of metric (in particulary universal) space to the properties of distance matrices. We show the link between those questions and classification of the Polish spaces with measure (Gromov or metric triples) and with the problem about S_{\infty}-invariant measures in the space of symmetric matrices. One of the new effects -exsitence in Urysohn space so called anarchical uniformly distributed sequences. We give examples of other categories in which the randomness and universality coincide (graph, etc.).Comment: 38 PAGE
    • …
    corecore