551,379 research outputs found
On distance measures for well-distributed sets
In this paper we investigate the Erd\"os/Falconer distance conjecture for a
natural class of sets statistically, though not necessarily arithmetically,
similar to a lattice. We prove a good upper bound for spherical means that have
been classically used to study this problem. We conjecture that a majorant for
the spherical means suffices to prove the distance conjecture(s) in this
setting. For a class of non-Euclidean distances, we show that this generally
cannot be achieved, at least in dimension two, by considering integer point
distributions on convex curves and surfaces. In higher dimensions, we link this
problem to the question about the existence of smooth well-curved hypersurfaces
that support many integer points
Spectral comparison of large urban graphs
The spectrum of an axial graph is proposed as a means for comparison between spaces,
particularly for measuring between very large and complex graphs. A number of methods have
been used in recent years for comparative analysis within large sets of urban areas, both to
investigate properties of specific known types of street network or to propose a taxonomy of urban
morphology based on an analytical technique. In many cases, a single or small range of predefined,
scalar measures such as metric distance, integration, control or clustering coefficient have
been used to compare the graphs. While these measures are well understood theoretically, their
low dimensionality determines the range of observations that can ultimately be drawn from the data.
Spectral analysis consists of a high dimensional vector representing each space, between which
metric distance may be measured to indicate the overall difference between two spaces, or
subspaces may be extracted to correspond to certain features. It is used for comparison of entire
urban graphs, to determine similarities (and differences) in their overall structure.
Results are shown of a comparison of 152 cities distributed around the world. The clustering of
cities of similar properties in a high dimensional space is discussed. Principal and nonlinear
components of the data set indicate significant correlations in the graph similarities between cities
and their proximity to one another, suggesting that cultural features based on location are evident in
the city form and that these can be quantified by the proposed method. Results of classification
tests show that a city’s location can be estimated based purely on its form.
The high dimensionality of the spectra is beneficial for its utility in data-mining applications that can
draw correlations with other data sets such as land use information. It is shown how further
processing by supervised learning allows the extraction of relevant features. A methodological
comparison is also drawn with statistical studies that use a strong correlation between human
genetic markers and geographical location of populations to derive detailed reconstructions of
prehistoric migration. Thus, it is suggested that the method may be utilised for mapping the transfer
of cultural memes by measuring comparison between cities
Exploration of the (Non-)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics
Comparing strings and assessing their similarity is a basic operation in many application domains of machine learning, such as in information retrieval, natural language processing and bioinformatics. The practitioner can choose from a large variety of available similarity measures for this task, each emphasizing different aspects of the string data. In this article, we present Harry, a small tool specifically designed for measuring the similarity of strings. Harry implements over 20 similarity measures, including common string distances and string kernels, such as the Levenshtein distance and the Subsequence kernel. The tool has been designed with efficiency in mind and allows for multi-threaded as well as distributed computing, enabling the analysis of large data sets of strings. Harry supports common data formats and thus can interface with analysis environments, such as Matlab, Pylab and Weka
The intrinsic dimension of biological data landscapes
Analyzing large volumes of high-dimensional data is an issue of fundamental importance in science
and beyond. Several approaches work on the assumption that the important content of a dataset
belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number
of coordinates. That manifold however is generally twisted and curved; in addition points on it will
be non-uniformly distributed: two factors that make the identification of the ID and its exploitation
really hard. Here we propose a new ID estimator using only the distance of the first and the second
nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the
effects of curvature, of density variation, and the resulting computational cost. The ID estimator is
theoretically exact in uniformly distributed data sets, and provides consistent measures in general.
When used in combination with block analysis, it allows discriminating the relevant dimensions as
a function of the block size. This allows estimating the ID even when the data lie on a manifold
perturbed by a high-dimensional noise, a situation often encountered in real world data sets. Upon defining a notion of distance between protein sequences, This tools is used to estimate the ID of protein families, and to assess the consistency of generative models. Moreover, If coupled with a density estimator, our ID allows to measure the density of points by taking into account the space in which they actually lie, thus allowing for a cleaner estimation. Here we move a step further towards an automatic classification of protein sequences by using three new tools: our ID estimator, a density estimator and a clustering algorithm. We present the analysis performed on a Pfam PUA clan, showing that these combined tools allow to successfully separate protein domains into architectures. Finally, we present a generalized model for the estimation of the ID that is able to work in data sets with multiple dimensionalities: taking advantage of Bayesian inference techniques, the method allows discriminating manifolds with different dimensions as well as assigning all the points to the respective manifolds. We test the method on a molecular dynamics trajectory, showing that the folded state has a higher dimension with respect to the unfolded one
Random Metric Spaces and Universality
WWe define the notion of a random metric space and prove that with
probability one such a space is isometricto the Urysohn universal metric space.
The main technique is the study of universal and random distance matrices; we
relate the properties of metric (in particulary universal) space to the
properties of distance matrices. We show the link between those questions and
classification of the Polish spaces with measure (Gromov or metric triples) and
with the problem about S_{\infty}-invariant measures in the space of symmetric
matrices. One of the new effects -exsitence in Urysohn space so called
anarchical uniformly distributed sequences. We give examples of other
categories in which the randomness and universality coincide (graph, etc.).Comment: 38 PAGE
- …