121,929 research outputs found
Data complexity measured by principal graphs
How to measure the complexity of a finite set of vectors embedded in a
multidimensional space? This is a non-trivial question which can be approached
in many different ways. Here we suggest a set of data complexity measures using
universal approximators, principal cubic complexes. Principal cubic complexes
generalise the notion of principal manifolds for datasets with non-trivial
topologies. The type of the principal cubic complex is determined by its
dimension and a grammar of elementary graph transformations. The simplest
grammar produces principal trees.
We introduce three natural types of data complexity: 1) geometric (deviation
of the data's approximator from some "idealized" configuration, such as
deviation from harmonicity); 2) structural (how many elements of a principal
graph are needed to approximate the data), and 3) construction complexity (how
many applications of elementary graph transformations are needed to construct
the principal object starting from the simplest one).
We compute these measures for several simulated and real-life data
distributions and show them in the "accuracy-complexity" plots, helping to
optimize the accuracy/complexity ratio. We discuss various issues connected
with measuring data complexity. Software for computing data complexity measures
from principal cubic complexes is provided as well.Comment: Computers and Mathematics with Applications, in pres
Geometrical complexity of data approximators
There are many methods developed to approximate a cloud of vectors embedded
in high-dimensional space by simpler objects: starting from principal points
and linear manifolds to self-organizing maps, neural gas, elastic maps, various
types of principal curves and principal trees, and so on. For each type of
approximators the measure of the approximator complexity was developed too.
These measures are necessary to find the balance between accuracy and
complexity and to define the optimal approximations of a given type. We propose
a measure of complexity (geometrical complexity) which is applicable to
approximators of several types and which allows comparing data approximations
of different types.Comment: 10 pages, 3 figures, minor correction and extensio
Principal manifolds and graphs in practice: from molecular biology to dynamical systems
We present several applications of non-linear data modeling, using principal
manifolds and principal graphs constructed using the metaphor of elasticity
(elastic principal graph approach). These approaches are generalizations of the
Kohonen's self-organizing maps, a class of artificial neural networks. On
several examples we show advantages of using non-linear objects for data
approximation in comparison to the linear ones. We propose four numerical
criteria for comparing linear and non-linear mappings of datasets into the
spaces of lower dimension. The examples are taken from comparative political
science, from analysis of high-throughput data in molecular biology, from
analysis of dynamical systems.Comment: 12 pages, 9 figure
Toward a multilevel representation of protein molecules: comparative approaches to the aggregation/folding propensity problem
This paper builds upon the fundamental work of Niwa et al. [34], which
provides the unique possibility to analyze the relative aggregation/folding
propensity of the elements of the entire Escherichia coli (E. coli) proteome in
a cell-free standardized microenvironment. The hardness of the problem comes
from the superposition between the driving forces of intra- and inter-molecule
interactions and it is mirrored by the evidences of shift from folding to
aggregation phenotypes by single-point mutations [10]. Here we apply several
state-of-the-art classification methods coming from the field of structural
pattern recognition, with the aim to compare different representations of the
same proteins gathered from the Niwa et al. data base; such representations
include sequences and labeled (contact) graphs enriched with chemico-physical
attributes. By this comparison, we are able to identify also some interesting
general properties of proteins. Notably, (i) we suggest a threshold around 250
residues discriminating "easily foldable" from "hardly foldable" molecules
consistent with other independent experiments, and (ii) we highlight the
relevance of contact graph spectra for folding behavior discrimination and
characterization of the E. coli solubility data. The soundness of the
experimental results presented in this paper is proved by the statistically
relevant relationships discovered among the chemico-physical description of
proteins and the developed cost matrix of substitution used in the various
discrimination systems.Comment: 17 pages, 3 figures, 46 reference
Urban identity through quantifiable spatial attributes: coherence and dispersion of local identity through the automated comparative analysis of building block plans
This analysis investigates whether and to what degree quantifiable spatial attrib-utes, as expressed in plan representations, can capture elements related to the ex-perience of spatial identity. By combining different methods of shape and spatial analysis it attempts to quantify spatial attributes, predominantly derived from plans, in order to illustrate patterns of interrelations between spaces through an ob-jective automated process. The study focuses on the scale of the urban block as the basic modular unit for the formation of urban configurations and the issue of spa-tial identity is perceived through consistency and differentiation within and amongst urban neighbourhoods
- …