17,893 research outputs found
K-nearest Neighbor Search by Random Projection Forests
K-nearest neighbor (kNN) search has wide applications in many areas,
including data mining, machine learning, statistics and many applied domains.
Inspired by the success of ensemble methods and the flexibility of tree-based
methodology, we propose random projection forests (rpForests), for kNN search.
rpForests finds kNNs by aggregating results from an ensemble of random
projection trees with each constructed recursively through a series of
carefully chosen random projections. rpForests achieves a remarkable accuracy
in terms of fast decay in the missing rate of kNNs and that of discrepancy in
the kNN distances. rpForests has a very low computational complexity. The
ensemble nature of rpForests makes it easily run in parallel on multicore or
clustered computers; the running time is expected to be nearly inversely
proportional to the number of cores or machines. We give theoretical insights
by showing the exponential decay of the probability that neighboring points
would be separated by ensemble random projection trees when the ensemble size
increases. Our theory can be used to refine the choice of random projections in
the growth of trees, and experiments show that the effect is remarkable.Comment: 15 pages, 4 figures, 2018 IEEE Big Data Conferenc
A GDP-driven model for the binary and weighted structure of the International Trade Network
Recent events such as the global financial crisis have renewed the interest
in the topic of economic networks. One of the main channels of shock
propagation among countries is the International Trade Network (ITN). Two
important models for the ITN structure, the classical gravity model of trade
(more popular among economists) and the fitness model (more popular among
networks scientists), are both limited to the characterization of only one
representation of the ITN. The gravity model satisfactorily predicts the volume
of trade between connected countries, but cannot reproduce the observed missing
links (i.e. the topology). On the other hand, the fitness model can
successfully replicate the topology of the ITN, but cannot predict the volumes.
This paper tries to make an important step forward in the unification of those
two frameworks, by proposing a new GDP-driven model which can simultaneously
reproduce the binary and the weighted properties of the ITN. Specifically, we
adopt a maximum-entropy approach where both the degree and the strength of each
node is preserved. We then identify strong nonlinear relationships between the
GDP and the parameters of the model. This ultimately results in a weighted
generalization of the fitness model of trade, where the GDP plays the role of a
`macroeconomic fitness' shaping the binary and the weighted structure of the
ITN simultaneously. Our model mathematically highlights an important asymmetry
in the role of binary and weighted network properties, namely the fact that
binary properties can be inferred without the knowledge of weighted ones, while
the opposite is not true
Randomizing world trade. II. A weighted network analysis
Based on the misleading expectation that weighted network properties always
offer a more complete description than purely topological ones, current
economic models of the International Trade Network (ITN) generally aim at
explaining local weighted properties, not local binary ones. Here we complement
our analysis of the binary projections of the ITN by considering its weighted
representations. We show that, unlike the binary case, all possible weighted
representations of the ITN (directed/undirected, aggregated/disaggregated)
cannot be traced back to local country-specific properties, which are therefore
of limited informativeness. Our two papers show that traditional macroeconomic
approaches systematically fail to capture the key properties of the ITN. In the
binary case, they do not focus on the degree sequence and hence cannot
characterize or replicate higher-order properties. In the weighted case, they
generally focus on the strength sequence, but the knowledge of the latter is
not enough in order to understand or reproduce indirect effects.Comment: See also the companion paper (Part I): arXiv:1103.1243
[physics.soc-ph], published as Phys. Rev. E 84, 046117 (2011
Randomizing world trade. II. A weighted network analysis
Based on the misleading expectation that weighted network properties always
offer a more complete description than purely topological ones, current
economic models of the International Trade Network (ITN) generally aim at
explaining local weighted properties, not local binary ones. Here we complement
our analysis of the binary projections of the ITN by considering its weighted
representations. We show that, unlike the binary case, all possible weighted
representations of the ITN (directed/undirected, aggregated/disaggregated)
cannot be traced back to local country-specific properties, which are therefore
of limited informativeness. Our two papers show that traditional macroeconomic
approaches systematically fail to capture the key properties of the ITN. In the
binary case, they do not focus on the degree sequence and hence cannot
characterize or replicate higher-order properties. In the weighted case, they
generally focus on the strength sequence, but the knowledge of the latter is
not enough in order to understand or reproduce indirect effects.Comment: See also the companion paper (Part I): arXiv:1103.1243
[physics.soc-ph], published as Phys. Rev. E 84, 046117 (2011
A systematic review of data quality issues in knowledge discovery tasks
Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust
Nonparametric Feature Extraction from Dendrograms
We propose feature extraction from dendrograms in a nonparametric way. The
Minimax distance measures correspond to building a dendrogram with single
linkage criterion, with defining specific forms of a level function and a
distance function over that. Therefore, we extend this method to arbitrary
dendrograms. We develop a generalized framework wherein different distance
measures can be inferred from different types of dendrograms, level functions
and distance functions. Via an appropriate embedding, we compute a vector-based
representation of the inferred distances, in order to enable many numerical
machine learning algorithms to employ such distances. Then, to address the
model selection problem, we study the aggregation of different dendrogram-based
distances respectively in solution space and in representation space in the
spirit of deep representations. In the first approach, for example for the
clustering problem, we build a graph with positive and negative edge weights
according to the consistency of the clustering labels of different objects
among different solutions, in the context of ensemble methods. Then, we use an
efficient variant of correlation clustering to produce the final clusters. In
the second approach, we investigate the sequential combination of different
distances and features sequentially in the spirit of multi-layered
architectures to obtain the final features. Finally, we demonstrate the
effectiveness of our approach via several numerical studies
A New Method to Correct for Fiber Collisions in Galaxy Two-Point Statistics
In fiber-fed galaxy redshift surveys, the finite size of the fiber plugs
prevents two fibers from being placed too close to one another, limiting the
ability of studying galaxy clustering on all scales. We present a new method
for correcting such fiber collision effects in galaxy clustering statistics
based on spectroscopic observations. Our method makes use of observations in
tile overlap regions to measure the contributions from the collided population,
and to therefore recover the full clustering statistics. The method is rooted
in solid theoretical ground and is tested extensively on mock galaxy catalogs.
We demonstrate that our method can well recover the projected and the full
three-dimensional redshift-space two-point correlation functions on scales both
below and above the fiber collision scale, superior to the commonly used
nearest neighbor and angular correction methods. We discuss potential
systematic effects in our method. The statistical correction accuracy of our
method is only limited by sample variance, which scales down with (the square
root of) the volume probed. For a sample similar to the final SDSS-III BOSS
galaxy sample, the statistical correction error is expected to be at the level
of 1% on scales 0.1--30Mpc/h for the two-point correlation functions. The
systematic error only occurs on small scales, caused by non-perfect correction
of collision multiplets, and its magnitude is expected to be smaller than 5%.
Our correction method, which can be generalized to other clustering statistics
as well, enables more accurate measurements of full three-dimensional galaxy
clustering on all scales with galaxy redshift surveys. (abridged)Comment: ApJ accepted. Matched to accepted version(improvements on
systematics
Statistical properties of determinantal point processes in high-dimensional Euclidean spaces
The goal of this paper is to quantitatively describe some statistical
properties of higher-dimensional determinantal point processes with a primary
focus on the nearest-neighbor distribution functions. Toward this end, we
express these functions as determinants of matrices and then
extrapolate to . This formulation allows for a quick and accurate
numerical evaluation of these quantities for point processes in Euclidean
spaces of dimension . We also implement an algorithm due to Hough \emph{et.
al.} \cite{hough2006dpa} for generating configurations of determinantal point
processes in arbitrary Euclidean spaces, and we utilize this algorithm in
conjunction with the aforementioned numerical results to characterize the
statistical properties of what we call the Fermi-sphere point process for to 4. This homogeneous, isotropic determinantal point process, discussed
also in a companion paper \cite{ToScZa08}, is the high-dimensional
generalization of the distribution of eigenvalues on the unit circle of a
random matrix from the circular unitary ensemble (CUE). In addition to the
nearest-neighbor probability distribution, we are able to calculate Voronoi
cells and nearest-neighbor extrema statistics for the Fermi-sphere point
process and discuss these as the dimension is varied. The results in this
paper accompany and complement analytical properties of higher-dimensional
determinantal point processes developed in \cite{ToScZa08}.Comment: 42 pages, 17 figure
- …