17,893 research outputs found

    K-nearest Neighbor Search by Random Projection Forests

    Full text link
    K-nearest neighbor (kNN) search has wide applications in many areas, including data mining, machine learning, statistics and many applied domains. Inspired by the success of ensemble methods and the flexibility of tree-based methodology, we propose random projection forests (rpForests), for kNN search. rpForests finds kNNs by aggregating results from an ensemble of random projection trees with each constructed recursively through a series of carefully chosen random projections. rpForests achieves a remarkable accuracy in terms of fast decay in the missing rate of kNNs and that of discrepancy in the kNN distances. rpForests has a very low computational complexity. The ensemble nature of rpForests makes it easily run in parallel on multicore or clustered computers; the running time is expected to be nearly inversely proportional to the number of cores or machines. We give theoretical insights by showing the exponential decay of the probability that neighboring points would be separated by ensemble random projection trees when the ensemble size increases. Our theory can be used to refine the choice of random projections in the growth of trees, and experiments show that the effect is remarkable.Comment: 15 pages, 4 figures, 2018 IEEE Big Data Conferenc

    A GDP-driven model for the binary and weighted structure of the International Trade Network

    Get PDF
    Recent events such as the global financial crisis have renewed the interest in the topic of economic networks. One of the main channels of shock propagation among countries is the International Trade Network (ITN). Two important models for the ITN structure, the classical gravity model of trade (more popular among economists) and the fitness model (more popular among networks scientists), are both limited to the characterization of only one representation of the ITN. The gravity model satisfactorily predicts the volume of trade between connected countries, but cannot reproduce the observed missing links (i.e. the topology). On the other hand, the fitness model can successfully replicate the topology of the ITN, but cannot predict the volumes. This paper tries to make an important step forward in the unification of those two frameworks, by proposing a new GDP-driven model which can simultaneously reproduce the binary and the weighted properties of the ITN. Specifically, we adopt a maximum-entropy approach where both the degree and the strength of each node is preserved. We then identify strong nonlinear relationships between the GDP and the parameters of the model. This ultimately results in a weighted generalization of the fitness model of trade, where the GDP plays the role of a `macroeconomic fitness' shaping the binary and the weighted structure of the ITN simultaneously. Our model mathematically highlights an important asymmetry in the role of binary and weighted network properties, namely the fact that binary properties can be inferred without the knowledge of weighted ones, while the opposite is not true

    Randomizing world trade. II. A weighted network analysis

    Get PDF
    Based on the misleading expectation that weighted network properties always offer a more complete description than purely topological ones, current economic models of the International Trade Network (ITN) generally aim at explaining local weighted properties, not local binary ones. Here we complement our analysis of the binary projections of the ITN by considering its weighted representations. We show that, unlike the binary case, all possible weighted representations of the ITN (directed/undirected, aggregated/disaggregated) cannot be traced back to local country-specific properties, which are therefore of limited informativeness. Our two papers show that traditional macroeconomic approaches systematically fail to capture the key properties of the ITN. In the binary case, they do not focus on the degree sequence and hence cannot characterize or replicate higher-order properties. In the weighted case, they generally focus on the strength sequence, but the knowledge of the latter is not enough in order to understand or reproduce indirect effects.Comment: See also the companion paper (Part I): arXiv:1103.1243 [physics.soc-ph], published as Phys. Rev. E 84, 046117 (2011

    Randomizing world trade. II. A weighted network analysis

    Get PDF
    Based on the misleading expectation that weighted network properties always offer a more complete description than purely topological ones, current economic models of the International Trade Network (ITN) generally aim at explaining local weighted properties, not local binary ones. Here we complement our analysis of the binary projections of the ITN by considering its weighted representations. We show that, unlike the binary case, all possible weighted representations of the ITN (directed/undirected, aggregated/disaggregated) cannot be traced back to local country-specific properties, which are therefore of limited informativeness. Our two papers show that traditional macroeconomic approaches systematically fail to capture the key properties of the ITN. In the binary case, they do not focus on the degree sequence and hence cannot characterize or replicate higher-order properties. In the weighted case, they generally focus on the strength sequence, but the knowledge of the latter is not enough in order to understand or reproduce indirect effects.Comment: See also the companion paper (Part I): arXiv:1103.1243 [physics.soc-ph], published as Phys. Rev. E 84, 046117 (2011

    A systematic review of data quality issues in knowledge discovery tasks

    Get PDF
    Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

    Nonparametric Feature Extraction from Dendrograms

    Full text link
    We propose feature extraction from dendrograms in a nonparametric way. The Minimax distance measures correspond to building a dendrogram with single linkage criterion, with defining specific forms of a level function and a distance function over that. Therefore, we extend this method to arbitrary dendrograms. We develop a generalized framework wherein different distance measures can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, to address the model selection problem, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep representations. In the first approach, for example for the clustering problem, we build a graph with positive and negative edge weights according to the consistency of the clustering labels of different objects among different solutions, in the context of ensemble methods. Then, we use an efficient variant of correlation clustering to produce the final clusters. In the second approach, we investigate the sequential combination of different distances and features sequentially in the spirit of multi-layered architectures to obtain the final features. Finally, we demonstrate the effectiveness of our approach via several numerical studies

    A New Method to Correct for Fiber Collisions in Galaxy Two-Point Statistics

    Full text link
    In fiber-fed galaxy redshift surveys, the finite size of the fiber plugs prevents two fibers from being placed too close to one another, limiting the ability of studying galaxy clustering on all scales. We present a new method for correcting such fiber collision effects in galaxy clustering statistics based on spectroscopic observations. Our method makes use of observations in tile overlap regions to measure the contributions from the collided population, and to therefore recover the full clustering statistics. The method is rooted in solid theoretical ground and is tested extensively on mock galaxy catalogs. We demonstrate that our method can well recover the projected and the full three-dimensional redshift-space two-point correlation functions on scales both below and above the fiber collision scale, superior to the commonly used nearest neighbor and angular correction methods. We discuss potential systematic effects in our method. The statistical correction accuracy of our method is only limited by sample variance, which scales down with (the square root of) the volume probed. For a sample similar to the final SDSS-III BOSS galaxy sample, the statistical correction error is expected to be at the level of 1% on scales 0.1--30Mpc/h for the two-point correlation functions. The systematic error only occurs on small scales, caused by non-perfect correction of collision multiplets, and its magnitude is expected to be smaller than 5%. Our correction method, which can be generalized to other clustering statistics as well, enables more accurate measurements of full three-dimensional galaxy clustering on all scales with galaxy redshift surveys. (abridged)Comment: ApJ accepted. Matched to accepted version(improvements on systematics

    Statistical properties of determinantal point processes in high-dimensional Euclidean spaces

    Full text link
    The goal of this paper is to quantitatively describe some statistical properties of higher-dimensional determinantal point processes with a primary focus on the nearest-neighbor distribution functions. Toward this end, we express these functions as determinants of N×NN\times N matrices and then extrapolate to NN\to\infty. This formulation allows for a quick and accurate numerical evaluation of these quantities for point processes in Euclidean spaces of dimension dd. We also implement an algorithm due to Hough \emph{et. al.} \cite{hough2006dpa} for generating configurations of determinantal point processes in arbitrary Euclidean spaces, and we utilize this algorithm in conjunction with the aforementioned numerical results to characterize the statistical properties of what we call the Fermi-sphere point process for d=1d = 1 to 4. This homogeneous, isotropic determinantal point process, discussed also in a companion paper \cite{ToScZa08}, is the high-dimensional generalization of the distribution of eigenvalues on the unit circle of a random matrix from the circular unitary ensemble (CUE). In addition to the nearest-neighbor probability distribution, we are able to calculate Voronoi cells and nearest-neighbor extrema statistics for the Fermi-sphere point process and discuss these as the dimension dd is varied. The results in this paper accompany and complement analytical properties of higher-dimensional determinantal point processes developed in \cite{ToScZa08}.Comment: 42 pages, 17 figure
    corecore