384 research outputs found
When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors
Finding similar user pairs is a fundamental task in social networks, with
numerous applications in ranking and personalization tasks such as link
prediction and tie strength detection. A common manifestation of user
similarity is based upon network structure: each user is represented by a
vector that represents the user's network connections, where pairwise cosine
similarity among these vectors defines user similarity. The predominant task
for user similarity applications is to discover all similar pairs that have a
pairwise cosine similarity value larger than a given threshold . In
contrast to previous work where is assumed to be quite close to 1, we
focus on recommendation applications where is small, but still
meaningful. The all pairs cosine similarity problem is computationally
challenging on networks with billions of edges, and especially so for settings
with small . To the best of our knowledge, there is no practical solution
for computing all user pairs with, say on large social networks,
even using the power of distributed algorithms.
Our work directly addresses this challenge by introducing a new algorithm ---
WHIMP --- that solves this problem efficiently in the MapReduce model. The key
insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for
approximate matrix multiplication with the SimHash random projection techniques
of Charikar. We provide a theoretical analysis of WHIMP, proving that it has
near optimal communication costs while maintaining computation cost comparable
with the state of the art. We also empirically demonstrate WHIMP's scalability
by computing all highly similar pairs on four massive data sets, and show that
it accurately finds high similarity pairs. In particular, we note that WHIMP
successfully processes the entire Twitter network, which has tens of billions
of edges
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
Optimal Data-Dependent Hashing for Approximate Near Neighbors
We show an optimal data-dependent hashing scheme for the approximate near
neighbor problem. For an -point data set in a -dimensional space our data
structure achieves query time and space , where for the Euclidean space and
approximation . For the Hamming space, we obtain an exponent of
.
Our result completes the direction set forth in [AINR14] who gave a
proof-of-concept that data-dependent hashing can outperform classical Locality
Sensitive Hashing (LSH). In contrast to [AINR14], the new bound is not only
optimal, but in fact improves over the best (optimal) LSH data structures
[IM98,AI06] for all approximation factors .
From the technical perspective, we proceed by decomposing an arbitrary
dataset into several subsets that are, in a certain sense, pseudo-random.Comment: 36 pages, 5 figures, an extended abstract appeared in the proceedings
of the 47th ACM Symposium on Theory of Computing (STOC 2015
Approximate Near Neighbors for General Symmetric Norms
We show that every symmetric normed space admits an efficient nearest
neighbor search data structure with doubly-logarithmic approximation.
Specifically, for every , , and every -dimensional
symmetric norm , there exists a data structure for
-approximate nearest neighbor search over
for -point datasets achieving query time and
space. The main technical ingredient of the algorithm is a
low-distortion embedding of a symmetric norm into a low-dimensional iterated
product of top- norms.
We also show that our techniques cannot be extended to general norms.Comment: 27 pages, 1 figur
Mineral prices persistence and the development of a new energy vehicle industry in China: A fractional integration approach
In this paper we examine price persistence in a set of minerals critical for the production of new energy vehicles. We implement techniques based on fractional integration also allowing for non-linearities and structural breaks at unknown periods of time. The results show that the series are generally very persistent, with orders of integration equal to or higher than 1 in practically all cases. The only exceptions being cobalt, tin and zinc if breaks are permitted and only for a given subsample. These findings are extremely relevant to initiate a discussion about the challenges that the new energy vehicle industry faces in China. China's government has already enforced some relevant initiatives to stabilise prices, but we conclude that additional measures will be necessary considering the high degree of uncertainty of certain supply-demand factors.Prof. Luis A. Gil-Alana gratefully acknowledges financial support from the MINEIC-AEI-FEDER PID2020-113691RB-I00 project from ‘Ministerio de Economía, Industria y Competitividad’ (MINEIC), ‘Agencia Estatal de Investigación’ (AEI) Spain and ‘Fondo Europeo de Desarrollo Regional’ (FEDER). He also acknowledges support from an internal Project of the Universidad Francisco de Vitoria, Madrid, Spain
Persistence and trends in CO2 emissions in Africa: is Chinese FDI behind these features?
In this article, we investigate the statistical features of the CO2 emissions and CO2 emissions per capita in a group of 45 African countries by looking at their degree of persistence and also testing for the existence of trends in the data. In addition, we also investigate if this level of emissions is related to the Chinese FDI in Africa. The results are very heterogeneous across countries, observing orders of integration statistically below 1 in a group of countries; in others, the majority of them, the values are around 1, while for some others, the degree of integration is statistically significantly above 1. Linear time trends are observed in approximately half of the countries. These results imply that, in the long term, public measures to reduce CO2 emissions may be required in the majority of the countries since in the event of shocks the series will not return by themselves to their original levels. If we look at Chinese FDI in these countries, we observe that there seems to be no relationship between the Chinese investment in Africa and the CO2 emission persistence, though this result needs to be contrasted in future research.post-print516 K
El rey o el papa. La crisis de lealtades del alto clero español a través de la controversia de 1799 en la Rota de la Nunciatura
Este texto examina la crisis del marco de doble lealtad que, durante el Antiguo Régimen, había vinculado al alto clero simultáneamente a la Corona y a la Santa Sede. La aproximación al desequilibrio de esta fidelidad compartida se realiza desde el nivel micro, tomando como ejemplo la controversia que tuvo lugar en la Rota de la nunciatura de España como consecuencia del real decreto de 5 de septiembre de 1799, en virtud del cual los obispos y algunos tribunales regios ejercerían, durante la vacante de la silla pontificia ocasionada por la muerte de Pío VI, algunas facultades reservadas a la Santa Sede. El seguimiento detallado de las trayectorias previas de los participantes en la controversia explica, en buena medida, la postura eclesiológica que sostuvieron.This paper examines the crisis in the long-term relationship model between the Spanish upper clergy, the Crown and the Papacy. Throughout the ancien régime, the high-ranking secular clergy divided its loyalty between two sovereign powers without any major problem. But this double loyalty underwent a crisis in the second half of the eighteenth century, aggravated by the French Revolution and the international political context. The controversy that arose between the members of the Spanish Rota tribunal concerning the royal decree of 1799, which ordered bishops and some royal courts to assume functions reserved to the Holy See, shows on a micro level the factors that led people to choose one loyalty over another
Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search
The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a
general technique for constructing a data structure to answer approximate near
neighbor queries by using a distribution over locality-sensitive
hash functions that partition space. For a collection of points, after
preprocessing, the query time is dominated by evaluations
of hash functions from and hash table lookups and
distance computations where is determined by the
locality-sensitivity properties of . It follows from a recent
result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive
hash functions can be reduced to , leaving the query time to be
dominated by distance computations and
additional word-RAM operations. We state this result as a general framework and
provide a simpler analysis showing that the number of lookups and distance
computations closely match the Indyk-Motwani framework, making it a viable
replacement in practice. Using ideas from another locality-sensitive hashing
framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of
additional word-RAM operations to .Comment: 15 pages, 3 figure
Deep Discrete Hashing with Self-supervised Pairwise Labels
Hashing methods have been widely used for applications of large-scale image
retrieval and classification. Non-deep hashing methods using handcrafted
features have been significantly outperformed by deep hashing methods due to
their better feature representation and end-to-end learning framework. However,
the most striking successes in deep hashing have mostly involved discriminative
models, which require labels. In this paper, we propose a novel unsupervised
deep hashing method, named Deep Discrete Hashing (DDH), for large-scale image
retrieval and classification. In the proposed framework, we address two main
problems: 1) how to directly learn discrete binary codes? 2) how to equip the
binary representation with the ability of accurate image retrieval and
classification in an unsupervised way? We resolve these problems by introducing
an intermediate variable and a loss function steering the learning process,
which is based on the neighborhood structure in the original space.
Experimental results on standard datasets (CIFAR-10, NUS-WIDE, and Oxford-17)
demonstrate that our DDH significantly outperforms existing hashing methods by
large margin in terms of~mAP for image retrieval and object recognition. Code
is available at \url{https://github.com/htconquer/ddh}
- …