Search CORE

327,161 research outputs found

Why we need an independent index of the Web

Author: Lewandowski Dirk
Publication venue
Publication date: 01/01/2014
Field of study

The path to greater diversity, as we have seen, cannot be achieved by merely hoping for a new search engine nor will government support for a single alternative achieve this goal. What is instead required is to create the conditions that will make establishing such a search engine possible in the first place. I describe how building and maintaining a proprietary index is the greatest deterrent to such an undertaking. We must first overcome this obstacle. Doing so will still not solve the problem of the lack of diversity in the search engine marketplace. But it may establish the conditions necessary to achieve that desired end

arXiv.org e-Print Archive

REPOSIT

FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

Author: Andoni A.
Broder A. Z.
Li P.
Lv Q.
Shrivastava A.
Shrivastava A.
Weber R.
Publication venue
Publication date: 03/07/2018
Field of study

We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (

n^2D

), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

arXiv.org e-Print Archive

Crossref

Data Mining in Electronic Commerce

Author: Banks David L.
Said Yasmin H.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 07/09/2006
Field of study

Modern business is rushing toward e-commerce. If the transition is done properly, it enables better management, new services, lower transaction costs and better customer relations. Success depends on skilled information technologists, among whom are statisticians. This paper focuses on some of the contributions that statisticians are making to help change the business world, especially through the development and application of data mining methods. This is a very large area, and the topics we cover are chosen to avoid overlap with other papers in this special issue, as well as to respect the limitations of our expertise. Inevitably, electronic commerce has raised and is raising fresh research problems in a very wide range of statistical areas, and we try to emphasize those challenges.Comment: Published at http://dx.doi.org/10.1214/088342306000000204 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Possible Implication of a Single Nonextensive $p_T$ Distribution for Hadron Production in High-Energy $pp$ Collisions

Author: Cirto Leonardo J. L.
Tsallis Constantino
Wilk Grzegorz
Wong Cheuk-Yin
Publication venue: 'EDP Sciences'
Publication date: 01/12/2014
Field of study

Multiparticle production processes in

pp

collisions at the central rapidity region are usually considered to be divided into independent "soft" and "hard" components. The first is described by exponential (thermal-like) transverse momentum spectra in the low-

p_T

region with a scale parameter

T

associated with the temperature of the hadronizing system. The second is governed by a power-like distributions of transverse momenta with power index

n

at high-

p_T

associated with the hard scattering between partons. We show that the hard-scattering integral can be approximated as a nonextensive distribution of a quasi-power-law containing a scale parameter

T

and a power index

n=1/(q -1)

, where

q

is the nonextensivity parameter. We demonstrate that the whole region of transverse momenta presently measurable at LHC experiments at central rapidity (in which the observed cross sections varies by

14

orders of magnitude down to the low

p_T

region) can be adequately described by a single nonextensive distribution. These results suggest the dominance of the hard-scattering hadron-production process and the approximate validity of a "no-hair" statistical-mechanical description of the

p_T

spectra for the whole

p_T

region at central rapidity for

pp

collisions at high-energies.Comment: 10 pages, 3 figures; presented by G.Wilk at the XLIV International Symposium on Multiparticle Dynamics; 8 - 12 September 2014 - Bologna, ITAL

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

Effective and Efficient Similarity Index for Link Prediction of Complex Networks

Author: A. Popescul
B. Gallagher
C. D. Manning
Ci-Hang Jin
D. Lin
F. Lorrain
G. H. Golub
G. Jeh
G. Salton
G. Salton
J. A. Hanely
J. Zhu
K. Yu
L. Getoor
L. Lü
Linyuan Lü
M. Bilgic
P. Jaccard
S. Geisser
T. Murata
T. Sørensen
T. Zhou
Tao Zhou
Z. Huang
Publication venue: 'American Physical Society (APS)'
Publication date: 26/08/2009
Field of study

Predictions of missing links of incomplete networks like protein-protein interaction networks or very likely but not yet existent links in evolutionary networks like friendship networks in web society can be considered as a guideline for further experiments or valuable information for web users. In this paper, we introduce a local path index to estimate the likelihood of the existence of a link between two nodes. We propose a network model with controllable density and noise strength in generating links, as well as collect data of six real networks. Extensive numerical simulations on both modeled networks and real networks demonstrated the high effectiveness and efficiency of the local path index compared with two well-known and widely used indices, the common neighbors and the Katz index. Indeed, the local path index provides competitively accurate predictions as the Katz index while requires much less CPU time and memory space, which is therefore a strong candidate for potential practical applications in data mining of huge-size networks.Comment: 8 pages, 5 figures, 3 table

arXiv.org e-Print Archive

Crossref

RERO DOC Digital Library

Information Outlook, October 2006

Author: Special Libraries Association
Publication venue: SJSU ScholarWorks
Publication date: 01/10/2006
Field of study

Volume 10, Issue 10https://scholarworks.sjsu.edu/sla_io_2006/1009/thumbnail.jp

SJSU ScholarWorks