327,161 research outputs found
Why we need an independent index of the Web
The path to greater diversity, as we have seen, cannot be achieved by merely
hoping for a new search engine nor will government support for a single
alternative achieve this goal. What is instead required is to create the
conditions that will make establishing such a search engine possible in the
first place. I describe how building and maintaining a proprietary index is the
greatest deterrent to such an undertaking. We must first overcome this
obstacle. Doing so will still not solve the problem of the lack of diversity in
the search engine marketplace. But it may establish the conditions necessary to
achieve that desired end
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
Data Mining in Electronic Commerce
Modern business is rushing toward e-commerce. If the transition is done
properly, it enables better management, new services, lower transaction costs
and better customer relations. Success depends on skilled information
technologists, among whom are statisticians. This paper focuses on some of the
contributions that statisticians are making to help change the business world,
especially through the development and application of data mining methods. This
is a very large area, and the topics we cover are chosen to avoid overlap with
other papers in this special issue, as well as to respect the limitations of
our expertise. Inevitably, electronic commerce has raised and is raising fresh
research problems in a very wide range of statistical areas, and we try to
emphasize those challenges.Comment: Published at http://dx.doi.org/10.1214/088342306000000204 in the
Statistical Science (http://www.imstat.org/sts/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Possible Implication of a Single Nonextensive Distribution for Hadron Production in High-Energy Collisions
Multiparticle production processes in collisions at the central rapidity
region are usually considered to be divided into independent "soft" and "hard"
components. The first is described by exponential (thermal-like) transverse
momentum spectra in the low- region with a scale parameter associated
with the temperature of the hadronizing system. The second is governed by a
power-like distributions of transverse momenta with power index at
high- associated with the hard scattering between partons. We show that
the hard-scattering integral can be approximated as a nonextensive distribution
of a quasi-power-law containing a scale parameter and a power index , where is the nonextensivity parameter. We demonstrate that the whole
region of transverse momenta presently measurable at LHC experiments at central
rapidity (in which the observed cross sections varies by orders of
magnitude down to the low region) can be adequately described by a single
nonextensive distribution. These results suggest the dominance of the
hard-scattering hadron-production process and the approximate validity of a
"no-hair" statistical-mechanical description of the spectra for the whole
region at central rapidity for collisions at high-energies.Comment: 10 pages, 3 figures; presented by G.Wilk at the XLIV International
Symposium on Multiparticle Dynamics; 8 - 12 September 2014 - Bologna, ITAL
Effective and Efficient Similarity Index for Link Prediction of Complex Networks
Predictions of missing links of incomplete networks like protein-protein
interaction networks or very likely but not yet existent links in evolutionary
networks like friendship networks in web society can be considered as a
guideline for further experiments or valuable information for web users. In
this paper, we introduce a local path index to estimate the likelihood of the
existence of a link between two nodes. We propose a network model with
controllable density and noise strength in generating links, as well as collect
data of six real networks. Extensive numerical simulations on both modeled
networks and real networks demonstrated the high effectiveness and efficiency
of the local path index compared with two well-known and widely used indices,
the common neighbors and the Katz index. Indeed, the local path index provides
competitively accurate predictions as the Katz index while requires much less
CPU time and memory space, which is therefore a strong candidate for potential
practical applications in data mining of huge-size networks.Comment: 8 pages, 5 figures, 3 table
Information Outlook, October 2006
Volume 10, Issue 10https://scholarworks.sjsu.edu/sla_io_2006/1009/thumbnail.jp
- …