327,161 research outputs found

    Why we need an independent index of the Web

    Full text link
    The path to greater diversity, as we have seen, cannot be achieved by merely hoping for a new search engine nor will government support for a single alternative achieve this goal. What is instead required is to create the conditions that will make establishing such a search engine possible in the first place. I describe how building and maintaining a proprietary index is the greatest deterrent to such an undertaking. We must first overcome this obstacle. Doing so will still not solve the problem of the lack of diversity in the search engine marketplace. But it may establish the conditions necessary to achieve that desired end

    FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

    Full text link
    We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n2Dn^2D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

    Data Mining in Electronic Commerce

    Full text link
    Modern business is rushing toward e-commerce. If the transition is done properly, it enables better management, new services, lower transaction costs and better customer relations. Success depends on skilled information technologists, among whom are statisticians. This paper focuses on some of the contributions that statisticians are making to help change the business world, especially through the development and application of data mining methods. This is a very large area, and the topics we cover are chosen to avoid overlap with other papers in this special issue, as well as to respect the limitations of our expertise. Inevitably, electronic commerce has raised and is raising fresh research problems in a very wide range of statistical areas, and we try to emphasize those challenges.Comment: Published at http://dx.doi.org/10.1214/088342306000000204 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Possible Implication of a Single Nonextensive pTp_T Distribution for Hadron Production in High-Energy pppp Collisions

    Full text link
    Multiparticle production processes in pppp collisions at the central rapidity region are usually considered to be divided into independent "soft" and "hard" components. The first is described by exponential (thermal-like) transverse momentum spectra in the low-pTp_T region with a scale parameter TT associated with the temperature of the hadronizing system. The second is governed by a power-like distributions of transverse momenta with power index nn at high-pTp_T associated with the hard scattering between partons. We show that the hard-scattering integral can be approximated as a nonextensive distribution of a quasi-power-law containing a scale parameter TT and a power index n=1/(q−1)n=1/(q -1), where qq is the nonextensivity parameter. We demonstrate that the whole region of transverse momenta presently measurable at LHC experiments at central rapidity (in which the observed cross sections varies by 1414 orders of magnitude down to the low pTp_T region) can be adequately described by a single nonextensive distribution. These results suggest the dominance of the hard-scattering hadron-production process and the approximate validity of a "no-hair" statistical-mechanical description of the pTp_T spectra for the whole pTp_T region at central rapidity for pppp collisions at high-energies.Comment: 10 pages, 3 figures; presented by G.Wilk at the XLIV International Symposium on Multiparticle Dynamics; 8 - 12 September 2014 - Bologna, ITAL

    Effective and Efficient Similarity Index for Link Prediction of Complex Networks

    Get PDF
    Predictions of missing links of incomplete networks like protein-protein interaction networks or very likely but not yet existent links in evolutionary networks like friendship networks in web society can be considered as a guideline for further experiments or valuable information for web users. In this paper, we introduce a local path index to estimate the likelihood of the existence of a link between two nodes. We propose a network model with controllable density and noise strength in generating links, as well as collect data of six real networks. Extensive numerical simulations on both modeled networks and real networks demonstrated the high effectiveness and efficiency of the local path index compared with two well-known and widely used indices, the common neighbors and the Katz index. Indeed, the local path index provides competitively accurate predictions as the Katz index while requires much less CPU time and memory space, which is therefore a strong candidate for potential practical applications in data mining of huge-size networks.Comment: 8 pages, 5 figures, 3 table

    Information Outlook, October 2006

    Get PDF
    Volume 10, Issue 10https://scholarworks.sjsu.edu/sla_io_2006/1009/thumbnail.jp
    • …
    corecore