8,565 research outputs found

    Queries mining for efficient routing in P2P communities

    Full text link
    Peer-to-peer (P2P) computing is currently attracting enormous attention. In P2P systems a very large number of autonomous computing nodes (the peers) pool together their resources and rely on each other for data and services. Peer-to-peer (P2P) Data-sharing systems now generate a significant portion of Internet traffic. Examples include P2P systems for network storage, web caching, searching and indexing of relevant documents and distributed network-threat analysis. Requirements for widely distributed information systems supporting virtual organizations have given rise to a new category of P2P systems called schema-based. In such systems each peer exposes its own schema and the main objective is the efficient search across the P2P network by processing each incoming query without overly consuming bandwidth. The usability of these systems depends on effective techniques to find and retrieve data; however, efficient and effective routing of content-based queries is a challenging problem in P2P networks. This work was attended as an attempt to motivate the use of mining algorithms and hypergraphs context to develop two different methods that improve significantly the efficiency of P2P communications. The proposed query routing methods direct the query to a set of relevant peers in such way as to avoid network traffic and bandwidth consumption. We compare the performance of the two proposed methods with the baseline one and our experimental results prove that our proposed methods generate impressive levels of performance and scalability.Comment: 20 pages, 9 figures. arXiv admin note: substantial text overlap with arXiv:1108.137

    Benchmarking Big Data Systems: State-of-the-Art and Future Directions

    Full text link
    The great prosperity of big data systems such as Hadoop in recent years makes the benchmarking of these systems become crucial for both research and industry communities. The complexity, diversity, and rapid evolution of big data systems gives rise to various new challenges about how we design generators to produce data with the 4V properties (i.e. volume, velocity, variety and veracity), as well as implement application-specific but still comprehensive workloads. However, most of the existing big data benchmarks can be described as attempts to solve specific problems in benchmarking systems. This article investigates the state-of-the-art in benchmarking big data systems along with the future challenges to be addressed to realize a successful and efficient benchmark.Comment: 9 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:1402.519

    Collusion-resistant and privacy-preserving P2P multimedia distribution based on recombined fingerprinting

    Full text link
    Recombined fingerprints have been suggested as a convenient approach to improve the efficiency of anonymous fingerprinting for the legal distribution of copyrighted multimedia contents in P2P systems. The recombination idea is inspired by the principles of mating, recombination and heredity of the DNA sequences of living beings, but applied to binary sequences, like in genetic algorithms. However, the existing recombination-based fingerprinting systems do not provide a convenient solution for collusion resistance, since they require double-layer fingerprinting codes, making the practical implementation of such systems a challenging task. In fact, collusion resistance is regarded as the most relevant requirement of a fingerprinting scheme, and the lack of any acceptable solution to this problem would possibly deter content merchants from deploying any practical implementation of the recombination approach. In this paper, this drawback is overcome by introducing two non-trivial improvements, paving the way for a future real-life application of recombination-based systems. First, Nuida et al.'s collusion-resistant codes are used in segment-wise fashion for the first time. Second, a novel version of the traitor-tracing algorithm is proposed in the encrypted domain, also for the first time, making it possible to provide the buyers with security against framing. In addition, the proposed method avoids the use of public-key cryptography for the multimedia content and expensive cryptographic protocols, leading to excellent performance in terms of both computational and communication burdens. The paper also analyzes the security and privacy properties of the proposed system both formally and informally, whereas the collusion resistance and the performance of the method are shown by means of experiments and simulations.Comment: 58 pages, 6 figures, journa

    On Big Data Benchmarking

    Full text link
    Big data systems address the challenges of capturing, storing, managing, analyzing, and visualizing big data. Within this context, developing benchmarks to evaluate and compare big data systems has become an active topic for both research and industry communities. To date, most of the state-of-the-art big data benchmarks are designed for specific types of systems. Based on our experience, however, we argue that considering the complexity, diversity, and rapid evolution of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads. Given this motivation, in this paper, we first propose the key requirements and challenges in developing big data benchmarks from the perspectives of generating data with 4V properties (i.e. volume, velocity, variety and veracity) of big data, as well as generating tests with comprehensive workloads for big data systems. We then present the methodology on big data benchmarking designed to address these challenges. Next, the state-of-the-art are summarized and compared, following by our vision for future research directions.Comment: 7 pages, 4 figures, 2 tables, accepted in BPOE-04 (http://prof.ict.ac.cn/bpoe_4_asplos/

    Scientific Workflows and Provenance: Introduction and Research Opportunities

    Full text link
    Scientific workflows are becoming increasingly popular for compute-intensive and data-intensive scientific applications. The vision and promise of scientific workflows includes rapid, easy workflow design, reuse, scalable execution, and other advantages, e.g., to facilitate "reproducible science" through provenance (e.g., data lineage) support. However, as described in the paper, important research challenges remain. While the database community has studied (business) workflow technologies extensively in the past, most current work in scientific workflows seems to be done outside of the database community, e.g., by practitioners and researchers in the computational sciences and eScience. We provide a brief introduction to scientific workflows and provenance, and identify areas and problems that suggest new opportunities for database research.Comment: 12 pages, 2 figure

    High-throughput Binding Affinity Calculations at Extreme Scales

    Get PDF
    Resistance to chemotherapy and molecularly targeted therapies is a major factor in limiting the effectiveness of cancer treatment. In many cases, resistance can be linked to genetic changes in target proteins, either pre-existing or evolutionarily selected during treatment. Key to overcoming this challenge is an understanding of the molecular determinants of drug binding. Using multi-stage pipelines of molecular simulations we can gain insights into the binding free energy and the residence time of a ligand, which can inform both stratified and personal treatment regimes and drug development. To support the scalable, adaptive and automated calculation of the binding free energy on high-performance computing resources, we introduce the High- throughput Binding Affinity Calculator (HTBAC). HTBAC uses a building block approach in order to attain both workflow flexibility and performance. We demonstrate close to perfect weak scaling to hundreds of concurrent multi-stage binding affinity calculation pipelines. This permits a rapid time-to-solution that is essentially invariant of the calculation protocol, size of candidate ligands and number of ensemble simulations. As such, HTBAC advances the state of the art of binding affinity calculations and protocols

    Moving Processing to Data: On the Influence of Processing in Memory on Data Management

    Full text link
    Near-Data Processing refers to an architectural hardware and software paradigm, based on the co-location of storage and compute units. Ideally, it will allow to execute application-defined data- or compute-intensive operations in-situ, i.e. within (or close to) the physical data storage. Thus, Near-Data Processing seeks to minimize expensive data movement, improving performance, scalability, and resource-efficiency. Processing-in-Memory is a sub-class of Near-Data processing that targets data processing directly within memory (DRAM) chips. The effective use of Near-Data Processing mandates new architectures, algorithms, interfaces, and development toolchains

    A Survey of Parallel Sequential Pattern Mining

    Full text link
    With the growing popularity of shared resources, large volumes of complex data of different types are collected automatically. Traditional data mining algorithms generally have problems and challenges including huge memory cost, low processing speed, and inadequate hard disk space. As a fundamental task of data mining, sequential pattern mining (SPM) is used in a wide variety of real-life applications. However, it is more complex and challenging than other pattern mining tasks, i.e., frequent itemset mining and association rule mining, and also suffers from the above challenges when handling the large-scale data. To solve these problems, mining sequential patterns in a parallel or distributed computing environment has emerged as an important issue with many applications. In this paper, an in-depth survey of the current status of parallel sequential pattern mining (PSPM) is investigated and provided, including detailed categorization of traditional serial SPM approaches, and state of the art parallel SPM. We review the related work of parallel sequential pattern mining in detail, including partition-based algorithms for PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for PSPM, and provide deep description (i.e., characteristics, advantages, disadvantages and summarization) of these parallel approaches of PSPM. Some advanced topics for PSPM, including parallel quantitative / weighted / utility sequential pattern mining, PSPM from uncertain data and stream data, hardware acceleration for PSPM, are further reviewed in details. Besides, we review and provide some well-known open-source software of PSPM. Finally, we summarize some challenges and opportunities of PSPM in the big data era.Comment: Accepted by ACM Trans. on Knowl. Discov. Data, 33 page

    LightChain: A DHT-based Blockchain for Resource Constrained Environments

    Get PDF
    As an append-only distributed database, blockchain is utilized in a vast variety of applications including the cryptocurrency and Internet-of-Things (IoT). The existing blockchain solutions have downsides in communication and storage efficiency, convergence to centralization, and consistency problems. In this paper, we propose LightChain, which is the first blockchain architecture that operates over a Distributed Hash Table (DHT) of participating peers. LightChain is a permissionless blockchain that provides addressable blocks and transactions within the network, which makes them efficiently accessible by all the peers. Each block and transaction is replicated within the DHT of peers and is retrieved in an on-demand manner. Hence, peers in LightChain are not required to retrieve or keep the entire blockchain. LightChain is fair as all of the participating peers have a uniform chance of being involved in the consensus regardless of their influence such as hashing power or stake. LightChain provides a deterministic fork-resolving strategy as well as a blacklisting mechanism, and it is secure against colluding adversarial peers attacking the availability and integrity of the system. We provide mathematical analysis and experimental results on scenarios involving 10K nodes to demonstrate the security and fairness of LightChain. As we experimentally show in this paper, compared to the mainstream blockchains like Bitcoin and Ethereum, LightChain requires around 66 times less per node storage, and is around 380 times faster on bootstrapping a new node to the system, while each LightChain node is rewarded equally likely for participating in the protocol

    GraphCombEx: A Software Tool for Exploration of Combinatorial Optimisation Properties of Large Graphs

    Full text link
    We present a prototype of a software tool for exploration of multiple combinatorial optimisation problems in large real-world and synthetic complex networks. Our tool, called GraphCombEx (an acronym of Graph Combinatorial Explorer), provides a unified framework for scalable computation and presentation of high-quality suboptimal solutions and bounds for a number of widely studied combinatorial optimisation problems. Efficient representation and applicability to large-scale graphs and complex networks are particularly considered in its design. The problems currently supported include maximum clique, graph colouring, maximum independent set, minimum vertex clique covering, minimum dominating set, as well as the longest simple cycle problem. Suboptimal solutions and intervals for optimal objective values are estimated using scalable heuristics. The tool is designed with extensibility in mind, with the view of further problems and both new fast and high-performance heuristics to be added in the future. GraphCombEx has already been successfully used as a support tool in a number of recent research studies using combinatorial optimisation to analyse complex networks, indicating its promise as a research software tool
    corecore