8,565 research outputs found
Queries mining for efficient routing in P2P communities
Peer-to-peer (P2P) computing is currently attracting enormous attention. In
P2P systems a very large number of autonomous computing nodes (the peers) pool
together their resources and rely on each other for data and services.
Peer-to-peer (P2P) Data-sharing systems now generate a significant portion of
Internet traffic. Examples include P2P systems for network storage, web
caching, searching and indexing of relevant documents and distributed
network-threat analysis. Requirements for widely distributed information
systems supporting virtual organizations have given rise to a new category of
P2P systems called schema-based. In such systems each peer exposes its own
schema and the main objective is the efficient search across the P2P network by
processing each incoming query without overly consuming bandwidth. The
usability of these systems depends on effective techniques to find and retrieve
data; however, efficient and effective routing of content-based queries is a
challenging problem in P2P networks. This work was attended as an attempt to
motivate the use of mining algorithms and hypergraphs context to develop two
different methods that improve significantly the efficiency of P2P
communications. The proposed query routing methods direct the query to a set of
relevant peers in such way as to avoid network traffic and bandwidth
consumption. We compare the performance of the two proposed methods with the
baseline one and our experimental results prove that our proposed methods
generate impressive levels of performance and scalability.Comment: 20 pages, 9 figures. arXiv admin note: substantial text overlap with
arXiv:1108.137
Benchmarking Big Data Systems: State-of-the-Art and Future Directions
The great prosperity of big data systems such as Hadoop in recent years makes
the benchmarking of these systems become crucial for both research and industry
communities. The complexity, diversity, and rapid evolution of big data systems
gives rise to various new challenges about how we design generators to produce
data with the 4V properties (i.e. volume, velocity, variety and veracity), as
well as implement application-specific but still comprehensive workloads.
However, most of the existing big data benchmarks can be described as attempts
to solve specific problems in benchmarking systems. This article investigates
the state-of-the-art in benchmarking big data systems along with the future
challenges to be addressed to realize a successful and efficient benchmark.Comment: 9 pages, 2 figures. arXiv admin note: substantial text overlap with
arXiv:1402.519
Collusion-resistant and privacy-preserving P2P multimedia distribution based on recombined fingerprinting
Recombined fingerprints have been suggested as a convenient approach to
improve the efficiency of anonymous fingerprinting for the legal distribution
of copyrighted multimedia contents in P2P systems. The recombination idea is
inspired by the principles of mating, recombination and heredity of the DNA
sequences of living beings, but applied to binary sequences, like in genetic
algorithms. However, the existing recombination-based fingerprinting systems do
not provide a convenient solution for collusion resistance, since they require
double-layer fingerprinting codes, making the practical implementation of such
systems a challenging task. In fact, collusion resistance is regarded as the
most relevant requirement of a fingerprinting scheme, and the lack of any
acceptable solution to this problem would possibly deter content merchants from
deploying any practical implementation of the recombination approach. In this
paper, this drawback is overcome by introducing two non-trivial improvements,
paving the way for a future real-life application of recombination-based
systems. First, Nuida et al.'s collusion-resistant codes are used in
segment-wise fashion for the first time. Second, a novel version of the
traitor-tracing algorithm is proposed in the encrypted domain, also for the
first time, making it possible to provide the buyers with security against
framing. In addition, the proposed method avoids the use of public-key
cryptography for the multimedia content and expensive cryptographic protocols,
leading to excellent performance in terms of both computational and
communication burdens. The paper also analyzes the security and privacy
properties of the proposed system both formally and informally, whereas the
collusion resistance and the performance of the method are shown by means of
experiments and simulations.Comment: 58 pages, 6 figures, journa
On Big Data Benchmarking
Big data systems address the challenges of capturing, storing, managing,
analyzing, and visualizing big data. Within this context, developing benchmarks
to evaluate and compare big data systems has become an active topic for both
research and industry communities. To date, most of the state-of-the-art big
data benchmarks are designed for specific types of systems. Based on our
experience, however, we argue that considering the complexity, diversity, and
rapid evolution of big data systems, for the sake of fairness, big data
benchmarks must include diversity of data and workloads. Given this motivation,
in this paper, we first propose the key requirements and challenges in
developing big data benchmarks from the perspectives of generating data with 4V
properties (i.e. volume, velocity, variety and veracity) of big data, as well
as generating tests with comprehensive workloads for big data systems. We then
present the methodology on big data benchmarking designed to address these
challenges. Next, the state-of-the-art are summarized and compared, following
by our vision for future research directions.Comment: 7 pages, 4 figures, 2 tables, accepted in BPOE-04
(http://prof.ict.ac.cn/bpoe_4_asplos/
Scientific Workflows and Provenance: Introduction and Research Opportunities
Scientific workflows are becoming increasingly popular for compute-intensive
and data-intensive scientific applications. The vision and promise of
scientific workflows includes rapid, easy workflow design, reuse, scalable
execution, and other advantages, e.g., to facilitate "reproducible science"
through provenance (e.g., data lineage) support. However, as described in the
paper, important research challenges remain. While the database community has
studied (business) workflow technologies extensively in the past, most current
work in scientific workflows seems to be done outside of the database
community, e.g., by practitioners and researchers in the computational sciences
and eScience. We provide a brief introduction to scientific workflows and
provenance, and identify areas and problems that suggest new opportunities for
database research.Comment: 12 pages, 2 figure
High-throughput Binding Affinity Calculations at Extreme Scales
Resistance to chemotherapy and molecularly targeted therapies is a major
factor in limiting the effectiveness of cancer treatment. In many cases,
resistance can be linked to genetic changes in target proteins, either
pre-existing or evolutionarily selected during treatment. Key to overcoming
this challenge is an understanding of the molecular determinants of drug
binding. Using multi-stage pipelines of molecular simulations we can gain
insights into the binding free energy and the residence time of a ligand, which
can inform both stratified and personal treatment regimes and drug development.
To support the scalable, adaptive and automated calculation of the binding free
energy on high-performance computing resources, we introduce the High-
throughput Binding Affinity Calculator (HTBAC). HTBAC uses a building block
approach in order to attain both workflow flexibility and performance. We
demonstrate close to perfect weak scaling to hundreds of concurrent multi-stage
binding affinity calculation pipelines. This permits a rapid time-to-solution
that is essentially invariant of the calculation protocol, size of candidate
ligands and number of ensemble simulations. As such, HTBAC advances the state
of the art of binding affinity calculations and protocols
Moving Processing to Data: On the Influence of Processing in Memory on Data Management
Near-Data Processing refers to an architectural hardware and software
paradigm, based on the co-location of storage and compute units. Ideally, it
will allow to execute application-defined data- or compute-intensive operations
in-situ, i.e. within (or close to) the physical data storage. Thus, Near-Data
Processing seeks to minimize expensive data movement, improving performance,
scalability, and resource-efficiency. Processing-in-Memory is a sub-class of
Near-Data processing that targets data processing directly within memory (DRAM)
chips. The effective use of Near-Data Processing mandates new architectures,
algorithms, interfaces, and development toolchains
A Survey of Parallel Sequential Pattern Mining
With the growing popularity of shared resources, large volumes of complex
data of different types are collected automatically. Traditional data mining
algorithms generally have problems and challenges including huge memory cost,
low processing speed, and inadequate hard disk space. As a fundamental task of
data mining, sequential pattern mining (SPM) is used in a wide variety of
real-life applications. However, it is more complex and challenging than other
pattern mining tasks, i.e., frequent itemset mining and association rule
mining, and also suffers from the above challenges when handling the
large-scale data. To solve these problems, mining sequential patterns in a
parallel or distributed computing environment has emerged as an important issue
with many applications. In this paper, an in-depth survey of the current status
of parallel sequential pattern mining (PSPM) is investigated and provided,
including detailed categorization of traditional serial SPM approaches, and
state of the art parallel SPM. We review the related work of parallel
sequential pattern mining in detail, including partition-based algorithms for
PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for
PSPM, and provide deep description (i.e., characteristics, advantages,
disadvantages and summarization) of these parallel approaches of PSPM. Some
advanced topics for PSPM, including parallel quantitative / weighted / utility
sequential pattern mining, PSPM from uncertain data and stream data, hardware
acceleration for PSPM, are further reviewed in details. Besides, we review and
provide some well-known open-source software of PSPM. Finally, we summarize
some challenges and opportunities of PSPM in the big data era.Comment: Accepted by ACM Trans. on Knowl. Discov. Data, 33 page
LightChain: A DHT-based Blockchain for Resource Constrained Environments
As an append-only distributed database, blockchain is utilized in a vast
variety of applications including the cryptocurrency and Internet-of-Things
(IoT). The existing blockchain solutions have downsides in communication and
storage efficiency, convergence to centralization, and consistency problems. In
this paper, we propose LightChain, which is the first blockchain architecture
that operates over a Distributed Hash Table (DHT) of participating peers.
LightChain is a permissionless blockchain that provides addressable blocks and
transactions within the network, which makes them efficiently accessible by all
the peers. Each block and transaction is replicated within the DHT of peers and
is retrieved in an on-demand manner. Hence, peers in LightChain are not
required to retrieve or keep the entire blockchain. LightChain is fair as all
of the participating peers have a uniform chance of being involved in the
consensus regardless of their influence such as hashing power or stake.
LightChain provides a deterministic fork-resolving strategy as well as a
blacklisting mechanism, and it is secure against colluding adversarial peers
attacking the availability and integrity of the system. We provide mathematical
analysis and experimental results on scenarios involving 10K nodes to
demonstrate the security and fairness of LightChain. As we experimentally show
in this paper, compared to the mainstream blockchains like Bitcoin and
Ethereum, LightChain requires around 66 times less per node storage, and is
around 380 times faster on bootstrapping a new node to the system, while each
LightChain node is rewarded equally likely for participating in the protocol
GraphCombEx: A Software Tool for Exploration of Combinatorial Optimisation Properties of Large Graphs
We present a prototype of a software tool for exploration of multiple
combinatorial optimisation problems in large real-world and synthetic complex
networks. Our tool, called GraphCombEx (an acronym of Graph Combinatorial
Explorer), provides a unified framework for scalable computation and
presentation of high-quality suboptimal solutions and bounds for a number of
widely studied combinatorial optimisation problems. Efficient representation
and applicability to large-scale graphs and complex networks are particularly
considered in its design. The problems currently supported include maximum
clique, graph colouring, maximum independent set, minimum vertex clique
covering, minimum dominating set, as well as the longest simple cycle problem.
Suboptimal solutions and intervals for optimal objective values are estimated
using scalable heuristics. The tool is designed with extensibility in mind,
with the view of further problems and both new fast and high-performance
heuristics to be added in the future. GraphCombEx has already been successfully
used as a support tool in a number of recent research studies using
combinatorial optimisation to analyse complex networks, indicating its promise
as a research software tool
- …