576 research outputs found
DEDISbench: a benchmark for deduplicated storage systems
Lecture Notes in Computer Science, 7566Deduplication is widely accepted as an effective technique for eliminating duplicated data in backup and archival systems. Nowadays, deduplication is also becoming appealing in cloud computing, where large-scale virtualized storage infrastructures hold huge data volumes with a significant share of duplicated content. There have thus been several proposals for embedding deduplication in storage appliances and file systems, providing different performance trade-offs while targeting both user and application data, as well as virtual machine images.
It is however hard to determine to what extent is deduplication useful in a particular setting and what technique will provide the best results. In fact, existing disk I/O micro-benchmarks are not designed for evaluating deduplication systems, following simplistic approaches for generating data written that lead to unrealistic amounts of duplicates.
We address this with DEDISbench, a novel micro-benchmark for evaluating disk I/O performance of block based deduplication systems. As the main contribution, we introduce the generation of a realistic duplicate distribution based on real datasets. Moreover, DEDISbench also allows simulating access hotspots and different load intensities for I/O operations. The usefulness of DEDISbench is shown by comparing it with Bonnie++ and IOzone open-source disk I/O micro-benchmarks on assessing two open-source deduplication systems, Opendedup and Lessfs, using Ext4 as a baseline. As a secondary contribution, our results lead to novel insight on the performance of these file systems.(undefined
Towards an accurate evaluation of deduplicated storage systems
Deduplication has proven to be a valuable technique for eliminating duplicate data in backup and archival systems and is now being applied to new storage environments with distinct requirements and performance trade-offs. Namely, deduplication system are now targeting large-scale cloud computing storage infrastructures holding unprecedented data volumes with a significant share of duplicate content.
It is however hard to assess the usefulness of deduplication in particular settings and what techniques provide the best results. In fact, existing disk I/O benchmarks follow simplistic approaches for generating data content leading to unrealistic amounts of duplicates that do not evaluate deduplication systems accurately. Moreover, deduplication systems are now targeting heterogeneous storage environments, with specific duplication ratios, that benchmarks must also simulate.
We address these issues with DEDISbench, a novel micro-benchmark for evaluating disk I/O performance of block based deduplication systems. As the main contribution, DEDISbench generates content by following realistic duplicate content distributions extracted from real datasets. Then, as a second contribution, we analyze and extract the duplicates found on three real storage systems, proving that DEDISbench can easily simulate several workloads.
The usefulness of DEDISbench is shown by comparing it with Bonnie++ and IOzone open-source disk I/O micro-benchmarks on assessing two open-source deduplication systems, Opendedup and Lessfs, using Ext4 as a baseline. Our results lead to novel insight on the performance of these file systems.This work is funded by ERDF - European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the FCT - Fundacao para a Ciencia e a Tecnologia (Portuguese Foundation for Science and Technology) within project RED FCOMP-01-0124-FEDER-010156 and FCT by Ph.D scholarship SFRH-BD-71372-2010
A unified framework for de-duplication and population size estimation (with Discussion)
Data de-duplication is the process of finding records in one or more datasets belonging to the same entity. In this paper we tackle the de-duplication process via a latent entity model, where the observed data are perturbed versions of a set of key variables drawn from a finite population of different entities. The main novelty of our approach is to consider the population size as an unknown model parameter. As a result, one salient feature of the proposed method is the capability of the model to account for the de-duplication uncertainty in the population size estimation. As by-products of our approach we illustrate the relationships between de-duplication problems and capture-recapture models and we obtain a more adequate prior distribution on the linkage structure. Moreover we propose a novel simulation algorithm for the posterior distribution of the matching configuration based on the marginalization of the key variables at the population level. We apply our approach to two synthetic data sets comprising German names. In addition we illustrate a real data application matching records from two lists reporting victims killed in the recent Syrian conflict
EviPlant: An efficient digital forensic challenge creation, manipulation and distribution solution
Education and training in digital forensics requires a variety of suitable
challenge corpora containing realistic features including regular
wear-and-tear, background noise, and the actual digital traces to be discovered
during investigation. Typically, the creation of these challenges requires
overly arduous effort on the part of the educator to ensure their viability.
Once created, the challenge image needs to be stored and distributed to a class
for practical training. This storage and distribution step requires significant
time and resources and may not even be possible in an online/distance learning
scenario due to the data sizes involved. As part of this paper, we introduce a
more capable methodology and system as an alternative to current approaches.
EviPlant is a system designed for the efficient creation, manipulation, storage
and distribution of challenges for digital forensics education and training.
The system relies on the initial distribution of base disk images, i.e., images
containing solely base operating systems. In order to create challenges for
students, educators can boot the base system, emulate the desired activity and
perform a "diffing" of resultant image and the base image. This diffing process
extracts the modified artefacts and associated metadata and stores them in an
"evidence package". Evidence packages can be created for different personae,
different wear-and-tear, different emulated crimes, etc., and multiple evidence
packages can be distributed to students and integrated into the base images. A
number of additional applications in digital forensic challenge creation for
tool testing and validation, proficiency testing, and malware analysis are also
discussed as a result of using EviPlant.Comment: Digital Forensic Research Workshop Europe 201
A Comparison of Blocking Methods for Record Linkage
Record linkage seeks to merge databases and to remove duplicates when unique
identifiers are not available. Most approaches use blocking techniques to
reduce the computational complexity associated with record linkage. We review
traditional blocking techniques, which typically partition the records
according to a set of field attributes, and consider two variants of a method
known as locality sensitive hashing, sometimes referred to as "private
blocking." We compare these approaches in terms of their recall, reduction
ratio, and computational complexity. We evaluate these methods using different
synthetic datafiles and conclude with a discussion of privacy-related issues.Comment: 22 pages, 2 tables, 7 figure
Re-evaluating Retrosynthesis Algorithms with Syntheseus
The planning of how to synthesize molecules, also known as retrosynthesis,
has been a growing focus of the machine learning and chemistry communities in
recent years. Despite the appearance of steady progress, we argue that
imperfect benchmarks and inconsistent comparisons mask systematic shortcomings
of existing techniques. To remedy this, we present a benchmarking library
called syntheseus which promotes best practice by default, enabling consistent
meaningful evaluation of single-step and multi-step retrosynthesis algorithms.
We use syntheseus to re-evaluate a number of previous retrosynthesis
algorithms, and find that the ranking of state-of-the-art models changes when
evaluated carefully. We end with guidance for future works in this area
- …