22 research outputs found

    A survey and classification of storage deduplication systems

    Get PDF
    The automatic elimination of duplicate data in a storage system commonly known as deduplication is increasingly accepted as an effective technique to reduce storage costs. Thus, it has been applied to different storage types, including archives and backups, primary storage, within solid state disks, and even to random access memory. Although the general approach to deduplication is shared by all storage types, each poses specific challenges and leads to different trade-offs and solutions. This diversity is often misunderstood, thus underestimating the relevance of new research and development. The first contribution of this paper is a classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope. This classification identifies and describes the different approaches used for each of them. As a second contribution, we describe which combinations of these design decisions have been proposed and found more useful for challenges in each storage type. Finally, outstanding research challenges and unexplored design points are identified and discussed.This work is funded by the European Regional Development Fund (EDRF) through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the Fundacao para a Ciencia e a Tecnologia (FCT; Portuguese Foundation for Science and Technology) within project RED FCOMP-01-0124-FEDER-010156 and the FCT by PhD scholarship SFRH-BD-71372-2010

    Meta-language in logic programming

    No full text
    Imperial Users onl

    CROification: Accurate Kernel Classification with the Efficiency of Sparse Linear SVM

    No full text

    The case for generating URIs by hashing RDF content

    No full text
    In this paper we argue for using hashed URIs to represent RDF content. Thes

    Abstract

    No full text
    We describe WebMon, a tool for correlated, transactionoriented performance monitoring of web services. Data collected with WebMon can be analyzed from a variety of perspectives: business, client, transaction, or systems. Maintainers of web services can use such analysis to better understand and manage the performance of their services. Moreover, WebMon’s data will enable the construction of more accurate performance prediction models for web services. Current web logging techniques create a log file per server, making it difficult to correlate data from log files with respect to a given transaction. Additionally, data about the quality of service perceived by the client is missing entirely. WebMon overcomes these limitations by providing heterogenous instrumentation sensors and HTTP cookiebased correlators. In this paper, we present the design and implementation of of WebMon and our experience in applying WebMon to an HP Library web service.

    Finding Similar Files in Large Document Repositories

    No full text
    Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction. The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing. We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability

    Content-based Documet Routing and Index . . .

    No full text
    We present a document routing and index partitioning scheme for scalable similarity-based search of documents in a large corpus. We consider the case when similarity-based search is performed by finding documents that have features in common with the query document. While it is possible to store all the features of all the documents in one index, this suffers from obvious scalability problems. Our approach is to partition the feature index into multiple smaller partitions that can be hosted on separate servers, enabling scalable and parallel search execution. When a document is ingested into the repository, a small number of partitions are chosen to store the features of the document. To perform similarity-based search, also, only a small number of partitions are queried. Our approach is stateless and incremental. The decision as to which partitions the features of the document should be routed to (for storing at ingestion time, and for similarity based search at query time) is solely based on the features of the document. Our approach scales very well. We show that executing similarity-based searches over such a partitioned search space has minimal impact on the precision and recall of search results, even though every search consults less than 3 % of the total number of partitions
    corecore