Search CORE

22 research outputs found

A survey and classification of storage deduplication systems

Author: Anand Ashok
Arcangeli Andrea
Berliner Brian
Bolosky William J.
Broder Andrei
Chen Feng
Chute Christopher
Clements Austin T.
Collberg Christian
Debnath Biplob
Dong Wei
Douglis Fred
Douglis Fred
Dubnicki Cezary
Dutch
El-Shimi Ahmed
Eshghi Kave
Guo Fanglu
Gupta Aayush
Hong Bo
José Pereira
João Paulo
Kruus Erik
Liguori Anthony
Lillibridge Mark
Lu Guanlin
Manber Udi
Milos Grzegorz
Nath Partho
Ng Chun-Ho
Quinlan Sean
Rhea Sean
Shilane Philip
Srinivasan Kiran
Suzaki Kuniyasu
Tarasov Vasily
Ungureanu Cristian
Wright Jeff
Xia Wen
You Lawrence
Zhu Benjamin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2014
Field of study

The automatic elimination of duplicate data in a storage system commonly known as deduplication is increasingly accepted as an effective technique to reduce storage costs. Thus, it has been applied to different storage types, including archives and backups, primary storage, within solid state disks, and even to random access memory. Although the general approach to deduplication is shared by all storage types, each poses specific challenges and leads to different trade-offs and solutions. This diversity is often misunderstood, thus underestimating the relevance of new research and development. The first contribution of this paper is a classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope. This classification identifies and describes the different approaches used for each of them. As a second contribution, we describe which combinations of these design decisions have been proposed and found more useful for challenges in each storage type. Finally, outstanding research challenges and unexplored design points are identified and discussed.This work is funded by the European Regional Development Fund (EDRF) through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the Fundacao para a Ciencia e a Tecnologia (FCT; Portuguese Foundation for Science and Technology) within project RED FCOMP-01-0124-FEDER-010156 and the FCT by PhD scholarship SFRH-BD-71372-2010

Universidade do Minho: RepositoriUM

Crossref

Meta-language in logic programming

Author: Eshghi Kave.
Eshghi Kave.
Publication venue: Department of Computing, Imperial College London
Publication date: 01/01/1987
Field of study

Imperial Users onl

Spiral - Imperial College Digital Repository

CROification: Accurate Kernel Classification with the Efficiency of Sparse Linear SVM

Author: Kave Eshghi
Mehran Kafai
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

The case for generating URIs by hashing RDF content

Author: Craig Sayers
Kave Eshghi
Publication venue
Publication date
Field of study

In this paper we argue for using hashed URIs to represent RDF content. Thes

CiteSeerX

Abstract

Author: Kave Eshghi
Thomas Gschwind
Publication venue
Publication date
Field of study

We describe WebMon, a tool for correlated, transactionoriented performance monitoring of web services. Data collected with WebMon can be analyzed from a variety of perspectives: business, client, transaction, or systems. Maintainers of web services can use such analysis to better understand and manage the performance of their services. Moreover, WebMon’s data will enable the construction of more accurate performance prediction models for web services. Current web logging techniques create a log file per server, making it difficult to correlate data from log files with respect to a given transaction. Additionally, data about the quality of service perceived by the client is missing entirely. WebMon overcomes these limitations by providing heterogenous instrumentation sensors and HTTP cookiebased correlators. In this paper, we present the design and implementation of of WebMon and our experience in applying WebMon to an HP Library web service.

CiteSeerX

Finding Similar Files in Large Document Repositories

Author: George Forman
Kave Eshghi
Stephane Chiocchetti
Publication venue
Publication date: 01/01/2005
Field of study

Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction. The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing. We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability

CiteSeerX

Crossref

Content-based Documet Routing and Index . . .

Author: Deepavali Bhagwat
Kave Eshghi et al.
Publication venue
Publication date
Field of study

We present a document routing and index partitioning scheme for scalable similarity-based search of documents in a large corpus. We consider the case when similarity-based search is performed by finding documents that have features in common with the query document. While it is possible to store all the features of all the documents in one index, this suffers from obvious scalability problems. Our approach is to partition the feature index into multiple smaller partitions that can be hosted on separate servers, enabling scalable and parallel search execution. When a document is ingested into the repository, a small number of partitions are chosen to store the features of the document. To perform similarity-based search, also, only a small number of partitions are queried. Our approach is stateless and incremental. The decision as to which partitions the features of the document should be routed to (for storing at ingestion time, and for similarity based search at query time) is solely based on the features of the document. Our approach scales very well. We show that executing similarity-based searches over such a partitioned search space has minimal impact on the precision and recall of search results, even though every search consults less than 3 % of the total number of partitions

CiteSeerX