    A Survey on Data Deduplication

    Now-a-days, the demand of data storage capacity is increasing drastically. Due to more demands of storage, the computer society is attracting toward cloud storage. Security of data and cost factors are important challenges in cloud storage. A duplicate file not only waste the storage, it also increases the access time. So the detection and removal of duplicate data is an essential task. Data deduplication, an efficient approach to data reduction, has gained increasing attention and popularity in large-scale storage systems. It eliminates redundant data at the file or subfile level and identifies duplicate content by its cryptographically secure hash signature. It is very tricky because neither duplicate files don?t have a common key nor they contain error. There are several approaches to identify and remove redundant data at file and chunk levels. In this paper, the background and key features of data deduplication is covered, then summarize and classify the data deduplication process according to the key workflow

    Block-level De-duplication with Encrypted Data

    Deduplication is a storage saving technique which has been adopted by many cloud storage providers such as Dropbox. The simple principle of deduplication is that duplicate data uploaded by different users are stored only once. Unfortunately, deduplication is not compatible with encryption. As a scheme that allows deduplication of encrypted data segments, we propose ClouDedup, a secure and efficient storage service which guarantees blocklevel deduplication and data confidentiality at the same time. ClouDedup strengthens convergent encryption by employing a component that implements an additional encryption operation and an access control mechanism. We also propose to introduce an additional component which is in charge of providing a key management system for data blocks together with the actual deduplication operation. We show that the overhead introduced by these new components is minimal and does not impact the overall storage and computational costs

    Efficient, Dependable Storage of Human Genome Sequencing Data

Data from human genomes is massive because there are millions of samples to be sequenced, and each sequenced human genome may size hundreds of gigabytes. Human genomes are critical because they are extremely valuable to research and may provide hints on individuals’ health status, identify their donors, or reveal information about donors’ relatives. Their size and criticality, plus the amount of data being produced by medical and life-sciences institutions, require systems to scale while being secure, dependable, auditable, and affordable. Current storage infrastructures are too expensive to ignore cost efficiency in storing human genomes, and they lack the proper knowledge and mechanisms to protect the privacy of sample donors. This thesis proposes an efficient storage system for human genomes that medical and lifesciences institutions may trust and afford. It enhances traditional storage ecosystems with privacy-aware, data-reduction, and auditability techniques to enable the efficient, dependable use of multi-tenant infrastructures to store human genomes. Contributions from this thesis include (1) a study on the privacy-sensitivity of human genomes; (2) to detect genomes’ privacy-sensitive portions systematically; (3) specialised data reduction algorithms for sequencing data; (4) an independent auditability scheme for secure dispersed storage; and (5) a complete storage pipeline that obtains reasonable privacy protection, security, and dependability guarantees at modest costs (e.g., less than 1/Genome/Year) by integrating the proposed mechanisms with appropriate storage configurations

    A New Secure Protected De-duplication Structure With Upgraded Reliability

    This makes the essential attempt to formalize the possibility of dispersed strong deduplication system. We propose new conveyed deduplication structures with higher unfaltering quality in which the data lumps are appropriated over different cloud servers. The security requirements of data protection and name consistency are in like manner achieved by introducing a deterministic puzzle sharing arrangement in appropriated stockpiling systems, as opposed to using simultaneous encryption as a piece of past deduplication structures. Security examination displays that our deduplication systems are secure the extent that the definitions decided in the proposed security illustrate. As a proof of thought, we complete the proposed systems and display that the procured overhead is especially limited in sensible circumstances