73 research outputs found

    A Robust Fault-Tolerant and Scalable Cluster-wide Deduplication for Shared-Nothing Storage Systems

    Full text link
    Deduplication has been largely employed in distributed storage systems to improve space efficiency. Traditional deduplication research ignores the design specifications of shared-nothing distributed storage systems such as no central metadata bottleneck, scalability, and storage rebalancing. Further, deduplication introduces transactional changes, which are prone to errors in the event of a system failure, resulting in inconsistencies in data and deduplication metadata. In this paper, we propose a robust, fault-tolerant and scalable cluster-wide deduplication that can eliminate duplicate copies across the cluster. We design a distributed deduplication metadata shard which guarantees performance scalability while preserving the design constraints of shared- nothing storage systems. The placement of chunks and deduplication metadata is made cluster-wide based on the content fingerprint of chunks. To ensure transactional consistency and garbage identification, we employ a flag-based asynchronous consistency mechanism. We implement the proposed deduplication on Ceph. The evaluation shows high disk-space savings with minimal performance degradation as well as high robustness in the event of sudden server failure.Comment: 6 Pages including reference

    Data Deduplication Technology for Cloud Storage

    Get PDF
    With the explosive growth of information data, the data storage system has stepped into the cloud storage era. Although the core of the cloud storage system is distributed file system in solving the problem of mass data storage, a large number of duplicate data exist in all storage system. File systems are designed to control how files are stored and retrieved. Fewer studies focus on the cloud file system deduplication technologies at the application level, especially for the Hadoop distributed file system. In this paper, we design a file deduplication framework on Hadoop distributed file system for cloud application developer. Proposed RFD-HDFS and FD-HDFS two data deduplication solutions process data deduplication online, which improves storage space utilisation and reduces the redundancy. In the end of the paper, we test the disk utilisation and the file upload performance on RFD-HDFS and FD-HDFS, and compare HDFS with the disk utilisation of two system frameworks. The results show that the two-system framework not only implements data deduplication function but also effectively reduces the disk utilisation of duplicate files. So, the proposed framework can indeed reduce the storage space by eliminating redundant HDFS file

    PAS: A Sampling Based Similarity Identification Algorithm for compression of Unicode data content

    Get PDF
    Generally, Users perform searches to satisfy their information needs. Now a day’s lots of people are using search engine to satisfy information need. Server search is one of the techniques of searching the information. the Growth of data brings new changes in Server. The data usually proposed in timely fashion in server. If there is increase in latency then it may cause a massive loss to the enterprises. The similarity detection plays very important role in data. while there are many algorithms are used for similarity detection such as Shingle, Simhas TSA and Position Aware sampling algorithm. The Shingle Simhash and Traits read entire files to calculate similar values. It requires the long delay in growth of data set value. instead of reading entire Files PAS sample some data in the form of Unicode to calculate similarity characteristic value.PAS is the advance technique of TSA. However slight modification of file will trigger the position of file content .Therefore the failure of similarity identification is there due to some modifications.. This paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to identify file similarity for the Server. EPAS concurrently samples data blocks from the modulated file to avoid the position shift by the modifications. While there is an metric is proposed to measure the similarity between different files and make the possible detection probability close to the actual probability. In this paper describes a PAS algorithm to reduce the time overhead of similarity detection. Using PAS algorithm we can reduce the complication and time for identifying the similarity. Our result demonstrate that the EPAS significantly outperforms the existing well known algorithms in terms of time. Therefore, it is an effective approach of similarity identification for the Server

    Cross-Layer Fragment Indexing based File Deduplication using Hyper Spectral Hash Duplicate Filter (HSHDF) for Optimized Cloud Storage

    Get PDF
    Cloud computing and storage processing is a big service for maintaining a large number of data in a centralized server to store and retrieve data depending on the use to pay as a service model. Due to increasing storage depending on duplicate copy presence during different sceneries, the increased size leads to increased cost. To resolve this problem, we propose a Cross-Layer Fragment Indexing (CLFI) based file deduplication using Hyper Spectral Hash Duplicate Filter (HSHDF) for optimized cloud storage. Initially, the file storage indexing easy carried out with Lexical Syntactic Parser (LSP) to split the files into blocks. Then comparativesector was created based on Chunk staking. Based on the file frequency weight, the relative Indexing was verified through Cross-Layer Fragment Indexing (CLFI). Then the fragmented index gets grouped by maximum relative threshold margin usingIntra Subset Near-Duplicate Clusters (ISNDC). The hashing is applied to get comparative index points based on hyper correlation comparer using Hyper Spectral Hash Duplicate Filter (HSHDF). This filter the near duplicate contentdepending on file content difference to identify the duplicates. This proposed system produces high performance compared to the other system. This optimizes cloudstorage and has a higher precision rate than other methods

    Data auditing and security in cloud computing: issues, challenges and future directions

    Get PDF
    Cloud computing is one of the significant development that utilizes progressive computational power and upgrades data distribution and data storing facilities. With cloud information services, it is essential for information to be saved in the cloud and also distributed across numerous customers. Cloud information repository is involved with issues of information integrity, data security and information access by unapproved users. Hence, an autonomous reviewing and auditing facility is necessary to guarantee that the information is effectively accommodated and used in the cloud. In this paper, a comprehensive survey on the state-of-art techniques in data auditing and security are discussed. Challenging problems in information repository auditing and security are presented. Finally, directions for future research in data auditing and security have been discusse

    Data Auditing and Security in Cloud Computing: Issues, Challenges and Future Directions

    Get PDF
    Cloud computing is one of the significant development that utilizes progressive computational power and upgrades data distribution and data storing facilities. With cloud information services, it is essential for information to be saved in the cloud and also distributed across numerous customers. Cloud information repository is involved with issues of information integrity, data security and information access by unapproved users. Hence, an autonomous reviewing and auditing facility is necessary to guarantee that the information is effectively accommodated and used in the cloud. In this paper, a comprehensive survey on the state-of-art techniques in data auditing and security are discussed. Challenging problems in information repository auditing and security are presented. Finally, directions for future research in data auditing and security have been discussed

    Efficient, Dependable Storage of Human Genome Sequencing Data

    Get PDF
    A compreensão do genoma humano impacta várias áreas da vida. Os dados oriundos do genoma humano são enormes pois existem milhões de amostras a espera de serem sequenciadas e cada genoma humano sequenciado pode ocupar centenas de gigabytes de espaço de armazenamento. Os genomas humanos são críticos porque são extremamente valiosos para a investigação e porque podem fornecer informações delicadas sobre o estado de saúde dos indivíduos, identificar os seus dadores ou até mesmo revelar informações sobre os parentes destes. O tamanho e a criticidade destes genomas, para além da quantidade de dados produzidos por instituições médicas e de ciências da vida, exigem que os sistemas informáticos sejam escaláveis, ao mesmo tempo que sejam seguros, confiáveis, auditáveis e com custos acessíveis. As infraestruturas de armazenamento existentes são tão caras que não nos permitem ignorar a eficiência de custos no armazenamento de genomas humanos, assim como em geral estas não possuem o conhecimento e os mecanismos adequados para proteger a privacidade dos dadores de amostras biológicas. Esta tese propõe um sistema de armazenamento de genomas humanos eficiente, seguro e auditável para instituições médicas e de ciências da vida. Ele aprimora os ecossistemas de armazenamento tradicionais com técnicas de privacidade, redução do tamanho dos dados e auditabilidade a fim de permitir o uso eficiente e confiável de infraestruturas públicas de computação em nuvem para armazenar genomas humanos. As contribuições desta tese incluem (1) um estudo sobre a sensibilidade à privacidade dos genomas humanos; (2) um método para detetar sistematicamente as porções dos genomas que são sensíveis à privacidade; (3) algoritmos de redução do tamanho de dados, especializados para dados de genomas sequenciados; (4) um esquema de auditoria independente para armazenamento disperso e seguro de dados; e (5) um fluxo de armazenamento completo que obtém garantias razoáveis de proteção, segurança e confiabilidade a custos modestos (por exemplo, menos de 1/Genoma/Ano),integrandoosmecanismospropostosaconfigurac\co~esdearmazenamentoapropriadasTheunderstandingofhumangenomeimpactsseveralareasofhumanlife.Datafromhumangenomesismassivebecausetherearemillionsofsamplestobesequenced,andeachsequencedhumangenomemaysizehundredsofgigabytes.Humangenomesarecriticalbecausetheyareextremelyvaluabletoresearchandmayprovidehintsonindividualshealthstatus,identifytheirdonors,orrevealinformationaboutdonorsrelatives.Theirsizeandcriticality,plustheamountofdatabeingproducedbymedicalandlifesciencesinstitutions,requiresystemstoscalewhilebeingsecure,dependable,auditable,andaffordable.Currentstorageinfrastructuresaretooexpensivetoignorecostefficiencyinstoringhumangenomes,andtheylacktheproperknowledgeandmechanismstoprotecttheprivacyofsampledonors.Thisthesisproposesanefficientstoragesystemforhumangenomesthatmedicalandlifesciencesinstitutionsmaytrustandafford.Itenhancestraditionalstorageecosystemswithprivacyaware,datareduction,andauditabilitytechniquestoenabletheefficient,dependableuseofmultitenantinfrastructurestostorehumangenomes.Contributionsfromthisthesisinclude(1)astudyontheprivacysensitivityofhumangenomes;(2)todetectgenomesprivacysensitiveportionssystematically;(3)specialiseddatareductionalgorithmsforsequencingdata;(4)anindependentauditabilityschemeforsecuredispersedstorage;and(5)acompletestoragepipelinethatobtainsreasonableprivacyprotection,security,anddependabilityguaranteesatmodestcosts(e.g.,lessthan1/Genoma/Ano), integrando os mecanismos propostos a configurações de armazenamento apropriadasThe understanding of human genome impacts several areas of human life. Data from human genomes is massive because there are millions of samples to be sequenced, and each sequenced human genome may size hundreds of gigabytes. Human genomes are critical because they are extremely valuable to research and may provide hints on individuals’ health status, identify their donors, or reveal information about donors’ relatives. Their size and criticality, plus the amount of data being produced by medical and life-sciences institutions, require systems to scale while being secure, dependable, auditable, and affordable. Current storage infrastructures are too expensive to ignore cost efficiency in storing human genomes, and they lack the proper knowledge and mechanisms to protect the privacy of sample donors. This thesis proposes an efficient storage system for human genomes that medical and lifesciences institutions may trust and afford. It enhances traditional storage ecosystems with privacy-aware, data-reduction, and auditability techniques to enable the efficient, dependable use of multi-tenant infrastructures to store human genomes. Contributions from this thesis include (1) a study on the privacy-sensitivity of human genomes; (2) to detect genomes’ privacy-sensitive portions systematically; (3) specialised data reduction algorithms for sequencing data; (4) an independent auditability scheme for secure dispersed storage; and (5) a complete storage pipeline that obtains reasonable privacy protection, security, and dependability guarantees at modest costs (e.g., less than 1/Genome/Year) by integrating the proposed mechanisms with appropriate storage configurations

    Big data reduction framework for value creation in sustainable enterprises

    No full text
    Value creation is a major sustainability factor for enterprises, in addition to profit maximization and revenue generation. Modern enterprises collect big data from various inbound and outbound data sources. The inbound data sources handle data generated from the results of business operations, such as manufacturing, supply chain management, marketing, and human resource management, among others. Outbound data sources handle customer-generated data which are acquired directly or indirectly from customers, market analysis, surveys, product reviews, and transactional histories. However, cloud service utilization costs increase because of big data analytics and value creation activities for enterprises and customers. This article presents a novel concept of big data reduction at the customer end in which early data reduction operations are performed to achieve multiple objectives, such as a) lowering the service utilization cost, b) enhancing the trust between customers and enterprises, c) preserving privacy of customers, d) enabling secure data sharing, and e) delegating data sharing control to customers. We also propose a framework for early data reduction at customer end and present a business model for end-to-end data reduction in enterprise applications. The article further presents a business model canvas and maps the future application areas with its nine components. Finally, the article discusses the technology adoption challenges for value creation through big data reduction in enterprise applications
    corecore