82 research outputs found

    Statistical characterization of storage system workloads for data deduplication and load placement in heterogeneous storage environments

    Get PDF
    University of Minnesota Ph.D. dissertation. November 2013. Major: Electrical Engineering. Advisor: David J. Lilja. 1 computer file (PDF); xi, 110 pages.The underlying technologies for storing digital bits have become more diverse in last decade.There is no fundamental differences in their functionality yet their behaviors can be quite different and no single management technique seems to fit them all.The differences can be categorized based on the metric of interest such as the performance profile, the reliability profile and the power profile.These profiles are a function of the system and the workload assuming that the systems are exposed only to a pre-specified environment. Near infinite workload space makes it infeasible to obtain the complete profiles for any storage systems unless the system enforces a discrete and finite profile internally. The thesis of this work is that an acceptable approximation of the profiles may be achieved by proper characterization of the workloads.A set of statistical tools as well as understanding of system behavior were used to evaluate and design such characterizations.The correctness of the characterization cannot be fully proved except by showing that the resulting profile can correctly predict any workload and storage system interactions. While this is not possible, we show that we can provide a reasonable confidence in our characterization by statistical evaluation of results.The characterizations of this work were applied to compression ratio for backup data deduplication and load balancing of heterogeneous storage systems in a virtualized environments.The validation of our characterization is validated through hundreds of real world test cases as well as reasonable deductions based on our understanding of the storage systems. In both cases, the goodness of characterizations were rigorously evaluated using statistical techniques.The findings along the validations were both confirming and contradicting of many previous beliefs

    Secure Cloud Storage with Client-Side Encryption Using a Trusted Execution Environment

    Full text link
    With the evolution of computer systems, the amount of sensitive data to be stored as well as the number of threats on these data grow up, making the data confidentiality increasingly important to computer users. Currently, with devices always connected to the Internet, the use of cloud data storage services has become practical and common, allowing quick access to such data wherever the user is. Such practicality brings with it a concern, precisely the confidentiality of the data which is delivered to third parties for storage. In the home environment, disk encryption tools have gained special attention from users, being used on personal computers and also having native options in some smartphone operating systems. The present work uses the data sealing, feature provided by the Intel Software Guard Extensions (Intel SGX) technology, for file encryption. A virtual file system is created in which applications can store their data, keeping the security guarantees provided by the Intel SGX technology, before send the data to a storage provider. This way, even if the storage provider is compromised, the data are safe. To validate the proposal, the Cryptomator software, which is a free client-side encryption tool for cloud files, was integrated with an Intel SGX application (enclave) for data sealing. The results demonstrate that the solution is feasible, in terms of performance and security, and can be expanded and refined for practical use and integration with cloud synchronization services

    HEC: Collaborative Research: SAM^2 Toolkit: Scalable and Adaptive Metadata Management for High-End Computing

    Get PDF
    The increasing demand for Exa-byte-scale storage capacity by high end computing applications requires a higher level of scalability and dependability than that provided by current file and storage systems. The proposal deals with file systems research for metadata management of scalable cluster-based parallel and distributed file storage systems in the HEC environment. It aims to develop a scalable and adaptive metadata management (SAM2) toolkit to extend features of and fully leverage the peak performance promised by state-of-the-art cluster-based parallel and distributed file storage systems used by the high performance computing community. There is a large body of research on data movement and management scaling, however, the need to scale up the attributes of cluster-based file systems and I/O, that is, metadata, has been underestimated. An understanding of the characteristics of metadata traffic, and an application of proper load-balancing, caching, prefetching and grouping mechanisms to perform metadata management correspondingly, will lead to a high scalability. It is anticipated that by appropriately plugging the scalable and adaptive metadata management components into the state-of-the-art cluster-based parallel and distributed file storage systems one could potentially increase the performance of applications and file systems, and help translate the promise and potential of high peak performance of such systems to real application performance improvements. The project involves the following components: 1. Develop multi-variable forecasting models to analyze and predict file metadata access patterns. 2. Develop scalable and adaptive file name mapping schemes using the duplicative Bloom filter array technique to enforce load balance and increase scalability 3. Develop decentralized, locality-aware metadata grouping schemes to facilitate the bulk metadata operations such as prefetching. 4. Develop an adaptive cache coherence protocol using a distributed shared object model for client-side and server-side metadata caching. 5. Prototype the SAM2 components into the state-of-the-art parallel virtual file system PVFS2 and a distributed storage data caching system, set up an experimental framework for a DOE CMS Tier 2 site at University of Nebraska-Lincoln and conduct benchmark, evaluation and validation studies

    Survey on Deduplication Techniques in Flash-Based Storage

    Get PDF
    Data deduplication importance is growing with the growth of data volumes. The domain of data deduplication is in active development. Recently it was influenced by appearance of Solid State Drive. This new type of disk has significant differences from random access memory and hard disk drives and is widely used now. In this paper we propose a novel taxonomy which reflects the main issues related to deduplication in Solid State Drive. We present a survey on deduplication techniques focusing on flash-based storage. We also describe several Open Source tools implementing data deduplication and briefly describe open research problems related to data deduplication in flash-based storage systems

    Deduplicação segura em ambientes móveis

    Get PDF
    A utilização crescente de dispositivos móveis como smart phones, PDAs e tablets com capacidades de armazenamento e processamento a crescer rapidamente, permite aos utilizadores transportar um crescente volume de informação. O valor desta informação tem de ser protegido de perdas, roubos ou qualquer outro tipo de acidente que possa ocorrer com os dispositivos, tornando o processo de cópias de segurança um fator chave. Nos dispositivos móveis a transmissão de dados tem um elevado custo, em preço e energia, e estes dispositivos dependem de baterias, otimizar a utilização de largura de banda é crucial. A deduplicação permite uma redução substancial do volume de dados a serem transmitidos sobre a rede em sessões de cópias de segurança, identificando blocos de dados semelhantes, transferindo-os e armazenando-os apenas uma vez. Por outro lado, habitualmente os dispositivos móveis usam redes móveis, e o armazenamento na nuvem é uma solução viável para guardar dados de cópias de segurança. Por consequência é essencial a utilização de processos criptográficos para proteger a informação durante a transmissão, mas também no armazenamento. Neste trabalho avalia-se a viabilidade da utilização de dispositivos móveis no processamento de cópias de segurança, empregando a deduplicação e criptografia simultaneamente.An increased use of mobile devices such as smart phones, PDAs and tablet computers with storing and processing capacities is rapidly increasing ,allowing users to carry a bigger volume of data along with them. The value of this information must be protected from loss, theft or any kind of accident, which may occur with the devices, making the backup process a key factor. In what concerns mobile devices, the data transmission has a high cost, not only in price, but also in terms of energy. On the other hand these devices depend on batteries, so the optimization of the bandwidth use is crucial. Deduplication allows substantial reductions in the data volume to be transmitted over network on backup sessions, by identifying common amounts of data, transferring and storing them only once. On the other hand, mobile devices usually use mobile networks and cloud storage is a viable solution for data backup. Thus the use of cryptographic processes to protect data is essential, not only during the transmission, but also in what concerns storage. This work evaluates the viability of the use of mobile devices, in what deals with the processing of backup, by using deduplication and encryption, simultaneously

    Computing at massive scale: Scalability and dependability challenges

    Get PDF
    Large-scale Cloud systems and big data analytics frameworks are now widely used for practical services and applications. However, with the increase of data volume, together with the heterogeneity of workloads and resources, and the dynamic nature of massive user requests, the uncertainties and complexity of resource management and service provisioning increase dramatically, often resulting in poor resource utilization, vulnerable system dependability, and user-perceived performance degradations. In this paper we report our latest understanding of the current and future challenges in this particular area, and discuss both existing and potential solutions to the problems, especially those concerned with system efficiency, scalability and dependability. We first introduce a data-driven analysis methodology for characterizing the resource and workload patterns and tracing performance bottlenecks in a massive-scale distributed computing environment. We then examine and analyze several fundamental challenges and the solutions we are developing to tackle them, including for example incremental but decentralized resource scheduling, incremental messaging communication, rapid system failover, and request handling parallelism. We integrate these solutions with our data analysis methodology in order to establish an engineering approach that facilitates the optimization, tuning and verification of massive-scale distributed systems. We aim to develop and offer innovative methods and mechanisms for future computing platforms that will provide strong support for new big data and IoE (Internet of Everything) applications

    Challenges and requirements of heterogenous research data management in environmental sciences:a qualitative study

    Get PDF
    Abstract. The research focuses on the challenges and requirements of heterogeneous research data management in environmental sciences. Environmental research involves diverse data types, and effective management and integration of these data sets are crucial in managing heterogeneous environmental research data. The issue at hand is the lack of specific guidance on how to select and plan an appropriate data management practice to address the challenges of handling and integrating diverse data types in environmental research. The objective of the research is to identify the issues associated with the current data storage approach in research data management and determine the requirements for an appropriate system to address these challenges. The research adopts a qualitative approach, utilizing semi-structured interviews to collect data. Content analysis is employed to analyze the gathered data and identify relevant issues and requirements. The study reveals various issues in the current data management process, including inconsistencies in data treatment, the risk of unintentional data deletion, loss of knowledge due to staff turnover, lack of guidelines, and data scattered across multiple locations. The requirements identified through interviews emphasize the need for a data management system that integrates automation, open access, centralized storage, online electronic lab notes, systematic data management, secure repositories, reduced hardware storage, and version control with metadata support. The research identifies the current challenges faced by researchers in heterogeneous data management and compiles a list of requirements for an effective solution. The findings contribute to existing knowledge on research-related problems and provide a foundation for developing tailored solutions to meet the specific needs of researchers in environmental sciences

    de.NBI Cloud Storage Tübingen. A federated and georedundant solution for large scientific data

    Get PDF
    The »German Network for Bioinformatics Infrastructure«, or in short »de.NBI«, is a national research infrastructure providing bioinformatics services to users in life sciences research, biomedicine and related fields. At five sites across Germany, cloud sites were established to host the bioinformatics services. In Tübingen an extension of the storage capabilites of the cloud was planned, implemented and brought into production. We here report about the motivation, requirements, design decisions and experiences which might serve as inspiration for other large scale storage endeavours in the academic domain

    Network Traffic Measurements, Applications to Internet Services and Security

    Get PDF
    The Internet has become along the years a pervasive network interconnecting billions of users and is now playing the role of collector for a multitude of tasks, ranging from professional activities to personal interactions. From a technical standpoint, novel architectures, e.g., cloud-based services and content delivery networks, innovative devices, e.g., smartphones and connected wearables, and security threats, e.g., DDoS attacks, are posing new challenges in understanding network dynamics. In such complex scenario, network measurements play a central role to guide traffic management, improve network design, and evaluate application requirements. In addition, increasing importance is devoted to the quality of experience provided to final users, which requires thorough investigations on both the transport network and the design of Internet services. In this thesis, we stress the importance of users’ centrality by focusing on the traffic they exchange with the network. To do so, we design methodologies complementing passive and active measurements, as well as post-processing techniques belonging to the machine learning and statistics domains. Traffic exchanged by Internet users can be classified in three macro-groups: (i) Outbound, produced by users’ devices and pushed to the network; (ii) unsolicited, part of malicious attacks threatening users’ security; and (iii) inbound, directed to users’ devices and retrieved from remote servers. For each of the above categories, we address specific research topics consisting in the benchmarking of personal cloud storage services, the automatic identification of Internet threats, and the assessment of quality of experience in the Web domain, respectively. Results comprise several contributions in the scope of each research topic. In short, they shed light on (i) the interplay among design choices of cloud storage services, which severely impact the performance provided to end users; (ii) the feasibility of designing a general purpose classifier to detect malicious attacks, without chasing threat specificities; and (iii) the relevance of appropriate means to evaluate the perceived quality of Web pages delivery, strengthening the need of users’ feedbacks for a factual assessment
    • …
    corecore