362 research outputs found
Shingled Magnetic Recording disks for Mass Storage Systems
Disk drives have seen a dramatic increase in storage density over the last five decades, but to continue the growth seems difficult if not impossible because of physical limitations. One way to increase storage density is using a shingled magnetic recording (SMR) disk. Shingled writing is a promising technique that trades off the inability to update in-place for narrower tracks and thus a much higher data density. It is particularly appealing as it can be adopted while utilizing essentially the same physical recording mechanisms currently in use. Because of its manner of writing, an SMR disk would be unable to update a written track without overwriting neighboring tracks, potentially requiring the rewrite of all the tracks to the end of a band where the end of a band is an area left unwritten to allow for a non-overlapped final track. Random reads are still possible on such devices, but the handling of writes becomes particularly critical.
In this manuscript, we first look at a variety of potential workloads, drawn from real-world traces, and evaluate their impact on SMR disk models. Later, we evaluate the behavior of SMR disks when used in an array configuration or when faced with heavily interleaved workloads. Specifically, we demonstrate the dramatically different effects that different workloads can have upon the opposing approaches of remapping and restoring blocks, and how write-heavy workloads can (under the right conditions, and contrary to intuition) result in a performance advantage for an SMR disk
RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)
RAID proposal advocated replacing large disks with arrays of PC disks, but as
the capacity of small disks increased 100-fold in 1990s the production of large
disks was discontinued. Storage dependability is increased via replication or
erasure coding. Cloud storage providers store multiple copies of data obviating
for need for further redundancy. Varitaions of RAID based on local recovery
codes, partial MDS reduce recovery cost. NAND flash Solid State Disks - SSDs
have low latency and high bandwidth, are more reliable, consume less power and
have a lower TCO than Hard Disk Drives, which are more viable for hyperscalers.Comment: Submitted to ACM Computing Surveys. arXiv admin note: substantial
text overlap with arXiv:2306.0876
Scalability of RAID systems
RAID systems (Redundant Arrays of Inexpensive Disks) have dominated backend
storage systems for more than two decades and have grown continuously in size
and complexity. Currently they face unprecedented challenges from data intensive
applications such as image processing, transaction processing and data warehousing.
As the size of RAID systems increases, designers are faced with both performance and
reliability challenges. These challenges include limited back-end network bandwidth,
physical interconnect failures, correlated disk failures and long disk reconstruction
time.
This thesis studies the scalability of RAID systems in terms of both performance
and reliability through simulation, using a discrete event driven simulator for RAID
systems (SIMRAID) developed as part of this project. SIMRAID incorporates two
benchmark workload generators, based on the SPC-1 and Iometer benchmark specifications.
Each component of SIMRAID is highly parameterised, enabling it to explore
a large design space. To improve the simulation speed, SIMRAID develops a set of
abstraction techniques to extract the behaviour of the interconnection protocol without
losing accuracy. Finally, to meet the technology trend toward heterogeneous storage
architectures, SIMRAID develops a framework that allows easy modelling of different
types of device and interconnection technique.
Simulation experiments were first carried out on performance aspects of scalability.
They were designed to answer two questions: (1) given a number of disks, which
factors affect back-end network bandwidth requirements; (2) given an interconnection
network, how many disks can be connected to the system. The results show that
the bandwidth requirement per disk is primarily determined by workload features and
stripe unit size (a smaller stripe unit size has better scalability than a larger one), with
cache size and RAID algorithm having very little effect on this value. The maximum
number of disks is limited, as would be expected, by the back-end network bandwidth.
Studies of reliability have led to three proposals to improve the reliability and scalability
of RAID systems. Firstly, a novel data layout called PCDSDF is proposed.
PCDSDF combines the advantages of orthogonal data layouts and parity declustering
data layouts, so that it can not only survivemultiple disk failures caused by physical interconnect
failures or correlated disk failures, but also has a good degraded and rebuild
performance. The generating process of PCDSDF is deterministic and time-efficient.
The number of stripes per rotation (namely the number of stripes to achieve rebuild workload balance) is small. Analysis shows that the PCDSDF data layout can significantly
improve the system reliability. Simulations performed on SIMRAID confirm
the good performance of PCDSDF, which is comparable to other parity declustering
data layouts, such as RELPR.
Secondly, a system architecture and rebuilding mechanism have been designed,
aimed at fast disk reconstruction. This architecture is based on parity declustering data
layouts and a disk-oriented reconstruction algorithm. It uses stripe groups instead of
stripes as the basic distribution unit so that it can make use of the sequential nature of
the rebuilding workload. The design space of system factors such as parity declustering
ratio, chunk size, private buffer size of surviving disks and free buffer size are explored
to provide guidelines for storage system design.
Thirdly, an efficient distributed hot spare allocation and assignment algorithm for
general parity declustering data layouts has been developed. This algorithm avoids
conflict problems in the process of assigning distributed spare space for the units on
the failed disk. Simulation results show that it effectively solves the write bottleneck
problem and, at the same time, there is only a small increase in the average response
time to user requests
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Scalable Storage for Digital Libraries
I propose a storage system optimised for digital libraries. Its key features are its heterogeneous scalability; its integration and exploitation of rich semantic metadata associated with digital objects; its use of a name space; and its aggressive performance optimisation in the digital library domain
High Performance Computing for DNA Sequence Alignment and Assembly
Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing.
Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical
Electronic Components Subsystems and Equipment: a Compilation
Developments in electronic components, subsystems, and equipment are summarized. Topics discussed include integrated circuit components and techniques, circuit components and techniques, and cables and connectors
A shared-disk parallel cluster file system
Dissertação apresentada para obtenção do Grau de Doutor em Informática Pela Universidade Nova de Lisboa, Faculdade de Ciências e TecnologiaToday, clusters are the de facto cost effective platform both for high performance
computing (HPC) as well as IT environments. HPC and IT are quite different environments
and differences include, among others, their choices on file systems and storage: HPC favours parallel file systems geared towards maximum I/O bandwidth, but which are not fully POSIX-compliant and were devised to run on top of (fault prone) partitioned storage; conversely, IT data centres favour both external disk arrays (to provide highly available storage) and POSIX compliant file systems, (either general purpose or shared-disk cluster file systems, CFSs).
These specialised file systems do perform very well in their target environments provided that applications do not require some lateral features, e.g., no file locking on parallel file systems, and no high performance writes over cluster-wide shared files on CFSs. In brief, we can say
that none of the above approaches solves the problem of providing high levels of reliability and performance to both worlds.
Our pCFS proposal makes a contribution to change this situation: the rationale is to take advantage on the best of both – the reliability of cluster file systems and the high performance of parallel file systems. We don’t claim to provide the absolute best of each, but we aim at full POSIX compliance, a rich feature set, and levels of reliability and performance good enough
for broad usage – e.g., traditional as well as HPC applications, support of clustered DBMS engines that may run over regular files, and video streaming. pCFS’ main ideas include:
· Cooperative caching, a technique that has been used in file systems for distributed disks but, as far as we know, was never used either in SAN based cluster file systems or in parallel file systems. As a result, pCFS may use all infrastructures (LAN and SAN) to move data.
· Fine-grain locking, whereby processes running across distinct nodes may define nonoverlapping byte-range regions in a file (instead of the whole file) and access them in parallel, reading and writing over those regions at the infrastructure’s full speed (provided that no major metadata changes are required).
A prototype was built on top of GFS (a Red Hat shared disk CFS): GFS’ kernel code was
slightly modified, and two kernel modules and a user-level daemon were added. In the
prototype, fine grain locking is fully implemented and a cluster-wide coherent cache is maintained through data (page fragments) movement over the LAN.
Our benchmarks for non-overlapping writers over a single file shared among processes
running on different nodes show that pCFS’ bandwidth is 2 times greater than NFS’ while
being comparable to that of the Parallel Virtual File System (PVFS), both requiring about 10 times more CPU. And pCFS’ bandwidth also surpasses GFS’ (600 times for small record sizes, e.g., 4 KB, decreasing down to 2 times for large record sizes, e.g., 4 MB), at about the same CPU usage.Lusitania, Companhia de Seguros S.A, Programa
IBM Shared University Research (SUR
Resiliency Mechanisms for In-Memory Column Stores
The key objective of database systems is to reliably manage data, while high query throughput and low query latency are core requirements. To date, database research activities mostly concentrated on the second part. However, due to the constant shrinking of transistor feature sizes, integrated circuits become more and more unreliable and transient hardware errors in the form of multi-bit flips become more and more prominent. In a more recent study (2013), in a large high-performance cluster with around 8500 nodes, a failure rate of 40 FIT per DRAM device was measured. For their system, this means that every 10 hours there occurs a single- or multi-bit flip, which is unacceptably high for enterprise and HPC scenarios. Causes can be cosmic rays, heat, or electrical crosstalk, with the latter being exploited actively through the RowHammer attack. It was shown that memory cells are more prone to bit flips than logic gates and several surveys found multi-bit flip events in main memory modules of today's data centers. Due to the shift towards in-memory data management systems, where all business related data and query intermediate results are kept solely in fast main memory, such systems are in great danger to deliver corrupt results to their users. Hardware techniques can not be scaled to compensate the exponentially increasing error rates. In other domains, there is an increasing interest in software-based solutions to this problem, but these proposed methods come along with huge runtime and/or storage overheads. These are unacceptable for in-memory data management systems.
In this thesis, we investigate how to integrate bit flip detection mechanisms into in-memory data management systems. To achieve this goal, we first build an understanding of bit flip detection techniques and select two error codes, AN codes and XOR checksums, suitable to the requirements of in-memory data management systems. The most important requirement is effectiveness of the codes to detect bit flips. We meet this goal through AN codes, which exhibit better and adaptable error detection capabilities than those found in today's hardware. The second most important goal is efficiency in terms of coding latency. We meet this by introducing a fundamental performance improvements to AN codes, and by vectorizing both chosen codes' operations. We integrate bit flip detection mechanisms into the lowest storage layer and the query processing layer in such a way that the remaining data management system and the user can stay oblivious of any error detection. This includes both base columns and pointer-heavy index structures such as the ubiquitous B-Tree. Additionally, our approach allows adaptable, on-the-fly bit flip detection during query processing, with only very little impact on query latency. AN coding allows to recode intermediate results with virtually no performance penalty. We support our claims by providing exhaustive runtime and throughput measurements throughout the whole thesis and with an end-to-end evaluation using the Star Schema Benchmark. To the best of our knowledge, we are the first to present such holistic and fast bit flip detection in a large software infrastructure such as in-memory data management systems. Finally, most of the source code fragments used to obtain the results in this thesis are open source and freely available.:1 INTRODUCTION
1.1 Contributions of this Thesis
1.2 Outline
2 PROBLEM DESCRIPTION AND RELATED WORK
2.1 Reliable Data Management on Reliable Hardware
2.2 The Shift Towards Unreliable Hardware
2.3 Hardware-Based Mitigation of Bit Flips
2.4 Data Management System Requirements
2.5 Software-Based Techniques For Handling Bit Flips
2.5.1 Operating System-Level Techniques
2.5.2 Compiler-Level Techniques
2.5.3 Application-Level Techniques
2.6 Summary and Conclusions
3 ANALYSIS OF CODING TECHNIQUES
3.1 Selection of Error Codes
3.1.1 Hamming Coding
3.1.2 XOR Checksums
3.1.3 AN Coding
3.1.4 Summary and Conclusions
3.2 Probabilities of Silent Data Corruption
3.2.1 Probabilities of Hamming Codes
3.2.2 Probabilities of XOR Checksums
3.2.3 Probabilities of AN Codes
3.2.4 Concrete Error Models
3.2.5 Summary and Conclusions
3.3 Throughput Considerations
3.3.1 Test Systems Descriptions
3.3.2 Vectorizing Hamming Coding
3.3.3 Vectorizing XOR Checksums
3.3.4 Vectorizing AN Coding
3.3.5 Summary and Conclusions
3.4 Comparison of Error Codes
3.4.1 Effectiveness
3.4.2 Efficiency
3.4.3 Runtime Adaptability
3.5 Performance Optimizations for AN Coding
3.5.1 The Modular Multiplicative Inverse
3.5.2 Faster Softening
3.5.3 Faster Error Detection
3.5.4 Comparison to Original AN Coding
3.5.5 The Multiplicative Inverse Anomaly
3.6 Summary
4 BIT FLIP DETECTING STORAGE
4.1 Column Store Architecture
4.1.1 Logical Data Types
4.1.2 Storage Model
4.1.3 Data Representation
4.1.4 Data Layout
4.1.5 Tree Index Structures
4.1.6 Summary
4.2 Hardened Data Storage
4.2.1 Hardened Physical Data Types
4.2.2 Hardened Lightweight Compression
4.2.3 Hardened Data Layout
4.2.4 UDI Operations
4.2.5 Summary and Conclusions
4.3 Hardened Tree Index Structures
4.3.1 B-Tree Verification Techniques
4.3.2 Justification For Further Techniques
4.3.3 The Error Detecting B-Tree
4.4 Summary
5 BIT FLIP DETECTING QUERY PROCESSING
5.1 Column Store Query Processing
5.2 Bit Flip Detection Opportunities
5.2.1 Early Onetime Detection
5.2.2 Late Onetime Detection
5.2.3 Continuous Detection
5.2.4 Miscellaneous Processing Aspects
5.2.5 Summary and Conclusions
5.3 Hardened Intermediate Results
5.3.1 Materialization of Hardened Intermediates
5.3.2 Hardened Bitmaps
5.4 Summary
6 END-TO-END EVALUATION
6.1 Prototype Implementation
6.1.1 AHEAD Architecture
6.1.2 Diversity of Physical Operators
6.1.3 One Concrete Operator Realization
6.1.4 Summary and Conclusions
6.2 Performance of Individual Operators
6.2.1 Selection on One Predicate
6.2.2 Selection on Two Predicates
6.2.3 Join Operators
6.2.4 Grouping and Aggregation
6.2.5 Delta Operator
6.2.6 Summary and Conclusions
6.3 Star Schema Benchmark Queries
6.3.1 Query Runtimes
6.3.2 Improvements Through Vectorization
6.3.3 Storage Overhead
6.3.4 Summary and Conclusions
6.4 Error Detecting B-Tree
6.4.1 Single Key Lookup
6.4.2 Key Value-Pair Insertion
6.5 Summary
7 SUMMARY AND CONCLUSIONS
7.1 Future Work
A APPENDIX
A.1 List of Golden As
A.2 More on Hamming Coding
A.2.1 Code examples
A.2.2 Vectorization
BIBLIOGRAPHY
LIST OF FIGURES
LIST OF TABLES
LIST OF LISTINGS
LIST OF ACRONYMS
LIST OF SYMBOLS
LIST OF DEFINITION
- …