1,451 research outputs found
European Wireless 2019; 25th European Wireless Conference. Aarhus, Denmark
This paper describes a new design of Reed-Solomon (RS) codes when using composite extension fields. Our ultimate goal is to provide codes that remain Maximum Distance Separable (MDS), but that can be processed at higher speeds in the encoder and decoder. This is possible by using coefficients in the generator matrix that belong to smaller (and faster) finite fields of the composite extension and limiting the use of the larger (and slower) finite fields to a minimum. We provide formulae and an algorithm to generate such constructions starting from a Vandermonde RS generator matrix and show that even the simplest constructions, e.g., using only processing in two finite fields, can speed up processing by as much as two-fold compared to a Vandermonde RS and Cauchy RS while using the same decoding algorithm, and more than two-fold compared to other RS Cauchy and FFT-based RS
CORE: Augmenting Regenerating-Coding-Based Recovery for Single and Concurrent Failures in Distributed Storage Systems
Data availability is critical in distributed storage systems, especially when
node failures are prevalent in real life. A key requirement is to minimize the
amount of data transferred among nodes when recovering the lost or unavailable
data of failed nodes. This paper explores recovery solutions based on
regenerating codes, which are shown to provide fault-tolerant storage and
minimum recovery bandwidth. Existing optimal regenerating codes are designed
for single node failures. We build a system called CORE, which augments
existing optimal regenerating codes to support a general number of failures
including single and concurrent failures. We theoretically show that CORE
achieves the minimum possible recovery bandwidth for most cases. We implement
CORE and evaluate our prototype atop a Hadoop HDFS cluster testbed with up to
20 storage nodes. We demonstrate that our CORE prototype conforms to our
theoretical findings and achieves recovery bandwidth saving when compared to
the conventional recovery approach based on erasure codes.Comment: 25 page
Evaluating Erasure Codes in Dicoogle PACS
DICOM (Digital Imaging and Communication in Medicine) is a standard for image and data transmission in medical purpose hardware and is commonly used for viewing, storing, printing and transmitting images. As a part of the way that DICOM transmits files, the PACS (Picture Archiving and Communication System) platform, Dicoogle, has become one of the most in-demand image processing and viewing platforms. However, the Dicoogle PACS architecture does not guarantee image information recovery in the case of information loss. Therefore, this paper proposes a file recovery solution in the Dicoogle architecture. The proposal consists of maximizing the encoding and decoding performance of medical images through computational parallelism. To validate the proposal, the Java programming language based on the Reed-Solomon algorithm is implemented in different performance tests. The experimental results show that the proposal is optimal in terms of image processing time for the Dicoogle PACS storage system.Ministry of Science, Innovation and Universities (MICINN) of Spain PGC2018 098883-B-C44European CommissionPrograma para el Desarrollo Profesional Docente para el Tipo Superior (PRODEP) of MexicoCorporacion Ecuatoriana para el Desarrollo de la Investigacion y la Academia (CEDIA) of Ecuador CEPRA XII-2018-13Universidad de Las Americas (UDLA), Quito, Ecuador IEA.WHP.21.0
What broke where for distributed and parallel applications — a whodunit story
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed and parallel systems is a difficult task. These large distributed and parallel systems are composed of various complex software and hardware components. When the system experiences some performance or correctness problem, developers struggle to understand the root cause of the problem and fix in a timely manner. In my thesis, I address these three components of the performance problems in computer systems. First, we focus on diagnosing performance problems in large-scale parallel applications running on supercomputers. We developed techniques to localize the performance problem for root-cause analysis. Parallel applications, most of which are complex scientific simulations running in supercomputers, can create up to millions of parallel tasks that run on different machines and communicate using the message passing paradigm. We developed a highly scalable and accurate automated debugging tool called PRODOMETER, which uses sophisticated algorithms to first, create a logical progress dependency graph of the tasks to highlight how the problem spread through the system manifesting as a system-wide performance issue. Second, uses this logical progress dependence graph to identify the task where the problem originated. Finally, PRODOMETER pinpoints the code region corresponding to the origin of the bug. Second, we developed a tool-chain that can detect performance anomaly using machine-learning techniques and can achieve very low false positive rate. Our input-aware performance anomaly detection system consists of a scalable data collection framework to collect performance related metrics from different granularity of code regions, an offline model creation and prediction-error characterization technique, and a threshold based anomaly-detection-engine for production runs. Our system requires few training runs and can handle unknown inputs and parameter combinations by dynamically calibrating the anomaly detection threshold according to the characteristics of the input data and the characteristics of the prediction-error of the models. Third, we developed performance problem mitigation scheme for erasure-coded distributed storage systems. Repair operations of the failed blocks in erasure-coded distributed storage system take really long time in networked constrained data-centers. The reason being, during the repair operation for erasure-coded distributed storage, a lot of data from multiple nodes are gathered into a single node and then a mathematical operation is performed to reconstruct the missing part. This process severely congests the links toward the destination where newly recreated data is to be hosted. We proposed a novel distributed repair technique, called Partial-Parallel-Repair (PPR) that performs this reconstruction in parallel on multiple nodes and eliminates network bottlenecks, and as a result, greatly speeds up the repair process. Fourth, we study how for a class of applications, performance can be improved (or performance problems can be mitigated) by selectively approximating some of the computations. For many applications, the main computation happens inside a loop that can be logically divided into a few temporal segments, we call phases. We found that while approximating the initial phases might severely degrade the quality of the results, approximating the computation for the later phases have very small impact on the final quality of the result. Based on this observation, we developed an optimization framework that for a given budget of quality-loss, would find the best approximation settings for each phase in the execution
High-Performance Asynchronous Byzantine Fault Tolerance Consensus Protocol
In response to new and innovating blockchain-based systems with Internet of Things (IoT), there is a need for consensus mechanisms that can provide high transaction throughput and security, despite varying network quality. Honeybadger was the first practical, asynchronous Byzantine Fault Tolerance (BFT) consensus protocol, achieving high scalability and robustness without making any timing assumptions regarding the network. To improve the current asynchronous consensus protocols, we designed Asynchronous Byzantine Fault Tolerance (ABFT) consensus protocol through integrating threshold Elliptic Curve Digital Signature Algorithm (ECDSA) signatures and optimization of erasure coding parameters, as well as additional implementation-level optimizations. We implement a prototype of ABFT, and evaluate its performance at scale in a global WAN network and a network affected by asymmetric network degradation. Our results show that ABFT provides considerably higher performance, significantly lower computational overhead, and greater scalability than its predecessors. ABFT can reach up to 38.700 transactions per second in throughput. Furthermore, we empirically show that ABFT is unaffected by asymmetric network degradation within the fault threshold.acceptedVersio
- …