22 research outputs found

    PPS-ADS: A Framework for Privacy-Preserved and Secured Distributed System Architecture for Handling Big Data

    Get PDF
    The exponential expansion of Big Data in 7V’s (velocity, variety, veracity, value, variability and visualization) brings forth new challenges to security, reliability, availability and privacy of these data sets. Traditional security techniques and algorithms fail to complement this gigantic big data. This paper aims to improve the recently proposed Atrain Distributed System (ADS) by incorporating new features which will cater to the end-to-end availability and security aspects of the big data in the distributed system. The paper also integrates the concept of Software Defined Networking (SDN) in ADS to effectively control and manage the routing of the data item in the ADS. The storage of data items in the ADS is done on the basis of the type of data (structured or unstructured), the capacity of the distributed system (or coach) and the distance of coach from the pilot computer (PC). In order to maintain the consistency of data and to eradicate the possible loss of data, the concept of “forward positive” and “backward positive” acknowledgment is proposed. Furthermore, we have incorporated “Twofish” cryptographic technique to encrypt the big data in the ADS. Issues like “data ownership”, “data security, “data privacy” and data reliability” are pivotal while handling the big data. The current paper presents a framework for a privacy-preserved architecture for handling the big data in an effective manner

    DualTable: A Hybrid Storage Model for Update Optimization in Hive

    Full text link
    Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation.There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive's latest version. Since this ACID compliant extension adopts same data storage format on HDFS, the update performance problem is not solved. In this paper, we propose a hybrid storage model called DualTable, which combines the efficient streaming reads of HDFS and the random write capability of HBase. Hive on DualTable provides better data manipulation support and preserves query performance at the same time. Experiments on a TPC-H data set and on a real smart grid data set show that Hive on DualTable is up to 10 times faster than Hive when executing update and delete operations.Comment: accepted by industry session of ICDE201

    Trends in the Solution of Distributed Data Placement Problem

    Get PDF
    Data placement for optimal performance is an old problem. For example the problem dealt with the placement of relational data in distributed databases, to achieve optimal query processing time. Heterogeneous distributed systems with commodity processors evolved in response to requirement of storage and processing capacity of enormous scale. Reliability and availability are accomplished by appropriate level of data replication, and efficiency is achieved by suitable placement and processing techniques. Where to place which data, how many copies to keep, how to propagate updates so as to maximize the reliability, availability and performance are the issues addressed. In addition to processing costs, the network parameters of bandwidth limitation, speed and reliability have to be considered. This paper surveys the state of the art of published literature on these topics. We are confident that the placement problem will continue to be a research problem in the future also, with the parameters changing. Such situations will arise for example with the advance of mobile smart phones both in terms of the capability and applications

    Erasure Coding Optimization for Data Storage: Acceleration Techniques and Delayed Parities Generation

    Get PDF
    Various techniques have been proposed in the literature to improve erasure code computation efficiency, including optimizing bitmatrix design and computation schedule, common XOR operation reduction, caching management techniques, and vectorization techniques. These techniques were largely proposed individually, and in this work, we seek to use them jointly. To accomplish this task, these techniques need to be thoroughly evaluated individually, and their relation better understood. Building on extensive testing, we develop methods to systematically optimize the computation chain together with the underlying bitmatrix. This led to a simple design approach of optimizing the bitmatrix by minimizing a weighted computation cost function, and also a straightforward coding procedure: follow a computation schedule produced from the optimized bitmatrix to apply XOR-level vectorization. This procedure provides better performances than most existing techniques (e.g., those used in ISA-L and Jerasure libraries), and sometimes can even compete against well-known but less general codes such as EVENODD, RDP, and STAR codes. One particularly important observation is that vectorizing the XOR operations is a better choice than directly vectorizing finite field operations, not only because of the flexibility in choosing finite field size and the better encoding throughput, but also its minimal migration efforts onto newer CPUs. A delayed parity generation technique for maximum distance separable (MDS) storage codes is proposed as well, for two possible applications: the first is to improve the write-speed during data intake where only a subset of the parities are initially produced and stored into the system, and the rest can be produced from the stored data during a later time of lower system load; the second is to provide better adaptivity, where a lower number of parities can be chosen initially in a storage system, and more parities can be produced when the existing ones are not sufficient to guarantee the needed reliability or performance. In both applications, it is important to reduce the data access as much as possible during the delayed parity generation procedure. For this purpose, we first identify the fundamental limit for delayed parity generation through a connection to the well-known multicast network coding problem, then provide an explicit and low-complexity code transformation that is applicable on any MDS codes to obtain optimal codes. The problem we consider is closely related to the regenerating code problem, however the proposed codes are much simpler and have a much smaller subpacketization factor than regenerating codes, and thus our result in fact shows that blindly adopting regenerating codes in these two settings is unnecessary and wasteful. Moreover, two aspects of this approach is addressed. The first is to optimize the underlying coding matrix, and the second is to understand its behavior in a system setting. For the former, we generalize the existing approach by allowing more flexibility in the code design, and then optimize the underlying coding matrix in the familiar bitmatrix-based coding framework. For the latter, we construct a prototype system, and conduct tests on a local storage network and on two virtual machine-based setups. In both cases, the results confirm the benefit of delayed parity generation when the system bottleneck is in the communication bandwidth instead of the computation

    What broke where for distributed and parallel applications — a whodunit story

    Get PDF
    Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed and parallel systems is a difficult task. These large distributed and parallel systems are composed of various complex software and hardware components. When the system experiences some performance or correctness problem, developers struggle to understand the root cause of the problem and fix in a timely manner. In my thesis, I address these three components of the performance problems in computer systems. First, we focus on diagnosing performance problems in large-scale parallel applications running on supercomputers. We developed techniques to localize the performance problem for root-cause analysis. Parallel applications, most of which are complex scientific simulations running in supercomputers, can create up to millions of parallel tasks that run on different machines and communicate using the message passing paradigm. We developed a highly scalable and accurate automated debugging tool called PRODOMETER, which uses sophisticated algorithms to first, create a logical progress dependency graph of the tasks to highlight how the problem spread through the system manifesting as a system-wide performance issue. Second, uses this logical progress dependence graph to identify the task where the problem originated. Finally, PRODOMETER pinpoints the code region corresponding to the origin of the bug. Second, we developed a tool-chain that can detect performance anomaly using machine-learning techniques and can achieve very low false positive rate. Our input-aware performance anomaly detection system consists of a scalable data collection framework to collect performance related metrics from different granularity of code regions, an offline model creation and prediction-error characterization technique, and a threshold based anomaly-detection-engine for production runs. Our system requires few training runs and can handle unknown inputs and parameter combinations by dynamically calibrating the anomaly detection threshold according to the characteristics of the input data and the characteristics of the prediction-error of the models. Third, we developed performance problem mitigation scheme for erasure-coded distributed storage systems. Repair operations of the failed blocks in erasure-coded distributed storage system take really long time in networked constrained data-centers. The reason being, during the repair operation for erasure-coded distributed storage, a lot of data from multiple nodes are gathered into a single node and then a mathematical operation is performed to reconstruct the missing part. This process severely congests the links toward the destination where newly recreated data is to be hosted. We proposed a novel distributed repair technique, called Partial-Parallel-Repair (PPR) that performs this reconstruction in parallel on multiple nodes and eliminates network bottlenecks, and as a result, greatly speeds up the repair process. Fourth, we study how for a class of applications, performance can be improved (or performance problems can be mitigated) by selectively approximating some of the computations. For many applications, the main computation happens inside a loop that can be logically divided into a few temporal segments, we call phases. We found that while approximating the initial phases might severely degrade the quality of the results, approximating the computation for the later phases have very small impact on the final quality of the result. Based on this observation, we developed an optimization framework that for a given budget of quality-loss, would find the best approximation settings for each phase in the execution
    corecore