Search CORE

1,526 research outputs found

Predicting Intermediate Storage Performance for Workflow Applications

Author: Abd-El-Malek M.
Al-Kiswany S.
Anderson E.
Costa L. B.
Costa L. B.
Haddad I. F.
Strunk J. D.
Publication venue
Publication date: 10/06/2013
Field of study

Configuring a storage system to better serve an application is a challenging task complicated by a multidimensional, discrete configuration space and the high cost of space exploration (e.g., by running the application with different storage configurations). To enable selecting the best configuration in a reasonable time, we design an end-to-end performance prediction mechanism that estimates the turn-around time of an application using storage system under a given configuration. This approach focuses on a generic object-based storage system design, supports exploring the impact of optimizations targeting workflow applications (e.g., various data placement schemes) in addition to other, more traditional, configuration knobs (e.g., stripe size or replication level), and models the system operation at data-chunk and control message level. This paper presents our experience to date with designing and using this prediction mechanism. We evaluate this mechanism using micro- as well as synthetic benchmarks mimicking real workflow applications, and a real application.. A preliminary evaluation shows that we are on a good track to meet our objectives: it can scale to model a workflow application run on an entire cluster while offering an over 200x speedup factor (normalized by resource) compared to running the actual application, and can achieve, in the limited number of scenarios we study, a prediction accuracy that enables identifying the best storage system configuration

arXiv.org e-Print Archive

Crossref

데이터 처리 시스템에서의 대규모 데이터 셔플에 대한 동적 최적화

Author: 이산하
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2019. 2. 전병곤.The scale of data used for data analytics is growing rapidly and the ability to process large volumes of data is critical to data processing systems. A scaling bottleneck for processing large amounts of data in the data processing systems is the random disk read overhead that occurs while shuffling data communications between tasks. To reduce this overhead, an external shuffle process can batch the disk read by aggregating the intermediate data through an additional computation. However, the additional computation cannot take advantage of distributed execution capabilities provided by data processing systems such as scheduling, parallelization, or fault recovery. In addition, the systems cannot dynamically optimize the external shuffle process in the same way that they optimize plain jobs without an external process. Instead of launching the external shuffle process, we propose to insert the disk read batching into a job. By doing so, the tasks can fully exploit the features, including the dynamic optimization provided by data processing systems, because the computation for intermediate data aggregation is fully revealed to the systems. Moreover, we suggest a dynamic data skew handling mechanism that can be applied with the disk read batching optimization at the same time. Evaluations show that our implemented technique can mitigate random disk read overhead and data skewness and can reduce the job completion time by up to 54%.오늘날 데이터 분석 작업에서 사용하는 데이터의 크기가 빠르게 커지고 있으며, 이 때문에 데이터 처리 시스템은 대용량의 데이터를 효율적으로 처리할 수 있어야 한다. 분산 데이터 처리 시스템에서 큰 데이터를 처리할 때의 병목은 태스크 간 데이터 셔플시 발생하는 랜덤 디스크 읽기 비용이다. 이 비용을 줄이기 위하여, 외부 셔플 프로세스가 데이터 처리 시스템 바깥에서 추가적인 계산을 통해 중간 데이터를 병합하여 디스크 읽기를 일괄 처리하도록 할 수 있다. 그러나, 이 경우 추가된 계산은 기존에 데이터 처리 시스템이 제공하는 계산 스케쥴링, 병렬화, 실패 복구 등의 기능을 이용할 수 없다. 또한, 데이터 처리 시스템이 다른 일반적인 작업을 최적화하는 것처럼 이 외부 셔플 프로세스의 동작을 최적화할 수 없다. 이 문제를 해결하기 위하여, 본 논문에서는 디스크 읽기를 일괄 처리하도록 만드는 계산을 작업 수행 내부에 끼워넣는 방식을 고안하였다. 중간 데이터 병합을 위한 계산을 태스크로서 작업에 끼워넣어 데이터 처리 시스템이 이 태스크를 수행하도록 하면 이 태스크들은 동적 최적화를 포함하여 데이터 처리 시스템이 제공하는 모든 기능들을 사용할 수 있다. 또한, 본 논문에서는 이러한 중간 데이터 병합과 호환되는 데이터 치우침 처리 방식을 제안한다. 수행된 실험의 결과를 통해 구현된 최적화가 랜덤 디스크 읽기 비용을 줄이고 데이터 치우침을 완화하여 최대 54%의 성능 향상을 보임을 확인할 수 있다.Chapter 1 Introduction 2 Chapter 2 Background 4 2.1 Distributed Data Processing Concepts . . . . . . . . . . . . . . . . . . 4 2.2 Random Disk Read Overhead in the Data Shuffle . . . . . . . . . . . . 5 2.3 Existing Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Skew Handling with Disk Read Batching . . . . . . . . . . . . . . . . 8 Chapter 3 Disk Read Batching as a Task 10 3.1 Intermediate Data Aggregation Stage . . . . . . . . . . . . . . . . . . . 10 3.2 Composing with Skew Handling Optimization . . . . . . . . . . . . . . 12 Chapter 4 Implementation 15 4.1 Optimization Pass for Disk Read Batching . . . . . . . . . . . . . . . . 15 4.2 Optimization Pass for Skew Handling . . . . . . . . . . . . . . . . . . 17 Chapter 5 Evaluation 21 5.1 Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2 Disk Read Batching Optimization . . . . . . . . . . . . . . . . . . . . 22 5.3 Skew Handling Optimization with Disk Batching . . . . . . . . . . . . 24 Chapter 6 Conclusion 28 Bibliography 29 국문초록 31Maste

SNU Open Repository and Archive

Effective scheduling algorithm for on-demand XML data broadcasts in wireless environments

Author: Qin Yongrui
Wang Hua
Xiao Jitian
Publication venue: Australian Computer Society Inc.
Publication date: 01/01/2011
Field of study

The organization of data on wireless channels, which aims to reduce the access time of mobile clients, is a key problem in data broadcasts. Many scheduling algorithms have been designed to organize flat data on air. However, how to effectively schedule semi-structured information such as XML data on wireless channels is still a challenge. In this paper, we firstly propose a novel method to greatly reduce the tuning time by splitting query results into XML snippets and to achieve better access efficiency by combining similar ones. Then we analyze the data broadcast scheduling problem of on-demand XML data broadcasts and define the efficiency of a data item. Based on the definition, a Least Efficient Last (LEL) scheduling algorithm is also devised to effectively organize XML data on wireless channels. Finally, we study the performance of our algorithms through extensive experiments. The results show that our scheduling algorithms can reduce both access time and tuning time signifcantly when compared with existing work

Research Online @ ECU

University of Southern Queensland ePrints

The Family of MapReduce and Large Scale Data Processing Systems

Author: Anna Liu
Ayman G. Fayoumi
King Abdulaziz
See Profile
Sherif Sakr
Sherif Sakr
South Wales
South Wales
Publication venue
Publication date: 12/02/2013
Field of study

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

arXiv.org e-Print Archive

CiteSeerX

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Author: Barbary Kyle
Franklin Michael J.
Nothaft Frank Austin
Patterson David A.
Perlmutter Saul
Sparks Evan
Zahn Oliver
Zhang Zhao
Publication venue
Publication date: 22/12/2015
Field of study

Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark -- a modern big data platform -- to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache Spark. We then use the Kira toolkit to implement a Source Extractor application for astronomy images, called Kira SE. With Kira SE as the use case, we study the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, we show that by leveraging software originally designed for big data infrastructure, Kira SE achieves competitive performance to the C implementation running on the NERSC Edison supercomputer. Our experience with Kira indicates that emerging Big Data platforms such as Apache Spark are a performant alternative for many-task scientific applications

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

HIL: designing an exokernel for the data center

Author: Averitt Sam
Chase Jeffrey S
Gaggero Massimo
Schatzberg Dan
Security Maximize
Turk Ata
Turk Ata
Yoo Andy B.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2016
Field of study

We propose a new Exokernel-like layer to allow mutually untrusting physically deployed services to efficiently share the resources of a data center. We believe that such a layer offers not only efficiency gains, but may also enable new economic models, new applications, and new security-sensitive uses. A prototype (currently in active use) demonstrates that the proposed layer is viable, and can support a variety of existing provisioning tools and use cases.Partial support for this work was provided by the MassTech Collaborative Research Matching Grant Program, National Science Foundation awards 1347525 and 1149232 as well as the several commercial partners of the Massachusetts Open Cloud who may be found at http://www.massopencloud.or

Crossref

Boston University Institutional Repository (OpenBU)

Galley: A New Parallel File System for Parallel Applications

Author: Nieuwejaar Nils
Publication venue: Dartmouth Digital Commons
Publication date: 01/11/1996
Field of study

Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scientific applications. Most multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access those multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated application and library programmers to use knowledge about their I/O to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. In this work we examine current multiprocessor file systems, as well as how those file systems are used by scientific applications. Contrary to the expectations of the designers of current parallel file systems, the workloads on those systems are dominated by requests to read and write small pieces of data. Furthermore, rather than being accessed sequentially and contiguously, as in uniprocessor and supercomputer workloads, files in multiprocessor file systems are accessed in regular, structured, but non-contiguous patterns. Based on our observations of multiprocessor workloads, we have designed Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. In this work, we introduce Galley and discuss its design and implementation. We describe Galley\u27s new three-dimensional file structure and discuss how that structure can be used by parallel applications to achieve higher performance. We introduce several new data-access interfaces, which allow applications to explicitly describe the regular access patterns we found to be common in parallel file system workloads. We show how these new interfaces allow parallel applications to achieve tremendous increases in I/O performance. Finally, we discuss how Galley\u27s new file structure and data-access interfaces can be useful in practice

Dartmouth Digital Commons (Dartmouth College)