1,526 research outputs found
Predicting Intermediate Storage Performance for Workflow Applications
Configuring a storage system to better serve an application is a challenging
task complicated by a multidimensional, discrete configuration space and the
high cost of space exploration (e.g., by running the application with different
storage configurations). To enable selecting the best configuration in a
reasonable time, we design an end-to-end performance prediction mechanism that
estimates the turn-around time of an application using storage system under a
given configuration. This approach focuses on a generic object-based storage
system design, supports exploring the impact of optimizations targeting
workflow applications (e.g., various data placement schemes) in addition to
other, more traditional, configuration knobs (e.g., stripe size or replication
level), and models the system operation at data-chunk and control message
level.
This paper presents our experience to date with designing and using this
prediction mechanism. We evaluate this mechanism using micro- as well as
synthetic benchmarks mimicking real workflow applications, and a real
application.. A preliminary evaluation shows that we are on a good track to
meet our objectives: it can scale to model a workflow application run on an
entire cluster while offering an over 200x speedup factor (normalized by
resource) compared to running the actual application, and can achieve, in the
limited number of scenarios we study, a prediction accuracy that enables
identifying the best storage system configuration
๋ฐ์ดํฐ ์ฒ๋ฆฌ ์์คํ ์์์ ๋๊ท๋ชจ ๋ฐ์ดํฐ ์ ํ์ ๋ํ ๋์ ์ต์ ํ
ํ์๋
ผ๋ฌธ (์์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2019. 2. ์ ๋ณ๊ณค.The scale of data used for data analytics is growing rapidly and the ability to process large volumes of data is critical to data processing systems. A scaling bottleneck for processing large amounts of data in the data processing systems is the random disk read overhead that occurs while shuffling data communications between tasks. To reduce this overhead, an external shuffle process can batch the disk read by aggregating the intermediate data through an additional computation. However, the additional computation cannot take advantage of distributed execution capabilities provided by data processing systems such as scheduling, parallelization, or fault recovery. In addition, the systems cannot dynamically optimize the external shuffle process in the same way that they optimize plain jobs without an external process. Instead of launching the external shuffle process, we propose to insert the disk read batching into a job. By doing so, the tasks can fully exploit the features, including the dynamic optimization provided by data processing systems, because the computation for intermediate data aggregation is fully revealed to the systems. Moreover, we suggest a dynamic data skew handling mechanism that can be applied with the disk read batching optimization at the same time. Evaluations show that our implemented technique can mitigate random disk read overhead and data skewness and can reduce the job completion time by up to 54%.์ค๋๋ ๋ฐ์ดํฐ ๋ถ์ ์์
์์ ์ฌ์ฉํ๋ ๋ฐ์ดํฐ์ ํฌ๊ธฐ๊ฐ ๋น ๋ฅด๊ฒ ์ปค์ง๊ณ ์์ผ๋ฉฐ, ์ด ๋๋ฌธ์ ๋ฐ์ดํฐ ์ฒ๋ฆฌ ์์คํ
์ ๋์ฉ๋์ ๋ฐ์ดํฐ๋ฅผ ํจ์จ์ ์ผ๋ก ์ฒ๋ฆฌํ ์ ์์ด์ผ ํ๋ค. ๋ถ์ฐ ๋ฐ์ดํฐ ์ฒ๋ฆฌ ์์คํ
์์ ํฐ ๋ฐ์ดํฐ๋ฅผ ์ฒ๋ฆฌํ ๋์ ๋ณ๋ชฉ์ ํ์คํฌ ๊ฐ ๋ฐ์ดํฐ ์
ํ์ ๋ฐ์ํ๋ ๋๋ค ๋์คํฌ ์ฝ๊ธฐ ๋น์ฉ์ด๋ค. ์ด ๋น์ฉ์ ์ค์ด๊ธฐ ์ํ์ฌ, ์ธ๋ถ ์
ํ ํ๋ก์ธ์ค๊ฐ ๋ฐ์ดํฐ ์ฒ๋ฆฌ ์์คํ
๋ฐ๊นฅ์์ ์ถ๊ฐ์ ์ธ ๊ณ์ฐ์ ํตํด ์ค๊ฐ ๋ฐ์ดํฐ๋ฅผ ๋ณํฉํ์ฌ ๋์คํฌ ์ฝ๊ธฐ๋ฅผ ์ผ๊ด ์ฒ๋ฆฌํ๋๋ก ํ ์ ์๋ค. ๊ทธ๋ฌ๋, ์ด ๊ฒฝ์ฐ ์ถ๊ฐ๋ ๊ณ์ฐ์ ๊ธฐ์กด์ ๋ฐ์ดํฐ ์ฒ๋ฆฌ ์์คํ
์ด ์ ๊ณตํ๋ ๊ณ์ฐ ์ค์ผ์ฅด๋ง, ๋ณ๋ ฌํ, ์คํจ ๋ณต๊ตฌ ๋ฑ์ ๊ธฐ๋ฅ์ ์ด์ฉํ ์ ์๋ค. ๋ํ, ๋ฐ์ดํฐ ์ฒ๋ฆฌ ์์คํ
์ด ๋ค๋ฅธ ์ผ๋ฐ์ ์ธ ์์
์ ์ต์ ํํ๋ ๊ฒ์ฒ๋ผ ์ด ์ธ๋ถ ์
ํ ํ๋ก์ธ์ค์ ๋์์ ์ต์ ํํ ์ ์๋ค. ์ด ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํ์ฌ, ๋ณธ ๋
ผ๋ฌธ์์๋ ๋์คํฌ ์ฝ๊ธฐ๋ฅผ ์ผ๊ด ์ฒ๋ฆฌํ๋๋ก ๋ง๋๋ ๊ณ์ฐ์ ์์
์ํ ๋ด๋ถ์ ๋ผ์๋ฃ๋ ๋ฐฉ์์ ๊ณ ์ํ์๋ค. ์ค๊ฐ ๋ฐ์ดํฐ ๋ณํฉ์ ์ํ ๊ณ์ฐ์ ํ์คํฌ๋ก์ ์์
์ ๋ผ์๋ฃ์ด ๋ฐ์ดํฐ ์ฒ๋ฆฌ ์์คํ
์ด ์ด ํ์คํฌ๋ฅผ ์ํํ๋๋ก ํ๋ฉด ์ด ํ์คํฌ๋ค์ ๋์ ์ต์ ํ๋ฅผ ํฌํจํ์ฌ ๋ฐ์ดํฐ ์ฒ๋ฆฌ ์์คํ
์ด ์ ๊ณตํ๋ ๋ชจ๋ ๊ธฐ๋ฅ๋ค์ ์ฌ์ฉํ ์ ์๋ค. ๋ํ, ๋ณธ ๋
ผ๋ฌธ์์๋ ์ด๋ฌํ ์ค๊ฐ ๋ฐ์ดํฐ ๋ณํฉ๊ณผ ํธํ๋๋ ๋ฐ์ดํฐ ์น์ฐ์นจ ์ฒ๋ฆฌ ๋ฐฉ์์ ์ ์ํ๋ค. ์ํ๋ ์คํ์ ๊ฒฐ๊ณผ๋ฅผ ํตํด ๊ตฌํ๋ ์ต์ ํ๊ฐ ๋๋ค ๋์คํฌ ์ฝ๊ธฐ ๋น์ฉ์ ์ค์ด๊ณ ๋ฐ์ดํฐ ์น์ฐ์นจ์ ์ํํ์ฌ ์ต๋ 54%์ ์ฑ๋ฅ ํฅ์์ ๋ณด์์ ํ์ธํ ์ ์๋ค.Chapter 1 Introduction 2
Chapter 2 Background 4
2.1 Distributed Data Processing Concepts . . . . . . . . . . . . . . . . . . 4
2.2 Random Disk Read Overhead in the Data Shuffle . . . . . . . . . . . . 5
2.3 Existing Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Skew Handling with Disk Read Batching . . . . . . . . . . . . . . . . 8
Chapter 3 Disk Read Batching as a Task 10
3.1 Intermediate Data Aggregation Stage . . . . . . . . . . . . . . . . . . . 10
3.2 Composing with Skew Handling Optimization . . . . . . . . . . . . . . 12
Chapter 4 Implementation 15
4.1 Optimization Pass for Disk Read Batching . . . . . . . . . . . . . . . . 15
4.2 Optimization Pass for Skew Handling . . . . . . . . . . . . . . . . . . 17
Chapter 5 Evaluation 21
5.1 Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Disk Read Batching Optimization . . . . . . . . . . . . . . . . . . . . 22
5.3 Skew Handling Optimization with Disk Batching . . . . . . . . . . . . 24
Chapter 6 Conclusion 28
Bibliography 29
๊ตญ๋ฌธ์ด๋ก 31Maste
Effective scheduling algorithm for on-demand XML data broadcasts in wireless environments
The organization of data on wireless channels, which aims to reduce the access time of mobile clients, is a key problem in data broadcasts. Many scheduling algorithms have been designed to organize flat data on air. However, how to effectively schedule semi-structured information such as XML data on wireless channels is still a challenge. In this paper, we firstly propose a novel method to greatly reduce the tuning time by splitting query results into XML snippets and to achieve better access efficiency by combining similar ones. Then we analyze the data broadcast scheduling problem of on-demand XML data broadcasts and
define the efficiency of a data item. Based on the definition, a Least Efficient Last (LEL) scheduling algorithm is also devised to effectively organize XML
data on wireless channels. Finally, we study the performance of our algorithms through extensive experiments. The results show that our scheduling algorithms can reduce both access time and tuning time
signifcantly when compared with existing work
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Scientific Computing Meets Big Data Technology: An Astronomy Use Case
Scientific analyses commonly compose multiple single-process programs into a
dataflow. An end-to-end dataflow of single-process programs is known as a
many-task application. Typically, tools from the HPC software stack are used to
parallelize these analyses. In this work, we investigate an alternate approach
that uses Apache Spark -- a modern big data platform -- to parallelize
many-task applications. We present Kira, a flexible and distributed astronomy
image processing toolkit using Apache Spark. We then use the Kira toolkit to
implement a Source Extractor application for astronomy images, called Kira SE.
With Kira SE as the use case, we study the programming flexibility, dataflow
richness, scheduling capacity and performance of Apache Spark running on the
EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an
equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon
EC2 cloud. Furthermore, we show that by leveraging software originally designed
for big data infrastructure, Kira SE achieves competitive performance to the C
implementation running on the NERSC Edison supercomputer. Our experience with
Kira indicates that emerging Big Data platforms such as Apache Spark are a
performant alternative for many-task scientific applications
HIL: designing an exokernel for the data center
We propose a new Exokernel-like layer to allow mutually untrusting physically deployed services to efficiently share the resources of a data center. We believe that such a layer offers not only efficiency gains, but may also enable new economic models, new applications, and new security-sensitive uses. A prototype (currently in active use) demonstrates that the proposed layer is viable, and can support a variety of existing provisioning tools and use cases.Partial support for this work was provided by the MassTech Collaborative Research Matching Grant Program, National Science Foundation awards 1347525 and 1149232 as well as the several commercial partners of the Massachusetts Open Cloud who may be found at http://www.massopencloud.or
Galley: A New Parallel File System for Parallel Applications
Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scientific applications. Most multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access those multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated application and library programmers to use knowledge about their I/O to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. In this work we examine current multiprocessor file systems, as well as how those file systems are used by scientific applications. Contrary to the expectations of the designers of current parallel file systems, the workloads on those systems are dominated by requests to read and write small pieces of data. Furthermore, rather than being accessed sequentially and contiguously, as in uniprocessor and supercomputer workloads, files in multiprocessor file systems are accessed in regular, structured, but non-contiguous patterns. Based on our observations of multiprocessor workloads, we have designed Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. In this work, we introduce Galley and discuss its design and implementation. We describe Galley\u27s new three-dimensional file structure and discuss how that structure can be used by parallel applications to achieve higher performance. We introduce several new data-access interfaces, which allow applications to explicitly describe the regular access patterns we found to be common in parallel file system workloads. We show how these new interfaces allow parallel applications to achieve tremendous increases in I/O performance. Finally, we discuss how Galley\u27s new file structure and data-access interfaces can be useful in practice
- โฆ