Search CORE

6,765 research outputs found

Affinity Scheduling and the Applications on Data Center Scheduling with Data Locality

Author: Kavousi Mohammadamir
Publication venue
Publication date: 08/05/2017
Field of study

MapReduce framework is the de facto standard in Hadoop. Considering the data locality in data centers, the load balancing problem of map tasks is a special case of affinity scheduling problem. There is a huge body of work on affinity scheduling, proposing heuristic algorithms which try to increase data locality in data centers like Delay Scheduling and Quincy. However, not enough attention has been put on theoretical guarantees on throughput and delay optimality of such algorithms. In this work, we present and compare different algorithms and discuss their shortcoming and strengths. To the best of our knowledge, most data centers are using static load balancing algorithms which are not efficient in any ways and results in wasting the resources and causing unnecessary delays for users

arXiv.org e-Print Archive

GB-PANDAS: Throughput and heavy-traffic optimality analysis for affinity scheduling

Author: Hajiesmaili Mohammad H
Hojjati Avesta
Yekkehkhany Ali
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/09/2017
Field of study

Dynamic affinity scheduling has been an open problem for nearly three decades. The problem is to dynamically schedule multi-type tasks to multi-skilled servers such that the resulting queueing system is both stable in the capacity region (throughput optimality) and the mean delay of tasks is minimized at high loads near the boundary of the capacity region (heavy-traffic optimality). As for applications, data-intensive analytics like MapReduce, Hadoop, and Dryad fit into this setting, where the set of servers is heterogeneous for different task types, so the pair of task type and server determines the processing rate of the task. The load balancing algorithm used in such frameworks is an example of affinity scheduling which is desired to be both robust and delay optimal at high loads when hot-spots occur. Fluid model planning, the MaxWeight algorithm, and the generalized

c\mu

-rule are among the first algorithms proposed for affinity scheduling that have theoretical guarantees on being optimal in different senses, which will be discussed in the related work section. All these algorithms are not practical for use in data center applications because of their non-realistic assumptions. The join-the-shortest-queue-MaxWeight (JSQ-MaxWeight), JSQ-Priority, and weighted-workload algorithms are examples of load balancing policies for systems with two and three levels of data locality with a rack structure. In this work, we propose the Generalized-Balanced-Pandas algorithm (GB-PANDAS) for a system with multiple levels of data locality and prove its throughput optimality. We prove this result under an arbitrary distribution for service times, whereas most previous theoretical work assumes geometric distribution for service times. The extensive simulation results show that the GB-PANDAS algorithm alleviates the mean delay and has a better performance than the JSQ-MaxWeight algorithm by twofoldComment: IFIP WG 7.3 Performance 2017 - The 35th International Symposium on Computer Performance, Modeling, Measurements and Evaluation 201

arXiv.org e-Print Archive

The Power of d Choices in Scheduling for Data Centers with Heterogeneous Servers

Author: Abhar Negin
Ahmadi Iman Nabati
Moaddeli Amir
Publication venue
Publication date: 31/03/2019
Field of study

MapReduce framework is the de facto in big data and its applications where a big data-set is split into small data chunks that are replicated on different servers among thousands of servers. The heterogeneous server structure of the system makes the scheduling much harder than scheduling for systems with homogeneous servers. Throughput optimality of the system on one hand and delay optimality on the other hand creates a dilemma for assigning tasks to servers. The JSQ-MaxWeight and Balanced-Pandas algorithms are the states of the arts algorithms with theoretical guarantees on throughput and delay optimality for systems with two and three levels of data locality. However, the scheduling complexity of these two algorithms are way too much. Hence, we use the power of

d

choices algorithm combined with the Balanced-Pandas algorithm and the JSQ-MaxWeight algorithm, and compare the complexity of the simple algorithms and the power of

d

choices versions of them. We will further show that the Balanced-Pandas algorithm combined with the power of the

d

choices, Balanced-Pandas-Pod, not only performs better than simple Balanced-Pandas, but also is less sensitive to the parameter

d

than the combination of the JSQ-MaxWeight algorithm and the power of the

d

choices, JSQ-MaxWeight-Pod. In fact in our extensive simulation results, the Balanced-Pandas-Pod algorithm is performing better than the simple Balanced-Pandas algorithm in low and medium loads, where data centers are usually performing at, and performs almost the same as the Balanced-Pandas algorithm at high loads. Note that the load balancing complexity of Balanced-Pandas and JSQ-MaxWeight algorithms are

O(M)

, where

M

is the number of servers in the system which is in the order of thousands servers, whereas the complexity of Balanced-Pandas-Pod and JSQ-MaxWeight-Pod are

O(1)

, that makes the central scheduler faster and saves energy

arXiv.org e-Print Archive

Blind GB-PANDAS: A Blind Throughput-Optimal Load Balancing Algorithm for Affinity Scheduling

Author: Nagi Rakesh
Yekkehkhany Ali
Publication venue
Publication date: 03/03/2020
Field of study

Dynamic affinity load balancing of multi-type tasks on multi-skilled servers, when the service rate of each task type on each of the servers is known and can possibly be different from each other, is an open problem for over three decades. The goal is to do task assignment on servers in a real time manner so that the system becomes stable, which means that the queue lengths do not diverge to infinity in steady state (throughput optimality), and the mean task completion time is minimized (delay optimality). The fluid model planning, Max-Weight, and c-

\mu

-rule algorithms have theoretical guarantees on optimality in some aspects for the affinity problem, but they consider a complicated queueing structure and either require the task arrival rates, the service rates of tasks on servers, or both. In many cases that are discussed in the introduction section, both task arrival rates and service rates of different task types on different servers are unknown. In this work, the Blind GB-PANDAS algorithm is proposed which is completely blind to task arrival rates and service rates. Blind GB-PANDAS uses an exploration-exploitation approach for load balancing. We prove that Blind GB-PANDAS is throughput optimal under arbitrary and unknown distributions for service times of different task types on different servers and unknown task arrival rates. Blind GB-PANDAS desires to route an incoming task to the server with the minimum weighted-workload, but since the service rates are unknown, such routing of incoming tasks is not guaranteed which makes the throughput optimality analysis more complicated than the case where service rates are known. Our extensive experimental results reveal that Blind GB-PANDAS significantly outperforms existing methods in terms of mean task completion time at high loads

arXiv.org e-Print Archive

Power-aware applications for scientific cluster and distributed computing

Author: Abdurachmanov David
Elmer Peter
Eulisse Giulio
Grosso Paola
Hillegas Curtis
Holzman Burt
Janssen Ruben L.
Klous Sander
Knight Robert
Muzaffar Shahzad
Publication venue
Publication date: 22/10/2014
Field of study

The aggregate power use of computing hardware is an important cost factor in scientific cluster and distributed computing systems. The Worldwide LHC Computing Grid (WLCG) is a major example of such a distributed computing system, used primarily for high throughput computing (HTC) applications. It has a computing capacity and power consumption rivaling that of the largest supercomputers. The computing capacity required from this system is also expected to grow over the next decade. Optimizing the power utilization and cost of such systems is thus of great interest. A number of trends currently underway will provide new opportunities for power-aware optimizations. We discuss how power-aware software applications and scheduling might be used to reduce power consumption, both as autonomous entities and as part of a (globally) distributed system. As concrete examples of computing centers we provide information on the large HEP-focused Tier-1 at FNAL, and the Tigress High Performance Computing Center at Princeton University, which provides HPC resources in a university context.Comment: Submitted to proceedings of International Symposium on Grids and Clouds (ISGC) 2014, 23-28 March 2014, Academia Sinica, Taipei, Taiwa

arXiv.org e-Print Archive

Analyzing Self-Driving Cars on Twitter

Author: Khan Mohsin
Sadiq Rizwan
Publication venue
Publication date: 05/04/2018
Field of study

This paper studies users' perception regarding a controversial product, namely self-driving (autonomous) cars. To find people's opinion regarding this new technology, we used an annotated Twitter dataset, and extracted the topics in positive and negative tweets using an unsupervised, probabilistic model known as topic modeling. We later used the topics, as well as linguist and Twitter specific features to classify the sentiment of the tweets. Regarding the opinions, the result of our analysis shows that people are optimistic and excited about the future technology, but at the same time they find it dangerous and not reliable. For the classification task, we found Twitter specific features, such as hashtags as well as linguistic features such as emphatic words among top attributes in classifying the sentiment of the tweets

arXiv.org e-Print Archive

Flexible-Latency DRAM: Understanding and Exploiting Latency Variation in Modern DRAM Chips

Author: Chang Kevin K.
Ghose Saugata
Hassan Hasan
Hsieh Kevin
Kashyap Abhijith
Khan Samira
Lee Donghyuk
Li Tianshi
Mutlu Onur
Pekhimenko Gennady
Publication venue
Publication date: 08/05/2018
Field of study

This article summarizes key results of our work on experimental characterization and analysis of latency variation and latency-reliability trade-offs in modern DRAM chips, which was published in SIGMETRICS 2016, and examines the work's significance and future potential. The goal of this work is to (i) experimentally characterize and understand the latency variation across cells within a DRAM chip for these three fundamental DRAM operations, and (ii) develop new mechanisms that exploit our understanding of the latency variation to reliably improve performance. To this end, we comprehensively characterize 240 DRAM chips from three major vendors, and make six major new observations about latency variation within DRAM. Notably, we find that (i) there is large latency variation across the cells for each of the three operations; (ii) variation characteristics exhibit significant spatial locality: slower cells are clustered in certain regions of a DRAM chip; and (iii) the three fundamental operations exhibit different reliability characteristics when the latency of each operation is reduced. Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance. The key idea of FLY-DRAM is to exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations. Our evaluations show that FLY-DRAM improves the performance of a wide range of applications by 13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors' real DRAM chips, in a simulated 8-core system

arXiv.org e-Print Archive

Voltron: Understanding and Exploiting the Voltage-Latency-Reliability Trade-Offs in Modern DRAM Chips to Improve Energy Efficiency

Author: Agrawal Aditya
Chang Kevin K.
Chatterjee Niladrish
Ghose Saugata
Hassan Hasan
Kashyap Abhijith
Lee Donghyuk
Mutlu Onur
O'Connor Mike
Yaglıkçı Abdullah Giray
Publication venue
Publication date: 08/05/2018
Field of study

This paper summarizes our work on experimental characterization and analysis of reduced-voltage operation in modern DRAM chips, which was published in SIGMETRICS 2017, and examines the work's significance and future potential. We take a comprehensive approach to understanding and exploiting the latency and reliability characteristics of modern DRAM when the DRAM supply voltage is lowered below the nominal voltage level specified by DRAM standards. We perform an experimental study of 124 real DDR3L (low-voltage) DRAM chips manufactured recently by three major DRAM vendors. We find that reducing the supply voltage below a certain point introduces bit errors in the data, and we comprehensively characterize the behavior of these errors. We discover that these errors can be avoided by increasing the latency of three major DRAM operations (activation, restoration, and precharge). We perform detailed DRAM circuit simulations to validate and explain our experimental findings. We also characterize the various relationships between reduced supply voltage and error locations, stored data patterns, DRAM temperature, and data retention. Based on our observations, we propose a new DRAM energy reduction mechanism, called Voltron. The key idea of Voltron is to use a performance model to determine by how much we can reduce the supply voltage without introducing errors and without exceeding a user-specified threshold for performance loss. Our evaluations show that Voltron reduces the average DRAM and system energy consumption by 10.5% and 7.3%, respectively, while limiting the average system performance loss to only 1.8%, for a variety of memory-intensive quad-core workloads. We also show that Voltron significantly outperforms prior dynamic voltage and frequency scaling mechanisms for DRAM

arXiv.org e-Print Archive

High-throughput Execution of Hierarchical Analysis Pipelines on Hybrid Cluster Platforms

Author: Cooper Lee A. D.
Kong Jun
Kurc Tahsin M.
Pan Tony
Saltz Joel H.
Teodoro George
Publication venue
Publication date: 14/09/2012
Field of study

We propose, implement, and experimentally evaluate a runtime middleware to support high-throughput execution on hybrid cluster machines of large-scale analysis applications. A hybrid cluster machine consists of computation nodes which have multiple CPUs and general purpose graphics processing units (GPUs). Our work targets scientific analysis applications in which datasets are processed in application-specific data chunks, and the processing of a data chunk is expressed as a hierarchical pipeline of operations. The proposed middleware system combines a bag-of-tasks style execution with coarse-grain dataflow execution. Data chunks and associated data processing pipelines are scheduled across cluster nodes using a demand driven approach, while within a node operations in a given pipeline instance are scheduled across CPUs and GPUs. The runtime system implements several optimizations, including performance aware task scheduling, architecture aware process placement, data locality conscious task assignment, and data prefetching and asynchronous data copy, to maximize utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. The application and performance benefits of the runtime middleware are demonstrated using an image analysis application, which is employed in a brain cancer study, on a state-of-the-art hybrid cluster in which each node has two 6-core CPUs and three GPUs. Our results show that implementing and scheduling application data processing as a set of fine-grain operations provide more opportunities for runtime optimizations and attain better performance than a coarser-grain, monolithic implementation. The proposed runtime system can achieve high-throughput processing of large datasets - we were able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles at about 150 tiles/second rate on 100 nodes.Comment: 12 pages, 14 figure

arXiv.org e-Print Archive

Exploiting Row-Level Temporal Locality in DRAM to Reduce the Memory Access Latency

Author: Ergin Oguz
Hassan Hasan
Lee Donghyuk
Mutlu Onur
Pekhimenko Gennady
Seshadri Vivek
Vijaykumar Nandita
Publication venue
Publication date: 08/05/2018
Field of study

This paper summarizes the idea of ChargeCache, which was published in HPCA 2016 [51], and examines the work's significance and future potential. DRAM latency continues to be a critical bottleneck for system performance. In this work, we develop a low-cost mechanism, called ChargeCache, that enables faster access to recently-accessed rows in DRAM, with no modifications to DRAM chips. Our mechanism is based on the key observation that a recently-accessed row has more charge and thus the following access to the same row can be performed faster. To exploit this observation, we propose to track the addresses of recently-accessed rows in a table in the memory controller. If a later DRAM request hits in that table, the memory controller uses lower timing parameters, leading to reduced DRAM latency. Row addresses are removed from the table after a specified duration to ensure rows that have leaked too much charge are not accessed with lower latency. We evaluate ChargeCache on a wide variety of workloads and show that it provides significant performance and energy benefits for both single-core and multi-core systems.Comment: arXiv admin note: substantial text overlap with arXiv:1609.0723

arXiv.org e-Print Archive