6,765 research outputs found
Affinity Scheduling and the Applications on Data Center Scheduling with Data Locality
MapReduce framework is the de facto standard in Hadoop. Considering the data
locality in data centers, the load balancing problem of map tasks is a special
case of affinity scheduling problem. There is a huge body of work on affinity
scheduling, proposing heuristic algorithms which try to increase data locality
in data centers like Delay Scheduling and Quincy. However, not enough attention
has been put on theoretical guarantees on throughput and delay optimality of
such algorithms. In this work, we present and compare different algorithms and
discuss their shortcoming and strengths. To the best of our knowledge, most
data centers are using static load balancing algorithms which are not efficient
in any ways and results in wasting the resources and causing unnecessary delays
for users
GB-PANDAS: Throughput and heavy-traffic optimality analysis for affinity scheduling
Dynamic affinity scheduling has been an open problem for nearly three
decades. The problem is to dynamically schedule multi-type tasks to
multi-skilled servers such that the resulting queueing system is both stable in
the capacity region (throughput optimality) and the mean delay of tasks is
minimized at high loads near the boundary of the capacity region (heavy-traffic
optimality). As for applications, data-intensive analytics like MapReduce,
Hadoop, and Dryad fit into this setting, where the set of servers is
heterogeneous for different task types, so the pair of task type and server
determines the processing rate of the task. The load balancing algorithm used
in such frameworks is an example of affinity scheduling which is desired to be
both robust and delay optimal at high loads when hot-spots occur. Fluid model
planning, the MaxWeight algorithm, and the generalized -rule are among
the first algorithms proposed for affinity scheduling that have theoretical
guarantees on being optimal in different senses, which will be discussed in the
related work section. All these algorithms are not practical for use in data
center applications because of their non-realistic assumptions. The
join-the-shortest-queue-MaxWeight (JSQ-MaxWeight), JSQ-Priority, and
weighted-workload algorithms are examples of load balancing policies for
systems with two and three levels of data locality with a rack structure. In
this work, we propose the Generalized-Balanced-Pandas algorithm (GB-PANDAS) for
a system with multiple levels of data locality and prove its throughput
optimality. We prove this result under an arbitrary distribution for service
times, whereas most previous theoretical work assumes geometric distribution
for service times. The extensive simulation results show that the GB-PANDAS
algorithm alleviates the mean delay and has a better performance than the
JSQ-MaxWeight algorithm by twofoldComment: IFIP WG 7.3 Performance 2017 - The 35th International Symposium on
Computer Performance, Modeling, Measurements and Evaluation 201
The Power of d Choices in Scheduling for Data Centers with Heterogeneous Servers
MapReduce framework is the de facto in big data and its applications where a
big data-set is split into small data chunks that are replicated on different
servers among thousands of servers. The heterogeneous server structure of the
system makes the scheduling much harder than scheduling for systems with
homogeneous servers. Throughput optimality of the system on one hand and delay
optimality on the other hand creates a dilemma for assigning tasks to servers.
The JSQ-MaxWeight and Balanced-Pandas algorithms are the states of the arts
algorithms with theoretical guarantees on throughput and delay optimality for
systems with two and three levels of data locality. However, the scheduling
complexity of these two algorithms are way too much. Hence, we use the power of
choices algorithm combined with the Balanced-Pandas algorithm and the
JSQ-MaxWeight algorithm, and compare the complexity of the simple algorithms
and the power of choices versions of them. We will further show that the
Balanced-Pandas algorithm combined with the power of the choices,
Balanced-Pandas-Pod, not only performs better than simple Balanced-Pandas, but
also is less sensitive to the parameter than the combination of the
JSQ-MaxWeight algorithm and the power of the choices, JSQ-MaxWeight-Pod. In
fact in our extensive simulation results, the Balanced-Pandas-Pod algorithm is
performing better than the simple Balanced-Pandas algorithm in low and medium
loads, where data centers are usually performing at, and performs almost the
same as the Balanced-Pandas algorithm at high loads. Note that the load
balancing complexity of Balanced-Pandas and JSQ-MaxWeight algorithms are
, where is the number of servers in the system which is in the order
of thousands servers, whereas the complexity of Balanced-Pandas-Pod and
JSQ-MaxWeight-Pod are , that makes the central scheduler faster and saves
energy
Blind GB-PANDAS: A Blind Throughput-Optimal Load Balancing Algorithm for Affinity Scheduling
Dynamic affinity load balancing of multi-type tasks on multi-skilled servers,
when the service rate of each task type on each of the servers is known and can
possibly be different from each other, is an open problem for over three
decades. The goal is to do task assignment on servers in a real time manner so
that the system becomes stable, which means that the queue lengths do not
diverge to infinity in steady state (throughput optimality), and the mean task
completion time is minimized (delay optimality). The fluid model planning,
Max-Weight, and c--rule algorithms have theoretical guarantees on
optimality in some aspects for the affinity problem, but they consider a
complicated queueing structure and either require the task arrival rates, the
service rates of tasks on servers, or both. In many cases that are discussed in
the introduction section, both task arrival rates and service rates of
different task types on different servers are unknown. In this work, the Blind
GB-PANDAS algorithm is proposed which is completely blind to task arrival rates
and service rates. Blind GB-PANDAS uses an exploration-exploitation approach
for load balancing. We prove that Blind GB-PANDAS is throughput optimal under
arbitrary and unknown distributions for service times of different task types
on different servers and unknown task arrival rates. Blind GB-PANDAS desires to
route an incoming task to the server with the minimum weighted-workload, but
since the service rates are unknown, such routing of incoming tasks is not
guaranteed which makes the throughput optimality analysis more complicated than
the case where service rates are known. Our extensive experimental results
reveal that Blind GB-PANDAS significantly outperforms existing methods in terms
of mean task completion time at high loads
Power-aware applications for scientific cluster and distributed computing
The aggregate power use of computing hardware is an important cost factor in
scientific cluster and distributed computing systems. The Worldwide LHC
Computing Grid (WLCG) is a major example of such a distributed computing
system, used primarily for high throughput computing (HTC) applications. It has
a computing capacity and power consumption rivaling that of the largest
supercomputers. The computing capacity required from this system is also
expected to grow over the next decade. Optimizing the power utilization and
cost of such systems is thus of great interest.
A number of trends currently underway will provide new opportunities for
power-aware optimizations. We discuss how power-aware software applications and
scheduling might be used to reduce power consumption, both as autonomous
entities and as part of a (globally) distributed system. As concrete examples
of computing centers we provide information on the large HEP-focused Tier-1 at
FNAL, and the Tigress High Performance Computing Center at Princeton
University, which provides HPC resources in a university context.Comment: Submitted to proceedings of International Symposium on Grids and
Clouds (ISGC) 2014, 23-28 March 2014, Academia Sinica, Taipei, Taiwa
Analyzing Self-Driving Cars on Twitter
This paper studies users' perception regarding a controversial product,
namely self-driving (autonomous) cars. To find people's opinion regarding this
new technology, we used an annotated Twitter dataset, and extracted the topics
in positive and negative tweets using an unsupervised, probabilistic model
known as topic modeling. We later used the topics, as well as linguist and
Twitter specific features to classify the sentiment of the tweets. Regarding
the opinions, the result of our analysis shows that people are optimistic and
excited about the future technology, but at the same time they find it
dangerous and not reliable. For the classification task, we found Twitter
specific features, such as hashtags as well as linguistic features such as
emphatic words among top attributes in classifying the sentiment of the tweets
Flexible-Latency DRAM: Understanding and Exploiting Latency Variation in Modern DRAM Chips
This article summarizes key results of our work on experimental
characterization and analysis of latency variation and latency-reliability
trade-offs in modern DRAM chips, which was published in SIGMETRICS 2016, and
examines the work's significance and future potential.
The goal of this work is to (i) experimentally characterize and understand
the latency variation across cells within a DRAM chip for these three
fundamental DRAM operations, and (ii) develop new mechanisms that exploit our
understanding of the latency variation to reliably improve performance. To this
end, we comprehensively characterize 240 DRAM chips from three major vendors,
and make six major new observations about latency variation within DRAM.
Notably, we find that (i) there is large latency variation across the cells for
each of the three operations; (ii) variation characteristics exhibit
significant spatial locality: slower cells are clustered in certain regions of
a DRAM chip; and (iii) the three fundamental operations exhibit different
reliability characteristics when the latency of each operation is reduced.
Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a
mechanism that exploits latency variation across DRAM cells within a DRAM chip
to improve system performance. The key idea of FLY-DRAM is to exploit the
spatial locality of slower cells within DRAM, and access the faster DRAM
regions with reduced latencies for the fundamental operations. Our evaluations
show that FLY-DRAM improves the performance of a wide range of applications by
13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors'
real DRAM chips, in a simulated 8-core system
Voltron: Understanding and Exploiting the Voltage-Latency-Reliability Trade-Offs in Modern DRAM Chips to Improve Energy Efficiency
This paper summarizes our work on experimental characterization and analysis
of reduced-voltage operation in modern DRAM chips, which was published in
SIGMETRICS 2017, and examines the work's significance and future potential.
We take a comprehensive approach to understanding and exploiting the latency
and reliability characteristics of modern DRAM when the DRAM supply voltage is
lowered below the nominal voltage level specified by DRAM standards. We perform
an experimental study of 124 real DDR3L (low-voltage) DRAM chips manufactured
recently by three major DRAM vendors. We find that reducing the supply voltage
below a certain point introduces bit errors in the data, and we comprehensively
characterize the behavior of these errors. We discover that these errors can be
avoided by increasing the latency of three major DRAM operations (activation,
restoration, and precharge). We perform detailed DRAM circuit simulations to
validate and explain our experimental findings. We also characterize the
various relationships between reduced supply voltage and error locations,
stored data patterns, DRAM temperature, and data retention.
Based on our observations, we propose a new DRAM energy reduction mechanism,
called Voltron. The key idea of Voltron is to use a performance model to
determine by how much we can reduce the supply voltage without introducing
errors and without exceeding a user-specified threshold for performance loss.
Our evaluations show that Voltron reduces the average DRAM and system energy
consumption by 10.5% and 7.3%, respectively, while limiting the average system
performance loss to only 1.8%, for a variety of memory-intensive quad-core
workloads. We also show that Voltron significantly outperforms prior dynamic
voltage and frequency scaling mechanisms for DRAM
High-throughput Execution of Hierarchical Analysis Pipelines on Hybrid Cluster Platforms
We propose, implement, and experimentally evaluate a runtime middleware to
support high-throughput execution on hybrid cluster machines of large-scale
analysis applications. A hybrid cluster machine consists of computation nodes
which have multiple CPUs and general purpose graphics processing units (GPUs).
Our work targets scientific analysis applications in which datasets are
processed in application-specific data chunks, and the processing of a data
chunk is expressed as a hierarchical pipeline of operations. The proposed
middleware system combines a bag-of-tasks style execution with coarse-grain
dataflow execution. Data chunks and associated data processing pipelines are
scheduled across cluster nodes using a demand driven approach, while within a
node operations in a given pipeline instance are scheduled across CPUs and
GPUs. The runtime system implements several optimizations, including
performance aware task scheduling, architecture aware process placement, data
locality conscious task assignment, and data prefetching and asynchronous data
copy, to maximize utilization of the aggregate computing power of CPUs and GPUs
and minimize data copy overheads. The application and performance benefits of
the runtime middleware are demonstrated using an image analysis application,
which is employed in a brain cancer study, on a state-of-the-art hybrid cluster
in which each node has two 6-core CPUs and three GPUs. Our results show that
implementing and scheduling application data processing as a set of fine-grain
operations provide more opportunities for runtime optimizations and attain
better performance than a coarser-grain, monolithic implementation. The
proposed runtime system can achieve high-throughput processing of large
datasets - we were able to process an image dataset consisting of 36,848
4Kx4K-pixel image tiles at about 150 tiles/second rate on 100 nodes.Comment: 12 pages, 14 figure
Exploiting Row-Level Temporal Locality in DRAM to Reduce the Memory Access Latency
This paper summarizes the idea of ChargeCache, which was published in HPCA
2016 [51], and examines the work's significance and future potential. DRAM
latency continues to be a critical bottleneck for system performance. In this
work, we develop a low-cost mechanism, called ChargeCache, that enables faster
access to recently-accessed rows in DRAM, with no modifications to DRAM chips.
Our mechanism is based on the key observation that a recently-accessed row has
more charge and thus the following access to the same row can be performed
faster. To exploit this observation, we propose to track the addresses of
recently-accessed rows in a table in the memory controller. If a later DRAM
request hits in that table, the memory controller uses lower timing parameters,
leading to reduced DRAM latency. Row addresses are removed from the table after
a specified duration to ensure rows that have leaked too much charge are not
accessed with lower latency. We evaluate ChargeCache on a wide variety of
workloads and show that it provides significant performance and energy benefits
for both single-core and multi-core systems.Comment: arXiv admin note: substantial text overlap with arXiv:1609.0723
- …