28,261 research outputs found
A software-hardware hybrid steering mechanism for clustered microarchitectures
Clustered microarchitectures provide a promising paradigm to solve or alleviate the problems of increasing microprocessor complexity and wire delays. High- performance out-of-order processors rely on hardware-only steering mechanisms to achieve balanced workload distribution among clusters. However, the additional steering logic results in a significant increase on complexity, which actually decreases the benefits of the clustered design. In this paper, we address this complexity issue and present a novel software-hardware hybrid steering mechanism for out-of-order processors. The proposed software- hardware cooperative scheme makes use of the concept of virtual clusters. Instructions are distributed to virtual clusters at compile time using static properties of the program such as data dependences. Then, at runtime, virtual clusters are mapped into physical clusters by considering workload information. Experiments using SPEC CPU2000 benchmarks show that our hybrid approach can achieve almost the same performance as a state-of-the-art hardware-only steering scheme, while requiring low hardware complexity. In addition, the proposed mechanism outperforms state-of-the-art software-only steering mechanisms by 5% and 10% on average for 2-cluster and 4-cluster machines, respectively.Peer ReviewedPostprint (published version
Workload characterization of the shared/buy-in computing cluster at Boston University
Computing clusters provide a complete environment
for computational research, including bio-informatics, machine
learning, and image processing. The Shared Computing Cluster
(SCC) at Boston University is based on a shared/buy-in architecture
that combines shared computers, which are free to be
used by all users, and buy-in computers, which are computers
purchased by users for semi-exclusive use. Although there exists
significant work on characterizing the performance of computing
clusters, little is known about shared/buy-in architectures. Using
data traces, we statistically analyze the performance of the SCC.
Our results show that the average waiting time of a buy-in job
is 16.1% shorter than that of a shared job. Furthermore, we
identify parameters that have a major impact on the performance
experienced by shared and buy-in jobs. These parameters include
the type of parallel environment and the run time limit (i.e., the
maximum time during which a job can use a resource). Finally,
we show that the semi-exclusive paradigm, which allows any SCC
user to use idle buy-in resources for a limited time, increases
the utilization of buy-in resources by 17.4%, thus significantly
improving the performance of the system as a whole.http://people.bu.edu/staro/MIT_Conference_Yoni.pdfAccepted manuscrip
Learning Scheduling Algorithms for Data Processing Clusters
Efficiently scheduling data processing jobs on distributed compute clusters
requires complex algorithms. Current systems, however, use simple generalized
heuristics and ignore workload characteristics, since developing and tuning a
scheduling policy for each workload is infeasible. In this paper, we show that
modern machine learning techniques can generate highly-efficient policies
automatically. Decima uses reinforcement learning (RL) and neural networks to
learn workload-specific scheduling algorithms without any human instruction
beyond a high-level objective such as minimizing average job completion time.
Off-the-shelf RL techniques, however, cannot handle the complexity and scale of
the scheduling problem. To build Decima, we had to develop new representations
for jobs' dependency graphs, design scalable RL models, and invent RL training
methods for dealing with continuous stochastic job arrivals. Our prototype
integration with Spark on a 25-node cluster shows that Decima improves the
average job completion time over hand-tuned scheduling heuristics by at least
21%, achieving up to 2x improvement during periods of high cluster load
Policy-based techniques for self-managing parallel applications
This paper presents an empirical investigation of policy-based self-management techniques for parallel applications executing in loosely-coupled environments. The dynamic and heterogeneous nature of these environments is discussed and the special considerations for parallel applications are identified. An adaptive strategy for the run-time deployment of tasks of parallel applications is presented. The strategy is based on embedding numerous policies which are informed by contextual and environmental inputs. The policies govern various aspects of behaviour, enhancing flexibility so that the goals of efficiency and performance are achieved despite high levels of environmental variability. A prototype self-managing parallel application is used as a vehicle to explore the feasibility and benefits of the strategy. In particular, several aspects of stability are investigated. The implementation and behaviour of three policies are discussed and sample results examined
Empowering a helper cluster through data-width aware instruction selection policies
Narrow values that can be represented by less number of bits than the full machine width occur very frequently in programs. On the other hand, clustering mechanisms enable cost- and performance-effective scaling of processor back-end features. Those attributes can be combined synergistically to design special clusters operating on narrow values (a.k.a. helper cluster), potentially providing performance benefits. We complement a 32-bit monolithic processor with a low-complexity 8-bit helper cluster. Then, in our main focus, we propose various ideas to select suitable instructions to execute in the data-width based clusters. We add data-width information as another instruction steering decision metric and introduce new data-width based selection algorithms which also consider dependency, inter-cluster communication and load imbalance. Utilizing those techniques, the performance of a wide range of workloads are substantially increased; helper cluster achieves an average speedup of 11% for a wide range of 412 apps. When focusing on integer applications, the speedup can be as high as 22% on averagePeer ReviewedPostprint (published version
H-word: Supporting job scheduling in Hadoop with workload-driven data redistribution
The final publication is available at http://link.springer.com/chapter/10.1007/978-3-319-44039-2_21Today’s distributed data processing systems typically follow a query shipping approach and exploit data locality for reducing network traffic. In such systems the distribution of data over the cluster resources plays a significant role, and when skewed, it can harm the performance of executing applications. In this paper, we addressthe challenges of automatically adapting the distribution of data in a cluster to the workload imposed by the input applications. We propose a generic algorithm, named H-WorD, which, based on the estimated workload over resources, suggests alternative execution scenarios of tasks, and hence identifies required transfers of input data a priori, for timely bringing data close to the execution. We exemplify our algorithm in the context of MapReduce jobs in a Hadoop ecosystem. Finally, we evaluate our approach and demonstrate the performance gains of automatic data redistribution.Peer ReviewedPostprint (author's final draft
The Blacklisting Memory Scheduler: Balancing Performance, Fairness and Complexity
In a multicore system, applications running on different cores interfere at
main memory. This inter-application interference degrades overall system
performance and unfairly slows down applications. Prior works have developed
application-aware memory schedulers to tackle this problem. State-of-the-art
application-aware memory schedulers prioritize requests of applications that
are vulnerable to interference, by ranking individual applications based on
their memory access characteristics and enforcing a total rank order.
In this paper, we observe that state-of-the-art application-aware memory
schedulers have two major shortcomings. First, such schedulers trade off
hardware complexity in order to achieve high performance or fairness, since
ranking applications with a total order leads to high hardware complexity.
Second, ranking can unfairly slow down applications that are at the bottom of
the ranking stack. To overcome these shortcomings, we propose the Blacklisting
Memory Scheduler (BLISS), which achieves high system performance and fairness
while incurring low hardware complexity, based on two observations. First, we
find that, to mitigate interference, it is sufficient to separate applications
into only two groups. Second, we show that this grouping can be efficiently
performed by simply counting the number of consecutive requests served from
each application.
We evaluate BLISS across a wide variety of workloads/system configurations
and compare its performance and hardware complexity, with five state-of-the-art
memory schedulers. Our evaluations show that BLISS achieves 5% better system
performance and 25% better fairness than the best-performing previous scheduler
while greatly reducing critical path latency and hardware area cost of the
memory scheduler (by 79% and 43%, respectively), thereby achieving a good
trade-off between performance, fairness and hardware complexity
- …