9,750 research outputs found
Architectural support for task dependence management with flexible software scheduling
The growing complexity of multi-core architectures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, finer-grained tasking will be required to exploit the available parallelism, which will increase the overheads introduced by the runtime system. This work presents Task Dependence Manager (TDM), a hardware/software co-designed mechanism to mitigate runtime system overheads. TDM introduces a hardware unit, denoted Dependence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations to the DMU and to still perform task scheduling in software. With lower hardware cost, TDM outperforms hardware-based solutions and enhances the flexibility, adaptability and composability of the system. Results show that TDM improves performance by 12.3% and reduces EDP by 20.4% on average with respect to a software runtime system. Compared to a runtime system fully implemented in hardware, TDM achieves an average speedup of 4.2% with 7.3x less area requirements and significant EDP reductions. In addition, five different software schedulers are evaluated with TDM, illustrating its flexibility and performance gains.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and
Innovation (contracts TIN2015-65316-P, TIN2016-76635-C2-2-R and TIN2016-81840-REDT), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Unionâs Horizon 2020 research and innovation programme under grant agreement No 671697 and No. 671610. M. MoretĂł has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047.Peer ReviewedPostprint (author's final draft
Adaptive planning for distributed systems using goal accomplishment tracking
Goal accomplishment tracking is the process of monitoring the progress of a task or series of tasks towards completing a goal. Goal accomplishment tracking is used to monitor goal progress in a variety of domains, including workflow processing, teleoperation and industrial manufacturing. Practically, it involves the constant monitoring of task execution, analysis of this data to determine the task progress and notification of interested parties. This information is usually used in a passive way to observe goal progress. However, responding to this information may prevent goal failures. In addition, responding proactively in an opportunistic way can also lead to goals being completed faster. This paper proposes an architecture to support the adaptive planning of tasks for fault tolerance or opportunistic task execution based on goal accomplishment tracking. It argues that dramatically increased performance can be gained by monitoring task execution and altering plans dynamically
Scaling Monte Carlo Tree Search on Intel Xeon Phi
Many algorithms have been parallelized successfully on the Intel Xeon Phi
coprocessor, especially those with regular, balanced, and predictable data
access patterns and instruction flows. Irregular and unbalanced algorithms are
harder to parallelize efficiently. They are, for instance, present in
artificial intelligence search algorithms such as Monte Carlo Tree Search
(MCTS). In this paper we study the scaling behavior of MCTS, on a highly
optimized real-world application, on real hardware. The Intel Xeon Phi allows
shared memory scaling studies up to 61 cores and 244 hardware threads. We
compare work-stealing (Cilk Plus and TBB) and work-sharing (FIFO scheduling)
approaches. Interestingly, we find that a straightforward thread pool with a
work-sharing FIFO queue shows the best performance. A crucial element for this
high performance is the controlling of the grain size, an approach that we call
Grain Size Controlled Parallel MCTS. Our subsequent comparing with the Xeon
CPUs shows an even more comprehensible distinction in performance between
different threading libraries. We achieve, to the best of our knowledge, the
fastest implementation of a parallel MCTS on the 61 core Intel Xeon Phi using a
real application (47 relative to a sequential run).Comment: 8 pages, 9 figure
Improving the scalability of parallel N-body applications with an event driven constraint based execution model
The scalability and efficiency of graph applications are significantly
constrained by conventional systems and their supporting programming models.
Technology trends like multicore, manycore, and heterogeneous system
architectures are introducing further challenges and possibilities for emerging
application domains such as graph applications. This paper explores the space
of effective parallel execution of ephemeral graphs that are dynamically
generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The
workloads are expressed using the semantics of an Exascale computing execution
model called ParalleX. For comparison, results using conventional execution
model semantics are also presented. We find improved load balancing during
runtime and automatic parallelism discovery improving efficiency using the
advanced semantics for Exascale computing.Comment: 11 figure
- âŠ