1,699 research outputs found
Estimating the Potential Speedup of Computer Vision Applications on Embedded Multiprocessors
Computer vision applications constitute one of the key drivers for embedded
multicore architectures. Although the number of available cores is increasing
in new architectures, designing an application to maximize the utilization of
the platform is still a challenge. In this sense, parallel performance
prediction tools can aid developers in understanding the characteristics of an
application and finding the most adequate parallelization strategy. In this
work, we present a method for early parallel performance estimation on embedded
multiprocessors from sequential application traces. We describe its
implementation in Parana, a fast trace-driven simulator targeting OpenMP
applications on the STMicroelectronics' STxP70 Application-Specific
Multiprocessor (ASMP). Results for the FAST key point detector application show
an error margin of less than 10% compared to the reference cycle-approximate
simulator, with lower modeling effort and up to 20x faster execution time.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and
Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241
TaskPoint: sampled simulation of task-based programs
Sampled simulation is a mature technique for reducing simulation time of single-threaded programs, but it is not directly applicable to simulation of multi-threaded architectures. Recent multi-threaded sampling techniques assume that the workload assigned to each thread does not change across multiple executions of a program. This assumption does not hold for dynamically scheduled task-based programming models. Task-based programming models allow the programmer to specify program segments as tasks which are instantiated many times and scheduled dynamically to available threads. Due to system noise and variation in scheduling decisions, two consecutive executions on the same machine typically result in different instruction streams processed by each thread. In this paper, we propose TaskPoint, a sampled simulation technique for dynamically scheduled task-based programs. We leverage task instances as sampling units and simulate only a fraction of all task instances in detail. Between detailed simulation intervals we employ a novel fast-forward mechanism for dynamically scheduled programs. We evaluate the proposed technique on a set of 19 task-based parallel benchmarks and two different architectures. Compared to detailed simulation, TaskPoint accelerates architectural simulation with 64 simulated threads by an average factor of 19.1 at an average error of 1.8% and a maximum error of 15.0%.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493, SEV-2011-00067), the Spanish Ministry of Science and Innovation
(contract TIN2015-65316-P), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence and the Mont-Blanc project (EU-FP7-610402 and EU-H2020-671697). M. Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship JCI-2012-15047. M. Casas is supported by the Ministry of Economy
and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the EUFP7 (contract 2013BP B 00243). T.Grass has been partially
supported by the AGAUR of the Generalitat de Catalunya (grant 2013FI B 0058).Peer ReviewedPostprint (author's final draft
A trace-driven approach for fast and accurate simulation of manycore architectures
International audienceThe evolution of manycore sytems, forecasted to feature hundreds of cores by the end of the decade calls for efficient solutions for design space exploration and debugging. Among the relevant existing solutions the well-known gem5 simu-lator provides a rich architecture description framework. However , these features come at the price of prohibitive simulation time that limits the scope of possible explorations to configurations made of tens of cores. To address this limitation, this paper proposes a novel trace-driven simulation approach for efficient exploration of manycore architectures
ElasticSimMATE: a Fast and Accurate gem5 Trace-Driven Simulator for Multicore Systems
International audienceMulticore system analysis requires efficient solutions for architectural parameter and scalability exploration. Long simulation time is the main drawback of current simulation approaches. In order to reduce the simulation time while keeping the accuracy levels, trace-driven simulation approaches have been developed. However, existing approaches do not allow multicore exploration or do not capture the behavior of multi-threaded programs. Based on the gem5 simulator, we developed a novel synchronization mechanism for multicore analysis based on the trace collection of synchronization events, instruction and dependencies. It allows efficient architectural parameter and scalability exploration with acceptable simulation speed and accuracy
Model-Based Performance Prediction for Concurrent Software on Multicore Architectures
Model-based performance prediction is a well-known concept to ensure the quality of software.Current approaches are based on a single-metric model, which leads to inaccurate predictions for modern architectures.
This thesis presents a multi-strategies approach to extend performance prediction models to support multicore architectures.We implemented the strategies into Palladio and significantly increased the performance prediction power
Sampled simulation of task-based programs
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksSampled simulation is a mature technique for reducing simulation time of single-threaded programs. Nevertheless, current sampling techniques do not take advantage of other execution models, like task-based execution, to provide both more accurate and faster simulation. Recent multi-threaded sampling techniques assume that the workload assigned to each thread does not change across multiple executions of a program. This assumption does not hold for dynamically scheduled task-based programming models. Task-based programming models allow the programmer to specify program segments as tasks which are instantiated many times and scheduled dynamically to available threads. Due to variation in scheduling decisions, two consecutive executions on the same machine typically result in different instruction streams processed by each thread. In this paper, we propose TaskPoint, a sampled simulation technique for dynamically scheduled task-based programs. We leverage task instances as sampling units and simulate only a fraction of all task instances in detail. Between detailed simulation intervals, we employ a novel fast-forwarding mechanism for dynamically scheduled programs. We evaluate different automatic techniques for clustering task instances and show that DBSCAN clustering combined with analytical performance modeling provides the best trade-off of simulation speed and accuracy. TaskPoint is the first technique combining sampled simulation and analytical modeling and provides a new way to trade off simulation speed and accuracy. Compared to detailed simulation, TaskPoint accelerates architectural simulation with 8 simulated threads by an average factor of 220x at an average error of 0.5 percent and a maximum error of 7.9 percent.Peer ReviewedPostprint (author's final draft
Topology-aware GPU scheduling for learning workloads in cloud environments
Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex connectivity among GPUs and between GPUs and CPUs. Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments.
This paper presents a new topology-aware workload placement strategy to schedule deep learning jobs on multi-GPU systems. The placement strategy is evaluated with a prototype on a Power8 machine with Tesla P100 cards, showing speedups of up to ≈1.30x compared to state-of-the-art strategies; the proposed algorithm achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a large-scale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.This project is supported by the IBM/BSC Technology Center for Supercomputing
collaboration agreement. It has also received funding from the European Research Council (ERC) under the European Union’s Horizon
2020 research and innovation programme (grant agreement No 639595). It is
also partially supported by the Ministry of Economy of Spain under contract
TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051,
by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program
(SEV-2015-0493). We thank our IBM Research colleagues Alaa Youssef
and Asser Tantawi for the valuable discussions. We also thank SC17 committee
member Blair Bethwaite of Monash University for his constructive feedback on the earlier drafts of this paper.Peer ReviewedPostprint (published version
Near-Memory Address Translation
Memory and logic integration on the same chip is becoming increasingly cost
effective, creating the opportunity to offload data-intensive functionality to
processing units placed inside memory chips. The introduction of memory-side
processing units (MPUs) into conventional systems faces virtual memory as the
first big showstopper: without efficient hardware support for address
translation MPUs have highly limited applicability. Unfortunately, conventional
translation mechanisms fall short of providing fast translations as
contemporary memories exceed the reach of TLBs, making expensive page walks
common.
In this paper, we are the first to show that the historically important
flexibility to map any virtual page to any page frame is unnecessary in today's
servers. We find that while limiting the associativity of the
virtual-to-physical mapping incurs no penalty, it can break the
translate-then-fetch serialization if combined with careful data placement in
the MPU's memory, allowing for translation and data fetch to proceed
independently and in parallel. We propose the Distributed Inverted Page Table
(DIPTA), a near-memory structure in which the smallest memory partition keeps
the translation information for its data share, ensuring that the translation
completes together with the data fetch. DIPTA completely eliminates the
performance overhead of translation, achieving speedups of up to 3.81x and
2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure
- …