20 research outputs found
Algorithms for Hierarchical and Semi-Partitioned Parallel Scheduling
We propose a model for scheduling jobs in a parallel machine setting that takes into account the cost of migrations by assuming that the processing time of a job may depend on the specific set of machines among which the job is migrated. For the makespan minimization objective, the model generalizes classical scheduling problems such as unrelated parallel machine scheduling, as well as novel ones such as semi-partitioned and clustered scheduling. In the case of a hierarchical family of machines, we derive a compact integer linear programming formulation of the problem and leverage its fractional relaxation to obtain a polynomial-time 2-approximation algorithm. Extensions that incorporate memory capacity constraints are also discussed
Efficient Machine-Independent Programming of High-Performance Multiprocessors
Parallel computing is regarded by most computer scientists as the most
likely approach for significantly improving computing power for scientists
and engineers. Advances in programming languages and parallelizing
compilers are making parallel computers easier to use by providing
a high-level portable programming model that protects software
investment. However, experience has shown that simply finding
parallelism is not always sufficient for obtaining good performance
from today's multiprocessors. The goal of this project is to develop
advanced compiler analysis of data and computation decompositions,
thread placement, communication, synchronization, and memory system
effects needed in order to take advantage of performance-critical
elements in modern parallel architectures
High Performance Depthwise and Pointwise Convolutions on Mobile Devices
Lightweight convolutional neural networks (e.g., MobileNets) are specifically
designed to carry out inference directly on mobile devices. Among the various
lightweight models, depthwise convolution (DWConv) and pointwise convolution
(PWConv) are their key operations. In this paper, we observe that the existing
implementations of DWConv and PWConv are not well utilizing the ARM processors
in the mobile devices, and exhibit lots of cache misses under multi-core and
poor data reuse at register level. We propose techniques to re-optimize the
implementations of DWConv and PWConv based on ARM architecture. Experimental
results show that our implementation can respectively achieve a speedup of up
to 5.5x and 2.1x against TVM (Chen et al. 2018) on DWConv and PWConv.Comment: 8 pages, Thirty-Four AAAI conference on Artificial Intelligenc
Un ordonnanceur flexible pour machines multiprocesseurs hiérarchiques
National audienceL'évolution des machines multiprocesseurs vers des architectures de plus en plus hiérarchiques impose, pour en tirer la quintessence, de répartir les flots d'exécution et les données avec une extrême précaution afin de réduire au maximum les accès mémoire non locaux. Les bibliothèques de multithreading actuelles fournissent très peu de fonctionnalités pour exprimer des directives de répartition au niveau applicatif, ce qui contraint les programmeurs à effectuer cette répartition explicitement en fonction de l'architecture sous-jacente, et donc de manière non portable. Dans cet article nous présentons: (1) un modèle permettant au programme d'exprimer dynamiquement la structure du calcul; (2) un ordonnanceur capable d'interpréter cette modélisation afin de prendre de judicieuses décisions de placement hiérarchisé ; (3) une implémentation au sein de la bibliothèque de threads utilisateur Marcel. Une expérimentation a été menée sur une application scientifique exécutée par une machine ccNUMA Bull NovaScale à 16 processeurs Intel Itanium II; les résultats obtenus montrent un gain de 50% par rapport à un ordonnanceur classique et sont comparables à ceux que l'on obtient en effectuant le placement « à la main », ce qui n'est pas portable
A QoS Monitoring System for Dataflow Programs
National audienceWith the generalization of multi-core processors, dataflow programming is regaining a strong interest, especially in the context of compute intensive multimedia applications such as video decoding. How- ever, most studies focus on static approaches to the compilation and placement problems. We advocate for dynamic adaptation of dataflow applications. In this paper, we build the first step towards this goal, namely a monitoring mechanism for observing quality-of-service properties of programs at run- time. We propose a language extension for expressing simple QoS properties over dataflow programs together with a run-time mechanism for the observation of events meaningful to the QoS establishment. We show the limited impact of such mechanisms on the application overall performances