15 research outputs found
Many-Core Scheduling of Data Parallel Applications Using SMT Solvers
AbstractâTo program recently developed many-core systems-on-chip two traditionally separate performance optimization problems have to be solved together. Firstly, it is the parallel scheduling on a shared-memory multi-core system. Secondly, it is the co-scheduling of network communication and processor computation. This is because many-core systems are networks of multi-core clusters. In this paper, we demonstrate the applicabil-ity of modern constraint solvers to efficiently schedule parallel applications on many-cores and validate the results by running benchmarks on a real many-core platform. Index Termsâtask graph, scheduling, multiprocessor, DMA I
Localizing FRBs through VLBI with the Algonquin Radio Observatory 10 m Telescope
The Canadian Hydrogen Intensity Mapping Experiment (CHIME)/FRB experiment has detected thousands of fast radio bursts (FRBs) due to its sensitivity and wide field of view; however, its low angular resolution prevents it from localizing events to their host galaxies. Very long baseline interferometry (VLBI), triggered by FRB detections from CHIME/FRB will solve the challenge of localization for non-repeating events. Using a refurbished 10 m radio dish at the Algonquin Radio Observatory located in Ontario Canada, we developed a testbed for a VLBI experiment with a theoretical λ/D âČ 30 mas. We provide an overview of the 10 m system and describe its refurbishment, the data acquisition, and a procedure for fringe fitting that simultaneously estimates the geometric delay used for localization and the dispersive delay from the ionosphere. Using single pulses from the Crab pulsar, we validate the system and localization procedure, and analyze the clock stability between sites, which is critical for coherently delay referencing an FRB event. We find a localization of âŒ200 mas is possible with the performance of the current system (single-baseline). Furthermore, for sources with insufficient signal or restricted wideband to simultaneously measure both geometric and ionospheric delays, we show that the differential ionospheric contribution between the two sites must be measured to a precision of 1 Ă 10-8 pc cm-3 to provide a reasonable localization from a detection in the 400-800 MHz band. Finally we show detection of an FRB observed simultaneously in the CHIME and the Algonquin 10 m telescope, the first non-repeating FRB in this long baseline. This project serves as a testbed for the forthcoming CHIME/FRB Outriggers project
A fast radio burst localized at detection to a galactic disk using very long baseline interferometry
Fast radio bursts (FRBs) are millisecond-duration, luminous radio transients
of extragalactic origin. These events have been used to trace the baryonic
structure of the Universe using their dispersion measure (DM) assuming that the
contribution from host galaxies can be reliably estimated. However,
contributions from the immediate environment of an FRB may dominate the
observed DM, thus making redshift estimates challenging without a robust host
galaxy association. Furthermore, while at least one Galactic burst has been
associated with a magnetar, other localized FRBs argue against magnetars as the
sole progenitor model. Precise localization within the host galaxy can
discriminate between progenitor models, a major goal of the field. Until now,
localizations on this spatial scale have only been carried out in follow-up
observations of repeating sources. Here we demonstrate the localization of FRB
20210603A with very long baseline interferometry (VLBI) on two baselines, using
data collected only at the time of detection. We localize the burst to SDSS
J004105.82+211331.9, an edge-on galaxy at , and detect recent
star formation in the kiloparsec-scale vicinity of the burst. The edge-on
inclination of the host galaxy allows for a unique comparison between the line
of sight towards the FRB and lines of sight towards known Galactic pulsars. The
DM, Faraday rotation measure (RM), and scattering suggest a progenitor
coincident with the host galactic plane, strengthening the link between the
environment of FRB 20210603A and the disk of its host galaxy. Single-pulse VLBI
localizations of FRBs to within their host galaxies, following the one
presented here, will further constrain the origins and host environments of
one-off FRBs.Comment: 40 pages, 13 figures, submitted. Fixed typo in abstrac
CHIME/FRB Discovery of 25 Repeating Fast Radio Burst Sources
We present the discovery of 25 new repeating fast radio burst (FRB) sources
found among CHIME/FRB events detected between 2019 September 30 and 2021 May 1.
The sources were found using a new clustering algorithm that looks for multiple
events co-located on the sky having similar dispersion measures (DMs). The new
repeaters have DMs ranging from 220 pc cm to 1700 pc
cm, and include sources having exhibited as few as two bursts to as many
as twelve. We report a statistically significant difference in both the DM and
extragalactic DM (eDM) distributions between repeating and apparently
nonrepeating sources, with repeaters having lower mean DM and eDM, and we
discuss the implications. We find no clear bimodality between the repetition
rates of repeaters and upper limits on repetition from apparently nonrepeating
sources after correcting for sensitivity and exposure effects, although some
active repeating sources stand out as anomalous. We measure the repeater
fraction and find that it tends to an equilibrium of % over
our exposure thus far. We also report on 14 more sources which are promising
repeating FRB candidates and which merit follow-up observations for
confirmation.Comment: Submitted to ApJ. Comments are welcome and follow-up observations are
encouraged
Sub-second periodicity in a fast radio burst
Fast radio bursts (FRBs) are millisecond-duration flashes of radio waves that
are visible at distances of billions of light-years. The nature of their
progenitors and their emission mechanism remain open astrophysical questions.
Here we report the detection of the multi-component FRB 20191221A and the
identification of a periodic separation of 216.8(1) ms between its components
with a significance of 6.5 sigmas. The long (~3 s) duration and nine or more
components forming the pulse profile make this source an outlier in the FRB
population. Such short periodicity provides strong evidence for a neutron-star
origin of the event. Moreover, our detection favours emission arising from the
neutron-star magnetosphere, as opposed to emission regions located further away
from the star, as predicted by some models.Comment: Updated to conform to the accepted versio
placement et ordonnancement sur les processeurs multi-core en utilisant un solveur SMT
In order to achieve performance gains in the software, computers have evolvedto multi-core and many-core platforms abounding with multiple processor cores.However the problem of finding efficient ways to execute parallel software onthese platform is hard. With a large number of processor cores available, thesoftware must orchestrate the communication, synchronization along with theexecution of the code. Communication corresponds to the transport of databetween different processors, which either can be handled transparently by thehardware or explicitly managed by the software. Synchronization is arequirement of proper selection of start time of computations eg. the conditionfor software tasks to begin execution only after all its dependencies aresatisfied.Models which represent the algorithms in a structured and formal way expose theavailable parallelism. Deployment of the software algorithms represented bysuch models needs a specification of which processor to execute the tasks on(mapping) and when to execute them (scheduling). Mapping andscheduling is a hard combinatorial problem to solve with a huge design spacecontaining exponential number of solutions. In addition, the solutions areevaluated according to different costs that need to be optimized, such asmemory consumption, time to execute, static power consumption, resources usedetc. Such a problem with multiple costs is called a multi-criteriaoptimization problem. The solution to this problem is not a unique singlesolution, but a set of incomparable solutions called Pareto solutions.In order to track multi-criteria problems, special algorithms are needed whichcan approximate the Pareto solutions in the design space.In this thesis we target a class of applications called streamingapplications, which process a continuous stream of data. These applicationstypically apply similar computation on different data items. A common class ofmodels called dataflow models conveniently expresses such applications.In this thesis, we deal with mapping and scheduling of dataflow applications onmany-core platforms. We encode this problem in form of logical constraints andpresent it to satisfiability modulo theory (SMT) solvers. SMT solvers,solve the encoded problem by using a combination of search techniques andconstraint propagation to find an assignment to the problem variablessatisfying the given cost constraints.In dataflow applications, the design space explodes with increased number oftasks and processors. In this thesis, we tackle this problem by introducingsymmetry reduction techniques and demonstrate that symmetry breakingaccelerates search in SMT solvers, increasing the size of the problem that canbe solved. Our design-space exploration algorithm approximates the Pareto frontof the problem and produces solutions with different cost trade-offs. Wevalidate these solutions by executing them on a real multi-core platform.Further we extend the scheduling problem to the many-core platforms which areassembled from multi-core clusters connected by network-on-chip. We provide adesign flow which performs mapping of the applications on such platforms andautomatic insertion of additional elements to model the communication. Wedemonstrate how communication with bounded memory can be performed by correctlymodeling the flow-control. We provide experimental results obtained on the256-processor Kalray MPPA-256 platform.Multi-core processors have typically a small amount of memory close to theprocessor. Generally application data does not fit in the local memory. Westudy a class of parallel applications having a regular data access pattern,with large amount of data to be processed by a uniform computation. Suchapplications are commonly found in image processing. The data must be broughtfrom main memory to local memory, processed and then the results written backto main memory, all in batches. Selecting the proper granularity of the datathat is brought into local memory is an optimization problem. We formalize thisproblem and provide a way to determine the optimal transfer granularitydepending on the characteristics of application and the hardware platform.Further we provide a technique to analyze different data exchange mechanismsfor the case where some data is shared between different computations.Applications in modern embedded systems can start and stop dynamically. Inorder to execute all these applications efficiently and to optimize globalcosts such as power consumption, execution time etc., the applications must bereconfigured at runtime. We present a predictable and composable way (executingindependently without affecting others) of migrating tasks according to thereconfiguration decision.Dans l'objectif d'augmenter les performances, l'architecture des processeurs a Ă©voluĂ© vers des plate-formes "multi-core" et "many-core" composĂ©es de multiple unitĂ©s de traitements. Toutefois, trouver des moyens efficaces pour exĂ©cuter du logiciel parallĂšle reste un problĂšme difficile. Avec un grand nombre d'unitĂ©s de calcul disponibles, le logiciel doit orchestrer la communication et assurer la synchronisation lors de lâexĂ©cution du code. La communication (transport des donnĂ©es entre les diffĂ©rents processeurs) est gĂ©rĂ©e de façon transparente par le matĂ©riel ou explicitement par le logiciel.Les modĂšles qui reprĂ©sentent les algorithmes de façon structurĂ©e et formelle mettent en Ă©vidence leur parallĂ©lisme inhĂ©rent. Le dĂ©ploiement des logiciels reprĂ©sentĂ©s par ces modĂšles nĂ©cessite de spĂ©cifier placement (sur quel processeur sâexĂ©cute une certaine tĂąche) et l'ordonnancement (dans quel ordre sont exĂ©cutĂ©es les tĂąches). Le placement et l'ordonnancement sont des problĂšmes combinatoires difficile avec un nombre exponentiel de solutions. En outre, les solutions ont diffĂ©rents coĂ»ts qui doivent ĂȘtre optimisĂ©s : la consommation de mĂ©moire, le temps d'exĂ©cution, les ressources utilisĂ©es, etc. C'est un problĂšme d'optimisation multi-critĂšres. La solution Ă ce problĂšme est ce qu'on appelle un ensemble Pareto-optimal nĂ©cessitant des algorithmes spĂ©ciaux pour lâapproximer.Nous ciblons une classe d'applications, appelĂ©es applications de streaming, qui traitent un flux continu de donnĂ©es. Ces applications qui appliquent un calcul similaire sur diffĂ©rents Ă©lĂ©ments de donnĂ©es successifs, peuvent ĂȘtre commodĂ©ment exprimĂ©es par une classe de modĂšles appelĂ©s modĂšles de flux de donnĂ©es. Le problĂšme du placement et de l'ordonnancement est codĂ© sous forme de contraintes logiques et rĂ©solu par un solveur SatisfaisabilitĂ© Modulo ThĂ©ories (SMT). Les solveurs SMT rĂ©solvent le problĂšme en combinant des techniques de recherche et de la propagation de contraintes afin d'attribuer des valeurs aux variables du problĂšme satisfaisant les contraintes de coĂ»t donnĂ©es.Dans les applications de flux de donnĂ©es, l'espace de conception explose avec l'augmentation du nombre de tĂąches et de processeurs. Dans cette thĂšse, nous nous attaquons Ă ce problĂšme par l'introduction des techniques de rĂ©duction de symĂ©trie et dĂ©montrons que la rupture de symĂ©trie accĂ©lĂšre la recherche dans un solveur SMT, permettant ainsi l'augmentation de la taille du problĂšme qui peut ĂȘtre rĂ©solu. Notre algorithme d'exploration de l'espace de conception approxime le front de Pareto du problĂšme et produit des solutions pour diffĂ©rents compromis de coĂ»ts. De plus, nous Ă©tendons le problĂšme d'ordonnancement pour les plate-formes "many-core" qui sont une catĂ©gorie de plate-forme multi coeurs oĂč les unitĂ©s sont connectĂ©s par un rĂ©seau sur puce (NOC). Nous fournissons un flot de conception qui rĂ©alise le placement des applications sur de telles plate-formes et insert automatiquement des Ă©lĂ©ments supplĂ©mentaires pour modĂ©liser la communication Ă l'aide de mĂ©moires de taille bornĂ©e. Nous prĂ©sentons des rĂ©sultats expĂ©rimentaux obtenus sur deux plate-formes existantes : la machine Kalray Ă 256 processeurs et les Tilera TILE-64.Les processeurs multi-cĆurs ont typiquement une faible quantitĂ© de mĂ©moire proche du processeur. Celle ci est gĂ©nĂ©ralement insuffisante pour contenir toutes les donnĂ©es necessaires au calcul d'une tĂąche. Nous Ă©tudions une classe d'applications parallĂšles prĂ©sentant un pattern rĂ©gulier d'accĂšs aux donnĂ©es et une grande quantitĂ© de donnĂ©es Ă traiter par un calcul uniforme. Les donnĂ©es doivent ĂȘtre acheminĂ©es depuis la mĂ©moire principale vers la mĂ©moire locale, traitĂ©es, puis, les rĂ©sultats retournĂ©s en mĂ©moire centrale, tout en lots. Fixer la bonne granularitĂ© des donnĂ©es acheminĂ©es en mĂ©moire locale est un problĂšme d'optimisation. Nous formalisons ce problĂšme et proposons un moyen de dĂ©terminer la granularitĂ© de transfert optimale en fonction des caractĂ©ristiques de l'application et de la plate-forme matĂ©rielle.En plus des problĂšmes d'ordonnancement et de gestion de la mĂ©moire locale, nous Ă©tudions une partie du problĂšme de la gestion de l'exĂ©cution des applications. Dans les systĂšmes embarquĂ©s modernes, les applications peuvent dĂ©marrer et s'arrĂȘter dynamiquement. Afin d'exĂ©cuter toutes les applications de maniĂšre efficace et d'optimiser les coĂ»ts globaux tels que la consommation d'Ă©nergie, temps d'exĂ©cution, etc., les applications nĂ©cessitent d'ĂȘtre reconfigurĂ©es dynamiquement Ă l'exĂ©cution. Nous prĂ©sentons une maniĂšre prĂ©visible et composable (exĂ©cution indĂ©pendamment sans affecter les autres) de rĂ©aliser la migration des tĂąches conformĂ©ment Ă la dĂ©cision de reconfiguration
Mapping and scheduling on multi-core processors using SMT solvers
In order to achieve performance gains, computers have evolved to multi-core and many-core platforms abounding with multiple processor cores. However the problem of finding efficient ways to execute parallel software on them is hard. With a large number of processor cores available, the software must orchestrate the communication, synchronization along with the code execution. Communication corresponds to the transport of data between different processors, handled transparently by the hardware or explicitly by the software.Models which represent the algorithms in a structured and formal way expose the available parallelism. Deployment of the software algorithms represented by such models needs a specification of which processor to execute the tasks on (mapping) and when to execute them (scheduling). Mapping and scheduling is a hard combinatorial problem with exponential number of solutions. In addition, the solutions have multiple costs that need to be optimized, such as memory consumption, time to execute, resources used etc. Such a problem with multiple costs is called a multi-criteria optimization problem. The solution to this problem is a set of incomparable solutions called Pareto solutions which need special algorithms to approximate them.We target a class of applications called streaming applications, which process a continuous stream of data. These applications apply similar computation on different data items, can be conveniently expressed by a class of models called dataflow models. We encode mapping and scheduling problem in form of logical constraints and present it to satisfiability modulo theory (SMT) solvers. SMT solvers, solve the encoded problem by using a combination of search techniques and constraint propagation to find an assignment to the problem variables satisfying the given cost constraints.In dataflow applications, the design space explodes with increased number of tasks and processors. In this thesis, we tackle this problem by introduction symmetry reduction techniques and demonstrate that symmetry breaking accelerates search in SMT solver, increasing the size of the problem that can be solved. Our design-space exploration algorithm approximates Pareto front of the problem and produces solutions with different cost trade-offs. Further we extend the scheduling problem to the many-core platforms which are a group of multi-core platforms connected by network-on-chip. We provide a design flow which performs mapping of the applications on such platforms and automatic insertion of additional elements to model the communication using bounded memory. We provide experimental results obtained on the 256-processor Kalray and the Tilera TILE-64 platforms.The multi-core processors have typically a small amount of memory close to the processor, generally insufficient for all application data to fit. We study a class of parallel applications having a regular data access pattern and large amount of data to be processed by a uniform computation. The data must be brought from main memory to local memory, processed and then the results written back to main memory, all in batches. Selecting the proper granularity of the data that is brought into local memory is an optimization problem. We formalize this problem and provide a way to determine the optimal transfer granularity depending on the characteristics of application and the hardware platform.In addition to the scheduling problems and local memory management, we study a part of the problem of runtime management of the applications. Applications in modern embedded systems can start and stop dynamically. In order to execute all the applications efficiently and to optimize global costs such as power consumption, execution time etc., the applications must be reconfigured dynamically at runtime. We present a predictable and composable (executing independently without affecting others) way of migrating tasks according to the reconfiguration decision.Dans lâobjectif dâaugmenter les performances, lâarchitecture des processeurs a Ă©voluĂ© versdes plate-formes "multi-core" et "many-core" composĂ©es de multiple unitĂ©s de traitements.Toutefois, trouver des moyens efficaces pour exĂ©cuter du logiciel parallĂšle reste un problĂšmedifficile. Avec un grand nombre dâunitĂ©s de calcul disponibles, le logiciel doit orchestrer lacommunication et assurer la synchronisation lors de lâexĂ©cution du code. La communication(transport des donnĂ©es entre les diffĂ©rents processeurs) est gĂ©rĂ©e de façon transparente par lematĂ©riel ou explicitement par le logiciel.Les modĂšles qui reprĂ©sentent les algorithmes de façon structurĂ©e et formelle mettent enĂ©vidence leur parallĂ©lisme inhĂ©rent. Le dĂ©ploiement des logiciels reprĂ©sentĂ©s par ces modĂšlesnĂ©cessite de spĂ©cifier placement (sur quel processeur sâexĂ©cute une certaine tĂąche) et lâordonnancement(dans quel ordre sont exĂ©cutĂ©es les tĂąches). Le placement et lâordonnancement sontdes problĂšmes combinatoires difficile avec un nombre exponentiel de solutions. En outre, lessolutions ont diffĂ©rents coĂ»ts qui doivent ĂȘtre optimisĂ©s : la consommation de mĂ©moire, letemps dâexĂ©cution, les ressources utilisĂ©es, etc. Câest un problĂšme dâoptimisation multi-critĂšres.La solution Ă ce problĂšme est ce quâon appelle un ensemble Pareto-optimal nĂ©cessitant desalgorithmes spĂ©ciaux pour lâapproximer.Nous ciblons une classe dâapplications, appelĂ©es applications de streaming, qui traitentun flux continu de donnĂ©es. Ces applications qui appliquent un calcul similaire sur diffĂ©rentsĂ©lĂ©ments de donnĂ©es successifs, peuvent ĂȘtre commodĂ©ment exprimĂ©es par une classe de modĂšlesappelĂ©s modĂšles de flux de donnĂ©es. Le problĂšme du placement et de lâordonnancementest codĂ© sous forme de contraintes logiques et rĂ©solu par un solveur SatisfaisabilitĂ© ModuloThĂ©ories (SMT). Les solveurs SMT rĂ©solvent le problĂšme en combinant des techniques derecherche et de la propagation de contraintes afin dâattribuer des valeurs aux variables duproblĂšme satisfaisant les contraintes de coĂ»t donnĂ©es.Dans les applications de flux de donnĂ©es, lâespace de conception explose avec lâaugmentationdu nombre de tĂąches et de processeurs. Dans cette thĂšse, nous nous attaquons Ă ceproblĂšme par lâintroduction des techniques de rĂ©duction de symĂ©trie et dĂ©montrons que larupture de symĂ©trie accĂ©lĂšre la recherche dans un solveur SMT, permettant ainsi lâaugmentationde la taille du problĂšme qui peut ĂȘtre rĂ©solu. Notre algorithme dâexploration de lâespacede conception approxime le front de Pareto du problĂšme et produit des solutions pour diffĂ©rentscompromis de coĂ»ts. De plus, nous Ă©tendons le problĂšme dâordonnancement pour lesplate-formes "many-core" qui sont une catĂ©gorie de plate-forme multi coeurs oĂč les unitĂ©s sontconnectĂ©s par un rĂ©seau sur puce (NoC). Nous fournissons un flot de conception qui rĂ©alise leplacement des applications sur de telles plate-formes et insert automatiquement des Ă©lĂ©mentssupplĂ©mentaires pour modĂ©liser la communication Ă lâaide de mĂ©moires de taille bornĂ©e. NousprĂ©sentons des rĂ©sultats expĂ©rimentaux obtenus sur deux plate-formes existantes : la machineKalray Ă 256 processeurs et les Tilera TILE-64
Allocation et ordonnancement sur des processeurs multi-coeur avec des solveurs SMT
Dans lâobjectif dâaugmenter les performances, lâarchitecture des processeurs a Ă©voluĂ© versdes plate-formes "multi-core" et "many-core" composĂ©es de multiple unitĂ©s de traitements.Toutefois, trouver des moyens efficaces pour exĂ©cuter du logiciel parallĂšle reste un problĂšmedifficile. Avec un grand nombre dâunitĂ©s de calcul disponibles, le logiciel doit orchestrer lacommunication et assurer la synchronisation lors de lâexĂ©cution du code. La communication(transport des donnĂ©es entre les diffĂ©rents processeurs) est gĂ©rĂ©e de façon transparente par lematĂ©riel ou explicitement par le logiciel.Les modĂšles qui reprĂ©sentent les algorithmes de façon structurĂ©e et formelle mettent enĂ©vidence leur parallĂ©lisme inhĂ©rent. Le dĂ©ploiement des logiciels reprĂ©sentĂ©s par ces modĂšlesnĂ©cessite de spĂ©cifier placement (sur quel processeur sâexĂ©cute une certaine tĂąche) et lâordonnancement(dans quel ordre sont exĂ©cutĂ©es les tĂąches). Le placement et lâordonnancement sontdes problĂšmes combinatoires difficile avec un nombre exponentiel de solutions. En outre, lessolutions ont diffĂ©rents coĂ»ts qui doivent ĂȘtre optimisĂ©s : la consommation de mĂ©moire, letemps dâexĂ©cution, les ressources utilisĂ©es, etc. Câest un problĂšme dâoptimisation multi-critĂšres.La solution Ă ce problĂšme est ce quâon appelle un ensemble Pareto-optimal nĂ©cessitant desalgorithmes spĂ©ciaux pour lâapproximer.Nous ciblons une classe dâapplications, appelĂ©es applications de streaming, qui traitentun flux continu de donnĂ©es. Ces applications qui appliquent un calcul similaire sur diffĂ©rentsĂ©lĂ©ments de donnĂ©es successifs, peuvent ĂȘtre commodĂ©ment exprimĂ©es par une classe de modĂšlesappelĂ©s modĂšles de flux de donnĂ©es. Le problĂšme du placement et de lâordonnancementest codĂ© sous forme de contraintes logiques et rĂ©solu par un solveur SatisfaisabilitĂ© ModuloThĂ©ories (SMT). Les solveurs SMT rĂ©solvent le problĂšme en combinant des techniques derecherche et de la propagation de contraintes afin dâattribuer des valeurs aux variables duproblĂšme satisfaisant les contraintes de coĂ»t donnĂ©es.Dans les applications de flux de donnĂ©es, lâespace de conception explose avec lâaugmentationdu nombre de tĂąches et de processeurs. Dans cette thĂšse, nous nous attaquons Ă ceproblĂšme par lâintroduction des techniques de rĂ©duction de symĂ©trie et dĂ©montrons que larupture de symĂ©trie accĂ©lĂšre la recherche dans un solveur SMT, permettant ainsi lâaugmentationde la taille du problĂšme qui peut ĂȘtre rĂ©solu. Notre algorithme dâexploration de lâespacede conception approxime le front de Pareto du problĂšme et produit des solutions pour diffĂ©rentscompromis de coĂ»ts. De plus, nous Ă©tendons le problĂšme dâordonnancement pour lesplate-formes "many-core" qui sont une catĂ©gorie de plate-forme multi coeurs oĂč les unitĂ©s sontconnectĂ©s par un rĂ©seau sur puce (NoC). Nous fournissons un flot de conception qui rĂ©alise leplacement des applications sur de telles plate-formes et insert automatiquement des Ă©lĂ©mentssupplĂ©mentaires pour modĂ©liser la communication Ă lâaide de mĂ©moires de taille bornĂ©e. NousprĂ©sentons des rĂ©sultats expĂ©rimentaux obtenus sur deux plate-formes existantes : la machineKalray Ă 256 processeurs et les Tilera TILE-64.In order to achieve performance gains, computers have evolved to multi-core and many-core platforms abounding with multiple processor cores. However the problem of finding efficient ways to execute parallel software on them is hard. With a large number of processor cores available, the software must orchestrate the communication, synchronization along with the code execution. Communication corresponds to the transport of data between different processors, handled transparently by the hardware or explicitly by the software.Models which represent the algorithms in a structured and formal way expose the available parallelism. Deployment of the software algorithms represented by such models needs a specification of which processor to execute the tasks on (mapping) and when to execute them (scheduling). Mapping and scheduling is a hard combinatorial problem with exponential number of solutions. In addition, the solutions have multiple costs that need to be optimized, such as memory consumption, time to execute, resources used etc. Such a problem with multiple costs is called a multi-criteria optimization problem. The solution to this problem is a set of incomparable solutions called Pareto solutions which need special algorithms to approximate them.We target a class of applications called streaming applications, which process a continuous stream of data. These applications apply similar computation on different data items, can be conveniently expressed by a class of models called dataflow models. We encode mapping and scheduling problem in form of logical constraints and present it to satisfiability modulo theory (SMT) solvers. SMT solvers, solve the encoded problem by using a combination of search techniques and constraint propagation to find an assignment to the problem variables satisfying the given cost constraints.In dataflow applications, the design space explodes with increased number of tasks and processors. In this thesis, we tackle this problem by introduction symmetry reduction techniques and demonstrate that symmetry breaking accelerates search in SMT solver, increasing the size of the problem that can be solved. Our design-space exploration algorithm approximates Pareto front of the problem and produces solutions with different cost trade-offs. Further we extend the scheduling problem to the many-core platforms which are a group of multi-core platforms connected by network-on-chip. We provide a design flow which performs mapping of the applications on such platforms and automatic insertion of additional elements to model the communication using bounded memory. We provide experimental results obtained on the 256-processor Kalray and the Tilera TILE-64 platforms.The multi-core processors have typically a small amount of memory close to the processor, generally insufficient for all application data to fit. We study a class of parallel applications having a regular data access pattern and large amount of data to be processed by a uniform computation. The data must be brought from main memory to local memory, processed and then the results written back to main memory, all in batches. Selecting the proper granularity of the data that is brought into local memory is an optimization problem. We formalize this problem and provide a way to determine the optimal transfer granularity depending on the characteristics of application and the hardware platform.In addition to the scheduling problems and local memory management, we study a part of the problem of runtime management of the applications. Applications in modern embedded systems can start and stop dynamically. In order to execute all the applications efficiently and to optimize global costs such as power consumption, execution time etc., the applications must be reconfigured dynamically at runtime. We present a predictable and composable (executing independently without affecting others) way of migrating tasks according to the reconfiguration decision
A Case Study into Predictable and Composable MPSoC Reconfiguration
AbstractâThe number of applications running concurrently on a MPSoC is ever increasing. Moreover, the set of running applications is often unknown at design-time. Part of the resource allocation decisions must therefore be deferred to run-time. This requires a run-time manager to optimize the resource usage of the system to preserve energy and allow as many applications as possible to use the resources simultaneously. An effective resource manager should therefore be able to reconfigure the resource assignment of running applications. To this end, a run-time task migration mechanism is needed. A user should however not notice the reconfiguration, as this would impact the perceived quality of the system. Hence, the reconfiguration mechanism should provide timing guarantees on its operation and it should not interfere with other applications running on the same system (i.e., it should be predictable and composable). In this paper, we present a practical implementation of such a predictable and composable MPSoC reconfiguration mechanism. We demonstrate the use of this mechanism on a JPEG decoder whose tasks are migrated at run-time while running on a state-of-the-art MPSoC platform. Index TermsâTask migration, real time systems, timing guarantees I