Search CORE

62 research outputs found

AI at the Edge, 2021 EPoSS White Paper

Author: Bierzynski Kay
Calvo Alonso Daniel
Gandhi Kaustubh
Lehment Nicolas
Mayer Dirk
Nackaerts Axel
Neul Reinhard
Peischl Bernhard
Rix Nigel
Rzepka Sven
Röhm Horst
Seifert Inessa
Steimetz Elisabeth
Stree Bernard
Tedesco Salvatore
Veledar Omar
Wilsch Benjamin
Publication venue: EPoSS: European Technology Platform on Smart Systems Integration
Publication date: 25/06/2021
Field of study

In this paper members of the European Platform on Smart Systems Integration (EPoSS) have collected their views on the benefits of incorporating Artificial Intelligence in future Smart devices and defined the actions required to achieve this to implement "AI at the Edge"

Revamping Timing Error Resilience to Tackle Choke Points at NTC

Author: Bal Aatreyi
Publication venue: DigitalCommons@USU
Publication date: 01/03/2010
Field of study

The growing market of portable devices and smart wearables has contributed to innovation and development of systems with longer battery-life. While Near Threshold Computing (NTC) systems address the need for longer battery-life, they have certain limitations. NTC systems are prone to be significantly affected by variations in the fabrication process, commonly called process variation (PV). This dissertation explores an intriguing effect of PV, called choke points. Choke points are especially important due to their multifarious influence on the functional correctness of an NTC system. This work shows why novel research is required in this direction and proposes two techniques to resolve the problems created by choke points, while maintaining the reduced power needs

Online Modeling and Tuning of Parallel Stream Processing Systems

Author: Beard Jonathan Curtis
Publication venue: Washington University Open Scholarship
Publication date: 15/08/2015
Field of study

Writing performant computer programs is hard. Code for high performance applications is profiled, tweaked, and re-factored for months specifically for the hardware for which it is to run. Consumer application code doesn\u27t get the benefit of endless massaging that benefits high performance code, even though heterogeneous processor environments are beginning to resemble those in more performance oriented arenas. This thesis offers a path to performant, parallel code (through stream processing) which is tuned online and automatically adapts to the environment it is given. This approach has the potential to reduce the tuning costs associated with high performance code and brings the benefit of performance tuning to consumer applications where otherwise it would be cost prohibitive. This thesis introduces a stream processing library and multiple techniques to enable its online modeling and tuning. Stream processing (also termed data-flow programming) is a compute paradigm that views an application as a set of logical kernels connected via communications links or streams. Stream processing is increasingly used by computational-x and x-informatics fields (e.g., biology, astrophysics) where the focus is on safe and fast parallelization of specific big-data applications. A major advantage of stream processing is that it enables parallelization without necessitating manual end-user management of non-deterministic behavior often characteristic of more traditional parallel processing methods. Many big-data and high performance applications involve high throughput processing, necessitating usage of many parallel compute kernels on several compute cores. Optimizing the orchestration of kernels has been the focus of much theoretical and empirical modeling work. Purely theoretical parallel programming models can fail when the assumptions implicit within the model are mis-matched with reality (i.e., the model is incorrectly applied). Often it is unclear if the assumptions are actually being met, even when verified under controlled conditions. Full empirical optimization solves this problem by extensively searching the range of likely configurations under native operating conditions. This, however, is expensive in both time and energy. For large, massively parallel systems, even deciding which modeling paradigm to use is often prohibitively expensive and unfortunately transient (with workload and hardware). In an ideal world, a parallel run-time will re-optimize an application continuously to match its environment, with little additional overhead. This work presents methods aimed at doing just that through low overhead instrumentation, modeling, and optimization. Online optimization provides a good trade-off between static optimization and online heuristics. To enable online optimization, modeling decisions must be fast and relatively accurate. Online modeling and optimization of a stream processing system first requires the existence of a stream processing framework that is amenable to the intended type of dynamic manipulation. To fill this void, we developed the RaftLib C++ template library, which enables usage of the stream processing paradigm for C++ applications (it is the run-time which is the basis of almost all the work within this dissertation). An application topology is specified by the user, however almost everything else is optimizable by the run-time. RaftLib takes advantage of the knowledge gained during the design of several prior streaming languages (notably Auto-Pipe). The resultant framework enables online migration of tasks, auto-parallelization, online buffer-reallocation, and other useful dynamic behaviors that were not available in many previous stream processing systems. Several benchmark applications have been designed to assess the performance gains through our approaches and compare performance to other leading stream processing frameworks. Information is essential to any modeling task, to that end a low-overhead instrumentation framework has been developed which is both dynamic and adaptive. Discovering a fast and relatively optimal configuration for a stream processing application often necessitates solving for buffer sizes within a finite capacity queueing network. We show that a generalized gain/loss network flow model can bootstrap the process under certain conditions. Any modeling effort, requires that a model be selected; often a highly manual task, involving many expensive operations. This dissertation demonstrates that machine learning methods (such as a support vector machine) can successfully select models at run-time for a streaming application. The full set of approaches are incorporated into the open source RaftLib framework

Washington University St. Louis: Open Scholarship

Efficient Software Implementation of Stream Programs

Author: Cedersjö Gustav
Publication venue: Department of Computer Science, Lund University
Publication date: 01/01/2017
Field of study

The way we use computers and mobile phones today requires large amounts of processing of data streams. Examples include digital signal processing for wireless transmission, audio and video coding for recording and watching videos, and noise reduction for the phone calls. These tasks can be performed by stream programs—computer programs that process streams of data. Stream programs can be composed of other stream programs. Components of a composition are connected in a network, i.e. the output streams of one component are sent as input streams to other components. The components, that perform the actual computation, are called kernels. They can be described in different styles and programming languages. There are also formal models for describing the kernels and the networks. One such model is the actor machine.This dissertation evaluates the actor machine, how it facilitates creating efficient software implementation of stream programs. The evaluation is divided into four aspects: (1) analyzability of its structure, (2) generality in what languages and styles it can express, (3) efficient implementation of kernels, and (4) efficient implementation of networks. This dissertation demonstrates all four aspects through implementation and evaluation of a stream program compiler based on actor machines

Lund University Publications

Placement de graphes de tâches de grande taille sur architectures massivement multicoeurs

Author: Berger Karl-Eduard
Publication venue: HAL CCSD
Publication date: 08/12/2015
Field of study

This Ph.D thesis is devoted to the study of the mapping problem related to massively parallel embedded architectures. This problem arises from industrial needs like energy savings, performance demands for synchronous dataflow applications. This problem has to be solved considering three criteria: heuristics should be able to deal with applications with various sizes, they must meet the constraints of capacities of processors and they have to take into account the target architecture topologies. In this thesis, tasks are organized in communication networks, modeled as graphs. In order to determine a way of evaluating the efficiency of the developed heuristics, mappings, obtained by the heuristics, are compared to a random mapping. This comparison is used as an evaluation metric throughout this thesis. The existence of this metric is motivated by the fact that no comparative heuristics can be found in the literature at the time of writing of this thesis. In order to address this problem, two heuristics are proposed. They are able to solve a dataflow process network mapping problem, where a network of communicating tasks is placed into a set of processors with limited resource capacities, while minimizing the overall communication bandwidth between processors. They are applied on task graphs where weights of tasks and edges are unitary set. The first heuristic, denoted as Task-wise Placement, places tasks one after another using a notion of task affinities. The second algorithm, named Subgraph-wise Placement, gathers tasks in small groups then place the different groups on processors using a notion of affinities between groups and processors. These algorithms are tested on tasks graphs with grid or logic gates network topologies. Obtained results are then compared to an algorithm present in the literature. This algorithm maps task graphs with moderated size on massively parallel architectures. In addition, the random based mapping metric is used in order to evaluate results of both heuristics. Then, in a will to address problems that can be found in industrial cases, application cases are widen to tasks graphs with tasks and edges weights values similar to those that can be found in the industry. A progressive construction heuristic named Regret Based Approach, based on game theory, is proposed. This heuristic maps tasks one after another. The costs of mapping tasks according to already mapped tasks are computed. The process of task selection is based on a notion of regret, present in game theory. The task with the highest value of regret for not placing it, is pointed out and is placed in priority. In order to check the strength of the algorithm, many types of task graphs (grids, logic gates networks, series-parallel, random, sparse matrices) with various size are generated. Tasks and edges weights are randomly chosen using a bimodal law parameterized in order to have similar values than industrial applications. Obtained results are compared to the Task Wise placement, especially adapted for non-unitary values. Moreover, results are evaluated using the metric defined above.Ce travail de thèse de doctorat est dédié à l'étude d'un problème de placement de tâches dans le domaine de la compilation d'applications pour des architectures massivement parallèles. Ce problème vient en réponse à certains besoins industriels tels que l'économie d'énergie, la demande de performances pour les applications de type flots de données synchrones. Ce problème de placement doit être résolu dans le respect de trois critères: les algorithmes doivent être capable de traiter des applications de tailles variables, ils doivent répondre aux contraintes de capacités des processeurs et prendre en compte la topologie des architectures cibles. Dans cette thèse, les tâches sont organisées en réseaux de communication, modélisés sous forme de graphes. Pour évaluer la qualité des solutions produites par les algorithmes, les placements obtenus sont comparés avec un placement aléatoire. Cette comparaison sert de métrique d'évaluation des placements des différentes méthodes proposées. Afin de résoudre à ce problème, deux algorithmes de placement de réseaux de tâches de grande taille sur des architectures clusterisées de processeurs de type many-coeurs ont été développés. Ils s'appliquent dans des cas où les poids des tâches et des arêtes sont unitaires. Le premier algorithme, nommé Task-wise Placement, place les tâches une par une en se servant d'une notion d'affinité entre les tâches. Le second, intitulé Subgraph-wise Placement, rassemble les tâches en groupes puis place les groupes de tâches sur les processeurs en se servant d'une relation d'affinité entre les groupes et les tâches déjà affectées. Ces algorithmes ont été testés sur des graphes, représentants des applications, possédant des topologies de types grilles ou de réseaux de portes logiques. Les résultats des placements sont comparés avec un algorithme de placement, présent dans la littérature qui place des graphes de tailles modérée et ce à l'aide de la métrique définie précédemment. Les cas d'application des algorithmes de placement sont ensuite orientés vers des graphes dans lesquels les poids des tâches et des arêtes sont variables similairement aux valeurs qu'on peut retrouver dans des cas industriels. Une heuristique de construction progressive basée sur la théorie des jeux a été développée. Cet algorithme, nommé Regret Based Approach, place les tâches une par une. Le coût de placement de chaque tâche en fonction des autres tâches déjà placées est calculée. La phase de sélection de la tâche se base sur une notion de regret présente dans la théorie des jeux. La tâche qu'on regrettera le plus de ne pas avoir placée est déterminée et placée en priorité. Afin de vérifier la robustesse de l'algorithme, différents types de graphes de tâches (grilles, logic gate networks, series-parallèles, aléatoires, matrices creuses) de tailles variables ont été générés. Les poids des tâches et des arêtes ont été générés aléatoirement en utilisant une loi bimodale paramétrée de manière à obtenir des valeurs similaires à celles des applications industrielles. Les résultats de l'algorithme ont également été comparés avec l'algorithme Task-Wise Placement, qui a été spécialement adapté pour les valeurs non unitaires. Les résultats sont également évalués en utilisant la métrique de placement aléatoire