1,386 research outputs found

    Run-time Spatial Mapping of Streaming Applications to Heterogeneous Multi-Processor Systems

    Get PDF
    In this paper, we define the problem of spatial mapping. We present reasons why performing spatial mappings at run-time is both necessary and desirable. We propose what is—to our knowledge—the first attempt at a formal description of spatial mappings for the embedded real-time streaming application domain. Thereby, we introduce criteria for a qualitative comparison of these spatial mappings. As an illustration of how our formalization relates to practice, we relate our own spatial mapping algorithm to the formal model

    Running stream-like programs on heterogeneous multi-core systems

    Get PDF
    All major semiconductor companies are now shipping multi-cores. Phones, PCs, laptops, and mobile internet devices will all require software that can make effective use of these cores. Writing high-performance parallel software is difficult, time-consuming and error prone, increasing both time-to-market and cost. Software outlives hardware; it typically takes longer to develop new software than hardware, and legacy software tends to survive for a long time, during which the number of cores per system will increase. Development and maintenance productivity will be improved if parallelism and technical details are managed by the machine, while the programmer reasons about the application as a whole. Parallel software should be written using domain-specific high-level languages or extensions. These languages reveal implicit parallelism, which would be obscured by a sequential language such as C. When memory allocation and program control are managed by the compiler, the program's structure and data layout can be safely and reliably modified by high-level compiler transformations. One important application domain contains so-called stream programs, which are structured as independent kernels interacting only through one-way channels, called streams. Stream programming is not applicable to all programs, but it arises naturally in audio and video encode and decode, 3D graphics, and digital signal processing. This representation enables high-level transformations, including kernel unrolling and kernel fusion. This thesis develops new compiler and run-time techniques for stream programming. The first part of the thesis is concerned with a statically scheduled stream compiler. It introduces a new static partitioning algorithm, which determines which kernels should be fused, in order to balance the loads on the processors and interconnects. A good partitioning algorithm is crucial if the compiler is to produce efficient code. The algorithm also takes account of downstream compiler passes---specifically software pipelining and buffer allocation---and it models the compiler's ability to fuse kernels. The latter is important because the compiler may not be able to fuse arbitrary collections of kernels. This thesis also introduces a static queue sizing algorithm. This algorithm is important when memory is distributed, especially when local stores are small. The algorithm takes account of latencies and variations in computation time, and is constrained by the sizes of the local memories. The second part of this thesis is concerned with dynamic scheduling of stream programs. First, it investigates the performance of known online, non-preemptive, non-clairvoyant dynamic schedulers. Second, it proposes two dynamic schedulers for stream programs. The first is specifically for one-dimensional stream programs. The second is more general: it does not need to be told the stream graph, but it has slightly larger overhead. This thesis also introduces some support tools related to stream programming. StarssCheck is a debugging tool, based on Valgrind, for the StarSs task-parallel programming language. It generates a warning whenever the program's behaviour contradicts a pragma annotation. Such behaviour could otherwise lead to exceptions or race conditions. StreamIt to OmpSs is a tool to convert a streaming program in the StreamIt language into a dynamically scheduled task based program using StarSs.Totes les empreses de semiconductors produeixen actualment multi-cores. Mòbils,PCs, portàtils, i dispositius mòbils d’Internet necessitaran programari quefaci servir eficientment aquests cores. Escriure programari paral·lel d’altrendiment és difícil, laboriós i propens a errors, incrementant tant el tempsde llançament al mercat com el cost. El programari té una vida més llarga queel maquinari; típicament pren més temps desenvolupar nou programi que noumaquinari, i el programari ja existent pot perdurar molt temps, durant el qualel nombre de cores dels sistemes incrementarà. La productivitat dedesenvolupament i manteniment millorarà si el paral·lelisme i els detallstècnics són gestionats per la màquina, mentre el programador raona sobre elconjunt de l’aplicació.El programari paral·lel hauria de ser escrit en llenguatges específics deldomini. Aquests llenguatges extrauen paral·lelisme implícit, el qual és ocultatper un llenguatge seqüencial com C. Quan l’assignació de memòria i lesestructures de control són gestionades pel compilador, l’estructura iorganització de dades del programi poden ser modificades de manera segura ifiable per les transformacions d’alt nivell del compilador.Un dels dominis de l’aplicació importants és el que consta dels programes destream; aquest programes són estructurats com a nuclis independents queinteractuen només a través de canals d’un sol sentit, anomenats streams. Laprogramació de streams no és aplicable a tots els programes, però sorgeix deforma natural en la codificació i descodificació d’àudio i vídeo, gràfics 3D, iprocessament de senyals digitals. Aquesta representació permet transformacionsd’alt nivell, fins i tot descomposició i fusió de nucli.Aquesta tesi desenvolupa noves tècniques de compilació i sistemes en tempsd’execució per a programació de streams. La primera part d’aquesta tesi esfocalitza amb un compilador de streams de planificació estàtica. Presenta unnou algorisme de partició estàtica, que determina quins nuclis han de serfusionats, per tal d’equilibrar la càrrega en els processadors i en lesinterconnexions. Un bon algorisme de particionat és fonamental per tal de queel compilador produeixi codi eficient. L’algorisme també té en compte elspassos de compilació subseqüents---específicament software pipelining il’arranjament de buffers---i modela la capacitat del compilador per fusionarnuclis. Aquesta tesi també presenta un algorisme estàtic de redimensionament de cues.Aquest algorisme és important quan la memòria és distribuïda, especialment quanles memòries locals són petites. L’algorisme té en compte latències ivariacions en els temps de càlcul, i considera el límit imposat per la mida deles memòries locals.La segona part d’aquesta tesi es centralitza en la planificació dinàmica deprogrames de streams. En primer lloc, investiga el rendiment dels planificadorsdinàmics online, non-preemptive i non-clairvoyant. En segon lloc, proposa dosplanificadors dinàmics per programes de stream. El primer és específicament pera programes de streams unidimensionals. El segon és més general: no necessitael graf de streams, però els overheads són una mica més grans.Aquesta tesi també presenta un conjunt d’eines de suport relacionades amb laprogramació de streams. StarssCheck és una eina de depuració, que és basa enValgrind, per StarSs, un llenguatge de programació paral·lela basat en tasques.Aquesta eina genera un avís cada vegada que el comportament del programa estàen contradicció amb una anotació pragma. Aquest comportament d’una altra manerapodria causar excepcions o situacions de competició. StreamIt to OmpSs és unaeina per convertir un programa de streams codificat en el llenguatge StreamIt aun programa de tasques en StarSs planificat de forma dinàmica.Postprint (published version

    Porting of DSMC to multi-GPUs using OpenACC

    Get PDF
    The Direct Simulation Monte Carlo has become the method of choice for studying gas flows characterized by variable rarefaction and non-equilibrium effects, rising interest in industry for simulating flows in micro-, and nano-electromechanical systems. However, rarefied gas dynamics represents an open research challenge from the computer science perspective, due to its computational expense compared to continuum computational fluid dynamics methods. Fortunately, over the last decade, high-performance computing has seen an exponential growth of performance. Actually, with the breakthrough of General-Purpose GPU computing, heterogeneous systems have become widely used for scientific computing, especially in large-scale clusters and supercomputers. Nonetheless, developing efficient, maintainable and portable applications for hybrid systems is, in general, a non-trivial task. Among the possible approaches, directive-based programming models, such as OpenACC, are considered the most promising for porting scientific codes to hybrid CPU/GPU systems, both for their simplicity and portability. This work is an attempt to port a simplified version of the fm dsmc code developed at FLOW Matters Consultancy B.V., a start-up company supporting this project, on a multi-GPU distributed hybrid system, such as Marconi100 hosted at CINECA, using OpenACC. Finally, we perform a detailed performance analysis of our DSMC application on Volta (NVIDIA V100 GPU) architecture based computing platform as well as a comparison with previous results obtained with x64 86 (Intel Xeon CPU) and ppc64le (IBM Power9 CPU) architectures

    Low power architectures for streaming applications

    Get PDF

    Predictable embedded multiprocessor architecture for streaming applications

    Get PDF
    The focus of this thesis is on embedded media systems that execute applications from the application domain car infotainment. These applications, which we refer to as jobs, typically fall in the class of streaming, i.e. they process on a stream of data. The jobs are executed on heterogeneous multiprocessor platforms, for performance and power efficiency reasons. Most of these jobs have firm real-time requirements, like throughput and end-to-end latency. Car-infotainment systems become increasingly more complex, due to an increase in the supported number of jobs and an increase of resource sharing. Therefore, it is hard to verify, for each job, that the realtime requirements are satisfied. To reduce the verification effort, we elaborate on an architecture for a predictable system from which we can verify, at design time, that the job’s throughput and end-to-end latency requirements are satisfied. This thesis introduces a network-based multiprocessor system that is predictable. This is achieved by starting with an architecture where processors have private local memories and execute tasks in a static order, so that the uncertainty in the temporal behaviour is minimised. As an interconnect, we use a network that supports guaranteed communication services so that it is guaranteed that data is delivered in time. The architecture is extended with shared local memories, run-time scheduling of tasks, and a memory hierarchy. Dataflow modelling and analysis techniques are used for verification, because they allow cyclic data dependencies that influence the job’s performance. Shown is how to construct a dataflow model from a job that is mapped onto our predictable multiprocessor platforms. This dataflow model takes into account: computation of tasks, communication between tasks, buffer capacities, and scheduling of shared resources. The job’s throughput and end-to-end latency bounds are derived from a self-timed execution of the dataflow graph, by making use of existing dataflow-analysis techniques. It is shown that the derived bounds are tight, e.g. for our channel equaliser job, the accuracy of the derived throughput bound is within 10.1%. Furthermore, it is shown that the dataflow modelling and analysis techniques can be used despite the use of shared memories, run-time scheduling of tasks, and caches

    Online Modeling and Tuning of Parallel Stream Processing Systems

    Get PDF
    Writing performant computer programs is hard. Code for high performance applications is profiled, tweaked, and re-factored for months specifically for the hardware for which it is to run. Consumer application code doesn\u27t get the benefit of endless massaging that benefits high performance code, even though heterogeneous processor environments are beginning to resemble those in more performance oriented arenas. This thesis offers a path to performant, parallel code (through stream processing) which is tuned online and automatically adapts to the environment it is given. This approach has the potential to reduce the tuning costs associated with high performance code and brings the benefit of performance tuning to consumer applications where otherwise it would be cost prohibitive. This thesis introduces a stream processing library and multiple techniques to enable its online modeling and tuning. Stream processing (also termed data-flow programming) is a compute paradigm that views an application as a set of logical kernels connected via communications links or streams. Stream processing is increasingly used by computational-x and x-informatics fields (e.g., biology, astrophysics) where the focus is on safe and fast parallelization of specific big-data applications. A major advantage of stream processing is that it enables parallelization without necessitating manual end-user management of non-deterministic behavior often characteristic of more traditional parallel processing methods. Many big-data and high performance applications involve high throughput processing, necessitating usage of many parallel compute kernels on several compute cores. Optimizing the orchestration of kernels has been the focus of much theoretical and empirical modeling work. Purely theoretical parallel programming models can fail when the assumptions implicit within the model are mis-matched with reality (i.e., the model is incorrectly applied). Often it is unclear if the assumptions are actually being met, even when verified under controlled conditions. Full empirical optimization solves this problem by extensively searching the range of likely configurations under native operating conditions. This, however, is expensive in both time and energy. For large, massively parallel systems, even deciding which modeling paradigm to use is often prohibitively expensive and unfortunately transient (with workload and hardware). In an ideal world, a parallel run-time will re-optimize an application continuously to match its environment, with little additional overhead. This work presents methods aimed at doing just that through low overhead instrumentation, modeling, and optimization. Online optimization provides a good trade-off between static optimization and online heuristics. To enable online optimization, modeling decisions must be fast and relatively accurate. Online modeling and optimization of a stream processing system first requires the existence of a stream processing framework that is amenable to the intended type of dynamic manipulation. To fill this void, we developed the RaftLib C++ template library, which enables usage of the stream processing paradigm for C++ applications (it is the run-time which is the basis of almost all the work within this dissertation). An application topology is specified by the user, however almost everything else is optimizable by the run-time. RaftLib takes advantage of the knowledge gained during the design of several prior streaming languages (notably Auto-Pipe). The resultant framework enables online migration of tasks, auto-parallelization, online buffer-reallocation, and other useful dynamic behaviors that were not available in many previous stream processing systems. Several benchmark applications have been designed to assess the performance gains through our approaches and compare performance to other leading stream processing frameworks. Information is essential to any modeling task, to that end a low-overhead instrumentation framework has been developed which is both dynamic and adaptive. Discovering a fast and relatively optimal configuration for a stream processing application often necessitates solving for buffer sizes within a finite capacity queueing network. We show that a generalized gain/loss network flow model can bootstrap the process under certain conditions. Any modeling effort, requires that a model be selected; often a highly manual task, involving many expensive operations. This dissertation demonstrates that machine learning methods (such as a support vector machine) can successfully select models at run-time for a streaming application. The full set of approaches are incorporated into the open source RaftLib framework

    3rd Many-core Applications Research Community (MARC) Symposium. (KIT Scientific Reports ; 7598)

    Get PDF
    This manuscript includes recent scientific work regarding the Intel Single Chip Cloud computer and describes approaches for novel approaches for programming and run-time organization

    Proceedings of the first international workshop on Investigating dataflow in embedded computing architectures (IDEA 2015), January 21, 2015, Amsterdam, The Netherlands

    Get PDF
    IDEA '15 held at HiPEAC 2015, Amsterdam, The Netherlands on January 21st, 2015 is the rst workshop on Investigating Data ow in Embedded computing Architectures. This technical report comprises of the proceedings of IDEA '15. Over the years, data ow has been gaining popularity among Embedded Systems researchers around Europe and the world. However, research on data ow is limited to small pockets in dierent communities without a common forum for discussion. The goal of the workshop was to provide a platform to researchers and practitioners to present work on modelling and analysis of present and future high performance embedded computing architectures using data ow. Despite being the rst edition of the workshop, it was very pleasant to see a total of 14 submissions, out of which 6 papers were selected following a thorough reviewing process. All the papers were reviewed by at least 5 reviewers. This workshop could not have become a reality without the help of a Technical Program Committee (TPC). The TPC members not only did the hard work to give helpful reviews in time, but also participated in extensive discussion following the reviewing process, leading to an excellent workshop program and very valuable feedback to authors. Likewise, the Organisation Committee also deserves acknowledgment to make this workshop a successful event. We take this opportunity to thank everyone who contributed in making this workshop a success

    Proceedings of the first international workshop on Investigating dataflow in embedded computing architectures (IDEA 2015), January 21, 2015, Amsterdam, The Netherlands

    Get PDF
    IDEA '15 held at HiPEAC 2015, Amsterdam, The Netherlands on January 21st, 2015 is the rst workshop on Investigating Data ow in Embedded computing Architectures. This technical report comprises of the proceedings of IDEA '15. Over the years, data ow has been gaining popularity among Embedded Systems researchers around Europe and the world. However, research on data ow is limited to small pockets in dierent communities without a common forum for discussion. The goal of the workshop was to provide a platform to researchers and practitioners to present work on modelling and analysis of present and future high performance embedded computing architectures using data ow. Despite being the rst edition of the workshop, it was very pleasant to see a total of 14 submissions, out of which 6 papers were selected following a thorough reviewing process. All the papers were reviewed by at least 5 reviewers. This workshop could not have become a reality without the help of a Technical Program Committee (TPC). The TPC members not only did the hard work to give helpful reviews in time, but also participated in extensive discussion following the reviewing process, leading to an excellent workshop program and very valuable feedback to authors. Likewise, the Organisation Committee also deserves acknowledgment to make this workshop a successful event. We take this opportunity to thank everyone who contributed in making this workshop a success
    corecore