60 research outputs found
High performance computing with FPGAs
Field-programmable gate arrays represent an army of logical units which can be organized in a highly parallel or pipelined fashion to implement an algorithm in hardware. The flexibility of this new medium creates new challenges to find the right processing paradigm which takes into account of the natural constraints of FPGAs: clock frequency, memory footprint and communication bandwidth. In this paper first use of FPGAs as a multiprocessor on a chip or its use as a highly functional coprocessor are compared, and the programming tools for hardware/software codesign are discussed. Next a number of techniques are presented to maximize the parallelism and optimize the data locality in nested loops. This includes unimodular transformations, data locality improving loop transformations and use of smart buffers. Finally, the use of these techniques on a number of examples is demonstrated.
The results in the paper and in the literature show that, with the proper programming tool set, FPGAs can speedup computation kernels significantly with respect to traditional processors
Chain-based scheduling: Part I - loop transformations and code generation
Chain-based scheduling [1] is an efficient partitioning and scheduling scheme for nested loops on distributed-memory multicomputers. The idea is to take advantage of the regular data dependence structure of a nested loop to overlap and pipeline the communication and computation. Most partitioning and scheduling algorithms proposed for nested loops on multicomputers [1,2,3] are graph algorithms on the iteration space of the nested loop. The graph algorithms for partitioning and scheduling are too expensive (at least O(N), where N is the total number of iterations) to be implemented in parallelizing compilers. Graph algorithms also need large data structures to store the result of the partitioning and scheduling. In this paper, we propose compiler loop transformations and the code generation to generate chain-based parallel codes for nested loops on multicomputers. The cost of the loop transformations is O(nd), where n is the number of nesting loops and d is the number of data dependences. Both n and d are very small in real programs. The loop transformations and code generation for chain-based partitioning and scheduling enable parallelizing compilers to generate parallel codes which contain all partitioning and scheduling information that the parallel processors need at run time
Communion: a new strategy for memory management in high-performance computer systems
Modern computers present a big gap between peak performance and sustained performance. There are many reasons for this situation, but mainly involving an inefficient usage of computational resources. Nowadays the memory system is the most critical component because of its growing inability to keep up with the processor requests. Technological trends have produced a large and growing gap between CPU speeds and DRAM speeds.
Much research has focused this memory system problem, including program optimizing techniques, data locality enhancement, hardware and software prefetching, decoupled architectures, mutithreading, speculative loads and execution. These techniques have got a relative success, but they focus only one component in the hardware or software systems.
We present here a new strategy for memory management in high-performance computer systems, named COMMUNION. The basic idea behind this strategy is cooperation. We introduce some interaction possibilities among system programs that are responsible to generate and execute application programs. So, we investigate two specific interactions: between the compiler and the operating system, and among the compiling system components.
The experimental results show that it’s possible to get improvements of about 10 times in execution time, and about 5 times in memory demand. In the interaction between compiler and operating system, named Compiler-Aided Page Replacement (CAPR), we achieved a reduction of about 10% in space-time product, with an increase of only 0.5% in the total execution time. All these results show that it’s possible to manage main memory with a better efficiency than current systems.Eje: Procesamiento distribuido y paralelo. Tratamiento de señalesRed de Universidades con Carreras en Informática (RedUNCI
Communion: a new strategy for memory management in high-performance computer systems
Modern computers present a big gap between peak performance and sustained performance. There are many reasons for this situation, but mainly involving an inefficient usage of computational resources. Nowadays the memory system is the most critical component because of its growing inability to keep up with the processor requests. Technological trends have produced a large and growing gap between CPU speeds and DRAM speeds.
Much research has focused this memory system problem, including program optimizing techniques, data locality enhancement, hardware and software prefetching, decoupled architectures, mutithreading, speculative loads and execution. These techniques have got a relative success, but they focus only one component in the hardware or software systems.
We present here a new strategy for memory management in high-performance computer systems, named COMMUNION. The basic idea behind this strategy is cooperation. We introduce some interaction possibilities among system programs that are responsible to generate and execute application programs. So, we investigate two specific interactions: between the compiler and the operating system, and among the compiling system components.
The experimental results show that it’s possible to get improvements of about 10 times in execution time, and about 5 times in memory demand. In the interaction between compiler and operating system, named Compiler-Aided Page Replacement (CAPR), we achieved a reduction of about 10% in space-time product, with an increase of only 0.5% in the total execution time. All these results show that it’s possible to manage main memory with a better efficiency than current systems.Eje: Procesamiento distribuido y paralelo. Tratamiento de señalesRed de Universidades con Carreras en Informática (RedUNCI
Communion: a new strategy form memory management in high-performance computer
Modern computers present a big gap between peak performance and sustained performance. There are many reasons for this situation, but mainly involving an inefficient usage of computational resources. Nowadays the memory system is the most critical component because of its growing inability to keep up with the processor requests. Technological trends have produced a large and growing gap between CPU speeds and DRAM speeds. Much research has focused this memory system problem, including program optimizing techniques, data locality enhancement, hardware and software prefetching, decoupled architectures, multithreading, speculative loads and execution. These techniques have got a relative success, but they focus only one component in the hardware or software systems. We present here a new strategy for memory management in high-performance computer systems, named COMMUNION. The basic idea behind this strategy is "cooperation". We introduce some interaction possibilities among system programs that are responsible to generate and execute application programs. So, we investigate two specific interactions: between the compiler and the operating system, and among the compiling system components. The experimental results show that it's possible to get improvements of about 10 times in execution time, and about 5 times in memory demand, enhancing the interaction between the compiling system components. In the interaction between compiler and operating system, named Compiler-Aided Page Replacement (CAPR), we achieved a reduction of about 10% in space-time product, with an increase of only 0.5% in the total execution time. All these results show that it s possible to manage main memory with a better efficiency than current systems.Facultad de Informátic
Compiler Optimization Techniques for Scheduling and Reducing Overhead
Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead
Optimisation des mémoires dans le flot de conception des systèmes multiprocesseurs sur puces pour des applications de type multimédia
RÉSUMÉ
Les systèmes multiprocesseurs sur puce (MPSoC) constituent l'un des principaux moteurs de
la révolution industrielle des semi-conducteurs. Les MPSoCs jouissent d’une popularité
grandissante dans le domaine des systèmes embarquĂ©s. Leur grande capacitĂ© de parallĂ©lisation Ă
un très haut niveau d'intégration, en font de bons candidats pour les systèmes et les applications
telles que les applications multimédia. La consommation d’énergie, la capacité de calcul et
l’espace de conception sont les éléments dont dépendent les performances de ce type
d’applications. La mémoire est le facteur clé permettant d’améliorer de façon substantielle leurs
performances. Avec l’arrivée des applications multimédias embarquées dans l’industrie, le
problème des gains de performances est vital. La masse de données traitées par ces applications
requiert une grande capacité de calcul et de mémoire. Dernièrement, de nouveaux modèles de
programmation ont fait leur apparition. Ces modèles offrent une programmation de plus haut
niveau pour répondre aux besoins croissants des MPSoCs, d’où la nécessité de nouvelles
approches d'optimisation et de placement pour les systèmes embarqués et leurs modèles de
programmation.
La conception niveau système des architectures MPSoCs pour les applications de type
multimédia constitue un véritable défi technique. L’objectif général de cette thèse est de relever
ce défi en trouvant des solutions. Plus spécifiquement, cette thèse se propose d’introduire le
concept d’optimisation mémoire dans le flot de conception niveau système et d’observer leur
impact sur différents modèles de programmation utilisés lors de la conception de MPSoCs. Il
s’agit, autrement dit, de réaliser l’unification du domaine de la compilation avec celui de la
conception niveau système pour une meilleure conception globale.
La contribution de cette thèse est de proposer de nouvelles approches pour les techniques
d'optimisation mémoire pour la conception MPSoCs avec différents modèles de programmation.
Nos travaux de recherche concernent l'intégration des techniques d’optimisation mémoire dans le
flot de conception de MPSoCs pour différents types de modèle de programmation. Ces travaux
ont été exécutés en collaboration avec STMicroelectronics.----------ABSTRACT
Multiprocessor systems-on-chip (MPSoC) are defined as one of the main drivers of the
industrial semiconductors revolution. MPSoCs are gaining popularity in the field of embedded
systems. Pursuant to their great ability to parallelize at a very high integration level, they are
good candidates for systems and applications such as multimedia. Memory is becoming a key
player for significant improvements in these applications (i.e. power, performance and area).
With the emergence of more embedded multimedia applications in the industry, this issue
becomes increasingly vital. The large amount of data manipulated by these applications requires
high-capacity calculation and memory. Lately, new programming models have been introduced.
These programming models offer a higher programming level to answer the increasing needs of
MPSoCs. This leads to the need of new optimization and mapping approaches suitable for
embedded systems and their programming models.
The overall objective of this research is to find solutions to the challenges of system level
design of applications such as multimedia. This entails the development of new approaches and
new optimization techniques. The specific objective of this research is to introduce the concept
of memory optimization in the system level conception flow and study its impact on different
programming models used for MPSoCs’ design. In other words, it is the unification of the
compilation and system level design domains.
The contribution of this research is to propose new approaches for memory optimization
techniques for MPSoCs’ design in different programming models. This thesis relates to the
integration of memory optimization to varying programming model types in the MPSoCs
conception flow. Our research was done in collaboration with STMicroelectronics
- …