    Learning Instruction Scheduling Heuristics from Optimal Data

    The development of modern pipelined and multiple functional unit processors has increased the available instruction level parallelism. In order to fully utilize these resources, compiler writers spend large amounts of time developing complex scheduling heuristics for each new architecture. In order to reduce the time spent on this process, automated machine learning techniques have been proposed to generate scheduling heuristics. We present two case studies using these techniques to generate instruction scheduling heuristics for basic blocks and super blocks. A basic block is a block of code with a single flow of control and a super block is a collection of basic blocks with a single entry point but multiple exit points. We improve previous techniques for automated generation of basic block scheduling heuristics by increasing the quality of the training data and increasing the number of features considered, including several novel features that have useful effects on scheduling instructions. Our case study into super block scheduling heuristics is a novel contribution as previous approaches were only applied to basic blocks. We show through experimentation that we can produce efficient heuristics that perform better than current heuristic methods for basic block and super block scheduling. We show that we can reduce the number of non-optimally scheduled blocks by up to 55% for basic blocks and 38% for super blocks. We also show that we can produce better schedules 7. 8 times more often than the next best heuristic for basic blocks and 4. 4 times more often for super blocks

    On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling

    Modern architectures allow multiple instructions to be issued at once and have other complex features. To account for this, compilers perform instruction scheduling after generating the output code. The instruction scheduling problem is to find an optimal schedule given the limitations and capabilities of the architecture. While this can be done optimally, a greedy algorithm known as list scheduling is used in practice in most production compilers. List scheduling is generally regarded as being near-optimal in practice, provided a good choice of heuristic is used. However, previous work comparing a list scheduler against an optimal scheduler either makes the assumption that an idealized architectural model is being used or uses too few test cases to strongly prove or disprove the assumed near-optimality of list scheduling. It remains an open question whether or not list scheduling performs well when scheduling for a realistic architectural model. Using constraint programming, we developed an efficient optimal scheduler capable of scheduling even very large blocks within a popular benchmark suite in a reasonable amount of time. I improved the architectural model and optimal scheduler by allowing for an issue width not equal to the number of functional units, instructions that monopolize the processor for one cycle, and non-fully pipelined instructions. I then evaluated the performance of list scheduling for this more realistic architectural model. I found that when scheduling for basic blocks when using a realistic architectural model, only 6% or less of schedules produced by a list scheduler are non-optimal, but when scheduling for superblocks, at least 40% of schedules produced by a list scheduler are non-optimal. Furthermore, when the list scheduler and optimal scheduler differed, the optimal scheduler was able to improve schedule cost by at least 5% on average, realizing maximum improvements of 82%. This suggests that list scheduling is only a viable solution in practice when scheduling basic blocks. When scheduling superblocks, the advantage of using a list scheduler is its speed, not the quality of schedules produced, and other alternatives to list scheduling should be considered

    Optimisation multi-niveau d'une application de traitement d'images sur machines parallèles

    Cette thèse vise à définir une méthodologie de mise en œuvre d applications performantes sur les processeurs embarqués du futur. Ces architectures nécessitent notamment d exploiter au mieux les différents niveaux de parallélisme (grain fin, gros grain) et de gérer les communications et les accès à la mémoire. Pour étudier cette méthodologie, nous avons utilisé un processeur cible représentatif de ces architectures émergentes, le processeur CELL. Le détecteurde points d intérêt de Harris est un exemple de traitement régulier nécessitant des unités de calcul intensif. En étudiant plusieurs schémas de mise en oeuvre sur le processeur CELL, nous avons ainsi pu mettre en évidence des méthodes d optimisation des calculs en adaptant les programmes aux unités spécifiques de traitement SIMD du processeur CELL. L utilisation efficace de la mémoire nécessite par ailleurs, à la fois une bonne exploitation des transferts et un arrangement optimal des données en mémoire. Nous avons développé un outil d abstraction permettant de simplifier et d automatiser les transferts et la synchronisation, CELL MPI. Cette expertise nous a permis de développer une méthodologie permettant de simplifier la mise en oeuvre parallèle optimisée de ces algorithmes. Nous avons ainsi conçu un outil de programmation parallèle à base de squelettes algorithmiques : SKELL BE. Ce modèle de programmation propose une solution originale de génération d applications à base de métaprogrammation. Il permet, de manière automatisée, d obtenir de très bonnes performances et de permettre une utilisation efficace de l architecture, comme le montre la comparaison pour un ensemble de programmes test avec plusieurs autres outils dédiés à ce processeur.This thesis aims to define a design methodology for high performance applications on future embedded processors. These architectures require an efficient usage of their different level of parallelism (fine-grain, coarse-grain), and a good handling of the inter-processor communications and memory accesses. In order to study this methodology, we have used a target processor which represents this type of emerging architectures, the Cell BE processor.We have also chosen a low level image processing application, the Harris points of interest detector, which is representative of a typical low level image processing application that is highly parallel. We have studied several parallelisation schemes of this application and we could establish different optimisation techniques by adapting the software to the specific SIMD units of the Cell processor. We have also developped a library named CELL MPI that allows efficient communication and synchronisation over the processing elements, using a simplified and implicit programming interface. This work allowed us to develop a methodology that simplifies the design of a parallel algorithm on the Cell processor.We have designed a parallel programming tool named SKELL BE which is based on algorithmic skeletons. This programming model providesan original solution of a meta-programming based code generator. Using SKELL BE, we can obtain very high performances applications that uses the Cell architecture efficiently when compared to other tools that exist on the market.

    Instruction scheduling in the TOBEY compiler

