7 research outputs found

    Synthèse et description de circuits numériques au niveau des transferts synchronisés par les données

    Get PDF
    RÉSUMÉ Au-delà des processeurs d’instructions multi-coeurs, le monde du traitement numérique haute performance moderne est également caractérisé par l’utilisation de circuits spécifiques à un domaine d’application implémentés au moyen de circuits programmables FPGA (réseau de portes programmables in situ). Les FPGA représentent des candidats intéressants à la réalisation de calculs haute-performances pour différentes raisons. D’une part, le nombre importants de blocs de propriétés intellectuelles gravés en dur sur ces puces (processeurs, mémoires, unités de traitement de signal numérique) réduit l’écart qui les sépare des circuits intégrés dédiés en termes de ressources disponibles. Un écart qui s’explique par le haut niveau de configurabilité offert par le circuit programmable, une capacité pour laquelle un grand nombre de ressources doit être dédié sans être utilisé par le circuit programmé. Néanmoins dans un contexte où souvent plus de transistors sont disponibles qu’on puisse en utiliser, le coût associé à la configurabilité s’en trouve d’autant réduit. De par leur capacité à être reconfigurés complètement ou partiellement, les FPGAs modernes, tout comme les processeurs d’instructions, offrent la flexibilité requise pour supporter un grand nombre d’applications. Néanmoins, contrairement aux processeurs d’instructions qui peuvent être programmés avec différents langages de programmation haut-niveau (Java, C#, C/C++, MPI, OpenMP, OpenCL), la programmation d’un FPGA requiert la spécification d’un circuit numérique, ce qui représente un obstacle majeur à leur plus grande adoption. La description de circuits numériques est généralement exprimée au moyen d’un langage concurrent pour lequel le niveau d’abstraction se situe au niveau des transferts entre registres (RTL), tels les langages VHDL et Verilog. Pour une application donnée, la réalisation d’un circuit numérique spécialisé requiert typiquement un effort de conception significativement plus grand qu’une réalisation logicielle. Il existe aujourd’hui différents outils académiques et commerciaux permettant la synthèse haut-niveau de circuits numériques en partant de descriptions C/C++/SystemC, et plus récemment OpenCL. Cependant, selon l’application considérée, ces outils ne permettent pas toujours d’obtenir des performances comparables à celles qui peuvent être obtenues avec une description RTL produite manuellement. On s’intéresse dans ce travail à un outil de synthèse de niveau intermédiaire offrant un compromis entre les performances atteignables au moyen d’une méthode de conception RTL, ainsi que les temps de conception que permet la synthèse à haut-niveau.----------ABSTRACT Beyond modern multi/many-cores processors, the world of computing is also caracterized by the use of dedicated circuits implemented on Field-Programmable Gate-Arrays (FPGAs). For many reasons, modern FPGAs have become interesting targets for high-performance computing applications. On one hand, their integration of considerable amounts of IP blocks (processors, memories, DSPs) has contributed to reduce the resource/performance gap that exist with Application Specific Integrated Devices (ASICs). A gap that is easily explained by the high-level of reconfigurability that these devices provide, a feature for which a considerable amount of resources (transistors) must be dedicated. Nevertheless, in a context where often more transistors are often available than it is needed or required, the impact of such a cost is less important. The ability to reconfigure completely or partially modern FPGAs further offer the flexibility required to support multiple different applications over time, similarly to instruction processors. However, while instruction processors can be programmed with different high abstraction level software programming languages (Java, C#, C/C++, MPI, OpenMP, OpenCL), FPGA programming typically requires the specification of a hardware design, which is a major obstacle to their widespread use. The description of a hardware design is generally done at the register-transfer level (RTL), using hardware description languages (HDLs) such as VHDL and Verilog. For a given application, the design and verification of a dedicated circuit requires a significantly more important effort than a software implementation. Nowadays, numerous commercial and academic tools allow the high-level synthesis of hardware designs starting from a software description using programming languages such as C/C++/SystemC, and more recently OpenCL. Nevertheless, depending on the application considered, at current state of the art, these tools do not allow performances that matches those which can be obtained through hand-made RTL designs. In this work, we consider an intermediate-level synthesis methodology offering a compromise between the performances and design times that can be obtained with RTL and high-level synthesis methodologies. We consider an input hardware description language that allows the description of algorithmic state machines (ASMs) handling connections between sources and sinks with predefined streaming interfaces. These interfaces are similar AXI4-Streaming and Avalon-Streaming interfaces, featuring ready-to-send/ready-to-receive synchronisation signals

    On the simulation and design of manycore CMPs

    Get PDF
    The progression of Moore’s Law has resulted in both embedded and performance computing systems which use an ever increasing number of processing cores integrated in a single chip. Commercial systems are now available which provide hundreds of cores, and academics have proposed architectures for up to 1024 cores. Embedded multicores are increasingly popular as it is easier to guarantee hard-realtime constraints using individual cores dedicated for tasks, than to use traditional time-multiplexed processing. However, finding the optimal hardware configuration to meet these requirements at minimum cost requires extensive trial and error approaches to investigate the design space. This thesis tackles the problems encountered in the design of these large scale multicore systems by first addressing the problem of fast, detailed micro-architectural simulation. Initially addressing embedded systems, this work exploits the lack of hardware cache-coherence support in many deeply embedded systems to increase the available parallelism in the simulation. Then, through partitioning the NoC and using packet counting and cycle skipping reduces the amount of computation required to accurately model the NoC interconnect. In combination, this enables simulation speeds significantly higher than the state of the art, while maintaining less error, when compared to real hardware, than any similar simulator. Simulation speeds reach up to 370MIPS (Million (target) Instructions Per Second), or 110MHz, which is better than typical FPGA prototypes, and approaching final ASIC production speeds. This is achieved while maintaining an error of only 2.1%, significantly lower than other similar simulators. The thesis continues by scaling the simulator past large embedded systems up to 64-1024 core processors, adding support for coherent architectures using the same packet counting techniques along with low overhead context switching to enable the simulation of such large systems with stricter synchronisation requirements. The new interconnect model was partitioned to enable parallel simulation to further improve simulation speeds in a manner which did not sacrifice any accuracy. These innovations were leveraged to investigate significant novel energy saving optimisations to the coherency protocol, processor ISA, and processor micro-architecture. By introducing a new instruction, with the name wait-on-address, the energy spent during spin-wait style synchronisation events can be significantly reduced. This functions by putting the core into a low-power idle state while the cache line of the indicated address is monitored for coherency action. Upon an update or invalidation (or traditional timer or external interrupts) the core will resume execution, but the active energy of running the core pipeline and repeatedly accessing the data and instruction caches is effectively reduced to static idle power. The thesis also shows that existing combined software-hardware schemes to track data regions which do not require coherency can adequately address the directory-associativity problem, and introduces a new coherency sharer encoding which reduces the energy consumed by sharer invalidations when sharers are grouped closely together, such as would be the case with a system running many tasks with a small degree of parallelism in each. The research concludes by using the extremely fast simulation speeds developed to produce a large set of training data, collecting various runtime and energy statistics for a wide range of embedded applications on a huge diverse range of potential MPSoC designs. This data was used to train a series of machine learning based models which were then evaluated on their capacity to predict performance characteristics of unseen workload combinations across the explored MPSoC design space, using only two sample simulations, with promising results from some of the machine learning techniques. The models were then used to produce a ranking of predicted performance across the design space, and on average Random Forest was able to predict the best design within 89% of the runtime performance of the actual best tested design, and better than 93% of the alternative design space. When predicting for a weighted metric of energy, delay and area, Random Forest on average produced results within 93% of the optimum result. In summary this thesis improves upon the state of the art for cycle accurate multicore simulation, introduces novel energy saving changes the the ISA and microarchitecture of future multicore processors, and demonstrates the viability of machine learning techniques to significantly accelerate the design space exploration required to bring a new manycore design to market
    corecore