    Lexicographic path searches for FPGA routing

    This dissertation reports on studies of the application of lexicographic graph searches to solve problems in FPGA detailed routing. Our contributions include the derivation of iteration limits for scalar implementations of negotiation congestion for standard floating point types and the identification of pathological cases for path choice. In the study of the routability-driven detailed FPGA routing problem, we show universal detailed routability is NP-complete based on a related proof by Lee and Wong. We describe the design of a lexicographic composition operator of totally-ordered monoids as path cost metrics and show its optimality under an adapted A* search. Our new router, CornNC, based on lexicographic composition of congestion and wirelength, established a new minimum track count for the FPGA Place and Route Challenge. For the problem of long-path timing-driven FPGA detailed routing, we show that long-path budgeted detailed routability is NP-complete by reduction to universal detailed routability. We generalise the lexicographic composition to any finite length and verify its optimality under A* search. The application of the timing budget solution of Ghiasi et al. is used to solve the long-path timing budget problem for FPGA connections. Our delay-clamped spiral lexicographic composition design, SpiralRoute, ensures connection based budgets are always met, thus achieves timing closure when it successfully routes. For 113 test routing instances derived from standard benchmarks, SpiralRoute found 13 routable instances with timing closure that were unroutable by a scalar negotiated congestion router and achieved timing closure in another 27 cases when the scalar router did not, at the expense of increased runtime. We also study techniques to improve SpiralRoute runtimes, including a data structure of a trie augmented by data stacks for minimum element retrieval, and the technique of step tomonoid elimination in reducing the retrieval depth in a trie of stacks structure

    Evolutionary design of digital VLSI hardware

    On the Use of Directed Moves for Placement in VLSI CAD

    Search-based placement methods have long been used for placing integrated circuits targeting the field programmable gate array (FPGA) and standard cell design styles. Such methods offer the potential for high-quality solutions but often come at the cost of long run-times compared to alternative methods. This dissertation examines strategies for enhancing local search heuristics---and in particular, simulated annealing---through the application of directed moves. These moves help to guide a search-based optimizer by focusing efforts on states which are most likely to yield productive improvement, effectively pruning the size of the search space. The engineering theory and implementation details of directed moves are discussed in the context of both field programmable gate array and standard cell designs. This work explores the ways in which such moves can be used to improve the quality of FPGA placements, improve the robustness of floorplan repair and legalization methods for mixed-size standard cell designs, and enhance the quality of detailed placement for standard cell circuits. The analysis presented herein confirms the validity and efficacy of directed moves, and supports the use of such heuristics within various optimization frameworks

    Replicable parallel branch and bound search

    Combinatorial branch and bound searches are a common technique for solving global optimisation and decision problems. Their performance often depends on good search order heuristics, refined over decades of algorithms research. Parallel search necessarily deviates from the sequential search order, sometimes dramatically and unpredictably, e.g. by distributing work at random. This can disrupt effective search order heuristics and lead to unexpected and highly variable parallel performance. The variability makes it hard to reason about the parallel performance of combinatorial searches. This paper presents a generic parallel branch and bound skeleton, implemented in Haskell, with replicable parallel performance. The skeleton aims to preserve the search order heuristic by distributing work in an ordered fashion, closely following the sequential search order. We demonstrate the generality of the approach by applying the skeleton to 40 instances of three combinatorial problems: Maximum Clique, 0/1 Knapsack and Travelling Salesperson. The overheads of our Haskell skeleton are reasonable: giving slowdown factors of between 1.9 and 6.2 compared with a class-leading, dedicated, and highly optimised C++ Maximum Clique solver. We demonstrate scaling up to 200 cores of a Beowulf cluster, achieving speedups of 100x for several Maximum Clique instances. We demonstrate low variance of parallel performance across all instances of the three combinatorial problems and at all scales up to 200 cores, with median Relative Standard Deviation (RSD) below 2%. Parallel solvers that do not follow the sequential search order exhibit far higher variance, with median RSD exceeding 85% for Knapsack

    Separation logic for high-level synthesis

    High-level synthesis (HLS) promises a significant shortening of the digital hardware design cycle by raising the abstraction level of the design entry to high-level languages such as C/C++. However, applications using dynamic, pointer-based data structures remain difficult to implement well, yet such constructs are widely used in software. Automated optimisations that leverage the memory bandwidth of dedicated hardware implementations by distributing the application data over separate on-chip memories and parallelise the implementation are often ineffective in the presence of dynamic data structures, due to the lack of an automated analysis that disambiguates pointer-based memory accesses. This thesis takes a step towards closing this gap. We explore recent advances in separation logic, a rigorous mathematical framework that enables formal reasoning about the memory access of heap-manipulating programs. We develop a static analysis that automatically splits heap-allocated data structures into provably disjoint regions. Our algorithm focuses on dynamic data structures accessed in loops and is accompanied by automated source-to-source transformations which enable loop parallelisation and physical memory partitioning by off-the-shelf HLS tools. We then extend the scope of our technique to pointer-based memory-intensive implementations that require access to an off-chip memory. The extended HLS design aid generates parallel on-chip multi-cache architectures. It uses the disjointness property of memory accesses to support non-overlapping memory regions by private caches. It also identifies regions which are shared after parallelisation and which are supported by parallel caches with a coherency mechanism and synchronisation, resulting in automatically specialised memory systems. We show up to 15x acceleration from heap partitioning, parallelisation and the insertion of the custom cache system in demonstrably practical applications.Open Acces

    Deployment of Deep Neural Networks on Dedicated Hardware Accelerators

    Deep Neural Networks (DNNs) have established themselves as powerful tools for a wide range of complex tasks, for example computer vision or natural language processing. DNNs are notoriously demanding on compute resources and as a result, dedicated hardware accelerators for all use cases are developed. Different accelerators provide solutions from hyper scaling cloud environments for the training of DNNs to inference devices in embedded systems. They implement intrinsics for complex operations directly in hardware. A common example are intrinsics for matrix multiplication. However, there exists a gap between the ecosystems of applications for deep learning practitioners and hardware accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics is still mainly defined by human hardware and software experts. Methods to automatically utilize hardware intrinsics in DNN operators are a subject of active research. Existing literature often works with transformationdriven approaches, which aim to establish a sequence of program rewrites and data-layout transformations such that the hardware intrinsic can be used to compute the operator. However, the complexity this of task has not yet been explored, especially for less frequently used operators like Capsule Routing. And not only the implementation of DNN operators with intrinsics is challenging, also their optimization on the target device is difficult. Hardware-in-the-loop tools are often used for this problem. They use latency measurements of implementations candidates to find the fastest one. However, specialized accelerators can have memory and programming limitations, so that not every arithmetically correct implementation is a valid program for the accelerator. These invalid implementations can lead to unnecessary long the optimization time. This work investigates the complexity of transformation-driven processes to automatically embed hardware intrinsics into DNN operators. It is explored with a custom, graph-based intermediate representation (IR). While operators like Fully Connected Layers can be handled with reasonable effort, increasing operator complexity or advanced data-layout transformation can lead to scaling issues. Building on these insights, this work proposes a novel method to embed hardware intrinsics into DNN operators. It is based on a dataflow analysis. The dataflow embedding method allows the exploration of how intrinsics and operators match without explicit transformations. From the results it can derive the data layout and program structure necessary to compute the operator with the intrinsic. A prototype implementation for a dedicated hardware accelerator demonstrates state-of-the art performance for a wide range of convolutions, while being agnostic to the data layout. For some operators in the benchmark, the presented method can also generate alternative implementation strategies to improve hardware utilization, resulting in a geo-mean speed-up of ×2.813 while reducing the memory footprint. Lastly, by curating the initial set of possible implementations for the hardware-in-the-loop optimization, the median timeto- solution is reduced by a factor of ×2.40. At the same time, the possibility to have prolonged searches due a bad initial set of implementations is reduced, improving the optimization’s robustness by ×2.35

    Faster elliptic-curve discrete logarithms on FPGAs

    This paper accelerates FPGA computations of discrete logarithms on elliptic curves over binary fields. As a toy example, this paper successfully attacks the SECG standard curve sect113r2, a binary elliptic curve that was not removed from the SECG standard until 2010 and was not disabled in OpenSSL until June 2015. This is a new size record for completed ECDL computations, using a prime order very slightly larger than the previous record holder. More importantly, this paper uses FPGAs much more efficiently, saving a factor close to 3/2 in the size of each high-speed ECDL core. This paper squeezes 3 cores into a low-cost Spartan-6 FPGA and many more cores into larger FPGAs. The paper also benchmarks many smaller-size attacks to demonstrate reliability of the estimates, and covers a much larger curve over a 127-bit field to demonstrate scalability

    Optimization for Decision Making II

    In the current context of the electronic governance of society, both administrations and citizens are demanding the greater participation of all the actors involved in the decision-making process relative to the governance of society. This book presents collective works published in the recent Special Issue (SI) entitled “Optimization for Decision Making II”. These works give an appropriate response to the new challenges raised, the decision-making process can be done by applying different methods and tools, as well as using different objectives. In real-life problems, the formulation of decision-making problems and the application of optimization techniques to support decisions are particularly complex and a wide range of optimization techniques and methodologies are used to minimize risks, improve quality in making decisions or, in general, to solve problems. In addition, a sensitivity or robustness analysis should be done to validate/analyze the influence of uncertainty regarding decision-making. This book brings together a collection of inter-/multi-disciplinary works applied to the optimization of decision making in a coherent manner

    Contributions Ă  l'optimisation de programmes et Ă  la synthĂšse de circuits haut-niveau

    Since the end of Dennard scaling, power efficiency is the limiting factor for large-scale computing. Hardware accelerators such as reconfigurable circuits (FPGA, CGRA) or Graphics Processing Units (GPUs) were introduced to improve the performance under a limited energy budget, resulting into complex heterogeneous platforms. This document presents a synthetic description of my research activities over the last decade on compilers for high-performance computing and high-level synthesis of circuits (HLS) for FPGA accelerators. Specifically, my contributions covers both theoretical and practical aspects of automatic parallelization and HLS in a general theoretical framework called the polyhedral model.A first chapter describes our contributions to loop tiling, a key program transformation for automatic parallelization which splits the computation atomic blocks called tiles.We rephrase loop tiling in the polyhedral model to enable any polyhedral tile shape whose size depends on a single parameter (monoparametric tiling), and we present a tiling transformation for programs with reductions – accumulations w.r.t. an associative/commutative operator. Our results open the way for semantic program transformations ; program transformations which does not preserve the computation but still lead to an equivalent program.A second chapter describes our contributions to algorithm recognition. A compiler optimization will never replace a good algorithm, hence the idea to recognize algorithm instances in a program and to substitute them by a call to a performance library. In our PhD thesis, we have addressed the recognition of templates – functionswith first-order variables – into programs and its application to program optimization. We propose a complementary algorithm recognition framework which leverages our monoparametric tiling and our reduction tiling transformations. This automates semantic tiling, a new semantic program transformation which increases the grain of operators (scalar → matrix).A third chapter presents our contributions to the synthesis of communications with an off-chip memory in the context of high-level circuit synthesis (HLS). We propose an execution model based on loop tiling, a pipelined architecture and a source-level compilation algorithm which, connected to the C2H HLS tool from Altera, ends up to a FPGA configuration with minimized data transfers. Our compilation algorithm is optimal – the data are loaded as late as possible and stored as soon as possible with a maximal reuse.A fourth chapter presents our contributions to design a unified polyhedral compilation model for high-level circuit synthesis.We present the Data-aware Process Networks (DPN), a dataflow intermediate representation which leverages the ideas developed in chapter 3 to explicit the data transfers with an off-chip memory. We propose an algorithm to compile a DPN from a sequential program, and we present our contribution to the synthesis of DPN to a circuit. In particular, we present our algorithms to compile the control, the channels and the synchronizations of a DPN. These results are used in the production compiler of the Xtremlogic start-up.Depuis la fin du Dennard scaling, l’efficacitĂ© Ă©nergĂ©tique est le facteur limitant pour le calcul haute performance. Les accĂ©lĂ©rateurs matĂ©riels comme les circuits reconfigurables (FPGA, CGRA) ou les accĂ©lĂ©rateurs graphiques (GPUs) ont Ă©tĂ© introduits pour amĂ©liorer les performances sous un budget Ă©nergĂ©tique limitĂ©, menant Ă  des plateformes hĂ©tĂ©rogĂšnes complexes.Mes travaux de recherche portent sur les compilateurs et la synthĂšse de circuits haut-niveau (High-Level Synthesis, HLS) pour le calcul haute-performance. Specifiquement, mes contributions couvrent les aspects thĂ©oriques etpratiques de la parallĂ©lisation automatique et la HLS dans le cadre gĂ©nĂ©ral du modĂšle polyĂ©drique.Un premier chapitre dĂ©crit mes contributions au tuilage de boucles, une transformation fondamentale pour la parallĂ©lisation automatique, qui dĂ©coupe le calcul en sous-calculs atomiques appelĂ©s tuiles. Nous reformulons le tuilage de boucles dans le modĂšle polyĂ©drique pour permettre n’importe tuile polytopique dont la taille dĂ©pend d’un facteur homothĂ©tique (tuilage monoparamĂ©trique), et nous dĂ©crivons une transformation de tuilage pour des programmes avec des rĂ©ductions – une accumulation selon un opĂ©rateur associative et commutatif. Nos rĂ©sultats ouvrent la voie Ă  des transformations de programme sĂ©mantiques ; qui ne prĂ©servent pas le calcul, mais produisent un programme Ă©quivalent.Un second chapitre dĂ©crit mes contributions Ă  la reconnaissance d’algorithmes. Une optimisation de compilateur ne remplacera jamais un bon algorithme, d’oĂč l’idĂ©e de reconnaĂźtre les instances d’un algorithme dans un programme et de les substituer par un appel vers une bibliothĂšque hauteperformance, chaque fois que c’est possible et utile.Dans notre thĂšse, nous avons traitĂ© la reconnaissance de templates – des fonctions avec des variables d’ordre 1 – dans un programme et son application Ă  l’optimisation de programes. Nous proposons une approche complĂ©mentaire qui s’appuie sur notre tuilage monoparamĂ©trique complĂ©tĂ© par une transformation pour tuiler les rĂ©ductions. Ceci automatise le tuilage sĂ©mantique, une nouvelle transformation sĂ©mantique qui augmente le grain des opĂ©rateurs (scalaire → matrice).Un troisiĂšme chapitre prĂ©sente mes contributions Ă  la synthĂšse des communications avec une mĂ©moire off-chip dans le contexte de la synthĂšse de circuits haut-niveau. Nous proposons un modĂšle d’exĂ©cution basĂ© sur le tuilage de boucles, une architecture pipelinĂ©e et un algorithme de compilation source-Ă -source qui, connectĂ© Ă  l’outil de HLS C2H d’Altera, produit une configuration de circuit FPGA qui rĂ©alise un volume minimal de transferts de donnĂ©es. Notre algorithme est optimal – les donnĂ©es sont chargĂ©es le plus tard possible et stockĂ©es le plus tĂŽt possible, avec une rĂ©utilisation maximale et sans redondances.Enfin, un quatriĂšme chapitre prĂ©sente mes contributions pour construire un modĂšle de compilation polyĂ©drique unifiĂ© pour la synthĂšse de circuits haut-niveau.Nous prĂ©sentons les rĂ©seaux de processus DPN (Data-aware Process Networks), une reprĂ©sentation intermĂ©diaire dataflow qui s’appuie sur les idĂ©es dĂ©veloppĂ©es au chapitre 3 pour expliciter les transferts de donnĂ©es entre le circuit et la mĂ©moire off-chip. Nous proposons une suite d’algorithmes pour compiler un DPN Ă  partir d’un programme sĂ©quentiel et nous prĂ©sentons nos contributions Ă  la synthĂšse d’un DPN en circuit. En particulier, nous prĂ©sentons nos algorithmes pour compiler le contrĂŽle, les canaux et les synchronisations d’un DPN. Ces rĂ©sultats sont utilisĂ©s dans le compilateur de production de la start-up XtremLogic
