69 research outputs found

    Reconfigurable computing for large-scale graph traversal algorithms

    Get PDF
    This thesis proposes a reconfigurable computing approach for supporting parallel processing in large-scale graph traversal algorithms. Our approach is based on a reconfigurable hardware architecture which exploits the capabilities of both FPGAs (Field-Programmable Gate Arrays) and a multi-bank parallel memory subsystem. The proposed methodology to accelerate graph traversal algorithms has been applied to three case studies, revealing that application-specific hardware customisations can benefit performance. A summary of our four contributions is as follows. First, a reconfigurable computing approach to accelerate large-scale graph traversal algorithms. We propose a reconfigurable hardware architecture which decouples computation and communication while keeping multiple memory requests in flight at any given time, taking advantage of the high bandwidth of multi-bank memory subsystems. Second, a demonstration of the effectiveness of our approach through two case studies: the breadth-first search algorithm, and a graphlet counting algorithm from bioinformatics. Both case studies involve graph traversal, but each of them adopts a different graph data representation. Third, a method for using on-chip memory resources in FPGAs to reduce off-chip memory accesses for accelerating graph traversal algorithms, through a case-study of the All-Pairs Shortest-Paths algorithm. This case study has been applied to process human brain network data. Fourth, an evaluation of an approach based on instruction-set extension for FPGA design against many-core GPUs (Graphics Processing Units), based on a set of benchmarks with different memory access characteristics. It is shown that while GPUs excel at streaming applications, the proposed approach can outperform GPUs in applications with poor locality characteristics, such as graph traversal problems.Open Acces

    Überblick zur Softwareentwicklung in Wissenschaftlichen Anwendungen

    Get PDF
    Viele wissenschaftliche Disziplinen müssen heute immer komplexer werdende numerische Probleme lösen. Die Komplexität der benutzten wissenschaftlichen Software steigt dabei kontinuierlich an. Diese Komplexitätssteigerung wird durch eine ganze Reihe sich ändernder Anforderungen verursacht: Die Betrachtung gekoppelter Phänomene gewinnt Aufmerksamkeit und gleichzeitig müssen neue Technologien wie das Grid-Computing oder neue Multiprozessorarchitekturen genutzt werden, um weiterhin in angemessener Zeit zu Berechnungsergebnissen zu kommen. Diese Fülle an neuen Anforderungen kann nicht mehr von kleinen spezialisierten Wissenschaftlergruppen in Isolation bewältigt werden. Die Entwicklung wissenschaftlicher Software muss vielmehr in interdisziplinären Gruppen geschehen, was neue Herausforderungen in der Softwareentwicklung induziert. Ein Paradigmenwechsel zu einer stärkeren Separation von Verantwortlichkeiten innerhalb interdisziplinärer Entwicklergruppen ist bis jetzt in vielen Fällen nur in Ansätzen erkennbar. Die Kopplung partitioniert durchgeführter Simulationen physikalischer Phänomene ist ein wichtiges Beispiel für softwaretechnisch herausfordernde Aufgaben im Gebiet des wissenschaftlichen Rechnens. In diesem Kontext modellieren verschiedene Simulationsprogramme unterschiedliche Teile eines komplexeren gekoppelten Systems. Die vorliegende Arbeit gibt einen Überblick über Paradigmen, die darauf abzielen Softwareentwicklung für Berechnungsprogramme verlässlicher und weniger abhängig voneinander zu machen. Ein spezielles Augenmerk liegt auf der Entwicklung gekoppelter Simulationen.Fields of modern science and engineering are in need of solving more and more complex numerical problems. The complexity of scientific software thereby rises continuously. This growth is caused by a number of changing requirements. Coupled phenomena gain importance and new technologies like the computational Grid, graphical and heterogeneous multi-core processors have to be used to achieve high-performance. The amount of additional complexity can not be handled by small groups of specialised scientists. The interdiciplinary nature of scientific software thereby presents new challanges for software engineering. A paradigm shift towards a stronger separation of concerns becomes necessary in the development of future scientific software. The coupling of independently simulated physical phenomena is an important example for a software-engineering concern in the domain of computational science. In this context, different simulation-programs model only a part of a more complex coupled system. The present work gives overview on paradigms which aim at making software-development in computational sciences more reliable and less interdependent. A special focus is put on the development of coupled simulations

    Late-bound code generation

    Get PDF
    Each time a function or method is invoked during the execution of a program, a stream of instructions is issued to some underlying hardware platform. But exactly what underlying hardware, and which instructions, is usually left implicit. However in certain situations it becomes important to control these decisions. For example, particular problems can only be solved in real-time when scheduled on specialised accelerators, such as graphics coprocessors or computing clusters. We introduce a novel operator for hygienically reifying the behaviour of a runtime function instance as a syntactic fragment, in a language which may in general differ from the source function definition. Translation and optimisation are performed by recursively invoked, dynamically dispatched code generators. Side-effecting operations are permitted, and their ordering is preserved. We compare our operator with other techniques for pragmatic control, observing that: the use of our operator supports lifting arbitrary mutable objects, and neither requires rewriting sections of the source program in a multi-level language, nor interferes with the interface to individual software components. Due to its lack of interference at the abstraction level at which software is composed, we believe that our approach poses a significantly lower barrier to practical adoption than current methods. The practical efficacy of our operator is demonstrated by using it to offload the user interface rendering of a smartphone application to an FPGA coprocessor, including both statically and procedurally defined user interface components. The generated pipeline is an application-specific, statically scheduled processor-per-primitive rendering pipeline, suitable for place-and-route style optimisation. To demonstrate the compatibility of our operator with existing languages, we show how it may be defined within the Python programming language. We introduce a transformation for weakening mutable to immutable named bindings, termed let-weakening, to solve the problem of propagating information pertaining to named variables between modular code generating units.Open Acces

    Accelerating orchestration with in-network offloading

    Get PDF
    The demand for low-latency Internet applications has pushed functionality that was originally placed in commodity hardware into the network. Either in the form of binaries for the programmable data plane or virtualised network functions, services are implemented within the network fabric with the aim of improving their performance and placing them close to the end user. Training of machine learning algorithms, aggregation of networking traffic, virtualised radio access components, are just some of the functions that have been deployed within the network. Therefore, as the network fabric becomes the accelerator for various applications, it is imperative that the orchestration of their components is also adapted to the constraints and capabilities of the deployment environment. This work identifies performance limitations of in-network compute use cases for both cloud and edge environments and makes suitable adaptations. Within cloud infrastructure, this thesis proposes a platform that relies on programmable switches to accelerate the performance of data replication. It then proceeds to discuss design adaptations of an orchestrator that will allow in-network data offloading and enable accelerated service deployment. At the edge, the topic of inefficient orchestration of virtualised network functions is explored, mainly with respect to energy usage and resource contention. An orchestrator is adapted to schedule requests by taking into account edge constraints in order to minimise resource contention and accelerate service processing times. With data transfers consuming valuable resources at the edge, an efficient data representation mechanism is implemented to provide statistical insight on the provenance of data at the edge and enable smart query allocation to nodes with relevant data. Taking into account the previous state of the art, the proposed data plane replication method appears to be the most computationally efficient and scalable in-network data replication platform available, with significant improvements in throughput and up to an order of magnitude decrease in latency. The orchestrator of virtual network functions at the edge was shown to reduce event rejections, total processing time, and energy consumption imbalances over the default orchestrator, thus proving more efficient use of the infrastructure. Lastly, computational cost at the edge was further reduced with the use of the proposed query allocation mechanism which minimised redundant engagement of nodes

    Productive and efficient computational science through domain-specific abstractions

    Get PDF
    In an ideal world, scientific applications are computationally efficient, maintainable and composable and allow scientists to work very productively. We argue that these goals are achievable for a specific application field by choosing suitable domain-specific abstractions that encapsulate domain knowledge with a high degree of expressiveness. This thesis demonstrates the design and composition of domain-specific abstractions by abstracting the stages a scientist goes through in formulating a problem of numerically solving a partial differential equation. Domain knowledge is used to transform this problem into a different, lower level representation and decompose it into parts which can be solved using existing tools. A system for the portable solution of partial differential equations using the finite element method on unstructured meshes is formulated, in which contributions from different scientific communities are composed to solve sophisticated problems. The concrete implementations of these domain-specific abstractions are Firedrake and PyOP2. Firedrake allows scientists to describe variational forms and discretisations for linear and non-linear finite element problems symbolically, in a notation very close to their mathematical models. PyOP2 abstracts the performance-portable parallel execution of local computations over the mesh on a range of hardware architectures, targeting multi-core CPUs, GPUs and accelerators. Thereby, a separation of concerns is achieved, in which Firedrake encapsulates domain knowledge about the finite element method separately from its efficient parallel execution in PyOP2, which in turn is completely agnostic to the higher abstraction layer. As a consequence of the composability of those abstractions, optimised implementations for different hardware architectures can be automatically generated without any changes to a single high-level source. Performance matches or exceeds what is realistically attainable by hand-written code. Firedrake and PyOP2 are combined to form a tool chain that is demonstrated to be competitive with or faster than available alternatives on a wide range of different finite element problems.Open Acces

    Rethinking FPGA Architectures for Deep Neural Network applications

    Get PDF
    The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance. Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking. First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks

    Embedded electronic systems driven by run-time reconfigurable hardware

    Get PDF
    Abstract This doctoral thesis addresses the design of embedded electronic systems based on run-time reconfigurable hardware technology –available through SRAM-based FPGA/SoC devices– aimed at contributing to enhance the life quality of the human beings. This work does research on the conception of the system architecture and the reconfiguration engine that provides to the FPGA the capability of dynamic partial reconfiguration in order to synthesize, by means of hardware/software co-design, a given application partitioned in processing tasks which are multiplexed in time and space, optimizing thus its physical implementation –silicon area, processing time, complexity, flexibility, functional density, cost and power consumption– in comparison with other alternatives based on static hardware (MCU, DSP, GPU, ASSP, ASIC, etc.). The design flow of such technology is evaluated through the prototyping of several engineering applications (control systems, mathematical coprocessors, complex image processors, etc.), showing a high enough level of maturity for its exploitation in the industry.Resumen Esta tesis doctoral abarca el diseño de sistemas electrónicos embebidos basados en tecnología hardware dinámicamente reconfigurable –disponible a través de dispositivos lógicos programables SRAM FPGA/SoC– que contribuyan a la mejora de la calidad de vida de la sociedad. Se investiga la arquitectura del sistema y del motor de reconfiguración que proporcione a la FPGA la capacidad de reconfiguración dinámica parcial de sus recursos programables, con objeto de sintetizar, mediante codiseño hardware/software, una determinada aplicación particionada en tareas multiplexadas en tiempo y en espacio, optimizando así su implementación física –área de silicio, tiempo de procesado, complejidad, flexibilidad, densidad funcional, coste y potencia disipada– comparada con otras alternativas basadas en hardware estático (MCU, DSP, GPU, ASSP, ASIC, etc.). Se evalúa el flujo de diseño de dicha tecnología a través del prototipado de varias aplicaciones de ingeniería (sistemas de control, coprocesadores aritméticos, procesadores de imagen, etc.), evidenciando un nivel de madurez viable ya para su explotación en la industria.Resum Aquesta tesi doctoral està orientada al disseny de sistemes electrònics empotrats basats en tecnologia hardware dinàmicament reconfigurable –disponible mitjançant dispositius lògics programables SRAM FPGA/SoC– que contribueixin a la millora de la qualitat de vida de la societat. S’investiga l’arquitectura del sistema i del motor de reconfiguració que proporcioni a la FPGA la capacitat de reconfiguració dinàmica parcial dels seus recursos programables, amb l’objectiu de sintetitzar, mitjançant codisseny hardware/software, una determinada aplicació particionada en tasques multiplexades en temps i en espai, optimizant així la seva implementació física –àrea de silici, temps de processat, complexitat, flexibilitat, densitat funcional, cost i potència dissipada– comparada amb altres alternatives basades en hardware estàtic (MCU, DSP, GPU, ASSP, ASIC, etc.). S’evalúa el fluxe de disseny d’aquesta tecnologia a través del prototipat de varies aplicacions d’enginyeria (sistemes de control, coprocessadors aritmètics, processadors d’imatge, etc.), demostrant un nivell de maduresa viable ja per a la seva explotació a la indústria

    An Efficient NoC-based Framework To Improve Dataflow Thread Management At Runtime

    Get PDF
    This doctoral thesis focuses on how the application threads that are based on dataflow execution model can be managed at Network-on-Chip (NoC) level. The roots of the dataflow execution model date back to the early 1970’s. Applications adhering to such program execution model follow a simple producer-consumer communication scheme for synchronising parallel thread related activities. In dataflow execution environment, a thread can run if and only if all its required inputs are available. Applications running on a large and complex computing environment can significantly benefit from the adoption of dataflow model. In the first part of the thesis, the work is focused on the thread distribution mechanism. It has been shown that how a scalable hash-based thread distribution mechanism can be implemented at the router level with low overheads. To enhance the support further, a tool to monitor the dataflow threads’ status and a simple, functional model is also incorporated into the design. Next, a software defined NoC has been proposed to manage the distribution of dataflow threads by exploiting its reconfigurability. The second part of this work is focused more on NoC microarchitecture level. Traditional 2D-mesh topology is combined with a standard ring, to understand how such hybrid network topology can outperform the traditional topology (such as 2D-mesh). Finally, a mixed-integer linear programming based analytical model has been proposed to verify if the application threads mapped on to the free cores is optimal or not. The proposed mathematical model can be used as a yardstick to verify the solution quality of the newly developed mapping policy. It is not trivial to provide a complete low-level framework for dataflow thread execution for better resource and power management. However, this work could be considered as a primary framework to which improvements could be carried out

    Hardware / Software System for Portable and Low-Cost Genome Assembly

    Full text link
    “The enjoyment of the highest attainable standard of health is one of the fundamental rights of every human being without distinction of race, religion, political belief, economic or social condition” [56]. Genomics (the study of the entire DNA) provides such a standard of health for people with rare diseases and helps control the spread of pandemics. Still, millions of human beings are unable to access genomics due to its cost, and portability. In genomics, DNA sequencers digitise DNA information, and computers analyse the digitised information. We have desktop and thumb-sized DNA sequencers, that digitise the DNA data rapidly. But computations necessary for the analysis of this data are inevitably performed on high-performance computers (HPCs) and cloud computers. These computations not only require powerful computers but also necessitate high-speed networks since the data generated are in the hundreds of gigabytes. Relying on HPCs and high-speed networks, deny the benefits that can be reaped by genomics for the masses who live in remote areas and in poorer nations. Developing a low-cost and portable genomics computation platform would provide personalised treatment based on an individual’s DNA and identify the source of the fast-spreading epidemics in remote areas and areas without HPC or network infrastructure. But developing a low-cost and portable genome analysing computing platform is a challenging task. This thesis develops novel computer architecture solutions to assemble the whole human DNA and COVID-19 virus RNA on a low-cost and portable platform. The first phase of the solution describes a ring-pipelined processor architecture for a key genome assembly algorithm. The human genome is partitioned to fit into the small memory footprint of embedded processors. These techniques allow an entire human genome to be assembled using highly portable and low-cost embedded processor cores. These processor cores can be housed within a single chip. Each processor was only 0.08 mm 2 and consumed just 37.5 mW. It has only 2 GB memory, 32-bit instruction width, and a clock with a 1 GHz frequency. The second phase of the solution describes how application-specific instruction-set processors can be sped up to execute a key genome assembly algorithm. A fully automated design system is presented, which improves the performance of large applications (such as genome assembly algorithm) and generates application-specific instructions for a commercial processor design tool (Xtensa). The tool enhances the base processor, which was used in the ring pipeline processor architecture. Thus, the alignment algorithms execute 2.1 times faster with only 11% additional hardware. The energy-delay product was reduced by 7.3× compared to the base processor. This tool is the only one of its type which can handle applications which are large. The third phase of the solution designs a portable low-cost genome assembly computer (PGA). PGA enhances the ring pipeline architecture with the customised processor found in phase two and with improved inter-processor communication. The results show that the COVID-19 virus RNA can be assembled in under 10 minutes and the whole human genome can be assembled in 11 days on a portable platform (HPC take around two days) for 30× coverage. PGA has an area footprint of just 5.68 mm 2 in a 28 nm technology node and is far smaller than a high-performance computer processor chip. The PGA consumes only 4W of power, which is lower than the power requirement of a high-performance processor chip. The manufacturing cost of the PGA also would be much cheaper than the high-performance system cost, when produced in volume. The developed solution can be powered by a USB port of a laptop. This thesis is the first of its type to show the design of a single-chip solution to be able to process a complex genomic problem. This thesis contributes to attaining one of the fundamental rights of every human being wherever they may live

    Efficient Algorithms for Artificial Neural Networks and Explainable AI

    Get PDF
    Artificial neural networks have allowed some remarkable progress in fields such as pattern recognition and computer vision. However, the increasing complexity of artificial neural networks presents a challenge for efficient computation. In this thesis, we first introduce a novel matrix multiplication method to reduce the complexity of artificial neural networks, where we demonstrate its suitability to compress fully connected layers of artificial neural networks. Our method outperforms other state-of-the-art methods when tested on standard publicly available datasets. This thesis then focuses on Explainable AI, which can be critical in fields like finance and medicine, as it can provide explanations for some decisions taken by sub-symbolic AI models behaving like a black box such as Artificial neural networks and transformation based learning approaches. We have also developed a new framework that facilitates the use of Explainable AI with tabular datasets. Our new framework Exmed, enables nonexpert users to prepare data, train models, and apply Explainable AI techniques effectively.Additionally, we propose a new algorithm that identifies the overall influence of input features and minimises the perturbations that alter the decision taken by a given model. Overall, this thesis introduces innovative and comprehensive techniques to enhance the efficiency of fully connected layers in artificial neural networks and provide a new approach to explain their decisions. These methods have significant practical applications in various fields, including portable medical devices
    corecore