210 research outputs found

    Empowering parallel computing with field programmable gate arrays

    Get PDF
    After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements

    A data-flow execution engine for scalable embedded computing

    Get PDF
    Nowadays embedded systems are increasingly used in the world of distributed computing to provide more computational power without having to change the whole system and the programming model. We propose a DataFlow Execution Engine (DEE) to spawn asynchronous, data-driven threads, among embedded cores to achieve a seamless distribution of threads without the need of using a distributed programming model. Our idea relies on the creation of a hardware scheduler that can handle creation, thread-dependency, and locality of many fine-grained tasks. We present an initial evaluation of our DEE that is suited for FPGA implementation. Our initial results show the importance of a hardware based support for such thread execution model

    Improving Model-Based Software Synthesis: A Focus on Mathematical Structures

    Get PDF
    Computer hardware keeps increasing in complexity. Software design needs to keep up with this. The right models and abstractions empower developers to leverage the novelties of modern hardware. This thesis deals primarily with Models of Computation, as a basis for software design, in a family of methods called software synthesis. We focus on Kahn Process Networks and dataflow applications as abstractions, both for programming and for deriving an efficient execution on heterogeneous multicores. The latter we accomplish by exploring the design space of possible mappings of computation and data to hardware resources. Mapping algorithms are not at the center of this thesis, however. Instead, we examine the mathematical structure of the mapping space, leveraging its inherent symmetries or geometric properties to improve mapping methods in general. This thesis thoroughly explores the process of model-based design, aiming to go beyond the more established software synthesis on dataflow applications. We starting with the problem of assessing these methods through benchmarking, and go on to formally examine the general goals of benchmarks. In this context, we also consider the role modern machine learning methods play in benchmarking. We explore different established semantics, stretching the limits of Kahn Process Networks. We also discuss novel models, like Reactors, which are designed to be a deterministic, adaptive model with time as a first-class citizen. By investigating abstractions and transformations in the Ohua language for implicit dataflow programming, we also focus on programmability. The focus of the thesis is in the models and methods, but we evaluate them in diverse use-cases, generally centered around Cyber-Physical Systems. These include the 5G telecommunication standard, automotive and signal processing domains. We even go beyond embedded systems and discuss use-cases in GPU programming and microservice-based architectures

    High performance graph analysis on parallel architectures

    Get PDF
    PhD ThesisOver the last decade pharmacology has been developing computational methods to enhance drug development and testing. A computational method called network pharmacology uses graph analysis tools to determine protein target sets that can lead on better targeted drugs for diseases as Cancer. One promising area of network-based pharmacology is the detection of protein groups that can produce better e ects if they are targeted together by drugs. However, the e cient prediction of such protein combinations is still a bottleneck in the area of computational biology. The computational burden of the algorithms used by such protein prediction strategies to characterise the importance of such proteins consists an additional challenge for the eld of network pharmacology. Such computationally expensive graph algorithms as the all pairs shortest path (APSP) computation can a ect the overall drug discovery process as needed network analysis results cannot be given on time. An ideal solution for these highly intensive computations could be the use of super-computing. However, graph algorithms have datadriven computation dictated by the structure of the graph and this can lead to low compute capacity utilisation with execution times dominated by memory latency. Therefore, this thesis seeks optimised solutions for the real-world graph problems of critical node detection and e ectiveness characterisation emerged from the collaboration with a pioneer company in the eld of network pharmacology as part of a Knowledge Transfer Partnership (KTP) / Secondment (KTS). In particular, we examine how genetic algorithms could bene t the prediction of protein complexes where their removal could produce a more e ective 'druggable' impact. Furthermore, we investigate how the problem of all pairs shortest path (APSP) computation can be bene ted by the use of emerging parallel hardware architectures as GPU- and FPGA- desktop-based accelerators. In particular, we address the problem of critical node detection with the development of a heuristic search method. It is based on a genetic algorithm that computes optimised node combinations where their removal causes greater impact than common impact analysis strategies. Furthermore, we design a general pattern for parallel network analysis on multi-core architectures that considers graph's embedded properties. It is a divide and conquer approach that decomposes a graph into smaller subgraphs based on its strongly connected components and computes the all pairs shortest paths concurrently on GPU. Furthermore, we use linear algebra to design an APSP approach based on the BFS algorithm. We use algebraic expressions to transform the problem of path computation to multiple independent matrix-vector multiplications that are executed concurrently on FPGA. Finally, we analyse how the optimised solutions of perturbation analysis and parallel graph processing provided in this thesis will impact the drug discovery process.This research was part of a Knowledge Transfer Partnership (KTP) and Knowledge Transfer Secondment (KTS) between e-therapeutics PLC and Newcastle University. It was supported as a collaborative project by e-therapeutics PLC and Technology Strategy boar

    Extension of a task-based model to functional programming

    Get PDF
    Recently, efforts have been made to bring together the areas of high-performance computing (HPC) and massive data processing (Big Data). Traditional HPC frameworks, like COMPSs, are mostly task-based, while popular big-data environments, like Spark, are based on functional programming principles. The earlier are know for their good performance for regular, matrix-based computations; on the other hand, for fine-grained, data-parallel workloads, the later has often been considered more successful. In this paper we present our experience with the integration of some dataflow techniques into COMPSs, a task-based framework, in an effort to bring together the best aspects of both worlds. We present our API, called DDF, which provides a new data abstraction that addresses the challenges of integrating Big Data application scenarios into COMPSs. DDF has a functional-based interface, similar to many Data Science tools, that allows us to use dynamic evaluation to adapt the task execution in runtime. Besides the performance optimization it provides, the API facilitates the development of applications by experts in the application domain. In this paper we evaluate DDF's effectiveness by comparing the resulting programs to their original versions in COMPSs and Spark. The results show that DDF can improve COMPSs execution time and even outperform Spark in many use cases.This work was partially supported by CAPES, CNPq, Fapemig and NIC.BR, and by projects Atmosphere (H2020-EU.2.1.1 777154) and INCT-Cyber.Peer ReviewedPostprint (author's final draft

    Towards Implicit Parallel Programming for Systems

    Get PDF
    Multi-core processors require a program to be decomposable into independent parts that can execute in parallel in order to scale performance with the number of cores. But parallel programming is hard especially when the program requires state, which many system programs use for optimization, such as for example a cache to reduce disk I/O. Most prevalent parallel programming models do not support a notion of state and require the programmer to synchronize state access manually, i.e., outside the realms of an associated optimizing compiler. This prevents the compiler to introduce parallelism automatically and requires the programmer to optimize the program manually. In this dissertation, we propose a programming language/compiler co-design to provide a new programming model for implicit parallel programming with state and a compiler that can optimize the program for a parallel execution. We define the notion of a stateful function along with their composition and control structures. An example implementation of a highly scalable server shows that stateful functions smoothly integrate into existing programming language concepts, such as object-oriented programming and programming with structs. Our programming model is also highly practical and allows to gradually adapt existing code bases. As a case study, we implemented a new data processing core for the Hadoop Map/Reduce system to overcome existing performance bottlenecks. Our lambda-calculus-based compiler automatically extracts parallelism without changing the program's semantics. We added further domain-specific semantic-preserving transformations that reduce I/O calls for microservice programs. The runtime format of a program is a dataflow graph that can be executed in parallel, performs concurrent I/O and allows for non-blocking live updates

    Towards Implicit Parallel Programming for Systems

    Get PDF
    Multi-core processors require a program to be decomposable into independent parts that can execute in parallel in order to scale performance with the number of cores. But parallel programming is hard especially when the program requires state, which many system programs use for optimization, such as for example a cache to reduce disk I/O. Most prevalent parallel programming models do not support a notion of state and require the programmer to synchronize state access manually, i.e., outside the realms of an associated optimizing compiler. This prevents the compiler to introduce parallelism automatically and requires the programmer to optimize the program manually. In this dissertation, we propose a programming language/compiler co-design to provide a new programming model for implicit parallel programming with state and a compiler that can optimize the program for a parallel execution. We define the notion of a stateful function along with their composition and control structures. An example implementation of a highly scalable server shows that stateful functions smoothly integrate into existing programming language concepts, such as object-oriented programming and programming with structs. Our programming model is also highly practical and allows to gradually adapt existing code bases. As a case study, we implemented a new data processing core for the Hadoop Map/Reduce system to overcome existing performance bottlenecks. Our lambda-calculus-based compiler automatically extracts parallelism without changing the program's semantics. We added further domain-specific semantic-preserving transformations that reduce I/O calls for microservice programs. The runtime format of a program is a dataflow graph that can be executed in parallel, performs concurrent I/O and allows for non-blocking live updates

    Towards Implicit Parallel Programming for Systems

    Get PDF
    Multi-core processors require a program to be decomposable into independent parts that can execute in parallel in order to scale performance with the number of cores. But parallel programming is hard especially when the program requires state, which many system programs use for optimization, such as for example a cache to reduce disk I/O. Most prevalent parallel programming models do not support a notion of state and require the programmer to synchronize state access manually, i.e., outside the realms of an associated optimizing compiler. This prevents the compiler to introduce parallelism automatically and requires the programmer to optimize the program manually. In this dissertation, we propose a programming language/compiler co-design to provide a new programming model for implicit parallel programming with state and a compiler that can optimize the program for a parallel execution. We define the notion of a stateful function along with their composition and control structures. An example implementation of a highly scalable server shows that stateful functions smoothly integrate into existing programming language concepts, such as object-oriented programming and programming with structs. Our programming model is also highly practical and allows to gradually adapt existing code bases. As a case study, we implemented a new data processing core for the Hadoop Map/Reduce system to overcome existing performance bottlenecks. Our lambda-calculus-based compiler automatically extracts parallelism without changing the program's semantics. We added further domain-specific semantic-preserving transformations that reduce I/O calls for microservice programs. The runtime format of a program is a dataflow graph that can be executed in parallel, performs concurrent I/O and allows for non-blocking live updates
    corecore