2,499 research outputs found

    A Purely Functional Computer Algebra System Embedded in Haskell

    Full text link
    We demonstrate how methods in Functional Programming can be used to implement a computer algebra system. As a proof-of-concept, we present the computational-algebra package. It is a computer algebra system implemented as an embedded domain-specific language in Haskell, a purely functional programming language. Utilising methods in functional programming and prominent features of Haskell, this library achieves safety, composability, and correctness at the same time. To demonstrate the advantages of our approach, we have implemented advanced Gr\"{o}bner basis algorithms, such as Faug\`{e}re's F4F_4 and F5F_5, in a composable way.Comment: 16 pages, Accepted to CASC 201

    AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

    Full text link
    CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

    Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures

    Get PDF
    A new solver featuring time-space adaptation and error control has been recently introduced to tackle the numerical solution of stiff reaction-diffusion systems. Based on operator splitting, finite volume adaptive multiresolution and high order time integrators with specific stability properties for each operator, this strategy yields high computational efficiency for large multidimensional computations on standard architectures such as powerful workstations. However, the data structure of the original implementation, based on trees of pointers, provides limited opportunities for efficiency enhancements, while posing serious challenges in terms of parallel programming and load balancing. The present contribution proposes a new implementation of the whole set of numerical methods including Radau5 and ROCK4, relying on a fully different data structure together with the use of a specific library, TBB, for shared-memory, task-based parallelism with work-stealing. The performance of our implementation is assessed in a series of test-cases of increasing difficulty in two and three dimensions on multi-core and many-core architectures, demonstrating high scalability

    Composable abstractions for synchronization in dynamic threading platforms

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 259-269).High-level abstractions for parallel programming simplify the development of efficient parallel applications. In particular, composable abstractions allow programmers to construct a complex parallel application out of multiple components, where each component itself may be designed to exploit parallelism. This dissertation presents the design of three composable abstractions for synchronization in dynamic-threading platforms, based on ideas of task-graph execution, helper locks, and transactional memory. These designs demonstrate provably efficient runtime scheduling for programs with synchronization. For applications that use task-graph synchronization, I demonstrate provably efficient execution of task graphs with arbitrary dependencies as a library in a fork-join platform. Conventional wisdom suggests that a fork-join platform can execute an arbitrary task graph only with special runtime support or by converting the graph into a series-parallel computation which has less parallelism. By implementing Nabbit, a Cilk++ library for arbitrary task-graph execution, I show that one can in fact avoid introducing runtime modifications or additional constraints on parallelism. Nabbit achieves an asymptotically optimal completion-time bound for task graphs with constant degree. For applications that use lock-based synchronization, I introduce helper locks, a new synchronization abstraction that enables programmers to exploit asynchronous task parallelism inside locked critical regions. When a processor fails to acquire a helper lock, it can help complete the parallel critical region protected by the lock instead of simply waiting for the lock to be released. I also present HELPER, a runtime for supporting helper locks, and prove theoretical performance bounds which imply that HELPER achieves linear speedup on programs with a small number of highly parallel critical regions. For applications that use transaction-based synchronization, I present CWSTM, the first design of a transactional memory (TM) system that supports transactions with nested parallelism and nested parallel transactions of unbounded nesting depth. CWSTM demonstrates that one can provide theoretical bounds on the overhead of transaction conflict detection which are independent of nesting depth. I also introduce the concept of ownership-aware TM, the idea of using information about which memory locations a software module owns to provide provable guarantees of safety and correctness for open-nested transactions.by Jim Sukha.Ph.D

    Actors that Unify Threads and Events

    Get PDF
    There is an impedance mismatch between message-passing concurrency and virtual machines, such as the JVM. VMs usually map their threads to heavyweight OS processes. Without a lightweight process abstraction, users are often forced to write parts of concurrent applications in an event-driven style which obscures control flow, and increases the burden on the programmer. In this paper we show how thread-based and event-based programming can be unified under a single actor abstraction. Using advanced abstraction mechanisms of the Scala programming language, we implemented our approach on unmodified JVMs. Our programming model integrates well with the threading model of the underlying VM

    Advances in parallel programming for electronic design automation

    Get PDF
    The continued miniaturization of the technology node increases not only the chip capacity but also the circuit design complexity. How does one efficiently design a chip with millions or billions transistors? This has become a challenging problem in the integrated circuit (IC) design industry, especially for the developers of electronic design automation (EDA) tools. To boost the performance of EDA tools, one promising direction is via parallel computing. In this dissertation, we explore different parallel computing approaches, from CPU to GPU to distributed computing, for EDA applications. Nowadays multi-core processors are prevalent from mobile devices to laptops to desktop, and it is natural for software developers to utilize the available cores to maximize the performance of their applications. Therefore, in this dissertation we first focus on multi-threaded programming. We begin by reviewing a C++ parallel programming library called Cpp-Taskflow. Cpp-Taskflow is designed to facilitate programming parallel applications, and has been successfully applied to an EDA timing analysis tool. We will demonstrate Cpp-Taskflow’s programming model and interface, software architecture and execution flow. Then, we improve Cpp-Taskflow in several aspects. First, we enhance Cpp-Taskflow’s usability through restructuring the software architecture. Second, we introduce task graph composition to support composability and modularity, which makes it easier for users to construct large and complex parallel patterns. Third, we add a new task type in Cpp-Taskflow to let users control the graph execution flow. This feature empowers the graph model with the ability to describe complex control flow. Aside from the above enhancements, we have designed a new scheduler to adaptively manage the threads based on available parallelism. The new scheduler uses a simple and effective strategy which can not only prevent resource from being underutilized, but also mitigate resource over-subscription. We have evaluated the new scheduler on both micro-benchmarks and a very-large-scale integration (VLSI) application, and the results show that the new scheduler can achieve good performance and is very energy-efficient. Next we study the applicability of heterogeneous computing, specifically the graphics processing unit (GPU), to EDA. We demonstrate how to use GPU to accelerate VLSI placement, and we show that GPU can bring substantial performance gain to VLSI placement. Finally, as the design size keeps increasing, a more scalable solution will be distributed computing. We introduce a distributed power grid analysis framework built on top of DtCraft. This framework allows users to flexibly partition the design and automatically deploy the computations across several machines. In addition, we propose a job scheduler that can efficiently utilize cluster resource to improve the framework’s performance
    • …
    corecore