329 research outputs found
Symbolic-numeric interface: A review
A survey of the use of a combination of symbolic and numerical calculations is presented. Symbolic calculations primarily refer to the computer processing of procedures from classical algebra, analysis, and calculus. Numerical calculations refer to both numerical mathematics research and scientific computation. This survey is intended to point out a large number of problem areas where a cooperation of symbolic and numerical methods is likely to bear many fruits. These areas include such classical operations as differentiation and integration, such diverse activities as function approximations and qualitative analysis, and such contemporary topics as finite element calculations and computation complexity. It is contended that other less obvious topics such as the fast Fourier transform, linear algebra, nonlinear analysis and error analysis would also benefit from a synergistic approach
Interfacing Mathemagix with C++
8 pagesIn this paper, we give a detailed description of the interface between the Mathemagix language and C++. In particular, we describe the mechanism which allows us to import a C++ template library (which only permits static instantiation) as a fully generic Mathemagix template library
Classical computing, quantum computing, and Shor's factoring algorithm
This is an expository talk written for the Bourbaki Seminar. After a brief
introduction, Section 1 discusses in the categorical language the structure of
the classical deterministic computations. Basic notions of complexity icluding
the P/NP problem are reviewed. Section 2 introduces the notion of quantum
parallelism and explains the main issues of quantum computing. Section 3 is
devoted to four quantum subroutines: initialization, quantum computing of
classical Boolean functions, quantum Fourier transform, and Grover's search
algorithm. The central Section 4 explains Shor's factoring algorithm. Section 5
relates Kolmogorov's complexity to the spectral properties of computable
function. Appendix contributes to the prehistory of quantum computing.Comment: 27 pp., no figures, amste
Overview of the Mathemagix type system
The goal of the Mathemagix project is to develop a new and free software for computer algebra and computer analysis, based on a strongly typed and compiled language. In this paper, we focus on the underlying type system of this language, which allows for heavy overloading, including parameterized overloading with parameters in so called "categories". The exposition is informal and aims at giving the reader an overview of the main concepts, ideas and differences with existing languages. In a forthcoming paper, we intend to describe the formal semantics of the type system in more details
Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology
Processing-in-memory (PIM) has been explored for decades by computer
architects, yet it has never seen the light of day in real-world products due
to their high design overheads and lack of a killer application. With the
advent of critical memory-intensive workloads, several commercial PIM
technologies have been introduced to the market ranging from domain-specific
PIM architectures to more general-purpose PIM architectures. In this work, we
deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled
parallel architecture that is highly programmable. Our first key contribution
is the development of a flexible simulation framework for PIM. The simulator we
developed (aka PIMulator) enables the compilation of UPMEM-PIM source codes
into its compiled machine-level instructions, which are subsequently consumed
by our cycle-level performance simulator. Using PIMulator, we demystify UPMEM's
PIM design through a detailed characterization study. Building on top of our
characterization, we conduct a series of case studies to pathfind important
architectural features that we deem will be critical for future PIM
architectures to suppor
Accelerating Linear Algebra and Machine Learning Kernels on a Massively Parallel Reconfigurable Architecture
abstract: This thesis presents efficient implementations of several linear algebra kernels, machine learning kernels and a neural network based recommender systems engine onto a massively parallel reconfigurable architecture, Transformer. The linear algebra kernels include Triangular Matrix Solver (TRSM), LU Decomposition (LUD), QR Decomposition (QRD), and Matrix Inversion. The machine learning kernels include an LSTM (Long Short Term Memory) cell, and a GRU (gated Recurrent Unit) cell used in recurrent neural networks. The neural network based recommender systems engine consists of multiple kernels including fully connected layers, embedding layer, 1-D batchnorm, Adam optimizer, etc.
Transformer is a massively parallel reconfigurable multicore architecture designed at the University of Michigan. The Transformer configuration considered here is 4 tiles and 16 General Processing Elements (GPEs) per tile. It supports a two level cache hierarchy where the L1 and L2 caches can operate in shared (S) or private (P) modes. The architecture was modeled using Gem5 and cycle accurate simulations were done to evaluate the performance in terms of execution times, giga-operations per second per Watt (GOPS/W), and giga-floating-point-operations per second per Watt (GFLOPS/W).
This thesis shows that for linear algebra kernels, each kernel achieves high performance for a certain cache mode and that this cache mode can change when the matrix size changes. For instance, for smaller matrix sizes, L1P, L2P cache mode is best for TRSM, while L1S, L2S is the best cache mode for LUD, and L1P, L2S is the best for QRD. For each kernel, the optimal cache mode changes when the matrix size is increased. For instance, for TRSM, the L1P, L2P cache mode is best for smaller matrix sizes () and it changes to L1S, L2P for larger matrix sizes (). For machine learning kernels, L1P, L2P is the best cache mode for all network parameter sizes.
Gem5 simulations show that the peak performance for TRSM, LUD, QRD and Matrix Inverse in the 14nm node is 97.5, 59.4, 133.0 and 83.05 GFLOPS/W, respectively. For LSTM and GRU, the peak performance is 44.06 and 69.3 GFLOPS/W.
The neural network based recommender system was implemented in L1S, L2S cache mode. It includes a forward pass and a backward pass and is significantly more complex in terms of both computational complexity and data movement. The most computationally intensive block is the fully connected layer followed by Adam optimizer. The overall performance of the recommender systems engine is 54.55 GFLOPS/W and 169.12 GOPS/W.Dissertation/ThesisMasters Thesis Electrical Engineering 201
Neural network computing using on-chip accelerators
The use of neural networks, machine learning, or artificial intelligence, in its broadest and most controversial sense, has been a tumultuous journey involving three distinct hype cycles and a history dating back to the 1960s. Resurgent, enthusiastic interest in machine learning and its applications bolsters the case for machine learning as a fundamental computational kernel. Furthermore, researchers have demonstrated that machine learning can be utilized as an auxiliary component of applications to enhance or enable new types of computation such as approximate computing or automatic parallelization. In our view, machine learning becomes not the underlying application, but a ubiquitous component of applications. This view necessitates a different approach towards the deployment of machine learning computation that spans not only hardware design of accelerator architectures, but also user and supervisor software to enable the safe, simultaneous use of machine learning accelerator resources.
In this dissertation, we propose a multi-transaction model of neural network computation to meet the needs of future machine learning applications. We demonstrate that this model, encompassing a decoupled backend accelerator for inference and learning from hardware and software for managing neural network transactions can be achieved with low overhead and integrated with a modern RISC-V microprocessor. Our extensions span user and supervisor software and data structures and, coupled with our hardware, enable multiple transactions from different address spaces to execute simultaneously, yet safely. Together, our system demonstrates the utility of a multi-transaction model to increase energy efficiency improvements and improve overall accelerator throughput for machine learning applications
- …