Search CORE

329 research outputs found

Symbolic-numeric interface: A review

Author: Ng E. W.
Publication venue
Publication date
Field of study

A survey of the use of a combination of symbolic and numerical calculations is presented. Symbolic calculations primarily refer to the computer processing of procedures from classical algebra, analysis, and calculus. Numerical calculations refer to both numerical mathematics research and scientific computation. This survey is intended to point out a large number of problem areas where a cooperation of symbolic and numerical methods is likely to bear many fruits. These areas include such classical operations as differentiation and integration, such diverse activities as function approximations and qualitative analysis, and such contemporary topics as finite element calculations and computation complexity. It is contended that other less obvious topics such as the fast Fourier transform, linear algebra, nonlinear analysis and error analysis would also benefit from a synergistic approach

NASA Technical Reports Server

Interfacing Mathemagix with C++

Author: Lecerf Grégoire
van Der Hoeven Joris
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

8 pagesIn this paper, we give a detailed description of the interface between the Mathemagix language and C++. In particular, we describe the mechanism which allows us to import a C++ template library (which only permits static instantiation) as a fully generic Mathemagix template library

Crossref

HAL-Polytechnique

Classical computing, quantum computing, and Shor's factoring algorithm

Author: Manin Yuri I.
Publication venue
Publication date: 01/01/1999
Field of study

This is an expository talk written for the Bourbaki Seminar. After a brief introduction, Section 1 discusses in the categorical language the structure of the classical deterministic computations. Basic notions of complexity icluding the P/NP problem are reviewed. Section 2 introduces the notion of quantum parallelism and explains the main issues of quantum computing. Section 3 is devoted to four quantum subroutines: initialization, quantum computing of classical Boolean functions, quantum Fourier transform, and Grover's search algorithm. The central Section 4 explains Shor's factoring algorithm. Section 5 relates Kolmogorov's complexity to the spectral properties of computable function. Appendix contributes to the prehistory of quantum computing.Comment: 27 pp., no figures, amste

arXiv.org e-Print Archive

CiteSeerX

Numérisation de Documents Anciens Mathématiques

CERN Document Server

MPG.PuRe

Overview of the Mathemagix type system

Author: van Der Hoeven Joris
Publication venue: HAL CCSD
Publication date: 01/01/2012
Field of study

The goal of the Mathemagix project is to develop a new and free software for computer algebra and computer analysis, based on a strongly typed and compiled language. In this paper, we focus on the underlying type system of this language, which allows for heavy overloading, including parameterized overloading with parameters in so called "categories". The exposition is informal and aims at giving the reader an overview of the main concepts, ideas and differences with existing languages. In a forthcoming paper, we intend to describe the formal semantics of the type system in more details

CiteSeerX

HAL-Polytechnique

Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

Author: Hyun Bongjoon
Kim Taehun
Lee Dongjae
Rhu Minsoo
Publication venue
Publication date: 01/08/2023
Field of study

Processing-in-memory (PIM) has been explored for decades by computer architects, yet it has never seen the light of day in real-world products due to their high design overheads and lack of a killer application. With the advent of critical memory-intensive workloads, several commercial PIM technologies have been introduced to the market ranging from domain-specific PIM architectures to more general-purpose PIM architectures. In this work, we deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled parallel architecture that is highly programmable. Our first key contribution is the development of a flexible simulation framework for PIM. The simulator we developed (aka PIMulator) enables the compilation of UPMEM-PIM source codes into its compiled machine-level instructions, which are subsequently consumed by our cycle-level performance simulator. Using PIMulator, we demystify UPMEM's PIM design through a detailed characterization study. Building on top of our characterization, we conduct a series of case studies to pathfind important architectural features that we deem will be critical for future PIM architectures to suppor

arXiv.org e-Print Archive

Accelerating Linear Algebra and Machine Learning Kernels on a Massively Parallel Reconfigurable Architecture

Author
Publication venue
Publication date: 01/01/2019
Field of study

abstract: This thesis presents efficient implementations of several linear algebra kernels, machine learning kernels and a neural network based recommender systems engine onto a massively parallel reconfigurable architecture, Transformer. The linear algebra kernels include Triangular Matrix Solver (TRSM), LU Decomposition (LUD), QR Decomposition (QRD), and Matrix Inversion. The machine learning kernels include an LSTM (Long Short Term Memory) cell, and a GRU (gated Recurrent Unit) cell used in recurrent neural networks. The neural network based recommender systems engine consists of multiple kernels including fully connected layers, embedding layer, 1-D batchnorm, Adam optimizer, etc. Transformer is a massively parallel reconfigurable multicore architecture designed at the University of Michigan. The Transformer configuration considered here is 4 tiles and 16 General Processing Elements (GPEs) per tile. It supports a two level cache hierarchy where the L1 and L2 caches can operate in shared (S) or private (P) modes. The architecture was modeled using Gem5 and cycle accurate simulations were done to evaluate the performance in terms of execution times, giga-operations per second per Watt (GOPS/W), and giga-floating-point-operations per second per Watt (GFLOPS/W). This thesis shows that for linear algebra kernels, each kernel achieves high performance for a certain cache mode and that this cache mode can change when the matrix size changes. For instance, for smaller matrix sizes, L1P, L2P cache mode is best for TRSM, while L1S, L2S is the best cache mode for LUD, and L1P, L2S is the best for QRD. For each kernel, the optimal cache mode changes when the matrix size is increased. For instance, for TRSM, the L1P, L2P cache mode is best for smaller matrix sizes (

N=64, 128, 256, 512

) and it changes to L1S, L2P for larger matrix sizes (

N=1024

). For machine learning kernels, L1P, L2P is the best cache mode for all network parameter sizes. Gem5 simulations show that the peak performance for TRSM, LUD, QRD and Matrix Inverse in the 14nm node is 97.5, 59.4, 133.0 and 83.05 GFLOPS/W, respectively. For LSTM and GRU, the peak performance is 44.06 and 69.3 GFLOPS/W. The neural network based recommender system was implemented in L1S, L2S cache mode. It includes a forward pass and a backward pass and is significantly more complex in terms of both computational complexity and data movement. The most computationally intensive block is the fully connected layer followed by Adam optimizer. The overall performance of the recommender systems engine is 54.55 GFLOPS/W and 169.12 GOPS/W.Dissertation/ThesisMasters Thesis Electrical Engineering 201

ASU Digital Repository

Transmuter: Bridging the Efficiency Gap using Memory and Dataflow Reconfiguration

Author: Amarnath Aporva
Beaumont Jonathan
Blaauw David
Chakrabarti Chaitali
Cole Murray
Dreslinski Ronald
Feng Siying
He Xin
Kaszyk Kuba
Kim Hun-Seok
Kim Sung
May Kyle
Morton John Magnus
Mudge Trevor
O'Boyle Michael
Pal Subhankar
Park Dong-hyeon
Sun Jiawen
Xiong Yan
Yang Chi-Sheng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/09/2020
Field of study

Crossref

Edinburgh Research Explorer

Neural network computing using on-chip accelerators

Author: Eldridge Schuyler
Publication venue
Publication date: 05/11/2016
Field of study

The use of neural networks, machine learning, or artificial intelligence, in its broadest and most controversial sense, has been a tumultuous journey involving three distinct hype cycles and a history dating back to the 1960s. Resurgent, enthusiastic interest in machine learning and its applications bolsters the case for machine learning as a fundamental computational kernel. Furthermore, researchers have demonstrated that machine learning can be utilized as an auxiliary component of applications to enhance or enable new types of computation such as approximate computing or automatic parallelization. In our view, machine learning becomes not the underlying application, but a ubiquitous component of applications. This view necessitates a different approach towards the deployment of machine learning computation that spans not only hardware design of accelerator architectures, but also user and supervisor software to enable the safe, simultaneous use of machine learning accelerator resources. In this dissertation, we propose a multi-transaction model of neural network computation to meet the needs of future machine learning applications. We demonstrate that this model, encompassing a decoupled backend accelerator for inference and learning from hardware and software for managing neural network transactions can be achieved with low overhead and integrated with a modern RISC-V microprocessor. Our extensions span user and supervisor software and data structures and, coupled with our hardware, enable multiple transactions from different address spaces to execute simultaneously, yet safely. Together, our system demonstrates the utility of a multi-transaction model to increase energy efficiency improvements and improve overall accelerator throughput for machine learning applications

Boston University Institutional Repository (OpenBU)