1,377 research outputs found

    Parallel Integer Polynomial Multiplication

    Get PDF
    We propose a new algorithm for multiplying dense polynomials with integer coefficients in a parallel fashion, targeting multi-core processor architectures. Complexity estimates and experimental comparisons demonstrate the advantages of this new approach

    Fast, Dense Feature SDM on an iPhone

    Full text link
    In this paper, we present our method for enabling dense SDM to run at over 90 FPS on a mobile device. Our contributions are two-fold. Drawing inspiration from the FFT, we propose a Sparse Compositional Regression (SCR) framework, which enables a significant speed up over classical dense regressors. Second, we propose a binary approximation to SIFT features. Binary Approximated SIFT (BASIFT) features, which are a computationally efficient approximation to SIFT, a commonly used feature with SDM. We demonstrate the performance of our algorithm on an iPhone 7, and show that we achieve similar accuracy to SDM

    06271 Abstracts Collection -- Challenges in Symbolic Computation Software

    Get PDF
    From 02.07.06 to 07.07.06, the Dagstuhl Seminar 06271 ``Challenges in Symbolic Computation Software\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

    Correct and Compositional Hardware Generators

    Full text link
    Hardware generators help designers explore families of concrete designs and their efficiency trade-offs. Both parameterized hardware description languages (HDLs) and higher-level programming models, however, can obstruct composability. Different concrete designs in a family can have dramatically different timing behavior, and high-level hardware generators rarely expose a consistent HDL-level interface. Composition, therefore, is typically only feasible at the level of individual instances: the user generates concrete designs and then composes them, sacrificing the ability to parameterize the combined design. We design Parafil, a system for correctly composing hardware generators. Parafil builds on Filament, an HDL with strong compile-time guarantees, and lifts those guarantees to generators to prove that all possible instantiations are free of timing bugs. Parafil can integrate with external hardware generators via a novel system of output parameters and a framework for invoking generator tools. We conduct experiments with two other generators, FloPoCo and Google's XLS, and we implement a parameterized FFT generator to show that Parafil ensures correct design space exploration.Comment: 13 page

    TREBUCHET: Fully Homomorphic Encryption Accelerator for Deep Computation

    Full text link
    Secure computation is of critical importance to not only the DoD, but across financial institutions, healthcare, and anywhere personally identifiable information (PII) is accessed. Traditional security techniques require data to be decrypted before performing any computation. When processed on untrusted systems the decrypted data is vulnerable to attacks to extract the sensitive information. To address these vulnerabilities Fully Homomorphic Encryption (FHE) keeps the data encrypted during computation and secures the results, even in these untrusted environments. However, FHE requires a significant amount of computation to perform equivalent unencrypted operations. To be useful, FHE must significantly close the computation gap (within 10x) to make encrypted processing practical. To accomplish this ambitious goal the TREBUCHET project is leading research and development in FHE processing hardware to accelerate deep computations on encrypted data, as part of the DARPA MTO Data Privacy for Virtual Environments (DPRIVE) program. We accelerate the major secure standardized FHE schemes (BGV, BFV, CKKS, FHEW, etc.) at >=128-bit security while integrating with the open-source PALISADE and OpenFHE libraries currently used in the DoD and in industry. We utilize a novel tile-based chip design with highly parallel ALUs optimized for vectorized 128b modulo arithmetic. The TREBUCHET coprocessor design provides a highly modular, flexible, and extensible FHE accelerator for easy reconfiguration, deployment, integration and application on other hardware form factors, such as System-on-Chip or alternate chip areas.Comment: 6 pages, 5figures, 2 table

    A multi-point 2D interface: Audio-rate signals for controlling complex multi-parametric sound synthesis

    Get PDF
    This paper documents a method of controlling complex sound synthesis processes such as granular synthesis, additive synthesis, timbre morphology, swarm-based spatialisation, spectral spatialisation, and timbre spatialisation via a multi-parametric 2D interface. This paper evaluates the use of audio-rate control signals for sound synthesis, and discussing approaches to de-interleaving, synchronization, and mapping. The paper also outlines a number of ways of extending the expressivity of such a control interface by coupling this with another 2D multi-parametric nodes interface and audio-rate 2D table lookup. The paper proceeds to review methods of navigating multi-parameter sets via interpolation and transformation. Some case studies are finally discussed in the paper. The author has used this method to control complex sound synthesis processes that require control data for more that a thousand parameters

    SAR processing on the MPP

    Get PDF
    The processing of synthetic aperture radar (SAR) signals using the massively parallel processor (MPP) is discussed. The fast Fourier transform convolution procedures employed in the algorithms are described. The MPP architecture comprises an array unit (ARU) which processes arrays of data; an array control unit which controls the operation of the ARU and performs scalar arithmetic; a program and data management unit which controls the flow of data; and a unique staging memory (SM) which buffers and permutes data. The ARU contains a 128 by 128 array of bit-serial processing elements (PE). Two-by-four surarrays of PE's are packaged in a custom VLSI HCMOS chip. The staging memory is a large multidimensional-access memory which buffers and permutes data flowing with the system. Efficient SAR processing is achieved via ARU communication paths and SM data manipulation. Real time processing capability can be realized via a multiple ARU, multiple SM configuration

    RPU: The Ring Processing Unit

    Get PDF
    Ring-Learning-with-Errors (RLWE) has emerged as the foundation of many important techniques for improving security and privacy, including homomorphic encryption and post-quantum cryptography. While promising, these techniques have received limited use due to their extreme overheads of running on general-purpose machines. In this paper, we present a novel vector Instruction Set Architecture (ISA) and microarchitecture for accelerating the ring-based computations of RLWE. The ISA, named B512, is developed to meet the needs of ring processing workloads while balancing high-performance and general-purpose programming support. Having an ISA rather than fixed hardware facilitates continued software improvement post-fabrication and the ability to support the evolving workloads. We then propose the ring processing unit (RPU), a high-performance, modular implementation of B512. The RPU has native large word modular arithmetic support, capabilities for very wide parallel processing, and a large capacity high-bandwidth scratchpad to meet the needs of ring processing. We address the challenges of programming the RPU using a newly developed SPIRAL backend. A configurable simulator is built to characterize design tradeoffs and quantify performance. The best performing design was implemented in RTL and used to validate simulator performance. In addition to our characterization, we show that a RPU using 20.5mm2 of GF 12nm can provide a speedup of 1485x over a CPU running a 64k, 128-bit NTT, a core RLWE workloa
    • …
    corecore