111 research outputs found

    PyHGL: A Python-based Hardware Generation Language Framework

    Full text link
    Hardware generation languages (HGLs) increase hardware design productivity by creating parameterized modules and test benches. Unfortunately, existing tools are not widely adopted due to several demerits, including limited support for asynchronous circuits and unknown states, lack of concise and efficient language features, and low integration of simulation and verification functions. This paper introduces PyHGL, an open-source Python framework that aims to provide a simple and unified environment for hardware generation, simulation, and verification. PyHGL language is a syntactical superset of Python, which greatly reduces the lines of code (LOC) and improves productivity by providing unique features such as dynamic typing, vectorized operations, and automatic port deduction. In addition, PyHGL integrates an event-driven simulator that simulates the asynchronous behaviors of digital circuits using three-state logic. We also propose an algorithm that eliminates the calculation and transmission overhead of unknown state propagation for binary stimuli. The results suggest that PyHGL code is up to 6.1x denser than traditional RTL and generates high-quality synthesizable RTL code. Moreover, the optimized simulator achieves 2.9x speed up and matches the performance of a commonly used open-source logic simulator

    Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia

    Full text link
    Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Rate and Momentum, is the most popular stochastic optimizer for accelerating the training of deep neural networks. However, empirically Adam often generalizes worse than Stochastic Gradient Descent (SGD). We unveil the mystery of this behavior based on the diffusion theoretical framework. Specifically, we disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and minima selection. We prove that Adaptive Learning Rate can escape saddle points efficiently, but cannot select flat minima as SGD does. In contrast, Momentum provides a drift effect to help the training process pass through saddle points, and almost does not affect flat minima selection. This theoretically explains why SGD (with Momentum) generalizes better, while Adam generalizes worse but converges faster. Furthermore, motivated by the analysis, we design a novel adaptive optimization framework named Adaptive Inertia, which uses parameter-wise adaptive inertia to accelerate the training and provably favors flat minima as well as SGD. Our extensive experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods.Comment: 28 pages, 11 figures, Adam, Adaptive Inerti

    Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning (Technical Report)

    Full text link
    Learning from the data stored in a database is an important function increasingly available in relational engines. Methods using lower precision input data are of special interest given their overall higher efficiency but, in databases, these methods have a hidden cost: the quantization of the real value into a smaller number is an expensive step. To address the issue, in this paper we present MLWeaving, a data structure and hardware acceleration technique intended to speed up learning of generalized linear models in databases. ML-Weaving provides a compact, in-memory representation enabling the retrieval of data at any level of precision. MLWeaving also takes advantage of the increasing availability of FPGA-based accelerators to provide a highly efficient implementation of stochastic gradient descent. The solution adopted in MLWeaving is more efficient than existing designs in terms of space (since it can process any resolution on the same design) and resources (via the use of bit-serial multipliers). MLWeaving also enables the runtime tuning of precision, instead of a fixed precision level during the training. We illustrate this using a simple, dynamic precision schedule. Experimental results show MLWeaving achieves up to16 performance improvement over low-precision CPU implementations of first-order methods.Comment: 18 page

    Block Processor: A Resource-distributed Architecture

    Get PDF
    Abstract-We present the architecture of Block Processor, task-level coprocessor, to execute vectorizable computing task migrated from main processor via command bus. The Block Processor is designed around 32 high-MVL block registers, which can be direct operands of vector instruction and be local cache of the Block Processor. The corresponding unique conflictsolving mechanism scales with the various implementations and easily supports chaining by adding extra execution states. The architecture distributes the block registers, ALUs and control logic. We implement the Block Processor which maps efficiently into the FPGA since the FPGA also distributes its inner resource. Each block register requires two FPGA Block RAM to be 2-read-1-write-port, 1024-depth and 32-bit-width. With the enhanced chaining and decoupling, it might hinder the latency of vector memory instructions and then sustain the computing abilities. With the little resource occupied, 1024-point radix-2 DIF FFT costs 11348 cycles on one Block Processor

    S3IM: Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural Fields

    Full text link
    Recently, Neural Radiance Field (NeRF) has shown great success in rendering novel-view images of a given scene by learning an implicit representation with only posed RGB images. NeRF and relevant neural field methods (e.g., neural surface representation) typically optimize a point-wise loss and make point-wise predictions, where one data point corresponds to one pixel. Unfortunately, this line of research failed to use the collective supervision of distant pixels, although it is known that pixels in an image or scene can provide rich structural information. To the best of our knowledge, we are the first to design a nonlocal multiplex training paradigm for NeRF and relevant neural field methods via a novel Stochastic Structural SIMilarity (S3IM) loss that processes multiple data points as a whole set instead of process multiple inputs independently. Our extensive experiments demonstrate the unreasonable effectiveness of S3IM in improving NeRF and neural surface representation for nearly free. The improvements of quality metrics can be particularly significant for those relatively difficult tasks: e.g., the test MSE loss unexpectedly drops by more than 90% for TensoRF and DVGO over eight novel view synthesis tasks; a 198% F-score gain and a 64% Chamfer L1L_{1} distance reduction for NeuS over eight surface reconstruction tasks. Moreover, S3IM is consistently robust even with sparse inputs, corrupted images, and dynamic scenes.Comment: ICCV 2023 main conference. Code: https://github.com/Madaoer/S3IM. 14 pages, 5 figures, 17 table

    HiCAST: Highly Customized Arbitrary Style Transfer with Adapter Enhanced Diffusion Models

    Full text link
    The goal of Arbitrary Style Transfer (AST) is injecting the artistic features of a style reference into a given image/video. Existing methods usually focus on pursuing the balance between style and content, whereas ignoring the significant demand for flexible and customized stylization results and thereby limiting their practical application. To address this critical issue, a novel AST approach namely HiCAST is proposed, which is capable of explicitly customizing the stylization results according to various source of semantic clues. In the specific, our model is constructed based on Latent Diffusion Model (LDM) and elaborately designed to absorb content and style instance as conditions of LDM. It is characterized by introducing of \textit{Style Adapter}, which allows user to flexibly manipulate the output results by aligning multi-level style information and intrinsic knowledge in LDM. Lastly, we further extend our model to perform video AST. A novel learning objective is leveraged for video diffusion model training, which significantly improve cross-frame temporal consistency in the premise of maintaining stylization strength. Qualitative and quantitative comparisons as well as comprehensive user studies demonstrate that our HiCAST outperforms the existing SoTA methods in generating visually plausible stylization results

    MARS: Exploiting Multi-Level Parallelism for DNN Workloads on Adaptive Multi-Accelerator Systems

    Full text link
    Along with the fast evolution of deep neural networks, the hardware system is also developing rapidly. As a promising solution achieving high scalability and low manufacturing cost, multi-accelerator systems widely exist in data centers, cloud platforms, and SoCs. Thus, a challenging problem arises in multi-accelerator systems: selecting a proper combination of accelerators from available designs and searching for efficient DNN mapping strategies. To this end, we propose MARS, a novel mapping framework that can perform computation-aware accelerator selection, and apply communication-aware sharding strategies to maximize parallelism. Experimental results show that MARS can achieve 32.2% latency reduction on average for typical DNN workloads compared to the baseline, and 59.4% latency reduction on heterogeneous models compared to the corresponding state-of-the-art method.Comment: Accepted by 60th DA

    cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs

    Get PDF
    Zero-knowledge proof is a critical cryptographic primitive. Its most practical type, called zero-knowledge Succinct Non-interactive ARgument of Knowledge (zkSNARK), has been deployed in various privacy-preserving applications such as cryptocurrencies and verifiable machine learning. Unfortunately, zkSNARK like Groth16 has a high overhead on its proof generation step, which consists of several time-consuming operations, including large-scale matrix-vector multiplication (MUL), number-theoretic transform (NTT), and multi-scalar multiplication (MSM). Therefore, this paper presents cuZK, an efficient GPU implementation of zkSNARK with the following three techniques to achieve high performance. First, we propose a new parallel MSM algorithm. This MSM algorithm achieves nearly perfect linear speedup over the Pippenger algorithm, a well-known serial MSM algorithm. Second, we parallelize the MUL operation. Along with our self-designed MSM scheme and well-studied NTT scheme, cuZK achieves the parallelization of all operations in the proof generation step. Third, cuZK reduces the latency overhead caused by CPU-GPU data transfer by 1) reducing redundant data transfer and 2) overlapping data transfer and device computation. The evaluation results show that our MSM module provides over 2.08Ă— (up to 2.94Ă—) speedup versus the state-of-the-art GPU implementation. cuZK achieves over 2.65Ă— (up to 4.86Ă—) speedup on standard benchmarks and 2.18Ă— speedup on a GPU-accelerated cryptocurrency application, Filecoin

    Structural Mechanism for the Specific Assembly and Activation of the Extracellular Signal Regulated Kinase 5 (ERK5) Module

    Get PDF
    Mitogen-activated protein kinase (MAPK) activation depends on a linear binding motif found in all MAPK kinases (MKK). In addition, the PB1 (Phox and Bem1) domain of MKK5 is required for extracellular signal regulated kinase 5 (ERK5) activation. We present the crystal structure of ERK5 in complex with an MKK5 construct comprised of the PB1 domain and the linear binding motif. We show that ERK5 has distinct protein-protein interaction surfaces compared with ERK2, which is the closest ERK5 paralog. The two MAPKs have characteristically different physiological functions and their distinct protein-protein interaction surface topography enables them to bind different sets of activators and substrates. Structural and biochemical characterization revealed that the MKK5 PB1 domain cooperates with the MAPK binding linear motif to achieve substrate specific binding, and it also enables co-recruitment of the upstream activating enzyme and the downstream substrate into one signaling competent complex. Studies on present day MAPKs and MKKs hint on the way protein kinase networks may evolve. In particular, they suggest how paralogous enzymes with similar catalytic properties could acquire novel signaling roles by merely changing the way they make physical links to other proteins
    • …
    corecore