111 research outputs found
PyHGL: A Python-based Hardware Generation Language Framework
Hardware generation languages (HGLs) increase hardware design productivity by
creating parameterized modules and test benches. Unfortunately, existing tools
are not widely adopted due to several demerits, including limited support for
asynchronous circuits and unknown states, lack of concise and efficient
language features, and low integration of simulation and verification
functions. This paper introduces PyHGL, an open-source Python framework that
aims to provide a simple and unified environment for hardware generation,
simulation, and verification. PyHGL language is a syntactical superset of
Python, which greatly reduces the lines of code (LOC) and improves productivity
by providing unique features such as dynamic typing, vectorized operations, and
automatic port deduction. In addition, PyHGL integrates an event-driven
simulator that simulates the asynchronous behaviors of digital circuits using
three-state logic. We also propose an algorithm that eliminates the calculation
and transmission overhead of unknown state propagation for binary stimuli. The
results suggest that PyHGL code is up to 6.1x denser than traditional RTL and
generates high-quality synthesizable RTL code. Moreover, the optimized
simulator achieves 2.9x speed up and matches the performance of a commonly used
open-source logic simulator
Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia
Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Rate
and Momentum, is the most popular stochastic optimizer for accelerating the
training of deep neural networks. However, empirically Adam often generalizes
worse than Stochastic Gradient Descent (SGD). We unveil the mystery of this
behavior based on the diffusion theoretical framework. Specifically, we
disentangle the effects of Adaptive Learning Rate and Momentum of the Adam
dynamics on saddle-point escaping and minima selection. We prove that Adaptive
Learning Rate can escape saddle points efficiently, but cannot select flat
minima as SGD does. In contrast, Momentum provides a drift effect to help the
training process pass through saddle points, and almost does not affect flat
minima selection. This theoretically explains why SGD (with Momentum)
generalizes better, while Adam generalizes worse but converges faster.
Furthermore, motivated by the analysis, we design a novel adaptive optimization
framework named Adaptive Inertia, which uses parameter-wise adaptive inertia to
accelerate the training and provably favors flat minima as well as SGD. Our
extensive experiments demonstrate that the proposed adaptive inertia method can
generalize significantly better than SGD and conventional adaptive gradient
methods.Comment: 28 pages, 11 figures, Adam, Adaptive Inerti
Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning (Technical Report)
Learning from the data stored in a database is an important function
increasingly available in relational engines. Methods using lower precision
input data are of special interest given their overall higher efficiency but,
in databases, these methods have a hidden cost: the quantization of the real
value into a smaller number is an expensive step. To address the issue, in this
paper we present MLWeaving, a data structure and hardware acceleration
technique intended to speed up learning of generalized linear models in
databases. ML-Weaving provides a compact, in-memory representation enabling the
retrieval of data at any level of precision. MLWeaving also takes advantage of
the increasing availability of FPGA-based accelerators to provide a highly
efficient implementation of stochastic gradient descent. The solution adopted
in MLWeaving is more efficient than existing designs in terms of space (since
it can process any resolution on the same design) and resources (via the use of
bit-serial multipliers). MLWeaving also enables the runtime tuning of
precision, instead of a fixed precision level during the training. We
illustrate this using a simple, dynamic precision schedule. Experimental
results show MLWeaving achieves up to16 performance improvement over
low-precision CPU implementations of first-order methods.Comment: 18 page
Block Processor: A Resource-distributed Architecture
Abstract-We present the architecture of Block Processor, task-level coprocessor, to execute vectorizable computing task migrated from main processor via command bus. The Block Processor is designed around 32 high-MVL block registers, which can be direct operands of vector instruction and be local cache of the Block Processor. The corresponding unique conflictsolving mechanism scales with the various implementations and easily supports chaining by adding extra execution states. The architecture distributes the block registers, ALUs and control logic. We implement the Block Processor which maps efficiently into the FPGA since the FPGA also distributes its inner resource. Each block register requires two FPGA Block RAM to be 2-read-1-write-port, 1024-depth and 32-bit-width. With the enhanced chaining and decoupling, it might hinder the latency of vector memory instructions and then sustain the computing abilities. With the little resource occupied, 1024-point radix-2 DIF FFT costs 11348 cycles on one Block Processor
S3IM: Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural Fields
Recently, Neural Radiance Field (NeRF) has shown great success in rendering
novel-view images of a given scene by learning an implicit representation with
only posed RGB images. NeRF and relevant neural field methods (e.g., neural
surface representation) typically optimize a point-wise loss and make
point-wise predictions, where one data point corresponds to one pixel.
Unfortunately, this line of research failed to use the collective supervision
of distant pixels, although it is known that pixels in an image or scene can
provide rich structural information. To the best of our knowledge, we are the
first to design a nonlocal multiplex training paradigm for NeRF and relevant
neural field methods via a novel Stochastic Structural SIMilarity (S3IM) loss
that processes multiple data points as a whole set instead of process multiple
inputs independently. Our extensive experiments demonstrate the unreasonable
effectiveness of S3IM in improving NeRF and neural surface representation for
nearly free. The improvements of quality metrics can be particularly
significant for those relatively difficult tasks: e.g., the test MSE loss
unexpectedly drops by more than 90% for TensoRF and DVGO over eight novel view
synthesis tasks; a 198% F-score gain and a 64% Chamfer distance
reduction for NeuS over eight surface reconstruction tasks. Moreover, S3IM is
consistently robust even with sparse inputs, corrupted images, and dynamic
scenes.Comment: ICCV 2023 main conference. Code: https://github.com/Madaoer/S3IM. 14
pages, 5 figures, 17 table
HiCAST: Highly Customized Arbitrary Style Transfer with Adapter Enhanced Diffusion Models
The goal of Arbitrary Style Transfer (AST) is injecting the artistic features
of a style reference into a given image/video. Existing methods usually focus
on pursuing the balance between style and content, whereas ignoring the
significant demand for flexible and customized stylization results and thereby
limiting their practical application. To address this critical issue, a novel
AST approach namely HiCAST is proposed, which is capable of explicitly
customizing the stylization results according to various source of semantic
clues. In the specific, our model is constructed based on Latent Diffusion
Model (LDM) and elaborately designed to absorb content and style instance as
conditions of LDM. It is characterized by introducing of \textit{Style
Adapter}, which allows user to flexibly manipulate the output results by
aligning multi-level style information and intrinsic knowledge in LDM. Lastly,
we further extend our model to perform video AST. A novel learning objective is
leveraged for video diffusion model training, which significantly improve
cross-frame temporal consistency in the premise of maintaining stylization
strength. Qualitative and quantitative comparisons as well as comprehensive
user studies demonstrate that our HiCAST outperforms the existing SoTA methods
in generating visually plausible stylization results
MARS: Exploiting Multi-Level Parallelism for DNN Workloads on Adaptive Multi-Accelerator Systems
Along with the fast evolution of deep neural networks, the hardware system is
also developing rapidly. As a promising solution achieving high scalability and
low manufacturing cost, multi-accelerator systems widely exist in data centers,
cloud platforms, and SoCs. Thus, a challenging problem arises in
multi-accelerator systems: selecting a proper combination of accelerators from
available designs and searching for efficient DNN mapping strategies. To this
end, we propose MARS, a novel mapping framework that can perform
computation-aware accelerator selection, and apply communication-aware sharding
strategies to maximize parallelism. Experimental results show that MARS can
achieve 32.2% latency reduction on average for typical DNN workloads compared
to the baseline, and 59.4% latency reduction on heterogeneous models compared
to the corresponding state-of-the-art method.Comment: Accepted by 60th DA
cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs
Zero-knowledge proof is a critical cryptographic primitive. Its most practical type, called zero-knowledge Succinct Non-interactive ARgument of Knowledge (zkSNARK), has been deployed in various privacy-preserving applications such as cryptocurrencies and verifiable machine learning. Unfortunately, zkSNARK like Groth16 has a high overhead on its proof generation step, which consists of several time-consuming operations, including large-scale matrix-vector multiplication (MUL), number-theoretic transform (NTT), and multi-scalar multiplication (MSM). Therefore, this paper presents cuZK, an efficient GPU implementation of zkSNARK with the following three techniques to achieve high performance. First, we propose a new parallel MSM algorithm. This MSM algorithm achieves nearly perfect linear speedup over the Pippenger algorithm, a well-known serial MSM algorithm. Second, we parallelize the MUL operation. Along with our self-designed MSM scheme and well-studied NTT scheme, cuZK achieves the parallelization of all operations in the proof generation step. Third, cuZK reduces the latency overhead caused by CPU-GPU data transfer by 1) reducing redundant data transfer and 2) overlapping data transfer and device computation. The evaluation results show that our MSM module provides over 2.08Ă— (up to 2.94Ă—) speedup versus the state-of-the-art GPU implementation. cuZK achieves over 2.65Ă— (up to 4.86Ă—) speedup on standard benchmarks and 2.18Ă— speedup on a GPU-accelerated cryptocurrency application, Filecoin
Structural Mechanism for the Specific Assembly and Activation of the Extracellular Signal Regulated Kinase 5 (ERK5) Module
Mitogen-activated protein kinase (MAPK) activation depends on a linear binding motif found in all MAPK kinases (MKK). In addition, the PB1 (Phox and Bem1) domain of MKK5 is required for extracellular signal regulated kinase 5 (ERK5) activation. We present the crystal structure of ERK5 in complex with an MKK5 construct comprised of the PB1 domain and the linear binding motif. We show that ERK5 has distinct protein-protein interaction surfaces compared with ERK2, which is the closest ERK5 paralog. The two MAPKs have characteristically different physiological functions and their distinct protein-protein interaction surface topography enables them to bind different sets of activators and substrates. Structural and biochemical characterization revealed that the MKK5 PB1 domain cooperates with the MAPK binding linear motif to achieve substrate specific binding, and it also enables co-recruitment of the upstream activating enzyme and the downstream substrate into one signaling competent complex. Studies on present day MAPKs and MKKs hint on the way protein kinase networks may evolve. In particular, they suggest how paralogous enzymes with similar catalytic properties could acquire novel signaling roles by merely changing the way they make physical links to other proteins
- …