350 research outputs found
Softmax Acceleration with Adaptive Numeric Format for both Training and Inference
The attention mechanism is a pivotal element within the Transformer
architecture, making a substantial contribution to its exceptional performance.
Within this attention mechanism, Softmax is an imperative component that
enables the model to assess the degree of correlation between various segments
of the input. Yet, prior research has shown that Softmax operations can
significantly increase processing latency and energy consumption in the
Transformer network due to their internal nonlinear operations and data
dependencies. In this work, we proposed~\textit{Hyft}, a hardware efficient
floating point Softmax accelerator for both training and inference. Hyft aims
to reduce the implementation cost of different nonlinear arithmetic operations
by adaptively converting intermediate results into the most suitable numeric
format for each specific operation, leading to reconfigurable accelerator with
hybrid numeric format. The evaluation results highlight that Hyft achieves a
remarkable reduction in hardware resource utilization and a reduction in processing latency, all while maintaining a negligible
impact on Transformer accuracy
BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling
Like masked language modeling (MLM) in natural language processing, masked
image modeling (MIM) aims to extract valuable insights from image patches to
enhance the feature extraction capabilities of the underlying deep neural
network (DNN). Contrasted with other training paradigms like supervised
learning and unsupervised contrastive learning, masked image modeling (MIM)
pretraining typically demands significant computational resources in order to
manage large training data batches (e.g., 4096). The significant memory and
computation requirements pose a considerable challenge to its broad adoption.
To mitigate this, we introduce a novel learning framework,
termed~\textit{Block-Wise Masked Image Modeling} (BIM). This framework involves
decomposing the MIM tasks into several sub-tasks with independent computation
patterns, resulting in block-wise back-propagation operations instead of the
traditional end-to-end approach. Our proposed BIM maintains superior
performance compared to conventional MIM while greatly reducing peak memory
consumption. Moreover, BIM naturally enables the concurrent training of
numerous DNN backbones of varying depths. This leads to the creation of
multiple trained DNN backbones, each tailored to different hardware platforms
with distinct computing capabilities. This approach significantly reduces
computational costs in comparison with training each DNN backbone individually.
Our framework offers a promising solution for resource constrained training of
MIM
4,6-DimethylÂpyrimidin-2(1H)-one–urea–water (1/1/1)
In the crystal structure of the title compound, C6H8N2O·CH4N2O·H2O, molÂecules are linked via N—H⋯O, O—H⋯N and O—H⋯O hydrogen bonds, forming a three–dimensional framework
SphereFed: Hyperspherical Federated Learning
Federated Learning aims at training a global model from multiple
decentralized devices (i.e. clients) without exchanging their private local
data. A key challenge is the handling of non-i.i.d. (independent identically
distributed) data across multiple clients that may induce disparities of their
local features. We introduce the Hyperspherical Federated Learning (SphereFed)
framework to address the non-i.i.d. issue by constraining learned
representations of data points to be on a unit hypersphere shared by clients.
Specifically, all clients learn their local representations by minimizing the
loss with respect to a fixed classifier whose weights span the unit
hypersphere. After federated training in improving the global model, this
classifier is further calibrated with a closed-form solution by minimizing a
mean squared loss. We show that the calibration solution can be computed
efficiently and distributedly without direct access of local data. Extensive
experiments indicate that our SphereFed approach is able to improve the
accuracy of multiple existing federated learning algorithms by a considerable
margin (up to 6% on challenging datasets) with enhanced computation and
communication efficiency across datasets and model architectures.Comment: European Conference on Computer Vision 202
- …