10 research outputs found
I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference
Vision Transformers (ViTs) have achieved state-of-the-art performance on
various computer vision applications. These models, however, have considerable
storage and computational overheads, making their deployment and efficient
inference on edge devices challenging. Quantization is a promising approach to
reducing model complexity; unfortunately, existing efforts to quantize ViTs are
simulated quantization (aka fake quantization), which remains floating-point
arithmetic during inference and thus contributes little to model acceleration.
In this paper, we propose I-ViT, an integer-only quantization scheme for ViTs,
to enable ViTs to perform the entire computational graph of inference with
integer operations and bit-shifting and no floating-point operations. In I-ViT,
linear operations (e.g., MatMul and Dense) follow the integer-only pipeline
with dyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and
LayerNorm) are approximated by the proposed light-weight integer-only
arithmetic methods. In particular, I-ViT applies the proposed Shiftmax and
ShiftGELU, which are designed to use integer bit-shifting to approximate the
corresponding floating-point operations. We evaluate I-ViT on various benchmark
models and the results show that integer-only INT8 quantization achieves
comparable (or even higher) accuracy to the full-precision (FP) baseline.
Furthermore, we utilize TVM for practical hardware deployment on the GPU's
integer arithmetic units, achieving 3.72~4.11 inference speedup
compared to the FP model
PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers
Data-free quantization can potentially address data privacy and security
concerns in model compression, and thus has been widely investigated. Recently,
PSAQ-ViT designs a relative value metric, patch similarity, to generate data
from pre-trained vision transformers (ViTs), achieving the first attempt at
data-free quantization for ViTs. In this paper, we propose PSAQ-ViT V2, a more
accurate and general data-free quantization framework for ViTs, built on top of
PSAQ-ViT. More specifically, following the patch similarity metric in PSAQ-ViT,
we introduce an adaptive teacher-student strategy, which facilitates the
constant cyclic evolution of the generated samples and the quantized model
(student) in a competitive and interactive fashion under the supervision of the
full-precision model (teacher), thus significantly improving the accuracy of
the quantized model. Moreover, without the auxiliary category guidance, we
employ the task- and model-independent prior information, making the
general-purpose scheme compatible with a broad range of vision tasks and
models. Extensive experiments are conducted on various models on image
classification, object detection, and semantic segmentation tasks, and PSAQ-ViT
V2, with the naive quantization strategy and without access to real-world data,
consistently achieves competitive results, showing potential as a powerful
baseline on data-free quantization for ViTs. For instance, with Swin-S as the
(backbone) model, 8-bit quantization reaches 82.13 top-1 accuracy on ImageNet,
50.9 box AP and 44.1 mask AP on COCO, and 47.2 mIoU on ADE20K. We hope that
accurate and general PSAQ-ViT V2 can serve as a potential and practice solution
in real-world applications involving sensitive data. Code is released and
merged at: https://github.com/zkkli/PSAQ-ViT.Comment: Accepted by TNNLS 202
Patch Similarity Aware Data-Free Quantization for Vision Transformers
Vision transformers have recently gained great success on various computer
vision tasks; nevertheless, their high model complexity makes it challenging to
deploy on resource-constrained devices. Quantization is an effective approach
to reduce model complexity, and data-free quantization, which can address data
privacy and security concerns during model deployment, has received widespread
interest. Unfortunately, all existing methods, such as BN regularization, were
designed for convolutional neural networks and cannot be applied to vision
transformers with significantly different model architectures. In this paper,
we propose PSAQ-ViT, a Patch Similarity Aware data-free Quantization framework
for Vision Transformers, to enable the generation of "realistic" samples based
on the vision transformer's unique properties for calibrating the quantization
parameters. Specifically, we analyze the self-attention module's properties and
reveal a general difference (patch similarity) in its processing of Gaussian
noise and real images. The above insights guide us to design a relative value
metric to optimize the Gaussian noise to approximate the real images, which are
then utilized to calibrate the quantization parameters. Extensive experiments
and ablation studies are conducted on various benchmarks to validate the
effectiveness of PSAQ-ViT, which can even outperform the real-data-driven
methods.Comment: Accepted to ECCV 202
Patch-wise Mixed-Precision Quantization of Vision Transformer
As emerging hardware begins to support mixed bit-width arithmetic
computation, mixed-precision quantization is widely used to reduce the
complexity of neural networks. However, Vision Transformers (ViTs) require
complex self-attention computation to guarantee the learning of powerful
feature representations, which makes mixed-precision quantization of ViTs still
challenging. In this paper, we propose a novel patch-wise mixed-precision
quantization (PMQ) for efficient inference of ViTs. Specifically, we design a
lightweight global metric, which is faster than existing methods, to measure
the sensitivity of each component in ViTs to quantization errors. Moreover, we
also introduce a pareto frontier approach to automatically allocate the optimal
bit-precision according to the sensitivity. To further reduce the computational
complexity of self-attention in inference stage, we propose a patch-wise module
to reallocate bit-width of patches in each layer. Extensive experiments on the
ImageNet dataset shows that our method greatly reduces the search cost and
facilitates the application of mixed-precision quantization to ViTs
BinaryViT: Towards Efficient and Accurate Binary Vision Transformers
Vision Transformers (ViTs) have emerged as the fundamental architecture for
most computer vision fields, but the considerable memory and computation costs
hinders their application on resource-limited devices. As one of the most
powerful compression methods, binarization reduces the computation of the
neural network by quantizing the weights and activation values as 1.
Although existing binarization methods have demonstrated excellent performance
on Convolutional Neural Networks (CNNs), the full binarization of ViTs is still
under-studied and suffering a significant performance drop. In this paper, we
first argue empirically that the severe performance degradation is mainly
caused by the weight oscillation in the binarization training and the
information distortion in the activation of ViTs. Based on these analyses, we
propose , an accurate full binarization scheme for ViTs,
which pushes the quantization of ViTs to the limit. Specifically, we propose a
novel gradient regularization scheme (GRS) for driving a bimodal distribution
of the weights to reduce oscillation in binarization training. Moreover, we
design an activation shift module (ASM) to adaptively tune the activation
distribution to reduce the information distortion caused by binarization.
Extensive experiments on ImageNet dataset show that our BinaryViT consistently
surpasses the strong baseline by 2.05% and improve the accuracy of fully
binarized ViTs to a usable level. Furthermore, our method achieves impressive
savings of 16.2 and 17.7 in model size and OPs compared to the
full-precision DeiT-S.Comment: We will be making some significant changes to the paper, including
the title and methodology. We therefore wish to withdraw the paper for no
Spectral purification improves monitoring accuracy of the comprehensive growth evaluation index for film-mulched winter wheat
In order to further improve the utility of unmanned aerial vehicle (UAV) remote-sensing for quickly and accurately monitoring the growth of winter wheat under film mulching, this study examined the treatments of ridge mulching, ridge–furrow full mulching, and flat cropping full mulching in winter wheat. Based on the fuzzy comprehensive evaluation (FCE) method, four agronomic parameters (leaf area index, above-ground biomass, plant height, and leaf chlorophyll content) were used to calculate the comprehensive growth evaluation index (CGEI) of the winter wheat, and 14 visible and near-infrared spectral indices were calculated using spectral purification technology to process the remote-sensing image data of winter wheat obtained by multispectral UAV. Four machine learning algorithms, partial least squares, support vector machines, random forests, and artificial neural network networks (ANN), were used to build the winter wheat growth monitoring model under film mulching, and accuracy evaluation and mapping of the spatial and temporal distribution of winter wheat growth status were carried out. The results showed that the CGEI of winter wheat under film mulching constructed using the FCE method could objectively and comprehensively evaluate the crop growth status. The accuracy of remote-sensing inversion of the CGEI based on the ANN model was higher than for the individual agronomic parameters, with a coefficient of determination of 0.75, a root mean square error of 8.40, and a mean absolute value error of 6.53. Spectral purification could eliminate the interference of background effects caused by mulching and soil, effectively improving the accuracy of the remote-sensing inversion of winter wheat under film mulching, with the best inversion effect achieved on the ridge–furrow full mulching area after spectral purification. The results of this study provide a theoretical reference for the use of UAV remote-sensing to monitor the growth status of winter wheat with film mulching
Shackling Effect Induced Property Differences in Metallo-Supramolecular Polymers
We demonstrate here
the synthesis of a novel class of metallo-supramolecular
polymers with shackled structure, via the coordination of cyclic diÂ(bis-terpyridine-triphenyl
ether ester) ligands with rutheniumÂ(II) ions. The constraint from
the ring topology via the shackling of ligands provides novel properties
to these metallo-supramolecular polymers, including the formation
of dendritic crystals, red-shift of absorption bands in the UV–vis
spectra from interchain charge-transfer transitions, and a typical
flash-type memory behavior
InternLM2 Technical Report
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has
sparked discussions on the advent of Artificial General Intelligence (AGI).
However, replicating such advancements in open-source models has been
challenging. This paper introduces InternLM2, an open-source LLM that
outperforms its predecessors in comprehensive evaluations across 6 dimensions
and 30 benchmarks, long-context modeling, and open-ended subjective evaluations
through innovative pre-training and optimization techniques. The pre-training
process of InternLM2 is meticulously detailed, highlighting the preparation of
diverse data types including text, code, and long-context data. InternLM2
efficiently captures long-term dependencies, initially trained on 4k tokens
before advancing to 32k tokens in pre-training and fine-tuning stages,
exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test.
InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel
Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF)
strategy that addresses conflicting human preferences and reward hacking. By
releasing InternLM2 models in different training stages and model sizes, we
provide the community with insights into the model's evolution