274 research outputs found
IOPS: An Unified SpMM Accelerator Based on Inner-Outer-Hybrid Product
Sparse matrix multiplication (SpMM) is widely applied to numerous domains,
such as graph processing, machine learning, and data analytics. However, inner
product based SpMM induces redundant zero-element computing for mismatched
nonzero operands, while outer product based approach lacks input reuse across
Process Elements (PEs) and poor output locality for accumulating partial sum
(psum) matrices. Besides, current works only focus on sparse-sparse matrix
multiplication (SSMM) or sparse-dense matrix multiplication (SDMM), rarely
performing efficiently for both. To address these problems, this paper proposes
an unified SpMM accelerator, called IOPS, hybridizing inner with outer
products. It reuses the input matrix among PEs with inner product dataflow, and
removes zero-element calculations with outer product approach in each PE, which
can efficiently process SSMM and SDMM. Moreover, an address mapping method is
designed to accumulate the irregular sparse psum matrices, reducing the latency
and DRAM access of psum accumulating. Furthermore, an adaptive partition
strategy is proposed to tile the input matrices based on their sparsity ratios,
effectively utilizing the storage of architecture and reducing DRAM access.
Compared with the SSMM accelerator, SpArch, we achieve 1.7x~6.3x energy
efficiency and 1.2x~4.4x resource efficiency, with 1.4x~2.1x DRAM access
saving
Observer Based Traction/Braking Control Design for High Speed Trains Considering Adhesion Nonlinearity
Train traction/braking control, one of the key enabling technologies for automatic train operation, literally takes its action through adhesion force. However, adhesion coefficient of high speed train (HST) is uncertain in general because it varies with wheel-rail surface condition and running speed; thus, it is extremely difficult to be measured, which makes traction/braking control design and implementation of HSTs greatly challenging. In this work, force observers are applied to estimate the adhesion force or/and the resistance, based on which simple traction/braking control schemes are established under the consideration of actual wheel-rail adhesion condition. It is shown that the proposed controllers have simple structure and can be easily implemented from real applications. Numerical simulation also validates the effectiveness of the proposed control scheme
A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision
Deep learning has the potential to revolutionize sports performance, with
applications ranging from perception and comprehension to decision. This paper
presents a comprehensive survey of deep learning in sports performance,
focusing on three main aspects: algorithms, datasets and virtual environments,
and challenges. Firstly, we discuss the hierarchical structure of deep learning
algorithms in sports performance which includes perception, comprehension and
decision while comparing their strengths and weaknesses. Secondly, we list
widely used existing datasets in sports and highlight their characteristics and
limitations. Finally, we summarize current challenges and point out future
trends of deep learning in sports. Our survey provides valuable reference
material for researchers interested in deep learning in sports applications
What Can Simple Arithmetic Operations Do for Temporal Modeling?
Temporal modeling plays a crucial role in understanding video content. To
tackle this problem, previous studies built complicated temporal relations
through time sequence thanks to the development of computationally powerful
devices. In this work, we explore the potential of four simple arithmetic
operations for temporal modeling. Specifically, we first capture auxiliary
temporal cues by computing addition, subtraction, multiplication, and division
between pairs of extracted frame features. Then, we extract corresponding
features from these cues to benefit the original temporal-irrespective domain.
We term such a simple pipeline as an Arithmetic Temporal Module (ATM), which
operates on the stem of a visual backbone with a plug-andplay style. We conduct
comprehensive ablation studies on the instantiation of ATMs and demonstrate
that this module provides powerful temporal modeling capability at a low
computational cost. Moreover, the ATM is compatible with both CNNs- and
ViTs-based architectures. Our results show that ATM achieves superior
performance over several popular video benchmarks. Specifically, on
Something-Something V1, V2 and Kinetics-400, we reach top-1 accuracy of 65.6%,
74.6%, and 89.4% respectively. The code is available at
https://github.com/whwu95/ATM.Comment: Accepted by ICCV 202
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
Current captioning approaches tend to generate correct but "generic"
descriptions that lack real-world knowledge, e.g., named entities and
contextual information. Considering that Vision-Language Pre-Training (VLP)
models master massive such knowledge from large-scale web-harvested data, it is
promising to utilize the generalizability of VLP models to incorporate
knowledge into image descriptions. However, using VLP models faces challenges:
zero-shot inference suffers from knowledge hallucination that leads to
low-quality descriptions, but the generic bias in downstream task fine-tuning
hinders the VLP model from expressing knowledge. To address these concerns, we
propose a simple yet effective method called Knowledge-guided Replay
(K-Replay), which enables the retention of pre-training knowledge during
fine-tuning. Our approach consists of two parts: (1) a knowledge prediction
task on automatically collected replay exemplars to continuously awaken the VLP
model's memory about knowledge, thus preventing the model from collapsing into
the generic pattern; (2) a knowledge distillation constraint to improve the
faithfulness of generated descriptions hence alleviating the knowledge
hallucination. To evaluate knowledge-enhanced descriptions, we construct a
novel captioning benchmark KnowCap, containing knowledge of landmarks, famous
brands, special foods and movie characters. Experimental results show that our
approach effectively incorporates knowledge into descriptions, outperforming
strong VLP baseline by 20.9 points (78.7->99.6) in CIDEr score and 20.5
percentage points (34.0%->54.5%) in knowledge recognition accuracy. Our code
and data is available at https://github.com/njucckevin/KnowCap.Comment: Accepted at ACM Multimedia (ACMMM) 202
Sense: Model Hardware Co-design for Accelerating Sparse CNN on Systolic Array
Sparsity is an intrinsic property of convolutional neural network(CNN) and
worth exploiting for CNN accelerators, but extra processing comes with hardware
overhead, causing many architectures suffering from only minor profit.
Meanwhile, systolic array has been increasingly competitive on CNNs
acceleration for its high spatiotemporal locality and low hardware overhead.
However, the irregularity of sparsity induces imbalanced workload under the
rigid systolic dataflow, causing performance degradation. Thus, this paper
proposed a systolicarray-based architecture, called Sense, for sparse CNN
acceleration by model-hardware co-design, achieving large performance
improvement. To balance input feature map(IFM) and weight loads across
Processing Element(PE) array, we applied channel clustering to gather IFMs with
approximate sparsity for array computation, and co-designed a load-balancing
weight pruning method to keep the sparsity ratio of each kernel at a certain
value with little accuracy loss, improving PE utilization and overall
performance. Additionally, Adaptive Dataflow Configuration is applied to
determine the computing strategy based on the storage ratio of IFMs and
weights, lowering 1.17x-1.8x DRAM access compared with Swallow and further
reducing system energy consumption. The whole design is implemented on
ZynqZCU102 with 200MHz and performs at 471-, 34-, 53- and 191-image/s for
AlexNet, VGG-16, ResNet-50 and GoogleNet respectively. Compared against sparse
systolic-array-based accelerators, Swallow, FESA and SPOTS, Sense achieves
1x-2.25x, 1.95x-2.5x and 1.17x-2.37x performance improvement on these CNNs
respectively with reasonable overhead.Comment: 14 pages, 29 figures, 6 tables, IEEE TRANSACTIONS ON VERY LARGE SCALE
INTEGRATION (VLSI) SYSTEM
AS-FIBA: Adaptive Selective Frequency-Injection for Backdoor Attack on Deep Face Restoration
Deep learning-based face restoration models, increasingly prevalent in smart
devices, have become targets for sophisticated backdoor attacks. These attacks,
through subtle trigger injection into input face images, can lead to unexpected
restoration outcomes. Unlike conventional methods focused on classification
tasks, our approach introduces a unique degradation objective tailored for
attacking restoration models. Moreover, we propose the Adaptive Selective
Frequency Injection Backdoor Attack (AS-FIBA) framework, employing a neural
network for input-specific trigger generation in the frequency domain,
seamlessly blending triggers with benign images. This results in imperceptible
yet effective attacks, guiding restoration predictions towards subtly degraded
outputs rather than conspicuous targets. Extensive experiments demonstrate the
efficacy of the degradation objective on state-of-the-art face restoration
models. Additionally, it is notable that AS-FIBA can insert effective backdoors
that are more imperceptible than existing backdoor attack methods, including
WaNet, ISSBA, and FIBA
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
This paper does not present a novel method. Instead, it delves into an
essential, yet must-know baseline in light of the latest advancements in
Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual
understanding. Our study centers on the evaluation of GPT-4's linguistic and
visual capabilities in zero-shot visual recognition tasks: Firstly, we explore
the potential of its generated rich textual descriptions across various
categories to enhance recognition performance without any training. Secondly,
we evaluate GPT-4's visual proficiency in directly recognizing diverse visual
content. We conducted extensive experiments to systematically evaluate GPT-4's
performance across images, videos, and point clouds, using 16 benchmark
datasets to measure top-1 and top-5 accuracy. Our findings show that GPT-4,
enhanced with rich linguistic descriptions, significantly improves zero-shot
recognition, offering an average top-1 accuracy increase of 7% across all
datasets. GPT-4 excels in visual recognition, outshining OpenAI-CLIP's ViT-L
and rivaling EVA-CLIP's ViT-E, particularly in video datasets HMDB-51 and
UCF-101, where it leads by 22% and 9%, respectively. We hope this research
contributes valuable data points and experience for future studies. We release
our code at https://github.com/whwu95/GPT4Vis.Comment: Technical report. Retest GPT-4V and update result
Devil in the Number: Towards Robust Multi-modality Data Filter
In order to appropriately filter multi-modality data sets on a web-scale, it
becomes crucial to employ suitable filtering methods to boost performance and
reduce training costs. For instance, LAION papers employs the CLIP score filter
to select data with CLIP scores surpassing a certain threshold. On the other
hand, T-MARS achieves high-quality data filtering by detecting and masking text
within images and then filtering by CLIP score. Through analyzing the dataset,
we observe a significant proportion of redundant information, such as numbers,
present in the textual content. Our experiments on a subset of the data unveil
the profound impact of these redundant elements on the CLIP scores. A logical
approach would involve reevaluating the CLIP scores after eliminating these
influences. Experimentally, our text-based CLIP filter outperforms the
top-ranked method on the ``small scale" of DataComp (a data filtering
benchmark) on ImageNet distribution shifts, achieving a 3.6% performance
improvement. The results also demonstrate that our proposed text-masked filter
outperforms the original CLIP score filter when selecting the top 40% of the
data. The impact of numbers on CLIP and their handling provide valuable
insights for improving the effectiveness of CLIP training, including language
rewrite techniques.Comment: ICCV 2023 Workshop: TNGCV-DataCom
- …