151 research outputs found
Liminality and festivals - Insights from the East
This research extends our knowledge of liminality through investigating how the liminal experiences of festival-goers are constructed in a Chinese music festival context. The research employs a multi-site data collection approach undertaking field observations and 68 in-depth semi-structured interviews at seven music festivals across three years. The study contributes to the theoretical development of a liminality framework by providing empirical evidence of the nature of liminality. It extends our understanding of event tourist experiences by highlighting the development and role of three types of communitas and identifying six stages within a rite of passage. The resulting multifaceted coexistence of liminal behaviours and identity with everyday routine life provides a new approach to the critical understanding of the role of liminality
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model
Video anomaly detection (VAD) has been paid increasing attention due to its
potential applications, its current dominant tasks focus on online detecting
anomalies% at the frame level, which can be roughly interpreted as the binary
or multiple event classification. However, such a setup that builds
relationships between complicated anomalous events and single labels, e.g.,
``vandalism'', is superficial, since single labels are deficient to
characterize anomalous events. In reality, users tend to search a specific
video rather than a series of approximate videos. Therefore, retrieving
anomalous events using detailed descriptions is practical and positive but few
researches focus on this. In this context, we propose a novel task called Video
Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant
anomalous videos by cross-modalities, e.g., language descriptions and
synchronous audios. Unlike the current video retrieval where videos are assumed
to be temporally well-trimmed with short duration, VAR is devised to retrieve
long untrimmed videos which may be partially relevant to the given query. To
achieve this, we present two large-scale VAR benchmarks, UCFCrime-AR and
XDViolence-AR, constructed on top of prevalent anomaly datasets. Meanwhile, we
design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we
propose an anomaly-led sampling to focus on key segments in long untrimmed
videos. Then, we introduce an efficient pretext task to enhance semantic
associations between video-text fine-grained representations. Besides, we
leverage two complementary alignments to further match cross-modal contents.
Experimental results on two benchmarks reveal the challenges of VAR task and
also demonstrate the advantages of our tailored method.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
MixCycle: Mixup Assisted Semi-Supervised 3D Single Object Tracking with Cycle Consistency
3D single object tracking (SOT) is an indispensable part of automated
driving. Existing approaches rely heavily on large, densely labeled datasets.
However, annotating point clouds is both costly and time-consuming. Inspired by
the great success of cycle tracking in unsupervised 2D SOT, we introduce the
first semi-supervised approach to 3D SOT. Specifically, we introduce two
cycle-consistency strategies for supervision: 1) Self tracking cycles, which
leverage labels to help the model converge better in the early stages of
training; 2) forward-backward cycles, which strengthen the tracker's robustness
to motion variations and the template noise caused by the template update
strategy. Furthermore, we propose a data augmentation strategy named SOTMixup
to improve the tracker's robustness to point cloud diversity. SOTMixup
generates training samples by sampling points in two point clouds with a mixing
rate and assigns a reasonable loss weight for training according to the mixing
rate. The resulting MixCycle approach generalizes to appearance matching-based
trackers. On the KITTI benchmark, based on the P2B tracker, MixCycle trained
with labels outperforms P2B trained with
labels, and achieves a precision improvement when using
labels. Our code will be released at
\url{https://github.com/Mumuqiao/MixCycle}.Comment: Accepted by ICCV2
Laser-generated surface acoustic wave-based study and detection of surface cracks
Monitoring cracks to check the integrity of engineering materials by Non- Destructive Testing (NDT) in industry is significant in industry. And within the NDT techniques, Laser-Generated Surface Acoustic Wave technique (LSAW) has shown to be a promising technique. To further develop non-contact and accurate testing strengths of this method, models for analyzing the generation, propagation and tracking of surface acoustic waves (SAW’s) changes in S45C steel samples with distributed cracks are developed by using Finite Element Method (FEM). Time and frequency domain analyses are used to process the acoustic wave signals after the interaction with cracks. The simulation results and preliminary analyses reveal the good potential LSAW’s have to monitor cracks. First results in developing an experimental setup for crack detection are also provided
Open-Vocabulary Video Anomaly Detection
Video anomaly detection (VAD) with weak supervision has achieved remarkable
performance in utilizing video-level labels to discriminate whether a video
frame is normal or abnormal. However, current approaches are inherently limited
to a closed-set setting and may struggle in open-world applications where there
can be anomaly categories in the test data unseen during training. A few recent
studies attempt to tackle a more realistic setting, open-set VAD, which aims to
detect unseen anomalies given seen anomalies and normal videos. However, such a
setting focuses on predicting frame anomaly scores, having no ability to
recognize the specific categories of anomalies, despite the fact that this
ability is essential for building more informed video surveillance systems.
This paper takes a step further and explores open-vocabulary video anomaly
detection (OVVAD), in which we aim to leverage pre-trained large models to
detect and categorize seen and unseen anomalies. To this end, we propose a
model that decouples OVVAD into two mutually complementary tasks --
class-agnostic detection and class-specific classification -- and jointly
optimizes both tasks. Particularly, we devise a semantic knowledge injection
module to introduce semantic knowledge from large language models for the
detection task, and design a novel anomaly synthesis module to generate pseudo
unseen anomaly videos with the help of large vision generation models for the
classification task. These semantic knowledge and synthesis anomalies
substantially extend our model's capability in detecting and categorizing a
variety of seen and unseen anomalies. Extensive experiments on three
widely-used benchmarks demonstrate our model achieves state-of-the-art
performance on OVVAD task.Comment: Submitte
VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection
The recent contrastive language-image pre-training (CLIP) model has shown
great success in a wide range of image-level tasks, revealing remarkable
ability for learning powerful visual representations with rich semantics. An
open and worthwhile problem is efficiently adapting such a strong model to the
video domain and designing a robust video anomaly detector. In this work, we
propose VadCLIP, a new paradigm for weakly supervised video anomaly detection
(WSVAD) by leveraging the frozen CLIP model directly without any pre-training
and fine-tuning process. Unlike current works that directly feed extracted
features into the weakly supervised classifier for frame-level binary
classification, VadCLIP makes full use of fine-grained associations between
vision and language on the strength of CLIP and involves dual branch. One
branch simply utilizes visual features for coarse-grained binary
classification, while the other fully leverages the fine-grained language-image
alignment. With the benefit of dual branch, VadCLIP achieves both
coarse-grained and fine-grained video anomaly detection by transferring
pre-trained knowledge from CLIP to WSVAD task. We conduct extensive experiments
on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best
performance on both coarse-grained and fine-grained WSVAD, surpassing the
state-of-the-art methods by a large margin. Specifically, VadCLIP achieves
84.51% AP and 88.02% AUC on XD-Violence and UCF-Crime, respectively. Code and
features will be released to facilitate future VAD research.Comment: Submitte
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning
VQA Natural Language Explanation (VQA-NLE) task aims to explain the
decision-making process of VQA models in natural language. Unlike traditional
attention or gradient analysis, free-text rationales can be easier to
understand and gain users' trust. Existing methods mostly use post-hoc or
self-rationalization models to obtain a plausible explanation. However, these
frameworks are bottlenecked by the following challenges: 1) the reasoning
process cannot be faithfully responded to and suffer from the problem of
logical inconsistency. 2) Human-annotated explanations are expensive and
time-consuming to collect. In this paper, we propose a new Semi-Supervised
VQA-NLE via Self-Critical Learning (S3C), which evaluates the candidate
explanations by answering rewards to improve the logical consistency between
answers and rationales. With a semi-supervised learning framework, the S3C can
benefit from a tremendous amount of samples without human-annotated
explanations. A large number of automatic measures and human evaluations all
show the effectiveness of our method. Meanwhile, the framework achieves a new
state-of-the-art performance on the two VQA-NLE datasets.Comment: CVPR202
Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
With the emergence of large pre-trained vison-language model like CLIP,
transferable representations can be adapted to a wide range of downstream tasks
via prompt tuning. Prompt tuning tries to probe the beneficial information for
downstream tasks from the general knowledge stored in the pre-trained model. A
recently proposed method named Context Optimization (CoOp) introduces a set of
learnable vectors as text prompt from the language side. However, tuning the
text prompt alone can only adjust the synthesized "classifier", while the
computed visual features of the image encoder can not be affected , thus
leading to sub-optimal solutions. In this paper, we propose a novel
Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual
prompts simultaneously. To make the final image feature concentrate more on the
target visual concept, a Class-Aware Visual Prompt Tuning (CAVPT) scheme is
further proposed in our DPT, where the class-aware visual prompt is generated
dynamically by performing the cross attention between text prompts features and
image patch token embeddings to encode both the downstream task-related
information and visual instance information. Extensive experimental results on
11 datasets demonstrate the effectiveness and generalization ability of the
proposed method. Our code is available in https://github.com/fanrena/DPT.Comment: 12 pages, 7 figure
Rethinking annotation granularity for overcoming deep shortcut learning: A retrospective study on chest radiographs
Deep learning has demonstrated radiograph screening performances that are
comparable or superior to radiologists. However, recent studies show that deep
models for thoracic disease classification usually show degraded performance
when applied to external data. Such phenomena can be categorized into shortcut
learning, where the deep models learn unintended decision rules that can fit
the identically distributed training and test set but fail to generalize to
other distributions. A natural way to alleviate this defect is explicitly
indicating the lesions and focusing the model on learning the intended
features. In this paper, we conduct extensive retrospective experiments to
compare a popular thoracic disease classification model, CheXNet, and a
thoracic lesion detection model, CheXDet. We first showed that the two models
achieved similar image-level classification performance on the internal test
set with no significant differences under many scenarios. Meanwhile, we found
incorporating external training data even led to performance degradation for
CheXNet. Then, we compared the models' internal performance on the lesion
localization task and showed that CheXDet achieved significantly better
performance than CheXNet even when given 80% less training data. By further
visualizing the models' decision-making regions, we revealed that CheXNet
learned patterns other than the target lesions, demonstrating its shortcut
learning defect. Moreover, CheXDet achieved significantly better external
performance than CheXNet on both the image-level classification task and the
lesion localization task. Our findings suggest improving annotation granularity
for training deep learning systems as a promising way to elevate future deep
learning-based diagnosis systems for clinical usage.Comment: 22 pages of main text, 18 pages of supplementary table
- …