9 research outputs found
CrimeNet: Neural Structured Learning using Vision Transformer for violence detection
The state of the art in violence detection in videos has improved in recent years thanks to deep learning models, but it is still below 90% of average precision in the most complex datasets, which may pose a problem of frequent false alarms in video surveillance environments and may cause security guards to disable the artificial intelligence system.
In this study, we propose a new neural network based on Vision Transformer (ViT) and Neural Structured Learning (NSL) with adversarial training. This network, called CrimeNet, outperforms previous works by a large margin and reduces practically to zero the false positives. Our tests on the four most challenging violence-related datasets (binary and multi-class) show the effectiveness of CrimeNet, improving the state of the art from 9.4 to 22.17 percentage points in ROC AUC depending on the dataset. In addition, we present a generalisation study on our model by training and testing it on different datasets. The obtained results show that CrimeNet improves over competing methods with a gain of between 12.39 and 25.22 percentage points, showing remarkable robustness.MCIN/AEI/ 10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR ” HORUS project - Grant n. PID2021-126359OB-I0
In-Place Gestures Classification via Long-term Memory Augmented Network
In-place gesture-based virtual locomotion techniques enable users to control
their viewpoint and intuitively move in the 3D virtual environment. A key
research problem is to accurately and quickly recognize in-place gestures,
since they can trigger specific movements of virtual viewpoints and enhance
user experience. However, to achieve real-time experience, only short-term
sensor sequence data (up to about 300ms, 6 to 10 frames) can be taken as input,
which actually affects the classification performance due to limited
spatio-temporal information. In this paper, we propose a novel long-term memory
augmented network for in-place gestures classification. It takes as input both
short-term gesture sequence samples and their corresponding long-term sequence
samples that provide extra relevant spatio-temporal information in the training
phase. We store long-term sequence features with an external memory queue. In
addition, we design a memory augmented loss to help cluster features of the
same class and push apart features from different classes, thus enabling our
memory queue to memorize more relevant long-term sequence features. In the
inference phase, we input only short-term sequence samples to recall the stored
features accordingly, and fuse them together to predict the gesture class. We
create a large-scale in-place gestures dataset from 25 participants with 11
gestures. Our method achieves a promising accuracy of 95.1% with a latency of
192ms, and an accuracy of 97.3% with a latency of 312ms, and is demonstrated to
be superior to recent in-place gesture classification techniques. User study
also validates our approach. Our source code and dataset will be made available
to the community.Comment: This paper is accepted to IEEE ISMAR202
Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level prototype memory
Semantic segmentation is a key technique involved in automatic interpretation
of high-resolution remote sensing (HRS) imagery and has drawn much attention in
the remote sensing community. Deep convolutional neural networks (DCNNs) have
been successfully applied to the HRS imagery semantic segmentation task due to
their hierarchical representation ability. However, the heavy dependency on a
large number of training data with dense annotation and the sensitiveness to
the variation of data distribution severely restrict the potential application
of DCNNs for the semantic segmentation of HRS imagery. This study proposes a
novel unsupervised domain adaptation semantic segmentation network
(MemoryAdaptNet) for the semantic segmentation of HRS imagery. MemoryAdaptNet
constructs an output space adversarial learning scheme to bridge the domain
distribution discrepancy between source domain and target domain and to narrow
the influence of domain shift. Specifically, we embed an invariant feature
memory module to store invariant domain-level context information because the
features obtained from adversarial learning only tend to represent the variant
feature of current limited inputs. This module is integrated by a category
attention-driven invariant domain-level context aggregation module to current
pseudo invariant feature for further augmenting the pixel representations. An
entropy-based pseudo label filtering strategy is used to update the memory
module with high-confident pseudo invariant feature of current target images.
Extensive experiments under three cross-domain tasks indicate that our proposed
MemoryAdaptNet is remarkably superior to the state-of-the-art methods.Comment: 17 pages, 12 figures and 8 table
TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers
Detection Transformer (DETR) and Deformable DETR have been proposed to
eliminate the need for many hand-designed components in object detection while
demonstrating good performance as previous complex hand-crafted detectors.
However, their performance on Video Object Detection (VOD) has not been well
explored. In this paper, we present TransVOD, the first end-to-end video object
detection system based on spatial-temporal Transformer architectures. The first
goal of this paper is to streamline the pipeline of VOD, effectively removing
the need for many hand-crafted components for feature aggregation, e.g.,
optical flow model, relation networks. Besides, benefited from the object query
design in DETR, our method does not need complicated post-processing methods
such as Seq-NMS. In particular, we present a temporal Transformer to aggregate
both the spatial object queries and the feature memories of each frame. Our
temporal transformer consists of two components: Temporal Query Encoder (TQE)
to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to
obtain current frame detection results. These designs boost the strong baseline
deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID
dataset. Then, we present two improved versions of TransVOD including
TransVOD++ and TransVOD Lite. The former fuses object-level information into
object query via dynamic convolution while the latter models the entire video
clips as the output to speed up the inference time. We give detailed analysis
of all three models in the experiment part. In particular, our proposed
TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet
VID with 90.0% mAP. Our proposed TransVOD Lite also achieves the best speed and
accuracy trade-off with 83.7% mAP while running at around 30 FPS on a single
V100 GPU device.Comment: Accepted to IEEE Transactions on Pattern Analysis and Machine
Intelligence (IEEE TPAMI), extended version of arXiv:2105.1092
General and Fine-Grained Video Understanding using Machine Learning & Standardised Neural Network Architectures
Recently, the broad adoption of the internet coupled with connected smart devices has seen a significant increase in the production, collection, and sharing of data. One of the biggest technical challenges in this new information age is how to effectively use machines to process and extract useful information from this data. Interpreting video data is of particular importance for many applications including surveillance, cataloguing, and robotics, however it is also particularly difficult due to video’s natural sparseness - for lots of data there is small amounts of useful information. This thesis examines and extends a number of Machine Learning models in a number of video understanding problem domains including captioning, detection and classification. Captioning videos with human like sentences can be considered a good indication of how well a machine can interpret and distill the contents of a video. Captioning generally requires knowledge of the scene, objects, actions, relationships and temporal dynamics. Current approaches break this problem into three stages with most works focusing on visual feature filtering techniques for supporting a caption generation module. Current approaches however still struggle to associate ideas described in captions with their visual components in the video. We find that captioning models tend to generate shorter more succinct captions, with overfitted training models performing significantly better than human annotators on the current evaluation metrics. After taking a closer look at the model and human generated captions we highlight that the main challenge for captioning models is to correctly identify and generate specific nouns and verbs, particularly rare concepts. With this in mind we experimentally analyse a handful of different concept grounding techniques, showing some to be promising in increasing captioning performance, particularly when concepts are identified correctly by the grounding mechanism. To strengthen visual interpretations, recent captioning approaches utilise object detections to attain more salient and detailed visual information. Currently, these detections are generated by an image based detector processing only a single video frame, however it’s desirable to capture the temporal dynamics of objects across an entire video. We take an efficient image object detection framework, and carry out an extensive exploration into the effects of a number of network modifications towards improving the model’s ability to perform on video data. We find a number of promising directions which improve upon the single frame baseline. Furthermore, to increase concept coverage for object detection in video we combine datasets from both the image and video domains. We then perform an in-depth analysis on the coverage of the combined detection dataset with the concepts found in captions from video captioning datasets. While the bulk of this thesis centres around general video understanding - random videos from the internet - it’s also useful to determine the performance of these Machine Learning techniques on a more fine-grained problem. We therefore introduce a new Tennis dataset, which includes broadcast video for five tennis matches with detailed annotations for match events and commentary style captions. We evaluate a number of modern Machine Learning techniques for performing shot classification, as a stand-alone and a precursor process for commentary generation, finding that current models are similarly effective for this fine-grained problem.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202