81,033 research outputs found
Siamese-DETR for Generic Multi-Object Tracking
The ability to detect and track the dynamic objects in different scenes is
fundamental to real-world applications, e.g., autonomous driving and robot
navigation. However, traditional Multi-Object Tracking (MOT) is limited to
tracking objects belonging to the pre-defined closed-set categories. Recently,
Open-Vocabulary MOT (OVMOT) and Generic MOT (GMOT) are proposed to track
interested objects beyond pre-defined categories with the given text prompt and
template image. However, the expensive well pre-trained (vision-)language model
and fine-grained category annotations are required to train OVMOT models. In
this paper, we focus on GMOT and propose a simple but effective method,
Siamese-DETR, for GMOT. Only the commonly used detection datasets (e.g., COCO)
are required for training. Different from existing GMOT methods, which train a
Single Object Tracking (SOT) based detector to detect interested objects and
then apply a data association based MOT tracker to get the trajectories, we
leverage the inherent object queries in DETR variants. Specifically: 1) The
multi-scale object queries are designed based on the given template image,
which are effective for detecting different scales of objects with the same
category as the template image; 2) A dynamic matching training strategy is
introduced to train Siamese-DETR on commonly used detection datasets, which
takes full advantage of provided annotations; 3) The online tracking pipeline
is simplified through a tracking-by-query manner by incorporating the tracked
boxes in previous frame as additional query boxes. The complex data association
is replaced with the much simpler Non-Maximum Suppression (NMS). Extensive
experimental results show that Siamese-DETR surpasses existing MOT methods on
GMOT-40 dataset by a large margin
STV-based Video Feature Processing for Action Recognition
In comparison to still image-based processes, video features can provide rich and intuitive information about dynamic events occurred over a period of time, such as human actions, crowd behaviours, and other subject pattern changes. Although substantial progresses have been made in the last decade on image processing and seen its successful applications in face matching and object recognition, video-based event detection still remains one of the most difficult challenges in computer vision research due to its complex continuous or discrete input signals, arbitrary dynamic feature definitions, and the often ambiguous analytical methods. In this paper, a Spatio-Temporal Volume (STV) and region intersection (RI) based 3D shape-matching method has been proposed to facilitate the definition and recognition of human actions recorded in videos. The distinctive characteristics and the performance gain of the devised approach stemmed from a coefficient factor-boosted 3D region intersection and matching mechanism developed in this research. This paper also reported the investigation into techniques for efficient STV data filtering to reduce the amount of voxels (volumetric-pixels) that need to be processed in each operational cycle in the implemented system. The encouraging features and improvements on the operational performance registered in the experiments have been discussed at the end
Comparator Networks
The objective of this work is set-based verification, e.g. to decide if two
sets of images of a face are of the same person or not. The traditional
approach to this problem is to learn to generate a feature vector per image,
aggregate them into one vector to represent the set, and then compute the
cosine similarity between sets. Instead, we design a neural network
architecture that can directly learn set-wise verification. Our contributions
are: (i) We propose a Deep Comparator Network (DCN) that can ingest a pair of
sets (each may contain a variable number of images) as inputs, and compute a
similarity between the pair--this involves attending to multiple discriminative
local regions (landmarks), and comparing local descriptors between pairs of
faces; (ii) To encourage high-quality representations for each set, internal
competition is introduced for recalibration based on the landmark score; (iii)
Inspired by image retrieval, a novel hard sample mining regime is proposed to
control the sampling process, such that the DCN is complementary to the
standard image classification models. Evaluations on the IARPA Janus face
recognition benchmarks show that the comparator networks outperform the
previous state-of-the-art results by a large margin.Comment: To appear in ECCV 201
Detecting Beyond-Einstein Polarizations of Continuous Gravitational Waves
The direct detection of gravitational waves with the next generation
detectors, like Advanced LIGO, provides the opportunity to measure deviations
from the predictions of General Relativity. One such departure would be the
existence of alternative polarizations. To measure these, we study a single
detector measurement of a continuous gravitational wave from a triaxial pulsar
source. We develop methods to detect signals of any polarization content and
distinguish between them in a model independent way. We present LIGO S5
sensitivity estimates for 115 pulsars.Comment: submitted to PR
Robust Dialog State Tracking for Large Ontologies
The Dialog State Tracking Challenge 4 (DSTC 4) differentiates itself from the
previous three editions as follows: the number of slot-value pairs present in
the ontology is much larger, no spoken language understanding output is given,
and utterances are labeled at the subdialog level. This paper describes a novel
dialog state tracking method designed to work robustly under these conditions,
using elaborate string matching, coreference resolution tailored for dialogs
and a few other improvements. The method can correctly identify many values
that are not explicitly present in the utterance. On the final evaluation, our
method came in first among 7 competing teams and 24 entries. The F1-score
achieved by our method was 9 and 7 percentage points higher than that of the
runner-up for the utterance-level evaluation and for the subdialog-level
evaluation, respectively.Comment: Paper accepted at IWSDS 201
- …