135 research outputs found
Feature Tracking Cardiac Magnetic Resonance via Deep Learning and Spline Optimization
Feature tracking Cardiac Magnetic Resonance (CMR) has recently emerged as an
area of interest for quantification of regional cardiac function from balanced,
steady state free precession (SSFP) cine sequences. However, currently
available techniques lack full automation, limiting reproducibility. We propose
a fully automated technique whereby a CMR image sequence is first segmented
with a deep, fully convolutional neural network (CNN) architecture, and
quadratic basis splines are fitted simultaneously across all cardiac frames
using least squares optimization. Experiments are performed using data from 42
patients with hypertrophic cardiomyopathy (HCM) and 21 healthy control
subjects. In terms of segmentation, we compared state-of-the-art CNN
frameworks, U-Net and dilated convolution architectures, with and without
temporal context, using cross validation with three folds. Performance relative
to expert manual segmentation was similar across all networks: pixel accuracy
was ~97%, intersection-over-union (IoU) across all classes was ~87%, and IoU
across foreground classes only was ~85%. Endocardial left ventricular
circumferential strain calculated from the proposed pipeline was significantly
different in control and disease subjects (-25.3% vs -29.1%, p = 0.006), in
agreement with the current clinical literature.Comment: Accepted to Functional Imaging and Modeling of the Heart (FIMH) 201
Grounded Question-Answering in Long Egocentric Videos
Existing approaches to video understanding, mainly designed for short videos
from a third-person perspective, are limited in their applicability in certain
fields, such as robotics. In this paper, we delve into open-ended
question-answering (QA) in long, egocentric videos, which allows individuals or
robots to inquire about their own past visual experiences. This task presents
unique challenges, including the complexity of temporally grounding queries
within extensive video content, the high resource demands for precise data
annotation, and the inherent difficulty of evaluating open-ended answers due to
their ambiguous nature. Our proposed approach tackles these challenges by (i)
integrating query grounding and answering within a unified model to reduce
error propagation; (ii) employing large language models for efficient and
scalable data synthesis; and (iii) introducing a close-ended QA task for
evaluation, to manage answer ambiguity. Extensive experiments demonstrate the
effectiveness of our method, which also achieves state-of-the-art performance
on the QaEgo4D and Ego4D-NLQ benchmarks. Code, data, and models are available
at https://github.com/Becomebright/GroundVQA.Comment: Accepted to CVPR 2024. Project website at https://dszdsz.cn/GroundVQ
Multicolumn networks for face recognition
The objective of this work is set-based face recognition, i.e. to decide if two sets of images of a face are of the same person or not. Conventionally, the set-wise feature descriptor is computed as an average of the descriptors from individual face images within the set. In this paper, we design a neural network architecture that learns to aggregate based on both “visual” quality (resolution, illumination), and “content” quality (relative importance for discriminative classification).
To this end, we propose a Multicolumn Network (MN) that takes a set of images (the number in the set can vary) as input, and learns to compute a fix-sized feature descriptor for the entire set. To encourage high-quality representations, each individual input image is first weighted by its “visual” quality, determined by a self-quality assessment module, and followed by a dynamic recalibration based on “content” qualities relative to the other images within the set. Both of these qualities are learnt implicitly during training for setwise classification. Comparing with the previous state-of-the-art architectures trained with the same dataset (VGGFace2), our Multicolumn Networks show an improvement of between 2-6% on the IARPA IJB face recognition benchmarks, and exceed the state of the art for all methods on these benchmarks
Multi-Modal Classifiers for Open-Vocabulary Object Detection
The goal of this paper is open-vocabulary object detection (OVOD)
\unicode{x2013} building a model that can detect objects beyond the set of
categories seen at training, thus enabling the user to specify categories of
interest at inference without the need for model retraining. We adopt a
standard two-stage object detector architecture, and explore three ways for
specifying novel categories: via language descriptions, via image exemplars, or
via a combination of the two. We make three contributions: first, we prompt a
large language model (LLM) to generate informative language descriptions for
object classes, and construct powerful text-based classifiers; second, we
employ a visual aggregator on image exemplars that can ingest any number of
images as input, forming vision-based classifiers; and third, we provide a
simple method to fuse information from language descriptions and image
exemplars, yielding a multi-modal classifier. When evaluating on the
challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our
text-based classifiers outperform all previous OVOD works; (ii) our
vision-based classifiers perform as well as text-based classifiers in prior
work; (iii) using multi-modal classifiers perform better than either modality
alone; and finally, (iv) our text-based and multi-modal classifiers yield
better performance than a fully-supervised detector.Comment: ICML 2023, project page:
https://www.robots.ox.ac.uk/vgg/research/mm-ovod
Inducing Predictive Uncertainty Estimation for Face Recognition
Knowing when an output can be trusted is critical for reliably using face
recognition systems. While there has been enormous effort in recent research on
improving face verification performance, understanding when a model's
predictions should or should not be trusted has received far less attention.
Our goal is to assign a confidence score for a face image that reflects its
quality in terms of recognizable information. To this end, we propose a method
for generating image quality training data automatically from 'mated-pairs' of
face images, and use the generated data to train a lightweight Predictive
Confidence Network, termed as PCNet, for estimating the confidence score of a
face image. We systematically evaluate the usefulness of PCNet with its error
versus reject performance, and demonstrate that it can be universally paired
with and improve the robustness of any verification model. We describe three
use cases on the public IJB-C face verification benchmark: (i) to improve 1:1
image-based verification error rates by rejecting low-quality face images; (ii)
to improve quality score based fusion performance on the 1:1 set-based
verification benchmark; and (iii) its use as a quality measure for selecting
high quality (unblurred, good lighting, more frontal) faces from a collection,
e.g. for automatic enrolment or display.Comment: To Appear at the British Machine Vision Conference (BMVC), 202
A tri-layer plugin to improve occluded detection
Detecting occluded objects still remains a challenge for state-of-the-art object detectors. The objective of this work is to improve the detection for such objects, and thereby improve the overall performance of a modern object detector. To this end we make the following four contributions: (1) We propose a simple `plugin' module for the detection head of two-stage object detectors to improve the recall of partially occluded objects. The module predicts a tri-layer of segmentation masks for the target object, the occluder and the occludee, and by doing so is able to better predict the mask of the target object. (2) We propose a scalable pipeline for generating training data for the module by using amodal completion of existing object detection and instance segmentation training datasets to establish occlusion relationships. (3) We also establish a COCO evaluation dataset to measure the recall performance of partially occluded and separated objects. (4) We show that the plugin module inserted into a two-stage detector can boost the performance significantly, by only fine-tuning the detection head, and with additional improvements if the entire architecture is fine-tuned. COCO results are reported for Mask R-CNN with Swin-T or Swin-S backbones, and Cascade Mask R-CNN with a Swin-B backbone
Self-supervised Co-training for Video Representation Learning
The objective of this paper is visual-only self-supervised video
representation learning. We make the following contributions: (i) we
investigate the benefit of adding semantic-class positives to instance-based
Info Noise Contrastive Estimation (InfoNCE) training, showing that this form of
supervised contrastive learning leads to a clear improvement in performance;
(ii) we propose a novel self-supervised co-training scheme to improve the
popular infoNCE loss, exploiting the complementary information from different
views, RGB streams and optical flow, of the same data source by using one view
to obtain positive class samples for the other; (iii) we thoroughly evaluate
the quality of the learnt representation on two different downstream tasks:
action recognition and video retrieval. In both cases, the proposed approach
demonstrates state-of-the-art or comparable performance with other
self-supervised approaches, whilst being significantly more efficient to train,
i.e. requiring far less training data to achieve similar performance.Comment: NeurIPS202
Multi-modal classifiers for open-vocabulary object detection
The goal of this paper is open-vocabulary object detection (OVOD) – building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yielding a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector
Diagnosing Human-object Interaction Detectors
Although we have witnessed significant progress in human-object interaction
(HOI) detection with increasingly high mAP (mean Average Precision), a single
mAP score is too concise to obtain an informative summary of a model's
performance and to understand why one approach is better than another. In this
paper, we introduce a diagnosis toolbox for analyzing the error sources of the
existing HOI detection models. We first conduct holistic investigations in the
pipeline of HOI detection, consisting of human-object pair detection and then
interaction classification. We define a set of errors and the oracles to fix
each of them. By measuring the mAP improvement obtained from fixing an error
using its oracle, we can have a detailed analysis of the significance of
different errors. We then delve into the human-object detection and interaction
classification, respectively, and check the model's behavior. For the first
detection task, we investigate both recall and precision, measuring the
coverage of ground-truth human-object pairs as well as the noisiness level in
the detections. For the second classification task, we compute mAP for
interaction classification only, without considering the detection scores. We
also measure the performance of the models in differentiating human-object
pairs with and without actual interactions using the AP (Average Precision)
score. Our toolbox is applicable for different methods across different
datasets and available at https://github.com/neu-vi/Diag-HOI
- …
