23 research outputs found
When Does Contrastive Visual Representation Learning Work?
Recent self-supervised representation learning techniques have largely closed
the gap between supervised and unsupervised learning on ImageNet
classification. While the particulars of pretraining on ImageNet are now
relatively well understood, the field still lacks widely accepted best
practices for replicating this success on other datasets. As a first step in
this direction, we study contrastive self-supervised learning on four diverse
large-scale datasets. By looking through the lenses of data quantity, data
domain, data quality, and task granularity, we provide new insights into the
necessary conditions for successful self-supervised learning. Our key findings
include observations such as: (i) the benefit of additional pretraining data
beyond 500k images is modest, (ii) adding pretraining images from another
domain does not lead to more general representations, (iii) corrupted
pretraining images have a disparate impact on supervised and self-supervised
pretraining, and (iv) contrastive learning lags far behind supervised learning
on fine-grained visual classification tasks.Comment: CVPR 202
Benchmarking Representation Learning for Natural World Image Collections
Recent progress in self-supervised learning has resulted in models that are
capable of extracting rich representations from image collections without
requiring any explicit label supervision. However, to date the vast majority of
these approaches have restricted themselves to training on standard benchmark
datasets such as ImageNet. We argue that fine-grained visual categorization
problems, such as plant and animal species classification, provide an
informative testbed for self-supervised learning. In order to facilitate
progress in this area we present two new natural world visual classification
datasets, iNat2021 and NeWT. The former consists of 2.7M images from 10k
different species uploaded by users of the citizen science application
iNaturalist. We designed the latter, NeWT, in collaboration with domain experts
with the aim of benchmarking the performance of representation learning
algorithms on a suite of challenging natural world binary classification tasks
that go beyond standard species classification. These two new datasets allow us
to explore questions related to large-scale representation and transfer
learning in the context of fine-grained categories. We provide a comprehensive
analysis of feature extractors trained with and without supervision on ImageNet
and iNat2021, shedding light on the strengths and weaknesses of different
learned features across a diverse set of tasks. We find that features produced
by standard supervised methods still outperform those produced by
self-supervised approaches such as SimCLR. However, improved self-supervised
learning methods are constantly being released and the iNat2021 and NeWT
datasets are a valuable resource for tracking their progress.Comment: CVPR 202
Bridging the Gap Between Object Detection and User Intent via Query-Modulation
When interacting with objects through cameras, or pictures, users often have
a specific intent. For example, they may want to perform a visual search.
However, most object detection models ignore the user intent, relying on image
pixels as their only input. This often leads to incorrect results, such as lack
of a high-confidence detection on the object of interest, or detection with a
wrong class label. In this paper we investigate techniques to modulate standard
object detectors to explicitly account for the user intent, expressed as an
embedding of a simple query. Compared to standard object detectors,
query-modulated detectors show superior performance at detecting objects for a
given label of interest. Thanks to large-scale training data synthesized from
standard object detection annotations, query-modulated detectors can also
outperform specialized referring expression recognition systems. Furthermore,
they can be simultaneously trained to solve for both query-modulated detection
and standard object detection
PolyMaX: General Dense Prediction with Mask Transformer
Dense prediction tasks, such as semantic segmentation, depth estimation, and
surface normal prediction, can be easily formulated as per-pixel classification
(discrete outputs) or regression (continuous outputs). This per-pixel
prediction paradigm has remained popular due to the prevalence of fully
convolutional networks. However, on the recent frontier of segmentation task,
the community has been witnessing a shift of paradigm from per-pixel prediction
to cluster-prediction with the emergence of transformer architectures,
particularly the mask transformers, which directly predicts a label for a mask
instead of a pixel. Despite this shift, methods based on the per-pixel
prediction paradigm still dominate the benchmarks on the other dense prediction
tasks that require continuous outputs, such as depth estimation and surface
normal prediction. Motivated by the success of DORN and AdaBins in depth
estimation, achieved by discretizing the continuous output space, we propose to
generalize the cluster-prediction based method to general dense prediction
tasks. This allows us to unify dense prediction tasks with the mask transformer
framework. Remarkably, the resulting model PolyMaX demonstrates
state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope
our simple yet effective design can inspire more research on exploiting mask
transformers for more dense prediction tasks. Code and model will be made
available.Comment: WACV 202
The Predictive Capacity of the Buffalo Concussion Treadmill Test After Sport-Related Concussion in Adolescents
The Buffalo Concussion Treadmill Test (BCTT) identifies the heart rate threshold (HRt) of exercise tolerance in concussed patients. A previous study found that an absolute HRt of < 135 bpm was associated with prolonged recovery (>30 days) from sport-related concussion (SRC). In this study, we assessed the relationship of ΔHR (difference between resting HR and HRt) and recovery from SRC. Using a retrospective cohort design, we compared acutely (<10 days since injury) concussed adolescents who were prescribed either (1) relative rest (RG, n = 27, 15.2 ± 1 years, 33% female, median 17 days to recovery, ΔHR = 69.6 ± 28 bpm), (2) a placebo-stretching program (PG, n = 51, 15.4 ± 2 years, 49% female, median 17 days to recovery, ΔHR = 60.9 ± 22 bpm), or (3) sub-threshold aerobic exercise (AG, n = 52, 15.3 ± 2 years, 46% female, median 13 days to recovery, ΔHR = 62.4 ± 26 bpm). Linear regression showed that ΔHR significantly correlated with duration of clinical recovery for RG (p = 0.012, R2 = 0.228) and PG (p = 0.011, R2 = 0.126) but not for AG (p = 0.084, R2 = 0.059). ΔHR values were significantly lower in participants with prolonged recovery (>30 days) in RG (p = 0.01) and PG (p = 0.04). A ΔHR of ≤50 bpm on the BCTT is 73% sensitive and 78% specific for predicting prolonged recovery in concussed adolescents who were prescribed the current standard of care (i.e., cognitive and physical rest)
The iNaturalist Localization 500 (iNatLoc500) Dataset
The iNaturalist Localization 500 (iNatLoc500) dataset. See the GitHub page (https://github.com/visipedia/inat_loc) for details.Related Publication:
On Label Granularity and Object Localization
Cole, Elijah Caltech
Wilber, Kimberly Google
Van Horn, Grant Cornell University
Yang, Xuan Google
Fornoni, Marco Google
Perona, Pietro Caltech
Belongie, Serge University of Copenhagen
Howard, Andrew Google
Mac Aodha, Oisin University of Edinburgh
European Conference on Computer Vision (ECCV)
2022
en