Search CORE

23 research outputs found

When Does Contrastive Visual Representation Learning Work?

Author: Aodha Oisin Mac
Belongie Serge
Cole Elijah
Wilber Kimberly
Yang Xuan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/04/2022
Field of study

Recent self-supervised representation learning techniques have largely closed the gap between supervised and unsupervised learning on ImageNet classification. While the particulars of pretraining on ImageNet are now relatively well understood, the field still lacks widely accepted best practices for replicating this success on other datasets. As a first step in this direction, we study contrastive self-supervised learning on four diverse large-scale datasets. By looking through the lenses of data quantity, data domain, data quality, and task granularity, we provide new insights into the necessary conditions for successful self-supervised learning. Our key findings include observations such as: (i) the benefit of additional pretraining data beyond 500k images is modest, (ii) adding pretraining images from another domain does not lead to more general representations, (iii) corrupted pretraining images have a disparate impact on supervised and self-supervised pretraining, and (iv) contrastive learning lags far behind supervised learning on fine-grained visual classification tasks.Comment: CVPR 202

arXiv.org e-Print Archive

Edinburgh Research Explorer

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

Author: Adam Hartwig
Belongie Serge
Mac Aodha Oisin
Qian Rui
Van Horn Grant
Wilber Kimberly
Publication venue
Publication date: 23/10/2022
Field of study

Edinburgh Research Explorer

Benchmarking Representation Learning for Natural World Image Collections

Author: Beery Sara
Belongie Serge
Cole Elijah
Mac Aodha Oisin
Van Horn Grant
Wilber Kimberly
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2021
Field of study

Recent progress in self-supervised learning has resulted in models that are capable of extracting rich representations from image collections without requiring any explicit label supervision. However, to date the vast majority of these approaches have restricted themselves to training on standard benchmark datasets such as ImageNet. We argue that fine-grained visual categorization problems, such as plant and animal species classification, provide an informative testbed for self-supervised learning. In order to facilitate progress in this area we present two new natural world visual classification datasets, iNat2021 and NeWT. The former consists of 2.7M images from 10k different species uploaded by users of the citizen science application iNaturalist. We designed the latter, NeWT, in collaboration with domain experts with the aim of benchmarking the performance of representation learning algorithms on a suite of challenging natural world binary classification tasks that go beyond standard species classification. These two new datasets allow us to explore questions related to large-scale representation and transfer learning in the context of fine-grained categories. We provide a comprehensive analysis of feature extractors trained with and without supervision on ImageNet and iNat2021, shedding light on the strengths and weaknesses of different learned features across a diverse set of tasks. We find that features produced by standard supervised methods still outperform those produced by self-supervised approaches such as SimCLR. However, improved self-supervised learning methods are constantly being released and the iNat2021 and NeWT datasets are a valuable resource for tracking their progress.Comment: CVPR 202

arXiv.org e-Print Archive

Edinburgh Research Explorer

Caltech Authors

Bridging the Gap Between Object Detection and User Intent via Query-Modulation

Author: Cui Yin
Fornoni Marco
Gong Boqing
Howard Andrew
Luo Liangchen
Stark Alex
Wilber Kimberly
Yan Chaochao
Publication venue
Publication date: 18/06/2021
Field of study

When interacting with objects through cameras, or pictures, users often have a specific intent. For example, they may want to perform a visual search. However, most object detection models ignore the user intent, relying on image pixels as their only input. This often leads to incorrect results, such as lack of a high-confidence detection on the object of interest, or detection with a wrong class label. In this paper we investigate techniques to modulate standard object detectors to explicitly account for the user intent, expressed as an embedding of a simple query. Compared to standard object detectors, query-modulated detectors show superior performance at detecting objects for a given label of interest. Thanks to large-scale training data synthesized from standard object detection annotations, query-modulated detectors can also outperform specialized referring expression recognition systems. Furthermore, they can be simultaneously trained to solve for both query-modulated detection and standard object detection

arXiv.org e-Print Archive

On Label Granularity and Object Localization

Author: Belongie Serge
Cole Elijah
Fornoni Marco
Howard Andrew
Mac Aodha Oisin
Perona Pietro
Van Horn Grant
Wilber Kimberly
Yang Xuan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Copenhagen University Research Information System

Edinburgh Research Explorer

PolyMaX: General Dense Prediction with Mask Transformer

Author: Adam Hartwig
Chen Liang-Chieh
Debats Stephanie
Gu Xiuye
Qiao Siyuan
Sharma Astuti
Sirotenko Mikhail
Wang Huisheng
Wilber Kimberly
Yang Xuan
Yuan Liangzhe
Publication venue
Publication date: 09/11/2023
Field of study

Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as per-pixel classification (discrete outputs) or regression (continuous outputs). This per-pixel prediction paradigm has remained popular due to the prevalence of fully convolutional networks. However, on the recent frontier of segmentation task, the community has been witnessing a shift of paradigm from per-pixel prediction to cluster-prediction with the emergence of transformer architectures, particularly the mask transformers, which directly predicts a label for a mask instead of a pixel. Despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on the other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction. Motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. This allows us to unify dense prediction tasks with the mask transformer framework. Remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. Code and model will be made available.Comment: WACV 202

arXiv.org e-Print Archive

The Predictive Capacity of the Buffalo Concussion Treadmill Test After Sport-Related Concussion in Adolescents

Author: Barry S. Willer
Charles G. Wilber
Itai Bezherano
Jeffrey C. Miecznikowski
John J. Leddy
Kaitlin B. Viera
Kimberly J. Wilkins
Mohammad N. Haider
Mohammad N. Haider
Publication venue: 'Frontiers Media SA'
Publication date: 01/04/2019
Field of study

The Buffalo Concussion Treadmill Test (BCTT) identifies the heart rate threshold (HRt) of exercise tolerance in concussed patients. A previous study found that an absolute HRt of < 135 bpm was associated with prolonged recovery (>30 days) from sport-related concussion (SRC). In this study, we assessed the relationship of ΔHR (difference between resting HR and HRt) and recovery from SRC. Using a retrospective cohort design, we compared acutely (<10 days since injury) concussed adolescents who were prescribed either (1) relative rest (RG, n = 27, 15.2 ± 1 years, 33% female, median 17 days to recovery, ΔHR = 69.6 ± 28 bpm), (2) a placebo-stretching program (PG, n = 51, 15.4 ± 2 years, 49% female, median 17 days to recovery, ΔHR = 60.9 ± 22 bpm), or (3) sub-threshold aerobic exercise (AG, n = 52, 15.3 ± 2 years, 46% female, median 13 days to recovery, ΔHR = 62.4 ± 26 bpm). Linear regression showed that ΔHR significantly correlated with duration of clinical recovery for RG (p = 0.012, R2 = 0.228) and PG (p = 0.011, R2 = 0.126) but not for AG (p = 0.084, R2 = 0.059). ΔHR values were significantly lower in participants with prolonged recovery (>30 days) in RG (p = 0.01) and PG (p = 0.04). A ΔHR of ≤50 bpm on the BCTT is 73% sensitive and 78% specific for predicting prolonged recovery in concussed adolescents who were prescribed the current standard of care (i.e., cognitive and physical rest)

Directory of Open Access Journals

The iNaturalist Localization 500 (iNatLoc500) Dataset

Author: Belongie Serge
Cole Elijah
Fornoni Marco
Howard Andrew
Mac Aodha Oisin
Perona Pietro
Van Horn Grant
Wilber Kimberly
Yang Xuan
Publication venue: CaltechDATA
Publication date: 16/07/2022
Field of study

The iNaturalist Localization 500 (iNatLoc500) dataset. See the GitHub page (https://github.com/visipedia/inat_loc) for details.Related Publication: On Label Granularity and Object Localization Cole, Elijah Caltech Wilber, Kimberly Google Van Horn, Grant Cornell University Yang, Xuan Google Fornoni, Marco Google Perona, Pietro Caltech Belongie, Serge University of Copenhagen Howard, Andrew Google Mac Aodha, Oisin University of Edinburgh European Conference on Computer Vision (ECCV) 2022 en

CaltechDATA (California Institute of Technology Research Data Repository)