23 research outputs found

    When Does Contrastive Visual Representation Learning Work?

    Get PDF
    Recent self-supervised representation learning techniques have largely closed the gap between supervised and unsupervised learning on ImageNet classification. While the particulars of pretraining on ImageNet are now relatively well understood, the field still lacks widely accepted best practices for replicating this success on other datasets. As a first step in this direction, we study contrastive self-supervised learning on four diverse large-scale datasets. By looking through the lenses of data quantity, data domain, data quality, and task granularity, we provide new insights into the necessary conditions for successful self-supervised learning. Our key findings include observations such as: (i) the benefit of additional pretraining data beyond 500k images is modest, (ii) adding pretraining images from another domain does not lead to more general representations, (iii) corrupted pretraining images have a disparate impact on supervised and self-supervised pretraining, and (iv) contrastive learning lags far behind supervised learning on fine-grained visual classification tasks.Comment: CVPR 202

    Benchmarking Representation Learning for Natural World Image Collections

    Get PDF
    Recent progress in self-supervised learning has resulted in models that are capable of extracting rich representations from image collections without requiring any explicit label supervision. However, to date the vast majority of these approaches have restricted themselves to training on standard benchmark datasets such as ImageNet. We argue that fine-grained visual categorization problems, such as plant and animal species classification, provide an informative testbed for self-supervised learning. In order to facilitate progress in this area we present two new natural world visual classification datasets, iNat2021 and NeWT. The former consists of 2.7M images from 10k different species uploaded by users of the citizen science application iNaturalist. We designed the latter, NeWT, in collaboration with domain experts with the aim of benchmarking the performance of representation learning algorithms on a suite of challenging natural world binary classification tasks that go beyond standard species classification. These two new datasets allow us to explore questions related to large-scale representation and transfer learning in the context of fine-grained categories. We provide a comprehensive analysis of feature extractors trained with and without supervision on ImageNet and iNat2021, shedding light on the strengths and weaknesses of different learned features across a diverse set of tasks. We find that features produced by standard supervised methods still outperform those produced by self-supervised approaches such as SimCLR. However, improved self-supervised learning methods are constantly being released and the iNat2021 and NeWT datasets are a valuable resource for tracking their progress.Comment: CVPR 202

    Bridging the Gap Between Object Detection and User Intent via Query-Modulation

    Full text link
    When interacting with objects through cameras, or pictures, users often have a specific intent. For example, they may want to perform a visual search. However, most object detection models ignore the user intent, relying on image pixels as their only input. This often leads to incorrect results, such as lack of a high-confidence detection on the object of interest, or detection with a wrong class label. In this paper we investigate techniques to modulate standard object detectors to explicitly account for the user intent, expressed as an embedding of a simple query. Compared to standard object detectors, query-modulated detectors show superior performance at detecting objects for a given label of interest. Thanks to large-scale training data synthesized from standard object detection annotations, query-modulated detectors can also outperform specialized referring expression recognition systems. Furthermore, they can be simultaneously trained to solve for both query-modulated detection and standard object detection

    PolyMaX: General Dense Prediction with Mask Transformer

    Full text link
    Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as per-pixel classification (discrete outputs) or regression (continuous outputs). This per-pixel prediction paradigm has remained popular due to the prevalence of fully convolutional networks. However, on the recent frontier of segmentation task, the community has been witnessing a shift of paradigm from per-pixel prediction to cluster-prediction with the emergence of transformer architectures, particularly the mask transformers, which directly predicts a label for a mask instead of a pixel. Despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on the other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction. Motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. This allows us to unify dense prediction tasks with the mask transformer framework. Remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. Code and model will be made available.Comment: WACV 202

    The Predictive Capacity of the Buffalo Concussion Treadmill Test After Sport-Related Concussion in Adolescents

    Get PDF
    The Buffalo Concussion Treadmill Test (BCTT) identifies the heart rate threshold (HRt) of exercise tolerance in concussed patients. A previous study found that an absolute HRt of < 135 bpm was associated with prolonged recovery (>30 days) from sport-related concussion (SRC). In this study, we assessed the relationship of ΔHR (difference between resting HR and HRt) and recovery from SRC. Using a retrospective cohort design, we compared acutely (<10 days since injury) concussed adolescents who were prescribed either (1) relative rest (RG, n = 27, 15.2 ± 1 years, 33% female, median 17 days to recovery, ΔHR = 69.6 ± 28 bpm), (2) a placebo-stretching program (PG, n = 51, 15.4 ± 2 years, 49% female, median 17 days to recovery, ΔHR = 60.9 ± 22 bpm), or (3) sub-threshold aerobic exercise (AG, n = 52, 15.3 ± 2 years, 46% female, median 13 days to recovery, ΔHR = 62.4 ± 26 bpm). Linear regression showed that ΔHR significantly correlated with duration of clinical recovery for RG (p = 0.012, R2 = 0.228) and PG (p = 0.011, R2 = 0.126) but not for AG (p = 0.084, R2 = 0.059). ΔHR values were significantly lower in participants with prolonged recovery (>30 days) in RG (p = 0.01) and PG (p = 0.04). A ΔHR of ≤50 bpm on the BCTT is 73% sensitive and 78% specific for predicting prolonged recovery in concussed adolescents who were prescribed the current standard of care (i.e., cognitive and physical rest)

    The iNaturalist Localization 500 (iNatLoc500) Dataset

    No full text
    The iNaturalist Localization 500 (iNatLoc500) dataset. See the GitHub page (https://github.com/visipedia/inat_loc) for details.Related Publication: On Label Granularity and Object Localization Cole, Elijah Caltech Wilber, Kimberly Google Van Horn, Grant Cornell University Yang, Xuan Google Fornoni, Marco Google Perona, Pietro Caltech Belongie, Serge University of Copenhagen Howard, Andrew Google Mac Aodha, Oisin University of Edinburgh European Conference on Computer Vision (ECCV) 2022 en
    corecore