128 research outputs found
Variance-Covariance Regularization Improves Representation Learning
Transfer learning has emerged as a key approach in the machine learning
domain, enabling the application of knowledge derived from one domain to
improve performance on subsequent tasks. Given the often limited information
about these subsequent tasks, a strong transfer learning approach calls for the
model to capture a diverse range of features during the initial pretraining
stage. However, recent research suggests that, without sufficient
regularization, the network tends to concentrate on features that primarily
reduce the pretraining loss function. This tendency can result in inadequate
feature learning and impaired generalization capability for target tasks. To
address this issue, we propose Variance-Covariance Regularization (VCR), a
regularization technique aimed at fostering diversity in the learned network
features. Drawing inspiration from recent advancements in the self-supervised
learning approach, our approach promotes learned representations that exhibit
high variance and minimal covariance, thus preventing the network from focusing
solely on loss-reducing features.
We empirically validate the efficacy of our method through comprehensive
experiments coupled with in-depth analytical studies on the learned
representations. In addition, we develop an efficient implementation strategy
that assures minimal computational overhead associated with our method. Our
results indicate that VCR is a powerful and efficient method for enhancing
transfer learning performance for both supervised learning and self-supervised
learning, opening new possibilities for future research in this domain.Comment: 16 pages, 2 figure
X-ray reflection spectroscopy with Kaluza-Klein black holes
Kaluza-Klein theory is a popular alternative theory of gravity, with both
non-rotating and rotating black hole solutions known. This allows for the
possibility that the theory could be observationally tested. We present a model
which calculates the reflection spectrum of a black hole accretion disk system,
where the black hole is described by a rotating solution of the Kaluza-Klein
theory. We also use this model to analyze X-ray data from the stella-mass black
hole in GRS 1915+105 and provide constraints on the free parameters of the
Kaluza-Klein black holes.Comment: 10 pages, 4 figures. v2: refereed versio
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Vision-language pre-training (VLP) has recently proven highly effective for
various uni- and multi-modal downstream applications. However, most existing
end-to-end VLP methods use high-resolution image-text box data to perform well
on fine-grained region-level tasks, such as object detection, segmentation, and
referring expression comprehension. Unfortunately, such high-resolution images
with accurate bounding box annotations are expensive to collect and use for
supervision at scale. In this work, we propose VoLTA (Vision-Language
Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm
that only utilizes image-caption data but achieves fine-grained region-level
image understanding, eliminating the use of expensive box annotations. VoLTA
adopts graph optimal transport-based weakly-supervised alignment on local image
patches and text tokens to germinate an explicit, self-normalized, and
interpretable low-level matching criterion. In addition, VoLTA pushes
multi-modal fusion deep into the uni-modal backbones during pre-training and
removes fusion-specific transformer layers, further reducing memory
requirements. Extensive experiments on a wide range of vision- and
vision-language downstream tasks demonstrate the effectiveness of VoLTA on
fine-grained applications without compromising the coarse-grained downstream
performance, often outperforming methods using significantly more caption and
box annotations
Generalized Neural Collapse for a Large Number of Classes
Neural collapse provides an elegant mathematical characterization of learned
last layer representations (a.k.a. features) and classifier weights in deep
classification models. Such results not only provide insights but also motivate
new techniques for improving practical deep models. However, most of the
existing empirical and theoretical studies in neural collapse focus on the case
that the number of classes is small relative to the dimension of the feature
space. This paper extends neural collapse to cases where the number of classes
are much larger than the dimension of feature space, which broadly occur for
language models, retrieval systems, and face recognition applications. We show
that the features and classifier exhibit a generalized neural collapse
phenomenon, where the minimum one-vs-rest margins is maximized.We provide
empirical study to verify the occurrence of generalized neural collapse in
practical deep neural networks. Moreover, we provide theoretical study to show
that the generalized neural collapse provably occurs under unconstrained
feature model with spherical constraint, under certain technical conditions on
feature dimension and number of classes.Comment: 32 pages, 12 figure
Stereoscopic video quality assessment based on 3D convolutional neural networks
The research of stereoscopic video quality assessment (SVQA) plays an important role for promoting the development of stereoscopic video system. Existing SVQA metrics rely on hand-crafted features, which is inaccurate and time-consuming because of the diversity and complexity of stereoscopic video distortion. This paper introduces a 3D convolutional neural networks (CNN) based SVQA framework that can model not only local spatio-temporal information but also global temporal information with cubic difference video patches as input. First, instead of using hand-crafted features, we design a 3D CNN architecture to automatically and effectively capture local spatio-temporal features. Then we employ a quality score fusion strategy considering global temporal clues to obtain final video-level predicted score. Extensive experiments conducted on two public stereoscopic video quality datasets show that the proposed method correlates highly with human perception and outperforms state-of-the-art methods by a large margin. We also show that our 3D CNN features have more desirable property for SVQA than hand-crafted features in previous methods, and our 3D CNN features together with support vector regression (SVR) can further boost the performance. In addition, with no complex preprocessing and GPU acceleration, our proposed method is demonstrated computationally efficient and easy to use
Understanding the Robustness of 3D Object Detection with Bird's-Eye-View Representations in Autonomous Driving
3D object detection is an essential perception task in autonomous driving to
understand the environments. The Bird's-Eye-View (BEV) representations have
significantly improved the performance of 3D detectors with camera inputs on
popular benchmarks. However, there still lacks a systematic understanding of
the robustness of these vision-dependent BEV models, which is closely related
to the safety of autonomous driving systems. In this paper, we evaluate the
natural and adversarial robustness of various representative models under
extensive settings, to fully understand their behaviors influenced by explicit
BEV features compared with those without BEV. In addition to the classic
settings, we propose a 3D consistent patch attack by applying adversarial
patches in the 3D space to guarantee the spatiotemporal consistency, which is
more realistic for the scenario of autonomous driving. With substantial
experiments, we draw several findings: 1) BEV models tend to be more stable
than previous methods under different natural conditions and common corruptions
due to the expressive spatial representations; 2) BEV models are more
vulnerable to adversarial noises, mainly caused by the redundant BEV features;
3) Camera-LiDAR fusion models have superior performance under different
settings with multi-modal inputs, but BEV fusion model is still vulnerable to
adversarial noises of both point cloud and image. These findings alert the
safety issue in the applications of BEV detectors and could facilitate the
development of more robust models.Comment: 8 pages, CVPR202
Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective
Visual representation learning is the key of solving various vision problems.
Relying on the seminal grid structure priors, convolutional neural networks
(CNNs) have been the de facto standard architectures of most deep vision
models. For instance, classical semantic segmentation methods often adopt a
fully-convolutional network (FCN) with an encoder-decoder architecture. The
encoder progressively reduces the spatial resolution and learns more abstract
visual concepts with larger receptive fields. Since context modeling is
critical for segmentation, the latest efforts have been focused on increasing
the receptive field, through either dilated (i.e., atrous) convolutions or
inserting attention modules. However, the FCN-based architecture remains
unchanged. In this paper, we aim to provide an alternative perspective by
treating visual representation learning generally as a sequence-to-sequence
prediction task. Specifically, we deploy a pure Transformer to encode an image
as a sequence of patches, without local convolution and resolution reduction.
With the global context modeled in every layer of the Transformer, stronger
visual representation can be learned for better tackling vision tasks. In
particular, our segmentation model, termed as SEgmentation TRansformer (SETR),
excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on
the day of submission), Pascal Context (55.83% mIoU) and reaches competitive
results on Cityscapes. Further, we formulate a family of Hierarchical
Local-Global (HLG) Transformers characterized by local attention within windows
and global-attention across windows in a hierarchical and pyramidal
architecture. Extensive experiments show that our method achieves appealing
performance on a variety of visual recognition tasks (e.g., image
classification, object detection and instance segmentation and semantic
segmentation).Comment: Extended version of CVPR 2021 paper arXiv:2012.1584
Evaluation of Chinese Quad-polarization Gaofen-3 SAR Wave Mode Data for Significant Wave Height Retrieval
Our work describes the accuracy of Chinese quad-polarization Gaofen-3 (GF-3) synthetic aperture radar (SAR) wave mode data for wave retrieval and provides guidance for the operational applications of GF-3 SAR. In this study, we evaluated the accuracy of the SAR-derived significant wave height (SWH) from 10,514 GF-3 SAR images with visible wave streaks acquired in wave mode by using the existing wave retrieval algorithms, e.g., the theoretical-based algorithm parameterized first-guess spectrum method (PFSM), the empirical algorithm CSAR_WAVE2 for VV-polarization, and the algorithm for quad-polarization (Q-P). The retrieved SWHs were compared with the European Centre for Medium-Range Weather Forecasts (ECMWF) reanalysis field with 0.125° grids. The root mean square error (RMSE) of the SWH is 0.57 m, found using CSAR_WAVE2, and this RMSE value was less than the RMSE values for the analysis results achieved with the PFSM and Q-P algorithms. The statistical analysis also indicated that wind speed had little impact on the bias with increasing wind speed. However, the retrieval tended to overestimate when the SWH was smaller than 2.5 m and underestimate with an increasing SWH. This behavior provides a perspective of the improvement needed for the SWH retrieval algorithm using the GF-3 SAR acquired in wave mode
Blind assessment for stereo images considering binocular characteristics and deep perception map based on deep belief network
© 2018 Elsevier Inc. In recent years, blind image quality assessment in the field of 2D image/video has gained the popularity, but its applications in 3D image/video are to be generalized. In this paper, we propose an effective blind metric evaluating stereo images via deep belief network (DBN). This method is based on wavelet transform with both 2D features from monocular images respectively as image content description and 3D features from a novel depth perception map (DPM) as depth perception description. In particular, the DPM is introduced to quantify longitudinal depth information to align with human stereo visual perception. More specifically, the 2D features are local histogram of oriented gradient (HoG) features from high frequency wavelet coefficients and global statistical features including magnitude, variance and entropy. Meanwhile, the global statistical features from the DPM are characterized as 3D features. Subsequently, considering binocular characteristics, an effective binocular weight model based on multiscale energy estimation of the left and right images is adopted to obtain the content quality. In the training and testing stages, three DBN models for the three types features separately are used to get the final score. Experimental results demonstrate that the proposed stereo image quality evaluation model has high superiority over existing methods and achieve higher consistency with subjective quality assessments
- …