8,767 research outputs found
Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning
It is well believed that video captioning is a fundamental but challenging
task in both computer vision and artificial intelligence fields. The prevalent
approach is to map an input video to a variable-length output sentence in a
sequence to sequence manner via Recurrent Neural Network (RNN). Nevertheless,
the training of RNN still suffers to some degree from vanishing/exploding
gradient problem, making the optimization difficult. Moreover, the inherently
recurrent dependency in RNN prevents parallelization within a sequence during
training and therefore limits the computations. In this paper, we present a
novel design --- Temporal Deformable Convolutional Encoder-Decoder Networks
(dubbed as TDConvED) that fully employ convolutions in both encoder and decoder
networks for video captioning. Technically, we exploit convolutional block
structures that compute intermediate states of a fixed number of inputs and
stack several blocks to capture long-term relationships. The structure in
encoder is further equipped with temporal deformable convolution to enable
free-form deformation of temporal sampling. Our model also capitalizes on
temporal attention mechanism for sentence generation. Extensive experiments are
conducted on both MSVD and MSR-VTT video captioning datasets, and superior
results are reported when comparing to conventional RNN-based encoder-decoder
techniques. More remarkably, TDConvED increases CIDEr-D performance from 58.8%
to 67.2% on MSVD.Comment: AAAI 201
Micro Fourier Transform Profilometry (FTP): 3D shape measurement at 10,000 frames per second
Recent advances in imaging sensors and digital light projection technology
have facilitated a rapid progress in 3D optical sensing, enabling 3D surfaces
of complex-shaped objects to be captured with improved resolution and accuracy.
However, due to the large number of projection patterns required for phase
recovery and disambiguation, the maximum fame rates of current 3D shape
measurement techniques are still limited to the range of hundreds of frames per
second (fps). Here, we demonstrate a new 3D dynamic imaging technique, Micro
Fourier Transform Profilometry (FTP), which can capture 3D surfaces of
transient events at up to 10,000 fps based on our newly developed high-speed
fringe projection system. Compared with existing techniques, FTP has the
prominent advantage of recovering an accurate, unambiguous, and dense 3D point
cloud with only two projected patterns. Furthermore, the phase information is
encoded within a single high-frequency fringe image, thereby allowing
motion-artifact-free reconstruction of transient events with temporal
resolution of 50 microseconds. To show FTP's broad utility, we use it to
reconstruct 3D videos of 4 transient scenes: vibrating cantilevers, rotating
fan blades, bullet fired from a toy gun, and balloon's explosion triggered by a
flying dart, which were previously difficult or even unable to be captured with
conventional approaches.Comment: This manuscript was originally submitted on 30th January 1
Mask and Restore: Blind Backdoor Defense at Test Time with Masked Autoencoder
Deep neural networks are vulnerable to backdoor attacks, where an adversary
maliciously manipulates the model behavior through overlaying images with
special triggers. Existing backdoor defense methods often require accessing a
few validation data and model parameters, which are impractical in many
real-world applications, e.g., when the model is provided as a cloud service.
In this paper, we address the practical task of blind backdoor defense at test
time, in particular for black-box models. The true label of every test image
needs to be recovered on the fly from the hard label predictions of a
suspicious model. The heuristic trigger search in image space, however, is not
scalable to complex triggers or high image resolution. We circumvent such
barrier by leveraging generic image generation models, and propose a framework
of Blind Defense with Masked AutoEncoder (BDMAE). It uses the image structural
similarity and label consistency between the test image and MAE restorations to
detect possible triggers. The detection result is refined by considering the
topology of triggers. We obtain a purified test image from restorations for
making prediction. Our approach is blind to the model architectures, trigger
patterns or image benignity. Extensive experiments on multiple datasets with
different backdoor attacks validate its effectiveness and generalizability.
Code is available at https://github.com/tsun/BDMAE
Backdoor Cleansing with Unlabeled Data
Due to the increasing computational demand of Deep Neural Networks (DNNs),
companies and organizations have begun to outsource the training process.
However, the externally trained DNNs can potentially be backdoor attacked. It
is crucial to defend against such attacks, i.e., to postprocess a suspicious
model so that its backdoor behavior is mitigated while its normal prediction
power on clean inputs remain uncompromised. To remove the abnormal backdoor
behavior, existing methods mostly rely on additional labeled clean samples.
However, such requirement may be unrealistic as the training data are often
unavailable to end users. In this paper, we investigate the possibility of
circumventing such barrier. We propose a novel defense method that does not
require training labels. Through a carefully designed layer-wise weight
re-initialization and knowledge distillation, our method can effectively
cleanse backdoor behaviors of a suspicious network with negligible compromise
in its normal behavior. In experiments, we show that our method, trained
without labels, is on-par with state-of-the-art defense methods trained using
labels. We also observe promising defense results even on out-of-distribution
data. This makes our method very practical
- …