17 research outputs found
Efficient Privacy Preserving Viola-Jones Type Object Detection via Random Base Image Representation
A cloud server spent a lot of time, energy and money to train a Viola-Jones
type object detector with high accuracy. Clients can upload their photos to the
cloud server to find objects. However, the client does not want the leakage of
the content of his/her photos. In the meanwhile, the cloud server is also
reluctant to leak any parameters of the trained object detectors. 10 years ago,
Avidan & Butman introduced Blind Vision, which is a method for securely
evaluating a Viola-Jones type object detector. Blind Vision uses standard
cryptographic tools and is painfully slow to compute, taking a couple of hours
to scan a single image. The purpose of this work is to explore an efficient
method that can speed up the process. We propose the Random Base Image (RBI)
Representation. The original image is divided into random base images. Only the
base images are submitted randomly to the cloud server. Thus, the content of
the image can not be leaked. In the meanwhile, a random vector and the secure
Millionaire protocol are leveraged to protect the parameters of the trained
object detector. The RBI makes the integral-image enable again for the great
acceleration. The experimental results reveal that our method can retain the
detection accuracy of that of the plain vision algorithm and is significantly
faster than the traditional blind vision, with only a very low probability of
the information leakage theoretically.Comment: 6 pages, 3 figures, To appear in the proceedings of the IEEE
International Conference on Multimedia and Expo (ICME), Jul 10, 2017 - Jul
14, 2017, Hong Kong, Hong Kon
DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing
Text-guided image editing faces significant challenges to training and
inference flexibility. Much literature collects large amounts of annotated
image-text pairs to train text-conditioned generative models from scratch,
which is expensive and not efficient. After that, some approaches that leverage
pre-trained vision-language models are put forward to avoid data collection,
but they are also limited by either per text-prompt optimization or
inference-time hyper-parameters tuning. To address these issues, we investigate
and identify a specific space, referred to as CLIP DeltaSpace, where the CLIP
visual feature difference of two images is semantically aligned with the CLIP
textual feature difference of their corresponding text descriptions. Based on
DeltaSpace, we propose a novel framework called DeltaEdit, which maps the CLIP
visual feature differences to the latent space directions of a generative model
during the training phase, and predicts the latent space directions from the
CLIP textual feature differences during the inference phase. And this design
endows DeltaEdit with two advantages: (1) text-free training; (2)
generalization to various text prompts for zero-shot inference. Extensive
experiments validate the effectiveness and versatility of DeltaEdit with
different generative models, including both the GAN model and the diffusion
model, in achieving flexible text-guided image editing. Code is available at
https://github.com/Yueming6568/DeltaEdit.Comment: 17 pages. arXiv admin note: text overlap with arXiv:2303.0628
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning
Recently, large-scale pre-trained language-image models like CLIP have shown
extraordinary capabilities for understanding spatial contents, but naively
transferring such models to video recognition still suffers from unsatisfactory
temporal modeling capabilities. Existing methods insert tunable structures into
or in parallel with the pre-trained model, which either requires
back-propagation through the whole pre-trained model and is thus
resource-demanding, or is limited by the temporal reasoning capability of the
pre-trained structure. In this work, we present DiST, which disentangles the
learning of spatial and temporal aspects of videos. Specifically, DiST uses a
dual-encoder structure, where a pre-trained foundation model acts as the
spatial encoder, and a lightweight network is introduced as the temporal
encoder. An integration branch is inserted between the encoders to fuse
spatio-temporal information. The disentangled spatial and temporal learning in
DiST is highly efficient because it avoids the back-propagation of massive
pre-trained parameters. Meanwhile, we empirically show that disentangled
learning with an extra network for integration benefits both spatial and
temporal understanding. Extensive experiments on five benchmarks show that DiST
delivers better performance than existing state-of-the-art methods by
convincing gaps. When pre-training on the large-scale Kinetics-710, we achieve
89.7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability
of DiST. Codes and models can be found in
https://github.com/alibaba-mmai-research/DiST.Comment: ICCV2023. Code: https://github.com/alibaba-mmai-research/DiS
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
A diffusion probabilistic model (DPM), which constructs a forward diffusion
process by gradually adding noise to data points and learns the reverse
denoising process to generate new samples, has been shown to handle complex
data distribution. Despite its recent success in image synthesis, applying DPMs
to video generation is still challenging due to high-dimensional data spaces.
Previous methods usually adopt a standard diffusion process, where frames in
the same video clip are destroyed with independent noises, ignoring the content
redundancy and temporal correlation. This work presents a decomposed diffusion
process via resolving the per-frame noise into a base noise that is shared
among all frames and a residual noise that varies along the time axis. The
denoising pipeline employs two jointly-learned networks to match the noise
decomposition accordingly. Experiments on various datasets confirm that our
approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based
alternatives in high-quality video generation. We further show that our
decomposed formulation can benefit from pre-trained image diffusion models and
well-support text-conditioned video creation.Comment: Accepted to CVPR202
RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Relational Language-Image Pre-training (RLIP) aims to align vision
representations with relational texts, thereby advancing the capability of
relational reasoning in computer vision tasks. However, hindered by the slow
convergence of RLIPv1 architecture and the limited availability of existing
scene graph data, scaling RLIPv1 is challenging. In this paper, we propose
RLIPv2, a fast converging model that enables the scaling of relational
pre-training to large-scale pseudo-labelled scene graph data. To enable fast
scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism
that facilitates earlier and deeper gated cross-modal fusion with sparsified
language encoding layers. ALIF leads to comparable or better performance than
RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain
scene graph data at scale, we extend object detection datasets with free-form
relation labels by introducing a captioner (e.g., BLIP) and a designed Relation
Tagger. The Relation Tagger assigns BLIP-generated relation texts to region
pairs, thus enabling larger-scale relational pre-training. Through extensive
experiments conducted on Human-Object Interaction Detection and Scene Graph
Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under
fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2
achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with
just 1% data and yields 45.09mAP with 100% data. Code and models are publicly
available at https://github.com/JacobYuan7/RLIPv2.Comment: Accepted to ICCV 2023. Code and models:
https://github.com/JacobYuan7/RLIPv
Abrupt motion tracking using a visual saliency embedded particle filter
Abrupt motion is a significant challenge that commonly causes traditional tracking methods to fail. This paper presents an improved visual saliency model and integrates it to a particle filter tracker to solve this problem. Once the target is lost, our algorithm recovers tracking by detecting the target region from salient regions, which are obtained in the saliency map of current frame. In addition, to strengthen the saliency of target region, the target model is used as a prior knowledge to calculate a weight set which is utilized to construct our improved saliency map adaptively. Furthermore, we adopt the covariance descriptor as the appearance model to describe the object more accurately. Compared with several other tracking algorithms, the experimental results demonstrate that our method is more robust in dealing with various types of abrupt motion scenarios. © 2013 Elsevier Ltd. All rights reserved
Controllable synthesis of zeolitic imidazolate frameworks with rod-like or delta-shaped morphologies at oil-water interface
Zeolitic imidazolate framework (ZIF-8) has attracted extensive application in gas adsorption, membrane separation and catalysis. It has been demonstrated that the morphology is crucial for the physical and chemical properties. In this work, ZIF-8 crystals with rod or delta morphology are regulated by interface synthesis. The preferred growth of specific crystalline face is controlled at oil-water interface and results in uniform ZIF-8 crystals with various morphologies. Solvents types and feeding concentrations play crucial roles in crystalline growth. This method can be extended to other materials synthesis, such as ZIF-1 and ZIF-67 crystals. The interface synthesis approach may provide insight to promote advanced materials fabrication and achieve better materials function
Distribution Adaptive INT8 Quantization for Training CNNs
Researches have demonstrated that low bit-width (e.g., INT8) quantization can be employed to accelerate the inference process. It makes the gradient quantization very promising since the backward propagation requires approximately twice more computation than forward one. Due to the variability and uncertainty of gradient distribution, a lot of methods have been proposed to attain training stability. However, most of them ignore the channel-wise gradient distributions and the impact of gradients with different magnitudes, resulting in the degradation of final accuracy. In this paper, we propose a novel INT8 quantization training framework for convolutional neural network to address the above issues. Specifically, we adopt Gradient Vectorized Quantization to quantize the gradient, based on the observation that layer-wise gradients contain multiple distributions along the channel dimension. Then, Magnitude-aware Clipping Strategy is introduced by taking the magnitudes of gradients into consideration when minimizing the quantization error, and we present a theoretical derivation to solve the quantization parameters of different distributions. Experimental results on broad range of computer vision tasks, such as image classification, object detection and video classification, demonstrate that the proposed Distribution Adaptive INT8 Quantization training method has achieved almost lossless training accuracy for different backbones, including ResNet, MobileNetV2, InceptionV3, VGG and AlexNet, which is superior to the state-of-the-art techniques. Moreover, we further implement the INT8 kernel that can accelerate the training iteration more than 200% under the latest Turing architecture, i.e., our method excels on both training accuracy and speed
Application of Association Rules Analysis in Mining Adverse Drug Reaction Signals
Adverse drug reactions (ADRs) are increasingly becoming a serious public health problem. Spontaneous reporting systems (SRSs) are an important way for many countries to monitor ADRs produced in the clinical use of drugs, and they are the main data source for ADR signal detection. The traditional signal detection methods are based on disproportionality analysis (DPA) and lack the application of data mining technology. In this paper, we selected the spontaneous reports from 2011 to 2018 in Jiangsu Province of China as the research data and used association rules analysis (ARA) to mine signals. We defined some important metrics of the ARA according to the two-dimensional contingency table of ADRs, such as Confidence and Lift, and constructed performance evaluation indicators such as Precision, Recall, and F1 as objective standards. We used experimental methods based on data to objectively determine the optimal thresholds of the corresponding metrics, which, in the best case, are Confidence = 0.007 and Lift = 1. We obtained the average performance of the method through 10-fold cross-validation. The experimental results showed that F1 increased from 31.43% in the MHRA method to 40.38% in the ARA method; this was a significant improvement. To reduce drug risk and provide decision making for drug safety, more data mining methods need to be introduced and applied to ADR signal detection