17 research outputs found

    Efficient Privacy Preserving Viola-Jones Type Object Detection via Random Base Image Representation

    Full text link
    A cloud server spent a lot of time, energy and money to train a Viola-Jones type object detector with high accuracy. Clients can upload their photos to the cloud server to find objects. However, the client does not want the leakage of the content of his/her photos. In the meanwhile, the cloud server is also reluctant to leak any parameters of the trained object detectors. 10 years ago, Avidan & Butman introduced Blind Vision, which is a method for securely evaluating a Viola-Jones type object detector. Blind Vision uses standard cryptographic tools and is painfully slow to compute, taking a couple of hours to scan a single image. The purpose of this work is to explore an efficient method that can speed up the process. We propose the Random Base Image (RBI) Representation. The original image is divided into random base images. Only the base images are submitted randomly to the cloud server. Thus, the content of the image can not be leaked. In the meanwhile, a random vector and the secure Millionaire protocol are leveraged to protect the parameters of the trained object detector. The RBI makes the integral-image enable again for the great acceleration. The experimental results reveal that our method can retain the detection accuracy of that of the plain vision algorithm and is significantly faster than the traditional blind vision, with only a very low probability of the information leakage theoretically.Comment: 6 pages, 3 figures, To appear in the proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Jul 10, 2017 - Jul 14, 2017, Hong Kong, Hong Kon

    DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing

    Full text link
    Text-guided image editing faces significant challenges to training and inference flexibility. Much literature collects large amounts of annotated image-text pairs to train text-conditioned generative models from scratch, which is expensive and not efficient. After that, some approaches that leverage pre-trained vision-language models are put forward to avoid data collection, but they are also limited by either per text-prompt optimization or inference-time hyper-parameters tuning. To address these issues, we investigate and identify a specific space, referred to as CLIP DeltaSpace, where the CLIP visual feature difference of two images is semantically aligned with the CLIP textual feature difference of their corresponding text descriptions. Based on DeltaSpace, we propose a novel framework called DeltaEdit, which maps the CLIP visual feature differences to the latent space directions of a generative model during the training phase, and predicts the latent space directions from the CLIP textual feature differences during the inference phase. And this design endows DeltaEdit with two advantages: (1) text-free training; (2) generalization to various text prompts for zero-shot inference. Extensive experiments validate the effectiveness and versatility of DeltaEdit with different generative models, including both the GAN model and the diffusion model, in achieving flexible text-guided image editing. Code is available at https://github.com/Yueming6568/DeltaEdit.Comment: 17 pages. arXiv admin note: text overlap with arXiv:2303.0628

    Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

    Full text link
    Recently, large-scale pre-trained language-image models like CLIP have shown extraordinary capabilities for understanding spatial contents, but naively transferring such models to video recognition still suffers from unsatisfactory temporal modeling capabilities. Existing methods insert tunable structures into or in parallel with the pre-trained model, which either requires back-propagation through the whole pre-trained model and is thus resource-demanding, or is limited by the temporal reasoning capability of the pre-trained structure. In this work, we present DiST, which disentangles the learning of spatial and temporal aspects of videos. Specifically, DiST uses a dual-encoder structure, where a pre-trained foundation model acts as the spatial encoder, and a lightweight network is introduced as the temporal encoder. An integration branch is inserted between the encoders to fuse spatio-temporal information. The disentangled spatial and temporal learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Meanwhile, we empirically show that disentangled learning with an extra network for integration benefits both spatial and temporal understanding. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps. When pre-training on the large-scale Kinetics-710, we achieve 89.7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST. Codes and models can be found in https://github.com/alibaba-mmai-research/DiST.Comment: ICCV2023. Code: https://github.com/alibaba-mmai-research/DiS

    VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation

    Full text link
    A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution. Despite its recent success in image synthesis, applying DPMs to video generation is still challenging due to high-dimensional data spaces. Previous methods usually adopt a standard diffusion process, where frames in the same video clip are destroyed with independent noises, ignoring the content redundancy and temporal correlation. This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation. We further show that our decomposed formulation can benefit from pre-trained image diffusion models and well-support text-conditioned video creation.Comment: Accepted to CVPR202

    RLIPv2: Fast Scaling of Relational Language-Image Pre-training

    Full text link
    Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.Comment: Accepted to ICCV 2023. Code and models: https://github.com/JacobYuan7/RLIPv

    Abrupt motion tracking using a visual saliency embedded particle filter

    No full text
    Abrupt motion is a significant challenge that commonly causes traditional tracking methods to fail. This paper presents an improved visual saliency model and integrates it to a particle filter tracker to solve this problem. Once the target is lost, our algorithm recovers tracking by detecting the target region from salient regions, which are obtained in the saliency map of current frame. In addition, to strengthen the saliency of target region, the target model is used as a prior knowledge to calculate a weight set which is utilized to construct our improved saliency map adaptively. Furthermore, we adopt the covariance descriptor as the appearance model to describe the object more accurately. Compared with several other tracking algorithms, the experimental results demonstrate that our method is more robust in dealing with various types of abrupt motion scenarios. © 2013 Elsevier Ltd. All rights reserved

    Controllable synthesis of zeolitic imidazolate frameworks with rod-like or delta-shaped morphologies at oil-water interface

    No full text
    Zeolitic imidazolate framework (ZIF-8) has attracted extensive application in gas adsorption, membrane separation and catalysis. It has been demonstrated that the morphology is crucial for the physical and chemical properties. In this work, ZIF-8 crystals with rod or delta morphology are regulated by interface synthesis. The preferred growth of specific crystalline face is controlled at oil-water interface and results in uniform ZIF-8 crystals with various morphologies. Solvents types and feeding concentrations play crucial roles in crystalline growth. This method can be extended to other materials synthesis, such as ZIF-1 and ZIF-67 crystals. The interface synthesis approach may provide insight to promote advanced materials fabrication and achieve better materials function

    Distribution Adaptive INT8 Quantization for Training CNNs

    No full text
    Researches have demonstrated that low bit-width (e.g., INT8) quantization can be employed to accelerate the inference process. It makes the gradient quantization very promising since the backward propagation requires approximately twice more computation than forward one. Due to the variability and uncertainty of gradient distribution, a lot of methods have been proposed to attain training stability. However, most of them ignore the channel-wise gradient distributions and the impact of gradients with different magnitudes, resulting in the degradation of final accuracy. In this paper, we propose a novel INT8 quantization training framework for convolutional neural network to address the above issues. Specifically, we adopt Gradient Vectorized Quantization to quantize the gradient, based on the observation that layer-wise gradients contain multiple distributions along the channel dimension. Then, Magnitude-aware Clipping Strategy is introduced by taking the magnitudes of gradients into consideration when minimizing the quantization error, and we present a theoretical derivation to solve the quantization parameters of different distributions. Experimental results on broad range of computer vision tasks, such as image classification, object detection and video classification, demonstrate that the proposed Distribution Adaptive INT8 Quantization training method has achieved almost lossless training accuracy for different backbones, including ResNet, MobileNetV2, InceptionV3, VGG and AlexNet, which is superior to the state-of-the-art techniques. Moreover, we further implement the INT8 kernel that can accelerate the training iteration more than 200% under the latest Turing architecture, i.e., our method excels on both training accuracy and speed

    Application of Association Rules Analysis in Mining Adverse Drug Reaction Signals

    No full text
    Adverse drug reactions (ADRs) are increasingly becoming a serious public health problem. Spontaneous reporting systems (SRSs) are an important way for many countries to monitor ADRs produced in the clinical use of drugs, and they are the main data source for ADR signal detection. The traditional signal detection methods are based on disproportionality analysis (DPA) and lack the application of data mining technology. In this paper, we selected the spontaneous reports from 2011 to 2018 in Jiangsu Province of China as the research data and used association rules analysis (ARA) to mine signals. We defined some important metrics of the ARA according to the two-dimensional contingency table of ADRs, such as Confidence and Lift, and constructed performance evaluation indicators such as Precision, Recall, and F1 as objective standards. We used experimental methods based on data to objectively determine the optimal thresholds of the corresponding metrics, which, in the best case, are Confidence = 0.007 and Lift = 1. We obtained the average performance of the method through 10-fold cross-validation. The experimental results showed that F1 increased from 31.43% in the MHRA method to 40.38% in the ARA method; this was a significant improvement. To reduce drug risk and provide decision making for drug safety, more data mining methods need to be introduced and applied to ADR signal detection
    corecore