10 research outputs found
Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition
A major emerging challenge is how to protect people's privacy as cameras and
computer vision are increasingly integrated into our daily lives, including in
smart devices inside homes. A potential solution is to capture and record just
the minimum amount of information needed to perform a task of interest. In this
paper, we propose a fully-coupled two-stream spatiotemporal architecture for
reliable human action recognition on extremely low resolution (e.g., 12x16
pixel) videos. We provide an efficient method to extract spatial and temporal
features and to aggregate them into a robust feature representation for an
entire action video sequence. We also consider how to incorporate high
resolution videos during training in order to build better low resolution
action recognition models. We evaluate on two publicly-available datasets,
showing significant improvements over the state-of-the-art.Comment: 9 pagers, 5 figures, published in WACV 201
Scale Invariant Privacy Preserving Video via Wavelet Decomposition
Video surveillance has become ubiquitous in the modern world. Mobile devices,
surveillance cameras, and IoT devices, all can record video that can violate
our privacy. One proposed solution for this is privacy-preserving video, which
removes identifying information from the video as it is produced. Several
algorithms for this have been proposed, but all of them suffer from scale
issues: in order to sufficiently anonymize near-camera objects, distant objects
become unidentifiable. In this paper, we propose a scale-invariant method,
based on wavelet decomposition
Adversarial Learning of Privacy-Preserving and Task-Oriented Representations
Data privacy has emerged as an important issue as data-driven deep learning
has been an essential component of modern machine learning systems. For
instance, there could be a potential privacy risk of machine learning systems
via the model inversion attack, whose goal is to reconstruct the input data
from the latent representation of deep networks. Our work aims at learning a
privacy-preserving and task-oriented representation to defend against such
model inversion attacks. Specifically, we propose an adversarial reconstruction
learning framework that prevents the latent representations decoded into
original input data. By simulating the expected behavior of adversary, our
framework is realized by minimizing the negative pixel reconstruction loss or
the negative feature reconstruction (i.e., perceptual distance) loss. We
validate the proposed method on face attribute prediction, showing that our
method allows protecting visual privacy with a small decrease in utility
performance. In addition, we show the utility-privacy trade-off with different
choices of hyperparameter for negative perceptual distance loss at training,
allowing service providers to determine the right level of privacy-protection
with a certain utility performance. Moreover, we provide an extensive study
with different selections of features, tasks, and the data to further analyze
their influence on privacy protection
Learning Human Action Recognition Representations Without Real Humans
Pre-training on massive video datasets has become essential to achieve high
action recognition performance on smaller downstream datasets. However, most
large-scale video datasets contain images of people and hence are accompanied
with issues related to privacy, ethics, and data protection, often preventing
them from being publicly shared for reproducible research. Existing work has
attempted to alleviate these problems by blurring faces, downsampling videos,
or training on synthetic data. On the other hand, analysis on the
transferability of privacy-preserving pre-trained models to downstream tasks
has been limited. In this work, we study this problem by first asking the
question: can we pre-train models for human action recognition with data that
does not include real humans? To this end, we present, for the first time, a
benchmark that leverages real-world videos with humans removed and synthetic
data containing virtual humans to pre-train a model. We then evaluate the
transferability of the representation learned on this data to a diverse set of
downstream action recognition benchmarks. Furthermore, we propose a novel
pre-training strategy, called Privacy-Preserving MAE-Align, to effectively
combine synthetic data and human-removed real data. Our approach outperforms
previous baselines by up to 5% and closes the performance gap between human and
no-human action recognition representations on downstream tasks, for both
linear probing and fine-tuning. Our benchmark, code, and models are available
at https://github.com/howardzh01/PPMA .Comment: 19 pages, 7 figures, 2023 NeurIPS Datasets and Benchmarks Trac
Towards Generalizable Deep Image Matting: Decomposition, Interaction, and Merging
Image matting refers to extracting the precise alpha mattes from images, playing a critical role in many downstream applications. Despite extensive attention, key challenges persist and motivate the research presented in this thesis.
One major challenge is the reliance of auxiliary inputs in previous methods, hindering real-time practicality. To address this, we introduce fully automatic image matting by decomposing the task into high-level semantic segmentation and low-level details matting. We then incorporate plug-in modules to enhance the interaction between the sub-tasks through feature integration. Furthermore, we propose an attention-based mechanism to guide the matting process through collaboration merging.
Another challenge lies in limited matting datasets, resulting in reliance on composite images and inferior performance on images in the wild. In response, our research proposes a composition route to mitigate the discrepancies and result in remarkable generalization ability. Additionally, we construct numerous large datasets of high-quality real-world images with manually labeled alpha mattes, providing a solid foundation for training and evaluation.
Moreover, our research uncovers new observations that warrant further investigation. Firstly, we systematically analyze and address privacy issues that have been neglected in previous portrait matting research. Secondly, we explore the adaptation of automatic matting methods to non-salient or transparent categories beyond salient ones. Furthermore, we collaborate with language modality to achieve a more controllable matting process, enabling specific target selection at a low cost. To validate our studies, we conduct extensive experiments and provide all codes and datasets through the link (https://github.com/JizhiziLi/).
We believe that the analyses, methods, and datasets presented in this thesis will offer valuable insights for future research endeavors in the field of image matting