42 research outputs found
FLIP: Cross-domain Face Anti-spoofing with Language Guidance
Face anti-spoofing (FAS) or presentation attack detection is an essential
component of face recognition systems deployed in security-critical
applications. Existing FAS methods have poor generalizability to unseen spoof
types, camera sensors, and environmental conditions. Recently, vision
transformer (ViT) models have been shown to be effective for the FAS task due
to their ability to capture long-range dependencies among image patches.
However, adaptive modules or auxiliary loss functions are often required to
adapt pre-trained ViT weights learned on large-scale datasets such as ImageNet.
In this work, we first show that initializing ViTs with multimodal (e.g., CLIP)
pre-trained weights improves generalizability for the FAS task, which is in
line with the zero-shot transfer capabilities of vision-language pre-trained
(VLP) models. We then propose a novel approach for robust cross-domain FAS by
grounding visual representations with the help of natural language.
Specifically, we show that aligning the image representation with an ensemble
of class descriptions (based on natural language semantics) improves FAS
generalizability in low-data regimes. Finally, we propose a multimodal
contrastive learning strategy to boost feature generalization further and
bridge the gap between source and target domains. Extensive experiments on
three standard protocols demonstrate that our method significantly outperforms
the state-of-the-art methods, achieving better zero-shot transfer performance
than five-shot transfer of adaptive ViTs. Code:
https://github.com/koushiksrivats/FLIPComment: Accepted to ICCV-2023. Project Page:
https://koushiksrivats.github.io/FLIP
On the Importance of Image Encoding in Automated Chest X-Ray Report Generation
Chest X-ray is one of the most popular medical imaging modalities due to its
accessibility and effectiveness. However, there is a chronic shortage of
well-trained radiologists who can interpret these images and diagnose the
patient's condition. Therefore, automated radiology report generation can be a
very helpful tool in clinical practice. A typical report generation workflow
consists of two main steps: (i) encoding the image into a latent space and (ii)
generating the text of the report based on the latent image embedding. Many
existing report generation techniques use a standard convolutional neural
network (CNN) architecture for image encoding followed by a Transformer-based
decoder for medical text generation. In most cases, CNN and the decoder are
trained jointly in an end-to-end fashion. In this work, we primarily focus on
understanding the relative importance of encoder and decoder components.
Towards this end, we analyze four different image encoding approaches: direct,
fine-grained, CLIP-based, and Cluster-CLIP-based encodings in conjunction with
three different decoders on the large-scale MIMIC-CXR dataset. Among these
encoders, the cluster CLIP visual encoder is a novel approach that aims to
generate more discriminative and explainable representations. CLIP-based
encoders produce comparable results to traditional CNN-based encoders in terms
of NLP metrics, while fine-grained encoding outperforms all other encoders both
in terms of NLP and clinical accuracy metrics, thereby validating the
importance of image encoder to effectively extract semantic information. GitHub
repository: https://github.com/mudabek/encoding-cxr-report-ge
Evading Forensic Classifiers with Attribute-Conditioned Adversarial Faces
The ability of generative models to produce highly realistic synthetic face
images has raised security and ethical concerns. As a first line of defense
against such fake faces, deep learning based forensic classifiers have been
developed. While these forensic models can detect whether a face image is
synthetic or real with high accuracy, they are also vulnerable to adversarial
attacks. Although such attacks can be highly successful in evading detection by
forensic classifiers, they introduce visible noise patterns that are detectable
through careful human scrutiny. Additionally, these attacks assume access to
the target model(s) which may not always be true. Attempts have been made to
directly perturb the latent space of GANs to produce adversarial fake faces
that can circumvent forensic classifiers. In this work, we go one step further
and show that it is possible to successfully generate adversarial fake faces
with a specified set of attributes (e.g., hair color, eye size, race, gender,
etc.). To achieve this goal, we leverage the state-of-the-art generative model
StyleGAN with disentangled representations, which enables a range of
modifications without leaving the manifold of natural images. We propose a
framework to search for adversarial latent codes within the feature space of
StyleGAN, where the search can be guided either by a text prompt or a reference
image. We also propose a meta-learning based optimization strategy to achieve
transferable performance on unknown target models. Extensive experiments
demonstrate that the proposed approach can produce semantically manipulated
adversarial fake faces, which are true to the specified attribute set and can
successfully fool forensic face classifiers, while remaining undetectable by
humans. Code: https://github.com/koushiksrivats/face_attribute_attack.Comment: Accepted in CVPR 2023. Project page:
https://koushiksrivats.github.io/face_attribute_attack
CLIP2Protect: Protecting Facial Privacy using Text-Guided Makeup via Adversarial Latent Search
The success of deep learning based face recognition systems has given rise to
serious privacy concerns due to their ability to enable unauthorized tracking
of users in the digital world. Existing methods for enhancing privacy fail to
generate naturalistic images that can protect facial privacy without
compromising user experience. We propose a novel two-step approach for facial
privacy protection that relies on finding adversarial latent codes in the
low-dimensional manifold of a pretrained generative model. The first step
inverts the given face image into the latent space and finetunes the generative
model to achieve an accurate reconstruction of the given image from its latent
code. This step produces a good initialization, aiding the generation of
high-quality faces that resemble the given identity. Subsequently, user-defined
makeup text prompts and identity-preserving regularization are used to guide
the search for adversarial codes in the latent space. Extensive experiments
demonstrate that faces generated by our approach have stronger black-box
transferability with an absolute gain of 12.06% over the state-of-the-art
facial privacy protection approach under the face verification task. Finally,
we demonstrate the effectiveness of the proposed approach for commercial face
recognition systems. Our code is available at
https://github.com/fahadshamshad/Clip2Protect.Comment: Accepted in CVPR 2023. Project page:
https://fahadshamshad.github.io/Clip2Protect
Suppressing Poisoning Attacks on Federated Learning for Medical Imaging
Collaboration among multiple data-owning entities (e.g., hospitals) can
accelerate the training process and yield better machine learning models due to
the availability and diversity of data. However, privacy concerns make it
challenging to exchange data while preserving confidentiality. Federated
Learning (FL) is a promising solution that enables collaborative training
through exchange of model parameters instead of raw data. However, most
existing FL solutions work under the assumption that participating clients are
\emph{honest} and thus can fail against poisoning attacks from malicious
parties, whose goal is to deteriorate the global model performance. In this
work, we propose a robust aggregation rule called Distance-based Outlier
Suppression (DOS) that is resilient to byzantine failures. The proposed method
computes the distance between local parameter updates of different clients and
obtains an outlier score for each client using Copula-based Outlier Detection
(COPOD). The resulting outlier scores are converted into normalized weights
using a softmax function, and a weighted average of the local parameters is
used for updating the global model. DOS aggregation can effectively suppress
parameter updates from malicious clients without the need for any
hyperparameter selection, even when the data distributions are heterogeneous.
Evaluation on two medical imaging datasets (CheXpert and HAM10000) demonstrates
the higher robustness of DOS method against a variety of poisoning attacks in
comparison to other state-of-the-art methods. The code can be found here
https://github.com/Naiftt/SPAFD
A Coarse-to-Fine Pseudo-Labeling (C2FPL) Framework for Unsupervised Video Anomaly Detection
Detection of anomalous events in videos is an important problem in
applications such as surveillance. Video anomaly detection (VAD) is
well-studied in the one-class classification (OCC) and weakly supervised (WS)
settings. However, fully unsupervised (US) video anomaly detection methods,
which learn a complete system without any annotation or human supervision, have
not been explored in depth. This is because the lack of any ground truth
annotations significantly increases the magnitude of the VAD challenge. To
address this challenge, we propose a simple-but-effective two-stage
pseudo-label generation framework that produces segment-level (normal/anomaly)
pseudo-labels, which can be further used to train a segment-level anomaly
detector in a supervised manner. The proposed coarse-to-fine pseudo-label
(C2FPL) generator employs carefully-designed hierarchical divisive clustering
and statistical hypothesis testing to identify anomalous video segments from a
set of completely unlabeled videos. The trained anomaly detector can be
directly applied on segments of an unseen test video to obtain segment-level,
and subsequently, frame-level anomaly predictions. Extensive studies on two
large-scale public-domain datasets, UCF-Crime and XD-Violence, demonstrate that
the proposed unsupervised approach achieves superior performance compared to
all existing OCC and US methods , while yielding comparable performance to the
state-of-the-art WS methods.Comment: Accepted in IEEE/CVF Winter Conference on Applications of Computer
Vision (WACV), 202
FedSIS: Federated Split Learning with Intermediate Representation Sampling for Privacy-preserving Generalized Face Presentation Attack Detection
Lack of generalization to unseen domains/attacks is the Achilles heel of most
face presentation attack detection (FacePAD) algorithms. Existing attempts to
enhance the generalizability of FacePAD solutions assume that data from
multiple source domains are available with a single entity to enable
centralized training. In practice, data from different source domains may be
collected by diverse entities, who are often unable to share their data due to
legal and privacy constraints. While collaborative learning paradigms such as
federated learning (FL) can overcome this problem, standard FL methods are
ill-suited for domain generalization because they struggle to surmount the twin
challenges of handling non-iid client data distributions during training and
generalizing to unseen domains during inference. In this work, a novel
framework called Federated Split learning with Intermediate representation
Sampling (FedSIS) is introduced for privacy-preserving domain generalization.
In FedSIS, a hybrid Vision Transformer (ViT) architecture is learned using a
combination of FL and split learning to achieve robustness against statistical
heterogeneity in the client data distributions without any sharing of raw data
(thereby preserving privacy). To further improve generalization to unseen
domains, a novel feature augmentation strategy called intermediate
representation sampling is employed, and discriminative information from
intermediate blocks of a ViT is distilled using a shared adapter network. The
FedSIS approach has been evaluated on two well-known benchmarks for
cross-domain FacePAD to demonstrate that it is possible to achieve
state-of-the-art generalization performance without data sharing. Code:
https://github.com/Naiftt/FedSISComment: Accepted to the IEEE International Joint Conference on Biometrics
(IJCB), 202