4 research outputs found
DeePhy: On Deepfake Phylogeny
Deepfake refers to tailored and synthetically generated videos which are now
prevalent and spreading on a large scale, threatening the trustworthiness of
the information available online. While existing datasets contain different
kinds of deepfakes which vary in their generation technique, they do not
consider progression of deepfakes in a "phylogenetic" manner. It is possible
that an existing deepfake face is swapped with another face. This process of
face swapping can be performed multiple times and the resultant deepfake can be
evolved to confuse the deepfake detection algorithms. Further, many databases
do not provide the employed generative model as target labels. Model
attribution helps in enhancing the explainability of the detection results by
providing information on the generative model employed. In order to enable the
research community to address these questions, this paper proposes DeePhy, a
novel Deepfake Phylogeny dataset which consists of 5040 deepfake videos
generated using three different generation techniques. There are 840 videos of
one-time swapped deepfakes, 2520 videos of two-times swapped deepfakes and 1680
videos of three-times swapped deepfakes. With over 30 GBs in size, the database
is prepared in over 1100 hours using 18 GPUs of 1,352 GB cumulative memory. We
also present the benchmark on DeePhy dataset using six deepfake detection
algorithms. The results highlight the need to evolve the research of model
attribution of deepfakes and generalize the process over a variety of deepfake
generation techniques. The database is available at:
http://iab-rubric.org/deephy-databaseComment: Accepted at 2022, International Joint Conference on Biometrics (IJCB
2022
Are Face Detection Models Biased?
The presence of bias in deep models leads to unfair outcomes for certain
demographic subgroups. Research in bias focuses primarily on facial recognition
and attribute prediction with scarce emphasis on face detection. Existing
studies consider face detection as binary classification into 'face' and
'non-face' classes. In this work, we investigate possible bias in the domain of
face detection through facial region localization which is currently
unexplored. Since facial region localization is an essential task for all face
recognition pipelines, it is imperative to analyze the presence of such bias in
popular deep models. Most existing face detection datasets lack suitable
annotation for such analysis. Therefore, we web-curate the Fair Face
Localization with Attributes (F2LA) dataset and manually annotate more than 10
attributes per face, including facial localization information. Utilizing the
extensive annotations from F2LA, an experimental setup is designed to study the
performance of four pre-trained face detectors. We observe (i) a high disparity
in detection accuracies across gender and skin-tone, and (ii) interplay of
confounding factors beyond demography. The F2LA data and associated annotations
can be accessed at http://iab-rubric.org/index.php/F2LA.Comment: Accepted in FG 202
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms
Artificial Intelligence (AI) has made its way into various scientific fields,
providing astonishing improvements over existing algorithms for a wide variety
of tasks. In recent years, there have been severe concerns over the
trustworthiness of AI technologies. The scientific community has focused on the
development of trustworthy AI algorithms. However, machine and deep learning
algorithms, popular in the AI community today, depend heavily on the data used
during their development. These learning algorithms identify patterns in the
data, learning the behavioral objective. Any flaws in the data have the
potential to translate directly into algorithms. In this study, we discuss the
importance of Responsible Machine Learning Datasets and propose a framework to
evaluate the datasets through a responsible rubric. While existing work focuses
on the post-hoc evaluation of algorithms for their trustworthiness, we provide
a framework that considers the data component separately to understand its role
in the algorithm. We discuss responsible datasets through the lens of fairness,
privacy, and regulatory compliance and provide recommendations for constructing
future datasets. After surveying over 100 datasets, we use 60 datasets for
analysis and demonstrate that none of these datasets is immune to issues of
fairness, privacy preservation, and regulatory compliance. We provide
modifications to the ``datasheets for datasets" with important additions for
improved dataset documentation. With governments around the world regularizing
data protection laws, the method for the creation of datasets in the scientific
community requires revision. We believe this study is timely and relevant in
today's era of AI.Comment: corrected typo