1 research outputs found
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms
Artificial Intelligence (AI) has made its way into various scientific fields,
providing astonishing improvements over existing algorithms for a wide variety
of tasks. In recent years, there have been severe concerns over the
trustworthiness of AI technologies. The scientific community has focused on the
development of trustworthy AI algorithms. However, machine and deep learning
algorithms, popular in the AI community today, depend heavily on the data used
during their development. These learning algorithms identify patterns in the
data, learning the behavioral objective. Any flaws in the data have the
potential to translate directly into algorithms. In this study, we discuss the
importance of Responsible Machine Learning Datasets and propose a framework to
evaluate the datasets through a responsible rubric. While existing work focuses
on the post-hoc evaluation of algorithms for their trustworthiness, we provide
a framework that considers the data component separately to understand its role
in the algorithm. We discuss responsible datasets through the lens of fairness,
privacy, and regulatory compliance and provide recommendations for constructing
future datasets. After surveying over 100 datasets, we use 60 datasets for
analysis and demonstrate that none of these datasets is immune to issues of
fairness, privacy preservation, and regulatory compliance. We provide
modifications to the ``datasheets for datasets" with important additions for
improved dataset documentation. With governments around the world regularizing
data protection laws, the method for the creation of datasets in the scientific
community requires revision. We believe this study is timely and relevant in
today's era of AI.Comment: corrected typo