As machine learning increasingly affects people and society, it is important
that we strive for a comprehensive and unified understanding of potential
sources of unwanted consequences. For instance, downstream harms to particular
groups are often blamed on "biased data," but this concept encompass too many
issues to be useful in developing solutions. In this paper, we provide a
framework that partitions sources of downstream harm in machine learning into
six distinct categories spanning the data generation and machine learning
pipeline. We describe how these issues arise, how they are relevant to
particular applications, and how they motivate different solutions. In doing
so, we aim to facilitate the development of solutions that stem from an
understanding of application-specific populations and data generation
processes, rather than relying on general statements about what may or may not
be "fair."Comment: 6 pages, 2 figures; updated with corrected figure