13 research outputs found

    The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning

    Full text link
    The nascent field of fair machine learning aims to ensure that decisions guided by algorithms are equitable. Over the last several years, three formal definitions of fairness have gained prominence: (1) anti-classification, meaning that protected attributes---like race, gender, and their proxies---are not explicitly used to make decisions; (2) classification parity, meaning that common measures of predictive performance (e.g., false positive and false negative rates) are equal across groups defined by the protected attributes; and (3) calibration, meaning that conditional on risk estimates, outcomes are independent of protected attributes. Here we show that all three of these fairness definitions suffer from significant statistical limitations. Requiring anti-classification or classification parity can, perversely, harm the very groups they were designed to protect; and calibration, though generally desirable, provides little guarantee that decisions are equitable. In contrast to these formal fairness criteria, we argue that it is often preferable to treat similarly risky people similarly, based on the most statistically accurate estimates of risk that one can produce. Such a strategy, while not universally applicable, often aligns well with policy objectives; notably, this strategy will typically violate both anti-classification and classification parity. In practice, it requires significant effort to construct suitable risk estimates. One must carefully define and measure the targets of prediction to avoid retrenching biases in the data. But, importantly, one cannot generally address these difficulties by requiring that algorithms satisfy popular mathematical formalizations of fairness. By highlighting these challenges in the foundation of fair machine learning, we hope to help researchers and practitioners productively advance the area

    Identifying and Measuring Excessive and Discriminatory Policing

    Get PDF
    We describe and apply three empirical approaches to identify superfluous police activity, unjustified racially disparate impacts, and limits to regulatory interventions. First, using cost-benefit analysis, we show that traffic and pedestrian stops in Nashville and New York City disproportionately impacted communities of color without achieving their stated public-safety goals. Second, we address a long-standing problem in discrimination research by presenting an empirical approach for identifying “similarly situated” individuals and, in so doing, quantify potentially unjustified disparities in stop policies in New York City and Chicago. Finally, taking a holistic view of police contact in Chicago and Philadelphia, we show that settlement agreements curbed pedestrian stops but that a concomitant rise in traffic stops maintained aggregate racial disparities, illustrating the challenges facing regulatory efforts. These case studies highlight the promise and value of viewing legal principles and policy goals through the lens of modern data analysis—both in police reform and in reform efforts more broadly

    Are Police Officers Bayesians? Police Updating in Investigative Stops

    Get PDF
    Theories of rational behavior assume that actors make decisions where the benefits of their acts exceed their costs or losses. If those expected costs and benefits change over time, the behavior will change accordingly as actors learn and internalize the parameters of success and failure. In the context of proactive policing, police stops that achieve any of several goals—constitutional compliance, stops that lead to “good” arrests or summonses, stops that lead to seizures of weapons, drugs, or other contraband, or stops that produce good will and citizen cooperation—should signal to officers the features of a stop that increase its rewards or benefits. Having formed a subjective estimate of success (i.e., prior beliefs), officers should observe their outcomes in subsequent encounters and form updated probability estimates, with specific features of the event, with a positive weight on those features. Officers should also learn the features of unproductive stops and adjust accordingly. A rational actor would pursue “good” or “productive” stops and avoid “unproductive” stops by updating their knowledge of these features through experience. We analyze data on 4.9 million Terry stops in New York City from 2004–2016 to estimate the extent of updating by officers in the New York Police Department. We compare models using a frequentist analysis of officer behavior with a Bayesian analysis where subsequent events are weighted by the signals from prior events. By comparing productive and unproductive stops, the analysis estimates the weights or values—an experience effect—that officers assign to the signals of each type of stop outcome. We find evidence of updating using both analytic methods, although the “hit rates”—our measure of stop productivity including recovery of firearms or arrests for criminal behavior—remain low. Updating is independent of total officer stop activity each month, suggesting that learning may be selective and specific to certain stop features. However, hit rates decline as officer stop activity increases. Both updating and hit rates improved as stop rates declined following a series of internal memoranda and trial orders beginning in May 2012. There is also evidence of differential updating by officers conditional on a variety of features of prior and current stops, including suspect race and stop legality. Though our analysis is limited to NYPD stops, given the ubiquity of policing regimes of intensive stop and frisk encounters across the United States, the relevance of these findings reaches beyond New York City. These regimes reveal tensions between the Terry jurisprudence of reasonable suspicion and evidence on contemporary police practices across the country
    corecore