6 research outputs found

    Improving Worker Performance with Human-Centered Data Science

    Full text link
    Advances in information technologies not only provide novel tools to support work in the traditional sectors; they also create additional employment opportunities in the modern workforce where work contexts have been largely changed. All these changes call for new efforts to study worker performance. Indeed, information technologies, especially data science techniques, render unprecedented large-scale rich data and sophisticated analytic tools to investigate worker performance. However, it remains unclear how we can combine the strengths of big data analytics in data science and our existing knowledge in social science to enhance worker performance. In this dissertation, we propose a human-centered data science framework that integrates machine learning, causal inference, field experiments, and social science theories: First, machine learning (with counterfactual reasoning) enables the prediction (and explanation) of human behavior in work practice via large-scale data analysis. Existing insights from social theories can further enhance its predictive power by informing feature construction, model architecture, and model explanation. Field experiments can help to evaluate the effectiveness of these models in real-world practices. Second, field experiments perform precise interventions and establish causality with randomized controlled trials. Yet, the experimental analysis mainly supports the understanding of treatment effects at aggregate levels, such as average treatment effect. Machine learning empowers more sophisticated analyses of experimental data by revealing heterogeneous effects at a finer granularity, such as individual treatment effects. Third, while these data-driven discoveries complement social science theories and provide rich insights for describing, explaining, and predicting human behavior, they require rigorous analytic tools, such as experiments and machine learning, to validate or disconfirm their applicability in specific contexts. In addition to testing theories, causal insights derived from field experiments and counterfactual machine learning models could support the development of new theories that better reflect reality. To exemplify the various applications of this framework in both traditional sectors and in the modern workforce, we present three empirical studies: developing machine learning models to improve the outreach performance for government specialists, leveraging a field experiment to enhance the performance of the gig economy workers, and using counterfactual machine learning to unpack individual treatment effects of field experiments on worker performance in the gig economy. These studies illustrate that the framework of human-centered data science is effective and flexible in increasing worker performance.PHDInformationUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169835/1/tengye_1.pd

    A Distillation Approach to Data Efficient Individual Treatment Effect Estimation

    No full text
    The potential for using machine learning algorithms as a tool for suggesting optimal interventions has fueled significant interest in developing methods for estimating heterogeneous or individual treatment effects (ITEs) from observational data. While several methods for estimating ITEs have been recently suggested, these methods assume no constraints on the availability of data at the time of deployment or test time. This assumption is unrealistic in settings where data acquisition is a significant part of the analysis pipeline, meaning data about a test case has to be collected in order to predict the ITE. In this work, we present Data Efficient Individual Treatment Effect Estimation (DEITEE), a method which exploits the idea that adjusting for confounding, and hence collecting information about confounders, is not necessary at test time. DEITEE allows the development of rich models that exploit all variables at train time but identifies a minimal set of variables required to estimate the ITE at test time. Using 77 semi-synthetic datasets with varying data generating processes, we show that DEITEE achieves significant reductions in the number of variables required at test time with little to no loss in accuracy. Using real data, we demonstrate the utility of our approach in helping soon-to-be mothers make planning and lifestyle decisions that will impact newborn health

    A Distillation Approach to Data Efficient Individual Treatment Effect Estimation

    No full text
    corecore