Search CORE

6 research outputs found

Recommended from our members

Interpretable Machine Learning for the Social Sciences: Applications in Political Science and Labor Economics

Author: Vafa Keyon
Publication venue
Publication date: 01/01/2023
Field of study

Recent advances in machine learning offer social scientists a unique opportunity to use data-driven methods to uncover insights into human behavior. However, current machine learning methods are opaque, ineffective on small social science datasets, and tailored for predicting unseen values rather than estimating parameters from data. In this thesis, we develop interpretable machine learning techniques designed to uncover latent patterns and estimate critical quantities in the social sciences. We focus on two aspects of interpretability: explaining individual model predictions and discovering latent patterns from data. We describe a method for explaining the predictions of general, black-box sequence models. This method approximates a combinatorial objective to elucidate the decision-making processes of sequence models. Next, we narrow our focus to domain-specific applications. In political science, we develop the text-based ideal point model, a model that quantifies political positions from text. This model marries a classical idea from political science with a Bayesian matrix factorization technique to infer meaningful structure from text. In labor economics, we adapt a model from natural language processing to analyze career trajectories. We describe a transfer learning method that can overcome the constraints posed by small survey datasets. Finally, we adapt this predictive model to estimate an important quantity in labor economics: the history-adjusted gender wage gap

Columbia University Academic Commons

Revisiting Topic-Guided Language Models

Author: Blei David M.
Vafa Keyon
Zheng Carolina
Publication venue
Publication date: 04/12/2023
Field of study

A recent line of work in natural language processing has aimed to combine language models and topic models. These topic-guided language models augment neural language models with topic models, unsupervised learning methods that can discover document-level patterns of word use. This paper compares the effectiveness of these methods in a standardized setting. We study four topic-guided language models and two baselines, evaluating the held-out predictive performance of each model on four corpora. Surprisingly, we find that none of these methods outperform a standard LSTM language model baseline, and most fail to learn good topics. Further, we train a probe of the neural language model that shows that the baseline's hidden states already encode topic information. We make public all code used for this study.Comment: Published in Transactions on Machine Learning Research (TMLR) (12/2023

arXiv.org e-Print Archive

An Invariant Learning Characterization of Controlled Text Generation

Author: Blei David M.
Feder Amir
Shi Claudia
Vafa Keyon
Zheng Carolina
Publication venue
Publication date: 31/05/2023
Field of study

Controlled generation refers to the problem of creating text that contains stylistic or semantic attributes of interest. Many approaches reduce this problem to training a predictor of the desired attribute. For example, researchers hoping to deploy a large language model to produce non-toxic content may use a toxicity classifier to filter generated text. In practice, the generated text to classify, which is determined by user prompts, may come from a wide range of distributions. In this paper, we show that the performance of controlled generation may be poor if the distributions of text in response to user prompts differ from the distribution the predictor was trained on. To address this problem, we cast controlled generation under distribution shift as an invariant learning problem: the most effective predictor should be invariant across multiple text environments. We then discuss a natural solution that arises from this characterization and propose heuristics for selecting natural environments. We study this characterization and the proposed method empirically using both synthetic and real data. Experiments demonstrate both the challenge of distribution shift in controlled generation and the potential of invariance methods in this setting.Comment: To appear in the 2023 Conference of the Association for Computational Linguistics (ACL 2023

arXiv.org e-Print Archive

Replication Data for: Price Discrimination in The Princeton Review’s Online SAT Tutoring Service

Author: Haigh Christian
Leung Alvin
Vafa Keyon
Yonack Noah
Publication venue: Harvard Dataverse
Publication date
Field of study

This dataset was used for this paper published on 9/1/2015 on Technology Science. http://techscience.org/a/2015090102

Harvard Dataverse Network