6 research outputs found
Recommended from our members
Interpretable Machine Learning for the Social Sciences: Applications in Political Science and Labor Economics
Recent advances in machine learning offer social scientists a unique opportunity to use data-driven methods to uncover insights into human behavior. However, current machine learning methods are opaque, ineffective on small social science datasets, and tailored for predicting unseen values rather than estimating parameters from data. In this thesis, we develop interpretable machine learning techniques designed to uncover latent patterns and estimate critical quantities in the social sciences.
We focus on two aspects of interpretability: explaining individual model predictions and discovering latent patterns from data. We describe a method for explaining the predictions of general, black-box sequence models. This method approximates a combinatorial objective to elucidate the decision-making processes of sequence models. Next, we narrow our focus to domain-specific applications. In political science, we develop the text-based ideal point model, a model that quantifies political positions from text.
This model marries a classical idea from political science with a Bayesian matrix factorization technique to infer meaningful structure from text. In labor economics, we adapt a model from natural language processing to analyze career trajectories. We describe a transfer learning method that can overcome the constraints posed by small survey datasets. Finally, we adapt this predictive model to estimate an important quantity in labor economics: the history-adjusted gender wage gap
Revisiting Topic-Guided Language Models
A recent line of work in natural language processing has aimed to combine
language models and topic models. These topic-guided language models augment
neural language models with topic models, unsupervised learning methods that
can discover document-level patterns of word use. This paper compares the
effectiveness of these methods in a standardized setting. We study four
topic-guided language models and two baselines, evaluating the held-out
predictive performance of each model on four corpora. Surprisingly, we find
that none of these methods outperform a standard LSTM language model baseline,
and most fail to learn good topics. Further, we train a probe of the neural
language model that shows that the baseline's hidden states already encode
topic information. We make public all code used for this study.Comment: Published in Transactions on Machine Learning Research (TMLR)
(12/2023
An Invariant Learning Characterization of Controlled Text Generation
Controlled generation refers to the problem of creating text that contains
stylistic or semantic attributes of interest. Many approaches reduce this
problem to training a predictor of the desired attribute. For example,
researchers hoping to deploy a large language model to produce non-toxic
content may use a toxicity classifier to filter generated text. In practice,
the generated text to classify, which is determined by user prompts, may come
from a wide range of distributions. In this paper, we show that the performance
of controlled generation may be poor if the distributions of text in response
to user prompts differ from the distribution the predictor was trained on. To
address this problem, we cast controlled generation under distribution shift as
an invariant learning problem: the most effective predictor should be invariant
across multiple text environments. We then discuss a natural solution that
arises from this characterization and propose heuristics for selecting natural
environments. We study this characterization and the proposed method
empirically using both synthetic and real data. Experiments demonstrate both
the challenge of distribution shift in controlled generation and the potential
of invariance methods in this setting.Comment: To appear in the 2023 Conference of the Association for Computational
Linguistics (ACL 2023
Replication Data for: Price Discrimination in The Princeton Review’s Online SAT Tutoring Service
This dataset was used for this paper published on 9/1/2015 on Technology Science. http://techscience.org/a/2015090102