71 research outputs found
Exploratory Mediation Analysis with Many Potential Mediators
Social and behavioral scientists are increasingly employing technologies such
as fMRI, smartphones, and gene sequencing, which yield 'high-dimensional'
datasets with more columns than rows. There is increasing interest, but little
substantive theory, in the role the variables in these data play in known
processes. This necessitates exploratory mediation analysis, for which
structural equation modeling is the benchmark method. However, this method
cannot perform mediation analysis with more variables than observations. One
option is to run a series of univariate mediation models, which incorrectly
assumes independence of the mediators. Another option is regularization, but
the available implementations may lead to high false positive rates. In this
paper, we develop a hybrid approach which uses components of both filter and
regularization: the 'Coordinate-wise Mediation Filter'. It performs filtering
conditional on the other selected mediators. We show through simulation that it
improves performance over existing methods. Finally, we provide an empirical
example, showing how our method may be used for epigenetic research.Comment: R code and package are available online as supplementary material at
https://github.com/vankesteren/cmfilter and
https://github.com/vankesteren/ema_simulation
Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions
Text embedding models from Natural Language Processing can map text data
(e.g. words, sentences, documents) to supposedly meaningful numerical
representations (a.k.a. text embeddings). While such models are increasingly
applied in social science research, one important issue is often not addressed:
the extent to which these embeddings are valid representations of constructs
relevant for social science research. We therefore propose the use of the
classic construct validity framework to evaluate the validity of text
embeddings. We show how this framework can be adapted to the opaque and
high-dimensional nature of text embeddings, with application to survey
questions. We include several popular text embedding methods (e.g. fastText,
GloVe, BERT, Sentence-BERT, Universal Sentence Encoder) in our construct
validity analyses. We find evidence of convergent and discriminant validity in
some cases. We also show that embeddings can be used to predict respondent's
answers to completely new survey questions. Furthermore, BERT-based embedding
techniques and the Universal Sentence Encoder provide more valid
representations of survey questions than do others. Our results thus highlight
the necessity to examine the construct validity of text embeddings before
deploying them in social science research.Comment: Under revie
Estimating stochastic survey response errors using the multitrait‐multierror model
From Wiley via Jisc Publications RouterHistory: received 2018-09-17, rev-recd 2021-01-26, accepted 2021-05-30, pub-electronic 2021-10-12Article version: VoRPublication status: PublishedFunder: ESRC National Centre for Research Methods, University of Southampton; Id: http://dx.doi.org/10.13039/501100000613; Grant(s): R121711Abstract: Surveys are well known to contain response errors of different types, including acquiescence, social desirability, common method variance and random error simultaneously. Nevertheless, a single error source at a time is all that most methods developed to estimate and correct for such errors consider in practice. Consequently, estimation of response errors is inefficient, their relative importance is unknown and the optimal question format may not be discoverable. To remedy this situation, we demonstrate how multiple types of errors can be estimated concurrently with the recently introduced ‘multitrait‐multierror’ (MTME) approach. MTME combines the theory of design of experiments with latent variable modelling to estimate response error variances of different error types simultaneously. This allows researchers to evaluate which errors are most impactful, and aids in the discovery of optimal question formats. We apply this approach using representative data from the United Kingdom to six survey items measuring attitudes towards immigrants that are commonly used across public opinion studies
Achieving Fair Inference Using Error-Prone Outcomes
Recently, an increasing amount of research has focused on methods to assess and account for fairness criteria when predicting ground truth targets in supervised learning. However, recent literature has shown that prediction unfairness can potentially arise due to measurement error when target labels are error prone. In this study we demonstrate that existing methods to assess and calibrate fairness criteria do not extend to the true target variable of interest, when an error-prone proxy target is used. As a solution to this problem, we suggest a framework that combines two existing fields of research: fair ML methods, such as those found in the counterfactual fairness literature and measurement models found in the statistical literature. Firstly, we discuss these approaches and how they can be combined to form our framework. We also show that, in a healthcare decision problem, a latent variable model to account for measurement error removes the unfairness detected previously
Flexible Extensions to Structural Equation Models Using Computation Graphs
Structural equation modeling (SEM) is being applied to ever more complex data types and questions, often requiring extensions such as regularization or novel fitting functions. To extend SEM, researchers currently need to completely reformulate SEM and its optimization algorithm–a challenging and time–consuming task. In this paper, we introduce the computation graph for SEM, and show that this approach can extend SEM without the need for bespoke software development. We show that both existing and novel SEM improvements follow naturally. To demonstrate, we introduce three SEM extensions: least absolute deviation estimation, Bayesian LASSO optimization, and sparse high–dimensional mediation analysis. We provide an implementation of SEM in PyTorch–popular software in the machine learning community–to accelerate development of structural equation models adequate for modern–day data and research questions
Evaluating the construct validity of text embeddings with application to survey questions
Text embedding models from Natural Language Processing can map text data (e.g. words, sentences, documents) to meaningful numerical representations (a.k.a. text embeddings). While such models are increasingly applied in social science research, one important issue is often not addressed: the extent to which these embeddings are high-quality representations of the information needed to be encoded. We view this quality evaluation problem from a measurement validity perspective, and propose the use of the classic construct validity framework to evaluate the quality of text embeddings. First, we describe how this framework can be adapted to the opaque and high-dimensional nature of text embeddings. Second, we apply our adapted framework to an example where we compare the validity of survey question representation across text embedding models
- …