70,550 research outputs found
Distributed multinomial regression
This article introduces a model-based approach to distributed computing for
multinomial logistic (softmax) regression. We treat counts for each response
category as independent Poisson regressions via plug-in estimates for fixed
effects shared across categories. The work is driven by the
high-dimensional-response multinomial models that are used in analysis of a
large number of random counts. Our motivating applications are in text
analysis, where documents are tokenized and the token counts are modeled as
arising from a multinomial dependent upon document attributes. We estimate such
models for a publicly available data set of reviews from Yelp, with text
regressed onto a large set of explanatory variables (user, business, and rating
information). The fitted models serve as a basis for exploring the connection
between words and variables of interest, for reducing dimension into supervised
factor scores, and for prediction. We argue that the approach herein provides
an attractive option for social scientists and other text analysts who wish to
bring familiar regression tools to bear on text data.Comment: Published at http://dx.doi.org/10.1214/15-AOAS831 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Recommended from our members
Predicting the course of Alzheimer's progression.
Alzheimer's disease is the most common neurodegenerative disease and is characterized by the accumulation of amyloid-beta peptides leading to the formation of plaques and tau protein tangles in brain. These neuropathological features precede cognitive impairment and Alzheimer's dementia by many years. To better understand and predict the course of disease from early-stage asymptomatic to late-stage dementia, it is critical to study the patterns of progression of multiple markers. In particular, we aim to predict the likely future course of progression for individuals given only a single observation of their markers. Improved individual-level prediction may lead to improved clinical care and clinical trials. We propose a two-stage approach to modeling and predicting measures of cognition, function, brain imaging, fluid biomarkers, and diagnosis of individuals using multiple domains simultaneously. In the first stage, joint (or multivariate) mixed-effects models are used to simultaneously model multiple markers over time. In the second stage, random forests are used to predict categorical diagnoses (cognitively normal, mild cognitive impairment, or dementia) from predictions of continuous markers based on the first-stage model. The combination of the two models allows one to leverage their key strengths in order to obtain improved accuracy. We characterize the predictive accuracy of this two-stage approach using data from the Alzheimer's Disease Neuroimaging Initiative. The two-stage approach using a single joint mixed-effects model for all continuous outcomes yields better diagnostic classification accuracy compared to using separate univariate mixed-effects models for each of the continuous outcomes. Overall prediction accuracy above 80% was achieved over a period of 2.5Â years. The results further indicate that overall accuracy is improved when markers from multiple assessment domains, such as cognition, function, and brain imaging, are used in the prediction algorithm as compared to the use of markers from a single domain only
Sensor-AssistedWeighted Average Ensemble Model for Detecting Major Depressive Disorder
The present methods of diagnosing depression are entirely dependent on self-report
ratings or clinical interviews. Those traditional methods are subjective, where the individual may
or may not be answering genuinely to questions. In this paper, the data has been collected using
self-report ratings and also using electronic smartwatches. This study aims to develop a weighted
average ensemble machine learning model to predict major depressive disorder (MDD) with superior
accuracy. The data has been pre-processed and the essential features have been selected using a
correlation-based feature selection method. With the selected features, machine learning approaches
such as Logistic Regression, Random Forest, and the proposedWeighted Average Ensemble Model are
applied. Further, for assessing the performance of the proposed model, the Area under the Receiver
Optimization Characteristic Curves has been used. The results demonstrate that the proposed
Weighted Average Ensemble model performs with better accuracy than the Logistic Regression and
the Random Forest approaches
The supervised hierarchical Dirichlet process
We propose the supervised hierarchical Dirichlet process (sHDP), a
nonparametric generative model for the joint distribution of a group of
observations and a response variable directly associated with that whole group.
We compare the sHDP with another leading method for regression on grouped data,
the supervised latent Dirichlet allocation (sLDA) model. We evaluate our method
on two real-world classification problems and two real-world regression
problems. Bayesian nonparametric regression models based on the Dirichlet
process, such as the Dirichlet process-generalised linear models (DP-GLM) have
previously been explored; these models allow flexibility in modelling nonlinear
relationships. However, until now, Hierarchical Dirichlet Process (HDP)
mixtures have not seen significant use in supervised problems with grouped data
since a straightforward application of the HDP on the grouped data results in
learnt clusters that are not predictive of the responses. The sHDP solves this
problem by allowing for clusters to be learnt jointly from the group structure
and from the label assigned to each group.Comment: 14 page
- …