91,213 research outputs found
Distributed multinomial regression
This article introduces a model-based approach to distributed computing for
multinomial logistic (softmax) regression. We treat counts for each response
category as independent Poisson regressions via plug-in estimates for fixed
effects shared across categories. The work is driven by the
high-dimensional-response multinomial models that are used in analysis of a
large number of random counts. Our motivating applications are in text
analysis, where documents are tokenized and the token counts are modeled as
arising from a multinomial dependent upon document attributes. We estimate such
models for a publicly available data set of reviews from Yelp, with text
regressed onto a large set of explanatory variables (user, business, and rating
information). The fitted models serve as a basis for exploring the connection
between words and variables of interest, for reducing dimension into supervised
factor scores, and for prediction. We argue that the approach herein provides
an attractive option for social scientists and other text analysts who wish to
bring familiar regression tools to bear on text data.Comment: Published at http://dx.doi.org/10.1214/15-AOAS831 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment
Automated data-driven decision making systems are increasingly being used to
assist, or even replace humans in many settings. These systems function by
learning from historical decisions, often taken by humans. In order to maximize
the utility of these systems (or, classifiers), their training involves
minimizing the errors (or, misclassifications) over the given historical data.
However, it is quite possible that the optimally trained classifier makes
decisions for people belonging to different social groups with different
misclassification rates (e.g., misclassification rates for females are higher
than for males), thereby placing these groups at an unfair disadvantage. To
account for and avoid such unfairness, in this paper, we introduce a new notion
of unfairness, disparate mistreatment, which is defined in terms of
misclassification rates. We then propose intuitive measures of disparate
mistreatment for decision boundary-based classifiers, which can be easily
incorporated into their formulation as convex-concave constraints. Experiments
on synthetic as well as real world datasets show that our methodology is
effective at avoiding disparate mistreatment, often at a small cost in terms of
accuracy.Comment: To appear in Proceedings of the 26th International World Wide Web
Conference (WWW), 2017. Code available at:
https://github.com/mbilalzafar/fair-classificatio
An Empirical Examination of Traditional Neighborhood Development
This study analyzes the impact of the new urbanism on single-family home prices. SpeciďŹcally, we explore the price differential that homebuyers pay for houses in new urbanist developments relative to houses in conventional suburban developments. Using data on over 5,000 single-family home sales from 1994 to 1997 in three different neighborhoods, hedonic regression results reveal that consumers pay more for homes in new urbanist communities than those in conventional suburban developments. Further analyses indicate that the price premium is not attributable to differences in improvement age and other housing characteristics
Residential Property Prices in a Sub-Market of South Africa: Separating Real Growth from Attribute Growth
This paper analyses the South African residential housing market using hedonic price theory. It builds and tests pooled OLS, fixed effects OLS, pseudo-panel and quantile regression models. The main findings are in agreement with most modern related literature. This paper highlights how house price growth rates have been calculated incorrectly due to the changing aggregate house sold every year. It calculates more accurate growth rates for the property market, yielding surprisingly different growth patterns from those originally thought. It illustrates that much of the recent house price growth was caused by attribute inflation rather than pure price inflation. It also shows that most of the pure inflation occurred at the bottom end of the market while most of the attribute inflation occurred at the top end of the market. Furthermore, it shows that house price determinants change across the house price distribution The data used was sourced from the Residential Property Price Ranger and covers 1930 house sales measured half yearly over three years; from 1 September 2004 to 31 August 2007. These sales were recorded in the towns of Stellenbosch, Somerset West, Strand and Gordonâs Bay.Hedonic pricing, Housing market, Growth rates
Union Mediation and Adaptation to Reciprocal Loyalty Arrangements
This study assesses the industrial relations application of the âloyalty-exit-voiceâ proposition. The loyalty concept is linked to reciprocal employer-employee arrangements and examined as a job attribute in a vignette questionnaire distributed to low and medium-skilled employees. The responses provided by employees in three European countries indicate that reciprocal loyalty arrangements, which involve the exchange of higher effort for job security, are one of the most desirable job attributes. This attribute exerts a higher impact on the job evaluations provided by unionised workers, compared to their non-union counterparts. This pattern is robust to a number of methodological considerations. It appears to be an outcome of adaptation to union mediated cooperation. Overall the evidence suggests that the loyalty-job evaluation profiles of unionised workers are receptive to repeated interaction and negative shocks, such as unemployment experience. This is not the case for the non-union workers. Finally, unionised workers appear to âvoiceâ a lower job satisfaction, but exhibit low âexitâ intentions, compared to the non-unionised labour.EPICURUS, a project supported by the European Commission through the 5th Framework Programme âImproving Human Potentialâ (contract number: HPSE-CT-2002-00143
Reassessing the Link between Voter Heterogeneity and Political Accountability: A Latent Class Regression Model of Economic Voting
While recent research has underscored the conditioning effect of individual characteristics on economic voting behavior, most empirical studies have failed to explicitly incorporate observed heterogeneity into statistical analyses linking citizens' economic evaluations to electoral choices. In order to overcome these drawbacks, we propose a latent
class regression model to jointly analyze the determinants and influence of economic
voting in Presidential and Congressional elections. Our modeling approach allows us to
better describe the effects of individual covariates on economic voting and to test hypotheses on the existence of heterogeneous types of voters, providing an empirical basis
for assessing the relative validity of alternative explanations proposed in the literature.
Using survey data from the 2004 U.S. Presidential, Senate and House elections, we
and that voters with college education and those more interested in political campaigns
based their vote on factors other than their economic perceptions. In contrast, less educated and interested respondents assigned considerable weight to economic assessments,
with sociotropic jugdgments strongly in
uencing their vote in the Presidential election
and personal financial considerations affecting their vote in House elections. We conclude that the main distinction in the 2004 election was not between `sociotropic' and
`pocketbook' voters, but rather between `economic' and `non-economic' voters
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
Direct and mediated impacts of product and process characteristics on consumersâ choice of organic vs. conventional chicken
There is a lack of research into why consumers value process characteristics. In this study, we test the hypothesis that the impact of process characteristics such as organic and free-range on consumersâ choices of food products is at least partly mediated through expected eating quality or taste expectations. In other words, the process characteristics partly function as cues to (eating) quality. Using a traditional metric conjoint approach based on an additive model, four product characteristics (production method, price, size and information about farmer and rearing conditions) were varied in a fractional factorial conjoint design, creating nine profiles of whole chickens. 384 respondents rated the nine different chickens in terms of taste expectations and willingness to buy.
Since the nine records for each respondent are not independent, we used linear mixed modelling for the mediation analysis, We find that, as expected, taste expectations are a strong predictor of willingness to buy. As hypothesized, the impact of both product and process characteristics on willingness to buy is at least partly mediated through taste expectations. Hence, the study shows that process characteristics are important for consumers, not only in and off themselves, but partly because consumers make inferences about eating quality from knowledge about such process characteristics
Assessment of hydrological and seasonal controls over the nitrate flushing from a forested watershed using a data mining technique
A data mining, regression tree algorithm M5 was used to review the role of mutual hydrological and seasonal settings which control the streamwater nitrate flushing during hydrological events within a forested watershed in the southwestern part of Slovenia, characterized by distinctive flushing, almost torrential hydrological regime. The basis for the research was an extensive dataset of continuous, high frequency measurements of seasonal meteorological conditions, watershed hydrological responses and streamwater nitrate concentrations. The dataset contained 16 recorded hydrographs occurring in different seasonal and hydrological conditions. Based on predefined regression tree pruning criteria, a comprehensible regression tree model was obtained in the sense of the domain knowledge, which was able to adequately describe most of the streamwater nitrate concentration variations (RMSE=1.02mg/l-N; r=0.91). The attributes which were found to be the most descriptive in the sense of streamwater nitrate concentrations were the antecedent precipitation index (API) and air temperatures in the preceding periods. The model was most successful in describing streamwater concentrations in the range 1-4 mg/l-N, covering large proportion of the dataset. The model performance was little worse in the periods of high streamwater nitrate concentration peaks during the summer hydrographs (up to 7 mg/l-N) but poor during the autumn hydrograph (up to 14 mg/l-N) related to highly variable hydrological conditions, which would require a less robust regression tree model based on the extended dataset
- âŚ