91,213 research outputs found

    Distributed multinomial regression

    Full text link
    This article introduces a model-based approach to distributed computing for multinomial logistic (softmax) regression. We treat counts for each response category as independent Poisson regressions via plug-in estimates for fixed effects shared across categories. The work is driven by the high-dimensional-response multinomial models that are used in analysis of a large number of random counts. Our motivating applications are in text analysis, where documents are tokenized and the token counts are modeled as arising from a multinomial dependent upon document attributes. We estimate such models for a publicly available data set of reviews from Yelp, with text regressed onto a large set of explanatory variables (user, business, and rating information). The fitted models serve as a basis for exploring the connection between words and variables of interest, for reducing dimension into supervised factor scores, and for prediction. We argue that the approach herein provides an attractive option for social scientists and other text analysts who wish to bring familiar regression tools to bear on text data.Comment: Published at http://dx.doi.org/10.1214/15-AOAS831 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment

    Full text link
    Automated data-driven decision making systems are increasingly being used to assist, or even replace humans in many settings. These systems function by learning from historical decisions, often taken by humans. In order to maximize the utility of these systems (or, classifiers), their training involves minimizing the errors (or, misclassifications) over the given historical data. However, it is quite possible that the optimally trained classifier makes decisions for people belonging to different social groups with different misclassification rates (e.g., misclassification rates for females are higher than for males), thereby placing these groups at an unfair disadvantage. To account for and avoid such unfairness, in this paper, we introduce a new notion of unfairness, disparate mistreatment, which is defined in terms of misclassification rates. We then propose intuitive measures of disparate mistreatment for decision boundary-based classifiers, which can be easily incorporated into their formulation as convex-concave constraints. Experiments on synthetic as well as real world datasets show that our methodology is effective at avoiding disparate mistreatment, often at a small cost in terms of accuracy.Comment: To appear in Proceedings of the 26th International World Wide Web Conference (WWW), 2017. Code available at: https://github.com/mbilalzafar/fair-classificatio

    An Empirical Examination of Traditional Neighborhood Development

    Get PDF
    This study analyzes the impact of the new urbanism on single-family home prices. Specically, we explore the price differential that homebuyers pay for houses in new urbanist developments relative to houses in conventional suburban developments. Using data on over 5,000 single-family home sales from 1994 to 1997 in three different neighborhoods, hedonic regression results reveal that consumers pay more for homes in new urbanist communities than those in conventional suburban developments. Further analyses indicate that the price premium is not attributable to differences in improvement age and other housing characteristics

    Residential Property Prices in a Sub-Market of South Africa: Separating Real Growth from Attribute Growth

    Get PDF
    This paper analyses the South African residential housing market using hedonic price theory. It builds and tests pooled OLS, fixed effects OLS, pseudo-panel and quantile regression models. The main findings are in agreement with most modern related literature. This paper highlights how house price growth rates have been calculated incorrectly due to the changing aggregate house sold every year. It calculates more accurate growth rates for the property market, yielding surprisingly different growth patterns from those originally thought. It illustrates that much of the recent house price growth was caused by attribute inflation rather than pure price inflation. It also shows that most of the pure inflation occurred at the bottom end of the market while most of the attribute inflation occurred at the top end of the market. Furthermore, it shows that house price determinants change across the house price distribution The data used was sourced from the Residential Property Price Ranger and covers 1930 house sales measured half yearly over three years; from 1 September 2004 to 31 August 2007. These sales were recorded in the towns of Stellenbosch, Somerset West, Strand and Gordon’s Bay.Hedonic pricing, Housing market, Growth rates

    Union Mediation and Adaptation to Reciprocal Loyalty Arrangements

    Get PDF
    This study assesses the industrial relations application of the “loyalty-exit-voice” proposition. The loyalty concept is linked to reciprocal employer-employee arrangements and examined as a job attribute in a vignette questionnaire distributed to low and medium-skilled employees. The responses provided by employees in three European countries indicate that reciprocal loyalty arrangements, which involve the exchange of higher effort for job security, are one of the most desirable job attributes. This attribute exerts a higher impact on the job evaluations provided by unionised workers, compared to their non-union counterparts. This pattern is robust to a number of methodological considerations. It appears to be an outcome of adaptation to union mediated cooperation. Overall the evidence suggests that the loyalty-job evaluation profiles of unionised workers are receptive to repeated interaction and negative shocks, such as unemployment experience. This is not the case for the non-union workers. Finally, unionised workers appear to “voice” a lower job satisfaction, but exhibit low “exit” intentions, compared to the non-unionised labour.EPICURUS, a project supported by the European Commission through the 5th Framework Programme “Improving Human Potential” (contract number: HPSE-CT-2002-00143

    Reassessing the Link between Voter Heterogeneity and Political Accountability: A Latent Class Regression Model of Economic Voting

    Get PDF
    While recent research has underscored the conditioning effect of individual characteristics on economic voting behavior, most empirical studies have failed to explicitly incorporate observed heterogeneity into statistical analyses linking citizens' economic evaluations to electoral choices. In order to overcome these drawbacks, we propose a latent class regression model to jointly analyze the determinants and influence of economic voting in Presidential and Congressional elections. Our modeling approach allows us to better describe the effects of individual covariates on economic voting and to test hypotheses on the existence of heterogeneous types of voters, providing an empirical basis for assessing the relative validity of alternative explanations proposed in the literature. Using survey data from the 2004 U.S. Presidential, Senate and House elections, we and that voters with college education and those more interested in political campaigns based their vote on factors other than their economic perceptions. In contrast, less educated and interested respondents assigned considerable weight to economic assessments, with sociotropic jugdgments strongly in uencing their vote in the Presidential election and personal financial considerations affecting their vote in House elections. We conclude that the main distinction in the 2004 election was not between `sociotropic' and `pocketbook' voters, but rather between `economic' and `non-economic' voters

    CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

    Full text link
    Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

    Direct and mediated impacts of product and process characteristics on consumers’ choice of organic vs. conventional chicken

    Get PDF
    There is a lack of research into why consumers value process characteristics. In this study, we test the hypothesis that the impact of process characteristics such as organic and free-range on consumers’ choices of food products is at least partly mediated through expected eating quality or taste expectations. In other words, the process characteristics partly function as cues to (eating) quality. Using a traditional metric conjoint approach based on an additive model, four product characteristics (production method, price, size and information about farmer and rearing conditions) were varied in a fractional factorial conjoint design, creating nine profiles of whole chickens. 384 respondents rated the nine different chickens in terms of taste expectations and willingness to buy. Since the nine records for each respondent are not independent, we used linear mixed modelling for the mediation analysis, We find that, as expected, taste expectations are a strong predictor of willingness to buy. As hypothesized, the impact of both product and process characteristics on willingness to buy is at least partly mediated through taste expectations. Hence, the study shows that process characteristics are important for consumers, not only in and off themselves, but partly because consumers make inferences about eating quality from knowledge about such process characteristics

    Assessment of hydrological and seasonal controls over the nitrate flushing from a forested watershed using a data mining technique

    Get PDF
    A data mining, regression tree algorithm M5 was used to review the role of mutual hydrological and seasonal settings which control the streamwater nitrate flushing during hydrological events within a forested watershed in the southwestern part of Slovenia, characterized by distinctive flushing, almost torrential hydrological regime. The basis for the research was an extensive dataset of continuous, high frequency measurements of seasonal meteorological conditions, watershed hydrological responses and streamwater nitrate concentrations. The dataset contained 16 recorded hydrographs occurring in different seasonal and hydrological conditions. Based on predefined regression tree pruning criteria, a comprehensible regression tree model was obtained in the sense of the domain knowledge, which was able to adequately describe most of the streamwater nitrate concentration variations (RMSE=1.02mg/l-N; r=0.91). The attributes which were found to be the most descriptive in the sense of streamwater nitrate concentrations were the antecedent precipitation index (API) and air temperatures in the preceding periods. The model was most successful in describing streamwater concentrations in the range 1-4 mg/l-N, covering large proportion of the dataset. The model performance was little worse in the periods of high streamwater nitrate concentration peaks during the summer hydrographs (up to 7 mg/l-N) but poor during the autumn hydrograph (up to 14 mg/l-N) related to highly variable hydrological conditions, which would require a less robust regression tree model based on the extended dataset
    • …
    corecore