268 research outputs found
The Wikipedia Gender Gap Revisited: Characterizing Survey Response Bias with Propensity Score Estimation
Opt-in surveys are the most widespread method used to study participation in online communities, but produce biased results in the absence of adjustments for non-response. A 2008 survey conducted by the Wikimedia Foundation and United Nations University at Maastricht is the source of a frequently cited statistic that less than 13% of Wikipedia contributors are female. However, the same study suggested that only 39.9% of Wikipedia readers in the US were female – a finding contradicted by a representative survey of American adults by the Pew Research Center conducted less than two months later. Combining these two datasets through an application and extension of a propensity score estimation technique used to model survey non-response bias, we construct revised estimates, contingent on explicit assumptions, for several of the Wikimedia Foundation and United Nations University at Maastricht claims about Wikipedia editors. We estimate that the proportion of female US adult editors was 27.5% higher than the original study reported (22.7%, versus 17.8%), and that the total proportion of female editors was 26.8% higher (16.1%, versus 12.7%).Sloan School of ManagementHarvard University. Berkman Center for Internet & SocietyFord Visionary Leadership FundNorthwestern University (Evanston, Ill.
A Study of Machine Learning Techniques for Daily Solar Energy Forecasting using Numerical Weather Models
Proceedings of: 8th International Symposium on Intelligent Distributed Computing (IDC'2014). Madrid, September 3-5, 2014Forecasting solar energy is becoming an important issue in the context of renewable energy sources and Machine Learning Algorithms play an important rule in this field. The prediction of solar energy can be addressed as a time series prediction problem using historical data. Also, solar energy forecasting can be derived from numerical weather prediction models (NWP). Our interest is focused on the latter approach.We focus on the problem of predicting solar energy from NWP computed from GEFS, the Global Ensemble Forecast System, which predicts meteorological variables for points in a grid. In this context, it can be useful to know how prediction accuracy improves depending on the number of grid nodes used as input for the machine learning techniques. However, using the variables from a large number of grid nodes can result in many attributes which might degrade the generalization performance of the learning algorithms. In this paper both issues are studied using data supplied by Kaggle for the State of Oklahoma comparing Support Vector Machines and Gradient Boosted Regression. Also, three different feature selection methods have been tested: Linear Correlation, the ReliefF algorithm and, a new method based on local information analysis.Publicad
Constraint Handling in Efficient Global Optimization
This is the author accepted manuscript. The final version is available from ACM via the DOI in this record.Real-world optimization problems are often subject to several constraints which are expensive to evaluate in terms of cost or time. Although a lot of effort is devoted to make use of surrogate models for expensive optimization tasks, not many strong surrogate-assisted algorithms can address the challenging constrained problems. Efficient Global Optimization (EGO) is a Kriging-based surrogate-assisted algorithm. It was originally proposed to address unconstrained problems and later was modified to solve constrained problems. However, these type of algorithms still suffer from several issues, mainly: (1) early stagnation, (2) problems with multiple active constraints and (3) frequent crashes. In this work, we introduce a new EGO-based algorithm which tries to overcome these common issues with Kriging optimization algorithms. We apply the proposed algorithm on problems with dimension d ≤ 4 from the G-function suite [16] and on an airfoil shape example.This research was partly funded by Tekes, the Finnish Funding Agency for Innovation (the DeCoMo project), and by the Engineering and Physical Sciences Research Council [grant numbers EP/N017195/1, EP/N017846/1]
Cross-sectional survey of users of internet depression communities
Background: Internet-based depression communities provide a forum for individuals to
communicate and share information and ideas. There has been little research into the health status
and other characteristics of users of these communities.
Methods: Online cross-sectional survey of Internet depression communities to identify depressive
morbidity among users of Internet depression communities in six European countries; to
investigate whether users were in contact with health services and receiving treatment; and to
identify user perceived effects of the communities.
Results: Major depression was highly prevalent among respondents (varying by country from 40%
to 64%). Forty-nine percent of users meeting criteria for major depression were not receiving
treatment, and 35% had no consultation with health services in the previous year. Thirty-six
percent of repeat community users who had consulted a health professional in the previous year
felt that the Internet community had been an important factor in deciding to seek professional help.
Conclusions: There are high levels of untreated and undiagnosed depression in users of Internet
depression communities. This group represents a target for intervention. Internet communities can
provide information and support for stigmatizing conditions that inhibit more traditional modes of
information seeking
Modeling User Search Behavior for Masquerade Detection
Masquerade attacks are a common security problem that is a consequence of identity theft. This paper extends prior work by modeling user search behavior to detect deviations indicating a masquerade attack. We hypothesize that each individual user knows their own file system well enough to search in a limited, targeted and unique fashion in order to find information germane to their current task. Masqueraders, on the other hand, will likely not know the file system and layout of another user's desktop, and would likely search more extensively and broadly in a manner that is different than the victim user being impersonated. We identify actions linked to search and information access activities, and use them to build user models. The experimental results show that modeling search behavior reliably detects all masqueraders with a very low false positive rate of 1.1%, far better than prior published results. The limited set of features used for search behavior modeling also results in large performance gains over the same modeling techniques that use larger sets of features
Feasibility, reliability, and validity of adolescent health status measurement by the Child Health Questionnaire Child Form (CHQ-CF): internet administration compared with the standard paper version
AIMS: In this study we evaluated indicators of the feasibility, reliability, and validity of the Child Health Questionnaire-Child Form (CHQ-CF). We compared the results in a subgroup of adolescents who completed the standard paper version of the CHQ-CF with the results in another subgroup of adolescents who completed an internet version, i.e., an online, web-based CHQ-CF questionnaire. METHODS: Under supervision at school, 1,071 adolescents were randomized to complete the CHQ-CF and items on chronic conditions by a paper questionnaire or by an internet administered questionnaire. RESULTS: The participation rate was 87%; age range 13–7 years. The internet administration resulted in fewer missing answers. All but one multi-item scale showed internal consistency reliability (Cronbach’s α > 0.70). All scales clearly discriminated between adolescents with no, a few, or many self-reported chronic conditions. The paper administration resulted in statistically significant, higher scores on 4 of 10 CHQ-CF scales compared with the internet administration (P < 0.05), but Cohen’s effect sizes d were ≤0.21. Mode of administration interacted significantly with age (P < 0.05) on four CHQ-CF scales, but Cohen’s effect sizes for these differences were also ≤0.21. CONCLUSION: This study supports the feasibility, internal consistency reliability of the scales, and construct validity of the CHQ-CF administered by either a paper questionnaire or online questionnaire. Given Cohen’s suggested guidelines for the interpretation of effect sizes, i.e., 0.20–.50 indicates a small effect, differences in CHQ-CF scale scores between paper and internet administration can be considered as negligible or small
- …