Search CORE

74,861 research outputs found

Automated data pre-processing via meta-learning

Author: A Guazzelli
A Kalousis
D Pyle
F Serban
J Vanschoren
J-U Kietz
M Hall
MA Munson
SF Crone
T Dasu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

The final publication is available at link.springer.comA data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. As a matter of fact, a dataset usually needs to be pre-processed. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives and nonexperienced users become overwhelmed. We show that this problem can be addressed by an automated approach, leveraging ideas from metalearning. Specifically, we consider a wide range of data pre-processing techniques and a set of data mining algorithms. For each data mining algorithm and selected dataset, we are able to predict the transformations that improve the result of the algorithm on the respective dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

Scalable Privacy-Compliant Virality Prediction on Twitter

Author: Kowalczyk Damian Konrad
Larsen Jan
Publication venue
Publication date: 01/01/2019
Field of study

The digital town hall of Twitter becomes a preferred medium of communication for individuals and organizations across the globe. Some of them reach audiences of millions, while others struggle to get noticed. Given the impact of social media, the question remains more relevant than ever: how to model the dynamics of attention in Twitter. Researchers around the world turn to machine learning to predict the most influential tweets and authors, navigating the volume, velocity, and variety of social big data, with many compromises. In this paper, we revisit content popularity prediction on Twitter. We argue that strict alignment of data acquisition, storage and analysis algorithms is necessary to avoid the common trade-offs between scalability, accuracy and privacy compliance. We propose a new framework for the rapid acquisition of large-scale datasets, high accuracy supervisory signal and multilanguage sentiment prediction while respecting every privacy request applicable. We then apply a novel gradient boosting framework to achieve state-of-the-art results in virality ranking, already before including tweet's visual or propagation features. Our Gradient Boosted Regression Tree is the first to offer explainable, strong ranking performance on benchmark datasets. Since the analysis focused on features available early, the model is immediately applicable to incoming tweets in 18 languages.Comment: AffCon@AAAI-19 Best Paper Award; Presented at AAAI-19 W1: Affective Content Analysi

arXiv.org e-Print Archive

Online Research Database In Technology

Consistent use of a combination product versus a single product in a safety trial of the diaphragm and microbicide in Harare, Zimbabwe.

Author: Clouse Kate
Hammond Nii
Mauck Christine
Moore Jie
Napierala Sue
Padian Nancy
van der Straten Ariane
Publication venue: 'Elsevier BV'
Publication date: 25/04/2008
Field of study

BACKGROUND: We examined the use and acceptability of a combination product (diaphragm and gel) compared to a single product (gel) during a 6-month safety trial in Zimbabwe. STUDY DESIGN: Women were randomized to the use of a diaphragm with gel or the use of gel alone, in addition to male condoms. Ever use and use of study product on the last act of sexual intercourse were assessed monthly by Audio Computer-Assisted Self-Interviewing. Acceptability, correct use and consistent use (use at every sexual act during the previous 3 months) were measured on the last visit by face-to-face interview. Predictors of consistent use were examined using multivariate logistic regression analyses. RESULTS: In this sample of 117 sexually active, monogamous, contracepting women, rates of consistent use were similar in both groups (59.7% for combination method vs. 56.4% for gel alone). Product acceptability was high, but was not independently associated with consistent use. Independent predictors of consistent use included age [adjusted odds ratio (AOR)=1.08; 95% confidence interval (95% CI)=1.01-1.16], consistent condom use (AOR=3.85; 95% CI=1.54-9.63) and having a partner who approves of product use (AOR=2.66; 95% CI=1.10-6.39). CONCLUSIONS: Despite high reported acceptability and few problems with the products, the participants reported only moderate product adherence levels. Consistent use of condoms and consistent use of products were strongly associated. If observed in other studies, this may bias the estimation of product effectiveness in future trials of female-controlled methods

LSHTM Research Online

The epidemiology of HIV among young people in sub-Saharan Africa: know your local epidemic and its implications for prevention.

Author: Changalucha John
Doyle Aoife M
Napierala Mavedzenge Sue
Olson Rick
Ross David A
Publication venue: Elsevier
Publication date: 01/01/2011
Field of study

BACKGROUND: Broad patterns of HIV epidemiology are frequently used to design generic HIV programs in sub-Saharan Africa. METHODS: We reviewed the epidemiology of HIV among young people in sub-Saharan Africa, and explored the unique dynamics of infection in its different regions. RESULTS: In 2009, HIV prevalence among youth in sub-Saharan Africa was an estimated 1.4% in males and 3.4% in females, but these values mask wide variation at regional and national levels. Within countries there are further major differences in HIV prevalence, such as by sex, urban/rural location, economic status, education, or ethnic group. Within this highly nuanced context, HIV prevention programs targeting youth must consider both where new infections are occurring and where they are coming from. CONCLUSIONS: Given the epidemiology, one-size-fits-all HIV prevention programs are usually inappropriate at regional and national levels. Consideration of local context and risk associated with life transitions, such as leaving school or getting married, is imperative to successful programming for young people

Crossref

LSHTM Research Online

A mathematical theory of semantic development in deep neural networks

Author: Ganguli Surya
McClelland James L.
Saxe Andrew M.
Publication venue
Publication date: 23/10/2018
Field of study

An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: what are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep learning dynamics to give rise to these regularities

arXiv.org e-Print Archive

Have Econometric Analyses of Happiness Data Been Futile? A Simple Truth About Happiness Scales

Author: Chen Le-Yu
Oparina Ekaterina
Powdthavee Nattavudh
Srisuma Sorawoot
Publication venue
Publication date: 20/02/2019
Field of study

Econometric analyses in the happiness literature typically use subjective well-being (SWB) data to compare the mean of observed or latent happiness across samples. Recent critiques show that comparing the mean of ordinal data is only valid under strong assumptions that are usually rejected by SWB data. This leads to an open question whether much of the empirical studies in the economics of happiness literature have been futile. In order to salvage some of the prior results and avoid future issues, we suggest regression analysis of SWB (and other ordinal data) should focus on the median rather than the mean. Median comparisons using parametric models such as the ordered probit and logit can be readily carried out using familiar statistical softwares like STATA. We also show a previously assumed impractical task of estimating a semiparametric median ordered-response model is also possible by using a novel constrained mixed integer optimization technique. We use GSS data to show the famous Easterlin Paradox from the happiness literature holds for the US independent of any parametric assumption

arXiv.org e-Print Archive

LSE Research Online

Warwick Research Archives Portal Repository

DR-NTU (Digital Repository of NTU)

Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Author: Albrecht
Austin
Baird
Batista
Boehm
Boehm
Breiman
Briand
Briand
Briand
Brockmeier
Cartwright
Cheung
Clark
Feelders
Finnie
Gama
Gray
Holte
Jain
Jeffery
Jun Liu
Jönsson
Kemerer
Khotanzad
Kibler
Kim
Kitchenham
Kohavi
Little
Little
Little
Little
Little
Martin Shepperd
Miranda
Myrtveit
Pickard
Putnam
Qinbao Song
Quinlan
Robins
Rubin
Rubin
Rubin
Rubin
Samson
Selby
Shao
Shepperd
Shepperd
Siedelecki
Song
Song
Srinivasan
Strike
Tabachnick
Tay
Walkerden
Walston
Xiangru Chen
Publication venue: 'Elsevier BV'
Publication date: 01/12/2008
Field of study

Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%

Crossref

Brunel University Research Archive