74,861 research outputs found
Automated data pre-processing via meta-learning
The final publication is available at link.springer.comA data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around.
As a matter of fact, a dataset usually needs to be pre-processed. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives and nonexperienced users become overwhelmed.
We show that this problem can be addressed by an automated approach, leveraging ideas from metalearning.
Specifically, we consider a wide range of data pre-processing techniques and a set of data mining algorithms. For each data mining algorithm and selected dataset, we are able to predict the transformations that improve the result
of the algorithm on the respective dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.Peer ReviewedPostprint (published version
Scalable Privacy-Compliant Virality Prediction on Twitter
The digital town hall of Twitter becomes a preferred medium of communication
for individuals and organizations across the globe. Some of them reach
audiences of millions, while others struggle to get noticed. Given the impact
of social media, the question remains more relevant than ever: how to model the
dynamics of attention in Twitter. Researchers around the world turn to machine
learning to predict the most influential tweets and authors, navigating the
volume, velocity, and variety of social big data, with many compromises. In
this paper, we revisit content popularity prediction on Twitter. We argue that
strict alignment of data acquisition, storage and analysis algorithms is
necessary to avoid the common trade-offs between scalability, accuracy and
privacy compliance. We propose a new framework for the rapid acquisition of
large-scale datasets, high accuracy supervisory signal and multilanguage
sentiment prediction while respecting every privacy request applicable. We then
apply a novel gradient boosting framework to achieve state-of-the-art results
in virality ranking, already before including tweet's visual or propagation
features. Our Gradient Boosted Regression Tree is the first to offer
explainable, strong ranking performance on benchmark datasets. Since the
analysis focused on features available early, the model is immediately
applicable to incoming tweets in 18 languages.Comment: AffCon@AAAI-19 Best Paper Award; Presented at AAAI-19 W1: Affective
Content Analysi
Consistent use of a combination product versus a single product in a safety trial of the diaphragm and microbicide in Harare, Zimbabwe.
BACKGROUND: We examined the use and acceptability of a combination product (diaphragm and gel) compared to a single product (gel) during a 6-month safety trial in Zimbabwe. STUDY DESIGN: Women were randomized to the use of a diaphragm with gel or the use of gel alone, in addition to male condoms. Ever use and use of study product on the last act of sexual intercourse were assessed monthly by Audio Computer-Assisted Self-Interviewing. Acceptability, correct use and consistent use (use at every sexual act during the previous 3 months) were measured on the last visit by face-to-face interview. Predictors of consistent use were examined using multivariate logistic regression analyses. RESULTS: In this sample of 117 sexually active, monogamous, contracepting women, rates of consistent use were similar in both groups (59.7% for combination method vs. 56.4% for gel alone). Product acceptability was high, but was not independently associated with consistent use. Independent predictors of consistent use included age [adjusted odds ratio (AOR)=1.08; 95% confidence interval (95% CI)=1.01-1.16], consistent condom use (AOR=3.85; 95% CI=1.54-9.63) and having a partner who approves of product use (AOR=2.66; 95% CI=1.10-6.39). CONCLUSIONS: Despite high reported acceptability and few problems with the products, the participants reported only moderate product adherence levels. Consistent use of condoms and consistent use of products were strongly associated. If observed in other studies, this may bias the estimation of product effectiveness in future trials of female-controlled methods
The epidemiology of HIV among young people in sub-Saharan Africa: know your local epidemic and its implications for prevention.
BACKGROUND: Broad patterns of HIV epidemiology are frequently used to design generic HIV programs in sub-Saharan Africa. METHODS: We reviewed the epidemiology of HIV among young people in sub-Saharan Africa, and explored the unique dynamics of infection in its different regions. RESULTS: In 2009, HIV prevalence among youth in sub-Saharan Africa was an estimated 1.4% in males and 3.4% in females, but these values mask wide variation at regional and national levels. Within countries there are further major differences in HIV prevalence, such as by sex, urban/rural location, economic status, education, or ethnic group. Within this highly nuanced context, HIV prevention programs targeting youth must consider both where new infections are occurring and where they are coming from. CONCLUSIONS: Given the epidemiology, one-size-fits-all HIV prevention programs are usually inappropriate at regional and national levels. Consideration of local context and risk associated with life transitions, such as leaving school or getting married, is imperative to successful programming for young people
A mathematical theory of semantic development in deep neural networks
An extensive body of empirical research has revealed remarkable regularities
in the acquisition, organization, deployment, and neural representation of
human semantic knowledge, thereby raising a fundamental conceptual question:
what are the theoretical principles governing the ability of neural networks to
acquire, organize, and deploy abstract knowledge by integrating across many
individual experiences? We address this question by mathematically analyzing
the nonlinear dynamics of learning in deep linear networks. We find exact
solutions to this learning dynamics that yield a conceptual explanation for the
prevalence of many disparate phenomena in semantic cognition, including the
hierarchical differentiation of concepts through rapid developmental
transitions, the ubiquity of semantic illusions between such transitions, the
emergence of item typicality and category coherence as factors controlling the
speed of semantic processing, changing patterns of inductive projection over
development, and the conservation of semantic similarity in neural
representations across species. Thus, surprisingly, our simple neural model
qualitatively recapitulates many diverse regularities underlying semantic
development, while providing analytic insight into how the statistical
structure of an environment can interact with nonlinear deep learning dynamics
to give rise to these regularities
Have Econometric Analyses of Happiness Data Been Futile? A Simple Truth About Happiness Scales
Econometric analyses in the happiness literature typically use subjective
well-being (SWB) data to compare the mean of observed or latent happiness
across samples. Recent critiques show that comparing the mean of ordinal data
is only valid under strong assumptions that are usually rejected by SWB data.
This leads to an open question whether much of the empirical studies in the
economics of happiness literature have been futile. In order to salvage some of
the prior results and avoid future issues, we suggest regression analysis of
SWB (and other ordinal data) should focus on the median rather than the mean.
Median comparisons using parametric models such as the ordered probit and logit
can be readily carried out using familiar statistical softwares like STATA. We
also show a previously assumed impractical task of estimating a semiparametric
median ordered-response model is also possible by using a novel constrained
mixed integer optimization technique. We use GSS data to show the famous
Easterlin Paradox from the happiness literature holds for the US independent of
any parametric assumption
Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation
Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%
- …