74,861 research outputs found

    Automated data pre-processing via meta-learning

    Get PDF
    The final publication is available at link.springer.comA data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. As a matter of fact, a dataset usually needs to be pre-processed. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives and nonexperienced users become overwhelmed. We show that this problem can be addressed by an automated approach, leveraging ideas from metalearning. Specifically, we consider a wide range of data pre-processing techniques and a set of data mining algorithms. For each data mining algorithm and selected dataset, we are able to predict the transformations that improve the result of the algorithm on the respective dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.Peer ReviewedPostprint (published version

    Scalable Privacy-Compliant Virality Prediction on Twitter

    Get PDF
    The digital town hall of Twitter becomes a preferred medium of communication for individuals and organizations across the globe. Some of them reach audiences of millions, while others struggle to get noticed. Given the impact of social media, the question remains more relevant than ever: how to model the dynamics of attention in Twitter. Researchers around the world turn to machine learning to predict the most influential tweets and authors, navigating the volume, velocity, and variety of social big data, with many compromises. In this paper, we revisit content popularity prediction on Twitter. We argue that strict alignment of data acquisition, storage and analysis algorithms is necessary to avoid the common trade-offs between scalability, accuracy and privacy compliance. We propose a new framework for the rapid acquisition of large-scale datasets, high accuracy supervisory signal and multilanguage sentiment prediction while respecting every privacy request applicable. We then apply a novel gradient boosting framework to achieve state-of-the-art results in virality ranking, already before including tweet's visual or propagation features. Our Gradient Boosted Regression Tree is the first to offer explainable, strong ranking performance on benchmark datasets. Since the analysis focused on features available early, the model is immediately applicable to incoming tweets in 18 languages.Comment: AffCon@AAAI-19 Best Paper Award; Presented at AAAI-19 W1: Affective Content Analysi

    Consistent use of a combination product versus a single product in a safety trial of the diaphragm and microbicide in Harare, Zimbabwe.

    No full text
    BACKGROUND: We examined the use and acceptability of a combination product (diaphragm and gel) compared to a single product (gel) during a 6-month safety trial in Zimbabwe. STUDY DESIGN: Women were randomized to the use of a diaphragm with gel or the use of gel alone, in addition to male condoms. Ever use and use of study product on the last act of sexual intercourse were assessed monthly by Audio Computer-Assisted Self-Interviewing. Acceptability, correct use and consistent use (use at every sexual act during the previous 3 months) were measured on the last visit by face-to-face interview. Predictors of consistent use were examined using multivariate logistic regression analyses. RESULTS: In this sample of 117 sexually active, monogamous, contracepting women, rates of consistent use were similar in both groups (59.7% for combination method vs. 56.4% for gel alone). Product acceptability was high, but was not independently associated with consistent use. Independent predictors of consistent use included age [adjusted odds ratio (AOR)=1.08; 95% confidence interval (95% CI)=1.01-1.16], consistent condom use (AOR=3.85; 95% CI=1.54-9.63) and having a partner who approves of product use (AOR=2.66; 95% CI=1.10-6.39). CONCLUSIONS: Despite high reported acceptability and few problems with the products, the participants reported only moderate product adherence levels. Consistent use of condoms and consistent use of products were strongly associated. If observed in other studies, this may bias the estimation of product effectiveness in future trials of female-controlled methods

    The epidemiology of HIV among young people in sub-Saharan Africa: know your local epidemic and its implications for prevention.

    No full text
    BACKGROUND: Broad patterns of HIV epidemiology are frequently used to design generic HIV programs in sub-Saharan Africa. METHODS: We reviewed the epidemiology of HIV among young people in sub-Saharan Africa, and explored the unique dynamics of infection in its different regions. RESULTS: In 2009, HIV prevalence among youth in sub-Saharan Africa was an estimated 1.4% in males and 3.4% in females, but these values mask wide variation at regional and national levels. Within countries there are further major differences in HIV prevalence, such as by sex, urban/rural location, economic status, education, or ethnic group. Within this highly nuanced context, HIV prevention programs targeting youth must consider both where new infections are occurring and where they are coming from. CONCLUSIONS: Given the epidemiology, one-size-fits-all HIV prevention programs are usually inappropriate at regional and national levels. Consideration of local context and risk associated with life transitions, such as leaving school or getting married, is imperative to successful programming for young people

    A mathematical theory of semantic development in deep neural networks

    Full text link
    An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: what are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep learning dynamics to give rise to these regularities

    Have Econometric Analyses of Happiness Data Been Futile? A Simple Truth About Happiness Scales

    Full text link
    Econometric analyses in the happiness literature typically use subjective well-being (SWB) data to compare the mean of observed or latent happiness across samples. Recent critiques show that comparing the mean of ordinal data is only valid under strong assumptions that are usually rejected by SWB data. This leads to an open question whether much of the empirical studies in the economics of happiness literature have been futile. In order to salvage some of the prior results and avoid future issues, we suggest regression analysis of SWB (and other ordinal data) should focus on the median rather than the mean. Median comparisons using parametric models such as the ordered probit and logit can be readily carried out using familiar statistical softwares like STATA. We also show a previously assumed impractical task of estimating a semiparametric median ordered-response model is also possible by using a novel constrained mixed integer optimization technique. We use GSS data to show the famous Easterlin Paradox from the happiness literature holds for the US independent of any parametric assumption

    Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

    Get PDF
    Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%
    corecore