2,274 research outputs found

    CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

    Full text link
    Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

    Hierarchical Classification of Research Fields in the "Web of Science" Using Deep Learning

    Full text link
    This paper presents a hierarchical classification system that automatically categorizes a scholarly publication using its abstract into a three-tier hierarchical label set (discipline, field, subfield) in a multi-class setting. This system enables a holistic categorization of research activities in the mentioned hierarchy in terms of knowledge production through articles and impact through citations, permitting those activities to fall into multiple categories. The classification system distinguishes 44 disciplines, 718 fields and 1,485 subfields among 160 million abstract snippets in Microsoft Academic Graph (version 2018-05-17). We used batch training in a modularized and distributed fashion to address and allow for interdisciplinary and interfield classifications in single-label and multi-label settings. In total, we have conducted 3,140 experiments in all considered models (Convolutional Neural Networks, Recurrent Neural Networks, Transformers). The classification accuracy is > 90% in 77.13% and 78.19% of the single-label and multi-label classifications, respectively. We examine the advantages of our classification by its ability to better align research texts and output with disciplines, to adequately classify them in an automated way, and to capture the degree of interdisciplinarity. The proposed system (a set of pre-trained models) can serve as a backbone to an interactive system for indexing scientific publications in the future.Comment: Under review in QS

    xFraud: Explainable Fraud Transaction Detection

    Full text link
    At online retail platforms, it is crucial to actively detect the risks of transactions to improve customer experience and minimize financial loss. In this work, we propose xFraud, an explainable fraud transaction prediction framework which is mainly composed of a detector and an explainer. The xFraud detector can effectively and efficiently predict the legitimacy of incoming transactions. Specifically, it utilizes a heterogeneous graph neural network to learn expressive representations from the informative heterogeneously typed entities in the transaction logs. The explainer in xFraud can generate meaningful and human-understandable explanations from graphs to facilitate further processes in the business unit. In our experiments with xFraud on real transaction networks with up to 1.1 billion nodes and 3.7 billion edges, xFraud is able to outperform various baseline models in many evaluation metrics while remaining scalable in distributed settings. In addition, we show that xFraud explainer can generate reasonable explanations to significantly assist the business analysis via both quantitative and qualitative evaluations.Comment: This is the extended version of a full paper to appear in PVLDB 15 (3) (VLDB 2022

    Turnip mosaic potyvirus probably first spread to Eurasian brassica crops from wild orchids about 1000 years ago

    Get PDF
    Turnip mosaic potyvirus (TuMV) is probably the most widespread and damaging virus that infects cultivated brassicas worldwide. Previous work has indicated that the virus originated in western Eurasia, with all of its closest relatives being viruses of monocotyledonous plants. Here we report that we have identified a sister lineage of TuMV-like potyviruses (TuMV-OM) from European orchids. The isolates of TuMV-OM form a monophyletic sister lineage to the brassica-infecting TuMVs (TuMV-BIs), and are nested within a clade of monocotyledon-infecting viruses. Extensive host-range tests showed that all of the TuMV-OMs are biologically similar to, but distinct from, TuMV-BIs and do not readily infect brassicas. We conclude that it is more likely that TuMV evolved from a TuMV-OM-like ancestor than the reverse. We did Bayesian coalescent analyses using a combination of novel and published sequence data from four TuMV genes [helper component-proteinase protein (HC-Pro), protein 3(P3), nuclear inclusion b protein (NIb), and coat protein (CP)]. Three genes (HC-Pro, P3, and NIb), but not the CP gene, gave results indicating that the TuMV-BI viruses diverged from TuMV-OMs around 1000 years ago. Only 150 years later, the four lineages of the present global population of TuMV-BIs diverged from one another. These dates are congruent with historical records of the spread of agriculture in Western Europe. From about 1200 years ago, there was a warming of the climate, and agriculture and the human population of the region greatly increased. Farming replaced woodlands, fostering viruses and aphid vectors that could invade the crops, which included several brassica cultivars and weeds. Later, starting 500 years ago, inter-continental maritime trade probably spread the TuMV-BIs to the remainder of the world

    ABCD Neurocognitive Prediction Challenge 2019: Predicting individual fluid intelligence scores from structural MRI using probabilistic segmentation and kernel ridge regression

    Get PDF
    We applied several regression and deep learning methods to predict fluid intelligence scores from T1-weighted MRI scans as part of the ABCD Neurocognitive Prediction Challenge (ABCD-NP-Challenge) 2019. We used voxel intensities and probabilistic tissue-type labels derived from these as features to train the models. The best predictive performance (lowest mean-squared error) came from Kernel Ridge Regression (KRR; λ=10\lambda=10), which produced a mean-squared error of 69.7204 on the validation set and 92.1298 on the test set. This placed our group in the fifth position on the validation leader board and first place on the final (test) leader board.Comment: Winning entry in the ABCD Neurocognitive Prediction Challenge at MICCAI 2019. 7 pages plus references, 3 figures, 1 tabl

    Using Whole Genome Sequences to Investigate Adenovirus Outbreaks in a Hematopoietic Stem Cell Transplant Unit

    Get PDF
    A recent surge in human mastadenovirus (HAdV) cases, including five deaths, amongst a haematopoietic stem cell transplant population led us to use whole genome sequencing (WGS) to investigate. We compared sequences from 37 patients collected over a 20-month period with sequences from GenBank and our own database of HAdVs. Maximum likelihood trees and pairwise differences were used to evaluate genotypic relationships, paired with the epidemiological data from routine infection prevention and control (IPC) records and hospital activity data. During this time period, two formal outbreaks had been declared by IPC, while WGS detected nine monophyletic clusters, seven were corroborated by epidemiological evidence and by comparison of single-nucleotide polymorphisms. One of the formal outbreaks was confirmed, and the other was not. Of the five HAdV-associated deaths, three were unlinked and the remaining two considered the source of transmission. Mixed infection was frequent (10%), providing a sentinel source of recombination and superinfection. Immunosuppressed patients harboring a high rate of HAdV positivity require comprehensive surveillance. As a consequence of these findings, HAdV WGS is being incorporated routinely into clinical practice to influence IPC policy contemporaneously

    Cross-sectional associations between sleep duration, sedentary time, physical activity, and adiposity indicators among Canadian preschool-aged children using compositional analyses

    Get PDF
    Abstract Background Sleep duration, sedentary behaviour, and physical activity are three co-dependent behaviours that fall on the movement/non-movement intensity continuum. Compositional data analyses provide an appropriate method for analyzing the association between co-dependent movement behaviour data and health indicators. The objectives of this study were to examine: (1) the combined associations of the composition of time spent in sleep, sedentary behaviour, light-intensity physical activity (LPA), and moderate- to vigorous-intensity physical activity (MVPA) with adiposity indicators; and (2) the association of the time spent in sleep, sedentary behaviour, LPA, or MVPA with adiposity indicators relative to the time spent in the other behaviours in a representative sample of Canadian preschool-aged children. Methods Participants were 552 children aged 3 to 4 years from cycles 2 and 3 of the Canadian Health Measures Survey. Sedentary time, LPA, and MVPA were measured with Actical accelerometers (Philips Respironics, Bend, OR USA), and sleep duration was parental reported. Adiposity indicators included waist circumference (WC) and body mass index (BMI) z-scores based on World Health Organization growth standards. Compositional data analyses were used to examine the cross-sectional associations. Results The composition of movement behaviours was significantly associated with BMI z-scores (p = 0.006) but not with WC (p = 0.718). Further, the time spent in sleep (BMI z-score: γ sleep  = −0.72; p = 0.138; WC: γ sleep  = −1.95; p = 0.285), sedentary behaviour (BMI z-score: γ SB  = 0.19; p = 0.624; WC: γ SB  = 0.87; p = 0.614), LPA (BMI z-score: γ LPA  = 0.62; p = 0.213, WC: γ LPA  = 0.23; p = 0.902), or MVPA (BMI z-score: γ MVPA  = −0.09; p = 0.733, WC: γ MVPA  = 0.08; p = 0.288) relative to the other behaviours was not significantly associated with the adiposity indicators. Conclusions This study is the first to use compositional analyses when examining associations of co-dependent sleep duration, sedentary time, and physical activity behaviours with adiposity indicators in preschool-aged children. The overall composition of movement behaviours appears important for healthy BMI z-scores in preschool-aged children. Future research is needed to determine the optimal movement behaviour composition that should be promoted in this age group

    Calibration estimation in dual-frame surveys

    Get PDF
    Survey statisticians make use of auxiliary information to improve estimates. One important example is calibration estimation, which constructs new weights that match benchmark constraints on auxiliary variables while remaining “close” to the design weights. Multiple-frame surveys are increasingly used by statistical agencies and private organizations to reduce sampling costs and/or avoid frame undercoverage errors. Several ways of combining estimates derived from such frames have been proposed elsewhere; in this paper, we extend the calibration paradigm, previously used for single-frame surveys, to calculate the total value of a variable of interest in a dual-frame survey. Calibration is a general tool that allows to include auxiliary information from two frames. It also incorporates, as a special case, certain dual-frame estimators that have been proposed previously. The theoretical properties of our class of estimators are derived and discussed, and simulation studies conducted to compare the efficiency of the procedure, using different sets of auxiliary variables. Finally, the proposed methodology is applied to real data obtained from the Barometer of Culture of Andalusia survey.Ministerio de Educación y CienciaConsejería de Economía, Innovación, Ciencia y EmpleoPRIN-SURWE
    corecore