Search CORE

2,274 research outputs found

CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Author: Blase Jennifer
Chu Xu
Li Peng
Rao Xi
Zhang Ce
Zhang Yue
Publication venue
Publication date: 01/01/2020
Field of study

Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

arXiv.org e-Print Archive

Repository for Publications and Research Data

Hierarchical Classification of Research Fields in the "Web of Science" Using Deep Learning

Author: Egger Peter H.
Rao Susie Xi
Zhang Ce
Publication venue
Publication date: 11/07/2023
Field of study

This paper presents a hierarchical classification system that automatically categorizes a scholarly publication using its abstract into a three-tier hierarchical label set (discipline, field, subfield) in a multi-class setting. This system enables a holistic categorization of research activities in the mentioned hierarchy in terms of knowledge production through articles and impact through citations, permitting those activities to fall into multiple categories. The classification system distinguishes 44 disciplines, 718 fields and 1,485 subfields among 160 million abstract snippets in Microsoft Academic Graph (version 2018-05-17). We used batch training in a modularized and distributed fashion to address and allow for interdisciplinary and interfield classifications in single-label and multi-label settings. In total, we have conducted 3,140 experiments in all considered models (Convolutional Neural Networks, Recurrent Neural Networks, Transformers). The classification accuracy is > 90% in 77.13% and 78.19% of the single-label and multi-label classifications, respectively. We examine the advantages of our classification by its ability to better align research texts and output with disciplines, to adequately classify them in an automated way, and to capture the degree of interdisciplinarity. The proposed system (a set of pre-trained models) can serve as a backbone to an interactive system for indexing scientific publications in the future.Comment: Under review in QS

arXiv.org e-Print Archive

xFraud: Explainable Fraud Transaction Detection

Author: Chen Zhiyao
Han Zhichao
Min Wei
Rao Susie Xi
Shan Yinan
Zhang Ce
Zhang Shuai
Zhang Zitao
Zhao Yang
Publication venue: 'VLDB Endowment'
Publication date: 07/12/2021
Field of study

At online retail platforms, it is crucial to actively detect the risks of transactions to improve customer experience and minimize financial loss. In this work, we propose xFraud, an explainable fraud transaction prediction framework which is mainly composed of a detector and an explainer. The xFraud detector can effectively and efficiently predict the legitimacy of incoming transactions. Specifically, it utilizes a heterogeneous graph neural network to learn expressive representations from the informative heterogeneously typed entities in the transaction logs. The explainer in xFraud can generate meaningful and human-understandable explanations from graphs to facilitate further processes in the business unit. In our experiments with xFraud on real transaction networks with up to 1.1 billion nodes and 3.7 billion edges, xFraud is able to outperform various baseline models in many evaluation metrics while remaining scalable in distributed settings. In addition, we show that xFraud explainer can generate reasonable explanations to significantly assist the business analysis via both quantitative and qualitative evaluations.Comment: This is the extended version of a full paper to appear in PVLDB 15 (3) (VLDB 2022

arXiv.org e-Print Archive

Factor relationships of metabolic syndrome and echocardiographic phenotypes in the HyperGEN Study

Author: Arnett DK
DE SIMONE GIOVANNI
Devereux RB
Huang P
Hunt SC
Kraja AT
Lewis CE
North KE
Rao DC
Rice T
Tang W
Publication venue
Publication date: 01/01/2008
Field of study

Archivio della ricerca - Università degli studi di Napoli Federico II

Turnip mosaic potyvirus probably first spread to Eurasian brassica crops from wild orchids about 1000 years ago

Author: A Gibbs
A Luo
Adrian J. Gibbs
AJ Drummond
AJ Drummond
AJ Drummond
AJ Gibbs
AJ Gibbs
AJ Gibbs
AJ Gibbs
ALN. Rao
BY Chung
CC Chen
CE Jenner
CE Jenner
CE Jenner
D Martin
D Posada
DE Lesemann
DH Huson
Dietrich Lesemann
DP Martin
DW Pallett
E Kozubek
EC Holmes
GF Weiller
HE Simmons
Heinrich-Josef Vetten
Huy D. Nguyen
HY Wang
I Pagán
I Pagán
J Chen
J Chen
John A. Walsh
K Ohshima
K Ohshima
K Ohshima
K Tomimura
K Tomimura
Kazusato Ohshima
KP Schliep
KS Lole
MA Larkin
MJ Gibbs
MO Salminen
MW Gardner
N Suehiro
O Nicolas
R Pinhasi
R Sanjuán
S Farzadfar
S Fuji
S Fuji
S Guindon
S Korkmaz
Sebastián Duchêne
Simon Y. W. Ho
Smith Maynard
SYW Ho
SYW Ho
TA Hall
Yasuhiro Tomitaka
Z Tan
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 06/02/2013
Field of study

Turnip mosaic potyvirus (TuMV) is probably the most widespread and damaging virus that infects cultivated brassicas worldwide. Previous work has indicated that the virus originated in western Eurasia, with all of its closest relatives being viruses of monocotyledonous plants. Here we report that we have identified a sister lineage of TuMV-like potyviruses (TuMV-OM) from European orchids. The isolates of TuMV-OM form a monophyletic sister lineage to the brassica-infecting TuMVs (TuMV-BIs), and are nested within a clade of monocotyledon-infecting viruses. Extensive host-range tests showed that all of the TuMV-OMs are biologically similar to, but distinct from, TuMV-BIs and do not readily infect brassicas. We conclude that it is more likely that TuMV evolved from a TuMV-OM-like ancestor than the reverse. We did Bayesian coalescent analyses using a combination of novel and published sequence data from four TuMV genes [helper component-proteinase protein (HC-Pro), protein 3(P3), nuclear inclusion b protein (NIb), and coat protein (CP)]. Three genes (HC-Pro, P3, and NIb), but not the CP gene, gave results indicating that the TuMV-BI viruses diverged from TuMV-OMs around 1000 years ago. Only 150 years later, the four lineages of the present global population of TuMV-BIs diverged from one another. These dates are congruent with historical records of the spread of agriculture in Western Europe. From about 1200 years ago, there was a warming of the climate, and agriculture and the human population of the region greatly increased. Farming replaced woodlands, fostering viruses and aphid vectors that could invade the crops, which included several brassica cultivars and weeds. Later, starting 500 years ago, inter-continental maritime trade probably spread the TuMV-BIs to the remainder of the world

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Warwick Research Archives Portal Repository

The Australian National University

University of Melbourne Institutional Repository

FigShare

ABCD Neurocognitive Prediction Challenge 2019: Predicting individual fluid intelligence scores from structural MRI using probabilistic segmentation and kernel ridge regression

Author: A Pfefferbaum
A Rakotomamonjy
A Rao
A Woolgar
AMJ MacLullich
BJ Casey
C Blaiotta
CE Rasmussen
G Varoquaux
GC Monté-Rubio
GD Batty
GD Batty
IJ Deary
IJ Deary
IJ Deary
J Ashburner
J Gläscher
J Schrouff
J Schrouff
JP Rushton
KL Narr
LS Gottfredson
MA McDaniel
ME Tipping
NA Goriounova
Natacha Akshoomoff
NC Andreasen
Neil P. Oxtoby
RB McCall
Rex E. Jung
S Fors
S Karama
SB Blumberg
T Rohlfing
W Johnson
Publication venue
Publication date: 01/01/2019
Field of study

We applied several regression and deep learning methods to predict fluid intelligence scores from T1-weighted MRI scans as part of the ABCD Neurocognitive Prediction Challenge (ABCD-NP-Challenge) 2019. We used voxel intensities and probabilistic tissue-type labels derived from these as features to train the models. The best predictive performance (lowest mean-squared error) came from Kernel Ridge Regression (KRR;

\lambda=10

), which produced a mean-squared error of 69.7204 on the validation set and 92.1298 on the test set. This placed our group in the fifth position on the validation leader board and first place on the final (test) leader board.Comment: Winning entry in the ABCD Neurocognitive Prediction Challenge at MICCAI 2019. 7 pages plus references, 3 figures, 1 tabl

arXiv.org e-Print Archive

Crossref

UCL Discovery

MPG.PuRe

Using Whole Genome Sequences to Investigate Adenovirus Outbreaks in a Hematopoietic Stem Cell Transplant Unit

Author: Best T
Breuer J
Dunn H
Guerra-Assuncao JA
Hartley JC
Houldcroft CJ
Margetts BK
Myers CE
Rao K
Rolfe KJ
Roy S
Venturini C
Williams CA
Williams R
Publication venue: Frontiers Media SA
Publication date: 02/07/2021
Field of study

A recent surge in human mastadenovirus (HAdV) cases, including five deaths, amongst a haematopoietic stem cell transplant population led us to use whole genome sequencing (WGS) to investigate. We compared sequences from 37 patients collected over a 20-month period with sequences from GenBank and our own database of HAdVs. Maximum likelihood trees and pairwise differences were used to evaluate genotypic relationships, paired with the epidemiological data from routine infection prevention and control (IPC) records and hospital activity data. During this time period, two formal outbreaks had been declared by IPC, while WGS detected nine monophyletic clusters, seven were corroborated by epidemiological evidence and by comparison of single-nucleotide polymorphisms. One of the formal outbreaks was confirmed, and the other was not. Of the five HAdV-associated deaths, three were unlinked and the remaining two considered the source of transmission. Mixed infection was frequent (10%), providing a sentinel source of recombination and superinfection. Immunosuppressed patients harboring a high rate of HAdV positivity require comprehensive surveillance. As a consequence of these findings, HAdV WGS is being incorporated routinely into clinical practice to influence IPC policy contemporaneously

UCL Discovery

Cross-sectional associations between sleep duration, sedentary time, physical activity, and adiposity indicators among Canadian preschool-aged children using compositional analyses

Author: AG LeBlanc
AL Adolph
B Day
BW Timmons
C Rich
Canadian Society for Exercise Physiology
CE Boeke
CJ Dobbelsteyn
DP Cliff
JNK Rao
KA Pfeiffer
KF Rust
L Matricciani
LL Moore
M Hirshkowitz
M Kang
Mark S. Tremblay
ME Rosenberger
MF Rolland-Cachera
ML Louzada
MP Buman
MS Tremblay
MS Tremblay
MS Tremblay
MS Tremblay
National Institutes of Health
RA Jones
RC Colley
RC Colley
RP Troiano
S Paruthi
Sebastien F. M. Chastin
SF Chastin
SL Wong
SM Williams
Statistics Canada
T Hinkley
V Carson
V Espana-Romero
Valerie Carson
WHO Multicentre Growth Reference Study Group
X Chen
Z Pedisic
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Abstract Background Sleep duration, sedentary behaviour, and physical activity are three co-dependent behaviours that fall on the movement/non-movement intensity continuum. Compositional data analyses provide an appropriate method for analyzing the association between co-dependent movement behaviour data and health indicators. The objectives of this study were to examine: (1) the combined associations of the composition of time spent in sleep, sedentary behaviour, light-intensity physical activity (LPA), and moderate- to vigorous-intensity physical activity (MVPA) with adiposity indicators; and (2) the association of the time spent in sleep, sedentary behaviour, LPA, or MVPA with adiposity indicators relative to the time spent in the other behaviours in a representative sample of Canadian preschool-aged children. Methods Participants were 552 children aged 3 to 4 years from cycles 2 and 3 of the Canadian Health Measures Survey. Sedentary time, LPA, and MVPA were measured with Actical accelerometers (Philips Respironics, Bend, OR USA), and sleep duration was parental reported. Adiposity indicators included waist circumference (WC) and body mass index (BMI) z-scores based on World Health Organization growth standards. Compositional data analyses were used to examine the cross-sectional associations. Results The composition of movement behaviours was significantly associated with BMI z-scores (p = 0.006) but not with WC (p = 0.718). Further, the time spent in sleep (BMI z-score: γ sleep = −0.72; p = 0.138; WC: γ sleep = −1.95; p = 0.285), sedentary behaviour (BMI z-score: γ SB = 0.19; p = 0.624; WC: γ SB = 0.87; p = 0.614), LPA (BMI z-score: γ LPA = 0.62; p = 0.213, WC: γ LPA = 0.23; p = 0.902), or MVPA (BMI z-score: γ MVPA = −0.09; p = 0.733, WC: γ MVPA = 0.08; p = 0.288) relative to the other behaviours was not significantly associated with the adiposity indicators. Conclusions This study is the first to use compositional analyses when examining associations of co-dependent sleep duration, sedentary time, and physical activity behaviours with adiposity indicators in preschool-aged children. The overall composition of movement behaviours appears important for healthy BMI z-scores in preschool-aged children. Future research is needed to determine the optimal movement behaviour composition that should be promoted in this age group

Crossref

Directory of Open Access Journals

Ghent University Academic Bibliography

ResearchOnline@GCU

Calibration estimation in dual-frame surveys

Author: A Arcos
AC Singh
Annalisa Teodoro
Antonio Arcos
C Wu
C Wu
CE Särndal
CE Särndal
CJ Skinner
CJ Skinner
CT Isaki
F Mecatti
G Kalton
JC Deville
JC Deville
JNK Rao
K Wolter
M Hidiroglou
M. Giovanna Ranalli
María del Mar Rueda
MD Bankier
PS Kott
PS Kott
PS Kott
RH Renssen
S Chen
SL Lohr
SL Lohr
SL Lohr
YG Berger
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2015
Field of study

Survey statisticians make use of auxiliary information to improve estimates. One important example is calibration estimation, which constructs new weights that match benchmark constraints on auxiliary variables while remaining “close” to the design weights. Multiple-frame surveys are increasingly used by statistical agencies and private organizations to reduce sampling costs and/or avoid frame undercoverage errors. Several ways of combining estimates derived from such frames have been proposed elsewhere; in this paper, we extend the calibration paradigm, previously used for single-frame surveys, to calculate the total value of a variable of interest in a dual-frame survey. Calibration is a general tool that allows to include auxiliary information from two frames. It also incorporates, as a special case, certain dual-frame estimators that have been proposed previously. The theoretical properties of our class of estimators are derived and discussed, and simulation studies conducted to compare the efficiency of the procedure, using different sets of auxiliary variables. Finally, the proposed methodology is applied to real data obtained from the Barometer of Culture of Andalusia survey.Ministerio de Educación y CienciaConsejería de Economía, Innovación, Ciencia y EmpleoPRIN-SURWE

Crossref

Repositorio Institucional Universidad de Granada