Search CORE

VCU Scholars Compass

Multiple Imputation Ensembles (MIE) for dealing with missing data

Author: A Farhangfar
AM Sefidian
B Schölkopf
C Cortes
CT Tran
DA Newman
DB Rubin
DB Rubin
DH Wolpert
EL Silva-Ramírez
GE Batista
GJ van der Heijden
H Gao
IH Witten
J Demšar
J Honaker
J Honaker
J Scheffer
JA Sterne
JL Schafer
JL Schafer
JR Quinlan
K Abayomi
KM Ting
L Breiman
L Breiman
L Rokach
M Fichman
M Khalilia
M Spratt
MA Klebanoff
MJ Azur
NJ Horton
PJ García-Laencina
PJ Kelly
PN Tan
RJ Little
S García
S Van Buuren
S Van Buuren
SS Chae
SS Choi
U Garciarena
V Vapnik
X Chen
Y Dong
Y Freund
Y He
Z Che
Z Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2020
Field of study

Missing data is a significant issue in many real-world datasets, yet there are no robust methods for dealing with it appropriately. In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases

University of East Anglia digital repository

Dealing with Missing Data and Uncertainty in the Context of Data Mining

Author: A Fichman
B Donald
B Schölkopf
BG Tabachnick
CT Tran
GJMG Heijden van der
IH Witten
J Demšar
J Scheffer
JR Quinlan
JW Grzymala-Busse
L Breiman
M Khalilia
N Horton
PJ García-Laencina
RJA Little
SB Kotsiantis
X Wu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/06/2018
Field of study

Missing data is an issue in many real-world datasets yet robust methods for dealing with missing data appropriately still need development. In this paper we conduct an investigation of how some methods for handling missing data perform when the uncertainty increases. Using benchmark datasets from the UCI Machine Learning repository we generate datasets for our experimentation with increasing amounts of data Missing Completely At Random (MCAR) both at the attribute level and at the record level. We then apply four classification algorithms: C4.5, Random Forest, Naïve Bayes and Support Vector Machines (SVMs). We measure the performance of each classifiers on the basis of complete case analysis, simple imputation and then we study the performance of the algorithms that can handle missing data. We find that complete case analysis has a detrimental effect because it renders many datasets infeasible when missing data increases, particularly for high dimensional data. We find that increasing missing data does have a negative effect on the performance of all the algorithms tested but the different algorithms tested either using preprocessing in the form of simple imputation or handling the missing data do not show a significant difference in performance

University of East Anglia digital repository

Foodways in transition: food plants, diet and local perceptions of change in a Costa Rican Ngäbe community

Background Indigenous populations are undergoing rapid ethnobiological, nutritional and socioeconomic transitions while being increasingly integrated into modernizing societies. To better understand the dynamics of these transitions, this article aims to characterize the cultural domain of food plants and analyze its relation with current day diets, and the local perceptions of changes given amongst the Ngäbe people of Southern Conte-Burica, Costa Rica, as production of food plants by its residents is hypothesized to be drastically in recession with an decreased local production in the area and new conservation and development paradigms being implemented. Methods Extensive freelisting, interviews and workshops were used to collect the data from 72 participants on their knowledge of food plants, their current dietary practices and their perceptions of change in local foodways, while cultural domain analysis, descriptive statistical analyses and development of fundamental explanatory themes were employed to analyze the data. Results Results show a food plants domain composed of 140 species, of which 85 % grow in the area, with a medium level of cultural consensus, and some age-based variation. Although many plants still grow in the area, in many key species a decrease on local production–even abandonment–was found, with much reduced cultivation areas. Yet, the domain appears to be largely theoretical, with little evidence of use; and the diet today is predominantly dependent on foods bought from the store (more than 50 % of basic ingredients), many of which were not salient or not even recognized as ‘food plants’ in freelists exercises. While changes in the importance of food plants were largely deemed a result of changes in cultural preferences for store bought processed food stuffs and changing values associated with farming and being food self-sufficient, Ngäbe were also aware of how changing household livelihood activities, and the subsequent loss of knowledge and use of food plants, were in fact being driven by changes in social and political policies, despite increases in forest cover and biodiversity. Conclusions Ngäbe foodways are changing in different and somewhat disconnected ways: knowledge of food plants is varied, reflecting most relevant changes in dietary practices such as lower cultivation areas and greater dependence on food from stores by all families. We attribute dietary shifts to socioeconomic and political changes in recent decades, in particular to a reduction of local production of food, new economic structures and agents related to the State and globalization

LJMU Research Online (Liverpool John Moores University)

Kent Academic Repository

Digital.CSIC

Muscle and tendon adaptations to moderate load eccentric vs. concentric resistance exercise in young and older males.

Author: Atherton PJ
Badiali F
Franchi MV
Francis S
Gharahdaghi N
Greenhaff PL
Hale A
Maganaris CN
Narici MV
Phillips BE
Quinlan JI
Smith K
Szewczyk N
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Resistance exercise training (RET) is well-known to counteract negative age-related changes in both muscle and tendon tissue. Traditional RET consists of both concentric (CON) and eccentric (ECC) contractions; nevertheless, isolated ECC contractions are metabolically less demanding and, thus, may be more suitable for older populations. However, whether submaximal (60% 1RM) CON or ECC contractions differ in their effectiveness is relatively unknown. Further, whether the time course of muscle and tendon adaptations differs to the above is also unknown. Therefore, this study aimed to establish the time course of muscle and tendon adaptations to submaximal CON and ECC RET. Twenty healthy young (24.5 ± 5.1 years) and 17 older males (68.1 ± 2.4 years) were randomly allocated to either isolated CON or ECC RET which took place 3/week for 8 weeks. Tendon biomechanical properties, muscle architecture and maximal voluntary contraction were assessed every 2 weeks and quadriceps muscle volume every 4 weeks. Positive changes in tendon Young's modulus were observed after 4 weeks in all groups after which adaptations in young males plateaued but continued to increase in older males, suggesting a dampened rate of adaptation with age. However, both CON and ECC resulted in similar overall changes in tendon Young's modulus, in all groups. Muscle hypertrophy and strength increases were similar between CON and ECC in all groups. However, pennation angle increases were greater in CON, and fascicle length changes were greater in ECC. Notably, muscle and tendon adaptations appeared to occur in synergy, presumably to maintain the efficacy of the muscle-tendon unit

Repository@Nottingham

Heart of England: HEFT Repository

Archivio istituzionale della ricerca - Università di Padova

HEFT Repository

A particle swarm optimization approach using adaptive entropy-based fitness quantification of expert knowledge for high-level, real-time cognitive robotic control

Author: A Hong
A Martínez-Tenor
A Rodriguez-Ramos
AC Tenorio-González
BJG Baars
C Sampedro
C Wei-Neng
D Hernández García
DH Perico
ES Low
F Kamil
FL Van Harmelen
G Rishwaraj
JR Quinlan
L Getoor
M Biba
MP Wellman
P-J Meyer
PJ Antsaklis
Q Cai
R Bellman
S Muggleton
S Schiffer
Stefan Brass
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/11/2019
Field of study

Abstract: High-level, real-time mission control of semi-autonomous robots, deployed in remote and dynamic environments, remains a challenge. Control models, learnt from a knowledgebase, quickly become obsolete when the environment or the knowledgebase changes. This research study introduces a cognitive reasoning process, to select the optimal action, using the most relevant knowledge from the knowledgebase, subject to observed evidence. The approach in this study introduces an adaptive entropy-based set-based particle swarm algorithm (AE-SPSO) and a novel, adaptive entropy-based fitness quantification (AEFQ) algorithm for evidence-based optimization of the knowledge. The performance of the AE-SPSO and AEFQ algorithms are experimentally evaluated with two unmanned aerial vehicle (UAV) benchmark missions: (1) relocating the UAV to a charging station and (2) collecting and delivering a package. Performance is measured by inspecting the success and completeness of the mission and the accuracy of autonomous flight control. The results show that the AE-SPSO/AEFQ approach successfully finds the optimal state-transition for each mission task and that autonomous flight control is successfully achieved

Kingston University Research Repository

A new scoring system in Cystic Fibrosis: statistical tools for database analysis – a preliminary report

Author: A Bagirov
B Linnane
C Hurst
C Que
D Streiner
GM Hafen
GM Hafen
H Shwachman
HA Tiddens
J Quinlan
J Smith
J Yearwood
JU Barben
M Anderson
M Loeve
M Mammadov
PA de Jong
PA de Jong
PB Davis
PJ Robinson
PM Gustafsson
R Arrington-Sanders
R Fisher
RJ Kuczmarski
RJ Massie
S Conway
SB Fain
T Kohonen
TE Robinson
Z Dzalilov
Z Zhang
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Cystic fibrosis is the most common fatal genetic disorder in the Caucasian population. Scoring systems for assessment of Cystic fibrosis disease severity have been used for almost 50 years, without being adapted to the milder phenotype of the disease in the 21st century. The aim of this current project is to develop a new scoring system using a database and employing various statistical tools. This study protocol reports the development of the statistical tools in order to create such a scoring system. Methods The evaluation is based on the Cystic Fibrosis database from the cohort at the Royal Children's Hospital in Melbourne. Initially, unsupervised clustering of the all data records was performed using a range of clustering algorithms. In particular incremental clustering algorithms were used. The clusters obtained were characterised using rules from decision trees and the results examined by clinicians. In order to obtain a clearer definition of classes expert opinion of each individual's clinical severity was sought. After data preparation including expert-opinion of an individual's clinical severity on a 3 point-scale (mild, moderate and severe disease), two multivariate techniques were used throughout the analysis to establish a method that would have a better success in feature selection and model derivation: 'Canonical Analysis of Principal Coordinates' and 'Linear Discriminant Analysis'. A 3-step procedure was performed with (1) selection of features, (2) extracting 5 severity classes out of a 3 severity class as defined per expert-opinion and (3) establishment of calibration datasets. Results (1) Feature selection: CAP has a more effective "modelling" focus than DA. (2) Extraction of 5 severity classes: after variables were identified as important in discriminating contiguous CF severity groups on the 3-point scale as mild/moderate and moderate/severe, Discriminant Function (DF) was used to determine the new groups mild, intermediate moderate, moderate, intermediate severe and severe disease. (3) Generated confusion tables showed a misclassification rate of 19.1% for males and 16.5% for females, with a majority of misallocations into adjacent severity classes particularly for males. Conclusion Our preliminary data show that using CAP for detection of selection features and Linear DA to derive the actual model in a CF database might be helpful in developing a scoring system. However, there are several limitations, particularly more data entry points are needed to finalize a score and the statistical tools have further to be refined and validated, with re-running the statistical methods in the larger dataset.</p

Deakin Research Online

Serveur académique lausannois

Federation ResearchOnline

University of Melbourne Institutional Repository

Integrated sequence and expression analysis of ovarian cancer structural variants underscores the importance of gene fusion regulation

Author: A Bashir
A Malhotra
AM Hillmer
AR Quinlan
B Langmead
B Zeitouni
D Karolchik
D Kim
DF Conrad
DR Zerbino
ED Pleasance
F Mitelman
G Robertson
H Li
H Li
I Letunic
J Clucas
J Salzman
JD Rowley
JD Rowley
JO Korbel
JO Korbel
John F. McDonald
JR Lupski
JT Simpson
K Kannan
K Nakayama
M Nambiar
M O’Huallachain
MJ Bueno
MN Cabili
MR Stratton
P Medvedev
PA Edwards
PA Futreal
PC Nowell
PJ Campbell
PJ O’Donovan
PJ Stephens
SJ Murphy
T Hubbard
TG Lugo
TH Rabbitts
Vinay K. Mittal
VK Mittal
X Jiao
Z Ning
ZC Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing

Author: André Gilles
AR Quinlan
Christopher W. Wheat
D Hamilton
DA Hahn
Emese Meglécz
F Saeed
IW Saunders
J. C. Dohm
Jean-François Martin
JM Aury
JS Reis-Filho
KJ Hoff
KM Wegner
M Lynch
M Lynch
M Margulies
MA Larkin
Maxime Galan
Nicolas Pech
P McCullagh
PJ Campbell
SF Altschul
SM Huse
Steve Hoffmann
Stéphanie Ferreira
Susan M Huse
Sverker Lundin
Thibaut Malausa
V Kunin
W Babik
XiaoGuang Zhou
Y Benjamini
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The rapid evolution of 454 GS-FLX sequencing technology has not been accompanied by a reassessment of the quality and accuracy of the sequences obtained. Current strategies for decision-making and error-correction are based on an initial analysis by Huse <it>et al. </it>in 2007, for the older GS20 system based on experimental sequences. We analyze here the quality of 454 sequencing data and identify factors playing a role in sequencing error, through the use of an extensive dataset for Roche control DNA fragments. Results We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in certain positions, and its distribution was linked to several experimental variables. The main factors related to error are the presence of homopolymers, position in the sequence, size of the sequence and spatial localization in PT plates for insertion and deletion errors. These factors can be described by considering seven variables. No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables. Conclusions The pattern identified here calls for the use of internal controls and error-correcting base callers, to correct for errors, when available (e.g. when sequencing amplicons). For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.</p

HAL AMU

Reconstructing cancer genomes from paired-end sequencing data

Author: A Kotzig
AA Steinhardt
Anna Ritz
AR Quinlan
B Raphael
Benjamin J Raphael
BJ Druker
BJ Raphael
BJ Raphael
C Greenman
CD Greenman
CK Ng
D Hochbaum
DG Albertson
DR Bentley
DY Chiang
E Tuzun
ER Mardis
F Hormozdiari
JO Korbel
K Chen
Layla Oesper
LE Kelemen
M Meyerson
MA Alekseyev
MC Schatz
P Kauraniemi
P Medvedev
P Medvedev
P Medvedev
P Pevzner
PA Pevzner
PA Pevzner
PA Pevzner
PJ Campbell
PJ Stephens
R Wittler
R Xi
RE Mills
Ryan Drebin
S Durinck
S Hannenhalli
S Sindi
S Takakura
S Volik
S Yoon
SA Moestue
Sarah J Aerni
Y Jung
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Abstract Background A cancer genome is derived from the germline genome through a series of somatic mutations. Somatic structural variants - including duplications, deletions, inversions, translocations, and other rearrangements - result in a cancer genome that is a scrambling of intervals, or "blocks" of the germline genome sequence. We present an efficient algorithm for reconstructing the block organization of a cancer genome from paired-end DNA sequencing data. Results By aligning paired reads from a cancer genome - and a matched germline genome, if available - to the human reference genome, we derive: (i) a partition of the reference genome into intervals; (ii) adjacencies between these intervals in the cancer genome; (iii) an estimated copy number for each interval. We formulate the Copy Number and Adjacency Genome Reconstruction Problem of determining the cancer genome as a sequence of the derived intervals that is consistent with the measured adjacencies and copy numbers. We design an efficient algorithm, called Paired-end Reconstruction of Genome Organization (PREGO), to solve this problem by reducing it to an optimization problem on an interval-adjacency graph constructed from the data. The solution to the optimization problem results in an Eulerian graph, containing an alternating Eulerian tour that corresponds to a cancer genome that is consistent with the sequencing data. We apply our algorithm to five ovarian cancer genomes that were sequenced as part of The Cancer Genome Atlas. We identify numerous rearrangements, or structural variants, in these genomes, analyze reciprocal vs. non-reciprocal rearrangements, and identify rearrangements consistent with known mechanisms of duplication such as tandem duplications and breakage/fusion/bridge (B/F/B) cycles. Conclusions We demonstrate that PREGO efficiently identifies complex and biologically relevant rearrangements in cancer genome sequencing data. An implementation of the PREGO algorithm is available at <url>http://compbio.cs.brown.edu/software/</url>.</p