Search CORE

19 research outputs found

The out-of-sample $R^2$ : estimation and inference

Author: Hawinkel Stijn
Maere Steven
Waegeman Willem
Publication venue: 'Informa UK Limited'
Publication date: 10/02/2023
Field of study

Out-of-sample prediction is the acid test of predictive models, yet an independent test dataset is often not available for assessment of the prediction error. For this reason, out-of-sample performance is commonly estimated using data splitting algorithms such as cross-validation or the bootstrap. For quantitative outcomes, the ratio of variance explained to total variance can be summarized by the coefficient of determination or in-sample

R^2

, which is easy to interpret and to compare across different outcome variables. As opposed to the in-sample

R^2

, the out-of-sample

R^2

has not been well defined and the variability on the out-of-sample

\hat{R}^2

has been largely ignored. Usually only its point estimate is reported, hampering formal comparison of predictability of different outcome variables. Here we explicitly define the out-of-sample

R^2

as a comparison of two predictive models, provide an unbiased estimator and exploit recent theoretical advances on uncertainty of data splitting estimates to provide a standard error for the

\hat{R}^2

. The performance of the estimators for the

R^2

and its standard error are investigated in a simulation study. We demonstrate our new method by constructing confidence intervals and comparing models for prediction of quantitative

\text{Brassica napus}

and

\text{Zea mays}

phenotypes based on gene expression data

arXiv.org e-Print Archive

Ghent University Academic Bibliography

Sequence count data are poorly fit by the negative binomial distribution

Author: Bijnens Luc
Hawinkel Stijn
Rayner JCW
Thas Olivier
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2020
Field of study

Sequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NB-assumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a dedicated statistical goodness of fit test for the NB distribution in regression models and demonstrate that the NB-assumption is violated in many publicly available RNA-Seq and 16S rRNA microbiome datasets. The zero-inflated NB distribution was not found to give a substantially better fit. We also show that the NB-based tests perform worse on the features for which the NB-assumption was violated than on the features for which no significant deviation was detected. This gives an explanation for the poor behaviour of NB-based tests in many published evaluation studies. We conclude that non-parametric tests should be preferred over parametric methods

Ghent University Academic Bibliography

Directory of Open Access Journals

A unified framework for unconstrained and constrained ordination of microbiome read count data

Author: Bijnens Luc
Hawinkel Stijn
Kerckhof Frederiek-Maarten
Thas Olivier
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2019
Field of study

Explorative visualization techniques provide a first summary of microbiome read count datasets through dimension reduction. A plethora of dimension reduction methods exists, but many of them focus primarily on sample ordination, failing to elucidate the role of the bacterial species. Moreover, implicit but often unrealistic assumptions underlying these methods fail to account for overdispersion and differences in sequencing depth, which are two typical characteristics of sequencing data. We combine log-linear models with a dispersion estimation algorithm and flexible response function modelling into a framework for unconstrained and constrained ordination. The method is able to cope with differences in dispersion between taxa and varying sequencing depths, to yield meaningful biological patterns. Moreover, it can correct for observed technical confounders, whereas other methods are adversely affected by these artefacts. Unlike distance-based ordination methods, the assumptions underlying our method are stated explicitly and can be verified using simple diagnostics. The combination of unconstrained and constrained ordination in the same framework is unique in the field and facilitates microbiome data exploration. We illustrate the advantages of our method on simulated and real datasets, while pointing out flaws in existing methods. The algorithms for fitting and plotting are available in the R-package RCM

Ghent University Academic Bibliography

Directory of Open Access Journals

Research Online

FigShare

Model-based joint visualization of multiple compositional omics datasets

Author: Bijnens Luc
Cao Kim-Anh Lê
Hawinkel Stijn
Thas Olivier
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2020
Field of study

The integration of multiple omics datasets measured on the same samples is a challenging task: data come from heterogeneous sources and vary in signal quality. In addition, some omics data are inherently compositional, e.g. sequence count data. Most integrative methods are limited in their ability to handle covariates, missing values, compositional structure and heteroscedasticity. In this article we introduce a flexible model-based approach to data integration to address these current limitations: COMBI. We combine concepts, such as compositional biplots and log-ratio link functions with latent variable models, and propose an attractive visualization through multiplots to improve interpretation. Using real data examples and simulations, we illustrate and compare our method with other data integration techniques. Our algorithm is available in the R-package combi

Ghent University Academic Bibliography

University of Melbourne Institutional Repository

Statistical analysis of microbiome sequence count data

Author: Hawinkel Stijn
Publication venue: Ghent University. Faculty of Sciences
Publication date: 29/01/2023
Field of study

Ghent University Academic Bibliography

Spatial Regression Models for Field Trials: A Comparative Study and New Ideas

Author: Sam De Meyer
Sam De Meyer
Steven Maere
Steven Maere
Stijn Hawinkel
Stijn Hawinkel
Publication venue: Frontiers Media S.A.
Publication date: 01/03/2022
Field of study

Naturally occurring variability within a study region harbors valuable information on relationships between biological variables. Yet, spatial patterns within these study areas, e.g., in field trials, violate the assumption of independence of observations, setting particular challenges in terms of hypothesis testing, parameter estimation, feature selection, and model evaluation. We evaluate a number of spatial regression methods in a simulation study, including more realistic spatial effects than employed so far. Based on our results, we recommend generalized least squares (GLS) estimation for experimental as well as for observational setups and demonstrate how it can be incorporated into popular regression models for high-dimensional data such as regularized least squares. This new method is available in the BioConductor R-package pengls. Inclusion of a spatial error structure improves parameter estimation and predictive model performance in low-dimensional settings and also improves feature selection in high-dimensional settings by reducing “red-shift”: the preferential selection of features with spatial structure. In addition, we argue that the absence of spatial autocorrelation (SAC) in the model residuals should not be taken as a sign of a good fit, since it may result from overfitting the spatial trend. Finally, we confirm our findings in a case study on the prediction of winter wheat yield based on multispectral measurements

Directory of Open Access Journals

A unified framework for unconstrained and constrained ordination of microbiome read count data

Author: Bijnens Luc
Hawinkel Stijn
Kerckhof Frederiek-Maarten
Thas Olivier
Publication venue: 'Sociological Research Online'
Publication date: 01/01/2019
Field of study

Research Online

Spatial regression models for field trials : a comparative study and new ideas

Author: De Meyer Sam
Hawinkel Stijn
Maere Steven
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2022
Field of study

Ghent University Academic Bibliography

PubMed Central

A broken promise : microbiome differential abundance methods do not control the false discovery rate

Author: Bijnens Luc
Hawinkel Stijn
Mattiello Federico
Thas Olivier
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2019
Field of study

High-throughput sequencing technologies allow easy characterization of the human microbiome, but the statistical methods to analyze microbiome data are still in their infancy. Differential abundance methods aim at detecting associations between the abundances of bacterial species and subject grouping factors. The results of such methods are important to identify the microbiome as a prognostic or diagnostic biomarker or to demonstrate efficacy of prodrug or antibiotic drugs. Because of a lack of benchmarking studies in the microbiome field, no consensus exists on the performance of the statistical methods. We have compared a large number of popular methods through extensive parametric and nonparametric simulation as well as real data shuffling algorithms. The results are consistent over the different approaches and all point to an alarming excess of false discoveries. This raises great doubts about the reliability of discoveries in past studies and imperils reproducibility of microbiome experiments. To further improve method benchmarking, we introduce a new simulation tool that allows to generate correlated count data following any univariate count distribution; the correlation structure may be inferred from real data. Most simulation studies discard the correlation between species, but our results indicate that this correlation can negatively affect the performance of statistical methods

Ghent University Academic Bibliography

Data_Sheet_2_Spatial Regression Models for Field Trials: A Comparative Study and New Ideas.zip

Author: Sam De Meyer (12320087)
Steven Maere (23665)
Stijn Hawinkel (6331094)
Publication venue
Publication date: 04/03/2024
Field of study

FigShare