Search CORE

110 research outputs found

The impact of pedigree structure on heritability estimates

Author: Ekstrøm Claus Thorn
Publication venue: 'S. Karger AG'
Publication date: 01/01/2009
Field of study

Copenhagen University Research Information System

Inference for feature selection using the Lasso with high-dimensional data

Author: Brink-Jensen Kasper
Ekstrøm Claus Thorn
Publication venue
Publication date: 17/03/2014
Field of study

Penalized regression models such as the Lasso have proved useful for variable selection in many fields - especially for situations with high-dimensional data where the numbers of predictors far exceeds the number of observations. These methods identify and rank variables of importance but do not generally provide any inference of the selected variables. Thus, the variables selected might be the "most important" but need not be significant. We propose a significance test for the selection found by the Lasso. We introduce a procedure that computes inference and p-values for features chosen by the Lasso. This method rephrases the null hypothesis and uses a randomization approach which ensures that the error rate is controlled even for small samples. We demonstrate the ability of the algorithm to compute

p

-values of the expected magnitude with simulated data using a multitude of scenarios that involve various effects strengths and correlation between predictors. The algorithm is also applied to a prostate cancer dataset that has been analyzed in recent papers on the subject. The proposed method is found to provide a powerful way to make inference for feature selection even for small samples and when the number of predictors are several orders of magnitude larger than the number of observations. The algorithm is implemented in the MESS package in R and is freely available

arXiv.org e-Print Archive

CiteSeerX

Copenhagen University Research Information System

Moment-based Estimation of Mixtures of Regression Models

Author: Ekstrøm Claus Thorn
Pipper Christian Bressen
Publication venue
Publication date: 01/01/2019
Field of study

Finite mixtures of regression models provide a flexible modeling framework for many phenomena. Using moment-based estimation of the regression parameters, we develop unbiased estimators with a minimum of assumptions on the mixture components. In particular, only the average regression model for one of the components in the mixture model is needed and no requirements on the distributions. The consistency and asymptotic distribution of the estimators is derived and the proposed method is validated through a series of simulation studies and is shown to be highly accurate. We illustrate the use of the moment-based mixture of regression models with an application to wine quality data.Comment: 17 pages, 3 figure

arXiv.org e-Print Archive

Copenhagen University Research Information System

Newborn dried blood spot samples in Denmark:the hidden figures of secondary use and research participation

Author: Ekstrøm Claus Thorn
Nordfalk Francisca
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Copenhagen University Research Information System

dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R

Author: Ekstrøm Claus Thorn
Petersen Anne Helby
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/01/2019
Field of study

Data cleaning and validation are important steps in any data analysis, as the validity of the conclusions from the analysis hinges on the quality of the input data. Mistakes in the data can arise for any number of reasons, including erroneous codings, malfunctioning measurement equipment, and inconsistent data generation manuals. Ideally, a human investigator should go through each variable in the dataset and look for potential errors - both in input values and codings - but that process can be very time-consuming, expensive and error-prone in itself. We describe an R package, dataMaid, which implements an extensive and customizable suite of quality assessment aids that can be applied to a dataset in order to identify potential problems in its variables. The results are presented in an auto-generated, nontechnical, stand-alone overview document intended to be perused by an investigator with an understanding of the variables in the data, but not necessarily knowledge of R. Thereby, dataMaid aids the dialogue between data analysts and field experts, while also providing easy documentation of reproducible data quality screening. Moreover, the dataMaid solution changes the data screening process from the usual ad hoc approach to a systematic, well-documented endeavor. dataMaid also provides a suite of more typical R tools for interactive data quality assessment and screening, where the data inspections are executed directly in the R console

Copenhagen University Research Information System

Journal of Statistical Software

Having a Ball: evaluating scoring streaks and game excitement using in-match trend estimation

Author: Ekstrøm Claus Thorn
Jensen Andreas Kryger
Publication venue
Publication date: 22/12/2020
Field of study

Many popular sports involve matches between two teams or players where each team have the possibility of scoring points throughout the match. While the overall match winner and result is interesting, it conveys little information about the underlying scoring trends throughout the match. Modeling approaches that accommodate a finer granularity of the score difference throughout the match is needed to evaluate in-game strategies, discuss scoring streaks, teams strengths, and other aspects of the game. We propose a latent Gaussian process to model the score difference between two teams and introduce the Trend Direction Index as an easily interpretable probabilistic measure of the current trend in the match as well as a measure of post-game trend evaluation. In addition we propose the Excitement Trend Index - the expected number of monotonicity changes in the running score difference - as a measure of overall game excitement. Our proposed methodology is applied to all 1143 matches from the 2019-2020 National Basketball Association (NBA) season. We show how the trends can be interpreted in individual games and how the excitement score can be used to cluster teams according to how exciting they are to watch

arXiv.org e-Print Archive

Copenhagen University Research Information System