371 research outputs found
Inference for feature selection using the Lasso with high-dimensional data
Penalized regression models such as the Lasso have proved useful for variable
selection in many fields - especially for situations with high-dimensional data
where the numbers of predictors far exceeds the number of observations. These
methods identify and rank variables of importance but do not generally provide
any inference of the selected variables. Thus, the variables selected might be
the "most important" but need not be significant. We propose a significance
test for the selection found by the Lasso. We introduce a procedure that
computes inference and p-values for features chosen by the Lasso. This method
rephrases the null hypothesis and uses a randomization approach which ensures
that the error rate is controlled even for small samples. We demonstrate the
ability of the algorithm to compute -values of the expected magnitude with
simulated data using a multitude of scenarios that involve various effects
strengths and correlation between predictors. The algorithm is also applied to
a prostate cancer dataset that has been analyzed in recent papers on the
subject. The proposed method is found to provide a powerful way to make
inference for feature selection even for small samples and when the number of
predictors are several orders of magnitude larger than the number of
observations. The algorithm is implemented in the MESS package in R and is
freely available
Moment-based Estimation of Mixtures of Regression Models
Finite mixtures of regression models provide a flexible modeling framework
for many phenomena. Using moment-based estimation of the regression parameters,
we develop unbiased estimators with a minimum of assumptions on the mixture
components. In particular, only the average regression model for one of the
components in the mixture model is needed and no requirements on the
distributions. The consistency and asymptotic distribution of the estimators is
derived and the proposed method is validated through a series of simulation
studies and is shown to be highly accurate. We illustrate the use of the
moment-based mixture of regression models with an application to wine quality
data.Comment: 17 pages, 3 figure
Editorial
This special issue of the Nordic Journal of Information Literacy in Higher Education summarizes the Conference Creating Knowledge IX held in Vejle, Denmark, June 6-8, 2018
The research librarian of the future: data scientist and co-investigator
There remains something of a disconnect between how research librarians themselves see their role and its responsibilities and how these are viewed by their faculty colleagues. Jeannette Ekstrøm, Mikael Elbaek, Chris Erdmann and Ivo Grigorov imagine how the research librarian of the future might work, utilising new data science and digital skills to drive more collaborative and open scholarship. Arguably this future is already upon us but institutions must implement a structured approach to developing librarians’ skills and services to fully realise the benefits
Having a Ball: evaluating scoring streaks and game excitement using in-match trend estimation
Many popular sports involve matches between two teams or players where each
team have the possibility of scoring points throughout the match. While the
overall match winner and result is interesting, it conveys little information
about the underlying scoring trends throughout the match. Modeling approaches
that accommodate a finer granularity of the score difference throughout the
match is needed to evaluate in-game strategies, discuss scoring streaks, teams
strengths, and other aspects of the game.
We propose a latent Gaussian process to model the score difference between
two teams and introduce the Trend Direction Index as an easily interpretable
probabilistic measure of the current trend in the match as well as a measure of
post-game trend evaluation. In addition we propose the Excitement Trend Index -
the expected number of monotonicity changes in the running score difference -
as a measure of overall game excitement.
Our proposed methodology is applied to all 1143 matches from the 2019-2020
National Basketball Association (NBA) season. We show how the trends can be
interpreted in individual games and how the excitement score can be used to
cluster teams according to how exciting they are to watch
- …