Journal of Statistical Software
Not a member yet
1616 research outputs found
Sort by
BoXHED2.0: Scalable Boosting of Dynamic Survival Analysis
Modern applications of survival analysis increasingly involve time-dependent covariates.The Python package BoXHED2.0 (Boosted eXact Hazard Estimator with Dynamic covariates) is a tree-boosted hazard estimator that is fully nonparametric, and is applicable to survival settings far more general than right-censoring, including recurring events and competing risks. BoXHED2.0 is also scalable to the point of being on the same order of speed as parametric boosted survival models, in part because its core is written in C++ and it also supports the use of GPUs and multicore CPUs. BoXHED2.0 is available from PyPI and also from https://github.com/BoXHED
Stability Selection and Consensus Clustering in R: The R Package sharp
The R package sharp (Stability-enHanced Approaches using Resampling Procedures) provides an integrated framework for stability-enhanced variable selection, graphical modeling and clustering. In stability selection, a feature selection algorithm is combined with a resampling technique to estimate feature selection probabilities. Features with selection proportions above a threshold are considered stably selected. Similarly, a clustering algorithm is applied on multiple subsamples of items to compute co-membership proportions in consensus clustering. The consensus clusters are obtained by clustering using comembership proportions as a measure of similarity. We calibrate the hyper-parameters of stability selection (or consensus clustering) jointly by maximizing a consensus score calculated under the null hypothesis of equiprobability of selection (or co-membership), which characterizes instability. The package offers flexibility in the modeling, includes diagnostic and visualization tools, and allows for parallelization
RESI: An R Package for Robust Effect Sizes
Effect size indices are useful parameters that quantify the strength of association and are unaffected by sample size. There are many available effect size parameters and estimators, but it is difficult to compare effect sizes across studies as most are defined for a specific type of population parameter. We recently introduced a new robust effect size index (RESI) and confidence interval, which is advantageous because it is not model-specific. Here we present the RESI R package, which makes it easy to report the RESI and its confidence interval for many different model classes, with a consistent interpretation across parameters and model types. The package produces coefficient, ANOVA tables, and overall Wald tests for model inputs, appending the RESI estimate and confidence interval to each. The package also includes functions for visualization and conversions to and from other effect size measures. For illustration, we analyze and interpret three datasets using different model types
hmmTMB: Hidden Markov Models with Flexible Covariate Effects in R
Hidden Markov models (HMMs) are widely applied in studies where a discrete-valued process of interest is observed indirectly. They have for example been used to model behavior from human and animal tracking data, disease status from medical data, and financial market volatility from stock prices. The model has two main sets of parameters: transition probabilities, which drive the latent state process, and observation parameters, which characterize the state-dependent distributions of observed variables. One particularly useful extension of HMMs is the inclusion of covariates on those parameters, to investigate the drivers of state transitions or to implement Markov-switching regression models. We present the new R package hmmTMB for HMM analyses, with flexible covariate models in both the hidden state and observation parameters. In particular, non-linear effects are implemented using penalized splines, including multiple univariate and multivariate splines, with automatic smoothness selection. The package allows for various random effect formulations (including random intercepts and slopes), to capture between-group heterogeneity. hmmTMB can be applied to multivariate observations, and it accommodates various types of response data, including continuous (bounded or not), discrete, and binary variables. Parameter constraints can be used to implement non-standard dependence structures, such as semi-Markov, higher-order Markov, and autoregressive models. Here, we summarize the relevant statistical methodology, we describe the structure of the package, and we present an example analysis of animal tracking data to showcase the workflow of the package
StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables
StepMix is an open-source Python package for the pseudo-likelihood estimation (one-, two- and three-step approaches) of generalized finite mixture models (latent profile and latent class analysis) with external variables (covariates and distal outcomes). In many applications in social sciences, the main objective is not only to cluster individuals into latent classes, but also to use these classes to develop more complex statistical models. These models generally divide into a measurement model that relates the latent classes to observed indicators, and a structural model that relates covariates and outcome variables to the latent classes. The measurement and structural models can be estimated jointly using the so-called one-step approach or sequentially using stepwise methods, which present significant advantages for practitioners regarding the interpretability of the estimated latent classes. In addition to the one-step approach, StepMix implements the most important stepwise estimation methods from the literature, including the bias-adjusted three-step methods with Bolk-Croon-Hagenaars and maximum likelihood corrections and the more recent two-step approach. These pseudo-likelihood estimators are presented in this paper under a unified framework as specific expectation-maximization subroutines. To facilitate and promote their adoption among the data science community, StepMix follows the object-oriented design of the scikit-learn library and provides an additional R wrapper
mdendro: An R Package for Extended Agglomerative Hierarchical Clustering
mdendro is an R package that provides a comprehensive collection of linkage methods for agglomerative hierarchical clustering on a matrix of proximity data (distances or similarities), returning a multifurcated dendrogram or multidendrogram. Multidendrograms can group more than two clusters at the same time, solving the nonuniqueness problem that arises when there are ties in the data. This problem causes that different binary dendrograms are possible depending both on the order of the input data and on the criterion used to break ties. Weighted and unweighted versions of the most common linkage methods are included in the package, which also implements two parametric linkage methods. In addition, package mdendro provides five descriptive measures to analyze the resulting dendrograms: cophenetic correlation coefficient, space distortion ratio, agglomerative coefficient, chaining coefficient and tree balance
Optimum Allocation for Adaptive Multi-Wave Sampling in R: The R Package optimall
The R package optimall offers a collection of functions that efficiently streamline the design process of sampling in surveys ranging from simple to complex. The package's main functions allow users to interactively define and adjust strata cut points based on values or quantiles of auxiliary covariates, adaptively calculate the optimum number of samples to allocate to each stratum using Neyman or Wright allocation, and select specific units to sample based on a stratified sampling design. Using real-life epidemiological study examples, we demonstrate how optimall facilitates an efficient workflow for the design and implementation of surveys in R. Although tailored towards multi-wave sampling under two- or three-phase designs, the R package optimall may be useful for any sampling survey
scpi: Uncertainty Quantification for Synthetic Control Methods
The synthetic control method offers a way to quantify the effect of an intervention using weighted averages of untreated units to approximate the counterfactual outcome that the treated unit(s) would have experienced in the absence of the intervention. This method is useful for program evaluation and causal inference in observational studies. We introduce the software package scpi for prediction and inference using synthetic controls, implemented in Python, R, and Stata. For point estimation or prediction of treatment effects, the package offers an array of (possibly penalized) approaches leveraging the latest optimization methods. For uncertainty quantification, the package offers the prediction interval methods introduced by Cattaneo, Feng, and Titiunik (2021) and Cattaneo, Feng, Palomba, and Titiunik (2025b). The paper includes numerical illustrations and a comparison with other synthetic control software
pymle: A Python Package for Maximum Likelihood Estimation and Simulation of Stochastic Differential Equations
This paper introduces the object-oriented Python package pymle, which provides core functionality for maximum likelihood estimation and simulation of univariate stochastic differential equations. The package supports maximum likelihood estimation using Euler, Elerian, Ozaki, Shoji-Ozaki, Hermite polynomial, and Kessler density approximations, as well as a recently proposed continuous-time Markov chain approximation scheme. Exact maximum likelihood estimation is also provided when available. The framework supports estimation and simulation for 21 stochastic differential equations models at the time of writing, and its object oriented design facilitates easy extensions to new models and approximation methods
NUBO: A Transparent Python Package for Bayesian Optimization
NUBO, short for Newcastle University Bayesian Optimisation, is a Bayesian optimization framework for the optimization of expensive-to-evaluate black-box functions, such as physical experiments and computer simulators. Bayesian optimization is a costefficient optimization strategy that uses surrogate modelling via Gaussian processes to represent an objective function and acquisition functions to guide the selection of candidate points to approximate the global optimum of the objective function. NUBO itself focuses on transparency and user experience to make Bayesian optimization easily accessible to researchers from all disciplines. Clean and understandable code, precise references, and thorough documentation ensure transparency, while user experience is ensured by a modular and flexible design, easy-to-write syntax, and careful selection of Bayesian optimization algorithms. NUBO allows users to tailor Bayesian optimization to their specific problem by writing the optimization loop themselves using the provided building blocks. It supports sequential single-point, parallel multi-point, and asynchronous optimization of bounded, constrained, and/or mixed (discrete and continuous) parameter input spaces. Only algorithms and methods that are extensively tested and validated to perform well are included in NUBO. This ensures that the package remains compact and does not overwhelm the user with an unnecessarily large number of options. The package is written in Python but does not require expert knowledge of Python to optimize your simulators and experiments. NUBO is distributed as open-source software under the BSD 3-Clause license