14,296 research outputs found

    Fast Cross-Validation via Sequential Testing

    Full text link
    With the increasing size of today's data sets, finding the right parameter configuration in model selection via cross-validation can be an extremely time-consuming task. In this paper we propose an improved cross-validation procedure which uses nonparametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating underperforming candidates quickly and keeping promising candidates as long as possible, the method speeds up the computation while preserving the capability of the full cross-validation. Theoretical considerations underline the statistical power of our procedure. The experimental evaluation shows that our method reduces the computation time by a factor of up to 120 compared to a full cross-validation with a negligible impact on the accuracy

    Change-point Problem and Regression: An Annotated Bibliography

    Get PDF
    The problems of identifying changes at unknown times and of estimating the location of changes in stochastic processes are referred to as the change-point problem or, in the Eastern literature, as disorder . The change-point problem, first introduced in the quality control context, has since developed into a fundamental problem in the areas of statistical control theory, stationarity of a stochastic process, estimation of the current position of a time series, testing and estimation of change in the patterns of a regression model, and most recently in the comparison and matching of DNA sequences in microarray data analysis. Numerous methodological approaches have been implemented in examining change-point models. Maximum-likelihood estimation, Bayesian estimation, isotonic regression, piecewise regression, quasi-likelihood and non-parametric regression are among the methods which have been applied to resolving challenges in change-point problems. Grid-searching approaches have also been used to examine the change-point problem. Statistical analysis of change-point problems depends on the method of data collection. If the data collection is ongoing until some random time, then the appropriate statistical procedure is called sequential. If, however, a large finite set of data is collected with the purpose of determining if at least one change-point occurred, then this may be referred to as non-sequential. Not surprisingly, both the former and the latter have a rich literature with much of the earlier work focusing on sequential methods inspired by applications in quality control for industrial processes. In the regression literature, the change-point model is also referred to as two- or multiple-phase regression, switching regression, segmented regression, two-stage least squares (Shaban, 1980), or broken-line regression. The area of the change-point problem has been the subject of intensive research in the past half-century. The subject has evolved considerably and found applications in many different areas. It seems rather impossible to summarize all of the research carried out over the past 50 years on the change-point problem. We have therefore confined ourselves to those articles on change-point problems which pertain to regression. The important branch of sequential procedures in change-point problems has been left out entirely. We refer the readers to the seminal review papers by Lai (1995, 2001). The so called structural change models, which occupy a considerable portion of the research in the area of change-point, particularly among econometricians, have not been fully considered. We refer the reader to Perron (2005) for an updated review in this area. Articles on change-point in time series are considered only if the methodologies presented in the paper pertain to regression analysis

    Methods for non-proportional hazards in clinical trials: A systematic review

    Full text link
    For the analysis of time-to-event data, frequently used methods such as the log-rank test or the Cox proportional hazards model are based on the proportional hazards assumption, which is often debatable. Although a wide range of parametric and non-parametric methods for non-proportional hazards (NPH) has been proposed, there is no consensus on the best approaches. To close this gap, we conducted a systematic literature search to identify statistical methods and software appropriate under NPH. Our literature search identified 907 abstracts, out of which we included 211 articles, mostly methodological ones. Review articles and applications were less frequently identified. The articles discuss effect measures, effect estimation and regression approaches, hypothesis tests, and sample size calculation approaches, which are often tailored to specific NPH situations. Using a unified notation, we provide an overview of methods available. Furthermore, we derive some guidance from the identified articles. We summarized the contents from the literature review in a concise way in the main text and provide more detailed explanations in the supplement (page 29)

    Non-parametric Sequential and Adaptive Designs for Survival Trials

    Get PDF
    This thesis deals with fixed samples size, sequential and adaptive survival trials and consists of two major parts. In the first part fixed sample size, sequential and adaptive testing methods are derived that utilize data from a survival as well as a categorical surrogate endpoint in a fully non-parametric way without the need to assume any type of proportional hazards. In the second part extensions to quality-adjusted survival endpoints are discussed. In existing adaptive methods for confirmatory survival trials with flexible adaptation rules strict type-I-error control is only ensured if the interim decisions are based solely on the primary endpoint. In trials with long-term follow-up it is often desirable to base interim decisions also on correlated short-term endpoints, such as a surrogate marker. Surrogate information available at the interim analysis may be used to predict future event times. If interim decisions, such as selection of a subgroup or changes to the recruitment process, depend on this information, control of the type-I-error is no longer formally guaranteed for methods assuming an independent increments structure. In this thesis the weighted Kaplan-Meier estimator, a modification of the classical Kaplan-Meier estimator incorporating discrete surrogate information, is used to construct a non-parametric test statistic for the comparison of survival distributions, a generalization of the average hazard ratio. It is shown in this thesis how this test statistic can be used in fixed design, group-sequential and adaptive trials, such that the type-I-error is controlled. Asymptotic normality of the multivariate average hazard ratio is first verified in the fixed sample size context and then applied to noninferiority testing in a three-arm trial with non-proportional hazards survival data. In the next step the independent increments property is shown to hold asymptotically for the weighted Kaplan-Meier estimator. Consequently, for all test statistics based on it. Standard methods for the calculation of group-sequential rejection boundaries are applicable. For adaptive designs the weighted Kaplan-Meier estimator is modified to support stage-wise left-truncated and right-censored data to ensure independence of the stage-wise test statistics, even when interim decisions are based on surrogate information. Standard combination test methodology can then be used to ensure strict type-I-error control. Quality-adjusted survival is an integrated measure of quality-of-life data, which has gained interest in recent years. In this thesis a novel non-parametric two-sample test for quality-adjusted survival distributions is developed, that allows adjustment for covariate-dependent censoring, whereby the censoring is assumed to follow a proportional hazards model. It is shown how this result can be used to design adaptive trials with a quality-adjusted survival endpoint

    Nonparametric Identification of Multivariate Mixtures

    Get PDF
    This article analyzes the identifiability of k-variate, M-component finite mixture models in which each component distribution has independent marginals, including models in latent class analysis. Without making parametric assumptions on the component distributions, we investigate how one can identify the number of components and the component distributions from the distribution function of the observed data. We reveal an important link between the number of variables (k), the number of values each variable can take, and the number of identifiable components. A lower bound on the number of components (M) is nonparametrically identifiable if k >= 2, and the maximum identifiable number of components is determined by the number of different values each variable takes. When M is known, the mixing proportions and the component distributions are nonparametrically identified from matrices constructed from the distribution function of the data if (i) k >= 3, (ii) two of k variables take at least M different values, and (iii) these matrices satisfy some rank and eigenvalue conditions. For the unknown M case, we propose an algorithm that possibly identifies M and the component distributions from data. We discuss a condition for nonparametric identi fication and its observable implications. In case M cannot be identified, we use our identification condition to develop a procedure that consistently estimates a lower bound on the number of components by estimating the rank of a matrix constructed from the distribution function of observed variables.finite mixture, latent class analysis, latent class model, model selection, number of components, rank estimation

    A class of two-sample nonparametric statistics for binary and time-to-event outcomes

    Get PDF
    © The Author(s) 2021We propose a class of two-sample statistics for testing the equality of proportions and the equality of survival functions. We build our proposal on a weighted combination of a score test for the difference in proportions and a Weighted Kaplan-Meier statistic-based test for the difference of survival functions. The proposed statistics are fully non-parametric and do not rely on the proportional hazards assumption for the survival outcome. We present the asymptotic distribution of these statistics, propose a variance estimator and show their asymptotic properties under fixed and local alternatives. We discuss different choices of weights including those that control the relative relevance of each outcome and emphasize the type of difference to be detected in the survival outcome. We evaluate the performance of these statistics with a simulation study, and illustrate their use with a randomized phase III cancer vaccine trial. We have implemented the proposed statistics in the R package SurvBin, available on GitHub (this https URL).This work was supported by the Ministerio de Ciencia e Innovación (Spain) under Grants PID2019-104830RB-I00; the Departament d’Empresa i Coneixement de la Generalitat de Catalunya (Spain) under Grant 2017 SGR 622 (GRBIO); and the Ministerio de Economía y Competitividad (Spain), through the María de Maeztu Programme for Units of Excellence in R&D under Grant MDM-2014-0445 to M. Bofill Roig.Peer ReviewedPostprint (published version

    Inference and Estimation in Change Point Models for Censored Data

    Get PDF
    In general, the change point problem considers inference of a change in distribution for a set of time-ordered observations. This has applications in a large variety of fields and can also apply to survival data. With improvements to medical diagnoses and treatments, incidences and mortality rates have changed. However, the most commonly used analysis methods do not account for such distributional changes. In survival analysis, change point problems can concern a shift in a distribution for a set of time-ordered observations, potentially under censoring or truncation. In this dissertation, we first propose a sequential testing approach for detecting multiple change points in the Weibull accelerated failure time model, since this is sufficiently flexible to accommodate increasing, decreasing, or constant hazard rates and is also the only continuous distribution for which the accelerated failure time model can be reparametrized as a proportional hazards model. Our sequential testing procedure does not require the number of change points to be known; this information is instead inferred from the data. We conduct a simulation study to show that the method accurately detects change points and estimates the model. The numerical results along with a real data application demonstrate that our proposed method can detect change points in the hazard rate. In survival analysis, most existing methods compare two treatment groups for the entirety of the study period. Some treatments may take a length of time to show effects in subjects. This has been called the time-lag effect in the literature, and in cases where time-lag effect is considerable, such methods may not be appropriate to detect significant differences between two groups. In the second part of this dissertation, we propose a novel non-parametric approach for estimating the point of treatment time-lag effect by using an empirical divergence measure. Theoretical properties of the estimator are studied. The results from the simulated data and real data example support our proposed method
    corecore