40 research outputs found

    Multiarm, multistage randomized controlled trials with stopping boundaries for efficacy and lack of benefit: An update to nstage

    Get PDF
    Royston et al.’s (2011, Trials 12: 81) multiarm, multistage (MAMS) framework for the design of randomized clinical trials uses intermediate outcomes to drop research arms early for lack of benefit at interim stages, increasing efficiency in multiarm designs. However, additionally permitting interim evaluation of efficacy on the primary outcome measure could increase adoption of the design and result in practical benefits, such as savings in patient numbers and cost, should any efficacious arm be identified early. The nstage command, which aids the design of MAMS trial designs, has been updated to support this methodological extension. Operating characteristics can now be calculated for a design with binding or nonbinding stopping rules for lack of benefit and with efficacy stopping boundaries. An additional option searches for a design that strongly controls the familywise error rate at the desired level. We illustrate how the new features can be used to design a trial with the drop-down menu, using the original comparisons from the MAMS trial STAMPEDE as an example. The new functionality of the command will serve a broader range of trial objectives and increase efficiency of the design and thus increase uptake of the MAMS design in practice

    Impact of lack-of-benefit stopping rules on treatment effect estimates of two-arm multi-stage (TAMS) trials with time to event outcome

    Get PDF
    In 2011, Royston et al. described technical details of a two-arm, multi-stage (TAMS) design. The design enables a trial to be stopped part-way through recruitment if the accumulating data suggests a lack of benefit of the experimental arm. Such interim decisions can be made using data on an available 'intermediate' outcome. At the conclusion of the trial, the definitive outcome is analyzed. Typical intermediate and definitive outcomes in cancer might be progression-free and overall survival, respectively. In TAMS designs, the stopping rule applied at the interim stage(s) affects the sampling distribution of the treatment effect estimator, potentially inducing bias that needs addressing

    Type I error rates of multi-arm multi-stage clinical trials: strong control and impact of intermediate outcomes

    Get PDF
    BACKGROUND: The multi-arm multi-stage (MAMS) design described by Royston et al. [Stat Med. 2003;22(14):2239-56 and Trials. 2011;12:81] can accelerate treatment evaluation by comparing multiple treatments with a control in a single trial and stopping recruitment to arms not showing sufficient promise during the course of the study. To increase efficiency further, interim assessments can be based on an intermediate outcome (I) that is observed earlier than the definitive outcome (D) of the study. Two measures of type I error rate are often of interest in a MAMS trial. Pairwise type I error rate (PWER) is the probability of recommending an ineffective treatment at the end of the study regardless of other experimental arms in the trial. Familywise type I error rate (FWER) is the probability of recommending at least one ineffective treatment and is often of greater interest in a study with more than one experimental arm. METHODS: We demonstrate how to calculate the PWER and FWER when the I and D outcomes in a MAMS design differ. We explore how each measure varies with respect to the underlying treatment effect on I and show how to control the type I error rate under any scenario. We conclude by applying the methods to estimate the maximum type I error rate of an ongoing MAMS study and show how the design might have looked had it controlled the FWER under any scenario. RESULTS: The PWER and FWER converge to their maximum values as the effectiveness of the experimental arms on I increases. We show that both measures can be controlled under any scenario by setting the pairwise significance level in the final stage of the study to the target level. In an example, controlling the FWER is shown to increase considerably the size of the trial although it remains substantially more efficient than evaluating each new treatment in separate trials. CONCLUSIONS: The proposed methods allow the PWER and FWER to be controlled in various MAMS designs, potentially increasing the uptake of the MAMS design in practice. The methods are also applicable in cases where the I and D outcomes are identical

    Quantifying the uptake of user-written commands over time

    Get PDF
    A major factor in the uptake of new statistical methods is the availability of user-friendly software implementations. One attractive feature of Stata is that users can write their own commands and release them to other users via Statistical Software Components at Boston College. Authors of statistical programs do not always get adequate credit, because programs are rarely cited properly. There is no obvious measure of a program’s impact, but researchers are under increasing pressure to demonstrate the impact of their work to funders. In addition to encouraging proper citation of software, the number of downloads of a user-written package can be regarded as a measure of impact over time. In this article, we explain how such information can be accessed for any month from July 2007 and summarized using the new ssccount command

    The extension of total gain (TG) statistic in survival models: Properties and applications

    Get PDF
    Background: The results of multivariable regression models are usually summarized in the form of parameter estimates for the covariates, goodness-of-fit statistics, and the relevant p-values. These statistics do not inform us about whether covariate information will lead to any substantial improvement in prediction. Predictive ability measures can be used for this purpose since they provide important information about the practical significance of prognostic factors. R 2 -type indices are the most familiar forms of such measures in survival models, but they all have limitations and none is widely used. Methods: In this paper, we extend the total gain (TG) measure, proposed for a logistic regression model, to survival models and explore its properties using simulations and real data. TG is based on the binary regression quantile plot, otherwise known as the predictiveness curve. Standardised TG ranges from 0 (no explanatory power) to 1 ('perfect' explanatory power). Results: The results of our simulations show that unlike many of the other R 2 -type predictive ability measures, TG is independent of random censoring. It increases as the effect of a covariate increases and can be applied to different types of survival models, including models with time-dependent covariate effects. We also apply TG to quantify the predictive ability of multivariable prognostic models developed in several disease areas. Conclusions: Overall, TG performs well in our simulation studies and can be recommended as a measure to quantify the predictive ability in survival models

    Assessing the impact of efficacy stopping rules on the error rates under the multi-arm multi-stage framework

    Get PDF
    BACKGROUND: The multi-arm multi-stage framework uses intermediate outcomes to assess lack-of-benefit of research arms at interim stages in randomised trials with time-to-event outcomes. However, the design lacks formal methods to evaluate early evidence of overwhelming efficacy on the definitive outcome measure. We explore the operating characteristics of this extension to the multi-arm multi-stage design and how to control the pairwise and familywise type I error rate. Using real examples and the updated nstage program, we demonstrate how such a design can be developed in practice. METHODS: We used the Dunnett approach for assessing treatment arms when conducting comprehensive simulation studies to evaluate the familywise error rate, with and without interim efficacy looks on the definitive outcome measure, at the same time as the planned lack-of-benefit interim analyses on the intermediate outcome measure. We studied the effect of the timing of interim analyses, allocation ratio, lack-of-benefit boundaries, efficacy rule, number of stages and research arms on the operating characteristics of the design when efficacy stopping boundaries are incorporated. Methods for controlling the familywise error rate with efficacy looks were also addressed. RESULTS: Incorporating Haybittle–Peto stopping boundaries on the definitive outcome at the interim analyses will not inflate the familywise error rate in a multi-arm design with two stages. However, this rule is conservative; in general, more liberal stopping boundaries can be used with minimal impact on the familywise error rate. Efficacy bounds in trials with three or more stages using an intermediate outcome may inflate the familywise error rate, but we show how to maintain strong control. CONCLUSION: The multi-arm multi-stage design allows stopping for both lack-of-benefit on the intermediate outcome and efficacy on the definitive outcome at the interim stages. We provide guidelines on how to control the familywise error rate when efficacy boundaries are implemented in practice

    Adding new experimental arms to randomised clinical trials: Impact on error rates

    Get PDF
    Background: Experimental treatments pass through various stages of development. If a treatment passes through early-phase experiments, the investigators may want to assess it in a late-phase randomised controlled trial. An efficient way to do this is adding it as a new research arm to an ongoing trial while the existing research arms continue, a so-called multi-arm platform trial. The familywise type I error rate is often a key quantity of interest in any multi-arm platform trial. We set out to clarify how it should be calculated when new arms are added to a trial some time after it has started. / Methods: We show how the familywise type I error rate, any-pair and all-pairs powers can be calculated when a new arm is added to a platform trial. We extend the Dunnett probability and derive analytical formulae for the correlation between the test statistics of the existing pairwise comparison and that of the newly added arm. We also verify our analytical derivation via simulations. / Results: Our results indicate that the familywise type I error rate depends on the shared control arm information (i.e. individuals in continuous and binary outcomes and primary outcome events in time-to-event outcomes) from the common control arm patients and the allocation ratio. The familywise type I error rate is driven more by the number of pairwise comparisons and the corresponding (pairwise) type I error rates than by the timing of the addition of the new arms. The familywise type I error rate can be estimated using Ơidák’s correction if the correlation between the test statistics of pairwise comparisons is less than 0.30. / Conclusions: The findings we present in this article can be used to design trials with pre-planned deferred arms or to add new pairwise comparisons within an ongoing platform trial where control of the pairwise error rate or familywise type I error rate (for a subset of pairwise comparisons) is required

    Proposals on Kaplan–Meier plots in medical research and a survey of stakeholder views: KMunicate

    Get PDF
    OBJECTIVES: To examine reactions to the proposed improvements to standard Kaplan–Meier plots, the standard way to present time-to-event data, and to understand which (if any) facilitated better depiction of (1) the state of patients over time, and (2) uncertainty over time in the estimates of survival. DESIGN: A survey of stakeholders’ opinions on the proposals. SETTING: A web-based survey, open to international participation, for those with an interest in visualisation of time-to-event data. PARTICIPANTS: 1174 people participated in the survey over a 6-week period. Participation was global (although primarily Europe and North America) and represented a wide range of researchers (primarily statisticians and clinicians). MAIN OUTCOME MEASURES: Two outcome measures were of principal importance: (1) participants’ opinions of each proposal compared with a ‘standard’ Kaplan–Meier plot; and (2) participants’ overall ranking of the proposals (including the standard). RESULTS: Most proposals were more popular than the standard Kaplan–Meier plot. The most popular proposals in the two categories, respectively, were an extended table beneath the plot depicting the numbers at risk, censored and having experienced an event at periodic timepoints, and CIs around each Kaplan–Meier curve. CONCLUSION: This study produced a high response number, reflecting the importance of graphics for time-to-event data. Those producing and publishing Kaplan–Meier plots—both authors and journals—should, as a starting point, consider using the combination of the two favoured proposals

    Combined test versus logrank/Cox test in 50 randomised trials

    Get PDF
    BACKGROUND: The logrank test and the Cox proportional hazards model are routinely applied in the design and analysis of randomised controlled trials (RCTs) with time-to-event outcomes. Usually, sample size and power calculations assume proportional hazards (PH) of the treatment effect, i.e. the hazard ratio is constant over the entire follow-up period. If the PH assumption fails, the power of the logrank/Cox test may be reduced, sometimes severely. It is, therefore, important to understand how serious this can become in real trials, and for a proven, alternative test to be available to increase the robustness of the primary test. METHODS: We performed a systematic search to identify relevant articles in four leading medical journals that publish results of phase 3 clinical trials. Altogether, 50 articles satisfied our inclusion criteria. We digitised published Kaplan-Meier curves and created approximations to the original times to event or censoring at the individual patient level. Using the reconstructed data, we tested for non-PH in all 50 trials. We compared the results from the logrank/Cox test with those from the combined test recently proposed by Royston and Parmar. RESULTS: The PH assumption was checked and reported only in 28% of the studies. Evidence of non-PH at the 0.10 level was detected in 31% of comparisons. The Cox test of the treatment effect was significant at the 0.05 level in 49% of comparisons, and the combined test in 55%. In four of five trials with discordant results, the interpretation would have changed had the combined test been used. The degree of non-PH and the dominance of the p value for the combined test were strongly associated. Graphical investigation suggested that non-PH was mostly due to a treatment effect manifesting in an early follow-up and disappearing later. CONCLUSIONS: The evidence for non-PH is checked (and, hence, identified) in only a small minority of RCTs, but non-PH may be present in a substantial fraction of such trials. In our reanalysis of the reconstructed data from 50 trials, the combined test outperformed the Cox test overall. The combined test is a promising approach to making trial design and analysis more robust

    Point estimation for adaptive trial designs II: practical considerations and guidance

    Get PDF
    In adaptive clinical trials, the conventional end-of-trial point estimate of a treatment effect is prone to bias, that is, a systematic tendency to deviate from its true value. As stated in recent FDA guidance on adaptive designs, it is desirable to report estimates of treatment effects that reduce or remove this bias. However, it may be unclear which of the available estimators are preferable, and their use remains rare in practice. This article is the second in a two-part series that studies the issue of bias in point estimation for adaptive trials. Part I provided a methodological review of approaches to remove or reduce the potential bias in point estimation for adaptive designs. In part II, we discuss how bias can affect standard estimators and assess the negative impact this can have. We review current practice for reporting point estimates and illustrate the computation of different estimators using a real adaptive trial example (including code), which we use as a basis for a simulation study. We show that while on average the values of these estimators can be similar, for a particular trial realization they can give noticeably different values for the estimated treatment effect. Finally, we propose guidelines for researchers around the choice of estimators and the reporting of estimates following an adaptive design. The issue of bias should be considered throughout the whole lifecycle of an adaptive design, with the estimation strategy prespecified in the statistical analysis plan. When available, unbiased or bias-reduced estimates are to be preferred
    corecore