Why do we use group-sequential designs?

Author

Kaspar Rufibach

Published

April 2, 2023

Purpose of this document

This R markdown file accompanies this linkedin article and provides the code to reproduce all computations. It heavily borrows from code within the rpact project. Find the code on github.

A much more comprehensive description is available in this rpact vignette.

Setup

# --------------------------------------------------------------
# packages
# --------------------------------------------------------------
packs <- c("rpact", "reporttools")    
for (i in 1:length(packs)){library(packs[i], character.only = TRUE)}

Lade nötiges Paket: xtable

Code

# --------------------------------------------------------------
# trial basics
# --------------------------------------------------------------
alpha <- 0.05
beta <- 0.2

# effect size and medians
m1 <- 60
hr <- 0.75
m2 <- m1 / hr

# required events for single-stage design, i.e. without interim
nevent <- getSampleSizeSurvival(hazardRatio = hr, sided = 2, alpha = alpha, beta = beta)
nevent <- ceiling(nevent$maxNumberOfEvents)

So for a single stage design we need 380 events in any case, i.e. even if the effect is much smaller or larger than what we assume for powering (hazard ratio = 0.75).

# information fractions after which to perform interim looks
info <- c(0.3, 0.66, 1)

Now generate a design where you add a futility interim after 30% of events and an interim for efficacy after 66% of events. At the futility interim, the trial will be stopped if the hazard ratio is below 1 and for effiacy we use an O’Brian-Fleming type \(\alpha\)-spending function (or rather the Lan-DeMets approximation to it).

# OBF design without initial interim at 30% (to get cumulative alpha-spending function for these visits)
design1 <- getDesignGroupSequential(sided = 1, alpha = alpha / 2, beta = beta, 
                                   informationRates = info[-1],
                                   typeOfDesign = "asOF")

# now spend very little alpha at first interim --> interim stopping not an option
# -6 codifies "no futility" at the 2nd interim
design <- getDesignGroupSequential(futilityBounds = c(0, -6), bindingFutility = FALSE,
                                    informationRates = info, typeOfDesign = "asUser",
                                    userAlphaSpending = c(0.0000001, design1$alphaSpent))

samplesize <- getSampleSizeSurvival(design = design, lambda2 = log(2) / m1, hazardRatio = hr,
                                    dropoutRate1 = 0.025, dropoutRate2 = 0.025, dropoutTime = 12,
                                    accrualTime = 0, accrualIntensity = 42, maxNumberOfSubjects = 1200)
samplesize

Design plan parameters and output for survival data:

Design parameters:
  Information rates                            : 0.300, 0.660, 1.000 
  Critical values                              : 5.199, 2.524, 1.992 
  Futility bounds (non-binding)                : 0.000, -Inf 
  Cumulative alpha spending                    : 0.0000001, 0.0057983, 0.0250000 
  Local one-sided significance levels          : 0.0000001, 0.0057983, 0.0232097 
  Significance level                           : 0.0250 
  Type II error rate                           : 0.2000 
  Test                                         : one-sided 

User defined parameters:
  lambda(2)                                    : 0.0116 
  Hazard ratio                                 : 0.750 
  Maximum number of subjects                   : 1200 
  Accrual time                                 : 28.57 
  Accrual intensity                            : 42.0 
  Drop-out rate (1)                            : 0.025 
  Drop-out rate (2)                            : 0.025 

Default parameters:
  Theta H0                                     : 1 
  Type of computation                          : Schoenfeld 
  Planned allocation ratio                     : 1 
  kappa                                        : 1 
  Piecewise survival times                     : 0.00 
  Drop-out time                                : 12.00 

Sample size and output:
  Direction upper                              : FALSE 
  median(1)                                    : 80.0 
  median(2)                                    : 60.0 
  lambda(1)                                    : 0.00866 
  Maximum number of subjects (1)               : 600 
  Maximum number of subjects (2)               : 600 
  Maximum number of events                     : 407.7 
  Follow up time                               : 29.62 
  Reject per stage [1]                         : 0.000154 
  Reject per stage [2]                         : 0.433149 
  Reject per stage [3]                         : 0.366697 
  Overall futility stop                        : 0.05582 
  Futility stop per stage [1]                  : 0.05582 
  Futility stop per stage [2]                  : 0.00000 
  Early stop                                   : 0.4891 
  Analysis times [1]                           : 25.26 
  Analysis times [2]                           : 40.65 
  Analysis times [3]                           : 58.19 
  Expected study duration                      : 48.75 
  Maximal study duration                       : 58.19 
  Number of events per stage [1]               : 122.3 
  Number of events per stage [2]               : 269.1 
  Number of events per stage [3]               : 407.7 
  Expected number of events under H0           : 264.2 
  Expected number of events under H0/H1        : 334.6 
  Expected number of events under H1           : 331.7 
  Number of subjects [1]                       : 1060.9 
  Number of subjects [2]                       : 1200 
  Number of subjects [3]                       : 1200 
  Expected number of subjects under H1         : 1192.2 
  Critical values (treatment effect scale) [1] : 0.391 
  Critical values (treatment effect scale) [2] : 0.735 
  Critical values (treatment effect scale) [3] : 0.821 
  Futility bounds (treatment effect scale) [1] : 1.000 
  Futility bounds (treatment effect scale) [2] : NA 
  Futility bounds (one-sided p-value scale) [1] : 0.5000 
  Futility bounds (one-sided p-value scale) [2] : 1.0000 

Legend:
  (i): values of treatment arm i
  [k]: values at stage k

So adding these interims (and not accounting for the power loss induced through adding the futility!) increases the number of events from 380 to 408 - only a slight increase.

Note that at the second interim stopping for futility is not foreseen anymore (the -6 for futilityBounds enforces that).

# information fractions after which to perform interim looks
nevents_i1 <- ceiling(info * samplesize$maxNumberOfEvents)
nevents_i1

[1] 123 270 408

Now the key is that when using a group-sequential design, or say if we run 100 of them, some of them will actually stop at the first or second interim. We can compute the probabilities for that happening as follows:

# stopping probabilities at futility and efficacy interim under H0 and H1
designChar <- getDesignCharacteristics(design)
stopProbsH0 <- getPowerAndAverageSampleNumber(design, theta = 0, nMax = designChar$shift)
stopProbsH1 <- getPowerAndAverageSampleNumber(design, theta = 1, nMax = designChar$shift)

stopFutIA_H0 <- stopProbsH0$earlyStop["stage = 1", 1]
stopEffIA_H0 <- stopProbsH0$earlyStop["stage = 2", 1]

stopFutIA_H1 <- stopProbsH1$earlyStop["stage = 1", 1]
stopEffIA_H1 <- stopProbsH1$earlyStop["stage = 2", 1]

# probability to stop at the futility interim, under H0 and H1
c(stopFutIA_H0, stopFutIA_H1)

stage = 1 stage = 1 
0.5000001 0.0559722

# probability to stop at the efficacy interim, under H0 and H1
c(stopEffIA_H0, stopEffIA_H1)

stage = 2 stage = 2 
0.0057649 0.4331487

# Expected number of events under H0 and H1 
expH0 <- samplesize$expectedEventsH0
expH1 <- samplesize$expectedEventsH1

Now the point is that if we stop at an interim we will need less events, because we would stop the trial then. The expected number of events can be computed as follows:

Under \(H_0\): 0.500 \(\cdot\) 123 + 0.006 \(\cdot\) 270 + 0.494 \(\cdot\) 408 = 264.
Under \(H_1\): 0.056 \(\cdot\) 123 + 0.433 \(\cdot\) 270 + 0.511 \(\cdot\) 408 = 332.

So if we use the specified group-sequential design and in fact, our intervention does not work at all (\(H_0\) is true), then on average we only need 264 instead of the 380 events in a single-stage design.

Under \(H_1\), i.e. if the drug works as assumed for powering the trial, we then need on average only 332 events to show that, i.e. reject the null hypothesis of the hazard ratio being equal to 0. This is the point of group-sequential designs: the maximal number of events in a given trial is slightly increased, but the average number of events over a portfolio of trials is reduced.

Now, what effect size do we need to observe at the efficacy interim to stop the trial, based on an O’Brien-Fleming \(\alpha\)-spending function? We can easily extract that number from the above output as follows:

# critical values on effect, i.e. hazard ratio, scale
crit1 <- t(disp(samplesize$criticalValuesEffectScale, 3))
crit1

     [,1]    [,2]    [,3]   
[1,] "0.391" "0.735" "0.821"

# 2-sided significance levels on p-value scale
crit2 <- disp(samplesize$criticalValuesPValueScale * 2, 3)
crit2

[1] "0.0000002" "0.012"     "0.046"

So this means that after 66% of events, in order to get a \(p\)-value below the pre-specified boundary of 0.012 the hazard ratio must be below 0.735. The latter is what sometimes is called minimal detectable difference, MDD. Two observations can be made from that:

Often, people believe in order to stop a trial early the effect seen at the interim must be much larger than what we assumed for powering. Comparing 0.75 to 0.735 it is clear that this is not the case. That the MDD at interim and the effect we power at are approximately the same is typically the case for an O’Brien-Fleming boundary and an interim after about 2/3 of information.
Another common belief is that in order to be significant at the interim we need to observe a hazard ratio \(\le 0.75\). Again, this is not true: the MDD at the final analysis actually is 0.821, i.e. in order to get a \(p\)-value of 0.046 or lower this is the hazard ratio we need to beat.

Inference after early stopping

See this rpact vignette for further details.

Conventional analyses at the first and second interim analysis

Assume the following results from standard inference at the futility interim analysis performed after 123 events have been observed:

Stratified HR 0.69 (95% CI 0.47 to 1.01). \(\Rightarrow\) this was \(\le 1\), so trial continues.
log(HR) = log(0.69) with standard error 0.20.
Corresponding Z-score: log(0.69)/0.20 = -1.86.

Results from standard inference at the efficacy interim analysis after 270 events:

Stratified HR 0.66 (95% CI 0.51 to 0.85).
log(HR) = log(0.66) with standard error 0.13.
Corresponding Z-score: log(0.66)/0.13 = -3.225.
Two-sided p-value is 0.0012 which was smaller than the critical value from the O’Brien-Fleming boundary of 0.012. \(\Rightarrow\) Trial stopped early for efficacy.

Analysis accounting for the group-sequential design

The results after the first and second interim are specified using the function getDataset:

# overallLogRanks: One-sided logrank statistic or Z-score (= log(HR) / SE) from Cox regression
results <- getDataset(
    overallEvents = nevents_i1[1:2],
    overallLogRanks = c(-1.86, -3.225),
    overallAllocationRatio = c(1, 1))

Finally, this is used for creating the adjusted inference using the function getAnalysisResults (directionUpper = FALSE is specified because the power is directed towards negative values of the logrank statistics):

adj_result <- getAnalysisResults(design = design, 
    dataInput = results, stage = 2, directionUpper = FALSE)

[PROGRESS] Stage results calculated [0.0436 secs] 
[PROGRESS] Conditional power calculated [0.0442 secs] 
[PROGRESS] Conditional rejection probabilities (CRP) calculated [0.0014 secs] 
[PROGRESS] Repeated confidence interval of stage 1 calculated [0.9783 secs] 
[PROGRESS] Repeated confidence interval of stage 2 calculated [0.7699 secs] 
[PROGRESS] Repeated confidence interval calculated [1.75 secs]

Warning: Repeated p-values not available for 'typeOfDesign' = 'asUser'

[PROGRESS] Repeated p-values calculated [0.0012 secs] 
[PROGRESS] Final p-value calculated [0.0021 secs] 
[PROGRESS] Final confidence interval calculated [0.0919 secs]

adj_result

Analysis results (survival of 2 groups, group sequential design):

Design parameters:
  Information rates                            : 0.300, 0.660, 1.000 
  Critical values                              : 5.199, 2.524, 1.992 
  Futility bounds (non-binding)                : 0.000, -Inf 
  Cumulative alpha spending                    : 0.0000001, 0.0057983, 0.0250000 
  Local one-sided significance levels          : 0.0000001, 0.0057983, 0.0232097 
  Significance level                           : 0.0250 
  Test                                         : one-sided 

User defined parameters:
  Direction upper                              : FALSE 

Default parameters:
  Normal approximation                         : TRUE 
  Theta H0                                     : 1 

Stage results:
  Cumulative effect sizes                      : 0.7150, 0.6753, NA 
  Stage-wise test statistics                   : -1.860, -2.669, NA 
  Stage-wise p-values                          : 0.03144, 0.00380, NA 
  Overall test statistics                      : -1.860, -3.225, NA 
  Overall p-values                             : 0.0314428, 0.0006299, NA 

Analysis results:
  Actions                                      : continue, reject and stop, NA 
  Conditional rejection probability            : 0.1359, 0.8594, NA 
  Conditional power                            : NA, NA, NA 
  Repeated confidence intervals (lower)        : 0.2800, 0.4967, NA 
  Repeated confidence intervals (upper)        : 1.8261, 0.9182, NA 
  Repeated p-values                            : NA, NA, NA 
  Final stage                                  : 2 
  Final p-value                                : NA, 0.0006299, NA 
  Final CIs (lower)                            : NA, 0.532, NA 
  Final CIs (upper)                            : NA, 0.8573, NA 
  Median unbiased estimate                     : NA, 0.6753, NA

The output is explained as follows:

Critical values are group-sequential efficacy boundary values on the \(Z\)-scale, stage levels are the corresponding one-sided local significance levels.
Effect sizes, Test statistics, and p-values refer to hazard ratio estimates, \(Z\)-scores, and \(p\)-values obtained from the first interim analysis and results which would have been obtained after the second interim analysis if not all data up to the second interim analysis but only new data since the first interim had been included (i.e., per-stage results).
Overall test statistics are the given (overall, not per-stage) \(Z\)-scores from each interim and Overall p-value the corresponding one-sided \(p\)-values.
RCIs are repeated confidence intervals which provide valid (but conservative) inference at any stage of an ongoing or stopped group-sequential trial. Repeated p-values are the corresponding \(p\)-values.
Final p-value is the final one-sided adjusted \(p\)-value based on the stagewise ordering of the sample space.
Median unbiased estimate and Final CIs are the corresponding adjusted treatment effect estimate and the confidence interval for the hazard ratio at the interim analysis where the trial was stopped.

Note that for this example, the adjusted final hazard ratio of 0.68 and the adjusted confidence interval of (0.53, 0.86) match the results from the conventional analysis almost exactly for the first two decimals.