This R markdown file accompanies this linkedin article and provides the code to reproduce all computations. It heavily borrows from code within the rpact project. Find the code on github.
A much more comprehensive description is available in this rpact vignette.
Setup
# --------------------------------------------------------------# packages# --------------------------------------------------------------packs <-c("rpact", "reporttools") for (i in1:length(packs)){library(packs[i], character.only =TRUE)}
Lade nötiges Paket: xtable
Code
# --------------------------------------------------------------# trial basics# --------------------------------------------------------------alpha <-0.05beta <-0.2# effect size and mediansm1 <-60hr <-0.75m2 <- m1 / hr# required events for single-stage design, i.e. without interimnevent <-getSampleSizeSurvival(hazardRatio = hr, sided =2, alpha = alpha, beta = beta)nevent <-ceiling(nevent$maxNumberOfEvents)
So for a single stage design we need 380 events in any case, i.e. even if the effect is much smaller or larger than what we assume for powering (hazard ratio = 0.75).
# information fractions after which to perform interim looksinfo <-c(0.3, 0.66, 1)
Now generate a design where you add a futility interim after 30% of events and an interim for efficacy after 66% of events. At the futility interim, the trial will be stopped if the hazard ratio is below 1 and for effiacy we use an O’Brian-Fleming type \(\alpha\)-spending function (or rather the Lan-DeMets approximation to it).
# OBF design without initial interim at 30% (to get cumulative alpha-spending function for these visits)design1 <-getDesignGroupSequential(sided =1, alpha = alpha /2, beta = beta, informationRates = info[-1],typeOfDesign ="asOF")# now spend very little alpha at first interim --> interim stopping not an option# -6 codifies "no futility" at the 2nd interimdesign <-getDesignGroupSequential(futilityBounds =c(0, -6), bindingFutility =FALSE,informationRates = info, typeOfDesign ="asUser",userAlphaSpending =c(0.0000001, design1$alphaSpent))samplesize <-getSampleSizeSurvival(design = design, lambda2 =log(2) / m1, hazardRatio = hr,dropoutRate1 =0.025, dropoutRate2 =0.025, dropoutTime =12,accrualTime =0, accrualIntensity =42, maxNumberOfSubjects =1200)samplesize
Design plan parameters and output for survival data:
Design parameters:
Information rates : 0.300, 0.660, 1.000
Critical values : 5.199, 2.524, 1.992
Futility bounds (non-binding) : 0.000, -Inf
Cumulative alpha spending : 0.0000001, 0.0057983, 0.0250000
Local one-sided significance levels : 0.0000001, 0.0057983, 0.0232097
Significance level : 0.0250
Type II error rate : 0.2000
Test : one-sided
User defined parameters:
lambda(2) : 0.0116
Hazard ratio : 0.750
Maximum number of subjects : 1200
Accrual time : 28.57
Accrual intensity : 42.0
Drop-out rate (1) : 0.025
Drop-out rate (2) : 0.025
Default parameters:
Theta H0 : 1
Type of computation : Schoenfeld
Planned allocation ratio : 1
kappa : 1
Piecewise survival times : 0.00
Drop-out time : 12.00
Sample size and output:
Direction upper : FALSE
median(1) : 80.0
median(2) : 60.0
lambda(1) : 0.00866
Maximum number of subjects (1) : 600
Maximum number of subjects (2) : 600
Maximum number of events : 407.7
Follow up time : 29.62
Reject per stage [1] : 0.000154
Reject per stage [2] : 0.433149
Reject per stage [3] : 0.366697
Overall futility stop : 0.05582
Futility stop per stage [1] : 0.05582
Futility stop per stage [2] : 0.00000
Early stop : 0.4891
Analysis times [1] : 25.26
Analysis times [2] : 40.65
Analysis times [3] : 58.19
Expected study duration : 48.75
Maximal study duration : 58.19
Number of events per stage [1] : 122.3
Number of events per stage [2] : 269.1
Number of events per stage [3] : 407.7
Expected number of events under H0 : 264.2
Expected number of events under H0/H1 : 334.6
Expected number of events under H1 : 331.7
Number of subjects [1] : 1060.9
Number of subjects [2] : 1200
Number of subjects [3] : 1200
Expected number of subjects under H1 : 1192.2
Critical values (treatment effect scale) [1] : 0.391
Critical values (treatment effect scale) [2] : 0.735
Critical values (treatment effect scale) [3] : 0.821
Futility bounds (treatment effect scale) [1] : 1.000
Futility bounds (treatment effect scale) [2] : NA
Futility bounds (one-sided p-value scale) [1] : 0.5000
Futility bounds (one-sided p-value scale) [2] : 1.0000
Legend:
(i): values of treatment arm i
[k]: values at stage k
So adding these interims (and not accounting for the power loss induced through adding the futility!) increases the number of events from 380 to 408 - only a slight increase.
Note that at the second interim stopping for futility is not foreseen anymore (the -6 for futilityBounds enforces that).
# information fractions after which to perform interim looksnevents_i1 <-ceiling(info * samplesize$maxNumberOfEvents)nevents_i1
[1] 123 270 408
Now the key is that when using a group-sequential design, or say if we run 100 of them, some of them will actually stop at the first or second interim. We can compute the probabilities for that happening as follows:
# stopping probabilities at futility and efficacy interim under H0 and H1designChar <-getDesignCharacteristics(design)stopProbsH0 <-getPowerAndAverageSampleNumber(design, theta =0, nMax = designChar$shift)stopProbsH1 <-getPowerAndAverageSampleNumber(design, theta =1, nMax = designChar$shift)stopFutIA_H0 <- stopProbsH0$earlyStop["stage = 1", 1]stopEffIA_H0 <- stopProbsH0$earlyStop["stage = 2", 1]stopFutIA_H1 <- stopProbsH1$earlyStop["stage = 1", 1]stopEffIA_H1 <- stopProbsH1$earlyStop["stage = 2", 1]# probability to stop at the futility interim, under H0 and H1c(stopFutIA_H0, stopFutIA_H1)
stage = 1 stage = 1
0.5000001 0.0559722
# probability to stop at the efficacy interim, under H0 and H1c(stopEffIA_H0, stopEffIA_H1)
stage = 2 stage = 2
0.0057649 0.4331487
# Expected number of events under H0 and H1 expH0 <- samplesize$expectedEventsH0expH1 <- samplesize$expectedEventsH1
Now the point is that if we stop at an interim we will need less events, because we would stop the trial then. The expected number of events can be computed as follows:
So if we use the specified group-sequential design and in fact, our intervention does not work at all (\(H_0\) is true), then on average we only need 264 instead of the 380 events in a single-stage design.
Under \(H_1\), i.e. if the drug works as assumed for powering the trial, we then need on average only 332 events to show that, i.e. reject the null hypothesis of the hazard ratio being equal to 0. This is the point of group-sequential designs: the maximal number of events in a given trial is slightly increased, but the average number of events over a portfolio of trials is reduced.
Now, what effect size do we need to observe at the efficacy interim to stop the trial, based on an O’Brien-Fleming \(\alpha\)-spending function? We can easily extract that number from the above output as follows:
# critical values on effect, i.e. hazard ratio, scalecrit1 <-t(disp(samplesize$criticalValuesEffectScale, 3))crit1
[,1] [,2] [,3]
[1,] "0.391" "0.735" "0.821"
# 2-sided significance levels on p-value scalecrit2 <-disp(samplesize$criticalValuesPValueScale *2, 3)crit2
[1] "0.0000002" "0.012" "0.046"
So this means that after 66% of events, in order to get a \(p\)-value below the pre-specified boundary of 0.012 the hazard ratio must be below 0.735. The latter is what sometimes is called minimal detectable difference, MDD. Two observations can be made from that:
Often, people believe in order to stop a trial early the effect seen at the interim must be much larger than what we assumed for powering. Comparing 0.75 to 0.735 it is clear that this is not the case. That the MDD at interim and the effect we power at are approximately the same is typically the case for an O’Brien-Fleming boundary and an interim after about 2/3 of information.
Another common belief is that in order to be significant at the interim we need to observe a hazard ratio \(\le 0.75\). Again, this is not true: the MDD at the final analysis actually is 0.821, i.e. in order to get a \(p\)-value of 0.046 or lower this is the hazard ratio we need to beat.
Conventional analyses at the first and second interim analysis
Assume the following results from standard inference at the futility interim analysis performed after 123 events have been observed:
Stratified HR 0.69 (95% CI 0.47 to 1.01). \(\Rightarrow\) this was \(\le 1\), so trial continues.
log(HR) = log(0.69) with standard error 0.20.
Corresponding Z-score: log(0.69)/0.20 = -1.86.
Results from standard inference at the efficacy interim analysis after 270 events:
Stratified HR 0.66 (95% CI 0.51 to 0.85).
log(HR) = log(0.66) with standard error 0.13.
Corresponding Z-score: log(0.66)/0.13 = -3.225.
Two-sided p-value is 0.0012 which was smaller than the critical value from the O’Brien-Fleming boundary of 0.012. \(\Rightarrow\)Trial stopped early for efficacy.
Analysis accounting for the group-sequential design
The results after the first and second interim are specified using the function getDataset:
Finally, this is used for creating the adjusted inference using the function getAnalysisResults (directionUpper = FALSE is specified because the power is directed towards negative values of the logrank statistics):
Warning: Repeated p-values not available for 'typeOfDesign' = 'asUser'
[PROGRESS] Repeated p-values calculated [0.0012 secs]
[PROGRESS] Final p-value calculated [0.0021 secs]
[PROGRESS] Final confidence interval calculated [0.0919 secs]
adj_result
Analysis results (survival of 2 groups, group sequential design):
Design parameters:
Information rates : 0.300, 0.660, 1.000
Critical values : 5.199, 2.524, 1.992
Futility bounds (non-binding) : 0.000, -Inf
Cumulative alpha spending : 0.0000001, 0.0057983, 0.0250000
Local one-sided significance levels : 0.0000001, 0.0057983, 0.0232097
Significance level : 0.0250
Test : one-sided
User defined parameters:
Direction upper : FALSE
Default parameters:
Normal approximation : TRUE
Theta H0 : 1
Stage results:
Cumulative effect sizes : 0.7150, 0.6753, NA
Stage-wise test statistics : -1.860, -2.669, NA
Stage-wise p-values : 0.03144, 0.00380, NA
Overall test statistics : -1.860, -3.225, NA
Overall p-values : 0.0314428, 0.0006299, NA
Analysis results:
Actions : continue, reject and stop, NA
Conditional rejection probability : 0.1359, 0.8594, NA
Conditional power : NA, NA, NA
Repeated confidence intervals (lower) : 0.2800, 0.4967, NA
Repeated confidence intervals (upper) : 1.8261, 0.9182, NA
Repeated p-values : NA, NA, NA
Final stage : 2
Final p-value : NA, 0.0006299, NA
Final CIs (lower) : NA, 0.532, NA
Final CIs (upper) : NA, 0.8573, NA
Median unbiased estimate : NA, 0.6753, NA
The output is explained as follows:
Critical values are group-sequential efficacy boundary values on the \(Z\)-scale, stage levels are the corresponding one-sided local significance levels.
Effect sizes, Test statistics, and p-values refer to hazard ratio estimates, \(Z\)-scores, and \(p\)-values obtained from the first interim analysis and results which would have been obtained after the second interim analysis if not all data up to the second interim analysis but only new data since the first interim had been included (i.e., per-stage results).
Overall test statistics are the given (overall, not per-stage) \(Z\)-scores from each interim and Overall p-value the corresponding one-sided \(p\)-values.
RCIs are repeated confidence intervals which provide valid (but conservative) inference at any stage of an ongoing or stopped group-sequential trial. Repeated p-values are the corresponding \(p\)-values.
Final p-value is the final one-sided adjusted \(p\)-value based on the stagewise ordering of the sample space.
Median unbiased estimate and Final CIs are the corresponding adjusted treatment effect estimate and the confidence interval for the hazard ratio at the interim analysis where the trial was stopped.
Note that for this example, the adjusted final hazard ratio of 0.68 and the adjusted confidence interval of (0.53, 0.86) match the results from the conventional analysis almost exactly for the first two decimals.