Two-sided vs. One-sided Tests: This Should not be “Controversial”
The appropriateness of two-sided vs. one-sided hypothesis tests in clinical trials has been the subject of an old debate. This is probably because of the intra and inter disciplinary disagreements among statisticians, clinicians and regulators on this matter often expressed via conflicting recommendations in the literature.
First let’s look at how the current ICH E9 Guidance document reads:
It is important to clarify whether one- or two-sided tests of statistical significance will be used and, in particular, to justify prospectively the use of one-sided tests. If hypothesis tests are not considered appropriate, then the alternative process for arriving at statistical conclusions should be given. The issue of one-sided or two-sided approaches to inference is controversial, and a diversity of views can be found in the statistical literature. The approach of setting Type I errors for one-sided tests at half the conventional Type I error used in two-sided tests is preferable in regulatory settings. This promotes consistency with the two-sided confidence intervals that are generally appropriate for estimating the possible size of the difference between two treatments.
The most important word in the paragraph above is the word “controversial”, which tells me that the regulators will accept either if the argument is made objectively in specific situations, and laid out in the protocol perspectively.
Here is how I see the issue, and the core of it aligns with Lloyd Fisher’s opinion that was published a dozen years ago in 2007:
For superiority trials, there are two things to consider:
#1 Active arm is compared to placebo for superiority. In this situation it makes sense to use a one-sided test at a statistical significance level of 2.5%.
#2 Active arm is compared to another active regimen. In this situation it makes sense to use a two-sided test at a statistical significance level of 0.05. This provides an opportunity to see if the comparator is superior to the investigational treatment and provides symmetry in the superiority question.
#3 If both comparisons are of interest, let’s say in a 3-arm trial…but wait, a superiority trial is never designed to compare a treatment with both a placebo and an active arm. In other words, this case doesn’t exist. Now, of course one can envision a situation where a non-inferiority test is conducted with the active comparator and superiority over the placebo.
Now, that brings us to non-inferiority and therapeutic equivalence tests:
#4 for all non-inferiority tests it only makes sense to conduct the tests as one-sided (at 2.5% alpha)because the question is one-sided. However, much of the non-inferiority testing is done via confidence intervals which are by construction two-sided. Now why a confidence interval approach is preferred is a mystery to me personally (a topic of a future blog post). My guess is that this is a by-product of bioequivalence testing using two-one-sided tests, fitting the solution into the non-inferiority problem. It is to be noted, however, the bioequivalence is a specific problem with a continuous (pharmacokinetic) endpoint, and the duality of confidence interval and hypothesis testing that holds in that case does not transfer automatically to all non-inferiority or therapeutic equivalence problems. But again, this is a diversion from the current topic…
#5 for therapeutic equivalence problems with two active arms should always use a two one-sided test structure at 2.5% significance level. This preserves the overall significance level at 2.5% as shown by Roger Berger long-time back (1996).
Now, there is also a technical issue with two-sided tests that few people have talked about. Many of the multiplicity adjustments depend on the closure principle that only holds under a directional alternative, i.e. a one-sided scenario. So with trials with multiple endpoints this poses a technical difficulty for such situations as well unless the two-sided test at 5% significance level is “operationally translated” to a one-sided test at 2.5% significance level.
In summary, this issue should not remain controversial at all. Several authors have brought in issues about sample-size, power and other related concerns in the choice of two vs. one-sided formulation of the problem, however, sample-size calculations should not determine how the hypotheses need to be set-up to formulate a statistical framework answering a scientific question, rather the framework, settled first, should dictate these calculations.
There are certain things that should not be “controversial” – like wearing a mask to avoid infections with coronavirus! The issue of one-sided vs. two-sided testing is no different.
Here are the 2 references used in the post:
The use of one-sided tests in drug trials: an fda advisory committee member’s perspective – L. Fisher, Journal of Biopharmaceutical Statistics, Vol. 1, Issue 1 (1991)
Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets – Roger L. Berger and Jason C. Hsu, Statistical Science Vol. 11, No. 4 (1996)
July 7, 2020