## Statistical considerations for the design of randomized, controlled trials for probiotics and prebiotics

*By Prof. Daniel Tancredi, UC Davis, USA*

The best evidence for the efficacy of probiotics or prebiotics generally comes from randomized controlled trials. The proper design of such trials should strive to use the available resources to achieve the most informative results for stakeholders, while properly accounting for the consequences of correct and incorrect decisions. It is crucial to understand that even well-designed and -executed studies cannot entirely eliminate uncertainty from statistical inferences. Those inferences could be incorrect, even though they were made rigorously and without any procedural or technical errors. By “incorrect”, I mean that the decisions made may not correspond to the truth about those unknown population parameters. Those parameters involve the distribution of study variables in the entire population, but our inferences are inductive and based on just the fraction of the population that appeared in our sample, creating the possibility for discordance between those parameters and our inferences about them. Although rigorous statistical inference procedures can allow us to control the probabilities of certain kinds of incorrect decisions, they cannot eliminate them.

For example, consider a two-armed randomized controlled trial designed to address a typical null hypothesis, that the probability of successful treatment is the same for the experimental treatment as for the comparator. Depending on the analytical methods to be employed, that null hypothesis could also be phrased as saying that the difference in successful treatment probabilities between the two arms is zero or that the ratio of the successful treatment probabilities between the two groups is one. Suppose the study sponsor has two possible choices regarding the null hypothesis, either to reject it or fail to reject it. (The latter choice is colloquially called “accepting the null hypothesis”, but that is a bit of an overstatement, as the absence of evidence for an effect in a sample typically does not rise to the level of being convincing evidence for the absence of an effect in the population.)

With these two choices about the null hypothesis, there are two major types of “incorrect decisions” that can be made: the null hypothesis could be true for the population but the study data led to a decision to reject the null hypothesis, a result conventionally called a “Type-1” error. Or the null hypothesis could be false for the population but the study data led to a decision not to reject the null hypothesis, conventionally called a “Type-2” error. Conversely, there are also two potentially correct decisions. One could fail to reject the null hypothesis when the null hypothesis is true for the population, a so-called “true negative”, or one could reject the null hypothesis when the null hypothesis is not true, a so-called true positive.

The consequences of these four different decision classifications vary from one stakeholder to another, and thus **it is unwise to rely solely and simply on commonly used error probabilities when planning studies**. The wiser approach is to **set the error probabilities so that they properly account for the relative gains and losses to a stakeholder** that arise from correct and incorrect decisions, respectively. From long experience assessing the design of clinical trials for probiotics and prebiotics, I recommend that stakeholders in the design phase of studies give thought to the following three statistical considerations.

**Pay attention to power**

Power is the probability of avoiding a type-2 error—in other words, under the condition that an assumed true effect exists in a study population and that the type-1 error has been controlled at a given value, power is computed of the probability of avoiding the incorrect decision to fail to reject the null hypothesis. Standard practices are to set the type-1 error at 5% and to determine a sample size that achieves 80% power for an assumed alternative hypothesis, one stating that the true effect is of a specific given magnitude, one corresponding to a so-called meaningful effect size. That effect size is typically called a ‘minimum clinically significant difference’ (MCSD) or something similar, because ideally the assumed effect size would be the smallest of the values that would be clinically important, although as a practical matter — because the higher the magnitude of the effect size, the lower the sample size requirements and thus the better the chance of the study being perceived as “affordable” to study sponsors — the MCSDs used to power studies are often larger than some of the values that would also be clinically significant. Nevertheless, let’s consider what it means for the sponsor to accept that the study should be powered at merely the conventional 80% level. Under the assumptions that the true effect in the population is the MCSD and that the study achieves its target sample size, **a sponsor of a study that has only 80% power is taking a 1-in-5 chance that the sample results would not be statistically significant** (and that the null hypothesis would fail to be rejected). Such an incorrect decision could have major adverse implications for the sponsor (and for potential beneficiaries of the intervention), particularly given the investments that have been made in the research program and the implications the incorrect decision could have for misinforming future decisions regarding the specific intervention and indeed related interventions. A 20% risk may not be worth taking.

All other considerations being equal, the risk of a type-2 error could be lowered by increasing the sample size. Under regular asymptotic assumptions that generally apply, increasing the target sample size by about one-third would cut a 20% type-2 error risk in half, to 10%. Increasing the target sample size by two-thirds reduces it all the way to 5%.

**Define the true minimum clinically significant effect size applicable to your study**

Another important question is where to set the minimum clinically significant effect. Often that effect is based on prior studies without any adjustment—but this can neglect key considerations. **Prior effects of an intervention are typically biased in a direction that overstates the benefits of the intervention, especially if the intervention emerged from smallish early-phase studies.** More fundamentally, from the perspective of decision theory the estimated effects seen in prior studies do not specifically address what could truly be the minimum clinically meaningful effect when one considers the possible benefits, risks, and costs of the intervention. Probiotics and prebiotics are typically relatively benign interventions in terms of adverse events, so it could be that even more modest favorable impacts on health than were seen in prior studies are still worthwhile.

Powering your study based on what truly is a minimal clinically meaningful effect may lead to a better overall strategy for optimizing net gains, while giving the intervention an appropriately high chance of showing that it works. Although the smaller the assumed effect size, the larger the required sample size needed to detect it (all other factors being the same), a proper assessment of the relative risks and benefits of the intervention and, also, of correct and incorrect decisions about the intervention, may provide a strong basis for making that investment.

In addition, there is another important but often overlooked aspect when deciding on what is a worthwhile improvement. We frequently turn to clinicians to determine what would be a worthwhile improvement, and it is natural for a clinician to address that question by considering what would be a meaningful improvement for a patient who responds to the intervention**. Keep in mind, though, that an intervention could be worthwhile for a population if it achieves what would be a worthwhile improvement for a single patient–say, a mean improvement of 0.2 SD on a quality-of-life scale—in only a fraction of the patients in the overall population, say 50%. **There are many conditions for which having an intervention that works for only large subsets of the population could be valuable in improving the population’s overall health and wellness. Using this example, where the worthwhile improvement for an individual is 0.2 SD and the worthwhile responder percentage is 50%, then the worthwhile improvement that should be used to power the study would be 0.1 SD, which is equal to (0.2 SD * 50%) + (0 SD * 50%), with the latter product quantifying an assumed absence of a benefit in the non-responders. **What should be gleaned from this example is that the minimum clinically important effect for a population is typically less than the minimum clinically important effect for an individual. **The effect used to power the study should be the one that applies to the relevant population. Again, that effect should be chosen so that it balances benefits relative to the costs and harms of the intervention while accounting also for variation in whether and how much individuals in the population may respond. When study planners fail to account for this variation, the result is a study that is underpowered for detecting meaningful population-level effects.

**Improving the signal-to-noise ratio**

In general, effect sizes can be expressed analogously to a mean difference divided by a standard error, and thus can be thought of as a signal-to-noise ratio. Sample size requirements depend crucially on this signal-to-noise ratio. Typically, standard errors are proportional to outcome standard deviations and inversely proportional to the square root of the sample size. The latter is key because it means that in case an expected signal would be cut in half, the noise would also need to be cut in half to maintain the signal-to-noise ratio, which means that if you cannot alter the outcome standard deviation, then you would need to quadruple the sample size. This also applies in the opposite direction, happily: if you can double the expected signal-to-noise ratio, you would only need one-fourth the sample size to achieve the desired power, all other things being equal.

Signal-to-noise ratios can be optimized by designing a trial for a judiciously restricted target population (of potential responders) and by using high-quality outcome measurements for the trial to reduce noise. Although research programs may eventually aim to culminate in large pragmatic trials that show meaningful improvements associated with an intervention even in populations of individuals with wide variations in their likelihood and amount of potential response, **it is generally wise up to that stage in a research program to focus trials so that they give accurate information as to whether the intervention works in populations targeted for being more apt to be responsive to an intervention**. To do that, for example, the trial methods should include accurate assessments for whether potential recruits are currently experiencing, say, symptoms from whatever condition the intervention is intended to address and whether the recruit would be able to achieve the desired dose of whatever the trial assigns to them. For a truly beneficial intervention, it is easier to continue a research program advancing the development of that intervention if the intervention sustains a consecutive string of “true positive” results from when it began to undergo trials, avoiding a potentially fatal type-2 error (“false negative”).

Careful attention to the above considerations can lead to better trials, ones that combine rigor and transparency with a tailored consideration of the relative costs and benefits of potentially fallible statistical inferences, so that the resulting evidence is as informative as possible for stakeholder decision-making.