**This document is adapted from a similar paper, Statistical Uncertainty in Randomised Controlled Trials (August 2018) developed by the Education Endowment Foundation (EEF). **

4 September 2018

## Introduction

This note discusses the role of statistical significance and statistical uncertainty in randomised controlled trials (RCTs).

It explains why Evidence for Learning (E4L) does not include statements about statistical significance in its headline descriptions of RCT results in its Evaluation Report and E4L webpage but does include those statements in the Executive Summaries and Commentaries on Evaluation Reports.

It also explains where E4L follows the reporting format of the EEF and where and why we deviate from it.

## Why does statistical uncertainty arise?

E4L evaluations use a randomised controlled trial methodology to estimate the size of impact of an intervention.

In an RCT, a sample of subjects is selected from the population of interest, and each of the subjects in the sample is randomly allocated to either the intervention group or the control group. Or, in a Cluster RCT, groups of subjects (like a class or a school), are randomly allocated to either the intervention group or the control group.

The difference between the control group and the intervention group after the intervention has occurred is the basis for estimating the average effect[1] that the intervention would have had on the whole sample, or on the wider population (the effect size).

RCTs are one of the most rigorous way to evaluate the impact of an intervention. However, there are two steps in the RCT process where uncertainty is introduced, regardless of how well the experiment is designed and implemented:

- When we sample subjects from a population, it is possible to get an unrepresentative sample, even if you take a random sample. It is not possible to know whether this has, in fact, happened unless you measure the whole population on every relevant characteristic, which isn’t feasible.
- When we randomly allocate subjects to the intervention group or the control group, it is possible that the groups differ in some way which means that their outcomes at the end point of the intervention would have been different, even without the intervention. Again, it is not possible to know whether this has happened unless you measure the whole sample on every relevant characteristic.

The uncertainly that is introduced during these steps is sometimes referred to as statistical uncertainty. This uncertainty means it is always possible that the effect size observed in the RCT will differ from the true average effect size[2] in the sample or the population[3].

In an individual RCT, we cannot know with complete certainty whether or to what extent this possibility has occurred. However, for any given RCT design, there are statistical techniques evaluators can use to assess the risk that the effect size estimate will not be a ‘sufficiently good’ estimate of the true average effect size in the population because of the uncertainty that sampling or random allocation introduces[4].

E4L requires its commissioned evaluations to use an appropriate technique to assess this risk. The results are included in the published report of that evaluation and the executive summary.

## Using statistical significance to assess the uncertainty

To determine the level of statistical uncertainty, evaluators will consider the hypothetical situation in which:

- The sample is a random sample from the population of interest;
- The experiment is conducted a large number of times on the same population; and
- The intervention has an average impact of zero on the population.

They then calculate how likely you would be, in this hypothetical situation, to observe a difference at least as big as the actual observed difference (the observed effect size), purely as a result of statistical uncertainty[5].

If the actual observed difference is very unlikely under this hypothetical situation, then this is judged to provide evidence against the hypothetical situation, that is, against the hypotheses that the intervention has no impact.

The figure that evaluators calculate during the process described above is called a p-value, and it shows the probability, under the hypothetical situation, that you would observe a difference as extreme as or more extreme than the one that was actually found. If the p-value is 0.05, there is a probability of 0.05 - equivalent to a 5% chance - that you would observe a difference as extreme as or more extreme than the difference that was actually found, assuming the hypothetical situation to be true.

For RCTs, some evaluators set a threshold for how big this chance can be: an effect size estimate is only considered to be ‘significant’ if the chance is below 1%, 5% or 10%.

Most commonly, and for E4L commissioned evaluations, the threshold is set at 5% and the effect size estimate is considered to be ‘significant’ if an effect size as or more extreme would only be found 5% of the time (or less), under the hypothetical situation of no actual impact.

For E4L evaluations, where the evaluator calculates a p-value or carries out a test for significance, the results are included in the published report of that evaluation and the executive summary.

## E4L’s approach to statistical significance

All E4L for Learning commissioned evaluations include tests for statistical significance on effect sizes. E4L reports on these tests (and the p-value) in full in the evaluation report and in key conclusions in the Executive Summary and the Commentary.

This is a slight deviation from the EEF approach. The EEF includes the p-value in the full report but not in its summary documents or the headline measures of effect size (translated into months’ of additional progress).

In part, this is because it is very difficult to communicate precisely what a p-value (and hence a significance test) tells us. Experts and lay people alike find it very hard to distinguish between:

- the probability that you would observe a difference as extreme as (or more extreme than) the one that was actually found, given that the intervention had no impact; and
- the probability that the intervention had no impact, given that you observed a difference as extreme as the one that was actually found.

There is evidence that this confusion consistently leads researchers and research users to conclude that a p-value of 0.05 means that there is only a 5% chance that a positive finding is due to chance[6], which is not true[7].

In addition, statistical significance is not the only factor that can affect whether the effect size observed in a given experiment is a ‘sufficiently good’ estimate of the true average effect size in the population. Other factors which can affect this include:

- Imbalance - large differences on known variables between the intervention group and control group before the intervention starts.
- Attrition - subjects from the intervention and control groups dropping out of the trial so that they cannot be included in the trial analysis.
- Other sources of bias - for example, people marking outcome tests being unconsciously influenced to give higher marks to pupils in the intervention group (this is a well-documented phenomenon).

There is an increasing trend across the social sciences to move away from statistical significance as a way of assessing RCT results. Some scientific journals no longer publish p-values, and some evaluators completely reject the validity of p-values, or the validity of testing whether a result is ‘significant’[8].

## E4L’s use of padlocks and p-values

For the reasons above, the EEF has created (and E4L has adopted) a broad measure of the overall security of results, which takes into account both statistical uncertainty and the other factors which can affect whether an effect size estimate is a ‘sufficiently good’ estimate of the true average effect size in the population. The padlock security rating takes into account:

- Design
- Imbalance
- Attrition
- Other sources of bias
- Statistical uncertainty

In the padlock consideration statistical uncertainty is accounted for using a measure of ‘statistical power’. The concept of statistical power is closely related to the concept of significance testing. All else being equal, the higher the statistical power of a trial, the more likely it is for the observed effect size to be significant using a given p-value threshold. However, statistical power is independent of the size of the impact estimate, whereas statistical significance is not; it is possible for a trial to have good statistical power even if the observed effect size is not significant, for example if the effect is close to zero.

The padlock rating therefore provides a single accessible measure of the security of an effect size result, having taken account of a range of factors that can affect whether it is a ‘sufficiently good’ estimate of the true average effect size in the population. For this reason E4L presents the padlock rating in its headline summary of an evaluation.

E4L recognises however that many in the education academic community are interested in, and rely on p-values, to make assessments about the strength of a particular finding. We have deviated from the EEF approach to also include p-values in the executive summary and commentary of our evaluations. We have adopted 0.05 as the marker for statistical significance and state that if a result is higher than 0.05 it needs to be treated with caution.

E4L believes having both the padlock rating of the overall security of the evaluation and the p-values for specific results achieves the right balance of accessibility for a lay person reader and accuracy for critical review.

*In E4L evaluations a padlock security rating is allocated and presented along with the effect size on the relevant E4L evaluation page. More detail on how the E4L padlock security rating is calculated can be found here.*

## What if a trial has a high padlock rating but the effect size is not statistically significant?

The higher the statistical power of a trial, the more likely it is for the observed effect size to be significant using a the 0.05 (or 5%) p-value threshold. When the EEF developed the padlock rating system it was expected that it would be unusual for a trial to get a high padlock rating, and have an effect size of important magnitude, but not deliver a statistically significant result.

Now that the EEF have published a large number of trial results, they have found that there are a small number of projects which have a positive result of 1 month’s progress or more (which could be an important magnitude depending on context), and a padlock rating of 3 or more, but for which the effect size estimate is not statistically significant (using a selected p-value threshold of 5%).

This illustrates the difference - in terms of minimising the risk of statistical uncertainty - between the requirement relating to statistical power in the E4L padlock security rating, and the requirement for the effect size to be statistically significant.

This is one of the reasons that E4L has resolved to also publish the p-value of evaluations along with the padlock rating; to allow a reader to make their own assessments of the importance of the inconsistency.

As at the time of writing, the EEF is currently assessing how they should best present the results of evaluations which have a high padlock rating and found a positive effect size, with but which is not statistically significant.

## References

[1] Various methods can be used to analyse an RCT, but they all make use of the difference observed between the control group and the intervention group after the intervention has occurred.

[2] You can never actually know the ‘true average effect size’, as you would need to know the pre‐test and post-test outcomes for each member of the sample/population both with and without the intervention, which is not possible.

[3] Indeed, even for two identical experiments, the observed effect size is likely to differ a little, and will occasionally differ a lot, because of this statistical uncertainty.

[4] There are different recognised techniques and a range of thresholds used for the acceptable level of uncertainty.

[5] This calculation is only possible if you assume you have a random sample from the population of interest, which is why evaluators make the assumption in the first bullet point.

[6] See, for example, Understanding The New Statistics by Geoff Cumming, Box 1.1.

[7] Ronald L. Wasserstein & Nicole A. Lazar (2016) The ASA's Statement on p‐Values: Context, Process, and Purpose, The American Statistician, 70:2, 129‐133, DOI: 10.1080/00031305.2016.1154108

[8] https://www.nature.com/news/one-size-fits-all-threshold-for-p-values-under-fire-1.22625