Effect sizes in education: Bigger is better right?

Introduction

Evidence-based practice has become a key focus for education systems and schools in seeking to improve student learning outcomes. Evidence comes in a variety of forms and ranges in quality, and discussion regarding the quality of evidence should be included when investing in an educational initiative. We focus here on the effect size in educational research, examining some of the complexity that underpins this simple measure of impact and detailing things to look for when evaluating research. Effect sizes matter when it comes to decision making in education, but bigger may not always be better.

Suppose we are interested in determining whether an educational intervention has an impact on student outcomes. There are generally two questions asked when evaluating an intervention:

Is the observed effect real, or due to chance alone?
How large is the effect?

Question one focuses on statistical significance (the p value), which tells us if the difference found between groups is unlikely to be caused purely by chance (Bakker et al., 2019). Question two refers to the effect size, telling us the magnitude of difference between the results of a group receiving an intervention and another group not receiving the intervention (control/comparison group). The effect size is the difference between groups converted into standard deviation units which means the result can be compared against other studies, or normal/expected achievement growth of students.

As an example, let us consider two studies, each designed to evaluate the effect of a different reading comprehension intervention against a control group, and each reporting a p‑value that gives us lots of confidence that the intervention makes a difference (say p = 0.001). Intervention X finds a much larger effect size of 0.68 standard deviations (SD) than intervention Y with an effect size of 0.18 SD. Should we conclude that using reading intervention X is more effective than Y? A large effect size is good news, right?

Whilst making for intuitive comparison of education research results, an effect size gives no information about:

The type of research that has been undertaken
Whether the outcomes used are meaningful in the grand scheme of a child’s education
Whether the results are generalizable to the broader population of schools, and students, within a system

This overwhelming deluge of information is left to the reader to hunt down, which becomes increasingly harder to gather and interpret when studies are piled into a meta-analysis where you will find effect size estimates ranging from 0.06 (Lortie-Forgues & Inglis, 2019) to 0.70 (Hattie, 2009). The information not provided by an effect size is the reason why there is such disparity in reported effects.

This article presents three broad principles of research design and how they interact to determine the quality of evidence. An example of the teaching practice of feedback is used to demonstrate how effect size and research design interact and we provide considerations for making decisions about initiatives.

Principles of research design

What is the research type?

Correlational, quasi-experimental and experimental are three broad study types that report effect sizes in education research. Correlational studies are considered lower quality evidence as they refer to the strength of a relationship between two variables (e.g., teacher attentiveness and student achievement) and not the ability of one variable to cause another (remember, correlation is not causation). Quasi-experimental studies compare the effects between groups involved in different conditions (control vs intervention), or a single group in comparison to ‘normal’ growth. These studies lack the condition of randomisation (random group assignment by a third party) needed to be labelled an experimental trial. As such quasi-experimental trials lack the confidence to conclude that a difference between the groups is due solely to the intervention and not to the sampling of the participants, which is the advantage of the experimental trial and the reason they are considered at the top of the evidence food chain.

What outcomes are used?

Study outcomes that are more proximal (e.g., specific mathematics ability test or teacher behaviour in a pedagogical intervention) are easier to change than outcomes that are more distal (e.g., NAPLAN numeracy or Progressive Achievement Testing). The use of more distal measures of student achievement are likely to give a better indication of the effect of an intervention for development of broader skills in a subject area, and are more readily linked to the broader benefits of lifetime attainment through enhanced education outcomes.

What was the sample?

Small samples (< 500 students) drawn from Table 1 demonstrates how the three elements of research design interact to determine the quality of research. In research designed to evaluate an intervention for the improvement of student outcomes, research considered to be high quality uses experimental designs, standardised measures that are distal in nature and utilizes large samples from many different schools. Findings from research of this nature have a high level of generalisability across school systems because the research is designed to remove factors that call into question if the intervention caused the results. The outcomes from these evaluations are more connected to the overall academic achievement of students and the sample is representative of the demographics of schools that make up a system.

What is the interaction between research design and research quality?

Table 1 demonstrates how the three elements of research design interact to determine the quality of research. In research designed to evaluate an intervention for the improvement of student outcomes, research considered to be high quality uses experimental designs, standardised measures that are distal in nature and utilizes large samples from many different schools. Findings from research of this nature have a high level of generalisability across school systems because the research is designed to remove factors that call into question if the intervention caused the results. The outcomes from these evaluations are more connected to the overall academic achievement of students and the sample is representative of the demographics of schools that make up a system. As the quality of research reduces, we are less confident in the generalisability of our results because we have controlled less of the bias in our study, our outcomes are focused towards smaller elements of student achievement and samples are smaller and less representative.

Table 1. Interaction between elements of research design and research quality

Quality	Research type	Outcome selection	Sample selection
High	Experimental	Distal	Large (> 2000)
Moderate	Quasi-experimental	Proximal & Distal	Moderation (500−2000) & Large (>2000)
Low	Correlation & Quasi-experimental	Proximal & Distal	Small (<500)

Interaction between research quality and effect size: The example of feedback

Research on the use of feedback within classroom practice for the promotion of student outcomes since the 1970s provides a demonstration of the relationship between research quality and effect size. This example draws on meta-analytic studies and large scale independently evaluated randomised controlled trials undertaken by the Institute of Education Sciences (USA) and the Education Endowment Foundation (UK) where feedback was evaluated for the development of academic outcomes in school children. Studies that evaluated behavioural outcomes, teacher-based outcomes and university studies were excluded. Results of this research synthesis are presented in Table 2.

This does not imply that the quality of the intervention is lower in the studies applying high quality evaluation, but that the effect size derived is likely a more accurate representation of the actual impact of feedback interventions on student achievement.

Table 2. Interaction between research quality and effect size for the feedback

Quality	Effect size	References
High	0.08	Speckesser et al [Wiliam, D intervention] (2018); Coe et al. (2011); Kozlow and Bellamy (2004)
Moderate	0.36	Wiliam et al. (2004); Kingston (2011), Wisniewski et al. (2020)
Low	0.69	Hattie (2009); Fuchs and Fuchs (1986); Graham (2015)

What does this mean for schools?

In short, raising student achievement is hard work and the effect of professional development interventions promised to schools have likely been overly optimistic (Kraft, 2020). Whilst the effect for feedback-based interventions of 0.08 seems like a small effect, this is equivalent to 1‑months growth in addition to that of the control group across a school year (Education Endowement Foundation, 2018).

There are several important points to note from this analysis about the effect size of feedback under high quality evaluation conditions. First, the result for the effect of feedback is still positive. A recent meta-analysis of high-quality evaluations reports an average intervention effect of 0.06, with 35% of interventions displaying negative effects on student achievement (Lortie-Forgues & Inglis, 2019). Second, the likelihood of schools achieving a positive result from a feedback-based initiative are high because the results were obtained using experimental designs on large representative samples measured with standardised outcomes.

To assist schools in making decisions about research informed initiatives we provide the following recommendations:

1. Invest in programs that have demonstrated effects under the conditions of rigorous evaluation

One third of interventions designed to improve student achievement had a negative effect when evaluated under rigorous research conditions (Lortie-Forgues & Inglis, 2019). Choosing interventions that have demonstrated effects under the conditions of high-quality evaluation means you are less likely to undertake an initiative that may have little to no impact on outcomes, even if this initiative has demonstrated positive effects using multiple low-to-moderate level research projects.

We are not suggesting that the vast amounts of educational research to date is not informative to practice, but that the effect size expectations of earlier research should be evaluated in relation to the quality of research used to obtain the result. The Institute of Education Sciences (USA), Education Endowment Foundation (UK), Evidence for Learning (Aus) are all funding and brokering high quality evaluation and synthesis of education initiatives. It is hoped that these organisations and the soon to be established National Evidence Institute be a starting point for examination of high-quality evidence.

2. Scrutinise research

The information provided here is a start to understanding research. Training in research methods would aid school leaders and middle leaders to go beyond the materials offered by evidence brokers and professional development providers.

When scrutinising any potential program, particularly from a price for effect standpoint, it’s important to remember two things:

Evidence of scaling is not evidence of impact – A program undertaken in multiple countries, by any number of schools is evidence of a good marketing team, not high-quality evidence of impact.

A collection of evidence is also not evidence of impact – The approach of throwing in every practice that demonstrates an effect above a set criterion is outdated and, in most cases, untested.

3. Implement well

Implementing an initiative with the highest possible fidelity is crucial to ensuring the effects on the outcome/s at hand. It is natural for adaptation to occur in any initiative that is implemented across different settings, however careful consultation with program developers should take place to ensure that you are not undertaking any ‘terminal’ adaptations that may dramatically reduce the chance of a program achieving its intended goal.

Equally as important in an implementation plan is the readiness of the school to take on an initiative. A school readiness evaluation should inform how a chosen initiative is undertaken at an individual school. In some cases, the school might be ready to go (e.g., school has an existing culture of collaborative practice) and in others there may need to be some work done in developing readiness over time (e.g., starting on a smaller scale with early adopters and gathering momentum over a 3 – 4 year school planning period).

4. Evaluate

Regardless of the quality of evidence underpinning an intervention that piques the interest of your school, trialling and evaluating for impact and feasibility/acceptability is suggested to scrutinise the intervention effect in your specific school context.

Summing up

In education, the effect size has traditionally been used to sell the promise of improved student outcomes in the lucrative professional development market. An effect size can be applied to any type and quality of intervention research. Herein lies the problem. Education research using weaker designs, smaller samples and simple measures often display larger effect sizes. Critiquing the quality of research used to evaluate interventions is more important than reliance on a single measure of impact. High quality intervention studies often produce smaller effect sizes than low quality studies but are more likely to provide an accurate picture of impact, if any, that an intervention may have on students.

References

Bakker, A., Cai, J., English, L., Kaiser, G., Mesa, V., & Van Dooren, W. (2019). Beyond small, medium, or large: points of consideration when interpreting effect sizes. Educational Studies in Mathematics, 102(1), 1 – 8. doi:10.1007/s10649-019 – 09908‑4

Coe, M., Hanita, M., Nishioka, V., & Smiley, R. (2011). An investigation of the impact of the 6+1 Trait Writing model on grade 5 student writing achievement (NCEE 2012 – 4010). . Retrieved from Washington, DC: https://ies.ed.gov/ncee/edlabs/projects/project.asp?ProjectID=52

Education Endowement Foundation. (2018). Teaching and Learning Toolkit & EEF Early Years Toolkit. Retrieved from https://educationendowmentfoundation.org.uk/modals/help/projects/the-eefs-months-progress-measure/?bwf_dp=t&bwf_entry_id=1767&bwf_token_id=817&bwf_token=lZphMRCcCNAXoVbLoZdAogCjb

Fuchs, L. S., & Fuchs, D. (1986). Effects of Systematic Formative Evaluation: A Meta-Analysis. Exceptional Children, 53(3), 199 – 208. doi:10.1177/001440298605300301

Graham, S., Hebert, M., & Harris, K. R. (2015). Formative assessment and writing: A meta-analysis. Elementary School Journal, 115(4), 523 – 547.

Hattie, J. (2009). Visible Learning. A synthesis of over 800 meta-analyses relating to achievement. New York: Routledge.

Kaplan, A., Cromley, J., Perez, T., Dai, T., Mara, K., & Balsai, M. (2020). The Role of Context in Educational RCT Findings: A Call to Redefine “Evidence-Based Practice”. Educational Researcher, 49(4), 285 – 288. doi:10.3102/0013189x20921862

Kingston, N., & Nash, B. (2011). Formative Assessment: A Meta-Analysis and a Call for Research. Educational Measurement: Issues and Practice, 30(4), 28 – 37. doi:10.1111/j.1745 – 3992.2011.00220.x

Kozlow, M., & Bellamy, P. (2004). Experimental study on the impact of the 6+1 Trait Writing model on student achievement in writing. Retrieved from Portland, OR: https://educationnorthwest.org/sites/default/files/resources/Student_Achievement_in_Writing.pdf

Kraft, M. A. (2020). Interpreting Effect Sizes of Education Interventions. Educational Researcher, 49(4), 241 – 253. doi:10.3102/0013189X20912798

Lortie-Forgues, H., & Inglis, M. (2019). Rigorous Large-Scale Educational RCTs Are Often Uninformative: Should We Be Concerned? Educational Researcher, 48(3), 158 – 166. doi:10.3102/0013189x19832850

Speckesser, S., Runge, J., Foliano, F., Bursnall, M., Hudson-Sharp, N., Rolfe, H., & Anders, J. (2018). Embeding Formative Assessment: Evaluation report and executive summary. Retrieved from

Wiliam, D., Lee, C., Harrison, C., & Black, P. (2004). Teachers developing assessment for learning: impact on student achievement. Assessment in Education: Principles, Policy & Practice, 11(1), 49 – 65. doi:10.1080/0969594042000208994

Wisniewski, B., Zierer, K., & Hattie, J. (2020). The Power of Feedback Revisited: A Meta-Analysis of Educational Feedback Research. Frontiers in Psychology, 10. doi:10.3389/fpsyg.2019.03087

Evidence for Learning: Effect sizes in education: Bigger is better right?