Between two stools: preclinical research, reproducibility, and statistical design of experiments
BMC Research Notes volume 15, Article number: 73 (2022)
Translation of animal-based preclinical research is hampered by poor validity and reproducibility issues. Unfortunately, preclinical research has ‘fallen between the stools’ of competing study design traditions. Preclinical studies are often characterised by small sample sizes, large variability, and ‘problem’ data. Although Fisher-type designs with randomisation and blocking are appropriate and have been vigorously promoted, structured statistically-based designs are almost unknown. Traditional analysis methods are commonly misapplied, and basic terminology and principles of inference testing misinterpreted. Problems are compounded by the lack of adequate statistical training for researchers, and failure of statistical educators to account for the unique demands of preclinical research. The solution is a return to the basics: statistical education tailored to non-statistician investigators, with clear communication of statistical concepts, and curricula that address design and data issues specific to preclinical research. Statistics curricula should focus on statistics as process: data sampling and study design before analysis and inference. Properly-designed and analysed experiments are a matter of ethics as much as procedure. Shifting the focus of statistical education from rote hypothesis testing to sound methodology will reduce the numbers of animals wasted in noninformative experiments and increase overall scientific quality and value of published research.
“…I think we’re falling between two stools at the moment.… I think we have to take a step backward and address the basics of our game.”
––Donal Lenihan 25 Nov 2020, RTÉ Rugby Podcast, on Ireland’s need to revise training strategy following a string of defeats to England.
Criticism of much animal-based preclinical research has centred on reproducibility issues and poor translation [1, 2]. Causes are systemic and multifactorial, and include poor model fidelity, clinical irrelevance of target biomarkers or molecular pathways, and between-lab disparities in models and procedures [3, 4]. Difficulties in verifying and replicating methodology  and methodological issues related to poor statistical design and analysis are also major contributors [6,7,8,9,10]. Translational failure has massive economic repercussions. Advances in therapeutic agents or diagnostics development are more than offset by multimillion-dollar losses in investment, and ultimately unsustainable research and development costs [6, 11, 12]. There is also a significant ethical component to these failures. If questionable methodology produces biased or invalid results, evidence derived from animal-based research cannot be a reliable bridge to human clinical trials . It is difficult to justify the continued use of millions of animals each year if the majority are wasted in non-informative experiments that fail to produce tangible benefit.
In this commentary, I suggest that preclinical research has ‘fallen between two stools’, by not conforming to either clinical trial or agricultural research traditions or skillset camps, and with little of the rigour of either. The solution is a return to the basics for statistical educators and consultants: statistical training explicitly tailored to non-statistician investigators, and coverage of statistical issues and topics relevant to preclinical research. In particular, I urge a change in focus from statistics as ‘just maths’ to statistics as process. I argue that reform of introductory statistics curricula along these lines could go far to reverse statistical pathologies common to much of the preclinical research literature.
Two stools of competing traditions
The clinical trial and agricultural/industrial research traditions show considerable divergence in focus and methodology. Clinical trials are performed when there is uncertainty regarding relative efficacy of a specific clinical intervention. They are constrained by the necessity to minimize subject risk of mortality and severe adverse events. In general, clinical trials tend to be relatively large and simple, with only two or a few comparator interventions randomly assigned to many subjects, ideally representative of the target population. Although clinical trials have a history going back several hundred years (e.g. ), the randomized controlled trial (RCT) as the gold standard was a relatively recent development, with the first modern RCT performed in 1946 [15, 16], and formalisation only in the late 1970s. Lagging implementation was due in part to resistance to the so-called “numerical approach” by supporters of the non-randomised “let's-try-it-and-see” attitude to clinical research problems [17, 18]. Meanwhile, methodology for observational studies was being developed in parallel. Cohort studies in particular have had a key role in epidemiological investigations of carcinogenic and environmental hazards when RCTs are not feasible . Because factors are not randomly assigned to subjects, inferring causality requires stringent methodological safeguards for minimising confounding and bias [15, 20, 21].
In contrast, agricultural/industrial designs are characterised by small sample sizes and multiple factors studied simultaneously. In addition to randomisation, key design features include replication and blocking (‘local control’), coupled with formal statistically-structured arrangements of input variables, such as randomized complete block and factorial designs . Agricultural designs were developed primarily by Sir Ronald Fisher in the early half of the twentieth century. These principles were subsequently extended to industrial experimentation by George Box and collaborators . Industrial experiments are further distinguished by sequential implementation (data from a small or restricted group of runs in the original experiment can be used to inform the next experiment), with prompt feedback (immediacy), allowing iteration and relatively rapid convergence to target solutions . For these applications, variable screening and model building are both of interest, and ‘design’ is essentially the imposition of a statistical model as a useful approximation to the response of interest [23, 25].
Preclinical studies: between the stools
Animal-based research studies are unique for the explicit ethical obligation to minimise the numbers of animals used. Application of Three Rs (Replacement, Reduction, Refinement) principles are based on the premise that maximum scientific value should be obtained with minimal harms . However, over-emphasis on numbers reduction has contributed to underpowered experiments generating unreliable, and ultimately noninformative, results [27, 28].
Small sample sizes, large variability, multi-group comparisons, and the exploratory nature of much preclinical research suggest that study designs should be more aligned with the agricultural/industrial tradition. Fisher-type designs (such as randomised complete blocks and factorials) are suitable for purpose and have been vigorously promoted [12, 29,30,31,32,33], as have procedural methods for controlling variation without increasing sample size , and design features that increase validity [1, 35]. However, these methods seem to be virtually unknown in the preclinical literature [7, 8, 36,37,38]. Two-group comparisons more typical of clinical trials are common, although unsuited to assessing multiple factors with interactions. Informal examination of introductory textbooks and statistics course syllabi suggest that knowledge gaps are due in part to sparse formal training in experimental design, and neglect of analytical methods more suited to preclinical data. Compounding these problems is lack of general statistical oversight. Unlike human-based studies , few animal research oversight committees in the USA have access to properly qualified biostatisticians, statistical analysis plans and study preregistration are not required, and protocol review criteria vary considerably between institutions .
Statistical pathologies in the preclinical literature
Bad statistical practices are very deeply entrenched in the preclinical literature. Many of the major errors observed in the research literature involve statistical basics [41,42,43]. Statistics service courses tend to emphasise mathematical aspects of probability and null hypothesis significance testing at the expense of non-mathematical components of statistical process [44,45,46]. Consequently, it is now part of the belief system of many investigators that ‘statistical significance (P < 0.05)’ is the major criterion for assessing biological importance of results, and that P-values are an intrinsic property of the biological event or group of animals being studied . As a result, there is over-reliance on rote hypothesis testing and P-values to interpret results. Related pathologies include reporting of orphan inexact P-values with no context, P-hacking, N-hacking, selective reporting, and spin [41, 48].
A second problem area is poor understanding by investigators of basic statistical concepts and operational definitions. Statistical terms are frequently conflated with lay meanings, confused with other technical definitions, or ignored. Concepts that seem especially misunderstood include ‘study design’, ‘randomisation’, ‘cohort’, ‘unit of analysis’, and ‘replication’. To investigators, ‘study design’ refers primarily to descriptions of technical methodology and materials, e.g. . To applied statisticians, ‘study design’ is the formal arrangement and structuring of independent or predictor variables hypothesized to affect the response or outcome of interest. A good study design maximizes the experimental signal by accounting for diverse sources of variability [31, 50, 51]), and incorporates specific design features to ensure results are reliable and valid, such as correct specification of the unit of analysis, relevant outcome measures, inclusion and exclusion criteria, and bias minimization methods [8, 35, 52]. ‘Randomisation’ to statisticians is a formal probabilistic process that minimizes selection bias and effect of latent confounders, and is the cornerstone for statistical inference. In contrast, randomisation in preclinical studies seems to be frequently misinterpreted in the lay sense of ‘unplanned’ or ‘haphazard’ , or is likely not performed at all [8, 38, 54, 55]. The common habit of referring to a group of animals subjected to a given treatment or intervention as a ‘cohort’ likely reflects non-random allocation of subjects to a defined intervention group, an invalid and confounded assignment strategy . The term ‘cohort’ actually refers to groups of subjects in observational studies, where group membership is defined by some common characteristic . It does not refer to experimental treatment groups with group allocation determined by randomisation. The meaning of ‘unit of analysis’ is virtually unknown, or confused with biological and observational units [56,57,58]. ‘Replication’ is frequently interpreted solely as duplication of the total sample size for ‘reproducibility’ , rather than as an independent repeat run of each combination of treatment factors .
A third area of concern is that the conventional statistical arsenal of t-tests, ANOVA, and χ2 tests [60, 61] is unsuited for analysing ‘problem’ data typical of many preclinical studies. ‘Problem’ data include non-gaussian, correlated (clustered, nested, time dependencies), or non-linear data; data that are missing at random or due to dropout or attrition; data characterised by over-representation of true zeros; and high-dimensional data. A major deficiency that must be addressed is the focus of introductory courses on methods virtually unchanged since the 1950s, with little coverage of modern methods more appropriate for such data [8, 35, 44].
Finally, there is little attention paid to methods for identifying diverse sources of variation during experiment planning. Research papers rarely report auxiliary variables and conditions related to animal signalment, environment, and procedures only indirectly related to the main experiments, e.g. . Such variables contribute to unpredictable effects on animals and experimental results, resulting in uncontrolled variation that obscures true treatment effects. For example, systematic investigations of factors contributing to survival time in mouse models of amyotrophic lateral sclerosis suggested that claims for therapeutic efficacy were most likely due to the effects of uncontrolled variation rather than actual drug effects [12, 29, 33].
Lack of knowledge on the part of investigators is related to training deficiencies on the part of statistics educators. The solution is a return to the basics: statistical education that meets the needs of non-statistician investigators, and curricula addressing design and data issues specific to preclinical research. This is hardly new: in 1954, John Tukey identified as essential that “statistical methods should be tailored to the real needs of the user” , and this has been repeated in the decades since [9, 44, 46, 64, 65]. Investigators still identify better training in statistics and statistical methods as a high priority [9, 64]. The June 2021 report by the Advisory Committee to the Director of the National Institutes of Health (NIH-ACD) made five major recommendations to improve rigor and reproducibility of animal-based research, among which was recognition of the need for “modern and innovative statistics curricula relevant to animal researchers” .
What do researchers need? The poor internal validity characterising much preclinical research  reflects poor understanding of the upstream basics of statistically-based study design and data sampling strategies. Unreliable downstream results cannot be rescued by fancy analyses after the fact, as Fisher himself warned . Therefore, the concept that good statistical principles must be built in during planning and before data are collected must be introduced and reinforced. This can be accomplished first, by more appropriate training of entry-level researchers with courses and topic coverage more attuned to specific need, and second by removal of longstanding barriers (such as cost and academic credit) to early consultation with appropriately-training statisticians. Early formal involvement of applied statisticians in the planning process must be encouraged and rewarded [9, 68].
Statistical educators and consultants must be re-educated to better address actual research needs. ‘Statistics’ is neither just maths nor an analytical frill tacked on to a study after data have been collected. Instead, statisticians must structure instructional materials to reflect the basic tenets of statistical process: design before inference, and data quality before analysis . Data curation skills are also part of good statistical practice , identified as such for nearly a century . These practices are not strongly mathematical, and unfortunately statisticians tend to be uninterested in non-mathematical procedures [46, 71]. Second, service courses must shift away from pedagogical approaches common to applied maths or algebra, where uncritical analysis of a data set leads to a fixed ‘correct’ solution [46, 71, 72]. Procedural change could be accelerated by statisticians becoming more aware of best-practice expectations though evidence-based planning  and reporting  guidelines. These tools can direct early-stage study planning to ensure that procedures strengthening study validity can be incorporated [4, 35, 74, 75].
Properly designed and analysed experiments are an ethical issue [28, 66, 69]. Shifting the focus of statistical education from rote hypothesis testing to sound methodology should ultimately reduce the numbers of animals wasted in noninformative experiments and increase overall scientific quality and value of published research.
Availability of data and materials
Replacement, Reduction, Refinement
Advisory Committee to the Director of the National Institutes of Health
Randomised controlled trial
Bailoo JD, Reichlin TS, Würbel H. Refinement of experimental design and conduct in laboratory animal research. ILAR J. 2014;55(3):383–91.
Lowenstein PR, Castro MG. Uncertainty in the translation of preclinical experiments to clinical trials. Why do most phase III clinical trials fail? Curr Gene Ther. 2009;9(5):368–74.
McGonigle P, Ruggeri B. Animal models of human disease: challenges in enabling translation. Biochem Pharmacol. 2014;87:162–71.
van der Worp HB, Sandercock PAG. Improving the process of translational research. BMJ. 2012;245: e7837.
Errington TM, Denis A, Allison AB, Araiza R, Aza-Blanc P, Bower LR, Campos J, Chu H, Denson S, Dionham C, et al. Experiments from unfinished registered reports in the reproducibility project: cancer biology. Elife. 2021;10: e73430.
Freedman LP, Cockburn IM, Simcoe TS. The economics of reproducibility in preclinical research. PLoS Biol. 2015;13: e1002165.
Macleod MR. Why animal research needs to improve. Nature. 2011;477:511.
Macleod MR, Lawson McLean A, Kyriakopoulou A, Serghiou S, de Wilde A, Sherratt N, Hirst T, Hemblade R, Bahor Z, Nunes-Fonseca C, et al. Risk of bias in reports of in vivo research: a focus for improvement. PLOS Biol. 2015;13(11): e1002301.
Wold B, Tabak LA, Advisory Committee to the Director. ACD working group on enhancing rigor, transparency, and translatability in animal. Washington, DC: Department of Health and Human Services; 2021.
Van Calster B, Wynants L, Riley RD, van Smeden M, Collins GS. Methodology over metrics: current scientific standards are a disservice to patients and society. J Clinical Epidemiol. 2021;138:219–26.
Ledford H. 4 ways to fix the clinical trial. Nature. 2011;477:526–8.
Perrin S. Make mouse studies work. Nature. 2014;507:423–5.
Macleod M. Learning lessons from MVA85A, a failed booster vaccine for BCG. BMJ. 2018;360: k66.
Collier R. Legumes, lemons and streptomycin: a short history of the clinical trial. CMAJ. 2009;180:23–4.
Doll R. Sir Austin Bradford Hill and the progress of medical science. BMJ. 1992;305:1521–6.
Hart PD. A change in scientific approach: from alternation to randomised allocation in clinical trials in the 1940s. BMJ. 1999;319:572–3.
Peto R. Reflections on the design and analysis of clinical trials and meta-analyses in the 1970s and 1980s. J R Soc Med. 2019;112(2):78–80.
Silverman WA. Personal reflections on lessons learned from randomized trials involving newborn infants from 1951 to 1967. Clin Trials. 2004;1:179–84.
Breslow NE, Day NE. The role of cohort studies in cancer epidemiology. In: Breslow NE, Day NE, editors. Statistical methods in cancer research. Volume II—the design and analysis of cohort studies. Lyon: IARC Scientific Publications; 1987.
Armitage P. Before and after Bradford Hill: some trends in medical statistics. J R Stat Soc A Stat Soc. 1995;158(1):143–53.
Hill AB. The environment and disease: association or causation? Proc R Soc Med. 1965;58:295–300.
Street DJ. Fisher’s contributions to agricultural statistics. Biometrics. 1990;46(4):937–45.
Box GEP, Draper NR. Empirical model-building and response surfaces. New York: Wiley; 1987.
Box GEP. Statistics as a catalyst to learning by scientific method part II—a discussion. J Qual Technol. 1999;31(1):16–29.
Montgomery DC. Design and analysis of experiments. 8th ed. London: Wiley; 2013.
Russell WMS, Burch RL. The principles of humane experimental technique. London: Methuen; 1959.
Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, Munafò MR. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013;14:365–76.
Parker RMA, Browne WJ. The place of experimental design and statistics in the 3Rs. ILAR J. 2014;55(3):477–85.
Editorial. The ‘3Is’ of animal experimentation. Nat Genetics. 2012;44(6):611.
Festing MFW. Randomized block experimental designs can increase the power and reproducibility of laboratory animal experiments. ILAR J. 2014;55:472–6.
Festing MFW, Altman DG. Guidelines for the design and statistical analysis of experiments using laboratory animals. ILAR J. 2002;432:244–58.
Karp NA, Fry D. What is the optimum design for my animal experiment? BMJ Open Sci. 2021;5: e100126.
Scott S, Kranz JE, Cole J, Lincecum JM, Thompson K, Kelly N, Bostrom A, Theodoss J, Al-Nakhala BM, Viera FG, et al. Design, power, and interpretation of studies in the standard murine model of ALS. Amyotroph Later Scler. 2008;9:4–15.
Lazic SE. Four simple ways to increase power without increasing the sample size. Lab Anim. 2018;52:621–9.
Muhlhauser BS, Bloomfield FH, Gillman MW. Whole animal experiments should be more like human randomized controlled trials. PLoS Biol. 2013;11(2): e1001481.
Errington TM, Denis A, Perfito N, Iorns E, Nosek BA. Challenges for assessing replicability in preclinical cancer biology. Elife. 2021;10: e67995.
Macleod MR, Mohan S. Reproducibility and rigor in animal-based research. ILAR J. 2020;60:17–23.
Kilkenny C, Parsons N, Kadyszewski E, Festing MF, Cuthill IC, Fry D, Hutton J, Altman DG. Survey of the quality of experimental design, statistical analysis and reporting of research using animals. PLoS ONE. 2009;4(11): e0007824.
Gaur A, Merz-Nideroest B, Zobel A. Clinical trials, good clinical practice, regulations, and compliance. Regul Focus Quart. 2021;1(1):15–31.
Silverman J, Macy J, Preisig P. The role of the IACUC in ensuring research reproducibility. Lab Anim (NY). 2017;46(4):129–35.
Diong J, Butler AA, Gandevia SC, Héroux ME. Poor statistical reporting, inadequate data presentation and spin persist despite editorial advice. PLoS ONE. 2018;13(8): e0202121.
Lang TA, Altman DG. Basic statistical reporting for articles published in clinical medical journals the SAMPL Guidelines. In: Smart P, Masisonneuve H, Polderman AKS, editors. Science editors’ handbook. Paris: European Association of Science; 2013.
Makin TR, De Orban Xivry J-J. Ten common statistical mistakes to watch out for when writing or reviewing a manuscript. Elife. 2019;8: e48175.
Preece DA. The design and analysis of experiments: what has gone wrong? Util Mathematica. 1982;21:201–44.
Preece DA. Illustrative examples: illustrative of what? J Roy Stat Soc Ser D. 1986;35(1):33–44.
Preece DA. Good statistical practice. J Roy Stat Soc Ser D. 1987;36(4):397–408.
Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31:337–50.
Nuzzo R. Statistical errors. Nature. 2014;506:150–2.
Marcus E. A STAR is born. Cell. 2016;166:1059–60.
Altman DG. Practical statistics for medical research. London: Chapman & Hall; 1991.
Karp NA. Reproducible preclinical research—is embracing variability the answer? PLoS Biol. 2018;16(3): e2005413.
Kilkenny C, Browne WJ, Cuthill IC, Emerson M, Altman DG. Improving bioscience research reporting: the ARRIVE guidelines for reporting animal research. PLoS Biol. 2010;8(6): e1000412.
Altman DG, Bland JM. Treatment allocation in controlled trials: why randomise? BMJ. 1999;318:1209.
Hirst JA, Howick J, Aronson JK, Roberts N, Perera R, Koshiaris C, Heneghan C. The need for randomization in animal trials: an overview of systematic reviews. PLoS ONE. 2014;9: e98856.
Reynolds PS, Garvan CW. Gap analysis of animal-based hemorrhage control research. “Houses of brick or mansions of straw?” Miltary Med. 2020;185:85–95.
Festing MFW. The “completely randomised” and the “randomised block” are the only experimental designs suitable for widespread use in pre-clinical research. Sci Rep. 2020;10:17577.
Lazic SE, Clarke-Williams CJ, Munafò MR. What exactly is “N” in cell culture and animal experiments? PLoS Biol. 2018;16: e2005282.
Parsons NR, Teare MD, Sitch AJ. Unit of analysis issues in laboratory-based research. eLife. 2018;7: e32486.
Frommlet F, Heinze G. Experimental replications in animal trials. Lab Anim. 2021;55(1):65–75.
Bolt T, Nomi JS, Bzdok D, Uddin L. Educating the future generation of researchers: A cross-disciplinary survey of trends in analysis methods. PLoS Biol. 2021;19(7): e3001313.
Gosselin RD. Insufficient transparency of statistical reporting in preclinical research: a scoping review. Sci Rep. 2021;11:3335.
Nevalainen T. Animal husbandry and experimental design. ILAR J. 2014;55(3):392–8.
Tukey JW. Unsolved problems of experimental statistics. J Am Stat Assoc. 1954;49:706–31.
Baker M. Is there a reproducibility crisis? Nature. 2016;533:452–4.
Brown AW, Kaisera K, Allison DB. Issues with data and analyses: errors, underlying themes, and potential solutions. Proc Natl Acad Sci. 2018;115(11):2563–70.
Sena ES, Currie GL. How our approaches to assessing benefits and harms can be improved. Anim Welf. 2019;28:107–15.
Fisher RA. Presidential address to the first indian statistical congress. Sankhya. 1938;4:14–7.
Sprent P. Some problems of statistical consultancy. J Roy Stat Soc Ser A. 1970;133(2):139–65.
Altman DG. Statistics and ethics in medical research: misuse of statistics is unethical. BMJ. 1980;281:1182–4.
Dunn HL. Application of statistical methods in physiology. Physiol Rev. 1929;9(2):275–398.
Preece DA. Discussion on the papers on `statistics and mathematics’. J Roy Stat Soc Ser D. 1998;47(2):274.
Preece DA. Biometry in the third world: science not ritual. Biometrics. 1984;40(2):519–23.
Smith AJ, Clutton RE, Lilley E, Hansen KEA, Brattelid T. PREPARE: guidelines for planning animal research and testing. Lab Anim. 2017;52(2):135–41.
Percie du Sert N, Hurst V, Ahluwalia A, Alam S, Avey MT, Baker M, Browne W, Clark A, Cuthill IC, Dirnagl U, et al. The ARRIVE guidelines 2.0: updated guidelines for reporting animal research. PLoS Biol. 2020;18(7): e3000410.
Altman DG, Simera I. Using reporting guidelines effectively to ensure good reporting of health research. In: Moher D, Altman DG, Schulz KF, Simera I, Wager E, editors. Guidelines for reporting health research: a user’s manual, edn. Chichester: Wiley; 2014. p. 32–40.
Many thanks to Dr Tamara Hughes and three anonymous reviewers for useful suggestions that greatly improved the manuscript.
None to declare.
Ethics approval and consent to participate
Consent for publication
PSR was a member of the ARRIVE 2.0 International Working Group.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Reynolds, P.S. Between two stools: preclinical research, reproducibility, and statistical design of experiments. BMC Res Notes 15, 73 (2022). https://doi.org/10.1186/s13104-022-05965-w