Failure of a numerical quality assessment scale to identify potential risk of bias in a systematic review: a comparison study

Background Assessing methodological quality of primary studies is an essential component of systematic reviews. Following a systematic review which used a domain based system [United States Preventative Services Task Force (USPSTF)] to assess methodological quality, a commonly used numerical rating scale (Downs and Black) was also used to evaluate the included studies and comparisons were made between quality ratings assigned using the two different methods. Both tools were used to assess the 20 randomized and quasi-randomized controlled trials examining an exercise intervention for chronic musculoskeletal pain which were included in the review. Inter-rater reliability and levels of agreement were determined using intraclass correlation coefficients (ICC). Influence of quality on pooled effect size was examined by calculating the between group standardized mean difference (SMD). Results Inter-rater reliability indicated at least substantial levels of agreement for the USPSTF system (ICC 0.85; 95% CI 0.66, 0.94) and Downs and Black scale (ICC 0.94; 95% CI 0.84, 0.97). Overall level of agreement between tools (ICC 0.80; 95% CI 0.57, 0.92) was also good. However, the USPSTF system identified a number of studies (n = 3/20) as “poor” due to potential risks of bias. Analysis revealed substantially greater pooled effect sizes in these studies (SMD −2.51; 95% CI −4.21, −0.82) compared to those rated as “fair” (SMD −0.45; 95% CI −0.65, −0.25) or “good” (SMD −0.38; 95% CI −0.69, −0.08). Conclusions In this example, use of a numerical rating scale failed to identify studies at increased risk of bias, and could have potentially led to imprecise estimates of treatment effect. Although based on a small number of included studies within an existing systematic review, we found the domain based system provided a more structured framework by which qualitative decisions concerning overall quality could be made, and was useful for detecting potential sources of bias in the available evidence. Electronic supplementary material The online version of this article (doi:10.1186/s13104-015-1181-1) contains supplementary material, which is available to authorized users.


Background
Systematic reviews are used to synthesize research evidence relating to the effectiveness of an intervention [1]. Conclusions of high quality reviews provide a basis on which clinicians and researchers can make evidencebased decisions and recommendations. Accurately assessing methodological quality of included studies is therefore essential. Quality is a multidimensional concept representing the extent to which study design can minimise systematic and non-systematic bias, as well as inferential error [2,3].
There are numerous instruments available for assessing quality of evidence and there remains uncertainty over which are the most appropriate to use [4], and how they should be used to interpret results [5,6]. Use of different assessment methods can result in significant changes to the size and direction of pooled effect sizes [7-9] and it is therefore important to consider the properties of the assessment methods used.

Open Access
*Correspondence: s.oconnor@qub.ac.uk 1 Centre for Public Health, Queen's University Belfast, Belfast, UK Full list of author information is available at the end of the article Numerical summary scores may be of limited value in interpreting the results of meta-analyses [10]. However, these scales are widely used in the literature, possibly due to their ease of use. Quality assessment based on nonnumerical or domain-based rating systems [11][12][13][14] are increasingly used, particularly when also seeking to make treatment recommendations.
The primary aim of this study was to compare these two contrasting methods for assessing methodological quality of randomized and non-randomized studies included within a systematic review [15]. As part of the review the rating system proposed by the United States Preventative Services Task Force (USPSTF) [12, 13] was used to assess methodological quality and to allow for treatment recommendations to be made. We wished to compare this domain based rating system to a numerical scale to determine the potential influence of different approaches on treatment effect size within a review. We selected the rating scale proposed by Downs and Black [16] for comparison as it is one of the most commonly used and well validated numerical rating scales [17].
The study objectives were: 1. To determine the effect of quality ratings on pooled effect size for primary outcome data from the included studies. 2. To determine inter-rater reliability and level of agreement between tools when examining separate components of internal and external validity, as well as overall ratings assigned to each paper.

Details of each quality assessment tool Downs and Black Scale
The Downs and Black Scale consists of 27 questions relating to quality of reporting (ten questions), external validity (three questions), internal validity (bias and confounding) (13 questions), and statistical power (one question) (Additional file 1: Table S1). It has been shown to have high internal consistency for the total score assigned (Kuder-Richardson 20 test: 0.89) as well as all subscales, except external validity (0.54); with reliability of the subscales varying from "good" (bias) to "poor" (external validity) [16].
The original scale provides a total score out of 32 points, with one question in the reporting section carrying a possible two points, and the statistical power question carrying a possible five points. Previous studies have frequently employed a modified version by simplifying the power question and awarding a single point if a study had sufficient power to detect a clinically important effect, where the probability value for a difference being due to chance is <5% [18][19][20]. The modified version which we employed in this study therefore has a maximum score of 28. Each paper was assigned a grade of "excellent" (24-28 points), "good" (19-23 points), "fair" (14-18 points) or "poor" (<14 points).

United States Preventative Services Task Force
In rating quality, the USPSTF system assigns individual studies a grade of "good", "fair", or "poor" for both internal and external validity. Assessment criteria are not used as rigid rules, but as guidelines with exceptions made if there is adequate justification. In general, a "good" study meets all criteria for that study design; a "fair" study does not meet all criteria but is judged to have no serious flaw that may compromise results; and a "poor" study contains a potentially serious methodological flaw. Criteria for determining a serious flaw are dependent on study design but include lack of adequate randomization or allocation concealment in randomized controlled trials; failure to maintain comparable groups or account for loss to follow-up or lack of similarity between the study population and patients seen in clinical practice [12].

Quality assessment conducted using both tools
Twenty studies were included as part of an updated systematic review conducted following the "preferred reporting items for systematic reviews and meta-analyses" (PRISMA) [21] guidelines which examined the effects of an exercise intervention for chronic musculoskeletal pain [15] (References for included studies are shown in Additional file 2: . All reviewers had experience of conducting systematic reviews in the area and specific experience of using both measures. Reviewers were not blinded with regards to study authorship, institution, or journal of publication. Prior to assessment reviewers met to establish standardized methods of scoring. Both methods were piloted on a sample of papers examining exercise interventions for an unrelated musculoskeletal condition.

Analysis
Inter-rater reliability was examined for the separate domains of internal and external validity, as well as for overall quality ratings. Agreement between reviewers before consensus and agreement between tools were determined using the interclass correlation coefficient (ICC) based on a mixed-model, two way analysis of variance (2, k) for absolute agreement and 95% confidence intervals (95% CI). For the purposes of the analysis, when rating quality using the USPSTF system, the number of relevant criteria which were met according to the design of the individual study was used to assign a score out of 11. The Downs and Black scale was scored out of 28.
Scores were converted to a percentage (score for paper/ total possible score × 100) in order to allow for statistical comparisons to be made between tools. Criteria used to determine levels of agreement for ICCs were: <0.00 for poor; 0.00-0.20 for fair; 0.21-0.45 for moderate; 0.46-0.75 for substantial and 0.76-1.0 for almost perfect agreement [22]. All analyses were performed using SPSS version 20.0 (SPSS Inc., Chicago, IL, USA). The grading system for the Downs and Black scale was modified to allow comparisons to be made with the USPSTF system by collapsing the "excellent" and "good" ratings together. This meant both tools were used to assign a grade of "good", "fair" or "poor" to each study. The influence of methodological quality ("poor", "fair", or "good") on pooled effect size for pain data was determined using a random effects model for inverse variance which was used to calculate the standardized mean difference (SMD) and 95% CI [Review Manager (RevMan) (Computer program); Version 5.0] [23].
There was at least a substantial level of agreement between the total scores assigned to each paper using both tools (ICC = 0.80; 95% CI = 0.57, 0.92) and overall quality ratings were the same for 14/20 studies (Table 1). However, the USPSTF system identified a small number of studies (n = 3/20) as "poor" which the Downs and Black scale did not. Analysis of pooled effect sizes for Table 1 Comparison of quality ratingsassigned to each paper using the Downs

Comparison between tools
This study examined the inter-rater reliability and level of agreement between two different approaches used to assess the methodological quality of randomized and non-randomized studies within a systematic review [15]. Both tools demonstrated good inter-rater reliability across the separate domains of internal and external validity, as well as for the final grade assigned to each paper. Although both tools assigned markedly different weighting to the internal and external validity sections, agreement was also good for the final grades assigned.  While overall analysis indicated a high level of agreement; the domain-based USPSTF system identified a number of the studies (3/20) as "poor" due to potential sources of bias. These studies were found to have substantially greater and less precise pooled effect sizes compared to those rated as "fair" or "good" using the USPSTF system (Figure 1).
In general, the USPSTF system was also found to be more conservative, with six of the 20 studies assigned a lower overall quality rating (Table 1). One possible reason accounting for this finding is that the USPSTF system considers a number of potentially invalidating methodological flaws in its assessment. The Downs and Black scale on the other hand assigns each question a single point (except in one case where a single question may be awarded two points). As a result, a study can contain a potentially serious flaw, and still be rated as "fair" or "good" quality.
Since the USPSTF system gives equal weighting to external validity, this might have accounted for the differences. However, the reasons for studies being rated as "poor" generally related to issues of internal validity, such as inadequate allocation concealment in randomized controlled trials, or possible selection bias occurring due to unequal distribution of primary outcomes at baseline. Schulz and co-authors [24] suggest that allocation concealment is the element of quality that is associated with the greatest risk of bias. While the greater effect sizes compared to those rated as "fair" or "good" was based on only three "poor" quality studies, others have reported similar findings using larger numbers of included studies [24][25][26][27].
The influence of other quality factors on effect size are less certain [5]; and various issues apart from methodological quality may contribute to inexact treatment effect sizes, including heterogeneity of study interventions or sample populations [28,29]. Although we included studies which were generally homogenous in terms of intervention type and sample population, it is uncertain whether differences in methodological quality alone would account for the variations in treatment effect observed in those studies rated as "poor".

Strengths and limitations
These results should be considered with a degree of caution given the relatively small number of included studies, and assessing a larger number of heterogeneous studies would be required to provide more certain evidence in support of these findings. Despite this, the study provides further support for the contention that numerical summary scores should not be used for the assessment of methodological quality, or for determining cut-off criteria for study inclusion. In practical terms, within the specific example of a single systematic review [15], a commonly used numerical summary scale failed to identify the small number of included studies which contained important sources of potential bias according to the domain based system. While we found a good level of reliability between independent assessments for both tools it is acknowledged that this could be due to the pilot phase used to standardize scoring methods, and the relatively small number of studies [30,31]. The conversion of domain based USP-STF ratings to a numerical value for reliability assessment is also a limitation; however this was to allow for comparison to be made with the Downs and Black scale and since it would provide a more robust and sensitive measure than comparing ratings of "poor" "fair" or "good". A further limitation is that there is no gold standard with which quality assessment tools can be compared. The study also did include a qualitative assessment of utility.
We selected the Downs and Black scale as it is one of the most widely used and well validated tools for assessment of both randomized and non-randomized studies [18]. However, in comparison to the USPSTF system, a number of limitations associated with its use were identified. In particular, the ability of the Downs and Black scale to differentiate studies containing potential sources of bias was limited in comparison to the USPSTF system.

Recommendations
Summary quality scales combine information on several methodological features in a single numerical value, whereas component or domain-based approaches examine key dimensions or outcomes individually [6, 12-14]. The use of summary scores from numerical rating scales for assessment of methodological quality has been called into question [4,8,32]. One issue is that they frequently incorporate items such as quality of reporting, ethical issues or statistical analysis techniques which are not directly related to quality or to potential bias [4]. This is an important distinction, since the inclusion of such items may be misleading and a study containing methodological bias, but which is well reported, can potentially still be rated as high quality. In particular, the practice of using numerical scores to identify trials of apparent low or high quality in a systematic review is not recommended [32].
Analysis of individual components of quality may overcome many of the shortcomings of composite scores. The component approach takes into account the importance of individual quality domains, and that the direction of potential bias varies between the contexts in which studies are performed [33]. Decisions relating to assessment of methodological quality when using domain-based rating systems are therefore dependent upon the particular research area under consideration, since important components relating to bias are not universal. The use of a standard set of quality components across all clinical areas is not recommended [5] and more specific guidance may be required when using these types of assessment tool [33,34]. Review authors should therefore remain cautious when using a domain based system to assess methodological quality and formulate guideline recommendations.

Conclusions
Here we evaluated a domain-based rating system and demonstrated its ability to successfully differentiate studies associated with potentially exaggerated treatment effects. Domain-based rating systems provide a structured framework by which studies can be assessed in a qualitative manner, allowing for the identification of potential sources of bias, firstly within the individual studies, but also in the context of the available body of evidence under review. This is important as quality of evidence can vary across outcomes reported in the same study, and some outcomes may be more prone to bias than others. For example, bias due to lack of allocation concealment may be more likely for subjective outcomes, such as quality of life [29]. How to account for any potential bias in the analysis remains in question, but the current Cochrane guidelines [11] recommend examining studies containing potential methodological bias as a separate sub-category in a sensitivity analysis.