Evaluation of clinical practice guidelines using the AGREE instrument: comparison between data obtained from AGREE I and AGREE II

Objective The Appraisal of Guidelines for Research and Evaluation (AGREE) is a representative, quantitative evaluation tool for evidence-based clinical practice guidelines (CPGs). Recently, AGREE was revised (AGREE II). The continuity of evaluation data obtained from the original version (AGREE I) has not yet been demonstrated. The present study investigated the relationship between data obtained from AGREE I and AGREE II to evaluate the continuity between the two measurement tools. Results An evaluation team consisting of three trained librarians evaluated 68 CPGs issued in 2011–2012 in Japan using AGREE I and AGREE II. The correlation coefficients for the six domains were: (1) scope and purpose 0.758; (2) stakeholder involvement 0.708; (3) rigor of development 0.982; (4) clarity of presentation 0.702; (5) applicability 0.919; and (6) editorial independence 0.971. The item “Overall Guideline Assessment” was newly introduced in AGREE II. This global item had a correlation coefficient of 0.628 using the six AGREE I domains, and 0.685 using the 23 items. Our results suggest that data obtained from AGREE I can be transferred to AGREE II, and the “Overall Guideline Assessment” data can be determined with high reliability using a standardized score of the 23 items. Electronic supplementary material The online version of this article (10.1186/s13104-017-3041-7) contains supplementary material, which is available to authorized users.


Introduction
Clinical practice guidelines (CPGs) are "statements that include recommendations intended to optimize patient care that are informed by a systematic review of evidence and an assessment of the benefits and harms of alternative care options. " [1]. CPGs are a representative tool for standardizing medical interventions and improve healthcare quality. In Japan, CPG development, using evidencebased medicine (EBM), began in the late 1990s with government support. Currently, 30-40 CPGs are developed per year, mainly by academic societies.
With the spread of CPGs in Japan, infrastructure to promote their use is also being developed. This includes clearing houses and standard manuals for developing CPGs. The Toho University Medical Media Center and the Medical Information Network Distribution Service Guideline Center of the Japan Council for Quality Health Care both operate CPG clearing houses [2,3].
The Appraisal of Guidelines for Research and Evaluation (AGREE) instrument, developed by the AGREE Enterprise, is a quantitative method for evaluating CPGs. The AGREE instrument determines items that must be satisfied by CPGs, and is expected to facilitate cost effective CPG development and improve CPG quality [4]. In CPGs using the AGREE I or AGREE II [8][9][10]. However, the continuity of the data obtained from AGREE I and AGREE II has not yet been demonstrated. The AGREE I was widely used and there is large amount of associated data; investigation of the continuity and conversion of data between AGREE I and II is necessary to make full use of AGREE I data.
We investigated the continuity of AGREE I and AGREE II data, and the conversion method from AGREE I data to AGREE II data.

Methods
A team consisting of three experienced librarians evaluated 68 CPGs, based on EBM issued in 2011-2012 using the AGREE I [11] and AGREE II [12]. The evaluated CPGs were all issued in 2011-2012 in Japan. Their contents were checked and judged by expert librarians as to whether they were prepared using EBM methodology, or not. The librarians who evaluated the CPGs have knowledge about the CPG preparation and experience using the AGREE tool. The librarians conducted independent evaluations and did not adjust the result; the results were aggregated into standardized scores. Correlation coefficients were calculated for the domains and items of the two instruments.
AGREE I comprised one overall assessment item and six domains: (1) scope and purpose; (2) stakeholder involvement; (3) rigor of development; (4) clarity of presentation; (5) applicability; (6) editorial independence, totaling 23 items. Each item is rated on a 4-point Likert scale (1 = "Strongly Disagree" to 4 = "Strongly Agree"). A standardized score for each domain was calculated according the formula shown below: For example, the scope and purpose domain consists of three items; the sum of the maximum possible score is 3 × 3 × 3 = 27, and the sum of the minimum possible score is 1 AGREE II is based on AGREE I, incorporating four distinct changes. First, the rating scale was changed from a 4-point to a 7-point Likert scale (1 = "Strongly Disagree" to 7 = "Strongly Agree"). Second, an item was added as [(obtained score − minimum possible score) (maximum possible score − minimum possible score) × 100%. a second overall guideline assessment item: "Rate the overall quality of this guideline". Third, the wording or expression of several items was changed, although the meaning of the items was preserved. Finally, Q7 (AGREE I) "The guideline has been piloted among end users" was removed, and was incorporated in Q19 (AGREE II) "The guideline describes facilitators and barriers to its application" and a new item Q9 (AGREE II) "The strengths and limitations of the body of evidence are clearly described". Therefore, Q7 (AGREE I) and Q9 (AGREE II) were excluded from analysis in the present study. A comparison of AGREE I and AGREE II items is shown in Table 1.
As there was no item in AGREE I that corresponded with the new overall guideline assessment item in AGREE II, we attempted to calculate this value using two approaches. First, we calculated the average of the standardized score using results of the six AGREE I domains. Second, we calculated the standardized score using the results of the 23 AGREE I items. We examined the correlation between "Overall Guideline Assessment" in the AGREE II and the results of the two approaches described above.
We used t-tests to compare standardized scores, and calculated correlation coefficients for each AGREE I and AGREE II item and domain. p values < 0.05 were indicated statistical significance. All analyses were performed using SPSS, version 20.0 (IBM SPSS Statistics for Windows, Version 20.0. Armonk, NY: IBM Corp.).

Results
The results of the AGREE I and AGREE II evaluations are shown in Fig. 1. Correlation coefficients are shown in Table 2. High correlations were observed in all domains: scope and purpose = 0.758; stakeholder involvement = 0.756; rigor of development = 0.992; clarity of presentation = 0.865; applicability = 0.938; and editorial independence = 0.938. The correlation coefficients of each item ranged from 0.708 to 0.982.
Correlation coefficients for the 22 items ranged from 0.694 to 0.995; 16 items had a correlation coefficient of 0.9 or more, three items were 0.8-0.9, and three items were 0.6-0.8. A high overall correlation was observed for all items (Additional file 1: Table S1).
The newly-introduced overall assessment item "Overall Guideline Assessment" (AGREE II) should be assessed based on AGREE I data. The six AGREE I domains had a  correlation coefficient of 0.628, when 23 items were used it was 0.685, suggesting a higher related value could be gained using the latter (Table 2).

Discussion
Since its publication in 2003, the high popularity of the AGREE instrument has produced a large amount of evaluation data. With the revision of the AGREE instrument, the relationship between data obtained from AGREE I and AGREE II, and data conversion from the AGREE I to the AGREE II are high research agenda priorities for investigating time trend analyses of CPG quality.
For the 68 CPGs issued in 2011-2012, our results demonstrated that AGREE I and AGREE II were highly correlated at both the domain and item levels, and the newly introduced overall rating item "Overall Guideline Assessment" could be calculated more precisely using the 23 AGREE I items, rather than domain-level data.
Increasing attention is being directed to safety and quality issues, and CPGs based on EBM are a representative method for standardizing and improving the quality and safety of healthcare procedures. The AGREE instrument is widely used to measure CPG quality. Our results suggest that the AGREE instrument can still be used as a measurement tool, which exhibits high consistency, although it has now been revised (AGREE II). It enables long-term, comprehensive CPG evaluation. The Japanese government has promoted CPG preparation since 1996. Our study may help evaluate the underlying policy guidelines.

Conclusion
Data obtained from AGREE I can be transferred to the AGREE II, and the data for "Overall Guideline Assessment" can be calculated with high reliability using a standardized score of the 23 items.