Inter-method reliability of paper surveys and computer assisted telephone interviews in a randomized controlled trial of yoga for low back pain
© Cerrada et al.; licensee BioMed Central Ltd. 2014
Received: 22 October 2013
Accepted: 2 April 2014
Published: 9 April 2014
Little is known about the reliability of different methods of survey administration in low back pain trials. This analysis was designed to determine the reliability of responses to self-administered paper surveys compared to computer assisted telephone interviews (CATI) for the primary outcomes of pain intensity and back-related function, and secondary outcomes of patient satisfaction, SF-36, and global improvement among participants enrolled in a study of yoga for chronic low back pain.
Pain intensity, back-related function, and both physical and mental health components of the SF-36 showed excellent reliability at all three time points; ICC scores ranged from 0.82 to 0.98. Pain medication use showed good reliability; kappa statistics ranged from 0.68 to 0.78. Patient satisfaction had moderate to excellent reliability; ICC scores ranged from 0.40 to 0.86. Global improvement showed poor reliability at 6 weeks (ICC = 0.24) and 12 weeks (ICC = 0.10).
CATI shows excellent reliability for primary outcomes and at least some secondary outcomes when compared to self-administered paper surveys in a low back pain yoga trial. Having two reliable options for data collection may be helpful to increase response rates for core outcomes in back pain trials.
ClinicalTrials.gov: NCT01761617. Date of trial registration: December 4, 2012.
KeywordsSurvey methods Reliability Back pain CATI
The self-administered paper survey is a traditional mode of survey administration and data collection in clinical trials . Self-administered paper surveys allow participants to control the pace and order of the questions and provide a level of privacy, which may encourage responders to answer sensitive questions more truthfully [2, 3]. Computer assisted telephone interviews (CATI) and other electronic methods of data collection are also common in survey research. They allow for skip logic patterns, immediate data entry, and predefined ranges for responses, all of which may improve data quality [4, 5]. CATI and other electronic methods may also help reduce missing responses to questions  and boost overall participant response rates by serving as an alternate mode of data collection for targeting non-responders .
As methods of survey administration evolve, especially through electronic means, it is increasingly important to consider the inter-method reliability and quality of data collected by each method. The availability of multiple reliable methods of data collection would allow researchers to tailor their survey administration strategy to reach the most participants. While a number of studies have already compared different methods of survey administration, few have focused specifically on low back pain intensity, back-related function, pain medication usage, and health-related quality of life. The purpose of this study is to determine the reliability of responses to traditional self-administered paper surveys and CATI among participants enrolled in a study of yoga for chronic low back pain (LBP).
Study design and setting
This study was part of a larger study of 95 participants enrolled in a randomized dosing trial comparing 12 weeks of once-weekly yoga classes with twice-weekly yoga classes for chronic LBP. Findings from this study suggest that once-weekly yoga classes, supplemented by home practice, are similarly effective as twice-weekly yoga classes for chronic LBP in a predominantly low minority income population .
Detailed methods of the parent study are described elsewhere . Briefly, eligibility requirements included being between 18-64 years old, having non-specific LBP lasting longer than 12 weeks, and having English proficiency sufficient to complete both paper and CATI surveys. Recruitment was targeted at community health centers affiliated with a large urban safety-net hospital in order to yield a predominantly minority study population. A 2x2 factorial design was used to generate four treatment groups: once-weekly yoga classes with paper surveys, once-weekly yoga classes with paper surveys and CATI, twice-weekly yoga classes with paper surveys, and twice-weekly yoga classes with paper surveys and CATI.
The Boston University Institutional Review Board and the participating community health centers’ research committees approved the study. Informed consent for the RCT outlined both the 12 week yoga intervention component and the CATI versus paper survey comparison component. All participants consented to both parts of the study.
All study participants completed baseline paper surveys in person at Boston Medical Center and subsequent paper surveys at six and twelve weeks at the community health center where they attended yoga classes. For practical reasons, including staffing constraints and participant burden, only 45 of the 95 participants enrolled in the larger study were randomized to complete a CATI after each paper survey. At each time point, unblinded study staff notified the 45 participants randomized to the CATI group that they would also complete a CATI version of each of their surveys. Staff members blinded to treatment allocation attempted to administer CATI surveys within 48 hours after the in person paper survey. Blinded research staff conducted the CATI via StudyTRAX (ScienceTRAX, Macon, GA), a web-based electronic data capture system . StudyTRAX displayed questionnaire scripts for the interviewers and utilized pre-programmed skip logic for navigating through survey questions. Access to StudyTRAX was granted through unique user logins and passwords. Access to treatment condition information was restricted from blinded staff members. The phrasing of each telephone survey question was kept as similar as possible to the paper survey questions. Participants were asked to try to respond to each question as accurately as possible rather than attempt to reproduce answers to their previous paper survey.
Survey elements included those commonly used in back pain trials . The parent study had two primary outcomes: average low back pain intensity in the previous week on an 11 point numerical scale (0 = ‘no pain’ and 10 = ‘worst possible pain’) [11, 12] and back-related function via the modified Roland Morris Disability Questionnaire (RMDQ), a 23 item scale where higher scores indicate worse functional status [13, 14]. Secondary outcomes included pain medication use in the last week (yes/no); health-related quality of life measured by the SF-36 ; global improvement of back pain on a 7 point numerical scale (0 = ‘extremely worsened’, 3 = ‘no change’, 6 = ‘extremely improved’); and patient satisfaction (5 point Likert scale, 1 = ‘very satisfied’, 2=’somewhat satisfied’, 3 = ‘not satisfied or dissatisfied’, 4=’somewhat dissatisfied’, 5 = ‘very dissatisfied’) .
Participants’ responses from their paper surveys were entered twice by different blinded study staff and compared to verify accuracy of data entry. To measure reliability between paper and CATI data collection methods, we calculated intraclass correlation coefficients  to assess reliability for continuous measures. Kappa statistics were calculated to assess reliability for categorical variables (i.e. pain medication use). Only complete pairs of paper and phone responses for each measure at each time point were included in the reliability analyses. Weighted averages were calculated to determine an average ICC or Kappa score for each outcome across all three time points. Means and standard deviations for primary outcomes collected by paper-only and CATI-only were calculated for the 45 participants at each time point using all available data.
Characteristics of 45 adults with chronic low back pain randomized to complete both paper surveys and computer assisted telephone interviews*
Paper and CATI (n = 45)
Mean age, years (SD)
High school or less
Beyond high school
Employed full or part time
Back Pain History
Duration of LBP
Mean days of LBP in the last 3 months (SD)
Mean hours/day of LBP (SD)
Mean days cut back activities due to LBP in the last 4 weeks (SD)
Response rates for different survey administration methods by time period
Method of survey administration
6 weeks n(%)
12 weeks n(%)
Paper (n = 95)
CATI (n = 45)
Both (n = 45)
Reliability of CATI vs. paper survey administration*
LBP intensity in the previous week
Pain medication use in last week1
Global improvement of back pain
SF-36 Mental Component Summary
SF-36 Physical Component Summary
Post-hoc analysis suggests that some participants with discordant responses to global improvement between data collection methods may have misinterpreted the values of the Likert scales. For example, two participants reported ‘extremely worsened’ for global improvement on paper and ‘extremely improved’ on the CATI at 12 weeks. When these two discordant responses were removed from the analysis, the ICC for global improvement increased dramatically from 0.10 to 0.61 at 12 weeks.
Discussion and conclusion
We compared the inter-method reliability of responses collected by CATI with those collected by a traditional paper method in a study of yoga for chronic LBP. For pain intensity, back-related function, pain medication use, and both physical and mental health components of the SF-36, reliability between paper survey and CATI data collection methods was very good to excellent. Satisfaction with treatment demonstrated moderate reliability at 6 weeks and improved at 12 weeks, whereas global improvement demonstrated poor reliability at every time point. While previous studies have compared the inter-method reliability of paper and telephone interviews for a number of health behavior questionnaires and the SF-36, our study is the first, to our knowledge, to focus on LBP-specific outcome measures such as LBP pain intensity, back-related function (RMDQ), and pain medication use.
The outcomes with greatest inter-method reliability, average low back pain intensity in the past week, back-related function, and pain medication use, are consistent with previous reliability studies [6, 18, 19]. Pain intensity was measured on a numerical scale from 0 to 10 and both the RMDQ and pain medication use questions were dichotomous, consisting of yes or no response choices. While the SF-36 contains multiple response choices for each question, each response choice also includes clear descriptions. In addition to being relatively straightforward, these measures are ubiquitous in clinical medicine and may be more familiar and intuitive to patients.
Some studies suggest that reliability between methods of survey administration may depend on the nature of the questions asked. For example, Lungenhausen et al  found within-subject differences in SF-12 mental health scores but not physical health scores, pain intensity, or pain-related disability between CATI and mailed questionnaires. Lower mental health scores were reported for the self-administered surveys when compared to CATI. Similarly, Feveile et al  randomized participants to either mailed questionnaires or telephone interviews and found that for self-assessed mental health items such as well-being, self-esteem, depression, and stress, participants reported more positively over the phone. There was no significant difference in responses between different survey modes for physical health and behavior items like smoking habits and medicine use. It appears that participants may respond differently to questions regarding sensitive topics, such as mental health, and report their health more positively when asked by an interviewer over the phone.
The wording of the question with poor reliability scores (global improvement) was relatively more complex than the others. Dillman  suggests that participants may have more difficulty remembering and processing a continuum of response choices, which is the case for Likert scales, and are more likely to choose responses at either extremes of the scale. Without visual cues and set descriptors for each response choice, it is plausible that participants mistakenly reversed the numeric values for ‘extremely worsened’ and ‘extremely improved’ when asked via CATI, resulting in lower ICC scores.
Limitations of our study include the relatively small sample size. While the sample size was chosen for practical considerations, it still provided sufficient precision in order to estimate an ICC. For example, at baseline, the estimated low back pain ICC of 0.87 had an estimate standard error of 0.04 while the estimated baseline Roland ICC of 0.89 had an estimate standard error of 0.03. With a non-response rate of about 10% for paper and 29% for the phone surveys at 6 and 12 weeks, response bias is a possible limitation. However, comparisons between responders and non-responders at 12 weeks showed no differences in most sociodemographic and baseline low back pain characteristics. As demonstrated, small sample size and non-responders may also magnify the effect of very discordant responses on the ICC. Given the relatively short time interval between administrations of each survey method, it is possible that participants may have reproduced responses from their paper surveys on the CATI. We are unable to distinguish inter-method reliability from test-retest reliability. As paper surveys were administered before CATI at all time points, we were also unable to assess the potential effect of survey administration order. Finally, because we targeted a predominantly low income minority population, the results may not be generalizable to a population with higher socioeconomic status.
As researchers begin to utilize new methods of data collection, such as Short Message Service (SMS) and internet surveys, future studies are needed to assess their reliability. Additionally, studies should compare cost, staff burden, and response rates of different data collection methods given the target population. For example, studies report that administering CATI may cost two  to three  times more than self-administered paper surveys per person. Future work might include an analysis of the costs associated with administering CATI with electronic data capturing systems such as StudyTRAX.
In summary, we studied the reliability of traditional paper surveys and CATI for average low back pain intensity, RMDQ, and pain medication use. At all three time points, the two data collection methods yielded similar results. Having both options for data collection available may be helpful in targeting non-responders and improving overall response rates.
The authors wish to thank all the study participants and staff of the participating sites (Boston Medical Center, Codman Square Health Center, Dorchester House Multiservice Center, Greater Roslindale Medical and Dental Center, South Boston Community Health Center, Upham’s Corner Health Center), the study site champions (Katherine Gergen-Barnett, Aram Kaligian, David Mello, Ani Tahmassian, Stephen Tringale, Yen Loh), yoga instructors (Deidre Allesio, Lisa Cahill, Danielle Ciofani, Anna Dunwell, Carol Faulkner, Victoria Garcia Drago, Robert Montgomery), research staff (Sarah Baird, Ama Boah, Eric Dorman, Danielle Dresner, Zak Gersten, Margo Godersky, Naomi Goodman, Julia Keosaian, Lana Kwong, Chelsey Lemaster, Sarah Marchese, Dorothy Marshall, Georgiy Pitman, Martina Tam, Huong Tran), and the Data Safety Monitoring Board (Maya Breuer, Bei Chang, Deborah Cotton, and Steve Williams).
This publication was made possible by grant number 1R01AT005956 from the National Center for Complementary and Alternative Medicine (NCCAM) at the National Institutes of Health, Bethesda, MD. NCCAM had no role in the design, conduct, and reporting of the study.
- Cook C: Mode of administration bias. J Man Manip Ther. 2010, 18: 61-63. 10.1179/106698110X12640740712617.PubMedPubMed CentralView ArticleGoogle Scholar
- de Leeuw ED: To mix or not to mix data collection modes in surveys. J Off Stat. 2005, 21: 233-255.Google Scholar
- Bowling A: Mode of questionnaire administration can have serious effects on data quality. J Public Health. 2005, 27: 281-291. 10.1093/pubmed/fdi031.View ArticleGoogle Scholar
- Bushnell DM, Martin ML, Parasuraman B: Electronic versus paper questionnaires: a further comparison in persons with asthma. J Asthma. 2003, 40: 751-762. 10.1081/JAS-120023501.PubMedView ArticleGoogle Scholar
- Gwaltney CJ, Shields AL, Shiffman S: Equivalence of electronic and paper-and-pencil administration of patient-reported outcome measures: a meta-analytic review. Value Health. 2008, 11: 322-333. 10.1111/j.1524-4733.2007.00231.x.PubMedView ArticleGoogle Scholar
- Lungenhausen M, Lange S, Maier C, Schaub C, Trampisch HJ, Endres HG: Randomised controlled comparison of the health survey short form (SF-12) and the graded chronic pain scale (GCPS) in telephone interviews versus self-administered questionnaires: are the results equivalent?. BMC Med Res Methodol. 2007, 7: 50-10.1186/1471-2288-7-50.PubMedPubMed CentralView ArticleGoogle Scholar
- Fowler FJ, Gallagher PM, Stringfellow VL, Zaslavsky AM, Thompson JW, Cleary PD: Using telephone interviews to reduce nonresponse bias to mail surveys of health plan members. Med Care. 2002, 40: 190-200. 10.1097/00005650-200203000-00003.PubMedView ArticleGoogle Scholar
- Saper RB, Boah AR, Keosaian J, Cerrada C, Weinberg J, Sherman KJ: Comparing once-versus twice-weekly yoga classes for chronic low back pain in predominantly low income minorities: a randomized dosing trial. Evid Based Complement Alternat Med. 2013, 2013: 658030-PubMedPubMed CentralView ArticleGoogle Scholar
- StudyTRAX. [http://www.sciencetrax.com/studytrax/]
- Bombardier C: Outcome assessments in the evaluation of treatment of spinal disorders: summary and general recommendations. Spine. 2000, 25: 3100-3103. 10.1097/00007632-200012150-00003.PubMedView ArticleGoogle Scholar
- Von Korff M, Jensen MP, Karoly P: Assessing global pain severity by self-report in clinical and health services research. Spine. 2000, 25: 3140-3151. 10.1097/00007632-200012150-00009.PubMedView ArticleGoogle Scholar
- Ritter PL, González VM, Laurent DD, Lorig KR: Measurement of pain using the visual numeric scale. J Rheumatol. 2006, 33: 574-580.PubMedGoogle Scholar
- Patrick DL, Deyo RA, Atlas SJ, Singer DE, Chapin A, Keller RB: Assessing health-related quality of life in patients with sciatica. Spine. 1995, 20: 1899-1908. 10.1097/00007632-199509000-00011.PubMedView ArticleGoogle Scholar
- Roland M, Fairbank J: The Roland-Morris disability questionnaire and the Oswestry disability questionnaire. Spine. 2000, 25: 3115-3124. 10.1097/00007632-200012150-00006.PubMedView ArticleGoogle Scholar
- Ware JE: SF-36 health survey update. Spine. 2000, 25 (24): 3130-3139. 10.1097/00007632-200012150-00008.PubMedView ArticleGoogle Scholar
- Hudak PL, Wright JG: The characteristics of patient satisfaction measures. Spine. 2000, 25: 3167-3177. 10.1097/00007632-200012150-00012.PubMedView ArticleGoogle Scholar
- Koch GG: Intraclass correlation coefficient. Encyclopedia of statistical sciences. Edited by: Kotz S, Johnson NL. 1982, New York: John Wiley, 213-217.Google Scholar
- Klevens J, Trick WE, Kee R, Angulo F, Garcia D, Sadowski LS: Concordance in the measurement of quality of life and health indicators between two methods of computer-assisted interviews: self-administered and by telephone. Qual Life Res. 2011, 20: 1179-1186. 10.1007/s11136-011-9862-2.PubMedView ArticleGoogle Scholar
- Feveile H, Olsen O, Hogh A: A randomized trial of mailed questionnaires versus telephone interviews: response patterns in a survey. BMC Med Res Methodol. 2007, 7: 27-10.1186/1471-2288-7-27.PubMedPubMed CentralView ArticleGoogle Scholar
- Dillman DA, Sangster RL, Tarnai J, Rockwood TH: Understanding differences in people’s answers to telephone and mail surveys. New Dir Eval. 1996, 1996: 45-61. 10.1002/ev.1034.View ArticleGoogle Scholar
- Duncan P, Reker D, Kwon S, Lai SM, Studenski S, Perera S, Alfrey C, Marquez J: Measuring stroke impact with the stroke impact scale: telephone versus mail administration in veterans with stroke. Med Care. 2005, 43: 507-515. 10.1097/01.mlr.0000160421.42858.de.PubMedView ArticleGoogle Scholar
- Aitken JF, Youl PH, Janda M, Elwood M, Ring IT, Lowe JB: Comparability of skin screening histories obtained by telephone interviews and mailed questionnaires: a randomized crossover study. Am J Epidemiol. 2004, 160: 598-604. 10.1093/aje/kwh263.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.