We compared the inter-method reliability of responses collected by CATI with those collected by a traditional paper method in a study of yoga for chronic LBP. For pain intensity, back-related function, pain medication use, and both physical and mental health components of the SF-36, reliability between paper survey and CATI data collection methods was very good to excellent. Satisfaction with treatment demonstrated moderate reliability at 6 weeks and improved at 12 weeks, whereas global improvement demonstrated poor reliability at every time point. While previous studies have compared the inter-method reliability of paper and telephone interviews for a number of health behavior questionnaires and the SF-36, our study is the first, to our knowledge, to focus on LBP-specific outcome measures such as LBP pain intensity, back-related function (RMDQ), and pain medication use.
The outcomes with greatest inter-method reliability, average low back pain intensity in the past week, back-related function, and pain medication use, are consistent with previous reliability studies [6, 18, 19]. Pain intensity was measured on a numerical scale from 0 to 10 and both the RMDQ and pain medication use questions were dichotomous, consisting of yes or no response choices. While the SF-36 contains multiple response choices for each question, each response choice also includes clear descriptions. In addition to being relatively straightforward, these measures are ubiquitous in clinical medicine and may be more familiar and intuitive to patients.
Some studies suggest that reliability between methods of survey administration may depend on the nature of the questions asked. For example, Lungenhausen et al [6] found within-subject differences in SF-12 mental health scores but not physical health scores, pain intensity, or pain-related disability between CATI and mailed questionnaires. Lower mental health scores were reported for the self-administered surveys when compared to CATI. Similarly, Feveile et al [19] randomized participants to either mailed questionnaires or telephone interviews and found that for self-assessed mental health items such as well-being, self-esteem, depression, and stress, participants reported more positively over the phone. There was no significant difference in responses between different survey modes for physical health and behavior items like smoking habits and medicine use. It appears that participants may respond differently to questions regarding sensitive topics, such as mental health, and report their health more positively when asked by an interviewer over the phone.
The wording of the question with poor reliability scores (global improvement) was relatively more complex than the others. Dillman [20] suggests that participants may have more difficulty remembering and processing a continuum of response choices, which is the case for Likert scales, and are more likely to choose responses at either extremes of the scale. Without visual cues and set descriptors for each response choice, it is plausible that participants mistakenly reversed the numeric values for ‘extremely worsened’ and ‘extremely improved’ when asked via CATI, resulting in lower ICC scores.
Limitations of our study include the relatively small sample size. While the sample size was chosen for practical considerations, it still provided sufficient precision in order to estimate an ICC. For example, at baseline, the estimated low back pain ICC of 0.87 had an estimate standard error of 0.04 while the estimated baseline Roland ICC of 0.89 had an estimate standard error of 0.03. With a non-response rate of about 10% for paper and 29% for the phone surveys at 6 and 12 weeks, response bias is a possible limitation. However, comparisons between responders and non-responders at 12 weeks showed no differences in most sociodemographic and baseline low back pain characteristics. As demonstrated, small sample size and non-responders may also magnify the effect of very discordant responses on the ICC. Given the relatively short time interval between administrations of each survey method, it is possible that participants may have reproduced responses from their paper surveys on the CATI. We are unable to distinguish inter-method reliability from test-retest reliability. As paper surveys were administered before CATI at all time points, we were also unable to assess the potential effect of survey administration order. Finally, because we targeted a predominantly low income minority population, the results may not be generalizable to a population with higher socioeconomic status.
As researchers begin to utilize new methods of data collection, such as Short Message Service (SMS) and internet surveys, future studies are needed to assess their reliability. Additionally, studies should compare cost, staff burden, and response rates of different data collection methods given the target population. For example, studies report that administering CATI may cost two [21] to three [22] times more than self-administered paper surveys per person. Future work might include an analysis of the costs associated with administering CATI with electronic data capturing systems such as StudyTRAX.
In summary, we studied the reliability of traditional paper surveys and CATI for average low back pain intensity, RMDQ, and pain medication use. At all three time points, the two data collection methods yielded similar results. Having both options for data collection available may be helpful in targeting non-responders and improving overall response rates.