The influence of applying insurance medicine guidelines for depression on disability assessments
© Schellart et al.; licensee BioMed Central Ltd. 2013
Received: 9 January 2013
Accepted: 28 May 2013
Published: 7 June 2013
In the current study we report on the effects of an implementation strategy in the form of a training programme on the assessed work limitations of a client with depression by insurance physicians (IPs) participating in a RCT. These assessed work limitations of a client were in the form of scores on the List of Functional Abilities (LFA).
We conducted a randomised controlled trial (RCT) for IPs in which we compared the intervention of a specially developed training programme with the usual methods of implementation and training currently used. The outcome was the mean sum score and the inter-rater reliability (Intraclass Correlation Coefficient, ICC) of the LFA scores. These LFA scores were scored by the IPs participating in the RCT for the work limitations of the cases presented in different videos, two videos before the training and two after the training of the intervention group.
At baseline, the intervention group (IG) consisted of 21 IPs and the control group (CG) of 19. For one participant of the IG and for one of the CG the LFAs of the two case reports after training were not available. Before training the sum scores for the first case report did not differ significantly between the groups, while the mean sum score was higher in the IG than in the CG for the second case report. For both case reports after training a higher score was found in the IG than in the CG. The inter-rater reliability measured for the two case reports before training was about the same in the IG and the CG: 0.64 and 0.65, respectively. For the two case reports after training, the ICC was higher in the IG than in the CG: 0.69 and 0.54, respectively. This difference was not significant however.
It would appear that the implementation of a specially designed training programme on guidelines for depression may lead to greater inter-rater reliability in the assessments by insurance physicians of the work limitations of clients with depression. It is, however, important to note that insurance physicians who receive training may find more work limitations than those who do not.
Netherlands’ Trial Register NTR1863
Insurance medicine in the Netherlands
The Dutch National Institute for Employee Benefit Schemes (the Institute) administers the eligibility of sick employees for a benefit under the Work and Income (Capacity for Work) Act (WIA). 900 Insurance physicians (IP) are employed at the Institute, approx. 450 of who perform disability assessments under the WIA . On average, these insurance physicians are 50 years old, 59% is men, they have approx. 16 years experience as insurance physician, approx. 86% is specialized in insurance medicine, 15% also has another extra medical speciality, and approx. 60% works full-time. They perform an average of 9 disability assessments per week, assessing employees with all types of diseases . Employees who are on sick leave for two years can claim a disability benefit through the Institute. Such an employee becomes a client of the Institute. The clients’ claim is assessed by an IP at a front office of the Institute. In this assessment, that is called the work disability assessment, the client’s work limitations and abilities are defined. The IP writes his or her findings down in a medical work disability report and fills in a List of Functional Abilities (LFA) . On average, an IP uses approximately two hours for a complete work disability assessment. One hour for the assessment interview, and one hour for writing the report. Subsequently, a labour expert matches the client’s work abilities as have been defined in the LFA, with the functional demands of (theoretically) available jobs, resulting in a selection of jobs that the client should be able to perform, despite his/her work limitations. The client’s benefit, finally, is determined by the loss of income, caused by the difference in wages between that of the client’s initial job and the wages of the selected jobs.
Guideline adherence and work limitations
We have previously investigated whether an implementation strategy that meets the needs of insurance physicians (IPs) leads to better adherence to guidelines than the usual implementation employed by the Dutch National Institute for Employee Benefits Schemes . To this end we have developed a training programme using interventions that teach IPs how to apply the insurance medicine guidelines for depression  when performing assessments for work limitations. The efficacy of this implementation strategy was investigated in a randomised controlled trial (RCT), in which a group of IPs trained in applying the guidelines for depression were compared with a control group. We have demonstrated that IPs trained in applying the guidelines for depression scored significantly higher on guideline adherence and on knowledge of the guidelines for depression than IPs in the control group .
What is the influence of the training programme on the work limitations?
What is the influence of the training programme on the inter-rater reliability between the LFA scores of the participating IPs?
Training in guidelines for depression will result in to more work limitations, because adherence to the guidelines leads to a more complete overview of disorders and the resulting work limitations, based on the information available.
Training in guidelines for depression will result in higher inter-rater reliability between IPs: after following the training programme the IPs will assess work limitations in a more uniform manner.
To determine the efficacy of a specially developed strategy for implementation of the guidelines for depression , we conducted a randomised controlled trial (RCT) in which we compared an intervention group with a control group. In this RCT we compared the intervention of a specially developed training programme with the usual methods of implementation and training currently in use by the social security agency.
The intervention was a training programme designed for IPs, in which they learnt to apply the guidelines for depression . This programme, together with baseline and follow-up measurements, was integrated into a four-day postgraduate course located at the Netherlands School of Public and Occupational Health (NSPOH).
While the intervention group was trained in applying the guidelines for depression, the control group received an alternative programme of training in motivational interviewing that did not conflict with the intervention programme. The RCT took three days within a period of two weeks in March 2009. After the RCT ended, the control group received the same training as the intervention group, while the intervention group received the alternative programme. This was planned as the fourth day of the course, which was held three months later at the end of June 2009.
By using actors simulating four different case reports on video, we managed to create a laboratory setting in which we could measure the work disability assessments of clients with depression by each IP. In these videos the role of the client was played by four different actors, while the role of the IP was played by two ‘real’ IPs, independently selected for this purpose. The training programme was designed to be also applied in practice. The Ethics Committee of the VU University Medical Centre granted approval for the study design and the RCT was accepted by the Netherlands Trial Register under number NTR1863.
In January 2009, IPs employed by the Institute were invited to take part in a postgraduate course in applying the guidelines for depression, given in the period from March to July 2009. The inclusion criteria were that individuals should be registered as insurance physicians, or still in training as such, and should be conducting disability assessments of clients as commissioned by the Institute. The NSPOH was responsible for enrolment of participants, who also provided written informed consent to take part in the study. 43 insurance physicians participated in the study.
The participants were allocated in order of registration to either the intervention group or the control group by using a random-sequence table. Participants who were not available on the planned dates were excluded from the trial. The participants were informed about the fact that the course was part of a research project, but they were not informed about the design of the entire project, i.e. the various measurements and the type of group they participated in.
Data were collected at the NSPOH during the period of the training course. At baseline (pre-intervention) and at follow-up (post-intervention) each IP assessed the work limitations of two clients, played by actors, who were presented separately on video. The actors played clients with depression, reconstructed from real case reports. The actors played their roles on the basis of extensive scripts, with room for improvisation. The videos showed the disability assessment encounter between a client (actor) and an independent IP (not a participant in the RCT), who had been briefed to perform the assessment in complete accordance with the guidelines for depression. The decision phase of the assessment encounter was not shown on the video. The participating IPs completed their medical disability reports, including the LFA, immediately after watching each client on the video. All reports and completed LFAs were collected directly afterwards. The researchers were blinded for the collection of data and an independent research assistant coded the data.
The primary outcome of the RCT was guideline adherence, measured using performance indicators. A detailed description of the development and reliability of these performance indicators has been published elsewhere , as has the effect of the intervention on guideline adherence .
Mental abilities: limitations in coping with various mental task demands
General physical abilities: limitations covering various aspects of the musculoskeletal system
Autonomy: limitations in being able to act autonomously in the working situation
Manual skills and grip strength limitations.
Since the internal reliability of this last scale was very low (alpha 0.46), items on this scale were included in the scale for general physical ability, a possibility demonstrated by another study of LFA data from 84,000 disability assessments . The three scales in the mentioned study had an acceptable level of reliability (alphas were 0.69 for scale 1, 0.72 for scale 2, and 0.75 for scale 3 including manual skills and grip strength). Hence, in the current study we used these three scales, with an additional separate scale for working hours, that had a very good internal reliability (alpha 0.97) .
To address the first hypothesis, we used an unpaired t-test to analyse differences in the mean sum scores of the four scales between the intervention group and the control group for each case report (four case reports: the first two pre-intervention, the other two post-intervention). To examine whether correction was necessary for the influence of any unequal distribution of background variables between the intervention group and the control group, we performed regression analysis using the relevant background variable as covariate.
To address the second hypothesis regarding inter-rater reliability, we performed analyses using linear mixed models, which enable modelling of variances (and covariances) and provide the possibility of accounting for hierarchical data . We used the variances to calculate the intraclass correlation coefficient (ICC, with values ranging between 0 and 1) . A higher ICC is an indication of greater degree of inter-rater reliability. We also calculated whether the difference between the ICCs of the intervention group and the control group was significantly different from zero. For a more detailed description of the statistical analysis please we refer to the Additional file 1. All analyses were performed using SPSS 15.0 .
Between January and March 2009 a total of 43 insurance physicians applied to take part in the course. At the time of the RCT all participating IPs were actively conducting disability assessments. Twenty-one IPs were allocated to the control group and 22 to the intervention group. One of the IPs who was allocated to the intervention group withdrew from the course and 2 IPs who were originally allocated to the control group were not available on the planned dates. All three were excluded from the RCT. At baseline, therefore, the control group (CG) consisted of 19 IPs and the intervention group (IG) of 21. For one CG participant and for one IG participant the LFAs of the two case reports after training were not available.
Baseline characteristics of insurance physicians in control group (CG) and intervention group (IG)
CG (n = 19)
IG (n = 21)
Mean (sd) or percentage
Age in years
Weekly working hours
Years working as physician
Registered as insurance physician
Years working as insurance physician
Number of clients with depression assessed per month
Assessment time for depressed clients (minutes)
Assessments under the new disability act
Employee of the Institute
Mean scale scores (sd) of LFA scales for two case reports before training*
Case report 1: CG
Case report 1: IG
Case report 2: CG
Case report 2: IG
Mean scale scores and sum scores of LFA scales for two case reports after training*
Case report 3: CG
Case report 3: IG
Case report 4: CG
Case report 4: IG
Results of the mixed models analysis and ICC calculation, with scores of four LFA scales*
Case reports 1 and 2
Case reports 3 and 4
Scale (case report)
Case report * respondent
(95% confidence interval)
Results of the ICC calculation, with scores of three and two LFA scales respectively*
Case reports 1 and 2
Case reports 3 and 4
ICC (3 scales)
ICC (2 scales)
The results of this study show that before training the sum scores for the first case report did not differ significantly between the groups, while for the second case report the mean sum score was significantly higher in the IG than in the CG. For the two case reports after training, we saw a significantly higher score in the IG than in the CG.
The inter-rater reliability measured for the two case reports before training and using four scales was about the same in the CG and the IG. For the two other case reports after training, the ICC was 0.69 for the IG and 0.54 for the CG. This difference was not significant however.
Interpretation and comparison with other studies
The training programme on applying the guidelines for depression resulted in more work limitations. For the same case report, IPs who received training filled in more work limitations in the LFA than the IPs who did not receive training. This difference is most noticeable in case report 3.
Post-intervention data showed that the group of IPs who were given training in applying the guidelines had a higher degree of consistency when filling in the LFA than the IPs in the control group. Apparently the implementation strategy contributed to more uniformity in work limitations assessments by IPs. This ties in well with earlier research into variation in work disability assessments [16, 17]. In terms of financial and social consequences, such variation is unwanted for both the client and society and in our opinion might be reduced by the use of standardised methods of assessment, as occurs when guidelines are applied. The fact that applying guidelines results in a more uniform judgment ties in well with the idea that reducing medical ambiguity or uncertainty also reduces variation between doctors [18, 19].
It is striking that the differences between the two groups with regard to the scale for working hours are considerable (except for case report 1), both before and after training. Working hours limitation is a strong determinant for the end result of the assessment: the degree of work disability assigned to the client. Another study into variations in disability assessments had also found little consistency between IPs regarding the work limitation scale for working hours . The scale for working hours even has its own guidelines, separate from those specific to diagnosis .
Our results confirm the trends posed in the two hypotheses. We have shown that IPs trained in using the guidelines apply more work limitations than untrained IPs. In another study of ability assessments of clients with depression, the use of a work ability checklist actually led to findings of higher levels of work ability, without a reduction in the variation of assessment results . One possible explanation for this is that the emphasis in the aforementioned study was on work ability rather than on work limitations as in the depression guidelines. Incidentally, the ICC in that study was of the similar magnitude to that found in the current study’s pre-intervention measurements, namely 0.64.
The training programme taught the IPs to conduct systematic and thoroughly justified disability assessments in accordance with the guidelines. Apparently this method of assessment leads to a higher number of work limitations than is usually the case. The reason for this might be that IPs who adhere more closely to guidelines interpret the information provided more strictly than usual. After all, the information concerning the client was provided by means of a case report on video, which was the same for all IPs. The IPs themselves were not able to ask the client any questions. Therefore, in the daily practice of IPs – where interviews form an influential part of a disability assessment – the difference between the groups may well be greater: the trained IP, actively applying the guidelines, will make further enquiries of the client regarding aspects such as sleep disorders. The existence of sleep disorders may then in turn influence how the IP fills in the LFA.
Strengths and weaknesses
This study has several strengths. Firstly, the active form of the four ‘real life’ case reports on video, which simulate the daily practice of an IP, is more effective than written case reports . Secondly, the fact that the two case reports presented before the training programme were different to the two after training prevents any confounding learning effect that occurs when a case report is presented for the second time. Thirdly, the suitability of the four scales drawn up on the basis of the LFA scientific research has already been established by statistical analysis in previous studies [10, 12, 13]: the difference in the means has been tested using the sum scores of the four scales, which are a valid measure of the number and severity of the limitations, since they are not influenced by the distribution over the four scales. Finally, to determine inter-rater reliability, an empirically tested method was used to calculate the ICCs (see Additional file 1): the differences between the ICCs of the CG and IG were tested for their significant difference from zero.
The study also has a number of weaknesses. To start with, it may be difficult for IPs to complete an LFA based purely on a video, a factor that was not looked at in this study. Another weakness is the question of what to do about items marked as ‘no limitations found’: should this be considered as missing data, or as an actual assessment of there being no limitations, or at least no severe limitations? We attempted to accommodate this weakness by also analysing inter-rater reliability while excluding the scales that had only a few observations. A further weakness is the fact that the pre-intervention data already showed a significant difference in the severity and number of limitations between the intervention group and the control group. Finally, since the case reports presented before and after the training programme were not necessarily comparable, the ICCs from before and after training were not comparable within each group (CG and IG). It was, therefore, not possible in the IG to test whether there was an increase in inter-rater reliability after the training programme.
The findings of this study provide a point of consideration for insurance medicine. IPs should be aware of the fact that collecting information about a client in a structural manner, as when following a guideline, can lead to the finding of more work limitations in that client. The IP should not lose sight of the importance of work participation and should focus on the work ability of the client. In addition, it would appear that IPs have difficulty reaching uniformity in applying the ‘reduced working hours’ standard . We recommend a separate training programme for IPs to teach them to apply this standard, preferably according to the existing disease-specific guidelines.
Policy makers should be aware that although it is possible to improve the inter-rater reliability between IPs for disability assessments, there is still space for professional autonomy and variation in assessments, even after guidelines have been implemented. IPs cannot be completely constrained to a guideline and a guideline cannot be fully comprehensive to cover all possible situations. This study found a maximum ICC of 0.69, and not of 1.00. Since disability assessments are, and will remain, human activities, a certain degree of variation within professional guidelines is acceptable.
There are indications that the implementation of a specially designed training programme on guidelines for depression may lead to greater inter-rater reliability in the assessments by insurance physicians of the work limitations of clients with depression. It is, however, important to note that insurance physicians who receive training may find more work limitations than those who do not. Whether this possible rise in work limitations found might also lead to a higher degree of work disability requires further investigation.
The authors wish to thank the IPs who participated in this research. The Research Center for Insurance Medicine AMC-UMCG-UWV-VU University Medical Center, in Amsterdam, is a joint initiative of the Academic Medical Center (AMC), the University Medical Center in Groningen (UMCG), the Dutch Institute for Employee Benefit Schemes (UWV), and the VU University Medical Center (VUMC). This trial was funded by the Dutch Institute for Employee Benefits Schemes and the Netherlands’ School for Public Health. FZ, JRA, and AJMS are (partially) funded by UWV. The study sponsor had no role in the study design, in the collection, analysis or interpretation of the data, in the writing of the case reports, or in the decision to submit the paper for publication. The design of this study was laboratorial and for data collection fictitious but realistic case-reports were used. Consequently, the Medical Ethics Committee agreed with the design. The full trial protocol can be accessed at the webadress of the Netherlands’Trial Register (NTR): http://www.trialregister.nl/trialreg/admin/rctview.asp?TC=1863.
- Steenbeek R, Schellart AJM, Mulders HPG, Anema JR, Kroneman H, Besseling JJM: The development of instruments to measure the work disability assessment behaviour of insurance physicians. BMC Publ Health. 2011, 11: 1-10.1186/1471-2458-11-1.View ArticleGoogle Scholar
- Lisv: Claim Beoordelings- en BorgingsSysteem (CBBS) (List of Functional Abilities). 2002, Amsterdam: LisvGoogle Scholar
- Zwerver F, Schellart AJM, Anema JR, Rammeloo K, Van der Beek AJ: Intervention mapping for the development of a strategy to implement the insurance medicine guidelines for depression. BMC Publ Health. 2011, 11: 9-10.1186/1471-2458-11-9.View ArticleGoogle Scholar
- Council H: Insurance Medicine Guidelines for Depression. 2006, Den Haag: GezondheidsraadGoogle Scholar
- Zwerver F, Schellart AJ, Knol DL, Anema JR, Van der Beek AJ: An implementation strategy to improve the guideline adherence of insurance physicians: an experiment in a controlled setting. Implement Sci. 2011, 6: 131-10.1186/1748-5908-6-131.PubMedPubMed CentralView ArticleGoogle Scholar
- World Health Organization: International classification of functioning, disability and health. 2002, Geneva: WHOGoogle Scholar
- Brage S, Donceel P, Falez F, Working Group of the European Union of Medicine in Assurance and Social Security: Development of ICF core set for disability evaluation in social security. Disabil Rehabil. 2008, 30 (18): 1392-1396. 10.1080/09638280701642950.PubMedView ArticleGoogle Scholar
- Østerås N, Brage S, Garratt A, Benth JS, Natvig B, Gulbrandsen P: Functional ability in a population: normative survey data and reliability for the ICF based Norwegian Function Assessment Scale. BMC Publ Health. 2007, 7: 278-10.1186/1471-2458-7-278.View ArticleGoogle Scholar
- Spanjer J, Krol B, Popping R, Groothoff JW, Brouwer S: Disability assessment interview: the role of detailed information on functioning in addition to medical history-taking. J Rehabil Med. 2009, 41 (4): 267-272. 10.2340/16501977-0323.PubMedView ArticleGoogle Scholar
- Schellart AJ, Mulders H, Steenbeek R, Anema JR, Kroneman H, Besseling J: Inter-doctor variations in the assessment of functional incapacities by insurance physicians. BMC Publ Health. 2011, 11: 864-10.1186/1471-2458-11-864.View ArticleGoogle Scholar
- Schellart AJ, Zwerver F, Knol DL, Anema JR, Van der Beek AJ: Development and reliability of performance indicators for measuring adherence to a guideline for depression by insurance physicians. Disabil Rehabil. 2011, 33 (25–26): 2535-2543.PubMedView ArticleGoogle Scholar
- Broersen JP, Mulders HP, Schellart AJ, Van der Beek AJ: The dimensional structure of the functional abilities in cases of long-term sickness absence. BMC Publ Health. 2011, 11: 99-10.1186/1471-2458-11-99.View ArticleGoogle Scholar
- Broersen JP, Mulders HP, Schellart AJ, Van der Beek AJ: The identification of job opportunities for severly disabled sick-listed employees. BMC Publ Health. 2012, 12: 156-10.1186/1471-2458-12-156.View ArticleGoogle Scholar
- SPSS: SPSS 15.0 Command Syntax Reference. 2006, Chicago III: SPSS IncGoogle Scholar
- Molenberghs G, Laenen A, Vangeneugden T: Estimating reliability and generalizability from hierarchical biomedical data. J Biopharm Stat. 2007, 17 (4): 595-627. 10.1080/10543400701329448.PubMedView ArticleGoogle Scholar
- Spanjer J, Krol B, Brouwer S, Groothoff JW: Sources of variation in work disability assessment. Work. 2010, 37 (4): 405-411.PubMedGoogle Scholar
- Spanjer J, Krol B, Brouwer S, Groothoff JW: Inter-rater reliability in disability assessment based on a semi-structured interview report. Disabil Rehabil. 2008, 30 (24): 1885-1890. 10.1080/09638280701688185.PubMedView ArticleGoogle Scholar
- Wennberg JE, Barnes BA, Zubkoff M: Professional uncertainty and the problem of supplier-induced demand. Soc Sci Med. 1982, 16 (7): 811-824. 10.1016/0277-9536(82)90234-9.PubMedView ArticleGoogle Scholar
- Eisenberg JM: Physician utilization: the state of research about physicians’ practice patterns. Med Care. 2002, 40 (11): 1016-1035. 10.1097/00005650-200211000-00004.PubMedView ArticleGoogle Scholar
- Lisv: Standaard verminderde arbeidsduur (Reduced working hours standard). 2007, Amsterdam: LisvGoogle Scholar
- Slebus FG, Kuijer PP, Willems JH, Frings-Dresen MH, Sluiter JK: Work ability assessment in prolonged depressive illness. Occup Med (Lond). 2010, 60 (4): 307-309. 10.1093/occmed/kqq079.View ArticleGoogle Scholar
- Berkhof M, Van Rijssen HJ, Schellart AJ, Anema JR, Van der Beek AJ: Effective training strategies for teaching communication skills to physicians: an overview of systematic reviews. Patient Educ Couns. 2011, 84 (2): 152-162. 10.1016/j.pec.2010.06.010.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.