Reliability and validity of motion analysis in children treated for congenital clubfoot according to the Clubfoot Assessment Protocol (CAP) using inexperienced assessors

Background The Clubfoot Assessment Protocol (CAP) was developed for follow-up of children treated for clubfoot. The objective of this study was to analyze reliability and validity of the six items used in the domain CAPMotion Quality using inexperienced assessors. Findings Four raters (two paediatric orthopaedic surgeons, two senior physiotherapists) used the CAP scores to analyze, on two different occasions, 11 videotapes containing standardized recordings of motion activity according to the domain CAPMotion Quality These results were compared to a criterion (two raters, well experienced CAP assessors) for validity and for checking for learning effect. Weighted kappa statistics, exact percentage observer agreement (Po), percentage observer agreement including one level difference (Po-1) and amount of scoring scales defined how reliability was to be interpreted. Inter- and intra rater differences were calculated using median and inter quartile ranges (IQR) on item level and mean and limits of agreement on domain level. Inter-rater reliability varied between fair and moderate (kappa) and had a mean agreement of 48/88% (Po/Po-1). Intra -rater reliability varied between moderate to good with a mean agreement of 63/96%. The intra- and inter-rater differences in the present study were generally small both on item (0.00) and domain level (-1.10). There was exact agreement of 51% and Po-1 of 91% of the six items with the criterion. No learning effect was found. Conclusion The CAPMotion quality can be used by inexperienced assessors with sufficient reliability in daily clinical practice and showed acceptable accuracy compared to the criterion.


Background
The Clubfoot Assessment Protocol (CAP) [1,2] (Table 1) was developed for follow-up of children treated for congenital clubfoot. Twenty items divided over four domains (Mobility, Muscle function, Morphology and Motion Quality) form the CAP. Most previous instruments for evaluation of children with clubfoot, such as the International Clubfoot Study Group evaluation system (ISGC) [3] and the Laaveg-Ponseti [4], do not include items concerning the child's quality of motion, as in walking or running.
The CAP has in previous studies shown good reliability, validity and sensitivity for change in the four domains Mobility, Muscle function, Morphology and Motion quality with experienced assessors. [1,2] The objective of this study was to analyze the intra-and inter rater reliability of the items used in the domain Motion Quality of the CAP and their validity, with inexperienced CAP assessors.

CAPMotion Quality
This domain contains six items; running, walking, toe walking, heel walking, one-leg hop and one-leg balance (see additional file 1). At the age of four years children are normally expected to be able to perform all six items. In the appendix the scoring distribution and criteria are described. The scoring has been divided systematically in proportion to what is regarded as normal variation and its supposed impact on the child's physical function. Assessment is done in relation to the child's age.

Patients
Video recordings of eleven children treated for clubfoot and with varying severity and outcome results, were selected from the archives of our clubfoot clinic. The tapes contained standardized recordings of motion activity according to the domain CAPMotion Quality. The median age was 5. 5 years (range 4 -7 years). Gender distribution was three girls and eight boys. Five children had unilateral clubfoot and six bilateral. All families gave their informed consent for the use of the video films.

Raters
Four raters were selected according to the criteria having worked within pediatric orthopedics at least seven years including experience with children with clubfoot. Two raters were pediatric orthopedic surgeons and two were senior physiotherapists. None of the raters had previous experiences with the CAP system. Two raters well experienced with CAP, one physiotherapist and developer of the CAP (HA) and one pediatric orthopedic surgeon (GH), defined the most correct score for each child's item performance.

Video recording
The recording procedure was standardized and comparable with the situation in a daily clinical environment. The children were recorded from a frontal and posterior view while moving along a 10 meter pathway. The camera was positioned on one meters height and two meter from the beginning of the pathway. The children wore t-shirts, shorts or underwear and were barefoot. The children were asked if they wanted to start with walking or running. All children started with running followed by walking, toe walking, heel walking, one-leg hop and on-leg stance. Recordings were made of each performance as much as necessary to be able to make an assessment comparable with real life. Each video sequence lasted about 4 minutes.

Rating procedures
All four raters received three weeks before the first assessment session the CAPMotion Quality manual with the items criteria and a copy of the protocol form to be used during the rating session (see additional file 1). They were asked to study the manual and scoring system and use this information during the assessment sessions.
Each rater assessed individually all 11 video recorded children twice within an interval of 4 to 6 weeks.
An introduction was given prior to each assessment session explaining the testing procedure; 1) After each video recording a half minute pause was given. A short brake was made after the fifth video. 2) No possibilities were given to stop the video or to assess the recordings in slow motion. 3) Before each new video sequence the raters received only information about the child's age and gender. 4) Both left and right side should be rated. As a training session, the raters viewed and at the same time rated a videotaped recording of a child without a disability and a child treated for congenital clubfoot. Total testing time was approximately one hour and 15 minutes.
The two experienced assessors (HA and GH) analyzed and discussed the same videos at one meeting and defined the most correct rating for each side and each child. This was done before the first assessment of the four raters.

Data analyses
Both legs were rated and used as individual ratings in the statistical analyses.
Inter -and intra tester reliability was calculated using the weighted kappa (k) statistics [1,2] together with its 95% confidence intervals. For the inter-rater testing the assessments from the first sessions were used. According to Alt-man [5] the kappa values are to be interpreted as follows: < 0.20 as poor agreement, 0.21 -0.40, as fair, 0.41 -0.60 as moderate, 0.61 -0.80 as good and > 0.80 as very good. Exact observed percentage agreement (Po) and percentage agreement including one level difference (= Po-1) were calculated as kappa values can become unstable under certain conditions, e.g. with limited distribution of cell frequency [6][7][8].
As the CAPMotion quality domain exists out of five scoring possibilities we regarded a Po ≥ 50% or a Po-1 ≥ 80% as good.
Good item reliability was considered when more than halve of the assessment pairs had kappa's values higher than 0.60 (= good) and/or a good percentage agreement. Sufficient item reliability was considered when the kappa values ranged between 0.41-0.60 (fair to moderate) for more than halve of the inter-intra ratings and/or had good percentage agreement.
The median differences and inter quartile ranges (IQR) for each item (ordinal data) and the mean difference and its limits of agreement (LOA) (interval data) for the domain motion quality for the inter-and intrarater were calculated. [5] For evaluating if there was a learning effect between the first and second session, the Po and Po-1 assessed with the criterion, were used. A difference of more than 10% was set as level for a real difference.

Inter-rater reliability
The item inter-rater reliability between the four raters is presented in Table 2   According to the reliability criteria four out of six items showed sufficient inter-rater reliability. The items toe-and heel walking showed overall problems with sufficient assessment agreement.

Intra-rater reliability
The median intra-rater reliability for the individual raters is presented in Table 4. In general item intra reliability varied between moderate to good and had a mean item Po Three items showed good reliability and three showed sufficient intra-rater agreement.

Learning development and validity with the criterion
No general improvement was seen between the first and second session regarding the exact observed mean percentage agreement for all items (Figure 1). Item toe walking showed decreased agreement (from 53 to 40%) Also when including one category difference for the observed percentage agreement, no improvement occurred between the first and second session except for item heel walking (83 to 96%). These results also showed an exact agreement of 51% and Po-1 of 91% of the six items with the criterion.

Discussion
This is, to our knowledge, the first study focusing on reliability on assessing different activity performances in children born with clubfoot in a situation comparable with a daily clinical setting. The inter-rater reliability for four out of six items from the CAPMotion quality showed sufficient reliability. The items toe-and heel-walking showed fair reliability. The observers' intra-rater reliability showed reliability between good to sufficient for all items. Interand intra rater score differences on item and domain level were relatively small. No clear learning effects were found between the first and second session.
Three-dimensional gait analyses (3DGA) are commonly advocated as the golden standard within gait analysis. In our study these computerized motion analyses were not useful for validation of our items as they are not (yet) obtainable. The exact agreement of 51% with our criterion, the five scorings possibilities and the Po-1 agreement of 91% shows evidence for a valid assessment system. It will be interesting to study how more experience of the CAP system or a CAP course can increase the validity and reliability. • * = sufficient item reliability. Bold numbers = above defined reliability cut off points • K = kappa (95%CI), Po/Po-1 = Exact observed percentage agreement/percentage agreement including one level difference • * = sufficient item reliability. ** = good item reliability. Bold numbers = above defined reliability cut off points • K = kappa (95%), Po/Po-1 = Exact observed percentage agreement/percentage agreement including one level difference

Methodological issues
It is impossible to actual calculate the true reliability of an instrument. Many internal factors such as sample size, amount of scoring possibilities, statistical method, and external factors such as assessment procedure and shifting performance of the object under observation, can influence the outcome of studies on reliability. In studies with young children these external factors can be very difficult to keep stable. We tried to control the external factors by using video recordings which resembled as much as possible the daily clinical situation. This made it possible for several raters to assess the same phenomenon.
Strictly methodologically it is not correct to use both legs of the same child as individual ratings as they can be dependent of each other. This can be a significant problem in treatment outcome studies. In the present study, however, we think this is of minor importance as the aim was to study the reliability of assessors when they have to assess both legs similar to the normal clinical situation.
Defining the cut off points for the percentage agreement is arbitrarily. A concordance of 75-80% with two possibilities is commonly used. We think that our cut off point with 50% for the exact percentage agreement with a 5point scale is acceptable. We also checked the score differences for information on the clinical implication of the reliability.
We tried to integrate different information on the instruments behavior with inter-and intra rater testing trying to create an as truthful picture as possible.
The exact mean percentage agreement (Po) and the within one level disagreement (Po-1) between the four observers and the criterion for the six items at testing session I and II Figure 1 The exact mean percentage agreement (Po) and the within one level disagreement (Po-1) between the four observers and the criterion for the six items at testing session I and II. Wren et al [9] found in their reliability study of visual gait assessments in children with pathologic gait no statistically significant differences in reliability between "live", full speed and slow speed video. In some cases though, slow motion video improved agreement of observational assessments. Brunnekeef et al [10] concluded that structured visual gait observation by use of a gait analysis form was moderately reliable in patients with orthopedic disorders. Clinical experience appeared to increase the reliability of visual gait analysis.
The observers in the present study explained difficulties in not having control over the assessment situation. The raters were not allowed to stop, rewind or see the recording in slow motion. This might have had a negative influence on our reliability and validity results.
Knowledge about the score differences between observers is important as this has to be incorporated in studies on responsiveness. The intra-and inter-rater measurement errors in the present study were small both on item and domain level. Celebi et al. [11] found mean difference scores of 0.17, 0.63 and 0.80 (LOA around -2.00 to 3.00) between three experienced observers for their functional domain of the International Clubfoot Study Group evaluation system (ISGC) [3]. This domain has a total score of 36 and uses 2 -or 3 point scales. Our result; -1.10 (-1.86-1.66) for the CAPMotionquality with a total score of 24 and a 5 point scale is in comparison very promising.
Fewer scoring levels would probably increase the reliability for the CAPMotionquality items, but decrease the sensitivity for differences. An instrument with higher sensitivity is clinically more informative. More scoring possibilities also demands the administrator to more critically assess an observation and decide which scoring is the most correct. These situations can have a learning effect and with time increases the quality and reliability of assessments.

Conclusion
We conclude that the CAPMotion quality can be used with good reliability and validity in daily clinical practice. When different observers are used, and within research, it is recommended to check the inter-rater reliability and calculate the scores differences on item or domain level.