This domain contains six items; running, walking, toe walking, heel walking, one-leg hop and one-leg balance (see additional file 1). At the age of four years children are normally expected to be able to perform all six items. In the appendix the scoring distribution and criteria are described. The scoring has been divided systematically in proportion to what is regarded as normal variation and its supposed impact on the child's physical function. Assessment is done in relation to the child's age.
Video recordings of eleven children treated for clubfoot and with varying severity and outcome results, were selected from the archives of our clubfoot clinic. The tapes contained standardized recordings of motion activity according to the domain CAPMotion Quality. The median age was 5. 5 years (range 4 – 7 years). Gender distribution was three girls and eight boys. Five children had unilateral clubfoot and six bilateral. All families gave their informed consent for the use of the video films.
Four raters were selected according to the criteria having worked within pediatric orthopedics at least seven years including experience with children with clubfoot. Two raters were pediatric orthopedic surgeons and two were senior physiotherapists. None of the raters had previous experiences with the CAP system.
Two raters well experienced with CAP, one physiotherapist and developer of the CAP (HA) and one pediatric orthopedic surgeon (GH), defined the most correct score for each child's item performance.
The recording procedure was standardized and comparable with the situation in a daily clinical environment. The children were recorded from a frontal and posterior view while moving along a 10 meter pathway. The camera was positioned on one meters height and two meter from the beginning of the pathway. The children wore t-shirts, shorts or underwear and were barefoot. The children were asked if they wanted to start with walking or running. All children started with running followed by walking, toe walking, heel walking, one-leg hop and on-leg stance. Recordings were made of each performance as much as necessary to be able to make an assessment comparable with real life. Each video sequence lasted about 4 minutes.
All four raters received three weeks before the first assessment session the CAPMotion Quality manual with the items criteria and a copy of the protocol form to be used during the rating session (see additional file 1). They were asked to study the manual and scoring system and use this information during the assessment sessions.
Each rater assessed individually all 11 video recorded children twice within an interval of 4 to 6 weeks.
An introduction was given prior to each assessment session explaining the testing procedure; 1) After each video recording a half minute pause was given. A short brake was made after the fifth video. 2) No possibilities were given to stop the video or to assess the recordings in slow motion. 3) Before each new video sequence the raters received only information about the child's age and gender. 4) Both left and right side should be rated. As a training session, the raters viewed and at the same time rated a videotaped recording of a child without a disability and a child treated for congenital clubfoot. Total testing time was approximately one hour and 15 minutes.
The two experienced assessors (HA and GH) analyzed and discussed the same videos at one meeting and defined the most correct rating for each side and each child. This was done before the first assessment of the four raters.
Both legs were rated and used as individual ratings in the statistical analyses.
Inter – and intra tester reliability was calculated using the weighted kappa (k) statistics [1, 2] together with its 95% confidence intervals. For the inter-rater testing the assessments from the first sessions were used. According to Altman  the kappa values are to be interpreted as follows: < 0.20 as poor agreement, 0.21 – 0.40, as fair, 0.41 – 0.60 as moderate, 0.61 – 0.80 as good and > 0.80 as very good. Exact observed percentage agreement (Po) and percentage agreement including one level difference (= Po-1) were calculated as kappa values can become unstable under certain conditions, e.g. with limited distribution of cell frequency [6–8]. As the CAPMotion quality domain exists out of five scoring possibilities we regarded a Po ≥ 50% or a Po-1 ≥ 80% as good.
Good item reliability was considered when more than halve of the assessment pairs had kappa's values higher than 0.60 (= good) and/or a good percentage agreement. Sufficient item reliability was considered when the kappa values ranged between 0.41–0.60 (fair to moderate) for more than halve of the inter-intra ratings and/or had good percentage agreement.
The median differences and inter quartile ranges (IQR) for each item (ordinal data) and the mean difference and its limits of agreement (LOA) (interval data) for the domain motion quality for the inter-and intrarater were calculated. 
For evaluating if there was a learning effect between the first and second session, the Po and Po-1 assessed with the criterion, were used. A difference of more than 10% was set as level for a real difference.