The main aim of the Medical Science Olympiad in Iran is to test creative and critical thinking in medical students. The specific objectives of Olympiad were: Identifying scientifically talented individuals, Motivating and encouraging scientifically talented Individuals, Orienting extra-circular scientific activities, Generating scientific liveliness and morale, Interuniversity cultural exchanges, Encouraging to creative and critical thinking, Reinforcing health system goals and objectives, Encouraging team work, Encouraging interdisciplinary activities .
The first Olympiad, held in Isfahan in 2009, and the second in Shiraz in 2010, comprised a separate examination in each of three areas: basic science, clinical science and health system management. All currently enrolled medical students with a grade point average of 16/20 (equivalent to a GPA of about 3.2 in the USA or a UK Class of about 60) or higher were eligible to register for the test. Then they prepared for the test by completing an intensive training course in the area of their choice at their own university. After this course enrollees were tested for critical thinking and reasoning skills at their university, and only those with the highest grades were then allowed to participate in the national Olympiad. Iran has 46 medical universities and each university is allowed to send only 3 students in each of the three areas to the Olympiad.
In the second Olympiad, 45 medical universities sent examinees for the areas of basic science and clinical science, and 44 medical universities sent examinees for the area of health system management. A total of 135 students took the test in basic science, 135 students were tested in clinical science, and 131 students took the test for the management area. In this study we analyzed the results only for the examination in the clinical science area. Only undergraduate students were allowed to participate in the Olympiad because of the importance of clinical reasoning skills in an early stage of their medical education and the need for efficient tools to assess it.
Development of the clinical reasoning tests
An expert committee with members from all Iranian medical schools was constituted and charged with developing a bank of test items in emergency medicine from all four clinical reasoning tests (i.e., KF, SCT CRP and CIP). The committee used the methodology described in previous publications [6–18]. Some examples of these tests are provided in additional file 1.
Development of the Olympiad examination by the reference panel
To prepare the examination to be used in the Olympiad, a total of 15 experts from different medical universities in Iran were chosen to constitute the reference panel. These experts comprised a broad sample of internists, general surgeons and emergency medicine specialists with different levels of experience and training, and were therefore considered to represent a normative sample of the reference population. Each member of the reference panel took each of the four tests and identified test items that were confusing or not relevant to emergency medicine. As a result, a few minor changes were made in the wording of some items. Then 20 KF items, 20 SCT items,10 CRP items and two 4 × 6 matrices from the CIP were chosen for inclusion in the full 2-day Olympiad examination. On the morning of the first day the 20 KF items were completed, and in the afternoon the 10 CRP items were completed. On the morning of the second day the 20 SCT items were completed, and in the afternoon the two CIP matrices were completed. Each of the four examination periods lasted 4 hours.
The examinees in the second Olympiad were 135 undergraduate medical students from 45 medical schools in Iran, with grade point average if 16/20 or higher. The length of medical education in Iran is 7 years.57.8 percent of participants were females and 42.2 percents were males. The mean year of study of participants was 6.1 years, the mean age of them was 24,3 years and the mean grade point average of them were 17.6 from 20.
A group of 22 general practitioners and first-year residents were asked to complete all Olympiad examination items in their own time without using textbooks, web sites or personal consultations. General practitioners and first-year residents were recruited for this group because of their experience with a wide range of clinical problems encompassing all areas of emergency medicine practice. The scores obtained by these examinees were used as a standard reference .
To enhance the discriminating power of this score, we also calculated the efficiency score (partial credit score) .
For high-stakes SCT examinations a reference group of more than 20 members is required ; as noted above, our reference group consisted of 22 physicians. Because of issues with aggregated scoring such as greater random error , we used average expert response weighted for distance and the correct answer on a five-point Likert scale. The mean response was considered the correct answer, and the weight for other responses was determined based on their credit and distance from the correct answer. With this scoring system the credit for the best answer was 100%, and credit for other answers was calculated based on the percentage of reference panel examinees who chose that answer. We used the formula 1/ (1 + x), where x is defined as the distance from the correct answer (values of x ranged from a minimum of 1 to a maximum of 4). This innovative scoring system was devised in the light of an analysis by Bland et al.  and consultation with a mathematician familiar with that research.
The first and second diagnoses and diagnostic features chosen for each item by reference group examinees were input into a table, and the diagnoses and nominated features that were chosen by at least two thirds of the reference group were considered the correct answers.
Examinees' scores were calculated from a matrix of answers given by the reference panel. For each of the 4 columns of cells in the matrix, 4 correct answers out of 4 (4/4) was scored as 100%, 3/4 as 75%, 2/4 as 50% and 1/4 as 0%. The grade for an entire matrix was considered the sum of the grades for all six rows and the grade for CIP exam was measured by the sum of two matrix grades.
Total exam scores
The total exam score was measured by the sum of 4 tests grade, therefore each test counts 25 percent of the total grade. The expert committee believed that 10 CRPs is similar to 20 KFs or 20 SCT because in CRP the students should choose two diagnoses and list the features of case based on these two diagnoses. In the CIP due to complexity of puzzles the expert committee considered two 6*4 puzzle similar to 20 KFs or 20 CRPs. As we mentioned earlier the similar exam time was considered for each of the four tests(four hour for each tests).
We measured item difficulty for each test, and determined the reliability of the scoring method for each test. The reliability of each test was calculated with Cronbach's alpha, considering each item individually and the combined reliability for all four clinical reasoning tests was calculated using variances of score in each test and total exam variance . Item difficulty was determined with the method of Whitney and Sabers , and correlations between the total examination score and scores for each item were calculated with Pearson's correlation coefficient for each of the four clinical reasoning tests. The correlation between the total score and scores on each of the four tests was also calculated, along with the correlation between the total score on the Olympiad and the student's university course grade point average. We sought an informed consent from participants and ethical approval for our study from Olympiad clinical domain.