Sampling
A representative sample of tuberculosis case registers was selected based on the list of all 140 management units in the public (governmental) sector in Cambodia and all 668 units in Viet Nam. A random selection of 30 units from each list was made by an independent collaborator. From each of these randomly selected management units, the Tuberculosis Case Register (henceforth the "case register") for two full calendar years was taken. The earliest permissible registration start was 1 January 2003 and the latest 31 December 2005.
Study approval
Because of the retrospective, record-based nature of the study and the omission of capturing any patient names, each country decided to require only administrative approval. Ethical approval for the study was obtained from The Union Ethics Advisory Group.
Data entry form and capture
The electronic data collection instrument was prepared with EpiData Entry (Version 3.1, freely available at http://www.epidata.dk). All variables recommended in the forms proposed by the International Union Against Tuberculosis and Lung Disease (The Union) [1] were captured, except for the name and address of the patient.
The data entry form was designed to be as efficient as possible, i.e. the length of each field was kept at the minimum required to allow automatic progress to the next field without hard carriage return after data entry in most instances. For instance, the case register has seven dates, (including the date registered, treatment start date, dates of bacteriologic examination at diagnosis, after 2(3), 5 and 7 months; and treatment result date). Each of these was entered as three separate variables for day, month, and year, with the computer calculating an exact date if all date components were known or an approximate date otherwise.
This approach served a triple purpose. Firstly, the year remained unchanged for about half of the records of the 2-year period and could thus be set to automatic repetition in the next record, requiring only confirmation with a single key stroke rather than re-entering the four digits. Secondly, automatically calculated exact dates (when all three date components were available) could be distinguished from automatically calculated approximated dates when fewer date components were available. Thirdly, it circumvented errors arising from style differences in writing dates among collaborating countries. Efficiency was further improved by using numeric coding coupled with labels, so that many variables required just a field length of 1 (e.g., sex of patient, intensive and continuation phase definitions, disease category and site, treatment outcome, etc.). Thus, while there were 37 variables in total that had to be entered, the total sum of field lengths was only 111, many of which (repeat fields) did only require a single key stroke despite a length of 4.
To allow validation of duplicate files a unique identifier was automatically composed for each record from the tuberculosis unit number (unique for one calendar year for a given unit), the registration year, the code of the treatment unit and the country.
In Viet Nam, data were double-entered by different and independent data entry persons. The two completed files from each tuberculosis treatment unit were sent to the country coordinator for comparison to identify any discordance between values for any variable for every pair of records. They were validated in a single step in EpiData Entry which compares each record of the first set on the unique identifier with the corresponding record of the second set. The generated report lists every record with at least one discordance, showing every field for that record with any discordance in its value. As any discordance arises as a result of a data entry error in either of the two records of a pair and it cannot be known in advance which of the sets has fewer errors, it was arbitrarily decided that a copy of the first set should always be the set for making corrections. This file was saved as the final dataset in which any error as ascertained by checking against the original physical record was corrected. As a result, three sets of files were available, allowing reproducing the data validation process. In Cambodia, the same system was used per protocol, but human resource constraints precluded different data entry teams and task switching within the team was done at their discretion. Validation of files remained the responsibility of the national coordinator.
In addition to study variables, data entry time was computer-generated and written to an access-blocked field, recording the time elapsed starting from the opening of a new record and completing entering the value for the last field, immediately before saving the record to disk. This field provided the basis to obtain an estimate for data entry cost but it also offered a control element whether data were truly double-entered. In every file there were isolated records that resulted in artificially long entry times if the data entry person was disturbed during entry before reaching the end of the record. The design was such that any later corrections in the record did not result in a change of the originally recorded data entry time.
Data analysis
All analyses were done using EpiData Analysis (Version 2.2.1.171, freely available at http://www.epidata.dk). The 120 original sets and the 60 final datasets were separately combined into two respective sets for analysis, defining new variables and sub-sets of the dataset as required. Point estimates are shown with 95% confidence intervals for the mean or proportions where appropriate.