Given below is a list of files that we have uploaded to BMC Research Notes in order to make our data available. Following several prior recommendations [12], we have also uploaded statistical code to allow replication of our results. The code is saved as Stata "do" files, but these can be opened from within a text editor or word processing package such as Microsoft Word. The code has been well-annotated, we hope sufficiently so to allow non-Stata users to follow our logic. We created over 100 do files for the numerous papers associated with our learning curve studies. Publishing all of these do files would more likely lead to confusion than insight. As such, we selected a sub-set of representative analyses that we believe would allow any competent analyst to replicate our results. For example, we provide code for a sensitivity analysis that includes only surgeons whose career experience was at least 100 cases; this code is easily adapted for a sensitivity analysis that includes surgeons with career experience of 250 or more cases.
Moreover, some of the code was originally written in a highly modular fashion, with kernels of code referenced by numerous different do files, with extensive routines for printing out results in a readable form (e.g. rounding p values). Both features can make our programming difficult to follow. Accordingly, we simplified the code for this presentation, removing code associated with presentation, and duplicating code in different do files in some cases. We also wrote new code to deidentify the data set.
We estimate that the total time taken to prepare the data and code for publication was 8 hours. While far from trivial, this constitutes a small fraction of the effort spent on the data set over the past five years. Moreover, this estimate must be seen as higher than typical, given that the code involved covered so many different papers.
The data have been uploaded both in Stata format, and a raw format that can be read by most software (it can be opened directly in Microsoft Excel, for example). These two files are named "master learning curve data set deidentified" with ".dta" and ".raw" extension respectively. "Variable labels.pdf" describes each variable on the data set [see Additional files 1, 2 and 3]. A description of each do file is as follows:
-
1.
01 deidentify data learning curve.do [Additional file 4]
This do file takes the data set with identifying information and saves out a new data set without any identifying information. This includes removing patient and surgeon identifiers and replacing them with anonymous identifiers, removing dates, and ensuring that patient age cannot identify individuals. Before saving out the deidentified data set, a data set is saved with both the true and anonymous patient and surgeon identifiers; this data set is not published, but is kept with the primary investigators so that any data enquiries about individual patients can be addressed by the primary investigator.
-
2.
02 primary analysis bcr learning curve.do [Additional file 5]
This do file performs the primary analysis of the learning curve for biochemical recurrence[6]. This is an example of the code to produce a learning curve for a survival-time outcome. A multivariable analysis is performed to obtain the adjusted p-value for the association between surgeon experience and outcome; the adjusted 5-year predicted probability of freedom from biochemical recurrence is plotted against surgeon experience; and the central estimates for 10 and 250 prior cases are displayed.
-
3.
03 bootstrap ci for difference in 10 vs 250 bcr learning curve.do [Additional file 6]
This do file uses bootstrap resampling to construct a 95% confidence interval for the difference in adjusted 5-year probability of biochemical recurrence for a patient treated by a surgeon with 10 vs 250 prior cases[6]. The output from the bootstrap resampling is saved as a Stata data set "output bootstrap ci for difference 10 vs 250 learning curve.dta". This is an example of code where bootstrap resampling is used to obtain confidence intervals for an estimate whose sampling distribution is unknown. The code could be modified easily for another estimate of interest, for example, the difference in adjusted probability of positive surgical margins for a patient treated by a surgeon with 10 vs 250 prior cases.
-
4.
04 sensitivity analysis patients treated after 1995 bcr learning curve.do [Additional file 7]
This do file performs the same analysis as done in "02 primary analysis bcr learning curve.do", except that the cohort is restricted to patients treated after 1995, when stage migration related to the advent of PSA screening appeared to be largely complete[6]. This is an example of the code where a specific group of patients is included, and another group excluded. This code could be modified easily to restrict the analysis to a different subgroup, for example, patients with low risk disease.
-
5.
05 sensitivity analysis surgeons with at least 100 total cases bcr learning curve.do [Additional file 8]
This do file performs the same analysis as done in "02 primary analysis bcr learning curve.do", except that the cohort is restricted to surgeons who completed at least 100 total cases. This sensitivity analysis was performed to confirm that the relationship between surgeon and experience and outcome was not confounded by the ability of individual surgeons to attract patients (i.e., a less capable surgeon who was unable to establish a practice would therefore contribute to the beginning but not the end of the learning curve)[6]. This is an example of the code where only patients treated by a specific group of surgeons are included. This code could be modified easily to restrict the analysis to patients treated by a different group of surgeons, for example, surgeons who completed at least 250 total cases.
-
6.
06 separately by postoperative risk bcr learning curve.do [Additional file 9]
This do file performs the primary analysis of the learning curve for biochemical recurrence separately by pathologic stage[8]. This is an example of the code to produce a learning curve separately for different subgroups of patients. A multivariable analysis is performed to obtain the adjusted p-value for the association between surgeon experience and outcome separately for those with organ-confined and non-organ-confined disease; the adjusted 5-year predicted probability of freedom from biochemical recurrence is plotted against surgeon experience separately by pathologic stage; and the central estimates for 10 and 250 prior cases are displayed. This code could be modified to obtain separate learning curves for subgroups defined in other ways, for example, patients treated by fellowship vs. non-fellowship trained surgeons.
-
7.
07 surgical margins learning curve.do [Additional file 10]
This do file performs the primary analysis of the learning curve for surgical margins[10]. This is an example of the code to produce a learning curve for a binary outcome. A multivariable analysis is performed to obtain the adjusted p-value for the association between surgeon experience and outcome; the adjusted predicted probability of positive surgical margin is plotted against surgeon experience; and the central estimates for 10 and 250 prior cases are displayed. This code could be modified easily to restrict the analysis to a particular subgroup of patients.
-
8.
08 heterogeneity in bcr by surgeon.do [Additional file 11]
This do file performs a multivariable random-effects model to evaluate heterogeneity between surgeons in biochemical recurrence outcomes after adjustment for case-mix and surgeon experience. The random effects variance, 95% confidence interval, and p-value are displayed[9]. This is an example of the code to determine whether heterogeneity exists between surgeons, and could be modified easily for different types of outcomes (for example, a binary outcome such as positive surgical margins) or different subgroups of patients.
-
9.
09 forest plot bcr by surgeon.do [Additional file 12]
This do file obtains the adjusted 5-year predicted probability of freedom from biochemical recurrence for each surgeon; obtains a combined estimate across all surgeons using meta-analytic methods, and shows the probabilities and 95% confidence intervals for each surgeon as a forest plot[9]. This could be modified easily for different types of outcomes or different subgroups of patients.