Overview of steps for generating synthetic data
We have written the dsSynthetic and partner dsSyntheticClient R packages (hereafter dsSynthetic) to help users to write code for data that is available in a DataSHIELD setting. A tutorial covering the examples in this text with executable code is available here:
https://tombisho.github.io/synthetic_bookdown
We describe the steps for generating synthetic data below:
-
1.
The data custodian uploads the raw data to the server side and installs the server side package dsSynthetic.
-
2.
The user installs the packages dsBaseClient and dsSyntheticClient on the client side.
-
3.
The user calls functions in the dsSyntheticClient package to generate a synthetic but non-disclosive data set which is transferred from the server to the client side. The generation of the synthetic data can use methods from: a) synthpop, where the synthetic data are generated on the server side and returned to the client, or b) simstudy where non-disclosive summary characteristics of and relationships between variables are generated on the server side, returned to the client, and the synthetic data are generated on the client side.
-
4.
With the synthetic data on the client side, the user can view the data and write code. They will be able to see the results of the code for the whole synthetic dataset.
-
5.
When the code is complete, it can be implemented on the server using the real data.
These variations are shown in Fig. 1.
The computational steps are outlined below.
Using synthetic data to build a DataSHIELD analysis script
A key use-case for dsSynthetic is to aid analysts in writing DataSHIELD analysis scripts in the standard DataSHIELD scenario where the real data cannot be fully accessed. We assume that the data sets are ready for analysis at different sites, the dsSynthetic package is installed on the servers and the dsSyntheticClient package is installed on the analyst’s computer (client). This use-case is demonstrated in Fig. 2.
-
1.
Generate the synthetic data so that it is available on the client (this could be data from multiple servers).
-
2.
Load the data into DSLite [8]. To test our DataSHIELD script locally on the client, we need to replicate the server environment(s) (which only accepts DataSHIELD function input) on the client. This is provided by the DSLite package. A DSLite instance exists in a user’s local R session and behaves the same as a server, except that it allows full access to the data held on it. Therefore the script can be developed against the DSLite instance and at any time the user can view what is happening on DSLite. This makes it easier for the user to correct any problems.
-
3.
Once finished, the script is run on real data on the remote server(s). This step should now run smoothly as any problems with the code have been identified when working with the synthetic data.
Using synthetic data to write harmonisation code
Prior to analysis with DataSHIELD, data must be harmonised so that variables have the same definition across datasets from all sites [9]. Common solutions to achieving harmonisation are:
-
1.
Have each data custodian harmonise their data and host it on the server. This requires a lot of coordination across groups and there can be a lack of visibility of how harmonisation was achieved.
-
2.
Make a one-off transfer to a central group which does the harmonising work before the data are returned to the custodian for the analysis phase by multiple parties. This suffers the same challenges of a data transfer for analysis.
To avoid these challenges a third way is for a central group to write harmonisation code in Opal [10] which can be used as the hosting datawarehouse on each of the participating servers. This code acts on the raw data on each server to generate datasets which contain harmonised variables. That is, all the variables are on common scales and measures (e.g. all measures of height in metres). To eliminate the need for a data transfer or full data access agreement, the users log into Opal to write the harmonisation code. This means they may not have full access to the data, only to summary statistics.
Again, this makes it challenging to write and test the code. A further complication is that Opal requires the code to be written in MagmaScript, which is based on JavaScript. This language is generally unfamiliar to users that do not have a background in software development. dsSynthetic allows users to write and test MagmaScript on a local synthetic copy of data, before implementing it on the server running Opal.
As with the analysis use case, this testing phase is performed in the R environment in DataSHIELD, and implemented on the server once the code is working properly. Our package enables this by generating synthetic data. We then use the V8 package [11] to run MagmaScript and JavaScript within R. This tested MagmaScript code can then be pasted into the Opal server to run against the real data.
A schematic of this workflow is shown in Fig. 3.
In detail, the steps proposed are:
-
1.
Generate synthetic data as described previously.
-
2.
Start a JavaScript session on the client side.
-
3.
Load the synthetic data into the session.
-
4.
Write and test MagmaScript code in the session using synthetic data.
-
5.
When testing is complete, copy the code into the Opal server to generate the harmonised version of the real data.
A worked example of this is detailed in the tutorial.
We note that remote, centralised harmonisation as described here could also be prototyped using DSLite and writing DataSHIELD, rather than MagmaScript, code. An example of this workflow would be harmonising to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (as has been done in [12]).
Minimising risk of disclosure
A concern with generating a synthetic dataset is that it might reveal too much information about the real data. In the case of using synthpop, existing work has assessed this risk to be low [13, 14].
We have also disabled built-in features of synthpop where certain options allow the return of real data (for example, setting the min.bucket value to 3 to prevent sampling from leaves that have data for very few individuals). When generating synthetic data via simstudy, this relies on non-disclosive results such as mean and standard deviation that can already be obtained via DataSHIELD, therefore there is no additional risk of disclosure.
Discussion
We introduce a package (dsSynthetic) which provides functionality for generating synthetic data in DataSHIELD.
Synthetic data generation is also available as part of other privacy preserving frameworks like OpenSAFELY [15] and other packages [16,17,18]. A version of synthpop for DataSHIELD is offered in [16] but the simstudy method is not available, nor are the detailed workflows for generating synthetic data. Our package is simpler to use, offers different options to generate synthetic data (using packages like simstudy and synthpop) and comprehensive workflows for writing DataSHIELD scripts and data harmonisation
Determining privacy risks in synthetic data derived from real world healthcare data [19, 20] is also a fruitful area of research for future work. Indicators of the resemblance, utility and privacy of synthetic data will also be helpful [21,22,23].