Platforms such as DataSHIELD allow users to analyse sensitive data remotely, without having full access to the detailed data items (federated analysis). While this feature helps to overcome difficulties with data sharing, it can make it challenging to write code without full visibility of the data. One solution is to generate realistic, non-disclosive synthetic data that can be transferred to the analyst so they can perfect their code without the access limitation. When this process is complete, they can run the code on the real data.
We have created a package in DataSHIELD (dsSynthetic) which allows generation of realistic synthetic data, building on existing packages. In our paper and accompanying tutorial we demonstrate how the use of synthetic data generated with our package can help DataSHIELD users with tasks such as writing analysis scripts and harmonising data to common scales and measures.
Research is increasingly dependent on the analysis and interpretation of sensitive, personal data (or data that cannot be easily shared) or on the co-analysis of such data from several sources simultaneously. Making data available so that it can be used, may be impeded by ethico-legal processes or by fear of loss of control of data. This results in a lot of work to get permission to gather several datasets together in a central location.
DataSHIELD provides a novel solution that can circumvent some of these challenges [1, 2]. The key feature of DataSHIELD is that data stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows a user to pass commands to each server and the analyst receives results that are designed to only disclose the summary data. For example, the user can fit a linear model to the data but not see the residuals. Furthermore, DataSHIELD enforces other disclosure protection requirements such as requiring a minimum cell count of data points in a summary . Thus DataSHIELD can be used to analyse data without physically sharing it with the users. We refer to this as federated analysis.
A challenge of this approach is that it is harder for data analysts to write and debug code that uses the data when the data are not tangibly in front of them. This is because DataSHIELD prevents access to the data at a detailed level. It is similar to trying to build a Lego model while blindfolded. While it is possible for the user to understand what is happening by asking for summary statistics, this process can still be difficult. For example, to confirm that a subset operation into male and female groups has been successful, the analyst could ask for a summary of the original gender column and check whether the counts of male and female participants match the length of the subset objects. With the datasets in front of them it is immediately obvious if a problem has occurred.
Hypothesis for using synthetic data
R packages like synthpop  have been developed to generate realistic synthetic data that is not disclosive: the synthetic data is representative of the real data, but is less likely to enable recovery of the real data. synthpop uses classification and regression trees (CART)  to generate synthetic data. synthpop fits a sequence of regression models. Once the models are fit, synthetic values are generated by drawing from the predictive distributions.
In DataSHIELD, the user (“client”) passes analytical commands to multiple datasets held on remote servers. If the user could have a client side synthetic copy of the real data that is held on each of the servers they are using, this could make code writing and debugging easier. They could have full access to the synthetic data, and confirm that their code is working correctly before applying it to the real data held on the servers.
The user therefore has the benefit of being able to see the data they are working with, but without the need to go through laborious data transfer processes.
Other packages that provide synthetic data generation are simstudy  and gcipdr . simstudy  requires the user to define the characteristics of variables and their relationships. Non-disclosive access via DataSHIELD can provide these summary statistics. Likewise, gcipdr  requires users to extract features such as mean, standard deviation and correlations via DataSHIELD, and uses these to provide automated generation of the synthetic data.
We introduce a package (dsSynthetic) which provides functionality for generating a local, non-disclosive synthetic copy of data held on remote servers via DataSHIELD. We build on the synthpop  and simstudy packages .
Our objective is to generate synthetic data to allow users to more easily design, implement and test code. This code can then be applied on the real data.
Overview of steps for generating synthetic data
We have written the dsSynthetic and partner dsSyntheticClient R packages (hereafter dsSynthetic) to help users to write code for data that is available in a DataSHIELD setting. A tutorial covering the examples in this text with executable code is available here:
We describe the steps for generating synthetic data below:
The data custodian uploads the raw data to the server side and installs the server side package dsSynthetic.
The user installs the packages dsBaseClient and dsSyntheticClient on the client side.
The user calls functions in the dsSyntheticClient package to generate a synthetic but non-disclosive data set which is transferred from the server to the client side. The generation of the synthetic data can use methods from: a) synthpop, where the synthetic data are generated on the server side and returned to the client, or b) simstudy where non-disclosive summary characteristics of and relationships between variables are generated on the server side, returned to the client, and the synthetic data are generated on the client side.
With the synthetic data on the client side, the user can view the data and write code. They will be able to see the results of the code for the whole synthetic dataset.
When the code is complete, it can be implemented on the server using the real data.
Using synthetic data to build a DataSHIELD analysis script
A key use-case for dsSynthetic is to aid analysts in writing DataSHIELD analysis scripts in the standard DataSHIELD scenario where the real data cannot be fully accessed. We assume that the data sets are ready for analysis at different sites, the dsSynthetic package is installed on the servers and the dsSyntheticClient package is installed on the analyst’s computer (client). This use-case is demonstrated in Fig. 2.
Generate the synthetic data so that it is available on the client (this could be data from multiple servers).
Load the data into DSLite . To test our DataSHIELD script locally on the client, we need to replicate the server environment(s) (which only accepts DataSHIELD function input) on the client. This is provided by the DSLite package. A DSLite instance exists in a user’s local R session and behaves the same as a server, except that it allows full access to the data held on it. Therefore the script can be developed against the DSLite instance and at any time the user can view what is happening on DSLite. This makes it easier for the user to correct any problems.
Once finished, the script is run on real data on the remote server(s). This step should now run smoothly as any problems with the code have been identified when working with the synthetic data.
Using synthetic data to write harmonisation code
Prior to analysis with DataSHIELD, data must be harmonised so that variables have the same definition across datasets from all sites . Common solutions to achieving harmonisation are:
Have each data custodian harmonise their data and host it on the server. This requires a lot of coordination across groups and there can be a lack of visibility of how harmonisation was achieved.
Make a one-off transfer to a central group which does the harmonising work before the data are returned to the custodian for the analysis phase by multiple parties. This suffers the same challenges of a data transfer for analysis.
To avoid these challenges a third way is for a central group to write harmonisation code in Opal  which can be used as the hosting datawarehouse on each of the participating servers. This code acts on the raw data on each server to generate datasets which contain harmonised variables. That is, all the variables are on common scales and measures (e.g. all measures of height in metres). To eliminate the need for a data transfer or full data access agreement, the users log into Opal to write the harmonisation code. This means they may not have full access to the data, only to summary statistics.
Write and test MagmaScript code in the session using synthetic data.
When testing is complete, copy the code into the Opal server to generate the harmonised version of the real data.
A worked example of this is detailed in the tutorial.
We note that remote, centralised harmonisation as described here could also be prototyped using DSLite and writing DataSHIELD, rather than MagmaScript, code. An example of this workflow would be harmonising to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (as has been done in ).
Minimising risk of disclosure
A concern with generating a synthetic dataset is that it might reveal too much information about the real data. In the case of using synthpop, existing work has assessed this risk to be low [13, 14].
We have also disabled built-in features of synthpop where certain options allow the return of real data (for example, setting the min.bucket value to 3 to prevent sampling from leaves that have data for very few individuals). When generating synthetic data via simstudy, this relies on non-disclosive results such as mean and standard deviation that can already be obtained via DataSHIELD, therefore there is no additional risk of disclosure.
We introduce a package (dsSynthetic) which provides functionality for generating synthetic data in DataSHIELD.
Synthetic data generation is also available as part of other privacy preserving frameworks like OpenSAFELY  and other packages [16,17,18]. A version of synthpop for DataSHIELD is offered in  but the simstudy method is not available, nor are the detailed workflows for generating synthetic data. Our package is simpler to use, offers different options to generate synthetic data (using packages like simstudy and synthpop) and comprehensive workflows for writing DataSHIELD scripts and data harmonisation
Determining privacy risks in synthetic data derived from real world healthcare data [19, 20] is also a fruitful area of research for future work. Indicators of the resemblance, utility and privacy of synthetic data will also be helpful [21,22,23].
We inherit the limitations of the underlying packages simstudy and synthpop. In particular, the user should exercise judgment about the number of variables requested in a synthetic dataset, so as to not overwhelm the server performing the generation.
Availability of data and materials
No data was generated in this study. All code and a tutorial are available here as an R bookdown format with executable code: https://tombisho.github.io/synthetic_bookdown All code is available from: https://github.com/tombisho/dsSynthetichttps://github.com/tombisho/dsSyntheticClient DataSHIELD is built on R and can be deployed using prebuilt docker images for easy installation. Our bookdown also documents a tutorial which can be followed with minimal client side package installations.
Gaye A, Marcon Y, Isaeva J, Laflamme P, Turner A, et al. DataSHIELD: taking the analysis to the data, not the data to the analysis. Int J Epidemiol. 2014;43:1929–44.
Bonofiglio F, Schumacher M, Binder H. Recovery of original individual person data (IPD) inferences from empirical IPD summaries only: applications to distributed computing under disclosure constraints. Stat Med. 2020;39:1183–98.
Fortier I, Doiron D, Little J, Ferretti V, L’Heureux F, et al. Is rigorous retrospective harmonization possible? Application of the dataSHaPER approach across 53 large studies. Int J Epidemiol. 2011;40:1314–28.
Doiron D, Marcon Y, Fortier I, Burton P, Ferretti V. Software application profile: opal and mica: open-source software solutions for epidemiological data management, harmonization and dissemination. Int J Epidemiol. 2017;46:1372–8.
Morales DR, Conover MM, You SC, Pratt N, Kostka K, et al. Renin-angiotensin system blockers and susceptibility to COVID-19: an international, open science, cohort analysis. Lancet Digit Heal. 2021;3:e98–114.
Mathur R, Rentsch CT, Morton CE, Hulme WJ, Schultze A, et al. Ethnic differences in SARS-CoV-2 infection and COVID-19-related hospitalisation, intensive care unit admission, and death in 17 million adults in England: an observational cohort study using the OpenSAFELY platform. Lancet. 2021;397:1711–24.
Hernandez M, Epelde G, Beristain A, Álvarez R, Molina C, et al. Incorporation of synthetic data generation techniques within a controlled data processing workflow in the health and wellbeing domain. Electronics. 2022;11:812.
Rankin D, Black M, Bond R, Wallace J, Mulvenna M, et al. Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data sharing. JMIR Med Informatics 2020; 8.
We acknowledge the help and support of the DataSHIELD technical team. We are especially grateful to Paul Burton, Stuart Wheater, Eleanor Hyde, Yannick Marcon and Demetris Avraam for fruitful discussions and feedback. Tom R.P. Bishop—Senior author. SB—corresponding author.
This work was funded by EUCAN-Connect under the European Union’s Horizon 2020 research and innovation programme (Grant agreement no. 824989). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The views expressed are those of the authors and not necessarily those of the funders.
Authors and Affiliations
Medical Research Council Epidemiology Unit, University of Cambridge School of Clinical Medicine, Cambridge, UK
No ethics approval and consent to participate was necessary.
Consent for publication
All authors declare they have no competing interest to disclose.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Soumya Banerjee is the corresponding author.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.