dsSurvival 2.0: privacy enhancing survival curves for survival models in the federated DataSHIELD analysis system

Banerjee, Soumya; Bishop, Tom R. P.

doi:10.1186/s13104-023-06372-5

Research Note
Open access
Published: 06 June 2023

dsSurvival 2.0: privacy enhancing survival curves for survival models in the federated DataSHIELD analysis system

Soumya Banerjee¹ &
Tom R. P. Bishop²

BMC Research Notes volume 16, Article number: 98 (2023) Cite this article

1653 Accesses
1 Altmetric
Metrics details

Abstract

Objective

Survival models are used extensively in biomedical sciences, where they allow the investigation of the effect of exposures on health outcomes. It is desirable to use diverse data sets in survival analyses, because this offers increased statistical power and generalisability of results. However, there are often challenges with bringing data together in one location or following an analysis plan and sharing results. DataSHIELD is an analysis platform that helps users to overcome these ethical, governance and process difficulties. It allows users to analyse data remotely, using functions that are built to restrict access to the detailed data items (federated analysis). Previous works have provided survival modelling functionality in DataSHIELD (dsSurvival package), but there is a requirement to provide functions that offer privacy enhancing survival curves that retain useful information.

Results

We introduce an enhanced version of the dsSurvival package which offers privacy enhancing survival curves for DataSHIELD. Different methods for enhancing privacy were evaluated for their effectiveness in enhancing privacy while maintaining utility. We demonstrated how our selected method could enhance privacy in different scenarios using real survival data. The details of how DataSHIELD can be used to generate survival curves can be found in the associated tutorial.

Peer Review reports

Introduction

Survival models are an important component of data science. They are used extensively in biomedical sciences to investigate the effect of exposures on health outcomes. Using data from diverse data sets is beneficial for increasing statistical power and generalisability of results. However, individual-level biomedical data are sensitive and ethical and legal considerations can make it challenging to move all the data required to one location or give analysts complete access to the data.

DataSHIELD is a platform that enables the non-disclosive analysis of distributed sensitive data. It permits analysts to work on remote datasets using functions that are designed to protect privacy through built in safeguards. Therefore the analysis can be run without sharing or fully accessing sensitive individual level data, which stay protected at their host institution [1].

We have previously developed a software package for DataSHIELD called dsSurvival 1.0 [2], which allows users to build survival models for each participating data set and meta-analyze hazard ratios while aiming to enhance the privacy of the data. In this work we have developed a new version of the package (dsSurvival 2.0) that adds significant new functionality to its previous version. We now allow the building and visualization survival curves for each data set that are designed to protect privacy.

Main text

Basics of survival models and survival curves

An important concept in survival models is a survival function:

$$\begin{aligned} S(t) = Pr (T > t) \end{aligned}$$

(1)

where S(t) is the survival function, t is the current time, and T is a random variable denoting time of death. Pr() is the probability that the time of death is greater than time t i.e. the probability of surviving until time t.

If we have observed survival times for a set of individuals, we can calculate the proportion surviving at each point during the study. These proportions can be plotted as a step function against time where the survival probabilities are constant between deaths. This is known as a survival curve and is a useful visualisation of survival data. For example, in biomedical research it might allow comparison of groups that had undergone different treatments.

Architecture of DataSHIELD and dsSurvival

The DataSHIELD framework and dsSurvival have a client–server architecture. The communication between the client and a single server for dsSurvival 2.0 is shown in Fig. 1. A client-side general analysis function calls a server-side function. This function is executed on all the servers which store the data and summary results (that are designed to enhance privacy) are returned to the client. The server-side function has checks that ensure that privacy is enhanced in the results that are returned. The client-side function aggregates results from all servers and finally returns it to the analyst.

Privacy and survival curves

A survival curve can be disclosive by revealing the time during which an event occurs. Each step in the curve corresponds to events for a group of individuals. If the size of the group is small (e.g. 1) then it becomes more likely that an adversary can attribute an event to an individual.

For example, if an adversary had a previously released survival curve, and an individual joined a study, a new drop in the curve could be attributed to that individual. The adversary would then know that the individual underwent an event, when it occurred and which particular subgroup the individual belongs to.

The objective of this work is to provide survival curves that aim to protect individuals privacy while still preserving useful information to the analyst. This provides novel functionality for the DataSHIELD platform.

Methods for enhancing privacy in survival curves

Previous graphical approaches in DataSHIELD have used both deterministic anonymization and probabilistic anonymization. We investigated whether these techniques would be suitable for survival curves. Currently dsSurvival is designed to fit survival models per site, allowing users to meta-analyse these results to give an overall result. Similarly, our approach provides a survival curve per site, rather than a global survival curve.

Probabilistic anonymisation adds random noise to the data points before they are visualized. For survival curves noise added to the X-axis adds uncertainty to the time when the event occurred, and when added to the Y-axis obscures how many individuals undergo an event at a particular time. The amount of noise is specified as a percentage of the value it is being added to. We also ensure that even after noise is added, time (plotted on the X-axis in a survival curve) increases monotonically and proportion surviving (Y-axis in a survival curve) decreases monotonically. An example survival curve that has undergone probabilistic anonymisation is shown in Additional file 1.

In deterministic anonymisation, the k nearest neighbours algorithm [3] is used to find centroids of the data points. Once the nearest neighbours are computed, the original data points are then moved to their centroids. This modified data is then plotted.

A review of existing literature suggests that two other options are available for generating privacy enhancing survival curves. The first of these is based around a smoothing technique (LOESS [Locally Estimated Scatterplot Smoothing]) [4]. Smoothing the steps in the survival curve makes it harder for an adversary to identify the timing of events and the number of individuals experiencing an event at a particular time. The smoothing process is achieved by fitting a low-degree polynomial to a subset of the data at each point. A smoothing parameter determines how much of the data is used to fit each polynomial - smaller values result in curves that track variations in the data closely and larger values give more smoothing.

The second method is to implement differentially private survival curves [5], which involves adding calibrated noise to the counts of events. While this makes strong promises about the protection of privacy, it is complex to implement. In particular it requires the calling framework to manage a privacy budget on behalf of each user which is depleted with each request, as the user learns more about the data. This is required for the promises around privacy protection to be upheld but this mechanism is not currently available in DataSHIELD.

We considered the different approaches available. Probabilistic anonymisation has the challenge that it is difficult to set the noise percentage to an appropriate amount to balance utility and privacy. Deterministic anonymisation has additional complexities when applied to skewed data such as survival data (more events happen at earlier times). The skewed data means that the nearest-neighbour algorithm “pulls” the data towards the centre of mass of the data.

While differential privacy is gaining traction in many applications, where a privacy budget can be managed, the lack of a solution to manage differential privacy in DataSHIELD meant it had to be ruled out.

We therefore choose to base our approach on LOESS, which has already been used in practice for survival curves [4].

Choosing a value for the smoothing parameter that would provide a suitable balance between privacy and utility could be challenging for data custodians. For example, a small parameter provides a curve closer to the true curve but risks compromising privacy. A larger parameter may smooth the curve too much such that it is less useful for research. Therefore we implemented LOESS using the the loess.as() function in the fANCOVA package [6], which automates the selection of the smoothing parameter. This is achieved by minimising a criterion that incorporates the number of variables in the model and the error in the fit of the model. We selected the corrected Akaike information criterion [AIC] because this offers better performance on smaller datasets [7].

Ablation studies and sensitivity analysis

We assessed privacy risk in a publicly available dataset: the veteran dataset in the survival R package [8]. We show our privacy enhancing survival curves (generated using the automated LOESS smoothing procedure) on the full dataset (Fig. 2a) and the dataset with one patient removed (Fig. 2b). To demonstrate the effectiveness of the smoothing, we consider the individual that undergoes an event at time 61. In the unsmoothed curve this event is clearly visible and would identify that individual. In the smoothed curve, it is not possible to determine if an event occurred at time 61. While there is an additional inflection point at time $\sim$ 53 with the patient added, this does not match their actual event time of 61. This suggests it is very difficult to infer the characteristics of this single patient.

We also conducted ablation studies using this automated smoothing technique on a dataset where we randomly reduced the number of patients from 137 to 50 (ablation study). The resulting privacy enhancing survival curve is shown in Fig. 3a. The survival curve with one additional patient removed is shown in Fig. 3b. Additional survival curves generated from further ablation studies are available in the Additional file 1.

We conducted similar ablation studies to reduce the number of patients in synthetic data and observe the effect on survival curves (see Additional file 1).

Based on our simulations and the fact that this technique was used in previous studies [4], we choose this automated smoothing technique as the primary implementation on dsSurvival 2.0 for generating survival curves.

Summary of steps taken to enhance privacy

We summarise the steps we take to enhance privacy, noting that no procedure can completely remove disclosure risk. The smoothing procedure for survival curves ensures that it is very difficult to infer the precise time that an event takes place (for any given patient).

We modify the fraction of patients that survive (surv field) using LOESS procedure described above. In survival curves symbols show when the censoring events occur. These symbols are removed from the plot. This ensures that it is very difficult to infer the precise time that an event takes place (for any given patient).

Finally, we note that the DataSHIELD architecture also helps to minimize disclosure risk and protect privacy. The functions in DataSHIELD are designed to return summary aggregated statistics, enforcing requirements to enhance disclosure control.

Computational pipeline and use case

We outline the development and code for implementing survival models and plotting of survival curves.

All code is available here in bookdown format with synthetic data:

https://neelsoumya.github.io/dsSurvivalbookdown/

The computational steps are outlined below using synthetic data.

The initial steps outlined below create a Surv() object (available in the survival package) to generate the times and proportion surviving, which are then used as inputs to the survival curve.

Discussion

Our work adds to the existing dsSurvival package (which enables privacy enhancing meta-analysis of survival models in the DataSHIELD federated environment) by adding functionality to plot privacy enhancing survival curves. This will allow users to visually assess the differences in survival rates between groups, for example, while maintaining privacy.

We choose to build on the existing smoothing techniques [4] as this offered a pragmatic approach which we have adapted to a federated setting. Other methods such as probabilistic anonymisation are challenging to optimise the balance between privacy and utility.

Differential privacy has been used to ensure survival curves are privacy enhancing [5] and this is a promising area of future work [9]. Apart from the additional complexity in implementing the solution itself, there remain challenges around managing the privacy budget. This could be the subject of future work.

Limitations

A limitation is that dsSurvival 2.0 provides a curve per study/dataset, and not a global curve. This might be possible with DataSHIELD in the future and would require secure exchange of values so that they can be ordered by time to build the survival curve. We also did not determine a minimum number of points that are required for a survival plot to be produced and not seriously compromise privacy. This could be a future improvement.

Another area of future work could be generating synthetic survival data using the dsSynthetic package [10].

Availability of data and materials

All code is available from the following repositories: https://github.com/neelsoumya/dsSurvivalClient/. https://github.com/neelsoumya/dsSurvival/. A tutorial in bookdown format with executable code to generate plots using synthetic data is available here: https://neelsoumya.github.io/dsSurvivalbookdown/

Abbreviations

LOESS:: Locally estimated scatterplot smoothing

References

Wilson RC, Butters OW, Avraam D, Baker J, Tedds JA, Turner A, et al. DataSHIELD - new directions and dimensions. Data Sci J. 2017. https://doi.org/10.5334/dsj-2017-021.
Article Google Scholar
Banerjee S, Sofack GN, Papakonstantinou T, Avraam D, Burton P, Zöller D, et al. dsSurvival: privacy preserving survival models for federated individual patient meta-analysis in DataSHIELD. BMC Res Notes. 2022;15:197. https://doi.org/10.1186/s13104-022-06085-1.
Article PubMed PubMed Central Google Scholar
Gareth J, Daniela W, Trevor H, Robert T. Introduction to statistical learning with applications in R. Springer; 2017. http://www-bcf.usc.edu/ gareth/ISL/.
O’Keefe CM, Sparks RS, McAullay D, Loong B. Confidentialising survival analysis output in a remote data access system. J Privacy Confid. 2012. https://doi.org/10.29012/jpc.v4i1.614.
Article Google Scholar
Bonomi L, Jiang X, Ohno-Machado L. Protecting patient privacy in survival analyses. J Am Med Inf Assoc. 2020;27:366–75. https://doi.org/10.1093/jamia/ocz195.
Article Google Scholar
Wang X. fANCOVA: Nonparametric analysis of covariance; 2020. https://cran.r-project.org/package=fANCOVA.
Hurvich CM, Simonoff JS, Tsai CL. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. J Royal Stat Soc Series B Stat Methodol. 1998;60:271–93. https://doi.org/10.1111/1467-9868.00125.
Article Google Scholar
Therneau T, Grambsch P, Fleming T. A package for survival analysis in S; 1994. https://cran.r-project.org/package=survival.
Gondara L, Wang K. Differentially private survival function estimation; 2020. p. 1–20. arXiv:1910.05108.
Banerjee S, Bishop TRP. dsSynthetic: Synthetic data generation for the DataSHIELD federated analysis system. BMC Res Notes. 2022;15:230. https://doi.org/10.1186/s13104-022-06111-2.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We acknowledge the help and support of the DataSHIELD technical team. We are especially grateful to Demetris Avraam, Paul Burton, Stuart Wheater, Eleanor Hyde and Wolfgang Vichtbauer for fruitful discussions and feedback.

Funding

This work was funded by EUCAN-Connect under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement no. 824,989). SB was also funded by the Accelerate Programme for Scientific Discovery. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The views expressed are those of the authors and not necessarily those of the funders.

Author information

Authors and Affiliations

Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
Soumya Banerjee
Medical Research Council Epidemiology Unit, University of Cambridge School of Clinical Medicine, Cambridge, UK
Tom R. P. Bishop

Authors

Soumya Banerjee
View author publications
You can also search for this author in PubMed Google Scholar
Tom R. P. Bishop
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SB carried out the analysis and implementation, participated in the design of the study and drafted the manuscript. TB drafted the manuscript, carried out the analysis and implementation, participated in the design of the study and directed the study. All authors read and approved the final manuscript and agree to be personally accountable for their contributions. Both the authors read and approved the final manuscript.

Authors’ information

SB is a Senior Research Fellow at the University of Cambridge. He worked in industry for many years before completing a PhD in applying computational techniques to interdisciplinary topics. He has worked closely with domain experts in finance, healthcare, immunology, virology, and cell biology. TB is a Senior Data Scientist at the University of Cambridge and has worked in industry and academia.

Corresponding author

Correspondence to Soumya Banerjee.

Ethics declarations

Ethics approval and consent to participate

No ethics approval and consent to participate was necessary.

Consent for publication

Not applicable.

Competing interests

All authors declare they have no competing interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Figure S1.

A survival curve generated using probabilistic anonymization. This adds noise to the Y-axisusing probabilistic anonymization. Figure S2. Survival curve generated using probabilistic anonymization with fewer total number of patients. Figure S3. Top panel: Survival curve with LOESS smoothing using the fANCOVA package with reduced number of patientsin the veteran dataset. The original unmodified survival curve is shown in black. Bottom panel: Survival curve with LOESS smoothing using the fANCOVA package with reduced number of patientsin the veteran dataset and one patient removed.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Banerjee, S., Bishop, T.R.P. dsSurvival 2.0: privacy enhancing survival curves for survival models in the federated DataSHIELD analysis system. BMC Res Notes 16, 98 (2023). https://doi.org/10.1186/s13104-023-06372-5

Download citation

Received: 18 July 2022
Accepted: 25 May 2023
Published: 06 June 2023
DOI: https://doi.org/10.1186/s13104-023-06372-5

dsSurvival 2.0: privacy enhancing survival curves for survival models in the federated DataSHIELD analysis system