Purpose
Breast cancer accounts for an estimated 2.22 million new cases and more than 684.000 deaths per year [1]. This emphasises the importance, necessity and promise of progress in deep-learning based computer-aided detection and diagnosis (CAD) systems for improved cancer detection at earlier stages. To train such deep-learning systems, vast amounts of patient imaging data is needed and ingested in the model, which may leak some of this private patient information after training [2, 3]. It, hence, can become necessary to actively protect patient information during...
Methods and materials
The study used the Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) dataset [7] containing 891 mass cases balanced between malignant and benign. CBIS-DDSM’s predefined split between the training (1296 mass images) and testing (402 mass images) portions of the dataset was adopted for our experiments. The training portion was furthermore randomly split per patient into the final training set (1104 mass images) and the validation set (192 mass images). The mass images were extracted as region-of-interest patches using the lesion coordinates...
Results
For a relatively strict epsilon privacy parameter (ε=0.1), the mass malignancy classifier performed better when pretrained on synthetic data before fine-tuning on real data with privacy guarantee (DP-SGD). Based on 3 random seeds, the classifier, when pretrained on synthetic data, achieved an average AUROC of 0.634 (std: 0.036) as compared to an average AUROC of 0.543 (std: 0.029) of the baseline trained directly on real data. When running these experiments with a more moderate epsilon of ε=19.63 (as chosen by [17]), the differences between the...
Conclusion
Our results suggest that pretraining models with synthetic mammography data can improve the performance of breast cancer classification models, particularly when addressing the utility of privacy-preserving deep learning models. When training clinical deep learning models with a patient privacy guarantee (i.e. under differential privacy), synthetic data can help to enhance the model’s privacy-utility trade-off. Rather than training on both synthetic and real patient data simultaneously, our experiments indicate that it can be beneficial to first pretrain on synthetic data before fine-tuning on real data. We...
Personal information and conflict of interest
R. Osuala:
Nothing to disclose
K. Kushibar:
Nothing to disclose
O. Diaz:
Nothing to disclose
K. Lekadir:
Nothing to disclose
References
[1] Global Cancer Observatory. The global cancer observatory (gco) is an interactive web-based platform presenting global cancer statistics to inform cancer control and research. https://gco.iarc.fr/, 2023. Accessed: 2023-01-17.
[2] Osuala, R., Kushibar, K., Garrucho, L., Linardos, A., Szafranowska, Z., Klein, S., ... & Lekadir, K. (2022). Data synthesis and adversarial networks: A review and meta-analysis in cancer imaging.Medical Image Analysis, 102704.
[3] Balle, B., Cherubin, G., & Hayes, J. (2022, May). Reconstructing training data with informed adversaries. In2022 IEEE Symposium on Security and Privacy (SP)(pp....