Keywords:
Breast, Mammography, Neural networks, Computer Applications-Detection, diagnosis, Technology assessment, Cancer
Authors:
R. Osuala, K. Kushibar, O. Diaz, K. Lekadir
DOI:
10.26044/ecr2023/C-24413
Methods and materials
The study used the Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) dataset [7] containing 891 mass cases balanced between malignant and benign. CBIS-DDSM’s predefined split between the training (1296 mass images) and testing (402 mass images) portions of the dataset was adopted for our experiments. The training portion was furthermore randomly split per patient into the final training set (1104 mass images) and the validation set (192 mass images). The mass images were extracted as region-of-interest patches using the lesion coordinates available in the metadata of CBIS-DDSM and resized to 224x224 pixel dimensions to train the mass malignancy classification model. Given its previously reported high performance on mammography imaging data [8], a Swin Transformer (Swin-T) deep learning model [9] was chosen as malignancy classifier and initialised from pretrained ImageNet [10] weights. Furthermore, a conditional generative adversarial network [11, 12] (cGAN) trained on the CBIS-DDSM final training set was retrieved from the medigan library [13]. This cGAN is class-conditioned on mass malignancy, which allows to control whether it generates malignant or benign mass images. We generate a synthetic dataset of 1500 benign and 1500 malignant mass images that we use to pretrain the malignancy classifier (all layers) for 50 epochs. Next, we fine-tune the malignancy classifier for another 100 epochs (training only the last two fully-connected layers) but now only using real mass images from the final CBIS-DDSM training set. During this fine-tuning step, we apply a patient privacy guarantee, namely, differentially-private stochastic gradient descent (DP-SGD) [5] with an epsilon ε of 1e-1, a delta δ of 1e-3, a maximum grad norm of 1, a learning rate of 5e-3, and a batch size of 64. Hyperparameters were selected based on AUC performance after empirically testing different variations guided by intuition from the experiments from [8, 9, 14]. On each epoch the resulting classifier model is stored. The final model for testing is selected as the one that achieves the highest area under the precision recall curve (AUPRC) on the validation set. All experiments were run on a NVIDIA RTX 2080 Super 8GB GPU using the PyTorch [15] and opacus libraries [16].