Study Population:
Keyword retrieval of relevant mammograms was performed on PACS.
Exclusion criteria was applied on mammograms with pacemakers,
breast implants,
non-standard views such as compressions and magnifications,
and examinations with technical quality issues.
The dataset included 69,202 examinations from 4,851 patients,
with an average of 3.69 images per study.
Of this dataset,
200 four-view studies,
totaling 800 images,
were reserved to evaluate the performance of the CNN and radiologist reviewers (Fig. 1).
The remaining 68,402 mammograms served as the network training images.
Image Pre-Processing & Training Labels:
The standard diagnostic complete DICOM studies were exported.
Whole mammographic views were downsampled to a 256x256px input image.
During this conversion,
standard filter manufacturer window width/window level settings were maintained.
Individual view images were anonymized as per standard protocols.
Labelling for each view image was derived from standard controlled keyphrase vocabulary in the official final clinical staff report using variations of BI-RADS vocabulary.
Neural Network Configuration:
An ImageNet pre-trained deep convolutional neural network (DCNN) with Inception-v2-Resnet classification architecture was employed [10].
The dataset utilized 70% splits for training,
10% for validation,
and 20% for testing.
Networks were trained using stochastic gradient descent (SGD) with an initial learning rate of 0.003 alongside weight decay and 5 x 20-epoch cosine annealing schedules for a total of 100 epochs.
Implementation of the neural network was done under TensorFlow on Python 3.6,
and training was performed on a Titan X (Pascal) workstation.
Standard data augmentation consisting of standard affine transformations (random cropping,
rotation,
shearing,
and horizontal flipping) was performed.
All mammographic views underwent histogram normalization prior to training.
Neural Network Objective and Subjective Validation:
The trained network was evaluated using areas under the receiving operating characteristic curves.
Further validation of the trained network was performed using Saliency and Class Activation Maps to determine whether relevant breast image regions were utilized to make classification determinations.
Inter-Rater Human Reviewer Consensus:
The held-out image dataset was reviewed by 3 human reviewers using the ACR BI-RADS 5th edition guidelines.
The consensus reading for density was established with a majority opinion of density between the 3 reviewers,
and inter-rater variability was evaluated.
In the event of triple discrepancy (i.e.
no human reviewer selected the same density as the others),
a separate consensus meeting was performed to permit an arbitration between the 3 reviewers.
An inter-rater reliability (IRR) linearly-weighted Cohen’s Kappa analysis assessed agreement of the AI system with majority radiology consensus.
Additional inter-rater assessment was performed with Kendall’s Coefficient of Concordance.