Discussion
In the present study, wide variability in radiologists&rsquo interpretations of the sample of mammograms was observed. The group of radiologists not routinely interpreting mammograms showed no differences in average sensitivity in mammogram interpretation compared with routine readers but showed significantly less specificity and accuracy. Of the various experience-related factors used to evaluate this variability, annual reader volume was only important when radiologists not routinely interpreting mammograms were compared with routine readers, the latter showing greater specificity and accuracy. In contrast, no significant differences in sensitivity, specificity or accuracy were found among routine readers between those reading less than 5000 mammograms per year compared with those reading more than 5000 films. When the remaining experience-related variables were incorporated into a multivariate model, obtaining feedback on cases for which further workup was recommended increased specificity.
To guarantee the quality of population screening programs, substantial efforts have been made during the last decade to understand the role played by radiologists&rsquo experience in the variability of screening mammogram interpretations, as well as to identify the radiologist-associated factors determining accuracy. One of the factors considered most important is annual reading volume. In 1998, Elmore et al.,[11] observed that the annual volume did not significantly influence the recommendation for workup but concluded that radiologists interpreting relatively few mammograms each year, even over many years, may not be sufficiently experienced to obtain high levels of sensitivity and specificity. Kan el al.[10] demonstrated that a minimum of 2500 annual readings guaranteed a better cancer detection rate.
Since 1998, two distinct lines of argument can be discerned: Esserman et al.[15] and Smith-Bindman et al.[14] concluded that the quality of mammogram readings could be improved by increasing annual reading volume, while Beam et al.[16] and Barlow et al.[8] reported that reading volume was not an important variable and that radiologists&rsquo interpretative performance is a multifactorial process in which a large number of factors play a role. A recent report by the Institute of Medicine containing an exhaustive review of the literature had no been able to demonstrate a clear relationship between volume alone and accuracy.[17] Our opinion is that, unfortunately, in the attempt to guarantee the quality of population screening programs, the study of variability in mammogram readings has been excessively simplified by evaluating the role played by the variable of annual reading volume, with fairly arbitrary cut-off values, beyond which greater accuracy would be achieved.
Our results, like those of other studies, cast doubt on the major role that has been assigned to the variable of annual reading volume as an indicator of radiologists&rsquo experience. As in other variables, we observed a positive association between reader volume and accuracy when comparing the group of radiologists not routinely interpreting mammograms with the group of routine readers (established on the basis of the recommended number of 5000 mammogram readings annually by the European guidelines,{Rosselli del Turco, 2001 1 /id} the National Health Service in the United Kingdom[18] or Esserman et al.[15]). However, we found no significant differences between the two levels of routine readers in either the univariate or the multivariate analyses. Therefore, in addition to questioning the importance of volume, we highlight the role played by other experience-related variables.
According to the results of the multivariate analysis in the present study, one of the most important factors determining experience is feedback (OR=1.37 CI:[1.031.85]), since it allows radiologists to perform a self-evaluation and become aware of the accuracy of previous readings. Moreover, we believe that the design of screening programs should take this factor into account. The other two significant variables found in this study, focus on breast radiology and radiologists&rsquo age, should be interpreted conjointly because these variables could show a certain degree of colinearity. Thus, a radiologist aged more than 45 years old spending less than 25% of the working day on breast radiology could correspond to the profile of a highly accurate reader.
Since we found no significant differences in sensitivity between the group of radiologists not routinely interpreting mammograms and the group of routine readers, we believe that sensitivity could present a certain ceiling effect, inherent to the experience-related factors studied to date. This result had previously been discussed in an article explaining how mammography sensitivity has not changed for decades.[19] Therefore, we believe that sensitivity is not an appropriate measure to evaluate accuracy, at least not in studies based on mammography samples. We also used the area under the ROC curve, at a specificity of 90%, which allowed us to rank the 28 radiologists according to performance (data not shown). However, for the multivariate analysis, because of correlated data, and given that our objective was to evaluate average accuracy (rather than the individual effect of each radiologist on accuracy), the analysis that we believe optimal was based on marginal models based on generalized estimation equations.
We emphasize our study design because, in a sample of screening mammograms, in which selection of thousands of mammograms and hundreds of radiologists is not feasible and in which cancer cases are necessarily oversampled (bearing in mind that the incidence of breast cancer is approximately 3-8&permil in an incident screening round), the composition of the sample is a key factor for understanding the results obtained. These results depend basically on the proportion of true positives, true negatives, false positives, and false negatives chosen from the program to compose the sample. This composition was chosen according to criteria published by Kerlikowske et al.[20] In this sense, given that the percentage of mammograms with uncertain diagnosis in our study was high, we found a large number of false positives and false negatives and consequently the average sensitivity and specificity were only 84% and 64% respectively, which is substantially lower than the sensitivity and specificity expected in a screening program.
Precisely because we chose a sample not representative of the population, a possible limitation of our study can be attributed to contextual bias. To evaluate the extent to which sensitivity and specificity were influenced by the sample, we performed an ad hoc analysis using only the 138 mammograms with a true positive and true negative result, and found that sensitivity did not vary, but that specificity was increased from 64% to 77%. Thus, we justify the study design based on a sample of mammograms by the difficulty of performing a prospective study in which recruitment of a sufficiently large number of radiologists to guarantee adequate statistical power would be difficult.
In addition to experience-related factors, variability is also explained by differences in organization and protocols for reading mammograms, which are not homogeneous in all countries.[21,22] In Europe, screening programs are population-based, publicly financed and adhere to European guidelines that guarantee the quality of the process (European guidelines,{Rosselli del Turco, 2001 1 /id} International Agency for Research on Cancer[23]), while in the USA, financing and organization are managed basically by private insurance. However, the characteristics of the protocol should also be taken into account, in which, based on mammography quality, there are also differences in the system of double reading and the method of tie break, in the number of views, in the percentage of clinical investigations and/or the adaptation of the BI-RADs, which greatly hampers comparisons among studies.
Conclusions
In conclusion, the results obtained in the present study are in line with those of the most recent publications, in which radiologists&rsquo experience depends on multiple factors therefore experience-related variables should not be interpreted in isolation. We stress the importance of feedback as a factor to be taken into account.