Artificial Intelligence, Lung, CT, Computer Applications-General, Pathology
M. M. Pieler, J. Hofmanninger, R. Donner, A. Sikka, E. Jiménez Arroyo, H. Prosch, R. Zhang, G. Langs, A. Makropoulos
Examples of the ground truth annotations compared to the model segmentations are visualized in Figure 2. The model predictions differ mostly in the boundary regions.
The volume was calculated from the annotations of the two raters and from the model predictions. For both patterns, we report the correlation between the model predictions and the two raters. On the 47 test slices with HC, the volume inter-rater correlation was 0.77, and correlation of the model and the two raters was 0.69 and 0.90, respectively. Correlation between the model and the rater average was 0.85. For 229 slices containing GGO, the inter-rater correlation was 0.77, and the model-to-rater correlation was 0.85 and 0.83, respectively. Model-to-rater average correlation was 0.89. The details are shown in Figure 3, and the pairwise comparisons are visualized in Figure 4.
Based on the visualizations of example segmentations (Figure 2), the plots comparing the measured volume by the two readers, and by the algorithm itself (Figure 4), we can observe that for HC the raters tend to label more volume on average when compared to the model predictions. At the same time there is also more scatter between the raters. For GGO the volume annotated by the raters is more similar to the predicted volume by the model, and there is less scatter between raters.