Key terms “interobserver variability” or “interobserver agreement” or “observer variability” and “TI-RADS” were combined on Medline. Studies were excluded if published in a language other than English, or did not pertain to ACR TI-RADS. Following exclusions, 9 studies were included.
All studies conveyed IA using the Kappa statistic. This has previously been interpreted as:
- K=0.81-0.99 almost perfect agreement
- K=0.61-0.8 substantial agreement
- K=0.41-0.6 moderate agreement
- K=0.21-0.4 fair agreement
- K=0.1-0.2 poor agreement
- K<0 less than chance agreement[2]
Percentage agreement was provided for most studies. Of note, one study was a meta-analysis[3], two studies compared more experienced observers to less experienced observers[4], [5], one study assessed the effect of a consensus meeting on IA[6] and one study assessed IA amongst sonographers, and IA between the sonographers and an expert panel[7].
Composition
Under the characteristic of composition, nodules may be classified as being “cystic of almost completely cystic” or “spongiform”, both conferring 0 points. “Mixed cystic and solid” and “solid or almost completely solid” confer 1 and 2 points respectively[1].
Although composition had the lowest average comfort score amongst sonographers, it had the second highest IA amongst this group[7]. In general, it had moderate IA with K values ranging from 0.39 to 0.68 across studies[7], [8]. IA between experienced and less experienced readers was not found to be statistically significant. However, in the study performed by Seifert et al., there was significant improvement in IA following a consensus meeting[6].
Echogenicity
Echogenicity may be classified as “anechoic”, “hyperechoic or isoechoic”, “hypoechoic” or “very hypoechoic”, conferring a range of points increasing from 0 to 3. The reference for comparison is the surrounding thyroid tissue, except for “very hypoechoic” nodules which are compared to the muscles of the neck[1].
Generally, the category had moderate agreement between observers, ranging from K=0.252 to K=0.62[2], [4]. In the study performed by Chung et al., this category had the highest IA[4]. There was little difference between the experienced and less experienced readers, with Hoang et al. reporting K=0.42 and K=0.47 for the experienced and less experienced readers respectively[5]. Chung et al. reported K=0.67 and K=0.62 for the experienced and less experienced readers respectively[4].
Shape
There are two options for shape, of which they may be “taller than wide”, >1 ratio in AP dimension compared to transverse dimension, or “wider than tall”. Measurements must be taken in the transverse plane to make this observation[1].
K values ranged from 0.29 to 0.79[2], [8]. This category achieved the highest IA in two studies[6], [8]. In a further study[5], shape and macrocalcifications were the only characteristics to achieve substantial agreement. In the meta-analysis published by Liu et al., IA did not appear to be significantly affected by experience or training, which differed from the remainder of the characteristics[3].
There were several explanations for these results. In the studies that found shape as having higher agreement than other characteristics, it was felt this may be because assessment of shape is the only characteristic that is objectively assessed in ACR TI-RADS. Authors observed that rounded nodules could be a cause for confusion amongst assessors, potentially affecting interobserver variability. Hoang et al. reference a prior study which found that when grouped by size, shape was the feature with poorest interobserver agreement in nodules <5 mm, suggesting that this feature may be more difficult to interpret in smaller nodules[5].
Margin
The margin of the nodule being assessed may be categorised as one of four features. The first two, “smooth” and “ill-defined” both confer 0 points. “Lobulated or irregular margins” confer 2 points, whereas “extra-thyroidal extension” confers 3 points[1].
Margin was found in several studies to have the poorest interobserver variability of the TI-RADS categories[2]–[5], [8], with K values ranging from 0.18 to 0.796[6], [7]. K value was 0.18 amongst 15 sonographers assessing the nodules included in one study, the lowest of all characteristics except “large comet tail artefact”[7]. In the study performed by Hoang et al., the more experienced readers had greater IA (K=0.32) than the less experienced radiologists (K=0.23)[5]. This conflicted with the findings of the study performed by Chung et al. where IA was significantly higher amongst less experienced readers for this category[4]. Seifert et al. reported that a second session of assessing nodules following a consensus meeting showed significant improvement in K value from 0.431 to 0.796 in this category[6].
Echogenic Foci
“Echogenic foci” is unique amongst the TI-RADS categories in that more than one characteristic may be present and the points additive. The features under echogenic foci are “none or large comet tail artefact” (0 points), “macrocalcifications” (1 point), “peripheral/rim calcifications” (2 points) and “punctate echogenic foci” (3 points)[1].
Interobserver agreement in this category was affected by how the results were collected in the various studies, with some assessing IA for the entire category and some assessing each feature individually. When assessed individually, macrocalcifications demonstrated relatively higher IA amongst readers than other characteristics. In the study performed by Hoang et al., there was 80% agreement amongst all readers, with K=0.73, the highest of all categories. Peripheral calcifications demonstrated fair agreement in the same study, with K=0.32. Punctate echogenic foci had less IA than the other echogenic foci, with K=0.31[5].
These findings were similar to those of other studies. In the study performed by Itani et al., macrocalcficiations, peripheral calcifications and punctate echogenic foci had K values of 0.49, 0.39 and 0.27 respectively[2]. Wildman-Tobriner found presence of macrocalcifications had the highest IA of all characteristics amongst sonographers with K=0.41. Presence of large comet-tail foci had the lowest IA of all characteristics amongst the sonographers with K=0.08. Presence of punctate echogenic foci and peripheral calcifications had fair agreement with K values of 0.28 and 0.26 respectively[7].
It was felt that punctate echogenic foci may be difficult to differentiate from background nodule heterogeneity, or may be so subtle as to be difficult to appreciate at all, which may account for generally low IA. Also, the echogenic foci may only be seen in a portion of the nodule, so may be missed if only reviewing static images[5]. Low agreement amongst sonographers regarding presence of large comet tail artefact was felt to be affected by the small number of nodules with this characteristic, and also of the uncertainty differentiating these from small comet tail artefact, which are considered to be associated with punctate echogenic foci. It was also suggested that peripheral calcification could be confused for macrocalcifications to account for only fair agreement in assessing this characteristic[7].
The study performed by Seifert et al. did not separate the individual features of calcifications within TI-RADS for analysing results. However, as a whole, there was significant improvement in IA following a consensus meeting (K=0.405 to K=0.424), but much smaller improvement of percentage agreement than other characteristics of 5%[6].
Overall TI-RADS score and Management Recommendations
The final TI-RADS score is formed from addition of scores from the five features, which confers a risk stratification level, TR1-TR5, which are associated with increasing chance of malignancy. The management recommendation is based on the nodule then meeting a size threshold for FNA or progress imaging[1].
K values ranged from 0.313 to 0.7 in assessment of final TI-RADS level[9], [10]. Seifert et al., demonstrated significant improvement following a consensus meeting from 0.321 to 0.569[6]. Management recommendations had higher concordance than the overall TR level in several studies. In the study performed by Chung et al. this was K=0.58 compared to 0.47 and in Hoang et al.’s study K=0.35 for risk stratification level, improving to 0.51 for biopsy recommendation[5], [6].