Digital Health Innovation and Informatics
Purpose/Objective(s): Automatic segmentation methods aim to alleviate labour intensive contouring of organs at risk (OAR) and clinical target volumes (CTV). Although deep learning-based contouring (DC) has shown improvement over manual and atlas-based auto-segmentation, the majority of previous studies were limited to one expert observer per scan. We aim to determine if DC models trained by a single Radiation Oncologist are comparable to multiple expert Radiation Oncologists manual contours (EC).
Materials/Methods: Multiple Radiation Oncologists at a single center were asked to contour central nervous system (CNS), head and neck (H&N), and prostate RT OARs and CTVs on radiotherapy planning computed tomography (CT) scans. DCs were generated using deep learning auto-segmentation software based on a U-net architecture and trained using contours from a single Radiation Oncologist on publicly available datasets. DC and ECs were compared using the Dice Similarity Coefficient (DSC) and 95% Hausdorff distance transform (DT). Radiation oncologists recorded manual contouring time for each scan.
Results: We compared DCs to 129 expert contoured structure sets on 43 CT scans. Each scan had 2-4 ECs, for a total of 60 CNS, 39 H&N, and 30 prostate EC structure sets. The mean DC and EC contouring times were 1.1 vs 8.0 minutes for CNS, 2.7 vs 27.8 minutes for H&N, and 1.4 vs 17.8 minutes for prostate structures. Differences in contouring duration were significant (p<0.005). For CNS structures, the DC to EC DSC and 95% DT were not significantly different from the EC to EC comparisons for brainstem and optic chiasm. The EC to EC comparisons were more similar for the optic globe DSC (0.88 vs 0.89; p=0.009) and optic chiasm 95% DT (6.2 vs 4.2 mm; p<0.005). For H&N structures, the DSC and 95% DT were not significantly different for the parotid gland and submandibular gland, and were different for the neck CTV DSC (0.75 vs 0.80; p<0.005), neck CTV 95% DT (9.3 vs 6.4 mm; p<0.005), and spinal cord 95% DT (4.5 vs 2.6 mm; p<0.005). For prostate structures, there was no difference for seminal vesicles DSC and 95% DT. There was more similarity in the DC to EC comparisons for bladder DSC (0.97 vs 0.96; p=0.03), bladder 95% DT (2.9 vs 3.1 mm; p=0.02), femoral head DSC (0.92 vs 0.89; p<0.005), femoral head 95% DT (5.4 vs 8.4 mm, p=0.006), rectum DSC (0.84 vs 0.81; p=0.02), and rectum 95% DT (6.9 vs 10.0 mm; p=0.01). The EC to EC comparison was more similar for the prostate DSC (0.81 vs 0.84; p=0.01).
Conclusion: We observed minimal differences in DSC and 95% DT from ECs to other ECs compared to those from ECs to DCs. These findings demonstrate that the accuracy of well-trained deep learning-based auto-segmentation models trained using a single Radiation Oncologist contours is similar to expert inter-observer variability for CNS, H&N, and prostate RT structures.