Castro, LT;
Barata, C;
Martins, P;
Afonso, F;
Pascoal, M;
Santiago, C;
Mennillo, L;
... Soares, AS; + view all
(2026)
Benchmarking variability in semantic segmentation in minimally invasive abdominal surgery.
International Journal of Computer Assisted Radiology and Surgery
10.1007/s11548-025-03562-3.
(In press).
Preview |
Text
R2_nottrackedchanges_Benchmarking.docx.pdf - Accepted Version Download (2MB) | Preview |
Abstract
PURPOSE: Anatomical identification during abdominal surgery is subjective given unclear boundaries of anatomical structures. Semantic segmentation of these structures relies on an accurate identification of the boundaries which carries an unknown uncertainty. Given its inherent subjectivity, it is important to assess annotation adequacy. This study aims to evaluate variability in anatomical structure identification and segmentation using MedSAM by surgical residents. METHODS: Images from the Dresden Surgical Anatomy Dataset and the Endoscapes2023 Dataset were semantically annotated by a group of surgery residents using MedSAM in the following classes: abdominal wall, colon, liver, small bowel, spleen, stomach and gallbladder. Each class had 3 to 4 sets of annotations. Inter-annotator variability was assessed through DSC, ICC, BIoU and using the Simultaneous Truth and Performance Level Estimation algorithm to obtain a consensus mask and by calculating Fleiss' kappa agreement between all annotations and reference. RESULTS: The study showed strong inter-annotator agreement among surgical residents, with DSC values of 0.84-0.95 and Fleiss' kappa between 0.85 and 0.91. Surface area reliability was good to excellent (ICC = 0.62-0.91), while boundary delineation showed lower reproducibility (BIoU = 0.092-0.157). STAPLE consensus masks confirmed consistent overall shape annotations despite variability in boundary precision. CONCLUSION: The study demonstrated low variability in the semantic segmentation of intraperitoneal organs in minimally invasive abdominal surgery, performed by surgical residents using MedSAM. While DSC and Fleiss' kappa values confirm strong inter-annotator agreement, the relatively low BIoU values point to challenges in boundary precision, especially for anatomically complex or variable structures. These results establish a benchmark for expanding annotation efforts to larger datasets and more detailed anatomical features.
Archive Staff Only
![]() |
View Item |

