Publication Details
Abstract
Colorectal cancer (CRC) is the second most common cause of cancer-related death in the world, and early detection of adenomatous polyps with colonoscopy is important because it can prevent CRC. Nonetheless, traditional colonoscopy is still operator-dependent and with lesion-miss rates of up to 25% for small or flat lesions. Over the past decade, automated polyp detection and segmentation have come a long way with deep learning. CNNs like U-Net, UNet++ and ResUNet++ set strong baselines, however their low receptive field bound hindered generalization. More recently, vision transformers (ViTs) and hybrid CNN–transformer architectures have achieved state-of-the-art results by modeling long-range dependencies and incorporating global context. This systematic review examines more than 50 studies as the representative one available between 2015 and 2025, with specific attention to the results on Kvasir-SEG, CVC-ClinicDB, and ETIS-Larib benchmarks. Experiments demonstrate that while NFL-CNNs reach Dice scores of 0.85–0.90 on easier datasets, ViTs and hybrids almost always outperform motors, with the best models reaching 0.94 (NA-SegFormer) on Kvasir-SEG and 0.81 over ETIS. Our results illustrate the disruptive nature of attention-based approaches, closer to what was desired by clinical colleagues, and signal some of the stubborn open challenges relating to dataset availability, practical computational cost and high-quality clinical evidence. Advances in the future will likely center around the development of lightweight understandable and generalisable AI systems designed to provide real-time, clinically reliable polyp detection.