Abstract:
To address the common challenges of insufficient local detail representation and low global modeling efficiency in high-resolution remote sensing image semantic segmentation, this paper proposes a segmentation network that integrates convolutional neural networks and state-space models, named RMUNet. The proposed model employs a lightweight ResNet18 as the encoder and introduces a Visual State Space Block (VSSBlock) in the decoder to achieve efficient global context modeling. Meanwhile, a Local Feature Compensation Module (LFCM) is designed to enhance the perception of fine-grained semantic information. To mitigate the semantic bias that may arise during the fusion of shallow and deep features, a Cross-level Fusion Attention Module (CFAM) is further proposed to enable effective collaboration between spatial and semantic representations. Experimental results demonstrate that RMUNet achieves mean Intersection-over-Union (mIoU) scores of 83.78%, 87.09%, and 52.85% on the Vaihingen, Potsdam, and LoveDA datasets, respectively, outperforming existing mainstream methods. While maintaining low computational complexity, RMUNet significantly enhances feature representation and segmentation accuracy, providing an efficient and effective solution for high-resolution remote sensing image interpretation.