Abstract:
Aimed at the limitation of the discriminative capability of the extracted features and unsatisfactory classification performance of the scene classification methods based on the deep learning, in the report, in order to improve the effective learning of the scene-representative features, a dual-stream architecture named SC-ETNet, which include the convolution stream and the transformation stream, was proposed for the remote sensing image scene classification. The convolution stream employed the spatial and channel reconstruction convolutions to separate and reconstruct features extracted by convolutional layers. The transformation stream used LightViT for the interaction between global tokens and image tokens to achieve local-global attention computation. The mean classification accuracy of the evaluation on the UC-Merced, AID, and NWPU-RESISC45 datasets was 99.61%, 97.81%, and 95.33%, respectively. These data suggested that compared with the existing advanced scene classification methods, SC-ETNet demonstrates superior classification performance.