SegFormer and SegFormer-UNet for anthropogenic geomorphic feature extraction from land surface parameters

Sarah Farhadpour; Aaron E Maxwell; Behnam Solouki

SegFormer and SegFormer-UNet for anthropogenic geomorphic feature extraction from land surface parameters

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Add a Comment

You must log in to post a comment.

Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Sarah Farhadpour, Aaron E Maxwell , Behnam Solouki

Abstract

Accurate, scalable mapping of anthropogenic geomorphic features from high spatial resolution terrain data remains challenging. While convolutional neural networks (CNNs) excel at characterizing local texture and patterns, their limited receptive fields may fail to capture broader spatial context. Transformer-based architectures, such as SegFormer, support stronger long-range dependency modelling, but their value for geomorphic mapping has not been systematically tested. We compare three architectures for pixel-level extraction of features from light detection and ranging (lidar)-derived land surface parameters (LSPs): (1) SegFormer, (2) SegUNet (a hybrid SegFormer-UNet model), and (3) a fully CNN-based UNet with a ResNet-34 encoder. Models are trained on 512×512 LSP chips for three tasks: agricultural terraces, historic mine benches, and valley fill faces. We evaluate random vs. ImageNet initialization and, when pretraining is used, frozen vs. unfrozen encoders. Training uses focal Tversky loss to address class imbalance; performance is assessed with F1-score, precision, recall, and overall accuracy, with robustness summarized via resampling of test chips. Results suggest that end-to-end training is the most consistent driver of gains across backbones and datasets, particularly for SegFormer. UNet-style architectures with CNN-based decoders generalize most reliably: SegUNet and the UNet typically achieve the best precision and recall balance and the smoothest optimization. Contrary to common practice, random initialization often matches or exceeds ImageNet pretraining for LSP inputs, suggesting a domain gap between natural-image features and lidar-based terrain representations. Across tasks, difficulty tracks object morphology: agricultural terraces are hardest, mine benches intermediate, and valley fill faces easiest. Practically, we recommend unfrozen encoders with a CNN-based decoder as a strong default, and we caution against assuming ImageNet pretraining will help when using LSP predictor variables. These findings provide clear guidance for selecting and optimizing deep learning models for anthropogenic geomorphic feature extraction from high spatial resolution digital terrain data.

DOI

https://doi.org/10.31223/X5VN47

Subjects

Environmental Monitoring, Geology, Geomorphology, Remote Sensing

Keywords

Lidar, Land Surface Parameters, Semantic Segmentation, CNNs, UNet, Geomorphic Feature Extraction, Transformers, SegFormer

Dates

Published: 2026-06-28 00:26

Last Updated: 2026-06-28 00:26

License

CC-BY Attribution-NonCommercial 4.0 International

Additional Metadata

Conflict of interest statement:
None

Metrics

Views: 689

Downloads: 56