Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion
Abstract
A semi-supervised remote sensing image segmentation framework combines vision-language and self-supervised models to reduce pseudo-label drift through dual-student architecture and semantic co-guidance mechanisms.
Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at https://xavierjiezou.github.io/Co2S/.
Community
We are excited to introduce our latest work on semi-supervised semantic segmentation:
📄 Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion
This paper tackles one of the most challenging issues in semi-supervised segmentation: pseudo-label drift. When labeled data are extremely scarce, self-training methods are prone to deterministic bias, where early incorrect pseudo-labels accumulate over time, leading to unstable and degraded training.
🧠 Motivation
Most existing consistency- or pseudo-label–based semi-supervised approaches rely heavily on self-generated supervision. Once early pseudo-labels become unreliable, error accumulation is inevitable. Our goal is to introduce stronger semantic priors to correct such drift and stabilize the training process.
✨ Key Contributions
1️⃣ Heterogeneous Dual-Student Framework
We leverage two complementary vision foundation models—CLIP for global semantic priors and DINOv3 for fine-grained local structures—to enable stable mutual learning and suppress error accumulation.
2️⃣ Explicit–Implicit Semantic Co-Guidance
By jointly utilizing text embeddings (explicit semantics) and learnable queries (implicit semantics), we provide class-level semantic anchors and enhance semantic consistency.
3️⃣ Global–Local Feature Co-Fusion
We fuse CLIP’s global contextual understanding with DINOv3’s local structural details, yielding more accurate and stable segmentation results.
📊 Experimental Results
Extensive evaluations on six mainstream remote sensing benchmarks demonstrate that Co2S consistently achieves strong and stable performance across different data splits and scenarios, especially under extremely low annotation budgets.
📦 Open-Source Resources
- arXiv Paper: https://arxiv.org/abs/2512.23035
- Project Page: https://xavierjiezou.github.io/Co2S/
- GitHub Code: https://github.com/XavierJiezou/Co2S
- HuggingFace Models: https://huggingface.co/XavierJiezou/co2s-models
- HuggingFace Datasets: https://huggingface.co/datasets/XavierJiezou/co2s-datasets
#Remote Sensing #Semantic Segmentation #Semi-Supervised Learning #Vision Foundation Models #CLIP #DINOv3
arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/toward-stable-semi-supervised-remote-sensing-segmentation-via-co-guidance-and-co-fusion-421-cdab0d19
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images (2025)
- BiCoR-Seg: Bidirectional Co-Refinement Framework for High-Resolution Remote Sensing Image Segmentation (2025)
- ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images (2025)
- Harmonizing Generalization and Specialization: Uncertainty-Informed Collaborative Learning for Semi-supervised Medical Image Segmentation (2025)
- CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation (2025)
- Multi-Text Guided Few-Shot Semantic Segmentation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper