Papers
arxiv:2603.19039

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Published on Mar 19
· Submitted by
Yan Shu
on Mar 23
#3 Paper of the day
Authors:
,
,
,
,
,

Abstract

TerraScope is a unified vision-language model that enables pixel-grounded geospatial reasoning through modality-flexible and multi-temporal capabilities, evaluated on a new benchmark with detailed visual reasoning outputs.

AI-generated summary

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

Community

Paper submitter
edited 1 day ago

CVPR2026-Pixel-Grounded reasoning for Earth Observation.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

TerraScope not only improves VLM training with the Terra-CoT dataset, but also endows the model with pixel-grounded reasoning capabilities by aligning reasoning processes with pixel-level segmentation, thereby enabling multi-temporal change analysis and multimodal fusion.

Suggestions:
1、Although TerraScope achieves pixel-level grounding, its boundary precision could be further improved by incorporating boundary-aware losses or high-resolution feature fusion to reduce mask ambiguity.

2、While TerraScope supports adaptive multimodal fusion, explicit cross-modal alignment (e.g., contrastive learning or shared latent space) could reduce discrepancies between optical and SAR representations.

3、Incorporating uncertainty estimation (e.g., probabilistic masks or confidence scores) could improve reliability in complex geospatial scenarios.

TerraScope represents a significant step toward pixel-grounded geospatial reasoning, but there remains room for improvement in boundary precision, cross-modal alignment, temporal modeling, and fine-grained reasoning.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.19039 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.19039 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.19039 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.