Jain P., Marcos D., Ienco D., Interdonato R., Berchoux T. (2026). TimeSenCLIP: A time series vision-language model for remote sensing. ISPRS Journal of Photogrammetry and Remote Sensing, 01/06/2026, vol. 236, p. 99-119.
https://doi.org/10.1016/j.isprsjprs.2026.03.043
https://doi.org/10.1016/j.isprsjprs.2026.03.043
| Titre : | TimeSenCLIP: A time series vision-language model for remote sensing (2026) |
| Auteurs : | P. Jain ; D. Marcos ; D. Ienco ; R. Interdonato ; T. Berchoux |
| Type de document : | Article |
| Dans : | ISPRS Journal of Photogrammetry and Remote Sensing (vol. 236, June 2026) |
| Article en page(s) : | p. 99-119 |
| Note générale : | GRANULAR project |
| Langues : | Anglais |
| Langues du résumé : | Anglais |
| Catégories : |
Catégories principales 06 - AGRICULTURE. FORÊTS. PÊCHES ; 6.6 - Technique Agricole (sols, engrais, mécanisation)Thésaurus IAMM MODELE ; TECHNIQUE D'IMAGERIE ; TELEDETECTION ; CARTOGRAPHIE ; SOL ; PHOTOINTERPRETATION ; ANALYSE DE SERIES CHRONOLOGIQUES ; APPRENTISSAGE |
| Résumé : | Visionlanguage models (VLMs) have shown significant promise in remote sensing applications, particularly for land-use and land-cover (LULC) mapping via zero-shot classification and retrieval. However, current approaches face several key challenges, such as the dependence on caption-based supervision, which is often not available or very limited in terms of the covered semantics, and the fact of being adapted from generic VLM architectures that are suitable for very high resolution images. Consequently, these models tend to prioritize spatial context over spectral and temporal information, limiting their effectiveness for medium-resolution remote sensing imagery. In this work, we present TimeSenCLIP, a lightweight VLM for remote sensing time series, using a cross-view temporal contrastive framework to align multispectral Sentinel-2 time series with geo-tagged ground-level imagery, without requiring textual annotations. Unlike prior VLMs, TimeSenCLIP emphasizes temporal and spectral signals over spatial context, investigating whether single-pixel time series contain sufficient information for solving a variety of tasks. Our approach is trained on the LUCAS and Sen4Map datasets and evaluated across four main mapping tasks: land cover, land use, habitat mapping and crop type classification. The CLIP text encoder can be used to probe the learned representations using semantically meaningful categories, enabling effective zero-shot generalization without task-specific text supervision. We further extend our evaluation to bioregions mapping and country-level image retrieval. Although coarse, these tasks are valuable for probing whether the model captures geographically meaningful representations, such as regional climate regimes, vegetation patterns, and land-use structures. TimeSenCLIP achieves consistently better performance than existing CLIP-based remote sensing models in both zero-shot classification and cross-modal retrieval. Notably, single-pixel multispectral time series variants remain highly competitive, particularly with extended temporal coverage, demonstrating that temporalspectral dynamics can compensate to a substantial degree for the reduced spatial footprint. While larger spatial patches still offer advantages for tasks where spatial patterns are inherently informative, such as ecosystem type classification, the results suggest that single-pixel multispectral time series can provide effective remote sensing visionlanguage pipelines, supporting scalable and efficient modeling in scenarios where large spatial tiles or extensive textual annotations are impractical. Code is available at https://github.com/pallavijain-pj/TimeSenCLIP |
| Note de contenu : | Giving Rural Actors Novel data and re-Useable tools to Lead public Action in Rural areas (Grant agreement ID: 101061068) |
| Cote : | Online |
| URL / DOI : | https://doi.org/10.1016/j.isprsjprs.2026.03.043 |
Documents numériques (1)
PRO54661.pdf Adobe Acrobat PDF |


