CITYREP: A Unified Benchmark for Urban Representati...

What the paper is about

Urban representation learning encodes complex urban environments into general-purpose embeddings for diverse downstream tasks and emerging urban foundation models. However, current evaluations are limited, typically focusing on one or two cities and tasks and relying on random splits that introduce spatial leakage, leading to inflated performance and weak support for cross-location generalization and fair comparison. To address this, we propose CityRep, a unified benchmark that evaluates urban representations across data modalities, cities, and tasks using spatially structured splits. CityRep consists of three key components: (1) a spatial unit-agnostic evaluation framework that supports heterogeneous urban representations through a standardized alignment module; (2) a unified evaluation protocol using block-based spatial splits to mitigate spatial leakage and enable rigorous model comparison; and (3) an extensible multi-city, multi-task benchmark suite spanning 8 cities and 8 tasks across regression, classification, and distribution prediction. We evaluate 11 representative urban representation models. Results show that performance is highly sensitive to the split protocol, with random splits inflating scores and altering model rankings. We also observe substantial variability across cities and tasks, underscoring the need for generalization-aware evaluation. CityRep is released as a reproducible benchmark with datasets, evaluation pipelines, and diagnostic tools to facilitate fair comparison and support future research in urban representation learning towards urban foundation models.

What it covers

CITYREP: A Unified Benchmark for Urban Representations Across Cities, Tasks, and Modalities Junyuan Liu 1 Xinglei Wang 1 Zichao Zeng 1,2 Jiazhuang Feng 1 Quan Qin 1,3 Ilya Ilyankou 1 Guangsheng Dong 1,4 Tao Cheng 1,† 1 SpaceTimeLab, University College London, UK 2 3DIMPact, University College London, UK 3 School of Resource and Environmental Sciences, Wuhan University, China 4 State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, China † Corresponding author: [email protected] Abstract Urban representation learning encodes complex urban environments into general-purpose embeddings for diverse downstream tasks and emerging urban foundation models. However, current evaluations are limited, typically focusing on one or two cities and tasks and relying on random splits that introduce spatial leakage, leading to inflated performance and weak support for cross-location generalization and fair comparison. To address this, we propose CityRep , a unified benchmark that evaluates urban representations across data modalities, cities, and tasks using spatially structured splits. CityRep consists of three key components: (1) a spatial unit-agnostic evaluation framework that supports heterogeneous urban representations through a standardized alignment module; (2) a unified evaluation protocol using block-based spatial splits to mitigate spatial leakage and enable rigorous model comparison; and (3) an extensible multi-city, multi-task benchmark suite spanning 8 cities and 8 tasks across regression, classification, and distribution prediction. We evaluate 11 representative urban representation models. Results show that performance is highly sensitive to the split protocol, with random splits inflating scores and altering model rankings. We also observe substantial variability across cities and tasks, underscoring the need for generalization-aware evaluation. CityRep is released as a reproducible benchmark with datasets, evaluation pipelines, and diagnostic tools to facilitate fair comparison and support future research in urban representation learning towards urban foundation models. Code: https://github.com/inwind0212/CityRep . 1 Introduction Figure 1: Framework of CityRep Benchmark. CityRep standardizes the evaluation of heterogeneous urban representations by aligning different spatial supports to common downstream task units, evaluating them across eight cities and eight tasks, and using spatial block splits to mitigate leakage. Urban representation learning seeks to turn heterogeneous observations of cities into reusable spatial embeddings. Recent models draw on remote sensing [ 16 , 12 , 7 ] , street-view imagery [ 42 , 47 ] , and points of interest [ 46 , 14 , 41 , 3 , 20 ] to encode geographic entities, regions, or locations. This motivation parallels the broader shift toward foundation models: representations learned from broad urban data are expected to transfer across tasks, locations, and domains. Yet the evaluation of these representations remains much less unified. Reported results are often tied to a particular model interface or a particular format of downstream task, making it difficult to assess whether an embedding is broadly useful or only effective under a narrow evaluation setup. Existing evaluations are also too narrow to support claims about general-purpose urban representations. Many studies evaluate on one or two cities, a small number of tasks, or a single label type. Such experiments are valuable for demonstrating a specific application, but they do not reveal whether a representation transfers across urban contexts or across qualitatively different prediction problems. This limitation is especially important for the emerging urban foundation models, whose value depends on broad reuse. A benchmark should therefore cover multiple cities, multiple urban domains, and multiple task types, while remaining extensible to incorporate new cities, tasks, and models. Finally, evaluation must explicitly account for spatial dependence. In urban representation learning, we expect models to leverage information from observed regions to make predictions in unseen areas, rather than merely interpolate among nearby samples. While this challenge has been well recognized in spatial validation studies [ 34 , 38 , 26 ] , most existing urban representation models are still evaluated using random splits that ignore spatial structure. We address this gap by establishing a unified spatially structured evaluation protocol. Our benchmark provides empirical evidence that random splits can substantially inflate performance and lead to over-optimistic conclusions about model generalization. We introduce CityRep , a unified benchmark for urban representations across modalities, tasks, and cities, as shown in Figure 1 . The central goal of CityRep is to move urban representation evaluation beyond narrow, single-setting comparisons. Rather than assessing an embedding on one city, one task, or one type of urban label, CityRep asks whether a representation remains useful across a sufficiently broad set of urban phenomena and geographic contexts. To operationalize this goal, we construct downstream evaluation data from four key dimensions of urban systems: morphology, demographics, economy, and environment. These dimensions are instantiated as eight tasks across eight cities, covering classification, regression, and distribution prediction. CityRep pairs this broad task suite with a common evaluation protocol for heterogeneous representations. Models based on rasters, regions, entities, or coordinates can be evaluated under the same downstream interface, while spatially structured splits are used to reduce leakage between nearby training and test samples. CityRep therefore makes it possible to examine not only average performance, but also how representation quality changes across urban contexts, task domains, label types, and split protocols. This provides a basis for assessing whether urban representations are genuinely general-purpose, rather than effective only under narrow evaluation settings. In summary, our contributions are:

• We introduce CityRep , a unified and extensible benchmark for urban representation learning that supports heterogeneous representation types across data modalities, cities, and downstream urban tasks.

• We design a spatially structured evaluation methodology, including spatial-unit alignment and block-based spatial splits, to enable fair comparison across heterogeneous urban representations while mitigating spatial leakage.

• We conduct a large-scale empirical study of eleven representative urban and geospatial representation models across eight cities and eight tasks, showing that benchmark conclusions are highly sensitive to the evaluation protocol, task domain, and urban context.

• We publicly release datasets, evaluation pipelines, processed benchmarks, model manifests, and diagnostic tools to support reproducible research on urban representation learning and urban foundation models. 2 Related Work Urban Representation Learning Existing urban representation learning methods are highly heterogeneous, differing in the data they utilise, the spatial units they operate upon, and the urban signals they encode. Following geographic information systems (GIS) taxonomy, this heterogeneity is largely shaped by whether models ingest vector data, such as points, polylines, and polygons, or raster data, such as satellite and street-view imagery. Vector-based methods often rely on POI data but produce different outputs: category embedding methods, including Place2Vec [ 46 ] , POI2Vec [ 11 ] , and SPPE [ 14 ] , capture spatial co-occurrence patterns of POI types and require aggregation to represent urban spaces, whereas entity embedding models such as Urban2Vec [ 42 ] , HGI [ 15 ] , and CityFM [ 3 ] directly encode regions, buildings, or roads. Raster-based methods, by contrast, naturally produce grid-cell embeddings, with earth-observation foundation models such as AlphaEarth Foundation [ 7 ] and TESSERA [ 12 ] enabling dense large-scale representations, and AETHER further incorporating POI semantics into raster foundations [ 19 ] . A related line of coordinate-based encoders, including Space2Vec [ 22 ] , SatCLIP [ 16 ] , and CaLLiPer [ 41 ] , learns representations for continuous locations from POIs, imagery, or language supervision. Consequently, the resulting embeddings operate over disparate spatial supports (e.g., regions, H3 cells [ 36 ] , raster grids, and coordinates) and capture varying urban information. This creates a central evaluation challenge: heterogeneous methods cannot be fairly compared without spatial alignment to standard task units, and evaluation on only a few downstream tasks is insufficient for representations capturing diverse urban signals. Geospatial Benchmarks and Spatial Evaluation Urban representation learning can be viewed as a fine-grained, city-focused branch of geospatial representation learning. Existing geospatial benchmarks have largely started from image- or raster-centered settings. TorchGeo provides reusable infrastructure for geospatial data loading, sampling, and model development [ 35 ] , while GEO-Bench, SatlasPretrain, and PANGAEA standardize Earth-observation evaluation and pretraining across tasks, sensors, resolutions, regions, and temporal settings [ 18 , 4 , 24 ] . Recent benchmarks move closer to spatial representation learning. TorchSpatial evaluates general-purpose location encoders [ 45 ] , OBSR evaluates geospatial embedders on regional and trajectory tasks [ 27 ] , and MoRA introduces human-centric social and economic prediction tasks based on mobility-centered representations [ 43 ] . However, they still provide limited evidence on whether representations capture fine-grained intra-urban structure and functions. Spatial evaluation is also critical. Prior work shows that random splits can overestimate performance under spatial dependence and recommends spatially structured validation for assessing transfer to unseen areas [ 34 , 38 , 26 , 33 ] . These gaps motivate a unified multi-city benchmark for heterogeneous urban representations under spatially robust evaluation protocols. 3 CityRep Framework and Benchmark CityRep aims to make urban representations comparable across spatial units, tasks, and cities. Figure 1 provides an overview of the CityRep benchmark. Urban representations arise in diverse forms, including raster-based embeddings, region-level features, POI or entity representations, and coordinate-based encoders, while downstream labels are defined over heterogeneous spatial units. To enable comparison across such settings, CityRep first aligns each representation to common task units across multiple cities and domains through standardized spatial alignment strategies, constructing unified task datasets. These aligned features are then evaluated under spatially structured split protocols, ensuring that performance reflects generalization to unseen areas rather than interpolation among nearby samples. Finally, results are aggregated across cities using task-appropriate metrics, enabling consistent and robust comparison of different urban representation models. 3.1 Problem Definition Urban representation learning aims to compress heterogeneous observations of a city into reusable spatial embeddings. Let 𝒟 m , c \mathcal{D}{m,c} denote the input data used by representation model m m in city c c , such as satellite imagery, street-view imagery, points of interest, road networks, or geographic coordinates. The model transforms these observations into a city representation E m , c = f m ( 𝒟 m , c ) , E{m,c}=f_{m}(\mathcal{D}{m,c}), (1) where E m , c E{m,c} may be defined on a raster grid, a set of regions or cells, a collection of POIs or spatial entities, or a continuous coordinate domain. CityRep does not prescribe how f m f_{m} is trained. Instead, it evaluates whether the resulting representation captures transferable urban information that is useful for downstream prediction across multiple task domains. For each city c c and downstream task t t , CityRep defines a set of task units 𝒰 c , t = { u i } i = 1 n c , t \mathcal{U}{c,t}={u{i}}{i=1}^{n{c,t}} and labels 𝐲 c , t \mathbf{y}{c,t} . Because the native unit of E m , c E{m,c} generally differs from the unit of 𝒰 c , t \mathcal{U}{c,t} , the central benchmark operation is spatial alignment: 𝐗 m , c , t = A ( E m , c , 𝒰 c , t ) , 𝐗 m , c , t ∈ ℝ n c , t × d m , \mathbf{X}{m,c,t}=A(E_{m,c},\mathcal{U}{c,t}),\qquad\mathbf{X}{m,c,t}\in\mathbb{R}^{n_{c,t}\times d_{m}}, (2) where A ( ⋅ ) A(\cdot) maps the native representation to the task units and d m d_{m} is the embedding dimension. Each row of 𝐗 m , c , t \mathbf{X}{m,c,t} is the feature vector assigned to one task unit and is paired with the corresponding label in 𝐲 c , t \mathbf{y}{c,t} . Given the aligned features, CityRep evaluates each representation with a fixed downstream predictor 𝐲 ^ c , t = g θ , t ( 𝐗 m , c , t ) , \hat{\mathbf{y}}{c,t}=g{\theta,t}(\mathbf{X}{m,c,t}), (3) where the prediction head is chosen according to the task type: regression, classification, or distribution prediction. This formulation separates representation learning from benchmark evaluation. Models may differ in input data modality, pretraining objective, and native spatial support, but they are compared by the same question: after alignment to the downstream task units, how much task-relevant urban information does the representation provide? 3.2 Spatial Alignment Spatial alignment is the mechanism that makes heterogeneous urban representations comparable. Representation models and downstream tasks are often defined on different spatial supports, such as raster cells, regions, entities, or coordinates. CityRep therefore treats alignment as a spatial matching problem: for each downstream task unit u i u{i} , the benchmark assigns a representation vector that corresponds to the same location or spatial area. The goal is not to force all models and tasks onto a single universal grid, but to preserve each task’s native evaluation unit while mapping every representation to that unit in a consistent way. CityRep implements this principle according to the spatial relationship between the representation unit and the downstream task unit. For raster or region-level embeddings, if the representation units are finer than the task unit, CityRep aggregates the embeddings within the task unit. If the representation unit is coarser than the task unit, all task units covered by the same representation unit share its embedding. When the spatial supports are directly compatible, alignment reduces to raster sampling, cell lookup, or spatial join. For entity-level embeddings, such as POI or map-entity representations, CityRep first aggregates entities to an intermediate support, such as H3 cells, and then applies the same region-matching rules. This avoids directly aggregating sparse and unevenly distributed entities to every downstream task unit, which can otherwise produce many missing features and degrade downstream evaluation; an ablation supporting this H3-first design is provided in Appendix C.3 . For coordinate encoders, no stored spatial support is required: CityRep queries the encoder at a representative coordinate of each task unit, such as a point location, raster-cell center, or polygon representative point, and creates an embedding from these sample points. 3.3 Spatial Split CityRep uses spatial splitting to define the generalization target of the benchmark. Let 𝒰 c , t \mathcal{U}{c,t} be the task units for city c c and task t t . Instead of drawing train and test samples independently from 𝒰 c , t \mathcal{U}{c,t} , we first partition the spatial extent of the task into a set of non-overlapping blocks ℬ c , t = { B j } j = 1 J c , t \mathcal{B}{c,t}={B{j}}{j=1}^{J{c,t}} . Each task unit is assigned to one block by spatial containment or by the location of its representative point: b ( u i ) ∈ ℬ c , t . b(u_{i})\in\mathcal{B}{c,t}. (4) The train, validation, and test sets are then formed by assigning blocks, not individual task units, to disjoint subsets: ℬ c , t = ℬ c , t , k train ∪ ℬ c , t , k val ∪ ℬ c , t , k test , \mathcal{B}{c,t}=\mathcal{B}^{\mathrm{train}}{c,t,k}\cup\mathcal{B}^{\mathrm{val}}{c,t,k}\cup\mathcal{B}^{\mathrm{test}}{c,t,k}, (5) where k k indexes the random seed and the three block sets are mutually disjoint. The corresponding task-unit split is induced by block membership: 𝒰 c , t , k s = { u i ∈ 𝒰 c , t : b ( u i ) ∈ ℬ c , t , k s } , s ∈ { train , val , test } . \mathcal{U}^{s}{c,t,k}={u_{i}\in\mathcal{U}{c,t}:b(u{i})\in\mathcal{B}^{s}{c,t,k}},\qquad s\in{\mathrm{train},\mathrm{val},\mathrm{test}}. (6) This formulation makes spatial separation part of the evaluation protocol. Test samples are held out together with spatially proximate samples within the same block, reducing the chance that performance is driven primarily by local interpolation from adjacent training points. The split is task-specific because different tasks may have different spatial extents, valid masks, and label supports, but it is model-invariant: all models evaluated on the same city–task pair use the same block partition and the same seed-specific block assignment. In the current benchmark instantiation, we use a 10 × 10 10\times 10 spatial block partition for the main results and report a block-granularity sensitivity analysis in Appendix D.3 . Detailed split configurations, visualization examples, and cross-seed test-block statistics are provided in Appendix C.1 . 3.4 Tasks and Dataset CityRep is built around eight downstream tasks that reflect different dimensions of urban systems. The goal is to test whether an urban representation captures information that transfers beyond a single visual pattern or geographic prior. The tasks span regression, classification, and distribution prediction. Downstream Tasks. CityRep includes eight downstream tasks organized into four urban domains: Morphology ( ♠ \spadesuit ), Demographics ( ♡ \heartsuit ), Economy ( ♢ \diamondsuit ), and Environment ( ♣ \clubsuit ). Details of the downstream task datasets and raw data sources are provided in Appendix B.1 . - Land-use classification (LUC) ♠ . This task uses city-specific zoning or land-use datasets from official or public planning sources [ 29 , 37 , 30 , 8 , 32 , 28 , 44 ] . It evaluates whether representations capture semantic urban functions such as residential, commercial, industrial, transportation, green space, institutional, utilities, water bodies, and mixed-use areas. - Road-density regression (RDE) ♠ . This task uses OpenStreetMap road-network data [ 31 ] . It evaluates whether representations capture physical street structure and connectivity. - Population regression (POP) ♡ . This task uses WorldPop gridded population datasets [ 5 ] . It evaluates whether representations capture spatial variation in population intensity. - Age-distribution prediction (AGE) ♡ . This task uses WorldPop age–sex datasets [ 6 ] . It evaluates whether representations capture demographic composition across age groups. - Gross Domestic Product regression (GDP) ♢ . This task uses gridded GDP datasets from Kummu et al. [ 17 ] . It evaluates whether representations capture spatial variation in economic output. - Nighttime lights regression (NTL) ♢ . This task uses the VIIRS Nighttime Lights Annual V2.2 product [ 9 , 10 ] . It evaluates whether representations capture spatial patterns of human activity, electrification, commercial intensity, and infrastructure use visible through nighttime illumination. - PM 2.5 regression ♣ . This task uses SEDAC/CIESIN annual PM 2.5 concentration datasets [ 39 ] . It evaluates whether representations capture fine particulate pollution exposure. - Land-surface-temperature regression (LST) ♣ . This task uses MODIS/Terra MOD11A2 daytime land-surface-temperature datasets [ 40 ] . It evaluates whether representations capture surface thermal conditions related to land cover, density, vegetation, and impervious surface. Cities and extensibility. Most sources used in CityRep are global or near-global, including WorldPop, gridded GDP, nighttime lights, PM 2.5 , MODIS LST, and OpenStreetMap. As a result, adding a new city mainly requires defining the boundary, extracting the same source layers, and running the standard task construction and alignment pipeline. We instantiate the benchmark on London, New York, Singapore, Sydney, Mumbai, Nairobi, Jakarta, and Cape Town, covering developed and developing urban contexts across Europe, North America, Asia, Africa, and Australia. To further demonstrate extensibility, Appendix D.4 extends the evaluation of global remote-sensing representations to 26 cities using the same benchmark pipeline. Table 1 reports the number of downstream task units for each city and task. The counts vary because cities differ in spatial extent, valid masks, source resolution, and task support. Dense raster-derived tasks such as population, road density, and land-surface temperature contain many units, while coarser grids such as GDP, NTL, and PM 2.5 contain fewer. Table 1: Number of prediction units for each downstream task across eight cities. Task London New York Singapore Sydney Mumbai Nairobi Jakarta Cape Town ♠ \spadesuit Morphology Land use 100,000 100,000 100,000 100,000 100,000 100,000 100,000 100,000 Road density 297,314 187,028 83,393 612,027 58,778 81,389 75,724 343,593 ♡ \heartsuit Demographics Population 266,183 114,276 64,172 416,254 43,522 61,677 74,885 167,771 Age distribution 234,428 87,736 32,990 265,815 43,455 55,521 74,096 136,093 ♢ \diamondsuit Economy Gross Domestic Product 2,977 1,812 806 6,121 575 817 756 3,420 Nighttime lights 2,977 1,867 839 6,119 583 818 756 3,434 ♣ \clubsuit Environment PM 2.5 2,072 1,038 551 4,231 407 569 527 2,380 Land surface temperature 295,364 112,433 70,009 592,033 55,394 81,389 75,117 339,759 3.5 Selected Baselines We evaluate eleven representative urban and geospatial representation models. The reproduction pipeline uses Foursquare POIs [ 13 ] , Mapillary street-view imagery [ 25 ] , OpenStreetMap entities [ 31 ] , and public remote-sensing embedding products [ 12 , 7 ] as the main raw data sources. - PE [ 21 ] Position encoding (PE) functions encode multi-scale location signals, serving as simple urban representations based entirely on spatial information. We select SphereC [ 23 ] as a representative example to illustrate the performance of these methods. - Place2Vec [ 46 ] learns place representations from POI context. We reproduce it with Foursquare POIs and aggregate the learned embeddings to H3 cells. - Space2Vec [ 22 ] represents locations through a coordinate encoder. We train it with POI category supervision and query the encoder directly at raster-cell centers or land-use point coordinates. - CaLLiPer [ 41 ] is a coordinate-based urban representation pretrained via language supervision from POI textual descriptions. Its embeddings are exported through the same interface as Space2Vec. - CityFM [ 3 ] learns urban representations from map entity information. We reproduce it using OpenStreetMap entities and export embeddings to H3 cells. - Urban2Vec [ 42 ] combines street-view and POIs for learning region representations. We construct its inputs from Mapillary imagery and Foursquare POIs, and root the embeddings in H3 cells. - MuseCL [ 47 ] is a multimodal urban representation model. Since consistent mobility data are unavailable across all benchmark cities, we implement a CityRep-compatible variant that replaces the mobility branch with Foursquare POI semantics while retaining the street-view, remote-sensing, and semantic fusion components. - SatCLIP [ 16 ] is a pretrained geographic coordinate encoder. We use the model checkpoint and query it directly at downstream task locations. Although not intended for city-level representation learning, it is included for being a representative imagery-based coordinate embedding method. - TESSERA [ 12 ] provides pretrained remote-sensing embedding rasters. We use the released embeddings as fixed raster representations and align them to task units by raster sampling. - AlphaEarth [ 7 ] is a large-scale pretrained geospatial embedding product. We crop or sample its released raster embeddings for each city and task. - AETHER [ 19 ] is a POI-guided alignment framework for pretrained imagery embeddings. We reproduce it by aligning AlphaEarth embedding inputs with Foursquare POI semantics. 3.6 Evaluation CityRep evaluates each representation after spatial alignment to the downstream task units. For a model m m , city c c , and task t t , the aligned feature matrix 𝐗 m , c , t \mathbf{X}{m,c,t} is paired with the task labels 𝐲 c , t \mathbf{y}{c,t} and used to train a lightweight task head. To make model comparison depend primarily on the representation rather than on downstream model engineering, CityRep uses the same predictor family and training protocol for all representation models. After alignment, each representation is evaluated by a task-specific prediction head: 𝐲 ^ c , t = g θ , t ( 𝐗 m , c , t ) , \hat{\mathbf{y}}{c,t}=g_{\theta,t}(\mathbf{X}_{m,c,t}), (7) where the output layer and loss are selected according to the task type. Evaluation metrics. CityRep uses nine task-appropriate metrics across the three prediction types. For regression tasks, including road density, population, GDP, NTL, PM 2.5 , and land surface temperature, we report R 2 R^{2} , mean absolute error (MAE), and root mean squared error (RMSE), with R 2 R^{2} used as the primary metric. For land-use classification, we report macro F1, macro recall, and macro precision, with macro F1 used as the primary metric because it gives equal weight to each class under imbalanced labels. For age-distribution prediction, we report KL divergence, Chebyshev distance, and L1 distance between the predicted and target distributions, with KL divergence used as the primary metric. Higher values are better for R 2 R^{2} , F1, recall, and precision, while lower values are better for MAE, RMSE, KL divergence, Chebyshev distance, and L1 distance. Formal metric definitions are provided in Appendix C.4 . Training and aggregation. All downstream predictors use the same MLP architecture and training protocol across models and tasks. For each model–task–city setting, we run five spatial split seeds, { 42 , 24 , 7 , 0 , 100 } {42,24,7,0,100} , and average the primary test metric over seeds to obtain a city-level score. Table 2 reports, for each model and task, the mean of city-level scores across cities as Avg. , together with the cross-city standard deviation as C Std. Since different tasks use different primary metrics and metric scales, CityRep reports raw task metrics in the main result columns and uses Mean City Rank as a rank-based diagnostic summary for comparing models across tasks and cities. Lower values indicate better overall rank. Additional details on training, aggregation, and rank computation are provided in Appendix C.5 . 4 Experiments We evaluate eleven representation models on eight tasks across eight cities. The experiments examine three questions: which representations are the best under a unified protocol, whether performance is stable across cities and tasks, and how much random splits overestimate generalization compared with spatial splits. 4.1 Main Performance Table 2 shows that large-scale pretrained geospatial representations achieve the strongest overall transfer. AETHER, AlphaEarth, and TESSERA obtain the best mean city ranks, and they dominate most tasks. This suggests that broad spatial coverage and large-scale pretraining are highly valuable when a representation is expected to support heterogeneous urban prediction tasks. We further report linear-probe results in Appendix D.2 , which show broadly comparable model rankings under a lower-capacity downstream evaluator. However, the ranking is not determined by pretraining scale alone. Several specialized or simple representations remain competitive on specific tasks. CityFM achieves strong performance on LST, indicating that map entities can provide useful signals for built intensity and thermal conditions. CaLLiPer is also competitive on AGE. Even the PE baseline performs strongly on the spatially smoother environmental task PM 2.5 , suggesting that location-only signals can be informative when the target exhibits broad spatial gradients. These results show that smaller, simpler, or more targeted representations can still be effective when their encoded signals align well with the downstream phenomenon. Another observation is that input modality alone is insufficient to explain performance across tasks. More modalities do not necessarily translate into stronger performance. At the same time, models with similar modalities exhibit distinct task-specific strengths. Among raster or raster-enhanced embeddings, AlphaEarth performs best on LUC, POP, NTL, and LST, TESSERA leads RDE, and AETHER is strongest on GDP, PM 2.5 , and AGE. POI- and entity-based models also differ substantially, with CaLLiPer outperforming Place2Vec and Space2Vec overall and CityFM remaining competitive on LST. These results indicate that benchmark performance is shaped not only by input modality, but also by model architecture, pretraining objective, spatial support, and alignment strategy. Table 2: Main benchmark results on CityRep. For eac

CITYREP: A Unified Benchmark for Urban Representati... | AI Research

Key Takeaways

What the paper is about

What it covers

Comments (0)

No comments yet