AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

Xu, Shixiong; Zhang, Chenghao; Fan, Lubin; Meng, Gaofeng; Xiang, Shiming; Ye, Jieping

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.08156v1 (cs)

[Submitted on 11 Jul 2024]

Title:AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

Authors:Shixiong Xu, Chenghao Zhang, Lubin Fan, Gaofeng Meng, Shiming Xiang, Jieping Ye

View PDF

Abstract:In this study, we introduce a new problem raised by social media and photojournalism, named Image Address Localization (IAL), which aims to predict the readable textual address where an image was taken. Existing two-stage approaches involve predicting geographical coordinates and converting them into human-readable addresses, which can lead to ambiguity and be resource-intensive. In contrast, we propose an end-to-end framework named AddressCLIP to solve the problem with more semantics, consisting of two key ingredients: i) image-text alignment to align images with addresses and scene captions by contrastive learning, and ii) image-geography matching to constrain image features with the spatial distance in terms of manifold learning. Additionally, we have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem. Experiments demonstrate that our approach achieves compelling performance on the proposed datasets and outperforms representative transfer learning methods for vision-language models. Furthermore, extensive ablations and visualizations exhibit the effectiveness of the proposed method. The datasets and source code are available at this https URL.

Comments:	Accepted at ECCV 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.08156 [cs.CV]
	(or arXiv:2407.08156v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.08156

Submission history

From: Chenghao Zhang [view email]
[v1] Thu, 11 Jul 2024 03:18:53 UTC (24,245 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators