DINO-Mix enhancing visual place recognition with foundational vision model and feature mixing

Middelberg, S., Sattler, T., Untzelmann, O. & Kobbelt, L. Scalable 6-DOF Localization on Mobile Devices. in (eds. Fleet, D., Pajdla, T., Schiele, B. & Tuytelaars, T.) vol. 8690 268–283 (2014).Suenderhauf, N. et al. Place Recognition with ConvNet Landmarks: Viewpoint-Robust, Condition-Robust, Training-Free. in Robotics: Science and Systems XI (Robotics: Science and Systems Foundation, doi: (2015). https://doi.org/10.15607/RSS.2015.XI.022Chaabane, M., Gueguen, L., Trabelsi, A., Beveridge, R. & O’Hara, S. End-to-end Learning Improves Static Object Geo-localization from Video. in Ieee Winter Conference on Applications of Computer Vision Wacv 2021 2062–2071 (Ieee, New York, 2021). doi: (2021). https://doi.org/10.1109/WACV48630.2021.00211Wilson, D. et al. Object Tracking and Geo-localization from Street images. Remote Sens. 14, 2575 (2022).Article 
ADS 

Google Scholar 
Agarwal, S., Snavely, N., Simon, I., Seitz, S. M. & Szeliski, R. Building Rome in a Day. in IEEE 12th International Conference on Computer Vision (ICCV) 72–79 (2009). doi: (2009). https://doi.org/10.1109/ICCV.2009.5459148Acampora, G., Anastasio, P., Risi, M., Tortora, G. & Vitiello, A. Automatic Event Geo-Location in Twitter. IEEE Access. 8, 128213–128223 (2020).Article 

Google Scholar 
Lowe, D. Distinctive image features from Scale-Invariant keypoints. Int. J. Comput. Vision. 60, 91–110 (2004).Article 

Google Scholar 
Dalal, N. & Triggs, B. Histograms of Oriented Gradients for Human Detection. in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) vol. 1 886–893 (IEEE, San Diego, CA, USA, 2005). (2005).Bay, H., Tuytelaars, T. & Van Gool, L. S. U. R. F. Speeded up robust features. in Computer Vision – ECCV 2006 (eds Leonardis, A., Bischof, H. & Pinz, A.) vol 3951 404–417 (Springer Berlin Heidelberg, Berlin, Heidelberg, (2006).Chapter 

Google Scholar 
Rublee, E., Rabaud, V., Konolige, K. & Bradski, G. O. R. B. An efficient alternative to SIFT or SURF. in International Conference on Computer Vision 2564–2571 (IEEE, Barcelona, Spain, 2011). doi: (2011). https://doi.org/10.1109/ICCV.2011.6126544Tang, K., Li, F. F. & Koller, D. Learning latent temporal structure for complex event detection. in IEEE Conference on Computer Vision and Pattern Recognition 1250–1257 (IEEE, Providence, RI, 2012). doi: (2012). https://doi.org/10.1109/cvpr.2012.6247808Jegou, H., Douze, M., Schmid, C. & Perez, P. Aggregating local descriptors into a compact image representation. in IEEE Computer Society Conference on Computer Vision and Pattern Recognition 3304–3311 (IEEE, San Francisco, CA, USA, 2010). doi: (2010). https://doi.org/10.1109/cvpr.2010.5540039Jegou, H. et al. Aggregating local image descriptors into Compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1704–1716 (2012).Article 
PubMed 

Google Scholar 
Xu, M. Queensland University of Technology,. Bridging the divide between visual place recognition and SLAM. doi: (2023). https://doi.org/10.5204/thesis.eprints.240786Kanjilal, R. & Uysal, I. Rich learning representations for human activity recognition: how to empower deep feature learning for biological time series. J. Biomed. Inf. 134, 104180 (2022).Article 

Google Scholar 
Costa, Y., Oliveira, L., Koerich, A. & Gouyon, F. Music genre recognition using gabor filters and LPQ texture descriptors. in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (ed Ruiz-Shulcloper, J.) (2013). & Sanniti Di Baja, G.) vol. 8259 67–74 (Springer Berlin Heidelberg, Berlin, Heidelberg.Chapter 

Google Scholar 
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T. & Sivic, J. NetVLAD: CNN Architecture for weakly supervised Place Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1437–1451 (2018).Article 
PubMed 

Google Scholar 
Radenovic, F., Tolias, G., Chum, O. & Fine-Tuning, C. N. N. Image Retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1655–1668 (2019).Article 
PubMed 

Google Scholar 
Berton, G., Masone, C. & Caputo, B. Rethinking Visual Geo-localization for Large-Scale Applications. in IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2022) 4868–4878 (IEEE Computer Soc, 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720 – 1264 USA, 2022). doi: (2022). https://doi.org/10.1109/CVPR52688.2022.00483Ali-Bey, A., Chaib-Draa, B. & Giguere, P. MixVPR: Feature Mixing for Visual Place Recognition. in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2997–3006 (IEEE, Waikoloa, HI, USA, 2023). doi: (2023). https://doi.org/10.1109/wacv56688.2023.00301Oquab, M. et al. DINOv2: Learning Robust Visual Features without Supervision. Preprint at (2023). https://doi.org/10.48550/arxiv.2304.07193Tolstikhin, I. O. et al. MLP-Mixer: An all-MLP Architecture for Vision. in Advances in Neural Information Processing Systems (eds Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S. & Vaughan, J. W.) vol. 34 24261–24272 (Curran Associates, Inc., (2021).Masone, C. & Caputo, B. A. Survey on Deep Visual Place Recognition. IEEE Access. 9, 19516–19547 (2021).Article 

Google Scholar 
Zhang, W. & Kosecka, J. Image Based Localization in Urban Environments. in Third International Symposium on 3D Data Processing, Visualization, and Transmission, Proceedings (eds. Pollefeys, M. & Daniilidis, K.) 33–40Chapel Hill, NC, USA, doi: (2007). https://doi.org/10.1109/3dpvt.2006.80Martin, A., Fischler, Robert, C. & Bolles Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM. 24, 381–395 (1981).Article 
MathSciNet 

Google Scholar 
Zamir, A. R. & Shah, M. Accurate image localization based on Google maps Street View. in Computer Vision – ECCV 2010 (eds Daniilidis, K., Maragos, P. & Paragios, N.) vol. 6314 255–268 (Springer, (2010).Zamir, A. R., Ardeshir, S. & Shah, M. GPS-Tag Refinement Using Random Walks with an Adaptive Damping Factor. in IEEE Conference on Computer Vision and Pattern Recognition 4280–4287 (IEEE, Columbus, OH, USA, 2014). doi: (2014). https://doi.org/10.1109/CVPR.2014.545Noh, H., Araujo, A., Sim, J., Weyand, T. & Han, B. Large-Scale Image Retrieval with Attentive Deep Local Features. in IEEE International Conference on Computer Vision (ICCV) 3476–3485 (IEEE, Venice, 2017). doi: (2017). https://doi.org/10.1109/ICCV.2017.374Ng, T., Balntas, V., Tian, Y. & Mikolajczyk, K. S. O. L. A. R. Second-Order Loss and Attention for Image Retrieval. in Computer Vision–ECCV 2020: 16th European Conference Part XXV 16 (eds. Vedaldi, A., Bischof, H., Brox, T. & Frahm, J.-M.) 253–270Springer International Publishing, Glasgow, UK, (2020).Chu, T. Y., Chen, Y. M., Huang, L., Xu, Z. G. & Tan, H. Y. A Grid feature-point selection method for large-Scale Street View Image Retrieval based on deep local features. Remote Sens. 12, 3978 (2020).Article 
ADS 

Google Scholar 
Chu, T. Y. et al. IEEE, Waikoloa, HI, USA,. Street View Image Retrieval with Average Pooling Features. in IGARSS 2020–2020 IEEE International Geoscience and Remote Sensing Symposium 1205–1208 doi: (2020). https://doi.org/10.1109/IGARSS39084.2020.9323667Yan, L. Q., Cui, Y. M., Chen, Y. J. & Liu, D. F. Hierarchical Attention Fusion for Geo-Localization. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021) 2220–2224 (IEEE, New York, 2021). doi: (2021). https://doi.org/10.1109/ICASSP39728.2021.9414517Chu, T. Y. et al. A news picture geo-localization pipeline based on deep learning and street view images. Int. J. Digit. Earth. 15, 1485–1505 (2022).Article 
ADS 

Google Scholar 
Tolias, G., Jenicek, T. & Chum, O. Learning and Aggregating Deep Local descriptors for Instance-Level Recognition. in Computer Vision – ECCV 2020 (eds Vedaldi, A., Bischof, H., Brox, T. & Frahm, J. M.) 460–477 (Springer International Publishing, Cham, (2020).Chapter 

Google Scholar 
Mishkin, D., Perdoch, M. & Matas, J. Place Recognition with WxBS Retrieval. in CVPR 2015 Workshop on Visual Place Recognition in Changing Environments vol. 30 9Boston, USA, (2015).Kim, H. J., Dunn, E. & Frahm, J. M. Learned Contextual Feature Reweighting for Image Geo-Localization. in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 3251–3260 (IEEE, Honolulu, HI, 2017). doi: (2017). https://doi.org/10.1109/CVPR.2017.346Yu, J., Zhu, C. Y., Zhang, J., Huang, Q. M. & Tao, D. C. Spatial pyramid-enhanced NetVLAD with Weighted Triplet loss for Place Recognition. IEEE Trans. Neural Netw. Learn. Syst. 31, 661–674 (2020).Article 
PubMed 

Google Scholar 
Khaliq, A., Milford, M. & Garg, S. MultiRes-NetVLAD: augmenting Place Recognition Training with Low-Resolution Imagery. IEEE Robot Autom. Lett.7, 3882–3889 (2022).Article 

Google Scholar 
Liu, L., Li, H. D. & Dai, Y. C. Stochastic Attraction-Repulsion Embedding for Large Scale Image Localization. in IEEE/CVF International Conference on Computer Vision (ICCV) 2570–2579 (IEEE, Seoul, Korea (South), 2019). doi: (2019). https://doi.org/10.1109/iccv.2019.00266Ge, Y., xiao, Wang, H., bo, Zhu, F., Zhao, R. & Li, H. Sheng. Self-supervising Fine-grained Region Similarities for Large-scale Image Localizationvol. 12349 369–386 (Springer International Publishing, 2020).Ali-bey, A., Chaib-draa, B. & Giguère, P. GSV-Cities: toward Appropriate supervised Visual Place Recognition. Neurocomputing. 513, 194–203 (2022).Article 

Google Scholar 
Dosovitskiy, A. et al. An image is worth 16×16 words: transformers for Image Recognition at Scale. in doi: (2021). https://doi.org/10.48550/arXiv.2010.11929Kirillov, A. et al. Segment Anything. Preprint at (2023). http://arxiv.org/abs/2304.02643Wang, R. T. et al. Transformer-Based Place Recognition with Multi-Level Attention Aggregation. in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 13638–13647 (IEEE, New Orleans, LA, USA, 2022). doi: (2022). https://doi.org/10.1109/cvpr52688.2022.01328Torii, A., Sivic, J., Okutomi, M. & Pajdla, T. Visual Place Recognition with repetitive structures. IEEE Trans. Pattern Anal. Mach. Intell. 37, 2346–2359 (2015).Article 
PubMed 

Google Scholar 
Torii, A., Arandjelovic, R., Sivic, J., Okutomi, M. & Pajdla, T. 24/7 Place Recognition by View Synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 40, 257–271 (2018).Article 
PubMed 

Google Scholar 
Sunderhauf, N., Neubert, P. & Protzel, P. Are we there yet? Challenging SeqSLAM on a 3000 km Journey Across All Four Seasons. in.Ruder, S. An overview of gradient descent optimization algorithms. Preprint at.https://doi.org/10.48550/arXiv.1609.04747 (2017).Article 

Google Scholar 
Hermans, A., Beyer, L. & Leibe, B. In Defense of the Triplet Loss for Person Re-Identification. in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)IEEE, (2018).Wang, X., Han, X., Huang, W., Dong, D. & Scott, M. R. Multi-Similarity Loss With General Pair Weighting for Deep Metric Learning. in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 5017–5025 (IEEE, Long Beach, CA, USA, 2019). doi: (2019). https://doi.org/10.1109/CVPR.2019.00516Huang, G. S., Zhou, Y., Hu, X. F., Zhao, L. Y. & Zhang, C. L. A survey of the Research Progress in Image Geo- localization. J. Geo-information Sci. 25, 1336–1362 (2023).
Google Scholar 
Yandex, A. B. & Lempitsky, V. Aggregating Deep Convolutional Features for Image Retrieval. in IEEE International Conference on Computer Vision (ICCV) 1269–1277 (IEEE, Santiago, Chile, 2015). doi: (2015). https://doi.org/10.1109/iccv.2015.150Razavian, A. S., Sullivan, J., Carlsson, S. & Maki, A. Visual Instance Retrieval with Deep Convolutional Networks. ITE Trans. Media Technol. Appl. 4, 251–258 (2016).
Google Scholar 
Tolias, G., Sicre, R. & Jégou, H. Particular object retrieval with integral max-pooling of CNN activations. Preprint at (2016). http://arxiv.org/abs/1511.05879Kordopatis-Zilos, G., Galopoulos, P., Papadopoulos, S. & Kompatsiaris, I. Leveraging EfficientNet and Contrastive Learning for Accurate Global-scale location estimation. in (2021).

Hot Topics

Related Articles