Hypermaps | Closing the complexity gap in robotic mapping

4-year research fellowship on multi-layer and semantic spatial representations for robotics

Funded by: National Reseach Council of Finland
Role: Principal investigator
Budget: 834k€
Period: Sep 2023–Aug 2027
Funding decision

abstract

Environmental awareness is a crucial skill for robotic systems intended to autonomously navigate and interact with their surroundings.

Robots access knowledge about their environment through maps. However, currently we see a big “complexity gap” in robotic mapping: while in recent years advances in computer vision have given us the ability to perceive our surroundings like never before through object detection and people tracking, robots still rely on maps containing only enough information for them to be able to navigate, but insufficient for many other tasks required by advanced autonomy. For example, most maps do not host semantic or dynamic information about the environment, needed for any application where interaction with people or specific objects is required. Until this gap is bridged, mobile robots will not be able to operate autonomously in dynamic environments.

Hypermaps lays the groundwork for the next level of interaction between robots and their environment by closing the complexity gap. In this project, we propose to go beyond today’s multi-layer maps by a new formalism, called hypermaps, where spatio-temporal knowledge (e.g., occupancy, semantics through deep object recognition, people movement in the environment) is stored and processed through advanced artificial intelligence to offer the robot task-specific maps to complete its missions. The core hypothesis of the project is that such a formalism will leverage the interplay between different maps to extract even more information and allow deeper reasoning. Anomalies in one map will be detected and corrected by looking at its correlation with the other maps, and information not visible in any single map will be made visible when the information of the layers is combined.

Closing the complexity gap constitutes a fundamental step towards the development of general, fully autonomous robots, able to execute high-level tasks and interact with us and their environment.

conference articles

IROS
Bayesian Floor Field: Transferring people flow predictions across environments

Francesco Verdoja, Tomasz Piotr Kucner, and Ville Kyrki

In 2024 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), Oct 2024

Abs DOI arXiv Bib PDF Video Code

Mapping people dynamics is a crucial skill for robots, because it enables them to coexist in human-inhabited environments. However, learning a model of people dynamics is a time consuming process which requires observation of large amount of people moving in an environment. Moreover, approaches for mapping dynamics are unable to transfer the learned models across environments: each model is only able to describe the dynamics of the environment it has been built in. However, the impact of architectural geometry on people’s movement can be used to anticipate their patterns of dynamics, and recent work has looked into learning maps of dynamics from occupancy. So far however, approaches based on trajectories and those based on geometry have not been combined. In this work we propose a novel Bayesian approach to learn people dynamics able to combine knowledge about the environment geometry with observations from human trajectories. An occupancy-based deep prior is used to build an initial transition model without requiring any observations of pedestrian; the model is then updated when observations become available using Bayesian inference. We demonstrate the ability of our model to increase data efficiency and to generalize across real large-scale environments, which is unprecedented for maps of dynamics.
@inproceedings{202410_verdoja_bayesian, title = {Bayesian Floor Field: Transferring people flow predictions across environments}, doi = {10.1109/IROS58592.2024.10802300}, booktitle = {2024 {IEEE}/{RSJ} {Int.}\ {Conf.}\ on {Intelligent} {Robots} and {Systems} ({IROS})}, author = {Verdoja, Francesco and Kucner, Tomasz Piotr and Kyrki, Ville}, month = oct, year = {2024}, pages = {12801--12807}, }

workshop articles

ICRA
Evaluating the quality of robotic visual-language maps

Matti Pekkanen, Tsvetomila Mihaylova, Francesco Verdoja, and Ville Kyrki

May 2024

Presented at the “Vision-Language Models for Navigation and Manipulation (VLMNM)” workshop at the IEEE Int. Conf. on Robotics and Automation (ICRA)

Abs Bib PDF

Visual-language models (VLMs) have recently been introduced in robotic mapping by using the latent representations, i.e., embeddings, of the VLMs to represent the natural language semantics in the map. The main benefit is moving beyond a small set of human-created labels toward open-vocabulary scene understanding. While there is anecdotal evidence that maps built this way support downstream tasks, such as navigation, rigorous analysis of the quality of the maps using these embeddings is lacking. In this paper, we propose a way to analyze the quality of maps created using VLMs by evaluating two critical properties: queryability and consistency. We demonstrate the proposed method by evaluating the maps created by two state-of-the-art methods, VLMaps and OpenScene, using two encoders, LSeg and OpenSeg, using real-world data from the Matterport3D data set. We find that OpenScene outperforms VLMaps with both encoders, and LSeg outperforms OpenSeg with both methods.
@online{202405_pekkanen_evaluating, title = {Evaluating the quality of robotic visual-language maps}, url = {https://vlmnm-workshop.github.io}, author = {Pekkanen, Matti and Mihaylova, Tsvetomila and Verdoja, Francesco and Kyrki, Ville}, month = may, year = {2024}, note = {Presented at the ``Vision-Language Models for Navigation and Manipulation (VLMNM)'' workshop at the IEEE Int.\ Conf.\ on Robotics and Automation (ICRA)}, }
ICRA
Using occupancy priors to generalize people flow predictions

Francesco Verdoja, Tomasz Piotr Kucner, and Ville Kyrki

May 2024

Presented at the “Long-term Human Motion Prediction” workshop at the IEEE Int. Conf. on Robotics and Automation (ICRA)

Abs Bib PDF

Mapping people dynamics is a crucial skill for robots, because it enables them to coexist in human-inhabited environments. However, learning a model of people dynamics is a time consuming process which requires observation of large amount of people moving in an environment. Moreover, approaches for mapping dynamics are unable to transfer the learned models across environments: each model is only able to describe the dynamics of the environment it has been built in. However, the impact of architectural geometry on people’s movement can be used to anticipate their patterns of dynamics, and recent work has looked into learning maps of dynamics from occupancy. So far however, approaches based on trajectories and those based on geometry have not been combined. In this work we propose a novel Bayesian approach to learn people dynamics able to combine knowledge about the environment geometry with observations from human trajectories. An occupancy-based deep prior is used to build an initial transition model without requiring any observations of pedestrian; the model is then updated when observations become available using Bayesian inference. We demonstrate the ability of our model to increase data efficiency and to generalize across real large-scale environments, which is unprecedented for maps of dynamics.
@online{202405_verdoja_using, title = {Using occupancy priors to generalize people flow predictions}, url = {https://motionpredictionicra2024.github.io}, author = {Verdoja, Francesco and Kucner, Tomasz Piotr and Kyrki, Ville}, month = may, year = {2024}, note = {Presented at the ``Long-term Human Motion Prediction'' workshop at the IEEE Int.\ Conf.\ on Robotics and Automation (ICRA)}, }

preprints

RA-L
Event-Grounding Graph: Unified Spatio-Temporal Scene Graph from Robotic Observations

Phuoc Nguyen, Francesco Verdoja, and Ville Kyrki

Oct 2025

IEEE Robotics and Automation Letters (submitted)

Abs arXiv Bib PDF Code

A fundamental aspect for building intelligent autonomous robots that can assist humans in their daily lives is the construction of rich environmental representations. While advances in semantic scene representations have enriched robotic scene understanding, current approaches lack a connection between spatial features and dynamic events; e.g., connecting the blue mug to the event washing a mug. In this work, we introduce the event-grounding graph (EGG), a framework grounding event interactions to spatial features of a scene. This representation allows robots to perceive, reason, and respond to complex spatio-temporal queries. Experiments using real robotic data demonstrate EGG’s capability to retrieve relevant information and respond accurately to human inquiries concerning the environment and events within.
@misc{202510_nguyen_event-grounding, title = {Event-{Grounding} {Graph}: {Unified} {Spatio}-{Temporal} {Scene} {Graph} from {Robotic} {Observations}}, shorttitle = {Event-{Grounding} {Graph}}, author = {Nguyen, Phuoc and Verdoja, Francesco and Kyrki, Ville}, url = {http://arxiv.org/abs/2510.18697}, month = oct, year = {2025}, publisher = {arXiv}, note = {IEEE Robotics and Automation Letters (submitted)} }
ICRA
QuASH: Using Natural-Language Heuristics to Query Visual-Language Robotic Maps

Matti Pekkanen, Francesco Verdoja, and Ville Kyrki

Oct 2025

2026 IEEE Int. Conf. on Robotics & Automation (ICRA) (submitted)

Abs arXiv Bib PDF Code

Embeddings from Visual-Language Models are increasingly utilized to represent semantics in robotic maps, offering an open-vocabulary scene understanding that surpasses traditional, limited labels. Embeddings enable on-demand querying by comparing embedded user text prompts to map embeddings via a similarity metric. The key challenge in performing the task indicated in a query is that the robot must determine the parts of the environment relevant to the query. This paper proposes a solution to this challenge. We leverage natural-language synonyms and antonyms associated with the query within the embedding space, applying heuristics to estimate the language space relevant to the query, and use that to train a classifier to partition the environment into matches and non-matches. We evaluate our method through extensive experiments, querying both maps and standard image benchmarks. The results demonstrate increased queryability of maps and images. Our querying technique is agnostic to the representation and encoder used, and requires limited training.
@misc{202510_pekkanen_quash, title = {{QuASH}: {Using} {Natural}-{Language} {Heuristics} to {Query} {Visual}-{Language} {Robotic} {Maps}}, shorttitle = {{QuASH}}, author = {Pekkanen, Matti and Verdoja, Francesco and Kyrki, Ville}, url = {http://arxiv.org/abs/2510.14546}, month = oct, year = {2025}, publisher = {arXiv}, note = {2026 {IEEE} {Int.}\ {Conf.}\ on {Robotics} \& {Automation} (ICRA) (submitted)} }
IROS
REACT: Real-time Efficient Attribute Clustering and Transfer for Updatable 3D Scene Graph

Phuoc Nguyen, Francesco Verdoja, and Ville Kyrki

Mar 2025

2025 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS)

Abs arXiv Bib PDF Code

Modern-day autonomous robots need high-level map representations to perform sophisticated tasks. Recently, 3D scene graphs (3DSGs) have emerged as a promising alternative to traditional grid maps, blending efficient memory use and rich feature representation. However, most efforts to apply them have been limited to static worlds. This work introduces REACT, a framework that efficiently performs real-time attribute clustering and transfer to relocalize object nodes in a 3DSG. REACT employs a novel method for comparing object instances using an embedding model trained on triplet loss, facilitating instance clustering and matching. Experimental results demonstrate that REACT is able to relocalize objects while maintaining computational efficiency. The REACT framework’s source code will be available as an open-source project, promoting further advancements in reusable and updatable 3DSGs.
@misc{202503_nguyen_react, title = {{REACT}: {Real}-time {Efficient} {Attribute} {Clustering} and {Transfer} for {Updatable} {3D} {Scene} {Graph}}, shorttitle = {{REACT}}, author = {Nguyen, Phuoc and Verdoja, Francesco and Kyrki, Ville}, url = {http://arxiv.org/abs/2503.03412}, month = mar, year = {2025}, publisher = {arXiv}, note = {2025 {IEEE}/{RSJ} {Int.}\ {Conf.}\ on {Intelligent} {Robots} and {Systems} ({IROS})} }
IROS
Do Visual-Language Grid Maps Capture Latent Semantics?

Matti Pekkanen, Tsvetomila Mihaylova, Francesco Verdoja, and Ville Kyrki

Mar 2025

2025 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS)

Abs arXiv Bib PDF

Visual-language models (VLMs) have recently been introduced in robotic mapping using the latent representations, i.e., embeddings, of the VLMs to represent semantics in the map. They allow moving from a limited set of human-created labels toward open-vocabulary scene understanding, which is very useful for robots when operating in complex real-world environments and interacting with humans. While there is anecdotal evidence that maps built this way support downstream tasks, such as navigation, rigorous analysis of the quality of the maps using these embeddings is missing. In this paper, we propose a way to analyze the quality of maps created using VLMs. We investigate two critical properties of map quality: queryability and distinctness. The evaluation of queryability addresses the ability to retrieve information from the embeddings. We investigate intra-map distinctness to study the ability of the embeddings to represent abstract semantic classes and inter-map distinctness to evaluate the generalization properties of the representation. We propose metrics to evaluate these properties and evaluate two state-of-the-art mapping methods, VLMaps and OpenScene, using two encoders, LSeg and OpenSeg, using real-world data from the Matterport3D data set. Our findings show that while 3D features improve queryability, they are not scale invariant, whereas image-based embeddings generalize to multiple map resolutions. This allows the image-based methods to maintain smaller map sizes, which can be crucial for using these methods in real-world deployments. Furthermore, we show that the choice of the encoder has an effect on the results. The results imply that properly thresholding open-vocabulary queries is an open problem.
@misc{202503_pekkanen_visual-language, title = {Do {Visual}-{Language} {Grid} {Maps} {Capture} {Latent} {Semantics}?}, author = {Pekkanen, Matti and Mihaylova, Tsvetomila and Verdoja, Francesco and Kyrki, Ville}, url = {http://arxiv.org/abs/2403.10117}, month = mar, year = {2025}, publisher = {arXiv}, note = {2025 {IEEE}/{RSJ} {Int.}\ {Conf.}\ on {Intelligent} {Robots} and {Systems} ({IROS})} }

abstract

related publications

conference articles

workshop articles

preprints