
VSLAM is a technique used to create a map of an unknown environment while simultaneously keeping track of the robot’s location within that environment. This is achieved by using visual data from stereo cameras to identify and track features in the environment.
The VSLAM process involves the following steps:

While VSLAM provides a sparse 3D map of features, a denser representation is needed for navigation. Nvblox, a package specifically for 3D scene reconstruction, excels at this. It takes the depth images and pose estimates from VSLAM and fuses them into a 3D voxel grid. Combining VSLAM and Nvblox, we get a much richer representation of the environment.
The 3D reconstruction process involves these key steps:

Nav2, a navigation framework for ROS2, provides a suite of tools for real-time path planning and obstacle avoidance. The idea is to use the 3D reconstruction from Nvblox to create a cost map and feed that into Nav2 for navigation planning.
The navigation process involves these key components:
To test out Nav2, I used a simple click and drag interface to set waypoints for the robot to navigate to. In this interface, I am manually defining a goal position (x, y) and orientation (yaw) for the robot and Nav2 will compute a path to that goal while avoiding obstacles in real-time.


Currently, I am developing a system that allows the robot to decide for itself where to go based on a given task and environmental cues. This works by sending task information and semantic descriptions of the environment to the LLM and converting LLM outputs into actionable commands for Nav2.
NOTE: The LLM cannot accurately determine facing rotation, so I am omitting the need for the model to output a target rotation (yaw) for now and just focusing on target position (x, y). The target rotation is calculated as the look at rotation from the robot’s current position to its target position.
Task description (natural language)
A description of what the robot is supposed to do, e.g. "Find an [object] and go to it."
ESDF (Euclidean Signed Distance Field) cost grid:
[val00, val01, ..., val0n;
val10, val11, ..., val1n;
..., ..., ..., ...;
valm0, valm1, ..., valmn]
Semantic dictionary of detected objects and their grid coordinates:
{"object1": "(x, y)", "object2": "(x, y)", ...}
Navigation feedback:
Navigation towards (x, y) failed / canceled / succeeded.
{
"REASONING": "The task is to ... I should ...",
"COORDINATES": [x, y],
"TASK STATUS": "IN PROGRESS"
}
One challenge with using LLMs for navigation is that they may not always have enough information to make informed decisions. For example, if the robot is tasked with finding a yellow basket but there is only information about a basket in the environment, the LLM needs to be able to ask clarifying questions about the color in order to proceed effectively.
At a basic level, the LLM is currently able to perform the following steps to gather information:
Last few lines of logged output demonstrating this as the robot approaches a yellow basket:
