𝐕𝐢𝐬𝐮𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐓𝐡𝐨𝐮𝐠𝐡𝐭 𝐄𝐥𝐢𝐜𝐢𝐭𝐬 𝐒𝐩𝐚𝐭𝐢𝐚𝐥 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠 𝐢𝐧 𝐋𝐚𝐫𝐠𝐞 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐌𝐨𝐝𝐞𝐥𝐬(Paper by microsoft)

3 min readMay 24, 2024

Imagine a scenario where you’re listening to a story. As you follow along, your mind naturally starts to visualize the the unseen objects, actions, and their relationships. This mental process is often referred to as the “Mind’s Eye.” Now, consider how language models can perform similar spatial reasoning tasks.

Spatial reasoning involves the ability to visualize and reason about relationships in a three-dimensional environment. It’s the skill that allows us to comprehend and navigate our surroundings. Recently, Microsoft introduced a paper that explores this concept in LLMs.

To evaluate spatial awareness and the LLM’s ability to understand spatial relationships, several tests have been conducted:

𝐍𝐚𝐭𝐮𝐫𝐚𝐥 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐍𝐚𝐯𝐢𝐠𝐚𝐭𝐢𝐨𝐧

In this approach, the model is tested on its ability to determine its current location based on a set of steps. Imagine a 4x4 square grid where objects are placed. For instance, if you start in the bottom-left corner and move right, you encounter a car; moving up, you find a chair, and so on.

After informing the LLM about the objects on the grid, we ask it to perform specific actions (e.g., “go up, then left, then right”). The model must then identify the current location of the specified object. At each reasoning step, the LLM visualizes the grid and performs the requested action.

2)𝐕𝐢𝐬𝐮𝐚𝐥 𝐍𝐚𝐯𝐢𝐠𝐚𝐭𝐢𝐨𝐧

The visual navigation task presents a synthetic 2D grid world to the LLM. The challenge is to navigate using visual cues. The LLM generates navigation instructions (left, right, up, down) to reach a destination from a starting point while avoiding obstacles.

This task involves two sub-tasks: route planning (finding a path) and next step prediction (using multi-hop spatial reasoning).

3)𝐕𝐢𝐬𝐮𝐚𝐥 𝐓𝐢𝐥𝐢𝐧𝐠

In this challenge, called polyomino tiling, a 6x6 grid is filled with colored shapes (e.g., green squares, yellow triangles). The LLM must fit a specific shape (e.g., a 3x1 red line) vertically within the grid. This tests the model’s ability to comprehend, organize, and reason with shapes in a confined area.