๐•๐ข๐ฌ๐ฎ๐š๐ฅ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง ๐จ๐Ÿ ๐“๐ก๐จ๐ฎ๐ ๐ก๐ญ ๐„๐ฅ๐ข๐œ๐ข๐ญ๐ฌ ๐’๐ฉ๐š๐ญ๐ข๐š๐ฅ ๐‘๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐  ๐ข๐ง ๐‹๐š๐ซ๐ ๐ž ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž ๐Œ๐จ๐๐ž๐ฅ๐ฌ(Paper by microsoft)

Ali Issa
3 min readMay 24, 2024

--

Generated by DALLE-3

Imagine a scenario where youโ€™re listening to a story. As you follow along, your mind naturally starts to visualize the the unseen objects, actions, and their relationships. This mental process is often referred to as the โ€œMindโ€™s Eye.โ€ Now, consider how language models can perform similar spatial reasoning tasks.

Spatial reasoning involves the ability to visualize and reason about relationships in a three-dimensional environment. Itโ€™s the skill that allows us to comprehend and navigate our surroundings. Recently, Microsoft introduced a paper that explores this concept in LLMs.

To evaluate spatial awareness and the LLMโ€™s ability to understand spatial relationships, several tests have been conducted:

๐๐š๐ญ๐ฎ๐ซ๐š๐ฅ ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž ๐๐š๐ฏ๐ข๐ ๐š๐ญ๐ข๐จ๐ง

In this approach, the model is tested on its ability to determine its current location based on a set of steps. Imagine a 4x4 square grid where objects are placed. For instance, if you start in the bottom-left corner and move right, you encounter a car; moving up, you find a chair, and so on.

After informing the LLM about the objects on the grid, we ask it to perform specific actions (e.g., โ€œgo up, then left, then rightโ€). The model must then identify the current location of the specified object. At each reasoning step, the LLM visualizes the grid and performs the requested action.

2)๐•๐ข๐ฌ๐ฎ๐š๐ฅ ๐๐š๐ฏ๐ข๐ ๐š๐ญ๐ข๐จ๐ง

The visual navigation task presents a synthetic 2D grid world to the LLM. The challenge is to navigate using visual cues. The LLM generates navigation instructions (left, right, up, down) to reach a destination from a starting point while avoiding obstacles.

This task involves two sub-tasks: route planning (finding a path) and next step prediction (using multi-hop spatial reasoning).

3)๐•๐ข๐ฌ๐ฎ๐š๐ฅ ๐“๐ข๐ฅ๐ข๐ง๐ 

In this challenge, called polyomino tiling, a 6x6 grid is filled with colored shapes (e.g., green squares, yellow triangles). The LLM must fit a specific shape (e.g., a 3x1 red line) vertically within the grid. This tests the modelโ€™s ability to comprehend, organize, and reason with shapes in a confined area.

๐•๐ข๐ฌ๐ฎ๐š๐ฅ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง ๐จ๐Ÿ ๐“๐ก๐จ๐ฎ๐ ๐ก๐ญ

The key idea behind VOT is to visualize the state after each reasoning step. Instead of simply generating an answer, the LLM performs a visualization at each reasoning stage. This approach helps improve the accuracy of the modelโ€™s results.

For a clearer understanding, refer to the image below, which illustrates the three tasks mentioned earlier and how VOT is applied.

Source

If you like what you see, hit the follow button! You can also find me on LinkedIn, and we can follow each other there too. ๐Ÿ˜Š

--

--