: An agent is placed in a simulated or real environment and given a command like "Walk past the kitchen, turn left at the couch, and stop by the wooden table."
VLN is a "multi-modal" task that requires an AI to process both visual input (what it sees) and linguistic input (what it is told to do) to reach a destination. VLN-155zip
YicongHong/Thinking-VLN: Ideas and thoughts about ... - GitHub : An agent is placed in a simulated
the file into a designated data/ or weights/ directory. turn left at the couch
: The agent must understand spatial relationships and object semantics, such as distinguishing a "wooden table" from a "marble counter".
: Archives often include .json or .txt files containing thousands of navigation paths paired with human-written instructions.