Multimodal foundation world models
for generalist embodied agents

Anonymous authors

Multimodal foundation world models allow grounding language and video prompts into embodied domains, by turning them into sequences of latent world model states.
Latent state sequences can be decoded using the decoder of the model, allowing visualization of the expected behavior, before training the agent to execute it.

Task behaviors

Behavior retrieval

The agent is tasked to solve tasks that are in the agent's training set. Tasks are inferred by the agent from text prompts, without any access to the real reward functions.
See the article for the list of tasks, results and text prompts.

walker run

kitchen burner

quadruped run

stickman walk

cheetah run

walker stand

kitchen light

quadruped stand

stickman run

kitchen microwave

walker walk

quadruped walk

kitchen slide

stickman stand

Multitask generalization

The agent is tasked to solve new tasks that not contained in the training set. Tasks are inferred by the agent from text prompts, without any access to the real reward functions.
See the article for the list of tasks, results and text prompts.

walker flipping

stickman headstand

quadruped two legs

stickman sit knees

walker lunge pose

stickman boxing

walker sit knees

stickman high kick

cheetah standing

stickman flipping

quadruped jump

stickman legs up

walker high kick

stickman lunge pose

walker lying down

stickman one foot

quadruped lie down

stickman lying down

walker one foot

stickman hands up

cheetah lying down

Language prompts decoded

Multimodal foundation world models allow grounding language prompts into the embodied domain.
The world model allows to visualize how the language prompt is interpreted by the model, by decoding the latent states that correspond to the prompt.

boxing

doing a cartwheel

crawling

crunch abs

doing a backflip

doing the splits

downward facing dog (yoga pose)

sitting

walking on the knees

karate kick

laying down and kicking

lean backwards

doing the moonwalk

doing push ups

Video prompts decoded

Multimodal foundation world models allow grounding visual prompts into the embodied domain.
The world model allows to visualize how the video prompt is interpreted by the model, by decoding the latent states that correspond to the prompt.
For the first six tasks we also indicate the corresponding tasks as we use them in the paper to show GenRL's ability to learn behaviors from video prompts.