Multimodal foundation world models
for generalist embodied agents

Anonymous authors
Multimodal foundation world models allow grounding language and video prompts into embodied domains, by turning them into sequences of latent world model states.
Latent state sequences can be decoded using the decoder of the model, allowing visualization of the expected behavior, before training the agent to execute it.

  Task behaviors

  Behavior retrieval

The agent is tasked to solve tasks that are in the agent's training set. Tasks are inferred by the agent from text prompts, without any access to the real reward functions.
See the article for the list of tasks, results and text prompts.

walker run
kitchen burner
quadruped run
stickman walk
cheetah run
walker stand
kitchen light
quadruped stand
stickman run
kitchen microwave
walker walk
quadruped walk
kitchen slide
stickman stand

  Multitask generalization

The agent is tasked to solve new tasks that not contained in the training set. Tasks are inferred by the agent from text prompts, without any access to the real reward functions.
See the article for the list of tasks, results and text prompts.

walker flipping
stickman headstand
quadruped two legs
stickman sit knees
walker lunge pose
stickman boxing
walker sit knees
stickman high kick
cheetah standing
stickman flipping
quadruped jump
stickman legs up
walker high kick
stickman lunge pose
walker lying down
stickman one foot
quadruped lie down
stickman lying down
walker one foot
stickman hands up
cheetah lying down

  Language prompts decoded

Multimodal foundation world models allow grounding language prompts into the embodied domain.
The world model allows to visualize how the language prompt is interpreted by the model, by decoding the latent states that correspond to the prompt.

boxing
doing a cartwheel
crawling
crunch abs
doing a backflip
doing the splits
downward facing dog (yoga pose)
sitting
walking on the knees
karate kick
laying down and kicking
lean backwards
doing the moonwalk
doing push ups

  Video prompts decoded

Multimodal foundation world models allow grounding visual prompts into the embodied domain.
The world model allows to visualize how the video prompt is interpreted by the model, by decoding the latent states that correspond to the prompt.
For the first six tasks we also indicate the corresponding tasks as we use them in the paper to show GenRL's ability to learn behaviors from video prompts.

quadruped walk
cheetah run
cheetah standing
stickman high kick
stickman walk
kitchen microwave