Google’s DeepMind RT-2 AI Model Will Help Robots Serve Humans Like R2D2

hero robot hand touching human hand
A new study involving Google's DeepMind Robotic Transformer 2 (RT-2) vision-language-action (VLA) model shows promising results in building a general-purpose physical robot that can reason, problem-solve, and interpret information in order to carry out a wide array of tasks in real-world settings. RT-2 learns from both web and robotics data, while translating the knowledge into generalized instructions for robotic control.

Sci-fi fans have long adored futuristic loyal companion robots, such as Star Wars' R2-D2. Fans of the original trilogy became enamored with the vacuum cleaner-shaped robot, as it beeped and booped its way through danger. Nearly every kid in the late 70's and early 80's dreamed of having their very own R2-D2 sidekick. Companies like Google have been making advancements in the robotics field, with its recent RT-2 results giving promise for the day R2-D2 will be available to all, sort of.

rt 2 robot google deepmind

The work from RT-2 stems from work on RT-1, "a model trained on multi-task demonstrations, which can learn combinations of tasks and objects seen in the robotic data," according to a recent press release by Google DeepMind. That research incorporated demonstration data collected with 13 robots over a 17-month period in an office kitchen environment.

RT-2 is essentially learning from RT-1 data, culminating in a VLA model that can control a robot. The result shows RT-2 has improved generalization capabilities and semantic and visual understanding that goes beyond the robotic data it was initially exposed to. The new paper indicates that this also includes RT-2 being able to interpret new commands and respond to user commands by performing rudimentary reasoning, like being able to reason "about object categories or high-level descriptions."

RT-2's ability to perform also can be seen as it incorporates chain-of-thought reasoning, allowing it to achieve multi-stage semantic reasoning. This includes being able to decide between objects and which object would be better for the job at hand, such as choosing a rock over a piece of paper to drive a nail.

rt 2 google deepmind hammer nail
RT-2 reasoning between what is more appropriate to use to drive a nail.

Google DeepMind says, "VLMs can be transformed into powerful vision-language-action (VLA) models, which can directly control a robot by combining VLM pre-training with robotics data."

While having a robotic friend like R2-D2 may still be a ways off, Google DeepMind and other companies are striving to deliver a more competent and capable robotic helpmate in the near future with advancements like RT-2.