Articleinterview

Interview: The Future of Embodied AI with MIT CSAIL Researcher Dr. James Okonkwo

By Robotocist Team··10 min read

Dr. James Okonkwo leads the Embodied Intelligence group at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL). His research on foundation models for robotic manipulation has been cited over 4,000 times. We spoke with him about where embodied AI is headed and what remains unsolved.


You have been working on embodied AI for over a decade. How would you describe the current moment?

We are in what I call the "GPT moment" for robotics. Around 2020-2022, language models crossed a threshold where they became genuinely useful for everyday tasks. Robotics is approaching a similar threshold now. The difference is that robotics is harder because you cannot just scale up data and compute the same way. A language model trains on the internet. A robot has to interact with the physical world, and the physical world does not scale.

That said, the progress in the last two years has been extraordinary. We have gone from robots that could barely grasp a cup to robots that can fold clothes, cook simple meals, and navigate cluttered apartments. The key was figuring out how to transfer the knowledge in large pre-trained models into physical systems.

Your lab developed the RT-X framework that has been widely adopted. Can you explain the core idea?

RT-X — Robotic Transformer Cross-embodiment — is built on a simple observation: robotic manipulation data collected on one robot is useful for training policies on a completely different robot. A grasping demonstration on a Franka Panda arm teaches something fundamental about how objects behave when you pick them up, and that knowledge transfers to a UR5 or a humanoid hand.

We aggregated data from 22 different robot platforms across 21 institutions. The combined dataset has over 500,000 manipulation episodes. We trained a vision-language-action model on this data, and the resulting policy works out of the box on robot platforms it has never seen, with some fine-tuning for embodiment-specific dynamics.

The architecture is a transformer that takes in camera images and a language instruction, encodes them with a pre-trained vision-language backbone, and outputs continuous motor commands. The innovation was in the tokenization of actions — we discretize the action space into 256 bins per dimension, which lets us use the same next-token prediction framework that works so well for language.

How do you bridge the sim-to-real gap? That has been a persistent challenge.

The sim-to-real gap is narrower than people think, if you do it right. The key techniques are domain randomization and system identification.

Domain randomization means you vary everything in simulation — lighting, textures, object masses, friction coefficients, camera positions — so the policy learns to be robust to visual and physical variation. When it encounters the real world, it is just another variation.

System identification means you carefully measure the physical parameters of your real robot — joint friction, gear backlash, cable stiffness — and model them accurately in simulation. The closer your simulator matches reality, the less randomization you need.

We have also been working on what we call "real-to-sim-to-real" — using data collected from real robots to improve the simulator, then training new policies in the improved simulator. It creates a virtuous cycle.

What role do large language models play in your research?

LLMs are the planning layer. They are remarkably good at decomposing high-level instructions into sequences of subtasks. If you tell the robot "make me a cup of coffee," the LLM knows the steps: go to the kitchen, find the coffee maker, open the lid, insert a pod, close the lid, place a cup, press the button, wait, pick up the cup, bring it to the user.

But LLMs do not understand physics. They might generate a plan that involves picking up a cup by the rim while it is full of hot coffee, which any human would know is a terrible idea. So we have a physics-grounded verification layer that checks whether each planned action is physically feasible and safe before execution.

The most exciting recent development is multimodal models that can reason about visual scenes. You can show the robot a picture of a cluttered desk and ask, "How would you organize this?" and the model generates a sensible plan based on what it sees. That visual-linguistic reasoning is a game changer.

What about learning from human demonstrations? Is that still the dominant approach?

It is the most practical approach for deploying robots today. You teleoperate the robot through a task 50 to 100 times, train a policy on those demonstrations, and the policy generalizes to variations it has not seen.

The bottleneck is data collection. Teleoperation is slow and tedious. We are working on several approaches to reduce the data requirement. One is to use VR-based teleoperation, which is more intuitive and faster. Another is to extract demonstrations from human videos — watch a human fold a towel on YouTube and translate those motions to the robot's body. A third is to use generative models to synthesize new demonstrations from a handful of real ones, essentially data augmentation in the trajectory space.

The end goal is one-shot or zero-shot task learning. Show the robot a single demonstration, or just describe the task in words, and it figures out how to do it. We are not there yet, but we are closer than I expected to be at this point.

What is the biggest open problem in embodied AI right now?

Long-horizon reasoning and planning under uncertainty. Current robots are good at individual skills — pick this up, put it there, open that drawer. But chaining skills together over minutes or hours, while handling unexpected failures and adapting to changing conditions, is where they fall apart.

Imagine asking a robot to clean a house. That involves hundreds of sub-tasks, many of which depend on the outcomes of previous tasks, and the state of the house changes as the robot works. The combinatorial complexity is enormous. Current systems struggle with tasks that take more than two to three minutes.

I think the solution will come from hierarchical planning with learned world models. The robot needs an internal model of how the world works — what happens when you push that stack of books, when you open that sticky drawer, when you pour water into a cup. With a good world model, the robot can plan ahead, anticipate failures, and recover from mistakes.

There is a lot of talk about the "data moat" in robotics. Do you agree that data is the key bottleneck?

Data is a bottleneck, but I push back on the idea that whoever has the most data wins. The quality and diversity of data matters more than quantity. A thousand demonstrations of grasping the same mug teach less than ten demonstrations of grasping ten different objects in different orientations.

What is more interesting to me is the question of data efficiency. Humans learn to manipulate objects from a relatively small number of experiences because we have incredible priors — an intuitive understanding of physics, materials, and causality. If we can build similar priors into robot learning systems, perhaps through pre-training on physics simulations or human video, we can dramatically reduce the data requirements.

What advice do you have for graduate students entering this field?

Get your hands on real robots. It is tempting to stay in simulation where everything is clean and reproducible, but the messy reality of hardware is where the most important research problems live. Cables snag, sensors drift, grippers slip, and motors overheat. Dealing with those problems will make you a better researcher.

Also, study cognitive science. The best ideas in embodied AI often come from understanding how humans and animals learn to interact with the physical world. Object permanence, affordance detection, causal reasoning — these concepts from developmental psychology are directly relevant to building intelligent robots.

How do you evaluate whether an embodied AI system is actually "intelligent" versus just pattern matching?

This is a philosophical minefield, but I think there are practical tests we can use. One is compositional generalization — can the robot combine skills it has learned separately to handle a novel task? If it knows how to open a drawer and how to pick up a spoon, can it figure out how to get a spoon from a closed drawer without being explicitly trained on that combination? Current systems are improving at this, but they are still fragile.

Another test is failure recovery. When something goes wrong, does the robot have a model of what happened and why? Can it try a different approach? Or does it just repeat the same failed action? True intelligence involves causal reasoning about failure, and that is qualitatively different from pattern matching.

I also look at sample efficiency. A genuinely intelligent system should learn from very few examples. If a robot needs a thousand demonstrations to learn to pick up a mug, it is memorizing, not understanding. If it can learn from five demonstrations and generalize to mugs of different shapes, sizes, and materials, something deeper is happening.

What role does hardware innovation play? Can better AI compensate for limited hardware?

There is a common belief in the AI community that software will solve everything and hardware is just a commodity. I strongly disagree. The quality of your sensors, the precision of your actuators, the speed of your onboard compute — these fundamentally determine what your AI can do.

Take tactile sensing as an example. Humans have about 17,000 mechanoreceptors in each hand. The best robotic hands today have maybe a few hundred tactile sensing elements. That is a massive information deficit. No amount of clever AI can fully compensate for not being able to feel what you are touching.

Similarly, actuator bandwidth matters. If your motors cannot respond fast enough, your robot cannot catch a tossed object or maintain balance on a suddenly shifting surface. The AI can compute the right response, but if the hardware cannot execute it in time, the robot still falls.

The most exciting developments are happening at the intersection of hardware and AI — sensors designed specifically to produce data that neural networks can process efficiently, and actuators designed to be inherently safe rather than relying on software safety limits.

What will be possible in five years that is not possible today?

In five years, I expect robots that can reliably perform multi-step household tasks — cooking simple recipes, doing laundry, loading and unloading a dishwasher. These will not be general-purpose butlers, but they will handle a defined set of tasks well enough to be genuinely useful, especially for elderly or disabled individuals who need assistance with daily living.

In research, I expect we will have robots that can learn new manipulation skills from a single demonstration or a natural language description. The foundation model approach will mature to the point where a new robot embodiment can be brought online in hours rather than months.

The most transformative development, though, will be robots that can explain their reasoning and ask for help when they are uncertain. That trust and transparency layer is what will make robots acceptable in human environments. Nobody wants a robot that silently makes mistakes. We want one that says, "I am not sure how to fold this fitted sheet. Can you show me?"


Dr. Okonkwo's lab at MIT CSAIL is supported by funding from the NSF, DARPA, and Toyota Research Institute. His recent paper on cross-embodiment foundation models won the Best Paper Award at CoRL 2025.

interviewembodied-aimitfoundation-modelsresearch
Share:𝕏inY