One of the harder problems about robotic embodiments is safety. How to guarantee standard-compliant and effective guardrails for generalist robots which are mobile and not limited in the tools they can use?

For example, it is practical to install light curtains for industrial robots to prevent anyone from getting into their working area when they are active.

But for mobile robots, they can be anywhere, and you can't build a safe operating space for them.

Even if your robot is weak in its joints and has no sharp corners, all bets are off once it grabs a power tool, or sits onto a driver's seat of a car.

It requires a paradigm shift in safety. You aren't actually trying to limit the robot movement in a classical sense, but you're trying to make it act in a way that prevents harm from happening. In many cases this might involve actual movement rather than stopping movement. Sometimes it requires limiting something outside the robot from happening, for example, if something heavy is about to fall down in a dangerous fashion, the robot should try to stop it.

This is of course against the strictly defined rules we have from classical robotic safety methods, but the reason is that those kinds of limited operating envelopes won't make generalist mobile robots safe.

There are many rationales for static safe constraint envelopes for robots, for example, if a robot malfunctions, it shouldn't crush anything to death. There are still places for such constraints, but they aren't enough, and trying to approach the safety challenge with only these kinds of methods as the only tools in the toolbox won't lead to a success.

The robotic safety systems shouldn't only care about the physical malfunctions of the robot itself, but also malfunctions of other things. For example, if a humanoid robot is preparing food, there might be a food oil fire, and instead of just stopping the robot should put it out.

In general robots should be robust against both degradations and extensions of their embodiments to be able to function robustly in the open environment. This alone should in itself be a solid protection against physical malfunctions. If a robot can walk after having lost one leg, it should also function within reason, without causing danger, if one of its servos get stuck active.

While hierarchies and layers create robust safety, the highest embodied control layer itself should be made safe, and it shouldn't lean on lower constraint envelopes to produce the safety.

The robot must not step on a cat, or cause a cat to be harmed by inaction. If your robotic safety framework ceases to apply when the robot picks up a power tool, or presses the button to activate data center halon extinguishers, it's not framed correctly.

#AI #robotics #UniversalEmbodiment #OpenToWork

One reason why LLMs are so powerful is that they are not only world models, they are multi-agentic world models.

What does it mean?

It means that they learn to imitate the behavior of all the agents and pseudo-agents (fire, water, ...) in the data. Ego (assistant) is just a special case for them. They learn from third-party experience, and they are able to roleplay, or let's say "embody", universally for any kind of an agent you set up as the ego.

So, robotics foundation models are typically not trained similarly. They are typically only able to learn from ego experience. This makes them very fragile and bad at learning skills.

To get to #UniversalEmbodiment in #robotics, you need to reframe the learning in robots in an analogous fashion to LLMs. You must make the ego a special case, not the only point of view.

If you need help in this, I am an #AI generalist with over 25 years of experience, currently #OpenToWork.

Let's chat!

Multi-agentic foundation models are important for #robotics and #automation in negotiated and adversarial places such as #traffic and #warfare.

But how to implement them? I have previously drafted a data-centric architecture for decomposing agentic representations for #UniversalEmbodiment in a GitHub repository.

But LLMs already have internalized multi-agentic representations, why can't we utilize them directly? For example, in text you can easily ask an LLM to describe all the persons or agents present in the scene and their intents.

We can and we must certainly utilize these! But these representations aren't grounded.

What we need to do is to craft robotic foundation model training data to involve scenarios where there are multiple agents present.

First start acausally from what ultimately happened β€” how was the scenario negotiated between multiple participants, who drove first, what attack and evasive patterns were used?

As we then know what happened, we can go back in time and ask the foundation model to identify all the participants in the feed, and complete their intentions with the information from the ultimate outcome.

The foundation model can then utilize all the language space knowledge it has about multi-agent environments, but also anchor this to visual and control signals present in the training data.

This allows the model to not only answer questions of what each participant intents to do, but also anchor this to multi-modal sensory information, and also project embodiment related control intents to all the participants in the scenario, not only ego.

Ego becomes just a special case in robotic control, the model should learn to generalize to project control intents to all agents present in the data.

Ultimately this allows the foundation model to learn from perceived and projected experiences of others, to learn to imitate or not imitate what it has seen other agents do.

It's all about crafting data, not really about sophisticated model architectures.

#RoboticFoundationModels #FoundationModels #PhysicalAI #AI #AGI

It's not really about structured versus unstructured environments for #robots anymore. It's static versus agentic.

Robots in the real world will encounter other agents. Autonomous cars will need to negotiate with all kinds of other road users, including cats, which are everywhere in Spain at least. There was a video from east Asia where an old lady was drying their vegetables on the road and an autonomous car was insistent on driving over them while the lady was trying their best to defend them.

So, for any autonomous robot "in the real world" the true challenge isn't anymore that there are no standard grasping surfaces and items aren't in predefined places. Those are solved problems.

The challenge is in agentic environments where the system needs to understand the other living or at least moving entities and their objectives to appropriately navigate the inherently social situations.

This isn't only about cats trying to trip humanoid robots in stairs. It's also non-living things like fire. Humans model fire psychologically as an entity with an intent. Hence they are evolutionarily adapted to being able to keep a fire burning, or limit its destruction by putting it off.

Human psychology is very Aristotlean in the way it models heavy things "wanting" to go down. Robotic psychology will need similar understanding to be able to negotiate, guide and harness dynamic entities in the world effectively.

For these purposes we will need to replace static world models with agentic world models which properly accommodate non-ego agents and non-ego intents in the world. What's cool about that is that it will also enable a model to learn from third party experience which is always more abundant than ego experience. Monkey-see-monkey-do, or in some cases learn to absolutely not do.

Let's work together in this and surpass the human level in agentic, living environments as well!

#UniversalEmbodiment #RoboticFoundationModels #AI

Why do we need universal embodiment with in-context learning of the embodiment? Because the embodiment isn't fixed. Of course there are the common degradations and even partial mechanical failures, but also imagine:

A humanoid robot sits on a car driver's seat and drives the car; the motor planning and reasoning shouldn't be on the level of turning the steering wheel this many degrees and so on, but on the level of the changed embodiment; now the car.

The same goes with using tools, adapting the embodiment ad-hoc for the purpose, letting the same model design, build, repair and customize embodiments. It is all synergistic and while the current paradigm of embodied AI doesn't aim for the next step yet, we will need to create the next step as well at some point.

Conveniently when we combine this with multi-agent and intent characterization, we will get embodiment adaptation much easier, but we'll also get truly social robotics which are able to negotiate and communicate in the real-world multi-agentic spaces.

#UniversalEmbodiment #RoboticFoundationModels

One thing to understand about physical foundation models or robotic foundation models is in-context learning.

You should aim to frame the problem and the data in a fashion where the model can learn to control the embodiment in-context, rather than training it without a possibility to calibrate and discover where it is in the start of the session.

Otherwise you won't get truly universal models, but models which constantly hedge their bets and are forced to make their control signal not only generalist, but generalist across all training worlds and embodiments *at the same time*.

This means that you'll be stuck in a frame where you will need a control adapter layer separately trained per embodiment, because the foundation model is incapable of discovering in-context what it inhabits, so its outputs are by necessity the kind that should work somewhat ok for all possible worlds.

The model also becomes unable to learn embodiment-specific control policies without hacks.

I believe the fact that people don't realize they need to consider in-context learning for these foundation models for embodiment calibration is a root of many practical problems down the line.

#PhysicalFoundationModels #UniversalEmbodiment #robots #FoundationModels

Why are these kinds of robots suddenly possible? From the perspective of mechatronics, better batteries and higher compute embedded computers are enablers.

Otherwise it's mostly about utilizing deep learning models to enable robotic control algorithms to encroach towards more difficult unstructured environments, cheaper embodiments, and less well-defined skills.

It's old tech, reinforcement learning and Sim2Real.

It looks like these robots can do anything, right? Yes and no. As they are already, they are insanely useful for all kinds of domains of robotic autonomy. Their competence in a low number of skills is clearly above human controller and in general above human just by training them with Sim2Real methods and reinforcement learning.

There are limitations to those methods though. Reinforcement learning is extremely data inefficient, and simulations need to be hand-crafted accurately for all kinds of different challenges. That's why you typically see these kinds of robots in uniform robotic embodiments and presenting only a low number of tasks, pretty much ~10 skills.

These aren't generalist AI. The same control models cannot be deployed to different kinds of robots, and they need to be retrained for any small upgrade in the robot model. Retraining is extremely expensive, not only in compute, but also in human labor in defining the simulated training environments, and target tasks, including their reward functions.

It's possible to achieve and exceed this level of performance with generalist AI though, like has happened across many domains before. Generalist models in the end win specialist models for many reasons.

Generalist models also don't need to be retrained for every new embodiment and skill, but they can do these zero-shot fashion, live in-context.

#UniversalEmbodiment

https://youtube.com/watch?v=X2UxtKLZnNo&si=UFDvM9So7jgMyxdh

Unitree B2-W Talent Awakening! πŸ₯³

YouTube

ARC-AGI doesn't measure intelligence. Intelligence is competence in ridiculously transferrable skills and knowledge. The transfer is bi-directional between different tasks.

If ARC-AGI measured a skill that is ridiculously transferrable, applicable across many diverse topics, LLMs would have learned this skill by learning competence across other kinds of generalist tasks. They didn't.

O3 achieved high scores in these tasks now, probably mostly because they were trained on 75% of the public ARC-AGI benchmark set, allowing it to learn the special skills needed for these tasks.

Since ARC-AGI skills are clearly super special, as in not relevant for anything else, and human-imitative, they do not relate to intelligence at all. It is easy to come up with special tasks invoking special skills which do not apply to any other tasks.

For example, as a contrived example, let's take an arbitrary hash function with an arbitrary seed and produce a sequence of numbers with it. The task is to guess the next number from the previous one. The skill to do this can only apply to this hash function and seed and doesn't generalize or transfer to any other actually useful task.

ARC-AGI is like that, except the hash function is human. This skill has very limited transfer and that is exactly the feature which makes it "difficult" for AIs. If it was a skill that actually means intelligence, it would have been paradoxically learnable by becoming competent in other, unrelated tasks. If it was a truly important skill among all skills related to intelligence, it would have been among the first skills LLMs would have learned as such important, core intelligence skills, are present in almost all tasks.

#UniversalEmbodiment #AI #ARCAGI #AGI #LLMs

When and why do generalist systems outperform specialist systems? This isn't a law of nature and indeed there are cases where tasks are so distinct that two specialist systems will always outperform a generalist system aiming to perform well in two tasks.

Basically generalist systems outperform the specialist systems based on a couple of aspects:
- Amount and quality of data, and
- Representational capacity.

Specifically, if we have two completely distinct tasks, there are no shared features between the two datasets, and a generalist system would need at least the sum of the representational capacity of two separate models. In such cases, the two tasks will just confuse and confound the performances of each others during training, and at the very least the training of a generalist model would take longer than training two specialist models, and it would become heavier for inference than running either of the specialist models alone.

In such cases generalist models don't make much sense. Typically these are cases with very few clearly defined and distinct tasks, for example, "detect cats from an image with bounding boxes", and "predict the stock value time series".

On the other end of the spectrum we have very identical tasks, where a generalist model would be pretty much the same as either one of the two specialist models in representational capacity, but since the data are completely interchangeable, the generalist model can be trained with double the volume of data compared to each of the two specialist models making it sovereignly better. For example, one task of "predict stock value time series for a random half of the stocks", and "predict the same for the other half".

There is a spectrum of sets of tasks in between, and with less well-defined tasks their domains tend to overlap a lot, for example in things like assistant bots for ISP problem solving and for buying computers, and of course for general multi-purpose assistants.

When we keep developing more and more multi-purpose, or generalist AIs, the tasks they will be trained on and ultimately used to perform form a continuity where you could pick two example tasks which are very much distinct and without any overlap, but taking in all the tasks the models are expected to perform, these are actually just single points in a continuous distribution of potential tasks, which in aggregate do have a heavy overlap.

The overlap is power of generalist models over specialist ones and takes the form of task data which benefits the performance of other tasks as well. This means that the data involves transferrable skills and transferrable knowledge.

And even if in special cases some skills and knowledge have limited transfer, in aggregate they form a synergistic, continuity of skill transfer and basis for training data efficiency.

#DeepLearning #AGI #ML #UniversalEmbodiment

Why and when is synthetic data better than real data for ML training?

It's not only a question of availability volume, although in the past that was an important consideration.

In training data we want to have:
1. Knowledge which is transferrable to the target task, or generally, in high fidelity.
2. Skills which are generalizable to the target task, or generally, in high fidelity.
3. Both represented in a way that allows instruction or control of the trained model, typically instruction-following form.

Can we get better synthetic data than a real world data is? That depends on our models actually. If our models do not yet understand the skills needed, they won't be able to practice those skills to become better in them. If they lack knowledge, they cannot by themselves acquire that knowledge without input from the real world, whether by literature or by active experimentation.

For some relatively generalist skills we already have frontier models which have acquired a bootstrappable level of competence in those skills, and indeed understand what those skills are about, to be able to improve above human level by autonomous practice.

The knowledge pool trained to our generalist large language models or large multi-modal models is already vast, impressively above human-level in most topics.

Of course in new modalities like medical imagery, and robotic control, both the competence in skills and required knowledge are still lacking in vanilla frontier models, but these can be easily trained to those models by imitation and self-supervised learning.

Once the models achieve the bootstrappable level of competence in a new domain, they will become able to self-improve by exercising the related skills and evaluating their own performances. In practice this becomes a process of recursive self-improvement by training data refinement and synthesis.

We already have a clear engineering roadmap to surpass human level in all domains, one by one, and the progress won't take steps backwards. Knowledge and skills from other domains transfer to new domains and make this process easier and faster for every novel domain.

Now, consider a world where this process has reached its conclusion.

#RecursiveSelfImprovement #UniversalEmbodiment #LLMs #AI #AGI