Robots Are Like Onions

One of the more surprising things about robotics is how often fixing a problem doesn’t seem to fix the problem.

At least from the outside.

A couple of seasons ago, we were seeing a large number of failures caused by DDS messages being dropped. Reliability was poor, and the message passing issue was clearly one of the largest contributors.

If you looked at the rest of the data, it seemed obvious what would happen next. Fix the DDS issue and performance should improve dramatically. It looked like we might even be able to triple overall reliability.

Eventually the team found the root cause and fixed it. Then everyone anxiously watched the metrics.

The metric barely moved.

So what happened?

The DDS failures weren’t the only problems in the system. They were just happening so frequently that they obscured everything else.

Every time the robot encountered a DDS issue, the run ended before many of the other problems had a chance to surface. The message passing failures were effectively stomping on the rest of the intervention data.

This pattern shows up so often that I eventually started describing robots as onions.

Problems come in layers.

Fix one and you’ll often discover another one hiding underneath. The first issue wasn’t the problem. It was the first problem you could see.

I’ve seen this pattern repeatedly across robotics domains. Perception issues hide planning issues. Planning issues hide controls issues. Controls issues hide operational issues. Operational issues hide tooling issues.

The details change. The pattern doesn’t.

Stop Fighting Individual Fires

One consequence of this is that I’ve become increasingly skeptical of organizations that focus too heavily on individual failures.

Individual failures are interesting.

Trends are actionable.

When you’re operating a fleet, the goal isn’t understanding why one robot had a bad day. The goal is understanding why a hundred robots had a bad day.

The interesting question is rarely “What happened?”

The interesting question is “What keeps happening?”

That’s usually where the next bottleneck is hiding.

Finding the Next Bottleneck Is Progress

The other lesson is that progress and metric movement aren’t always the same thing.

Sometimes a project removes a major bottleneck and the metric responds exactly the way everyone expected.

Sometimes the metric barely moves because removing one bottleneck exposed the next one.

I’ve seen teams become discouraged in that situation because it feels like all the work didn’t accomplish much.

I tend to view it differently.

Finding the next bottleneck is progress.

In complex systems, understanding what to work on next is often more valuable than squeezing a few more percentage points out of the current solution.

The organizations that improve the fastest aren’t necessarily the ones that fix problems the fastest.

They’re the ones that discover problems the fastest.

Build Systems That Help You Learn

Over time, this has shaped how I think about engineering investments.

Observability, simulation, testing infrastructure, developer tooling, and data systems don’t directly make robots better. What they do is shorten the time between discovering one bottleneck and discovering the next.

The best logging systems don’t eliminate failures. They make patterns easier to see.

The best simulation environments don’t magically improve performance. They make bottlenecks easier to discover.

The best engineering organizations don’t get rid of every layer of the onion.

They get better at peeling it.

Tags:

Stop Fighting Individual Fires

Finding the Next Bottleneck Is Progress

Build Systems That Help You Learn

Table of Contents

Related posts

Demos Can Deceive You. Don't Trust Them.

Keep Calm and Don't Page Everyone

Space, Roads, Fields, and Other Places Robots Get Lost