Agent Failures Don't Start Where They Appear

You joined the recent wave of agent experimentation.

You built an agent to automate a repetitive task no one on your team enjoys doing. It worked. Not only on your machine. Surprisingly, also in production.

For a while.

Then one run fails.

Forty five minutes into the execution the final result is wrong.

You are a careful developer, so the trace is complete. Every prompt, every tool call, every model response. The entire life of the run recorded.

So you start debugging.

Where do you begin?

Most people start at the end.

You read the final prompt.

You inspect the last tool call.

Then you scroll backward through the trace looking for the first line that feels suspicious.

Scrolling backward rarely helps.

The trace contains everything, but it gives you no signal about where the cause actually lives. Every step looks plausible. Every response looks reasonable. Nothing clearly marks the moment where the run became unrecoverable.

You are not isolating a failure.

You are reconstructing a story.

The debugging question is wrong

When engineers debug agent failures today, the question usually sounds like this:

What went wrong?

It feels natural. Something failed, so you look for the mistake.

But that question quietly assumes something about how failures happen.

It assumes the failure occurs at the moment you observe it.

For long running agent systems, that assumption is almost always false.

The step where the failure becomes visible is rarely the step where the failure begins.

Consider a simple example.

An agent is screening transactions for sanctions risk.

At step 47 the agent approves a transfer that should have been halted.

The violation appears at step 47.

But the cause likely appeared earlier.

Maybe a database lookup returned a result that should have triggered escalation.

Maybe a threshold crossed quietly.

Maybe a field in the agent's internal state flipped from safe to risky.

Whatever happened, it probably happened long before step 47.

By the time the final action occurs, the outcome may already be inevitable.

Failures spread forward through time

Agent systems do not fail like single function calls.

They fail across trajectories.

A small mistake at step 12 can propagate through dozens of later decisions.

Each step consumes the state produced by the previous one.

Once a bad state enters the system, every subsequent action builds on it.

The failure spreads forward through time.

By the time someone inspects the trace, the system may be dozens of steps past the moment where recovery was still possible.

That moment is the one investigators actually need.

The moment the future breaks

Imagine looking at a long execution trace.

Somewhere in the middle there is a step where the system crosses a boundary.

Before that step, the run could still succeed.

After that step, failure becomes inevitable unless something intervenes.

That boundary is subtle.

Nothing dramatic may happen there.

A flag flips.

A tool response changes.

A variable moves from one category to another.

But that transition determines everything that follows.

It is the point where the future breaks.

Why traces do not answer this question

Most modern agent tooling focuses on observability.

It records prompts.

Model outputs.

Tool calls.

Latency metrics.

These traces are extremely useful for reconstructing what happened.

But they do not help locate the moment where the system first entered a failing state.

The trace shows the entire history.

It does not highlight the causal boundary.

Engineers must read the sequence and infer the answer themselves.

Sometimes they get it right.

Sometimes they do not.

And often the process takes far longer than anyone would like.

The question investigators actually need

When systems run for minutes or hours, the debugging question changes.

It is no longer enough to ask:

What happened?

The real question becomes:

At what step did the system first enter the state from which failure became inevitable?

That is the moment where the trajectory diverged.

The moment where intervention could have changed the outcome.

The moment investigators actually care about.

Most agent tooling today cannot answer that question.

But if we could identify that moment reliably, debugging would look very different.

Instead of reading hundreds of log lines, engineers could jump directly to the transition where the run first went wrong.

The trace would stop being a story.

It would become evidence.

A different kind of execution history

To answer the onset question, the execution history itself has to change.

The system must record not just events, but the evolution of state over time.

Each step must capture what the agent knew at that moment.

Only then can we compare states across the run and determine when a violation first became true.

Once runs are recorded this way, a surprising capability becomes possible.

You can search the execution history for the exact step where failure begins.

And you can do it in logarithmic time.

That is the idea.

You joined the recent wave of agent experimentation.

You built an agent to automate a repetitive task no one on your team enjoys doing. It worked. Not only on your machine. Surprisingly, also in production.

For a while.

Then one run fails.

Forty five minutes into the execution the final result is wrong.

You are a careful developer, so the trace is complete. Every prompt, every tool call, every model response. The entire life of the run recorded.

So you start debugging.

Where do you begin?

Most people start at the end.

You read the final prompt.

You inspect the last tool call.

Then you scroll backward through the trace looking for the first line that feels suspicious.

Scrolling backward rarely helps.

You are not isolating a failure.

You are reconstructing a story.

The debugging question is wrong

When engineers debug agent failures today, the question usually sounds like this:

What went wrong?

It feels natural. Something failed, so you look for the mistake.

But that question quietly assumes something about how failures happen.

It assumes the failure occurs at the moment you observe it.

For long running agent systems, that assumption is almost always false.

The step where the failure becomes visible is rarely the step where the failure begins.

Consider a simple example.

An agent is screening transactions for sanctions risk.

At step 47 the agent approves a transfer that should have been halted.

The violation appears at step 47.

But the cause likely appeared earlier.

Maybe a database lookup returned a result that should have triggered escalation.

Maybe a threshold crossed quietly.

Maybe a field in the agent's internal state flipped from safe to risky.

Whatever happened, it probably happened long before step 47.

By the time the final action occurs, the outcome may already be inevitable.

Failures spread forward through time

Agent systems do not fail like single function calls.

They fail across trajectories.

A small mistake at step 12 can propagate through dozens of later decisions.

Each step consumes the state produced by the previous one.

Once a bad state enters the system, every subsequent action builds on it.

The failure spreads forward through time.

By the time someone inspects the trace, the system may be dozens of steps past the moment where recovery was still possible.

That moment is the one investigators actually need.

The moment the future breaks

Imagine looking at a long execution trace.

Somewhere in the middle there is a step where the system crosses a boundary.

Before that step, the run could still succeed.

After that step, failure becomes inevitable unless something intervenes.

That boundary is subtle.

Nothing dramatic may happen there.

A flag flips.

A tool response changes.

A variable moves from one category to another.

But that transition determines everything that follows.

It is the point where the future breaks.

Why traces do not answer this question

Most modern agent tooling focuses on observability.

It records prompts.

Model outputs.

Tool calls.

Latency metrics.

These traces are extremely useful for reconstructing what happened.

But they do not help locate the moment where the system first entered a failing state.

The trace shows the entire history.

It does not highlight the causal boundary.

Engineers must read the sequence and infer the answer themselves.

Sometimes they get it right.

Sometimes they do not.

And often the process takes far longer than anyone would like.

The question investigators actually need

When systems run for minutes or hours, the debugging question changes.

It is no longer enough to ask:

What happened?

The real question becomes:

At what step did the system first enter the state from which failure became inevitable?

That is the moment where the trajectory diverged.

The moment where intervention could have changed the outcome.

The moment investigators actually care about.

Most agent tooling today cannot answer that question.

But if we could identify that moment reliably, debugging would look very different.

Instead of reading hundreds of log lines, engineers could jump directly to the transition where the run first went wrong.

The trace would stop being a story.

It would become evidence.

A different kind of execution history

To answer the onset question, the execution history itself has to change.

The system must record not just events, but the evolution of state over time.

Each step must capture what the agent knew at that moment.

Only then can we compare states across the run and determine when a violation first became true.

Once runs are recorded this way, a surprising capability becomes possible.

You can search the execution history for the exact step where failure begins.

And you can do it in logarithmic time.

That is the idea.

Agent Failures Don't Start Where They Appear

The debugging question is wrong

Failures spread forward through time

The moment the future breaks

Why traces do not answer this question

The question investigators actually need

A different kind of execution history

Related reading

Why I Built kern

Why Understanding AI Internals Won't Explain Agent Failures

Putting Git on AI Agents

Agent Failures Don't Start Where They Appear

The debugging question is wrong

Failures spread forward through time

The moment the future breaks

Why traces do not answer this question

The question investigators actually need

A different kind of execution history

Related reading

Why I Built kern

Why Understanding AI Internals Won't Explain Agent Failures

Putting Git on AI Agents