June 27, 20266 min read

Agent Continuity Starts with Degraded Modes and Queue Freshness

By Equipo Quantum Developers

Two file queues, one amber and one green, separated by an illuminated gate, with a person reviewing them in the background.

Summarize:

Operating thesis

Restoring servers does not necessarily restore an operation. During a failure, invoices, shipments, or cases may accumulate; some may already have caused an external action while others have not. A backup does not decide what may wait, what moves to people, or how uncertain states are reconciled. The thesis is concrete: agent continuity requires admission rules, minimum function, queue freshness, and reconciliation.

NIST SP 800-34 Rev. 1 organizes contingency planning around business impact, recovery strategies, plans, testing, and maintenance. For an agent, impact analysis must reach the business object. “Service available” does not answer whether an invoice is still valid or a shipment has already lost its intervention window.

Four different clocks

Continuity should separate:

time to detect: how long degradation remains invisible;
time to contain: how long before new risky actions stop;
case age: how long the oldest object still needing attention has waited;
time to reconcile: how long before the team confirms what occurred during failure.

RTO and RPO remain useful, but they do not replace these clocks. A system may recover quickly after its queue already breached obligations. It may preserve every message and still duplicate an action because it cannot tell whether the first attempt reached the ERP.

AWS documents SQS metrics such as visible messages and approximate age of the oldest message in its CloudWatch metric list. Those measures are service-specific, but they illustrate the principle: depth and age tell different stories. Business continuity needs both, segmented by priority.

Artifact: the degradation matrix

Failure	Function retained	Work stopped	Alternate route	Recovery signal
model or rule	capture and queue	new automated decisions	prior rule or human review	representative case test
external tool	classification and evidence	remote action	idempotent queue or manual process	confirmation and reconciliation
data source	cases with current data	affected population	authorized secondary source	freshness and consistency restored
approver	low-risk proposal	approval-bound actions	prioritized queue	human capacity confirmed
control platform	cached low-risk policy	changes and sensitive actions	buffered events	state synchronization
system of record	local intake and validation	final write	intent journal	later read and reconciliation

Each cell needs an owner, permission, activation command, and test. “Manual process” is not a route when nobody has tested its capacity or knows how to return results to the workflow.

Admission comes before recovery

When capacity falls, accepting everything worsens the incident. Google SRE’s Handling Overload explains how early rejection and prioritization can avoid wasted work and cascading failure. An agent needs an admission policy based on consequence:

preserve cases with an irreversible window or human impact;
accept cases whose evidence remains valid when processed;
defer work that can be regenerated without loss;
reject explicitly when storage only creates a toxic backlog.

Classification is defined before the incident. If it depends on the failed model, it is not a safeguard. Use deterministic fields such as type, deadline, value, or domain severity.

A freshness ledger

Assign each object type:

maximum wait before review;
source and timestamp establishing validity;
event that invalidates the case;
recovery priority;
destination after expiry;
evidence needed for reprocessing.

These are internal limits, not benchmarks. A quote may expire with its price list, while a logistics alert loses value after the intervention window. The dashboard should show age bands or percentiles, not only an average; a few old cases can hide behind many new arrivals.

Uncertain actions and reconciliation

The most dangerous category is not failed. It is outcome unknown. The agent sent a command, lost its connection, and cannot tell whether the external system accepted it. Retrying without an idempotency key can duplicate action. Marking failure can conceal an action that did occur.

The runbook creates a reconciliation set with identifier, intent, last known state, destination, and query method. Ask the system of record first; only then choose completion, compensation, or escalation. An uncertain case never returns automatically to the general queue.

Entering and leaving degraded mode

Activation can be automatic when it stops an action, but expanding capability should require evidence. Use explicit states: normal, restricted, recommendation only, capture only, and stopped. Every transition records actor and reason.

Exit requires more than a green indicator:

dependency is reachable and consistent;
synthetic check succeeds;
real-case sample runs without material action;
prioritized queue and capacity are adequate;
uncertain-state reconciliation is underway;
owner authorizes expansion.

In Quantum Automation Center, execution states, timelines, artifacts, logs, permissions, and approvals can expose the current mode and link each case to recovery. The system of record remains authoritative for final action.

Tabletop exercise and live test

A tabletop scenario makes the failure specific: a source is reachable but stale, a tool times out, or approval capacity disappears. The team walks through matrix, permissions, and communications. It then performs a controlled test: stop admission, activate the mode, process a sample, restore, and reconcile.

Measure decisions, not only duration. Did the right population receive protection? Was evidence preserved? Were cases orphaned? Did backlog age remain inside the boundary? Update the plan from findings, consistent with NIST’s contingency-plan maintenance cycle.

The strongest counterargument

Several degraded modes increase complexity. A rarely used path may fail during the incident, and maintaining duplicate logic consumes budget. For some teams, a clean stop is more reliable than partial operation.

That criticism is valid. Do not build one mode for every dependency. Begin with two: safely paused and a minimum function that protects the object. Add another only when impact analysis shows a stop causes greater harm and the organization can test the mode regularly.

When not to use this approach

Do not design complex degradation for noncritical work that can stop safely and resume from an authoritative source. Do not accumulate a queue when objects will expire before capacity returns.

Use the matrix when interruption creates obligations, windows, or uncertain actions. Real continuity is not keeping the agent “up.” It is preserving safe decisions while context is down and later proving what happened to every object.

Sources

Article topics

Operational continuity Resilience AI agents