June 12, 20266 min read

Availability Is Not Enough: SLOs for Agent Decisions, Escalation, and Evidence

By Equipo Quantum Developers

Two operators review a dashboard with two service lines, an alert, and a validated card beside a clock.

Summarize:

Operating thesis

An agent may remain reachable all day and still deliver a poor service. It can misclassify an exception, route it too late, or act without leaving evidence that another operator can verify. The falsifiable thesis is this: a dashboard limited to uptime and technical latency cannot prove that an agent made the right decision or that a person could intervene in time. A production contract needs three additional outcomes—decision quality, escalation latency, and evidence completeness.

Google SRE treats correctness as a relevant indicator of system health and advises teams to select a small set of indicators tied to what users actually care about, rather than everything that is convenient to measure in its SLO guidance. For an agent, returning a response is not the same as resolving a case. The service unit is the business case that reaches an acceptable state, not the model call.

From uptime to a business SLO set

A useful SLO identifies an indicator, population, window, objective, and consequence. This set keeps distinct dimensions from collapsing into one number:

Dimension	Indicator	Denominator	Response to a miss
Operational availability	cases accepted by the flow	eligible cases received	activate a degraded mode
Decision quality	decisions confirmed as correct	cases with an observable outcome	reduce autonomy
Escalation	time from risk signal to human assignment	exceptions requiring review	increase coverage or narrow scope
Evidence	cases with input, rule, decision, actor, and outcome linked	closed cases	prevent closure
Approval	irreversible actions with a valid approval	controlled actions	block execution

Targets cannot be borrowed from another company. They depend on risk tolerance, review capacity, and the organization’s own baseline. NIST calls for documented human roles, deployment conditions, metrics, and limitations, with continued production measurement across Govern, Map, Measure, and Manage in the AI RMF Core. That makes an SLO a governance decision rather than dashboard decoration.

A minimum measurement contract

Every case needs a stable identity. Without one, an alert cannot be joined to the decision or its eventual outcome. A minimum event can carry:

case_id and business_object_id for the case and affected object;
decision_type, decision, and confidence_band, without treating confidence as correctness;
policy_version, model_or_rule_version, and input references;
risk_class and escalation_reason;
assigned_human, assigned_at, and resolved_at;
approval_id when the action is controlled;
evidence_artifact_ids and outcome_status;
trace_id linking technical activity to the operational result.

OpenTelemetry publishes semantic conventions that provide common names for traces, metrics, and logs across libraries and platforms. It does not define Quantum’s business objects, but it illustrates the right discipline: agree on vocabulary before building a scorecard. Teams can extend that convention with case, policy, and approval attributes.

Worked illustration: a queue of one hundred cases

Consider an illustrative window of one hundred eligible cases. Eighty are completed automatically and twenty are escalated. Reviewers can determine the outcome for ninety cases because ten remain open. Of those ninety, eighty-five decisions agree with the confirmed outcome. Eighteen of the twenty exceptions were assigned within the internal limit, and eighty-eight closures contain a complete evidence packet.

These numbers are not a benchmark. They demonstrate denominators. Confirmed quality is eighty-five divided by ninety; timely escalation is eighteen divided by twenty; evidence completeness is eighty-eight divided by one hundred. Reporting eighty-five correct decisions out of one hundred would hide the ten cases that are not yet evaluable. Verification coverage must be shown separately.

The response to a miss should also be written before an incident. If the quality error budget is exhausted, the agent can move from acting to recommending. If evidence is incomplete, the case cannot close. If escalation latency rises, the eligible population can be narrowed or another queue owner assigned. An SLO without a predetermined response is merely a metric, not a control mechanism.

Accountability without ambiguity

Each indicator needs both a signal owner and a response owner. Operations may own the queue; the business domain owns the definition of a correct result; risk or internal control owns irreversible gates; engineering owns telemetry and degraded modes. NIST says responsibilities and communication lines should be clear and evaluation should resemble the deployment environment. A generic RACI is insufficient if the on-call operator does not know who can lower autonomy or authorize its return.

Within Quantum Automation Center, the catalog, execution status, timelines, artifacts, logs, and human approvals can become evidence surfaces. The purpose is not to accumulate screens. It is to link each signal to the same case_id and preserve the policy that governed the decision.

The strongest counterargument

The strongest objection is operational: measuring correctness requires labels, human review, and sometimes a wait for the outcome. A team that is still discovering the problem could spend more on instrumentation than on learning. Too many SLOs can also create conflicting incentives. Faster response may be achieved by flooding a human queue; higher automation may reduce verification coverage.

The answer is not to measure everything. Start with one indicator for each material dimension, disclose missing coverage, and increase rigor with risk. An assistive stage may use a reviewed sample while retaining evidence for every action. Autonomy should expand when the evidence supports it, not when a demonstration feels smooth.

When not to use this approach

Do not apply the full contract to creative work with no objectively correct answer, a one-off exploration, or a process with no owner who can define consequences. Do not use precise-looking targets to conceal a sample too small to interpret. If outcomes cannot yet be observed, keep the agent in an assistive role and first measure the review process.

Use the approach when an agent handles recurring cases, actions have operational consequences, and late exceptions create risk or work. The final test is straightforward: when a decision is challenged, the team should be able to reconstruct what happened, who was expected to respond, and what changed afterward.

Sources

Article topics

AI operations Governance and traceability Observability