AO Activity Events: How We Made Agent Failures Queryable

Layer	Decision	Why it matters
Storage	SQLite-backed event store with WAL mode	AO can write events from many lifecycle paths without needing an external service.
Retention	Automatic pruning	The log stays useful locally without growing forever.
Search	FTS search plus structured filters	Humans and agents can search symptoms like review fetch failed, not only exact schemas.
Safety	Sensitive key and credential URL sanitization	Events can include context without leaking tokens, passwords, or auth-bearing URLs.
CLI	ao events list, search, and stats	The RCA loop works from the terminal and can be consumed by subprocess agents.
Contract	Versioned JSON output	Automation can parse output without depending on human formatting.
Precedent	session.spawned, session.spawn_failed, session.killed, lifecycle.transition, activity.transition	Future contributors got copyable examples instead of a blank API.

Decision

Local SQLite WAL

We chose a local store because RCA evidence must exist even when external services are unavailable.

Why this call

This layer is for AO operators and diagnostic agents, not centralized product analytics. SQLite keeps setup near-zero for local dev, CI, and customer debugging. WAL mode lets multiple lifecycle paths write while reads are happening, so instrumentation does not need its own service dependency.

Tradeoff

The tradeoff is weaker cross-machine aggregation by default, but a much smaller blast radius and fewer privacy questions.

Decision

FTS plus structured filters

Humans and agents usually start from symptoms, not from exact event-kind names.

Why this call

During an incident, the query is often review comments missing, PR not detected, terminal vanished, or cleanup skipped. FTS supports that natural starting point, while filters on session, project, source, level, and time narrow the result into an RCA trail.

Tradeoff

The tradeoff is that summaries become part of the searchable contract, so they must stay descriptive, stable, and sanitized.

Decision

Automatic retention

Local observability only works if it stays cheap enough to leave on.

Why this call

Without pruning, every agent run adds durable data forever. That creates disk growth, slower local queries, and higher risk that old context is retained longer than it should be. Retention keeps the default path operational instead of making event logging something people disable.

Tradeoff

The tradeoff is that old incidents eventually need exported evidence or external archival if they must be preserved.

Decision

Versioned JSON contract

Machine consumers should parse events as data, not scrape human CLI formatting.

Why this call

The table output can change for readability. The JSON envelope should not. Versioning gives subprocess agents, scripts, and future dashboard integrations a stable shape to validate. If the schema changes later, AO can bump the version instead of silently breaking automation.

Tradeoff

The tradeoff is maintenance discipline: new fields should be additive where possible, and breaking changes need an explicit version bump.

Decision

CLI as the first reader

The fastest RCA path should work before the dashboard or a remote service is involved.

Why this call

AO sessions already run from terminals, and follow-up agents can call subprocesses. Exposing list, search, and stats in the CLI made the event layer immediately useful to both humans and automation without waiting for a UI workflow.

Tradeoff

The tradeoff is that the CLI contract matters more now: flags, JSON output, and filtering semantics need the same care as API routes.

Decision

Sanitize before persistence

We need useful context without turning the event store into a secret dump.

Why this call

Events often sit next to SCM URLs, config data, plugin errors, prompts, and runtime metadata. Sanitizing before storage and indexing means both CLI users and FTS queries operate on safe evidence, not raw credentials or auth-bearing URLs.

Tradeoff

The tradeoff is that some raw failure detail may be intentionally unavailable later. That is correct for this layer.

Decision

Best-effort writes

Observability cannot become a new lifecycle failure mode.

Why this call

PR #1620 relied on this. Events were added inside sensitive lifecycle-manager paths, but event persistence is not allowed to decide whether AO detects, reacts, cleans up, or polls. The product behavior remains owned by the original code path.

Tradeoff

The tradeoff is occasional missing evidence if storage is broken, which is better than breaking a user run while trying to log it.

Decision

Copyable event precedent

The first infra PR had to teach the next PR how to emit safely.

Why this call

Events like session.spawned, session.spawn_failed, lifecycle.transition, and activity.transition created working examples for source, kind, summary, level, and data shape. That made #1620 a consumer PR, not a design-from-scratch PR.

Tradeoff

The tradeoff is that precedent can spread mistakes too, so follow-up PRs should copy the pattern but still ask the RCA question first.

Bundle	File and methods	Events	Question answered
B1: plugin-call failures	packages/core/src/lifecycle-manager.ts in populatePREnrichmentCache, determineStatus, maybeDispatchReviewBacklog, runtime and agent probing paths	scm.batch_enrich_failed, scm.detect_pr_succeeded, scm.detect_pr_failed, scm.review_fetch_failed, scm.poll_pr_failed, runtime.probe_failed, agent.process_probe_failed, agent.activity_probe_failed	When did AO first see the PR, and which SCM/runtime/agent probe failed before the state machine moved on?
B2: reactions	executeReaction	reaction.escalated, reaction.send_to_agent_failed, reaction.action_succeeded	Did AO try to auto-fix, notify, or auto-merge? Did it stop because attempts or duration were exhausted?
B3: cleanup and poll	maybeAutoCleanupOnMerge, checkSession, pollAll	session.auto_cleanup_deferred, session.auto_cleanup_completed, session.auto_cleanup_failed, detecting.escalated, lifecycle.poll_failed	Was cleanup skipped, completed, or failed? Did the polling loop crash? Did detecting take too long?
B4: report watcher	auditAndReactToReports	report_watcher.triggered	Why did AO think a session needed attention, and did the watcher fire only once for the activation?

Field	Stored as	Why it exists	Why not something else
projectId	project_id indexed SQL column	Scopes events to a repo/project so team members and agents can ask: show me only this AO project.	Do not hide this inside data. Project filtering is a first-class query path.
sessionId	session_id indexed SQL column	Correlates events across one run: PR detection, review fetch, reaction, cleanup, terminal state.	Do not rely on summary text for this. Session correlation must be exact.
source	source indexed SQL column	Names the owner layer: lifecycle, scm, runtime, agent, reaction, api, ui.	Do not create one source per method. Source should be broad enough for ownership dashboards.
kind	type indexed SQL column	The stable event name automation should filter on, such as reaction.escalated.	Do not parse the English summary. The summary can change; the kind is the contract.
level	log_level SQL column	Separates expected breadcrumbs from warnings and errors.	Do not encode severity in kind names like reaction.escalated_error.
summary	summary SQL column plus FTS	One safe sentence for humans and symptom search.	Do not put raw exception blobs, prompts, tokens, or huge payloads here.
data	sanitized JSON string in data column plus FTS	Carries facts specific to this event kind without making the table sparse.	Do not make every possible field a SQL column. Promote only fields that become common query axes.

data field	Meaning	Why this exact field
reaction	The reaction/action AO was trying to perform.	Without this, reaction.escalated tells us escalation happened but not which automation path caused it.
reason	A stable reason code such as max_attempts.	Agents can group by reason code. Free-form prose would create many accidental variants.
maxAttempts	The configured threshold.	Shows the policy AO was enforcing at the time of the event.
attempts	The observed count.	Shows the measured value that crossed the threshold.

PR	Area	Ownership
#1698	CLI and supervisor lifecycle	Capture start/setup/stop/shutdown evidence near ao start, ao stop, supervisor setup, and graceful unregister paths.
#1695	API mutation routes	Capture route-level mutation failures and attribution, while avoiding duplicate spawn failure events already owned by core.
#1693	Webhooks and terminal WebSocket	Capture terminal PTY loss, webhook verification failures, and unsupported webhook paths with idempotent tests.
#1697	Session manager	Round out session-manager events around metadata source, report-send retry stages, and prompt delivery context.
#1696	Config, plugin registry, storage migration	Separate malformed config, invalid config, migration, and resolve failure signals so RCA can distinguish setup classes.
#1699	Plugin internals	Capture distinct plugin failure shapes such as workspace destroy fallback, unreachable network signals, checkout failures, and branch collisions.
#1692	Recovery, metadata corruption, agent-report	Capture recovery and metadata corruption paths without mixing them into normal steady-state probes.

Improvement	Why
Event catalog	A checked-in catalog of source, kind, level, summary pattern, and required data fields would reduce naming drift.
Small emit helpers	Helpers for common patterns like integration failure, escalation, retry exhausted, and one-shot trigger would keep call sites concise.
Reason code conventions	Stable reason strings are easier for agents to group than free-form error summaries.
Saved RCA queries	Common searches for PR detection, review fetch, terminal loss, config failure, and cleanup failure would make the CLI more teachable.
Noise budget	Every repeated condition should declare whether it is one-shot, milestone-based, or intentionally repeated.
Test template	Each PR should include the same minimum tests: emit shape, no throw-through, idempotency where relevant, and JSON queryability for at least one event.

AO Activity Events: How We Made Agent Failures Queryable

The Problem

What Ashish Built First

Why these infra calls were deliberate

Local SQLite WAL

FTS plus structured filters

Automatic retention

Versioned JSON contract

CLI as the first reader

Sanitize before persistence

Best-effort writes

Copyable event precedent

Why FTS And JSON Matter

The RCA Pipeline

What PR #1620 Added

A Good Event Looks Like This

Rules For Follow-Up PRs

How The Seven Rollout PRs Fit Together

What I Would Improve Next

How To Review An Event PR

The Mental Model I Want The Team To Share