AO Activity Events: How We Made Agent Failures Queryable
Agent Orchestrator already knew a lot about what happened inside a session. The problem was that most of that knowledge died in memory, logs, or silent catch blocks. When an agent failed to notice a PR, missed a review comment, kept polling a stale runtime, or triggered a report watcher repeatedly, the RCA path was usually human archaeology.
The activity event logging work changed that. Ashish built the durable event layer in PR #1528. I consumed it in PR #1620 by wiring the first high-value lifecycle-manager failure paths into that layer. This post is the team handoff: why the architecture exists, how to use it, what PR #1620 actually emits, and the rules I want every follow-up PR to preserve.
The Problem
AO runs autonomous coding sessions across trackers, SCM providers, runtimes, agents, webhooks, and the dashboard. That makes normal logs insufficient. A log line can tell us something failed; it does not reliably answer which project, which session, which PR, which lifecycle state, whether the event was repeated, whether sensitive data was stripped, or whether another agent can query it later.
The goal was not to build a full observability platform. The goal was smaller and more useful for AO: create a local, queryable, sanitized RCA trail that humans and diagnostic agents can read after the fact.
Design principle: activity events are evidence, not telemetry noise. Every event should answer a real debugging question.
What Ashish Built First
PR #1528 gave us the platform layer. It is intentionally boring infrastructure: write events locally, sanitize the payload, keep the query surface stable, and make the write path best-effort so AO never changes behavior just because event storage is unavailable.
| Layer | Decision | Why it matters |
|---|---|---|
| Storage | SQLite-backed event store with WAL mode | AO can write events from many lifecycle paths without needing an external service. |
| Retention | Automatic pruning | The log stays useful locally without growing forever. |
| Search | FTS search plus structured filters | Humans and agents can search symptoms like review fetch failed, not only exact schemas. |
| Safety | Sensitive key and credential URL sanitization | Events can include context without leaking tokens, passwords, or auth-bearing URLs. |
| CLI | ao events list, search, and stats | The RCA loop works from the terminal and can be consumed by subprocess agents. |
| Contract | Versioned JSON output | Automation can parse output without depending on human formatting. |
| Precedent | session.spawned, session.spawn_failed, session.killed, lifecycle.transition, activity.transition | Future contributors got copyable examples instead of a blank API. |
Why these infra calls were deliberate
Local SQLite WAL
We chose a local store because RCA evidence must exist even when external services are unavailable.
This layer is for AO operators and diagnostic agents, not centralized product analytics. SQLite keeps setup near-zero for local dev, CI, and customer debugging. WAL mode lets multiple lifecycle paths write while reads are happening, so instrumentation does not need its own service dependency.
TradeoffThe tradeoff is weaker cross-machine aggregation by default, but a much smaller blast radius and fewer privacy questions.
FTS plus structured filters
Humans and agents usually start from symptoms, not from exact event-kind names.
During an incident, the query is often review comments missing, PR not detected, terminal vanished, or cleanup skipped. FTS supports that natural starting point, while filters on session, project, source, level, and time narrow the result into an RCA trail.
TradeoffThe tradeoff is that summaries become part of the searchable contract, so they must stay descriptive, stable, and sanitized.
Automatic retention
Local observability only works if it stays cheap enough to leave on.
Without pruning, every agent run adds durable data forever. That creates disk growth, slower local queries, and higher risk that old context is retained longer than it should be. Retention keeps the default path operational instead of making event logging something people disable.
TradeoffThe tradeoff is that old incidents eventually need exported evidence or external archival if they must be preserved.
Versioned JSON contract
Machine consumers should parse events as data, not scrape human CLI formatting.
The table output can change for readability. The JSON envelope should not. Versioning gives subprocess agents, scripts, and future dashboard integrations a stable shape to validate. If the schema changes later, AO can bump the version instead of silently breaking automation.
TradeoffThe tradeoff is maintenance discipline: new fields should be additive where possible, and breaking changes need an explicit version bump.
CLI as the first reader
The fastest RCA path should work before the dashboard or a remote service is involved.
AO sessions already run from terminals, and follow-up agents can call subprocesses. Exposing list, search, and stats in the CLI made the event layer immediately useful to both humans and automation without waiting for a UI workflow.
TradeoffThe tradeoff is that the CLI contract matters more now: flags, JSON output, and filtering semantics need the same care as API routes.
Sanitize before persistence
We need useful context without turning the event store into a secret dump.
Events often sit next to SCM URLs, config data, plugin errors, prompts, and runtime metadata. Sanitizing before storage and indexing means both CLI users and FTS queries operate on safe evidence, not raw credentials or auth-bearing URLs.
TradeoffThe tradeoff is that some raw failure detail may be intentionally unavailable later. That is correct for this layer.
Best-effort writes
Observability cannot become a new lifecycle failure mode.
PR #1620 relied on this. Events were added inside sensitive lifecycle-manager paths, but event persistence is not allowed to decide whether AO detects, reacts, cleans up, or polls. The product behavior remains owned by the original code path.
TradeoffThe tradeoff is occasional missing evidence if storage is broken, which is better than breaking a user run while trying to log it.
Copyable event precedent
The first infra PR had to teach the next PR how to emit safely.
Events like session.spawned, session.spawn_failed, lifecycle.transition, and activity.transition created working examples for source, kind, summary, level, and data shape. That made #1620 a consumer PR, not a design-from-scratch PR.
TradeoffThe tradeoff is that precedent can spread mistakes too, so follow-up PRs should copy the pattern but still ask the RCA question first.
The most important invariant from the infra PR is simple: recordActivityEvent must be safe to call from critical paths. If event storage fails, the lifecycle continues. That is why PR #1620 could add events in sensitive state-machine code without turning observability into a new failure mode.
Why FTS And JSON Matter
During the planning around issue #1511, the key idea was that the consumer of these events is not always a person reading a table. Often it is another agent trying to diagnose a broken AO run. That agent does not start with the exact event kind. It starts with a symptom: PR not detected, review comments missing, terminal vanished, cleanup skipped, report watcher fired too many times.
That is why FTS search and a stable JSON envelope are not nice-to-haves. FTS lets a diagnostic agent search like a human. JSON lets it reliably extract session IDs, project IDs, PR URLs, lifecycle states, attempt counts, and reasons. Together they turn local events into a usable RCA substrate.
The RCA Pipeline
- 01AO observes an edge
Lifecycle, SCM, runtime, agent, cleanup, webhook, or config code reaches a meaningful success/failure boundary.
- 02Emit safe evidence
recordActivityEvent writes a compact source, kind, summary, level, and sanitized data payload.
- 03Store locally
SQLite WAL and retention keep the evidence durable without introducing an external service dependency.
- 04Search by symptom
Humans or diagnostic agents use ao events list/search/stats with structured filters or FTS queries.
- 05Explain the run
The RCA answer comes from correlated events instead of reconstructing the state machine from memory.
ao events list --session ta-7 --json
ao events search "report watcher triggered" --project agent-orchestrator
ao events stats --since 24hWhat PR #1620 Added
PR #1620 made lifecycle-manager the first serious consumer of the activity-event layer. I intentionally chose lifecycle-manager because it sits on the highest-value RCA path: it decides whether sessions are detecting, implementing, reviewing, reacting, cleaning up, or stuck. When it loses context, the whole system becomes hard to explain.
The PR added 17 event kinds across 18 emit sites in packages/core/src/lifecycle-manager.ts. The events were grouped into four bundles so review could reason about each class of behavior separately.
| Bundle | File and methods | Events | Question answered |
|---|---|---|---|
| B1: plugin-call failures | packages/core/src/lifecycle-manager.ts in populatePREnrichmentCache, determineStatus, maybeDispatchReviewBacklog, runtime and agent probing paths | scm.batch_enrich_failed, scm.detect_pr_succeeded, scm.detect_pr_failed, scm.review_fetch_failed, scm.poll_pr_failed, runtime.probe_failed, agent.process_probe_failed, agent.activity_probe_failed | When did AO first see the PR, and which SCM/runtime/agent probe failed before the state machine moved on? |
| B2: reactions | executeReaction | reaction.escalated, reaction.send_to_agent_failed, reaction.action_succeeded | Did AO try to auto-fix, notify, or auto-merge? Did it stop because attempts or duration were exhausted? |
| B3: cleanup and poll | maybeAutoCleanupOnMerge, checkSession, pollAll | session.auto_cleanup_deferred, session.auto_cleanup_completed, session.auto_cleanup_failed, detecting.escalated, lifecycle.poll_failed | Was cleanup skipped, completed, or failed? Did the polling loop crash? Did detecting take too long? |
| B4: report watcher | auditAndReactToReports | report_watcher.triggered | Why did AO think a session needed attention, and did the watcher fire only once for the activation? |
The implementation was deliberately local to existing catch blocks, success observers, and state transitions. It did not move lifecycle mutations. It did not add new state-machine branches. It did not make event writes part of control flow.
A Good Event Looks Like This
A good AO activity event is compact, attributable, safe, and queryable. It names the source layer, uses a specific kind, includes enough identifiers for correlation, and summarizes the RCA value in one sentence.
recordActivityEvent({
projectId, // SQL column: project/repo boundary for filtering
sessionId: session.id, // SQL column: reconstruct one agent run end to end
source: "reaction", // SQL column: subsystem that observed the edge
kind: "reaction.escalated", // SQL column 'type': stable machine contract
level: "warn", // SQL column: severity for triage and filtering
summary: "Reaction escalated after reaching max attempts", // FTS-readable headline
data: { // JSON text column: event-specific RCA facts
reaction: "send_to_agent", // which automated action was attempted
reason: "max_attempts", // stable grouping key; avoid prose here
maxAttempts, // configured threshold AO compared against
attempts, // observed value that crossed the threshold
},
});The call is synchronous and best-effort. It does not decide lifecycle behavior. Internally it opens the local SQLite event DB if needed, sanitizes the summary and data, writes one row, updates the FTS index through SQLite triggers, and periodically prunes old rows. If the DB is unavailable, AO increments a dropped-event counter and continues.
| Field | Stored as | Why it exists | Why not something else |
|---|---|---|---|
| projectId | project_id indexed SQL column | Scopes events to a repo/project so team members and agents can ask: show me only this AO project. | Do not hide this inside data. Project filtering is a first-class query path. |
| sessionId | session_id indexed SQL column | Correlates events across one run: PR detection, review fetch, reaction, cleanup, terminal state. | Do not rely on summary text for this. Session correlation must be exact. |
| source | source indexed SQL column | Names the owner layer: lifecycle, scm, runtime, agent, reaction, api, ui. | Do not create one source per method. Source should be broad enough for ownership dashboards. |
| kind | type indexed SQL column | The stable event name automation should filter on, such as reaction.escalated. | Do not parse the English summary. The summary can change; the kind is the contract. |
| level | log_level SQL column | Separates expected breadcrumbs from warnings and errors. | Do not encode severity in kind names like reaction.escalated_error. |
| summary | summary SQL column plus FTS | One safe sentence for humans and symptom search. | Do not put raw exception blobs, prompts, tokens, or huge payloads here. |
| data | sanitized JSON string in data column plus FTS | Carries facts specific to this event kind without making the table sparse. | Do not make every possible field a SQL column. Promote only fields that become common query axes. |
This is deliberately a hybrid schema. The dimensions every event shares are normal SQL columns with indexes. The details that only make sense for one event kind live in data as sanitized JSON text. That keeps writes simple while preserving exact filters for the things we always query by: time, project, session, source, kind, and level.
| data field | Meaning | Why this exact field |
|---|---|---|
| reaction | The reaction/action AO was trying to perform. | Without this, reaction.escalated tells us escalation happened but not which automation path caused it. |
| reason | A stable reason code such as max_attempts. | Agents can group by reason code. Free-form prose would create many accidental variants. |
| maxAttempts | The configured threshold. | Shows the policy AO was enforcing at the time of the event. |
| attempts | The observed count. | Shows the measured value that crossed the threshold. |
The data object is not a dumping ground. It should contain the few fields that make the event useful later: what AO was trying to do, what rule or boundary it hit, the configured limit, the observed value, and enough IDs to join it with neighboring events. If a field is useful across many event kinds and people need to filter by it constantly, that is a signal to standardize it or promote it later.
Rules For Follow-Up PRs
- Start with the RCA question. Do not begin with where can I emit an event? Begin with what would we wish we knew during an incident?
- Prefer state edges and exceptional paths. Emit when AO learns something, changes state, fails an integration call, escalates, retries, gives up, or completes cleanup. Avoid per-poll heartbeat spam.
- Keep ownership clear. If core already owns a spawn failure, the API or CLI should not emit a duplicate event for the same failure unless it adds new context.
- Preserve behavior. Event writes must never decide lifecycle behavior. Treat them as best-effort evidence after the real action has happened or failed.
- Use one-shot guards for repeated conditions. Report watcher triggers, detecting escalation, and long-lived stuck states need activation or milestone events, not one event every loop.
- Sanitize by construction. Include PR URLs, branch names, project IDs, stage names, and reason codes. Avoid raw prompts, secrets, tokens, headers, and large blobs.
- Test both sides. Test that the event fires with the expected shape, and test that AO still behaves if recordActivityEvent throws.
How The Seven Rollout PRs Fit Together
After PR #1620 proved the consumer pattern in lifecycle-manager, I split the remaining work into narrow PRs so different owners can finish implementation and tests without stepping on each other. The split follows ownership boundaries, not random event kinds.
| PR | Area | Ownership |
|---|---|---|
| #1698 | CLI and supervisor lifecycle | Capture start/setup/stop/shutdown evidence near ao start, ao stop, supervisor setup, and graceful unregister paths. |
| #1695 | API mutation routes | Capture route-level mutation failures and attribution, while avoiding duplicate spawn failure events already owned by core. |
| #1693 | Webhooks and terminal WebSocket | Capture terminal PTY loss, webhook verification failures, and unsupported webhook paths with idempotent tests. |
| #1697 | Session manager | Round out session-manager events around metadata source, report-send retry stages, and prompt delivery context. |
| #1696 | Config, plugin registry, storage migration | Separate malformed config, invalid config, migration, and resolve failure signals so RCA can distinguish setup classes. |
| #1699 | Plugin internals | Capture distinct plugin failure shapes such as workspace destroy fallback, unreachable network signals, checkout failures, and branch collisions. |
| #1692 | Recovery, metadata corruption, agent-report | Capture recovery and metadata corruption paths without mixing them into normal steady-state probes. |
The rule for these PRs is the same as #1620: add evidence at the boundary where AO learns something important, keep the emit local to that boundary, and prove with tests that instrumentation cannot break the product path.
What I Would Improve Next
| Improvement | Why |
|---|---|
| Event catalog | A checked-in catalog of source, kind, level, summary pattern, and required data fields would reduce naming drift. |
| Small emit helpers | Helpers for common patterns like integration failure, escalation, retry exhausted, and one-shot trigger would keep call sites concise. |
| Reason code conventions | Stable reason strings are easier for agents to group than free-form error summaries. |
| Saved RCA queries | Common searches for PR detection, review fetch, terminal loss, config failure, and cleanup failure would make the CLI more teachable. |
| Noise budget | Every repeated condition should declare whether it is one-shot, milestone-based, or intentionally repeated. |
| Test template | Each PR should include the same minimum tests: emit shape, no throw-through, idempotency where relevant, and JSON queryability for at least one event. |
How To Review An Event PR
- Ask what question it answers. If the answer is just it logs that we were here, the event is probably too weak.
- Check source and kind ownership. Source should name the layer that owns the event, not necessarily the file that happens to emit it.
- Check payload size and safety. The payload should explain the event without carrying secrets, raw prompts, or full external responses.
- Check ordering. The emit should sit after the successful action or inside the exact catch that already handles the failure.
- Check duplicates. If another layer already emits the same failure, either remove one or make the second event add distinct context.
- Check tests. A test should fail if the event disappears or if instrumentation starts throwing into the product path.
The Mental Model I Want The Team To Share
AO is a state machine wrapped around unreliable external systems. SCM can fail, runtimes can disappear, agents can stop responding, webhooks can be malformed, config can be invalid, and cleanup can be deferred. Activity events are how we make those edges explicit.
The point is not to log more. The point is to make AO explain itself. When a user asks why did my session get stuck?, why did the PR not get detected?, why did AO notify me?, or why did cleanup not happen?, the answer should be recoverable from local events without asking someone to reproduce the whole run.
That is why this work is worth doing across the codebase. It turns AO from a system that performs actions into a system that can also narrate the evidence behind those actions.