Architecture¶

Architecture — Gaby¶

Status: Draft v0.1 · Owner: Guilliano · Last updated: 2026-04-11

Reading order: SPEC.md → FOUNDATION.md → ARCHITECTURE.md (this) → ROADMAP.md.

SPEC.md says what we're building. FOUNDATION.md locks the stack and the repo layout. This doc is the technical how: lifecycles, state machines, contracts, concurrency, failure modes, data flow.

This is a living document. Anything with a § icon is a design decision that can be revisited with evidence; anything with a 🔒 is locked for v1.0.

1. System map — one picture¶

                                 ┌─────────────────────────────────┐
                                 │  End user (customer / employee) │
                                 └──────────────┬──────────────────┘
                                                │
                        ┌───────────────────────┼───────────────────────┐
                        │                       │                       │
                    Help desk               Chat widget             Slack / Teams
                   (Zendesk, Halo,        (JS snippet,              (bot user)
                    Linear, Zoho…)         shadow DOM)
                        │                       │                       │
                        └───────────┬───────────┴───────────┬───────────┘
                                    │                       │
                                    ▼                       ▼
                         ┌────────────────────┐    ┌────────────────────┐
                         │ TicketSource       │    │ ChatSession        │
                         │  adapters          │    │  manager           │
                         └─────────┬──────────┘    └─────────┬──────────┘
                                   │                         │
                                   └──────────┬──────────────┘
                                              ▼
                                 ┌─────────────────────────┐
                                 │ Event bus (in-proc)     │
                                 │   topic: ticket.new     │
                                 └────────────┬────────────┘
                                              │
                                              ▼
                                 ┌─────────────────────────┐
                                 │ Worker runner           │
                                 │  (in-proc or arq+Redis) │
                                 └────────────┬────────────┘
                                              │
                                              ▼
            ┌────────────────────────────────────────────────────────────┐
            │                    Agent loop                              │
            │   ┌─────────┐  ┌──────────┐  ┌───────────┐  ┌──────────┐ │
            │   │ Plan    │→ │ Retrieve │→ │ Tool call │→ │ Observe  │ │
            │   └─────────┘  └──────────┘  └─────┬─────┘  └────┬─────┘ │
            │        ▲                            │              │      │
            │        └────────────────────────────┴──────────────┘      │
            │                    (loop until verdict)                   │
            └───────┬─────────────┬──────────────────────┬──────────────┘
                    │             │                      │
                    ▼             ▼                      ▼
             ┌──────────┐  ┌────────────┐         ┌─────────────┐
             │ LLM      │  │ Knowledge  │         │ MCP host    │
             │ gateway  │  │ retrieval  │         │ (spawns and │
             │ (litellm)│  │ (hybrid)   │         │  supervises │
             └──────────┘  └────────────┘         │  connectors)│
                                                  └──────┬──────┘
                                                         │
                                      ┌──────────────────┼─────────────────┐
                                      │                  │                 │
                               MCP server          MCP server        MCP server
                               postgres            keycloak         zoho-desk
                               (stdio/HTTP)        (stdio/HTTP)     (stdio/HTTP)
                                      │                  │                 │
                                      ▼                  ▼                 ▼
                               Real Postgres       Real Keycloak     Real Zoho
                                 (read-only)        (read-only)     (read + write)

            The agent loop, before every tool call, passes through:

                     Safety pipeline (§6)
             ┌─────────────────────────────────┐
             │ scope check → redact → dry-run  │
             │       → apply → audit           │
             └─────────────────────────────────┘

Everything else in this document elaborates one of the boxes or one of the arrows.

2. Core request lifecycle — from "ticket arrives" to "ticket closed"¶

This is the canonical path. Every v0.1 scenario collapses to it.

 1. [TicketSource]  poll() / webhook → raw_ticket
 2.                  .normalize()    → Ticket (canonical form)
 3.                  persist → tickets table, emit "ticket.new"
 4. [Worker]         consumes "ticket.new" → schedules Investigation
 5. [Agent loop]     new Investigation(id, ticket_id, budget)
 5a.                   (optional) classify(ticket) → triage verdict
 5b.                   if triage == "not_worth_investigating":
 5c.                     verdict = "skipped"; go to step 23
 6.                    while not verdict:
 7.                      plan_next_step(working_memory)       # LLM call: planner
 8.                      if needs_retrieval:
 9.                        retrieve(query)                    # knowledge subsystem
10.                        append to working_memory
11.                      if needs_tool_call:
12.                        action = propose_tool_call()       # LLM call: tool_selector
13.                        safety_check(action, scopes, autonomy)   ←── may raise
14.                        if dry_run:
15.                          result = simulate(action)
16.                        else:
17.                          result = mcp_host.call(action)
18.                        audit.write(action, result)
19.                        append to working_memory
20.                      maybe_emit_step_to_ui(step)          # live updates
21.                      if budget_exceeded or max_iterations:
22.                        verdict = "failed_budget"; break
23.                    verdict = classify(working_memory)     # LLM call: verdict
24.                    summary  = summarize(working_memory)   # LLM call: summarizer
25. [TicketSink]     write_back(ticket, summary, verdict)     # via the source adapter
26.                  update tickets.status
27.                  emit "investigation.done"
28. [Escalator]      if verdict ∈ {needs_tech, needs_l2, needs_client}:
29.                    dispatch_to_channel(persona.escalation_target)
30. [KB learner]     if verdict == "auto_resolved" and quality_gate_passes:
31.                    stage the resolution as a candidate KB entry (human review)

Legend for the LLM calls in the loop¶

Call name	Purpose	Model tier	Streaming
`planner`	Given working memory, what should we do next?	big	no
`tool_selector`	Choose a specific MCP tool + its arguments	big	no
`summarizer`	Turn the working memory into a customer-facing message	big	yes
`verdict`	Classify final outcome (auto_resolved / needs_tech / …)	small	no
`classifier` (optional)	Cheap pre-filter at step 1 (is this even worth investigating?)	small	no

The model router (§8.4) decides "big" vs "small". Classifier-style calls go through a cheap model so we don't spend flagship-model tokens on yes/no questions.

2.1 Working memory vs investigation steps — two things, not one¶

These are separate and must not be confused.

Thing	Shape	Scope	Persistence	Consumer
Working memory	A typed object `WorkingMemory { ticket, messages, tool_calls, retrieved_chunks, budget_state }`. The `messages` array is the LLM conversation history for this investigation.	One in-flight investigation	Snapshotted to `investigations.working_memory_snapshot` (jsonb) at every state-machine transition	The agent loop
Investigation steps	Append-only rows in `investigation_steps` matching the UI timeline shape (`system`, `action`, `detail`, `type`, `timestamp`)	One investigation, historical	Permanent (soft-delete only)	The UI, the audit log, the operator

Every state transition in §3 does two writes: it updates the working memory snapshot AND appends one or more investigation step rows. The snapshot lets us resume after a crash; the steps let the UI animate in real time and the audit log reconstruct history.

3. Agent loop — state machine¶

                        ┌──────────────┐
                        │   CREATED    │
                        └──────┬───────┘
                               │ start()
                               ▼
                        ┌──────────────┐
         ┌─────────────▶│   PLANNING   │
         │              └──────┬───────┘
         │                     │ next_step == retrieve
         │                     ▼
         │              ┌──────────────┐
         │              │  RETRIEVING  │
         │              └──────┬───────┘
         │                     │
         │                     ▼  (back to planning with new evidence)
         │              ┌──────────────┐
         └──────────────┤   PLANNING   ├─┐
                        └──────┬───────┘ │
                               │          │ next_step == act
                               ▼          ▼
                        ┌──────────────┐
                        │  SAFETY_CHK  │
                        └──────┬───────┘
                               │
                 ┌─────────────┼────────────┐
                 │             │            │
                 │             │            │
            denied          approval       allowed
                 │           required        │
                 ▼             │             ▼
         ┌──────────┐          ▼      ┌─────────────┐
         │ HALTED   │   ┌──────────┐  │  ACTING     │
         └──────────┘   │ WAITING  │  └──────┬──────┘
                        │ APPROVAL │         │
                        └─────┬────┘         ▼
                              │       ┌─────────────┐
                              │       │  OBSERVING  │
                              │       └──────┬──────┘
                              │              │
                              ▼              ▼
                        ┌──────────────┐
                        │   PLANNING   │  (loop)
                        └──────┬───────┘
                               │ verdict_ready
                               ▼
                        ┌──────────────┐
                        │  VERDICT     │
                        └──────┬───────┘
                               │
                               ▼
                        ┌──────────────┐
                        │  WRITING_BACK│
                        └──────┬───────┘
                               │
                               ▼
                        ┌──────────────┐
                        │   DONE       │
                        └──────────────┘

Terminal states¶

State	Meaning
`DONE`	Verdict produced, written back, audit closed. Normal path.
`HALTED`	Safety denial or unrecoverable error. Escalated, audit closed.
`WAITING_APPROVAL`	Paused, waiting on a human. Resumable. Has a TTL (default 24h). On TTL expiry → auto-escalate. Not strictly terminal; `APPROVED` transitions back into `ACTING` with the same pending action.

Resume semantics (after a crash OR after an approval)¶

Because working memory is snapshotted at every transition, resuming is deterministic:

1. Load investigations.working_memory_snapshot for the target investigation
2. Load investigations.status → the last state
3. Re-enter the state machine at that state with the snapshot as input
4. For WAITING_APPROVAL: when the approval lands, the loop re-enters ACTING,
   calls the already-validated (tool_name, args), and proceeds normally
5. For a crash resume: the loop re-enters PLANNING with the last snapshot.
   We *never* replay a non-idempotent action — if the crash happened inside
   ACTING, the audit log tells us whether the action completed
   (`action.applied` event) or not. Completed actions are skipped on resume.

Idempotency requirement on MCP tool authors: every write tool must accept an idempotency_key argument (Gaby generates one per action UUID). The connector is responsible for de-duplicating on retry. Contract test §12 enforces this for every dangerous tool.

Budget enforcement¶

At every transition, the loop checks: - tokens_used < budget.tokens - usd_spent < budget.usd - wall_clock < budget.max_seconds - iterations < budget.max_iterations (default 20)

Any breach → verdict failed_budget, escalation. No silent degradation.

Why homegrown (a reminder)¶

We discussed this in FOUNDATION.md §1.1. The state machine above is ~400 Python lines on top of the Anthropic/OpenAI SDKs. The reasons to not adopt LangGraph or pydantic-ai at v0.1 are:

Safety must come before every tool call, not as a decorator. Frameworks make this awkward; in a hand-rolled loop it's one function call on the critical path.
Every transition emits an audit event with the full working memory delta. Frameworks' internal state is opaque to us.
Budget enforcement is per-transition, not per-call. Our loop checks every edge; frameworks expose hooks but not guarantees.
We want streaming of summarizer output directly to the UI. Simple from our loop; non-obvious in a framework that wraps the LLM client.

The public interface of the loop is small enough (start, resume, step, state) that a future swap is a week of work, not a rewrite.

Preferred escape hatch: pydantic-ai¶

If the homegrown loop stalls — prompt debugging becomes painful, multi-step branching gets tangled, we reimplement checkpointing — the preferred migration target is pydantic-ai, not LangGraph. Published 2026 benchmarks put pydantic-ai at ~44% lower P95 latency, ~5× fewer errors under load, and ~2.7× lower token consumption versus LangGraph on equivalent agent tasks. It also ships pydantic-graph with durable execution across restarts and first-class human-in-the-loop, which maps cleanly onto our WAITING_APPROVAL state.

Re-evaluation gate: after v0.1 ships, run the eval harness (50+ fixture tickets) against both the homegrown loop and a pydantic-ai port. If the pydantic-ai port is within 10% of the homegrown loop on safety compliance AND meaningfully shorter in code OR faster on latency, we swap for v0.2.

4. Concurrency model¶

Gaby is I/O-bound. LLM calls, DB queries, MCP round-trips, HTTP to help desks. asyncio everywhere is the default; threads are only for hard CPU work (embeddings inference if we run it locally, BM25 scoring on large corpora).

4.1 The runtime shape¶

                         ┌────────────────────────┐
                         │    FastAPI app         │
                         │  (uvicorn, 1 process)  │
                         └──────────┬─────────────┘
                                    │
                         same event loop
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
          ┌────────────┐  ┌──────────────┐  ┌─────────────┐
          │ HTTP routes│  │ Worker runner│  │ Chat gateway│
          └────────────┘  └──────┬───────┘  └─────────────┘
                                 │
                                 ▼
                     bounded semaphore (N=8 default)
                                 │
                     ┌───────────┼───────────┐
                     ▼           ▼           ▼
               Investigation  Investigation  Investigation
                 task           task          task
                 (asyncio.Task)  ...           ...

4.2 Key parameters¶

Parameter	SQLite default (v0.1)	Postgres default (v0.2+)	Notes
`uvicorn --workers`	1	2–4	Single process in v0.1 is enough; scale out horizontally in v0.5
Concurrent investigations	8	16	Bounded semaphore; excess tickets queue in the DB
Per-investigation LLM concurrency	1	1	LLM calls inside one investigation are serial — simpler reasoning
MCP subprocess pool size	unbounded	unbounded	One MCP server per connector; each serves many investigations concurrently
Async DB pool	`pool_size=5`, `max_overflow=5`	`pool_size=20`, `max_overflow=10`	SQLite is single-writer so the pool just serializes writes. Postgres default follows the 2026 production pattern of pool=20/overflow=10 for moderate API servers. Remember: total DB connections = workers × (pool_size + max_overflow).
PostgreSQL driver	n/a	`asyncpg`	asyncpg (not psycopg2) for true async; SQLAlchemy configured with `postgresql+asyncpg://`
HTTP client (httpx)	shared	shared	Single `AsyncClient` per process, limits=Limits(max_keepalive=40)

4.3 In-process vs external worker¶

v0.1 default:           Investigations run inside the FastAPI process, on the same
                        event loop. No Redis. "docker compose up" = 1 container for
                        the app + 1 for the UI + 1 for Postgres (optional).

v0.2 default (scale):   arq worker in a separate container. Same code, different
                        entry point. Switches on `GABY_WORKER_MODE=arq`.

The worker interface is identical in both modes — runner.schedule(investigation) — so upgrade path is a config flag, not a refactor.

4.4 Backpressure¶

If the investigation semaphore is full, new ticket.new events queue in the DB (tickets.status='queued').
The web UI dashboard shows queue depth and expected wait time (queue_length / avg_inv_duration).
Above a configurable threshold (default 50 queued), Gaby starts degraded mode: it still intakes tickets, but skips the retrieve step for low-priority tickets to drain faster.
We never drop a ticket. The DB is the queue; losing a ticket requires a DB failure, not a process crash.

4.5 The DB is the queue — event bus clarified¶

To square "in-process event bus" with "never drop a ticket", the actual rule is:

Ticket adapters write tickets(status='queued') and then optionally fire an in-memory notification to wake the worker faster.
The worker runner's main loop is a DB poll: SELECT ... FROM tickets WHERE status='queued' ORDER BY priority, received_at FOR UPDATE SKIP LOCKED LIMIT 1. SQLite uses BEGIN IMMEDIATE + an app-level mutex in place of FOR UPDATE SKIP LOCKED.
The in-memory notification is a latency optimization, not the source of truth. If it gets dropped (process crash, Python GC pause), the next poll tick catches the ticket.
Poll interval defaults to 2 seconds; notification-driven wakeups push it effectively to zero when the system is loaded.

This makes the system at-least-once: after a crash, we may re-claim a ticket whose investigation was mid-flight. The resume rules in §3 handle that safely because: - Working memory is snapshotted per transition → we pick up where we left off. - Every write action carries an idempotency_key → no duplicate side-effects. - The audit log tells us what was already applied before the crash.

4.6 Ticket claim transaction¶

Pseudo-code for the claim:

-- Postgres path
BEGIN;
SELECT id, workspace_id, body FROM tickets
  WHERE status = 'queued'
  ORDER BY priority DESC, received_at ASC
  LIMIT 1
  FOR UPDATE SKIP LOCKED;
UPDATE tickets SET status = 'investigating', claimed_by = $worker_id, claimed_at = now()
  WHERE id = $picked_id;
COMMIT;

For SQLite we fall back to a single-writer strategy: one claim task, protected by a process-wide asyncio.Lock, using BEGIN IMMEDIATE to acquire the DB writer lock before the SELECT + UPDATE. This is fine for v0.1 throughput targets.

5. MCP host — connector lifecycle¶

Every connector is an MCP server. Gaby is an MCP host (in MCP parlance) and an MCP client (it calls their tools).

5.1 Spawn strategies¶

Strategy	When	Implementation
stdio subprocess (v0.1 default)	Default for first-party connectors bundled in the image	`asyncio.create_subprocess_exec(...)`; framed JSON-RPC over pipes as per the MCP stdio transport
Streamable HTTP	Remote / community MCP servers reachable over HTTP	MCP's Streamable HTTP transport — the current spec's HTTP option (superseded the older SSE transport)
in-process	Tiny built-ins that never block (e.g. local filesystem, time)	Direct function calls wearing an MCP-shaped mask

Why stdio default? First-party connectors ship in the Gaby Docker image and are spawned as subprocesses — no network, no auth, lowest latency, simplest failure modes. Streamable HTTP is for remote/community servers where subprocess isn't an option. The official Python MCP SDK supports both transports with the same client interface.

5.2 Lifecycle¶

CONFIGURED ──start──▶ LAUNCHING ──handshake──▶ READY ──▶ BUSY ──▶ READY
                          │                      │         │        │
                          └─fail─▶ CRASHED       │         ▼        │
                                     │           │     TIMEOUT      │
                                     ▼           │         │        ▼
                                 RESTARTING ◀────┴─────────┴───── SHUTDOWN
                                     │
                                     ▼
                                 READY (or DEGRADED after N failures)

Handshake = MCP initialize + tools/list. The tool list is cached per connector version.
Crash recovery: exponential backoff, max 5 restarts in 5 minutes. After that the connector is marked DEGRADED and the UI shows a persistent warning. Investigations that would have used this connector either skip it or halt, depending on connector criticality.
Health check: periodic ping every 30s (for HTTP) or "is the subprocess alive?" (for stdio). Surfaced at /health/connectors.
Graceful shutdown: SIGTERM → wait for in-flight tool calls to finish → SIGKILL after 10s.

5.3 Tool scope declaration — the contract every connector must satisfy¶

Every connector declares its capabilities in a machine-readable form that Gaby trusts for authorization decisions:

# connector tool manifest (returned by tools/list + scope extension)
tools:
  - name: query_users
    scope: read
    description: "Look up user by email or ID"
    args: [{ name: email, type: string }]
  - name: reset_password
    scope: write
    dangerous: true
    requires_approval_above_autonomy: propose   # auto-approves only when autonomy=act
    description: "Trigger a password reset"
    args: [{ name: user_id, type: string }]
  - name: delete_user
    scope: write
    dangerous: true
    forbidden_in_autonomy: [investigate, propose]   # only allowed in autonomy=act
    description: "Permanently delete a user"
    args: [{ name: user_id, type: string }]

Contract tests (see §12) verify every first-party connector declares these fields. Community connectors that don't are flagged UNSAFE in the UI and cannot be moved out of investigate autonomy.

5.4 Manifest versioning and cache invalidation¶

Each connector declares a manifest_version (semver) in its initialize response. Gaby stores the last-seen version per connector. On restart, if the version changed, the tool list cache is invalidated and a new tools/list is performed. The manifest hash is also written to the audit log so historical investigations reference a specific immutable version of the tool set.

5.5 Idempotency keys for write tools¶

Every write tool must accept idempotency_key: string as an argument. Gaby generates one per action UUID and passes it automatically. Connectors use it to de-duplicate on retry after a crash or transient failure. This requirement is enforced by the contract tests.

6. Safety pipeline — the thing that cannot break¶

This is the single most important subsystem. Every non-read action passes through it, in order:

        action (tool_name, args, connector_id)
                       │
                       ▼
        ┌──────────────────────────────┐
        │ 1. SCOPE CHECK               │   evaluate(action, connector.scopes,
        │                              │             ticket.workspace_id,
        │                              │             persona.autonomy_level)
        └──────┬───────────────────────┘
               │ denied ──────────────────▶ AUDIT(denied) → raise PermissionError
               │ allowed
               ▼
        ┌──────────────────────────────┐
        │ 2. REDACTION                 │   strip PII from any string args per
        │                              │   workspace.compliance_profile (HIPAA, SOC2…)
        └──────┬───────────────────────┘
               │
               ▼
        ┌──────────────────────────────┐
        │ 3. DRY-RUN DECISION          │   dry_run = (autonomy ≠ act)
        │                              │            OR (tool.dangerous AND not approved)
        └──────┬───────────────────────┘
               │
        ┌──────┴──────┐
        │             │
     dry_run       real
        │             │
        ▼             ▼
  simulate()      mcp_host.call()
        │             │
        └──────┬──────┘
               ▼
        ┌──────────────────────────────┐
        │ 4. AUDIT                     │   append_hash_chained(
        │                              │     actor, action, result, ts)
        └──────┬───────────────────────┘
               │
               ▼
        return result to agent loop

6.1 Scope DSL (sketch)¶

Scopes are declarative, per-connector. There are exactly two lanes — read and write. Dry-run is not a scope lane; it is a runtime decision made at step 3 of the safety pipeline (see §6 diagram) and implemented by the connector when its tool manifest sets supports_dry_run=true. See docs/decisions/2026-04-15-dry-run-not-a-scope-lane.md for the rationale.

connector: m365
scopes:
  read:
    allow: ["users/*", "mailboxes/*"]
  write:
    allow: ["users/{id}/reset_password"]
    deny:  ["users/{id}/delete"]

The scope checker resolves action.tool against these globs plus the tool manifest's scope field. Denies beat allows. Everything not explicitly allowed is denied.

6.2 Audit log — hash-chained, append-only¶

Every entry:

{
  "id": <uuid7>,
  "workspace_id": ...,
  "ts": <monotonic_wall>,
  "actor_kind": "agent" | "user" | "system",
  "actor_id": ...,
  "event": "action.applied" | "action.denied" | "approval.granted" | ...,
  "payload": { ... action + result snapshot ... },
  "prev_hash": <sha256 of previous row>,
  "hash":      <sha256(prev_hash || canonical_json(this row without 'hash'))>
}

Verification: a background task re-walks the chain daily and alerts on any mismatch.
SIEM export (EE feature): tail the chain to Splunk / Sumo / Elastic via a pluggable exporter.
Why not a separate append-only database (e.g. QLDB, immudb)? Added operational complexity. A hash-chained table in the same Postgres, with row-level ACLs and no UPDATE / DELETE grants, gives 95% of the guarantee for 5% of the complexity. EE customers who need stronger guarantees can pipe to a dedicated store.

6.3 The four autonomy levels (one more than SPEC.md §6.5)¶

Level	What the agent does	When to use
`off`	Gaby does nothing. Tickets are ingested but not investigated.	Maintenance mode / legal hold.
`investigate`	Gaby reads, retrieves, queries. Never writes. Produces a summary for humans.	First week of deployment. Read-only SRE connectors.
`propose`	Gaby drafts the fix. Every write action goes to the approval queue.	Default for most non-trivial deployments.
`act`	Gaby executes writes itself, with dry-run + audit + rollback. Still respects `dangerous`/`forbidden` flags on tools.	Mature deployments with well-understood playbooks.

Autonomy is set per connector, per workspace. A single investigation can include act calls to Redis and propose calls to Stripe.

7. Knowledge subsystem — retrieval with citations¶

7.1 Pipeline¶

source           chunker             embedder            store             retrieval
-----            -------             --------            -----             ---------
git repo         token-aware         provider-agnostic   sqlite-vec        hybrid (BM25 + vector)
dir walker       Markdown/code-aware (pluggable)         or pgvector       + cross-encoder rerank
confluence       respects headings                       + FTS5 / tsvector + top-k=6 default
notion                                                                     + explicit citations
pdf
url crawler
past tickets

Vector store scale cliff — plan for the migration¶

sqlite-vec uses brute-force search (no ANN index). This is fine for v0.1 — a founder's runbook folder is hundreds, maybe low thousands, of chunks — but it does not scale to large corpora. The migration triggers:

Corpus size	Recommendation
< 5,000 chunks	sqlite-vec (brute-force is fast enough, <20 ms queries)
5,000 – 50,000	Evaluate vectorlite (sqlite ANN, ~3–30× faster than sqlite-vec on the same hardware)
> 50,000	Switch to the Postgres profile, use pgvector with HNSW indexes

All three present the same VectorStore protocol, so the migration is a config flag + a background reindex, not a rewrite. The documents table schema stores embedding as raw BLOB (float32 × dim) so the underlying index implementation is swappable.

Embedding model default: we start with a provider-agnostic choice — text-embedding-3-small (OpenAI, 1536 dim) for BYOK users on OpenAI, voyage-3-lite (Voyage AI, 512 dim) for Anthropic-leaning deployments, or a local BGE model via sentence-transformers for air-gapped installs. The schema is dim-agnostic; changing models triggers a background reindex.

7.2 Chunker rules (§)¶

Markdown: split on top-level headings first, then H2, then 800-token soft max.
Code (source files): split per function/class; never split mid-function.
PDF: page-aware; no cross-page chunks unless a heading continues.
Chunk metadata carries: source_uri, headings_path, line_range, content_hash.

7.3 Retrieval¶

Query rewrite (optional, cheap model): turn the ticket title + body into 1–3 search queries.
Parallel retrieval: BM25 top-20 ∥ vector top-20.
Reciprocal-rank fusion → top-20 hybrid candidates.
Cross-encoder rerank (cheap model or a small local model) → top-6.
Attach to working memory with explicit citations ([doc:uri#headings#L12-L30]).

7.4 Citations in output¶

Every claim in the final summary must end with a citation token. Unsourced claims are re-queried or explicitly disclaimed:

"This user's Authenticator was tied to the old iPhone [kb://runbooks/mfa-lockout#L45-L58] and the Entra ID sign-in log confirms 7 AADSTS50076 failures [entra://signinlogs#user=kevin.reyes@hartwelllaw.com&window=30m]."

Users can click any citation in the UI to see the source.

7.5 Learning loop¶

When an investigation resolves with auto_resolved verdict AND the operator rates it ≥4/5 (or no one disputes it within 7 days), Gaby stages a new KB candidate (the ticket + the resolution + the tool-call trace) in a review queue. An operator accepts / edits / rejects. Accepted entries become new indexed documents.

No silent learning. Human in the loop, always.

8. LLM gateway¶

8.1 Provider interface¶

class LLMProvider(Protocol):
    async def chat(self, messages, *, model, tools=None, max_tokens, temperature,
                   cache_control=None, stream=False) -> ChatResult: ...
    async def embed(self, texts, *, model) -> list[list[float]]: ...
    def supports(self, capability: Literal["tools", "streaming", "cache", "json_mode"]) -> bool: ...

Three concrete implementations in v0.1: - AnthropicProvider (direct anthropic Python SDK) — used on hot paths (planner, summarizer) - OpenAIProvider (direct openai Python SDK) — fallback for BYOK - LiteLLMProvider — wraps ~100 providers for BYOK users who want Azure OpenAI, Bedrock, Vertex, Mistral, local vLLM, etc.

A note on LiteLLM. We use the Python SDK (litellm as a library), not the LiteLLM proxy — the proxy has known production issues in 2026 (GIL-bound throughput, DB logging degradation, SSO gated behind paid tier) and was compromised in a PyPI supply-chain attack in March 2026 (versions 1.82.7 and 1.82.8). Mitigations: - Pin litellm to a known-good version range in uv.lock and update deliberately. - All SDK installs go through PyPI with hashes verified at install time. - Hot paths (planner, tool_selector, verdict, summarizer) bypass LiteLLM entirely and use the direct Anthropic/OpenAI SDKs. - LiteLLM only sees BYOK-only providers (Bedrock, Vertex, Azure, local vLLM) where its breadth is the value.

If BYOK volume becomes a real production load, Bifrost (Apache 2.0, Go-based, ~10μs overhead) and Portkey are the v0.2+ evaluation targets.

8.2 Prompt caching¶

Anthropic (2026 rules). Cache scope is the whole prefix up to the cache breakpoint, in request order: tools → system → messages. Cache reads are billed at ~0.1× base input price, 5-minute writes at 1.25×, 1-hour writes at 2×. Max 4 breakpoints per request. Minimum cacheable block size: 1,024 tokens on Sonnet, 4,096 tokens on Haiku (so short prompts never benefit on Haiku). Cache TTL defaults to 5 minutes, refreshed on every hit — which fits perfectly inside a typical investigation that spans seconds to minutes. As of 2026-02-05 caches are workspace-isolated (not org-wide), so multi-workspace deployments get proper separation for free.

We place our 4 breakpoints as follows:

#	Block	Lifetime	Why
1	Tool manifest (the MCP tool list, serialized)	Connector version	Changes only when a connector updates. Near-permanent cache hits.
2	System prompt (persona-specific instructions + safety rules)	Persona version	Changes rarely. Big win on every planner / verdict call.
3	Retrieved KB chunks for this investigation	Investigation lifetime	Same chunks are re-sent with each planner turn; the 5-min TTL keeps them hot.
4	Accumulated `messages` (ticket + prior tool calls)	Investigation lifetime	The accumulator. Every new turn extends past the last breakpoint.

Below 1,024 tokens on Sonnet (or 4,096 on Haiku) the breakpoint is a no-op and the call pays the full input price — a minor inefficiency, never a bug.

OpenAI. Prompt caching is automatic and server-side — no API change required. No minimum block size to worry about.

Local models. The cache is a no-op but the interface is uniform across providers.

8.3 Budget enforcement¶

Every chat() call passes through a BudgetGuard:

guard.check(investigation_id)  # raises BudgetExceeded before the HTTP call
guard.record(investigation_id, prompt_tokens, completion_tokens, cost_usd)

Budgets are per investigation, set from the persona's profile (default: 50k tokens, $0.50). Breaches halt the investigation and escalate.

8.3.1 Cost mapping — where USD comes from¶

Token counts come back from the provider response (usage.input_tokens, usage.output_tokens, usage.cache_creation_input_tokens, usage.cache_read_input_tokens for Anthropic). The tokens→USD conversion uses a single pricing table:

Primary source: the pricing table bundled in litellm (updated by the upstream project regularly).
Override: per-workspace config can set custom per-model rates for BYOK customers with negotiated pricing.
Fallback: if a model isn't in the table, the cost column is NULL and the cost metric isn't incremented for that call. The token metric still is.

This is recorded in llm_calls table so the cost dashboard (§16) can aggregate by investigation, by workspace, by model, and by purpose.

8.4 Model router¶

A 20-line table, not a framework:

ROUTER = {
    "classifier":   "claude-haiku-4-5",
    "verdict":      "claude-haiku-4-5",
    "planner":      "claude-sonnet-4-6",
    "tool_selector":"claude-sonnet-4-6",
    "summarizer":   "claude-sonnet-4-6",
}

Overridable per workspace in config. BYOK users can map these to any provider/model via LITELLM_MODEL_* env vars.

9. Ticketing adapters — source and sink¶

Every help desk adapter is both a source (new tickets) and a sink (write back results). The base contract:

class TicketAdapter(Protocol):
    async def poll(self, since: datetime) -> list[RawTicket]: ...
    async def subscribe_webhook(self, callback) -> WebhookHandle: ...     # optional
    def normalize(self, raw: RawTicket) -> Ticket: ...
    async def post_reply(self, ticket_id: str, body: str, *, private: bool) -> None: ...
    async def update_status(self, ticket_id: str, status: str) -> None: ...
    async def log_time_entry(self, ticket_id: str, minutes: int, note: str) -> None: ...  # MSP
    def capabilities(self) -> AdapterCapabilities: ...

Adapters in v0.1: Zoho Desk. Adapters in the v0.2-v0.4 window: HaloPSA, Autotask, ConnectWise, Zendesk, Linear, GitHub Issues, Jira SM, Freshdesk, Intercom, email.

Webhooks are preferred when available; polling is the fallback. The poller supports the "since cursor" pattern natively — it stores last_seen_external_id per source in the DB and asks each adapter "give me everything newer than this".

9.1 Canonical `Ticket` model¶

Ticket:
  id:           uuid7
  workspace_id:    uuid7
  source_id:    fk → ticket_sources
  external_id:  string     # the source's native ID (ZD-1234, HPS-4871...)
  title:        string
  body:         text
  customer:     string     # free-form, e.g. "Hartwell Law — Kevin Reyes"
  requester_email: string?
  priority:     low | medium | high | critical
  status:       new | queued | investigating | auto_resolved | needs_tech | needs_client | needs_l2 | failed
  sla_at:       datetime?
  received_at:  datetime
  source_metadata: jsonb   # anything the adapter wants to preserve

This maps 1:1 to the existing persona prototypes' ticket shape. Migrations are avoided by keeping the superset.

A React app bundled by Vite in library mode into a single JS file (gaby-widget.js, target <40 KB gzipped).
Mounted into a shadow DOM so the host site's CSS can't leak in or out.
Talks to /api/chat on the Gaby backend via fetch + SSE for streaming replies.
Themable via a single Gaby.init({ theme: { primary: '#0284c7', font: 'Inter' } }) call.

Abuse surface — rate limits and auth¶

The widget is public-facing. It must not become a free LLM token faucet. Controls:

Layer	Limit	Rationale
Per IP	20 messages / minute, 200 / day	Hard cap before any backend work
Per widget session	40 messages total, 15-minute idle timeout	Session-scoped envelope
Per workspace	Configurable daily budget in USD (default $50/day for chat)	Workspace owner sets the ceiling
Challenge	After 3 messages, an invisible Turnstile/hCaptcha challenge	Blocks bots without friction for humans
Auth options	Anonymous (rate-limited), host-provided JWT (verified by shared secret), logged-in user (via host's own auth)	Stricter auth → higher limits

All limits are enforced before any LLM call is made. Rate-limit hits return 429 with a Retry-After header; the widget surfaces "I'm getting a lot of messages right now, please try again in a moment."

10.2 Session lifecycle¶

session.created ──user message──▶ session.active
                                       │
                                       │ (Gaby responds, possibly multiple turns)
                                       │
                                       ▼
                            can_auto_resolve? ──yes──▶ session.resolved
                                       │
                                       no
                                       ▼
                            handoff_requested → session.handoff_pending
                                       │
                         operator accepts
                                       │
                                       ▼
                              session.handoff_active
                                       │
                                       ▼
                            operator closes → session.closed

10.3 Handoff bundle¶

When Gaby escalates a chat to a human, the operator receives (in the operator console):

Full transcript so far (both user + Gaby)
Every tool call Gaby made, with arguments and redacted results
Citations used for any KB-backed claims
The current working memory snapshot
A one-sentence "why I couldn't resolve this" from the agent

The operator starts mid-flight, not cold. This is the single biggest satisfaction driver for the human chat surface.

10.4 Slack / Teams¶

Bolt-for-Python for Slack, Bot Framework for Teams.
Same session model, same handoff bundle.
Inbound in Slack is v0.3; v0.1 ships Slack outbound only for escalations.

11. Auth and identity (three surfaces)¶

Surface	Mechanism	Session store
Web UI (operators)	Session cookie (HttpOnly, SameSite=Lax) + CSRF token	`sessions` table
CLI / automation	API key (`gaby-XXXX.YYYY`), prefix + hashed remainder	`api_keys` table
End-user chat widget	Host-provided JWT (verified by a shared key) or anonymous token	`chat_sessions` table
Connector OAuth	Per-connector device-code flow for the ones that support it; API keys for the rest	encrypted `connectors.config`

First-run bootstrap¶

On first boot, Gaby generates a one-time admin provisioning URL printed to stdout (and to a file if running headless). Opening it creates the first admin user. This URL expires in 15 minutes. After first use, Gaby refuses to issue another unless the DB is wiped — no silent "admin/admin" defaults.

SSO / SAML / SCIM¶

Enterprise Edition feature. Implemented via authlib + SAML2, behind a feature flag keyed to the license.

12. Connector contract — the testable promise¶

Every connector must pass these at pytest time:

Test	Assertion
`test_initialize`	Responds to `initialize` MCP request within 2s
`test_tools_list`	Returns a tool list with every tool carrying `scope` and `description`
`test_tool_scopes_wellformed`	Every tool's `scope` ∈
`test_dangerous_flagged`	Any destructive tool has `dangerous: true`
`test_dry_run_supported`	Every write tool supports a `dry_run=true` argument
`test_healthcheck`	Responds to the healthcheck tool
`test_redaction_noleak`	Tool results never echo back secrets passed in args (paranoia check)
`test_large_result_truncated`	Results over 100 KB are truncated (or paged) with a truncation marker

Contract tests live under connectors/_contract/ and are re-run against every first-party and community connector in CI.

13. Error handling philosophy¶

One rule: fail loud, fail early, degrade only after explicit design.

Category	Handling
Transient (network blips, 5xx)	Retry with exponential backoff + jitter, max 3 attempts, circuit breaker per endpoint
Permanent (4xx, auth expired)	No retry. Investigation enters `needs_tech` with a specific error.
Budget exceeded	Investigation → `failed_budget`, escalate.
Scope denied	Tool call never runs. Audit as `action.denied`. Agent plans an alternative.
LLM refuses or returns garbage	Retry once with a clarifying instruction. Then escalate.
MCP connector crash mid-call	Investigation pauses, connector is restarted, call retried once. Then escalate.
Unrecoverable internal bug	Investigation → `failed`, full stack trace in audit, operator notified.

No bare except: anywhere in the codebase. Enforced by a ruff custom rule.

14. Caching layers¶

Cache	Scope	TTL	Invalidation trigger
LLM prompt cache	Per Anthropic cache key	Anthropic-managed (~5 min)	Automatic
Tool manifest cache	Per connector version	process lifetime	Connector restart
Retrieval result cache	Per `(query_hash, corpus_version)`	10 min	KB re-ingest bumps `corpus_version`
Embedding cache	Per `(text_hash, model)`	permanent	Model change invalidates by key
Session ticket cache	Per ticket_id, within one investigation	investigation lifetime	End of investigation
OpenAPI doc cache	Per build	process lifetime	Rebuild

15. Deployment topologies¶

15.1 Founder quickstart (v0.1 default)¶

┌────────────────────────────────┐
│      docker compose up         │
│                                │
│  ┌──────────┐   ┌──────────┐   │
│  │  gaby    │   │  gaby    │   │
│  │ backend  │◀─▶│   web    │   │
│  │  + SQLite│   │  static  │   │
│  └──────────┘   └──────────┘   │
│         ▲                      │
│         │                      │
│  ┌──────┴───────┐              │
│  │  MCP servers │              │
│  │  (subproc)   │              │
│  └──────────────┘              │
└────────────────────────────────┘

2 containers, embedded SQLite, in-process worker
Total RAM: ~512 MB
Total persistent volumes: 1 (the SQLite DB + KB index)

15.2 MSP / scaled (v0.2)¶

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│  gaby       │   │  gaby       │   │  gaby       │
│  backend    │   │  backend    │   │  worker     │
│             │   │             │   │   (arq)     │
└──────┬──────┘   └──────┬──────┘   └──────┬──────┘
       │                 │                 │
       └────────┬────────┴────────┬────────┘
                │                 │
                ▼                 ▼
         ┌─────────────┐   ┌─────────────┐
         │  Postgres   │   │   Redis     │
         │  + pgvector │   │             │
         └─────────────┘   └─────────────┘

15.3 Enterprise / air-gapped (EE feature, later)¶

Air-gapped registry (images mirrored)
External secrets provider required
OIDC/SAML enforced
SIEM export of audit log enabled
Local LLM (vLLM or Ollama) for data-sensitive workspaces

16. Observability contracts¶

Signal	Carrier	Required fields
Log	structlog → JSON stdout	`ts`, `level`, `service`, `request_id`, `workspace_id`, `investigation_id?`, `event`, payload
Trace span	OTel span	same attributes as logs + `span.kind`, `span.status`
Metric	Prometheus counter/histogram	`gaby_*` prefix, labels: `workspace`, `connector`, `model`, `outcome`

Canonical metrics (v0.1 must emit these):

Metric	Type	Why
`gaby_investigations_total`	counter	Throughput
`gaby_investigations_duration_seconds`	histogram	Latency p50/p95/p99
`gaby_investigations_verdict_total`	counter	By verdict label — auto-resolution %
`gaby_llm_tokens_total`	counter	By model + by purpose (planner/verdict)
`gaby_llm_cost_usd_total`	counter	By model
`gaby_connector_calls_total`	counter	By connector + by tool + by status
`gaby_connector_health`	gauge	0/1 per connector
`gaby_safety_denials_total`	counter	Every denial is visible
`gaby_approvals_pending`	gauge	Operator queue depth
`gaby_chat_sessions_total`	counter	By channel (widget/slack/teams), by outcome
`gaby_chat_handoffs_total`	counter	Gaby → human takeovers
`gaby_rate_limit_rejections_total`	counter	By surface (widget/api)
`gaby_ticket_queue_depth`	gauge	DB poll result
`gaby_retrieval_hit_rate`	gauge	Share of citations ultimately used in the summary

Every metric above ships with a Grafana dashboard JSON in docs/operations/dashboards/.

17. Failure modes — explicit list¶

Failure	Detection	Response
LLM provider down	HTTP error from provider	Router tries fallback provider; if none, escalate
LLM budget exhausted	`BudgetGuard` pre-check	Investigation halts with `failed_budget`
Connector subprocess crash	`asyncio.subprocess.returncode`	Supervised restart; after 5 failures → DEGRADED
Connector OAuth expired	401/403 from tool call	Pause investigation, send re-auth link to admin via escalation
Help desk webhook delivery fail	Missing poll cursor gap	Poller fallback closes the gap
Database unavailable	SQLAlchemy pool exhaustion	API returns 503; backoff; alert
Redis unavailable (when running arq)	redis-py ConnectionError	Worker pauses; in-process fallback if enabled
Vector index corruption	Query returns 0 results when FTS returns >0	Automatic reindex on next KB sync; alert
Disk full	SQLite write error	API returns 503; alert
Malicious PII in ticket body	Redaction rule	Redact before LLM, record the original in encrypted column
Runaway agent loop	`max_iterations` cap (20)	Escalate with `failed_budget`

18. Open architectural questions (non-blocking for v0.1)¶

These are worth watching but don't hold v0.1:

Streaming the investigation timeline to the UI over SSE vs WebSocket vs polling? Plan: SSE (simpler infra, one-way fits the model).
Multi-region replication for the managed cloud. Defer to v0.5.
Reasoning models for the planner. Worth evaluating on the eval harness before v0.2.
Embedded vLLM for privacy-sensitive workspaces. Planned for v0.3 alongside the EE air-gapped mode.

Previously here: "Cross-investigation memory — defer to v0.3." Now designed — see §22 Memory hierarchy. Long-term memory starts accumulating on day one of v0.1 via SQLiteMemoryGraph.

19. Cross-document index¶

SPEC.md §6.5 — Safety model. This doc §6 is the implementation.
SPEC.md §6.6 — Ticket sources. This doc §9 is the adapter contract.
SPEC.md §6.4 — LLM layer. This doc §8 is the implementation.
FOUNDATION.md §1.1 — Stack choices. This doc §4, §8, §12 are the consequences.
FOUNDATION.md §3 — Data model. This doc §2, §6 are how they're used at runtime.

20. Definition of done for this document¶

This doc is "done enough to build v0.1 against" when:

[x] Every component in the System Map (§1) has its own section in this doc
[x] The agent loop state machine (§3) is explicit, with terminal states and budget rules
[x] The safety pipeline (§6) has a numbered order of operations and a scope DSL sketch
[x] Every failure mode we can think of today is listed (§17)
[x] Observability and metrics are concrete (§16)
[x] The v0.1 deployment topology is drawn (§15.1)
[x] Cross-links to SPEC.md and FOUNDATION.md are in place (§19)
[x] Self-critique pass completed (added §2.1, §4.5–4.6, resume semantics in §3, §5.4–5.5, rate limits in §10.1, cost mapping in §8.3.1, classifier wiring in §2, extra metrics in §16)
[x] Online reference validation pass completed — see §21

21. Reference validation — pass dated 2026-04-11¶

Every load-bearing technology choice was verified against current (2026) online references during architecture review. Summary:

Area	Finding	Architecture impact
MCP transports	Spec 2025-03-26 introduced Streamable HTTP as the remote transport, superseding SSE. Connection recovery via `Last-Event-ID` header. Python SDK supports both stdio and Streamable HTTP with a unified client.	§5.1 locked: stdio subprocess for first-party in-image connectors, Streamable HTTP for remote/community servers. Confirmed current.
Agent frameworks	pydantic-ai shows meaningful advantages over LangGraph in 2026 benchmarks (~44% lower P95, ~5× fewer errors, ~2.7× lower tokens). Ships `pydantic-graph` with durable execution and HITL that map onto our `WAITING_APPROVAL` state.	§3: homegrown loop is still the v0.1 choice (safety on the critical path); pydantic-ai is the locked fallback if homegrown stalls, with a re-eval gate after v0.1 ships.
Anthropic prompt caching	Order tools → system → messages. Sonnet min 1,024 tokens / Haiku min 4,096. Max 4 breakpoints per request. 5-min default TTL, 1-hour at 2× base, cache reads at 0.1×. Workspace-level isolation since 2026-02-05.	§8.2: the 4 breakpoints are now explicit (tools, system, KB chunks, messages) with the minimum-size caveats called out.
LiteLLM	Proxy has production issues (GIL throughput, DB logging degradation). March 2026 supply-chain attack on PyPI versions 1.82.7 and 1.82.8. 800+ open issues. Bifrost and Portkey are the production-grade alternatives.	§8.1: we use the SDK only, not the proxy; hot paths go direct to Anthropic/OpenAI; pin versions; install with hash verification; Bifrost/Portkey noted as v0.2+ migration targets if BYOK volume warrants.
sqlite-vec	Mozilla Builders project, pure C, production-stable, but brute-force search only (no ANN). Fine for small KBs, does not scale. vectorlite is 3–30× faster with ANN; pgvector HNSW is the Postgres story.	§7.1: added a scale cliff table and explicit migration triggers at 5k and 50k chunks. Embedding blob schema is dim-agnostic to keep migration cheap.
uv	Production/Stable status on PyPI. Community consensus in 2026 trending heavily toward uv. 10–100× faster than pip. Drop-in replacement.	`FOUNDATION.md §1.1` locked uv — confirmed current.
FastAPI + SQLAlchemy 2 async	Modern production default is `pool_size=20`, `max_overflow=10` for Postgres (not 5). Use `asyncpg` driver. Dependency-injected session lifecycle.	§4.2 updated: defaults split between SQLite (small) and Postgres (larger) profiles.
Tailwind 4 + shadcn/ui + React 19 + Vite + RR7	Fully compatible and production-ready. Migration notes: use `data-slot` attribute, `React.ComponentProps` instead of `forwardRef`. `@tailwindcss/vite` plugin is the install path.	`FOUNDATION.md §1.2` locked — confirmed current.
Biome	10–25× faster than ESLint+Prettier; covers ~80% of ESLint rules; doesn't support `eslint-plugin-react-hooks` (type-aware rules require the TS language service).	`FOUNDATION.md §1.2` needs a note: Biome primary, keep ESLint running for `react-hooks` only until Biome closes the gap. Added to the plan.
LLM eval tooling	promptfoo is used by Anthropic and OpenAI themselves for prompt regression; Inspect AI (UK AISI) is specifically designed for agent evaluation with tool calls and model-graded rubrics.	`FOUNDATION.md §4.1` updated direction: promptfoo in v0.1 for prompt regression; Inspect AI evaluated for v0.2 for full agent evals with tool calls.

What I did not find any reason to change: FastAPI as web framework, asyncio as concurrency model, Alembic for migrations, arq for background jobs (with in-process fallback), structlog for logging, OpenTelemetry for tracing, pnpm for JS, Vite for bundling, Vitest+Playwright for tests, MkDocs Material for docs, MCP Python SDK as the connector protocol library, DCO over CLA for contributions, Apache 2.0 core + commercial EE for licensing.

Next validation pass: after v0.1 ships, re-run this research with the same queries and diff. Anything that has moved >1 major version or lost community traction goes on the v0.2 re-evaluation list.

22. Memory hierarchy — how Gaby gets smarter over time¶

TL;DR. Three memory tiers (short / medium / long) feed the planner through a bounded context envelope on every LLM call. Long-term memory is a graph-shaped model stored behind a MemoryGraph protocol. The v0.1 default backend is SQLite (two tables, recursive CTE traversals). Two opt-in backends ship stubs in Iter 0 and full implementations in Iter 4: Apache AGE (Postgres extension, the recommended graph-native path) and FalkorDBLite (embedded Cypher, Apache 2.0). Long-term memory starts accumulating on day one regardless of backend choice — the data written through the protocol is the data you migrate later.

22.1 The three tiers¶

                     ┌──────────────────────────────────────────────────────┐
                     │                    SHORT-TERM                          │
                     │    scope:    one investigation                        │
                     │    lifetime: seconds–minutes                          │
                     │    storage:  in-proc WorkingMemory + jsonb snapshot   │
                     │    gate:     redaction on the LLM boundary            │
                     │    purpose:  let the loop reason + crash-resume       │
                     └──────────────────────────────────────────────────────┘
                                            ▲
                                            │ feeds upward on success
                                            │ (via kb_candidates staging)
                                            │
                     ┌──────────────────────────────────────────────────────┐
                     │                   MEDIUM-TERM                          │
                     │    scope:    workspace                                   │
                     │    lifetime: hours–days                               │
                     │    storage:  TTL queries + in-proc LRUs + jsonb       │
                     │    gate:     automatic (cache-like, not "learning")   │
                     │    purpose:  avoid rework, spot bursts, context-swap  │
                     └──────────────────────────────────────────────────────┘
                                            ▲
                                            │ promoted via human review
                                            │
                     ┌──────────────────────────────────────────────────────┐
                     │                    LONG-TERM                          │
                     │    scope:    workspace                                   │
                     │    lifetime: indefinite                               │
                     │    storage:  MemoryGraph (nodes + edges) + documents  │
                     │    gate:     100% human-in-the-loop — no silent writes│
                     │    purpose:  the product getting smarter              │
                     └──────────────────────────────────────────────────────┘

22.2 Short-term (within one investigation)¶

Already designed in §2.1. WorkingMemory { ticket, messages, tool_calls, retrieved_chunks, budget_state } lives in memory during the loop and is snapshotted to investigations.working_memory_snapshot at every state-machine transition so a crash can resume. PII is redacted before anything crosses into an LLM call. Short-term memory is private to its investigation — it is never read by a different investigation.

22.3 Medium-term (cache-like, automatic)¶

Four operational stores. None of them "learn" — they are all caches with hard TTLs. They are written automatically and read by the planner as hints.

Store	Shape	TTL	Purpose
Recent-tickets window	A query on `tickets` (`WHERE received_at > now() - interval '24h'`) — not a new table	24 h rolling	Dedup (same thing just resolved), burst detection ("4 customers just hit the same VPN error — this is an upstream incident, not 4 individual problems")
Connector-result cache	Process-local LRU keyed on `(workspace_id, connector_id, tool_name, canonical(args))`; Redis-backed when `arq` is on	60 s for reads, 0 s for writes	"I just SELECT-ed this users table 15 s ago during this investigation, don't re-query"
Operator session notes	jsonb column on the existing `sessions` table	Session lifetime	"The operator just approved this kind of action — don't re-prompt them for the rest of their session"
KB candidate staging	`kb_candidates` table — entries awaiting human review, visible in the approval queue UI	30 days → auto-archive if unreviewed	The bridge from auto-resolved investigations to long-term KB. Auto-written when verdict = `auto_resolved` and the quality gate passes.

22.4 Long-term (the graph memory — the product learning layer)¶

Two stores with different access patterns:

Store	What it holds	How it's created	How it's used	How it's forgotten
Verified KB entries (`documents` + `document_chunks` tables from Iter 2's knowledge pipeline)	Runbooks, past-ticket resolutions promoted from `kb_candidates`, manually-added Markdown	Human accepts/edits a candidate through the approval queue UI	Retrieval pipeline in §7 (hybrid BM25 + vector → top-6 with citations)	Explicit delete; stale-content detection after 6 months of non-use
Memory graph (`memory_nodes` + `memory_edges` behind the `MemoryGraph` protocol)	Entities (customers, users, systems, connectors, tickets, investigations, facts, observations, resolutions) and their typed relationships	Operator clicks "Remember this" on an investigation step, OR Gaby proposes after N≥3 similar observations and the operator accepts	Planner envelope at every LLM call — `neighbors(ticket.customer, depth=1)` loads applicable facts and observations	Explicit delete; `status='archived'` for unused nodes after 90 days; GDPR `forget_subject()` for hard compliance removal

22.5 The node label set (POLE+O, domain-adapted)¶

Borrowed from neo4j-labs/agent-memory's POLE+O model (Persons, Objects, Locations, Events, Observations) and adapted to Gaby's domain:

Label	Meaning	Example natural_key
`customer`	A company / client / account receiving support	`customer:hartwell-law`
`user`	An end-user of a customer (the person who opened a ticket)	`user:kevin.reyes@hartwelllaw.com`
`system`	An application, service, or piece of infrastructure	`system:keycloak-prod`, `system:stripe-webhooks`
`connector`	A configured MCP connector instance	`connector:postgres-main`
`ticket`	A canonical ticket (one node per external ticket)	`ticket:zoho:ZD-8891`
`investigation`	An investigation Gaby ran	`investigation:inv_01HXYZ...`
`fact`	An atomic piece of knowledge (Observations in POLE+O)	`fact:hartwell-legacy-pop3`
`observation`	A time-stamped occurrence ("this happened at this time")	`observation:mfa-lockout@kevin@2026-02-20`
`resolution`	A resolution pattern that worked	`resolution:clear-stale-keycloak-sessions`

Labels are not exhaustive — new labels can be added in v0.2+ without a schema migration (they're just a string column), but these nine cover the v0.1 Founder persona completely.

22.6 The typed relations (seven categories)¶

Borrowed from memory-graph/memory-graph's seven-category relationship model. Edges carry a relation column plus a free-form properties jsonb.

Category	Relations
Causal	`CAUSES`, `TRIGGERS`, `LEADS_TO`, `PREVENTS`
Solution	`SOLVES`, `ADDRESSES`, `ALTERNATIVE_TO`, `IMPROVES`
Context	`OCCURS_IN`, `APPLIES_TO`, `WORKS_WITH`, `REQUIRES`
Learning	`BUILDS_ON`, `CONTRADICTS`, `CONFIRMS`
Similarity	`SIMILAR_TO`, `VARIANT_OF`, `RELATED_TO`
Workflow	`FOLLOWS`, `DEPENDS_ON`, `ENABLES`, `BLOCKS`
Quality	`EFFECTIVE_FOR`, `PREFERRED_OVER`, `DEPRECATED_BY`

Relations are an enum in code, but the DB column is a string so v0.2+ can add relations without a migration. Every backend implementation validates relation strings against the enum at write time.

22.7 The `MemoryGraph` protocol — the contract¶

Every backend implements exactly these 11 methods. The small surface is deliberate: smaller surface = easier plug-and-play + easier verification via the 3-backend round-trip test in Iter 4.

from typing import Protocol, Literal

class MemoryGraph(Protocol):
    # ---- Writes (every call requires workspace_id) ----
    async def upsert_node(
        self,
        workspace_id: WorkspaceId,
        label: str,            # one of the 9 node labels above
        natural_key: str,      # unique within (workspace_id, label)
        properties: dict,
        provenance: Literal["operator", "proposed", "imported"],
        status: Literal["provisional", "active", "archived"] = "provisional",
    ) -> NodeId: ...

    async def upsert_edge(
        self,
        workspace_id: WorkspaceId,
        from_id: NodeId,
        to_id: NodeId,
        relation: str,         # one of the ~25 typed relations above
        weight: float = 1.0,
        properties: dict | None = None,
        observed_at: datetime | None = None,
    ) -> EdgeId: ...

    async def mark_archived(self, workspace_id: WorkspaceId, node_id: NodeId) -> None: ...
    async def forget_subject(
        self, workspace_id: WorkspaceId, subject_natural_key: str
    ) -> ForgetReport: ...

    # ---- Reads ----
    async def get_node(
        self, workspace_id: WorkspaceId, label: str, natural_key: str
    ) -> Node | None: ...

    async def neighbors(
        self,
        workspace_id: WorkspaceId,
        node: NodeId,
        *,
        depth: int = 1,                         # backend may cap at its comfort zone
        relations: list[str] | None = None,
        limit: int = 10,
    ) -> list[tuple[Node, Edge]]: ...

    async def path(
        self,
        workspace_id: WorkspaceId,
        from_id: NodeId,
        to_id: NodeId,
        max_depth: int = 3,
    ) -> list[Edge] | None: ...

    async def query_by_labels(
        self,
        workspace_id: WorkspaceId,
        labels: list[str],
        limit: int = 50,
    ) -> list[Node]: ...

    # ---- Admin ----
    async def healthcheck(self) -> Health: ...

    # ---- Migration — non-negotiable for every backend ----
    async def export_all(
        self, workspace_id: WorkspaceId
    ) -> AsyncIterator[NodeDump | EdgeDump]: ...
    async def import_all(
        self, workspace_id: WorkspaceId, stream: AsyncIterator[NodeDump | EdgeDump]
    ) -> int: ...

The export_all / import_all pair is the migration contract. SQLite → Postgres+AGE, SQLite → FalkorDBLite, FalkorDBLite → Postgres+AGE — all the same operation: export_all from source, stream through import_all on destination. Iter 4's test suite round-trips the same fixture through all three backends and asserts byte-identical dumps.

22.8 The three backends shipped in v0.1¶

Backend	Shipped	When to use	Implementation notes
`SQLiteMemoryGraph`	Default in v0.1. Full implementation lands in Iter 0 (tables + protocol surface) and Iter 4 (all methods).	Everyone using `docker compose up` without a profile flag. Works up to low-5-figure node counts with acceptable query latency.	Two tables (`memory_nodes`, `memory_edges`), SQLite FTS5 on node `properties.text` for label-scoped search, recursive CTEs for `neighbors(depth=2)` capped at depth 2 (higher depths raise `DepthNotSupported`).
`PostgresAGEMemoryGraph`	Stub in Iter 0 (so factory + config compile), full implementation in Iter 4.	Users who want graph-native memory from day one AND are happy to run Postgres. Recommended graph-native path.	Uses the Apache AGE extension on the same Postgres instance that holds the rest of Gaby's data. Nodes/edges become AGE vertices/edges. Traversals use Cypher via the AGE `cypher(...)` SQL function. `depth` can go to 5+ without pain.
`FalkorDBLiteMemoryGraph`	Stub in Iter 0, full implementation in Iter 4.	Users who want graph-native memory embedded (no extra container) AND are comfortable with a Beta embedding layer on top of a production-stable engine. Secondary graph-native path.	`falkordblite` Python package (Apache 2.0) spawns a local FalkorDB inside the backend container. Cypher queries via the `falkordb-py` client. Same API in embedded and server modes — "switch to production FalkorDB" is one config change.

Expressly not shipped in v0.1: Neo4j (JVM cost per §22.10), SurrealDB (license status flagged for verification, not confirmed), Cozo (pre-1.0, maintainers don't promise storage compatibility), Kuzu (archived 2025-10-10).

22.9 The planner context envelope¶

The planner does NOT receive all the memory — that would blow the context window and the budget. It receives a bounded envelope assembled at the start of every LLM call:

planner_input = {
    system_prompt,                           ← cache breakpoint 1 (persona + safety)
    tool_manifest,                           ← cache breakpoint 2 (connectors)
    envelope: {
        retrieved_kb_chunks[0..6],            ← long-term vector   (hybrid retrieval)
        applicable_facts[0..10],              ← long-term graph    (MemoryGraph.neighbors)
        recent_similar_tickets[0..3],         ← medium-term query  (tickets table window)
        connector_recent_results,             ← medium-term cache  (in-proc LRU)
    },                                       ← cache breakpoint 3 (stable during investigation)
    working_memory.messages,                  ← short-term (accumulator)
}                                             ← cache breakpoint 4

Breakpoint 3 is where cache economics matter most: the envelope is stable for an entire investigation, so every planner call after the first reads it at 0.1× base price (per Anthropic prompt cache rules in §8.2).

The envelope's long-term-graph slot is filled by:

facts = await memory.neighbors(
    workspace_id,
    node=await memory.get_node(workspace_id, "customer", ticket.customer_natural_key),
    depth=1,
    relations=["HAS_FACT", "PREFERS", "USES", "WORKS_WITH"],
    limit=10,
)

Bounds (6 KB chunks, 10 facts, 3 recent tickets) are configurable per workspace but the defaults are deliberate — they keep the envelope under ~5k tokens in the common case, which is safely below every v0.1 model's context window and inside the cache discount band.

22.10 Governance — the hard rules¶

No silent writes to long-term memory. Every node that reaches status='active' goes through a human gate. Gaby may write status='provisional' nodes autonomously after N≥3 confirming observations, but only an operator promotes them.
Workspace isolation is enforced at the query layer, not documented at the query layer. Every upsert_*, neighbors, path, query_by_labels call takes workspace_id as a required positional argument. No "global traversal" method exists; no default-workspace fallback exists. Cross-workspace traversal requires a dedicated admin API that does not exist in v0.1.
PII redaction happens on the way IN. A node or edge stores redacted text. Raw text is preserved only in the encrypted tickets.body_encrypted column for audit.
Forgetting is a first-class operation. forget_subject() purges medium-term buffers AND cascades status='archived' across the graph for the subject AND records an audit event. GDPR compliance is not retrofitted; it's designed in.
Stale content is demoted, not deleted. Nodes unused for the configured threshold (default: facts 90 d, KB 180 d) get status='archived'. Archived nodes are hidden from planner queries but preserved for audit reconstruction.
Provenance is never lost. Every node carries provenance ∈ {operator, proposed, imported}. The planner envelope's display in the UI shows this so operators know which facts came from them vs which Gaby proposed.

22.11 Why not a single graph DB?¶

Short version: because the MemoryGraph protocol is the commitment; the storage engine is an implementation detail. See the full trade-off discussion in the commit history (SQLite default was chosen to preserve the 5-minute install promise while AGE and FalkorDBLite are offered as opt-in profiles for users who want graph-native from day one). Key rationales by what we rejected:

Neo4j Community — JVM (+300 MB RAM), ~600 MB Docker image, third service in Compose, Cypher as a third query language, GPLv3 licensing friction with our Apache-2.0 core. Not a capability problem; an ops cost problem. Remains a valid swap target for users who want it.
Kuzu — archived 2025-10-10. Dead for v0.1 commitment.
SurrealDB — real production use (Samsung Ads, Verizon, Tencent), embedded Rust binary, but 2024–2025 BSL licensing history needs verification before we commit an Apache-2.0 core project to it. Flagged, not adopted.
Cozo — great vision ("the hippocampus for AI"), embedded like SQLite, but pre-1.0 and the maintainers explicitly do not promise storage compatibility before 1.0. Revisit in v0.3+.
Memgraph — BSL license. Non-OSS by our definition.
TypeDB / NebulaGraph / Dgraph — heavier than we need, none add enough over the three chosen backends to justify the extra ops cost.

22.12 What ships in Iter 0 vs Iter 4¶

Iter 0 (scaffold)	Iter 4 (agent loop)
`memory_nodes` + `memory_edges` + `kb_candidates` tables	`PostgresAGEMemoryGraph` full implementation
`sessions.operator_notes` jsonb column	`FalkorDBLiteMemoryGraph` full implementation
`storage/memory_graph/base.py` — the protocol	Planner envelope integration (`applicable_facts` fill)
`storage/memory_graph/sqlite.py` — full implementation	Migration CLI: `gaby memory export`, `import`, `migrate --to=<backend>`
`storage/memory_graph/postgres_age.py` — class stubs with `NotImplementedError`	3-backend round-trip integration test
`storage/memory_graph/falkor_lite.py` — class stubs with `NotImplementedError`	"Remember this" UI button in the approval queue
`storage/memory_graph/__init__.py` — factory reading `GABY_MEMORY_BACKEND` env var	`gaby memory forget --subject=<key>` CLI (GDPR)
`ops/docker/docker-compose.yml` gets a `graph-age` profile adding Postgres+AGE container	Property tests on workspace isolation (any traversal that leaks across workspaces fails)

Iter 0 still writes graph data from the first ticket. Iter 4 adds the alternative backends and the planner integration.

22.13 Verification — how we'll prove the backend swap works¶

A single integration test runs in CI for every PR that touches storage/memory_graph/:

@pytest.mark.integration
@pytest.mark.parametrize("source", ALL_BACKENDS)
@pytest.mark.parametrize("target", ALL_BACKENDS)
async def test_roundtrip_across_backends(source, target, tmp_fixture_graph):
    # 1. Materialize the fixture graph in `source`
    src = make_backend(source)
    await load_fixture(src, tmp_fixture_graph)

    # 2. Export → import into `target`
    dst = make_backend(target)
    async for row in src.export_all("workspace-test"):
        await dst.import_all("workspace-test", one_row_stream(row))

    # 3. Compare canonical dumps
    src_dump = sorted([r async for r in src.export_all("workspace-test")], key=_canon)
    dst_dump = sorted([r async for r in dst.export_all("workspace-test")], key=_canon)
    assert src_dump == dst_dump

    # 4. Assert neighbors() returns the same results from both
    for seed in fixture_seeds:
        src_neigh = await src.neighbors("workspace-test", seed, depth=1, limit=10)
        dst_neigh = await dst.neighbors("workspace-test", seed, depth=1, limit=10)
        assert canonicalize(src_neigh) == canonicalize(dst_neigh)

With 3 backends this is a 3×3 = 9-cell test matrix. The SQLite↔SQLite cell catches regressions in the reference implementation. The cross-backend cells catch drift. Any new backend added in the future must make this matrix green before it can be offered to users.