Skip to content

Architecture

Architecture — Gaby

Status: Draft v0.1 · Owner: Guilliano · Last updated: 2026-04-11

Reading order: SPEC.mdFOUNDATION.mdARCHITECTURE.md (this)ROADMAP.md.

SPEC.md says what we're building. FOUNDATION.md locks the stack and the repo layout. This doc is the technical how: lifecycles, state machines, contracts, concurrency, failure modes, data flow.

This is a living document. Anything with a § icon is a design decision that can be revisited with evidence; anything with a 🔒 is locked for v1.0.


1. System map — one picture

                                 ┌─────────────────────────────────┐
                                 │  End user (customer / employee) │
                                 └──────────────┬──────────────────┘
                        ┌───────────────────────┼───────────────────────┐
                        │                       │                       │
                    Help desk               Chat widget             Slack / Teams
                   (Zendesk, Halo,        (JS snippet,              (bot user)
                    Linear, Zoho…)         shadow DOM)
                        │                       │                       │
                        └───────────┬───────────┴───────────┬───────────┘
                                    │                       │
                                    ▼                       ▼
                         ┌────────────────────┐    ┌────────────────────┐
                         │ TicketSource       │    │ ChatSession        │
                         │  adapters          │    │  manager           │
                         └─────────┬──────────┘    └─────────┬──────────┘
                                   │                         │
                                   └──────────┬──────────────┘
                                 ┌─────────────────────────┐
                                 │ Event bus (in-proc)     │
                                 │   topic: ticket.new     │
                                 └────────────┬────────────┘
                                 ┌─────────────────────────┐
                                 │ Worker runner           │
                                 │  (in-proc or arq+Redis) │
                                 └────────────┬────────────┘
            ┌────────────────────────────────────────────────────────────┐
            │                    Agent loop                              │
            │   ┌─────────┐  ┌──────────┐  ┌───────────┐  ┌──────────┐ │
            │   │ Plan    │→ │ Retrieve │→ │ Tool call │→ │ Observe  │ │
            │   └─────────┘  └──────────┘  └─────┬─────┘  └────┬─────┘ │
            │        ▲                            │              │      │
            │        └────────────────────────────┴──────────────┘      │
            │                    (loop until verdict)                   │
            └───────┬─────────────┬──────────────────────┬──────────────┘
                    │             │                      │
                    ▼             ▼                      ▼
             ┌──────────┐  ┌────────────┐         ┌─────────────┐
             │ LLM      │  │ Knowledge  │         │ MCP host    │
             │ gateway  │  │ retrieval  │         │ (spawns and │
             │ (litellm)│  │ (hybrid)   │         │  supervises │
             └──────────┘  └────────────┘         │  connectors)│
                                                  └──────┬──────┘
                                      ┌──────────────────┼─────────────────┐
                                      │                  │                 │
                               MCP server          MCP server        MCP server
                               postgres            keycloak         zoho-desk
                               (stdio/HTTP)        (stdio/HTTP)     (stdio/HTTP)
                                      │                  │                 │
                                      ▼                  ▼                 ▼
                               Real Postgres       Real Keycloak     Real Zoho
                                 (read-only)        (read-only)     (read + write)

            The agent loop, before every tool call, passes through:

                     Safety pipeline (§6)
             ┌─────────────────────────────────┐
             │ scope check → redact → dry-run  │
             │       → apply → audit           │
             └─────────────────────────────────┘

Everything else in this document elaborates one of the boxes or one of the arrows.


2. Core request lifecycle — from "ticket arrives" to "ticket closed"

This is the canonical path. Every v0.1 scenario collapses to it.

 1. [TicketSource]  poll() / webhook → raw_ticket
 2.                  .normalize()    → Ticket (canonical form)
 3.                  persist → tickets table, emit "ticket.new"
 4. [Worker]         consumes "ticket.new" → schedules Investigation
 5. [Agent loop]     new Investigation(id, ticket_id, budget)
 5a.                   (optional) classify(ticket) → triage verdict
 5b.                   if triage == "not_worth_investigating":
 5c.                     verdict = "skipped"; go to step 23
 6.                    while not verdict:
 7.                      plan_next_step(working_memory)       # LLM call: planner
 8.                      if needs_retrieval:
 9.                        retrieve(query)                    # knowledge subsystem
10.                        append to working_memory
11.                      if needs_tool_call:
12.                        action = propose_tool_call()       # LLM call: tool_selector
13.                        safety_check(action, scopes, autonomy)   ←── may raise
14.                        if dry_run:
15.                          result = simulate(action)
16.                        else:
17.                          result = mcp_host.call(action)
18.                        audit.write(action, result)
19.                        append to working_memory
20.                      maybe_emit_step_to_ui(step)          # live updates
21.                      if budget_exceeded or max_iterations:
22.                        verdict = "failed_budget"; break
23.                    verdict = classify(working_memory)     # LLM call: verdict
24.                    summary  = summarize(working_memory)   # LLM call: summarizer
25. [TicketSink]     write_back(ticket, summary, verdict)     # via the source adapter
26.                  update tickets.status
27.                  emit "investigation.done"
28. [Escalator]      if verdict ∈ {needs_tech, needs_l2, needs_client}:
29.                    dispatch_to_channel(persona.escalation_target)
30. [KB learner]     if verdict == "auto_resolved" and quality_gate_passes:
31.                    stage the resolution as a candidate KB entry (human review)

Legend for the LLM calls in the loop

Call name Purpose Model tier Streaming
planner Given working memory, what should we do next? big no
tool_selector Choose a specific MCP tool + its arguments big no
summarizer Turn the working memory into a customer-facing message big yes
verdict Classify final outcome (auto_resolved / needs_tech / …) small no
classifier (optional) Cheap pre-filter at step 1 (is this even worth investigating?) small no

The model router (§8.4) decides "big" vs "small". Classifier-style calls go through a cheap model so we don't spend flagship-model tokens on yes/no questions.


2.1 Working memory vs investigation steps — two things, not one

These are separate and must not be confused.

Thing Shape Scope Persistence Consumer
Working memory A typed object WorkingMemory { ticket, messages, tool_calls, retrieved_chunks, budget_state }. The messages array is the LLM conversation history for this investigation. One in-flight investigation Snapshotted to investigations.working_memory_snapshot (jsonb) at every state-machine transition The agent loop
Investigation steps Append-only rows in investigation_steps matching the UI timeline shape (system, action, detail, type, timestamp) One investigation, historical Permanent (soft-delete only) The UI, the audit log, the operator

Every state transition in §3 does two writes: it updates the working memory snapshot AND appends one or more investigation step rows. The snapshot lets us resume after a crash; the steps let the UI animate in real time and the audit log reconstruct history.


3. Agent loop — state machine

                        ┌──────────────┐
                        │   CREATED    │
                        └──────┬───────┘
                               │ start()
                        ┌──────────────┐
         ┌─────────────▶│   PLANNING   │
         │              └──────┬───────┘
         │                     │ next_step == retrieve
         │                     ▼
         │              ┌──────────────┐
         │              │  RETRIEVING  │
         │              └──────┬───────┘
         │                     │
         │                     ▼  (back to planning with new evidence)
         │              ┌──────────────┐
         └──────────────┤   PLANNING   ├─┐
                        └──────┬───────┘ │
                               │          │ next_step == act
                               ▼          ▼
                        ┌──────────────┐
                        │  SAFETY_CHK  │
                        └──────┬───────┘
                 ┌─────────────┼────────────┐
                 │             │            │
                 │             │            │
            denied          approval       allowed
                 │           required        │
                 ▼             │             ▼
         ┌──────────┐          ▼      ┌─────────────┐
         │ HALTED   │   ┌──────────┐  │  ACTING     │
         └──────────┘   │ WAITING  │  └──────┬──────┘
                        │ APPROVAL │         │
                        └─────┬────┘         ▼
                              │       ┌─────────────┐
                              │       │  OBSERVING  │
                              │       └──────┬──────┘
                              │              │
                              ▼              ▼
                        ┌──────────────┐
                        │   PLANNING   │  (loop)
                        └──────┬───────┘
                               │ verdict_ready
                        ┌──────────────┐
                        │  VERDICT     │
                        └──────┬───────┘
                        ┌──────────────┐
                        │  WRITING_BACK│
                        └──────┬───────┘
                        ┌──────────────┐
                        │   DONE       │
                        └──────────────┘

Terminal states

State Meaning
DONE Verdict produced, written back, audit closed. Normal path.
HALTED Safety denial or unrecoverable error. Escalated, audit closed.
WAITING_APPROVAL Paused, waiting on a human. Resumable. Has a TTL (default 24h). On TTL expiry → auto-escalate. Not strictly terminal; APPROVED transitions back into ACTING with the same pending action.

Resume semantics (after a crash OR after an approval)

Because working memory is snapshotted at every transition, resuming is deterministic:

1. Load investigations.working_memory_snapshot for the target investigation
2. Load investigations.status → the last state
3. Re-enter the state machine at that state with the snapshot as input
4. For WAITING_APPROVAL: when the approval lands, the loop re-enters ACTING,
   calls the already-validated (tool_name, args), and proceeds normally
5. For a crash resume: the loop re-enters PLANNING with the last snapshot.
   We *never* replay a non-idempotent action — if the crash happened inside
   ACTING, the audit log tells us whether the action completed
   (`action.applied` event) or not. Completed actions are skipped on resume.

Idempotency requirement on MCP tool authors: every write tool must accept an idempotency_key argument (Gaby generates one per action UUID). The connector is responsible for de-duplicating on retry. Contract test §12 enforces this for every dangerous tool.

Budget enforcement

At every transition, the loop checks: - tokens_used < budget.tokens - usd_spent < budget.usd - wall_clock < budget.max_seconds - iterations < budget.max_iterations (default 20)

Any breach → verdict failed_budget, escalation. No silent degradation.

Why homegrown (a reminder)

We discussed this in FOUNDATION.md §1.1. The state machine above is ~400 Python lines on top of the Anthropic/OpenAI SDKs. The reasons to not adopt LangGraph or pydantic-ai at v0.1 are:

  1. Safety must come before every tool call, not as a decorator. Frameworks make this awkward; in a hand-rolled loop it's one function call on the critical path.
  2. Every transition emits an audit event with the full working memory delta. Frameworks' internal state is opaque to us.
  3. Budget enforcement is per-transition, not per-call. Our loop checks every edge; frameworks expose hooks but not guarantees.
  4. We want streaming of summarizer output directly to the UI. Simple from our loop; non-obvious in a framework that wraps the LLM client.

The public interface of the loop is small enough (start, resume, step, state) that a future swap is a week of work, not a rewrite.

Preferred escape hatch: pydantic-ai

If the homegrown loop stalls — prompt debugging becomes painful, multi-step branching gets tangled, we reimplement checkpointing — the preferred migration target is pydantic-ai, not LangGraph. Published 2026 benchmarks put pydantic-ai at ~44% lower P95 latency, ~5× fewer errors under load, and ~2.7× lower token consumption versus LangGraph on equivalent agent tasks. It also ships pydantic-graph with durable execution across restarts and first-class human-in-the-loop, which maps cleanly onto our WAITING_APPROVAL state.

Re-evaluation gate: after v0.1 ships, run the eval harness (50+ fixture tickets) against both the homegrown loop and a pydantic-ai port. If the pydantic-ai port is within 10% of the homegrown loop on safety compliance AND meaningfully shorter in code OR faster on latency, we swap for v0.2.


4. Concurrency model

Gaby is I/O-bound. LLM calls, DB queries, MCP round-trips, HTTP to help desks. asyncio everywhere is the default; threads are only for hard CPU work (embeddings inference if we run it locally, BM25 scoring on large corpora).

4.1 The runtime shape

                         ┌────────────────────────┐
                         │    FastAPI app         │
                         │  (uvicorn, 1 process)  │
                         └──────────┬─────────────┘
                         same event loop
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
          ┌────────────┐  ┌──────────────┐  ┌─────────────┐
          │ HTTP routes│  │ Worker runner│  │ Chat gateway│
          └────────────┘  └──────┬───────┘  └─────────────┘
                     bounded semaphore (N=8 default)
                     ┌───────────┼───────────┐
                     ▼           ▼           ▼
               Investigation  Investigation  Investigation
                 task           task          task
                 (asyncio.Task)  ...           ...

4.2 Key parameters

Parameter SQLite default (v0.1) Postgres default (v0.2+) Notes
uvicorn --workers 1 2–4 Single process in v0.1 is enough; scale out horizontally in v0.5
Concurrent investigations 8 16 Bounded semaphore; excess tickets queue in the DB
Per-investigation LLM concurrency 1 1 LLM calls inside one investigation are serial — simpler reasoning
MCP subprocess pool size unbounded unbounded One MCP server per connector; each serves many investigations concurrently
Async DB pool pool_size=5, max_overflow=5 pool_size=20, max_overflow=10 SQLite is single-writer so the pool just serializes writes. Postgres default follows the 2026 production pattern of pool=20/overflow=10 for moderate API servers. Remember: total DB connections = workers × (pool_size + max_overflow).
PostgreSQL driver n/a asyncpg asyncpg (not psycopg2) for true async; SQLAlchemy configured with postgresql+asyncpg://
HTTP client (httpx) shared shared Single AsyncClient per process, limits=Limits(max_keepalive=40)

4.3 In-process vs external worker

v0.1 default:           Investigations run inside the FastAPI process, on the same
                        event loop. No Redis. "docker compose up" = 1 container for
                        the app + 1 for the UI + 1 for Postgres (optional).

v0.2 default (scale):   arq worker in a separate container. Same code, different
                        entry point. Switches on `GABY_WORKER_MODE=arq`.

The worker interface is identical in both modes — runner.schedule(investigation) — so upgrade path is a config flag, not a refactor.

4.4 Backpressure

  • If the investigation semaphore is full, new ticket.new events queue in the DB (tickets.status='queued').
  • The web UI dashboard shows queue depth and expected wait time (queue_length / avg_inv_duration).
  • Above a configurable threshold (default 50 queued), Gaby starts degraded mode: it still intakes tickets, but skips the retrieve step for low-priority tickets to drain faster.
  • We never drop a ticket. The DB is the queue; losing a ticket requires a DB failure, not a process crash.

4.5 The DB is the queue — event bus clarified

To square "in-process event bus" with "never drop a ticket", the actual rule is:

  1. Ticket adapters write tickets(status='queued') and then optionally fire an in-memory notification to wake the worker faster.
  2. The worker runner's main loop is a DB poll: SELECT ... FROM tickets WHERE status='queued' ORDER BY priority, received_at FOR UPDATE SKIP LOCKED LIMIT 1. SQLite uses BEGIN IMMEDIATE + an app-level mutex in place of FOR UPDATE SKIP LOCKED.
  3. The in-memory notification is a latency optimization, not the source of truth. If it gets dropped (process crash, Python GC pause), the next poll tick catches the ticket.
  4. Poll interval defaults to 2 seconds; notification-driven wakeups push it effectively to zero when the system is loaded.

This makes the system at-least-once: after a crash, we may re-claim a ticket whose investigation was mid-flight. The resume rules in §3 handle that safely because: - Working memory is snapshotted per transition → we pick up where we left off. - Every write action carries an idempotency_key → no duplicate side-effects. - The audit log tells us what was already applied before the crash.

4.6 Ticket claim transaction

Pseudo-code for the claim:

-- Postgres path
BEGIN;
SELECT id, workspace_id, body FROM tickets
  WHERE status = 'queued'
  ORDER BY priority DESC, received_at ASC
  LIMIT 1
  FOR UPDATE SKIP LOCKED;
UPDATE tickets SET status = 'investigating', claimed_by = $worker_id, claimed_at = now()
  WHERE id = $picked_id;
COMMIT;

For SQLite we fall back to a single-writer strategy: one claim task, protected by a process-wide asyncio.Lock, using BEGIN IMMEDIATE to acquire the DB writer lock before the SELECT + UPDATE. This is fine for v0.1 throughput targets.


5. MCP host — connector lifecycle

Every connector is an MCP server. Gaby is an MCP host (in MCP parlance) and an MCP client (it calls their tools).

5.1 Spawn strategies

Strategy When Implementation
stdio subprocess (v0.1 default) Default for first-party connectors bundled in the image asyncio.create_subprocess_exec(...); framed JSON-RPC over pipes as per the MCP stdio transport
Streamable HTTP Remote / community MCP servers reachable over HTTP MCP's Streamable HTTP transport — the current spec's HTTP option (superseded the older SSE transport)
in-process Tiny built-ins that never block (e.g. local filesystem, time) Direct function calls wearing an MCP-shaped mask

Why stdio default? First-party connectors ship in the Gaby Docker image and are spawned as subprocesses — no network, no auth, lowest latency, simplest failure modes. Streamable HTTP is for remote/community servers where subprocess isn't an option. The official Python MCP SDK supports both transports with the same client interface.

5.2 Lifecycle

CONFIGURED ──start──▶ LAUNCHING ──handshake──▶ READY ──▶ BUSY ──▶ READY
                          │                      │         │        │
                          └─fail─▶ CRASHED       │         ▼        │
                                     │           │     TIMEOUT      │
                                     ▼           │         │        ▼
                                 RESTARTING ◀────┴─────────┴───── SHUTDOWN
                                 READY (or DEGRADED after N failures)
  • Handshake = MCP initialize + tools/list. The tool list is cached per connector version.
  • Crash recovery: exponential backoff, max 5 restarts in 5 minutes. After that the connector is marked DEGRADED and the UI shows a persistent warning. Investigations that would have used this connector either skip it or halt, depending on connector criticality.
  • Health check: periodic ping every 30s (for HTTP) or "is the subprocess alive?" (for stdio). Surfaced at /health/connectors.
  • Graceful shutdown: SIGTERM → wait for in-flight tool calls to finish → SIGKILL after 10s.

5.3 Tool scope declaration — the contract every connector must satisfy

Every connector declares its capabilities in a machine-readable form that Gaby trusts for authorization decisions:

# connector tool manifest (returned by tools/list + scope extension)
tools:
  - name: query_users
    scope: read
    description: "Look up user by email or ID"
    args: [{ name: email, type: string }]
  - name: reset_password
    scope: write
    dangerous: true
    requires_approval_above_autonomy: propose   # auto-approves only when autonomy=act
    description: "Trigger a password reset"
    args: [{ name: user_id, type: string }]
  - name: delete_user
    scope: write
    dangerous: true
    forbidden_in_autonomy: [investigate, propose]   # only allowed in autonomy=act
    description: "Permanently delete a user"
    args: [{ name: user_id, type: string }]

Contract tests (see §12) verify every first-party connector declares these fields. Community connectors that don't are flagged UNSAFE in the UI and cannot be moved out of investigate autonomy.

5.4 Manifest versioning and cache invalidation

Each connector declares a manifest_version (semver) in its initialize response. Gaby stores the last-seen version per connector. On restart, if the version changed, the tool list cache is invalidated and a new tools/list is performed. The manifest hash is also written to the audit log so historical investigations reference a specific immutable version of the tool set.

5.5 Idempotency keys for write tools

Every write tool must accept idempotency_key: string as an argument. Gaby generates one per action UUID and passes it automatically. Connectors use it to de-duplicate on retry after a crash or transient failure. This requirement is enforced by the contract tests.


6. Safety pipeline — the thing that cannot break

This is the single most important subsystem. Every non-read action passes through it, in order:

        action (tool_name, args, connector_id)
        ┌──────────────────────────────┐
        │ 1. SCOPE CHECK               │   evaluate(action, connector.scopes,
        │                              │             ticket.workspace_id,
        │                              │             persona.autonomy_level)
        └──────┬───────────────────────┘
               │ denied ──────────────────▶ AUDIT(denied) → raise PermissionError
               │ allowed
        ┌──────────────────────────────┐
        │ 2. REDACTION                 │   strip PII from any string args per
        │                              │   workspace.compliance_profile (HIPAA, SOC2…)
        └──────┬───────────────────────┘
        ┌──────────────────────────────┐
        │ 3. DRY-RUN DECISION          │   dry_run = (autonomy ≠ act)
        │                              │            OR (tool.dangerous AND not approved)
        └──────┬───────────────────────┘
        ┌──────┴──────┐
        │             │
     dry_run       real
        │             │
        ▼             ▼
  simulate()      mcp_host.call()
        │             │
        └──────┬──────┘
        ┌──────────────────────────────┐
        │ 4. AUDIT                     │   append_hash_chained(
        │                              │     actor, action, result, ts)
        └──────┬───────────────────────┘
        return result to agent loop

6.1 Scope DSL (sketch)

Scopes are declarative, per-connector. There are exactly two lanesread and write. Dry-run is not a scope lane; it is a runtime decision made at step 3 of the safety pipeline (see §6 diagram) and implemented by the connector when its tool manifest sets supports_dry_run=true. See docs/decisions/2026-04-15-dry-run-not-a-scope-lane.md for the rationale.

connector: m365
scopes:
  read:
    allow: ["users/*", "mailboxes/*"]
  write:
    allow: ["users/{id}/reset_password"]
    deny:  ["users/{id}/delete"]

The scope checker resolves action.tool against these globs plus the tool manifest's scope field. Denies beat allows. Everything not explicitly allowed is denied.

6.2 Audit log — hash-chained, append-only

Every entry:

{
  "id": <uuid7>,
  "workspace_id": ...,
  "ts": <monotonic_wall>,
  "actor_kind": "agent" | "user" | "system",
  "actor_id": ...,
  "event": "action.applied" | "action.denied" | "approval.granted" | ...,
  "payload": { ... action + result snapshot ... },
  "prev_hash": <sha256 of previous row>,
  "hash":      <sha256(prev_hash || canonical_json(this row without 'hash'))>
}
  • Verification: a background task re-walks the chain daily and alerts on any mismatch.
  • SIEM export (EE feature): tail the chain to Splunk / Sumo / Elastic via a pluggable exporter.
  • Why not a separate append-only database (e.g. QLDB, immudb)? Added operational complexity. A hash-chained table in the same Postgres, with row-level ACLs and no UPDATE / DELETE grants, gives 95% of the guarantee for 5% of the complexity. EE customers who need stronger guarantees can pipe to a dedicated store.

6.3 The four autonomy levels (one more than SPEC.md §6.5)

Level What the agent does When to use
off Gaby does nothing. Tickets are ingested but not investigated. Maintenance mode / legal hold.
investigate Gaby reads, retrieves, queries. Never writes. Produces a summary for humans. First week of deployment. Read-only SRE connectors.
propose Gaby drafts the fix. Every write action goes to the approval queue. Default for most non-trivial deployments.
act Gaby executes writes itself, with dry-run + audit + rollback. Still respects dangerous/forbidden flags on tools. Mature deployments with well-understood playbooks.

Autonomy is set per connector, per workspace. A single investigation can include act calls to Redis and propose calls to Stripe.


7. Knowledge subsystem — retrieval with citations

7.1 Pipeline

source           chunker             embedder            store             retrieval
-----            -------             --------            -----             ---------
git repo         token-aware         provider-agnostic   sqlite-vec        hybrid (BM25 + vector)
dir walker       Markdown/code-aware (pluggable)         or pgvector       + cross-encoder rerank
confluence       respects headings                       + FTS5 / tsvector + top-k=6 default
notion                                                                     + explicit citations
pdf
url crawler
past tickets

Vector store scale cliff — plan for the migration

sqlite-vec uses brute-force search (no ANN index). This is fine for v0.1 — a founder's runbook folder is hundreds, maybe low thousands, of chunks — but it does not scale to large corpora. The migration triggers:

Corpus size Recommendation
< 5,000 chunks sqlite-vec (brute-force is fast enough, <20 ms queries)
5,000 – 50,000 Evaluate vectorlite (sqlite ANN, ~3–30× faster than sqlite-vec on the same hardware)
> 50,000 Switch to the Postgres profile, use pgvector with HNSW indexes

All three present the same VectorStore protocol, so the migration is a config flag + a background reindex, not a rewrite. The documents table schema stores embedding as raw BLOB (float32 × dim) so the underlying index implementation is swappable.

Embedding model default: we start with a provider-agnostic choice — text-embedding-3-small (OpenAI, 1536 dim) for BYOK users on OpenAI, voyage-3-lite (Voyage AI, 512 dim) for Anthropic-leaning deployments, or a local BGE model via sentence-transformers for air-gapped installs. The schema is dim-agnostic; changing models triggers a background reindex.

7.2 Chunker rules (§)

  • Markdown: split on top-level headings first, then H2, then 800-token soft max.
  • Code (source files): split per function/class; never split mid-function.
  • PDF: page-aware; no cross-page chunks unless a heading continues.
  • Chunk metadata carries: source_uri, headings_path, line_range, content_hash.

7.3 Retrieval

  1. Query rewrite (optional, cheap model): turn the ticket title + body into 1–3 search queries.
  2. Parallel retrieval: BM25 top-20 ∥ vector top-20.
  3. Reciprocal-rank fusion → top-20 hybrid candidates.
  4. Cross-encoder rerank (cheap model or a small local model) → top-6.
  5. Attach to working memory with explicit citations ([doc:uri#headings#L12-L30]).

7.4 Citations in output

Every claim in the final summary must end with a citation token. Unsourced claims are re-queried or explicitly disclaimed:

"This user's Authenticator was tied to the old iPhone [kb://runbooks/mfa-lockout#L45-L58] and the Entra ID sign-in log confirms 7 AADSTS50076 failures [entra://signinlogs#user=kevin.reyes@hartwelllaw.com&window=30m]."

Users can click any citation in the UI to see the source.

7.5 Learning loop

When an investigation resolves with auto_resolved verdict AND the operator rates it ≥4/5 (or no one disputes it within 7 days), Gaby stages a new KB candidate (the ticket + the resolution + the tool-call trace) in a review queue. An operator accepts / edits / rejects. Accepted entries become new indexed documents.

No silent learning. Human in the loop, always.


8. LLM gateway

8.1 Provider interface

class LLMProvider(Protocol):
    async def chat(self, messages, *, model, tools=None, max_tokens, temperature,
                   cache_control=None, stream=False) -> ChatResult: ...
    async def embed(self, texts, *, model) -> list[list[float]]: ...
    def supports(self, capability: Literal["tools", "streaming", "cache", "json_mode"]) -> bool: ...

Three concrete implementations in v0.1: - AnthropicProvider (direct anthropic Python SDK) — used on hot paths (planner, summarizer) - OpenAIProvider (direct openai Python SDK) — fallback for BYOK - LiteLLMProvider — wraps ~100 providers for BYOK users who want Azure OpenAI, Bedrock, Vertex, Mistral, local vLLM, etc.

A note on LiteLLM. We use the Python SDK (litellm as a library), not the LiteLLM proxy — the proxy has known production issues in 2026 (GIL-bound throughput, DB logging degradation, SSO gated behind paid tier) and was compromised in a PyPI supply-chain attack in March 2026 (versions 1.82.7 and 1.82.8). Mitigations: - Pin litellm to a known-good version range in uv.lock and update deliberately. - All SDK installs go through PyPI with hashes verified at install time. - Hot paths (planner, tool_selector, verdict, summarizer) bypass LiteLLM entirely and use the direct Anthropic/OpenAI SDKs. - LiteLLM only sees BYOK-only providers (Bedrock, Vertex, Azure, local vLLM) where its breadth is the value.

If BYOK volume becomes a real production load, Bifrost (Apache 2.0, Go-based, ~10μs overhead) and Portkey are the v0.2+ evaluation targets.

8.2 Prompt caching

Anthropic (2026 rules). Cache scope is the whole prefix up to the cache breakpoint, in request order: tools → system → messages. Cache reads are billed at ~0.1× base input price, 5-minute writes at 1.25×, 1-hour writes at 2×. Max 4 breakpoints per request. Minimum cacheable block size: 1,024 tokens on Sonnet, 4,096 tokens on Haiku (so short prompts never benefit on Haiku). Cache TTL defaults to 5 minutes, refreshed on every hit — which fits perfectly inside a typical investigation that spans seconds to minutes. As of 2026-02-05 caches are workspace-isolated (not org-wide), so multi-workspace deployments get proper separation for free.

We place our 4 breakpoints as follows:

# Block Lifetime Why
1 Tool manifest (the MCP tool list, serialized) Connector version Changes only when a connector updates. Near-permanent cache hits.
2 System prompt (persona-specific instructions + safety rules) Persona version Changes rarely. Big win on every planner / verdict call.
3 Retrieved KB chunks for this investigation Investigation lifetime Same chunks are re-sent with each planner turn; the 5-min TTL keeps them hot.
4 Accumulated messages (ticket + prior tool calls) Investigation lifetime The accumulator. Every new turn extends past the last breakpoint.

Below 1,024 tokens on Sonnet (or 4,096 on Haiku) the breakpoint is a no-op and the call pays the full input price — a minor inefficiency, never a bug.

OpenAI. Prompt caching is automatic and server-side — no API change required. No minimum block size to worry about.

Local models. The cache is a no-op but the interface is uniform across providers.

8.3 Budget enforcement

Every chat() call passes through a BudgetGuard:

guard.check(investigation_id)  # raises BudgetExceeded before the HTTP call
guard.record(investigation_id, prompt_tokens, completion_tokens, cost_usd)

Budgets are per investigation, set from the persona's profile (default: 50k tokens, $0.50). Breaches halt the investigation and escalate.

8.3.1 Cost mapping — where USD comes from

Token counts come back from the provider response (usage.input_tokens, usage.output_tokens, usage.cache_creation_input_tokens, usage.cache_read_input_tokens for Anthropic). The tokens→USD conversion uses a single pricing table:

  • Primary source: the pricing table bundled in litellm (updated by the upstream project regularly).
  • Override: per-workspace config can set custom per-model rates for BYOK customers with negotiated pricing.
  • Fallback: if a model isn't in the table, the cost column is NULL and the cost metric isn't incremented for that call. The token metric still is.

This is recorded in llm_calls table so the cost dashboard (§16) can aggregate by investigation, by workspace, by model, and by purpose.

8.4 Model router

A 20-line table, not a framework:

ROUTER = {
    "classifier":   "claude-haiku-4-5",
    "verdict":      "claude-haiku-4-5",
    "planner":      "claude-sonnet-4-6",
    "tool_selector":"claude-sonnet-4-6",
    "summarizer":   "claude-sonnet-4-6",
}

Overridable per workspace in config. BYOK users can map these to any provider/model via LITELLM_MODEL_* env vars.


9. Ticketing adapters — source and sink

Every help desk adapter is both a source (new tickets) and a sink (write back results). The base contract:

class TicketAdapter(Protocol):
    async def poll(self, since: datetime) -> list[RawTicket]: ...
    async def subscribe_webhook(self, callback) -> WebhookHandle: ...     # optional
    def normalize(self, raw: RawTicket) -> Ticket: ...
    async def post_reply(self, ticket_id: str, body: str, *, private: bool) -> None: ...
    async def update_status(self, ticket_id: str, status: str) -> None: ...
    async def log_time_entry(self, ticket_id: str, minutes: int, note: str) -> None: ...  # MSP
    def capabilities(self) -> AdapterCapabilities: ...

Adapters in v0.1: Zoho Desk. Adapters in the v0.2-v0.4 window: HaloPSA, Autotask, ConnectWise, Zendesk, Linear, GitHub Issues, Jira SM, Freshdesk, Intercom, email.

Webhooks are preferred when available; polling is the fallback. The poller supports the "since cursor" pattern natively — it stores last_seen_external_id per source in the DB and asks each adapter "give me everything newer than this".

9.1 Canonical Ticket model

Ticket:
  id:           uuid7
  workspace_id:    uuid7
  source_id:    fk → ticket_sources
  external_id:  string     # the source's native ID (ZD-1234, HPS-4871...)
  title:        string
  body:         text
  customer:     string     # free-form, e.g. "Hartwell Law — Kevin Reyes"
  requester_email: string?
  priority:     low | medium | high | critical
  status:       new | queued | investigating | auto_resolved | needs_tech | needs_client | needs_l2 | failed
  sla_at:       datetime?
  received_at:  datetime
  source_metadata: jsonb   # anything the adapter wants to preserve

This maps 1:1 to the existing persona prototypes' ticket shape. Migrations are avoided by keeping the superset.


10. Chat surface — widget, Slack, Teams, operator console

10.1 End-user chat widget

  • A React app bundled by Vite in library mode into a single JS file (gaby-widget.js, target <40 KB gzipped).
  • Mounted into a shadow DOM so the host site's CSS can't leak in or out.
  • Talks to /api/chat on the Gaby backend via fetch + SSE for streaming replies.
  • Themable via a single Gaby.init({ theme: { primary: '#0284c7', font: 'Inter' } }) call.

Abuse surface — rate limits and auth

The widget is public-facing. It must not become a free LLM token faucet. Controls:

Layer Limit Rationale
Per IP 20 messages / minute, 200 / day Hard cap before any backend work
Per widget session 40 messages total, 15-minute idle timeout Session-scoped envelope
Per workspace Configurable daily budget in USD (default $50/day for chat) Workspace owner sets the ceiling
Challenge After 3 messages, an invisible Turnstile/hCaptcha challenge Blocks bots without friction for humans
Auth options Anonymous (rate-limited), host-provided JWT (verified by shared secret), logged-in user (via host's own auth) Stricter auth → higher limits

All limits are enforced before any LLM call is made. Rate-limit hits return 429 with a Retry-After header; the widget surfaces "I'm getting a lot of messages right now, please try again in a moment."

10.2 Session lifecycle

session.created ──user message──▶ session.active
                                       │ (Gaby responds, possibly multiple turns)
                            can_auto_resolve? ──yes──▶ session.resolved
                                       no
                            handoff_requested → session.handoff_pending
                         operator accepts
                              session.handoff_active
                            operator closes → session.closed

10.3 Handoff bundle

When Gaby escalates a chat to a human, the operator receives (in the operator console):

  • Full transcript so far (both user + Gaby)
  • Every tool call Gaby made, with arguments and redacted results
  • Citations used for any KB-backed claims
  • The current working memory snapshot
  • A one-sentence "why I couldn't resolve this" from the agent

The operator starts mid-flight, not cold. This is the single biggest satisfaction driver for the human chat surface.

10.4 Slack / Teams

  • Bolt-for-Python for Slack, Bot Framework for Teams.
  • Same session model, same handoff bundle.
  • Inbound in Slack is v0.3; v0.1 ships Slack outbound only for escalations.

11. Auth and identity (three surfaces)

Surface Mechanism Session store
Web UI (operators) Session cookie (HttpOnly, SameSite=Lax) + CSRF token sessions table
CLI / automation API key (gaby-XXXX.YYYY), prefix + hashed remainder api_keys table
End-user chat widget Host-provided JWT (verified by a shared key) or anonymous token chat_sessions table
Connector OAuth Per-connector device-code flow for the ones that support it; API keys for the rest encrypted connectors.config

First-run bootstrap

On first boot, Gaby generates a one-time admin provisioning URL printed to stdout (and to a file if running headless). Opening it creates the first admin user. This URL expires in 15 minutes. After first use, Gaby refuses to issue another unless the DB is wiped — no silent "admin/admin" defaults.

SSO / SAML / SCIM

Enterprise Edition feature. Implemented via authlib + SAML2, behind a feature flag keyed to the license.


12. Connector contract — the testable promise

Every connector must pass these at pytest time:

Test Assertion
test_initialize Responds to initialize MCP request within 2s
test_tools_list Returns a tool list with every tool carrying scope and description
test_tool_scopes_wellformed Every tool's scope
test_dangerous_flagged Any destructive tool has dangerous: true
test_dry_run_supported Every write tool supports a dry_run=true argument
test_healthcheck Responds to the healthcheck tool
test_redaction_noleak Tool results never echo back secrets passed in args (paranoia check)
test_large_result_truncated Results over 100 KB are truncated (or paged) with a truncation marker

Contract tests live under connectors/_contract/ and are re-run against every first-party and community connector in CI.


13. Error handling philosophy

One rule: fail loud, fail early, degrade only after explicit design.

Category Handling
Transient (network blips, 5xx) Retry with exponential backoff + jitter, max 3 attempts, circuit breaker per endpoint
Permanent (4xx, auth expired) No retry. Investigation enters needs_tech with a specific error.
Budget exceeded Investigation → failed_budget, escalate.
Scope denied Tool call never runs. Audit as action.denied. Agent plans an alternative.
LLM refuses or returns garbage Retry once with a clarifying instruction. Then escalate.
MCP connector crash mid-call Investigation pauses, connector is restarted, call retried once. Then escalate.
Unrecoverable internal bug Investigation → failed, full stack trace in audit, operator notified.

No bare except: anywhere in the codebase. Enforced by a ruff custom rule.


14. Caching layers

Cache Scope TTL Invalidation trigger
LLM prompt cache Per Anthropic cache key Anthropic-managed (~5 min) Automatic
Tool manifest cache Per connector version process lifetime Connector restart
Retrieval result cache Per (query_hash, corpus_version) 10 min KB re-ingest bumps corpus_version
Embedding cache Per (text_hash, model) permanent Model change invalidates by key
Session ticket cache Per ticket_id, within one investigation investigation lifetime End of investigation
OpenAPI doc cache Per build process lifetime Rebuild

15. Deployment topologies

15.1 Founder quickstart (v0.1 default)

┌────────────────────────────────┐
│      docker compose up         │
│                                │
│  ┌──────────┐   ┌──────────┐   │
│  │  gaby    │   │  gaby    │   │
│  │ backend  │◀─▶│   web    │   │
│  │  + SQLite│   │  static  │   │
│  └──────────┘   └──────────┘   │
│         ▲                      │
│         │                      │
│  ┌──────┴───────┐              │
│  │  MCP servers │              │
│  │  (subproc)   │              │
│  └──────────────┘              │
└────────────────────────────────┘
  • 2 containers, embedded SQLite, in-process worker
  • Total RAM: ~512 MB
  • Total persistent volumes: 1 (the SQLite DB + KB index)

15.2 MSP / scaled (v0.2)

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│  gaby       │   │  gaby       │   │  gaby       │
│  backend    │   │  backend    │   │  worker     │
│             │   │             │   │   (arq)     │
└──────┬──────┘   └──────┬──────┘   └──────┬──────┘
       │                 │                 │
       └────────┬────────┴────────┬────────┘
                │                 │
                ▼                 ▼
         ┌─────────────┐   ┌─────────────┐
         │  Postgres   │   │   Redis     │
         │  + pgvector │   │             │
         └─────────────┘   └─────────────┘

15.3 Enterprise / air-gapped (EE feature, later)

  • Air-gapped registry (images mirrored)
  • External secrets provider required
  • OIDC/SAML enforced
  • SIEM export of audit log enabled
  • Local LLM (vLLM or Ollama) for data-sensitive workspaces

16. Observability contracts

Signal Carrier Required fields
Log structlog → JSON stdout ts, level, service, request_id, workspace_id, investigation_id?, event, payload
Trace span OTel span same attributes as logs + span.kind, span.status
Metric Prometheus counter/histogram gaby_* prefix, labels: workspace, connector, model, outcome

Canonical metrics (v0.1 must emit these):

Metric Type Why
gaby_investigations_total counter Throughput
gaby_investigations_duration_seconds histogram Latency p50/p95/p99
gaby_investigations_verdict_total counter By verdict label — auto-resolution %
gaby_llm_tokens_total counter By model + by purpose (planner/verdict)
gaby_llm_cost_usd_total counter By model
gaby_connector_calls_total counter By connector + by tool + by status
gaby_connector_health gauge 0/1 per connector
gaby_safety_denials_total counter Every denial is visible
gaby_approvals_pending gauge Operator queue depth
gaby_chat_sessions_total counter By channel (widget/slack/teams), by outcome
gaby_chat_handoffs_total counter Gaby → human takeovers
gaby_rate_limit_rejections_total counter By surface (widget/api)
gaby_ticket_queue_depth gauge DB poll result
gaby_retrieval_hit_rate gauge Share of citations ultimately used in the summary

Every metric above ships with a Grafana dashboard JSON in docs/operations/dashboards/.


17. Failure modes — explicit list

Failure Detection Response
LLM provider down HTTP error from provider Router tries fallback provider; if none, escalate
LLM budget exhausted BudgetGuard pre-check Investigation halts with failed_budget
Connector subprocess crash asyncio.subprocess.returncode Supervised restart; after 5 failures → DEGRADED
Connector OAuth expired 401/403 from tool call Pause investigation, send re-auth link to admin via escalation
Help desk webhook delivery fail Missing poll cursor gap Poller fallback closes the gap
Database unavailable SQLAlchemy pool exhaustion API returns 503; backoff; alert
Redis unavailable (when running arq) redis-py ConnectionError Worker pauses; in-process fallback if enabled
Vector index corruption Query returns 0 results when FTS returns >0 Automatic reindex on next KB sync; alert
Disk full SQLite write error API returns 503; alert
Malicious PII in ticket body Redaction rule Redact before LLM, record the original in encrypted column
Runaway agent loop max_iterations cap (20) Escalate with failed_budget

18. Open architectural questions (non-blocking for v0.1)

These are worth watching but don't hold v0.1:

  1. Streaming the investigation timeline to the UI over SSE vs WebSocket vs polling? Plan: SSE (simpler infra, one-way fits the model).
  2. Multi-region replication for the managed cloud. Defer to v0.5.
  3. Reasoning models for the planner. Worth evaluating on the eval harness before v0.2.
  4. Embedded vLLM for privacy-sensitive workspaces. Planned for v0.3 alongside the EE air-gapped mode.

Previously here: "Cross-investigation memory — defer to v0.3." Now designed — see §22 Memory hierarchy. Long-term memory starts accumulating on day one of v0.1 via SQLiteMemoryGraph.


19. Cross-document index

  • SPEC.md §6.5 — Safety model. This doc §6 is the implementation.
  • SPEC.md §6.6 — Ticket sources. This doc §9 is the adapter contract.
  • SPEC.md §6.4 — LLM layer. This doc §8 is the implementation.
  • FOUNDATION.md §1.1 — Stack choices. This doc §4, §8, §12 are the consequences.
  • FOUNDATION.md §3 — Data model. This doc §2, §6 are how they're used at runtime.

20. Definition of done for this document

This doc is "done enough to build v0.1 against" when:

  • [x] Every component in the System Map (§1) has its own section in this doc
  • [x] The agent loop state machine (§3) is explicit, with terminal states and budget rules
  • [x] The safety pipeline (§6) has a numbered order of operations and a scope DSL sketch
  • [x] Every failure mode we can think of today is listed (§17)
  • [x] Observability and metrics are concrete (§16)
  • [x] The v0.1 deployment topology is drawn (§15.1)
  • [x] Cross-links to SPEC.md and FOUNDATION.md are in place (§19)
  • [x] Self-critique pass completed (added §2.1, §4.5–4.6, resume semantics in §3, §5.4–5.5, rate limits in §10.1, cost mapping in §8.3.1, classifier wiring in §2, extra metrics in §16)
  • [x] Online reference validation pass completed — see §21

21. Reference validation — pass dated 2026-04-11

Every load-bearing technology choice was verified against current (2026) online references during architecture review. Summary:

Area Finding Architecture impact
MCP transports Spec 2025-03-26 introduced Streamable HTTP as the remote transport, superseding SSE. Connection recovery via Last-Event-ID header. Python SDK supports both stdio and Streamable HTTP with a unified client. §5.1 locked: stdio subprocess for first-party in-image connectors, Streamable HTTP for remote/community servers. Confirmed current.
Agent frameworks pydantic-ai shows meaningful advantages over LangGraph in 2026 benchmarks (~44% lower P95, ~5× fewer errors, ~2.7× lower tokens). Ships pydantic-graph with durable execution and HITL that map onto our WAITING_APPROVAL state. §3: homegrown loop is still the v0.1 choice (safety on the critical path); pydantic-ai is the locked fallback if homegrown stalls, with a re-eval gate after v0.1 ships.
Anthropic prompt caching Order tools → system → messages. Sonnet min 1,024 tokens / Haiku min 4,096. Max 4 breakpoints per request. 5-min default TTL, 1-hour at 2× base, cache reads at 0.1×. Workspace-level isolation since 2026-02-05. §8.2: the 4 breakpoints are now explicit (tools, system, KB chunks, messages) with the minimum-size caveats called out.
LiteLLM Proxy has production issues (GIL throughput, DB logging degradation). March 2026 supply-chain attack on PyPI versions 1.82.7 and 1.82.8. 800+ open issues. Bifrost and Portkey are the production-grade alternatives. §8.1: we use the SDK only, not the proxy; hot paths go direct to Anthropic/OpenAI; pin versions; install with hash verification; Bifrost/Portkey noted as v0.2+ migration targets if BYOK volume warrants.
sqlite-vec Mozilla Builders project, pure C, production-stable, but brute-force search only (no ANN). Fine for small KBs, does not scale. vectorlite is 3–30× faster with ANN; pgvector HNSW is the Postgres story. §7.1: added a scale cliff table and explicit migration triggers at 5k and 50k chunks. Embedding blob schema is dim-agnostic to keep migration cheap.
uv Production/Stable status on PyPI. Community consensus in 2026 trending heavily toward uv. 10–100× faster than pip. Drop-in replacement. FOUNDATION.md §1.1 locked uv — confirmed current.
FastAPI + SQLAlchemy 2 async Modern production default is pool_size=20, max_overflow=10 for Postgres (not 5). Use asyncpg driver. Dependency-injected session lifecycle. §4.2 updated: defaults split between SQLite (small) and Postgres (larger) profiles.
Tailwind 4 + shadcn/ui + React 19 + Vite + RR7 Fully compatible and production-ready. Migration notes: use data-slot attribute, React.ComponentProps instead of forwardRef. @tailwindcss/vite plugin is the install path. FOUNDATION.md §1.2 locked — confirmed current.
Biome 10–25× faster than ESLint+Prettier; covers ~80% of ESLint rules; doesn't support eslint-plugin-react-hooks (type-aware rules require the TS language service). FOUNDATION.md §1.2 needs a note: Biome primary, keep ESLint running for react-hooks only until Biome closes the gap. Added to the plan.
LLM eval tooling promptfoo is used by Anthropic and OpenAI themselves for prompt regression; Inspect AI (UK AISI) is specifically designed for agent evaluation with tool calls and model-graded rubrics. FOUNDATION.md §4.1 updated direction: promptfoo in v0.1 for prompt regression; Inspect AI evaluated for v0.2 for full agent evals with tool calls.

What I did not find any reason to change: FastAPI as web framework, asyncio as concurrency model, Alembic for migrations, arq for background jobs (with in-process fallback), structlog for logging, OpenTelemetry for tracing, pnpm for JS, Vite for bundling, Vitest+Playwright for tests, MkDocs Material for docs, MCP Python SDK as the connector protocol library, DCO over CLA for contributions, Apache 2.0 core + commercial EE for licensing.

Next validation pass: after v0.1 ships, re-run this research with the same queries and diff. Anything that has moved >1 major version or lost community traction goes on the v0.2 re-evaluation list.


22. Memory hierarchy — how Gaby gets smarter over time

TL;DR. Three memory tiers (short / medium / long) feed the planner through a bounded context envelope on every LLM call. Long-term memory is a graph-shaped model stored behind a MemoryGraph protocol. The v0.1 default backend is SQLite (two tables, recursive CTE traversals). Two opt-in backends ship stubs in Iter 0 and full implementations in Iter 4: Apache AGE (Postgres extension, the recommended graph-native path) and FalkorDBLite (embedded Cypher, Apache 2.0). Long-term memory starts accumulating on day one regardless of backend choice — the data written through the protocol is the data you migrate later.

22.1 The three tiers

                     ┌──────────────────────────────────────────────────────┐
                     │                    SHORT-TERM                          │
                     │    scope:    one investigation                        │
                     │    lifetime: seconds–minutes                          │
                     │    storage:  in-proc WorkingMemory + jsonb snapshot   │
                     │    gate:     redaction on the LLM boundary            │
                     │    purpose:  let the loop reason + crash-resume       │
                     └──────────────────────────────────────────────────────┘
                                            │ feeds upward on success
                                            │ (via kb_candidates staging)
                     ┌──────────────────────────────────────────────────────┐
                     │                   MEDIUM-TERM                          │
                     │    scope:    workspace                                   │
                     │    lifetime: hours–days                               │
                     │    storage:  TTL queries + in-proc LRUs + jsonb       │
                     │    gate:     automatic (cache-like, not "learning")   │
                     │    purpose:  avoid rework, spot bursts, context-swap  │
                     └──────────────────────────────────────────────────────┘
                                            │ promoted via human review
                     ┌──────────────────────────────────────────────────────┐
                     │                    LONG-TERM                          │
                     │    scope:    workspace                                   │
                     │    lifetime: indefinite                               │
                     │    storage:  MemoryGraph (nodes + edges) + documents  │
                     │    gate:     100% human-in-the-loop — no silent writes│
                     │    purpose:  the product getting smarter              │
                     └──────────────────────────────────────────────────────┘

22.2 Short-term (within one investigation)

Already designed in §2.1. WorkingMemory { ticket, messages, tool_calls, retrieved_chunks, budget_state } lives in memory during the loop and is snapshotted to investigations.working_memory_snapshot at every state-machine transition so a crash can resume. PII is redacted before anything crosses into an LLM call. Short-term memory is private to its investigation — it is never read by a different investigation.

22.3 Medium-term (cache-like, automatic)

Four operational stores. None of them "learn" — they are all caches with hard TTLs. They are written automatically and read by the planner as hints.

Store Shape TTL Purpose
Recent-tickets window A query on tickets (WHERE received_at > now() - interval '24h') — not a new table 24 h rolling Dedup (same thing just resolved), burst detection ("4 customers just hit the same VPN error — this is an upstream incident, not 4 individual problems")
Connector-result cache Process-local LRU keyed on (workspace_id, connector_id, tool_name, canonical(args)); Redis-backed when arq is on 60 s for reads, 0 s for writes "I just SELECT-ed this users table 15 s ago during this investigation, don't re-query"
Operator session notes jsonb column on the existing sessions table Session lifetime "The operator just approved this kind of action — don't re-prompt them for the rest of their session"
KB candidate staging kb_candidates table — entries awaiting human review, visible in the approval queue UI 30 days → auto-archive if unreviewed The bridge from auto-resolved investigations to long-term KB. Auto-written when verdict = auto_resolved and the quality gate passes.

22.4 Long-term (the graph memory — the product learning layer)

Two stores with different access patterns:

Store What it holds How it's created How it's used How it's forgotten
Verified KB entries (documents + document_chunks tables from Iter 2's knowledge pipeline) Runbooks, past-ticket resolutions promoted from kb_candidates, manually-added Markdown Human accepts/edits a candidate through the approval queue UI Retrieval pipeline in §7 (hybrid BM25 + vector → top-6 with citations) Explicit delete; stale-content detection after 6 months of non-use
Memory graph (memory_nodes + memory_edges behind the MemoryGraph protocol) Entities (customers, users, systems, connectors, tickets, investigations, facts, observations, resolutions) and their typed relationships Operator clicks "Remember this" on an investigation step, OR Gaby proposes after N≥3 similar observations and the operator accepts Planner envelope at every LLM call — neighbors(ticket.customer, depth=1) loads applicable facts and observations Explicit delete; status='archived' for unused nodes after 90 days; GDPR forget_subject() for hard compliance removal

22.5 The node label set (POLE+O, domain-adapted)

Borrowed from neo4j-labs/agent-memory's POLE+O model (Persons, Objects, Locations, Events, Observations) and adapted to Gaby's domain:

Label Meaning Example natural_key
customer A company / client / account receiving support customer:hartwell-law
user An end-user of a customer (the person who opened a ticket) user:kevin.reyes@hartwelllaw.com
system An application, service, or piece of infrastructure system:keycloak-prod, system:stripe-webhooks
connector A configured MCP connector instance connector:postgres-main
ticket A canonical ticket (one node per external ticket) ticket:zoho:ZD-8891
investigation An investigation Gaby ran investigation:inv_01HXYZ...
fact An atomic piece of knowledge (Observations in POLE+O) fact:hartwell-legacy-pop3
observation A time-stamped occurrence ("this happened at this time") observation:mfa-lockout@kevin@2026-02-20
resolution A resolution pattern that worked resolution:clear-stale-keycloak-sessions

Labels are not exhaustive — new labels can be added in v0.2+ without a schema migration (they're just a string column), but these nine cover the v0.1 Founder persona completely.

22.6 The typed relations (seven categories)

Borrowed from memory-graph/memory-graph's seven-category relationship model. Edges carry a relation column plus a free-form properties jsonb.

Category Relations
Causal CAUSES, TRIGGERS, LEADS_TO, PREVENTS
Solution SOLVES, ADDRESSES, ALTERNATIVE_TO, IMPROVES
Context OCCURS_IN, APPLIES_TO, WORKS_WITH, REQUIRES
Learning BUILDS_ON, CONTRADICTS, CONFIRMS
Similarity SIMILAR_TO, VARIANT_OF, RELATED_TO
Workflow FOLLOWS, DEPENDS_ON, ENABLES, BLOCKS
Quality EFFECTIVE_FOR, PREFERRED_OVER, DEPRECATED_BY

Relations are an enum in code, but the DB column is a string so v0.2+ can add relations without a migration. Every backend implementation validates relation strings against the enum at write time.

22.7 The MemoryGraph protocol — the contract

Every backend implements exactly these 11 methods. The small surface is deliberate: smaller surface = easier plug-and-play + easier verification via the 3-backend round-trip test in Iter 4.

from typing import Protocol, Literal

class MemoryGraph(Protocol):
    # ---- Writes (every call requires workspace_id) ----
    async def upsert_node(
        self,
        workspace_id: WorkspaceId,
        label: str,            # one of the 9 node labels above
        natural_key: str,      # unique within (workspace_id, label)
        properties: dict,
        provenance: Literal["operator", "proposed", "imported"],
        status: Literal["provisional", "active", "archived"] = "provisional",
    ) -> NodeId: ...

    async def upsert_edge(
        self,
        workspace_id: WorkspaceId,
        from_id: NodeId,
        to_id: NodeId,
        relation: str,         # one of the ~25 typed relations above
        weight: float = 1.0,
        properties: dict | None = None,
        observed_at: datetime | None = None,
    ) -> EdgeId: ...

    async def mark_archived(self, workspace_id: WorkspaceId, node_id: NodeId) -> None: ...
    async def forget_subject(
        self, workspace_id: WorkspaceId, subject_natural_key: str
    ) -> ForgetReport: ...

    # ---- Reads ----
    async def get_node(
        self, workspace_id: WorkspaceId, label: str, natural_key: str
    ) -> Node | None: ...

    async def neighbors(
        self,
        workspace_id: WorkspaceId,
        node: NodeId,
        *,
        depth: int = 1,                         # backend may cap at its comfort zone
        relations: list[str] | None = None,
        limit: int = 10,
    ) -> list[tuple[Node, Edge]]: ...

    async def path(
        self,
        workspace_id: WorkspaceId,
        from_id: NodeId,
        to_id: NodeId,
        max_depth: int = 3,
    ) -> list[Edge] | None: ...

    async def query_by_labels(
        self,
        workspace_id: WorkspaceId,
        labels: list[str],
        limit: int = 50,
    ) -> list[Node]: ...

    # ---- Admin ----
    async def healthcheck(self) -> Health: ...

    # ---- Migration — non-negotiable for every backend ----
    async def export_all(
        self, workspace_id: WorkspaceId
    ) -> AsyncIterator[NodeDump | EdgeDump]: ...
    async def import_all(
        self, workspace_id: WorkspaceId, stream: AsyncIterator[NodeDump | EdgeDump]
    ) -> int: ...

The export_all / import_all pair is the migration contract. SQLite → Postgres+AGE, SQLite → FalkorDBLite, FalkorDBLite → Postgres+AGE — all the same operation: export_all from source, stream through import_all on destination. Iter 4's test suite round-trips the same fixture through all three backends and asserts byte-identical dumps.

22.8 The three backends shipped in v0.1

Backend Shipped When to use Implementation notes
SQLiteMemoryGraph Default in v0.1. Full implementation lands in Iter 0 (tables + protocol surface) and Iter 4 (all methods). Everyone using docker compose up without a profile flag. Works up to low-5-figure node counts with acceptable query latency. Two tables (memory_nodes, memory_edges), SQLite FTS5 on node properties.text for label-scoped search, recursive CTEs for neighbors(depth=2) capped at depth 2 (higher depths raise DepthNotSupported).
PostgresAGEMemoryGraph Stub in Iter 0 (so factory + config compile), full implementation in Iter 4. Users who want graph-native memory from day one AND are happy to run Postgres. Recommended graph-native path. Uses the Apache AGE extension on the same Postgres instance that holds the rest of Gaby's data. Nodes/edges become AGE vertices/edges. Traversals use Cypher via the AGE cypher(...) SQL function. depth can go to 5+ without pain.
FalkorDBLiteMemoryGraph Stub in Iter 0, full implementation in Iter 4. Users who want graph-native memory embedded (no extra container) AND are comfortable with a Beta embedding layer on top of a production-stable engine. Secondary graph-native path. falkordblite Python package (Apache 2.0) spawns a local FalkorDB inside the backend container. Cypher queries via the falkordb-py client. Same API in embedded and server modes — "switch to production FalkorDB" is one config change.

Expressly not shipped in v0.1: Neo4j (JVM cost per §22.10), SurrealDB (license status flagged for verification, not confirmed), Cozo (pre-1.0, maintainers don't promise storage compatibility), Kuzu (archived 2025-10-10).

22.9 The planner context envelope

The planner does NOT receive all the memory — that would blow the context window and the budget. It receives a bounded envelope assembled at the start of every LLM call:

planner_input = {
    system_prompt,                           ← cache breakpoint 1 (persona + safety)
    tool_manifest,                           ← cache breakpoint 2 (connectors)
    envelope: {
        retrieved_kb_chunks[0..6],            ← long-term vector   (hybrid retrieval)
        applicable_facts[0..10],              ← long-term graph    (MemoryGraph.neighbors)
        recent_similar_tickets[0..3],         ← medium-term query  (tickets table window)
        connector_recent_results,             ← medium-term cache  (in-proc LRU)
    },                                       ← cache breakpoint 3 (stable during investigation)
    working_memory.messages,                  ← short-term (accumulator)
}                                             ← cache breakpoint 4

Breakpoint 3 is where cache economics matter most: the envelope is stable for an entire investigation, so every planner call after the first reads it at 0.1× base price (per Anthropic prompt cache rules in §8.2).

The envelope's long-term-graph slot is filled by:

facts = await memory.neighbors(
    workspace_id,
    node=await memory.get_node(workspace_id, "customer", ticket.customer_natural_key),
    depth=1,
    relations=["HAS_FACT", "PREFERS", "USES", "WORKS_WITH"],
    limit=10,
)

Bounds (6 KB chunks, 10 facts, 3 recent tickets) are configurable per workspace but the defaults are deliberate — they keep the envelope under ~5k tokens in the common case, which is safely below every v0.1 model's context window and inside the cache discount band.

22.10 Governance — the hard rules

  1. No silent writes to long-term memory. Every node that reaches status='active' goes through a human gate. Gaby may write status='provisional' nodes autonomously after N≥3 confirming observations, but only an operator promotes them.
  2. Workspace isolation is enforced at the query layer, not documented at the query layer. Every upsert_*, neighbors, path, query_by_labels call takes workspace_id as a required positional argument. No "global traversal" method exists; no default-workspace fallback exists. Cross-workspace traversal requires a dedicated admin API that does not exist in v0.1.
  3. PII redaction happens on the way IN. A node or edge stores redacted text. Raw text is preserved only in the encrypted tickets.body_encrypted column for audit.
  4. Forgetting is a first-class operation. forget_subject() purges medium-term buffers AND cascades status='archived' across the graph for the subject AND records an audit event. GDPR compliance is not retrofitted; it's designed in.
  5. Stale content is demoted, not deleted. Nodes unused for the configured threshold (default: facts 90 d, KB 180 d) get status='archived'. Archived nodes are hidden from planner queries but preserved for audit reconstruction.
  6. Provenance is never lost. Every node carries provenance ∈ {operator, proposed, imported}. The planner envelope's display in the UI shows this so operators know which facts came from them vs which Gaby proposed.

22.11 Why not a single graph DB?

Short version: because the MemoryGraph protocol is the commitment; the storage engine is an implementation detail. See the full trade-off discussion in the commit history (SQLite default was chosen to preserve the 5-minute install promise while AGE and FalkorDBLite are offered as opt-in profiles for users who want graph-native from day one). Key rationales by what we rejected:

  • Neo4j Community — JVM (+300 MB RAM), ~600 MB Docker image, third service in Compose, Cypher as a third query language, GPLv3 licensing friction with our Apache-2.0 core. Not a capability problem; an ops cost problem. Remains a valid swap target for users who want it.
  • Kuzu — archived 2025-10-10. Dead for v0.1 commitment.
  • SurrealDB — real production use (Samsung Ads, Verizon, Tencent), embedded Rust binary, but 2024–2025 BSL licensing history needs verification before we commit an Apache-2.0 core project to it. Flagged, not adopted.
  • Cozo — great vision ("the hippocampus for AI"), embedded like SQLite, but pre-1.0 and the maintainers explicitly do not promise storage compatibility before 1.0. Revisit in v0.3+.
  • Memgraph — BSL license. Non-OSS by our definition.
  • TypeDB / NebulaGraph / Dgraph — heavier than we need, none add enough over the three chosen backends to justify the extra ops cost.

22.12 What ships in Iter 0 vs Iter 4

Iter 0 (scaffold) Iter 4 (agent loop)
memory_nodes + memory_edges + kb_candidates tables PostgresAGEMemoryGraph full implementation
sessions.operator_notes jsonb column FalkorDBLiteMemoryGraph full implementation
storage/memory_graph/base.py — the protocol Planner envelope integration (applicable_facts fill)
storage/memory_graph/sqlite.py — full implementation Migration CLI: gaby memory export, import, migrate --to=<backend>
storage/memory_graph/postgres_age.py — class stubs with NotImplementedError 3-backend round-trip integration test
storage/memory_graph/falkor_lite.py — class stubs with NotImplementedError "Remember this" UI button in the approval queue
storage/memory_graph/__init__.py — factory reading GABY_MEMORY_BACKEND env var gaby memory forget --subject=<key> CLI (GDPR)
ops/docker/docker-compose.yml gets a graph-age profile adding Postgres+AGE container Property tests on workspace isolation (any traversal that leaks across workspaces fails)

Iter 0 still writes graph data from the first ticket. Iter 4 adds the alternative backends and the planner integration.

22.13 Verification — how we'll prove the backend swap works

A single integration test runs in CI for every PR that touches storage/memory_graph/:

@pytest.mark.integration
@pytest.mark.parametrize("source", ALL_BACKENDS)
@pytest.mark.parametrize("target", ALL_BACKENDS)
async def test_roundtrip_across_backends(source, target, tmp_fixture_graph):
    # 1. Materialize the fixture graph in `source`
    src = make_backend(source)
    await load_fixture(src, tmp_fixture_graph)

    # 2. Export → import into `target`
    dst = make_backend(target)
    async for row in src.export_all("workspace-test"):
        await dst.import_all("workspace-test", one_row_stream(row))

    # 3. Compare canonical dumps
    src_dump = sorted([r async for r in src.export_all("workspace-test")], key=_canon)
    dst_dump = sorted([r async for r in dst.export_all("workspace-test")], key=_canon)
    assert src_dump == dst_dump

    # 4. Assert neighbors() returns the same results from both
    for seed in fixture_seeds:
        src_neigh = await src.neighbors("workspace-test", seed, depth=1, limit=10)
        dst_neigh = await dst.neighbors("workspace-test", seed, depth=1, limit=10)
        assert canonicalize(src_neigh) == canonicalize(dst_neigh)

With 3 backends this is a 3×3 = 9-cell test matrix. The SQLite↔SQLite cell catches regressions in the reference implementation. The cross-backend cells catch drift. Any new backend added in the future must make this matrix green before it can be offered to users.