Architecture¶
Architecture — Gaby¶
Status: Draft v0.1 · Owner: Guilliano · Last updated: 2026-04-11
Reading order:
SPEC.md→FOUNDATION.md→ARCHITECTURE.md(this) →ROADMAP.md.
SPEC.mdsays what we're building.FOUNDATION.mdlocks the stack and the repo layout. This doc is the technical how: lifecycles, state machines, contracts, concurrency, failure modes, data flow.This is a living document. Anything with a § icon is a design decision that can be revisited with evidence; anything with a 🔒 is locked for v1.0.
1. System map — one picture¶
┌─────────────────────────────────┐
│ End user (customer / employee) │
└──────────────┬──────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
Help desk Chat widget Slack / Teams
(Zendesk, Halo, (JS snippet, (bot user)
Linear, Zoho…) shadow DOM)
│ │ │
└───────────┬───────────┴───────────┬───────────┘
│ │
▼ ▼
┌────────────────────┐ ┌────────────────────┐
│ TicketSource │ │ ChatSession │
│ adapters │ │ manager │
└─────────┬──────────┘ └─────────┬──────────┘
│ │
└──────────┬──────────────┘
▼
┌─────────────────────────┐
│ Event bus (in-proc) │
│ topic: ticket.new │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ Worker runner │
│ (in-proc or arq+Redis) │
└────────────┬────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Agent loop │
│ ┌─────────┐ ┌──────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Plan │→ │ Retrieve │→ │ Tool call │→ │ Observe │ │
│ └─────────┘ └──────────┘ └─────┬─────┘ └────┬─────┘ │
│ ▲ │ │ │
│ └────────────────────────────┴──────────────┘ │
│ (loop until verdict) │
└───────┬─────────────┬──────────────────────┬──────────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌────────────┐ ┌─────────────┐
│ LLM │ │ Knowledge │ │ MCP host │
│ gateway │ │ retrieval │ │ (spawns and │
│ (litellm)│ │ (hybrid) │ │ supervises │
└──────────┘ └────────────┘ │ connectors)│
└──────┬──────┘
│
┌──────────────────┼─────────────────┐
│ │ │
MCP server MCP server MCP server
postgres keycloak zoho-desk
(stdio/HTTP) (stdio/HTTP) (stdio/HTTP)
│ │ │
▼ ▼ ▼
Real Postgres Real Keycloak Real Zoho
(read-only) (read-only) (read + write)
The agent loop, before every tool call, passes through:
Safety pipeline (§6)
┌─────────────────────────────────┐
│ scope check → redact → dry-run │
│ → apply → audit │
└─────────────────────────────────┘
Everything else in this document elaborates one of the boxes or one of the arrows.
2. Core request lifecycle — from "ticket arrives" to "ticket closed"¶
This is the canonical path. Every v0.1 scenario collapses to it.
1. [TicketSource] poll() / webhook → raw_ticket
2. .normalize() → Ticket (canonical form)
3. persist → tickets table, emit "ticket.new"
4. [Worker] consumes "ticket.new" → schedules Investigation
5. [Agent loop] new Investigation(id, ticket_id, budget)
5a. (optional) classify(ticket) → triage verdict
5b. if triage == "not_worth_investigating":
5c. verdict = "skipped"; go to step 23
6. while not verdict:
7. plan_next_step(working_memory) # LLM call: planner
8. if needs_retrieval:
9. retrieve(query) # knowledge subsystem
10. append to working_memory
11. if needs_tool_call:
12. action = propose_tool_call() # LLM call: tool_selector
13. safety_check(action, scopes, autonomy) ←── may raise
14. if dry_run:
15. result = simulate(action)
16. else:
17. result = mcp_host.call(action)
18. audit.write(action, result)
19. append to working_memory
20. maybe_emit_step_to_ui(step) # live updates
21. if budget_exceeded or max_iterations:
22. verdict = "failed_budget"; break
23. verdict = classify(working_memory) # LLM call: verdict
24. summary = summarize(working_memory) # LLM call: summarizer
25. [TicketSink] write_back(ticket, summary, verdict) # via the source adapter
26. update tickets.status
27. emit "investigation.done"
28. [Escalator] if verdict ∈ {needs_tech, needs_l2, needs_client}:
29. dispatch_to_channel(persona.escalation_target)
30. [KB learner] if verdict == "auto_resolved" and quality_gate_passes:
31. stage the resolution as a candidate KB entry (human review)
Legend for the LLM calls in the loop¶
| Call name | Purpose | Model tier | Streaming |
|---|---|---|---|
planner |
Given working memory, what should we do next? | big | no |
tool_selector |
Choose a specific MCP tool + its arguments | big | no |
summarizer |
Turn the working memory into a customer-facing message | big | yes |
verdict |
Classify final outcome (auto_resolved / needs_tech / …) | small | no |
classifier (optional) |
Cheap pre-filter at step 1 (is this even worth investigating?) | small | no |
The model router (§8.4) decides "big" vs "small". Classifier-style calls go through a cheap model so we don't spend flagship-model tokens on yes/no questions.
2.1 Working memory vs investigation steps — two things, not one¶
These are separate and must not be confused.
| Thing | Shape | Scope | Persistence | Consumer |
|---|---|---|---|---|
| Working memory | A typed object WorkingMemory { ticket, messages, tool_calls, retrieved_chunks, budget_state }. The messages array is the LLM conversation history for this investigation. |
One in-flight investigation | Snapshotted to investigations.working_memory_snapshot (jsonb) at every state-machine transition |
The agent loop |
| Investigation steps | Append-only rows in investigation_steps matching the UI timeline shape (system, action, detail, type, timestamp) |
One investigation, historical | Permanent (soft-delete only) | The UI, the audit log, the operator |
Every state transition in §3 does two writes: it updates the working memory snapshot AND appends one or more investigation step rows. The snapshot lets us resume after a crash; the steps let the UI animate in real time and the audit log reconstruct history.
3. Agent loop — state machine¶
┌──────────────┐
│ CREATED │
└──────┬───────┘
│ start()
▼
┌──────────────┐
┌─────────────▶│ PLANNING │
│ └──────┬───────┘
│ │ next_step == retrieve
│ ▼
│ ┌──────────────┐
│ │ RETRIEVING │
│ └──────┬───────┘
│ │
│ ▼ (back to planning with new evidence)
│ ┌──────────────┐
└──────────────┤ PLANNING ├─┐
└──────┬───────┘ │
│ │ next_step == act
▼ ▼
┌──────────────┐
│ SAFETY_CHK │
└──────┬───────┘
│
┌─────────────┼────────────┐
│ │ │
│ │ │
denied approval allowed
│ required │
▼ │ ▼
┌──────────┐ ▼ ┌─────────────┐
│ HALTED │ ┌──────────┐ │ ACTING │
└──────────┘ │ WAITING │ └──────┬──────┘
│ APPROVAL │ │
└─────┬────┘ ▼
│ ┌─────────────┐
│ │ OBSERVING │
│ └──────┬──────┘
│ │
▼ ▼
┌──────────────┐
│ PLANNING │ (loop)
└──────┬───────┘
│ verdict_ready
▼
┌──────────────┐
│ VERDICT │
└──────┬───────┘
│
▼
┌──────────────┐
│ WRITING_BACK│
└──────┬───────┘
│
▼
┌──────────────┐
│ DONE │
└──────────────┘
Terminal states¶
| State | Meaning |
|---|---|
DONE |
Verdict produced, written back, audit closed. Normal path. |
HALTED |
Safety denial or unrecoverable error. Escalated, audit closed. |
WAITING_APPROVAL |
Paused, waiting on a human. Resumable. Has a TTL (default 24h). On TTL expiry → auto-escalate. Not strictly terminal; APPROVED transitions back into ACTING with the same pending action. |
Resume semantics (after a crash OR after an approval)¶
Because working memory is snapshotted at every transition, resuming is deterministic:
1. Load investigations.working_memory_snapshot for the target investigation
2. Load investigations.status → the last state
3. Re-enter the state machine at that state with the snapshot as input
4. For WAITING_APPROVAL: when the approval lands, the loop re-enters ACTING,
calls the already-validated (tool_name, args), and proceeds normally
5. For a crash resume: the loop re-enters PLANNING with the last snapshot.
We *never* replay a non-idempotent action — if the crash happened inside
ACTING, the audit log tells us whether the action completed
(`action.applied` event) or not. Completed actions are skipped on resume.
Idempotency requirement on MCP tool authors: every write tool must accept an idempotency_key argument (Gaby generates one per action UUID). The connector is responsible for de-duplicating on retry. Contract test §12 enforces this for every dangerous tool.
Budget enforcement¶
At every transition, the loop checks:
- tokens_used < budget.tokens
- usd_spent < budget.usd
- wall_clock < budget.max_seconds
- iterations < budget.max_iterations (default 20)
Any breach → verdict failed_budget, escalation. No silent degradation.
Why homegrown (a reminder)¶
We discussed this in FOUNDATION.md §1.1. The state machine above is ~400 Python lines on top of the Anthropic/OpenAI SDKs. The reasons to not adopt LangGraph or pydantic-ai at v0.1 are:
- Safety must come before every tool call, not as a decorator. Frameworks make this awkward; in a hand-rolled loop it's one function call on the critical path.
- Every transition emits an audit event with the full working memory delta. Frameworks' internal state is opaque to us.
- Budget enforcement is per-transition, not per-call. Our loop checks every edge; frameworks expose hooks but not guarantees.
- We want streaming of
summarizeroutput directly to the UI. Simple from our loop; non-obvious in a framework that wraps the LLM client.
The public interface of the loop is small enough (start, resume, step, state) that a future swap is a week of work, not a rewrite.
Preferred escape hatch: pydantic-ai¶
If the homegrown loop stalls — prompt debugging becomes painful, multi-step branching gets tangled, we reimplement checkpointing — the preferred migration target is pydantic-ai, not LangGraph. Published 2026 benchmarks put pydantic-ai at ~44% lower P95 latency, ~5× fewer errors under load, and ~2.7× lower token consumption versus LangGraph on equivalent agent tasks. It also ships pydantic-graph with durable execution across restarts and first-class human-in-the-loop, which maps cleanly onto our WAITING_APPROVAL state.
Re-evaluation gate: after v0.1 ships, run the eval harness (50+ fixture tickets) against both the homegrown loop and a pydantic-ai port. If the pydantic-ai port is within 10% of the homegrown loop on safety compliance AND meaningfully shorter in code OR faster on latency, we swap for v0.2.
4. Concurrency model¶
Gaby is I/O-bound. LLM calls, DB queries, MCP round-trips, HTTP to help desks. asyncio everywhere is the default; threads are only for hard CPU work (embeddings inference if we run it locally, BM25 scoring on large corpora).
4.1 The runtime shape¶
┌────────────────────────┐
│ FastAPI app │
│ (uvicorn, 1 process) │
└──────────┬─────────────┘
│
same event loop
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌────────────┐ ┌──────────────┐ ┌─────────────┐
│ HTTP routes│ │ Worker runner│ │ Chat gateway│
└────────────┘ └──────┬───────┘ └─────────────┘
│
▼
bounded semaphore (N=8 default)
│
┌───────────┼───────────┐
▼ ▼ ▼
Investigation Investigation Investigation
task task task
(asyncio.Task) ... ...
4.2 Key parameters¶
| Parameter | SQLite default (v0.1) | Postgres default (v0.2+) | Notes |
|---|---|---|---|
uvicorn --workers |
1 | 2–4 | Single process in v0.1 is enough; scale out horizontally in v0.5 |
| Concurrent investigations | 8 | 16 | Bounded semaphore; excess tickets queue in the DB |
| Per-investigation LLM concurrency | 1 | 1 | LLM calls inside one investigation are serial — simpler reasoning |
| MCP subprocess pool size | unbounded | unbounded | One MCP server per connector; each serves many investigations concurrently |
| Async DB pool | pool_size=5, max_overflow=5 |
pool_size=20, max_overflow=10 |
SQLite is single-writer so the pool just serializes writes. Postgres default follows the 2026 production pattern of pool=20/overflow=10 for moderate API servers. Remember: total DB connections = workers × (pool_size + max_overflow). |
| PostgreSQL driver | n/a | asyncpg |
asyncpg (not psycopg2) for true async; SQLAlchemy configured with postgresql+asyncpg:// |
| HTTP client (httpx) | shared | shared | Single AsyncClient per process, limits=Limits(max_keepalive=40) |
4.3 In-process vs external worker¶
v0.1 default: Investigations run inside the FastAPI process, on the same
event loop. No Redis. "docker compose up" = 1 container for
the app + 1 for the UI + 1 for Postgres (optional).
v0.2 default (scale): arq worker in a separate container. Same code, different
entry point. Switches on `GABY_WORKER_MODE=arq`.
The worker interface is identical in both modes — runner.schedule(investigation) — so upgrade path is a config flag, not a refactor.
4.4 Backpressure¶
- If the investigation semaphore is full, new
ticket.newevents queue in the DB (tickets.status='queued'). - The web UI dashboard shows queue depth and expected wait time (
queue_length / avg_inv_duration). - Above a configurable threshold (default 50 queued), Gaby starts degraded mode: it still intakes tickets, but skips the
retrievestep for low-priority tickets to drain faster. - We never drop a ticket. The DB is the queue; losing a ticket requires a DB failure, not a process crash.
4.5 The DB is the queue — event bus clarified¶
To square "in-process event bus" with "never drop a ticket", the actual rule is:
- Ticket adapters write
tickets(status='queued')and then optionally fire an in-memory notification to wake the worker faster. - The worker runner's main loop is a DB poll:
SELECT ... FROM tickets WHERE status='queued' ORDER BY priority, received_at FOR UPDATE SKIP LOCKED LIMIT 1. SQLite usesBEGIN IMMEDIATE+ an app-level mutex in place ofFOR UPDATE SKIP LOCKED. - The in-memory notification is a latency optimization, not the source of truth. If it gets dropped (process crash, Python GC pause), the next poll tick catches the ticket.
- Poll interval defaults to 2 seconds; notification-driven wakeups push it effectively to zero when the system is loaded.
This makes the system at-least-once: after a crash, we may re-claim a ticket whose investigation was mid-flight. The resume rules in §3 handle that safely because:
- Working memory is snapshotted per transition → we pick up where we left off.
- Every write action carries an idempotency_key → no duplicate side-effects.
- The audit log tells us what was already applied before the crash.
4.6 Ticket claim transaction¶
Pseudo-code for the claim:
-- Postgres path
BEGIN;
SELECT id, workspace_id, body FROM tickets
WHERE status = 'queued'
ORDER BY priority DESC, received_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED;
UPDATE tickets SET status = 'investigating', claimed_by = $worker_id, claimed_at = now()
WHERE id = $picked_id;
COMMIT;
For SQLite we fall back to a single-writer strategy: one claim task, protected by a process-wide asyncio.Lock, using BEGIN IMMEDIATE to acquire the DB writer lock before the SELECT + UPDATE. This is fine for v0.1 throughput targets.
5. MCP host — connector lifecycle¶
Every connector is an MCP server. Gaby is an MCP host (in MCP parlance) and an MCP client (it calls their tools).
5.1 Spawn strategies¶
| Strategy | When | Implementation |
|---|---|---|
| stdio subprocess (v0.1 default) | Default for first-party connectors bundled in the image | asyncio.create_subprocess_exec(...); framed JSON-RPC over pipes as per the MCP stdio transport |
| Streamable HTTP | Remote / community MCP servers reachable over HTTP | MCP's Streamable HTTP transport — the current spec's HTTP option (superseded the older SSE transport) |
| in-process | Tiny built-ins that never block (e.g. local filesystem, time) | Direct function calls wearing an MCP-shaped mask |
Why stdio default? First-party connectors ship in the Gaby Docker image and are spawned as subprocesses — no network, no auth, lowest latency, simplest failure modes. Streamable HTTP is for remote/community servers where subprocess isn't an option. The official Python MCP SDK supports both transports with the same client interface.
5.2 Lifecycle¶
CONFIGURED ──start──▶ LAUNCHING ──handshake──▶ READY ──▶ BUSY ──▶ READY
│ │ │ │
└─fail─▶ CRASHED │ ▼ │
│ │ TIMEOUT │
▼ │ │ ▼
RESTARTING ◀────┴─────────┴───── SHUTDOWN
│
▼
READY (or DEGRADED after N failures)
- Handshake = MCP
initialize+tools/list. The tool list is cached per connector version. - Crash recovery: exponential backoff, max 5 restarts in 5 minutes. After that the connector is marked
DEGRADEDand the UI shows a persistent warning. Investigations that would have used this connector either skip it or halt, depending on connector criticality. - Health check: periodic
pingevery 30s (for HTTP) or "is the subprocess alive?" (for stdio). Surfaced at/health/connectors. - Graceful shutdown: SIGTERM → wait for in-flight tool calls to finish → SIGKILL after 10s.
5.3 Tool scope declaration — the contract every connector must satisfy¶
Every connector declares its capabilities in a machine-readable form that Gaby trusts for authorization decisions:
# connector tool manifest (returned by tools/list + scope extension)
tools:
- name: query_users
scope: read
description: "Look up user by email or ID"
args: [{ name: email, type: string }]
- name: reset_password
scope: write
dangerous: true
requires_approval_above_autonomy: propose # auto-approves only when autonomy=act
description: "Trigger a password reset"
args: [{ name: user_id, type: string }]
- name: delete_user
scope: write
dangerous: true
forbidden_in_autonomy: [investigate, propose] # only allowed in autonomy=act
description: "Permanently delete a user"
args: [{ name: user_id, type: string }]
Contract tests (see §12) verify every first-party connector declares these fields. Community connectors that don't are flagged UNSAFE in the UI and cannot be moved out of investigate autonomy.
5.4 Manifest versioning and cache invalidation¶
Each connector declares a manifest_version (semver) in its initialize response. Gaby stores the last-seen version per connector. On restart, if the version changed, the tool list cache is invalidated and a new tools/list is performed. The manifest hash is also written to the audit log so historical investigations reference a specific immutable version of the tool set.
5.5 Idempotency keys for write tools¶
Every write tool must accept idempotency_key: string as an argument. Gaby generates one per action UUID and passes it automatically. Connectors use it to de-duplicate on retry after a crash or transient failure. This requirement is enforced by the contract tests.
6. Safety pipeline — the thing that cannot break¶
This is the single most important subsystem. Every non-read action passes through it, in order:
action (tool_name, args, connector_id)
│
▼
┌──────────────────────────────┐
│ 1. SCOPE CHECK │ evaluate(action, connector.scopes,
│ │ ticket.workspace_id,
│ │ persona.autonomy_level)
└──────┬───────────────────────┘
│ denied ──────────────────▶ AUDIT(denied) → raise PermissionError
│ allowed
▼
┌──────────────────────────────┐
│ 2. REDACTION │ strip PII from any string args per
│ │ workspace.compliance_profile (HIPAA, SOC2…)
└──────┬───────────────────────┘
│
▼
┌──────────────────────────────┐
│ 3. DRY-RUN DECISION │ dry_run = (autonomy ≠ act)
│ │ OR (tool.dangerous AND not approved)
└──────┬───────────────────────┘
│
┌──────┴──────┐
│ │
dry_run real
│ │
▼ ▼
simulate() mcp_host.call()
│ │
└──────┬──────┘
▼
┌──────────────────────────────┐
│ 4. AUDIT │ append_hash_chained(
│ │ actor, action, result, ts)
└──────┬───────────────────────┘
│
▼
return result to agent loop
6.1 Scope DSL (sketch)¶
Scopes are declarative, per-connector. There are exactly two lanes —
read and write. Dry-run is not a scope lane; it is a runtime
decision made at step 3 of the safety pipeline (see §6 diagram) and
implemented by the connector when its tool manifest sets
supports_dry_run=true. See
docs/decisions/2026-04-15-dry-run-not-a-scope-lane.md
for the rationale.
connector: m365
scopes:
read:
allow: ["users/*", "mailboxes/*"]
write:
allow: ["users/{id}/reset_password"]
deny: ["users/{id}/delete"]
The scope checker resolves action.tool against these globs plus the tool manifest's scope field. Denies beat allows. Everything not explicitly allowed is denied.
6.2 Audit log — hash-chained, append-only¶
Every entry:
{
"id": <uuid7>,
"workspace_id": ...,
"ts": <monotonic_wall>,
"actor_kind": "agent" | "user" | "system",
"actor_id": ...,
"event": "action.applied" | "action.denied" | "approval.granted" | ...,
"payload": { ... action + result snapshot ... },
"prev_hash": <sha256 of previous row>,
"hash": <sha256(prev_hash || canonical_json(this row without 'hash'))>
}
- Verification: a background task re-walks the chain daily and alerts on any mismatch.
- SIEM export (EE feature): tail the chain to Splunk / Sumo / Elastic via a pluggable exporter.
- Why not a separate append-only database (e.g. QLDB, immudb)? Added operational complexity. A hash-chained table in the same Postgres, with row-level ACLs and no
UPDATE/DELETEgrants, gives 95% of the guarantee for 5% of the complexity. EE customers who need stronger guarantees can pipe to a dedicated store.
6.3 The four autonomy levels (one more than SPEC.md §6.5)¶
| Level | What the agent does | When to use |
|---|---|---|
off |
Gaby does nothing. Tickets are ingested but not investigated. | Maintenance mode / legal hold. |
investigate |
Gaby reads, retrieves, queries. Never writes. Produces a summary for humans. | First week of deployment. Read-only SRE connectors. |
propose |
Gaby drafts the fix. Every write action goes to the approval queue. | Default for most non-trivial deployments. |
act |
Gaby executes writes itself, with dry-run + audit + rollback. Still respects dangerous/forbidden flags on tools. |
Mature deployments with well-understood playbooks. |
Autonomy is set per connector, per workspace. A single investigation can include act calls to Redis and propose calls to Stripe.
7. Knowledge subsystem — retrieval with citations¶
7.1 Pipeline¶
source chunker embedder store retrieval
----- ------- -------- ----- ---------
git repo token-aware provider-agnostic sqlite-vec hybrid (BM25 + vector)
dir walker Markdown/code-aware (pluggable) or pgvector + cross-encoder rerank
confluence respects headings + FTS5 / tsvector + top-k=6 default
notion + explicit citations
pdf
url crawler
past tickets
Vector store scale cliff — plan for the migration¶
sqlite-vec uses brute-force search (no ANN index). This is fine for v0.1 — a founder's runbook folder is hundreds, maybe low thousands, of chunks — but it does not scale to large corpora. The migration triggers:
| Corpus size | Recommendation |
|---|---|
| < 5,000 chunks | sqlite-vec (brute-force is fast enough, <20 ms queries) |
| 5,000 – 50,000 | Evaluate vectorlite (sqlite ANN, ~3–30× faster than sqlite-vec on the same hardware) |
| > 50,000 | Switch to the Postgres profile, use pgvector with HNSW indexes |
All three present the same VectorStore protocol, so the migration is a config flag + a background reindex, not a rewrite. The documents table schema stores embedding as raw BLOB (float32 × dim) so the underlying index implementation is swappable.
Embedding model default: we start with a provider-agnostic choice — text-embedding-3-small (OpenAI, 1536 dim) for BYOK users on OpenAI, voyage-3-lite (Voyage AI, 512 dim) for Anthropic-leaning deployments, or a local BGE model via sentence-transformers for air-gapped installs. The schema is dim-agnostic; changing models triggers a background reindex.
7.2 Chunker rules (§)¶
- Markdown: split on top-level headings first, then H2, then 800-token soft max.
- Code (source files): split per function/class; never split mid-function.
- PDF: page-aware; no cross-page chunks unless a heading continues.
- Chunk metadata carries:
source_uri,headings_path,line_range,content_hash.
7.3 Retrieval¶
- Query rewrite (optional, cheap model): turn the ticket title + body into 1–3 search queries.
- Parallel retrieval: BM25 top-20 ∥ vector top-20.
- Reciprocal-rank fusion → top-20 hybrid candidates.
- Cross-encoder rerank (cheap model or a small local model) → top-6.
- Attach to working memory with explicit citations (
[doc:uri#headings#L12-L30]).
7.4 Citations in output¶
Every claim in the final summary must end with a citation token. Unsourced claims are re-queried or explicitly disclaimed:
"This user's Authenticator was tied to the old iPhone [kb://runbooks/mfa-lockout#L45-L58] and the Entra ID sign-in log confirms 7
AADSTS50076failures [entra://signinlogs#user=kevin.reyes@hartwelllaw.com&window=30m]."
Users can click any citation in the UI to see the source.
7.5 Learning loop¶
When an investigation resolves with auto_resolved verdict AND the operator rates it ≥4/5 (or no one disputes it within 7 days), Gaby stages a new KB candidate (the ticket + the resolution + the tool-call trace) in a review queue. An operator accepts / edits / rejects. Accepted entries become new indexed documents.
No silent learning. Human in the loop, always.
8. LLM gateway¶
8.1 Provider interface¶
class LLMProvider(Protocol):
async def chat(self, messages, *, model, tools=None, max_tokens, temperature,
cache_control=None, stream=False) -> ChatResult: ...
async def embed(self, texts, *, model) -> list[list[float]]: ...
def supports(self, capability: Literal["tools", "streaming", "cache", "json_mode"]) -> bool: ...
Three concrete implementations in v0.1:
- AnthropicProvider (direct anthropic Python SDK) — used on hot paths (planner, summarizer)
- OpenAIProvider (direct openai Python SDK) — fallback for BYOK
- LiteLLMProvider — wraps ~100 providers for BYOK users who want Azure OpenAI, Bedrock, Vertex, Mistral, local vLLM, etc.
A note on LiteLLM. We use the Python SDK (
litellmas a library), not the LiteLLM proxy — the proxy has known production issues in 2026 (GIL-bound throughput, DB logging degradation, SSO gated behind paid tier) and was compromised in a PyPI supply-chain attack in March 2026 (versions 1.82.7 and 1.82.8). Mitigations: - Pinlitellmto a known-good version range inuv.lockand update deliberately. - All SDK installs go through PyPI with hashes verified at install time. - Hot paths (planner, tool_selector, verdict, summarizer) bypass LiteLLM entirely and use the direct Anthropic/OpenAI SDKs. - LiteLLM only sees BYOK-only providers (Bedrock, Vertex, Azure, local vLLM) where its breadth is the value.If BYOK volume becomes a real production load, Bifrost (Apache 2.0, Go-based, ~10μs overhead) and Portkey are the v0.2+ evaluation targets.
8.2 Prompt caching¶
Anthropic (2026 rules). Cache scope is the whole prefix up to the cache breakpoint, in request order: tools → system → messages. Cache reads are billed at ~0.1× base input price, 5-minute writes at 1.25×, 1-hour writes at 2×. Max 4 breakpoints per request. Minimum cacheable block size: 1,024 tokens on Sonnet, 4,096 tokens on Haiku (so short prompts never benefit on Haiku). Cache TTL defaults to 5 minutes, refreshed on every hit — which fits perfectly inside a typical investigation that spans seconds to minutes. As of 2026-02-05 caches are workspace-isolated (not org-wide), so multi-workspace deployments get proper separation for free.
We place our 4 breakpoints as follows:
| # | Block | Lifetime | Why |
|---|---|---|---|
| 1 | Tool manifest (the MCP tool list, serialized) | Connector version | Changes only when a connector updates. Near-permanent cache hits. |
| 2 | System prompt (persona-specific instructions + safety rules) | Persona version | Changes rarely. Big win on every planner / verdict call. |
| 3 | Retrieved KB chunks for this investigation | Investigation lifetime | Same chunks are re-sent with each planner turn; the 5-min TTL keeps them hot. |
| 4 | Accumulated messages (ticket + prior tool calls) |
Investigation lifetime | The accumulator. Every new turn extends past the last breakpoint. |
Below 1,024 tokens on Sonnet (or 4,096 on Haiku) the breakpoint is a no-op and the call pays the full input price — a minor inefficiency, never a bug.
OpenAI. Prompt caching is automatic and server-side — no API change required. No minimum block size to worry about.
Local models. The cache is a no-op but the interface is uniform across providers.
8.3 Budget enforcement¶
Every chat() call passes through a BudgetGuard:
guard.check(investigation_id) # raises BudgetExceeded before the HTTP call
guard.record(investigation_id, prompt_tokens, completion_tokens, cost_usd)
Budgets are per investigation, set from the persona's profile (default: 50k tokens, $0.50). Breaches halt the investigation and escalate.
8.3.1 Cost mapping — where USD comes from¶
Token counts come back from the provider response (usage.input_tokens, usage.output_tokens, usage.cache_creation_input_tokens, usage.cache_read_input_tokens for Anthropic). The tokens→USD conversion uses a single pricing table:
- Primary source: the pricing table bundled in
litellm(updated by the upstream project regularly). - Override: per-workspace config can set custom per-model rates for BYOK customers with negotiated pricing.
- Fallback: if a model isn't in the table, the cost column is
NULLand the cost metric isn't incremented for that call. The token metric still is.
This is recorded in llm_calls table so the cost dashboard (§16) can aggregate by investigation, by workspace, by model, and by purpose.
8.4 Model router¶
A 20-line table, not a framework:
ROUTER = {
"classifier": "claude-haiku-4-5",
"verdict": "claude-haiku-4-5",
"planner": "claude-sonnet-4-6",
"tool_selector":"claude-sonnet-4-6",
"summarizer": "claude-sonnet-4-6",
}
Overridable per workspace in config. BYOK users can map these to any provider/model via LITELLM_MODEL_* env vars.
9. Ticketing adapters — source and sink¶
Every help desk adapter is both a source (new tickets) and a sink (write back results). The base contract:
class TicketAdapter(Protocol):
async def poll(self, since: datetime) -> list[RawTicket]: ...
async def subscribe_webhook(self, callback) -> WebhookHandle: ... # optional
def normalize(self, raw: RawTicket) -> Ticket: ...
async def post_reply(self, ticket_id: str, body: str, *, private: bool) -> None: ...
async def update_status(self, ticket_id: str, status: str) -> None: ...
async def log_time_entry(self, ticket_id: str, minutes: int, note: str) -> None: ... # MSP
def capabilities(self) -> AdapterCapabilities: ...
Adapters in v0.1: Zoho Desk. Adapters in the v0.2-v0.4 window: HaloPSA, Autotask, ConnectWise, Zendesk, Linear, GitHub Issues, Jira SM, Freshdesk, Intercom, email.
Webhooks are preferred when available; polling is the fallback. The poller supports the "since cursor" pattern natively — it stores last_seen_external_id per source in the DB and asks each adapter "give me everything newer than this".
9.1 Canonical Ticket model¶
Ticket:
id: uuid7
workspace_id: uuid7
source_id: fk → ticket_sources
external_id: string # the source's native ID (ZD-1234, HPS-4871...)
title: string
body: text
customer: string # free-form, e.g. "Hartwell Law — Kevin Reyes"
requester_email: string?
priority: low | medium | high | critical
status: new | queued | investigating | auto_resolved | needs_tech | needs_client | needs_l2 | failed
sla_at: datetime?
received_at: datetime
source_metadata: jsonb # anything the adapter wants to preserve
This maps 1:1 to the existing persona prototypes' ticket shape. Migrations are avoided by keeping the superset.
10. Chat surface — widget, Slack, Teams, operator console¶
10.1 End-user chat widget¶
- A React app bundled by Vite in library mode into a single JS file (
gaby-widget.js, target <40 KB gzipped). - Mounted into a shadow DOM so the host site's CSS can't leak in or out.
- Talks to
/api/chaton the Gaby backend viafetch+ SSE for streaming replies. - Themable via a single
Gaby.init({ theme: { primary: '#0284c7', font: 'Inter' } })call.
Abuse surface — rate limits and auth¶
The widget is public-facing. It must not become a free LLM token faucet. Controls:
| Layer | Limit | Rationale |
|---|---|---|
| Per IP | 20 messages / minute, 200 / day | Hard cap before any backend work |
| Per widget session | 40 messages total, 15-minute idle timeout | Session-scoped envelope |
| Per workspace | Configurable daily budget in USD (default $50/day for chat) | Workspace owner sets the ceiling |
| Challenge | After 3 messages, an invisible Turnstile/hCaptcha challenge | Blocks bots without friction for humans |
| Auth options | Anonymous (rate-limited), host-provided JWT (verified by shared secret), logged-in user (via host's own auth) | Stricter auth → higher limits |
All limits are enforced before any LLM call is made. Rate-limit hits return 429 with a Retry-After header; the widget surfaces "I'm getting a lot of messages right now, please try again in a moment."
10.2 Session lifecycle¶
session.created ──user message──▶ session.active
│
│ (Gaby responds, possibly multiple turns)
│
▼
can_auto_resolve? ──yes──▶ session.resolved
│
no
▼
handoff_requested → session.handoff_pending
│
operator accepts
│
▼
session.handoff_active
│
▼
operator closes → session.closed
10.3 Handoff bundle¶
When Gaby escalates a chat to a human, the operator receives (in the operator console):
- Full transcript so far (both user + Gaby)
- Every tool call Gaby made, with arguments and redacted results
- Citations used for any KB-backed claims
- The current working memory snapshot
- A one-sentence "why I couldn't resolve this" from the agent
The operator starts mid-flight, not cold. This is the single biggest satisfaction driver for the human chat surface.
10.4 Slack / Teams¶
- Bolt-for-Python for Slack, Bot Framework for Teams.
- Same session model, same handoff bundle.
- Inbound in Slack is v0.3; v0.1 ships Slack outbound only for escalations.
11. Auth and identity (three surfaces)¶
| Surface | Mechanism | Session store |
|---|---|---|
| Web UI (operators) | Session cookie (HttpOnly, SameSite=Lax) + CSRF token | sessions table |
| CLI / automation | API key (gaby-XXXX.YYYY), prefix + hashed remainder |
api_keys table |
| End-user chat widget | Host-provided JWT (verified by a shared key) or anonymous token | chat_sessions table |
| Connector OAuth | Per-connector device-code flow for the ones that support it; API keys for the rest | encrypted connectors.config |
First-run bootstrap¶
On first boot, Gaby generates a one-time admin provisioning URL printed to stdout (and to a file if running headless). Opening it creates the first admin user. This URL expires in 15 minutes. After first use, Gaby refuses to issue another unless the DB is wiped — no silent "admin/admin" defaults.
SSO / SAML / SCIM¶
Enterprise Edition feature. Implemented via authlib + SAML2, behind a feature flag keyed to the license.
12. Connector contract — the testable promise¶
Every connector must pass these at pytest time:
| Test | Assertion |
|---|---|
test_initialize |
Responds to initialize MCP request within 2s |
test_tools_list |
Returns a tool list with every tool carrying scope and description |
test_tool_scopes_wellformed |
Every tool's scope ∈ |
test_dangerous_flagged |
Any destructive tool has dangerous: true |
test_dry_run_supported |
Every write tool supports a dry_run=true argument |
test_healthcheck |
Responds to the healthcheck tool |
test_redaction_noleak |
Tool results never echo back secrets passed in args (paranoia check) |
test_large_result_truncated |
Results over 100 KB are truncated (or paged) with a truncation marker |
Contract tests live under connectors/_contract/ and are re-run against every first-party and community connector in CI.
13. Error handling philosophy¶
One rule: fail loud, fail early, degrade only after explicit design.
| Category | Handling |
|---|---|
| Transient (network blips, 5xx) | Retry with exponential backoff + jitter, max 3 attempts, circuit breaker per endpoint |
| Permanent (4xx, auth expired) | No retry. Investigation enters needs_tech with a specific error. |
| Budget exceeded | Investigation → failed_budget, escalate. |
| Scope denied | Tool call never runs. Audit as action.denied. Agent plans an alternative. |
| LLM refuses or returns garbage | Retry once with a clarifying instruction. Then escalate. |
| MCP connector crash mid-call | Investigation pauses, connector is restarted, call retried once. Then escalate. |
| Unrecoverable internal bug | Investigation → failed, full stack trace in audit, operator notified. |
No bare except: anywhere in the codebase. Enforced by a ruff custom rule.
14. Caching layers¶
| Cache | Scope | TTL | Invalidation trigger |
|---|---|---|---|
| LLM prompt cache | Per Anthropic cache key | Anthropic-managed (~5 min) | Automatic |
| Tool manifest cache | Per connector version | process lifetime | Connector restart |
| Retrieval result cache | Per (query_hash, corpus_version) |
10 min | KB re-ingest bumps corpus_version |
| Embedding cache | Per (text_hash, model) |
permanent | Model change invalidates by key |
| Session ticket cache | Per ticket_id, within one investigation | investigation lifetime | End of investigation |
| OpenAPI doc cache | Per build | process lifetime | Rebuild |
15. Deployment topologies¶
15.1 Founder quickstart (v0.1 default)¶
┌────────────────────────────────┐
│ docker compose up │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ gaby │ │ gaby │ │
│ │ backend │◀─▶│ web │ │
│ │ + SQLite│ │ static │ │
│ └──────────┘ └──────────┘ │
│ ▲ │
│ │ │
│ ┌──────┴───────┐ │
│ │ MCP servers │ │
│ │ (subproc) │ │
│ └──────────────┘ │
└────────────────────────────────┘
- 2 containers, embedded SQLite, in-process worker
- Total RAM: ~512 MB
- Total persistent volumes: 1 (the SQLite DB + KB index)
15.2 MSP / scaled (v0.2)¶
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ gaby │ │ gaby │ │ gaby │
│ backend │ │ backend │ │ worker │
│ │ │ │ │ (arq) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────┬────────┴────────┬────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Postgres │ │ Redis │
│ + pgvector │ │ │
└─────────────┘ └─────────────┘
15.3 Enterprise / air-gapped (EE feature, later)¶
- Air-gapped registry (images mirrored)
- External secrets provider required
- OIDC/SAML enforced
- SIEM export of audit log enabled
- Local LLM (vLLM or Ollama) for data-sensitive workspaces
16. Observability contracts¶
| Signal | Carrier | Required fields |
|---|---|---|
| Log | structlog → JSON stdout | ts, level, service, request_id, workspace_id, investigation_id?, event, payload |
| Trace span | OTel span | same attributes as logs + span.kind, span.status |
| Metric | Prometheus counter/histogram | gaby_* prefix, labels: workspace, connector, model, outcome |
Canonical metrics (v0.1 must emit these):
| Metric | Type | Why |
|---|---|---|
gaby_investigations_total |
counter | Throughput |
gaby_investigations_duration_seconds |
histogram | Latency p50/p95/p99 |
gaby_investigations_verdict_total |
counter | By verdict label — auto-resolution % |
gaby_llm_tokens_total |
counter | By model + by purpose (planner/verdict) |
gaby_llm_cost_usd_total |
counter | By model |
gaby_connector_calls_total |
counter | By connector + by tool + by status |
gaby_connector_health |
gauge | 0/1 per connector |
gaby_safety_denials_total |
counter | Every denial is visible |
gaby_approvals_pending |
gauge | Operator queue depth |
gaby_chat_sessions_total |
counter | By channel (widget/slack/teams), by outcome |
gaby_chat_handoffs_total |
counter | Gaby → human takeovers |
gaby_rate_limit_rejections_total |
counter | By surface (widget/api) |
gaby_ticket_queue_depth |
gauge | DB poll result |
gaby_retrieval_hit_rate |
gauge | Share of citations ultimately used in the summary |
Every metric above ships with a Grafana dashboard JSON in docs/operations/dashboards/.
17. Failure modes — explicit list¶
| Failure | Detection | Response |
|---|---|---|
| LLM provider down | HTTP error from provider | Router tries fallback provider; if none, escalate |
| LLM budget exhausted | BudgetGuard pre-check |
Investigation halts with failed_budget |
| Connector subprocess crash | asyncio.subprocess.returncode |
Supervised restart; after 5 failures → DEGRADED |
| Connector OAuth expired | 401/403 from tool call | Pause investigation, send re-auth link to admin via escalation |
| Help desk webhook delivery fail | Missing poll cursor gap | Poller fallback closes the gap |
| Database unavailable | SQLAlchemy pool exhaustion | API returns 503; backoff; alert |
| Redis unavailable (when running arq) | redis-py ConnectionError | Worker pauses; in-process fallback if enabled |
| Vector index corruption | Query returns 0 results when FTS returns >0 | Automatic reindex on next KB sync; alert |
| Disk full | SQLite write error | API returns 503; alert |
| Malicious PII in ticket body | Redaction rule | Redact before LLM, record the original in encrypted column |
| Runaway agent loop | max_iterations cap (20) |
Escalate with failed_budget |
18. Open architectural questions (non-blocking for v0.1)¶
These are worth watching but don't hold v0.1:
- Streaming the investigation timeline to the UI over SSE vs WebSocket vs polling? Plan: SSE (simpler infra, one-way fits the model).
- Multi-region replication for the managed cloud. Defer to v0.5.
- Reasoning models for the planner. Worth evaluating on the eval harness before v0.2.
- Embedded vLLM for privacy-sensitive workspaces. Planned for v0.3 alongside the EE air-gapped mode.
Previously here: "Cross-investigation memory — defer to v0.3." Now designed — see
§22 Memory hierarchy. Long-term memory starts accumulating on day one of v0.1 viaSQLiteMemoryGraph.
19. Cross-document index¶
SPEC.md §6.5— Safety model. This doc §6 is the implementation.SPEC.md §6.6— Ticket sources. This doc §9 is the adapter contract.SPEC.md §6.4— LLM layer. This doc §8 is the implementation.FOUNDATION.md §1.1— Stack choices. This doc §4, §8, §12 are the consequences.FOUNDATION.md §3— Data model. This doc §2, §6 are how they're used at runtime.
20. Definition of done for this document¶
This doc is "done enough to build v0.1 against" when:
- [x] Every component in the System Map (§1) has its own section in this doc
- [x] The agent loop state machine (§3) is explicit, with terminal states and budget rules
- [x] The safety pipeline (§6) has a numbered order of operations and a scope DSL sketch
- [x] Every failure mode we can think of today is listed (§17)
- [x] Observability and metrics are concrete (§16)
- [x] The v0.1 deployment topology is drawn (§15.1)
- [x] Cross-links to
SPEC.mdandFOUNDATION.mdare in place (§19) - [x] Self-critique pass completed (added §2.1, §4.5–4.6, resume semantics in §3, §5.4–5.5, rate limits in §10.1, cost mapping in §8.3.1, classifier wiring in §2, extra metrics in §16)
- [x] Online reference validation pass completed — see §21
21. Reference validation — pass dated 2026-04-11¶
Every load-bearing technology choice was verified against current (2026) online references during architecture review. Summary:
| Area | Finding | Architecture impact |
|---|---|---|
| MCP transports | Spec 2025-03-26 introduced Streamable HTTP as the remote transport, superseding SSE. Connection recovery via Last-Event-ID header. Python SDK supports both stdio and Streamable HTTP with a unified client. |
§5.1 locked: stdio subprocess for first-party in-image connectors, Streamable HTTP for remote/community servers. Confirmed current. |
| Agent frameworks | pydantic-ai shows meaningful advantages over LangGraph in 2026 benchmarks (~44% lower P95, ~5× fewer errors, ~2.7× lower tokens). Ships pydantic-graph with durable execution and HITL that map onto our WAITING_APPROVAL state. |
§3: homegrown loop is still the v0.1 choice (safety on the critical path); pydantic-ai is the locked fallback if homegrown stalls, with a re-eval gate after v0.1 ships. |
| Anthropic prompt caching | Order tools → system → messages. Sonnet min 1,024 tokens / Haiku min 4,096. Max 4 breakpoints per request. 5-min default TTL, 1-hour at 2× base, cache reads at 0.1×. Workspace-level isolation since 2026-02-05. | §8.2: the 4 breakpoints are now explicit (tools, system, KB chunks, messages) with the minimum-size caveats called out. |
| LiteLLM | Proxy has production issues (GIL throughput, DB logging degradation). March 2026 supply-chain attack on PyPI versions 1.82.7 and 1.82.8. 800+ open issues. Bifrost and Portkey are the production-grade alternatives. | §8.1: we use the SDK only, not the proxy; hot paths go direct to Anthropic/OpenAI; pin versions; install with hash verification; Bifrost/Portkey noted as v0.2+ migration targets if BYOK volume warrants. |
| sqlite-vec | Mozilla Builders project, pure C, production-stable, but brute-force search only (no ANN). Fine for small KBs, does not scale. vectorlite is 3–30× faster with ANN; pgvector HNSW is the Postgres story. | §7.1: added a scale cliff table and explicit migration triggers at 5k and 50k chunks. Embedding blob schema is dim-agnostic to keep migration cheap. |
| uv | Production/Stable status on PyPI. Community consensus in 2026 trending heavily toward uv. 10–100× faster than pip. Drop-in replacement. | FOUNDATION.md §1.1 locked uv — confirmed current. |
| FastAPI + SQLAlchemy 2 async | Modern production default is pool_size=20, max_overflow=10 for Postgres (not 5). Use asyncpg driver. Dependency-injected session lifecycle. |
§4.2 updated: defaults split between SQLite (small) and Postgres (larger) profiles. |
| Tailwind 4 + shadcn/ui + React 19 + Vite + RR7 | Fully compatible and production-ready. Migration notes: use data-slot attribute, React.ComponentProps instead of forwardRef. @tailwindcss/vite plugin is the install path. |
FOUNDATION.md §1.2 locked — confirmed current. |
| Biome | 10–25× faster than ESLint+Prettier; covers ~80% of ESLint rules; doesn't support eslint-plugin-react-hooks (type-aware rules require the TS language service). |
FOUNDATION.md §1.2 needs a note: Biome primary, keep ESLint running for react-hooks only until Biome closes the gap. Added to the plan. |
| LLM eval tooling | promptfoo is used by Anthropic and OpenAI themselves for prompt regression; Inspect AI (UK AISI) is specifically designed for agent evaluation with tool calls and model-graded rubrics. | FOUNDATION.md §4.1 updated direction: promptfoo in v0.1 for prompt regression; Inspect AI evaluated for v0.2 for full agent evals with tool calls. |
What I did not find any reason to change: FastAPI as web framework, asyncio as concurrency model, Alembic for migrations, arq for background jobs (with in-process fallback), structlog for logging, OpenTelemetry for tracing, pnpm for JS, Vite for bundling, Vitest+Playwright for tests, MkDocs Material for docs, MCP Python SDK as the connector protocol library, DCO over CLA for contributions, Apache 2.0 core + commercial EE for licensing.
Next validation pass: after v0.1 ships, re-run this research with the same queries and diff. Anything that has moved >1 major version or lost community traction goes on the v0.2 re-evaluation list.
22. Memory hierarchy — how Gaby gets smarter over time¶
TL;DR. Three memory tiers (short / medium / long) feed the planner through a bounded context envelope on every LLM call. Long-term memory is a graph-shaped model stored behind a
MemoryGraphprotocol. The v0.1 default backend is SQLite (two tables, recursive CTE traversals). Two opt-in backends ship stubs in Iter 0 and full implementations in Iter 4: Apache AGE (Postgres extension, the recommended graph-native path) and FalkorDBLite (embedded Cypher, Apache 2.0). Long-term memory starts accumulating on day one regardless of backend choice — the data written through the protocol is the data you migrate later.
22.1 The three tiers¶
┌──────────────────────────────────────────────────────┐
│ SHORT-TERM │
│ scope: one investigation │
│ lifetime: seconds–minutes │
│ storage: in-proc WorkingMemory + jsonb snapshot │
│ gate: redaction on the LLM boundary │
│ purpose: let the loop reason + crash-resume │
└──────────────────────────────────────────────────────┘
▲
│ feeds upward on success
│ (via kb_candidates staging)
│
┌──────────────────────────────────────────────────────┐
│ MEDIUM-TERM │
│ scope: workspace │
│ lifetime: hours–days │
│ storage: TTL queries + in-proc LRUs + jsonb │
│ gate: automatic (cache-like, not "learning") │
│ purpose: avoid rework, spot bursts, context-swap │
└──────────────────────────────────────────────────────┘
▲
│ promoted via human review
│
┌──────────────────────────────────────────────────────┐
│ LONG-TERM │
│ scope: workspace │
│ lifetime: indefinite │
│ storage: MemoryGraph (nodes + edges) + documents │
│ gate: 100% human-in-the-loop — no silent writes│
│ purpose: the product getting smarter │
└──────────────────────────────────────────────────────┘
22.2 Short-term (within one investigation)¶
Already designed in §2.1. WorkingMemory { ticket, messages, tool_calls, retrieved_chunks, budget_state } lives in memory during the loop and is snapshotted to investigations.working_memory_snapshot at every state-machine transition so a crash can resume. PII is redacted before anything crosses into an LLM call. Short-term memory is private to its investigation — it is never read by a different investigation.
22.3 Medium-term (cache-like, automatic)¶
Four operational stores. None of them "learn" — they are all caches with hard TTLs. They are written automatically and read by the planner as hints.
| Store | Shape | TTL | Purpose |
|---|---|---|---|
| Recent-tickets window | A query on tickets (WHERE received_at > now() - interval '24h') — not a new table |
24 h rolling | Dedup (same thing just resolved), burst detection ("4 customers just hit the same VPN error — this is an upstream incident, not 4 individual problems") |
| Connector-result cache | Process-local LRU keyed on (workspace_id, connector_id, tool_name, canonical(args)); Redis-backed when arq is on |
60 s for reads, 0 s for writes | "I just SELECT-ed this users table 15 s ago during this investigation, don't re-query" |
| Operator session notes | jsonb column on the existing sessions table |
Session lifetime | "The operator just approved this kind of action — don't re-prompt them for the rest of their session" |
| KB candidate staging | kb_candidates table — entries awaiting human review, visible in the approval queue UI |
30 days → auto-archive if unreviewed | The bridge from auto-resolved investigations to long-term KB. Auto-written when verdict = auto_resolved and the quality gate passes. |
22.4 Long-term (the graph memory — the product learning layer)¶
Two stores with different access patterns:
| Store | What it holds | How it's created | How it's used | How it's forgotten |
|---|---|---|---|---|
Verified KB entries (documents + document_chunks tables from Iter 2's knowledge pipeline) |
Runbooks, past-ticket resolutions promoted from kb_candidates, manually-added Markdown |
Human accepts/edits a candidate through the approval queue UI | Retrieval pipeline in §7 (hybrid BM25 + vector → top-6 with citations) | Explicit delete; stale-content detection after 6 months of non-use |
Memory graph (memory_nodes + memory_edges behind the MemoryGraph protocol) |
Entities (customers, users, systems, connectors, tickets, investigations, facts, observations, resolutions) and their typed relationships | Operator clicks "Remember this" on an investigation step, OR Gaby proposes after N≥3 similar observations and the operator accepts | Planner envelope at every LLM call — neighbors(ticket.customer, depth=1) loads applicable facts and observations |
Explicit delete; status='archived' for unused nodes after 90 days; GDPR forget_subject() for hard compliance removal |
22.5 The node label set (POLE+O, domain-adapted)¶
Borrowed from neo4j-labs/agent-memory's POLE+O model (Persons, Objects, Locations, Events, Observations) and adapted to Gaby's domain:
| Label | Meaning | Example natural_key |
|---|---|---|
customer |
A company / client / account receiving support | customer:hartwell-law |
user |
An end-user of a customer (the person who opened a ticket) | user:kevin.reyes@hartwelllaw.com |
system |
An application, service, or piece of infrastructure | system:keycloak-prod, system:stripe-webhooks |
connector |
A configured MCP connector instance | connector:postgres-main |
ticket |
A canonical ticket (one node per external ticket) | ticket:zoho:ZD-8891 |
investigation |
An investigation Gaby ran | investigation:inv_01HXYZ... |
fact |
An atomic piece of knowledge (Observations in POLE+O) | fact:hartwell-legacy-pop3 |
observation |
A time-stamped occurrence ("this happened at this time") | observation:mfa-lockout@kevin@2026-02-20 |
resolution |
A resolution pattern that worked | resolution:clear-stale-keycloak-sessions |
Labels are not exhaustive — new labels can be added in v0.2+ without a schema migration (they're just a string column), but these nine cover the v0.1 Founder persona completely.
22.6 The typed relations (seven categories)¶
Borrowed from memory-graph/memory-graph's seven-category relationship model. Edges carry a relation column plus a free-form properties jsonb.
| Category | Relations |
|---|---|
| Causal | CAUSES, TRIGGERS, LEADS_TO, PREVENTS |
| Solution | SOLVES, ADDRESSES, ALTERNATIVE_TO, IMPROVES |
| Context | OCCURS_IN, APPLIES_TO, WORKS_WITH, REQUIRES |
| Learning | BUILDS_ON, CONTRADICTS, CONFIRMS |
| Similarity | SIMILAR_TO, VARIANT_OF, RELATED_TO |
| Workflow | FOLLOWS, DEPENDS_ON, ENABLES, BLOCKS |
| Quality | EFFECTIVE_FOR, PREFERRED_OVER, DEPRECATED_BY |
Relations are an enum in code, but the DB column is a string so v0.2+ can add relations without a migration. Every backend implementation validates relation strings against the enum at write time.
22.7 The MemoryGraph protocol — the contract¶
Every backend implements exactly these 11 methods. The small surface is deliberate: smaller surface = easier plug-and-play + easier verification via the 3-backend round-trip test in Iter 4.
from typing import Protocol, Literal
class MemoryGraph(Protocol):
# ---- Writes (every call requires workspace_id) ----
async def upsert_node(
self,
workspace_id: WorkspaceId,
label: str, # one of the 9 node labels above
natural_key: str, # unique within (workspace_id, label)
properties: dict,
provenance: Literal["operator", "proposed", "imported"],
status: Literal["provisional", "active", "archived"] = "provisional",
) -> NodeId: ...
async def upsert_edge(
self,
workspace_id: WorkspaceId,
from_id: NodeId,
to_id: NodeId,
relation: str, # one of the ~25 typed relations above
weight: float = 1.0,
properties: dict | None = None,
observed_at: datetime | None = None,
) -> EdgeId: ...
async def mark_archived(self, workspace_id: WorkspaceId, node_id: NodeId) -> None: ...
async def forget_subject(
self, workspace_id: WorkspaceId, subject_natural_key: str
) -> ForgetReport: ...
# ---- Reads ----
async def get_node(
self, workspace_id: WorkspaceId, label: str, natural_key: str
) -> Node | None: ...
async def neighbors(
self,
workspace_id: WorkspaceId,
node: NodeId,
*,
depth: int = 1, # backend may cap at its comfort zone
relations: list[str] | None = None,
limit: int = 10,
) -> list[tuple[Node, Edge]]: ...
async def path(
self,
workspace_id: WorkspaceId,
from_id: NodeId,
to_id: NodeId,
max_depth: int = 3,
) -> list[Edge] | None: ...
async def query_by_labels(
self,
workspace_id: WorkspaceId,
labels: list[str],
limit: int = 50,
) -> list[Node]: ...
# ---- Admin ----
async def healthcheck(self) -> Health: ...
# ---- Migration — non-negotiable for every backend ----
async def export_all(
self, workspace_id: WorkspaceId
) -> AsyncIterator[NodeDump | EdgeDump]: ...
async def import_all(
self, workspace_id: WorkspaceId, stream: AsyncIterator[NodeDump | EdgeDump]
) -> int: ...
The export_all / import_all pair is the migration contract. SQLite → Postgres+AGE, SQLite → FalkorDBLite, FalkorDBLite → Postgres+AGE — all the same operation: export_all from source, stream through import_all on destination. Iter 4's test suite round-trips the same fixture through all three backends and asserts byte-identical dumps.
22.8 The three backends shipped in v0.1¶
| Backend | Shipped | When to use | Implementation notes |
|---|---|---|---|
SQLiteMemoryGraph |
Default in v0.1. Full implementation lands in Iter 0 (tables + protocol surface) and Iter 4 (all methods). | Everyone using docker compose up without a profile flag. Works up to low-5-figure node counts with acceptable query latency. |
Two tables (memory_nodes, memory_edges), SQLite FTS5 on node properties.text for label-scoped search, recursive CTEs for neighbors(depth=2) capped at depth 2 (higher depths raise DepthNotSupported). |
PostgresAGEMemoryGraph |
Stub in Iter 0 (so factory + config compile), full implementation in Iter 4. | Users who want graph-native memory from day one AND are happy to run Postgres. Recommended graph-native path. | Uses the Apache AGE extension on the same Postgres instance that holds the rest of Gaby's data. Nodes/edges become AGE vertices/edges. Traversals use Cypher via the AGE cypher(...) SQL function. depth can go to 5+ without pain. |
FalkorDBLiteMemoryGraph |
Stub in Iter 0, full implementation in Iter 4. | Users who want graph-native memory embedded (no extra container) AND are comfortable with a Beta embedding layer on top of a production-stable engine. Secondary graph-native path. | falkordblite Python package (Apache 2.0) spawns a local FalkorDB inside the backend container. Cypher queries via the falkordb-py client. Same API in embedded and server modes — "switch to production FalkorDB" is one config change. |
Expressly not shipped in v0.1: Neo4j (JVM cost per §22.10), SurrealDB (license status flagged for verification, not confirmed), Cozo (pre-1.0, maintainers don't promise storage compatibility), Kuzu (archived 2025-10-10).
22.9 The planner context envelope¶
The planner does NOT receive all the memory — that would blow the context window and the budget. It receives a bounded envelope assembled at the start of every LLM call:
planner_input = {
system_prompt, ← cache breakpoint 1 (persona + safety)
tool_manifest, ← cache breakpoint 2 (connectors)
envelope: {
retrieved_kb_chunks[0..6], ← long-term vector (hybrid retrieval)
applicable_facts[0..10], ← long-term graph (MemoryGraph.neighbors)
recent_similar_tickets[0..3], ← medium-term query (tickets table window)
connector_recent_results, ← medium-term cache (in-proc LRU)
}, ← cache breakpoint 3 (stable during investigation)
working_memory.messages, ← short-term (accumulator)
} ← cache breakpoint 4
Breakpoint 3 is where cache economics matter most: the envelope is stable for an entire investigation, so every planner call after the first reads it at 0.1× base price (per Anthropic prompt cache rules in §8.2).
The envelope's long-term-graph slot is filled by:
facts = await memory.neighbors(
workspace_id,
node=await memory.get_node(workspace_id, "customer", ticket.customer_natural_key),
depth=1,
relations=["HAS_FACT", "PREFERS", "USES", "WORKS_WITH"],
limit=10,
)
Bounds (6 KB chunks, 10 facts, 3 recent tickets) are configurable per workspace but the defaults are deliberate — they keep the envelope under ~5k tokens in the common case, which is safely below every v0.1 model's context window and inside the cache discount band.
22.10 Governance — the hard rules¶
- No silent writes to long-term memory. Every node that reaches
status='active'goes through a human gate. Gaby may writestatus='provisional'nodes autonomously after N≥3 confirming observations, but only an operator promotes them. - Workspace isolation is enforced at the query layer, not documented at the query layer. Every
upsert_*,neighbors,path,query_by_labelscall takesworkspace_idas a required positional argument. No "global traversal" method exists; no default-workspace fallback exists. Cross-workspace traversal requires a dedicated admin API that does not exist in v0.1. - PII redaction happens on the way IN. A node or edge stores redacted text. Raw text is preserved only in the encrypted
tickets.body_encryptedcolumn for audit. - Forgetting is a first-class operation.
forget_subject()purges medium-term buffers AND cascadesstatus='archived'across the graph for the subject AND records an audit event. GDPR compliance is not retrofitted; it's designed in. - Stale content is demoted, not deleted. Nodes unused for the configured threshold (default: facts 90 d, KB 180 d) get
status='archived'. Archived nodes are hidden from planner queries but preserved for audit reconstruction. - Provenance is never lost. Every node carries
provenance ∈ {operator, proposed, imported}. The planner envelope's display in the UI shows this so operators know which facts came from them vs which Gaby proposed.
22.11 Why not a single graph DB?¶
Short version: because the MemoryGraph protocol is the commitment; the storage engine is an implementation detail. See the full trade-off discussion in the commit history (SQLite default was chosen to preserve the 5-minute install promise while AGE and FalkorDBLite are offered as opt-in profiles for users who want graph-native from day one). Key rationales by what we rejected:
- Neo4j Community — JVM (+300 MB RAM), ~600 MB Docker image, third service in Compose, Cypher as a third query language, GPLv3 licensing friction with our Apache-2.0 core. Not a capability problem; an ops cost problem. Remains a valid swap target for users who want it.
- Kuzu — archived 2025-10-10. Dead for v0.1 commitment.
- SurrealDB — real production use (Samsung Ads, Verizon, Tencent), embedded Rust binary, but 2024–2025 BSL licensing history needs verification before we commit an Apache-2.0 core project to it. Flagged, not adopted.
- Cozo — great vision ("the hippocampus for AI"), embedded like SQLite, but pre-1.0 and the maintainers explicitly do not promise storage compatibility before 1.0. Revisit in v0.3+.
- Memgraph — BSL license. Non-OSS by our definition.
- TypeDB / NebulaGraph / Dgraph — heavier than we need, none add enough over the three chosen backends to justify the extra ops cost.
22.12 What ships in Iter 0 vs Iter 4¶
| Iter 0 (scaffold) | Iter 4 (agent loop) |
|---|---|
memory_nodes + memory_edges + kb_candidates tables |
PostgresAGEMemoryGraph full implementation |
sessions.operator_notes jsonb column |
FalkorDBLiteMemoryGraph full implementation |
storage/memory_graph/base.py — the protocol |
Planner envelope integration (applicable_facts fill) |
storage/memory_graph/sqlite.py — full implementation |
Migration CLI: gaby memory export, import, migrate --to=<backend> |
storage/memory_graph/postgres_age.py — class stubs with NotImplementedError |
3-backend round-trip integration test |
storage/memory_graph/falkor_lite.py — class stubs with NotImplementedError |
"Remember this" UI button in the approval queue |
storage/memory_graph/__init__.py — factory reading GABY_MEMORY_BACKEND env var |
gaby memory forget --subject=<key> CLI (GDPR) |
ops/docker/docker-compose.yml gets a graph-age profile adding Postgres+AGE container |
Property tests on workspace isolation (any traversal that leaks across workspaces fails) |
Iter 0 still writes graph data from the first ticket. Iter 4 adds the alternative backends and the planner integration.
22.13 Verification — how we'll prove the backend swap works¶
A single integration test runs in CI for every PR that touches storage/memory_graph/:
@pytest.mark.integration
@pytest.mark.parametrize("source", ALL_BACKENDS)
@pytest.mark.parametrize("target", ALL_BACKENDS)
async def test_roundtrip_across_backends(source, target, tmp_fixture_graph):
# 1. Materialize the fixture graph in `source`
src = make_backend(source)
await load_fixture(src, tmp_fixture_graph)
# 2. Export → import into `target`
dst = make_backend(target)
async for row in src.export_all("workspace-test"):
await dst.import_all("workspace-test", one_row_stream(row))
# 3. Compare canonical dumps
src_dump = sorted([r async for r in src.export_all("workspace-test")], key=_canon)
dst_dump = sorted([r async for r in dst.export_all("workspace-test")], key=_canon)
assert src_dump == dst_dump
# 4. Assert neighbors() returns the same results from both
for seed in fixture_seeds:
src_neigh = await src.neighbors("workspace-test", seed, depth=1, limit=10)
dst_neigh = await dst.neighbors("workspace-test", seed, depth=1, limit=10)
assert canonicalize(src_neigh) == canonicalize(dst_neigh)
With 3 backends this is a 3×3 = 9-cell test matrix. The SQLite↔SQLite cell catches regressions in the reference implementation. The cross-backend cells catch drift. Any new backend added in the future must make this matrix green before it can be offered to users.