Foundation¶
Foundation Plan — Gaby v0.1 → v1.0¶
Status: Draft v0.1 · Owner: Guilliano · Last updated: 2026-04-11
This is the foundation plan. It sits one step below
SPEC.md(the what) and aboveARCHITECTURE.md(the detailed how). Its job is to lock the decisions that are expensive to change later:
- What language(s) we write in
- How the repo is laid out
- What the data model looks like
- How we test it
- How we ship it
- How the prototype UI becomes a real design system
Everything in this doc is opinionated and defaulted. If a section says "we pick X", assume we start with X and revisit only if we have measured data showing X isn't working. The goal is to stop debating foundation and start building.
Grounding: the first roadmap target is the Founder persona. Every decision below is sized for "ship v0.1 in a small-team time frame" but structured so v0.2 (MSP), v0.3 (Support Lead), and v0.4 (SRE) slot in without rewrites.
0. Guiding principles (for any decision not listed below)¶
- Boring tech by default. Python, React, Postgres, SQLite, Docker. Every novel dependency is a liability paid in debugging hours at 3 a.m.
- Two languages max. One backend (Python), one frontend (TypeScript/React). No Go CLI, no Rust agent core, no Elixir sidecar. Adding a third language requires a written justification.
- Ship a running
docker compose upon day one of every week. The demo always works. If it doesn't, that's the next thing you fix. - Tests are part of the feature. Every PR ships with the tests that prove it works. No "will add tests in a follow-up".
- The prototypes in
personas/are the UX spec, not a moodboard. The React implementation is a port, not a reinterpretation. - Every closed item ships with an HTML status report. For each delivered feature, module, or roadmap item, write a standalone HTML file under
reports/that describes what was done and how to test it. The file is written by whoever closes the item and linked fromreports/index.html. No exceptions — this is the paper trail for review and handoff. Reports match the visual language of the landing page (Tailwind CDN, Inter font, clean cards) so they can be shared with non-engineers.
1. Stack decisions (locked)¶
1.1 Backend — Python¶
| Concern | Choice | Why |
|---|---|---|
| Runtime | Python 3.12+ | Pattern matching, performance wins, the best typing story in the Python 3 line, supported until 2028. Drop support when <5% of users. |
| Package manager | uv | Rust-based, ~10-100× faster than pip/poetry, single-tool for venv + install + lockfile + tool runs. The clear future. |
| Web framework | FastAPI | Async-first, automatic OpenAPI generation (needed to generate the TypeScript API client), Pydantic-native, excellent DX. |
| Agent loop | Homegrown (≤500 LOC) on top of the Anthropic + OpenAI SDKs via litellm | The loop is the product. We do not want to fight LangGraph/pydantic-ai abstractions when we tune prompts and error handling. We adopt a framework piece only when we find ourselves re-implementing it. |
| MCP | Official mcp Python SDK (as a client / host) |
Standard. Supports both stdio and streamable HTTP transports. Every connector we ship and every community one is an MCP server. |
| LLM SDK abstraction | litellm (SDK only, not proxy) with direct Anthropic + OpenAI for hot paths | One provider interface for BYOK. Hot paths (planner, verdict) use the direct SDK. We never run the LiteLLM proxy in-process — it has known 2026 production issues. Pin litellm to a known-good version; install with hash verification. See ARCHITECTURE.md §21 for the full rationale. |
| ORM | SQLAlchemy 2.x (async) | Async is non-negotiable for an I/O-heavy service. SQLAlchemy 2 is the mature, typed choice. No SQLModel — it's a thinner wrapper that we don't need. |
| Migrations | Alembic | Paired with SQLAlchemy. Boring, reliable. |
| Primary DB | SQLite (default) / Postgres (opt-in) | SQLite for single-node installs and first-run demos — zero external deps. Postgres via a one-line config change for scale. Same SQLAlchemy models. |
| Vector store | sqlite-vec (default) / pgvector (Postgres) | Matches the DB choice. sqlite-vec is maintained and production-capable; pgvector is the standard for Postgres. No external Qdrant/Pinecone at v0.1. |
| Full-text search | SQLite FTS5 / Postgres tsvector | Hybrid retrieval = BM25-ish + vector. Native to each DB, no extra service. |
| Memory graph | MemoryGraph protocol with SQLiteMemoryGraph default + opt-in PostgresAGEMemoryGraph + opt-in FalkorDBLiteMemoryGraph |
Long-term agent memory is graph-shaped from day one (nodes + typed edges + POLE+O-style labels). Default backend is two SQLite tables with recursive CTEs — zero ops cost. Users who want graph-native from day one can opt into Apache AGE (Postgres extension, openCypher, recommended) or FalkorDBLite (embedded Cypher, Beta embedding on a production-stable engine) via docker compose --profile graph-age or GABY_MEMORY_BACKEND=falkor-lite. All three backends implement the same 11-method protocol; Iter 4 ships the 3-backend round-trip test that proves the migration path. See ARCHITECTURE.md §22 for the full design. |
| Background jobs | arq (Redis-backed async) with a local in-process fallback | arq is async-native and minimal. For the "docker compose up" founder install, we fall back to an in-process worker so Redis isn't required until scale demands it. |
| Config & secrets | pydantic-settings + pluggable secret providers (env / file / Vault / AWS SM / GCP SM) | Types for free; one interface per provider. |
| Logging | structlog with JSON renderer | Structured JSON to stdout. 12-factor. Works with every log aggregator. |
| Tracing | OpenTelemetry SDK with OTLP exporter | Turn on by env var. Every tool call and LLM call is a span. Non-negotiable for the SRE persona later. |
| Metrics | prometheus-client with /metrics endpoint |
Standard. No external push gateway. |
| Linter / formatter | ruff (both) + mypy --strict | ruff replaces black + isort + flake8 + many plugins, in one Rust-fast tool. mypy strict on the core package. |
| Test runner | pytest + pytest-asyncio + pytest-cov + hypothesis | Standard Python testing. Hypothesis for property-based tests on anything safety-critical. |
| HTTP mocking | respx | Async-compatible, integrates cleanly with httpx (which litellm uses). |
| Time / clock | time-machine in tests |
Deterministic time; faster than freezegun. |
Why not…¶
- LangGraph / pydantic-ai / smolagents. They each have real strengths, but the investigation loop is the soul of the product and we need to iterate on prompts, tool-call retry semantics, and error recovery without a framework's opinions in the way. We ship our own 300-500 line loop. If a year in we find we are re-implementing
langgraph.checkpoint, we adopt that specific piece. - Django. Gaby is an API-first service with a React frontend. FastAPI is the better fit; Django Admin is not something we want to maintain.
- Go. Strong single-binary story but adds a second language and a second hiring pipeline. Docker Compose gives us "5-minute install" without it. We can add a Go CLI wrapper in v0.5+ if the install friction demands it.
- Node for the backend. The Python AI ecosystem (tokenization, evals, RAG tooling, connector SDKs) is meaningfully ahead of Node's. Every AI team that has switched has switched toward Python, not away.
1.2 Frontend — TypeScript / React¶
| Concern | Choice | Why |
|---|---|---|
| Framework | Vite + React 19 + React Router 7 | No SSR complexity. The admin UI is a SPA served as static assets from the backend. Vite build is fast, HMR is instant. |
| Language | TypeScript strict | Non-negotiable. Any any requires an inline justification. |
| Styling | Tailwind CSS 4 | Matches shared/styles.css in the prototypes. No migration tax. |
| Component primitives | shadcn/ui | Copy-paste Radix components, fully themable, matches the aesthetic we already shipped. Not a dependency — code we own. |
| Server state | TanStack Query v5 | Cache, refetch, optimistic updates. The category winner. |
| Client state | Zustand | For the small handful of cases that aren't server state (wizard step, theme, transient UI). |
| Forms | React Hook Form + Zod | Zod schemas can be shared with the API client for end-to-end typing. |
| API client | Auto-generated from OpenAPI via openapi-typescript |
FastAPI emits OpenAPI; we codegen the TS client. Adding a backend route never requires a manual frontend type update. |
| Charts | Recharts | Simple, well-known, good enough for dashboards. |
| Icons | Custom SVG set (port of shared/icons.js) + lucide-react for generic icons |
Keep brand identity; reuse lucide for everything generic. |
| Unit tests | Vitest + React Testing Library | Vitest shares the Vite config; RTL is the standard. |
| E2E tests | Playwright (already in tests/) |
Reuse the existing harness and page-object pattern. |
| Linter / formatter | Biome (primary) + ESLint for react-hooks only |
Biome is ~25× faster and covers ~80% of ESLint rules. The gap is type-aware rules — specifically eslint-plugin-react-hooks, which Biome does not yet replicate. We run Biome on every file and keep a minimal ESLint config only for the React Hooks plugin until Biome closes that gap. Biome is the canonical formatter. |
| Package manager | pnpm | Disk-efficient, fast, handles monorepos natively. |
Why not…¶
- Next.js. We don't need SSR for an admin UI; we do want trivial Docker packaging. Next pulls in a server runtime we don't want inside the Docker image. Vite wins on simplicity.
- shadcn+Radix alternatives (Chakra, MUI, Mantine). They're fine frameworks but they own the theme. shadcn leaves us in control of every pixel — and our prototypes already have a distinctive look we don't want to lose.
- ESLint + Prettier. Biome is simply faster and one tool. ESLint is still the fallback if Biome lacks a rule we need.
1.3 Cross-cutting¶
| Concern | Choice |
|---|---|
| Version control | Git + GitHub |
| Commit signing | DCO (git commit -s) — not a full CLA |
| Issue tracker | GitHub Issues |
| Docs site | MkDocs Material (Python-native, fast build, great search) |
| API docs | Auto from FastAPI's OpenAPI → rendered in MkDocs via mkdocs-swagger-ui-tag |
| Changelog | CHANGELOG.md hand-edited, enforced on release PRs |
| Release automation | GitHub Actions → Docker Hub / GHCR + PyPI + npm (for widget) + Helm chart repo |
| Container registry | GHCR primary, Docker Hub mirror |
| License headers in source | SPDX short form (# SPDX-License-Identifier: Apache-2.0) in every file |
2. Repository layout¶
gaby/ # the repo root (name matches the product)
├── README.md # "what is this / 5-minute install"
├── SPEC.md # (exists) product spec
├── FOUNDATION.md # (this document) foundation plan
├── ARCHITECTURE.md # (next) detailed technical architecture
├── ROADMAP.md # (next) dated roadmap across the 4 personas
├── CHANGELOG.md # (next) user-facing changes per release
├── LICENSE # Apache 2.0
├── LICENSE-EE # commercial enterprise edition license
├── TRADEMARK.md
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── SECURITY.md # vuln disclosure
│
├── BUSINESS.md # OSS-vs-commercial position, kept honest
├── docs/ # MkDocs Material site — canonical marketing
│ # + narrative docs, deployed to gaby.skycloak.io
├── personas/ # (exists) persona prototypes — canonical UX spec
│ ├── founder/
│ ├── msp/
│ ├── sre/
│ └── support-lead/
├── shared/ # (exists) prototype shared assets
│
├── backend/ # Python service
│ ├── pyproject.toml # single source of truth for deps, tools, build
│ ├── uv.lock
│ ├── README.md # "how to run the backend locally"
│ ├── src/gaby/
│ │ ├── __init__.py
│ │ ├── __main__.py # `python -m gaby`
│ │ ├── cli.py # `gaby` entry point (Typer)
│ │ │
│ │ ├── api/ # FastAPI app
│ │ │ ├── app.py # FastAPI() instance + wiring
│ │ │ ├── deps.py # dependency-injection helpers
│ │ │ ├── middleware.py # auth, CSRF, request-id, structlog binding
│ │ │ ├── errors.py # typed exception → HTTP mapping
│ │ │ └── routers/
│ │ │ ├── health.py # /health /ready /metrics
│ │ │ ├── onboarding.py # wizard-driven setup
│ │ │ ├── tickets.py
│ │ │ ├── investigations.py
│ │ │ ├── connectors.py
│ │ │ ├── knowledge.py
│ │ │ ├── chat.py # widget + operator console backend
│ │ │ ├── settings.py
│ │ │ └── admin.py
│ │ │
│ │ ├── agent/ # THE investigation loop
│ │ │ ├── loop.py # run_investigation(ticket) → Investigation
│ │ │ ├── planner.py # decides next tool / next action
│ │ │ ├── memory.py # working memory within one investigation
│ │ │ ├── verdict.py # classify auto-resolved / needs-tech / etc.
│ │ │ ├── prompts/ # versioned prompt templates (just .md files)
│ │ │ │ ├── planner.md
│ │ │ │ ├── tool_selector.md
│ │ │ │ ├── summarizer.md
│ │ │ │ └── verdict.md
│ │ │ └── safety_check.py # scope enforcement *before* any action
│ │ │
│ │ ├── connectors/ # MCP host + connector framework
│ │ │ ├── base.py # Connector abstract + Scope / Action types
│ │ │ ├── registry.py # load/persist connector configs
│ │ │ ├── mcp_host.py # spawn + supervise MCP server subprocesses
│ │ │ ├── mcp_client.py # thin wrapper around the official SDK client
│ │ │ ├── catalog.yaml # first-party connector catalog (metadata only)
│ │ │ └── builtin/ # connectors we ship in-tree for v0.1
│ │ │ ├── postgresql.py
│ │ │ ├── keycloak.py
│ │ │ └── zoho_desk.py
│ │ │
│ │ ├── knowledge/
│ │ │ ├── ingest.py # fs walker, git puller, url crawler, pdf reader
│ │ │ ├── chunker.py # token-aware Markdown/code chunking
│ │ │ ├── embeddings.py # pluggable embedding provider
│ │ │ ├── store.py # write to vector + FTS indices
│ │ │ ├── retrieve.py # hybrid BM25 + vector, with citations
│ │ │ └── citations.py # "where did this claim come from?" helpers
│ │ │
│ │ ├── ticketing/ # help desk adapters (source + sink)
│ │ │ ├── base.py # TicketSource + TicketSink interfaces
│ │ │ ├── zoho_desk.py # v0.1 canonical adapter
│ │ │ ├── linear.py
│ │ │ ├── github_issues.py
│ │ │ └── email_inbox.py
│ │ │
│ │ ├── llm/
│ │ │ ├── provider.py # Protocol: chat, stream, tool_call
│ │ │ ├── anthropic.py
│ │ │ ├── openai.py
│ │ │ ├── litellm_gateway.py # fallback for any OpenAI-compatible endpoint
│ │ │ ├── budget.py # per-investigation token & dollar ceilings
│ │ │ ├── router.py # cheap-model for classification / big-model for verdict
│ │ │ └── cache.py # prompt-cache friendly hashing
│ │ │
│ │ ├── safety/ # the machine will not wreck production
│ │ │ ├── authz.py # scope evaluation
│ │ │ ├── audit.py # append-only hash-chained log
│ │ │ ├── approvals.py # the approval queue
│ │ │ ├── scopes.py # scope DSL (read/write/dry_run)
│ │ │ └── redaction.py # PII redaction before sending to LLM
│ │ │
│ │ ├── chat/ # human chat surface
│ │ │ ├── sessions.py
│ │ │ ├── widget.py # end-user widget backend
│ │ │ ├── slack.py # Slack app
│ │ │ ├── teams.py # Microsoft Teams app
│ │ │ └── handoff.py # Gaby → human takeover
│ │ │
│ │ ├── storage/
│ │ │ ├── db.py # engine, session factory, init
│ │ │ ├── encryption.py # symmetric envelope for sensitive columns
│ │ │ └── models/ # one file per aggregate
│ │ │ ├── workspace.py
│ │ │ ├── user.py
│ │ │ ├── connector.py
│ │ │ ├── knowledge.py
│ │ │ ├── ticket.py
│ │ │ ├── investigation.py
│ │ │ ├── action.py
│ │ │ ├── approval.py
│ │ │ ├── audit.py
│ │ │ ├── llm_call.py
│ │ │ └── chat.py
│ │ │
│ │ ├── observability/
│ │ │ ├── logging.py
│ │ │ ├── tracing.py
│ │ │ └── metrics.py
│ │ │
│ │ ├── workers/
│ │ │ ├── runner.py # in-process vs arq dispatcher
│ │ │ ├── ticket_poller.py # poll help desks
│ │ │ ├── investigation_worker.py # run the agent loop
│ │ │ └── summary_mailer.py # nightly email for founder persona
│ │ │
│ │ ├── events.py # internal event bus (pub/sub)
│ │ └── config.py # the single Settings object
│ │
│ └── tests/
│ ├── conftest.py
│ ├── unit/ # mirrors src/gaby/ structure 1:1
│ ├── integration/ # marked @pytest.mark.integration
│ ├── contract/ # connector contract tests
│ ├── property/ # hypothesis tests
│ └── fixtures/
│ ├── tickets/
│ ├── docs/ # small KB corpora
│ └── llm_transcripts/ # recorded LLM responses for deterministic replay
│
├── web/ # React admin UI
│ ├── package.json
│ ├── pnpm-lock.yaml
│ ├── vite.config.ts
│ ├── tsconfig.json
│ ├── biome.json
│ ├── tailwind.config.ts # imports tokens from ../shared/styles.css
│ ├── README.md
│ ├── index.html
│ ├── src/
│ │ ├── main.tsx
│ │ ├── App.tsx
│ │ ├── router.tsx
│ │ ├── routes/ # one folder per top-level route
│ │ │ ├── onboarding/
│ │ │ ├── dashboard/
│ │ │ ├── investigation/
│ │ │ ├── tickets/
│ │ │ ├── connectors/
│ │ │ ├── knowledge/
│ │ │ ├── chat-console/ # operator view of the widget
│ │ │ └── settings/
│ │ ├── components/
│ │ │ ├── ui/ # shadcn primitives (button, dialog, etc.)
│ │ │ └── domain/ # TicketRow, TimelineStep, StatCard, Sidebar...
│ │ ├── lib/
│ │ │ ├── api.ts # generated OpenAPI client
│ │ │ ├── query.ts # TanStack Query setup
│ │ │ ├── auth.ts
│ │ │ └── theme.ts # persona palette (indigo/violet/emerald/sky)
│ │ ├── hooks/
│ │ ├── styles/globals.css
│ │ └── icons/ # port of shared/icons.js as .tsx
│ ├── public/
│ └── tests/
│ ├── unit/ # Vitest
│ └── components/
│
├── widget/ # embeddable end-user chat widget
│ ├── package.json
│ ├── vite.config.ts # library mode; outputs a single JS bundle
│ ├── README.md # "how to drop the snippet into any site"
│ ├── src/
│ │ ├── index.ts # entry: window.Gaby.init({...})
│ │ ├── widget.tsx # React root mounted into a shadow DOM
│ │ ├── api.ts # talks to /api/chat
│ │ └── styles.css # scoped to shadow DOM
│ └── dist/ # built bundle shipped to npm + CDN
│
├── connectors/ # first-party MCP servers (separately publishable)
│ ├── README.md # how to build a connector
│ ├── _contract/ # the contract tests every connector must pass
│ ├── postgres/
│ ├── keycloak/
│ ├── stripe/
│ ├── zoho-desk/
│ ├── m365/
│ ├── entra-id/
│ ├── ninjaone/
│ ├── halopsa/
│ ├── kubernetes/
│ └── datadog/
│
├── docs/ # MkDocs Material
│ ├── mkdocs.yml
│ ├── index.md
│ ├── quickstart/
│ │ ├── docker-compose.md
│ │ ├── helm.md
│ │ └── managed-cloud.md
│ ├── concepts/
│ │ ├── agent-loop.md
│ │ ├── connectors.md
│ │ ├── knowledge.md
│ │ ├── safety.md
│ │ └── autonomy-levels.md
│ ├── connectors/ # one page per connector
│ ├── personas/ # how to set up Gaby for each persona
│ ├── operations/
│ │ ├── observability.md
│ │ ├── backups.md
│ │ ├── upgrades.md
│ │ └── security.md
│ ├── reference/
│ │ ├── api.md # OpenAPI rendered
│ │ ├── cli.md
│ │ └── config.md
│ └── contributing/
│ ├── dev-setup.md
│ ├── testing.md
│ └── connector-authoring.md
│
├── ops/ # deployment artifacts
│ ├── docker/
│ │ ├── Dockerfile.backend
│ │ ├── Dockerfile.web
│ │ ├── Dockerfile.widget
│ │ └── docker-compose.yml # canonical v0.1 install path
│ ├── helm/
│ │ └── gaby/ # chart
│ └── profiles/
│ ├── founder-quickstart/ # smallest, friendliest install
│ ├── msp-multiworkspace/
│ └── sre-readonly/
│
├── .github/
│ ├── workflows/
│ │ ├── backend-ci.yml
│ │ ├── web-ci.yml
│ │ ├── widget-ci.yml
│ │ ├── e2e.yml
│ │ ├── release.yml
│ │ └── docs.yml
│ ├── ISSUE_TEMPLATE/
│ └── PULL_REQUEST_TEMPLATE.md
│
└── scripts/
├── dev.sh # runs backend + web concurrently
├── seed-dev-data.py # loads fixture tickets + docs
├── gen-api-client.sh # OpenAPI → TS client
└── mint-dev-secrets.py
Why this structure¶
- Monorepo, not multiple repos. Three apps (
backend/,web/,widget/) stay in lockstep; the alternative is broken releases and drift. personas/stays at the root as the canonical UX prototypes (perSPEC.mdSection 4). The repo-rootindex.htmlwas retired in v0.3.1 —docs/index.md(deployed to https://gaby.skycloak.io) is now the canonical marketing surface.connectors/is a sibling ofbackend/, not a sub-package. Each connector is its own MCP server, publishable to PyPI independently. The backend depends on MCP as a protocol, not on connector implementations.- Tests live next to the code they test (
backend/tests/,web/tests/), not in a top-leveltests/folder — except for the existingtests/harness for the prototype Playwright tests, which stays until the React UI replaces the HTML prototypes. docs/is MkDocs-managed, not a grab-bag of loose Markdown. Onemkdocs.yml, deployable as a static site.
3. Data model core¶
These entities carry us from v0.1 (Founder) to v0.4 (SRE) without schema rewrites. The MSP persona's multi-workspace is baked in by having a workspace_id column on every row from day one, even though v0.1 uses a single hard-coded "default" workspace.
| Aggregate | Key fields | Notes |
|---|---|---|
| workspaces | id, name, plan, compliance_profile, residency_region, created_at |
Single "default" workspace in v0.1. Everything else joins on this. |
| users | id, workspace_id, email, role (admin/agent/viewer), password_hash, disabled |
Operators of Gaby, not end-users. RBAC defaults to three roles; custom roles are EE. |
| api_keys | id, workspace_id, user_id?, prefix, hash, scopes, expires_at |
For CLI and machine access. |
| sessions | id, user_id, expires_at, csrf_token, operator_notes (jsonb) |
HTTP cookie sessions for the web UI. operator_notes holds medium-term, session-scoped notes (e.g. "operator just approved X — don't re-prompt for the rest of this session"). |
| connectors | id, workspace_id, kind, name, config_encrypted, status, last_health_check, scopes, autonomy_level |
kind = postgresql / m365 / zoho_desk / … ; scopes is a JSON scope DSL; autonomy_level ∈ {investigate, propose, act}. |
| connector_events | id, connector_id, ts, kind, payload |
Healthchecks, auth failures, permission denials. |
| knowledge_sources | id, workspace_id, kind (git / confluence / dir / url), locator, config, last_sync |
Where to ingest from. |
| documents | id, workspace_id, source_id, uri, title, content_hash, last_ingested_at |
One row per source document (a runbook, a PDF, a Confluence page). |
| document_chunks | id, document_id, workspace_id, ordinal, text, token_count, embedding, fts |
The retrievable unit. Embedding column uses sqlite-vec or pgvector. |
| tickets | id, workspace_id, source_id, external_id, title, body, customer, priority, status, sla_at, created_at |
Canonical form — every help-desk adapter maps to this. |
| investigations | id, workspace_id, ticket_id, started_at, finished_at, verdict, summary, token_cost, dollar_cost |
One per ticket. Verdict ∈ {auto_resolved, needs_tech, needs_l2, needs_client, investigating, failed}. |
| investigation_steps | id, investigation_id, ordinal, system, action, detail, type (read/query/action/verify/verdict), ts |
Exactly matches renderTimelineStep in the prototypes. |
| actions | id, investigation_id, connector_id, scope, payload, dry_run, result, status, applied_at, rolled_back_at |
Every write Gaby does is recorded here — so is every dry-run shadow. |
| approvals | id, action_id, requested_at, decided_at, decided_by, decision, reason |
Drives the approval queue for propose autonomy. |
| audit_log | id, workspace_id, ts, actor_kind (user/agent/system), actor_id, event, payload, prev_hash, hash |
Append-only, hash-chained. Exportable to SIEM (EE). |
| llm_calls | id, investigation_id?, purpose, model, prompt_hash, prompt_tokens, completion_tokens, latency_ms, cost, cached |
Cost dashboard + prompt debugging. |
| chat_sessions | id, workspace_id, channel (widget/slack/teams), external_user_id, started_at, handed_off_at? |
The human-chat surface. |
| chat_messages | id, session_id, role (user/gaby/operator), content, attachments, ts |
|
| escalations | id, ticket_id?, session_id?, channel (slack/teams/pagerduty/email), sent_at, acknowledged_at? |
Per-persona escalation routing. |
| kb_candidates | id, workspace_id, staged_from_investigation_id, proposed_title, proposed_body, status (pending/accepted/rejected/expired), reviewed_by?, reviewed_at?, expires_at, created_at |
Staging area for auto-resolved investigations that want to be promoted to KB entries. Everything here is provisional until a human reviews. TTL 30 days default; expired rows auto-archive. |
| memory_nodes | id, workspace_id, label (customer/user/system/connector/ticket/investigation/fact/observation/resolution), natural_key, properties (jsonb), provenance (operator/proposed/imported), status (active/provisional/archived), first_seen_at, last_seen_at, last_used_at, approved_by?, approved_at? |
Nodes of the long-term memory graph. See ARCHITECTURE.md §22 for the full model, the MemoryGraph protocol, and the three backend implementations (SQLite default, Postgres+AGE opt-in, FalkorDBLite opt-in). Unique index on (workspace_id, label, natural_key). |
| memory_edges | id, workspace_id, from_node_id, to_node_id, relation (7 typed categories), weight, properties (jsonb), observed_at, decayed_at? |
Edges of the long-term memory graph. Typed relations per ARCHITECTURE.md §22 (Causal / Solution / Context / Learning / Similarity / Workflow / Quality). Composite indexes on (workspace_id, from_node_id, relation) and (workspace_id, to_node_id, relation). |
Design rules the models must follow¶
- Every row has
workspace_id. Even in v0.1. Cheap now, impossible to retrofit later. - Secrets are encrypted at rest.
connectors.config_encrypted, any API key-bearing column. Envelope encryption with a data key from the secrets provider. - Nothing ever hard-deletes. Soft-delete with
deleted_at. We need audit reconstructibility. - Every sensitive column has a redaction rule. Before an investigation step is sent to an LLM, PII is stripped per the workspace's compliance profile.
- UUIDv7 for all primary keys (time-sortable, index-friendly).
4. Test strategy¶
Testing an LLM-driven system is where most AI projects rot. We avoid that by separating the deterministic parts from the LLM-driven parts and testing each on its own terms.
4.1 Layers¶
| Layer | Tool | Scope | Where it runs | Budget |
|---|---|---|---|---|
| Unit | pytest | Pure functions and classes. Chunker, scope evaluator, audit hash chain, prompt builders, retrieval scoring, DB models against in-memory SQLite. No network. | Every PR, pre-commit | <30s full run |
| Property | hypothesis | Invariants on the critical-path: scope evaluator, audit hash chain, retrieval top-k containment, redaction idempotence. | Every PR | <60s |
| Integration | pytest + testcontainers | Real SQLite on disk; real Postgres via testcontainers for the Postgres profile. FastAPI TestClient. Real MCP server spawned as a subprocess. LLM calls go to a deterministic mock provider that replays recorded transcripts. | Every PR | <5 min |
| Connector contract | pytest (shared fixtures) | Every connector (ours + community) must pass: tool-list endpoint, scope declaration, healthcheck, dry-run of each declared action. | On every connector PR | <2 min/connector |
| End-to-end | Playwright (existing harness) | Full stack in Docker Compose: backend + web + mock connectors + deterministic LLM. Walks through founder onboarding → dashboard → investigation. | Every PR, nightly full sweep | <10 min |
| Load | k6 | API throughput: 100 concurrent investigations, p95 latency, error rate. | Weekly scheduled | 15 min |
| Evals (LLM-specific) | promptfoo (v0.1) → Inspect AI (v0.2+) for full agent evals with tool calls | Fixed corpus of ≥50 tickets with known-good resolutions. Measures auto-resolution rate, citation accuracy, safety-boundary compliance. promptfoo handles prompt regression well; Inspect AI (from UK AISI) is purpose-built for agent evaluation and fits our v0.2 tool-calling tests better. | Manual before every release, automated weekly | 30 min |
4.2 Making LLM-driven code deterministic for CI¶
The agent loop calls an LLM. We do two things to make that testable:
- Transcript replay. Every integration test that exercises the agent uses a "fake LLM provider" that is seeded with a recorded transcript (tool-call sequences plus final text) stored under
backend/tests/fixtures/llm_transcripts/. The test asserts on the deterministic behavior (scope checks, DB writes, audit log shape), not the LLM text. - Evals are a separate beast. Evals use real LLM calls against a fixed corpus, run on a schedule (not per-PR), and measure quality metrics like auto-resolution rate and citation accuracy. They gate releases but not PRs.
This separation is the difference between "our tests pass in 3 minutes" and "our tests cost $400/month in OpenAI credits".
4.3 What we refuse to test with mocks¶
- The DB schema. Integration tests use a real DB engine, not an in-memory mock. Schema bugs cost 10× to find in prod.
- The MCP protocol. Integration tests spawn a real MCP server (a tiny stub is fine) and exchange real messages.
- The FastAPI request lifecycle. Integration tests use
TestClient, which runs the real middleware stack.
4.4 Test file conventions¶
backend/tests/unit/mirrorsbackend/src/gaby/1:1. If you touchagent/loop.py, you touchtests/unit/agent/test_loop.py.- Integration tests are in
backend/tests/integration/and are marked@pytest.mark.integration. The defaultpytestcommand runs unit + property only;pytest -m integrationruns the slower ones. - Every test file starts with
from __future__ import annotations. - No shared mutable fixtures across tests — everything is function-scoped or rebuilt per test.
4.5 Coverage¶
- Target: 85% line coverage on
backend/src/gaby/, enforced in CI. Not because coverage is a quality metric, but because it catches "forgot to wire this up" regressions. - Critical paths are 100%:
safety/,audit,scopes,authz. A PR that drops these below 100% is auto-rejected. - Frontend coverage is best-effort; we rely on Playwright e2e for confidence there.
5. Design system — from prototype to real UI¶
The persona prototypes under personas/ are already the spec. Porting them to React is a translation, not a redesign. Here is how.
5.1 The tokens we keep¶
Pulled from shared/styles.css:
| Token family | Prototype source | Target |
|---|---|---|
| Color palettes | .hero-gradient-*, .btn-primary-*, .selected-*, .active-* (indigo / violet / emerald / sky) |
web/src/lib/theme.ts → CSS variables + Tailwind theme extension |
| Typography | Inter + JetBrains Mono from index.html |
tailwind.config.ts fontFamily |
| Spacing / radius | Tailwind defaults | Unchanged |
| Shadows | stat-card, feature-card styles |
shadow-sm/md/lg tokens in Tailwind |
| Animations | float-up, slide-in, pulse-dot, progress-bar |
tailwindcss-animate plugin + custom keyframes in globals.css |
5.2 Domain components to build¶
Each maps to a function in shared/components.js:
| Component | Prototype function | Responsibility |
|---|---|---|
<OnboardingWizard> |
steps in personas/*/index.html |
Generic 5-6 step wizard container, consumes a config array |
<ProgressDots> |
renderProgress |
Wizard progress indicator |
<Sidebar> |
renderSidebar |
Persona-themed nav |
<StatCard> |
renderStatCard |
Dashboard stat tile |
<TicketRow> |
renderTicketRow |
One ticket in a queue |
<TicketFilters> |
filterTickets logic |
Filter buttons over a ticket list |
<InvestigationTimeline> |
renderTimelineStep |
Full investigation view with timeline steps |
<TimelineStep> |
renderTimelineStep |
One step in an investigation |
<ConnectorCard> |
renderServerCard |
A connector in the catalog or settings |
<Toast> |
showToast |
Transient notification |
<SimulatedInvestigation> |
simulateInvestigation |
The "live demo" step in onboarding |
5.3 Persona theming¶
Persona colors become CSS variables scoped to a wrapper class:
/* web/src/styles/globals.css */
:root[data-persona="founder"] { --gaby-primary: 99 102 241; /* indigo-500 */ }
:root[data-persona="support-lead"] { --gaby-primary: 139 92 246; /* violet-500 */ }
:root[data-persona="sre"] { --gaby-primary: 5 150 105; /* emerald-600 */ }
:root[data-persona="msp"] { --gaby-primary: 2 132 199; /* sky-600 */ }
Components then use bg-[rgb(var(--gaby-primary))] / text-[rgb(var(--gaby-primary))]. One set of components, per-persona skin with no code duplication.
5.4 What happens to the prototypes once the React UI ships¶
- They stay as static files, served from the repo, and remain the visual spec and marketing.
- The React app is available at
/app(served by the backend); the landing page stays at/. - When the React UI reaches parity with a persona's prototype, the prototype is marked
[reference]in its header and the React app becomes the runtime. - We never break the prototypes. They are easier to share with non-engineers than a running SPA.
5.5 Accessibility baseline¶
- All new components are built on Radix primitives (via shadcn), which ship with keyboard navigation and ARIA.
- Every interactive element has a
data-testidmatching the prototype conventions (already observed in the HTML prototypes). - Color contrast meets WCAG AA at minimum; we test with
@axe-core/reactin dev. - Every form has labels, and error messages are read out via
aria-live.
6. Observability, ops, and security baselines (day-1 musts)¶
6.1 Observability (cross-cutting)¶
| Signal | Default | Opt-in |
|---|---|---|
| Logs | Structured JSON to stdout via structlog | Ship to Loki / Datadog via OTLP |
| Traces | OpenTelemetry, off by default | GABY_OTEL_EXPORTER=otlp://... turns it on |
| Metrics | Prometheus /metrics endpoint always on |
Dashboards ship as JSON in docs/operations/dashboards/ |
| Status | Built-in /status page in the web UI |
/health, /ready for probes |
6.2 Security (non-negotiable at v0.1)¶
| Control | Implementation |
|---|---|
| Authenticated API by default | No "open mode" in v0.1. First run mints an admin user. |
| Secrets at rest | Envelope encryption (AES-GCM) with data keys from the configured secrets provider |
| CSRF | Session cookie + CSRF token on every state-changing web route |
| CSP | Strict-ish CSP by default; MkDocs docs are served from a separate origin |
| Input validation | Pydantic models on every route; reject unknown fields |
| Rate limits | Per-IP and per-API-key, token-bucket, in Redis or in-process |
| Dependency scanning | Dependabot + pip-audit + pnpm audit in CI |
| SBOM | syft in the release workflow; SBOM attached to every Docker image |
| Vuln disclosure | SECURITY.md with a PGP-signed security address |
6.3 CI pipelines (GitHub Actions)¶
| Workflow | Triggers | Steps | Gate |
|---|---|---|---|
backend-ci.yml |
PR + main | uv sync → ruff check → mypy strict → pytest (unit + property) → pytest-cov ≥85% | Blocking |
backend-int.yml |
PR + nightly | pytest -m integration with testcontainers Postgres |
Blocking |
web-ci.yml |
PR + main | pnpm i → biome check → tsc --noEmit → vitest | Blocking |
widget-ci.yml |
PR + main | pnpm i → biome check → vitest → build size-limit check | Blocking |
e2e.yml |
PR + nightly | docker compose up → Playwright run → upload traces on failure | Blocking (fast suite) / nightly (full) |
connector-ci.yml |
PR touching connectors/** |
Run contract tests against the touched connectors | Blocking |
docs.yml |
PR + main | MkDocs strict build + broken-link check | Blocking |
release.yml |
tag v*.*.* |
Build + push Docker images, Helm chart, PyPI wheel, npm widget; publish SBOM; draft release notes | Manual review |
eval.yml |
weekly + manual | Run the LLM eval harness; post a summary comment to a tracking issue | Informational |
Cache strategy: uv and pnpm caches keyed on their respective lockfiles.
6.4 Release cadence¶
- v0.x: weekly or whenever a meaningful slice is ready. Breaking changes are expected and flagged in
CHANGELOG.md. - v1.0: the four personas ship, API is stable, upgrade path is documented. Semver from here.
7. v0.1 exit criteria (Founder persona) — what shipping actually means¶
We ship v0.1 when a technical founder can docker compose up Gaby, connect their stack in under 10 minutes, and receive a Slack DM by the next morning showing a real ticket Gaby auto-resolved overnight.
7.1 In scope for v0.1¶
- [ ] Installation:
docker compose upworks on macOS, Linux, WSL2. SQLite only; no Postgres / Redis required. - [ ] Onboarding wizard: the Founder flow from
personas/founder/index.html, fully real in React. - [ ] Connectors shipped: PostgreSQL (read-only + limited write), Keycloak (read-only), Zoho Desk (read + reply).
- [ ] Ticket source: Zoho Desk polling adapter. Canonical ticket model in the DB.
- [ ] Knowledge ingestion: point at a local
./docsfolder; Markdown + PDF. - [ ] Agent loop: Anthropic Claude via litellm, homegrown loop, token + dollar budget per investigation.
- [ ] Safety: three autonomy modes; dry-run by default on writes; approval queue for
propose. - [ ] Audit log: append-only, hash-chained.
- [ ] Web UI: onboarding → dashboard → investigation detail → basic settings.
- [ ] Slack escalation: outbound only. Nightly summary email via SMTP.
- [ ] Observability: structured logs,
/metrics,/health,/ready. - [ ] Docs: quickstart page that matches the install experience; one-page "how Gaby works"; one-page security overview.
- [ ] Tests: every layer above passing in CI. 85% backend coverage. 100% on
safety/.
7.2 Explicitly NOT in v0.1¶
- Multi-workspace mode (it's built into the schema but we serve one default workspace)
- Per-client autonomy rules (MSP persona)
- Human chat widget (v0.3 at earliest)
- Slack / Teams inbound bot (only outbound escalation)
- SSO / SAML / SCIM (Enterprise Edition)
- Air-gapped install mode (Enterprise Edition)
- All connectors not listed above
- The Support Lead / SRE / MSP persona wizards
- The operator chat console
7.3 Definition of done for the v0.1 release¶
- [ ] All in-scope checkboxes above are ticked
- [ ] The founder quickstart (
docs/quickstart/docker-compose.md) runs green on all three target OSes - [ ] A five-ticket live test against a staging Zoho Desk: at least three of five tickets are auto-resolved, all writes go through the safety layer, the audit log reconstructs the full sequence
- [ ] The eval suite (50 fixture tickets) achieves ≥60% auto-resolution with zero safety violations
- [ ]
CHANGELOG.mdhas a v0.1.0 entry written by a human - [ ] A tagged release with Docker images on GHCR, Helm chart in the chart repo, and Python wheel on PyPI
- [ ] The landing page's "Join Waitlist" button is replaced with "Install" once v0.1 ships
8. Which things we will probably get wrong (so future-us can forgive us)¶
These are the calls I am least confident about. They will likely change. Document them here so the change is expected, not a crisis.
- Agent loop as homegrown. If after 3 months we are spending more time on loop mechanics than on prompts and tools, we adopt the matching piece of pydantic-ai or LangGraph. The loop's public interface is small enough to swap.
- arq vs Celery. arq is simpler but newer. If we hit production pain (worker supervision, retries, visibility), we switch to Celery before v0.5.
- sqlite-vec vs pgvector lock-in. We abstract the vector operations behind a tiny interface so swapping is a day of work, not a week.
- Widget in a shadow DOM. Theoretically clean, practically has quirks (fonts, CSP). If it becomes a maintenance sink, we fall back to an iframe.
- Keeping the HTML prototypes as reference after the React UI ships. This might create drift. Mitigation: a CI check that both prototype and React page use the same
data-testidset. - In-process worker vs external. Great for v0.1 demos, but "Gaby is slow" will probably mean "the worker is blocking the API". v0.2 promotes arq+Redis to default in Compose.
9. What this document is not¶
- Not the detailed technical architecture. That is
ARCHITECTURE.md— sequence diagrams, exact class contracts, concurrency model, scaling notes. It follows this document. - Not a dated roadmap. That is
ROADMAP.md— v0.1 through v1.0 with estimates and milestones. - Not a contributor guide. That is
CONTRIBUTING.md— dev setup, branch rules, code review rituals. - Not a product spec. That is
SPEC.md— the what and the why.
Read in order: SPEC.md → FOUNDATION.md (this) → ARCHITECTURE.md → ROADMAP.md.
10. Appendix — decisions at a glance¶
| Area | Choice |
|---|---|
| Backend lang | Python 3.12+ |
| Backend framework | FastAPI |
| Agent loop | Homegrown, ≤500 LOC |
| LLM abstraction | litellm (BYOK) + direct Anthropic/OpenAI (hot paths) |
| MCP | Official Python SDK |
| ORM | SQLAlchemy 2.x async + Alembic |
| DB default | SQLite (+ sqlite-vec + FTS5) |
| DB opt-in | Postgres (+ pgvector + tsvector) |
| Background jobs | arq (Redis) with in-process fallback |
| Package manager | uv |
| Linter | ruff + mypy --strict |
| Frontend | Vite + React 19 + RR7 + TS strict |
| Styling | Tailwind 4 + shadcn/ui |
| Server state | TanStack Query |
| Client state | Zustand |
| Forms | React Hook Form + Zod |
| API client | openapi-typescript generated from FastAPI OpenAPI |
| Widget | Vite library mode → shadow DOM React |
| Frontend lint | Biome |
| Frontend pm | pnpm |
| Tests (BE) | pytest + hypothesis + testcontainers + respx |
| Tests (FE) | Vitest + RTL + Playwright (reuse tests/) |
| Deterministic LLM testing | transcript replay via a fake provider |
| Evals | promptfoo or homegrown, scheduled not per-PR |
| Docs | MkDocs Material |
| Containers | Docker + Compose (v0.1), Helm (v0.2) |
| Registry | GHCR primary, Docker Hub mirror |
| License (core) | Apache 2.0 |
| License (EE) | Commercial |
| Contribution | DCO |
| Release cadence | Weekly v0.x, monthly from v1.0 |