Skip to content

Foundation

Foundation Plan — Gaby v0.1 → v1.0

Status: Draft v0.1 · Owner: Guilliano · Last updated: 2026-04-11

This is the foundation plan. It sits one step below SPEC.md (the what) and above ARCHITECTURE.md (the detailed how). Its job is to lock the decisions that are expensive to change later:

  • What language(s) we write in
  • How the repo is laid out
  • What the data model looks like
  • How we test it
  • How we ship it
  • How the prototype UI becomes a real design system

Everything in this doc is opinionated and defaulted. If a section says "we pick X", assume we start with X and revisit only if we have measured data showing X isn't working. The goal is to stop debating foundation and start building.

Grounding: the first roadmap target is the Founder persona. Every decision below is sized for "ship v0.1 in a small-team time frame" but structured so v0.2 (MSP), v0.3 (Support Lead), and v0.4 (SRE) slot in without rewrites.


0. Guiding principles (for any decision not listed below)

  1. Boring tech by default. Python, React, Postgres, SQLite, Docker. Every novel dependency is a liability paid in debugging hours at 3 a.m.
  2. Two languages max. One backend (Python), one frontend (TypeScript/React). No Go CLI, no Rust agent core, no Elixir sidecar. Adding a third language requires a written justification.
  3. Ship a running docker compose up on day one of every week. The demo always works. If it doesn't, that's the next thing you fix.
  4. Tests are part of the feature. Every PR ships with the tests that prove it works. No "will add tests in a follow-up".
  5. The prototypes in personas/ are the UX spec, not a moodboard. The React implementation is a port, not a reinterpretation.
  6. Every closed item ships with an HTML status report. For each delivered feature, module, or roadmap item, write a standalone HTML file under reports/ that describes what was done and how to test it. The file is written by whoever closes the item and linked from reports/index.html. No exceptions — this is the paper trail for review and handoff. Reports match the visual language of the landing page (Tailwind CDN, Inter font, clean cards) so they can be shared with non-engineers.

1. Stack decisions (locked)

1.1 Backend — Python

Concern Choice Why
Runtime Python 3.12+ Pattern matching, performance wins, the best typing story in the Python 3 line, supported until 2028. Drop support when <5% of users.
Package manager uv Rust-based, ~10-100× faster than pip/poetry, single-tool for venv + install + lockfile + tool runs. The clear future.
Web framework FastAPI Async-first, automatic OpenAPI generation (needed to generate the TypeScript API client), Pydantic-native, excellent DX.
Agent loop Homegrown (≤500 LOC) on top of the Anthropic + OpenAI SDKs via litellm The loop is the product. We do not want to fight LangGraph/pydantic-ai abstractions when we tune prompts and error handling. We adopt a framework piece only when we find ourselves re-implementing it.
MCP Official mcp Python SDK (as a client / host) Standard. Supports both stdio and streamable HTTP transports. Every connector we ship and every community one is an MCP server.
LLM SDK abstraction litellm (SDK only, not proxy) with direct Anthropic + OpenAI for hot paths One provider interface for BYOK. Hot paths (planner, verdict) use the direct SDK. We never run the LiteLLM proxy in-process — it has known 2026 production issues. Pin litellm to a known-good version; install with hash verification. See ARCHITECTURE.md §21 for the full rationale.
ORM SQLAlchemy 2.x (async) Async is non-negotiable for an I/O-heavy service. SQLAlchemy 2 is the mature, typed choice. No SQLModel — it's a thinner wrapper that we don't need.
Migrations Alembic Paired with SQLAlchemy. Boring, reliable.
Primary DB SQLite (default) / Postgres (opt-in) SQLite for single-node installs and first-run demos — zero external deps. Postgres via a one-line config change for scale. Same SQLAlchemy models.
Vector store sqlite-vec (default) / pgvector (Postgres) Matches the DB choice. sqlite-vec is maintained and production-capable; pgvector is the standard for Postgres. No external Qdrant/Pinecone at v0.1.
Full-text search SQLite FTS5 / Postgres tsvector Hybrid retrieval = BM25-ish + vector. Native to each DB, no extra service.
Memory graph MemoryGraph protocol with SQLiteMemoryGraph default + opt-in PostgresAGEMemoryGraph + opt-in FalkorDBLiteMemoryGraph Long-term agent memory is graph-shaped from day one (nodes + typed edges + POLE+O-style labels). Default backend is two SQLite tables with recursive CTEs — zero ops cost. Users who want graph-native from day one can opt into Apache AGE (Postgres extension, openCypher, recommended) or FalkorDBLite (embedded Cypher, Beta embedding on a production-stable engine) via docker compose --profile graph-age or GABY_MEMORY_BACKEND=falkor-lite. All three backends implement the same 11-method protocol; Iter 4 ships the 3-backend round-trip test that proves the migration path. See ARCHITECTURE.md §22 for the full design.
Background jobs arq (Redis-backed async) with a local in-process fallback arq is async-native and minimal. For the "docker compose up" founder install, we fall back to an in-process worker so Redis isn't required until scale demands it.
Config & secrets pydantic-settings + pluggable secret providers (env / file / Vault / AWS SM / GCP SM) Types for free; one interface per provider.
Logging structlog with JSON renderer Structured JSON to stdout. 12-factor. Works with every log aggregator.
Tracing OpenTelemetry SDK with OTLP exporter Turn on by env var. Every tool call and LLM call is a span. Non-negotiable for the SRE persona later.
Metrics prometheus-client with /metrics endpoint Standard. No external push gateway.
Linter / formatter ruff (both) + mypy --strict ruff replaces black + isort + flake8 + many plugins, in one Rust-fast tool. mypy strict on the core package.
Test runner pytest + pytest-asyncio + pytest-cov + hypothesis Standard Python testing. Hypothesis for property-based tests on anything safety-critical.
HTTP mocking respx Async-compatible, integrates cleanly with httpx (which litellm uses).
Time / clock time-machine in tests Deterministic time; faster than freezegun.

Why not

  • LangGraph / pydantic-ai / smolagents. They each have real strengths, but the investigation loop is the soul of the product and we need to iterate on prompts, tool-call retry semantics, and error recovery without a framework's opinions in the way. We ship our own 300-500 line loop. If a year in we find we are re-implementing langgraph.checkpoint, we adopt that specific piece.
  • Django. Gaby is an API-first service with a React frontend. FastAPI is the better fit; Django Admin is not something we want to maintain.
  • Go. Strong single-binary story but adds a second language and a second hiring pipeline. Docker Compose gives us "5-minute install" without it. We can add a Go CLI wrapper in v0.5+ if the install friction demands it.
  • Node for the backend. The Python AI ecosystem (tokenization, evals, RAG tooling, connector SDKs) is meaningfully ahead of Node's. Every AI team that has switched has switched toward Python, not away.

1.2 Frontend — TypeScript / React

Concern Choice Why
Framework Vite + React 19 + React Router 7 No SSR complexity. The admin UI is a SPA served as static assets from the backend. Vite build is fast, HMR is instant.
Language TypeScript strict Non-negotiable. Any any requires an inline justification.
Styling Tailwind CSS 4 Matches shared/styles.css in the prototypes. No migration tax.
Component primitives shadcn/ui Copy-paste Radix components, fully themable, matches the aesthetic we already shipped. Not a dependency — code we own.
Server state TanStack Query v5 Cache, refetch, optimistic updates. The category winner.
Client state Zustand For the small handful of cases that aren't server state (wizard step, theme, transient UI).
Forms React Hook Form + Zod Zod schemas can be shared with the API client for end-to-end typing.
API client Auto-generated from OpenAPI via openapi-typescript FastAPI emits OpenAPI; we codegen the TS client. Adding a backend route never requires a manual frontend type update.
Charts Recharts Simple, well-known, good enough for dashboards.
Icons Custom SVG set (port of shared/icons.js) + lucide-react for generic icons Keep brand identity; reuse lucide for everything generic.
Unit tests Vitest + React Testing Library Vitest shares the Vite config; RTL is the standard.
E2E tests Playwright (already in tests/) Reuse the existing harness and page-object pattern.
Linter / formatter Biome (primary) + ESLint for react-hooks only Biome is ~25× faster and covers ~80% of ESLint rules. The gap is type-aware rules — specifically eslint-plugin-react-hooks, which Biome does not yet replicate. We run Biome on every file and keep a minimal ESLint config only for the React Hooks plugin until Biome closes that gap. Biome is the canonical formatter.
Package manager pnpm Disk-efficient, fast, handles monorepos natively.

Why not

  • Next.js. We don't need SSR for an admin UI; we do want trivial Docker packaging. Next pulls in a server runtime we don't want inside the Docker image. Vite wins on simplicity.
  • shadcn+Radix alternatives (Chakra, MUI, Mantine). They're fine frameworks but they own the theme. shadcn leaves us in control of every pixel — and our prototypes already have a distinctive look we don't want to lose.
  • ESLint + Prettier. Biome is simply faster and one tool. ESLint is still the fallback if Biome lacks a rule we need.

1.3 Cross-cutting

Concern Choice
Version control Git + GitHub
Commit signing DCO (git commit -s) — not a full CLA
Issue tracker GitHub Issues
Docs site MkDocs Material (Python-native, fast build, great search)
API docs Auto from FastAPI's OpenAPI → rendered in MkDocs via mkdocs-swagger-ui-tag
Changelog CHANGELOG.md hand-edited, enforced on release PRs
Release automation GitHub Actions → Docker Hub / GHCR + PyPI + npm (for widget) + Helm chart repo
Container registry GHCR primary, Docker Hub mirror
License headers in source SPDX short form (# SPDX-License-Identifier: Apache-2.0) in every file

2. Repository layout

gaby/                                       # the repo root (name matches the product)
├── README.md                                # "what is this / 5-minute install"
├── SPEC.md                                  # (exists) product spec
├── FOUNDATION.md                            # (this document) foundation plan
├── ARCHITECTURE.md                          # (next) detailed technical architecture
├── ROADMAP.md                               # (next) dated roadmap across the 4 personas
├── CHANGELOG.md                             # (next) user-facing changes per release
├── LICENSE                                  # Apache 2.0
├── LICENSE-EE                               # commercial enterprise edition license
├── TRADEMARK.md
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── SECURITY.md                              # vuln disclosure
├── BUSINESS.md                              # OSS-vs-commercial position, kept honest
├── docs/                                    # MkDocs Material site — canonical marketing
│                                            # + narrative docs, deployed to gaby.skycloak.io
├── personas/                                # (exists) persona prototypes — canonical UX spec
│   ├── founder/
│   ├── msp/
│   ├── sre/
│   └── support-lead/
├── shared/                                  # (exists) prototype shared assets
├── backend/                                 # Python service
│   ├── pyproject.toml                       # single source of truth for deps, tools, build
│   ├── uv.lock
│   ├── README.md                            # "how to run the backend locally"
│   ├── src/gaby/
│   │   ├── __init__.py
│   │   ├── __main__.py                      # `python -m gaby`
│   │   ├── cli.py                           # `gaby` entry point (Typer)
│   │   │
│   │   ├── api/                             # FastAPI app
│   │   │   ├── app.py                       # FastAPI() instance + wiring
│   │   │   ├── deps.py                      # dependency-injection helpers
│   │   │   ├── middleware.py                # auth, CSRF, request-id, structlog binding
│   │   │   ├── errors.py                    # typed exception → HTTP mapping
│   │   │   └── routers/
│   │   │       ├── health.py                # /health /ready /metrics
│   │   │       ├── onboarding.py            # wizard-driven setup
│   │   │       ├── tickets.py
│   │   │       ├── investigations.py
│   │   │       ├── connectors.py
│   │   │       ├── knowledge.py
│   │   │       ├── chat.py                  # widget + operator console backend
│   │   │       ├── settings.py
│   │   │       └── admin.py
│   │   │
│   │   ├── agent/                           # THE investigation loop
│   │   │   ├── loop.py                      # run_investigation(ticket) → Investigation
│   │   │   ├── planner.py                   # decides next tool / next action
│   │   │   ├── memory.py                    # working memory within one investigation
│   │   │   ├── verdict.py                   # classify auto-resolved / needs-tech / etc.
│   │   │   ├── prompts/                     # versioned prompt templates (just .md files)
│   │   │   │   ├── planner.md
│   │   │   │   ├── tool_selector.md
│   │   │   │   ├── summarizer.md
│   │   │   │   └── verdict.md
│   │   │   └── safety_check.py              # scope enforcement *before* any action
│   │   │
│   │   ├── connectors/                      # MCP host + connector framework
│   │   │   ├── base.py                      # Connector abstract + Scope / Action types
│   │   │   ├── registry.py                  # load/persist connector configs
│   │   │   ├── mcp_host.py                  # spawn + supervise MCP server subprocesses
│   │   │   ├── mcp_client.py                # thin wrapper around the official SDK client
│   │   │   ├── catalog.yaml                 # first-party connector catalog (metadata only)
│   │   │   └── builtin/                     # connectors we ship in-tree for v0.1
│   │   │       ├── postgresql.py
│   │   │       ├── keycloak.py
│   │   │       └── zoho_desk.py
│   │   │
│   │   ├── knowledge/
│   │   │   ├── ingest.py                    # fs walker, git puller, url crawler, pdf reader
│   │   │   ├── chunker.py                   # token-aware Markdown/code chunking
│   │   │   ├── embeddings.py                # pluggable embedding provider
│   │   │   ├── store.py                     # write to vector + FTS indices
│   │   │   ├── retrieve.py                  # hybrid BM25 + vector, with citations
│   │   │   └── citations.py                 # "where did this claim come from?" helpers
│   │   │
│   │   ├── ticketing/                       # help desk adapters (source + sink)
│   │   │   ├── base.py                      # TicketSource + TicketSink interfaces
│   │   │   ├── zoho_desk.py                 # v0.1 canonical adapter
│   │   │   ├── linear.py
│   │   │   ├── github_issues.py
│   │   │   └── email_inbox.py
│   │   │
│   │   ├── llm/
│   │   │   ├── provider.py                  # Protocol: chat, stream, tool_call
│   │   │   ├── anthropic.py
│   │   │   ├── openai.py
│   │   │   ├── litellm_gateway.py           # fallback for any OpenAI-compatible endpoint
│   │   │   ├── budget.py                    # per-investigation token & dollar ceilings
│   │   │   ├── router.py                    # cheap-model for classification / big-model for verdict
│   │   │   └── cache.py                     # prompt-cache friendly hashing
│   │   │
│   │   ├── safety/                          # the machine will not wreck production
│   │   │   ├── authz.py                     # scope evaluation
│   │   │   ├── audit.py                     # append-only hash-chained log
│   │   │   ├── approvals.py                 # the approval queue
│   │   │   ├── scopes.py                    # scope DSL (read/write/dry_run)
│   │   │   └── redaction.py                 # PII redaction before sending to LLM
│   │   │
│   │   ├── chat/                            # human chat surface
│   │   │   ├── sessions.py
│   │   │   ├── widget.py                    # end-user widget backend
│   │   │   ├── slack.py                     # Slack app
│   │   │   ├── teams.py                     # Microsoft Teams app
│   │   │   └── handoff.py                   # Gaby → human takeover
│   │   │
│   │   ├── storage/
│   │   │   ├── db.py                        # engine, session factory, init
│   │   │   ├── encryption.py                # symmetric envelope for sensitive columns
│   │   │   └── models/                      # one file per aggregate
│   │   │       ├── workspace.py
│   │   │       ├── user.py
│   │   │       ├── connector.py
│   │   │       ├── knowledge.py
│   │   │       ├── ticket.py
│   │   │       ├── investigation.py
│   │   │       ├── action.py
│   │   │       ├── approval.py
│   │   │       ├── audit.py
│   │   │       ├── llm_call.py
│   │   │       └── chat.py
│   │   │
│   │   ├── observability/
│   │   │   ├── logging.py
│   │   │   ├── tracing.py
│   │   │   └── metrics.py
│   │   │
│   │   ├── workers/
│   │   │   ├── runner.py                    # in-process vs arq dispatcher
│   │   │   ├── ticket_poller.py             # poll help desks
│   │   │   ├── investigation_worker.py      # run the agent loop
│   │   │   └── summary_mailer.py            # nightly email for founder persona
│   │   │
│   │   ├── events.py                        # internal event bus (pub/sub)
│   │   └── config.py                        # the single Settings object
│   │
│   └── tests/
│       ├── conftest.py
│       ├── unit/                            # mirrors src/gaby/ structure 1:1
│       ├── integration/                     # marked @pytest.mark.integration
│       ├── contract/                        # connector contract tests
│       ├── property/                        # hypothesis tests
│       └── fixtures/
│           ├── tickets/
│           ├── docs/                        # small KB corpora
│           └── llm_transcripts/             # recorded LLM responses for deterministic replay
├── web/                                     # React admin UI
│   ├── package.json
│   ├── pnpm-lock.yaml
│   ├── vite.config.ts
│   ├── tsconfig.json
│   ├── biome.json
│   ├── tailwind.config.ts                   # imports tokens from ../shared/styles.css
│   ├── README.md
│   ├── index.html
│   ├── src/
│   │   ├── main.tsx
│   │   ├── App.tsx
│   │   ├── router.tsx
│   │   ├── routes/                          # one folder per top-level route
│   │   │   ├── onboarding/
│   │   │   ├── dashboard/
│   │   │   ├── investigation/
│   │   │   ├── tickets/
│   │   │   ├── connectors/
│   │   │   ├── knowledge/
│   │   │   ├── chat-console/                # operator view of the widget
│   │   │   └── settings/
│   │   ├── components/
│   │   │   ├── ui/                          # shadcn primitives (button, dialog, etc.)
│   │   │   └── domain/                      # TicketRow, TimelineStep, StatCard, Sidebar...
│   │   ├── lib/
│   │   │   ├── api.ts                       # generated OpenAPI client
│   │   │   ├── query.ts                     # TanStack Query setup
│   │   │   ├── auth.ts
│   │   │   └── theme.ts                     # persona palette (indigo/violet/emerald/sky)
│   │   ├── hooks/
│   │   ├── styles/globals.css
│   │   └── icons/                           # port of shared/icons.js as .tsx
│   ├── public/
│   └── tests/
│       ├── unit/                            # Vitest
│       └── components/
├── widget/                                  # embeddable end-user chat widget
│   ├── package.json
│   ├── vite.config.ts                       # library mode; outputs a single JS bundle
│   ├── README.md                            # "how to drop the snippet into any site"
│   ├── src/
│   │   ├── index.ts                         # entry: window.Gaby.init({...})
│   │   ├── widget.tsx                       # React root mounted into a shadow DOM
│   │   ├── api.ts                           # talks to /api/chat
│   │   └── styles.css                       # scoped to shadow DOM
│   └── dist/                                # built bundle shipped to npm + CDN
├── connectors/                              # first-party MCP servers (separately publishable)
│   ├── README.md                            # how to build a connector
│   ├── _contract/                           # the contract tests every connector must pass
│   ├── postgres/
│   ├── keycloak/
│   ├── stripe/
│   ├── zoho-desk/
│   ├── m365/
│   ├── entra-id/
│   ├── ninjaone/
│   ├── halopsa/
│   ├── kubernetes/
│   └── datadog/
├── docs/                                    # MkDocs Material
│   ├── mkdocs.yml
│   ├── index.md
│   ├── quickstart/
│   │   ├── docker-compose.md
│   │   ├── helm.md
│   │   └── managed-cloud.md
│   ├── concepts/
│   │   ├── agent-loop.md
│   │   ├── connectors.md
│   │   ├── knowledge.md
│   │   ├── safety.md
│   │   └── autonomy-levels.md
│   ├── connectors/                          # one page per connector
│   ├── personas/                            # how to set up Gaby for each persona
│   ├── operations/
│   │   ├── observability.md
│   │   ├── backups.md
│   │   ├── upgrades.md
│   │   └── security.md
│   ├── reference/
│   │   ├── api.md                           # OpenAPI rendered
│   │   ├── cli.md
│   │   └── config.md
│   └── contributing/
│       ├── dev-setup.md
│       ├── testing.md
│       └── connector-authoring.md
├── ops/                                     # deployment artifacts
│   ├── docker/
│   │   ├── Dockerfile.backend
│   │   ├── Dockerfile.web
│   │   ├── Dockerfile.widget
│   │   └── docker-compose.yml               # canonical v0.1 install path
│   ├── helm/
│   │   └── gaby/                            # chart
│   └── profiles/
│       ├── founder-quickstart/              # smallest, friendliest install
│       ├── msp-multiworkspace/
│       └── sre-readonly/
├── .github/
│   ├── workflows/
│   │   ├── backend-ci.yml
│   │   ├── web-ci.yml
│   │   ├── widget-ci.yml
│   │   ├── e2e.yml
│   │   ├── release.yml
│   │   └── docs.yml
│   ├── ISSUE_TEMPLATE/
│   └── PULL_REQUEST_TEMPLATE.md
└── scripts/
    ├── dev.sh                               # runs backend + web concurrently
    ├── seed-dev-data.py                     # loads fixture tickets + docs
    ├── gen-api-client.sh                    # OpenAPI → TS client
    └── mint-dev-secrets.py

Why this structure

  • Monorepo, not multiple repos. Three apps (backend/, web/, widget/) stay in lockstep; the alternative is broken releases and drift.
  • personas/ stays at the root as the canonical UX prototypes (per SPEC.md Section 4). The repo-root index.html was retired in v0.3.1 — docs/index.md (deployed to https://gaby.skycloak.io) is now the canonical marketing surface.
  • connectors/ is a sibling of backend/, not a sub-package. Each connector is its own MCP server, publishable to PyPI independently. The backend depends on MCP as a protocol, not on connector implementations.
  • Tests live next to the code they test (backend/tests/, web/tests/), not in a top-level tests/ folder — except for the existing tests/ harness for the prototype Playwright tests, which stays until the React UI replaces the HTML prototypes.
  • docs/ is MkDocs-managed, not a grab-bag of loose Markdown. One mkdocs.yml, deployable as a static site.

3. Data model core

These entities carry us from v0.1 (Founder) to v0.4 (SRE) without schema rewrites. The MSP persona's multi-workspace is baked in by having a workspace_id column on every row from day one, even though v0.1 uses a single hard-coded "default" workspace.

Aggregate Key fields Notes
workspaces id, name, plan, compliance_profile, residency_region, created_at Single "default" workspace in v0.1. Everything else joins on this.
users id, workspace_id, email, role (admin/agent/viewer), password_hash, disabled Operators of Gaby, not end-users. RBAC defaults to three roles; custom roles are EE.
api_keys id, workspace_id, user_id?, prefix, hash, scopes, expires_at For CLI and machine access.
sessions id, user_id, expires_at, csrf_token, operator_notes (jsonb) HTTP cookie sessions for the web UI. operator_notes holds medium-term, session-scoped notes (e.g. "operator just approved X — don't re-prompt for the rest of this session").
connectors id, workspace_id, kind, name, config_encrypted, status, last_health_check, scopes, autonomy_level kind = postgresql / m365 / zoho_desk / … ; scopes is a JSON scope DSL; autonomy_level ∈ {investigate, propose, act}.
connector_events id, connector_id, ts, kind, payload Healthchecks, auth failures, permission denials.
knowledge_sources id, workspace_id, kind (git / confluence / dir / url), locator, config, last_sync Where to ingest from.
documents id, workspace_id, source_id, uri, title, content_hash, last_ingested_at One row per source document (a runbook, a PDF, a Confluence page).
document_chunks id, document_id, workspace_id, ordinal, text, token_count, embedding, fts The retrievable unit. Embedding column uses sqlite-vec or pgvector.
tickets id, workspace_id, source_id, external_id, title, body, customer, priority, status, sla_at, created_at Canonical form — every help-desk adapter maps to this.
investigations id, workspace_id, ticket_id, started_at, finished_at, verdict, summary, token_cost, dollar_cost One per ticket. Verdict ∈ {auto_resolved, needs_tech, needs_l2, needs_client, investigating, failed}.
investigation_steps id, investigation_id, ordinal, system, action, detail, type (read/query/action/verify/verdict), ts Exactly matches renderTimelineStep in the prototypes.
actions id, investigation_id, connector_id, scope, payload, dry_run, result, status, applied_at, rolled_back_at Every write Gaby does is recorded here — so is every dry-run shadow.
approvals id, action_id, requested_at, decided_at, decided_by, decision, reason Drives the approval queue for propose autonomy.
audit_log id, workspace_id, ts, actor_kind (user/agent/system), actor_id, event, payload, prev_hash, hash Append-only, hash-chained. Exportable to SIEM (EE).
llm_calls id, investigation_id?, purpose, model, prompt_hash, prompt_tokens, completion_tokens, latency_ms, cost, cached Cost dashboard + prompt debugging.
chat_sessions id, workspace_id, channel (widget/slack/teams), external_user_id, started_at, handed_off_at? The human-chat surface.
chat_messages id, session_id, role (user/gaby/operator), content, attachments, ts
escalations id, ticket_id?, session_id?, channel (slack/teams/pagerduty/email), sent_at, acknowledged_at? Per-persona escalation routing.
kb_candidates id, workspace_id, staged_from_investigation_id, proposed_title, proposed_body, status (pending/accepted/rejected/expired), reviewed_by?, reviewed_at?, expires_at, created_at Staging area for auto-resolved investigations that want to be promoted to KB entries. Everything here is provisional until a human reviews. TTL 30 days default; expired rows auto-archive.
memory_nodes id, workspace_id, label (customer/user/system/connector/ticket/investigation/fact/observation/resolution), natural_key, properties (jsonb), provenance (operator/proposed/imported), status (active/provisional/archived), first_seen_at, last_seen_at, last_used_at, approved_by?, approved_at? Nodes of the long-term memory graph. See ARCHITECTURE.md §22 for the full model, the MemoryGraph protocol, and the three backend implementations (SQLite default, Postgres+AGE opt-in, FalkorDBLite opt-in). Unique index on (workspace_id, label, natural_key).
memory_edges id, workspace_id, from_node_id, to_node_id, relation (7 typed categories), weight, properties (jsonb), observed_at, decayed_at? Edges of the long-term memory graph. Typed relations per ARCHITECTURE.md §22 (Causal / Solution / Context / Learning / Similarity / Workflow / Quality). Composite indexes on (workspace_id, from_node_id, relation) and (workspace_id, to_node_id, relation).

Design rules the models must follow

  1. Every row has workspace_id. Even in v0.1. Cheap now, impossible to retrofit later.
  2. Secrets are encrypted at rest. connectors.config_encrypted, any API key-bearing column. Envelope encryption with a data key from the secrets provider.
  3. Nothing ever hard-deletes. Soft-delete with deleted_at. We need audit reconstructibility.
  4. Every sensitive column has a redaction rule. Before an investigation step is sent to an LLM, PII is stripped per the workspace's compliance profile.
  5. UUIDv7 for all primary keys (time-sortable, index-friendly).

4. Test strategy

Testing an LLM-driven system is where most AI projects rot. We avoid that by separating the deterministic parts from the LLM-driven parts and testing each on its own terms.

4.1 Layers

Layer Tool Scope Where it runs Budget
Unit pytest Pure functions and classes. Chunker, scope evaluator, audit hash chain, prompt builders, retrieval scoring, DB models against in-memory SQLite. No network. Every PR, pre-commit <30s full run
Property hypothesis Invariants on the critical-path: scope evaluator, audit hash chain, retrieval top-k containment, redaction idempotence. Every PR <60s
Integration pytest + testcontainers Real SQLite on disk; real Postgres via testcontainers for the Postgres profile. FastAPI TestClient. Real MCP server spawned as a subprocess. LLM calls go to a deterministic mock provider that replays recorded transcripts. Every PR <5 min
Connector contract pytest (shared fixtures) Every connector (ours + community) must pass: tool-list endpoint, scope declaration, healthcheck, dry-run of each declared action. On every connector PR <2 min/connector
End-to-end Playwright (existing harness) Full stack in Docker Compose: backend + web + mock connectors + deterministic LLM. Walks through founder onboarding → dashboard → investigation. Every PR, nightly full sweep <10 min
Load k6 API throughput: 100 concurrent investigations, p95 latency, error rate. Weekly scheduled 15 min
Evals (LLM-specific) promptfoo (v0.1) → Inspect AI (v0.2+) for full agent evals with tool calls Fixed corpus of ≥50 tickets with known-good resolutions. Measures auto-resolution rate, citation accuracy, safety-boundary compliance. promptfoo handles prompt regression well; Inspect AI (from UK AISI) is purpose-built for agent evaluation and fits our v0.2 tool-calling tests better. Manual before every release, automated weekly 30 min

4.2 Making LLM-driven code deterministic for CI

The agent loop calls an LLM. We do two things to make that testable:

  1. Transcript replay. Every integration test that exercises the agent uses a "fake LLM provider" that is seeded with a recorded transcript (tool-call sequences plus final text) stored under backend/tests/fixtures/llm_transcripts/. The test asserts on the deterministic behavior (scope checks, DB writes, audit log shape), not the LLM text.
  2. Evals are a separate beast. Evals use real LLM calls against a fixed corpus, run on a schedule (not per-PR), and measure quality metrics like auto-resolution rate and citation accuracy. They gate releases but not PRs.

This separation is the difference between "our tests pass in 3 minutes" and "our tests cost $400/month in OpenAI credits".

4.3 What we refuse to test with mocks

  • The DB schema. Integration tests use a real DB engine, not an in-memory mock. Schema bugs cost 10× to find in prod.
  • The MCP protocol. Integration tests spawn a real MCP server (a tiny stub is fine) and exchange real messages.
  • The FastAPI request lifecycle. Integration tests use TestClient, which runs the real middleware stack.

4.4 Test file conventions

  • backend/tests/unit/ mirrors backend/src/gaby/ 1:1. If you touch agent/loop.py, you touch tests/unit/agent/test_loop.py.
  • Integration tests are in backend/tests/integration/ and are marked @pytest.mark.integration. The default pytest command runs unit + property only; pytest -m integration runs the slower ones.
  • Every test file starts with from __future__ import annotations.
  • No shared mutable fixtures across tests — everything is function-scoped or rebuilt per test.

4.5 Coverage

  • Target: 85% line coverage on backend/src/gaby/, enforced in CI. Not because coverage is a quality metric, but because it catches "forgot to wire this up" regressions.
  • Critical paths are 100%: safety/, audit, scopes, authz. A PR that drops these below 100% is auto-rejected.
  • Frontend coverage is best-effort; we rely on Playwright e2e for confidence there.

5. Design system — from prototype to real UI

The persona prototypes under personas/ are already the spec. Porting them to React is a translation, not a redesign. Here is how.

5.1 The tokens we keep

Pulled from shared/styles.css:

Token family Prototype source Target
Color palettes .hero-gradient-*, .btn-primary-*, .selected-*, .active-* (indigo / violet / emerald / sky) web/src/lib/theme.ts → CSS variables + Tailwind theme extension
Typography Inter + JetBrains Mono from index.html tailwind.config.ts fontFamily
Spacing / radius Tailwind defaults Unchanged
Shadows stat-card, feature-card styles shadow-sm/md/lg tokens in Tailwind
Animations float-up, slide-in, pulse-dot, progress-bar tailwindcss-animate plugin + custom keyframes in globals.css

5.2 Domain components to build

Each maps to a function in shared/components.js:

Component Prototype function Responsibility
<OnboardingWizard> steps in personas/*/index.html Generic 5-6 step wizard container, consumes a config array
<ProgressDots> renderProgress Wizard progress indicator
<Sidebar> renderSidebar Persona-themed nav
<StatCard> renderStatCard Dashboard stat tile
<TicketRow> renderTicketRow One ticket in a queue
<TicketFilters> filterTickets logic Filter buttons over a ticket list
<InvestigationTimeline> renderTimelineStep Full investigation view with timeline steps
<TimelineStep> renderTimelineStep One step in an investigation
<ConnectorCard> renderServerCard A connector in the catalog or settings
<Toast> showToast Transient notification
<SimulatedInvestigation> simulateInvestigation The "live demo" step in onboarding

5.3 Persona theming

Persona colors become CSS variables scoped to a wrapper class:

/* web/src/styles/globals.css */
:root[data-persona="founder"]      { --gaby-primary: 99 102 241; /* indigo-500 */ }
:root[data-persona="support-lead"] { --gaby-primary: 139 92 246; /* violet-500 */ }
:root[data-persona="sre"]          { --gaby-primary: 5 150 105;  /* emerald-600 */ }
:root[data-persona="msp"]          { --gaby-primary: 2 132 199;  /* sky-600 */ }

Components then use bg-[rgb(var(--gaby-primary))] / text-[rgb(var(--gaby-primary))]. One set of components, per-persona skin with no code duplication.

5.4 What happens to the prototypes once the React UI ships

  • They stay as static files, served from the repo, and remain the visual spec and marketing.
  • The React app is available at /app (served by the backend); the landing page stays at /.
  • When the React UI reaches parity with a persona's prototype, the prototype is marked [reference] in its header and the React app becomes the runtime.
  • We never break the prototypes. They are easier to share with non-engineers than a running SPA.

5.5 Accessibility baseline

  • All new components are built on Radix primitives (via shadcn), which ship with keyboard navigation and ARIA.
  • Every interactive element has a data-testid matching the prototype conventions (already observed in the HTML prototypes).
  • Color contrast meets WCAG AA at minimum; we test with @axe-core/react in dev.
  • Every form has labels, and error messages are read out via aria-live.

6. Observability, ops, and security baselines (day-1 musts)

6.1 Observability (cross-cutting)

Signal Default Opt-in
Logs Structured JSON to stdout via structlog Ship to Loki / Datadog via OTLP
Traces OpenTelemetry, off by default GABY_OTEL_EXPORTER=otlp://... turns it on
Metrics Prometheus /metrics endpoint always on Dashboards ship as JSON in docs/operations/dashboards/
Status Built-in /status page in the web UI /health, /ready for probes

6.2 Security (non-negotiable at v0.1)

Control Implementation
Authenticated API by default No "open mode" in v0.1. First run mints an admin user.
Secrets at rest Envelope encryption (AES-GCM) with data keys from the configured secrets provider
CSRF Session cookie + CSRF token on every state-changing web route
CSP Strict-ish CSP by default; MkDocs docs are served from a separate origin
Input validation Pydantic models on every route; reject unknown fields
Rate limits Per-IP and per-API-key, token-bucket, in Redis or in-process
Dependency scanning Dependabot + pip-audit + pnpm audit in CI
SBOM syft in the release workflow; SBOM attached to every Docker image
Vuln disclosure SECURITY.md with a PGP-signed security address

6.3 CI pipelines (GitHub Actions)

Workflow Triggers Steps Gate
backend-ci.yml PR + main uv sync → ruff check → mypy strict → pytest (unit + property) → pytest-cov ≥85% Blocking
backend-int.yml PR + nightly pytest -m integration with testcontainers Postgres Blocking
web-ci.yml PR + main pnpm i → biome check → tsc --noEmit → vitest Blocking
widget-ci.yml PR + main pnpm i → biome check → vitest → build size-limit check Blocking
e2e.yml PR + nightly docker compose up → Playwright run → upload traces on failure Blocking (fast suite) / nightly (full)
connector-ci.yml PR touching connectors/** Run contract tests against the touched connectors Blocking
docs.yml PR + main MkDocs strict build + broken-link check Blocking
release.yml tag v*.*.* Build + push Docker images, Helm chart, PyPI wheel, npm widget; publish SBOM; draft release notes Manual review
eval.yml weekly + manual Run the LLM eval harness; post a summary comment to a tracking issue Informational

Cache strategy: uv and pnpm caches keyed on their respective lockfiles.

6.4 Release cadence

  • v0.x: weekly or whenever a meaningful slice is ready. Breaking changes are expected and flagged in CHANGELOG.md.
  • v1.0: the four personas ship, API is stable, upgrade path is documented. Semver from here.

7. v0.1 exit criteria (Founder persona) — what shipping actually means

We ship v0.1 when a technical founder can docker compose up Gaby, connect their stack in under 10 minutes, and receive a Slack DM by the next morning showing a real ticket Gaby auto-resolved overnight.

7.1 In scope for v0.1

  • [ ] Installation: docker compose up works on macOS, Linux, WSL2. SQLite only; no Postgres / Redis required.
  • [ ] Onboarding wizard: the Founder flow from personas/founder/index.html, fully real in React.
  • [ ] Connectors shipped: PostgreSQL (read-only + limited write), Keycloak (read-only), Zoho Desk (read + reply).
  • [ ] Ticket source: Zoho Desk polling adapter. Canonical ticket model in the DB.
  • [ ] Knowledge ingestion: point at a local ./docs folder; Markdown + PDF.
  • [ ] Agent loop: Anthropic Claude via litellm, homegrown loop, token + dollar budget per investigation.
  • [ ] Safety: three autonomy modes; dry-run by default on writes; approval queue for propose.
  • [ ] Audit log: append-only, hash-chained.
  • [ ] Web UI: onboarding → dashboard → investigation detail → basic settings.
  • [ ] Slack escalation: outbound only. Nightly summary email via SMTP.
  • [ ] Observability: structured logs, /metrics, /health, /ready.
  • [ ] Docs: quickstart page that matches the install experience; one-page "how Gaby works"; one-page security overview.
  • [ ] Tests: every layer above passing in CI. 85% backend coverage. 100% on safety/.

7.2 Explicitly NOT in v0.1

  • Multi-workspace mode (it's built into the schema but we serve one default workspace)
  • Per-client autonomy rules (MSP persona)
  • Human chat widget (v0.3 at earliest)
  • Slack / Teams inbound bot (only outbound escalation)
  • SSO / SAML / SCIM (Enterprise Edition)
  • Air-gapped install mode (Enterprise Edition)
  • All connectors not listed above
  • The Support Lead / SRE / MSP persona wizards
  • The operator chat console

7.3 Definition of done for the v0.1 release

  • [ ] All in-scope checkboxes above are ticked
  • [ ] The founder quickstart (docs/quickstart/docker-compose.md) runs green on all three target OSes
  • [ ] A five-ticket live test against a staging Zoho Desk: at least three of five tickets are auto-resolved, all writes go through the safety layer, the audit log reconstructs the full sequence
  • [ ] The eval suite (50 fixture tickets) achieves ≥60% auto-resolution with zero safety violations
  • [ ] CHANGELOG.md has a v0.1.0 entry written by a human
  • [ ] A tagged release with Docker images on GHCR, Helm chart in the chart repo, and Python wheel on PyPI
  • [ ] The landing page's "Join Waitlist" button is replaced with "Install" once v0.1 ships

8. Which things we will probably get wrong (so future-us can forgive us)

These are the calls I am least confident about. They will likely change. Document them here so the change is expected, not a crisis.

  1. Agent loop as homegrown. If after 3 months we are spending more time on loop mechanics than on prompts and tools, we adopt the matching piece of pydantic-ai or LangGraph. The loop's public interface is small enough to swap.
  2. arq vs Celery. arq is simpler but newer. If we hit production pain (worker supervision, retries, visibility), we switch to Celery before v0.5.
  3. sqlite-vec vs pgvector lock-in. We abstract the vector operations behind a tiny interface so swapping is a day of work, not a week.
  4. Widget in a shadow DOM. Theoretically clean, practically has quirks (fonts, CSP). If it becomes a maintenance sink, we fall back to an iframe.
  5. Keeping the HTML prototypes as reference after the React UI ships. This might create drift. Mitigation: a CI check that both prototype and React page use the same data-testid set.
  6. In-process worker vs external. Great for v0.1 demos, but "Gaby is slow" will probably mean "the worker is blocking the API". v0.2 promotes arq+Redis to default in Compose.

9. What this document is not

  • Not the detailed technical architecture. That is ARCHITECTURE.md — sequence diagrams, exact class contracts, concurrency model, scaling notes. It follows this document.
  • Not a dated roadmap. That is ROADMAP.md — v0.1 through v1.0 with estimates and milestones.
  • Not a contributor guide. That is CONTRIBUTING.md — dev setup, branch rules, code review rituals.
  • Not a product spec. That is SPEC.md — the what and the why.

Read in order: SPEC.mdFOUNDATION.md (this) → ARCHITECTURE.mdROADMAP.md.


10. Appendix — decisions at a glance

Area Choice
Backend lang Python 3.12+
Backend framework FastAPI
Agent loop Homegrown, ≤500 LOC
LLM abstraction litellm (BYOK) + direct Anthropic/OpenAI (hot paths)
MCP Official Python SDK
ORM SQLAlchemy 2.x async + Alembic
DB default SQLite (+ sqlite-vec + FTS5)
DB opt-in Postgres (+ pgvector + tsvector)
Background jobs arq (Redis) with in-process fallback
Package manager uv
Linter ruff + mypy --strict
Frontend Vite + React 19 + RR7 + TS strict
Styling Tailwind 4 + shadcn/ui
Server state TanStack Query
Client state Zustand
Forms React Hook Form + Zod
API client openapi-typescript generated from FastAPI OpenAPI
Widget Vite library mode → shadow DOM React
Frontend lint Biome
Frontend pm pnpm
Tests (BE) pytest + hypothesis + testcontainers + respx
Tests (FE) Vitest + RTL + Playwright (reuse tests/)
Deterministic LLM testing transcript replay via a fake provider
Evals promptfoo or homegrown, scheduled not per-PR
Docs MkDocs Material
Containers Docker + Compose (v0.1), Helm (v0.2)
Registry GHCR primary, Docker Hub mirror
License (core) Apache 2.0
License (EE) Commercial
Contribution DCO
Release cadence Weekly v0.x, monthly from v1.0