---
## Why SBproxy
Most teams run one tool for HTTP traffic and another for LLM traffic. That's two systems to configure, deploy, and monitor. SBproxy handles both in one binary.
- **One config file** replaces your reverse proxy, AI gateway, and the middleware glue between them.
- **200+ LLM models** behind an OpenAI-compatible API, with fallback chains, guardrails, and budgets.
- **Secure by default.** Auth, rate limiting, WAF, DDoS, and CSRF are built in.
- **Hot reload** with no dropped connections.
- **Sub-millisecond p99 overhead.** Idle RSS in single-digit megabytes.
---
## Install
curl (macOS / Linux):
```bash
curl -fsSL https://download.sbproxy.dev | sh
```
The script detects your OS and architecture, fetches the matching release binary from GitHub, and drops it in `~/.local/bin`. Override with `SBPROXY_INSTALL=` for a custom location or `SBPROXY_VERSION=` to pin a release.
Homebrew (macOS / Linux):
```bash
brew tap soapbucket/tap
brew install sbproxy
```
Docker:
```bash
docker pull ghcr.io/soapbucket/sbproxy:latest
```
From source (needs Rust 1.82+):
```bash
git clone https://github.com/soapbucket/sbproxy
cd sbproxy
make build-release
```
---
## Quick start
We host a public HTTP echo service at `test.sbproxy.dev` (request inspection, like httpbin) so you can wire up a real upstream without leaving the SoapBucket ecosystem. Try it directly:
```bash
curl https://test.sbproxy.dev/get
```
Now run the gateway in front of it. Drop this into `sb.yml`:
```yaml
proxy:
http_bind_port: 8080
origins:
"myapp.example.com":
action:
type: proxy
url: https://test.sbproxy.dev
```
```bash
make run CONFIG=sb.yml
curl -H "Host: myapp.example.com" http://127.0.0.1:8080/get
```
`myapp.example.com` is the host your client sees; SoapBucket matches it against `origins:` and forwards to the upstream. Use any hostname you want here; `example.com` is reserved (RFC 2606), so it never collides with anything real.
That's a reverse proxy. Add AI routing, auth, and rate limiting in the same file. See [`examples/`](examples/) for runnable end-to-end configurations covering each feature.
---
## Documentation
The full documentation lives in [`docs/README.md`](docs/README.md): manual, configuration reference, AI gateway guide, scripting reference, performance, troubleshooting, architecture, and more. Running the operator for the first time? Start with [`docs/quickstart-operator.md`](docs/quickstart-operator.md).
For contributors: [CONTRIBUTING.md](CONTRIBUTING.md).
---
## Community
- [Issue Tracker](https://github.com/soapbucket/sbproxy/issues) for bug reports and feature requests.
- Looking for a managed offering? [SBproxy Enterprise](https://sbproxy.dev/enterprise).
---
## Upgrading from v0.1.x (Go)
SBproxy v1.0 is a Rust rewrite. The Go implementation that previously occupied this repository is archived at [soapbucket/sbproxy-go](https://github.com/soapbucket/sbproxy-go) and tagged `v0.1.2-go-final`. New work happens here. See [MIGRATION.md](./MIGRATION.md) for the upgrade path; existing `sb.yml` files should compile unchanged.
---
## License
Licensed under the [Apache License 2.0](LICENSE). Free for any use, including production and commercial, with no field-of-use restriction.
See also [NOTICE](NOTICE) and [TRADEMARKS](TRADEMARKS.md). A [Soap Bucket LLC](https://www.soapbucket.com) project.
================================================================
# MIGRATION.md
================================================================
## Migrating from v0.1.x (Go) to v1.0 (Rust)
*Last modified: 2026-04-28*
SBproxy v1.0 replaces the Go implementation with a Rust rewrite built on Cloudflare's Pingora. This document covers what changes for operators upgrading from a v0.1.x Go binary to a v1.0 Rust binary.
The v0.1.x Go binary continues to be available at `github.com/soapbucket/sbproxy-go` (archived, read-only) at the `v0.1.2` release tag. New development happens only on v1.0 and later.
## TL;DR
- Your `sb.yml` is mostly portable. Field names match. Most operators upgrade by swapping the binary and re-deploying.
- The install command and binary name are unchanged (`sbproxy`, `brew install sbproxy`, `ghcr.io/soapbucket/sbproxy:latest`).
- A handful of v0.1.x flags were renamed or removed in v1.0. See `Breaking changes` below.
- Performance improves substantially (3x throughput, 3-4x lower p99 on the AI path) with no config changes required.
## What's the same
- **Config language**. `sb.yml` field names, structure, and semantics are preserved across the proxy, AI gateway, auth, policy, transform, and modifier surfaces.
- **Binary name and install paths**. The binary is still `sbproxy`. `brew install sbproxy/sbproxy` and `docker pull ghcr.io/soapbucket/sbproxy:latest` continue to work.
- **Hot reload**. Send `SIGHUP` (or save the config file when watcher mode is on) and the new pipeline atomically swaps in.
- **Admin endpoint**. `/api/health`, `/api/metrics`, `/api/openapi.{json,yaml}` work the same way.
- **CEL and Lua scripts**. Existing CEL expressions and Lua transform scripts run unchanged on the Rust extension engine.
- **Provider catalog**. The 90+ AI provider catalog is the same data file; existing AI routes continue to resolve providers by the same names.
## What's new in v1.0
These are additive and do not require config changes:
- **Cloudflare-style edge security policies**: `ai_crawl_control` (Pay Per Crawl), `exposed_credentials`, `page_shield`, `bulk_redirects`, `cache_reserve`, `dlp_catalog`, `web_bot_auth`. See `docs/` for each.
- **OpenAPI emission**. The gateway publishes its live config as OpenAPI 3.0 at `/api/openapi.json` (admin) and per-host `/.well-known/openapi.json` (opt-in via `expose_openapi: true` on the origin).
- **Storage action with real backends**. The `storage` action now drives S3, GCS, Azure Blob, or local filesystem via `object_store`.
- **JavaScript and WASM scripting** alongside CEL and Lua.
- **Pattern-aware PII redaction at the request boundary** for AI routes.
- **Single-digit-MB idle RSS** and sub-millisecond p99 added latency.
- **Hierarchical budgets across team/project/user/model** with downgrade-on-exceed.
## Breaking changes
### Removed
- No CLI flags or environment variables from v0.1.x have been removed in v1.0. If your v0.1.x deployment uses a non-default flag and you cannot find the equivalent in v1.0, file an issue tagged `migration`.
### Renamed
- No `sb.yml` field renames between the v0.1.x Go config schema and the v1.0 Rust config schema. (The internal config schema is also referred to as `schema-v1`; that label has not changed.) The compatibility promise is pinned by the `v1_compat::v1_fixtures_compile_unmodified` test in `crates/sbproxy-config/`. If a real-world v0.1.x config fails to compile under v1.0, that is a bug; file an issue tagged `migration`.
### Default changes
- The upstream `Host` header now defaults to the upstream URL's hostname (matching nginx and Envoy `auto_host_rewrite`). Set `host_override: ` per action to keep the v0.1.x client-Host pass-through behavior.
- `proxy.trusted_proxies` is now strictly enforced. When the immediate TCP peer is not in the trust list, inbound `X-Forwarded-*` headers are stripped on ingress (forgery defense). v0.1.x had a more permissive default.
## Recommended upgrade procedure
1. **Read `CHANGELOG.md`** for the full list of changes between your starting v0.1.x version and v1.0.0.
2. **Stage v1.0 alongside v0.1.x** in a non-production environment. Point a copy of your `sb.yml` at the v1.0 binary and run `sbproxy validate sb.yml`. Address any validation errors.
3. **Run a smoke test** against a small percentage of real traffic. Observe `/api/metrics` and `/api/health/targets` for any regressions in 4xx/5xx rates or upstream latency.
4. **Verify signed binary** before promoting to production. v1.0 ships with cosign signatures and an SBOM; see `SUPPLY-CHAIN.md` for the verification commands.
5. **Promote to production** once smoke is clean.
6. **Keep v0.1.x available for rollback** for at least one full deployment cycle. The v0.1.x binary at the `v0.1.2` tag of `github.com/soapbucket/sbproxy-go` is the recommended rollback target.
## Help
- File migration questions as an issue tagged `migration` on `github.com/soapbucket/sbproxy`.
- Security-sensitive issues go through `SECURITY.md`.
- For paid migration support (e.g., enterprise customers with non-trivial v0.1.x customizations), contact support@soapbucket.dev.
================================================================
# CHANGELOG.md
================================================================
## Changelog
All notable changes to SBproxy v1.x. Versions before v1.0 shipped as the
Go implementation and now live in the archived
[`soapbucket/sbproxy-go`](https://github.com/soapbucket/sbproxy-go)
repository.
## [Unreleased]
Work that has merged to `main` since the v1.1.0 tag and is queued for
the next version cut. No promises about backward compatibility for any
of the new YAML fields below until the version that ships them.
## [1.1.0] - 2026-06-06
First minor release on the Rust v1.x line. This release carries
breaking changes to the MCP tool-access policy (now closed-by-default
and principal-aware); read the Breaking section and
`docs/migration-mcp-rbac.md` before upgrading. It also ships 66 native
AI providers behind one OpenAI-compatible API.
### Breaking
- **MCP default-deny**: `ToolAccessPolicy` flipped from
open-by-default to closed-by-default. An unknown caller (no
matching ACL rule) is denied every tool. An empty `allowed: []`
list under an ACL rule means "deny all", not "allow all".
Operators who want the legacy behaviour add `default_allow: true`
on the origin's MCP action. The legacy `key_permissions: { key: [tools] }`
shape is gone; rewrite to the principal-aware `tool_access[]`
selector list. See `docs/migration-mcp-rbac.md`.
- **MCP principal-aware ACL**: `ToolAccessPolicy` now
carries `tool_access[]` rules with `principals[]` selectors
(`virtual_key`, `sub`, `team`, `project`, `user`, `role`,
`tenant_id`) plus an `allowed[]` tool list. The legacy
`key_permissions: HashMap>` map is removed
along with `ToolAccessPolicy::is_tool_allowed(key, tool)`; the new
surface is `policy.check(&principal, tool) -> ToolAccessDecision`
and `policy.filter_tools(&principal, &tools)`. `tools/list` now
filters by RBAC against the inbound principal (the legacy schema
leaked tool names through `tools/list` even when the gate would
deny the matching `tools/call`). A new `tool_quotas[]` table
enforces per-tool sliding-window quotas keyed on
`(tenant_id, principal_id, tool_name)`. See
`docs/migration-mcp-rbac.md`.
### Added
- **66 native AI providers behind one OpenAI-compatible API.** The
embedded `ai_providers.yml` registry ships 66 providers (up from 43),
adding Hugging Face Inference, GitHub Models, Vercel AI Gateway,
Nebius, Baseten, Lambda, FriendliAI, Scaleway, Nscale, DigitalOcean
Gradient, OVHcloud, Inference.net, kluster.ai, OpenPipe, Writer,
Upstage, Aleph Alpha, MiniMax, Volcengine Ark (Doubao), Tencent
Hunyuan, Baidu Qianfan (ERNIE), StepFun, and Mixedbread. The catalog
is plain YAML and operator-extensible at runtime via
`proxy.ai_providers_file`; the `model` field passes through to the
upstream, so any model a provider serves is reachable without
per-model config. The "200+ models" reach is native (bring your own
keys); OpenRouter is one provider among the 66, not a dependency. See
`docs/providers.md#extending-the-provider-catalog`.
- **Session ledger from live MCP traffic.** A new top-level
`session_ledger:` block makes SBproxy emit the canonical
`session-ledger-v1` run record (shared with mcptest) from its
`tools/call` path: one `header` per session, then one `tool_call`
record per call carrying `session_id`, a zero-based `hop_index`, the
bare tool name and server, redacted `params` / `result`, an error
flag, and the round-trip `duration_ms`. `sink: logging` (default)
emits each record as a `session_ledger` tracing line; `sink: file`
with a `path:` appends NDJSON. Off unless `enabled: true`; when off
the tool-call path pays only a single atomic load. Payloads are
redacted with the same secret-stripping the access log uses. See
`docs/mcp.md` and `examples/mcp-federation/sb.yml`.
- **Structured-log schema v2 (`SCHEMA_VERSION = "2"`).** Three changes
land together so downstream tooling can read them in one swing:
optional `session_id` and `user_id` top-level fields parallel the
`RequestEvent` envelope (cross-surface JOIN no longer relies on
`request_id` alone); the field-key redaction marker is normalised
to `[REDACTED:]` everywhere (was `` in v1) so
the schema-v1 layer matches the existing PII-rule replacement
shape; the schema bump is additive on the field set (a v1 reader
parsing a v2 line keeps working because every new field is
`skip_serializing_if = Option::is_none`). Marker normalisation is
a string change; downstream tooling that greps for the old
`` form must update.
- **Phase-timing breakdown on the access log + new
`sbproxy_phase_duration_seconds` Prometheus histogram.** The
access log carried `latency_ms` end to end and that was it; an
operator looking at a slow request could not tell from the log
whether the time went to the auth provider, the upstream, or a
response transform. Three new optional fields land on every
`AccessLogEntry`: `auth_ms` (request_start → auth provider
returned), `upstream_ttfb_ms` (request_start → first upstream
response byte), `response_filter_ms` (first upstream byte → end
of `response_filter`). All three are `Option` and
`serde-skip` when None, so origins that short-circuit (cache
hit, auth deny) keep compact lines. The same observations also
feed a new `sbproxy_phase_duration_seconds{phase, origin}`
histogram with buckets identical to
`sbproxy_request_duration_seconds` for cross-cut dashboards. See
`docs/access-log.md` and `docs/metrics-stability.md`.
- **Nine standard HTTP fields on the access log: `host`, `query`,
`protocol`, `scheme`, `user_agent`, `referer`, `upstream_status`,
`response_content_type`, `response_content_encoding`.** The log
was missing the canonical fields most HTTP access-log consumers
expect (Apache, NGINX, Envoy, the cookie-cutter ELK pipeline).
`host` is the client-supplied Host header (distinct from
`origin`, the matched virtual-host pattern); `upstream_status`
is the upstream's response code when the proxy rewrote the
status the client sees. All nine are `Option`, `serde-skip` when
not applicable. Promoted from the generic header allowlist
because nearly every analytics consumer wants them. See
`docs/access-log.md`.
- **Opt-in OpenTelemetry metrics mirror alongside the canonical
Prometheus surface.** New `telemetry.export_metrics: true`
(with `telemetry.metrics_interval_secs` cadence, default 30s)
installs an OTel `MeterProvider` that ships observations to the
same OTLP collector the trace pipeline targets. The first two
mirrored instruments are `sbproxy.phase.duration` and
`sbproxy.request.duration`; record-paths fall back to OTel's
global no-op meter when the export is off, so operators pay
nothing for the mirror unless they opt in. The Prometheus
surface remains canonical; this is for operators who already
aggregate via Mimir / Datadog / Honeycomb and want to skip the
Prometheus scrape.
- **OIDC Relying-Party stack shipped end to end.**
`/oidc/callback` (auth-code + PKCE + sealed session cookie)
plus the helpers + config wiring for
`/.well-known/openid-configuration` discovery, refresh-token
rotation, RP-initiated logout at `/oidc/logout`, userinfo →
`X-Auth-*` trust headers, an optional server-side session store
(in-memory + KV-backed redb/file/Redis) for targeted revocation.
See `docs/configuration.md` § OIDC auth.
- **OpenAI Apps SDK / MCP Apps (SEP-1865) compatibility.**
Gateway-side `_meta.mcpApps` passthrough for tool definitions,
`params.audit.cause` plumbing on `tools/call`, and a typed
validator set (`apps.template_declared`, `apps.iframe_sandbox`,
`apps.csp_present`, `apps.cache_metadata`) usable by sbproxy,
the enterprise extension, and any CI gate over the
`sbproxy-plugin` surface.
- **Web Bot Auth full conformance, publish + sign sides.**
SBproxy now publishes its own JWKS-shaped
directory at `/.well-known/http-message-signatures-directory`
and a Signature Agent Card at
`/.well-known/web-bot-auth/agent-card` (opt in via
`web_bot_auth_publish` per origin). New
`sbproxy-middleware::signatures::MessageSignatureSigner`
primitive signs outbound requests per RFC 9421, round-trips
through the existing verifier. See `docs/web-bot-auth.md` and
`examples/web-bot-auth-publish/`.
- **Three previously-undocumented OSS policies now have docs +
runnable examples:** `object_authz` (BOLA + BFLA with
enumeration detection), `content_digest` (RFC 9530 request-body
verification), `agent_budget` (per-agent semantic rate limit).
See `docs/object-authz.md`, `docs/content-digest.md`,
`docs/agent-budget.md`.
- **Discoverable FAQ.** `docs/faq.md` covers install, common
401 causes, OIDC minimal config, log levels, OSS-vs-enterprise
scope, and pointers into the rest of `docs/`. Wired into
`docs/README.md` under "Getting started".
- **Explicit SIGINT/SIGTERM handling with a structured shutdown
event and a 30s default drain budget.** Pingora's
`Server::run_forever` already trapped SIGTERM and SIGINT, but
the proxy emitted no operator-facing log line on receipt, so a
pod eviction or `docker stop` looked the same as a crash in the
log stream. This change subscribes to Pingora's execution-phase
broadcast and emits `shutdown_signal_received`,
`shutdown_grace_period`, and `shutdown_complete` tracing events
with the resolved grace budget. The Kubernetes operator
(`sbproxy-k8s-operator`) now installs the same SIGINT/SIGTERM
handlers via `tokio::signal::ctrl_c` and
`tokio::signal::unix::signal(SignalKind::terminate())`; before
this change the operator relied on the orchestrator SIGKILL at
`terminationGracePeriodSeconds`. The drain budget is the new
`SBPROXY_SHUTDOWN_GRACE_MS` env var (or `--shutdown-grace-ms`
CLI flag) which defaults to 30000ms, matching Kubernetes'
default `terminationGracePeriodSeconds`. The legacy
`SB_GRACE_TIME` / `--grace-time` (seconds) still works and
takes precedence when explicitly set; an unset legacy var lets
the new 30s default apply. Operator exits 0 on a clean drain,
1 when the grace window is exceeded, so the orchestrator can
alert. Documented in `docs/manual.md` §3 and
`docs/kubernetes.md` §Graceful shutdown.
- **Idempotency middleware now engages on AI gateway origins
(`action: ai_proxy`).** Before this change, the
RFC 8594 middleware only ran on general HTTP origins
(`action: proxy`). AI customers using `Idempotency-Key`
headers for Stripe-style retries were double-billed by the
upstream provider because the proxy did not replay from cache.
The fix engages the same primitive in `handle_ai_proxy` after
the request body is buffered (the AI gateway already buffers
for the JSON parser, model router, and guardrails) and before
the upstream call. On a cache hit the gateway writes the
cached `(status, headers, body)` triple directly to the client
with `x-sbproxy-idempotency: HIT` and never contacts the
provider. On a body conflict the gateway returns 409
`ledger.idempotency_conflict` per the RFC. On a miss the
gateway forwards, then records the post-translation OpenAI-shape
bytes the client actually saw so retries replay byte-identical.
Reuses the same per-request and pool caps shipped on
`CompiledIdempotency`: `max_request_body_bytes`,
`max_response_body_bytes`, `max_concurrent_buffers`. The four
skip markers (`SKIPPED-OVERSIZE-REQUEST`, `SKIPPED-POOL-FULL`,
`SKIPPED-OVERSIZE-RESPONSE`, `SKIPPED-MULTIPART`) stamp on the
outgoing response so operators see graceful degradation in
dashboards. Multipart bodies (audio transcription, image edit /
variation, file upload) skip caching with `SKIPPED-MULTIPART`
because the cache primitive stores raw bytes and multipart
boundaries may be regenerated by clients on retry. Streaming
(SSE) chat completion responses abandon the cache record on
oversize because framing-aware capture is out of scope for v1.
- **`proxy_status` and `problem_details` now cover upstream
failures.** Before this change, `proxy_status.enabled: true`
stamped the `Proxy-Status` header on proxy-generated errors
(auth deny, policy deny, default 404) but **not** on upstream
failures routed through Pingora's `fail_to_proxy` path (connect
refused, connect timeout, TLS handshake error, mid-stream
connection loss). The fix wires both blocks into the
upstream-failure path so dashboards consuming `Proxy-Status` see
consistent coverage across error sources. The status code +
RFC 9209 `error` token derive from the Pingora `ErrorType` via
a new `map_upstream_failure` translator: 504 +
`connection_timeout` for `ConnectTimedout` /
`ReadTimedout`; 502 + `connection_refused` for `ConnectRefused`;
502 + `tls_protocol_error` for TLS errors; 502 +
`connection_terminated` for mid-stream loss; 502 +
`http_request_error` as the catch-all. When
`problem_details.enabled: true` the body is now rendered as
`application/problem+json` for upstream failures too, with the
RFC 9209 error token in the `detail` field so both signals share
the same vocabulary.
- **Idempotency cache check moved to `request_filter`.** Before this
change, the cache lookup ran in `request_body_filter`, after
Pingora had already opened the upstream TCP connection. On a cache
hit the upstream observed one aborted partial request before the
proxy served the cached response to the client. The check now runs
before Pingora's upstream-peer phase: cache hits and body
conflicts write the response from inside `request_filter` and
return `Ok(true)`, so the upstream is never contacted at all. On
cache miss the proxy buffers the body (bounded by
`max_request_body_bytes` from PR #139), then re-injects it via
`request_body_filter` at end-of-stream so Pingora's normal upstream
forwarding picks it up. Existing e2e tests now assert the
upstream-not-contacted invariant; the previous "may observe one
aborted partial request" caveat has been removed from
`docs/configuration.md` and the example README.
- **Idempotency middleware: per-request and pool caps.** Three new
fields on the `idempotency:` block bound memory usage and let the
middleware gracefully degrade under pressure rather than buffering
unbounded bodies. `max_request_body_bytes` (default 1 MiB) caps
the per-request buffer; bodies above the cap skip caching with
`x-sbproxy-idempotency: SKIPPED-OVERSIZE-REQUEST` stamped on the
response. `max_response_body_bytes` (default 1 MiB) caps the
per-response cache buffer; responses above the cap stream through
uncached. `max_concurrent_buffers` (default 256) is a per-origin
pool over concurrent buffered requests; pool exhaustion skips the
cache with `x-sbproxy-idempotency: SKIPPED-POOL-FULL`. Worst-case
memory is bounded at `max_concurrent_buffers * max_request_body_bytes`
per origin.
- **RFC 8594 idempotency middleware (`idempotency:`).** Per-origin
block that engages on POST / PUT / PATCH (configurable via
`methods:`) when an `Idempotency-Key` header is present. The
middleware sits ahead of policies in the handler chain, hashes the
request body, and short-circuits the three branches per the RFC:
cache hits replay the cached `(status, headers, body)` verbatim
with `x-sbproxy-idempotency: HIT`; conflicts (same key, different
body) return 409 with the `ledger.idempotency_conflict` JSON body;
misses forward to the upstream and capture the response for the
next retry. Workspace-isolated keys prevent cross-tenant
collisions. Memory backend (default) is per-origin and per-replica;
`backend: redis` binds to `proxy.l2_store` at config-compile time
for cluster-wide replay. Cached replays do not consume rate-limit
slots. Documented in `docs/configuration.md` and demonstrated by
`examples/idempotency/`. Known v1 limitation: the cache check
fires in `request_body_filter`, after Pingora has already opened
the upstream connection. On a cache hit the upstream observes one
aborted partial handshake before the proxy serves the cached
response to the client; future work moves the check earlier so the
upstream never sees the replay.
- **RFC 9457 problem-details default renderer (`problem_details:`).**
New per-origin block that opts in to `application/problem+json` for
proxy-generated errors (authentication denials, policy denials,
default 404) that are not matched by an authored `error_pages`
entry. The two blocks compose: per-status custom pages still win
when authored; `problem_details` catches everything else with a
structured `type` / `title` / `status` / `detail` / `instance`
body. `type_base_uri` produces stable per-status `type` URIs;
`include_detail: false` suppresses the internal error string.
Documented in `docs/configuration.md` and demonstrated by
`examples/problem-details/`.
- **Typed `error_pages` config.** The opaque
`error_pages: Option` field is now typed as
`Option>`. Public types `ErrorPageEntry`,
`StatusSpec`, and `ProblemDetailsConfig` live in `sbproxy-config`.
The authored YAML shape is unchanged: every existing
`error_pages:` list keeps parsing, including the `status:` single-
int / `[status]` list shorthand and `template: true` substitution.
The OpenAPI emitter now walks typed entries to populate
per-status `responses` keys (the previous code inspected the
field as an object and silently produced no entries; this is a
bug fix on top of the migration).
- **AI gateway Realtime WebSocket dispatch (Phase 7, Option C).**
`GET /v1/realtime` requests with `Upgrade: websocket` against an
`ai_proxy` origin are now dispatched through the AI gateway
pipeline:
- Pre-upgrade gating runs the same surface classification, 501
capability check (only providers in
`provider_supports_realtime` are eligible; today: OpenAI),
per-surface rate limit, and provider selection as the rest of
the AI surface set.
- After the gating passes, Pingora forwards bytes between
client and provider transparently through the upgraded
connection. The dispatcher does not terminate the WebSocket;
per-frame guardrails and frame-exact audio metering are
reserved for a future enterprise terminate-and-relay path so
every AI gateway feature added to `handle_action` continues
to apply to realtime through one shared code path.
- `sbproxy_ai_realtime_sessions_active` (gauge),
`sbproxy_ai_realtime_session_duration_seconds` (histogram),
`sbproxy_ai_realtime_audio_seconds_total` (counter), and
`sbproxy_ai_realtime_frames_forwarded_total` (counter) are
registered. The OSS dispatch ticks the gauge on session open
and observes the duration histogram on close. Documented in
`docs/metrics-stability.md`.
- At session close, `logging` emits a session-end
`AiBillingEvent` with `AudioSeconds { seconds }` valued at
the wall-clock session duration so realtime usage appears on
the standard billing-event bus alongside chat/image/audio.
- `RealtimeSessionTracker` (lock-free atomic counters) and
`audio_seconds_from_frame(bytes, sample_rate, channels)` ship
in `sbproxy-ai::realtime` for the eventual terminate-and-relay
path to consume.
- `docs/ai-gateway.md` documents the new dispatch path with a
YAML example and the per-surface rate-limit knob.
- **AI gateway OpenAI surface dispatch (Option A).** The `ai_proxy`
action now routes every OpenAI-compatible surface through a
single classifier with per-surface observability and gating:
- New `AiSurface` enum + `classify_surface(method, path)` cover
chat completions, models, embeddings, assistants and threads
(full v2 surface), batches, fine-tuning, files, realtime,
image generation/edits/variations, audio transcription/speech,
moderations, and reranking. Marked `#[non_exhaustive]` so
future variants don't break downstream pattern matches.
- Method coverage extended past GET/POST: DELETE, PUT, PATCH,
HEAD, and OPTIONS dispatch through `AiClient::forward_with_method`
without engaging the JSON body-parse pipeline.
- Multipart bodies (image edits/variations, audio transcription,
file uploads) byte-forward via `AiClient::forward_bytes` with
the inbound `Content-Type` preserved. Previously these surfaces
returned a 400 "invalid JSON body" from the chat-path body parse.
- Provider capability matrix in `api_routes.rs` corrected:
Anthropic no longer claims audio/reranking/moderations support,
Gemini no longer claims moderations. A new
`provider_supports_surface` matrix gates non-universal surfaces
with **501 Not Implemented** when no configured provider
supports the surface.
- Per-surface observability: new
`sbproxy_ai_surface_requests_total{surface, method}` counter and
`sbproxy_ai_surface_request_duration_seconds{surface, method}`
histogram. Sibling of the existing per-provider metrics so
dashboards can pivot between surface and provider views.
Documented in `docs/metrics-stability.md`.
- Per-surface input guardrails: image generation, audio speech,
reranking, and moderations bodies now have their input field
(`prompt`, `input`, `query`, `input`) extracted and run through
the same guardrail pipeline as chat-style `messages`.
- Per-surface rate limits: new `per_surface_rate_limits` field
on the AI handler config, keyed by surface label. 429 fires
before any upstream call when the cap is hit.
- Surface-aware billing event: new `AiBillingEvent` carrying
`AiUsage` with `Tokens`, `Images { count, resolution }`,
`AudioSeconds`, `Characters`, `RerankUnits`, and `PerCall`
variants. Every dispatched request emits exactly one event.
Image generation, audio speech, and reranking emit real cost
via per-surface pricing tables (`lookup_image_price`,
`lookup_audio_speech_price`, `lookup_rerank_price`,
`lookup_audio_transcription_price`). `docs/ai-gateway.md`
documents the new surface, methods, guardrails, and rate-limit
knobs.
- **Policy verdict audit bus + Plugin dispatch.**
Wires the previously-dead `Policy::Plugin` arm in `server.rs` to
call the trait's `enforce()`, folds the returned `PolicyDecision`
into the existing chain reducer, and emits a
`PolicyVerdictEvent` for every decision on a bounded
`tokio::sync::mpsc` audit bus per
`docs/adr-policy-audit-binding.md`. The OSS substrate ships an
in-memory drain stub; enterprise replaces the consumer with a
NATS-backed audit-chain subscriber. Multi-policy resolution
rules from `docs/adr-policy-verdict-shape.md` are implemented at
the chain level: any Deny wins, the first Confirm wins over
AllowWithHeaders, AllowWithHeaders accumulate, otherwise Allow.
`Confirm` in OSS routes through the existing AllowWithHeaders
mechanism with `X-Policy-Confirm: ` stamped on the
response; an `expires_at` already in the past synthesises a 410
and an SSRF-blocked `webhook_url` synthesises a 502 at decision
time. New metrics:
`sbproxy_policy_audit_events_total{verdict, surface, policy_id}`,
`sbproxy_policy_audit_events_dropped_total{tenant}`,
`sbproxy_policy_decision_duration_seconds{surface}`. New Grafana
dashboard `sbproxy-policy-verdicts` covers the surface.
([crates/sbproxy-observe/src/events.rs],
[crates/sbproxy-observe/src/metrics.rs],
[crates/sbproxy-core/src/policy_bus.rs],
[crates/sbproxy-core/src/policy_dispatch.rs],
[crates/sbproxy-core/src/server.rs],
[crates/sbproxy-plugin/src/traits.rs],
[dashboards/grafana/sbproxy-policy-verdicts.json])
- **Synthetic-transaction `/readyz` probe.** Optional
background driver that fires an in-process request through the
compiled handler chain on a fixed cadence and reports the verdict as
a `synthetic_pipeline` component on `/readyz`. Disabled by default;
opt in via `proxy.synthetic_probe.enabled: true` and define an origin
for the configured sentinel hostname (default `__synthetic.local`)
pointing at a non-network action (`static`, `mock`, `echo`, `noop`).
Failures bump the new
`sbproxy_synthetic_probe_failures_total{reason}` counter so they do
not pollute real-traffic error metrics.
([crates/sbproxy-config/src/types.rs],
[crates/sbproxy-core/src/synthetic.rs],
[crates/sbproxy-observe/src/synthetic.rs],
[crates/sbproxy-observe/src/metrics.rs],
[e2e/tests/synthetic_probe.rs])
- **`GET /admin/drift` config drift endpoint.** Returns
whether the on-disk config file has diverged from what the running
proxy has loaded, without triggering a reload. Compares a
content-hash baseline captured at startup (and refreshed on every
`/admin/reload`) against a fresh hash of the current file. K8s
operators and dashboards scrape this so they can flag an edited
config that has not been hot-reloaded yet. Documented in
`docs/configuration.md` § Admin fields.
([crates/sbproxy-core/src/admin.rs],
[crates/sbproxy-core/src/server.rs],
[docs/configuration.md])
- **Deterministic clock-skew testing hooks.** `ClockSkewMonitor` now
accepts an injected clock source for tests while production continues
to use the system clock.
([crates/sbproxy-observe/src/clock_skew.rs])
- **Operator runbook hooks and fast-track ADR template.** Added a
dashboard-oriented operator runbook, linked all Grafana panels to the
relevant triage sections, and added a fast-track ADR amendment
template plus OSS threat-model refresh checklist.
([docs/operator-runbook.md], [docs/adr-fast-track-amendment.md],
[docs/threat-model.md], [dashboards/grafana/])
- **Live reverse-DNS resolver for agent verification.** `SystemResolver`
now uses `hickory-resolver` for PTR and forward-confirmation lookups,
replacing the previous typed PTR stub.
([crates/sbproxy-security/src/agent_verify.rs])
- **Multi-window SLO burn-rate replay harness.** `sbproxy-observe`
now includes a burn-rate evaluator and `AlertSnapshot` replay helper
for substrate availability and latency alert taxonomy tests.
([crates/sbproxy-observe/src/alerting/burn_rate.rs],
[e2e/tests/slo_burn_rate.rs])
- **Vault-style quote-token seed references.** `ai_crawl_control.quote_token.secret_ref`
now accepts `secret:` references resolved through `sbproxy-vault`
with the existing environment fallback, in addition to the older
`secret_ref.env` and inline `seed_hex` paths.
([crates/sbproxy-modules/src/policy/ai_crawl.rs])
- **Operator first-24-hours quickstart.** Added a concise
`docs/quickstart-operator.md` covering deploy, `/readyz`, metrics,
Grafana, logs, and rollback, linked from the README and Kubernetes
docs.
([docs/quickstart-operator.md])
- **Hostname cardinality override for metrics.** `proxy.metrics.cardinality.hostname_cap`
can lower the `hostname` label budget independently from the default
per-label cap, enabling deterministic overflow tests and tighter
multi-tenant Prometheus budgets.
([crates/sbproxy-config/src/types.rs],
[crates/sbproxy-observe/src/cardinality.rs])
- **`release-fast` build profile for CI images.** Docker-based CI and
local kind smoke-test builds can now use `CARGO_PROFILE=release-fast`
to skip fat LTO and use more codegen units, cutting link memory/time
while leaving production release artifacts on the existing `release`
profile.
([Cargo.toml], [Dockerfile.ci], [Dockerfile.cloudbuild])
- **Reproducible build probe workflow.** CI now has an informational
double-build lane that builds the release binary twice on independent
GitHub-hosted runners, uploads each binary and SHA-256, and publishes
a comparison report without yet treating non-identical output as a
failure.
([.github/workflows/reproducible-build.yml], [SUPPLY-CHAIN.md])
- **Phase 2: CEL `features[...]` namespace.** Per-request
flags parsed from the `x-sb-flags` header and `?_sb.` query
prefix are now exposed to CEL expressions. Built-in flags surface
as bools (`features.debug`, `features.trace`,
`features["no-cache"]`, `features.any_set`); free-form `k=v` extras
surface as strings (`features["env"]`). Wired into the rate-limit
CEL evaluator and `ExpressionPolicy::evaluate_with_views`.
([crates/sbproxy-extension/src/cel/context.rs])
- **`SB_WORKER_THREADS` env var.** Positive integer overrides the
auto-detected Pingora worker thread count
(`std::thread::available_parallelism()`). Useful for benchmarking
with a fixed worker count or capping the pool below a cgroup quota.
([crates/sbproxy-core/src/server.rs])
- **`/live`, `/livez`, `/ready`, `/healthz`, and rich `/health`
admin endpoints.**
`/livez` returns `{"alive":true}` on every call and never 503s, so
K8s liveness probes don't trip on transient readiness failures.
`/live` is a bare alias. `/ready` is an alias for `/readyz`.
`/healthz` stays a fixed liveness body, while `/health` now returns
version, build hash, timestamp, uptime, and readiness checks for
dashboards / SIEM ingestion. Existing `/readyz` behavior unchanged.
([crates/sbproxy-observe/src/health.rs],
[crates/sbproxy-core/src/admin.rs])
- **`--request-log-level` and `SB_REQUEST_LOG_LEVEL`.** Operators can
now tune request/access logging independently from application logs.
The setting appends an `access_log=` target directive to the
effective `tracing-subscriber` filter while preserving the existing
per-target `RUST_LOG` escape hatch.
([crates/sbproxy/src/main.rs])
- **Access-log forced emission and file output.** `access_log` now
supports `slow_request_threshold_ms` and `always_log_errors` so slow
requests and 5xxs bypass sampling after status/method filters match.
It also supports `output: { type: file, path, max_size_mb,
max_backups, compress }` for direct JSON-line access-log files with
size-based rotation and optional gzip compression of rotated files.
([crates/sbproxy-config/src/types.rs],
[crates/sbproxy-core/src/server.rs],
[crates/sbproxy-observe/src/access_log.rs])
- **OCSP stapling for the manual fallback cert.** `OcspStapler`
(which previously existed but was unwired) now does an immediate
fetch on startup, refreshes every 12 hours, and pushes the bytes
into `CertResolver::update_fallback_ocsp` so subsequent rustls
handshakes staple the response on the wire. No-op when no manual
cert is configured or when the cert lacks an AIA extension.
([crates/sbproxy-tls/src/ocsp.rs],
[crates/sbproxy-tls/src/cert_resolver.rs])
- **Readiness synthetic probe primitive.** `sbproxy-observe` now ships a
`SyntheticProbe` type so startup or test wiring can register an
in-process readiness probe that exercises a caller-provided path and
reports through the same `/readyz` component model as built-in probes.
([crates/sbproxy-observe/src/health.rs])
### Removed
- **`sbproxy_ai::IdempotencyCache`.** The OSS AI gateway never wired
this cache; it was publicly re-exported but had zero callers in the
workspace. The new `idempotency:` block on general HTTP origins
(above) supersedes it. AI gateway integration is a follow-up tracked
in `docs/missing.md`. Plugin authors that imported the removed
type can switch to
`sbproxy_middleware::idempotency::{IdempotencyCache,
InMemoryIdempotencyCache, KvIdempotencyCache}` which carries the
richer surface (workspace isolation, body-hash conflict detection,
conflict body builder).
### Changed
- **mTLS now wired on the ACME path.** Previously, an operator who
configured `mtls:` alongside `acme:` got plain TLS until they
noticed clients reaching the upstream without the expected cert
headers. The ACME branch now mirrors the manual-cert branch:
builds `TlsSettings` with the configured `ClientCertVerifier` and
falls back to plain TLS only when mTLS setup itself fails.
([crates/sbproxy-core/src/server.rs])
- **Examples and Kubernetes smoke checks are local-only.** The
Docker-backed examples smoke lane and kind-based Kubernetes operator
smoke lane no longer run automatically on pull requests. They remain
available as `make examples-smoke` and `make k8s-operator-smoke` for
explicit local / release validation.
([Makefile], [docs/kubernetes.md])
- **Reload drain state is now one coherent atomic snapshot.** The
drain flag and active request count are packed into one `AtomicU64`,
so `is_draining()` no longer combines two independent relaxed loads.
Added loom coverage for the last-request-finish interleaving.
([crates/sbproxy-core/src/reload.rs])
- **Optional readiness dependencies no longer fail `/readyz` by
default.** The default admin health registry now registers absent
ledger and bot-auth-directory probes as `not_configured`, matching the
existing future-wave stubs and keeping `/readyz` green when those
optional services are not wired in a deployment.
([crates/sbproxy-observe/src/health.rs],
[crates/sbproxy-core/src/admin.rs])
- **`docs/manual.md` rewrites** matching what actually ships:
- §6 Health checks: `/livez`, `/readyz`, `/healthz`, and rich
`/health` semantics, replacing the old per-endpoint URL fork
diagram and stale `/health` alias wording.
- §10 Feature flags: CEL accessor table, kill-switch note, and
a "planned, not yet wired" note for Lua / JS / WASM features
namespaces and workspace-level pub/sub flags.
- §3 CPU detection: documents the new `SB_WORKER_THREADS` knob.
- §13 env-var table: adds `SB_WORKER_THREADS` and
`SB_DISABLE_SB_FLAGS`; later updates add
`SB_REQUEST_LOG_LEVEL` and access-log file/forced-emit examples.
### Fixed
- **CAP `sub` binding only fires for a genuinely resolved agent.** The
CAP verifier binds a token's `sub` to the request's resolved agent id
(rejecting a mismatch with `403`). Because the agent-class resolver is
installed with the built-in catalog by default and always stamps
*some* id (falling through to the `human` sentinel when no signal
matches), the binding would have rejected every CAP token whose `sub`
was not literally `"human"`, even on origins that never configured
agent classes. The binding now skips the resolver's fallback / `human`
verdict and engages only when the resolver actually identified an
agent, so an unauthenticated caller falls through to the normal CAP
validation path. Set `cap.require_agent_binding: true` to fail closed
when no agent is resolved.
- **Virtual-key model allow/block lists are now enforced.** A virtual
key (or `ai_provider` credential) with `models.allow` / `models.block`
declared its scope but the AI dispatch path never checked it, so a key
confined to a subset of the gateway's models could still call any
model the gateway served. The matched key's allow/block lists are now
enforced against the effective model (after any `route_to_model`
rewrite): a request for a disallowed model is rejected with `403`
before any upstream call, the block-list taking precedence over the
allow-list. Keys with no `models.allow` are unaffected. See
`examples/ai-virtual-keys/`.
- **Licensing-projection wire formats now match the canonical specs [BREAKING].** Two projection emitters were producing
document shapes that didn't match their cited specifications.
`/licenses.xml` previously declared the namespace
`https://rsl.ai/spec/1.0` and emitted a flat
`...` document. The canonical
RSL Collective spec at uses the
namespace `https://rslstandard.org/rsl` and a nested
`...`
shape; the `` `url` attribute is the canonical wildcard
`https:///*` for the origin-wide license. `/.well-known/tdmrep.json`
previously wrapped its policies in a `{"version", "generated", "policies": [...]}`
envelope; the W3C TDMRep CG-FINAL spec mandates a bare JSON array
at the document root with `location`, `tdm-reservation`
(integer 0 or 1), and `tdm-policy` (URL of the policy document)
fields per entry. Both emitters now produce the canonical shapes.
Operators consuming `/licenses.xml` or `/.well-known/tdmrep.json`
programmatically must update their parsers to the new shapes; the
in-process JSON envelope and the response middleware that stamps
`TDM-Reservation: 1` and the URN-bearing `license` field are
unaffected. Conformance is asserted by the active structure-shape
tests; the earlier schema-validation tests were removed because
neither standard publishes a machine-readable schema to validate
against (RSL 1.0 is prose-only; W3C TDMRep ships no JSON Schema).
([crates/sbproxy-modules/src/projections/licenses.rs],
[crates/sbproxy-modules/src/projections/tdmrep.rs],
[e2e/tests/rsl_licenses_projection_e2e.rs],
[e2e/tests/tdmrep_projection_e2e.rs])
- **Build under prometheus 0.14 type inference.** Sites in
`sbproxy-observe::metrics` and `sbproxy-core::server` that passed
heterogeneous `&[&String, &str]` arrays to
`prometheus::with_label_values` no longer compile on prometheus
0.14 because Rust unifies the array element type to `&String` and
rejects bare `&str` literals. Coerced all such call sites to
uniform `&[&str]` via `.as_str()` so the workspace builds clean
again. No behavioural change.
([crates/sbproxy-observe/src/metrics.rs],
[crates/sbproxy-core/src/server.rs])
- **WASM extension docs corrected.** `CLAUDE.md` previously labeled the
WASM surface as "WASM stub" while marketing docs claimed
production-grade support; the runtime is real
(`wasmtime` + WASI preview-1 with sandboxed memory and CPU caps,
stderr capture, no FS or network). `llms.txt` also incorrectly
claimed "WASI networking with host allowlist" but `allowed_hosts` is
parsed-but-inert until WASI sockets land. CLAUDE.md and llms.txt now
match the shipped surface.
([CLAUDE.md], [llms.txt],
[crates/sbproxy-extension/src/wasm/mod.rs])
- **E2E proxy startup flake under CPU contention.** The e2e
`ProxyHarness` keeps its HTTP-level readiness probe, but now gives
release/debug proxy boots a 10-second window instead of 5 seconds so
tests like `action_graphql` do not fail spuriously while cargo is
competing for CPU.
([e2e/src/lib.rs])
- **Docs CI Rust snippet failures.** Workspace-dependent documentation
examples that cannot compile as standalone `rust-script` programs are
now tagged `rust,no_run`, keeping docs-ci focused on executable
snippets instead of illustrative API fragments.
([docs/architecture.md], [docs/audit-log.md], [docs/cache-reserve.md])
- **Unsafe-code drift guardrails.** Crates that do not need unsafe now
forbid it at the crate root, while `sbproxy-vault` explicitly allows
its narrowly-scoped volatile zeroization unsafe with an inline
justification.
([crates/sbproxy-*/src/lib.rs])
- **Outbound webhook delivery identity headers.** Signed customer
webhooks now include `Sbproxy-Subscription-Id`,
`Sbproxy-Delivery-Id`, and 1-based `Sbproxy-Attempt` headers, with a
fresh delivery ULID on every retry attempt.
([crates/sbproxy-observe/src/notify.rs])
- **AI client retry resilience.** `MemoryBatchStore` now uses
`parking_lot::Mutex` so a panic in one worker cannot poison the
in-memory batch map for every later operation. Provider retries now
honor `provider.max_retries` as same-provider retry attempts with
bounded jittered exponential backoff before recording provider
failure and moving to the next eligible provider.
([crates/sbproxy-ai/src/batch.rs],
[crates/sbproxy-ai/src/client.rs])
- **Dynamic Web Bot Auth directory dispatch.** The main request auth
path now invokes `BotAuthProvider::verify_async` when a configured
hosted directory and `Signature-Agent` header are present, so dynamic
directory failures surface distinctly instead of falling through the
static inline-agent verifier.
([crates/sbproxy-core/src/server.rs])
- **ACME/Pebble order polling.** Certificate issuance now polls the
authorization to `valid` after responding to the HTTP-01 challenge
before polling the order to `ready`, matching Pebble's stricter state
progression. Finalization also parses the order returned by the
finalize response and falls back to polling the original order URL,
avoiding accidental POST-as-GET polling of the finalize URL when
`Location` is absent.
([crates/sbproxy-tls/src/acme.rs])
- **JWKS unknown-`kid` key rotation.** JWTs that reference an unseen
`kid` now trigger one rate-limited JWKS refetch before failing
closed, with a Prometheus counter for success / failure /
rate-limited outcomes. This avoids requiring operator intervention
for routine IdP key rotation.
([crates/sbproxy-modules/src/auth/jwks.rs],
[crates/sbproxy-modules/src/auth/mod.rs],
[crates/sbproxy-observe/src/metrics.rs])
- **Rate-limit LRU pollution bypass.** Per-key local token buckets now
preserve deny state in a bounded cold tier after hot LRU eviction, so
a spray of attacker keys cannot reset an already-throttled
legitimate client.
([crates/sbproxy-modules/src/policy/mod.rs])
### Open follow-ups
Tracked in Linear, not in this changeset:
- the upstream issue full configurable
synthetic transaction through the live request pipeline. The
`SyntheticProbe` readiness primitive has landed; config and pipeline
execution remain.
- Phase 2.5: Lua / JS / WASM `features` namespace, plus
workspace-level flags via messenger pub/sub
- the upstream issue remaining
rate-limiter proptest coverage. The reload-drain loom portion has
landed.
## [1.0.1] - 2026-05-04
Patch release. No runtime behavior changes.
### Fixed
- **Container image publish**: the `release.yml` workflow's docker
prepare step extracted the flat-layout tarballs into `/tmp/`
directly, which tripped a sticky-bit `Cannot utime` error on the
archive's `./` entry and caused `ghcr.io/soapbucket/sbproxy:1.0.0`
to never publish. Each platform tarball now extracts to a per-arch
staging dir before the binary moves into the docker context.
## [1.0.0] - 2026-05-03
First Rust release of SBproxy on this repository.
### What changed
- **Implementation**: SBproxy is now written in Rust on Cloudflare's
Pingora. The Go implementation that previously occupied this repo
(`v0.1.0` through `v0.1.2`) has moved to
[`soapbucket/sbproxy-go`](https://github.com/soapbucket/sbproxy-go),
preserved as the `v0.1.2-go-final` branch and tag, and is now in
maintenance-only mode.
- **Data plane**: routing, AI gateway, MCP gateway, guardrails, security
policies, and scripting (CEL, Lua, JavaScript, WebAssembly) all ship
open source in this release. See [`docs/architecture.md`](docs/architecture.md)
for the request pipeline shape.
- **Enterprise tier**: see [`docs/enterprise.md`](docs/enterprise.md) for
what enterprise adds on top of the OSS data plane and how to request
access.
### Upgrading from v0.1.x (Go)
The internal config schema (`schema-v1`) is supported by both the Go
`v0.1.x` line and this Rust `v1.x` line, so existing `sb.yml` files
should compile unchanged. See [`MIGRATION.md`](MIGRATION.md) for the
full upgrade path.
================================================================
# docs/README.md
================================================================
## SBproxy documentation
*Last modified: 2026-06-08*
The AI gateway built like a real proxy. One binary, built on Pingora.
## Where to start
New here? Read [manual.md](manual.md) for install and CLI, then [configuration.md](configuration.md) for the schema. The [examples](../examples/) folder has runnable configs you can point the binary at right away.
## Documentation index
### Getting started
- [manual.md](manual.md) - install, CLI, runtime, TLS, deployment patterns.
- [getting-started-api-estate.md](getting-started-api-estate.md) - put SBproxy in front of existing APIs with auth, rate limits, and header rewrites.
- [getting-started-content-estate.md](getting-started-content-estate.md) - HTML-to-markdown and content transformation for agents.
- [getting-started-ai-estate.md](getting-started-ai-estate.md) - run SBproxy as the LLM gateway in front of model providers.
- [getting-started-agent-identity.md](getting-started-agent-identity.md) - issue and enforce agent identity at the edge.
- [getting-started-sovereign-multicloud.md](getting-started-sovereign-multicloud.md) - Kubernetes, sidecar, and secret-backend deployment.
- [configuration.md](configuration.md) - every `sb.yml` field with examples.
- [json-schema.md](json-schema.md) - JSON Schema for editor autocomplete + validation of `sb.yml`.
- [mcp-schema-drift.md](mcp-schema-drift.md) - CI-friendly schema-drift detection for converted MCP servers (the `sbproxy-mcp-drift` CLI).
- [features.md](features.md) - tour of every feature with copy-paste configs.
- [troubleshooting.md](troubleshooting.md) - common failure modes and fixes.
- [faq.md](faq.md) - quick answers to the questions operators hit most often.
### AI gateway
- [ai-gateway.md](ai-gateway.md) - providers, routing strategies, guardrails, budgets, streaming.
- [ai-lb-benchmark.md](ai-lb-benchmark.md) - P50/P95/P99/P99.9 latency comparison across AI router strategies under skewed load.
- [providers.md](providers.md) - the catalog of supported LLM providers.
- [scripting.md](scripting.md) - CEL, Lua, JavaScript, and WASM scripting reference.
- [wasm-development.md](wasm-development.md) - writing WebAssembly modules for the `wasm` transform against the WASI preview-1 contract.
- [mcp.md](mcp.md) - the MCP gateway: wire shape, capabilities, and `experimental.agentSkillsUrl` advertising.
- [a2a-gateway.md](a2a-gateway.md) - the `a2a` action: typed AgentCard, capability discovery, and modality negotiation helpers.
- [agent-skills.md](agent-skills.md) - Agent Skills v0.2.0 well-known projection: schema, integrity, archive safety, no-script-execution contract.
- [cloudflare-code-mode.md](cloudflare-code-mode.md) - typed TypeScript module emission for Cloudflare Code Mode agents over the MCP federation registry.
- [ai-crawl-control.md](ai-crawl-control.md) - the `ai_crawl_control` policy: Pay Per Crawl token challenge, ledger trait, OSS-advertises / enterprise-settles split.
- [content-for-agents.md](content-for-agents.md) - operator guide to agent-aware content delivery: shape negotiation, body transforms, well-known license posture.
- [rsl.md](rsl.md) - RSL 1.0 licensing cookbook: expressing license stance via YAML and the resulting `/licenses.xml` projection.
- [web-bot-auth.md](web-bot-auth.md) - the `bot_auth` provider: verifying RFC 9421-signed AI crawlers against a published key directory.
- [auth-oidc.md](auth-oidc.md) - the `oidc` auth provider: OpenID Connect Relying-Party login flow (authorization-code + PKCE, sealed session cookie, optional userinfo trust-header projection, RP-initiated logout).
- [prompt-injection-v2.md](prompt-injection-v2.md) - the v2 guardrail: swappable detector returning score + label, with score-to-action mapping.
### Operations
- [access-log.md](access-log.md) - structured JSON access log: filters, sampling, header capture, redaction.
- [audit-log.md](audit-log.md) - tamper-evident audit log of admin actions.
- [observability.md](observability.md) - metrics, logs, traces, and the bundled dashboards.
- [clickhouse-attribution.md](clickhouse-attribution.md) - access-log schema, pre-aggregations, and sample attribution queries.
- [migration-credentials.md](migration-credentials.md) - migrating the legacy `virtual_keys:` shape to the unified `credentials:` block.
- [migration-mcp-rbac.md](migration-mcp-rbac.md) - upgrading MCP `ToolAccessPolicy` to the principal-aware ACL and the default-deny flip.
- [secrets.md](secrets.md) - vault backend setup for HashiCorp Vault, AWS Secrets Manager, and Kubernetes Secrets.
- [multi-tenant.md](multi-tenant.md) - when to use the multi-tenant shape, the three scopes, isolation guarantees, the synthetic `__default__` tenant.
- [operator-runbook.md](operator-runbook.md) - dashboard triage and rollback actions.
- [threat-model.md](threat-model.md) - OSS trust boundaries and per-wave review checklist.
- [events.md](events.md) - the event bus, callback hooks, and emitted event types.
- [openapi-emission.md](openapi-emission.md) - publishing an OpenAPI 3.0 document from the live config.
- [policy.md](policy.md) - the policy engine: `semantic_constraint`, the NL linter L001-L009, and the OSS / enterprise capability boundary.
- [object-authz.md](object-authz.md) - `object_authz` policy: BOLA + BFLA enforcement with tenant-isolation and enumeration detection.
- [headless-detection.md](headless-detection.md) - header-only headless / stealth-browser indicator heuristics surfaced under `request.agent.headless_*`.
- [content-digest.md](content-digest.md) - `content_digest` policy: RFC 9530 request-body verification for integrity-critical inboxes.
- [agent-budget.md](agent-budget.md) - `agent_budget` policy: semantic rate-limit primitive keyed on resolved agent identity.
- [performance.md](performance.md) - tuning guide, benchmark methodology, profiling.
- [degradation.md](degradation.md) - failure modes and graceful degradation behavior.
- [upgrade.md](upgrade.md) - migration notes between releases.
- [quickstart-operator.md](quickstart-operator.md) - first 24 hours running the Kubernetes operator.
- [kubernetes.md](kubernetes.md) - the OSS Kubernetes operator and its CRDs.
- [sidecar-deployment.md](sidecar-deployment.md) - running sbproxy as a per-pod sidecar: traffic capture (iptables / eBPF), service-mesh integration (Istio, Linkerd), and the kustomize overlay under `deploy/k8s/sidecar/`.
### Reference
- [402-challenge.md](402-challenge.md) - wire-format contract for the `402 Payment Required` body, including the OSS-advertises / enterprise-settles split.
- [l402.md](l402.md) - L402 (Lightning HTTP 402) macaroon bearer credential surface: issuer, verifier, attenuation, payment-hash binding.
- [outbound-peer-pricing.md](outbound-peer-pricing.md) - the `peer_pricing_preflight` policy: parse a peer's `llms.txt`, gate egress on budget, return a structured 402 to the agent on overflow.
- [admin-api-reference.md](admin-api-reference.md) - per-route schema for the embedded admin server (`/api/*`, `/admin/*`, and the unauthenticated probe routes).
- [config-stability.md](config-stability.md) - field stability guarantees and versioning.
- [listings.md](listings.md) - the repo-native `Listing` primitive: schema, loader, three pinning modes, plan-validation rules.
- [bulk-redirects.md](bulk-redirects.md) - the `redirect` action's source-to-destination row list, compiled at load time into an O(1) path lookup.
- [cache-reserve.md](cache-reserve.md) - long-tail cold tier under the response cache: backends (memory, filesystem, Redis) and admission sampling.
- [exposed-credentials.md](exposed-credentials.md) - the `exposed_credentials` policy: detect known-leaked basic-auth passwords and tag or block.
- [feature-flags.md](feature-flags.md) - the sticky-bucketing flag store plus the `flag_enabled(name, key)` CEL helper.
- [routing-strategies.md](routing-strategies.md) - the `RoutingStrategy` trait: opt-in extension point for custom upstream selection inside `load_balancer`.
- [openapi-validation.md](openapi-validation.md) - the `openapi_validation` policy: validating request bodies against an OpenAPI 3.0 document at startup.
- [enterprise.md](enterprise.md) - what the enterprise tier adds on top of the OSS data plane and how to request access.
- [glossary.md](glossary.md) - vocabulary used in this documentation set.
- [headers-reference.md](headers-reference.md) - every response header the proxy can emit, with the config that triggers it.
- [metrics-stability.md](metrics-stability.md) - Prometheus metric naming and stability.
- [model-pinning.md](model-pinning.md) - how SHA-256 hashes get computed and pinned for the classifier known-model registry.
- [adr-ai-hub-format.md](adr-ai-hub-format.md) - hub `ChatFormat` trait and the canonical `ChatRequest` / `ChatResponse` shape that backs `/v1/chat/completions`, `/v1/messages`, and `/v1/responses`.
- [adr-outbound-credential-resolver.md](adr-outbound-credential-resolver.md) - the OSS vs enterprise line for the outbound credential resolver (RFC 8693 exchange, client-credentials, and vault resolution in OSS).
- [comparison.md](comparison.md) - how SBproxy compares to other proxies and AI gateways.
### Contributing
- [architecture.md](architecture.md) - internals: pipeline, hot reload, plugin system.
- [build.md](build.md) - building from source, supported platforms, optional features.
- [CONTRIBUTING.md](../CONTRIBUTING.md) - how to set up a dev environment and submit changes.
### AI-discoverable corpora
- [llms.txt](llms.txt) - flat capability catalog (one line per shipped feature), per the [llmstxt.org](https://llmstxt.org/) convention. The small index AI tools fetch first.
- [llms-full.txt](llms-full.txt) - the entire docs corpus (this directory + the top-level `README.md`, `MIGRATION.md`, `CHANGELOG.md`) flattened into one file so AI tools that want the full set get it in one HTTP request. Generated; do not hand-edit. Regenerate with `scripts/regen-llms-full.sh` after any docs change. Mirrored live at .
## Quick start
```bash
## Build
make build-release
## Run with a config
make run CONFIG=examples/basic-proxy/sb.yml
```
Minimal `sb.yml`:
```yaml
proxy:
http_bind_port: 8080
origins:
"api.example.com":
action:
type: proxy
url: http://backend:3000
```
## What's in the box
- Reverse proxy: HTTP/1.1, HTTP/2, WebSocket, gRPC, connection pooling, hot reload.
- AI gateway: 200+ LLM models, 15 routing strategies, OpenAI-compatible API, guardrails, budgets, virtual keys, MCP server.
- Authentication: API key, basic, bearer, JWT, digest, forward auth, noop.
- Policies: rate limiting, IP filter, CEL expressions, WAF, DDoS, CSRF, security headers.
- Transforms: 18 request and response transforms (JSON, HTML, Markdown, CSS, Lua, JavaScript, encoding, and more).
- Scripting: CEL via cel-rust, Lua via mlua/Luau, JavaScript via QuickJS, WebAssembly via wasmtime.
- Caching: response cache with pluggable backends (memory, file, Redis).
- Load balancing: 7 algorithms with sticky sessions and health checks.
- Observability: Prometheus metrics, structured logging, typed event bus, OpenTelemetry tracing.
- Hot reload: config changes apply with no dropped connections.
================================================================
# docs/402-challenge.md
================================================================
## 402 Challenge contract
*Last modified: 2026-05-25*
The wire format the proxy uses when it returns `402 Payment Required`
to an AI crawler. This document is the canonical reference for the
challenge body shape and for the line that splits OSS-advertises from
enterprise-settles.
The behavioural policy that emits these bodies is `ai_crawl_control`;
see [`ai-crawl-control.md`](ai-crawl-control.md) for configuration,
agent classes, ledger, and tiered pricing.
## Two challenge shapes
The OSS proxy emits one of two 402 shapes, picked per request:
1. **Single-rail (default).** Returned to legacy crawlers and to any
request that has not opted in to multi-rail negotiation. Carries
the `Crawler-Payment` response header and a flat JSON body with the
price and currency. This is the long-standing Pay Per Crawl shape.
2. **Multi-rail (opt-in).** Returned when the agent opts in via either
the `Accept-Payment` request header (a q-value list of rail names)
or one of the multi-rail `Accept` MIME types
(`application/sbproxy-multi-rail+json`, `application/x402+json`,
`application/mpp+json`). Carries `Content-Type:
application/sbproxy-multi-rail+json` and a JSON body that lists
one entry per advertised rail, each with its own per-rail
quote-token JWS.
The multi-rail body is the negotiation contract. It is fully defined
in OSS so the same proxy binary can advertise rails whether or not the
operator is running an enterprise build that can settle them.
## OSS advertises, enterprise settles
The split between what OSS does and what the enterprise build does is
deliberate, and matches the framing the rail-Lightning example PR
uses (see `examples/rail-lightning/README.md`).
What the OSS proxy does today:
- Parses the `Accept-Payment` header (RFC-style q-values) and the
multi-rail `Accept` MIME types.
- Filters the agent's preference set against the operator's per-tier
`rails:` override and the top-level `rails:` block.
- Emits the multi-rail 402 body with one entry per surviving rail,
each carrying its own quote-token JWS (separate nonce per rail).
- Responds 406 `no_acceptable_rail` when the preference set has no
overlap with the offered rails, listing the operator's offered set
on the response.
- Falls back to the single-rail format for legacy crawlers that did
not opt in.
- Honours the in-memory ledger (`valid_tokens:`) and the HTTPS-only
HTTP ledger client for accept-payment redemption.
What the OSS proxy cannot do today:
- Settle a real-money payment on a stablecoin or fiat rail.
- Verify an x402 redemption token against a facilitator.
- Capture a Stripe `payment_intent`.
- Open or close a Lightning invoice.
Settlement on those rails requires the enterprise build, gated behind
cargo features:
| Feature | Settles |
|----------------------|------------------------------------------------|
| `stripe` | Stripe fiat (cards, ACH). |
| `x402` | x402 v2 stablecoin-on-chain via a facilitator. |
| `mpp` | Stripe Multi-Party Payments. |
| `lightning-cln` | Core Lightning node. |
| `lightning-lnd` | LND node. |
| `lightning-phoenixd` | Phoenix self-custodial daemon. |
Each enterprise feature registers a `BillingRail` impl into the OSS
plugin trait registry under the canonical rail name the OSS schema
already understands (`x402`, `mpp`, `lightning`). The OSS YAML schema
in `sb.yml` does not change across enterprise backends; only the
settlement code does. That is the property this contract pins:
operators write the same `sb.yml` whether they run OSS or an
enterprise build.
## Single-rail body
The default 402 body for legacy crawlers. Returned with the
`Crawler-Payment` response header and `Content-Type: application/json`.
```json
{
"error": "payment_required",
"price": "0.001",
"currency": "USD",
"target": "blog.example.com/article",
"header": "crawler-payment"
}
```
The `header` field tells the crawler which header name to set on its
retry. The default is `crawler-payment`; operators override it via the
policy's `header:` config field.
## Multi-rail body
Emitted when the agent opted in. `Content-Type:
application/sbproxy-multi-rail+json`.
```json
{
"rails": [
{
"kind": "x402",
"version": "2",
"chain": "base",
"facilitator": "https://facilitator-base.x402.org",
"asset": "USDC",
"amount_micros": 1000,
"currency": "USD",
"pay_to": "0x0000000000000000000000000000000000000000",
"expires_at": "2026-05-08T12:34:56Z",
"quote_token": "eyJhbGc..."
},
{
"kind": "mpp",
"version": "1",
"amount_micros": 1000,
"currency": "USD",
"expires_at": "2026-05-08T12:34:56Z",
"quote_token": "eyJhbGc..."
}
],
"agent_choice_method": "header_negotiation",
"policy": "first_match_wins"
}
```
Notes:
- `rails[].kind` is a closed enum: `x402`, `mpp`, `lightning`. Adding
a rail follows the closed-enum amendment rule in
[`adr-fast-track-amendment.md`](adr-fast-track-amendment.md).
- `rails[].quote_token` is a JWS. One nonce per rail per response, so
the agent cannot replay a quote across rails. JWKS publication and
token replay are covered by the
`examples/quote-token-replay-jwks/` example.
- `rails[]` order is the operator's declared preference. Agents break
ties on this order after q-value sorting their own preference set.
- Lightning entries appear in the body only when an enterprise
`lightning-*` feature has registered a `BillingRail` named
`lightning` into the trait registry. With the OSS-default build, a
per-tier `rails: [lightning, x402]` declaration parses cleanly (the
`Rail::Lightning` enum variant ships in OSS) and the proxy still
negotiates against the `lightning` token on the wire; the body just
carries the next surviving rail (here `x402`).
## Cloudflare Pay Per Crawl interop
Set `cloudflare_compat: true` on the `ai_crawl_control` policy to speak
Cloudflare's exact Pay Per Crawl wire contract. A crawler that already
transacts with a Cloudflare origin works against an SBproxy origin
unchanged, and the differentiator is that SBproxy settles on the
operator's own rails with no Merchant-of-Record cut.
In this mode the negotiation uses Cloudflare's header set instead of
the single-rail JSON body:
- The 402 response carries `crawler-price: `, for
example `crawler-price: USD 0.01`. A JSON body mirrors the price for
clients that read the body instead of the header.
- The crawler retries with `crawler-exact-price` (commit to a precise
amount) or `crawler-max-price` (a cap), plus its payment token on the
configured header (`crawler-payment` by default). The token settles
through the same self-hosted ledger the single-rail path uses.
- A `crawler-max-price` below the quote, or a `crawler-exact-price`
that does not equal the quote, re-quotes with a fresh 402 and does
not spend the token.
- A settled request is served with `crawler-charged: ` so the crawler learns exactly what it paid.
```yaml
policies:
- type: ai_crawl_control
price: 0.01
currency: USD
cloudflare_compat: true
free_paths:
- "/feed/*"
valid_tokens:
- ppc-token-1
```
### Always-free paths
These well-known operational endpoints are never charged, so a crawler
can always discover the site's policy without paying to read it:
- `/robots.txt`
- `/sitemap.xml`
- `/security.txt`
- `/.well-known/security.txt`
- `/crawlers.json`
The per-policy `free_paths:` list extends this built-in allowlist
(Cloudflare's Configuration-Rules equivalent). A trailing `*` is a
prefix match (`/feed/*`); otherwise the entry matches exactly. The
built-in allowlist always applies, so an operator cannot accidentally
start charging for `robots.txt`.
### Binding the price headers to a Web Bot Auth signature
The crawler's pre-authorization headers (`crawler-max-price` and
`crawler-exact-price`) are inbound request headers, so an operator who
also runs the `bot_auth` verifier can require them to be signed
components by listing the header name in that agent's
`required_components`. A retry whose Web Bot Auth signature does not
cover the listed price header is then rejected before the ledger is
consulted.
Binding the proxy's outbound price headers (`crawler-price`,
`crawler-charged`) into a signature the crawler can verify is a separate
piece of work: it needs the outbound response-signing path, which is not
part of this contract yet.
### Pluggable pricing model
Pricing can be flat (`price:`) or per-path (`tiers:`). For a learned
model (an LM-Tree-style pricing model is the motivating example), an
embedder injects a `PricingModel` implementation through
`AiCrawlControlPolicy::with_pricing_model`. The model is consulted
before the static tier table; returning a price overrides the static
resolution for that request, and returning nothing defers to the tier
table and the flat-price fallback. The OSS build ships only the seam,
not a model.
## 406 fallback
When the agent's `Accept-Payment` preference set has no overlap with
the operator's offered rails, the proxy returns `406 Not Acceptable`
with `Content-Type: application/json`:
```json
{
"error": "no_acceptable_rail",
"supported_rails": ["x402", "mpp"],
"target": "blog.example.com/article"
}
```
`supported_rails` reflects the operator's declared offered set on the
matched tier (the per-tier `rails:` override, or the route default if
no override is set), not the runtime-emittable subset. The agent
retries with one of the listed rails on its `Accept-Payment` header.
## Opt-in signals
Per A3.1, any of the following signals on the request opts the agent
in to the multi-rail body:
- `Accept-Payment` request header carries a q-value list of rail
names. Example: `Accept-Payment: lightning;q=1.0, x402;q=0.5`.
- `Accept` request header includes
`application/sbproxy-multi-rail+json`,
`application/x402+json`, or `application/mpp+json`. The latter two
are narrowly opt-in: an agent that sends `Accept:
application/x402+json` is asking specifically for the x402 entry,
not for the full multi-rail body.
Without any opt-in signal, the proxy emits the single-rail body so
legacy crawlers keep working unchanged.
## Quote-token JWS
Each rail entry in the multi-rail body carries its own `quote_token`,
signed by the proxy under a key whose JWKS the operator publishes at
`/.well-known/sbproxy-quote-jwks`. The token binds the rail kind, the
amount, the route, and a per-rail nonce so the agent cannot replay a
quote across rails or reuse it after expiry.
The `accept_payment` policy verifies the JWS on the agent's retry
before consulting the ledger. A token whose claims do not match the
retry context (different rail, different route, expired) is rejected
without a ledger round-trip.
The token schema is OSS. The settlement that the token underwrites is
enterprise.
## Related
- [`ai-crawl-control.md`](ai-crawl-control.md) - policy configuration,
agent classes, ledger, tiered pricing.
- [`enterprise.md`](enterprise.md) - the OSS / enterprise split,
including the rail settlement features.
- `examples/rail-x402-base-sepolia/` - x402 rail with a hermetic
mock facilitator.
- `examples/rail-mpp-stripe-test/` - MPP rail with Stripe test
mode and a wiremock fallback.
- `examples/multi-rail-accept-payment/` - x402 + MPP wired
together with q-value negotiation.
- `examples/rail-lightning/` - Lightning rail negotiation contract
(settlement is enterprise-only).
- `examples/quote-token-replay-jwks/` - JWKS endpoint and
single-use quote-token enforcement.
================================================================
# docs/a2a-gateway.md
================================================================
## A2A gateway
*Last modified: 2026-05-31*
The `a2a` action proxies agent-to-agent requests to an upstream A2A endpoint and surfaces the agent's typed AgentCard for capability discovery and modality negotiation. Pairs with MCP federation (one gateway, two protocols) and the AP2 / ACP / RAR payment surfaces.
## Wire shape
The A2A protocol is JSON-RPC over HTTP. Clients call `POST //tasks/sendSubscribe` (or the streaming variant) with a JSON-RPC envelope; the agent responds with a `Task` document. The gateway sits in front of one or more agent endpoints and is responsible for two things the bare proxy cannot do on its own: telling a calling agent what each upstream advertises, and gating the call when the caller and the agent disagree on modality.
## AgentCard
```yaml
origins:
"agent.example.com":
action:
type: a2a
url: http://backend:9000/a2a
agent_card:
name: "Reservation assistant"
description: "Books and modifies restaurant reservations."
version: "0.3.0"
url: "https://agent.example.com/"
capabilities:
streaming: true
pushNotifications: false
stateTransitionHistory: false
defaultInputModes:
- "application/json"
- "text/plain"
defaultOutputModes:
- "application/json"
skills:
- id: "find_table"
description: "Find a free table by time + party size"
```
The whole card round-trips through the gateway: SBproxy types only the fields it consumes (`capabilities`, `defaultInputModes`, `defaultOutputModes`, `name`, `description`, `version`, `url`, `skills`). Anything else the operator pastes (the A2A spec's optional `provider`, `authentication`, `supportsAuthenticatedExtendedCard`, etc.) lives on `extensions` and serialises back verbatim.
## Capability discovery
The gateway can serve the card itself at `/.well-known/agent.json` so an A2A client can probe SBproxy and get back the agent it would route to. The handler emission is configured by the operator on the action; absent it, the well-known path falls through to the upstream so a real agent that already serves its own card keeps doing so.
`capabilities.streaming` and `capabilities.pushNotifications` are surfaced under CEL so policies can branch on what the agent advertises before forwarding. A typical use is gating an A2A request that requests streaming when the agent does not advertise it; the policy rejects with a 400 before the upstream is contacted.
## Modality negotiation
SBproxy ships pure-function helpers `AgentCard::negotiate_input` and `AgentCard::negotiate_output` that pair the caller's `Content-Type` and `Accept` against the agent's advertised `defaultInputModes` and `defaultOutputModes`. Each call returns one of four typed outcomes:
| Outcome | When | Effect on the upstream call |
|---|---|---|
| `Matched(mode)` | the caller's preference overlaps with the agent's advertised modes | proceed with `mode` |
| `NoCallerPreference(mode)` | the caller omitted `Content-Type` / `Accept` | proceed; gateway echoes `mode` |
| `AgentUndeclared(mode)` | the agent's mode list is empty (no restriction) | proceed with the caller's preference |
| `Mismatch { requested, advertised }` | no overlap | gateway returns 406 with both lists in the error body |
The negotiator is case-insensitive on the MIME `type/subtype` head and strips `;`-parameters before comparing, so `application/json; charset=utf-8` matches `application/json`. The output side honours `*/*` by collapsing to the agent's first declared output mode.
## See also
- The A2A x402 payment bridge.
- The agentgateway / Bifrost / SBproxy capability benchmark.
- `crates/sbproxy-modules/src/action/a2a.rs` - the proxy action itself.
- `crates/sbproxy-modules/src/action/a2a_card.rs` - typed AgentCard + negotiator.
================================================================
# docs/access-log.md
================================================================
## Access log
*Last modified: 2026-05-04*
Structured-JSON access logs give every completed request a single line on
stdout, ready to ship to ELK, Loki, Datadog, or any pipeline that already
speaks JSON. The proxy emits the line via the `access_log` tracing target
so log routers can split access logs from application logs without
additional plumbing.
## Default behaviour
Off. SBproxy emits no access-log lines unless the top-level `access_log`
block is present and `enabled: true`. Metrics, traces, and the audit log
are unaffected by this knob.
## Enabling
Add the block to `sb.yml`:
```yaml
access_log:
enabled: true
origins:
api.example.com:
action:
type: proxy
url: http://localhost:3000
```
A request to `api.example.com` now produces a line such as:
```json
{"timestamp":"2026-04-27T12:00:03.521Z","request_id":"7f7c","origin":"api.example.com","method":"GET","path":"/health","status":200,"latency_ms":24.7,"auth_ms":1.2,"upstream_ttfb_ms":18.9,"response_filter_ms":4.1,"bytes_in":0,"bytes_out":1024,"client_ip":"203.0.113.10"}
```
The three `*_ms` phase fields (`auth_ms`, `upstream_ttfb_ms`,
`response_filter_ms`) split `latency_ms` into the parts of the
pipeline that contributed to it. They are emitted whenever the
matching phase ran on the request; an origin with no auth provider
omits `auth_ms`, an early WAF block omits `upstream_ttfb_ms` and
`response_filter_ms`, a cache hit served from the proxy omits both
upstream fields. The same observations also feed the
`sbproxy_phase_duration_seconds` Prometheus histogram (see
[metrics-stability.md](./metrics-stability.md)) so the aggregate
view does not require log scraping.
Optional fields (`provider`, `model`, `tokens_in`, `tokens_out`,
`cache_result`, `trace_id`, `request_headers`, `response_headers`,
`upstream_host`) are omitted when not applicable, keeping non-AI lines
compact.
## Filters
`status_codes` and `methods` narrow the set of requests that get logged:
```yaml
access_log:
enabled: true
status_codes: [500, 502, 503, 504]
methods: ["POST", "PUT", "PATCH", "DELETE"]
```
Empty or omitted lists match every value. Method comparison is
case-insensitive.
## Sampling
`sample_rate` is a probability in `[0.0, 1.0]` applied after the
status/method filters:
```yaml
access_log:
enabled: true
sample_rate: 0.05 # log 5% of matching requests
```
`1.0` (the default) logs every match. `0.0` is equivalent to disabling
emission entirely.
### Forced emission
Two knobs bypass `sample_rate` after the status/method filters match:
```yaml
access_log:
enabled: true
sample_rate: 0.05
slow_request_threshold_ms: 1000
always_log_errors: true
```
`slow_request_threshold_ms` logs every matching request whose end-to-end
latency is at or above the threshold. `always_log_errors: true` logs
every matching `5xx` response. Both knobs are off by default, preserving
the sampler-only behavior for existing configs.
## Header capture
Opt in by listing header names in `access_log.capture_headers.request`
and / or `access_log.capture_headers.response`. Captured values land in
the `request_headers` and `response_headers` fields of the emitted entry.
```yaml
access_log:
enabled: true
capture_headers:
request: ["user-agent", "x-request-id", "x-ratelimit-*"]
response: ["x-sbproxy-cache", "content-length"]
max_value_bytes: 1024
redact_pii: false
```
Three pattern shapes are accepted:
* Exact name: `"user-agent"`, `"x-cache"`.
* `"*"`: capture every header (subject to the sensitive-header denylist
below).
* Trailing glob: `"x-ratelimit-*"` captures every header whose name
starts with the prefix before the `*`. Only one trailing `*` is
supported; embedded wildcards are treated as literal.
Header names are matched case-insensitively. Captured values are
truncated to `max_value_bytes` (default 1024) with a trailing `"..."`
that counts toward the cap.
A hardcoded denylist of sensitive headers (`authorization`, `cookie`,
`set-cookie`, `proxy-authorization`, `x-api-key`) is excluded from `*`
and glob matches. To capture one of these, list it by exact name; the
proxy logs a `WARN` at config load so the choice is visible.
When `redact_pii: true`, the `sbproxy-security` PII redactor runs over
captured header values. `redact_pii_rules` (empty by default) optionally
restricts the rule set; accepted names are `email`, `us_ssn`,
`credit_card`, `phone_us`, `ipv4`, `openai_key`, `anthropic_key`,
`aws_access`, `github_token`.
## Record shape
| Field | Type | Notes |
|-------|------|-------|
| `timestamp` | string | RFC 3339 (UTC) of when the response was sent. |
| `request_id` | string | Unique per request. Reuses the propagated `X-Request-Id` when set; otherwise a fresh UUIDv4. |
| `origin` | string | Hostname routing matched. |
| `method` | string | HTTP method. |
| `path` | string | Request path, no query string. |
| `status` | int | HTTP response status code. |
| `latency_ms` | float | Wall-clock end-to-end latency in milliseconds. |
| `auth_ms` | float? | Time spent in the auth check (provider dispatch, JWT verify, forward-auth subrequest, OIDC cookie open). Absent when the origin has no auth provider. |
| `upstream_ttfb_ms` | float? | Time from request start to the first byte of the upstream response header. Absent when the request never reached an upstream (early auth/policy short-circuit, cache hit). |
| `response_filter_ms` | float? | Time spent running response transforms between first upstream byte and end of `response_filter`. Absent when no response_filter ran. |
| `query` | string? | Request query string without the leading `?`. Captured separately from `path` so per-route aggregations on `path` are not split by every distinct query. Absent when no query was supplied. |
| `protocol` | string? | HTTP version on the wire (`HTTP/1.1`, `HTTP/2.0`, `HTTP/3.0`). |
| `scheme` | string? | Scheme the client used to reach the proxy (`http` or `https`). Distinct from `upstream_host`'s scheme. |
| `host` | string? | Client-supplied `Host` header. May differ from `origin` (the matched virtual-host pattern, which can be a wildcard) and from `upstream_host` (where the proxy forwarded to). |
| `user_agent` | string? | Client `User-Agent` header. Pulled out as a primary field because nearly every analytics consumer wants it; the header allowlist still works as a redundant capture path. |
| `referer` | string? | Client `Referer` header (the canonical RFC 7231 misspelling). |
| `upstream_status` | int? | Upstream's response status code, when it differs from `status`. Populated when a retry chain, fallback, or `response_modifier` rewrote the status the client sees; absent when the proxy passed the upstream status through unchanged. |
| `response_content_type` | string? | Response `Content-Type` as sent to the client. |
| `response_content_encoding` | string? | Response `Content-Encoding` (`gzip`, `br`, `zstd`, ...) when the body was compressed; absent when uncompressed. |
| `bytes_in` | int | Inbound request body bytes (post header-decode). |
| `bytes_out` | int | Bytes written to the client. |
| `client_ip` | string | Post-trust-boundary client IP. |
| `provider` | string? | AI provider when an AI gateway route handled the request. |
| `model` | string? | Selected AI model identifier. |
| `tokens_in` | int? | Prompt tokens, when known. |
| `tokens_out` | int? | Completion tokens, when known. |
| `trace_id` | string? | W3C trace id when distributed tracing is active, for span correlation. |
| `cache_result` | string? | One of `hit`, `miss`, `stale`, `bypass` for cached responses. |
| `upstream_host` | string? | Upstream host the proxy contacted; absent on short-circuited requests (auth deny, WAF block, cache hit). |
| `request_headers` | object? | Captured request headers, lowercased keys. Absent when no allowlist or no matches. |
| `response_headers` | object? | Captured response headers, same shape as `request_headers`. |
| `attribution` | object? | Resolved business attribution tags (project, feature, okr, team, customer, environment, agent_type, risk_tier, trace_id) merged from the credential `attrs:` and `SB-Attr-*` headers. Same tag set the per-attribution spend metric is labeled by. Absent when none resolved. |
| `custom` | object? | Operator-defined custom fields from `observability.log.custom_fields:`. See below. Absent when none configured or none resolved. |
Optional fields are omitted from the JSON object when their value is
`None`.
## Custom fields
`observability.log.custom_fields:` adds operator-defined keys to each
line's `custom` object, so you can pivot logs on dimensions the built-in
schema does not carry (region, deployment, a derived tier, a routing
decision) without forking the binary. Each field's value is computed per
request from either a static string with `${...}` variable interpolation
or a script.
```yaml
proxy:
observability:
log:
custom_fields:
- name: region # static value + interpolation
value: "${env.REGION}"
- name: caller_tier # CEL expression
engine: cel
source: 'has(request.headers["x-tier"]) ? request.headers["x-tier"] : "standard"'
- name: route_class # Lua script (returns the value)
engine: lua
source: 'return string.find(ctx.request.method, "GET") and "read" or "write"'
- name: upper_method # JS script
engine: js
source: "ctx.request.method.toUpperCase()"
```
Rules:
- Each field sets exactly one of `value` or (`source` + `engine`).
Both, or neither, is a config error.
- `engine` is one of `cel`, `lua`, `js`. WASM is not supported for log
fields because it is a compiled module, not inline source.
- Static `value` interpolation variables: `${env.NAME}`, `${tenant_id}`,
`${method}`, `${path}`, `${host}`, `${status}`, `${provider}`,
`${model}`, `${request.header.NAME}`, `${attribution.KEY}`. An
unresolved variable becomes the empty string.
- CEL expressions see the context keys as top-level variables
(`request`, `response`, `tenant_id`, `provider`, `model`,
`attribution`). Lua and JS scripts see the whole context as a `ctx`
global and `return` (Lua) / evaluate to (JS) the value to log.
- A field whose script errors, or that resolves to the empty string, is
omitted from the line rather than failing the request.
- Custom values pass through the same redaction as every other field.
### Scopes
`custom_fields:` can be declared at three scopes: `proxy.observability.log`,
`tenants[].observability.log`, and `origins..observability.log`. They
compose per request as **proxy then tenant then origin**: the tenant set is
resolved from the request's `tenant_id`, the origin set from the matched
origin, and a more-specific scope's field overrides a less-specific field
of the same `name` (the broader definition is not evaluated at all for that
name). Fields with distinct names from every scope are unioned. This is the
same composition order redaction uses (see the sink-scope and tenant/origin
redaction sections in the observability guide).
A worked example covering all three scopes is in
`examples/custom-log-fields/`.
## Redaction
Every line is passed through the same secret redactor that protects
metric labels and audit events. Bearer tokens, API keys with
recognisable prefixes (`sk-`, `pk-`, `ghp_`, ...), and JWT-shaped
strings are replaced with `[REDACTED]` before the line reaches stdout.
Apply additional masking at your log shipper if your origin embeds
custom secrets in URLs or other places the line carries verbatim.
The PII redactor described under [Header capture](#header-capture) runs
before secret redaction, but only over captured header values. Other
fields (`path`, `request_id`, `client_ip`) are not PII-redacted.
## Routing the lines
Every line carries `target = "access_log"` in tracing metadata. Common
patterns:
* Filter via `RUST_LOG=info,access_log=info,sbproxy=warn` to keep
operator logs quiet while keeping access logs.
* Use the JSON log subscriber (default in `sbproxy-observe`) and let
your collector tag by `target`.
* Pipe stdout through `vector` or `fluent-bit` to split on `target`.
### File output
To write access logs directly to disk instead of the tracing target:
```yaml
access_log:
enabled: true
output:
type: file
path: /var/log/sbproxy/access.log
max_size_mb: 100
max_backups: 7
compress: true
```
When the active file reaches `max_size_mb`, SBproxy rotates it before
writing the next line. Rotated files use suffixes like
`access.log.1` or `access.log.1.gz`; `max_backups` caps how many
rotated files are retained. `compress: true` gzips rotated files.
Omitting `output` keeps the default behavior: emit JSON through the
`access_log` tracing target.
================================================================
# docs/admin-api-reference.md
================================================================
## Admin API reference
*Last modified: 2026-06-06*
The embedded admin server publishes a small set of HTTP routes for
operator tooling: liveness probes, request log, per-target health,
hot reload, drift detection, and the emitted OpenAPI document.
This page is the per-route reference. For the operator workflow
(enabling the server, picking a port, IP allowlisting), see
[manual.md section 9 - Hot reload](manual.md#9-hot-reload) and
[manual.md section 5 - Metrics and observability](manual.md#5-metrics-and-observability).
## Enabling the admin server
```yaml
proxy:
admin:
enabled: true
port: 9090
username: admin
password: !env ADMIN_PASSWORD
max_log_entries: 1000
```
When `enabled: false` (the default) the admin listener does not bind
and every route below is unreachable. The server binds on
`127.0.0.1:` so the admin surface is loopback-only by default;
expose it via a reverse proxy or sidecar with an IP allowlist when an
operator console needs remote access.
## Authentication
Routes split into two tiers:
- **Unauthenticated probe routes** are reachable without credentials so
load balancers and orchestrators can probe liveness without
configuring secrets: `/healthz`, `/health`, `/readyz`, `/ready`,
`/livez`, `/live`, `/.well-known/sbproxy/quote-keys.json`.
- **Authenticated routes** require HTTP Basic auth using the
`username` and `password` from the config block. Every route under
`/api/*` and `/admin/*` is in this tier.
Send credentials with `curl -u admin:secret ` or an
`Authorization: Basic ` header.
## Rate limiting
The admin server enforces an in-process rate limit with both per-IP
and global caps. The per-IP cap is 60 requests / minute by default;
the global cap is 10x that (600 / minute). A request that exceeds
either cap returns `429` and is not counted against future windows.
The per-IP tracking map is capped at 10000 entries to prevent
unique-IP floods from growing memory.
## Error envelope
All authenticated routes return JSON errors as:
```json
{"error":""}
```
Status codes follow conventional HTTP: `401` for missing or invalid
credentials, `405` for wrong method on a method-gated route, `409`
when a hot reload is already in flight, `429` when rate-limited,
`5xx` for server-side failures.
---
## Probe routes (unauthenticated)
### `GET /healthz`
Kubernetes-style liveness probe. Returns `200` with body
`{"status":"ok"}` whenever the process is up. Does **not** consult
the live config or any dependency; treat it as "the process is
running and the listener accepted my connection".
### `GET /health`
Component-aware liveness with version and git SHA. Returns `200`
with a JSON document that includes the proxy version, build commit,
and a per-component status table:
```json
{
"status": "ok",
"version": "1.1.0",
"commit": "abc1234",
"components": [
{"name": "config", "status": "ok"},
{"name": "cache_store", "status": "ok"}
]
}
```
A component reporting `"status": "degraded"` returns the same `200`
because the proxy still serves traffic on degraded components.
Components in `"status": "failed"` flip the top-level status.
### `GET /readyz`, `GET /ready`
Kubernetes-style readiness probe. Returns `200` once all required
components are ready to serve traffic, `503` while any required
component is still initialising or has failed. K8s polls this to
gate traffic shifting during rolling restarts.
### `GET /livez`, `GET /live`
Bare liveness probe. Like `/healthz` but with a different name for
load balancers that hardcode this path.
### `GET /.well-known/sbproxy/quote-keys.json`
JWKS document publishing every Ed25519 public key the live config
uses to sign Wave 3 quote tokens (the `402 Payment Required` flow's
agent-verifiable payment quotes). External verifiers (ledger
clients, agent SDKs) fetch this to verify a quote without contacting
the issuer.
Response:
```json
{
"keys": [
{
"kty": "OKP",
"crv": "Ed25519",
"kid": "",
"x": ""
}
]
}
```
Served unauthenticated because the keys themselves are public. The
document aggregates keys across every `ai_crawl_control` policy so a
multi-tenant deployment publishes one document for all of its
issuers.
---
## Read routes (authenticated)
### `GET /api/requests`
Returns the most recent request log entries, newest first. The ring
buffer size is `proxy.admin.max_log_entries` (default `1000`).
Response body: an array of `RequestLogEntry`:
```json
[
{
"timestamp": "2026-05-12T10:15:32.456Z",
"origin": "api.example.com",
"method": "GET",
"path": "/v1/orders?limit=10",
"status": 200,
"latency_ms": 42.7,
"client_ip": "10.0.0.5"
}
]
```
| Field | Type | Description |
|---|---|---|
| `timestamp` | string | RFC 3339 timestamp when the request finished. |
| `origin` | string | Configured origin hostname that handled the request. |
| `method` | string | HTTP method. |
| `path` | string | Request path including query string. |
| `status` | int | Response status code. |
| `latency_ms` | float | End-to-end latency in milliseconds. |
| `client_ip` | string | Client IP as observed by the proxy. |
This is an in-memory ring buffer; entries are lost when the process
exits. For durable request logs, enable the structured access log
(see [access-log.md](access-log.md)).
### `GET /api/health`
Aggregate liveness summary. Returns `200` with:
```json
{"status":"ok","origins":[]}
```
The `origins` array is currently a placeholder; per-origin health
detail lives at `/api/health/targets` below.
### `GET /api/health/targets`
Per-target health for every origin whose action is a
`load_balancer`. Walks the live pipeline and reports the exact state
that `select_target` consults: active health probe result, outlier
detector eject state, and circuit breaker state. Use this to confirm
that an upstream operators believe is healthy actually is, or to
diagnose why a load balancer is short on candidates.
```json
{
"config_revision": "abc123...",
"origins": [
{
"hostname": "api.example.com",
"origin_id": "api",
"targets": [
{
"index": 0,
"url": "https://upstream-1.internal:8443",
"eligible": true,
"healthy": true,
"outlier_ejected": false,
"circuit_breaker_state": "closed",
"weight": 10,
"backup": false,
"group": null,
"zone": "us-west-1a"
}
]
}
]
}
```
| Field | Type | Description |
|---|---|---|
| `config_revision` | string | Current pipeline revision; matches the `x-sbproxy-debug-config-rev` header when debug mode is on. |
| `origins[].hostname` | string | Origin hostname. |
| `origins[].origin_id` | string | Stable identifier for this origin within its workspace. |
| `origins[].targets[].index` | int | Position in the configured target list. |
| `origins[].targets[].url` | string | Upstream URL. |
| `origins[].targets[].eligible` | bool | True when `healthy && !outlier_ejected && circuit_breaker_state != "open"`; matches what `select_target` honours. |
| `origins[].targets[].healthy` | bool | Latest active-health-check verdict. |
| `origins[].targets[].outlier_ejected` | bool | True when the outlier detector has temporarily ejected this target. |
| `origins[].targets[].circuit_breaker_state` | string \| null | `"closed"`, `"open"`, `"half_open"`, or null when the breaker is unconfigured. |
| `origins[].targets[].weight` | int | Authored weight. |
| `origins[].targets[].backup` | bool | True when this is a backup target. |
| `origins[].targets[].group` | string \| null | Authored group tag, if any. |
| `origins[].targets[].zone` | string \| null | Authored zone tag, if any. |
Origins whose action is not `load_balancer` (e.g. `proxy`,
`ai_proxy`, `static`, `redirect`) are omitted from `origins`.
### `GET /api/stats`
Basic counters summary.
```json
{"request_log_entries": 42}
```
This is a placeholder; the authoritative metrics surface is the
Prometheus `/metrics` endpoint exposed on the health port (see
[metrics-stability.md](metrics-stability.md)).
### `GET /api/openapi.json`, `GET /api/openapi.yaml`
The live pipeline's emitted OpenAPI 3.0 document. The proxy renders
the document once per pipeline revision and caches both JSON and
YAML renderings; the cache invalidates on hot reload.
The shape and the per-origin mapping are documented in
[openapi-emission.md](openapi-emission.md). The `.json` route
returns `Content-Type: application/json`; the `.yaml` route returns
`Content-Type: application/yaml`.
---
## Control routes (authenticated)
### `POST /admin/reload`
Re-reads `proxy.admin.config_path` from disk, recompiles the
pipeline, and hot-swaps the in-memory pipeline. The route uses the
same single-flight guard as the file watcher, so a manual reload
during a file-watcher reload returns `409`.
`GET /admin/reload` returns `405`; the route is gated on POST.
Success response (`200`):
```json
{
"config_revision": "abc123...",
"loaded_at": "2026-05-12T10:15:32.456Z"
}
```
| Status | When |
|---|---|
| `200` | Reload succeeded; pipeline swapped. |
| `400` | YAML parse failed. Error body carries the parse error with the config path scrubbed. |
| `405` | Method other than POST. |
| `409` | Another reload is already in flight. |
| `500` | Could not read the config file (permissions, ENOENT), or pipeline compile failed. |
| `503` | The admin server has no `config_path` wired (in-memory / test mode). |
See [manual.md section 9](manual.md#9-hot-reload) for the full
operator workflow including curl examples and the Kubernetes
operator integration.
### `GET /admin/drift`
Compares the on-disk config file at `proxy.admin.config_path`
against the content hash captured the last time the proxy loaded a
config (startup, file-watcher reload, or `POST /admin/reload`). Use
this to detect when the running proxy has diverged from the
declared config without triggering a reload.
```json
{
"config_path": "/etc/sbproxy/sb.yml",
"loaded_revision": "abc123...",
"loaded_content_hash": "sha256:...",
"on_disk_content_hash": "sha256:...",
"drift": false,
"on_disk_size_bytes": 8421,
"checked_at": "2026-05-12T10:15:32.456Z"
}
```
| Field | Type | Description |
|---|---|---|
| `config_path` | string | Absolute path the admin server reads. |
| `loaded_revision` | string | Pipeline `config_revision` of the running proxy. |
| `loaded_content_hash` | string | Content hash of the bytes that produced the running pipeline. |
| `on_disk_content_hash` | string | Content hash of the bytes the admin server just read off disk. |
| `drift` | bool | True when `loaded_content_hash != on_disk_content_hash`. |
| `on_disk_size_bytes` | int | Size in bytes of the on-disk config. |
| `checked_at` | string | RFC 3339 timestamp of this check. |
| Status | When |
|---|---|
| `200` | Drift check completed. The body always describes the comparison. |
| `500` | Could not read the on-disk config file. Path is scrubbed from the error message. |
| `503` | The admin server has no `config_path` wired, or no content-hash baseline has been captured yet. |
Operators typically scrape this every few seconds from their dashboard
or alert pipeline. When `drift: true` is sustained for more than the
expected reload window, page the operator: either the watcher is
stuck, the deploy pipeline forgot to call `POST /admin/reload`, or
someone hand-edited the file out of band.
---
## Admin UI (`GET /admin/ui`, `GET /`)
The OSS admin server serves a minimal browser UI at `/admin/ui` for
configuration inspection, drift status, recent requests, and the
runtime prompt-store overlay (see `/admin/prompts` below). `GET /`
redirects to `/admin/ui` so browsing to the admin port lands on the
UI without typing the path. Both routes are authenticated like the
rest of `/api/*` and `/admin/*`.
Response: `200 text/html`. The UI is a static SPA bundled into the
binary; it does not require a separate build step or asset directory.
---
## Prompt store admin (`GET /admin/prompts`, `POST /admin/prompts/...`)
Exposes the runtime prompt-store overlay. `GET /admin/prompts`
returns the in-memory snapshot (every active prompt + pinned
version + last-mutation metadata) as JSON. `POST /admin/prompts`
mutators add a new version, pin a version, or roll back; mutations
persist to the operator-configured redb file when `admin.prompt_store_path`
is set, so changes survive restart.
The full set of POST shapes and request schemas is documented in
[ai-gateway.md](./ai-gateway.md) under "Stored prompts". This
reference only catalogues the route surface; the request/response
contracts live with the feature.
---
## Chat playground (`POST /admin/api/playground/chat`)
A stub handler for the dashboard's interactive chat surface. The
admin UI scaffold + cargo feature ship today; the wiring that
routes the request through `proxy_router.oneshot` and streams a
model's response back is deferred to a follow-up ticket so the
front-end scaffold and the production integration can land
independently.
Today the route returns `501 Not Implemented` with a JSON envelope
naming the follow-up:
```json
{
"error": "not implemented",
"detail": "chat playground stub; real handler will route through proxy_router.oneshot and stream the model response back to /admin/ui"
}
```
Other verbs return `405 Method Not Allowed`. The route shares the
admin port's basic-auth gate, so a curious operator pinging it
without credentials still sees `401 Unauthorized` first.
This route is OSS, ships in every build, and lives on the admin
server (next to `/admin/reload`) rather than the production proxy
listener. The path is stable; the follow-up that lights up the
real handler does not move it.
---
## Curl recipes
```bash
## Reload the running config.
curl -s -X POST -u admin:secret \
http://127.0.0.1:9090/admin/reload
## Check for config drift.
curl -s -u admin:secret \
http://127.0.0.1:9090/admin/drift | jq
## Watch per-target health.
curl -s -u admin:secret \
http://127.0.0.1:9090/api/health/targets | jq '.origins[].targets'
## Inspect the last 50 requests.
curl -s -u admin:secret \
http://127.0.0.1:9090/api/requests | jq '.[0:50]'
## Pull the emitted OpenAPI spec for a Postman import.
curl -s -u admin:secret \
http://127.0.0.1:9090/api/openapi.json > openapi.json
```
---
## See also
- [manual.md](manual.md) - install, CLI, hot reload workflow.
- [configuration.md](configuration.md) - the `proxy.admin:` block.
- [openapi-emission.md](openapi-emission.md) - the emitted OpenAPI document's shape and per-origin mapping.
- [access-log.md](access-log.md) - the durable structured request log.
- [metrics-stability.md](metrics-stability.md) - the Prometheus `/metrics` surface.
- [audit-log.md](audit-log.md) - tamper-evident log of admin actions.
================================================================
# docs/adr-ai-hub-format.md
================================================================
## ADR: AI gateway hub format and the `ChatFormat` trait
*Last modified: 2026-05-12*
Status: proposed. Drives the hub `ChatFormat` trait plus `/v1/messages` and `/v1/responses` inbound surfaces.
## Context
SBproxy's AI gateway today accepts the OpenAI `POST /v1/chat/completions` shape from clients and either passes it through (OpenAI-compatible upstreams: Groq, Together, DeepSeek, Mistral, Perplexity, OpenRouter, vLLM, Ollama) or hands it to a per-provider translator that rewrites request and response bytes (Anthropic Messages today; Gemini and Bedrock left as TODO in `crates/sbproxy-ai/src/translators/mod.rs:36`). The translator API is two free functions, `translate_request` and `translate_response`, branching on a small `ProviderFormat` enum.
That worked while the only inbound shape was OpenAI chat-completions and the only translated upstream was Anthropic. It does not generalize.
Operators are already asking for two more inbound shapes:
1. `POST /v1/messages` (the Anthropic Messages shape, so the Anthropic SDK and Claude Code can point at SBproxy directly).
2. `POST /v1/responses` (the OpenAI Responses API, which the OpenAI Python and TypeScript SDKs are migrating to).
And five outbound shapes are in scope:
1. OpenAI (and every OpenAI-compatible upstream).
2. Anthropic Messages.
3. Google Gemini and Vertex AI (same wire, two transports).
4. AWS Bedrock InvokeModel / Converse.
5. Custom (per-provider plugin, owned by the operator).
Three inbound shapes times five outbound shapes is fifteen translation pairs. Building each pair by hand would mean fifteen code paths, fifteen test matrices, and fifteen places where a new tool-call field has to be threaded. We have already seen the cost in miniature: the existing Anthropic translator strips seven OpenAI-only fields, hoists `system` messages, defaults `max_tokens`, and rewrites a path; adding a Gemini translator in the same style would duplicate ninety percent of that code.
The cost shows up most clearly in three places.
First, streaming. SSE event shapes differ for every provider. OpenAI emits `delta.content` chunks; Anthropic emits `event: content_block_delta` with a JSON-Patch-like body; Bedrock wraps everything in an AWS event-stream envelope with `:event-type` headers; Gemini emits its own `streamGenerateContent` shape. A per-pair translator means writing the same stream demuxer N times.
Second, observability. We want to emit OpenInference / OTel GenAI spans that name the model, tokens, tools, and finish reason regardless of inbound or outbound format. With per-pair translators we either repeat the extraction logic per translator or add a parallel "extract telemetry from raw bytes" code path.
Third, guardrails. The prompt-injection classifier, PII redactor, response-cache key, semantic cache, cost router, and budget gate all need a stable view of "what the user said" and "what the model said." Today those features only see the inbound OpenAI shape; they will go blind the moment the inbound is Anthropic Messages.
The hub format solves all three by collapsing N times M into N plus M. Every inbound parser writes into one canonical Rust value; every outbound emitter reads from the same canonical Rust value; everything in between (telemetry, guardrails, caching, routing) speaks one shape.
## Decision
We will introduce a `ChatFormat` trait under `crates/sbproxy-ai/src/format/` that owns translation in both directions, and a canonical `ChatRequest` / `ChatResponse` pair that every translator round-trips through. Each format implements the same trait twice over: once as an inbound parser (bytes from the client become a `ChatRequest`) and once as an outbound emitter (a `ChatRequest` becomes bytes for the upstream). Streaming follows the same pattern with `ChatEvent` chunks.
The pseudo-Rust surface is short on purpose. The trait is the contract the whole pipeline depends on, so the smaller it is the fewer places have to change when we add a sixth provider.
```rust,ignore
// crates/sbproxy-ai/src/format/mod.rs
/// A bidirectional translator between a wire format and the hub.
///
/// Implementors are stateless and cheap to construct; the gateway
/// holds one instance per registered format inside a registry.
pub trait ChatFormat: Send + Sync + 'static {
/// Stable identifier used in config and logs (`openai`,
/// `anthropic`, `gemini`, `bedrock`, `responses`).
fn id(&self) -> &'static str;
/// Inbound path this format claims (`/v1/chat/completions`,
/// `/v1/messages`, `/v1/responses`). Returned as a slice because a
/// format may claim several paths (Bedrock has both
/// `InvokeModel` and `Converse`).
fn inbound_paths(&self) -> &'static [&'static str];
// --- Request direction ---
/// Parse client bytes on an inbound path into the hub request.
/// Errors here are HTTP 400 to the client: malformed JSON, missing
/// required fields, an unsupported feature the format cannot
/// represent in the hub at all.
fn parse_request(&self, bytes: &[u8]) -> Result;
/// Emit upstream bytes for the hub request, plus the upstream
/// path. Returned path is the path the AI client should hit on the
/// upstream (Anthropic rewrites to `/v1/messages`; OpenAI keeps
/// `/v1/chat/completions`).
fn emit_request(&self, req: &ChatRequest) -> Result;
// --- Response direction ---
/// Parse a non-streaming upstream response body into the hub
/// response.
fn parse_response(&self, bytes: &[u8]) -> Result;
/// Emit the hub response back to the client in this format's
/// wire shape.
fn emit_response(&self, resp: &ChatResponse) -> Result, ChatError>;
// --- Streaming ---
/// Parse a single SSE frame (the bytes between two blank lines)
/// into zero or more hub events. A single upstream frame can
/// expand to several hub events (Anthropic's `message_start`
/// frame emits both `MessageStart` and a first `Usage` event).
fn parse_event(&self, frame: &SseFrame) -> Result, ChatError>;
/// Emit hub events back to the client as SSE frames. The
/// translator owns terminator framing (`data: [DONE]` for OpenAI,
/// `event: message_stop` for Anthropic).
fn emit_event(&self, ev: &ChatEvent) -> Result, ChatError>;
}
pub struct EmittedRequest {
pub path: String,
pub body: Vec,
pub headers: Vec<(String, String)>, // `anthropic-version`, etc.
}
```
The trait makes four deliberate choices.
First, parse-and-emit are separate methods, not a single round-trip. The pipeline often parses on one format and emits on another; baking that asymmetry into the trait means there is no temptation to write a "translator" that only works for one direction.
Second, the trait is bytes-in / bytes-out at the edges and a typed `ChatRequest` / `ChatResponse` in the middle. That keeps wire formats out of the rest of the codebase: telemetry, guardrails, and cache code never look at raw JSON.
Third, streaming is opaque-frame in, hub-event out, not "parse the whole stream." A frame is the unit Pingora's response body filter sees, and the SSE framing layer (`event:` / `data:` / blank line) is identical across providers. Only the payload differs.
Fourth, `ChatError` is the formats' error type, with HTTP status carried inline. Format errors map directly to client errors; transport errors are caught upstream and never reach the format layer.
## Hub format shape
The hub `ChatRequest` and `ChatResponse` shape are deliberately close to the OpenAI chat-completions JSON shape. OpenAI's chat-completions is the closest existing shape to a lowest common denominator: it has roles, message-level content arrays, tool calls, tool results, finish reasons, usage tokens, and streaming deltas, and every other provider's shape can be projected into it without losing the load-bearing fields.
```rust,ignore
// crates/sbproxy-ai/src/format/types.rs
pub struct ChatRequest {
pub model: String,
pub messages: Vec,
pub tools: Vec,
pub tool_choice: ToolChoice,
pub max_tokens: Option,
pub temperature: Option,
pub top_p: Option,
pub top_k: Option, // hub keeps it even though OpenAI lacks it
pub stop: Vec,
pub stream: bool,
pub system: Option, // hoisted out of messages on parse
pub metadata: ChatMetadata, // request id, user id, workspace id
pub extensions: BTreeMap, // see below
}
pub struct ChatMessage {
pub role: Role, // System | User | Assistant | Tool
pub content: Vec,
pub name: Option,
pub tool_call_id: Option, // set when role == Tool
}
pub enum ContentPart {
Text { text: String },
Image { source: ImageSource, media_type: String },
ToolUse { id: String, name: String, input: Value },
ToolResult { tool_call_id: String, content: String, is_error: bool },
}
pub struct ToolCall {
pub id: String,
pub name: String,
pub arguments: Value, // typed JSON, not the OpenAI string-of-JSON
}
pub struct ChatResponse {
pub id: String,
pub model: String,
pub content: Vec,
pub tool_calls: Vec,
pub finish_reason: FinishReason,
pub usage: Usage,
pub extensions: BTreeMap,
}
pub enum FinishReason {
Stop,
Length,
ToolCalls,
ContentFilter,
Other(String), // a provider can survive a finish_reason we have not seen
}
```
Three places the hub deliberately diverges from OpenAI's shape:
1. **Tool-call `arguments` are typed JSON, not a string.** OpenAI ships `function.arguments` as a string containing JSON, because the OpenAI streaming protocol assembles that string token by token. Anthropic ships it as a real JSON object. Storing the typed value in the hub means the OpenAI emitter is responsible for stringification (a one-line `serde_json::to_string`) and every other consumer (Anthropic, Gemini, Bedrock, telemetry, guardrails) gets the structured form for free.
2. **`top_k` is in the hub even though OpenAI lacks it.** Anthropic, Gemini, and Bedrock all accept `top_k`, and dropping it on the OpenAI inbound would silently degrade sampling control for users routing OpenAI-shape requests at an Anthropic upstream. The OpenAI emitter drops it on the way out.
3. **`system` is a single optional string, not interleaved.** OpenAI permits `system` messages anywhere in the array; Anthropic requires a single top-level `system` field. The hub stores `system` as a single string (concatenated with `\n\n` on parse if the inbound had several system turns) and every emitter that wants per-turn system has to re-derive it. In practice no upstream wants per-turn system; the round-trip is lossy at the wire level (you cannot tell after the fact whether the original had one system message or three concatenated ones), but lossless at the semantic level (the model sees the same prompt).
The `extensions` map is the escape valve for provider-specific knobs the hub does not model. Anthropic `cache_control` blocks land in `extensions["anthropic.cache_control"]`; OpenAI `response_format: json_object` lands in `extensions["openai.response_format"]`. Each emitter looks for the extensions namespaced to its own format and applies them; everyone else ignores them. The namespacing rule is enforced at parse time so a misnamed key is a 400 to the client, not a silent drop on the upstream.
`ChatEvent` is the streaming counterpart and has a deliberately small vocabulary, covered in its own section below.
## Inbound endpoints
Three inbound parsers, registered into a parser registry keyed by inbound path:
- `/v1/chat/completions` (OpenAI): the existing route, refactored to call `OpenAiFormat::parse_request`. This is the pass-through path; the registry can short-circuit it when both inbound and outbound are OpenAI, skipping the hub entirely so the no-translation hot path is byte-for-byte identical.
- `/v1/messages` (Anthropic): new route. Backed by `AnthropicFormat::parse_request`. Existing Anthropic clients (the Anthropic SDK, Claude Code, Cursor) point at this path and Just Work, including when the configured upstream is OpenAI or Gemini.
- `/v1/responses` (OpenAI Responses): new route. Backed by `OpenAiResponsesFormat::parse_request`. The Responses shape is OpenAI's stateful-conversation API; the hub parser flattens it into a stateless `ChatRequest` and the response emitter re-wraps the result.
The registry is a small struct in `crates/sbproxy-ai/src/format/registry.rs` that holds a map from inbound path to `Arc`. Outbound is selected from the provider config (each provider declares its format in `ai_providers.yml`), so the runtime never has to guess which emitter to use.
Configuration touches one new field on the AI gateway block, and inbound-path support is opt-in:
```yaml
ai:
inbound_formats:
- openai # /v1/chat/completions, always on for back-compat
- anthropic # /v1/messages, opt-in
- openai_responses # /v1/responses, opt-in
providers:
- id: claude-sonnet
format: anthropic
url: https://api.anthropic.com
models: [claude-3-5-sonnet]
```
Opt-in inbound formats is the conservative default. If we turn on `/v1/messages` for every operator who upgrades, we hijack any operator who happens to already route `/v1/messages` to a real Anthropic upstream through SBproxy as a transparent proxy.
## Streaming translation
Streaming is the highest-leverage and the highest-risk part of this design, so the hub event vocabulary is deliberately tiny.
```rust,ignore
pub enum ChatEvent {
MessageStart { id: String, model: String },
ContentDelta { index: usize, part: ContentPartDelta },
ToolCallDelta { index: usize, delta: ToolCallDelta },
Usage(Usage),
MessageStop { finish_reason: FinishReason },
}
pub enum ContentPartDelta {
Text(String),
// Image / ToolResult are non-streaming today; they appear in full
// inside MessageStart-adjacent metadata, not as deltas.
}
pub struct ToolCallDelta {
pub id: Option, // present in the first delta
pub name: Option, // present in the first delta
pub arguments_chunk: Option, // raw JSON chunk for OpenAI;
// Anthropic emits whole objects
}
```
Five events cover every provider we have looked at. The mapping table:
| Hub event | OpenAI SSE | Anthropic SSE | Gemini SSE | Bedrock event-stream |
|---|---|---|---|---|
| `MessageStart` | first `data:` with `id` | `event: message_start` | first chunk with `responseId` | `:event-type: messageStart` |
| `ContentDelta` | `delta.content` | `event: content_block_delta` (text) | `candidates[0].content.parts[].text` | `:event-type: contentBlockDelta` (text) |
| `ToolCallDelta` | `delta.tool_calls[]` | `event: content_block_delta` (input_json_delta) | `functionCall.args` partials | `:event-type: contentBlockDelta` (toolUse) |
| `Usage` | last chunk (`usage` block when `stream_options.include_usage`) | `event: message_delta` (`usage`) | `usageMetadata` on final chunk | `:event-type: metadata` |
| `MessageStop` | `data: [DONE]` after `finish_reason` chunk | `event: message_stop` | `finishReason` field | `:event-type: messageStop` |
Three rules keep the streaming path honest.
First, **frames are the unit, not bytes.** Every translator gets a complete SSE frame (parsed by the same SSE framer in `sbproxy-transport`, which already exists for HTTP/2 push and gRPC). A translator never sees a partial frame, so it never has to buffer.
Second, **a single upstream frame may produce zero or many hub events.** Anthropic's `message_start` frame carries enough state to emit both `MessageStart` and a "seed" usage record; OpenAI's first chunk emits only `MessageStart`. Returning `Vec` makes that explicit.
Third, **emitters own terminator framing.** OpenAI requires a trailing `data: [DONE]`; Anthropic does not. Bedrock has a binary event-stream framing layer that wraps the SSE payload. Each emitter is responsible for getting the goodbye right.
The pass-through hot path is unchanged: when inbound and outbound are both OpenAI, the registry detects the match and the streaming bytes are forwarded with zero parsing. This matters because OpenAI-compatible upstreams are still the common case and any streaming overhead is paid per token.
## Cross-format lossiness
Three classes of feature do not survive every cross-format hop, and the hub will say so out loud rather than dropping silently.
**Anthropic `cache_control` blocks** mark message content for Anthropic's prompt caching. There is no OpenAI analog. When the inbound is Anthropic and the outbound is OpenAI:
1. The parser stashes the blocks in `extensions["anthropic.cache_control"]` so they round-trip if the outbound is also Anthropic.
2. The OpenAI emitter drops the extension and adds one entry to the request's `lossiness` log (a `Vec` on `ChatRequest` that telemetry exports as a span attribute).
3. The classifier logs a `sbproxy_ai_format_lossy_field_total{field="anthropic.cache_control",direction="downgrade"}` counter so operators can see it on a dashboard.
This is "warn and best-effort." The request still goes through; the model still answers; the operator can see in metrics and traces that the cache hint was dropped.
**Anthropic thinking blocks** (`type: thinking` content blocks) come back from extended-thinking models. OpenAI o1 and o3 emit a similar concept (`reasoning_content`) but with different framing and no streamable shape. The hub keeps thinking as a first-class `ContentPart::Thinking { signature, text }` variant so any inbound parser that sees it preserves it on the way to any outbound emitter that knows what to do with it; emitters that do not (OpenAI Chat Completions today) drop it with a `lossiness` note.
**OpenAI `response_format: json_schema`** is a structured-output mode OpenAI implements at decoding time. Anthropic and Gemini have similar features with different schemas and different field names. The hub does not model structured output as a first-class field today; it lives in `extensions["openai.response_format"]` and only the OpenAI emitter applies it. Cross-emitting from OpenAI to Anthropic with a `response_format` request adds a lossiness note and the operator's tests are likely to fail. This is the loudest of the three: we will document it in `ai-gateway.md` as a known limitation and revisit when WOR-... follow-ups land.
Lossiness notes carry three fields: the field name, the direction (`downgrade` or `unsupported`), and a short string explaining the effect. They surface in OpenInference spans (as a `lossiness` attribute on the parent span) and in structured logs at WARN level once per request. They do not block the request.
## Migration path
The existing Anthropic translator at `crates/sbproxy-ai/src/translators/anthropic.rs` becomes two halves of one `AnthropicFormat` implementor. `request_to_native` is the bones of `emit_request`; `response_to_openai` is the bones of `parse_response` plus a no-op `emit_response`. The free-function API in `translators/mod.rs` stays as a deprecated shim for one release so any out-of-tree callers do not break.
Implementation breaks into roughly six to eight chunks. Each one is small enough to land on its own and CI gate, in line with the workspace's tracer-bullet preference.
1. **Hub types and registry.** Land `ChatRequest`, `ChatResponse`, `ChatMessage`, `ContentPart`, `ToolCall`, `ChatEvent`, the `ChatFormat` trait, and an empty `FormatRegistry`. No wire integration yet; the crate compiles and has unit tests for the types.
2. **OpenAI format as the identity.** Implement `OpenAiFormat: ChatFormat` so the existing `/v1/chat/completions` path can go through the hub on a feature flag. Round-trip every existing AI e2e test through the hub under the flag; flip the flag once green.
3. **Anthropic format migration.** Port the current translator into `AnthropicFormat`. Add an outbound test matrix (OpenAI inbound, Anthropic outbound) that proves byte-equivalent behavior with the legacy free-function path. Delete the free functions once the matrix is green for two releases.
4. **`/v1/messages` inbound.** Register `AnthropicFormat` as an inbound parser, gated by `inbound_formats: [..., anthropic]`. Add a route handler that picks the format from path. New e2e: Anthropic SDK against SBproxy against an OpenAI upstream.
5. **`/v1/responses` inbound.** Add `OpenAiResponsesFormat`. The Responses shape has stateful conversation handling that the hub will flatten; add a stateless emitter back to Responses for the round-trip.
6. **Streaming.** Implement `parse_event` / `emit_event` for OpenAI, Anthropic, and OpenAI Responses. Add a streaming conformance test (one fixture per provider, replayed deterministically).
7. **Gemini format.** Add `GeminiFormat` (request + response + streaming). Lights up Gemini and Vertex upstreams without a Google-side translator code path elsewhere.
8. **Bedrock format.** Add `BedrockFormat`. Bedrock's binary event-stream wrapping is the tricky part; SigV4 stays in the existing auth layer.
Six chunks ship a working hub with three inbound shapes and three outbound shapes. Chunks seven and eight are independent and can ship in either order.
## Alternatives considered
**Per-pair translators (the status quo).** Keep adding `translate_request_anthropic_to_openai`, `translate_request_gemini_to_openai`, and so on, fanning out to one function per pair. The translator file already has Gemini and Bedrock as TODO comments. Cost: N times M code paths, duplicated streaming logic, observability hooks duplicated per pair. Wins: zero new types, no abstraction, easy to grep. We rejected this because the duplication compounds with every provider and the streaming demuxer in particular is too large to write five times.
**Upstream-only routing through OpenRouter or LiteLLM.** Send every non-OpenAI provider through OpenRouter or a sidecar LiteLLM. Wins: zero in-process translation; OpenRouter's pricing is already integrated. Cost: an extra network hop, opaque routing decisions, no control over guardrails or PII redaction (they fire after the hop), no streaming visibility, vendor lock to OpenRouter's evolution. We rejected this because the whole pitch of "the AI gateway built like a real proxy" is that everything happens in process; an external hop defeats that.
**Fork OpenAI's Python SDK shapes and use them verbatim as the hub.** Mirror OpenAI's Python `Pydantic` types in Rust and treat the OpenAI shape (with `.arguments` as a string, no `top_k`) as the canonical form. Wins: zero invention; copy from a working spec. Cost: locks the hub to OpenAI's evolution (Responses already obsoletes parts of it), forces every Anthropic-only field through a string-of-JSON keyhole, and makes structured tool arguments awkward to inspect. We rejected this because the OpenAI shape is the closest existing shape, not a correct hub. The hub diverges in three places (typed `arguments`, hub-only `top_k`, single `system`) on purpose.
**One trait, but bytes-in / bytes-out at the trait surface (no hub types).** Make `ChatFormat` a `(format_a, format_b, bytes_in) -> bytes_out` API and skip the canonical types. Wins: minimum allocations on the no-translation path. Cost: telemetry, guardrails, caching, and cost routing all have to re-parse the bytes; we are back to N times M for those features. We rejected this because the bytes-in / bytes-out surface only solves the translation problem and leaves four other features uncovered.
## Open questions
These are genuinely undecided and need an answer before this ADR closes; do not treat the absence of an answer as a sign the design will not change.
1. **Cost routing and inbound model names.** Today the cost router keys on the OpenAI model name. When the inbound is Anthropic Messages with `model: claude-3-5-sonnet`, does the router look up Anthropic pricing, or does it expect the operator's `ai_providers.yml` to declare an alias? Probably the latter, but the alias-resolution path needs a design.
2. **Guardrail input scope on multi-turn conversations.** The prompt-injection classifier inspects the latest user message today. With Anthropic-style messages where a `tool_result` block can carry attacker-controlled text from a previous tool call, the "latest user message" is the wrong scope. Hub-level: scan every `Tool` role message too? Open.
3. **Streaming back-pressure.** The hub emits `Vec` per upstream frame. If a slow client cannot keep up with the upstream's frame rate, we either buffer (memory pressure) or drop (correctness loss). Pingora already has body-write back-pressure; need to confirm that the trait surface composes with it cleanly when the emitter produces several SSE frames per hub event.
4. **`extensions` versioning.** Provider wire formats evolve. If Anthropic adds a new `cache_control` mode, every old parser will silently drop it. Do we pin a wire-version per format, fail closed on unknown extensions, or warn? Probably "warn and pass through under a versioned key," but the policy is not written yet.
5. **`/v1/responses` stateful mode.** The Responses API has a `previous_response_id` field that points at a prior conversation. The hub flattens to stateless requests; the operator-facing question is whether SBproxy stores those conversations itself or refuses the field. Refusing is the conservative answer for v1, but it breaks `client.responses.create(previous_response_id=...)` calls.
6. **Schema discipline for `extensions`.** Today the rule is "namespace by format id" but it is not enforced beyond a runtime check. A JSON Schema fragment per format would let the config compiler validate at load time. Worth doing in chunk one or worth deferring? Open.
7. **Where does the AWS event-stream wrapper live?** Bedrock's streaming layer is non-trivial. Inside `BedrockFormat::parse_event`, or in a `sbproxy-transport` helper that other AWS services could share? Leaning toward the helper, but not certain until the second AWS-shape provider lands.
================================================================
# docs/adr-outbound-credential-resolver.md
================================================================
## ADR: outbound credential resolver, OSS vs enterprise line
*Last modified: 2026-05-24*
Status: accepted. Drives the move of outbound-credential-resolver basics into OSS.
## Context
SBproxy's stated differentiator is the outbound credential resolver: the
gateway mints or exchanges the right credential for each upstream so the
agent or client never handles a per-upstream secret. A request arrives
with one identity; the proxy presents a different, correctly-scoped
credential to each upstream it talks to.
Until now the whole resolver was an enterprise capability. The OSS binary
shipped `sbproxy-vault` (secret resolution and rotation) but no outbound
*minting*: RFC 8693 token exchange, the OAuth client-credentials grant,
broker JWT re-sign, DPoP, and stored per-user OAuth grants were all paid.
Two things changed that make this line wrong:
1. **The basic mechanism is no longer category-unique.** Per-upstream
outbound credential brokering is now offered by AWS Bedrock AgentCore
Gateway, Pomerium, Auth0 / Okta Token Vault, Arcade, and Scalekit. RFC
8693 token exchange is generally available in Keycloak 26.2 and Okta.
A self-hostable gateway whose headline differentiator is paywalled
looks behind on its own pitch.
2. **Two open competitors are racing the same square.** agentgateway
(Rust, open) and Bifrost (Go, open) target the self-hostable agent
gateway niche. If the OSS binary cannot even demonstrate the resolver,
the wedge is undefended.
The differentiator has to move up the stack. The basic minting mechanism
becomes table-stakes that OSS must show; the durable, monetizable value
moves to operating that mechanism at scale.
## Decision
OSS ships the **mechanism**: enough to resolve a per-upstream outbound
credential three ways, single-tenant, statically configured, with the
safety rails that make exchange safe to run. Enterprise keeps **operation
at scale**: per-user delegated identity, sender-constrained tokens,
broker-as-issuer, multi-tenant and multi-source entitlements, and the
hardware-backed and compliance tooling around all of it.
This mirrors the split already used elsewhere in the product: the
mechanism is OSS; the operational, multi-tenant, hardware-backed, and
compliance-grade layers are enterprise.
### OSS (the basics)
- **RFC 8693 token exchange.** Exchange a subject token for an
upstream-audience token (`grant_type=urn:ietf:params:oauth:grant-type:token-exchange`).
- **OAuth client-credentials grant** per upstream.
- **Vault-resolved static secret** per upstream (already in OSS; exposed
through the unified resolver).
- **The unified `outbound_credential_resolver` config surface**: per
origin, select one of the three modes. This is the artifact that
demonstrates the wedge.
- **The safety rails that ride with exchange**, shipped together with it
and never separable: `subject_token_issuers` and
`allowed_token_exchange_audiences` allowlists, the `act` delegation
chain with a depth cap, and a single-process minted-token cache with
TTL. A basic feature must not ship in an unsafe configuration; security
rails are not a paid add-on.
### Enterprise (operation at scale)
- **Stored OAuth grants / per-user token vault**: device-code and
interactive-consent flows, refresh-token lifecycle, per-user delegated
identity. This is the operationally hard, high-value capability that
comparable products charge for.
- **Broker JWT re-sign and issuer-vouched / broker-augmented identity
(CIMD)**: the broker becomes the issuer. Needs hardware-backed keys and
is compliance-grade.
- **Sender-constrained tokens (DPoP, mTLS-bound).**
- **Multi-source entitlements, multi-tenant credential isolation, and
hardware-backed broker keys.** Combining identity across an identity
provider, workload identity, and an entitlement service, isolated per
tenant, is the enterprise operational job.
### The crux: RFC 8693 itself is OSS
The one genuinely debatable item is token exchange. It is OSS. Keeping it
paid is indefensible now that it is generally available across the IdP
market, and an open binary that cannot show token exchange cedes the
narrative to the open competitors. The differentiator survives because
the operational layer (stored per-user grants, broker-as-issuer,
multi-tenant, hardware-backed, audited) stays enterprise, and that is
where buyers actually spend.
## Consequences
- The OSS binary can demonstrate, end to end and without a license:
"per-upstream credentials, minted three ways, no client-side secret
handling, self-hosted." That is the wedge, defended.
- Enterprise sells the operational story: "operate that for thousands of
users across dozens of upstreams, sender-constrained, broker-issued,
and audited."
- The OSS resolver is single-tenant and statically configured by design.
Multi-tenant isolation and dynamic, per-user credential lifecycle are
the natural upgrade boundary, so the line is legible to operators
rather than arbitrary.
- The resolver is a closed enum of modes, so an operator who needs a mode
the OSS binary does not implement gets a config-load error rather than
a silent fallback to an unsafe default.
## Implementation
PR 1 lands this ADR and the OSS resolver subsystem: the config surface,
the three minting modes, the allowlists, and the `act`-chain depth cap,
with unit coverage including a mock token endpoint. A follow-up wires the
resolver into the outbound request path per upstream and adds the
end-to-end test (request to upstream A gets credential A; request to
upstream B gets credential B).
================================================================
# docs/agent-budget.md
================================================================
## agent_budget policy
*Last modified: 2026-05-31*
The `agent_budget` policy is a semantic rate-limit primitive keyed on the resolved `agent_id`. Standard per-IP / per-user / per-key limits assume humans pause between requests; agents driven by an LLM loop fire at network speed and trip those buckets immediately. Datadog reports roughly a third of LLM-span errors in production are rate-limit denials for exactly that reason.
One bucket per named agent collapses "every request from the Cursor instance" or "every request from the same OpenAI Assistant" into a single budget that an operator can actually size. The `agent_id` comes from the agent-class resolver (`sbproxy-agent-detect` / `sbproxy-classifiers`); when no `agent_id` resolved, the policy applies the `on_anonymous` rule.
## Config
```yaml
origins:
"ai.example.com":
upstream: https://api.openai.com
auth:
type: bearer
policies:
- type: agent_budget
# Token-bucket refill rate, per agent_id.
requests_per_minute: 60
# Rolling LLM-token budget per agent_id. The token bucket
# exists in the policy API; consumption is wired in via the
# AI-usage tracker. Configuring without that wiring is a no-op
# on the token field today.
tokens_per_hour: 100000
# Max simultaneous in-flight requests per agent_id. RAII guard
# releases the slot when the request completes.
burst: 10
# What to do when the cap fires.
# - deny (default): respond 429.
# - log: emit the decision metric, pass the request through.
# - downgrade: dispatcher routes to a cheaper model.
on_exceed: deny
# What to do when the request has no resolved agent_id.
# - skip (default): no enforcement.
# - shared: all anonymous requests share one bucket.
on_anonymous: skip
```
## Decisions
The policy reports its verdict to the dispatcher; the dispatcher maps the verdict to a real action:
| Verdict | `on_exceed` | HTTP outcome |
|---|---|---|
| Within budget | n/a | pass through |
| Cap fired, deny | `deny` | 429 with `Retry-After` |
| Cap fired, log | `log` | pass through, metric increments |
| Cap fired, downgrade | `downgrade` | dispatcher picks the cheaper AI provider for this request |
## Observability
* `sbproxy_policy_triggers_total{origin, policy_type="agent_budget", action="block"}` increments on `deny` denials.
* `sbproxy_ai_budget_utilization_ratio{origin, agent_id}` gauge reports the current utilisation per agent.
* Access log: `policy_action` set to the verdict; `agent_id`, `agent_class`, `agent_vendor` carry the resolved agent identity.
## Why per-agent
A standard rate-limit policy keyed on IP or API key cannot distinguish "Cursor making 200 background completions while the user types" from "an attacker fanning out 200 distinct concurrent prompts". Both look identical to an IP-keyed bucket. Keying on `agent_id` (the resolved agent identity, not the network address) lets the operator size the legitimate background traffic without hardening to it, and lets the abuse path get blocked cleanly because the attacker cannot produce a fresh `agent_id` per request without re-resolving against the agent registry.
## Out of scope for slice 1
* Cluster-shared budgets. Each proxy enforces its own local view; an attacker spreading across replicas sees N times the per-instance budget. A cluster-shared backend (Redis or shared KV) is the obvious follow-up; for now, treat the per-instance budget as the floor.
* Upstream token accounting. `tokens_per_hour` is wired into the policy API but only consumed when the AI gateway calls `AgentBudgetPolicy::consume_tokens`. A follow-up wires that into `sbproxy-ai`'s usage tracker.
## See also
* [features.md](./features.md) - tour with policy examples.
* [examples/agent-budget/](../examples/agent-budget/) - runnable per-agent rate-limit fixture.
* [ai-gateway.md](./ai-gateway.md) - the AI surfaces the budget protects.
* [configuration.md](./configuration.md) - the full schema.
================================================================
# docs/agent-skills.md
================================================================
## Agent Skills v0.2.0
*Last modified: 2026-05-09*
SBproxy serves an Agent Skills v0.2.0 discovery manifest at
`/.well-known/agent-skills/index.json`. Cooperative agents fetch the
manifest to discover the skills the origin advertises, then fetch each
artifact at the URL the manifest pins. Every artifact body is
hashed (SHA-256) at config-load time and re-hashed on every serve.
The schema lives at
`https://schemas.agentskills.io/discovery/0.2.0/schema.json`. The
originating RFC is at
`https://github.com/cloudflare/agent-skills-discovery-rfc`.
## What it does
The Agent Skills projection is a sibling of the four Wave 4
projections (`robots.txt`, `llms.txt`, `licenses.xml`,
`tdmrep.json`). All five are derived from the compiled config snapshot
and refreshed atomically on every config reload.
Each entry in the manifest carries:
- `name` - stable identifier.
- `type` - closed enum, `skill-md` or `archive`.
- `description` - one-line capability summary.
- `url` - relative, path-absolute, or fully-qualified.
- `digest` - `sha256:` of the artifact body.
URLs are resolved per RFC 3986 against the request authority at serve
time, so the manifest's URLs stay portable across hostnames and
schemes.
## Configuration
```yaml
proxy:
http_bind_port: 8080
origins:
"test.sbproxy.dev":
action:
type: proxy
url: https://test.sbproxy.dev
agent_skills:
- name: "deploy-via-pr"
type: skill-md
description: "Open a PR to deploy a config change."
url: "/skills/deploy-via-pr.md"
visibility: public
- name: "internal-rotate-secret"
type: skill-md
description: "Rotate a service credential via vault."
url: "/skills/internal-rotate-secret.md"
visibility: authenticated
```
Every field except `name`, `type`, `description`, and `url` is
optional. Skills can declare an inline `body:` literal, an explicit
filesystem `path:`, or rely on the workspace-relative resolution that
the URL implies (the example above resolves
`/skills/deploy-via-pr.md` against the directory `sbproxy serve` was
invoked from).
### Visibility
`public` (the default) returns the entry to every caller.
`authenticated` filters the entry out of the manifest served to
anonymous callers. Callers that present an `Authorization` header
receive the full set.
The serve-time filter walks the manifest fresh on every request, so
an authenticated upgrade does not require a manifest reload. SHA-256
digests are computed once at config-load and pin the artifact body
across all callers.
### Archive entries (`type: archive`)
`archive` entries point at a `.tar.gz` or `.zip` bundle. The proxy
sniffs the magic bytes, validates the bundle once at config-load time,
and serves it as opaque bytes on every request.
The archive parser refuses to load a bundle that:
- traverses outside the archive root via `..` or absolute paths,
- contains a symlink whose target escapes the archive root (or any
symlink at all in the zip case),
- exceeds the configured decompression ratio (default 100:1),
- exceeds the configured entry count (default 1000), or
- exceeds the configured expanded byte budget (default 10 MiB).
Each cap is configurable per entry:
| Field | Default | Purpose |
|---|---|---|
| `max_decompression_ratio` | 100 | Compressed:expanded ratio cap. |
| `max_entries` | 1000 | Max entries per archive. |
| `max_expanded_bytes` | 10485760 | Max expanded archive bytes. |
| `max_clock_skew_secs` | 60 | Tolerance for time-sensitive headers. |
## Integrity contract
Every artifact `GET` re-hashes the served body and compares to the
manifest digest. On mismatch the proxy:
1. Returns HTTP 503 with a generic "service unavailable" body.
2. Emits a structured `agent_skill.digest_mismatch` audit event with
`{ skill_name, hostname, expected_digest, observed_digest }`.
3. Increments
`sbproxy_agent_skill_digest_mismatch_total{skill=""}`.
The runtime check is the contract that lets cooperative agents trust
the digest. Operators who wire an audit sink see the mismatch land on
their existing audit pipeline.
## No script execution
Per the v0.2.0 spec, SBproxy does not execute pre-/post-hooks or any
embedded scripts shipped inside an artifact. Artifacts are served as
opaque bytes. Archives are validated for size and traversal safety at
config-load time but are never extracted to disk during a request, and
the request handler never invokes a subprocess on the artifact body.
## MCP `experimental.agentSkillsUrl` advertising
When the origin's action is an MCP gateway and `agent_skills:` is
configured, the `initialize` JSON-RPC response includes a
`capabilities.experimental.agentSkillsUrl` field pointing at the
manifest. The advertised URL is the absolute URL of the origin's
`/.well-known/agent-skills/index.json`, resolved from the request
`Host` and the proxy's TLS posture.
```json
{
"protocol_version": "2025-06-18",
"capabilities": {
"tools": {},
"experimental": {
"agentSkillsUrl": "https://api.example.com/.well-known/agent-skills/index.json"
}
},
"server_info": { "name": "sbproxy-mcp", "version": "1.0" }
}
```
The advertised path is the same regardless of caller identity; the
manifest itself filters by visibility at serve time. When
`agent_skills:` is not configured for the origin, the field is omitted
entirely (no empty advertisement).
## `resources.listChanged` capability and manifest refresh
When `agent_skills:` is configured, the `initialize` response also
advertises `capabilities.resources.listChanged: true`. The manifest is
exposed to MCP clients as a resource; `listChanged` is the signal that
the resource set can change and the client should subscribe to
refresh notifications instead of caching the manifest forever.
```json
"capabilities": {
"resources": { "listChanged": true },
"experimental": { "agentSkillsUrl": "..." }
}
```
How a client uses this depends on its transport:
* **Persistent server-push transport** (the MCP streamable HTTP
transport's GET-SSE channel, when present): the client opens the
SSE channel and waits for a `notifications/resources/list_changed`
push. The proxy will emit that frame when the manifest regenerates,
once the server-side SSE push channel ships in a future release.
* **Request/response only** (the common case today): the client
treats the manifest like any other long-cached HTTP resource and
uses the `Cache-Control` / `Last-Modified` headers on the
well-known endpoint, polling with `If-Modified-Since` when its
internal cadence allows. The advertised `listChanged: true` is the
hint that polling IS expected; without it, a client might cache
the manifest indefinitely.
The capability is omitted entirely when `agent_skills:` is not
configured, so a legacy client that keys off field presence does not
subscribe to a channel that has nothing to emit.
## Inspection
```bash
curl -s -H 'Host: api.example.com' \
http://127.0.0.1:8080/.well-known/agent-skills/index.json | jq
curl -s -H 'Host: api.example.com' -H 'Authorization: Bearer demo' \
http://127.0.0.1:8080/.well-known/agent-skills/index.json | jq
```
The example bundle at `examples/agent-skills/` is runnable with
`sbproxy serve -f sb.yml` and demonstrates the manifest, the
visibility filter, and the digest contract end-to-end.
## See also
- [`mcp.md`](mcp.md) for the broader MCP gateway story.
- [`threat-model.md`](threat-model.md) for the OSS trust boundaries
that constrain the digest verifier.
- [`features.md`](features.md) for the projection family overview.
================================================================
# docs/ai-crawl-control.md
================================================================
## AI Crawl Control + Pay Per Crawl
*Last modified: 2026-05-08*
The `ai_crawl_control` policy implements the "Pay Per Crawl" pattern: AI crawlers that arrive without a valid `Crawler-Payment` token receive `402 Payment Required` along with a JSON challenge body. A crawler that wants the content reads the challenge, posts a payment to your billing system, and retries with the issued token in the `Crawler-Payment` header. Each token redeems exactly once.
The OSS implementation ships an in-memory ledger seeded from config and an HTTPS-only HTTP ledger client for production. The enterprise build extends the same `Ledger` trait with managed adapters so the proxy can authorise tokens against Stripe, x402, MPP, and Lightning rails.
## OSS scope: challenge body only
The OSS proxy emits two challenge shapes:
1. **Single-rail (default).** A 402 with the `Crawler-Payment` header and a flat JSON body describing the price. This is the path legacy crawlers see.
2. **Multi-rail (opt-in).** When the agent sends `Accept-Payment:` or one of the multi-rail `Accept` MIME types (`application/sbproxy-multi-rail+json`, `application/x402+json`, `application/mpp+json`), the OSS proxy emits a 402 with `Content-Type: application/sbproxy-multi-rail+json` and a body that lists one entry per rail the operator declared (x402, MPP, Lightning), each with its own quote-token JWS.
The multi-rail body is the wire-format contract. The OSS build can negotiate it, advertise rails, mint per-rail quote tokens, and respond 406 when the agent's preference set has no overlap with the operator's offered rails.
What the OSS build cannot do is settle a payment on x402, MPP, Stripe, or Lightning. Settlement code lives in the enterprise build behind the `stripe`, `x402`, `mpp`, `lightning-cln`, `lightning-lnd`, and `lightning-phoenixd` cargo features. With an OSS-only build, the rails advertised in the multi-rail body are honoured by the in-memory or HTTP ledger; the enterprise BillingRail registrations are what actually authorise a real-money settlement.
This is the same framing the rail-Lightning example uses: see `examples/rail-lightning/README.md`. For the wire-shape contract on its own, see [`402-challenge.md`](402-challenge.md).
## Request flow
```
crawler GET /article
User-Agent: GPTBot/1.0
proxy <- 402 Payment Required
Crawler-Payment: realm="ai-crawl" currency="USD" price="0.001"
Content-Type: application/json
body: {"error":"payment_required","price":"0.001","currency":"USD","target":"blog.example.com/article","header":"crawler-payment"}
crawler GET /article (after paying out-of-band)
User-Agent: GPTBot/1.0
crawler-payment: tok_a89be2...
proxy <- 200 OK
body:
crawler GET /article (replay attempt)
User-Agent: GPTBot/1.0
crawler-payment: tok_a89be2...
proxy <- 402 (single-use ledger; token already spent)
```
## Configuration
```yaml
policies:
- type: ai_crawl_control
price: 0.001
currency: USD
header: crawler-payment # default
crawler_user_agents: # case-insensitive substring match
- GPTBot
- ChatGPT-User
- ClaudeBot
- anthropic-ai
- Google-Extended
- PerplexityBot
- CCBot
valid_tokens: # in-memory ledger
- tok_a89be2f1
- tok_b7cf012e
- tok_c34f9a82
```
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `price` | float | unset | Price emitted in the challenge body and the `price=` parameter of the challenge header. Used as the fallback when no tier matches. |
| `currency` | string | `USD` | ISO-4217 code surfaced in the challenge header and body. |
| `header` | string | `crawler-payment` | Header the crawler reads from the 402 response and writes to its retry. |
| `crawler_user_agents` | list | covers GPTBot, ChatGPT-User, ClaudeBot, anthropic-ai, Google-Extended, PerplexityBot, CCBot, FacebookBot | Case-insensitive substring matches against the request User-Agent. Empty list treats every GET / HEAD as a crawler. |
| `valid_tokens` | list | `[]` | Seeds the in-memory ledger. Each token redeems once, then leaves the set. |
| `tiers` | list | `[]` | Pricing tiers. First match wins. See "Tiered pricing" below. |
| `ledger` | block | unset | HTTP ledger client config. See "HTTP ledger" below. Mutually exclusive with `valid_tokens`. |
Only `GET` and `HEAD` requests are subject to charging today. `POST`, `PUT`, `PATCH`, and `DELETE` pass through without charge.
## Tiered pricing
A flat per-site price is the right starting point but not the right long-term shape. Different routes carry different commercial value, and the same article in three formats (HTML, Markdown, PDF) is worth three different prices to a training crawler. The `tiers:` field lets you price by route pattern and content shape without forking the policy.
```yaml
policies:
- type: ai_crawl_control
price: 0.0005 # fallback when no tier matches
currency: USD
tiers:
- route_pattern: /premium/*
price:
amount_micros: 5000 # $0.005 per crawl
currency: USD
free_preview_bytes: 1024 # cooperative crawlers get 1 KiB free
paywall_position: hard
- route_pattern: /articles/*
price:
amount_micros: 1000 # $0.001 per crawl
currency: USD
content_shape: markdown # Markdown form only
free_preview_bytes: 4096
paywall_position: soft
- route_pattern: /articles/*
price:
amount_micros: 500 # $0.0005 per crawl
currency: USD
content_shape: html
- route_pattern: /docs/*
price:
amount_micros: 250
currency: USD
```
| Field | Type | Description |
|---|---|---|
| `route_pattern` | string | Path matcher. Supports literal paths (`/about`) and a `*` suffix wildcard (`/articles/*`). First match wins; later tiers act as fallbacks. |
| `price.amount_micros` | u64 | Price in micros (1e-6 of one unit of `currency`). 1000 micros = $0.001. Floats never enter the wire format. |
| `price.currency` | string | ISO-4217 code. Must match the policy-level `currency` for now. |
| `content_shape` | enum | One of `html`, `markdown`, `json`, `pdf`, `other`. Advisory; surfaced in metrics and the redeem payload but not yet used as a tier filter. |
| `free_preview_bytes` | u64, optional | Byte budget the crawler may read without paying. Surfaced in the challenge body so cooperative crawlers can decide up front whether the preview alone meets their need. |
| `paywall_position` | enum, optional | Hint to the crawler about where the paywall sits: `hard` (no content without payment), `soft` (preview, then paywall), `metered` (N free per period). |
The first tier whose `route_pattern` matches wins. When no tier matches, the policy falls back to the top-level `price` and `currency`. An empty `tiers` list keeps the original flat-price behaviour.
### Per-shape pricing
`content_shape` is advisory: configurations may set the field on a tier so metrics and the redeem payload carry the shape, but the policy does not yet match against it. The wire format is stable, so configurations that set `content_shape` today will keep working when the resolver lands.
## HTTP ledger
The OSS in-memory ledger (`valid_tokens:`) is fine for tests, fixed-token issuance, or one-off content gates. Production deployments with multiple proxy replicas need a network-callable ledger so one token spends across all nodes. The HTTP ledger client speaks a JSON-over-HTTPS protocol with HMAC-SHA256 envelope signatures over a fixed eight-line canonical form.
```yaml
policies:
- type: ai_crawl_control
price: 0.001
currency: USD
ledger:
endpoint: "https://ledger.internal"
key_id: "sb-ledger-2026-q2"
key_file: "${SBPROXY_LEDGER_HMAC_KEY_FILE}"
workspace_id: "default"
agent_id: "openai-gptbot" # forwarded into the redeem payload
agent_vendor: "OpenAI"
per_attempt_timeout_ms: 5000
total_timeout_ms: 30000
max_attempts: 5 # hard-capped at 5 by the ADR
breaker:
failure_threshold: 10
success_threshold: 1
open_duration_ms: 5000
```
The client refuses to construct against a non-HTTPS endpoint at config-load time. Plain HTTP is a hard error because the request envelope carries an HMAC over the body, and TLS is the only thing keeping the body itself confidential.
### Request envelope
Every redeem call carries the eight-line canonical envelope:
```json
{
"v": 1,
"request_id": "01HZX...",
"timestamp": "2026-04-30T12:34:56.789Z",
"nonce": "8f4a...32-hex...",
"agent_id": "openai-gptbot",
"agent_vendor": "OpenAI",
"workspace_id": "default",
"payload": {
"token": "tok_abc...",
"host": "blog.example.com",
"path": "/articles/foo",
"amount_micros": 1000,
"currency": "USD",
"content_shape": "markdown"
}
}
```
The signature is HMAC-SHA256 over the canonical signing string (eight `\n`-separated fields, last one being the SHA-256 of the request body). The signature lands in the `X-Sb-Ledger-Signature: v1=` header. The `v1=` prefix reserves room for future MAC migrations without breaking peers.
### Idempotency
Every attempt carries an `Idempotency-Key` header (a fresh ULID per logical operation). Retries reuse the same key; the ledger short-circuits the second attempt with the cached response. A different body under the same key returns 409 `ledger.idempotency_conflict`, which protects against accidental key reuse across operations.
`Idempotency-Key` is distinct from the envelope's `request_id`: the request id identifies the inbound 402 from the agent, while the idempotency key identifies a single conversation with the ledger about that request.
### Retry and circuit breaker
Exponential backoff with full jitter, max 5 attempts, per-attempt deadline 5 s, total deadline 30 s. The base schedule is 0 ms, 250 ms, 500 ms, 1 s, 2 s, each with `[0, base)` jitter added. Retries fire only on:
- network errors (DNS, TCP RST, TLS handshake, read timeout)
- HTTP 429 (with `Retry-After` honoured)
- HTTP 502 / 503 / 504
- error envelopes with `retryable: true`
Hard failures (`ledger.token_already_spent`, `ledger.signature_invalid`, `ledger.bad_request`) translate directly to a 402 to the crawler. There is no point retrying a token the ledger already rejected as spent.
The circuit breaker opens after 10 consecutive failures over a 30 s window, half-opens after 5 s with one probe, and closes on probe success. While the breaker is open, the client returns a synthetic `ledger.unavailable` error without making the network call. The policy treats that as "ledger is down" and applies the configured `on_ledger_failure` action (default fail-closed).
A 503 response with `Retry-After` propagates straight to the crawler: the 402 response carries `Retry-After` so the crawler knows when to come back. This is the one case where the policy emits `Retry-After` on a 402.
### Failure modes
| Ledger response | Policy action |
|---|---|
| 200 success, redeemed | Pass the request through. |
| 200 success, not redeemed | 402 with the challenge body. The token was valid format but the ledger refused (out of balance, expired). |
| 409 `token_already_spent` | 402, no retry. |
| 4xx other | 402, no retry, log at WARN. |
| 5xx, transient envelope, breaker open | Apply `on_ledger_failure` (default fail-closed -> 503). |
## Agent classes and per-vendor pricing
An `agent_class` taxonomy lets metrics, audit logs, and ledger payloads attribute revenue per vendor. The agent class is resolved at request time via three signals (in order of confidence):
1. Verified Web Bot Auth `keyid` matches an `expected_keyids` entry. Highest confidence.
2. Forward-confirmed reverse-DNS suffix matches an `expected_reverse_dns_suffixes` entry. Strong confidence.
3. User-Agent regex match. Advisory unless the policy explicitly trusts UAs.
Three reserved sentinels round out the resolver:
- `human` is emitted when no automated-agent signal is present.
- `unknown` is the fall-through bucket for an automated UA without a registry match.
- `anonymous` is emitted for anonymous Web Bot Auth requests with no known `keyid`.
Operators see all three values in metrics and dashboards; alerting on a sustained climb in `unknown` is the normal way to spot a new crawler that needs a registry entry.
### Per-vendor pricing example
```yaml
agent_classes:
- id: openai-gptbot
vendor: OpenAI
purpose: training
expected_user_agent_pattern: "(?i)\\bGPTBot/\\d"
expected_reverse_dns_suffixes: [".gptbot.openai.com"]
- id: anthropic-claudebot
vendor: Anthropic
purpose: training
expected_user_agent_pattern: "(?i)\\bClaudeBot/\\d"
- id: commoncrawl-ccbot
vendor: Common Crawl
purpose: archival
expected_user_agent_pattern: "(?i)\\bCCBot/\\d"
policies:
- type: ai_crawl_control
currency: USD
tiers:
# Training crawlers pay full price.
- route_pattern: /articles/*
agent_id: openai-gptbot
price: { amount_micros: 2000, currency: USD }
- route_pattern: /articles/*
agent_id: anthropic-claudebot
price: { amount_micros: 2000, currency: USD }
# Archival crawlers get a discount.
- route_pattern: /articles/*
agent_id: commoncrawl-ccbot
price: { amount_micros: 500, currency: USD }
# Sentinel buckets price differently for diagnostics.
- route_pattern: /articles/*
agent_id: anonymous
price: { amount_micros: 1000, currency: USD }
- route_pattern: /articles/*
agent_id: unknown
price: { amount_micros: 1500, currency: USD }
```
`agent_id` on a tier matches against the resolver's verdict. The first tier whose route pattern AND agent id both match wins. A tier without `agent_id` matches every agent.
The eight default agent classes (`openai-gptbot`, `openai-chatgpt-user`, `anthropic-claudebot`, `perplexity-perplexitybot`, `google-googlebot`, `google-extended`, `microsoft-bingbot`, `duckduckgo-duckduckbot`, `apple-applebot`, `commoncrawl-ccbot`) ship embedded in the binary. Operators extend or override entries inline in `sb.yml`.
## Observability
Every redeem fires a metric and a structured-log line. The label set:
| Label | Source | Cardinality cap |
|---|---|---|
| `agent_id` | Agent-class resolver. Bounded to registry plus `human`, `unknown`, `anonymous` sentinels. | 200 |
| `agent_class` | Closed enum from the taxonomy. | 8 |
| `agent_vendor` | Free-form vendor name from the taxonomy. | 20 |
| `payment_rail` | Closed enum: `none`, `x402`, `mpp_card`, `mpp_stablecoin`, `stripe_fiat`, `lightning`. | 6 |
| `content_shape` | Closed enum: `html`, `markdown`, `json`, `pdf`, `other`. | 5 |
Cardinality budgets are enforced by `sbproxy-observe::cardinality::CardinalityLimiter`; over-cap label values demote to `__other__` and increment `sbproxy_label_cardinality_overflow_total`.
### Metrics
| Metric | Type | Notes |
|---|---|---|
| `sbproxy_ledger_redeem_total{result, agent_id, agent_vendor, payment_rail}` | counter | Per-redeem outcome. `result` is one of `success`, `denied`, `error`. |
| `sbproxy_ledger_redeem_duration_seconds_bucket` | histogram | Tail-latency of the ledger round-trip. Carries trace exemplars. |
| `sbproxy_ledger_circuit_breaker_state{endpoint}` | gauge | 0 closed, 1 half-open, 2 open. |
| `sbproxy_ledger_circuit_breaker_transitions_total{endpoint, from, to}` | counter | Breaker flap counter. |
| `sbproxy_requests_total{agent_id, agent_class, agent_vendor, payment_rail, content_shape}` | counter | Per-request outcome. |
The per-agent dashboard (`deploy/dashboards/per-agent.json`) groups every panel by `agent_class` plus `agent_vendor`, so operators see one row per vendor and one row each for the sentinels. The audit-log dashboard (`deploy/dashboards/audit-log.json`) shows admin actions on `ai_crawl_control` tier edits.
### Tracing
The HTTP ledger client emits one outbound span per attempt, named `sbproxy.ledger.redeem`. The span carries `sbproxy.ledger.idempotency_key` so operators correlating across the proxy and the ledger can grep both sides for the same key. W3C TraceContext propagates on the outbound request; if the ledger emits OTel spans, the trace stitches end-to-end without manual correlation.
Exemplars on `sbproxy_ledger_redeem_duration_seconds_bucket` let Grafana jump from "this latency outlier" straight to the matching trace in Tempo.
## Limitations
- Detection is User-Agent based by default. Crawlers that lie about their UA bypass the check unless reverse-DNS or Web Bot Auth signals catch them; layer this with bot-detection or WAF policies for defence in depth.
- The OSS in-memory ledger is single-process. Multi-replica deployments without an HTTP ledger need sticky session affinity to one replica.
- `content_shape` is advisory. The field flows through metrics and the redeem payload but is not yet used as a tier filter.
- Per-agent pricing requires the agent-class resolver to be enabled; the resolver runs unconditionally by default, but operators who explicitly disable it fall back to UA-only matching and lose the per-vendor distinction.
## See also
- [configuration.md](configuration.md#ai_crawl_control) - schema reference.
- [ai-gateway.md](ai-gateway.md) - how this policy interacts with `ai_proxy` upstreams.
- [observability.md](observability.md) - metrics, logs, traces, dashboards.
- `examples/ai-crawl-control/` - runnable example.
================================================================
# docs/ai-gateway.md
================================================================
## SBproxy AI gateway guide
*Last modified: 2026-06-06*
SBproxy includes an AI gateway that sits between your application and LLM providers. You get one API endpoint with automatic failover, cost tracking, rate limits, and programmable routing across OpenAI, Anthropic, and other providers. The proxy ships with 66 native providers behind one OpenAI-compatible API, including a native Anthropic translator. You bring your own provider keys and the model name passes straight through, so you reach 200+ models without waiting on us to add them.
## Provider setup
Configure one or more providers under the `action` block. Each provider needs a name, API key, and model list:
```yaml
origins:
"ai.example.com":
action:
type: ai_proxy
providers:
- name: openai
api_key: ${OPENAI_API_KEY}
models: [gpt-4o, gpt-4o-mini, gpt-4-turbo]
- name: anthropic
api_key: ${ANTHROPIC_API_KEY}
models: [claude-sonnet-4-20250514, claude-3-5-haiku-20241022]
default_model: gpt-4o-mini
routing:
strategy: round_robin
```
API keys support environment variable interpolation with `${VAR_NAME}` syntax. Never put raw keys in config files.
### Native providers
66 native providers ship in-tree alongside a native Anthropic translator. You bring your own key per provider and the `model` field passes straight through, so the gateway reaches 200+ models (and any model a provider ships next) without enumerating them. Direct adapters include `openai`, `anthropic`, `gemini`, `azure`, `bedrock`, `cohere`, `mistral`, `groq`, `deepseek`, `together`, `fireworks`, `cerebras`, `sambanova`, `nvidia`, `vertex`, `databricks`, `huggingface`, `vllm`, and `openrouter`.
Any model a listed provider serves works without extra config. For a self-hosted or proprietary endpoint, point `vllm` or any provider at it with a custom `base_url`. `openrouter` is available as one of the providers when you want many vendors behind a single key. See `providers.md` for the full per-provider table.
## Routing strategies
The `routing.strategy` field controls how the proxy picks a provider for each request.
### round_robin
Spreads requests evenly across healthy providers. A reasonable default.
```yaml
routing:
strategy: round_robin
```
### weighted
Assigns a weight to each provider. Higher weight means more traffic.
```yaml
routing:
strategy: weighted
```
### fallback_chain
Tries providers in priority order. When the selected provider fails or returns 5xx, the router moves to the next provider.
```yaml
routing:
strategy: fallback_chain
```
### cost_optimized
Picks the cheapest provider that is not already loaded. The router scores each provider as `in_flight_requests * 1000 + weight` and routes to the lowest score. Set a lower `weight` on cheaper providers so they win ties when utilization is similar.
```yaml
routing:
strategy: cost_optimized
```
### lowest_latency
Routes to the provider with the lowest observed latency based on recent request history.
```yaml
routing:
strategy: lowest_latency
```
### least_connections
Routes to the provider with the fewest in-flight requests.
```yaml
routing:
strategy: least_connections
```
### sticky
Pins a user or session to the same provider. Falls back to round_robin for the initial pick.
```yaml
routing:
strategy: sticky
```
### random
Picks a provider uniformly at random. Useful for spreading load when no other signal applies.
```yaml
routing:
strategy: random
```
### token_rate
Routes to the provider with the most remaining token-per-minute capacity. Pair with per-provider token limits so the router can score headroom.
```yaml
routing:
strategy: token_rate
```
### race
Fans the request out to every eligible provider in parallel, returns the first 2xx, cancels the in-flight losers. Optimizes p99 latency at the cost of N times the API spend per request. Pair with `resilience` so persistently slow providers fall out of the eligible set.
```yaml
routing:
strategy: race
```
See [examples/ai-race](../examples/ai-race/sb.yml).
### least_token_usage
Routes to the provider with the lowest absolute observed token throughput in the current minute, regardless of any configured limit. Unlike `token_rate`, which scores remaining headroom against a declared per-provider TPM cap, this scores raw observed throughput, so it suits self-hosted vLLM or SGLang pools that do not pre-declare a token cap. Untried providers sort lowest and are explored first.
```yaml
routing:
strategy: least_token_usage
```
### prefix_affinity
Hashes a stable prefix of the request body to an enabled provider so requests that share a prompt prefix land on the same upstream and reuse its KV cache (vLLM, SGLang). The hash is deterministic and stable across reloads as long as the provider list does not reorder. Falls back to round_robin when no prefix can be extracted.
```yaml
routing:
strategy: prefix_affinity
```
### peak_ewma
Power-of-two-choices over observed latency: sample two eligible providers and route to the one with the lower recently observed latency. Cuts tail latency under skewed load versus always picking the single lowest-latency provider, which herds traffic. An untried provider is explored first.
```yaml
routing:
strategy: peak_ewma
```
### cascade
Tries a sequence of `(provider, model)` tiers from cheapest to most expensive. Each tier's response is graded against its `quality_threshold`; a response that is below threshold, empty, or refused retries on the next tier. `max_total_cost` (micro-USD) is an optional cumulative budget cap. Streaming requests dispatch only to the first tier.
```yaml
routing:
strategy: cascade
max_total_cost: 100000
tiers:
- provider_id: openai
model: gpt-4o-mini
quality_threshold: 0.7
- provider_id: openai
model: gpt-4o
quality_threshold: 0.85
```
See [examples/ai-cascade-routing](../examples/ai-cascade-routing/sb.yml).
### cost_quality
Scores each prompt's difficulty and routes simple prompts to a cheap model and hard prompts to a frontier model, on a single `cost_threshold` dial (`0.0` sends almost everything to the frontier, `1.0` sends almost everything to the cheap model).
```yaml
routing:
strategy: cost_quality
cheap_provider: openai-mini
frontier_provider: openai
cost_threshold: 0.5
```
## Resilience
Per-provider circuit breaker, outlier detection, and active health probes layered on top of the routing strategy. Each signal independently ejects a provider; when every provider is ejected, the router falls back to the unfiltered enabled list rather than refusing the request.
```yaml
resilience:
circuit_breaker:
failure_threshold: 5
success_threshold: 2
open_duration_secs: 30
outlier_detection:
threshold: 0.5
window_secs: 60
min_requests: 5
ejection_duration_secs: 30
health_check:
path: /models
interval_secs: 30
timeout_ms: 5000
unhealthy_threshold: 3
healthy_threshold: 2
```
See [examples/ai-resilience](../examples/ai-resilience/sb.yml). Field reference in [configuration.md#resilience-resilience](configuration.md#resilience-resilience).
## Shadow eval
Mirror each request to a second provider concurrently. The primary's response is what the client sees; the shadow body is drained and metrics are emitted at `target=sbproxy_ai_shadow` (status, latency, prompt/completion tokens, finish_reason). Useful for prompt regression checks before swapping a primary model.
```yaml
shadow:
provider: anthropic
sample_rate: 0.1
timeout_ms: 30000
```
See [examples/ai-shadow](../examples/ai-shadow/sb.yml).
## Proxy-native AI patterns
SBproxy is a proxy first, so AI traffic composes with everything else the proxy offers: CEL policies, forward rules, regex guardrails, request modifiers. Patterns that are awkward or impossible to express in a pure AI gateway library:
| Pattern | Mechanism | Example |
|---------|-----------|---------|
| Tenant access control before any AI call | `policies` (CEL expression) | [93-ai-cel-tenant-gate](../examples/ai-cel-tenant-gate/sb.yml) |
| Mixed AI + non-AI on one hostname (health probes, docs, model catalog) | `forward_rules` with inline child origins | [94-ai-mixed-traffic](../examples/ai-mixed-traffic/sb.yml) |
| Custom DLP beyond built-in PII (codenames, ticket IDs, internal hostnames) | `guardrails.input` with `regex` patterns | [95-ai-regex-dlp](../examples/ai-regex-dlp/sb.yml) |
| Topic enforcement (allow-list of approved keywords) | `regex` guardrail with `action: allow` | [95-ai-regex-dlp](../examples/ai-regex-dlp/sb.yml) |
CEL policies and request modifiers run before the AI handler dispatches, so a rejection costs no provider tokens. Forward rules dispatch by path, which means health checks and probe traffic can stay on the same hostname without billing a model. Regex guardrails inspect the parsed prompt body and slot in next to PII, injection, jailbreak, and schema guardrails.
## Native format translation
Clients always speak the OpenAI chat completions shape; sbproxy rewrites the body, path, and response back to OpenAI shape when the upstream provider speaks a different protocol.
| Provider format | Direction | Status |
|-----------------|-----------|--------|
| OpenAI | pass-through | always |
| Anthropic Messages API | bidirectional, non-streaming | shipped |
| Anthropic SSE events | streaming | not yet translated, passes through native |
| Google Gemini | bidirectional | not yet implemented |
| AWS Bedrock | bidirectional | not yet implemented |
For Anthropic, the request hoists `system` role messages to the top-level `system` field, defaults `max_tokens` when missing, strips OpenAI-only knobs (`logit_bias`, `n`, `presence_penalty`, `frequency_penalty`, `response_format`, `seed`, `user`), and rewrites the path from `/v1/chat/completions` to `/v1/messages`. The response converts text and tool_use blocks back into the OpenAI `choices[].message.content` and `tool_calls` shape, maps `stop_reason` to `finish_reason`, and renames `usage.input_tokens` / `output_tokens` to `prompt_tokens` / `completion_tokens`.
See [examples/ai-claude](../examples/ai-claude/sb.yml) and [providers.md](providers.md).
## Rate limits
Apply rate limits per client or globally to control costs and prevent abuse:
```yaml
origins:
"ai.example.com":
action:
type: ai_proxy
providers:
- name: openai
api_key: ${OPENAI_API_KEY}
models: [gpt-4o-mini]
default_model: gpt-4o-mini
routing:
strategy: round_robin
policies:
- type: rate_limiting
requests_per_minute: 100
```
Clients exceeding the limit receive a `429 Too Many Requests` response with a `Retry-After` header.
### Per-surface rate limits
Per-model and per-tenant rate limits cap each user, key, or model independently. The AI gateway also supports per-surface caps that apply to a classified API surface (chat completions, assistants, image generation, audio speech, ...) so expensive paths can be throttled without affecting cheap ones.
```yaml
origins:
"ai.example.com":
action:
type: ai_proxy
providers:
- name: openai
api_key: ${OPENAI_API_KEY}
per_surface_rate_limits:
image_generation:
requests_per_minute: 30
audio_speech:
requests_per_minute: 60
chat_completions:
requests_per_minute: 600
```
Keys are the `AiSurface` labels emitted on metrics (`chat_completions`, `models`, `embeddings`, `assistants`, `threads`, `batches`, `fine_tuning`, `files`, `realtime`, `image_generation`, `image_edits`, `image_variations`, `audio_transcription`, `audio_speech`, `moderations`, `reranking`). Surfaces without an entry are uncapped. When the cap fires, the proxy returns 429 before any upstream call.
The sliding window is one minute, shared across all configured origins (state is process-global). Audio-seconds-per-hour caps for realtime sessions are reserved for the realtime dispatch phase.
## Guardrails
The proxy supports nine guardrail types: `pii`, `injection`, `jailbreak`, `toxicity`, `content_safety`, `schema`, `regex`, `context_poisoning`, and `agent_alignment`. Guardrails run on input (before the provider call) or output (after), and they can block, flag, or rewrite content. See the CEL guardrails section below for inline CEL conditions, and `features.md` for the higher-level configuration of each guardrail type.
Input guardrails apply to whichever body field the surface carries user text in:
| Surface | Field guarded |
|---|---|
| `chat_completions`, `assistants`, `threads` | `body["messages"][].content` |
| `image_generation`, `image_edits`, `image_variations` | `body["prompt"]` |
| `audio_speech` | `body["input"]` |
| `reranking` | `body["query"]` |
| `moderations` | `body["input"]` |
A single guardrail block on the AI handler config covers every supported surface; the proxy picks the right field automatically based on the classified surface. Multipart-bodied surfaces (image edits, image variations, audio transcription) bypass the input-guardrail check today because their bodies are forwarded byte-transparently; output-side scanning for those surfaces is reserved for a follow-up.
### Streaming policy
A guardrail is *streaming-safe* when its block decision is stable as soon as the chunk it sees is decided. The proxy classifies the built-in guardrails as follows:
| Guardrail | Streaming-safe | Reason |
|---|---|---|
| `regex` | yes | per-chunk regex match is stable |
| `pii` | yes | PII patterns match per-chunk |
| `schema` | yes | JSON schema validation is decided on the parsed value |
| `context_poisoning` | yes | rule matches are per-message |
| `injection` | no | multi-token context windows; partial windows produce false negatives |
| `toxicity` | no | full-text classifier; partial-window scores are misleading |
| `jailbreak` | no | multi-pattern + multi-token detector |
| `content_safety` | no | full-text classifier (self-harm, violence, etc.) |
| `agent_alignment` | no | runs on the input body only (it inspects assistant tool_calls); streaming output is not in scope |
On the buffered (non-streaming) path the proxy runs every configured output guardrail against the full response. On the streaming output path the proxy runs only the streaming-safe guardrails on each chunk; non-safe guardrails are skipped because evaluating them against a partial window produces both false positives (tripping on benign mid-stream substrings) and false negatives (missing late-stream signal). Input guardrails always run against the full request regardless of `stream`.
Operators that want a non-safe guardrail to apply to streaming responses anyway should accept the partial-window risk explicitly and run a second buffered pass once the stream closes; the per-entry `streaming_safe` override surface for that case rides a follow-up.
### Context-poisoning guardrail
The `context_poisoning` input guardrail flags untrusted retrieval content that tries to manipulate the model before a downstream tool call. This is the indirect prompt injection vector from Greshake et al. (2023): a RAG pipeline pulls a poisoned page into the model's context, and the model then issues a tool call influenced by that content.
The check runs on the full input, including any `role: tool` or `role: function` messages that the AI gateway treats as retrieval content. Findings carry a stable `rule_id` and a confidence weight; the `min_confidence` setting filters out low-weight rules.
```yaml
guardrails:
input:
- type: context_poisoning
enabled: true
action: deny # log | score | deny (default deny)
min_confidence: 0.5
rules: # optional allowlist; omit for all rules
- cp_instruction_ignore_previous
- cp_tool_call_scaffold
- cp_encoded_instruction
- cp_conflicting_directive
```
The rule catalogue covers four families:
| Family | Sample rule IDs | Detects |
|---|---|---|
| Instruction-like patterns | `cp_instruction_ignore_previous`, `cp_instruction_you_are_now`, `cp_instruction_system_prompt_leak`, `cp_suspicious_url` | "ignore previous instructions" style payloads, role-swap framings, exfiltration URL shapes |
| Tool-call hints | `cp_tool_call_scaffold`, `cp_tool_call_json_shape` | Literal ``, `function_call:`, or JSON tool invocations inside passive content |
| Encoded instructions | `cp_encoded_instruction` | Base64 and hex blobs that decode to instruction-like text |
| Conflicting directives | `cp_conflicting_directive`, `cp_instruction_imperative_regex` | Imperative second-person language in `role: tool` or `role: function` content |
Every hit emits `sbproxy_ai_context_poisoning_findings_total{rule_id, action}`. When `action: deny`, the request is also counted in `sbproxy_ai_context_poisoning_blocked_total` and the proxy returns a 4xx before any upstream call. `action: log` and `action: score` keep the request flowing; they differ only in the metric label so dashboards can separate observability volume from scoring volume.
See `examples/ai-context-poisoning/` for a complete sample configuration and curl commands.
### Agent-alignment guardrail
The `agent_alignment` input guardrail audits the assistant's `tool_calls` array against operator-declared rules: an allow list of tools the agent is permitted to invoke, an explicit deny list that always trips even when allowed elsewhere, a forbidden-substring scan over the tool arguments, and a per-turn budget on the number of tool calls. The check is the LlamaFirewall (arXiv:2505.03574) "Agent Alignment Check" use case rendered as a deterministic ruleset so the per-request cost is bounded; an LLM-judge advisory variant rides a follow-up and slots into the same configuration.
Unlike the other guardrails this one runs against the raw request body so it can read the OpenAI / Anthropic / MCP tool-call shapes; the flat-text view that backs `pii` / `injection` / etc. strips `tool_calls` and would silently miss the goal-divergence cases.
```yaml
guardrails:
input:
- type: agent_alignment
enabled: true
mode: flag # flag (default, observability only) | block
allowed_tools: [search, fetch]
denied_tools: [delete_account]
forbidden_arg_substrings:
- "/etc/passwd"
- "AKIA" # leaked AWS-key shapes
max_tool_calls_per_turn: 4
```
`mode: flag` records every violation as a log line + access-log entry but lets the request through; once the operator has tuned the rule lists they flip to `mode: block` so the dispatch loop short-circuits to a 400 on the next violation. Tool calls in any of three shapes are recognised: OpenAI (`tool_calls[*].function.name` + `function.arguments`), Anthropic (`tool_calls[*].name` + `input`), and MCP (`tool_calls[*].tool` or `tool_calls[*].name` + `arguments`). The forbidden-substring scan is case-insensitive against the JSON encoding of whichever argument field is present.
See `examples/ai-agent-alignment/` for a runnable configuration that exercises every rule.
## Lua hooks
Use Lua scripts for more complex routing logic. Lua hooks run in a sandbox with access to request context variables.
Example: route coding questions to Anthropic based on the request path:
```yaml
origins:
"ai.example.com":
action:
type: ai_proxy
providers:
- name: openai
api_key: ${OPENAI_API_KEY}
models: [gpt-4o-mini]
- name: anthropic
api_key: ${ANTHROPIC_API_KEY}
models: [claude-sonnet-4-20250514]
default_model: gpt-4o-mini
routing:
strategy: round_robin
request_modifiers:
lua:
script: |
local path = request.path
if string.find(path, "/code") then
return {
add_headers = {
["X-Preferred-Provider"] = "anthropic"
}
}
end
return {}
```
## CEL guardrails
Block or modify AI requests with CEL expressions:
```yaml
origins:
"ai.example.com":
action:
type: ai_proxy
providers:
- name: openai
api_key: ${OPENAI_API_KEY}
models: [gpt-4o-mini]
default_model: gpt-4o-mini
routing:
strategy: round_robin
policies:
- type: rate_limiting
requests_per_minute: 100
request_modifiers:
cel:
- expression: >
request.headers['x-department'] == ''
? {"set_headers": {"X-Block": "true"}}
: {}
```
## Budgets
Set token or dollar caps that apply across a workspace, a single virtual key, an end user, a model, an origin, or a metadata tag. The `budget` block sits under `action` and is parsed by `BudgetConfig` in `crates/sbproxy-ai/src/budget.rs`.
```yaml
action:
type: ai_proxy
budget:
on_exceed: downgrade
limits:
- scope: workspace
max_cost_usd: 500
period: monthly
- scope: api_key
max_tokens: 1000000
period: daily
downgrade_to: gpt-4o-mini
- scope: user
max_cost_usd: 5
period: daily
- scope: model
max_tokens: 200000
period: daily
- scope: origin
max_cost_usd: 50
period: daily
- scope: tag
max_cost_usd: 25
period: monthly
```
### `budget` fields
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `limits` | list | `[]` | One or more `BudgetLimit` entries. Each is checked on every request. |
| `on_exceed` | enum | `block` | One of `block`, `log`, `downgrade`. Applies to whichever limit fires. |
### `BudgetLimit` fields
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `scope` | enum | required | One of `workspace`, `api_key`, `user`, `model`, `origin`, `tag`. |
| `max_tokens` | u64 | unset | Total prompt + completion tokens allowed for the scope. |
| `max_cost_usd` | f64 | unset | Total cost ceiling in USD across all requests in the scope. |
| `period` | string | unset | One of `daily`, `weekly`, `monthly`, `total`. Window over which usage accumulates. |
| `downgrade_to` | string | unset | Model name routed to when this limit fires and `on_exceed` is `downgrade`. |
### Behaviour notes
- A limit fires the first time `usage >= max_tokens` or `usage >= max_cost_usd`. Limits are checked in declaration order and the first match wins.
- `on_exceed: log` records a warning and a `sbproxy_ai_budget_utilization_ratio` gauge update, then lets the request through.
- `on_exceed: downgrade` swaps the request's model to the firing limit's `downgrade_to` and proceeds. If `downgrade_to` is unset, the request is blocked.
- Setting only `max_tokens` and leaving `max_cost_usd` unset (or vice versa) is supported. A limit with neither field is a no-op.
- A hierarchical view (`org`, `team`, `project`, `user`, `model` keys with 80% warning band) is exposed to in-process callers via `HierarchicalBudget` in `hierarchical_budget.rs`. There is no top-level YAML knob for it today; it is wired by the runtime when the gateway tracks spend.
## Virtual API keys
Issue per-team or per-app keys that the gateway validates locally. Each key can restrict allowed providers and models, set its own request and token rates, carry its own budget ceiling, and tag requests for downstream attribution. The `virtual_keys` list sits under `action` and is parsed by `VirtualKeyConfig` in `crates/sbproxy-ai/src/identity.rs`.
```yaml
action:
type: ai_proxy
virtual_keys:
- key: ${TEAM_A_KEY}
name: team-a
enabled: true
allowed_providers: [openai, anthropic]
allowed_models: [gpt-4o-mini, claude-3-5-haiku-20241022]
blocked_models: [gpt-4-turbo]
max_requests_per_minute: 60
max_tokens_per_minute: 200000
budget:
max_tokens: 5000000
max_cost_usd: 100
tags: [team-a, beta]
```
### `virtual_keys[]` fields
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `key` | string | required | The token clients send. Treat it like a secret and inject via `${VAR}`. |
| `name` | string | unset | Human label used in logs and metrics. |
| `enabled` | bool | `true` | Disable a key without deleting the entry. |
| `allowed_providers` | list of string | `[]` | Empty list allows all configured providers. |
| `allowed_models` | list of string | `[]` | Empty list allows all models. Otherwise the request model must match one entry. |
| `blocked_models` | list of string | `[]` | Takes precedence over `allowed_models`. A blocked model is rejected even if it appears in the allow list. |
| `max_requests_per_minute` | u64 | unset | Per-key RPM cap. The 60-second window starts on the first request and resets after one minute of wall time. |
| `max_tokens_per_minute` | u64 | unset | Per-key TPM cap. Tokens are recorded after the response is read. |
| `budget` | object | unset | `KeyBudget` with `max_tokens` and `max_cost_usd`. Independent of the global `budget` block. |
| `tags` | list of string | `[]` | Free-form labels attached to every request the key authenticates. Surfaced in logs and emitted in the `sbproxy_ai_key_*` metric labels. |
Per-key usage shows up in the `sbproxy_ai_key_*` metrics.
## Caching
Three independent caches sit in front of providers. Each has its own runtime configuration in `crates/sbproxy-ai/src/`. Hit and miss counts land in `sbproxy_ai_cache_results_total`.
### Exact prompt cache
Hashes the request body and serves byte-for-byte hits. Implemented in `prompt_cache.rs`. The cache key is the SHA-256 of the canonicalised JSON `messages` array, so request key ordering does not affect lookups. The module also detects Anthropic's native `cache_control` blocks (top-level `system`, per-message, or per-content-part) and lets those pass through to the upstream provider.
The exact-match path is a runtime construct rather than an `action` field today. It is enabled implicitly when the gateway is built with a cache backing store. There are no YAML knobs for the exact prompt cache.
### Semantic cache
Stores responses keyed by the SHA-256 of the messages array with TTL and capacity bounds. Implemented in `semantic_cache.rs` as `SemanticCache`. The constructor takes `max_entries: usize` and `ttl_secs: u64`; entries are evicted with an insert-order LRU when the cache is full, and lazily expired on lookup.
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `max_entries` | usize | constructor arg | Hard cap on cached responses. The oldest insert is evicted on overflow. |
| `ttl_secs` | u64 | constructor arg | Seconds before an entry is treated as a miss and removed. |
The semantic cache is configured via per-origin `extensions.semantic_cache` rather than `action.semantic_cache`. Example:
```yaml
origins:
ai.example.com:
action:
type: ai_proxy
providers: [...]
extensions:
semantic_cache:
enabled: true
ttl_secs: 1200
key_template: "{embedding_model}:{lsh_bucket}"
```
The `extensions` map is opaque to the OSS config parser; runtime components that recognise the key apply it.
### Idempotency middleware (RFC 8594)
Engages on `action: ai_proxy` origins when an `Idempotency-Key`
header is present on a POST / PUT / PATCH request. The middleware
sits ahead of the upstream provider call: on a cache hit the
gateway replays the cached `(status, headers, body)` triple
directly to the client with `x-sbproxy-idempotency: HIT` and
never contacts the provider, so Stripe-style retries do not
double-bill the upstream. On a body conflict the gateway returns
409 `ledger.idempotency_conflict`. On a miss the gateway forwards
and records the post-translation OpenAI-shape bytes the client
saw so retries replay byte-identical.
Per-origin caps (`max_request_body_bytes`,
`max_response_body_bytes`, `max_concurrent_buffers`) bound memory
and skip caching gracefully when a request exceeds them. Skip
reasons stamp on the outgoing response as
`x-sbproxy-idempotency: SKIPPED-...` so operators can spot
graceful degradation in dashboards.
Configuration is identical to general HTTP origins: see the
`idempotency:` block reference under
[`configuration.md`](configuration.md). v1 limitations: multipart
request bodies (audio transcription, image edit / variation, file
upload) are not cached, and SSE streaming responses abandon the
cache record above the response cap.
## Per-provider limits
The proxy reads rate limit headers off provider responses and pre-emptively throttles when remaining capacity falls under a configured fraction. Implemented in `provider_ratelimit.rs` as `ProviderRateLimitTracker`.
Recognised response headers (case-insensitive):
- `x-ratelimit-remaining-requests`, `x-ratelimit-remaining-tokens`
- `x-ratelimit-reset-requests`, `x-ratelimit-reset-tokens` (formats: `1s`, `500ms`, plain seconds)
- `retry-after` (plain seconds)
- `anthropic-ratelimit-requests-remaining`, `anthropic-ratelimit-tokens-remaining`
- `anthropic-ratelimit-requests-reset`
The tracker takes a single `throttle_threshold: f64` between 0.0 and 1.0. The implementation throttles when remaining requests fall to or below `floor(1000 * threshold)`, treating 1000 req/min as a baseline. Default: `0.1`, which throttles at 100 remaining requests or fewer.
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `throttle_threshold` | f64 | `0.1` | Clamped to `[0.0, 1.0]`. Lower values delay throttling until the provider is closer to its hard limit. |
Per-provider throttling is a runtime construct. There is no top-level YAML field; the tracker is instantiated alongside the provider pool and updated from every upstream response.
For per-model rate limits configurable in YAML, use `model_rate_limits` on the `action` block. The struct is `ModelRateConfig` in `ratelimit.rs`:
```yaml
action:
type: ai_proxy
model_rate_limits:
gpt-4o:
requests_per_minute: 200
tokens_per_minute: 400000
claude-sonnet-4-20250514:
requests_per_minute: 100
tokens_per_minute: 200000
```
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `requests_per_minute` | u64 | unset | Sliding one-minute window cap on requests for the model. |
| `tokens_per_minute` | u64 | unset | Sliding one-minute window cap on tokens for the model. |
## Model aliases
Map friendly names onto specific provider plus model pairs, with optional deprecation pointers. Implemented in `model_alias.rs` as `ModelAliasRegistry`, with each entry typed as `ModelAlias`. The registry is constructed by the runtime; entries deserialise from YAML or JSON when loaded.
```yaml
model_aliases:
- alias: fast
provider: openai
model_id: gpt-4o-mini
- alias: smart
provider: anthropic
model_id: claude-sonnet-4-20250514
- alias: claude-old
provider: anthropic
model_id: claude-3-opus-20240229
deprecated: true
replacement: smart
```
### `ModelAlias` fields
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `alias` | string | required | The friendly name clients send. |
| `provider` | string | required | Provider name to route to. |
| `model_id` | string | required | The model ID actually sent upstream. |
| `deprecated` | bool | `false` | When true, a warning is logged on every resolution. |
| `replacement` | string | unset | Suggested alias to migrate to. Surfaces in the deprecation log line. |
Resolution returns `None` for unknown names so the request falls back to literal model ID matching. Re-registering the same alias overwrites the previous entry.
The alias registry is wired by the runtime rather than read off the `action` block. Treat the YAML above as the canonical shape when serialising aliases for code paths that load them.
## Supported endpoints
Every inbound request to an `action: ai_proxy` origin is classified into an `AiSurface` by `classify_surface(method, path)` in `crates/sbproxy-ai/src/handler.rs`. The classifier accepts canonical OpenAI paths with optional `/v1` or `/api/v1` prefix and any trailing slash. The surface label appears on the per-surface metrics, on the request tracing span, and on every per-surface decision (rate limit, guardrail extractor, 501 gate).
Provider capability is the source of truth for which surfaces a configured provider can serve. The matrix lives in `crates/sbproxy-ai/src/api_routes.rs::provider_supports_surface`. When no configured provider supports the requested surface, the proxy returns **501 Not Implemented** before any upstream call. Universal surfaces (chat completions and models) bypass the gate. Unknown surfaces fall through to the existing dispatch and 404 at the upstream.
| Surface label | Method(s) | Path(s) | Providers (today) |
|---|---|---|---|
| `chat_completions` | POST | `/v1/chat/completions` | All |
| `models` | GET | `/v1/models`, `/v1/models/{id}` | All |
| `embeddings` | POST | `/v1/embeddings` | OpenAI, Gemini, Cohere |
| `assistants` | POST, GET, DELETE | `/v1/assistants[/{id}[/files[/{file_id}]]]` | OpenAI |
| `threads` | POST, GET, DELETE | `/v1/threads[/{id}[/messages[/{id}] \| /runs[/{id}[/cancel]]]]`, `/v1/threads/runs` | OpenAI |
| `batches` | POST, GET | `/v1/batches[/{id}[/cancel]]` | OpenAI |
| `fine_tuning` | POST, GET | `/v1/fine_tuning/jobs[/{id}[/cancel \| /events]]` | OpenAI |
| `files` | POST, GET, DELETE | `/v1/files[/{id}[/content]]` | OpenAI |
| `realtime` | GET (WebSocket upgrade) | `/v1/realtime` | OpenAI |
| `image_generation` | POST | `/v1/images/generations` | OpenAI, Gemini |
| `image_edits` | POST (multipart) | `/v1/images/edits` | OpenAI, Gemini |
| `image_variations` | POST (multipart) | `/v1/images/variations` | OpenAI, Gemini |
| `audio_transcription` | POST (multipart) | `/v1/audio/transcriptions`, `/v1/audio/translations` | OpenAI, Gemini |
| `audio_speech` | POST | `/v1/audio/speech` | OpenAI, Gemini |
| `moderations` | POST | `/v1/moderations` | OpenAI |
| `reranking` | POST | `/v1/rerank`, `/v1/reranking` | Cohere |
### Response shape contract
"Supported" in the table above means the gateway accepts the surface and routes it. It does NOT mean the gateway normalises the response. Per-surface translation behaviour:
| Surface | Response shape |
|---|---|
| `chat_completions` | normalised to / from the OpenAI shape on Anthropic and Google (gemini) formats; passthrough on OpenAI-compatible upstreams |
| `messages`, `responses` | native-format inbound shims that translate down to the same hub shape as chat completions |
| `models` | **passthrough only**: the gateway forwards the upstream's native model-list body unchanged. Clients calling `/v1/models` through a non-OpenAI provider see the upstream's shape, not the OpenAI `{"object": "list", "data": [...]}` envelope |
| everything else | passthrough on the providers listed in the table; clients see the upstream's native response shape |
The Models passthrough decision is deliberate. OpenAI returns `{"object": "list", "data": [{"id": "...", "owned_by": "..."}]}`; Anthropic returns `{"data": [{"id": "...", "display_name": "..."}], "has_more": false}`; Google's `models.list` returns `{"models": [{"name": "models/...", "displayName": "..."}]}`. A lossy normalisation would conflate these and mislead clients about per-model metadata. Callers that need a unified shape across providers should consume the proxy's own model registry instead of the passthrough.
### Method coverage
The gateway accepts any standard HTTP method for any supported surface. GET, POST, PUT, DELETE, PATCH, HEAD, and OPTIONS all dispatch through the same provider-selection and observability surface. Methods other than GET/POST forward via `AiClient::forward_with_method` and do not engage the chat-completions body-parse pipeline (no JSON parsing, no budget enforcement, no input guardrails). Method-aware dispatch is what makes `DELETE /v1/assistants/{id}`, `POST /v1/threads/{id}/runs/{id}/cancel`, and the other non-POST verbs work end-to-end.
### Multipart bodies
Image edits, image variations, audio transcription, and audio translation send multipart request bodies. The proxy detects multipart by inspecting the inbound `Content-Type` header; when it starts with `multipart/`, the body is forwarded byte-for-byte via `AiClient::forward_bytes` with the original Content-Type preserved. Provider format translation (Anthropic, etc.) does not run for multipart, since these surfaces are OpenAI-only.
### Per-surface configuration
Per-surface knobs live under `per_surface_rate_limits` (see [Per-surface rate limits](#per-surface-rate-limits)) and apply automatically based on the classified surface. Surfaces have no dedicated YAML config block beyond that; they share the top-level `providers`, `routing`, `virtual_keys`, `budget`, `model_rate_limits`, `max_concurrent`, and `guardrails` settings.
### Surfaces marked enterprise-only
`reranking` is gated to ship dispatch in the enterprise build. In the OSS build the surface is classified (so observability still tags requests with `surface = "reranking"`) and the 501 gate fires unless an enterprise license check passes. The same surface label and matrix entry exist in both builds.
## Context handling
Three modules handle prompts that approach or exceed a model's context window. They are layered: relay carries history across rotations, overflow decides what to do when the next request will not fit, and compress trims when the answer is to keep going with a smaller history.
### Context relay
`crates/sbproxy-ai/src/context_relay.rs` is a thread-safe map of session ID to message history. When the router rotates between providers or virtual keys mid-session, it pulls the prior message list out of the relay and replays it to the new provider so the conversation does not reset. Messages are kept as raw `serde_json::Value` so provider-specific shapes survive the round trip. No YAML config: it is internal state used by the router.
### Context overflow
`crates/sbproxy-ai/src/context_overflow.rs` ships a registry of context windows for the OpenAI, Anthropic, Gemini, Mistral, and Llama families and decides what to do when a request would overflow. Three actions are available:
- `Error`: return a 4xx to the client.
- `FallbackToLarger(model)`: resend to a larger-window model named in config.
- `Truncate`: drop oldest turns and retry, available through `check_overflow_with_truncate`.
The choice is driven by a `context_overflow` block on the AI handler:
```yaml
action:
type: ai_proxy
context_overflow:
fallback_model: gpt-4o # used when the current model overflows and gpt-4o has a larger window
on_overflow: truncate # error | fallback | truncate
```
If the requested model is not in the registry, overflow checks are skipped (no window to compare against) and the request is forwarded as-is.
### Context compress
`crates/sbproxy-ai/src/context_compress.rs` does cost-aware history trimming. `estimate_message_tokens` uses a four-characters-per-token approximation. `trim_to_budget` always keeps the leading system message, then walks remaining messages newest-to-oldest, including each one only if it fits in the remaining token budget, then restores chronological order before returning.
This module exposes pure functions; it is invoked by the routing strategy and overflow handler. There is no `context_compress:` YAML block.
## Streaming analytics
`crates/sbproxy-ai/src/streaming_analytics.rs` tracks per-stream timing for SSE responses. `StreamTracker` records start time, first-token instant, and last-token instant; from these it computes Time to First Token (`ttft_ms`), Tokens Per Second (`tps`), and average inter-token latency (`avg_itl_ms`). `StreamRegistry` is the global map of in-flight streams keyed by request ID.
These values feed the `sbproxy_ai_request_duration_seconds` histogram and request-scoped log records. The module has no YAML config; it is wired in whenever streaming responses are observed.
## Structured output
`crates/sbproxy-ai/src/structured_output.rs` validates responses against a JSON Schema. The config struct sits on the AI handler:
```yaml
action:
type: ai_proxy
structured_output:
schema: # JSON Schema the response must conform to
type: object
required: [name, age]
properties:
name: {type: string}
age: {type: integer}
retry_on_failure: true # default: false
max_retries: 2 # default: 1
```
When `retry_on_failure` is true, a failed validation triggers a retry with the schema injected into the system prompt via `build_schema_instruction`. `extract_json` strips ` ```json ` and ` ``` ` fences before parsing, so models that wrap output in markdown still validate. Validation is structural: required-field presence and per-property type checks (`string`, `number`, `integer`, `boolean`, `array`, `object`, `null`). Full JSON Schema features such as `$ref` and `oneOf` are not implemented.
The validator and the schema-instruction builder are live functions; the wiring that calls them on every chat response is a runtime construct rather than a top-level YAML field. The YAML block above is the shape that ships when a runtime caller threads `StructuredOutputConfig` into the chat handler. Source: `crates/sbproxy-ai/src/structured_output.rs`.
## OpenAI surface-area modules
The `sbproxy-ai` crate ships shape definitions and lightweight handlers for the OpenAI surface beyond chat completions: assistants, threads, batch jobs, image generation, audio, fine-tuning, realtime sessions, and structured output. The shapes are stable and round-trip through `serde_json`; the chat-path router (`crates/sbproxy-ai/src/handler.rs:parse_ai_path` and `crates/sbproxy-ai/src/api_routes.rs:parse_endpoint`) recognises a subset (chat, embeddings, models, rerank, moderations, image generation, audio transcription, audio speech) and falls back to `Unknown` for the rest. The remaining shapes are present so plugin authors can build on top of them and so the action config surface is forward-compatible.
The subsections below describe what each module contributes today.
### `assistants`
Shape definitions for the OpenAI Assistants API. `AssistantHandler::route_request(path, method)` classifies a request into one of: `CreateAssistant`, `ListAssistants`, `GetAssistant(id)`, `CreateThread`, `CreateMessage(thread_id)`, `CreateRun(thread_id)`, `GetRun(thread_id, run_id)`, or `Unknown`. The optional `/v1` prefix is stripped before matching. `AssistantConfig { enabled: bool }` is the on/off shape.
```yaml
action:
type: ai_proxy
providers: [...]
# Forward-compatible flag, recognised by the parser but not yet enforced.
assistants:
enabled: true
```
The router classifier is implemented; routing into the chat dispatcher is not yet wired in the OSS build. Use chat completions for assistant-style flows until the dispatcher lands. Source: `crates/sbproxy-ai/src/assistants.rs:AssistantHandler`.
### `threads`
In-memory `ThreadStore` for OpenAI-style threads and their messages. Stores `Thread { id, created_at, metadata }` and ordered `ThreadMessage { id, thread_id, role, content, created_at }`. The store is thread-safe (mutex-backed) and used by the assistants handler for local session continuity. There is no YAML field that selects a backing store today; the in-memory store is the only implementation. Source: `crates/sbproxy-ai/src/threads.rs:ThreadStore`.
### `batch`
`BatchJob` shape (id, status, created_at, completed_at, total_requests, completed_requests, failed_requests, metadata) plus a `BatchStore` trait with one implementation, `MemoryBatchStore`. Status lifecycle: `pending`, `in_progress`, `completed`, `failed`, `cancelled`. The store is wired by the runtime when a batch dispatcher is constructed; there is no top-level `batch:` YAML block. Source: `crates/sbproxy-ai/src/batch.rs`.
### `image`
Request and response shapes for image generation, edit, and variation. `ImageGenerationRequest { prompt, model, size, n }` and `ImageGenerationResponse { images: Vec }`, where each `ImageData` carries either a `url` or a base-64 `b64_json` payload depending on the provider's `response_format`. `/v1/images/generations` is routed by `api_routes.rs`; the per-call dispatch is built by the runtime. No dedicated YAML knobs. Source: `crates/sbproxy-ai/src/image.rs`.
### `audio`
Request and response shapes for audio transcription and speech synthesis. `TranscriptionRequest { file_url, model, language }`, `TranscriptionResponse { text, duration }`, and `SpeechRequest { input, model, voice }`. `/v1/audio/transcriptions` and `/v1/audio/speech` are recognised by `api_routes.rs`. No dedicated YAML knobs; the audio dispatcher reuses the top-level provider list and routing strategy. Source: `crates/sbproxy-ai/src/audio.rs`.
### `finetune`
Fine-tuning API classifier. `FinetuneHandler::route_request(path, method)` classifies into `CreateJob`, `ListJobs`, `GetJob(id)`, `CancelJob(id)`, `ListEvents(id)`, or `Unknown`, with the optional `/v1` prefix stripped. `FinetuneConfig { enabled: bool }` is the on/off shape.
```yaml
action:
type: ai_proxy
providers: [...]
# Forward-compatible flag, recognised by the parser but not yet enforced.
finetune:
enabled: true
```
Like `assistants`, the classifier is implemented; routing into the chat dispatcher is not yet wired in the OSS build. Source: `crates/sbproxy-ai/src/finetune.rs:FinetuneHandler`.
### `realtime`
Shape definitions and config for OpenAI's Realtime websocket API. `RealtimeConfig { enabled, model }` defaults to `enabled: false` and `model: "gpt-4o-realtime-preview"`. `RealtimeSession { session_id, model, created_at, status }` and `RealtimeEvent { event_type, data }` round-trip through serde. The `/v1/realtime` websocket path is recognised by the proxy but session bridging requires a runtime-level dispatcher; the config shape above is the YAML form that the dispatcher reads.
```yaml
action:
type: ai_proxy
providers: [...]
realtime:
enabled: true
model: gpt-4o-realtime-preview
```
Source: `crates/sbproxy-ai/src/realtime.rs`.
### `structured_output`
Already covered above under [Structured output](#structured-output). Shape and validator are live (`extract_json`, `validate_response`, `build_schema_instruction`); the wiring that runs the validator on every chat response is a runtime construct rather than a top-level YAML field. Source: `crates/sbproxy-ai/src/structured_output.rs`.
## Per-request attribution
The gateway records provider, model, token counts, and estimated cost for every AI request and exposes them through Prometheus metrics (see below). Direct response headers for these fields are not emitted today.
## Token usage metrics
The proxy exposes aggregate AI usage as Prometheus metrics. When `telemetry.bind_port` is configured, the following counters and gauges are available at `/metrics` under the `sbproxy_ai_*` namespace:
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `sbproxy_ai_requests_total` | Counter | `provider`, `model`, `status` | Total AI requests |
| `sbproxy_ai_surface_requests_total` | Counter | `surface`, `method` | Total AI requests partitioned by classified surface (chat completions, assistants, image generation, ...) and HTTP method |
| `sbproxy_ai_surface_request_duration_seconds` | Histogram | `surface`, `method` | Per-surface request latency. Buckets match `sbproxy_ai_request_duration_seconds` for side-by-side dashboards |
| `sbproxy_ai_tokens_total` | Counter | `provider`, `model`, `direction` | Tokens consumed (`direction` is `input` or `output`) |
| `sbproxy_ai_cost_dollars_total` | Counter | `provider`, `model` | Estimated cost in USD |
| `sbproxy_ai_request_duration_seconds` | Histogram | `provider`, `model` | End-to-end AI request latency |
| `sbproxy_ai_failovers_total` | Counter | `from_provider`, `to_provider`, `reason` | Provider failover events |
| `sbproxy_ai_guardrail_blocks_total` | Counter | `category` | Guardrail block events (pii, injection, jailbreak, etc.) |
| `sbproxy_ai_cache_results_total` | Counter | `provider`, `cache_type`, `result` | AI response cache results (`cache_type` is `exact` or `semantic`, `result` is `hit` or `miss`) |
| `sbproxy_ai_budget_utilization_ratio` | Gauge | `scope` | Current budget utilization as a 0 to 1 ratio |
| `sbproxy_ai_key_requests_total` | Counter | `virtual_key`, `provider`, `model` | Requests per virtual key |
| `sbproxy_ai_key_tokens_total` | Counter | `virtual_key`, `direction` | Tokens per virtual key |
| `sbproxy_ai_key_cost_dollars_total` | Counter | `virtual_key` | Cost in USD per virtual key |
| `sbproxy_ai_realtime_sessions_active` | Gauge | | Currently open OpenAI Realtime API WebSocket sessions |
| `sbproxy_ai_realtime_session_duration_seconds` | Histogram | `provider`, `close_reason` | Wall-clock duration of a Realtime WebSocket session, observed at close. `close_reason` is `client_closed` or `error` |
| `sbproxy_ai_realtime_audio_seconds_total` | Counter | `provider`, `direction` | Cumulative audio seconds forwarded over Realtime sessions. Frame-exact accounting requires terminate-and-relay (not on the OSS path); the OSS dispatcher uses session wall-clock as a duration proxy on close |
| `sbproxy_ai_realtime_frames_forwarded_total` | Counter | `provider`, `direction`, `kind` | Cumulative frames forwarded over Realtime sessions (`kind` is `text` or `audio`). Reserved for a future enterprise terminate-and-relay path |
Use these to build spending dashboards, set budget alerts, and track provider reliability without any application-level instrumentation.
## Dashboards
The metrics above can be wired into any Prometheus-compatible dashboard tool. A pre-built JSON for AI gateway health is on the roadmap; for now, point your existing Prometheus or Grafana setup at `/metrics` and chart the counters and histograms listed above.
## Streaming
The proxy supports streaming responses. When your client sends a streaming request (e.g. `"stream": true` in the OpenAI API), the proxy:
1. Validates the request (auth, rate limits, guardrails).
2. Picks a provider using the configured routing strategy.
3. Opens a streaming connection to the provider.
4. Forwards SSE chunks to the client as they arrive.
5. Reads token usage from the final chunk and records it to the metrics counters.
No special configuration is needed. Streaming works with all routing strategies and all providers.
### Usage extraction
Different providers report streaming token counts in different SSE shapes. The streaming relay scans every chunk through a pluggable parser and records the captured tokens against the configured budget scopes when the stream closes. Pick the parser explicitly with `usage_parser`, or leave it at the default `auto` and the proxy resolves it from the upstream URL host, response `Content-Type`, and an optional `X-Provider` response header.
| `usage_parser` | Wire format | Notes |
|---|---|---|
| `openai` | `data: {..., "usage": {...}}\n\n` terminal frame | OpenAI, Azure OpenAI, OpenAI-compatible relays |
| `anthropic` | `event: message_start` plus `event: message_delta` with `usage` | Max-of across both events; `input_tokens` from start, `output_tokens` from delta |
| `vertex` | `data: {..., "usageMetadata": {...}}` on every chunk | Vertex AI / Gemini; values grow monotonically |
| `bedrock` | `data: {"bytes": ""}` envelope | Decodes the envelope and delegates to the Anthropic parser for the inner stream |
| `cohere` | `data: {..., "event_type": "stream-end", ..., "billed_units": {...}}` | Reads `response.meta.billed_units` or `meta.billed_units` |
| `ollama` | NDJSON: `{..., "done": true, "prompt_eval_count": N, "eval_count": M}\n` | Line-delimited JSON instead of SSE |
| `generic` | Best-effort across all of the above | Default fallback when `auto` cannot match a known upstream |
| `auto` | Resolved at request time | See order below |
| `none` | Skip parsing | Disables streaming budget recording for this origin |
`auto` resolves in this order:
1. Response `X-Provider` header (operator-controlled).
2. Upstream URL host: `*.openai.com` plus `*.openai.azure.com` -> `openai`, `*.anthropic.com` -> `anthropic`, `*.googleapis.com` or any host containing `aiplatform` -> `vertex`, `bedrock-*` or `*.amazonaws.com` -> `bedrock`, `*.cohere.ai` or `*.cohere.com` -> `cohere`, `localhost:11434` or any host containing `ollama` -> `ollama`.
3. Response `Content-Type`: `application/x-ndjson` or `application/jsonl` -> `ollama`.
4. Fall back to `generic`.
Unknown values warn once and fall back to `generic` so a typo never silently disables budget recording.
```yaml
origins:
"ai.example.com":
action:
type: ai_proxy
usage_parser: anthropic # or auto, openai, vertex, bedrock, cohere, ollama, generic, none
providers:
- name: anthropic
api_key: ${ANTHROPIC_API_KEY}
base_url: https://api.anthropic.com/v1
```
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="unused",
default_headers={"Host": "ai.example.com"},
)
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Write a haiku about proxies."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
## Realtime
The AI gateway routes OpenAI Realtime API WebSocket sessions through the same dispatch path as the rest of the surface set. A client opens `GET /v1/realtime` with `Upgrade: websocket` against the proxy, the gateway runs its standard pre-upgrade gating, picks an enabled provider that supports Realtime (today: OpenAI), and lets Pingora forward bytes between the client and the provider after the `101 Switching Protocols` handshake.
What runs before the upgrade:
- Surface classification stamps `ai.surface = "realtime"` on the request span and the access log.
- The 501 capability gate fires if no configured provider supports Realtime.
- The per-surface rate limit (`per_surface_rate_limits.realtime`) fires before the upgrade is attempted, returning 429 when the cap is hit.
- The active-sessions gauge `sbproxy_ai_realtime_sessions_active` ticks up.
What runs during the session:
- Pingora forwards WebSocket frames byte-transparently. The proxy does not inspect individual frames (per-frame guardrails are not on the OSS path; they would require terminate-and-relay, which is reserved for an enterprise build).
What runs at session close (the `logging` hook):
- The active-sessions gauge ticks down.
- `sbproxy_ai_realtime_session_duration_seconds` records the wall-clock session lifetime.
- An `AiBillingEvent` fires with `usage = AudioSeconds { seconds = wall_clock }` so operators see realtime usage on the standard billing event bus. Cost is reported as 0.0 in OSS until the realtime rate card lands in the pricing helper; downstream consumers can compute cost from the duration.
```yaml
origins:
"ai.example.com":
action:
type: ai_proxy
providers:
- name: openai
api_key: ${OPENAI_API_KEY}
base_url: https://api.openai.com/v1
models: [gpt-4o-realtime-preview]
per_surface_rate_limits:
realtime:
requests_per_minute: 30
```
A client connects with the standard OpenAI Realtime URL, replacing the OpenAI host with the proxy host:
```python
import websocket # websocket-client
ws = websocket.create_connection(
"wss://ai.example.com/v1/realtime?model=gpt-4o-realtime-preview",
header=[
"Authorization: Bearer ",
"OpenAI-Beta: realtime=v1",
],
)
```
The proxy enforces gating before the upgrade and emits a session-end billing event after close; per-frame inspection is reserved for an enterprise terminate-and-relay path that would land alongside a dedicated Pingora `Service` impl.
## Full example
An AI gateway with two providers, fallback routing, API key auth, and a rate limit:
```yaml
proxy:
http_bind_port: 8080
origins:
"ai.example.com":
action:
type: ai_proxy
providers:
- name: openai
api_key: ${OPENAI_API_KEY}
priority: 1
models: [gpt-4o, gpt-4o-mini, gpt-4-turbo]
- name: anthropic
api_key: ${ANTHROPIC_API_KEY}
priority: 2
models: [claude-sonnet-4-20250514, claude-3-5-haiku-20241022]
default_model: gpt-4o-mini
routing:
strategy: fallback_chain
authentication:
type: api_key
api_keys:
- ${AI_GATEWAY_KEY}
policies:
- type: rate_limiting
requests_per_minute: 200
```
## Hot-reload behavior
A `SIGHUP`, an admin-API reload, or an in-place edit of `sb.yml` (when the file watcher is on) refreshes the AI gateway without restarting the proxy. The provider catalog under `proxy.ai_providers_file`, the live `AiClient`, and the compiled handler chain are rebuilt and swapped atomically; in-flight requests continue against their existing snapshot until they finish, and subsequent requests pick up the new state. Adding a provider, rotating a `default_base_url`, or fixing a typo in `ai_providers.yml` no longer requires shedding connections.
The process-wide AI budget tracker is deliberately left alone on reload. Budget windows are wall-clock-relative (daily, monthly, custom), so the per-scope token and cost accumulators must outlive a config reload. Wiping the tracker would silently roll counters back to zero and let already-spent budget through a second time. To clear a budget intentionally, restart the process or call the per-scope reset path on the admin surface.
## See also
- [providers.md](providers.md) - full provider table and per-provider model lists.
- [scripting.md](scripting.md) - CEL and Lua reference, including AI selector and guardrail variables.
- [configuration.md](configuration.md) - general configuration model, origin schema, and the full `sb.yml` field reference.
- [features.md](features.md) - higher-level overview of features including guardrails.
================================================================
# docs/ai-lb-benchmark.md
================================================================
## AI router load-balancing benchmark
*Last modified: 2026-05-31*
The AI router supports several load-balancing strategies (round-robin,
peak-EWMA, least-connections, least-token-usage, prefix-affinity, and
others). This page compares them on a synthetic, skewed workload and
publishes the P50 / P95 / P99 / P99.9 numbers an operator can compare
against when picking a strategy.
## What the bench measures
The harness at `sbproxy-bench/harness/ai_lb_strategy/` drives a
synthetic, skewed workload through the live
`sbproxy_ai::routing::Router` for each declared strategy, then
prints a P50 / P95 / P99 / P99.9 / max comparison table plus a
Jain fairness index and (for `prefix_affinity`) a KV-cache hit
rate.
The bench is in-process, not HTTP-driven. The variable under test
is the LB algorithm; an HTTP backend would have to fake the
KV-cache and provider-latency skews anyway, so the in-process
driver lets the bench measure the router without confounds from
the proxy substrate.
## The workload
Three orthogonal skews, each tunable via CLI:
| Skew | Default | Models the real-world case where ... |
| --- | --- | --- |
| Provider latency heterogeneity | one slow provider out of four at 5x base latency | A vLLM pool has one warm-but-overloaded worker |
| Prompt-prefix Zipf | s = 1.1 over 100 prefixes | Chat traffic where some system prompts repeat |
| Tenant token-burst Zipf | s = 1.0 over 10 tenants | A small fleet with one hot tenant emitting most tokens |
## Simulated latency model
```text
observed_ms = base_ms * provider_factor
- kv_cache_bonus_ms if prefix was seen on this provider
in the last 64 requests
+ queue_term_ms (in-flight count * 5ms)
+ lognormal noise (mu=0, sigma=0.3)
```
The lognormal noise creates the heavy tail that makes P99 the
right comparison metric. The KV-cache bonus is what lets
`prefix_affinity` show its value in simulation; without it the
strategy is indistinguishable from round-robin.
These assumptions are not validated against a real vLLM pool. A
follow-up bench against a Docker vLLM fixture is tracked under
the bench harness's README.
## Reproducing the run
```bash
cd sbproxy-bench/harness/ai_lb_strategy
SBPROXY_BENCH=1 cargo run --release -- --total-requests 50000
```
The `SBPROXY_BENCH=1` env-var gate is enforced in `main.rs` so an
accidental local invocation cannot saturate a core. CI does not
run this; it is a lab-only artifact.
## What to expect
Under the default skewed workload:
- **`round_robin`** posts the worst P99 because it does not avoid
the slow provider. Per-provider request distribution is uniform
(Jain ~1.0) which looks fair but produces the tail.
- **`peak_ewma`** posts the best P99 of the latency-aware strategies.
Two-of-N sampling avoids the herd-on-one-fast-provider pathology
that `lowest_latency` falls into.
- **`prefix_affinity`** posts the best P99 when the Zipf parameter
is at least ~1.0 (default 1.1). The KV-cache hit rate column shows
why: the same prefix lands on the same provider often enough to
reuse a warm cache. Lower the prefix-Zipf to 0.0 (uniform) and
the strategy degenerates toward round-robin's number.
- **`least_token_usage`** posts a fairness Jain index above 0.95
on the tenant-skewed workload because it spreads the hot tenant's
tokens evenly across providers.
- **`least_connections`** behaves similarly to `peak_ewma` here
because the queue term in the latency model is what its in-flight
signal tracks. In a real vLLM pool the queue term is more
pronounced and the two diverge.
The README at `sbproxy-bench/harness/ai_lb_strategy/README.md` is
the canonical reference for the flags and the model assumptions.
## Caveats
1. The KV-cache bonus and lognormal-noise sigma are unvalidated
against production traffic. The doc calls them out so a reader
can challenge them.
2. The bench writes to `Router::record_latency` with `Relaxed`
atomic semantics. Two strategies (`lowest_latency`, `peak_ewma`)
read the same field as ground truth. The most recent write
wins; under the bench's single-threaded sample loop this is
deterministic, but under multi-threaded production traffic the
reads see slightly stale numbers.
3. `prefix_affinity` looks bad with uniform prompts. The default
prefix-Zipf of 1.1 ships the strategy in its strong configuration;
operators considering it should match against their own traffic
shape before turning it on.
4. The bench does not measure cost. Strategies with cost in their
name (`cost_optimized`, `cascade`) are not in the comparison
table because P99 is the wrong axis for them.
## Related
- `crates/sbproxy-ai/src/routing.rs` is where the strategies live.
- `BENCHMARK.md` at the repo root covers workspace-level proxy
overhead numbers; this page is the AI router-specific axis.
- The `sbproxy_ai_lb_decisions_total{strategy, provider}` metric
emitted by the router lets you reproduce the per-provider
distribution table on a live deployment.
================================================================
# docs/architecture.md
================================================================
## SBproxy architecture and deployment guide
*Last modified: 2026-06-08*
This document covers the internal architecture of SBproxy, the request lifecycle, the plugin
system, the AI gateway, caching, events, and common deployment topologies.
---
## 1. Overview
SBproxy is a single static binary with no required external runtime dependencies. It is
written in Rust and ships as a self-contained executable. There is no JVM, no Python
interpreter, no Node.js runtime, and no shared library requirement beyond libc (or none at
all when built with `musl` or `--target *-unknown-linux-musl`).
The proxy is built on Cloudflare's [Pingora](https://github.com/cloudflare/pingora)
framework. Pingora supplies the tokio runtime, listener management, HTTP/1.1, HTTP/2
(HTTP/3 is currently disabled pending native Pingora HTTP/3), TLS termination, and a
phase-based callback model for the request
pipeline. SBproxy layers its host router, compiled origin pipeline, plugin registry, and
hot-reload machinery on top of those primitives.
The plugin system is modeled on Caddy's module pattern. Every extensible component type
(action handlers, auth providers, policy evaluators, transforms, middleware) registers
itself at compile time through the `inventory` crate. The proxy crate is the binary
composition root; pulling a feature in or out is a matter of which workspace crates are
linked into the final executable.
Key properties:
- Single binary. One file to copy, one process to manage. mimalloc is the global
allocator, typically 5 to 10 percent faster than glibc's allocator under contention.
- Zero-dependency startup. Runs without Redis, a database, or a sidecar. External
integrations (Redis cache, webhook events, OTEL tracing) are opt-in and fail gracefully
when unavailable.
- Hot reload. Config changes are applied without restarting. The watcher detects file
changes and atomically swaps the compiled origin map via `arc-swap`. In-flight requests
finish on their snapshot; new requests pick up the new map immediately.
- Embeddable. The `sbproxy-core` crate exposes a small `run` / `shutdown` API for use as a
library inside another Rust binary.
---
## 2. Workspace layout
```
sbproxy/
crates/
sbproxy/ - Binary entry point. Wires modules and starts the server.
sbproxy-core/ - Pingora server, host router, phase dispatch,
hot reload, hook registry.
sbproxy-config/ - YAML/JSON schema, type definitions, parsing,
compilation (RawOrigin -> CompiledOrigin).
sbproxy-plugin/ - Plugin trait definitions and `inventory` registry
(PUBLIC API for third-party modules).
sbproxy-modules/ - Built-in modules:
action/ - proxy, loadbalancer, redirect, static,
echo, mock, beacon, websocket, grpc,
ai_proxy, mcp, noop, storage
auth/ - api_key, basic_auth, bearer, jwt,
digest, forward_auth, jwks
policy/ - rate_limit, ip_filter, waf, ddos,
csrf, security_headers, request_limit,
assertion, sri, cel
transform/- json, json_projection, html, markdown,
template, lua, javascript, css,
encoding, format_convert, normalize,
payload_limit, replace_strings,
html_to_markdown, sse_chunking, noop
sbproxy-ai/ - AI gateway: 66 native providers, routing,
guardrails, budget enforcement, key vault,
memory store, MCP federation.
sbproxy-extension/ - Scripting and extension runtimes:
cel/ - cel-rust expression evaluation
lua/ - mlua + Luau scripting
wasm/ - wasmtime sandboxed plugins
js/ - QuickJS via rquickjs
mcp/ - Model Context Protocol server
sbproxy-middleware/ - CORS, HSTS, compression (gzip/brotli/zstd),
header modifiers, error pages, forward rules.
sbproxy-cache/ - Response cache trait, memory backend,
pluggable store interface, cache key partitioning.
sbproxy-security/ - Cross-cutting security primitives: crypto helpers,
host filter (bloom + HashMap lookup), client-IP
extraction with trusted-proxy CIDRs, PII redactor,
SSRF guard, plus optional headless-browser
detection and bot/agent verification helpers.
The WAF, DDoS, CSRF, and security_headers
policies live in sbproxy-modules/src/policy/.
sbproxy-tls/ - TLS termination via rustls 0.23 with the `ring`
crypto provider, ACME auto-cert (Let's Encrypt),
HTTP/3 listener wiring (currently disabled
pending native Pingora HTTP/3), OCSP stapling.
sbproxy-transport/ - Outbound transport: retry with exponential backoff,
request coalescing, hedged requests,
circuit breaker, upstream rate limiting.
sbproxy-vault/ - Secret management. Encrypted local vault,
rotation hooks, secret reference resolution.
sbproxy-observe/ - tracing-based structured logging,
Prometheus metrics, typed event bus.
sbproxy-platform/ - Infrastructure primitives: KV store abstraction,
DNS cache, messenger, health tracking,
circuit breaker.
sbproxy-httpkit/ - HTTP utilities: client IP extraction,
host:port splitting, buffer pools, body limit
readers.
examples/ - Working sb.yml examples per feature
docs/ - Documentation
e2e/ - End-to-end test harness
schemas/ - JSON schema for sb.yml
```
The dependency graph is enforced by the workspace structure. `sbproxy-plugin` is the public
API surface and depends only on `sbproxy-config`. Built-in modules depend on
`sbproxy-plugin`, never on `sbproxy-core`. Third-party plugins built against the published
`sbproxy-plugin` crate are link-compatible with the binary.
---
## 3. Request pipeline
Every inbound request passes through the following stages in order. A rejection at any stage
short-circuits the rest and writes the error response immediately. The pipeline is
implemented as a sequence of `ProxyHttp` callbacks; the per-request work happens inside
those callbacks rather than in a separate dispatcher.
```
request_filter:
1. Trace context extract (W3C / B3)
2. ACME HTTP-01 challenge interception
3. /health and /metrics short-circuit
4. Hostname extraction and origin resolution (bloom + HashMap)
5. Force-SSL redirect
6. Allowed methods check
7. CORS preflight handling
8. Bot detection
9. Threat protection (JSON body checks)
10. Authentication
11. Policy enforcement (rate limit, IP filter, WAF, CSRF, DDoS, CEL, ...)
12. Response cache lookup
13. on_request callbacks
14. Forward rule matching
15. Non-proxy action dispatch (static, redirect, echo, mock, beacon, AI, ...)
upstream_peer:
Resolve upstream peer for proxy actions.
upstream_request_filter:
URL rewrite, query injection, method override, body replacement, request
header modifiers, distributed tracing headers.
response_filter:
CORS, HSTS, security headers, response modifiers, forward rule echo,
rate limit headers, Alt-Svc, CSRF cookie, session cookie, on_response
callbacks, traceparent echo.
response_body_filter:
Response cache write on miss, transform pipeline, fallback body swap.
logging:
Metrics emission, access log, event publication.
```
Action types dispatched inside `request_filter` step 15 (or via `upstream_peer` for
`proxy` actions): `proxy`, `load_balancer`, `ai_proxy`, `static`, `mock`, `redirect`,
`echo`, `beacon`, `noop`, `websocket`, `grpc`. Built-in actions are enum variants; the
compiler turns the dispatch site into a branch-predicted match. Third-party plugins use
`Plugin(Box)` and pay one indirect call per request.
---
## 4. Plugin system
All extensible component types use a single pattern: register at compile time via the
`inventory` crate, keyed by the type string that appears in YAML configs.
### Registry traits (sbproxy-plugin)
```rust,no_run
pub trait ActionHandler: Send + Sync + 'static {
fn handler_type(&self) -> &'static str;
fn handle(
&self,
req: &mut http::Request,
ctx: &mut dyn std::any::Any,
) -> Pin> + Send + '_>>;
}
// Same shape for AuthProvider, PolicyEnforcer, TransformHandler, RequestEnricher.
```
Factory closures construct concrete handlers from a `serde_json::Value` config blob and
return `Box`. The factory itself is the registration unit.
### Registration pattern
```rust,no_run
inventory::submit! {
PluginRegistration {
kind: PluginKind::Policy,
name: "rate_limit_custom",
factory: |raw| {
let cfg: MyConfig = serde_json::from_value(raw)?;
Ok(Box::new(MyPolicy::new(cfg)))
},
}
}
```
`inventory::submit!` writes a static descriptor into a link-section that the binary
enumerates at startup. There is no central wiring file. Adding a policy is:
1. Implement `PolicyEnforcer` for the new struct.
2. Drop the file in `sbproxy-modules/src/policy/`.
3. Add an `inventory::submit!` block.
4. Add `pub mod my_policy;` to the parent `mod.rs`.
The compile_config step in `sbproxy-config` looks up factories by name from the inventory
registry. Built-in modules are exposed as enum variants (`Policy::RateLimit(...)`,
`Policy::Plugin(Box)`); the compiler prefers the enum variant when
available for cache locality and branch prediction, falling back to dynamic dispatch for
third-party names.
### Built-in vs plugin dispatch
Built-in modules are enum variants. Match dispatch over enums is a single
branch-predicted jump that the compiler typically inlines. Third-party plugins go through
`Box` for dynamic dispatch. That costs one indirect call per phase but keeps
the plugin ABI stable across compiler versions.
```rust,no_run
enum Action {
Proxy(ProxyAction),
Static(StaticAction),
Redirect(RedirectAction),
LoadBalancer(LoadBalancerAction),
AiProxy(AiProxyAction),
// ... built-ins
Plugin(Box), // third-party
}
```
### Thread safety
`inventory` is populated at link time before `main` runs. All registry reads happen after
that, against an immutable slice. There is no lock on the hot path: the compiled origin
holds direct `Arc` pointers to the handler instances, so per-request dispatch is a pointer
dereference followed by a virtual or static call.
---
## 5. Config architecture
### Pure types layer (sbproxy-config)
The `sbproxy-config` crate contains type definitions, serde derives, and the
compilation step. Its workspace dependencies are limited to `sbproxy-plugin`,
`sbproxy-httpkit`, and `sbproxy-platform` (for the `KVStore` trait used by `l2_store`).
It does not pull in Pingora, the module set, or any networking runtime.
The serde tags in `sbproxy-config` are the canonical field names. When in doubt about a
YAML field name, read the struct definition, not prose documentation.
### Config lifecycle
```
sb.yml (YAML file or API-delivered bytes)
|
v
serde_yaml::from_str -> ConfigFile { proxy, origins, secrets, ... }
|
v
validate_schema() - Reject unknown fields, type-check.
|
v
resolve_secrets() - Expand ${secret.X} references via the vault.
|
v
apply_inheritance() - Parent / child origin merge.
|
v
compile_config() - For each origin:
build CompiledOrigin {
action,
auths: SmallVec<[Auth; 2]>,
policies: SmallVec<[Policy; 4]>,
request_modifiers, response_modifiers,
transforms, hooks, cache, error_pages, ...
}
|
v
build host_map: bloom filter + HashMap of hostname -> origin index
|
v
Arc - Immutable snapshot.
|
v
ArcSwap::store() - Atomic publish. Old readers continue
against the previous snapshot.
```
### Parent/child origin inheritance
Origins can declare a `parent` field that references another origin by name. The child
inherits all fields from the parent and can override any of them. This is resolved at
parse time, not at request time. The resulting child config is fully materialized before
compilation.
### Hot reload
The config watcher (`sbproxy-core::reload`) uses the `notify` crate to detect file changes.
On change it re-parses, re-resolves, and recompiles the config. The new
`Arc` is published via `ArcSwap::store`. Requests that already loaded a
snapshot continue with it; new requests pick up the new pointer on their next snapshot
load. Old snapshots are dropped when their refcount hits zero, after all in-flight
requests using them complete. There is no global lock and no quiescence period.
---
## 6. AI gateway architecture
The `ai_proxy` action delegates entirely to the `sbproxy-ai` crate. It presents an
OpenAI-compatible API surface and routes requests to any supported LLM provider.
```
Client (OpenAI-compatible request)
|
v
+------------------+
| AI Handler | Validates request format. Extracts consumer identity.
| | Checks per-key concurrency limits.
+------------------+
|
v
+------------------+
| Guardrails | Pre-request evaluation. CEL/Lua selectors determine
| (pre-request) | which guardrail rules apply. Rules may block, flag,
| | or redact content before the request leaves the proxy.
| | Built-in types: PII, prompt injection, toxicity,
| | jailbreak, content safety, JSON schema, regex.
+------------------+
|
v
+------------------+
| Router | Selects provider and model based on routing strategy.
| | Strategies: round_robin, weighted, fallback_chain,
| | random, lowest_latency, least_connections,
| | cost_optimized, token_rate, sticky.
| | Context window validation: token count checked against
| | provider model limits. Oversized requests routed to a
| | model with a larger context window or rejected.
+------------------+
|
v
+------------------+
| Budget Enforcer | Hierarchical scopes (workspace, key, route).
| | Action on exceed: log, downgrade to cheaper model,
| | or hard-block with 402.
+------------------+
|
v
+------------------+
| Provider | Translates normalized request to provider-specific
| | wire format. Injects API key from vault.
+------------------+
|
v
LLM API (OpenAI / Anthropic / Gemini / Bedrock / ...)
|
v
+------------------+
| Response Handler | For streaming: SSE proxy with buffered guardrail
| | evaluation on accumulated chunks. Token usage and
| | cost updated atomically. Conversation memory written.
| | For non-streaming: full response passed to post-request
| | guardrails before returning to client.
+------------------+
|
v
Client
```
### Provider registry
Providers register through the same `inventory` mechanism as actions. Each provider
implements `sbproxy_ai::providers::Provider`. The provider list is also driven by
`providers.yaml`, which maps provider names to their base URLs and supported models. Rust
implementations handle request serialization and response normalization.
66 native providers ship in-tree alongside a native Anthropic
translator. The `model` field passes straight through to the upstream,
so the gateway reaches 200+ models without enumerating them.
Direct adapters include OpenAI, Anthropic, Google Gemini, Azure
OpenAI, AWS Bedrock, Cohere, Mistral, DeepSeek, xAI / Grok, Perplexity,
Groq, Together AI, Fireworks AI, OpenRouter, Ollama, vLLM, AWS SageMaker,
Databricks, Oracle Cloud GenAI, IBM Watsonx, plus three local-runtime
adapters (Hugging Face TGI, LM Studio, llama.cpp).
### Routing strategies
| Strategy | Behavior |
|---------------------|----------|
| `round_robin` | Rotate through providers in order. |
| `weighted` | Distribute proportional to provider weight. |
| `fallback_chain` | Try providers in priority order, falling back on failure. |
| `random` | Uniform random pick. |
| `lowest_latency` | Provider with the lowest observed latency (microseconds, atomic counter). |
| `least_connections` | Provider with the fewest in-flight requests. |
| `cost_optimized` | Lowest score of `connections * 1000 + weight`. Utilization dominates; weight breaks ties in favor of cheaper providers. |
| `token_rate` | Provider with the most remaining tokens-per-minute headroom. |
| `sticky` | Pin a session key to one provider. Falls back to round robin without a session key. |
| `race` | Fan out to every healthy provider in parallel; first non-error response wins, the rest are cancelled. |
### Streaming
The SSE proxy reads chunks from the upstream provider and forwards them to the client
immediately. For guardrail evaluation, the proxy keeps a rolling window of the last N
tokens. When the stream completes, a final guardrail pass runs against the accumulated
content. If a violation shows up mid-stream, the proxy injects a stop chunk and closes
the stream.
### Streaming cache recorder hook
`StreamCacheRecorderHook` (in `sbproxy-core/src/hooks.rs`) is the OSS-side seam that lets
an enterprise build record streaming AI responses for later replay. It mirrors the shape
of `SemanticLookupHook` and `StreamSafetyHook`: a trait, a per-session context type
(`StreamCacheCtx`), and a unit slot on the `Hooks` bundle that defaults to `None`.
The hook lives in OSS because the emit point is on the SSE forwarding hot path. Threading
chunks across a crate boundary at runtime would be expensive; landing the trait in
`sbproxy-core` keeps the per-chunk fan-out cheap and lets the enterprise impl plug in
through `EnterpriseStartupHook::on_startup` exactly like every other slot.
When the slot is wired, `relay_ai_stream` calls `start_session` once at stream start,
forwards a copy of every chunk into the returned channel, and emits exactly one terminal
`StreamCacheEvent::End { complete }`. The `complete` flag is true on a clean
end-of-stream and false on every other terminal condition (client cancel, upstream
error, mid-stream abort). A `StreamCacheGuard` RAII wrapper owns this terminal-event
invariant: `guard.finish()` sends `complete: true`, and the guard's `Drop` impl sends
`complete: false` if `finish` was never called.
What stays out of OSS: caching policy decisions (deterministic tool calls only, image
data by reference only), replay pacing (`as_fast_as_possible` vs `natural`), eviction,
and persistence. The OSS proxy passes the AI handler's `semantic_cache.streaming` config
block through verbatim as a `serde_json::Value` so the enterprise recorder reads
whatever shape it expects without OSS validating those fields. The enterprise crate
fills the slot from its `EnterpriseStartupHook::on_startup` implementation.
### MCP federation
`sbproxy-extension::mcp` implements a Model Context Protocol server. Tools from upstream
MCP endpoints can be federated and exposed as a single combined tool surface to clients.
Tool calls are routed to the registered upstream by name, with optional auth injection.
---
## 7. Event system
SBproxy uses two event mechanisms with different scopes and semantics.
### Internal bus (sbproxy-observe::events)
High-throughput, in-process publish/subscribe. Components call
`events::emit(SystemEvent { ... })`. Subscribers register for specific event type strings.
Used for:
- Circuit breaker state transitions.
- Config hot-reload completion.
- Buffer overflow warnings.
- Rate limit threshold crossings.
- Workspace quota alerts.
Events carry a `workspace_id` field. Per-workspace bounded queues (backed by
`sbproxy-platform::messenger` with a 10k-entry cap) prevent one active workspace from
starving event delivery to others. The bus is implemented over tokio broadcast channels
plus per-subscriber filter predicates.
### Public bus
The `EventBus` trait is exposed to external consumers via the embedding API. The default
implementation is a no-op. Three built-in subscriber types ship with the binary:
- log subscriber: writes events as structured JSON via `tracing`.
- webhook subscriber: POSTs event payloads to a configurable HTTPS endpoint with HMAC
signing.
- prometheus subscriber: increments labeled counters for each event type.
### Event filtering
Subscribers declare a filter predicate at registration time. The bus evaluates predicates
before delivering the event, so filtered subscribers never receive irrelevant events. The
filter is evaluated inline (no spawn per delivery in the common case).
---
## 8. Caching architecture
### Response cache
The response cache sits inside the request pipeline at two points: before the action handler
(cache hit check) and after the action handler (cache write on miss). It is keyed by a
signature derived from the request method, URL, selected request headers, and optionally
the request body hash.
Configurable per origin:
- `ttl` - Time-to-live for cached entries.
- `stale_while_revalidate` - Serve stale content while a background refresh runs.
- `vary` - List of request headers to include in the cache key.
- `methods` - Which HTTP methods are eligible for caching (default: GET, HEAD).
### Store backends
| Backend | Use case |
|-----------|----------|
| `memory` | Single-instance deployments. LRU eviction. No persistence. |
| `file` | Survives restarts. Suitable for low-traffic origins with slow upstreams. |
| `memcached` | Distributed cache via memcached protocol. |
| `redis` | Shared cache across multiple proxy instances. Requires Redis 6+. JSON serialization with TTL. Circuit breaker on Redis failures. |
The `Cacher` trait is the pluggable surface; new backends are added without touching the
pipeline.
### Object cache
Separate from the response cache. Stores arbitrary objects (compiled CEL programs, parsed
Lua scripts, provider capability metadata). Backed by the same store interface. TTL and
LRU eviction policy are configured independently.
### Cache key partitioning
Keys are namespaced as `workspace_id:config_id:hostname:signature`. This prevents
cross-tenant collisions when multiple origins share a backend store. A test-mode fallback
omits the workspace and config prefix for isolation in unit tests.
---
## 9. Observability
The observability stack has three components: Prometheus metrics, OpenTelemetry tracing,
and structured logging via `tracing`.
### Prometheus metrics
When `telemetry.bind_port` is configured, SBproxy runs a dedicated HTTP server that exposes
a `/metrics` endpoint in Prometheus exposition format. Metric names share a single
`sbproxy_*` namespace. Core HTTP counters include `sbproxy_requests_total`,
`sbproxy_request_duration_seconds`, `sbproxy_errors_total`, and
`sbproxy_active_connections`. AI gateway metrics carry `sbproxy_ai_*`. Per-origin
breakdowns use `sbproxy_origin_*` variants. Auth, policy, cache, and circuit breaker
counters follow the same convention.
### Grafana dashboards
Two Grafana dashboards ship in `crates/sbproxy-observe/dashboards/`:
- `proxy-overview.json` - Request rates, latency, active connections,
cache hit ratio, error breakdown.
- `mesh-overview.json` - Per-origin and per-edge topology view.
Pre-built Prometheus alert rules are not bundled today; build your own
against the `sbproxy_*` metric names.
### Structured logging
Logging uses the `tracing` crate. `release_max_level_info` is set at the workspace level,
which compile-strips `debug!` and `trace!` calls from release builds entirely. On hot paths
the macro arguments are eliminated rather than evaluated and filtered at runtime.
### Distributed tracing
Distributed tracing extracts W3C Trace Context (`traceparent` / `tracestate`)
and B3 single / multi-header formats, generates a child span ID for each
upstream call, and echoes the propagation headers back to the downstream
client. Full OTLP export to an external collector is wireframed in
`sbproxy-observe::export::otlp_grpc` but not yet shipped; the runtime
emits structured logs and Prometheus counters today.
---
## 10. Deployment topologies
### Single instance (simplest)
```
Internet
|
v
[ sbproxy ] <-- single binary, one process
|
v
[ Upstream services / APIs ]
```
One process, one config file. TLS handled by SBproxy via ACME (Let's Encrypt). Fine for
internal tools, development environments, and low-traffic production services.
### Behind a load balancer (horizontal scaling)
```
Internet
|
v
[ Load Balancer ] (e.g., AWS ALB, Nginx, HAProxy)
| |
v v
[ sbproxy ] [ sbproxy ] (2+ instances, same sb.yml)
| |
v v
[ Upstream services / APIs ]
```
For shared cache and session state, configure the `redis` store backend. All instances
connect to the same Redis. TLS is terminated at the load balancer.
### Kubernetes with Ingress
```
Internet
|
v
[ Ingress Controller ] (nginx, traefik, etc.)
|
v
[ sbproxy Service ] (ClusterIP or NodePort)
/ | \
v v v
[pod] [pod] [pod] (3+ replicas, Deployment)
|
v
[ Upstream Services ] (other Deployments or external APIs)
```
Sample topology:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: sbproxy
spec:
replicas: 3
template:
spec:
containers:
- name: sbproxy
image: sbproxy:latest
args: ["--config", "/config/sb.yml"]
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
livenessProbe:
httpGet:
path: /health
port: 8080
volumeMounts:
- name: config
mountPath: /config
volumes:
- name: config
configMap:
name: sbproxy-config
```
Config is supplied via a ConfigMap. The hot-reload watcher detects the kubelet's atomic
symlink swap when the ConfigMap updates.
### Docker Compose (dev and test)
```
Browser / curl
|
v
[ sbproxy ] (port 8080)
|
+---> [ mock-api ] (local upstream for testing)
|
+---> [ redis ] (shared cache for multi-instance testing)
```
Sample `docker-compose.yml` fragment:
```yaml
services:
sbproxy:
image: sbproxy:latest
ports:
- "8080:8080"
volumes:
- ./sb.yml:/config/sb.yml:ro
command: ["--config", "/config/sb.yml"]
depends_on:
- redis
redis:
image: redis:7-alpine
ports:
- "6379:6379"
```
---
## 11. Performance characteristics
### Compiled pipeline, not interpreted
The biggest win in the request path is that auth chains, policy chains, modifier chains,
and the action handler are compiled exactly once per origin and stored as inline
collections of trait objects (or enum variants for built-ins). A request through a
compiled pipeline is a slice iteration with no map lookups, no JSON re-parsing, and no
config re-reads.
### Per-request allocation budget
The goal is near-zero heap allocations on the hot path for a proxy-type request:
- Per-request state lives in a `bumpalo` arena that resets after the response is written.
Many small allocations become a single bump-pointer increment.
- `bytes::Bytes` and `BytesMut` carry request and response bodies, avoiding copies as
data moves through pipeline phases.
- `compact_str::CompactString` keeps short strings (hostnames, IDs, header names) inline
on the stack without heap allocation.
- `smallvec::SmallVec<[T; N]>` keeps policies, transforms, and modifiers inline; most
origins have 1 to 3 of each.
- The compiled pipeline itself allocates nothing at call time.
### Connection pooling and HTTP/2
Pingora maintains a connection pool per upstream peer with tuned idle connection limits.
HTTP/2 multiplexing is enabled for upstreams that negotiate it via ALPN. Connection reuse
eliminates TCP and TLS setup cost for repeated requests to the same upstream. Pingora is
production-tested at Cloudflare scale; SBproxy inherits its IO model directly.
### DNS cache
`sbproxy-platform::dns` wraps the system resolver with an LRU cache. Cache entries are
keyed by hostname and carry a configurable TTL (default: 30 seconds). Lookups are O(1).
Eviction uses a doubly-linked list to maintain LRU order without O(n) scans. This matters
most for AI proxy routes, which resolve provider hostnames on every request.
### Bloom filter for hostname pre-check
The host router maintains an in-memory bloom filter over all configured hostnames. On
each request, the filter is checked before any HashMap lookup. Requests for unconfigured
hostnames (scanners, bots, misconfigurations) are rejected in sub-microsecond time without
touching the HashMap.
### Sharded counters for hot state
Subsystems that track per-consumer or per-origin state (rate limiters, AI session counters)
shard their state across N buckets based on a hash of the key. Each shard uses
`parking_lot::Mutex` or atomic counters. That cuts lock contention by a factor of N
under concurrent load from many distinct keys. The rate limiter also has atomic-only fast
paths when the bucket has clear capacity.
### Lock-free config reads
`arc-swap` provides atomic pointer swap with no locking on the read side. Every request
loads the current `Arc` once, which is a single atomic read plus a refcount
increment. Hot reload publishes a new pointer; in-flight requests continue against their
existing snapshot until they complete and drop their `Arc`.
### Circuit breaker design
Each upstream has a circuit breaker backed by atomic compare-and-swap operations. The
open / half-open / closed state transition uses a single atomic int. Only one probe request
is allowed through per recovery cycle. All other requests during the open state fail fast
without acquiring any lock or making any network call.
### Compiler optimizations
Release builds use `lto = "fat"`, `codegen-units = 1`, and `panic = "abort"`. mimalloc
replaces the system allocator. `tracing`'s `release_max_level_info` feature compile-strips
all debug and trace logging from the binary.
### Observed overhead
Under typical workloads (no Lua, no CEL, no response transforms), the proxy adds well
under 1 millisecond of overhead at p99 to end-to-end request latency. The dominant cost
is the upstream network round-trip. Microbenchmarks for static and echo actions clear
100k requests per second on a single core; full-pipeline scenarios with auth, rate
limiting, CORS, and HSTS sustain 80k or more.
For benchmark methodology, scenario definitions, and how to reproduce these numbers, see
[performance.md](performance.md). For feature-by-feature comparisons against other proxies
and AI gateways, see [comparison.md](comparison.md). For the YAML schema reference, see
[configuration.md](configuration.md).
================================================================
# docs/audit-log.md
================================================================
## Audit log
*Last modified: 2026-05-04*
Every state-mutating endpoint in SBproxy emits one audit envelope. The envelope is typed and append-only. This guide covers what gets audited, the schema, the `target_kind` JSON discriminator note, and the structured-log audit sink that ships with the OSS distribution.
The OSS surface emits the envelope through the structured-log audit sink so every deployment gets an audit trail. Durable persistence (Postgres, S3, hash-chained verification) lives in the commercial distribution and is out of scope for this repo.
## What is audited
Audit emission is on **writes** by default. Every mutating handler emits one envelope per call: agent registration / approval / revocation, key rotation, registry edit, policy edit, login, logout.
Reads are audited only when:
1. The read targets the audit log itself (export, verify). The auditor must be auditable.
2. The read targets secret material (key-management endpoints, even when the response redacts the secret).
3. The read is a bulk-export endpoint.
Routine reads (list agents, get balance) are not audited; they live in the access log and the request-event stream. Adding read-audit to a routine endpoint requires an ADR amendment because the cardinality cost is high.
## Envelope schema
Every event is an `AdminAuditEvent`. Wire format is JSON; field order is significant only for canonical hashing.
| Field | Type | Required | Notes |
|---|---|---|---|
| `event_id` | ULID (string) | yes | Generated at emission. Lexicographically time-sortable. |
| `schema_version` | u16 | yes | Currently `0`. |
| `ts` | RFC 3339 UTC | yes | Wall-clock time at emission. |
| `tenant_id` | string | yes | `default` in OSS. |
| `subject` | tagged enum | yes | Who initiated the action. See subjects below. |
| `action` | enum | yes | What was done. Closed enum with an `Other(String)` escape hatch. |
| `target` | tagged enum | yes | What was acted on. See targets below. |
| `before` | JSON value | optional | Pre-mutation snapshot, redacted. `None` on pure-read operations. |
| `after` | JSON value | optional | Post-mutation snapshot, redacted. `None` on failed mutations. |
| `reason` | string | optional | Operator justification. Capped at 4 KiB; over-cap truncates with `...[truncated]`. Not redacted. |
| `result` | tagged enum | yes | Outcome: `Success`, `Failure { error_code, error_message }`, `Denied { reason }`. |
| `request_id` | ULID | yes | Correlation: the in-flight HTTP request. |
| `trace_id` | string (32 hex) | yes | Correlation: OTel trace id. Empty string when no trace context. |
| `span_id` | string (16 hex) | yes | Correlation: OTel span id. |
| `ip` | IpAddr | yes | Caller IP, post-trusted-proxy resolution. |
| `user_agent` | string | optional | Capped at 512 bytes. |
| `chain_position` | object | optional | Reserved for future hash-chained log support. Always `None` in OSS. |
### Subjects
```rust,no_run
pub enum AuditSubject {
User { user_id: String, session_id: Option },
Service { principal_id: String },
Agent { agent_id: String, agent_class: Option },
System { component: String },
}
```
`User` is a portal-authenticated human. `Service` is CI or internal automation. `Agent` is a registered agent acting on its own behalf. `System` is the subject of last resort and SHOULD be rare; config reload and scheduled jobs use it.
### Actions
Closed enum. Adding a new variant is an ADR amendment. The current set:
`Create`, `Update`, `Delete`, `Read`, `Approve`, `Revoke`, `RotateKey`, `Disable`, `Enable`, `Export`, `Import`, `Login`, `Logout`, `PolicyEdit`, `Other(String)`.
`Other(String)` is the escape hatch for variants not yet hoisted into the closed enum; persistent uses require an ADR amendment to add a proper variant.
### Targets
```rust,no_run
pub enum AuditTarget {
Agent { agent_id: String },
RegistryEntry { feed: String, entry_id: String },
Key { kind: KeyKind, key_id: String },
Policy { policy_path: String },
Origin { hostname: String },
User { user_id: String },
Tenant { tenant_id: String },
Config { path: String },
AuditLog,
Other { kind: String, id: String },
}
```
`KeyKind` is closed: `OutboundWebhook`, `RegistryFeed`, `Tls`, `Tenant`.
## JSON discriminator note: `target_kind`
`AuditTarget` serializes with an external tag named `target_kind`, **not** the serde default `kind`. The rename avoids a field collision: the `Other { kind, id }` variant carries its own `kind` field, and the default tag would silently overwrite it.
The wire format looks like this:
```json
{"target_kind": "registry_entry", "feed": "agents", "entry_id": "openai-gptbot"}
{"target_kind": "other", "kind": "rate-limit", "id": "rl_us_east_1"}
```
Verifier CLIs and replay tooling MUST read the discriminator from `target_kind`. The trailing `kind` inside the `Other` variant is opaque payload.
## Append-only contract
The storage backend MUST reject updates and deletes. The contract is enforced at the trait level:
```rust,no_run
#[async_trait::async_trait]
pub trait Emitter: Send + Sync {
async fn emit(&self, event: AdminAuditEvent) -> Result;
async fn read_range(
&self,
from: chrono::DateTime,
to: chrono::DateTime,
) -> Result, AuditError>;
// No update(), no delete(). Compile-time enforcement.
}
```
A refactor that wants to mutate prior events would have to add a method to the trait, which is an ADR-amendment-level change.
PII deletion (GDPR Article 17, CCPA right-to-delete) is handled by tombstoning, not by mutating the audit log. A separate `audit_tombstones` table records the deletion request, and the verifier CLI redacts matching subjects on read.
## Adapters
### In-memory
Used for tests. Append to a `Vec`; no removal API.
```rust,no_run
use sbproxy_audit::{InMemoryEmitter, AdminAuditEvent};
use std::sync::Arc;
let emitter = Arc::new(InMemoryEmitter::default());
emitter.emit(event).await?;
let range = emitter.read_range(from, to).await?;
```
### Structured log
The default OSS sink writes envelopes to the structured log stream so every deployment gets an audit trail. Pair it with whatever log shipper you already run.
## EmitterMiddleware
A Tower / Axum `Layer` wraps every state-mutating handler. The middleware:
1. Captures envelope context up front (`request_id`, `trace_id`, `span_id`, caller IP, User-Agent, subject).
2. Runs the handler.
3. Pulls the `AuditDescriptor` the handler attached to the response extensions (action, target, before, after, optional reason).
4. Builds the envelope, applies the length caps, redacts `before` and `after` per the internal profile, and emits.
```rust,no_run
use axum::Router;
use sbproxy_audit::{AuditLayer, EmitterArc, InMemoryEmitter};
use std::sync::Arc;
let emitter: EmitterArc = Arc::new(InMemoryEmitter::default());
let app: Router = Router::new()
.route("/agents/:id/approve", axum::routing::post(approve_handler))
.layer(AuditLayer::new(emitter, "tenant_42"));
```
State-mutating handlers opt in by implementing `Auditable`:
```rust,no_run
use sbproxy_audit::{
AuditAction, AuditDescriptor, AuditTarget, Auditable,
};
impl Auditable for ApproveHandler {
fn audit_action(&self) -> AuditAction { AuditAction::Approve }
fn audit_target(&self, req: &axum::extract::Request) -> AuditTarget {
AuditTarget::Agent { agent_id: extract_agent_id(req) }
}
fn audit_snapshot(&self, req: &axum::extract::Request) -> Option {
Some(snapshot_agent_state(req))
}
}
```
A clippy lint and a CI grep ensure every mutating handler is wrapped or wears an explicit `#[allow(audit_required)]` with a comment.
### Failure handling
Audit emission failure does not fail the underlying request. The handler succeeds even if the audit append fails; the failure pages on `SLO-AUDIT-WRITE` so durable audit gets restored. The OSS sink logs and drops on emit failure.
## See also
- [observability.md](observability.md) - audit metrics (`sbproxy_audit_emit_total`), the `SLO-AUDIT-WRITE` page tier, and the audit-log Grafana dashboard.
================================================================
# docs/auth-oidc.md
================================================================
## OIDC Relying-Party login
*Last modified: 2026-06-03*
The `oidc` auth provider turns SBproxy into an OpenID Connect
Relying Party. Unlike the `jwt` provider, which only validates a
bearer JWT that the caller already holds, this provider drives
the full authorization-code + PKCE login dance: it redirects an
unauthenticated caller to the IdP, exchanges the returned code
for an ID token, validates the token, and mints a sealed session
cookie. Subsequent requests authenticate from the cookie until
the session expires.
This is the "put SSO in front of an app that has none" use case
that operators reach for with oauth2-proxy, Pomerium, or
Cloudflare Access. SBproxy ships it as a configuration auth
provider; no separate sidecar needed.
## Quick start
```yaml
origins:
"app.example.com":
action:
type: proxy
url: http://upstream-app:3000
auth:
type: oidc
authorization_endpoint: https://idp.example.com/authorize
token_endpoint: https://idp.example.com/oauth/token
jwks_uri: https://idp.example.com/.well-known/jwks.json
issuer: https://idp.example.com/
client_id: sbproxy-app-example-com
client_secret: vault://idp/client_secret
cookie_secret: vault://oidc/cookie_secret
scope: "openid email profile"
```
The minimum fields are the four IdP endpoints (`authorization_endpoint`,
`token_endpoint`, `jwks_uri`, `issuer`), the OAuth `client_id`
and `client_secret`, and a `cookie_secret` used to seal the
session cookie. Everything else has a sensible default.
A runnable example lives at
[`examples/oidc/`](../examples/oidc/) with a mock IdP shape and
the curl invocations to walk through.
## Flow
1. The browser requests a protected origin without a session cookie.
2. SBproxy mints a transaction cookie (sealed PKCE verifier + state
+ nonce, TTL `tx_ttl_secs`) and 302's the browser to
`authorization_endpoint?response_type=code&client_id=...&code_challenge=...&state=...&nonce=...&scope=...&redirect_uri=https://app.example.com/oidc/callback`.
3. The IdP authenticates the user and 302's back to
`https://app.example.com/oidc/callback?code=...&state=...`.
4. The `/oidc/callback` handler (a synthetic endpoint mounted by
the OIDC provider, the same shape as MCP's well-known
endpoints) unseals the transaction cookie, verifies the
`state` matches, POSTs to `token_endpoint` with the `code` and
the PKCE `code_verifier`, validates the returned ID token
against `issuer` + `client_id` + `nonce`, mints a sealed
session cookie (TTL `session_ttl_secs`), and 302's the browser
back to the originally-requested URL.
5. Subsequent requests carry the session cookie; the proxy
decrypts and the caller is treated as authenticated.
All cookies use the `__Host-` prefix per RFC 6265bis (forces
`Secure` + `Path=/` + no `Domain`), so the cookie-tossing attack
against the session secret is closed.
## Configuration reference
| Field | Type | Default | Description |
|---|---|---|---|
| `authorization_endpoint` | URL | (required) | IdP's authorization endpoint. |
| `token_endpoint` | URL | (required) | IdP's token endpoint. The callback POSTs `code` + `code_verifier` here. |
| `jwks_uri` | URL | (required) | IdP's JWKS endpoint. Fetched through the same `JwksCache` the `jwt` provider uses, so the keys are cached across origins. |
| `issuer` | URL | (required) | Expected `iss` on the ID token. Pinned by config so a rogue token from a different IdP (even one signed by a key pulled from `jwks_uri`) is rejected. |
| `client_id` | string | (required) | OAuth client ID. Sent on the auth redirect and matched against the ID token `aud`. |
| `client_secret` | string | (required) | OAuth client secret. Sent over Basic on the token-endpoint POST. Supports `vault://` references. |
| `cookie_secret` | string | (required) | 32+ byte secret used as the HKDF IKM for the session + transaction cookie keys. Supports `vault://`. Rotating this invalidates every outstanding session and tx cookie. |
| `redirect_path` | path | `/oidc/callback` | Path the IdP redirects back to. Must be one of the URIs you registered with the IdP under `redirect_uris`. |
| `logout_path` | path | `/oidc/logout` | Path that triggers RP-initiated logout. |
| `end_session_endpoint` | URL | unset | IdP's `end_session_endpoint`. When set, `/oidc/logout` deletes the session cookie and 302's to the OP so the IdP terminates its own session too. When unset, `/oidc/logout` only deletes the cookie and 302's to `post_logout_redirect_default`. |
| `userinfo_endpoint` | URL | unset | IdP's userinfo endpoint. When set, the callback handler calls userinfo after the token exchange and projects the resulting claims as trust headers on the request to the upstream. |
| `post_logout_redirect_default` | path or URL | `/` | Where to send the browser after a logout completes if the caller did not supply (or did not allowlist) a `post_logout_redirect_uri`. |
| `post_logout_redirect_allowlist` | list of URLs | `[]` | Permitted values for the `post_logout_redirect_uri` query parameter on `/oidc/logout`. Without this gate the endpoint becomes an open-redirect. Match is verbatim. |
| `scope` | string | `openid` | Space-separated OIDC scope list. Minimum is `openid` (the scope that produces an ID token); add `email profile groups` etc. as needed. |
| `session_ttl_secs` | integer | `3600` | Session cookie TTL in seconds. |
| `tx_ttl_secs` | integer | `300` | Transaction cookie TTL in seconds. Should comfortably exceed the operator's expected time between auth redirect and callback redirect; a stale tx cookie aborts the login. |
| `session_cookie_name` | string | `__Host-sbproxy_session` | Name of the session cookie. The `__Host-` prefix forces `Secure` + `Path=/` + no `Domain`. |
| `tx_cookie_name` | string | `__Host-sbproxy_oidc_tx` | Name of the transaction cookie. |
| `attrs` | block | `{}` | Provider-level attribution metadata stamped onto the resolved `Principal` on a successful OIDC session validation. Same shape as the other auth providers. |
## Trust-header injection (optional)
When `userinfo_endpoint` is set, the callback handler:
1. Calls the userinfo endpoint with the access token from the
token exchange.
2. Projects the returned claims through
`userinfo::trust_headers_from_claims`.
3. Stashes the projection in the sealed session cookie.
On every subsequent request, the request-time auth check replays
the trust headers onto the upstream request. Downstream policies
(for example the `object_authz` BOLA + BFLA policy) see the
verified subject and groups without an additional round trip.
The headers stamped are:
| Header | Source claim |
|---|---|
| `X-Auth-Subject` | `sub` |
| `X-Auth-Email` | `email` (when present and `email_verified` is `true`) |
| `X-Auth-Name` | `name` (when present) |
| `X-Auth-Groups` | `groups` (comma-joined when array-shaped) |
Upstreams MUST be configured to trust these headers only from
the proxy (e.g. via mTLS or a tight network boundary); the proxy
strips inbound copies of these headers from the client before
adding its own so a malicious client cannot inject identity.
## Logout
Send the browser to `logout_path` (default `/oidc/logout`). The
handler:
1. Deletes the session cookie.
2. If `end_session_endpoint` is set, 302's the browser to the IdP
so the OP terminates its own session.
3. Otherwise, 302's the browser to `post_logout_redirect_default`
(or, if the caller supplied a `post_logout_redirect_uri` query
parameter that appears in `post_logout_redirect_allowlist`,
honours that value verbatim).
The allowlist is the open-redirect gate. Without it, leaving the
endpoint to honour arbitrary query parameters is unsafe.
## Discovery
Today the IdP endpoints are explicit config fields. The OIDC
discovery document at `/.well-known/openid-configuration`
is supported as an optional discovery-time fetch: when an
operator points the provider at a discovery URL (a follow-up
PR2), the proxy can populate `authorization_endpoint`,
`token_endpoint`, `jwks_uri`, and `end_session_endpoint` from the
fetched document instead of from explicit config. Until that
lands, populate the endpoints by hand from the IdP's discovery
document.
## Session storage
Default is **stateless encrypted cookie**: the session claims
travel in the cookie body, sealed with the per-origin cookie
key. No proxy-side state, no Redis. The cookie size grows with
the projected trust headers, so keep the trust-header projection
narrow.
For long-lived sessions or for sessions that need server-side
revocation, the `oidc::store` helpers offer a server-side
session-store hook (KV-backed) that operators can wire under the
existing `kv` storage. The default is stateless because the
cookie shape covers the common case and avoids the operational
cost of a session store.
## Relationship to the other auth providers
| Provider | Validates | Issues | Drives a login flow |
|---|---|---|---|
| `noop` | nothing | nothing | no |
| `api_key`, `basic_auth`, `bearer`, `digest` | per-credential lookup | no | no |
| `jwt` | bearer JWT (issuer / audience / signature) | no | no |
| `forward_auth` | delegates to an external authorizer | no | no |
| `oidc` (this provider) | session cookie + ID token | session cookie | **yes** |
The `oidc` provider shares the JWKS cache with `jwt` so two
origins backed by the same IdP do not duplicate key fetches.
Operators that want to layer "validate a bearer JWT issued by a
different system" on top of "log in via OIDC" can combine
`oidc` here with `jwt` on a different origin in the same
config; the providers are independent.
## What's not in this provider
* **Discovery-document auto-population** of the four endpoint
fields. Tracked as a follow-up; today the operator pastes the
values from the IdP's published `.well-known/openid-configuration`.
* **Refresh-token rotation.** The session TTL bounds the time
between IdP round-trips. A follow-up adds rotating refresh
tokens behind a server-side session store.
* **DPoP-bound sessions.** The session cookie today is a sealed
bearer; DPoP binding to a client-held key is a follow-up.
* **MFA enforcement / step-up.** The provider honours whatever
the IdP does on the auth side; in-proxy step-up is not in
scope.
## See also
- [Example: `examples/oidc/`](../examples/oidc/)
- [`configuration.md`](configuration.md) for the auth-provider
registry surface.
================================================================
# docs/build.md
================================================================
## Build pipeline
*Last modified: 2026-04-30*
How the proxy container images are built, what stays warm between
runs, and what the expected wall-clock numbers are. Companion to
`docs/architecture.md` (request pipeline) and the workspace
`CLAUDE.md` (pre-commit local loop).
## Container image layout
Two Dockerfiles live at the repo root and share the same layered
cargo-chef layout:
| File | Purpose | Consumer |
|---|---|---|
| `Dockerfile.cloudbuild` | Cloud Build / GCR amd64 image. | `gcloud builds submit`; bench loadtest stack. |
| `Dockerfile.ci` | Kind-based smoke-test image. | `make k8s-operator-smoke`. |
Both files have six stages:
1. **chef-base**: `rust:1.94-bookworm` plus the apt deps (`pkg-config`,
`libclang-dev`, `build-essential`, `cmake`, `perl`) plus a pinned
`cargo-chef@0.1.71`. Reused by every later Rust stage.
2. **planner**: copies the workspace, runs `cargo chef prepare`, emits
`recipe.json`. The recipe captures every `Cargo.toml` and
`Cargo.lock` digest in the workspace; nothing under
`crates/*/src/` affects it.
3. **cacher**: `cargo chef cook --profile release-fast --bin sbproxy
--recipe-path recipe.json`. Compiles every dependency from
crates.io. This is the layer the warm-rebuild path reuses.
4. **builder**: copies `/src/target` from cacher, then the workspace
source, then runs `cargo build --profile release-fast --bin sbproxy
--locked`.
The dep `target/` from the cacher stage is the entire reason this
step does not have to recompile crates like `pingora`,
`aws-lc-sys`, or `tokio` again.
5. **cert-gen** (cloudbuild only): self-signed loadtest cert.
Production deploys mount real certs over `/etc/sbproxy/` at
runtime.
6. **runtime**: `gcr.io/distroless/cc-debian12`. Carries the binary
and (cloudbuild) the loadtest cert pair.
## Build-time numbers
Cold = empty BuildKit cache (`docker buildx prune -f` first). Warm =
touch a file under `crates/sbproxy/src/` and rebuild without
clearing the cache.
| Build | Before chef | After chef |
|---|---|---|
| Cold (Cloud Build amd64) | ~12 min | ~3-4 min |
| Warm (only first-party source changed) | ~12 min (no caching) | <90s |
The warm path's win comes from the `cacher` layer: as long as
`recipe.json` is byte-identical to the previous build, Docker
short-circuits stages 1-3 and only re-runs stages 4 + 6.
The Dockerfiles default to `CARGO_PROFILE=release-fast`, which inherits
the production release settings but disables fat LTO and raises
`codegen-units` for lower link time and memory. Pass
`--build-arg CARGO_PROFILE=release` when you intentionally want the
full production release profile inside these Dockerfiles.
The cold path's win comes from BuildKit `--mount=type=cache` on
`/usr/local/cargo/{registry,git}`: even when the layer cache is cold
(e.g. a fresh Cloud Build worker), the cargo registry tarballs are
re-used across builds of the same Cloud Build trigger.
## BuildKit requirement
Both Dockerfiles use the cache-mount syntax (`RUN
--mount=type=cache,...`). That syntax is BuildKit-only.
- Local: `export DOCKER_BUILDKIT=1` or use `docker buildx build`.
- Cloud Build: builders that consume these Dockerfiles must set
`DOCKER_BUILDKIT=1` in the build step env, or use a `docker buildx
build` invocation. Cloud Build's standard `gcr.io/cloud-builders/docker`
step honors `DOCKER_BUILDKIT=1`. If a build step ever drops back to
the legacy builder, the `--mount=type=cache` directives silently
no-op; the build still succeeds, just slower.
## Validating a build
The fast smoke test, locally:
```bash
DOCKER_BUILDKIT=1 docker build \
-f Dockerfile.cloudbuild \
--target builder \
-t sbproxy:builder-smoke .
```
The `--target builder` short-circuits before the runtime stage so the
test does not pay for the cert-gen + distroless copy. To validate the
runtime image:
```bash
DOCKER_BUILDKIT=1 docker build -f Dockerfile.cloudbuild -t sbproxy:rt .
docker run --rm sbproxy:rt --version
```
## Warm-path verification
To prove the chef layer is doing its job, after a cold build, touch a
file under `crates/sbproxy/src/`:
```bash
touch crates/sbproxy/src/main.rs
DOCKER_BUILDKIT=1 docker build -f Dockerfile.cloudbuild --target builder -t sbproxy:warm .
```
The output should show stages `chef-base`, `planner`, and `cacher`
all `CACHED`, and only `builder` running. Wall-clock time on a
modern amd64 worker should be under 90s.
## Troubleshooting
- **The cacher stage rebuilds every time.** Some change touched a
`Cargo.toml` or `Cargo.lock` (added a dep, bumped a version,
changed a feature flag). The recipe digest is keyed on those
files; the cacher stage cooks fresh.
- **`cargo build` in the builder stage refuses to use the cooked
artifacts.** Symptom: stage 4 takes ~12 min, ignoring the COPY
from cacher. Most likely cause: `--locked` and a stale
`Cargo.lock` in cacher's COPY. Re-run `cargo update` and rebuild.
- **OOM on Cloud Build.** Set `machineType` on the build step to
`E2_HIGHCPU_8` or higher; the chef cacher stage holds the full
`target/` of cooked deps in memory while linking.
================================================================
# docs/bulk-redirects.md
================================================================
## Bulk redirects
*Last modified: 2026-04-27*
The `redirect` action accepts a list of source-to-destination rows
in addition to (or instead of) a single `url:`. Each origin owns its
own list. The proxy compiles the rows once at config-load time into
an O(1) lookup table keyed on the request path; runtime cost is one
hash hit on the redirect dispatch path.
## Sources
| `bulk_list.type` | What it loads |
|------------------|---------------|
| `inline` | YAML rows embedded directly in the config under `rows:`. |
| `file` | A local file. CSV when the path ends in `.csv`, YAML otherwise. |
| `url` | An HTTPS URL fetched once at startup. CSV/YAML by URL extension or explicit `format:`. The proxy refuses HTTP because list contents drive 30x responses. |
```yaml
origins:
"marketing.local":
action:
type: redirect
status_code: 301
preserve_query: true
bulk_list:
type: file
path: /etc/sbproxy/marketing-redirects.csv
```
## Row shape
CSV columns: `from,to[,status]`. Lines starting with `#` and blank
lines are ignored. A leading row whose first column is the literal
`from` is treated as a header.
```csv
from,to,status
/old/about,/about,301
/old/help,/help # status defaults to the action's status_code
/blog/2023,https://blog.example.com/2023,308
```
YAML or inline:
```yaml
bulk_list:
type: inline
rows:
- from: /category/legacy
to: /category/2024
status: 308
- from: /docs/v1
to: https://docs.example.com/v2
preserve_query: false # override per row
```
## Lookup semantics
- Exact-match on the request path. Wildcards and prefix matching are
not supported; use the existing `forward_rules` for those.
- A row's `status` and `preserve_query` default to the action's
values when omitted; per-row overrides win when set.
- Unmapped paths fall through to the action's `url:`. When `url:`
is empty, the proxy returns `404`.
## Per-origin isolation
Lists never cross origins. Two origins can declare lists with
overlapping paths and no row leaks; each origin's compiled table is
scoped to its hostname.
## Reload
The list reloads on the next config swap. There is no per-row hot
reload; redeploy the config to pick up new rows. URL-backed lists
re-fetch on each config compile.
## Performance
A 100k-row CSV compiles in well under a second on a warm cache and
serves redirects in tens of nanoseconds per request (HashMap lookup
on a `String` key). Cap the list length at the size your operators
can audit.
## See also
- [configuration.md](configuration.md#redirect) - full action schema.
- `examples/bulk-redirects/` - runnable CSV + inline example.
================================================================
# docs/cache-reserve.md
================================================================
## Cache Reserve
*Last modified: 2026-04-27*
Cache Reserve is a long-tail cold tier sitting under the per-origin response cache. Items evicted from the hot cache are admitted into the reserve subject to a sample rate and size threshold; on a hot miss the proxy consults the reserve before falling through to origin and promotes the entry back into the hot tier on hit.
The OSS package ships three reserve backends out of the box (memory, filesystem, redis) plus the [`CacheReserveBackend`](#backend-trait) trait that enterprise builds extend with an S3 + KMS implementation.
## Configuration
Cache Reserve is configured at the top level of `sb.yml`. It applies to every origin whose `response_cache.enabled` is true.
```yaml
proxy:
http_bind_port: 8080
cache_reserve:
enabled: true
backend:
type: filesystem
path: /var/lib/sbproxy/reserve
sample_rate: 0.1 # mirror 10% of hot-cache writes
min_ttl: 3600 # only items with TTL >= 1 hour are admitted
max_size_bytes: 1048576 # skip entries above 1 MiB
origins:
"api.example.com":
action: { type: proxy, url: "https://upstream.example.com" }
response_cache:
enabled: true
ttl: 7200
cacheable_status: [200]
```
### Backends
| `type` | Required fields | Notes |
|--------|-----------------|-------|
| `memory` | none | In-process map. For tests and ephemeral single-replica setups; nothing survives a restart. |
| `filesystem` | `path` | One body file plus a sidecar metadata JSON per key, fanned out by SHA-256 hash. Survives restarts. |
| `redis` | `redis_url`, optional `key_prefix` | Connection pooling via `ConnectionManager`. Entries self-evict on the server side via `PEXPIREAT`. |
Enterprise builds register additional types (e.g. `s3`) through the `CacheReserveBackend` trait. The OSS pipeline ignores unknown types with a warning so the enterprise startup hook can swap in its own implementation.
### Admission filter
| Field | Default | Behaviour |
|-------|---------|-----------|
| `sample_rate` | `0.1` | Fraction of hot-cache writes mirrored into the reserve. Use a low rate when the reserve is on a paid object store. |
| `min_ttl` | `3600` | Skip entries whose TTL is below this (seconds). Items that won't outlive a typical hot eviction window aren't worth carrying. |
| `max_size_bytes` | `1048576` | Skip oversize objects. `0` disables the cap. |
The filter runs before any reserve I/O happens so a misconfigured admission window doesn't show up as a reserve write spike.
## Request flow
1. Hot cache lookup runs first.
2. On a hot miss, the proxy consults the reserve. A reserve hit replays the body to the client with `x-sbproxy-cache: HIT-RESERVE` and promotes the entry back into the hot tier so subsequent reads stay hot.
3. On a hot miss + reserve miss, the request goes to origin as normal.
4. On the response path, every cacheable upstream reply lands in the hot tier; the reserve admits a sampled subset that passes the TTL and size filters.
5. When a hot entry's TTL is exhausted (and it's outside any SWR window), the entry is mirrored to the reserve before being deleted from the hot tier so the long-tail content gets a second life.
6. `POST` / `PUT` / `PATCH` / `DELETE` invalidations evict the no-Vary canonical reserve key alongside the hot-tier prefix sweep. Vary-based variants in the reserve must wait for natural expiry; the trait surface is intentionally narrow so backends like S3 don't need to scan keys.
## Backend trait
The integration point for cold-tier backends is the async [`CacheReserveBackend`](../crates/sbproxy-cache/src/reserve/mod.rs) trait. Enterprise builds ship their own `impl CacheReserveBackend` (S3 + KMS, GCS, Azure Blob) without re-vendoring the OSS data plane.
```rust,no_run
use async_trait::async_trait;
use bytes::Bytes;
use std::time::SystemTime;
use sbproxy_cache::{CacheReserveBackend, ReserveMetadata};
pub struct MyBackend { /* ... */ }
#[async_trait]
impl CacheReserveBackend for MyBackend {
async fn put(&self, key: &str, value: Bytes, metadata: ReserveMetadata) -> anyhow::Result<()> {
// ...
Ok(())
}
async fn get(&self, key: &str) -> anyhow::Result