diff --git a/README.md b/README.md index 7e8afe5..96b4c1a 100644 --- a/README.md +++ b/README.md @@ -1,324 +1,138 @@ -# pixa caching image reverse proxy server +# pixa -This is a web service written in go that is designed to proxy images from -source URLs, optionally resizing or transforming them, and serving the -results. Both the source images as well as the transformed images are -cached. The images served to the client are cached a configurable interval -so that subsequent requests to the same path on the pixa server are served -from disk without origin server requests or additional processing. +pixa is a GPL-3.0-licensed Go web server by +[@sneak](https://sneak.berlin) that proxies images from upstream +sources, optionally resizing or transforming them, and serves the +results. Both source and transformed images are cached to disk so that +subsequent requests are served without origin fetches or additional +processing. -# storage +## Getting Started -* unaltered source file straight from upstream: - * `/cache/src-content///` -* source path metadata - * `/cache/src-metadata//.json` - * fetch time - * all original resp headers - * original request - * sha256 hash +```bash +# clone and build +git clone https://git.eeqj.de/sneak/pixa.git +cd pixa +make build -Note that multiple source paths may reference the same content blob. We -won't do refcounting here, we'll use the state database for that. +# run with a config file +./bin/pixad --config config.example.yml -* database: - * `/state.sqlite3` +# or build and run via Docker +make docker +docker run -p 8080:8080 pixad:latest +``` -* output documents: - * `/cache/dst-content///` +## Rationale -While the database is the long-term authority on what we have in the output -cache, we must aggressively cache in-process the mapping between requests -and output content hashes so as to serve as a maximally efficient caching -proxy for extremely popular/hot request paths. The goal is the ability to -easily support 1-5k r/s. +Image-heavy web applications need a fast, caching reverse proxy that +can resize and transcode images on the fly. pixa fills that role as a +single, self-contained binary with no external runtime dependencies +beyond libvips. It supports HMAC-SHA256 signed URLs with expiration to +prevent abuse, and whitelisted source hosts for open access. -# Routes +## Design -/img///?signature=&format= +### Storage -Images are only fetched from origins using TLS. Origin certificates must be -valid at time of fetch. +- **Source content**: + `/cache/src-content///` +- **Source metadata**: + `/cache/src-metadata//.json` + (fetch time, original headers, request, content hash) +- **Database**: `/state.sqlite3` (SQLite) +- **Output documents**: + `/cache/dst-content///` - is one of 'orig', 'png', 'jpeg', 'webp' +Multiple source paths may reference the same content blob; the +database tracks references rather than using filesystem refcounting. +In-process caching of request-to-output mappings targets 1-5k r/s. - is one of 'orig' or 'x' +### Routes -# Source Hosts +``` +/v1/image///.?sig=&exp= +``` -Source hosts may be whitelisted in the pixa configuration. If not in the -explicit whitelist, a signature using a shared secret must be appended. +Images are only fetched from origins using TLS with valid certificates. -## Signature Specification +- ``: one of `orig`, `png`, `jpeg`, `webp` +- ``: `orig` or `x` (e.g. `800x600`) -Signatures use HMAC-SHA256 and include an expiration timestamp to prevent replay attacks. +### Source Hosts -### Signed Data Format +Source hosts may be whitelisted in the configuration. Non-whitelisted +hosts require an HMAC-SHA256 signature. -The signature is computed over a colon-separated string: +#### Signature Specification + +Signatures use HMAC-SHA256 and include an expiration timestamp to +prevent replay attacks. + +**Signed data format** (colon-separated): ``` HMAC-SHA256(secret, "host:path:query:width:height:format:expiration") ``` Where: -- `host` - Source origin hostname (e.g., `cdn.example.com`) -- `path` - Source path (e.g., `/photos/cat.jpg`) -- `query` - Source query string, empty string if none -- `width` - Requested width in pixels, `0` for original -- `height` - Requested height in pixels, `0` for original -- `format` - Output format (jpeg, png, webp, avif, gif, orig) -- `expiration` - Unix timestamp when signature expires -### URL Format with Signature +- `host` — source origin hostname (e.g. `cdn.example.com`) +- `path` — source path (e.g. `/photos/cat.jpg`) +- `query` — source query string, empty string if none +- `width` — requested width in pixels, `0` for original +- `height` — requested height in pixels, `0` for original +- `format` — output format (jpeg, png, webp, avif, gif, orig) +- `expiration` — Unix timestamp when signature expires -``` -/v1/image///.?sig=&exp= -``` - -### Example - -For a request to resize `https://cdn.example.com/photos/cat.jpg` to 800x600 WebP -with expiration at Unix timestamp 1704067200: - -1. Build the signature input: - ``` - cdn.example.com:/photos/cat.jpg::800:600:webp:1704067200 - ``` +**Example:** resize +`https://cdn.example.com/photos/cat.jpg` to 800x600 WebP with +expiration 1704067200: +1. Build input: + `cdn.example.com:/photos/cat.jpg::800:600:webp:1704067200` 2. Compute HMAC-SHA256 with your secret key - 3. Base64URL-encode the result +4. URL: + `/v1/image/cdn.example.com/photos/cat.jpg/800x600.webp?sig=&exp=1704067200` -4. Final URL: - ``` - /v1/image/cdn.example.com/photos/cat.jpg/800x600.webp?sig=&exp=1704067200 - ``` +**Whitelist patterns:** -### Whitelist Patterns +- **Exact match**: `cdn.example.com` — matches only that host +- **Suffix match**: `.example.com` — matches `cdn.example.com`, + `images.example.com`, and `example.com` -The whitelist supports two pattern types: -- **Exact match**: `cdn.example.com` - matches only that host -- **Suffix match**: `.example.com` - matches `cdn.example.com`, `images.example.com`, and `example.com` +### Configuration -# configuration +Configured via YAML file (`--config`). Key settings: -* access-control-allow-origin config -* source host whitelist -* upstream fetch timeout -* upstream max response size -* downstream timeout -* downstream max request size -* downstream max response size -* internal processing timeout -* referer blacklist +- `access_control_allow_origin` — CORS origin +- `source_host_whitelist` — list of allowed upstream hosts +- `upstream_fetch_timeout` — timeout for origin requests +- `upstream_max_response_size` — max origin response size +- `downstream_timeout` — client response timeout +- `signing_key` — HMAC secret for URL signatures -# Design Review & Recommendations +See `config.example.yml` for all options with defaults. -## Security Concerns +### Architecture -### Critical -- **HMAC signature scheme is undefined** - The "FIXME" for signature - construction is a blocker. Recommend HMAC-SHA256 over the full path: - `HMAC-SHA256(secret, "///?format=")` -- **No signature expiration** - Signatures should include a timestamp to - prevent indefinite replay. Add `&expires=` and include it in the - HMAC input -- **Path traversal risk** - Ensure `` cannot contain `..` - sequences or be used to access unintended resources on origin -- **SSRF potential** - Even with TLS requirement, internal/private IPs - (10.x, 172.16.x, 192.168.x, 127.x, ::1, link-local) must be blocked to - prevent server-side request forgery -- **Open redirect via Host header** - Validate that requests cannot be - manipulated to cache content under incorrect keys +- **Dependency injection**: Uber fx +- **HTTP router**: go-chi +- **Image processing**: govips (CGO wrapper for libvips) +- **Database**: SQLite via modernc.org/sqlite +- **Static assets**: embedded via `//go:embed` +- **Metrics**: Prometheus +- **Logging**: stdlib slog -### Important -- **No authentication for cache purge** - If cache invalidation is needed, it requires auth -- **Response header sanitization** - Strip sensitive headers from upstream before forwarding (X-Powered-By, Server, etc.) -- **Content-Type validation** - Verify upstream Content-Type matches expected image types before processing -- **Maximum image dimensions** - Limit output dimensions to prevent resource exhaustion (e.g., max 4096x4096) +## TODO -## URL Route Improvements +See [TODO.md](TODO.md) for the full prioritized task list. -Current: `/img///?signature=&format=` +## License -### Recommended Scheme -``` -/v1/image///x.?sig=&exp= -``` +GPL-3.0. See [LICENSE](LICENSE). -The size+format segment (e.g., `800x600.webp`) is appended to the source path and stripped when constructing the upstream request. This pattern is unambiguous (regex: `(\d+x\d+|orig)\.(webp|jpg|jpeg|png|avif)$`) and won't collide with real paths. +## Author -**Size options:** -- `800x600.` - resize to 800x600 -- `0x0.` - original size, format conversion only -- `orig.` - original size, format conversion only (human-friendly alias) - -**Benefits:** -- API versioning (`/v1/`) allows breaking changes later -- Human-readable URLs that can be manually constructed for whitelisted domains -- Format as extension is intuitive and CDN-friendly - -### Examples - -**Basic resize and convert:** -``` -/v1/image/cdn.example.com/photos/cat.jpg/800x600.webp?sig=abc123&exp=1704067200 -``` -Fetches `https://cdn.example.com/photos/cat.jpg`, resizes to 800x600, converts to webp. - -**Source URL with query parameters:** -``` -/v1/image/cdn.example.com/photos/cat.jpg%3Farg1=val1%26arg2=val2/800x600.webp?sig=abc123&exp=1704067200 -``` -Fetches `https://cdn.example.com/photos/cat.jpg?arg1=val1&arg2=val2`, resizes to 800x600, converts to webp. - -Note: The source query string must be URL-encoded (`?` → `%3F`, `&` → `%26`) to avoid ambiguity with pixa's own query parameters. - -**Original size, format conversion only:** -``` -/v1/image/cdn.example.com/photos/cat.jpg/orig.webp?sig=abc123&exp=1704067200 -/v1/image/cdn.example.com/photos/cat.jpg/0x0.webp?sig=abc123&exp=1704067200 -``` -Both fetch the original image and convert to webp without resizing. - -## Additional Formats - -### Output Formats to Support -- `avif` - Superior compression, growing browser support -- `gif` - For animated image passthrough (with frame limit) -- `svg` - Passthrough only, no resizing (vector) - -### Input Format Whitelist (MIME types to accept) -- `image/jpeg` -- `image/png` -- `image/webp` -- `image/gif` -- `image/avif` -- `image/svg+xml` (passthrough or rasterize) -- **Reject all others** - Especially `image/x-*`, `application/*` - -### Input Validation -- Verify magic bytes match declared Content-Type -- Maximum input file size (e.g., 50MB) -- Maximum input dimensions (e.g., 16384x16384) -- Reject files with embedded scripts (SVG sanitization) - -## Rate Limiting - -### Per-IP Limits -- Requests per second (e.g., 10 req/s burst, 100 req/min sustained) -- Concurrent connections (e.g., 50 per IP) - -### Global Limits -- Total concurrent upstream fetches (prevent origin overwhelm) -- Per-origin fetch rate limiting (be a good citizen) -- Cache miss rate limiting (prevent cache-busting attacks) - -### Response -- Return `429 Too Many Requests` with `Retry-After` header -- Consider `X-RateLimit-*` headers for transparency - -## Additional Features for 1.0 - -### Must Have -- **Health check endpoint** - `/health` or `/healthz` for load balancers -- **Metrics endpoint** - `/metrics` (Prometheus format) for observability -- **Graceful shutdown** - Drain connections on SIGTERM -- **Request ID/tracing** - `X-Request-ID` header propagation -- **Cache-Control headers** - Proper `Cache-Control`, `ETag`, `Last-Modified` on responses -- **Vary header** - `Vary: Accept` if doing content negotiation - -### Should Have -- **Auto-format selection** - If `format=auto`, pick best format based on `Accept` header -- **Quality parameter** - `&q=85` for lossy format quality control -- **Fit modes** - `fit=cover|contain|fill|inside|outside` for resize behavior -- **Background color** - For transparent-to-JPEG conversion -- **Blur/sharpen** - Common post-resize operations -- **Watermarking** - Optional overlay support - -### Nice to Have -- **Cache warming API** - Pre-populate cache for known images -- **Cache stats API** - Hit/miss rates, storage usage -- **Admin UI** - Simple dashboard for monitoring - -## Configuration Additions - -```yaml -server: - listen: ":8080" - read_timeout: 30s - write_timeout: 60s - max_header_bytes: 8192 - -cache: - directory: "/var/cache/pixa" - max_size_gb: 100 - ttl: 168h # 7 days - negative_ttl: 5m # Cache 404s briefly - -upstream: - timeout: 30s - max_response_size: 52428800 # 50MB - max_concurrent: 100 - user_agent: "Pixa/1.0" - -processing: - max_input_pixels: 268435456 # 16384x16384 - max_output_dimension: 4096 - default_quality: 85 - strip_metadata: true # Remove EXIF etc. - -security: - hmac_secret: "${PIXA_HMAC_SECRET}" # From env - signature_ttl: 3600 # 1 hour - blocked_networks: - - "10.0.0.0/8" - - "172.16.0.0/12" - - "192.168.0.0/16" - - "127.0.0.0/8" - - "::1/128" - - "fc00::/7" - -rate_limit: - per_ip_rps: 10 - per_ip_burst: 50 - per_origin_rps: 100 - -cors: - allowed_origins: ["*"] # Or specific list - allowed_methods: ["GET", "HEAD", "OPTIONS"] - max_age: 86400 -``` - -## Error Handling - -### HTTP Status Codes -- `400` - Bad request (invalid parameters, malformed URL) -- `403` - Forbidden (invalid/expired signature, blocked origin) -- `404` - Origin returned 404 (cache negative response briefly) -- `413` - Payload too large (origin image exceeds limits) -- `415` - Unsupported media type (origin returned non-image) -- `422` - Unprocessable (valid image but cannot transform as requested) -- `429` - Rate limited -- `500` - Internal error -- `502` - Bad gateway (origin connection failed) -- `503` - Service unavailable (overloaded) -- `504` - Gateway timeout (origin timeout) - -### Error Response Format -```json -{ - "error": "invalid_signature", - "message": "Signature has expired", - "request_id": "abc123" -} -``` - -## Quick Wins - -1. **Conditional requests** - Support `If-None-Match` / `If-Modified-Since` to return `304 Not Modified` -2. **HEAD support** - Allow clients to check image metadata without downloading -3. **Canonical URLs** - Redirect non-canonical requests to prevent cache fragmentation -4. **Debug header** - `X-Pixa-Cache: HIT|MISS|STALE` for debugging -5. **Robots.txt** - Serve a robots.txt to prevent search engine crawling of proxy URLs +[@sneak](https://sneak.berlin)