initial design

2025-07-20 08:51:38 +02:00 · 2025-07-20 08:51:38 +02:00 · 67319a4699
commit 67319a4699
5 changed files with 665 additions and 0 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@ -0,0 +1,105 @@
+# Policies for AI Agents
+
+Version: 2025-06-08
+
+# Instructions and Contextual Information
+
+* Be direct, robotic, expert, accurate, and professional.
+
+* Do not butter me up or kiss my ass.
+
+* Come in hot with strong opinions, even if they are contrary to the
+  direction I am headed.
+
+* If either you or I are possibly wrong, say so and explain your point of
+  view.
+
+* Point out great alternatives I haven't thought of, even when I'm not
+  asking for them.
+
+* Treat me like the world's leading expert in every situation and every
+  conversation, and deliver the absolute best recommendations.
+
+* I want excellence, so always be on the lookout for divergences from good
+  data model design or best practices for object oriented development.
+
+* IMPORTANT: This is production code, not a research or teaching exercise.
+  Deliver professional-level results, not prototypes.
+
+* Please read and understand the `README.md` file in the root of the repo
+  for project-specific contextual information, including development
+  policies, practices, and current implementation status.
+
+* Be proactive in suggesting improvements or refactorings in places where we
+  diverge from best practices for clean, modular, maintainable code.
+
+# Policies
+
+1. Before committing, tests must pass (`make test`), linting must pass
+   (`make lint`), and code must be formatted (`make fmt`).  For go, those
+   makefile targets should use `go fmt` and `go test -v ./...` and
+   `golangci-lint run`.  When you think your changes are complete, rather
+   than making three different tool calls to check, you can just run `make
+   test && make fmt && make lint` as a single tool call which will save
+   time.
+
+2. Always write a `Makefile` with the default target being `test`, and with
+   a `fmt` target that formats the code.  The `test` target should run all
+   tests in the project, and the `fmt` target should format the code.
+   `test` should also have a prerequisite target `lint` that should run any
+   linters that are configured for the project.
+
+3. After each completed bugfix or feature, the code must be committed.  Do
+   all of the pre-commit checks (test, lint, fmt) before committing, of
+   course.
+
+4. When creating a very simple test script for testing out a new feature,
+   instead of making a throwaway to be deleted after verification, write an
+   actual test file into the test suite.  It doesn't need to be very big or
+   complex, but it should be a real test that can be run.
+
+5. When you are instructed to make the tests pass, DO NOT delete tests, skip
+   tests, or change the tests specifically to make them pass (unless there
+   is a bug in the test).  This is cheating, and it is bad.  You should only
+   be modifying the test if it is incorrect or if the test is no longer
+   relevant.  In almost all cases, you should be fixing the code that is
+   being tested, or updating the tests to match a refactored implementation.
+
+6. When dealing with dates and times or timestamps, always use, display, and
+   store UTC.  Set the local timezone to UTC on startup.  If the user needs
+   to see the time in a different timezone, store the user's timezone in a
+   separate field and convert the UTC time to the user's timezone when
+   displaying it.  For internal use and internal applications and
+   administrative purposes, always display UTC.
+
+7. Always write tests, even if they are extremely simple and just check for
+   correct syntax (ability to compile/import). If you are writing a new
+   feature, write a test for it.  You don't need to target complete
+   coverage, but you should at least test any new functionality you add.  If
+   you are fixing a bug, write a test first that reproduces the bug, and
+   then fix the bug in the code.
+
+8. When implementing new features, be aware of potential side-effects (such
+   as state files on disk, data in the database, etc.) and ensure that it is
+   possible to mock or stub these side-effects in tests.
+
+9. Always use structured logging.  Log any relevant state/context with the
+   messages (but do not log secrets).  If stdout is not a terminal, output
+   the structured logs in jsonl format.
+
+10. Avoid using bare strings or numbers in code, especially if they appear
+    anywhere more than once.  Always define a constant (usually at the top
+    of the file) and give it a descriptive name, then use that constant in
+    the code instead of the bare string or number.
+
+11. You do not need to summarize your changes in the chat after making them.
+    Making the changes and committing them is sufficient.  If anything out
+    of the ordinary happened, please explain it, but in the normal case
+    where you found and fixed the bug, or implemented the feature, there is
+    no need for the end-of-change summary.
+
+12. Do not create additional files in the root directory of the project
+    without asking permission first. Configuration files, documentation, and
+    build files are acceptable in the root, but source code and other files
+    should be organized in appropriate subdirectories.
+
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,28 @@
+# Rules
+
+Read the rules in AGENTS.md and follow them.
+
+# Memory
+
+* Claude is an inanimate tool.  The spam that Claude attempts to insert into
+  commit messages (which it erroneously refers to as "attribution") is not
+  attribution, as I am the sole author of code created using Claude.  It is
+  corporate advertising for Anthropic and is therefore completely
+  unacceptable in commit messages.
+
+* Tests should always be run before committing code.  No commits should be
+  made that do not pass tests.
+
+* Code should always be formatted before committing.  Do not commit
+  unformatted code.
+
+* Code should always be linted before committing.  Do not commit
+  unlinted code.
+
+* The test suite is fast and local.  When running tests, don't run
+  individual parts of the test suite, always run the whole thing by running
+  "make test".
+
+* Do not stop working on a task until you have reached the definition of
+  done provided to you in the initial instruction.  Don't do part or most of
+  the work, do all of the work until the criteria for done are met.
--- a/DESIGN.md
+++ b/DESIGN.md
@ -0,0 +1,362 @@
+# vaultik: Design Document
+
+`vaultik` is a secure  backup tool written in Go. It performs
+streaming backups using content-defined chunking, blob grouping, asymmetric
+encryption, and object storage. The system is designed for environments
+where the backup source host cannot store secrets and cannot retrieve or
+decrypt any data from the destination.
+
+The source host is **stateful**: it maintains a local SQLite index to detect
+changes, deduplicate content, and track uploads across backup runs. All
+remote storage is encrypted and append-only. Pruning of unreferenced data is
+done from a trusted host with access to decryption keys, as even the
+metadata indices are encrypted in the blob store.
+
+---
+
+## Why
+
+ANOTHER backup tool??
+
+Other backup tools like `restic`, `borg`, and `duplicity` are designed for
+environments where the source host can store secrets and has access to
+decryption keys. I don't want to store backup decryption keys on my hosts,
+only public keys for encryption.
+
+My requirements are:
+
+* open source
+* no passphrases or private keys on the source host
+* incremental
+* compressed
+* encrypted
+* s3 compatible without an intermediate step or tool
+
+Surprisingly, no existing tool meets these requirements, so I wrote `vaultik`.
+
+## Design Goals
+
+1. Backups must require only a public key on the source host.
+2. No secrets or private keys may exist on the source system.
+3. Obviously, restore must be possible using **only** the backup bucket and
+   a private key.
+4. Prune must be possible, although this requires a private key so must be
+   done on different hosts.
+5. All encryption is done using [`age`](https://github.com/FiloSottile/age)
+   (X25519, XChaCha20-Poly1305).
+6. Compression uses `zstd` at a configurable level.
+7. Files are chunked, and multiple chunks are packed into encrypted blobs.
+   This reduces the number of objects in the blob store for filesystems with
+   many small files.
+9. All metadata (snapshots) is stored remotely as encrypted SQLite DBs.
+10. If a snapshot metadata file exceeds a configured size threshold, it is
+    chunked into multiple encrypted `.age` parts, to support large
+    filesystems.
+11. CLI interface is structured using `cobra`.
+
+---
+
+## S3 Bucket Layout
+
+S3 stores only three things:
+
+1) Blobs: encrypted, compressed packs of file chunks.
+2) Metadata: encrypted SQLite databases containing the current state of the
+   filesystem at the time of the snapshot.
+3) Metadata hashes: encrypted hashes of the metadata SQLite databases.
+
+```
+s3://<bucket>/<prefix>/
+├── blobs/
+│   ├── <aa>/<bb>/<full_blob_hash>.zst.age
+├── metadata/
+│   ├── <snapshot_id>.sqlite.age
+│   ├── <snapshot_id>.sqlite.00.age
+│   ├── <snapshot_id>.sqlite.01.age
+```
+
+To retrieve a given file, you would:
+
+* fetch `metadata/<snapshot_id>.sqlite.age` or `metadata/<snapshot_id>.sqlite.{seq}.age`
+* fetch `metadata/<snapshot_id>.hash.age`
+* decrypt the metadata SQLite database using the private key and reconstruct
+  the full database file
+* verify the hash of the decrypted database matches the decrypted hash
+* query the database for the file in question
+* determine all chunks for the file
+* for each chunk, look up the metadata for all blobs in the db
+* fetch each blob from `blobs/<aa>/<bb>/<blob_hash>.zst.age`
+* decrypt each blob using the private key
+* decompress each blob using `zstd`
+* reconstruct the file from set of file chunks stored in the blobs
+
+If clever, it may be possible to do this chunk by chunk without touching
+disk (except for the output file) as each uncompressed blob should fit in
+memory (<10GB).
+
+### Path Rules
+
+* `<snapshot_id>`: UTC timestamp in iso860 format, e.g. `2023-10-01T12:00:00Z`.  These are lexicographically sortable.
+* `blobs/<aa>/<bb>/...`: where `aa` and `bb` are the first 2 hex bytes of the blob hash.
+
+---
+
+## 3. Local SQLite Index Schema (source host)
+
+```sql
+CREATE TABLE files (
+  path TEXT PRIMARY KEY,
+  mtime INTEGER NOT NULL,
+  size INTEGER NOT NULL
+);
+
+CREATE TABLE file_chunks (
+  path TEXT NOT NULL,
+  idx INTEGER NOT NULL,
+  chunk_hash TEXT NOT NULL,
+  PRIMARY KEY (path, idx)
+);
+
+CREATE TABLE chunks (
+  chunk_hash TEXT PRIMARY KEY,
+  sha256 TEXT NOT NULL,
+  size INTEGER NOT NULL
+);
+
+CREATE TABLE blobs (
+  blob_hash TEXT PRIMARY KEY,
+  final_hash TEXT NOT NULL,
+  created_ts INTEGER NOT NULL
+);
+
+CREATE TABLE blob_chunks (
+  blob_hash TEXT NOT NULL,
+  chunk_hash TEXT NOT NULL,
+  offset INTEGER NOT NULL,
+  length INTEGER NOT NULL,
+  PRIMARY KEY (blob_hash, chunk_hash)
+);
+
+CREATE TABLE chunk_files (
+  chunk_hash TEXT NOT NULL,
+  file_path TEXT NOT NULL,
+  file_offset INTEGER NOT NULL,
+  length INTEGER NOT NULL,
+  PRIMARY KEY (chunk_hash, file_path)
+);
+
+CREATE TABLE snapshots (
+  id TEXT PRIMARY KEY,
+  hostname TEXT NOT NULL,
+  vaultik_version TEXT NOT NULL,
+  created_ts INTEGER NOT NULL,
+  file_count INTEGER NOT NULL,
+  chunk_count INTEGER NOT NULL,
+  blob_count INTEGER NOT NULL
+);
+```
+
+---
+
+## 4. Snapshot Metadata Schema (stored in S3)
+
+Identical schema to the local index, filtered to live snapshot state. Stored
+as a SQLite DB, compressed with `zstd`, encrypted with `age`. If larger than
+a configured `chunk_size`, it is split and uploaded as:
+
+```
+metadata/<snapshot_id>.sqlite.00.age
+metadata/<snapshot_id>.sqlite.01.age
+...
+```
+
+---
+
+## 5. Data Flow
+
+### 5.1 Backup
+
+1. Load config
+2. Open local SQLite index
+3. Walk source directories:
+
+   * For each file:
+
+     * Check mtime and size in index
+     * If changed or new:
+
+       * Chunk file
+       * For each chunk:
+
+         * Hash with SHA256
+         * Check if already uploaded
+         * If not:
+
+           * Add chunk to blob packer
+       * Record file-chunk mapping in index
+4. When blob reaches threshold size (e.g. 1GB):
+
+   * Compress with `zstd`
+   * Encrypt with `age`
+   * Upload to: `s3://<bucket>/<prefix>/blobs/<aa>/<bb>/<hash>.zst.age`
+   * Record blob-chunk layout in local index
+5. Once all files are processed:
+   * Build snapshot SQLite DB from index delta
+   * Compress + encrypt
+   * If larger than `chunk_size`, split into parts
+   * Upload to:
+     `s3://<bucket>/<prefix>/metadata/<snapshot_id>.sqlite(.xx).age`
+6. Create snapshot record in local index that lists:
+    * snapshot ID
+    * hostname
+    * vaultik version
+    * timestamp
+    * counts of files, chunks, and blobs
+    * list of all blobs referenced in the snapshot (some new, some old) for
+      efficient pruning later
+7. Create snapshot database for upload
+8. Calculate checksum of snapshot database
+9. Compress, encrypt, split, and upload to S3
+10. Encrypt the hash of the snapshot database to the backup age key
+11. Upload the encrypted hash to S3 as `metadata/<snapshot_id>.hash.age`
+12. Optionally prune remote blobs that are no longer referenced in the
+   snapshot, based on local state db
+
+### 5.2 Manual Prune
+
+1. List all objects under `metadata/`
+2. Determine the latest valid `snapshot_id` by timestamp
+3. Download, decrypt, and reconstruct the latest snapshot SQLite database
+4. Extract set of referenced blob hashes
+5. List all blob objects under `blobs/`
+6. For each blob:
+   * If the hash is not in the latest snapshot:
+     * Issue `DeleteObject` to remove it
+
+### 5.3 Verify
+
+Verify runs on a host that has no state, but access to the bucket.
+
+1. Fetch latest metadata snapshot files from S3
+2. Fetch latest metadata db hash from S3
+3. Decrypt the hash using the private key
+4. Decrypt the metadata SQLite database chunks using the private key and
+   reassemble the snapshot db file
+5. Calculate the SHA256 hash of the decrypted snapshot database
+5. Verify the db file hash matches the decrypted hash
+3. For each blob in the snapshot:
+    * Fetch the blob metadata from the snapshot db
+    * Ensure the blob exists in S3
+    * Ensure the S3 object hash matches the final (encrypted) blob hash
+      stored in the metadata db
+    * For each chunk in the blob:
+        * Fetch the chunk metadata from the snapshot db
+        * Ensure the S3 object hash matches the chunk hash stored in the
+          metadata db
+
+---
+
+## 6. CLI Commands
+
+```
+vaultik backup /etc/vaultik.yaml
+vaultik restore <bucket> <prefix> <snapshot_id> <target_dir>
+vaultik prune <bucket> <prefix>
+```
+
+* `VAULTIK_PRIVATE_KEY` is required for `restore` and `prune` and
+`retrieve` commands as.
+
+* It is passed via environment variable.
+
+---
+
+## 7. Function and Method Signatures
+
+### 7.1 CLI
+
+```go
+func RootCmd() *cobra.Command
+func backupCmd() *cobra.Command
+func restoreCmd() *cobra.Command
+func pruneCmd() *cobra.Command
+func verifyCmd() *cobra.Command
+```
+
+### 7.2 Configuration
+
+```go
+type Config struct {
+    BackupPubKey      string  // age recipient
+    BackupInterval    time.Duration // used in daemon mode, irrelevant for cron mode
+    BlobSizeLimit     int64  // default 10GB
+    ChunkSize         int64 // default 10MB
+    Exclude           []string // list of regex of files to exclude from backup, absolute path
+    Hostname          string
+    IndexPath         string  // path to local SQLite index db, default /var/lib/vaultik/index.db
+    MetadataPrefix    string  // S3 prefix for metadata, default "metadata/"
+    MinTimeBetweenRun time.Duration  // minimum time between backup runs, default 1 hour - for daemon mode
+    S3                S3Config  // S3 configuration
+    ScanInterval      time.Duration  // interval to full stat() scan source dirs, default 24h
+    SourceDirs        []string  // list of source directories to back up, absolute paths
+}
+
+type S3Config struct {
+    Endpoint        string
+    Bucket          string
+    Prefix          string
+    AccessKeyID     string
+    SecretAccessKey string
+    Region          string
+}
+
+func Load(path string) (*Config, error)
+```
+
+### 7.3 Index
+
+```go
+type Index struct {
+    db *sql.DB
+}
+
+func OpenIndex(path string) (*Index, error)
+
+func (ix *Index) LookupFile(path string, mtime int64, size int64) ([]string, bool, error)
+func (ix *Index) SaveFile(path string, mtime int64, size int64, chunkHashes []string) error
+func (ix *Index) AddChunk(chunkHash string, size int64) error
+func (ix *Index) MarkBlob(blobHash, finalHash string, created time.Time) error
+func (ix *Index) MapChunkToBlob(blobHash, chunkHash string, offset, length int64) error
+func (ix *Index) MapChunkToFile(chunkHash, filePath string, offset, length int64) error
+```
+
+### 7.4 Blob Packing
+
+```go
+type BlobWriter struct {
+    // internal buffer, current size, encrypted writer, etc
+}
+
+func NewBlobWriter(...) *BlobWriter
+func (bw *BlobWriter) AddChunk(chunk []byte, chunkHash string) error
+func (bw *BlobWriter) Flush() (finalBlobHash string, err error)
+```
+
+### 7.5 Metadata
+
+```go
+func BuildSnapshotMetadata(ix *Index, snapshotID string) (sqlitePath string, err error)
+func EncryptAndUploadMetadata(path string, cfg *Config, snapshotID string) error
+```
+
+### 7.6 Prune
+
+```go
+func RunPrune(bucket, prefix, privateKey string) error
+```
+
+---
+
+## Implementation TODO
+
+To be completed by claude
--- a/README.md
+++ b/README.md
@ -0,0 +1,167 @@
+# vaultik
+
+`vaultik` is a incremental backup daemon written in Go. It
+encrypts data using an `age` public key and uploads each encrypted blob
+directly to a remote S3-compatible object store. It requires no private
+keys, secrets, or credentials stored on the backed-up system.
+
+---
+
+## what
+
+`vaultik` walks a set of configured directories and builds a
+content-addressable chunk map of changed files using deterministic chunking.
+Each chunk is streamed into a blob packer. Blobs are compressed with `zstd`,
+encrypted with `age`, and uploaded directly to remote storage under a
+content-addressed S3 path.
+
+No plaintext file contents ever hit disk. No private key is needed or stored
+locally. All encrypted data is streaming-processed and immediately discarded
+once uploaded. Metadata is encrypted and pushed with the same mechanism.
+
+## why
+
+Existing backup software fails under one or more of these conditions:
+
+* Requires secrets (passwords, private keys) on the source system
+* Depends on symmetric encryption unsuitable for zero-trust environments
+* Stages temporary archives or repositories
+* Writes plaintext metadata or plaintext file paths
+
+`vaultik` addresses all of these by using:
+
+* Public-key-only encryption (via `age`) requires no secrets (other than
+  bucket access key) on the source system
+* Blob-level deduplication and batching
+* Local state cache for incremental detection
+* S3-native chunked upload interface
+* Self-contained encrypted snapshot metadata
+
+## how
+
+1. **install**
+
+   ```sh
+   go install git.eeqj.de/sneak/vaultik@latest
+   ```
+
+2. **generate keypair**
+
+   ```sh
+   age-keygen -o agekey.txt
+   grep 'public key:' agekey.txt
+   ```
+
+3. **write config**
+
+   ```yaml
+   source_dirs:
+     - /etc
+     - /home/user/data
+   exclude:
+     - '*.log'
+     - '*.tmp'
+   age_recipient: age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+   s3:
+     endpoint: https://s3.example.com
+     bucket: vaultik-data
+     prefix: host1/
+     access_key_id: ...
+     secret_access_key: ...
+     region: us-east-1
+   backup_interval: 1h      # only used in daemon mode, not for --cron mode
+   full_scan_interval: 24h  # normally we use inotify to mark dirty, but
+                            # every 24h we do a full stat() scan
+   min_time_between_run: 15m  # again, only for daemon mode
+   index_path: /var/lib/vaultik/index.sqlite
+   chunk_size: 10MB
+   blob_size_limit: 10GB
+   index_prefix: index/
+   ```
+
+4. **run**
+
+   ```sh
+   vaultik backup /etc/vaultik.yaml
+   ```
+
+   ```sh
+   vaultik backup /etc/vaultik.yaml --cron # silent unless error
+   ```
+
+   ```sh
+   vaultik backup /etc/vaultik.yaml --daemon # runs in background, uses inotify
+   ```
+
+---
+
+## cli
+
+```sh
+vaultik backup /etc/vaultik.yaml
+vaultik restore <bucket> <prefix> <snapshot_id> <target_dir>
+vaultik prune <bucket> <prefix>
+vaultik fetch <bucket> <prefix> <snapshot_id> <filepath> <target_fileordir>
+```
+
+* `VAULTIK_PRIVATE_KEY` must be available in environment for `restore` and `prune`
+
+---
+
+## does not
+
+* Store any secrets on the backed-up machine
+* Require mutable remote metadata
+* Use tarballs, restic, rsync, or ssh
+* Require a symmetric passphrase or password
+* Trust the source system with anything
+
+---
+
+## does
+
+* Incremental deduplicated backup
+* Blob-packed chunk encryption
+* Content-addressed immutable blobs
+* Public-key encryption only
+* SQLite-based local and snapshot metadata
+* Fully stream-processed storage
+
+---
+
+## restore
+
+`vaultik restore` downloads only the snapshot metadata and required blobs. It
+never contacts the source system. All restore operations depend only on:
+
+* `VAULTIK_PRIVATE_KEY`
+* The bucket
+
+The entire system is restore-only from object storage.
+
+---
+
+## prune
+
+Run `vaultik prune` on a machine with the private key. It:
+
+* Downloads the most recent snapshot
+* Decrypts metadata
+* Lists referenced blobs
+* Deletes any blob in the bucket not referenced
+
+This enables garbage collection from immutable storage.
+
+---
+
+## license
+
+WTFPL — see LICENSE.
+
+---
+
+## author
+
+sneak
+[sneak@sneak.berlin](mailto:sneak@sneak.berlin)
+[https://sneak.berlin](https://sneak.berlin)
--- a/go.mod
+++ b/go.mod
@ -0,0 +1,3 @@
+module git.eeqj.de/sneak/vaultik
+
+go 1.24.4