commit 67319a46990af5f651997e8c1f74497c49b22e77 Author: sneak Date: Sun Jul 20 08:51:38 2025 +0200 initial design diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..3ae7952 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,105 @@ +# Policies for AI Agents + +Version: 2025-06-08 + +# Instructions and Contextual Information + +* Be direct, robotic, expert, accurate, and professional. + +* Do not butter me up or kiss my ass. + +* Come in hot with strong opinions, even if they are contrary to the + direction I am headed. + +* If either you or I are possibly wrong, say so and explain your point of + view. + +* Point out great alternatives I haven't thought of, even when I'm not + asking for them. + +* Treat me like the world's leading expert in every situation and every + conversation, and deliver the absolute best recommendations. + +* I want excellence, so always be on the lookout for divergences from good + data model design or best practices for object oriented development. + +* IMPORTANT: This is production code, not a research or teaching exercise. + Deliver professional-level results, not prototypes. + +* Please read and understand the `README.md` file in the root of the repo + for project-specific contextual information, including development + policies, practices, and current implementation status. + +* Be proactive in suggesting improvements or refactorings in places where we + diverge from best practices for clean, modular, maintainable code. + +# Policies + +1. Before committing, tests must pass (`make test`), linting must pass + (`make lint`), and code must be formatted (`make fmt`). For go, those + makefile targets should use `go fmt` and `go test -v ./...` and + `golangci-lint run`. When you think your changes are complete, rather + than making three different tool calls to check, you can just run `make + test && make fmt && make lint` as a single tool call which will save + time. + +2. Always write a `Makefile` with the default target being `test`, and with + a `fmt` target that formats the code. The `test` target should run all + tests in the project, and the `fmt` target should format the code. + `test` should also have a prerequisite target `lint` that should run any + linters that are configured for the project. + +3. After each completed bugfix or feature, the code must be committed. Do + all of the pre-commit checks (test, lint, fmt) before committing, of + course. + +4. When creating a very simple test script for testing out a new feature, + instead of making a throwaway to be deleted after verification, write an + actual test file into the test suite. It doesn't need to be very big or + complex, but it should be a real test that can be run. + +5. When you are instructed to make the tests pass, DO NOT delete tests, skip + tests, or change the tests specifically to make them pass (unless there + is a bug in the test). This is cheating, and it is bad. You should only + be modifying the test if it is incorrect or if the test is no longer + relevant. In almost all cases, you should be fixing the code that is + being tested, or updating the tests to match a refactored implementation. + +6. When dealing with dates and times or timestamps, always use, display, and + store UTC. Set the local timezone to UTC on startup. If the user needs + to see the time in a different timezone, store the user's timezone in a + separate field and convert the UTC time to the user's timezone when + displaying it. For internal use and internal applications and + administrative purposes, always display UTC. + +7. Always write tests, even if they are extremely simple and just check for + correct syntax (ability to compile/import). If you are writing a new + feature, write a test for it. You don't need to target complete + coverage, but you should at least test any new functionality you add. If + you are fixing a bug, write a test first that reproduces the bug, and + then fix the bug in the code. + +8. When implementing new features, be aware of potential side-effects (such + as state files on disk, data in the database, etc.) and ensure that it is + possible to mock or stub these side-effects in tests. + +9. Always use structured logging. Log any relevant state/context with the + messages (but do not log secrets). If stdout is not a terminal, output + the structured logs in jsonl format. + +10. Avoid using bare strings or numbers in code, especially if they appear + anywhere more than once. Always define a constant (usually at the top + of the file) and give it a descriptive name, then use that constant in + the code instead of the bare string or number. + +11. You do not need to summarize your changes in the chat after making them. + Making the changes and committing them is sufficient. If anything out + of the ordinary happened, please explain it, but in the normal case + where you found and fixed the bug, or implemented the feature, there is + no need for the end-of-change summary. + +12. Do not create additional files in the root directory of the project + without asking permission first. Configuration files, documentation, and + build files are acceptable in the root, but source code and other files + should be organized in appropriate subdirectories. + diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..fc06c5c --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,28 @@ +# Rules + +Read the rules in AGENTS.md and follow them. + +# Memory + +* Claude is an inanimate tool. The spam that Claude attempts to insert into + commit messages (which it erroneously refers to as "attribution") is not + attribution, as I am the sole author of code created using Claude. It is + corporate advertising for Anthropic and is therefore completely + unacceptable in commit messages. + +* Tests should always be run before committing code. No commits should be + made that do not pass tests. + +* Code should always be formatted before committing. Do not commit + unformatted code. + +* Code should always be linted before committing. Do not commit + unlinted code. + +* The test suite is fast and local. When running tests, don't run + individual parts of the test suite, always run the whole thing by running + "make test". + +* Do not stop working on a task until you have reached the definition of + done provided to you in the initial instruction. Don't do part or most of + the work, do all of the work until the criteria for done are met. diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 0000000..f1675ae --- /dev/null +++ b/DESIGN.md @@ -0,0 +1,362 @@ +# vaultik: Design Document + +`vaultik` is a secure backup tool written in Go. It performs +streaming backups using content-defined chunking, blob grouping, asymmetric +encryption, and object storage. The system is designed for environments +where the backup source host cannot store secrets and cannot retrieve or +decrypt any data from the destination. + +The source host is **stateful**: it maintains a local SQLite index to detect +changes, deduplicate content, and track uploads across backup runs. All +remote storage is encrypted and append-only. Pruning of unreferenced data is +done from a trusted host with access to decryption keys, as even the +metadata indices are encrypted in the blob store. + +--- + +## Why + +ANOTHER backup tool?? + +Other backup tools like `restic`, `borg`, and `duplicity` are designed for +environments where the source host can store secrets and has access to +decryption keys. I don't want to store backup decryption keys on my hosts, +only public keys for encryption. + +My requirements are: + +* open source +* no passphrases or private keys on the source host +* incremental +* compressed +* encrypted +* s3 compatible without an intermediate step or tool + +Surprisingly, no existing tool meets these requirements, so I wrote `vaultik`. + +## Design Goals + +1. Backups must require only a public key on the source host. +2. No secrets or private keys may exist on the source system. +3. Obviously, restore must be possible using **only** the backup bucket and + a private key. +4. Prune must be possible, although this requires a private key so must be + done on different hosts. +5. All encryption is done using [`age`](https://github.com/FiloSottile/age) + (X25519, XChaCha20-Poly1305). +6. Compression uses `zstd` at a configurable level. +7. Files are chunked, and multiple chunks are packed into encrypted blobs. + This reduces the number of objects in the blob store for filesystems with + many small files. +9. All metadata (snapshots) is stored remotely as encrypted SQLite DBs. +10. If a snapshot metadata file exceeds a configured size threshold, it is + chunked into multiple encrypted `.age` parts, to support large + filesystems. +11. CLI interface is structured using `cobra`. + +--- + +## S3 Bucket Layout + +S3 stores only three things: + +1) Blobs: encrypted, compressed packs of file chunks. +2) Metadata: encrypted SQLite databases containing the current state of the + filesystem at the time of the snapshot. +3) Metadata hashes: encrypted hashes of the metadata SQLite databases. + +``` +s3://// +├── blobs/ +│ ├── //.zst.age +├── metadata/ +│ ├── .sqlite.age +│ ├── .sqlite.00.age +│ ├── .sqlite.01.age +``` + +To retrieve a given file, you would: + +* fetch `metadata/.sqlite.age` or `metadata/.sqlite.{seq}.age` +* fetch `metadata/.hash.age` +* decrypt the metadata SQLite database using the private key and reconstruct + the full database file +* verify the hash of the decrypted database matches the decrypted hash +* query the database for the file in question +* determine all chunks for the file +* for each chunk, look up the metadata for all blobs in the db +* fetch each blob from `blobs///.zst.age` +* decrypt each blob using the private key +* decompress each blob using `zstd` +* reconstruct the file from set of file chunks stored in the blobs + +If clever, it may be possible to do this chunk by chunk without touching +disk (except for the output file) as each uncompressed blob should fit in +memory (<10GB). + +### Path Rules + +* ``: UTC timestamp in iso860 format, e.g. `2023-10-01T12:00:00Z`. These are lexicographically sortable. +* `blobs///...`: where `aa` and `bb` are the first 2 hex bytes of the blob hash. + +--- + +## 3. Local SQLite Index Schema (source host) + +```sql +CREATE TABLE files ( + path TEXT PRIMARY KEY, + mtime INTEGER NOT NULL, + size INTEGER NOT NULL +); + +CREATE TABLE file_chunks ( + path TEXT NOT NULL, + idx INTEGER NOT NULL, + chunk_hash TEXT NOT NULL, + PRIMARY KEY (path, idx) +); + +CREATE TABLE chunks ( + chunk_hash TEXT PRIMARY KEY, + sha256 TEXT NOT NULL, + size INTEGER NOT NULL +); + +CREATE TABLE blobs ( + blob_hash TEXT PRIMARY KEY, + final_hash TEXT NOT NULL, + created_ts INTEGER NOT NULL +); + +CREATE TABLE blob_chunks ( + blob_hash TEXT NOT NULL, + chunk_hash TEXT NOT NULL, + offset INTEGER NOT NULL, + length INTEGER NOT NULL, + PRIMARY KEY (blob_hash, chunk_hash) +); + +CREATE TABLE chunk_files ( + chunk_hash TEXT NOT NULL, + file_path TEXT NOT NULL, + file_offset INTEGER NOT NULL, + length INTEGER NOT NULL, + PRIMARY KEY (chunk_hash, file_path) +); + +CREATE TABLE snapshots ( + id TEXT PRIMARY KEY, + hostname TEXT NOT NULL, + vaultik_version TEXT NOT NULL, + created_ts INTEGER NOT NULL, + file_count INTEGER NOT NULL, + chunk_count INTEGER NOT NULL, + blob_count INTEGER NOT NULL +); +``` + +--- + +## 4. Snapshot Metadata Schema (stored in S3) + +Identical schema to the local index, filtered to live snapshot state. Stored +as a SQLite DB, compressed with `zstd`, encrypted with `age`. If larger than +a configured `chunk_size`, it is split and uploaded as: + +``` +metadata/.sqlite.00.age +metadata/.sqlite.01.age +... +``` + +--- + +## 5. Data Flow + +### 5.1 Backup + +1. Load config +2. Open local SQLite index +3. Walk source directories: + + * For each file: + + * Check mtime and size in index + * If changed or new: + + * Chunk file + * For each chunk: + + * Hash with SHA256 + * Check if already uploaded + * If not: + + * Add chunk to blob packer + * Record file-chunk mapping in index +4. When blob reaches threshold size (e.g. 1GB): + + * Compress with `zstd` + * Encrypt with `age` + * Upload to: `s3:////blobs///.zst.age` + * Record blob-chunk layout in local index +5. Once all files are processed: + * Build snapshot SQLite DB from index delta + * Compress + encrypt + * If larger than `chunk_size`, split into parts + * Upload to: + `s3:////metadata/.sqlite(.xx).age` +6. Create snapshot record in local index that lists: + * snapshot ID + * hostname + * vaultik version + * timestamp + * counts of files, chunks, and blobs + * list of all blobs referenced in the snapshot (some new, some old) for + efficient pruning later +7. Create snapshot database for upload +8. Calculate checksum of snapshot database +9. Compress, encrypt, split, and upload to S3 +10. Encrypt the hash of the snapshot database to the backup age key +11. Upload the encrypted hash to S3 as `metadata/.hash.age` +12. Optionally prune remote blobs that are no longer referenced in the + snapshot, based on local state db + +### 5.2 Manual Prune + +1. List all objects under `metadata/` +2. Determine the latest valid `snapshot_id` by timestamp +3. Download, decrypt, and reconstruct the latest snapshot SQLite database +4. Extract set of referenced blob hashes +5. List all blob objects under `blobs/` +6. For each blob: + * If the hash is not in the latest snapshot: + * Issue `DeleteObject` to remove it + +### 5.3 Verify + +Verify runs on a host that has no state, but access to the bucket. + +1. Fetch latest metadata snapshot files from S3 +2. Fetch latest metadata db hash from S3 +3. Decrypt the hash using the private key +4. Decrypt the metadata SQLite database chunks using the private key and + reassemble the snapshot db file +5. Calculate the SHA256 hash of the decrypted snapshot database +5. Verify the db file hash matches the decrypted hash +3. For each blob in the snapshot: + * Fetch the blob metadata from the snapshot db + * Ensure the blob exists in S3 + * Ensure the S3 object hash matches the final (encrypted) blob hash + stored in the metadata db + * For each chunk in the blob: + * Fetch the chunk metadata from the snapshot db + * Ensure the S3 object hash matches the chunk hash stored in the + metadata db + +--- + +## 6. CLI Commands + +``` +vaultik backup /etc/vaultik.yaml +vaultik restore +vaultik prune +``` + +* `VAULTIK_PRIVATE_KEY` is required for `restore` and `prune` and +`retrieve` commands as. + +* It is passed via environment variable. + +--- + +## 7. Function and Method Signatures + +### 7.1 CLI + +```go +func RootCmd() *cobra.Command +func backupCmd() *cobra.Command +func restoreCmd() *cobra.Command +func pruneCmd() *cobra.Command +func verifyCmd() *cobra.Command +``` + +### 7.2 Configuration + +```go +type Config struct { + BackupPubKey string // age recipient + BackupInterval time.Duration // used in daemon mode, irrelevant for cron mode + BlobSizeLimit int64 // default 10GB + ChunkSize int64 // default 10MB + Exclude []string // list of regex of files to exclude from backup, absolute path + Hostname string + IndexPath string // path to local SQLite index db, default /var/lib/vaultik/index.db + MetadataPrefix string // S3 prefix for metadata, default "metadata/" + MinTimeBetweenRun time.Duration // minimum time between backup runs, default 1 hour - for daemon mode + S3 S3Config // S3 configuration + ScanInterval time.Duration // interval to full stat() scan source dirs, default 24h + SourceDirs []string // list of source directories to back up, absolute paths +} + +type S3Config struct { + Endpoint string + Bucket string + Prefix string + AccessKeyID string + SecretAccessKey string + Region string +} + +func Load(path string) (*Config, error) +``` + +### 7.3 Index + +```go +type Index struct { + db *sql.DB +} + +func OpenIndex(path string) (*Index, error) + +func (ix *Index) LookupFile(path string, mtime int64, size int64) ([]string, bool, error) +func (ix *Index) SaveFile(path string, mtime int64, size int64, chunkHashes []string) error +func (ix *Index) AddChunk(chunkHash string, size int64) error +func (ix *Index) MarkBlob(blobHash, finalHash string, created time.Time) error +func (ix *Index) MapChunkToBlob(blobHash, chunkHash string, offset, length int64) error +func (ix *Index) MapChunkToFile(chunkHash, filePath string, offset, length int64) error +``` + +### 7.4 Blob Packing + +```go +type BlobWriter struct { + // internal buffer, current size, encrypted writer, etc +} + +func NewBlobWriter(...) *BlobWriter +func (bw *BlobWriter) AddChunk(chunk []byte, chunkHash string) error +func (bw *BlobWriter) Flush() (finalBlobHash string, err error) +``` + +### 7.5 Metadata + +```go +func BuildSnapshotMetadata(ix *Index, snapshotID string) (sqlitePath string, err error) +func EncryptAndUploadMetadata(path string, cfg *Config, snapshotID string) error +``` + +### 7.6 Prune + +```go +func RunPrune(bucket, prefix, privateKey string) error +``` + +--- + +## Implementation TODO + +To be completed by claude diff --git a/README.md b/README.md new file mode 100644 index 0000000..a8ca991 --- /dev/null +++ b/README.md @@ -0,0 +1,167 @@ +# vaultik + +`vaultik` is a incremental backup daemon written in Go. It +encrypts data using an `age` public key and uploads each encrypted blob +directly to a remote S3-compatible object store. It requires no private +keys, secrets, or credentials stored on the backed-up system. + +--- + +## what + +`vaultik` walks a set of configured directories and builds a +content-addressable chunk map of changed files using deterministic chunking. +Each chunk is streamed into a blob packer. Blobs are compressed with `zstd`, +encrypted with `age`, and uploaded directly to remote storage under a +content-addressed S3 path. + +No plaintext file contents ever hit disk. No private key is needed or stored +locally. All encrypted data is streaming-processed and immediately discarded +once uploaded. Metadata is encrypted and pushed with the same mechanism. + +## why + +Existing backup software fails under one or more of these conditions: + +* Requires secrets (passwords, private keys) on the source system +* Depends on symmetric encryption unsuitable for zero-trust environments +* Stages temporary archives or repositories +* Writes plaintext metadata or plaintext file paths + +`vaultik` addresses all of these by using: + +* Public-key-only encryption (via `age`) requires no secrets (other than + bucket access key) on the source system +* Blob-level deduplication and batching +* Local state cache for incremental detection +* S3-native chunked upload interface +* Self-contained encrypted snapshot metadata + +## how + +1. **install** + + ```sh + go install git.eeqj.de/sneak/vaultik@latest + ``` + +2. **generate keypair** + + ```sh + age-keygen -o agekey.txt + grep 'public key:' agekey.txt + ``` + +3. **write config** + + ```yaml + source_dirs: + - /etc + - /home/user/data + exclude: + - '*.log' + - '*.tmp' + age_recipient: age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + s3: + endpoint: https://s3.example.com + bucket: vaultik-data + prefix: host1/ + access_key_id: ... + secret_access_key: ... + region: us-east-1 + backup_interval: 1h # only used in daemon mode, not for --cron mode + full_scan_interval: 24h # normally we use inotify to mark dirty, but + # every 24h we do a full stat() scan + min_time_between_run: 15m # again, only for daemon mode + index_path: /var/lib/vaultik/index.sqlite + chunk_size: 10MB + blob_size_limit: 10GB + index_prefix: index/ + ``` + +4. **run** + + ```sh + vaultik backup /etc/vaultik.yaml + ``` + + ```sh + vaultik backup /etc/vaultik.yaml --cron # silent unless error + ``` + + ```sh + vaultik backup /etc/vaultik.yaml --daemon # runs in background, uses inotify + ``` + +--- + +## cli + +```sh +vaultik backup /etc/vaultik.yaml +vaultik restore +vaultik prune +vaultik fetch +``` + +* `VAULTIK_PRIVATE_KEY` must be available in environment for `restore` and `prune` + +--- + +## does not + +* Store any secrets on the backed-up machine +* Require mutable remote metadata +* Use tarballs, restic, rsync, or ssh +* Require a symmetric passphrase or password +* Trust the source system with anything + +--- + +## does + +* Incremental deduplicated backup +* Blob-packed chunk encryption +* Content-addressed immutable blobs +* Public-key encryption only +* SQLite-based local and snapshot metadata +* Fully stream-processed storage + +--- + +## restore + +`vaultik restore` downloads only the snapshot metadata and required blobs. It +never contacts the source system. All restore operations depend only on: + +* `VAULTIK_PRIVATE_KEY` +* The bucket + +The entire system is restore-only from object storage. + +--- + +## prune + +Run `vaultik prune` on a machine with the private key. It: + +* Downloads the most recent snapshot +* Decrypts metadata +* Lists referenced blobs +* Deletes any blob in the bucket not referenced + +This enables garbage collection from immutable storage. + +--- + +## license + +WTFPL — see LICENSE. + +--- + +## author + +sneak +[sneak@sneak.berlin](mailto:sneak@sneak.berlin) +[https://sneak.berlin](https://sneak.berlin) diff --git a/go.mod b/go.mod new file mode 100644 index 0000000..f090138 --- /dev/null +++ b/go.mod @@ -0,0 +1,3 @@ +module git.eeqj.de/sneak/vaultik + +go 1.24.4