Add custom types, version command, and restore --verify flag

- Add internal/types package with type-safe wrappers for IDs, hashes, paths, and credentials (FileID, BlobID, ChunkHash, etc.) - Implement driver.Valuer and sql.Scanner for UUID-based types - Add `vaultik version` command showing version, commit, go version - Add `--verify` flag to restore command that checksums all restored files against expected chunk hashes with progress bar - Remove fetch.go (dead code, functionality in restore) - Clean up TODO.md, remove completed items - Update all database and snapshot code to use new custom types
2026-01-14 17:11:52 -08:00
parent 2afd54d693
commit 417b25a5f5
53 changed files with 2330 additions and 1581 deletions
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@

 WIP: pre-1.0, some functions may not be fully implemented yet

-`vaultik` is a incremental backup daemon written in Go. It encrypts data
+`vaultik` is an incremental backup daemon written in Go. It encrypts data
 using an `age` public key and uploads each encrypted blob directly to a
 remote S3-compatible object store. It requires no private keys, secrets, or
 credentials (other than those required to PUT to encrypted object storage,
@@ -22,19 +22,6 @@ It includes table-stakes features such as:
 * does not create huge numbers of small files (to keep S3 operation counts
  down) even if the source system has many small files

-## what
-
-`vaultik` walks a set of configured directories and builds a
-content-addressable chunk map of changed files using deterministic chunking.
-Each chunk is streamed into a blob packer. Blobs are compressed with `zstd`,
-encrypted with `age`, and uploaded directly to remote storage under a
-content-addressed S3 path.  at the end, a pruned snapshot-specific sqlite
-database of metadata is created, encrypted, and uploaded alongside the
-blobs.
-
-No plaintext file contents ever hit disk. No private key or secret
-passphrase is needed or stored locally.
-
 ## why

 Existing backup software fails under one or more of these conditions:
@@ -45,15 +32,46 @@ Existing backup software fails under one or more of these conditions:
 * Creates one-blob-per-file, which results in excessive S3 operation counts
 * is slow

-`vaultik` addresses these by using:
+Other backup tools like `restic`, `borg`, and `duplicity` are designed for
+environments where the source host can store secrets and has access to
+decryption keys. I don't want to store backup decryption keys on my hosts,
+only public keys for encryption.

-* Public-key-only encryption (via `age`) requires no secrets (other than
-  remote storage api key) on the source system
-* Local state cache for incremental detection does not require reading from
-  or decrypting remote storage
-* Content-addressed immutable storage allows efficient deduplication
-* Storage only of large encrypted blobs of configurable size (1G by default)
-  reduces S3 operation counts and improves performance
+My requirements are:
+
+* open source
+* no passphrases or private keys on the source host
+* incremental
+* compressed
+* encrypted
+* s3 compatible without an intermediate step or tool
+
+Surprisingly, no existing tool meets these requirements, so I wrote `vaultik`.
+
+## design goals
+
+1. Backups must require only a public key on the source host.
+1. No secrets or private keys may exist on the source system.
+1. Restore must be possible using **only** the backup bucket and a private key.
+1. Prune must be possible (requires private key, done on different hosts).
+1. All encryption uses [`age`](https://age-encryption.org/) (X25519, XChaCha20-Poly1305).
+1. Compression uses `zstd` at a configurable level.
+1. Files are chunked, and multiple chunks are packed into encrypted blobs
+   to reduce object count for filesystems with many small files.
+1. All metadata (snapshots) is stored remotely as encrypted SQLite DBs.
+
+## what
+
+`vaultik` walks a set of configured directories and builds a
+content-addressable chunk map of changed files using deterministic chunking.
+Each chunk is streamed into a blob packer. Blobs are compressed with `zstd`,
+encrypted with `age`, and uploaded directly to remote storage under a
+content-addressed S3 path. At the end, a pruned snapshot-specific sqlite
+database of metadata is created, encrypted, and uploaded alongside the
+blobs.
+
+No plaintext file contents ever hit disk. No private key or secret
+passphrase is needed or stored locally.

 ## how

@@ -63,59 +81,63 @@ Existing backup software fails under one or more of these conditions:
   go install git.eeqj.de/sneak/vaultik@latest
   ```

-2. **generate keypair**
+1. **generate keypair**

   ```sh
   age-keygen -o agekey.txt
   grep 'public key:' agekey.txt
   ```

-3. **write config**
+1. **write config**

   ```yaml
-   source_dirs:
-     - /etc
-     - /home/user/data
+   # Named snapshots - each snapshot can contain multiple paths
+   snapshots:
+     system:
+       paths:
+         - /etc
+         - /var/lib
+       exclude:
+         - '*.cache'  # Snapshot-specific exclusions
+     home:
+       paths:
+         - /home/user/documents
+         - /home/user/photos
+
+   # Global exclusions (apply to all snapshots)
   exclude:
     - '*.log'
     - '*.tmp'
-   age_recipient: age1278m9q7dp3chsh2dcy82qk27v047zywyvtxwnj4cvt0z65jw6a7q5dqhfj
+     - '.git'
+     - 'node_modules'
+
+   age_recipients:
+     - age1278m9q7dp3chsh2dcy82qk27v047zywyvtxwnj4cvt0z65jw6a7q5dqhfj
   s3:
-     # endpoint is optional if using AWS S3, but who even does that?
     endpoint: https://s3.example.com
     bucket: vaultik-data
     prefix: host1/
     access_key_id: ...
     secret_access_key: ...
     region: us-east-1
-   backup_interval: 1h      # only used in daemon mode, not for --cron mode
-   full_scan_interval: 24h  # normally we use inotify to mark dirty, but
-                            # every 24h we do a full stat() scan
-   min_time_between_run: 15m  # again, only for daemon mode
-   #index_path: /var/lib/vaultik/index.sqlite
+   backup_interval: 1h
+   full_scan_interval: 24h
+   min_time_between_run: 15m
   chunk_size: 10MB
-   blob_size_limit: 10GB
+   blob_size_limit: 1GB
   ```

-4. **run**
+1. **run**

   ```sh
+   # Create all configured snapshots
   vaultik --config /etc/vaultik.yaml snapshot create
-   ```

-   ```sh
-   vaultik --config /etc/vaultik.yaml snapshot create --cron # silent unless error
-   ```
+   # Create specific snapshots by name
+   vaultik --config /etc/vaultik.yaml snapshot create home system

-   ```sh
-   vaultik --config /etc/vaultik.yaml snapshot daemon # runs continuously in foreground, uses inotify to detect changes
-
-   # TODO
-   * make sure daemon mode does not make a snapshot if no files have
-     changed, even if the backup_interval has passed
-   * in daemon mode, if we are long enough since the last snapshot event, and we get
-     an inotify event, we should schedule the next snapshot creation for 10 minutes from the
-     time of the mark-dirty event.
+   # Silent mode for cron
+   vaultik --config /etc/vaultik.yaml snapshot create --cron
   ```

 ---
@@ -125,76 +147,211 @@ Existing backup software fails under one or more of these conditions:
 ### commands

 ```sh
-vaultik [--config <path>] snapshot create [--cron] [--daemon]
+vaultik [--config <path>] snapshot create [snapshot-names...] [--cron] [--daemon] [--prune]
 vaultik [--config <path>] snapshot list [--json]
-vaultik [--config <path>] snapshot purge [--keep-latest | --older-than <duration>] [--force]
 vaultik [--config <path>] snapshot verify <snapshot-id> [--deep]
+vaultik [--config <path>] snapshot purge [--keep-latest | --older-than <duration>] [--force]
+vaultik [--config <path>] snapshot remove <snapshot-id> [--dry-run] [--force]
+vaultik [--config <path>] snapshot prune
+vaultik [--config <path>] restore <snapshot-id> <target-dir> [paths...]
+vaultik [--config <path>] prune [--dry-run] [--force]
+vaultik [--config <path>] info
 vaultik [--config <path>] store info
-# FIXME: remove 'bucket' and 'prefix' and 'snapshot' flags.  it should be
-# 'vaultik restore snapshot <snapshot> --target <dir>'.  bucket and prefix are always
-# from config file.
-vaultik restore --bucket <bucket> --prefix <prefix> --snapshot <id> --target <dir>
-# FIXME: remove prune, it's the old version of "snapshot purge"
-vaultik prune --bucket <bucket> --prefix <prefix> [--dry-run]
-# FIXME: change fetch to 'vaultik restore path <snapshot> <path> --target <path>'
-vaultik fetch --bucket <bucket> --prefix <prefix> --snapshot <id> --file <path> --target <path>
-# FIXME: remove this, it's redundant with 'snapshot verify'
-vaultik verify --bucket <bucket> --prefix <prefix> [--snapshot <id>] [--quick]
 ```

 ### environment

-* `VAULTIK_PRIVATE_KEY`: Required for `restore`, `prune`, `fetch`, and `verify` commands. Contains the age private key for decryption.
-* `VAULTIK_CONFIG`: Optional path to config file. If set, config file path doesn't need to be specified on the command line.
+* `VAULTIK_AGE_SECRET_KEY`: Required for `restore` and deep `verify`. Contains the age private key for decryption.
+* `VAULTIK_CONFIG`: Optional path to config file.

 ### command details

-**snapshot create**: Perform incremental backup of configured directories
+**snapshot create**: Perform incremental backup of configured snapshots
 * Config is located at `/etc/vaultik/config.yml` by default
+* Optional snapshot names argument to create specific snapshots (default: all)
 * `--cron`: Silent unless error (for crontab)
 * `--daemon`: Run continuously with inotify monitoring and periodic scans
+* `--prune`: Delete old snapshots and orphaned blobs after backup

 **snapshot list**: List all snapshots with their timestamps and sizes
 * `--json`: Output in JSON format

+**snapshot verify**: Verify snapshot integrity
+* `--deep`: Download and verify blob contents (not just existence)
+
 **snapshot purge**: Remove old snapshots based on criteria
 * `--keep-latest`: Keep only the most recent snapshot
 * `--older-than`: Remove snapshots older than duration (e.g., 30d, 6mo, 1y)
 * `--force`: Skip confirmation prompt

-**snapshot verify**: Verify snapshot integrity
-* `--deep`: Download and verify blob hashes (not just existence)
+**snapshot remove**: Remove a specific snapshot
+* `--dry-run`: Show what would be deleted without deleting
+* `--force`: Skip confirmation prompt

-**store info**: Display S3 bucket configuration and storage statistics
+**snapshot prune**: Clean orphaned data from local database

-**restore**: Restore entire snapshot to target directory
-* Downloads and decrypts metadata
-* Fetches only required blobs
-* Reconstructs directory structure
+**restore**: Restore snapshot to target directory
+* Requires `VAULTIK_AGE_SECRET_KEY` environment variable with age private key
+* Optional path arguments to restore specific files/directories (default: all)
+* Downloads and decrypts metadata, fetches required blobs, reconstructs files
+* Preserves file permissions, timestamps, and ownership (ownership requires root)
+* Handles symlinks and directories

-**prune**: Remove unreferenced blobs from storage
-* Requires private key
-* Downloads latest snapshot metadata
+**prune**: Remove unreferenced blobs from remote storage
+* Scans all snapshots for referenced blobs
 * Deletes orphaned blobs

-**fetch**: Extract single file from backup
-* Retrieves specific file without full restore
-* Supports extracting to different filename
+**info**: Display system and configuration information

-**verify**: Validate backup integrity
-* Checks metadata hash
-* Verifies all referenced blobs exist
-* Default: Downloads blobs and validates chunk integrity
-* `--quick`: Only checks blob existence and S3 content hashes
+**store info**: Display S3 bucket configuration and storage statistics

 ---

 ## architecture

+### s3 bucket layout
+
+```
+s3://<bucket>/<prefix>/
+├── blobs/
+│   └── <aa>/<bb>/<full_blob_hash>
+└── metadata/
+    ├── <snapshot_id>/
+    │   ├── db.zst.age
+    │   └── manifest.json.zst
+```
+
+* `blobs/<aa>/<bb>/...`: Two-level directory sharding using first 4 hex chars of blob hash
+* `metadata/<snapshot_id>/db.zst.age`: Encrypted, compressed SQLite database
+* `metadata/<snapshot_id>/manifest.json.zst`: Unencrypted blob list for pruning
+
+### blob manifest format
+
+The `manifest.json.zst` file is unencrypted (compressed JSON) to enable pruning without decryption:
+
+```json
+{
+  "snapshot_id": "hostname_snapshotname_2025-01-01T12:00:00Z",
+  "blob_hashes": [
+    "aa1234567890abcdef...",
+    "bb2345678901bcdef0..."
+  ]
+}
+```
+
+Snapshot IDs follow the format `<hostname>_<snapshot-name>_<timestamp>` (e.g., `server1_home_2025-01-01T12:00:00Z`).
+
+### local sqlite schema
+
+```sql
+CREATE TABLE files (
+  id TEXT PRIMARY KEY,
+  path TEXT NOT NULL UNIQUE,
+  mtime INTEGER NOT NULL,
+  size INTEGER NOT NULL,
+  mode INTEGER NOT NULL,
+  uid INTEGER NOT NULL,
+  gid INTEGER NOT NULL
+);
+
+CREATE TABLE file_chunks (
+  file_id TEXT NOT NULL,
+  idx INTEGER NOT NULL,
+  chunk_hash TEXT NOT NULL,
+  PRIMARY KEY (file_id, idx),
+  FOREIGN KEY (file_id) REFERENCES files(id) ON DELETE CASCADE
+);
+
+CREATE TABLE chunks (
+  chunk_hash TEXT PRIMARY KEY,
+  size INTEGER NOT NULL
+);
+
+CREATE TABLE blobs (
+  id TEXT PRIMARY KEY,
+  blob_hash TEXT NOT NULL UNIQUE,
+  uncompressed INTEGER NOT NULL,
+  compressed INTEGER NOT NULL,
+  uploaded_at INTEGER
+);
+
+CREATE TABLE blob_chunks (
+  blob_hash TEXT NOT NULL,
+  chunk_hash TEXT NOT NULL,
+  offset INTEGER NOT NULL,
+  length INTEGER NOT NULL,
+  PRIMARY KEY (blob_hash, chunk_hash)
+);
+
+CREATE TABLE chunk_files (
+  chunk_hash TEXT NOT NULL,
+  file_id TEXT NOT NULL,
+  file_offset INTEGER NOT NULL,
+  length INTEGER NOT NULL,
+  PRIMARY KEY (chunk_hash, file_id)
+);
+
+CREATE TABLE snapshots (
+  id TEXT PRIMARY KEY,
+  hostname TEXT NOT NULL,
+  vaultik_version TEXT NOT NULL,
+  started_at INTEGER NOT NULL,
+  completed_at INTEGER,
+  file_count INTEGER NOT NULL,
+  chunk_count INTEGER NOT NULL,
+  blob_count INTEGER NOT NULL,
+  total_size INTEGER NOT NULL,
+  blob_size INTEGER NOT NULL,
+  compression_ratio REAL NOT NULL
+);
+
+CREATE TABLE snapshot_files (
+  snapshot_id TEXT NOT NULL,
+  file_id TEXT NOT NULL,
+  PRIMARY KEY (snapshot_id, file_id)
+);
+
+CREATE TABLE snapshot_blobs (
+  snapshot_id TEXT NOT NULL,
+  blob_id TEXT NOT NULL,
+  blob_hash TEXT NOT NULL,
+  PRIMARY KEY (snapshot_id, blob_id)
+);
+```
+
+### data flow
+
+#### backup
+
+1. Load config, open local SQLite index
+1. Walk source directories, check mtime/size against index
+1. For changed/new files: chunk using content-defined chunking
+1. For each chunk: hash, check if already uploaded, add to blob packer
+1. When blob reaches threshold: compress, encrypt, upload to S3
+1. Build snapshot metadata, compress, encrypt, upload
+1. Create blob manifest (unencrypted) for pruning support
+
+#### restore
+
+1. Download `metadata/<snapshot_id>/db.zst.age`
+1. Decrypt and decompress SQLite database
+1. Query files table (optionally filtered by paths)
+1. For each file, get ordered chunk list from file_chunks
+1. Download required blobs, decrypt, decompress
+1. Extract chunks and reconstruct files
+1. Restore permissions, mtime, uid/gid
+
+#### prune
+
+1. List all snapshot manifests
+1. Build set of all referenced blob hashes
+1. List all blobs in storage
+1. Delete any blob not in referenced set
+
 ### chunking

-* Content-defined chunking using rolling hash (Rabin fingerprint)
-* Average chunk size: 10MB (configurable)
+* Content-defined chunking using FastCDC algorithm
+* Average chunk size: configurable (default 10MB)
 * Deduplication at chunk level
 * Multiple chunks packed into blobs for efficiency

@@ -205,19 +362,13 @@ vaultik verify --bucket <bucket> --prefix <prefix> [--snapshot <id>] [--quick]
 * Each blob encrypted independently
 * Metadata databases also encrypted

-### storage
+### compression

-* Content-addressed blob storage
-* Immutable append-only design
-* Two-level directory sharding for blobs (aa/bb/hash)
-* Compressed with zstd before encryption
+* zstd compression at configurable level
+* Applied before encryption
+* Blob-level compression for efficiency

-### state tracking
-
-* Local SQLite database for incremental state
-* Tracks file mtimes and chunk mappings
-* Enables efficient change detection
-* Supports inotify monitoring in daemon mode
+---

 ## does not

@@ -227,8 +378,6 @@ vaultik verify --bucket <bucket> --prefix <prefix> [--snapshot <id>] [--quick]
 * Require a symmetric passphrase or password
 * Trust the source system with anything

---
-
 ## does

 * Incremental deduplicated backup
@@ -240,70 +389,16 @@ vaultik verify --bucket <bucket> --prefix <prefix> [--snapshot <id>] [--quick]

 ---

-## restore
-
-`vaultik restore` downloads only the snapshot metadata and required blobs. It
-never contacts the source system. All restore operations depend only on:
-
-* `VAULTIK_PRIVATE_KEY`
-* The bucket
-
-The entire system is restore-only from object storage.
-
---
-
-## features
-
-### daemon mode
-
-* Continuous background operation
-* inotify-based change detection
-* Respects `backup_interval` and `min_time_between_run`
-* Full scan every `full_scan_interval` (default 24h)
-
-### cron mode
-
-* Single backup run
-* Silent output unless errors
-* Ideal for scheduled backups
-
-### metadata integrity
-
-* SHA256 hash of metadata stored separately
-* Encrypted hash file for verification
-* Chunked metadata support for large filesystems
-
-### exclusion patterns
-
-* Glob-based file exclusion
-* Configured in YAML
-* Applied during directory walk
-
-## prune
-
-Run `vaultik prune` on a machine with the private key. It:
-
-* Downloads the most recent snapshot
-* Decrypts metadata
-* Lists referenced blobs
-* Deletes any blob in the bucket not referenced
-
-This enables garbage collection from immutable storage.
-
---
-
-## LICENSE
-
-[MIT](https://opensource.org/license/mit/)
-
---
-
 ## requirements

-* Go 1.24.4 or later
+* Go 1.24 or later
 * S3-compatible object storage
 * Sufficient disk space for local index (typically <1GB)

+## license
+
+[MIT](https://opensource.org/license/mit/)
+
 ## author

 Made with love and lots of expensive SOTA AI by [sneak](https://sneak.berlin) in Berlin in the summer of 2025.