Add custom types, version command, and restore --verify flag

- Add internal/types package with type-safe wrappers for IDs, hashes,
  paths, and credentials (FileID, BlobID, ChunkHash, etc.)
- Implement driver.Valuer and sql.Scanner for UUID-based types
- Add `vaultik version` command showing version, commit, go version
- Add `--verify` flag to restore command that checksums all restored
  files against expected chunk hashes with progress bar
- Remove fetch.go (dead code, functionality in restore)
- Clean up TODO.md, remove completed items
- Update all database and snapshot code to use new custom types
This commit is contained in:
2026-01-14 17:11:52 -08:00
parent 2afd54d693
commit 417b25a5f5
53 changed files with 2330 additions and 1581 deletions

407
README.md
View File

@@ -2,7 +2,7 @@
WIP: pre-1.0, some functions may not be fully implemented yet
`vaultik` is a incremental backup daemon written in Go. It encrypts data
`vaultik` is an incremental backup daemon written in Go. It encrypts data
using an `age` public key and uploads each encrypted blob directly to a
remote S3-compatible object store. It requires no private keys, secrets, or
credentials (other than those required to PUT to encrypted object storage,
@@ -22,19 +22,6 @@ It includes table-stakes features such as:
* does not create huge numbers of small files (to keep S3 operation counts
down) even if the source system has many small files
## what
`vaultik` walks a set of configured directories and builds a
content-addressable chunk map of changed files using deterministic chunking.
Each chunk is streamed into a blob packer. Blobs are compressed with `zstd`,
encrypted with `age`, and uploaded directly to remote storage under a
content-addressed S3 path. at the end, a pruned snapshot-specific sqlite
database of metadata is created, encrypted, and uploaded alongside the
blobs.
No plaintext file contents ever hit disk. No private key or secret
passphrase is needed or stored locally.
## why
Existing backup software fails under one or more of these conditions:
@@ -45,15 +32,46 @@ Existing backup software fails under one or more of these conditions:
* Creates one-blob-per-file, which results in excessive S3 operation counts
* is slow
`vaultik` addresses these by using:
Other backup tools like `restic`, `borg`, and `duplicity` are designed for
environments where the source host can store secrets and has access to
decryption keys. I don't want to store backup decryption keys on my hosts,
only public keys for encryption.
* Public-key-only encryption (via `age`) requires no secrets (other than
remote storage api key) on the source system
* Local state cache for incremental detection does not require reading from
or decrypting remote storage
* Content-addressed immutable storage allows efficient deduplication
* Storage only of large encrypted blobs of configurable size (1G by default)
reduces S3 operation counts and improves performance
My requirements are:
* open source
* no passphrases or private keys on the source host
* incremental
* compressed
* encrypted
* s3 compatible without an intermediate step or tool
Surprisingly, no existing tool meets these requirements, so I wrote `vaultik`.
## design goals
1. Backups must require only a public key on the source host.
1. No secrets or private keys may exist on the source system.
1. Restore must be possible using **only** the backup bucket and a private key.
1. Prune must be possible (requires private key, done on different hosts).
1. All encryption uses [`age`](https://age-encryption.org/) (X25519, XChaCha20-Poly1305).
1. Compression uses `zstd` at a configurable level.
1. Files are chunked, and multiple chunks are packed into encrypted blobs
to reduce object count for filesystems with many small files.
1. All metadata (snapshots) is stored remotely as encrypted SQLite DBs.
## what
`vaultik` walks a set of configured directories and builds a
content-addressable chunk map of changed files using deterministic chunking.
Each chunk is streamed into a blob packer. Blobs are compressed with `zstd`,
encrypted with `age`, and uploaded directly to remote storage under a
content-addressed S3 path. At the end, a pruned snapshot-specific sqlite
database of metadata is created, encrypted, and uploaded alongside the
blobs.
No plaintext file contents ever hit disk. No private key or secret
passphrase is needed or stored locally.
## how
@@ -63,59 +81,63 @@ Existing backup software fails under one or more of these conditions:
go install git.eeqj.de/sneak/vaultik@latest
```
2. **generate keypair**
1. **generate keypair**
```sh
age-keygen -o agekey.txt
grep 'public key:' agekey.txt
```
3. **write config**
1. **write config**
```yaml
source_dirs:
- /etc
- /home/user/data
# Named snapshots - each snapshot can contain multiple paths
snapshots:
system:
paths:
- /etc
- /var/lib
exclude:
- '*.cache' # Snapshot-specific exclusions
home:
paths:
- /home/user/documents
- /home/user/photos
# Global exclusions (apply to all snapshots)
exclude:
- '*.log'
- '*.tmp'
age_recipient: age1278m9q7dp3chsh2dcy82qk27v047zywyvtxwnj4cvt0z65jw6a7q5dqhfj
- '.git'
- 'node_modules'
age_recipients:
- age1278m9q7dp3chsh2dcy82qk27v047zywyvtxwnj4cvt0z65jw6a7q5dqhfj
s3:
# endpoint is optional if using AWS S3, but who even does that?
endpoint: https://s3.example.com
bucket: vaultik-data
prefix: host1/
access_key_id: ...
secret_access_key: ...
region: us-east-1
backup_interval: 1h # only used in daemon mode, not for --cron mode
full_scan_interval: 24h # normally we use inotify to mark dirty, but
# every 24h we do a full stat() scan
min_time_between_run: 15m # again, only for daemon mode
#index_path: /var/lib/vaultik/index.sqlite
backup_interval: 1h
full_scan_interval: 24h
min_time_between_run: 15m
chunk_size: 10MB
blob_size_limit: 10GB
blob_size_limit: 1GB
```
4. **run**
1. **run**
```sh
# Create all configured snapshots
vaultik --config /etc/vaultik.yaml snapshot create
```
```sh
vaultik --config /etc/vaultik.yaml snapshot create --cron # silent unless error
```
# Create specific snapshots by name
vaultik --config /etc/vaultik.yaml snapshot create home system
```sh
vaultik --config /etc/vaultik.yaml snapshot daemon # runs continuously in foreground, uses inotify to detect changes
# TODO
* make sure daemon mode does not make a snapshot if no files have
changed, even if the backup_interval has passed
* in daemon mode, if we are long enough since the last snapshot event, and we get
an inotify event, we should schedule the next snapshot creation for 10 minutes from the
time of the mark-dirty event.
# Silent mode for cron
vaultik --config /etc/vaultik.yaml snapshot create --cron
```
---
@@ -125,76 +147,211 @@ Existing backup software fails under one or more of these conditions:
### commands
```sh
vaultik [--config <path>] snapshot create [--cron] [--daemon]
vaultik [--config <path>] snapshot create [snapshot-names...] [--cron] [--daemon] [--prune]
vaultik [--config <path>] snapshot list [--json]
vaultik [--config <path>] snapshot purge [--keep-latest | --older-than <duration>] [--force]
vaultik [--config <path>] snapshot verify <snapshot-id> [--deep]
vaultik [--config <path>] snapshot purge [--keep-latest | --older-than <duration>] [--force]
vaultik [--config <path>] snapshot remove <snapshot-id> [--dry-run] [--force]
vaultik [--config <path>] snapshot prune
vaultik [--config <path>] restore <snapshot-id> <target-dir> [paths...]
vaultik [--config <path>] prune [--dry-run] [--force]
vaultik [--config <path>] info
vaultik [--config <path>] store info
# FIXME: remove 'bucket' and 'prefix' and 'snapshot' flags. it should be
# 'vaultik restore snapshot <snapshot> --target <dir>'. bucket and prefix are always
# from config file.
vaultik restore --bucket <bucket> --prefix <prefix> --snapshot <id> --target <dir>
# FIXME: remove prune, it's the old version of "snapshot purge"
vaultik prune --bucket <bucket> --prefix <prefix> [--dry-run]
# FIXME: change fetch to 'vaultik restore path <snapshot> <path> --target <path>'
vaultik fetch --bucket <bucket> --prefix <prefix> --snapshot <id> --file <path> --target <path>
# FIXME: remove this, it's redundant with 'snapshot verify'
vaultik verify --bucket <bucket> --prefix <prefix> [--snapshot <id>] [--quick]
```
### environment
* `VAULTIK_PRIVATE_KEY`: Required for `restore`, `prune`, `fetch`, and `verify` commands. Contains the age private key for decryption.
* `VAULTIK_CONFIG`: Optional path to config file. If set, config file path doesn't need to be specified on the command line.
* `VAULTIK_AGE_SECRET_KEY`: Required for `restore` and deep `verify`. Contains the age private key for decryption.
* `VAULTIK_CONFIG`: Optional path to config file.
### command details
**snapshot create**: Perform incremental backup of configured directories
**snapshot create**: Perform incremental backup of configured snapshots
* Config is located at `/etc/vaultik/config.yml` by default
* Optional snapshot names argument to create specific snapshots (default: all)
* `--cron`: Silent unless error (for crontab)
* `--daemon`: Run continuously with inotify monitoring and periodic scans
* `--prune`: Delete old snapshots and orphaned blobs after backup
**snapshot list**: List all snapshots with their timestamps and sizes
* `--json`: Output in JSON format
**snapshot verify**: Verify snapshot integrity
* `--deep`: Download and verify blob contents (not just existence)
**snapshot purge**: Remove old snapshots based on criteria
* `--keep-latest`: Keep only the most recent snapshot
* `--older-than`: Remove snapshots older than duration (e.g., 30d, 6mo, 1y)
* `--force`: Skip confirmation prompt
**snapshot verify**: Verify snapshot integrity
* `--deep`: Download and verify blob hashes (not just existence)
**snapshot remove**: Remove a specific snapshot
* `--dry-run`: Show what would be deleted without deleting
* `--force`: Skip confirmation prompt
**store info**: Display S3 bucket configuration and storage statistics
**snapshot prune**: Clean orphaned data from local database
**restore**: Restore entire snapshot to target directory
* Downloads and decrypts metadata
* Fetches only required blobs
* Reconstructs directory structure
**restore**: Restore snapshot to target directory
* Requires `VAULTIK_AGE_SECRET_KEY` environment variable with age private key
* Optional path arguments to restore specific files/directories (default: all)
* Downloads and decrypts metadata, fetches required blobs, reconstructs files
* Preserves file permissions, timestamps, and ownership (ownership requires root)
* Handles symlinks and directories
**prune**: Remove unreferenced blobs from storage
* Requires private key
* Downloads latest snapshot metadata
**prune**: Remove unreferenced blobs from remote storage
* Scans all snapshots for referenced blobs
* Deletes orphaned blobs
**fetch**: Extract single file from backup
* Retrieves specific file without full restore
* Supports extracting to different filename
**info**: Display system and configuration information
**verify**: Validate backup integrity
* Checks metadata hash
* Verifies all referenced blobs exist
* Default: Downloads blobs and validates chunk integrity
* `--quick`: Only checks blob existence and S3 content hashes
**store info**: Display S3 bucket configuration and storage statistics
---
## architecture
### s3 bucket layout
```
s3://<bucket>/<prefix>/
├── blobs/
│ └── <aa>/<bb>/<full_blob_hash>
└── metadata/
├── <snapshot_id>/
│ ├── db.zst.age
│ └── manifest.json.zst
```
* `blobs/<aa>/<bb>/...`: Two-level directory sharding using first 4 hex chars of blob hash
* `metadata/<snapshot_id>/db.zst.age`: Encrypted, compressed SQLite database
* `metadata/<snapshot_id>/manifest.json.zst`: Unencrypted blob list for pruning
### blob manifest format
The `manifest.json.zst` file is unencrypted (compressed JSON) to enable pruning without decryption:
```json
{
"snapshot_id": "hostname_snapshotname_2025-01-01T12:00:00Z",
"blob_hashes": [
"aa1234567890abcdef...",
"bb2345678901bcdef0..."
]
}
```
Snapshot IDs follow the format `<hostname>_<snapshot-name>_<timestamp>` (e.g., `server1_home_2025-01-01T12:00:00Z`).
### local sqlite schema
```sql
CREATE TABLE files (
id TEXT PRIMARY KEY,
path TEXT NOT NULL UNIQUE,
mtime INTEGER NOT NULL,
size INTEGER NOT NULL,
mode INTEGER NOT NULL,
uid INTEGER NOT NULL,
gid INTEGER NOT NULL
);
CREATE TABLE file_chunks (
file_id TEXT NOT NULL,
idx INTEGER NOT NULL,
chunk_hash TEXT NOT NULL,
PRIMARY KEY (file_id, idx),
FOREIGN KEY (file_id) REFERENCES files(id) ON DELETE CASCADE
);
CREATE TABLE chunks (
chunk_hash TEXT PRIMARY KEY,
size INTEGER NOT NULL
);
CREATE TABLE blobs (
id TEXT PRIMARY KEY,
blob_hash TEXT NOT NULL UNIQUE,
uncompressed INTEGER NOT NULL,
compressed INTEGER NOT NULL,
uploaded_at INTEGER
);
CREATE TABLE blob_chunks (
blob_hash TEXT NOT NULL,
chunk_hash TEXT NOT NULL,
offset INTEGER NOT NULL,
length INTEGER NOT NULL,
PRIMARY KEY (blob_hash, chunk_hash)
);
CREATE TABLE chunk_files (
chunk_hash TEXT NOT NULL,
file_id TEXT NOT NULL,
file_offset INTEGER NOT NULL,
length INTEGER NOT NULL,
PRIMARY KEY (chunk_hash, file_id)
);
CREATE TABLE snapshots (
id TEXT PRIMARY KEY,
hostname TEXT NOT NULL,
vaultik_version TEXT NOT NULL,
started_at INTEGER NOT NULL,
completed_at INTEGER,
file_count INTEGER NOT NULL,
chunk_count INTEGER NOT NULL,
blob_count INTEGER NOT NULL,
total_size INTEGER NOT NULL,
blob_size INTEGER NOT NULL,
compression_ratio REAL NOT NULL
);
CREATE TABLE snapshot_files (
snapshot_id TEXT NOT NULL,
file_id TEXT NOT NULL,
PRIMARY KEY (snapshot_id, file_id)
);
CREATE TABLE snapshot_blobs (
snapshot_id TEXT NOT NULL,
blob_id TEXT NOT NULL,
blob_hash TEXT NOT NULL,
PRIMARY KEY (snapshot_id, blob_id)
);
```
### data flow
#### backup
1. Load config, open local SQLite index
1. Walk source directories, check mtime/size against index
1. For changed/new files: chunk using content-defined chunking
1. For each chunk: hash, check if already uploaded, add to blob packer
1. When blob reaches threshold: compress, encrypt, upload to S3
1. Build snapshot metadata, compress, encrypt, upload
1. Create blob manifest (unencrypted) for pruning support
#### restore
1. Download `metadata/<snapshot_id>/db.zst.age`
1. Decrypt and decompress SQLite database
1. Query files table (optionally filtered by paths)
1. For each file, get ordered chunk list from file_chunks
1. Download required blobs, decrypt, decompress
1. Extract chunks and reconstruct files
1. Restore permissions, mtime, uid/gid
#### prune
1. List all snapshot manifests
1. Build set of all referenced blob hashes
1. List all blobs in storage
1. Delete any blob not in referenced set
### chunking
* Content-defined chunking using rolling hash (Rabin fingerprint)
* Average chunk size: 10MB (configurable)
* Content-defined chunking using FastCDC algorithm
* Average chunk size: configurable (default 10MB)
* Deduplication at chunk level
* Multiple chunks packed into blobs for efficiency
@@ -205,19 +362,13 @@ vaultik verify --bucket <bucket> --prefix <prefix> [--snapshot <id>] [--quick]
* Each blob encrypted independently
* Metadata databases also encrypted
### storage
### compression
* Content-addressed blob storage
* Immutable append-only design
* Two-level directory sharding for blobs (aa/bb/hash)
* Compressed with zstd before encryption
* zstd compression at configurable level
* Applied before encryption
* Blob-level compression for efficiency
### state tracking
* Local SQLite database for incremental state
* Tracks file mtimes and chunk mappings
* Enables efficient change detection
* Supports inotify monitoring in daemon mode
---
## does not
@@ -227,8 +378,6 @@ vaultik verify --bucket <bucket> --prefix <prefix> [--snapshot <id>] [--quick]
* Require a symmetric passphrase or password
* Trust the source system with anything
---
## does
* Incremental deduplicated backup
@@ -240,70 +389,16 @@ vaultik verify --bucket <bucket> --prefix <prefix> [--snapshot <id>] [--quick]
---
## restore
`vaultik restore` downloads only the snapshot metadata and required blobs. It
never contacts the source system. All restore operations depend only on:
* `VAULTIK_PRIVATE_KEY`
* The bucket
The entire system is restore-only from object storage.
---
## features
### daemon mode
* Continuous background operation
* inotify-based change detection
* Respects `backup_interval` and `min_time_between_run`
* Full scan every `full_scan_interval` (default 24h)
### cron mode
* Single backup run
* Silent output unless errors
* Ideal for scheduled backups
### metadata integrity
* SHA256 hash of metadata stored separately
* Encrypted hash file for verification
* Chunked metadata support for large filesystems
### exclusion patterns
* Glob-based file exclusion
* Configured in YAML
* Applied during directory walk
## prune
Run `vaultik prune` on a machine with the private key. It:
* Downloads the most recent snapshot
* Decrypts metadata
* Lists referenced blobs
* Deletes any blob in the bucket not referenced
This enables garbage collection from immutable storage.
---
## LICENSE
[MIT](https://opensource.org/license/mit/)
---
## requirements
* Go 1.24.4 or later
* Go 1.24 or later
* S3-compatible object storage
* Sufficient disk space for local index (typically <1GB)
## license
[MIT](https://opensource.org/license/mit/)
## author
Made with love and lots of expensive SOTA AI by [sneak](https://sneak.berlin) in Berlin in the summer of 2025.