Major refactoring: UUID-based storage, streaming architecture, and CLI improvements

This commit represents a significant architectural overhaul of vaultik:

Database Schema Changes:
- Switch files table to use UUID primary keys instead of path-based keys
- Add UUID primary keys to blobs table for immediate chunk association
- Update all foreign key relationships to use UUIDs
- Add comprehensive schema documentation in DATAMODEL.md
- Add SQLite busy timeout handling for concurrent operations

Streaming and Performance Improvements:
- Implement true streaming blob packing without intermediate storage
- Add streaming chunk processing to reduce memory usage
- Improve progress reporting with real-time metrics
- Add upload metrics tracking in new uploads table

CLI Refactoring:
- Restructure CLI to use subcommands: snapshot create/list/purge/verify
- Add store info command for S3 configuration display
- Add custom duration parser supporting days/weeks/months/years
- Remove old backup.go in favor of enhanced snapshot.go
- Add --cron flag for silent operation

Configuration Changes:
- Remove unused index_prefix configuration option
- Add support for snapshot pruning retention policies
- Improve configuration validation and error messages

Testing Improvements:
- Add comprehensive repository tests with edge cases
- Add cascade delete debugging tests
- Fix concurrent operation tests to use SQLite busy timeout
- Remove tolerance for SQLITE_BUSY errors in tests

Documentation:
- Add MIT LICENSE file
- Update README with new command structure
- Add comprehensive DATAMODEL.md explaining database schema
- Update DESIGN.md with UUID-based architecture

Other Changes:
- Add test-config.yml for testing
- Update Makefile with better test output formatting
- Fix various race conditions in concurrent operations
- Improve error handling throughout
This commit is contained in:
2025-07-22 14:54:37 +02:00
parent 86b533d6ee
commit 78af626759
54 changed files with 5525 additions and 1109 deletions

115
README.md
View File

@@ -5,7 +5,21 @@ encrypts data using an `age` public key and uploads each encrypted blob
directly to a remote S3-compatible object store. It requires no private
keys, secrets, or credentials stored on the backed-up system.
---
It includes table-stakes features such as:
* modern authenticated encryption
* deduplication
* incremental backups
* modern multithreaded zstd compression with configurable levels
* content-addressed immutable storage
* local state tracking in standard SQLite database
* inotify-based change detection
* streaming processing of all data to not require lots of ram or temp file
storage
* no mutable remote metadata
* no plaintext file paths or metadata stored in remote
* does not create huge numbers of small files (to keep S3 operation counts
down) even if the source system has many small files
## what
@@ -15,27 +29,29 @@ Each chunk is streamed into a blob packer. Blobs are compressed with `zstd`,
encrypted with `age`, and uploaded directly to remote storage under a
content-addressed S3 path.
No plaintext file contents ever hit disk. No private key is needed or stored
locally. All encrypted data is streaming-processed and immediately discarded
once uploaded. Metadata is encrypted and pushed with the same mechanism.
No plaintext file contents ever hit disk. No private key or secret
passphrase is needed or stored locally. All encrypted data is
streaming-processed and immediately discarded once uploaded. Metadata is
encrypted and pushed with the same mechanism.
## why
Existing backup software fails under one or more of these conditions:
* Requires secrets (passwords, private keys) on the source system
* Requires secrets (passwords, private keys) on the source system, which
compromises encrypted backups in the case of host system compromise
* Depends on symmetric encryption unsuitable for zero-trust environments
* Stages temporary archives or repositories
* Writes plaintext metadata or plaintext file paths
* Creates one-blob-per-file, which results in excessive S3 operation counts
`vaultik` addresses all of these by using:
`vaultik` addresses these by using:
* Public-key-only encryption (via `age`) requires no secrets (other than
bucket access key) on the source system
* Blob-level deduplication and batching
* Local state cache for incremental detection
* S3-native chunked upload interface
* Self-contained encrypted snapshot metadata
remote storage api key) on the source system
* Local state cache for incremental detection does not require reading from
or decrypting remote storage
* Content-addressed immutable storage allows efficient deduplication
* Storage only of large encrypted blobs of configurable size (1G by default)
reduces S3 operation counts and improves performance
## how
@@ -63,6 +79,7 @@ Existing backup software fails under one or more of these conditions:
- '*.tmp'
age_recipient: age1278m9q7dp3chsh2dcy82qk27v047zywyvtxwnj4cvt0z65jw6a7q5dqhfj
s3:
# endpoint is optional if using AWS S3, but who even does that?
endpoint: https://s3.example.com
bucket: vaultik-data
prefix: host1/
@@ -73,24 +90,30 @@ Existing backup software fails under one or more of these conditions:
full_scan_interval: 24h # normally we use inotify to mark dirty, but
# every 24h we do a full stat() scan
min_time_between_run: 15m # again, only for daemon mode
index_path: /var/lib/vaultik/index.sqlite
#index_path: /var/lib/vaultik/index.sqlite
chunk_size: 10MB
blob_size_limit: 10GB
index_prefix: index/
```
4. **run**
```sh
vaultik backup /etc/vaultik.yaml
vaultik --config /etc/vaultik.yaml snapshot create
```
```sh
vaultik backup /etc/vaultik.yaml --cron # silent unless error
vaultik --config /etc/vaultik.yaml snapshot create --cron # silent unless error
```
```sh
vaultik backup /etc/vaultik.yaml --daemon # runs in background, uses inotify
vaultik --config /etc/vaultik.yaml snapshot daemon # runs continuously in foreground, uses inotify to detect changes
# TODO
* make sure daemon mode does not make a snapshot if no files have
changed, even if the backup_interval has passed
* in daemon mode, if we are long enough since the last snapshot event, and we get
an inotify event, we should schedule the next snapshot creation for 10 minutes from the
time of the mark-dirty event.
```
---
@@ -100,26 +123,48 @@ Existing backup software fails under one or more of these conditions:
### commands
```sh
vaultik backup [--config <path>] [--cron] [--daemon]
vaultik [--config <path>] snapshot create [--cron] [--daemon]
vaultik [--config <path>] snapshot list [--json]
vaultik [--config <path>] snapshot purge [--keep-latest | --older-than <duration>] [--force]
vaultik [--config <path>] snapshot verify <snapshot-id> [--deep]
vaultik [--config <path>] store info
# FIXME: remove 'bucket' and 'prefix' and 'snapshot' flags. it should be
# 'vaultik restore snapshot <snapshot> --target <dir>'. bucket and prefix are always
# from config file.
vaultik restore --bucket <bucket> --prefix <prefix> --snapshot <id> --target <dir>
# FIXME: remove prune, it's the old version of "snapshot purge"
vaultik prune --bucket <bucket> --prefix <prefix> [--dry-run]
# FIXME: change fetch to 'vaultik restore path <snapshot> <path> --target <path>'
vaultik fetch --bucket <bucket> --prefix <prefix> --snapshot <id> --file <path> --target <path>
# FIXME: remove this, it's redundant with 'snapshot verify'
vaultik verify --bucket <bucket> --prefix <prefix> [--snapshot <id>] [--quick]
```
### environment
* `VAULTIK_PRIVATE_KEY`: Required for `restore`, `prune`, `fetch`, and `verify` commands. Contains the age private key for decryption.
* `VAULTIK_CONFIG`: Optional path to config file. If set, `vaultik backup` can be run without specifying the config file path.
* `VAULTIK_CONFIG`: Optional path to config file. If set, config file path doesn't need to be specified on the command line.
### command details
**backup**: Perform incremental backup of configured directories
**snapshot create**: Perform incremental backup of configured directories
* Config is located at `/etc/vaultik/config.yml` by default
* `--config`: Override config file path
* `--cron`: Silent unless error (for crontab)
* `--daemon`: Run continuously with inotify monitoring and periodic scans
**snapshot list**: List all snapshots with their timestamps and sizes
* `--json`: Output in JSON format
**snapshot purge**: Remove old snapshots based on criteria
* `--keep-latest`: Keep only the most recent snapshot
* `--older-than`: Remove snapshots older than duration (e.g., 30d, 6mo, 1y)
* `--force`: Skip confirmation prompt
**snapshot verify**: Verify snapshot integrity
* `--deep`: Download and verify blob hashes (not just existence)
**store info**: Display S3 bucket configuration and storage statistics
**restore**: Restore entire snapshot to target directory
* Downloads and decrypts metadata
* Fetches only required blobs
@@ -245,41 +290,23 @@ This enables garbage collection from immutable storage.
---
## license
## LICENSE
WTFPL — see LICENSE.
[MIT](https://opensource.org/license/mit/)
---
## security considerations
* Source host compromise cannot decrypt backups
* No replay attacks possible (append-only)
* Each blob independently encrypted
* Metadata tampering detectable via hash verification
* S3 credentials only allow write access to backup prefix
## performance
* Streaming processing (no temp files)
* Parallel blob uploads
* Deduplication reduces storage and bandwidth
* Local index enables fast incremental detection
* Configurable compression levels
## requirements
* Go 1.24.4 or later
* S3-compatible object storage
* age command-line tool (for key generation)
* SQLite3
* Sufficient disk space for local index
* Sufficient disk space for local index (typically <1GB)
## author
Made with love and lots of expensive SOTA AI by [sneak](https://sneak.berlin) in Berlin in the summer of 2025.
Released as a free software gift to the world, no strings attached, under the [WTFPL](https://www.wtfpl.net/) license.
Released as a free software gift to the world, no strings attached.
Contact: [sneak@sneak.berlin](mailto:sneak@sneak.berlin)