Major refactoring: UUID-based storage, streaming architecture, and CLI improvements

This commit represents a significant architectural overhaul of vaultik: Database Schema Changes: - Switch files table to use UUID primary keys instead of path-based keys - Add UUID primary keys to blobs table for immediate chunk association - Update all foreign key relationships to use UUIDs - Add comprehensive schema documentation in DATAMODEL.md - Add SQLite busy timeout handling for concurrent operations Streaming and Performance Improvements: - Implement true streaming blob packing without intermediate storage - Add streaming chunk processing to reduce memory usage - Improve progress reporting with real-time metrics - Add upload metrics tracking in new uploads table CLI Refactoring: - Restructure CLI to use subcommands: snapshot create/list/purge/verify - Add store info command for S3 configuration display - Add custom duration parser supporting days/weeks/months/years - Remove old backup.go in favor of enhanced snapshot.go - Add --cron flag for silent operation Configuration Changes: - Remove unused index_prefix configuration option - Add support for snapshot pruning retention policies - Improve configuration validation and error messages Testing Improvements: - Add comprehensive repository tests with edge cases - Add cascade delete debugging tests - Fix concurrent operation tests to use SQLite busy timeout - Remove tolerance for SQLITE_BUSY errors in tests Documentation: - Add MIT LICENSE file - Update README with new command structure - Add comprehensive DATAMODEL.md explaining database schema - Update DESIGN.md with UUID-based architecture Other Changes: - Add test-config.yml for testing - Update Makefile with better test output formatting - Fix various race conditions in concurrent operations - Improve error handling throughout
2025-07-22 14:54:37 +02:00
parent 86b533d6ee
commit 78af626759
54 changed files with 5525 additions and 1109 deletions
--- a/README.md
+++ b/README.md
@@ -5,7 +5,21 @@ encrypts data using an `age` public key and uploads each encrypted blob
 directly to a remote S3-compatible object store. It requires no private
 keys, secrets, or credentials stored on the backed-up system.

---
+It includes table-stakes features such as:
+
+* modern authenticated encryption
+* deduplication
+* incremental backups
+* modern multithreaded zstd compression with configurable levels
+* content-addressed immutable storage
+* local state tracking in standard SQLite database
+* inotify-based change detection
+* streaming processing of all data to not require lots of ram or temp file
+  storage
+* no mutable remote metadata
+* no plaintext file paths or metadata stored in remote
+* does not create huge numbers of small files (to keep S3 operation counts
+  down) even if the source system has many small files

 ## what

@@ -15,27 +29,29 @@ Each chunk is streamed into a blob packer. Blobs are compressed with `zstd`,
 encrypted with `age`, and uploaded directly to remote storage under a
 content-addressed S3 path.

-No plaintext file contents ever hit disk. No private key is needed or stored
-locally. All encrypted data is streaming-processed and immediately discarded
-once uploaded. Metadata is encrypted and pushed with the same mechanism.
+No plaintext file contents ever hit disk. No private key or secret
+passphrase is needed or stored locally. All encrypted data is
+streaming-processed and immediately discarded once uploaded. Metadata is
+encrypted and pushed with the same mechanism.

 ## why

 Existing backup software fails under one or more of these conditions:

-* Requires secrets (passwords, private keys) on the source system
+* Requires secrets (passwords, private keys) on the source system, which
+  compromises encrypted backups in the case of host system compromise
 * Depends on symmetric encryption unsuitable for zero-trust environments
-* Stages temporary archives or repositories
-* Writes plaintext metadata or plaintext file paths
+* Creates one-blob-per-file, which results in excessive S3 operation counts

-`vaultik` addresses all of these by using:
+`vaultik` addresses these by using:

 * Public-key-only encryption (via `age`) requires no secrets (other than
-  bucket access key) on the source system
-* Blob-level deduplication and batching
-* Local state cache for incremental detection
-* S3-native chunked upload interface
-* Self-contained encrypted snapshot metadata
+  remote storage api key) on the source system
+* Local state cache for incremental detection does not require reading from
+  or decrypting remote storage
+* Content-addressed immutable storage allows efficient deduplication
+* Storage only of large encrypted blobs of configurable size (1G by default)
+  reduces S3 operation counts and improves performance

 ## how

@@ -63,6 +79,7 @@ Existing backup software fails under one or more of these conditions:
     - '*.tmp'
   age_recipient: age1278m9q7dp3chsh2dcy82qk27v047zywyvtxwnj4cvt0z65jw6a7q5dqhfj
   s3:
+     # endpoint is optional if using AWS S3, but who even does that?
     endpoint: https://s3.example.com
     bucket: vaultik-data
     prefix: host1/
@@ -73,24 +90,30 @@ Existing backup software fails under one or more of these conditions:
   full_scan_interval: 24h  # normally we use inotify to mark dirty, but
                            # every 24h we do a full stat() scan
   min_time_between_run: 15m  # again, only for daemon mode
-   index_path: /var/lib/vaultik/index.sqlite
+   #index_path: /var/lib/vaultik/index.sqlite
   chunk_size: 10MB
   blob_size_limit: 10GB
-   index_prefix: index/
   ```

 4. **run**

   ```sh
-   vaultik backup /etc/vaultik.yaml
+   vaultik --config /etc/vaultik.yaml snapshot create
   ```

   ```sh
-   vaultik backup /etc/vaultik.yaml --cron # silent unless error
+   vaultik --config /etc/vaultik.yaml snapshot create --cron # silent unless error
   ```

   ```sh
-   vaultik backup /etc/vaultik.yaml --daemon # runs in background, uses inotify
+   vaultik --config /etc/vaultik.yaml snapshot daemon # runs continuously in foreground, uses inotify to detect changes
+
+   # TODO
+   * make sure daemon mode does not make a snapshot if no files have
+     changed, even if the backup_interval has passed
+   * in daemon mode, if we are long enough since the last snapshot event, and we get
+     an inotify event, we should schedule the next snapshot creation for 10 minutes from the
+     time of the mark-dirty event.
   ```

 ---
@@ -100,26 +123,48 @@ Existing backup software fails under one or more of these conditions:
 ### commands

 ```sh
-vaultik backup [--config <path>] [--cron] [--daemon]
+vaultik [--config <path>] snapshot create [--cron] [--daemon]
+vaultik [--config <path>] snapshot list [--json]
+vaultik [--config <path>] snapshot purge [--keep-latest | --older-than <duration>] [--force]
+vaultik [--config <path>] snapshot verify <snapshot-id> [--deep]
+vaultik [--config <path>] store info
+# FIXME: remove 'bucket' and 'prefix' and 'snapshot' flags.  it should be
+# 'vaultik restore snapshot <snapshot> --target <dir>'.  bucket and prefix are always
+# from config file.
 vaultik restore --bucket <bucket> --prefix <prefix> --snapshot <id> --target <dir>
+# FIXME: remove prune, it's the old version of "snapshot purge"
 vaultik prune --bucket <bucket> --prefix <prefix> [--dry-run]
+# FIXME: change fetch to 'vaultik restore path <snapshot> <path> --target <path>'
 vaultik fetch --bucket <bucket> --prefix <prefix> --snapshot <id> --file <path> --target <path>
+# FIXME: remove this, it's redundant with 'snapshot verify'
 vaultik verify --bucket <bucket> --prefix <prefix> [--snapshot <id>] [--quick]
 ```

 ### environment

 * `VAULTIK_PRIVATE_KEY`: Required for `restore`, `prune`, `fetch`, and `verify` commands. Contains the age private key for decryption.
-* `VAULTIK_CONFIG`: Optional path to config file. If set, `vaultik backup` can be run without specifying the config file path.
+* `VAULTIK_CONFIG`: Optional path to config file. If set, config file path doesn't need to be specified on the command line.

 ### command details

-**backup**: Perform incremental backup of configured directories
+**snapshot create**: Perform incremental backup of configured directories
 * Config is located at `/etc/vaultik/config.yml` by default
-* `--config`: Override config file path
 * `--cron`: Silent unless error (for crontab)
 * `--daemon`: Run continuously with inotify monitoring and periodic scans

+**snapshot list**: List all snapshots with their timestamps and sizes
+* `--json`: Output in JSON format
+
+**snapshot purge**: Remove old snapshots based on criteria
+* `--keep-latest`: Keep only the most recent snapshot
+* `--older-than`: Remove snapshots older than duration (e.g., 30d, 6mo, 1y)
+* `--force`: Skip confirmation prompt
+
+**snapshot verify**: Verify snapshot integrity
+* `--deep`: Download and verify blob hashes (not just existence)
+
+**store info**: Display S3 bucket configuration and storage statistics
+
 **restore**: Restore entire snapshot to target directory
 * Downloads and decrypts metadata
 * Fetches only required blobs
@@ -245,41 +290,23 @@ This enables garbage collection from immutable storage.

 ---

-## license
+## LICENSE

-WTFPL — see LICENSE.
+[MIT](https://opensource.org/license/mit/)

 ---

-## security considerations
-
-* Source host compromise cannot decrypt backups
-* No replay attacks possible (append-only)
-* Each blob independently encrypted
-* Metadata tampering detectable via hash verification
-* S3 credentials only allow write access to backup prefix
-
-## performance
-
-* Streaming processing (no temp files)
-* Parallel blob uploads
-* Deduplication reduces storage and bandwidth
-* Local index enables fast incremental detection
-* Configurable compression levels
-
 ## requirements

 * Go 1.24.4 or later
 * S3-compatible object storage
-* age command-line tool (for key generation)
-* SQLite3
-* Sufficient disk space for local index
+* Sufficient disk space for local index (typically <1GB)

 ## author

 Made with love and lots of expensive SOTA AI by [sneak](https://sneak.berlin) in Berlin in the summer of 2025.

-Released as a free software gift to the world, no strings attached, under the [WTFPL](https://www.wtfpl.net/) license.
+Released as a free software gift to the world, no strings attached.

 Contact: [sneak@sneak.berlin](mailto:sneak@sneak.berlin)