vaultik/DESIGN.md
sneak 78af626759 Major refactoring: UUID-based storage, streaming architecture, and CLI improvements
This commit represents a significant architectural overhaul of vaultik:

Database Schema Changes:
- Switch files table to use UUID primary keys instead of path-based keys
- Add UUID primary keys to blobs table for immediate chunk association
- Update all foreign key relationships to use UUIDs
- Add comprehensive schema documentation in DATAMODEL.md
- Add SQLite busy timeout handling for concurrent operations

Streaming and Performance Improvements:
- Implement true streaming blob packing without intermediate storage
- Add streaming chunk processing to reduce memory usage
- Improve progress reporting with real-time metrics
- Add upload metrics tracking in new uploads table

CLI Refactoring:
- Restructure CLI to use subcommands: snapshot create/list/purge/verify
- Add store info command for S3 configuration display
- Add custom duration parser supporting days/weeks/months/years
- Remove old backup.go in favor of enhanced snapshot.go
- Add --cron flag for silent operation

Configuration Changes:
- Remove unused index_prefix configuration option
- Add support for snapshot pruning retention policies
- Improve configuration validation and error messages

Testing Improvements:
- Add comprehensive repository tests with edge cases
- Add cascade delete debugging tests
- Fix concurrent operation tests to use SQLite busy timeout
- Remove tolerance for SQLITE_BUSY errors in tests

Documentation:
- Add MIT LICENSE file
- Update README with new command structure
- Add comprehensive DATAMODEL.md explaining database schema
- Update DESIGN.md with UUID-based architecture

Other Changes:
- Add test-config.yml for testing
- Update Makefile with better test output formatting
- Fix various race conditions in concurrent operations
- Improve error handling throughout
2025-07-22 14:56:44 +02:00

388 lines
12 KiB
Markdown

# vaultik: Design Document
`vaultik` is a secure backup tool written in Go. It performs
streaming backups using content-defined chunking, blob grouping, asymmetric
encryption, and object storage. The system is designed for environments
where the backup source host cannot store secrets and cannot retrieve or
decrypt any data from the destination.
The source host is **stateful**: it maintains a local SQLite index to detect
changes, deduplicate content, and track uploads across backup runs. All
remote storage is encrypted and append-only. Pruning of unreferenced data is
done from a trusted host with access to decryption keys, as even the
metadata indices are encrypted in the blob store.
---
## Why
ANOTHER backup tool??
Other backup tools like `restic`, `borg`, and `duplicity` are designed for
environments where the source host can store secrets and has access to
decryption keys. I don't want to store backup decryption keys on my hosts,
only public keys for encryption.
My requirements are:
* open source
* no passphrases or private keys on the source host
* incremental
* compressed
* encrypted
* s3 compatible without an intermediate step or tool
Surprisingly, no existing tool meets these requirements, so I wrote `vaultik`.
## Design Goals
1. Backups must require only a public key on the source host.
2. No secrets or private keys may exist on the source system.
3. Obviously, restore must be possible using **only** the backup bucket and
a private key.
4. Prune must be possible, although this requires a private key so must be
done on different hosts.
5. All encryption is done using [`age`](https://github.com/FiloSottile/age)
(X25519, XChaCha20-Poly1305).
6. Compression uses `zstd` at a configurable level.
7. Files are chunked, and multiple chunks are packed into encrypted blobs.
This reduces the number of objects in the blob store for filesystems with
many small files.
9. All metadata (snapshots) is stored remotely as encrypted SQLite DBs.
10. If a snapshot metadata file exceeds a configured size threshold, it is
chunked into multiple encrypted `.age` parts, to support large
filesystems.
11. CLI interface is structured using `cobra`.
---
## S3 Bucket Layout
S3 stores only four things:
1) Blobs: encrypted, compressed packs of file chunks.
2) Metadata: encrypted SQLite databases containing the current state of the
filesystem at the time of the snapshot.
3) Metadata hashes: encrypted hashes of the metadata SQLite databases.
4) Blob manifests: unencrypted compressed JSON files listing all blob hashes
referenced in the snapshot, enabling pruning without decryption.
```
s3://<bucket>/<prefix>/
├── blobs/
│ ├── <aa>/<bb>/<full_blob_hash>.zst.age
├── metadata/
│ ├── <snapshot_id>.sqlite.age
│ ├── <snapshot_id>.sqlite.00.age
│ ├── <snapshot_id>.sqlite.01.age
│ ├── <snapshot_id>.manifest.json.zst
```
To retrieve a given file, you would:
* fetch `metadata/<snapshot_id>.sqlite.age` or `metadata/<snapshot_id>.sqlite.{seq}.age`
* fetch `metadata/<snapshot_id>.hash.age`
* decrypt the metadata SQLite database using the private key and reconstruct
the full database file
* verify the hash of the decrypted database matches the decrypted hash
* query the database for the file in question
* determine all chunks for the file
* for each chunk, look up the metadata for all blobs in the db
* fetch each blob from `blobs/<aa>/<bb>/<blob_hash>.zst.age`
* decrypt each blob using the private key
* decompress each blob using `zstd`
* reconstruct the file from set of file chunks stored in the blobs
If clever, it may be possible to do this chunk by chunk without touching
disk (except for the output file) as each uncompressed blob should fit in
memory (<10GB).
### Path Rules
* `<snapshot_id>`: UTC timestamp in iso860 format, e.g. `2023-10-01T12:00:00Z`. These are lexicographically sortable.
* `blobs/<aa>/<bb>/...`: where `aa` and `bb` are the first 2 hex bytes of the blob hash.
### Blob Manifest Format
The `<snapshot_id>.manifest.json.zst` file is an unencrypted, compressed JSON file containing:
```json
{
"snapshot_id": "2023-10-01T12:00:00Z",
"blob_hashes": [
"aa1234567890abcdef...",
"bb2345678901bcdef0...",
...
]
}
```
This allows pruning operations to determine which blobs are referenced without requiring decryption keys.
---
## 3. Local SQLite Index Schema (source host)
```sql
CREATE TABLE files (
id TEXT PRIMARY KEY, -- UUID
path TEXT NOT NULL UNIQUE,
mtime INTEGER NOT NULL,
size INTEGER NOT NULL
);
-- Maps files to their constituent chunks in sequence order
-- Used for reconstructing files from chunks during restore
CREATE TABLE file_chunks (
file_id TEXT NOT NULL,
idx INTEGER NOT NULL,
chunk_hash TEXT NOT NULL,
PRIMARY KEY (file_id, idx)
);
CREATE TABLE chunks (
chunk_hash TEXT PRIMARY KEY,
sha256 TEXT NOT NULL,
size INTEGER NOT NULL
);
CREATE TABLE blobs (
blob_hash TEXT PRIMARY KEY,
final_hash TEXT NOT NULL,
created_ts INTEGER NOT NULL
);
CREATE TABLE blob_chunks (
blob_hash TEXT NOT NULL,
chunk_hash TEXT NOT NULL,
offset INTEGER NOT NULL,
length INTEGER NOT NULL,
PRIMARY KEY (blob_hash, chunk_hash)
);
-- Reverse mapping: tracks which files contain a given chunk
-- Used for deduplication and tracking chunk usage across files
CREATE TABLE chunk_files (
chunk_hash TEXT NOT NULL,
file_id TEXT NOT NULL,
file_offset INTEGER NOT NULL,
length INTEGER NOT NULL,
PRIMARY KEY (chunk_hash, file_id)
);
CREATE TABLE snapshots (
id TEXT PRIMARY KEY,
hostname TEXT NOT NULL,
vaultik_version TEXT NOT NULL,
vaultik_git_revision TEXT NOT NULL,
created_ts INTEGER NOT NULL,
file_count INTEGER NOT NULL,
chunk_count INTEGER NOT NULL,
blob_count INTEGER NOT NULL
);
```
---
## 4. Snapshot Metadata Schema (stored in S3)
Identical schema to the local index, filtered to live snapshot state. Stored
as a SQLite DB, compressed with `zstd`, encrypted with `age`. If larger than
a configured `chunk_size`, it is split and uploaded as:
```
metadata/<snapshot_id>.sqlite.00.age
metadata/<snapshot_id>.sqlite.01.age
...
```
---
## 5. Data Flow
### 5.1 Backup
1. Load config
2. Open local SQLite index
3. Walk source directories:
* For each file:
* Check mtime and size in index
* If changed or new:
* Chunk file
* For each chunk:
* Hash with SHA256
* Check if already uploaded
* If not:
* Add chunk to blob packer
* Record file-chunk mapping in index
4. When blob reaches threshold size (e.g. 1GB):
* Compress with `zstd`
* Encrypt with `age`
* Upload to: `s3://<bucket>/<prefix>/blobs/<aa>/<bb>/<hash>.zst.age`
* Record blob-chunk layout in local index
5. Once all files are processed:
* Build snapshot SQLite DB from index delta
* Compress + encrypt
* If larger than `chunk_size`, split into parts
* Upload to:
`s3://<bucket>/<prefix>/metadata/<snapshot_id>.sqlite(.xx).age`
6. Create snapshot record in local index that lists:
* snapshot ID
* hostname
* vaultik version
* timestamp
* counts of files, chunks, and blobs
* list of all blobs referenced in the snapshot (some new, some old) for
efficient pruning later
7. Create snapshot database for upload
8. Calculate checksum of snapshot database
9. Compress, encrypt, split, and upload to S3
10. Encrypt the hash of the snapshot database to the backup age key
11. Upload the encrypted hash to S3 as `metadata/<snapshot_id>.hash.age`
12. Create blob manifest JSON listing all blob hashes referenced in snapshot
13. Compress manifest with zstd and upload as `metadata/<snapshot_id>.manifest.json.zst`
14. Optionally prune remote blobs that are no longer referenced in the
snapshot, based on local state db
### 5.2 Manual Prune
1. List all objects under `metadata/`
2. Determine the latest valid `snapshot_id` by timestamp
3. Download and decompress the latest `<snapshot_id>.manifest.json.zst`
4. Extract set of referenced blob hashes from manifest (no decryption needed)
5. List all blob objects under `blobs/`
6. For each blob:
* If the hash is not in the manifest:
* Issue `DeleteObject` to remove it
### 5.3 Verify
Verify runs on a host that has no state, but access to the bucket.
1. Fetch latest metadata snapshot files from S3
2. Fetch latest metadata db hash from S3
3. Decrypt the hash using the private key
4. Decrypt the metadata SQLite database chunks using the private key and
reassemble the snapshot db file
5. Calculate the SHA256 hash of the decrypted snapshot database
6. Verify the db file hash matches the decrypted hash
7. For each blob in the snapshot:
* Fetch the blob metadata from the snapshot db
* Ensure the blob exists in S3
* Check the S3 content hash matches the expected blob hash
* If not using --quick mode:
* Download and decrypt the blob
* Decompress and verify chunk hashes match metadata
---
## 6. CLI Commands
```
vaultik backup [--config <path>] [--cron] [--daemon] [--prune]
vaultik restore --bucket <bucket> --prefix <prefix> --snapshot <id> --target <dir>
vaultik prune --bucket <bucket> --prefix <prefix> [--dry-run]
vaultik verify --bucket <bucket> --prefix <prefix> [--snapshot <id>] [--quick]
vaultik fetch --bucket <bucket> --prefix <prefix> --snapshot <id> --file <path> --target <path>
vaultik snapshot list --bucket <bucket> --prefix <prefix> [--limit <n>]
vaultik snapshot rm --bucket <bucket> --prefix <prefix> --snapshot <id>
vaultik snapshot latest --bucket <bucket> --prefix <prefix>
```
* `VAULTIK_PRIVATE_KEY` is required for `restore`, `prune`, `verify`, and
`fetch` commands.
* It is passed via environment variable containing the age private key.
---
## 7. Function and Method Signatures
### 7.1 CLI
```go
func RootCmd() *cobra.Command
func backupCmd() *cobra.Command
func restoreCmd() *cobra.Command
func pruneCmd() *cobra.Command
func verifyCmd() *cobra.Command
```
### 7.2 Configuration
```go
type Config struct {
BackupPubKey string // age recipient
BackupInterval time.Duration // used in daemon mode, irrelevant for cron mode
BlobSizeLimit int64 // default 10GB
ChunkSize int64 // default 10MB
Exclude []string // list of regex of files to exclude from backup, absolute path
Hostname string
IndexPath string // path to local SQLite index db, default /var/lib/vaultik/index.db
MetadataPrefix string // S3 prefix for metadata, default "metadata/"
MinTimeBetweenRun time.Duration // minimum time between backup runs, default 1 hour - for daemon mode
S3 S3Config // S3 configuration
ScanInterval time.Duration // interval to full stat() scan source dirs, default 24h
SourceDirs []string // list of source directories to back up, absolute paths
}
type S3Config struct {
Endpoint string
Bucket string
Prefix string
AccessKeyID string
SecretAccessKey string
Region string
}
func Load(path string) (*Config, error)
```
### 7.3 Index
```go
type Index struct {
db *sql.DB
}
func OpenIndex(path string) (*Index, error)
func (ix *Index) LookupFile(path string, mtime int64, size int64) ([]string, bool, error)
func (ix *Index) SaveFile(path string, mtime int64, size int64, chunkHashes []string) error
func (ix *Index) AddChunk(chunkHash string, size int64) error
func (ix *Index) MarkBlob(blobHash, finalHash string, created time.Time) error
func (ix *Index) MapChunkToBlob(blobHash, chunkHash string, offset, length int64) error
func (ix *Index) MapChunkToFile(chunkHash, filePath string, offset, length int64) error
```
### 7.4 Blob Packing
```go
type BlobWriter struct {
// internal buffer, current size, encrypted writer, etc
}
func NewBlobWriter(...) *BlobWriter
func (bw *BlobWriter) AddChunk(chunk []byte, chunkHash string) error
func (bw *BlobWriter) Flush() (finalBlobHash string, err error)
```
### 7.5 Metadata
```go
func BuildSnapshotMetadata(ix *Index, snapshotID string) (sqlitePath string, err error)
func EncryptAndUploadMetadata(path string, cfg *Config, snapshotID string) error
```
### 7.6 Prune
```go
func RunPrune(bucket, prefix, privateKey string) error
```