vaultik/DESIGN.md

# vaultik: Design Document

`vaultik` is a secure  backup tool written in Go. It performs
streaming backups using content-defined chunking, blob grouping, asymmetric
encryption, and object storage. The system is designed for environments
where the backup source host cannot store secrets and cannot retrieve or
decrypt any data from the destination.

The source host is **stateful**: it maintains a local SQLite index to detect
changes, deduplicate content, and track uploads across backup runs. All
remote storage is encrypted and append-only. Pruning of unreferenced data is
done from a trusted host with access to decryption keys, as even the
metadata indices are encrypted in the blob store.

---

## Why

ANOTHER backup tool??

Other backup tools like `restic`, `borg`, and `duplicity` are designed for
environments where the source host can store secrets and has access to
decryption keys. I don't want to store backup decryption keys on my hosts,
only public keys for encryption.

My requirements are:

* open source
* no passphrases or private keys on the source host
* incremental
* compressed
* encrypted
* s3 compatible without an intermediate step or tool

Surprisingly, no existing tool meets these requirements, so I wrote `vaultik`.

## Design Goals

1. Backups must require only a public key on the source host.
2. No secrets or private keys may exist on the source system.
3. Obviously, restore must be possible using **only** the backup bucket and
   a private key.
4. Prune must be possible, although this requires a private key so must be
   done on different hosts.
5. All encryption is done using [`age`](https://github.com/FiloSottile/age)
   (X25519, XChaCha20-Poly1305).
6. Compression uses `zstd` at a configurable level.
7. Files are chunked, and multiple chunks are packed into encrypted blobs.
   This reduces the number of objects in the blob store for filesystems with
   many small files.
9. All metadata (snapshots) is stored remotely as encrypted SQLite DBs.
10. If a snapshot metadata file exceeds a configured size threshold, it is
    chunked into multiple encrypted `.age` parts, to support large
    filesystems.
11. CLI interface is structured using `cobra`.

---

## S3 Bucket Layout

S3 stores only three things:

1) Blobs: encrypted, compressed packs of file chunks.
2) Metadata: encrypted SQLite databases containing the current state of the
   filesystem at the time of the snapshot.
3) Metadata hashes: encrypted hashes of the metadata SQLite databases.

```
s3://<bucket>/<prefix>/
├── blobs/
│   ├── <aa>/<bb>/<full_blob_hash>.zst.age
├── metadata/
│   ├── <snapshot_id>.sqlite.age
│   ├── <snapshot_id>.sqlite.00.age
│   ├── <snapshot_id>.sqlite.01.age
```

To retrieve a given file, you would:

* fetch `metadata/<snapshot_id>.sqlite.age` or `metadata/<snapshot_id>.sqlite.{seq}.age`
* fetch `metadata/<snapshot_id>.hash.age`
* decrypt the metadata SQLite database using the private key and reconstruct
  the full database file
* verify the hash of the decrypted database matches the decrypted hash
* query the database for the file in question
* determine all chunks for the file
* for each chunk, look up the metadata for all blobs in the db
* fetch each blob from `blobs/<aa>/<bb>/<blob_hash>.zst.age`
* decrypt each blob using the private key
* decompress each blob using `zstd`
* reconstruct the file from set of file chunks stored in the blobs

If clever, it may be possible to do this chunk by chunk without touching
disk (except for the output file) as each uncompressed blob should fit in
memory (<10GB).

### Path Rules

* `<snapshot_id>`: UTC timestamp in iso860 format, e.g. `2023-10-01T12:00:00Z`.  These are lexicographically sortable.
* `blobs/<aa>/<bb>/...`: where `aa` and `bb` are the first 2 hex bytes of the blob hash.

---

## 3. Local SQLite Index Schema (source host)

```sql
CREATE TABLE files (
  path TEXT PRIMARY KEY,
  mtime INTEGER NOT NULL,
  size INTEGER NOT NULL
);

CREATE TABLE file_chunks (
  path TEXT NOT NULL,
  idx INTEGER NOT NULL,
  chunk_hash TEXT NOT NULL,
  PRIMARY KEY (path, idx)
);

CREATE TABLE chunks (
  chunk_hash TEXT PRIMARY KEY,
  sha256 TEXT NOT NULL,
  size INTEGER NOT NULL
);

CREATE TABLE blobs (
  blob_hash TEXT PRIMARY KEY,
  final_hash TEXT NOT NULL,
  created_ts INTEGER NOT NULL
);

CREATE TABLE blob_chunks (
  blob_hash TEXT NOT NULL,
  chunk_hash TEXT NOT NULL,
  offset INTEGER NOT NULL,
  length INTEGER NOT NULL,
  PRIMARY KEY (blob_hash, chunk_hash)
);

CREATE TABLE chunk_files (
  chunk_hash TEXT NOT NULL,
  file_path TEXT NOT NULL,
  file_offset INTEGER NOT NULL,
  length INTEGER NOT NULL,
  PRIMARY KEY (chunk_hash, file_path)
);

CREATE TABLE snapshots (
  id TEXT PRIMARY KEY,
  hostname TEXT NOT NULL,
  vaultik_version TEXT NOT NULL,
  created_ts INTEGER NOT NULL,
  file_count INTEGER NOT NULL,
  chunk_count INTEGER NOT NULL,
  blob_count INTEGER NOT NULL
);
```

---

## 4. Snapshot Metadata Schema (stored in S3)

Identical schema to the local index, filtered to live snapshot state. Stored
as a SQLite DB, compressed with `zstd`, encrypted with `age`. If larger than
a configured `chunk_size`, it is split and uploaded as:

```
metadata/<snapshot_id>.sqlite.00.age
metadata/<snapshot_id>.sqlite.01.age
...
```

---

## 5. Data Flow

### 5.1 Backup

1. Load config
2. Open local SQLite index
3. Walk source directories:

   * For each file:

     * Check mtime and size in index
     * If changed or new:

       * Chunk file
       * For each chunk:

         * Hash with SHA256
         * Check if already uploaded
         * If not:

           * Add chunk to blob packer
       * Record file-chunk mapping in index
4. When blob reaches threshold size (e.g. 1GB):

   * Compress with `zstd`
   * Encrypt with `age`
   * Upload to: `s3://<bucket>/<prefix>/blobs/<aa>/<bb>/<hash>.zst.age`
   * Record blob-chunk layout in local index
5. Once all files are processed:
   * Build snapshot SQLite DB from index delta
   * Compress + encrypt
   * If larger than `chunk_size`, split into parts
   * Upload to:
     `s3://<bucket>/<prefix>/metadata/<snapshot_id>.sqlite(.xx).age`
6. Create snapshot record in local index that lists:
    * snapshot ID
    * hostname
    * vaultik version
    * timestamp
    * counts of files, chunks, and blobs
    * list of all blobs referenced in the snapshot (some new, some old) for
      efficient pruning later
7. Create snapshot database for upload
8. Calculate checksum of snapshot database
9. Compress, encrypt, split, and upload to S3
10. Encrypt the hash of the snapshot database to the backup age key
11. Upload the encrypted hash to S3 as `metadata/<snapshot_id>.hash.age`
12. Optionally prune remote blobs that are no longer referenced in the
   snapshot, based on local state db

### 5.2 Manual Prune

1. List all objects under `metadata/`
2. Determine the latest valid `snapshot_id` by timestamp
3. Download, decrypt, and reconstruct the latest snapshot SQLite database
4. Extract set of referenced blob hashes
5. List all blob objects under `blobs/`
6. For each blob:
   * If the hash is not in the latest snapshot:
     * Issue `DeleteObject` to remove it

### 5.3 Verify

Verify runs on a host that has no state, but access to the bucket.

1. Fetch latest metadata snapshot files from S3
2. Fetch latest metadata db hash from S3
3. Decrypt the hash using the private key
4. Decrypt the metadata SQLite database chunks using the private key and
   reassemble the snapshot db file
5. Calculate the SHA256 hash of the decrypted snapshot database
5. Verify the db file hash matches the decrypted hash
3. For each blob in the snapshot:
    * Fetch the blob metadata from the snapshot db
    * Ensure the blob exists in S3
    * Ensure the S3 object hash matches the final (encrypted) blob hash
      stored in the metadata db
    * For each chunk in the blob:
        * Fetch the chunk metadata from the snapshot db
        * Ensure the S3 object hash matches the chunk hash stored in the
          metadata db

---

## 6. CLI Commands

```
vaultik backup /etc/vaultik.yaml
vaultik restore <bucket> <prefix> <snapshot_id> <target_dir>
vaultik prune <bucket> <prefix>
```

* `VAULTIK_PRIVATE_KEY` is required for `restore` and `prune` and
`retrieve` commands as.

* It is passed via environment variable.

---

## 7. Function and Method Signatures

### 7.1 CLI

```go
func RootCmd() *cobra.Command
func backupCmd() *cobra.Command
func restoreCmd() *cobra.Command
func pruneCmd() *cobra.Command
func verifyCmd() *cobra.Command
```

### 7.2 Configuration

```go
type Config struct {
    BackupPubKey      string  // age recipient
    BackupInterval    time.Duration // used in daemon mode, irrelevant for cron mode
    BlobSizeLimit     int64  // default 10GB
    ChunkSize         int64 // default 10MB
    Exclude           []string // list of regex of files to exclude from backup, absolute path
    Hostname          string
    IndexPath         string  // path to local SQLite index db, default /var/lib/vaultik/index.db
    MetadataPrefix    string  // S3 prefix for metadata, default "metadata/"
    MinTimeBetweenRun time.Duration  // minimum time between backup runs, default 1 hour - for daemon mode
    S3                S3Config  // S3 configuration
    ScanInterval      time.Duration  // interval to full stat() scan source dirs, default 24h
    SourceDirs        []string  // list of source directories to back up, absolute paths
}

type S3Config struct {
    Endpoint        string
    Bucket          string
    Prefix          string
    AccessKeyID     string
    SecretAccessKey string
    Region          string
}

func Load(path string) (*Config, error)
```

### 7.3 Index

```go
type Index struct {
    db *sql.DB
}

func OpenIndex(path string) (*Index, error)

func (ix *Index) LookupFile(path string, mtime int64, size int64) ([]string, bool, error)
func (ix *Index) SaveFile(path string, mtime int64, size int64, chunkHashes []string) error
func (ix *Index) AddChunk(chunkHash string, size int64) error
func (ix *Index) MarkBlob(blobHash, finalHash string, created time.Time) error
func (ix *Index) MapChunkToBlob(blobHash, chunkHash string, offset, length int64) error
func (ix *Index) MapChunkToFile(chunkHash, filePath string, offset, length int64) error
```

### 7.4 Blob Packing

```go
type BlobWriter struct {
    // internal buffer, current size, encrypted writer, etc
}

func NewBlobWriter(...) *BlobWriter
func (bw *BlobWriter) AddChunk(chunk []byte, chunkHash string) error
func (bw *BlobWriter) Flush() (finalBlobHash string, err error)
```

### 7.5 Metadata

```go
func BuildSnapshotMetadata(ix *Index, snapshotID string) (sqlitePath string, err error)
func EncryptAndUploadMetadata(path string, cfg *Config, snapshotID string) error
```

### 7.6 Prune

```go
func RunPrune(bucket, prefix, privateKey string) error
```

---

## Implementation TODO

### Phase 1: Core Infrastructure
1. Set up Go module and project structure
2. Create Makefile with test, fmt, and lint targets
3. Set up cobra CLI skeleton with all commands
4. Implement config loading and validation from YAML
5. Create data structures for FileInfo, ChunkInfo, BlobInfo, etc.

### Phase 2: Local Index Database
6. Implement SQLite schema creation and migrations
7. Create Index type with all database operations
8. Add transaction support and proper locking
9. Implement file tracking (save, lookup, delete)
10. Implement chunk tracking and deduplication
11. Implement blob tracking and chunk-to-blob mapping
12. Write tests for all index operations

### Phase 3: Chunking and Hashing
13. Implement Rabin fingerprint chunker
14. Create streaming chunk processor
15. Implement SHA256 hashing for chunks
16. Add configurable chunk size parameters
17. Write tests for chunking consistency

### Phase 4: Compression and Encryption
18. Implement zstd compression wrapper
19. Integrate age encryption library
20. Create Encryptor type for public key encryption
21. Create Decryptor type for private key decryption
22. Implement streaming encrypt/decrypt pipelines
23. Write tests for compression and encryption

### Phase 5: Blob Packing
24. Implement BlobWriter with size limits
25. Add chunk accumulation and flushing
26. Create blob hash calculation
27. Implement proper error handling and rollback
28. Write tests for blob packing scenarios

### Phase 6: S3 Operations
29. Integrate MinIO client library
30. Implement S3Client wrapper type
31. Add multipart upload support for large blobs
32. Implement retry logic with exponential backoff
33. Add connection pooling and timeout handling
34. Write tests using MinIO container

### Phase 7: Backup Command - Basic
35. Implement directory walking with exclusion patterns
36. Add file change detection using index
37. Integrate chunking pipeline for changed files
38. Implement blob upload coordination
39. Add progress reporting to stderr
40. Write integration tests for backup

### Phase 8: Snapshot Metadata
41. Implement snapshot metadata extraction from index
42. Create SQLite snapshot database builder
43. Add metadata compression and encryption
44. Implement metadata chunking for large snapshots
45. Add hash calculation and verification
46. Implement metadata upload to S3
47. Write tests for metadata operations

### Phase 9: Restore Command
48. Implement snapshot listing and selection
49. Add metadata download and reconstruction
50. Implement hash verification for metadata
51. Create file restoration logic with chunk retrieval
52. Add blob caching for efficiency
53. Implement proper file permissions and mtime restoration
54. Write integration tests for restore

### Phase 10: Prune Command
55. Implement latest snapshot detection
56. Add referenced blob extraction from metadata
57. Create S3 blob listing and comparison
58. Implement safe deletion of unreferenced blobs
59. Add dry-run mode for safety
60. Write tests for prune scenarios

### Phase 11: Verify Command
61. Implement metadata integrity checking
62. Add blob existence verification
63. Create optional deep verification mode
64. Implement detailed error reporting
65. Write tests for verification

### Phase 12: Fetch Command
66. Implement single-file metadata query
67. Add minimal blob downloading for file
68. Create streaming file reconstruction
69. Add support for output redirection
70. Write tests for fetch command

### Phase 13: Daemon Mode
71. Implement inotify watcher for Linux
72. Add dirty path tracking in index
73. Create periodic full scan scheduler
74. Implement backup interval enforcement
75. Add proper signal handling and shutdown
76. Write tests for daemon behavior

### Phase 14: Cron Mode
77. Implement silent operation mode
78. Add proper exit codes for cron
79. Implement lock file to prevent concurrent runs
80. Add error summary reporting
81. Write tests for cron mode

### Phase 15: Finalization
82. Add comprehensive logging throughout
83. Implement proper error wrapping and context
84. Add performance metrics collection
85. Create end-to-end integration tests
86. Write documentation and examples
87. Set up CI/CD pipeline