Add ARCHITECTURE.md documenting internal design

Document the data model, type instantiation flow, and module responsibilities. Covers chunker, packer, vaultik, cli, snapshot, and database modules with detailed explanations of relationships between File, Chunk, Blob, and Snapshot entities.
2025-12-18 19:49:42 -08:00 · 2025-12-18 19:49:42 -08:00 · cda0cf865a
commit cda0cf865a
parent 0736bd070b
1 changed files with 380 additions and 0 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@ -0,0 +1,380 @@
+# Vaultik Architecture
+
+This document describes the internal architecture of Vaultik, focusing on the data model, type instantiation, and the relationships between core modules.
+
+## Overview
+
+Vaultik is a backup system that uses content-defined chunking for deduplication and packs chunks into large, compressed, encrypted blobs for efficient cloud storage. The system is built around dependency injection using [uber-go/fx](https://github.com/uber-go/fx).
+
+## Data Flow
+
+```
+Source Files
+     │
+     ▼
+┌─────────────────┐
+│    Scanner      │  Walks directories, detects changed files
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│    Chunker      │  Splits files into variable-size chunks (FastCDC)
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│    Packer       │  Accumulates chunks, compresses (zstd), encrypts (age)
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│   S3 Client     │  Uploads blobs to remote storage
+└─────────────────┘
+```
+
+## Data Model
+
+### Core Entities
+
+The database tracks five primary entities and their relationships:
+
+```
+┌──────────────┐     ┌──────────────┐     ┌──────────────┐
+│   Snapshot   │────▶│     File     │────▶│    Chunk     │
+└──────────────┘     └──────────────┘     └──────────────┘
+       │                                         │
+       │                                         │
+       ▼                                         ▼
+┌──────────────┐                          ┌──────────────┐
+│     Blob     │◀─────────────────────────│  BlobChunk   │
+└──────────────┘                          └──────────────┘
+```
+
+### Entity Descriptions
+
+#### File (`database.File`)
+Represents a file or directory in the backup system. Stores metadata needed for restoration:
+- Path, timestamps (mtime, ctime)
+- Size, mode, ownership (uid, gid)
+- Symlink target (if applicable)
+
+#### Chunk (`database.Chunk`)
+A content-addressed unit of data. Files are split into variable-size chunks using the FastCDC algorithm:
+- `ChunkHash`: SHA256 hash of chunk content (primary key)
+- `Size`: Chunk size in bytes
+
+Chunk sizes vary between `avgChunkSize/4` and `avgChunkSize*4` (typically 16KB-256KB for 64KB average).
+
+#### FileChunk (`database.FileChunk`)
+Maps files to their constituent chunks:
+- `FileID`: Reference to the file
+- `Idx`: Position of this chunk within the file (0-indexed)
+- `ChunkHash`: Reference to the chunk
+
+#### Blob (`database.Blob`)
+The final storage unit uploaded to S3. Contains many compressed and encrypted chunks:
+- `ID`: UUID assigned at creation
+- `Hash`: SHA256 of final compressed+encrypted content
+- `UncompressedSize`: Total raw chunk data before compression
+- `CompressedSize`: Size after zstd compression and age encryption
+- `CreatedTS`, `FinishedTS`, `UploadedTS`: Lifecycle timestamps
+
+Blob creation process:
+1. Chunks are accumulated (up to MaxBlobSize, typically 10GB)
+2. Compressed with zstd
+3. Encrypted with age (recipients configured in config)
+4. SHA256 hash computed → becomes filename in S3
+5. Uploaded to `blobs/{hash[0:2]}/{hash[2:4]}/{hash}`
+
+#### BlobChunk (`database.BlobChunk`)
+Maps chunks to their position within blobs:
+- `BlobID`: Reference to the blob
+- `ChunkHash`: Reference to the chunk
+- `Offset`: Byte offset within the uncompressed blob
+- `Length`: Chunk size
+
+#### Snapshot (`database.Snapshot`)
+Represents a point-in-time backup:
+- `ID`: Format is `{hostname}-{YYYYMMDD}-{HHMMSS}Z`
+- Tracks file count, chunk count, blob count, sizes, compression ratio
+- `CompletedAt`: Null until snapshot finishes successfully
+
+#### SnapshotFile / SnapshotBlob
+Join tables linking snapshots to their files and blobs.
+
+### Relationship Summary
+
+```
+Snapshot 1──────────▶ N SnapshotFile N ◀────────── 1 File
+Snapshot 1──────────▶ N SnapshotBlob N ◀────────── 1 Blob
+File     1──────────▶ N FileChunk    N ◀────────── 1 Chunk
+Blob     1──────────▶ N BlobChunk    N ◀────────── 1 Chunk
+```
+
+## Type Instantiation
+
+### Application Startup
+
+The CLI uses fx for dependency injection. Here's the instantiation order:
+
+```go
+// cli/app.go: NewApp()
+fx.New(
+    fx.Supply(config.ConfigPath(opts.ConfigPath)),  // 1. Config path
+    fx.Supply(opts.LogOptions),                      // 2. Log options
+    fx.Provide(globals.New),                         // 3. Globals
+    fx.Provide(log.New),                             // 4. Logger config
+    config.Module,                                   // 5. Config
+    database.Module,                                 // 6. Database + Repositories
+    log.Module,                                      // 7. Logger initialization
+    s3.Module,                                       // 8. S3 client
+    snapshot.Module,                                 // 9. SnapshotManager + ScannerFactory
+    fx.Provide(vaultik.New),                         // 10. Vaultik orchestrator
+)
+```
+
+### Key Type Instantiation Points
+
+#### 1. Config (`config.Config`)
+- **Created by**: `config.Module` via `config.LoadConfig()`
+- **When**: Application startup (fx DI)
+- **Contains**: All configuration from YAML file (S3 credentials, encryption keys, paths, etc.)
+
+#### 2. Database (`database.DB`)
+- **Created by**: `database.Module` via `database.New()`
+- **When**: Application startup (fx DI)
+- **Contains**: SQLite connection, path reference
+
+#### 3. Repositories (`database.Repositories`)
+- **Created by**: `database.Module` via `database.NewRepositories()`
+- **When**: Application startup (fx DI)
+- **Contains**: All repository interfaces (Files, Chunks, Blobs, Snapshots, etc.)
+
+#### 4. Vaultik (`vaultik.Vaultik`)
+- **Created by**: `vaultik.New(VaultikParams)`
+- **When**: Application startup (fx DI)
+- **Contains**: All dependencies for backup operations
+
+```go
+type Vaultik struct {
+    Globals         *globals.Globals
+    Config          *config.Config
+    DB              *database.DB
+    Repositories    *database.Repositories
+    S3Client        *s3.Client
+    ScannerFactory  snapshot.ScannerFactory
+    SnapshotManager *snapshot.SnapshotManager
+    Shutdowner      fx.Shutdowner
+    Fs              afero.Fs
+    ctx             context.Context
+    cancel          context.CancelFunc
+}
+```
+
+#### 5. SnapshotManager (`snapshot.SnapshotManager`)
+- **Created by**: `snapshot.Module` via `snapshot.NewSnapshotManager()`
+- **When**: Application startup (fx DI)
+- **Responsibility**: Creates/completes snapshots, exports metadata to S3
+
+#### 6. Scanner (`snapshot.Scanner`)
+- **Created by**: `ScannerFactory(ScannerParams)`
+- **When**: Each `CreateSnapshot()` call
+- **Contains**: Chunker, Packer, progress reporter
+
+```go
+// vaultik/snapshot.go: CreateSnapshot()
+scanner := v.ScannerFactory(snapshot.ScannerParams{
+    EnableProgress: !opts.Cron,
+    Fs:             v.Fs,
+})
+```
+
+#### 7. Chunker (`chunker.Chunker`)
+- **Created by**: `chunker.NewChunker(avgChunkSize)`
+- **When**: Inside `snapshot.NewScanner()`
+- **Configuration**:
+  - `avgChunkSize`: From config (typically 64KB)
+  - `minChunkSize`: avgChunkSize / 4
+  - `maxChunkSize`: avgChunkSize * 4
+
+#### 8. Packer (`blob.Packer`)
+- **Created by**: `blob.NewPacker(PackerConfig)`
+- **When**: Inside `snapshot.NewScanner()`
+- **Configuration**:
+  - `MaxBlobSize`: Maximum blob size before finalization (typically 10GB)
+  - `CompressionLevel`: zstd level (1-19)
+  - `Recipients`: age public keys for encryption
+
+```go
+// snapshot/scanner.go: NewScanner()
+packerCfg := blob.PackerConfig{
+    MaxBlobSize:      cfg.MaxBlobSize,
+    CompressionLevel: cfg.CompressionLevel,
+    Recipients:       cfg.AgeRecipients,
+    Repositories:     cfg.Repositories,
+    Fs:               cfg.FS,
+}
+packer, err := blob.NewPacker(packerCfg)
+```
+
+## Module Responsibilities
+
+### `internal/cli`
+Entry point for fx application. Combines all modules and handles signal interrupts.
+
+Key functions:
+- `NewApp(AppOptions)` → Creates fx.App with all modules
+- `RunApp(ctx, app)` → Starts app, handles graceful shutdown
+- `RunWithApp(ctx, opts)` → Convenience wrapper
+
+### `internal/vaultik`
+Main orchestrator containing all dependencies and command implementations.
+
+Key methods:
+- `New(VaultikParams)` → Constructor (fx DI)
+- `CreateSnapshot(opts)` → Main backup operation
+- `ListSnapshots(jsonOutput)` → List available snapshots
+- `VerifySnapshot(id, deep)` → Verify snapshot integrity
+- `PurgeSnapshots(...)` → Remove old snapshots
+
+### `internal/chunker`
+Content-defined chunking using FastCDC algorithm.
+
+Key types:
+- `Chunk` → Hash, Data, Offset, Size
+- `Chunker` → avgChunkSize, minChunkSize, maxChunkSize
+
+Key methods:
+- `NewChunker(avgChunkSize)` → Constructor
+- `ChunkReaderStreaming(reader, callback)` → Stream chunks with callback (preferred)
+- `ChunkReader(reader)` → Return all chunks at once (memory-intensive)
+
+### `internal/blob`
+Blob packing: accumulates chunks, compresses, encrypts, tracks metadata.
+
+Key types:
+- `Packer` → Thread-safe blob accumulator
+- `ChunkRef` → Hash + Data for adding to packer
+- `FinishedBlob` → Completed blob ready for upload
+- `BlobWithReader` → FinishedBlob + io.Reader for streaming upload
+
+Key methods:
+- `NewPacker(PackerConfig)` → Constructor
+- `AddChunk(ChunkRef)` → Add chunk to current blob
+- `FinalizeBlob()` → Compress, encrypt, hash current blob
+- `Flush()` → Finalize any in-progress blob
+- `SetBlobHandler(func)` → Set callback for upload
+
+### `internal/snapshot`
+
+#### Scanner
+Orchestrates the backup process for a directory.
+
+Key methods:
+- `NewScanner(ScannerConfig)` → Constructor (creates Chunker + Packer)
+- `Scan(ctx, path, snapshotID)` → Main scan operation
+
+Scan phases:
+1. **Phase 0**: Detect deleted files from previous snapshots
+2. **Phase 1**: Walk directory, identify files needing processing
+3. **Phase 2**: Process files (chunk → pack → upload)
+
+#### SnapshotManager
+Manages snapshot lifecycle and metadata export.
+
+Key methods:
+- `CreateSnapshot(ctx, hostname, version, commit)` → Create snapshot record
+- `CompleteSnapshot(ctx, snapshotID)` → Mark snapshot complete
+- `ExportSnapshotMetadata(ctx, dbPath, snapshotID)` → Export to S3
+- `CleanupIncompleteSnapshots(ctx, hostname)` → Remove failed snapshots
+
+### `internal/database`
+SQLite database for local index. Single-writer mode for thread safety.
+
+Key types:
+- `DB` → Database connection wrapper
+- `Repositories` → Collection of all repository interfaces
+
+Repository interfaces:
+- `FilesRepository` → CRUD for File records
+- `ChunksRepository` → CRUD for Chunk records
+- `BlobsRepository` → CRUD for Blob records
+- `SnapshotsRepository` → CRUD for Snapshot records
+- Plus join table repositories (FileChunks, BlobChunks, etc.)
+
+## Snapshot Creation Flow
+
+```
+CreateSnapshot(opts)
+    │
+    ├─► CleanupIncompleteSnapshots()   // Critical: avoid dedup errors
+    │
+    ├─► SnapshotManager.CreateSnapshot()   // Create DB record
+    │
+    ├─► For each source directory:
+    │       │
+    │       ├─► scanner.Scan(ctx, path, snapshotID)
+    │       │       │
+    │       │       ├─► Phase 0: detectDeletedFiles()
+    │       │       │
+    │       │       ├─► Phase 1: scanPhase()
+    │       │       │       Walk directory
+    │       │       │       Check file metadata changes
+    │       │       │       Build list of files to process
+    │       │       │
+    │       │       └─► Phase 2: processPhase()
+    │       │               For each file:
+    │       │                   chunker.ChunkReaderStreaming()
+    │       │                   For each chunk:
+    │       │                       packer.AddChunk()
+    │       │                       If blob full → FinalizeBlob()
+    │       │                           → handleBlobReady()
+    │       │                           → s3Client.PutObjectWithProgress()
+    │       │               packer.Flush()  // Final blob
+    │       │
+    │       └─► Accumulate statistics
+    │
+    ├─► SnapshotManager.UpdateSnapshotStatsExtended()
+    │
+    ├─► SnapshotManager.CompleteSnapshot()
+    │
+    └─► SnapshotManager.ExportSnapshotMetadata()
+            │
+            ├─► Copy database to temp file
+            ├─► Clean to only current snapshot data
+            ├─► Dump to SQL
+            ├─► Compress with zstd
+            ├─► Encrypt with age
+            ├─► Upload db.zst.age to S3
+            └─► Upload manifest.json.zst to S3
+```
+
+## Deduplication Strategy
+
+1. **File-level**: Files unchanged since last backup are skipped (metadata comparison: size, mtime, mode, uid, gid)
+
+2. **Chunk-level**: Chunks are content-addressed by SHA256 hash. If a chunk hash already exists in the database, the chunk data is not re-uploaded.
+
+3. **Blob-level**: Blobs contain only unique chunks. Duplicate chunks within a blob are skipped.
+
+## Storage Layout in S3
+
+```
+bucket/
+├── blobs/
+│   └── {hash[0:2]}/
+│       └── {hash[2:4]}/
+│           └── {full-hash}          # Compressed+encrypted blob
+│
+└── metadata/
+    └── {snapshot-id}/
+        ├── db.zst.age               # Encrypted database dump
+        └── manifest.json.zst        # Blob list (for verification)
+```
+
+## Thread Safety
+
+- `Packer`: Thread-safe via mutex. Multiple goroutines can call `AddChunk()`.
+- `Scanner`: Uses `packerMu` mutex to coordinate blob finalization.
+- `Database`: Single-writer mode (`MaxOpenConns=1`) ensures SQLite thread safety.
+- `Repositories.WithTx()`: Handles transaction lifecycle automatically.