diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 0000000..4cdb844 --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,380 @@ +# Vaultik Architecture + +This document describes the internal architecture of Vaultik, focusing on the data model, type instantiation, and the relationships between core modules. + +## Overview + +Vaultik is a backup system that uses content-defined chunking for deduplication and packs chunks into large, compressed, encrypted blobs for efficient cloud storage. The system is built around dependency injection using [uber-go/fx](https://github.com/uber-go/fx). + +## Data Flow + +``` +Source Files + │ + ▼ +┌─────────────────┐ +│ Scanner │ Walks directories, detects changed files +└────────┬────────┘ + │ + ▼ +┌─────────────────┐ +│ Chunker │ Splits files into variable-size chunks (FastCDC) +└────────┬────────┘ + │ + ▼ +┌─────────────────┐ +│ Packer │ Accumulates chunks, compresses (zstd), encrypts (age) +└────────┬────────┘ + │ + ▼ +┌─────────────────┐ +│ S3 Client │ Uploads blobs to remote storage +└─────────────────┘ +``` + +## Data Model + +### Core Entities + +The database tracks five primary entities and their relationships: + +``` +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ Snapshot │────▶│ File │────▶│ Chunk │ +└──────────────┘ └──────────────┘ └──────────────┘ + │ │ + │ │ + ▼ ▼ +┌──────────────┐ ┌──────────────┐ +│ Blob │◀─────────────────────────│ BlobChunk │ +└──────────────┘ └──────────────┘ +``` + +### Entity Descriptions + +#### File (`database.File`) +Represents a file or directory in the backup system. Stores metadata needed for restoration: +- Path, timestamps (mtime, ctime) +- Size, mode, ownership (uid, gid) +- Symlink target (if applicable) + +#### Chunk (`database.Chunk`) +A content-addressed unit of data. Files are split into variable-size chunks using the FastCDC algorithm: +- `ChunkHash`: SHA256 hash of chunk content (primary key) +- `Size`: Chunk size in bytes + +Chunk sizes vary between `avgChunkSize/4` and `avgChunkSize*4` (typically 16KB-256KB for 64KB average). + +#### FileChunk (`database.FileChunk`) +Maps files to their constituent chunks: +- `FileID`: Reference to the file +- `Idx`: Position of this chunk within the file (0-indexed) +- `ChunkHash`: Reference to the chunk + +#### Blob (`database.Blob`) +The final storage unit uploaded to S3. Contains many compressed and encrypted chunks: +- `ID`: UUID assigned at creation +- `Hash`: SHA256 of final compressed+encrypted content +- `UncompressedSize`: Total raw chunk data before compression +- `CompressedSize`: Size after zstd compression and age encryption +- `CreatedTS`, `FinishedTS`, `UploadedTS`: Lifecycle timestamps + +Blob creation process: +1. Chunks are accumulated (up to MaxBlobSize, typically 10GB) +2. Compressed with zstd +3. Encrypted with age (recipients configured in config) +4. SHA256 hash computed → becomes filename in S3 +5. Uploaded to `blobs/{hash[0:2]}/{hash[2:4]}/{hash}` + +#### BlobChunk (`database.BlobChunk`) +Maps chunks to their position within blobs: +- `BlobID`: Reference to the blob +- `ChunkHash`: Reference to the chunk +- `Offset`: Byte offset within the uncompressed blob +- `Length`: Chunk size + +#### Snapshot (`database.Snapshot`) +Represents a point-in-time backup: +- `ID`: Format is `{hostname}-{YYYYMMDD}-{HHMMSS}Z` +- Tracks file count, chunk count, blob count, sizes, compression ratio +- `CompletedAt`: Null until snapshot finishes successfully + +#### SnapshotFile / SnapshotBlob +Join tables linking snapshots to their files and blobs. + +### Relationship Summary + +``` +Snapshot 1──────────▶ N SnapshotFile N ◀────────── 1 File +Snapshot 1──────────▶ N SnapshotBlob N ◀────────── 1 Blob +File 1──────────▶ N FileChunk N ◀────────── 1 Chunk +Blob 1──────────▶ N BlobChunk N ◀────────── 1 Chunk +``` + +## Type Instantiation + +### Application Startup + +The CLI uses fx for dependency injection. Here's the instantiation order: + +```go +// cli/app.go: NewApp() +fx.New( + fx.Supply(config.ConfigPath(opts.ConfigPath)), // 1. Config path + fx.Supply(opts.LogOptions), // 2. Log options + fx.Provide(globals.New), // 3. Globals + fx.Provide(log.New), // 4. Logger config + config.Module, // 5. Config + database.Module, // 6. Database + Repositories + log.Module, // 7. Logger initialization + s3.Module, // 8. S3 client + snapshot.Module, // 9. SnapshotManager + ScannerFactory + fx.Provide(vaultik.New), // 10. Vaultik orchestrator +) +``` + +### Key Type Instantiation Points + +#### 1. Config (`config.Config`) +- **Created by**: `config.Module` via `config.LoadConfig()` +- **When**: Application startup (fx DI) +- **Contains**: All configuration from YAML file (S3 credentials, encryption keys, paths, etc.) + +#### 2. Database (`database.DB`) +- **Created by**: `database.Module` via `database.New()` +- **When**: Application startup (fx DI) +- **Contains**: SQLite connection, path reference + +#### 3. Repositories (`database.Repositories`) +- **Created by**: `database.Module` via `database.NewRepositories()` +- **When**: Application startup (fx DI) +- **Contains**: All repository interfaces (Files, Chunks, Blobs, Snapshots, etc.) + +#### 4. Vaultik (`vaultik.Vaultik`) +- **Created by**: `vaultik.New(VaultikParams)` +- **When**: Application startup (fx DI) +- **Contains**: All dependencies for backup operations + +```go +type Vaultik struct { + Globals *globals.Globals + Config *config.Config + DB *database.DB + Repositories *database.Repositories + S3Client *s3.Client + ScannerFactory snapshot.ScannerFactory + SnapshotManager *snapshot.SnapshotManager + Shutdowner fx.Shutdowner + Fs afero.Fs + ctx context.Context + cancel context.CancelFunc +} +``` + +#### 5. SnapshotManager (`snapshot.SnapshotManager`) +- **Created by**: `snapshot.Module` via `snapshot.NewSnapshotManager()` +- **When**: Application startup (fx DI) +- **Responsibility**: Creates/completes snapshots, exports metadata to S3 + +#### 6. Scanner (`snapshot.Scanner`) +- **Created by**: `ScannerFactory(ScannerParams)` +- **When**: Each `CreateSnapshot()` call +- **Contains**: Chunker, Packer, progress reporter + +```go +// vaultik/snapshot.go: CreateSnapshot() +scanner := v.ScannerFactory(snapshot.ScannerParams{ + EnableProgress: !opts.Cron, + Fs: v.Fs, +}) +``` + +#### 7. Chunker (`chunker.Chunker`) +- **Created by**: `chunker.NewChunker(avgChunkSize)` +- **When**: Inside `snapshot.NewScanner()` +- **Configuration**: + - `avgChunkSize`: From config (typically 64KB) + - `minChunkSize`: avgChunkSize / 4 + - `maxChunkSize`: avgChunkSize * 4 + +#### 8. Packer (`blob.Packer`) +- **Created by**: `blob.NewPacker(PackerConfig)` +- **When**: Inside `snapshot.NewScanner()` +- **Configuration**: + - `MaxBlobSize`: Maximum blob size before finalization (typically 10GB) + - `CompressionLevel`: zstd level (1-19) + - `Recipients`: age public keys for encryption + +```go +// snapshot/scanner.go: NewScanner() +packerCfg := blob.PackerConfig{ + MaxBlobSize: cfg.MaxBlobSize, + CompressionLevel: cfg.CompressionLevel, + Recipients: cfg.AgeRecipients, + Repositories: cfg.Repositories, + Fs: cfg.FS, +} +packer, err := blob.NewPacker(packerCfg) +``` + +## Module Responsibilities + +### `internal/cli` +Entry point for fx application. Combines all modules and handles signal interrupts. + +Key functions: +- `NewApp(AppOptions)` → Creates fx.App with all modules +- `RunApp(ctx, app)` → Starts app, handles graceful shutdown +- `RunWithApp(ctx, opts)` → Convenience wrapper + +### `internal/vaultik` +Main orchestrator containing all dependencies and command implementations. + +Key methods: +- `New(VaultikParams)` → Constructor (fx DI) +- `CreateSnapshot(opts)` → Main backup operation +- `ListSnapshots(jsonOutput)` → List available snapshots +- `VerifySnapshot(id, deep)` → Verify snapshot integrity +- `PurgeSnapshots(...)` → Remove old snapshots + +### `internal/chunker` +Content-defined chunking using FastCDC algorithm. + +Key types: +- `Chunk` → Hash, Data, Offset, Size +- `Chunker` → avgChunkSize, minChunkSize, maxChunkSize + +Key methods: +- `NewChunker(avgChunkSize)` → Constructor +- `ChunkReaderStreaming(reader, callback)` → Stream chunks with callback (preferred) +- `ChunkReader(reader)` → Return all chunks at once (memory-intensive) + +### `internal/blob` +Blob packing: accumulates chunks, compresses, encrypts, tracks metadata. + +Key types: +- `Packer` → Thread-safe blob accumulator +- `ChunkRef` → Hash + Data for adding to packer +- `FinishedBlob` → Completed blob ready for upload +- `BlobWithReader` → FinishedBlob + io.Reader for streaming upload + +Key methods: +- `NewPacker(PackerConfig)` → Constructor +- `AddChunk(ChunkRef)` → Add chunk to current blob +- `FinalizeBlob()` → Compress, encrypt, hash current blob +- `Flush()` → Finalize any in-progress blob +- `SetBlobHandler(func)` → Set callback for upload + +### `internal/snapshot` + +#### Scanner +Orchestrates the backup process for a directory. + +Key methods: +- `NewScanner(ScannerConfig)` → Constructor (creates Chunker + Packer) +- `Scan(ctx, path, snapshotID)` → Main scan operation + +Scan phases: +1. **Phase 0**: Detect deleted files from previous snapshots +2. **Phase 1**: Walk directory, identify files needing processing +3. **Phase 2**: Process files (chunk → pack → upload) + +#### SnapshotManager +Manages snapshot lifecycle and metadata export. + +Key methods: +- `CreateSnapshot(ctx, hostname, version, commit)` → Create snapshot record +- `CompleteSnapshot(ctx, snapshotID)` → Mark snapshot complete +- `ExportSnapshotMetadata(ctx, dbPath, snapshotID)` → Export to S3 +- `CleanupIncompleteSnapshots(ctx, hostname)` → Remove failed snapshots + +### `internal/database` +SQLite database for local index. Single-writer mode for thread safety. + +Key types: +- `DB` → Database connection wrapper +- `Repositories` → Collection of all repository interfaces + +Repository interfaces: +- `FilesRepository` → CRUD for File records +- `ChunksRepository` → CRUD for Chunk records +- `BlobsRepository` → CRUD for Blob records +- `SnapshotsRepository` → CRUD for Snapshot records +- Plus join table repositories (FileChunks, BlobChunks, etc.) + +## Snapshot Creation Flow + +``` +CreateSnapshot(opts) + │ + ├─► CleanupIncompleteSnapshots() // Critical: avoid dedup errors + │ + ├─► SnapshotManager.CreateSnapshot() // Create DB record + │ + ├─► For each source directory: + │ │ + │ ├─► scanner.Scan(ctx, path, snapshotID) + │ │ │ + │ │ ├─► Phase 0: detectDeletedFiles() + │ │ │ + │ │ ├─► Phase 1: scanPhase() + │ │ │ Walk directory + │ │ │ Check file metadata changes + │ │ │ Build list of files to process + │ │ │ + │ │ └─► Phase 2: processPhase() + │ │ For each file: + │ │ chunker.ChunkReaderStreaming() + │ │ For each chunk: + │ │ packer.AddChunk() + │ │ If blob full → FinalizeBlob() + │ │ → handleBlobReady() + │ │ → s3Client.PutObjectWithProgress() + │ │ packer.Flush() // Final blob + │ │ + │ └─► Accumulate statistics + │ + ├─► SnapshotManager.UpdateSnapshotStatsExtended() + │ + ├─► SnapshotManager.CompleteSnapshot() + │ + └─► SnapshotManager.ExportSnapshotMetadata() + │ + ├─► Copy database to temp file + ├─► Clean to only current snapshot data + ├─► Dump to SQL + ├─► Compress with zstd + ├─► Encrypt with age + ├─► Upload db.zst.age to S3 + └─► Upload manifest.json.zst to S3 +``` + +## Deduplication Strategy + +1. **File-level**: Files unchanged since last backup are skipped (metadata comparison: size, mtime, mode, uid, gid) + +2. **Chunk-level**: Chunks are content-addressed by SHA256 hash. If a chunk hash already exists in the database, the chunk data is not re-uploaded. + +3. **Blob-level**: Blobs contain only unique chunks. Duplicate chunks within a blob are skipped. + +## Storage Layout in S3 + +``` +bucket/ +├── blobs/ +│ └── {hash[0:2]}/ +│ └── {hash[2:4]}/ +│ └── {full-hash} # Compressed+encrypted blob +│ +└── metadata/ + └── {snapshot-id}/ + ├── db.zst.age # Encrypted database dump + └── manifest.json.zst # Blob list (for verification) +``` + +## Thread Safety + +- `Packer`: Thread-safe via mutex. Multiple goroutines can call `AddChunk()`. +- `Scanner`: Uses `packerMu` mutex to coordinate blob finalization. +- `Database`: Single-writer mode (`MaxOpenConns=1`) ensures SQLite thread safety. +- `Repositories.WithTx()`: Handles transaction lifecycle automatically.