# Vaultik Architecture This document describes the internal architecture of Vaultik, focusing on the data model, type instantiation, and the relationships between core modules. ## Overview Vaultik is a backup system that uses content-defined chunking for deduplication and packs chunks into large, compressed, encrypted blobs for efficient cloud storage. The system is built around dependency injection using [uber-go/fx](https://github.com/uber-go/fx). ## Data Flow ``` Source Files │ ▼ ┌─────────────────┐ │ Scanner │ Walks directories, detects changed files └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Chunker │ Splits files into variable-size chunks (FastCDC) └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Packer │ Accumulates chunks, compresses (zstd), encrypts (age) └────────┬────────┘ │ ▼ ┌─────────────────┐ │ S3 Client │ Uploads blobs to remote storage └─────────────────┘ ``` ## Data Model ### Core Entities The database tracks five primary entities and their relationships: ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Snapshot │────▶│ File │────▶│ Chunk │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ Blob │◀─────────────────────────│ BlobChunk │ └──────────────┘ └──────────────┘ ``` ### Entity Descriptions #### File (`database.File`) Represents a file or directory in the backup system. Stores metadata needed for restoration: - Path, timestamps (mtime, ctime) - Size, mode, ownership (uid, gid) - Symlink target (if applicable) #### Chunk (`database.Chunk`) A content-addressed unit of data. Files are split into variable-size chunks using the FastCDC algorithm: - `ChunkHash`: SHA256 hash of chunk content (primary key) - `Size`: Chunk size in bytes Chunk sizes vary between `avgChunkSize/4` and `avgChunkSize*4` (typically 16KB-256KB for 64KB average). #### FileChunk (`database.FileChunk`) Maps files to their constituent chunks: - `FileID`: Reference to the file - `Idx`: Position of this chunk within the file (0-indexed) - `ChunkHash`: Reference to the chunk #### Blob (`database.Blob`) The final storage unit uploaded to S3. Contains many compressed and encrypted chunks: - `ID`: UUID assigned at creation - `Hash`: SHA256 of final compressed+encrypted content - `UncompressedSize`: Total raw chunk data before compression - `CompressedSize`: Size after zstd compression and age encryption - `CreatedTS`, `FinishedTS`, `UploadedTS`: Lifecycle timestamps Blob creation process: 1. Chunks are accumulated (up to MaxBlobSize, typically 10GB) 2. Compressed with zstd 3. Encrypted with age (recipients configured in config) 4. SHA256 hash computed → becomes filename in S3 5. Uploaded to `blobs/{hash[0:2]}/{hash[2:4]}/{hash}` #### BlobChunk (`database.BlobChunk`) Maps chunks to their position within blobs: - `BlobID`: Reference to the blob - `ChunkHash`: Reference to the chunk - `Offset`: Byte offset within the uncompressed blob - `Length`: Chunk size #### Snapshot (`database.Snapshot`) Represents a point-in-time backup: - `ID`: Format is `{hostname}-{YYYYMMDD}-{HHMMSS}Z` - Tracks file count, chunk count, blob count, sizes, compression ratio - `CompletedAt`: Null until snapshot finishes successfully #### SnapshotFile / SnapshotBlob Join tables linking snapshots to their files and blobs. ### Relationship Summary ``` Snapshot 1──────────▶ N SnapshotFile N ◀────────── 1 File Snapshot 1──────────▶ N SnapshotBlob N ◀────────── 1 Blob File 1──────────▶ N FileChunk N ◀────────── 1 Chunk Blob 1──────────▶ N BlobChunk N ◀────────── 1 Chunk ``` ## Type Instantiation ### Application Startup The CLI uses fx for dependency injection. Here's the instantiation order: ```go // cli/app.go: NewApp() fx.New( fx.Supply(config.ConfigPath(opts.ConfigPath)), // 1. Config path fx.Supply(opts.LogOptions), // 2. Log options fx.Provide(globals.New), // 3. Globals fx.Provide(log.New), // 4. Logger config config.Module, // 5. Config database.Module, // 6. Database + Repositories log.Module, // 7. Logger initialization s3.Module, // 8. S3 client snapshot.Module, // 9. SnapshotManager + ScannerFactory fx.Provide(vaultik.New), // 10. Vaultik orchestrator ) ``` ### Key Type Instantiation Points #### 1. Config (`config.Config`) - **Created by**: `config.Module` via `config.LoadConfig()` - **When**: Application startup (fx DI) - **Contains**: All configuration from YAML file (S3 credentials, encryption keys, paths, etc.) #### 2. Database (`database.DB`) - **Created by**: `database.Module` via `database.New()` - **When**: Application startup (fx DI) - **Contains**: SQLite connection, path reference #### 3. Repositories (`database.Repositories`) - **Created by**: `database.Module` via `database.NewRepositories()` - **When**: Application startup (fx DI) - **Contains**: All repository interfaces (Files, Chunks, Blobs, Snapshots, etc.) #### 4. Vaultik (`vaultik.Vaultik`) - **Created by**: `vaultik.New(VaultikParams)` - **When**: Application startup (fx DI) - **Contains**: All dependencies for backup operations ```go type Vaultik struct { Globals *globals.Globals Config *config.Config DB *database.DB Repositories *database.Repositories S3Client *s3.Client ScannerFactory snapshot.ScannerFactory SnapshotManager *snapshot.SnapshotManager Shutdowner fx.Shutdowner Fs afero.Fs ctx context.Context cancel context.CancelFunc } ``` #### 5. SnapshotManager (`snapshot.SnapshotManager`) - **Created by**: `snapshot.Module` via `snapshot.NewSnapshotManager()` - **When**: Application startup (fx DI) - **Responsibility**: Creates/completes snapshots, exports metadata to S3 #### 6. Scanner (`snapshot.Scanner`) - **Created by**: `ScannerFactory(ScannerParams)` - **When**: Each `CreateSnapshot()` call - **Contains**: Chunker, Packer, progress reporter ```go // vaultik/snapshot.go: CreateSnapshot() scanner := v.ScannerFactory(snapshot.ScannerParams{ EnableProgress: !opts.Cron, Fs: v.Fs, }) ``` #### 7. Chunker (`chunker.Chunker`) - **Created by**: `chunker.NewChunker(avgChunkSize)` - **When**: Inside `snapshot.NewScanner()` - **Configuration**: - `avgChunkSize`: From config (typically 64KB) - `minChunkSize`: avgChunkSize / 4 - `maxChunkSize`: avgChunkSize * 4 #### 8. Packer (`blob.Packer`) - **Created by**: `blob.NewPacker(PackerConfig)` - **When**: Inside `snapshot.NewScanner()` - **Configuration**: - `MaxBlobSize`: Maximum blob size before finalization (typically 10GB) - `CompressionLevel`: zstd level (1-19) - `Recipients`: age public keys for encryption ```go // snapshot/scanner.go: NewScanner() packerCfg := blob.PackerConfig{ MaxBlobSize: cfg.MaxBlobSize, CompressionLevel: cfg.CompressionLevel, Recipients: cfg.AgeRecipients, Repositories: cfg.Repositories, Fs: cfg.FS, } packer, err := blob.NewPacker(packerCfg) ``` ## Module Responsibilities ### `internal/cli` Entry point for fx application. Combines all modules and handles signal interrupts. Key functions: - `NewApp(AppOptions)` → Creates fx.App with all modules - `RunApp(ctx, app)` → Starts app, handles graceful shutdown - `RunWithApp(ctx, opts)` → Convenience wrapper ### `internal/vaultik` Main orchestrator containing all dependencies and command implementations. Key methods: - `New(VaultikParams)` → Constructor (fx DI) - `CreateSnapshot(opts)` → Main backup operation - `ListSnapshots(jsonOutput)` → List available snapshots - `VerifySnapshot(id, deep)` → Verify snapshot integrity - `PurgeSnapshots(...)` → Remove old snapshots ### `internal/chunker` Content-defined chunking using FastCDC algorithm. Key types: - `Chunk` → Hash, Data, Offset, Size - `Chunker` → avgChunkSize, minChunkSize, maxChunkSize Key methods: - `NewChunker(avgChunkSize)` → Constructor - `ChunkReaderStreaming(reader, callback)` → Stream chunks with callback (preferred) - `ChunkReader(reader)` → Return all chunks at once (memory-intensive) ### `internal/blob` Blob packing: accumulates chunks, compresses, encrypts, tracks metadata. Key types: - `Packer` → Thread-safe blob accumulator - `ChunkRef` → Hash + Data for adding to packer - `FinishedBlob` → Completed blob ready for upload - `BlobWithReader` → FinishedBlob + io.Reader for streaming upload Key methods: - `NewPacker(PackerConfig)` → Constructor - `AddChunk(ChunkRef)` → Add chunk to current blob - `FinalizeBlob()` → Compress, encrypt, hash current blob - `Flush()` → Finalize any in-progress blob - `SetBlobHandler(func)` → Set callback for upload ### `internal/snapshot` #### Scanner Orchestrates the backup process for a directory. Key methods: - `NewScanner(ScannerConfig)` → Constructor (creates Chunker + Packer) - `Scan(ctx, path, snapshotID)` → Main scan operation Scan phases: 1. **Phase 0**: Detect deleted files from previous snapshots 2. **Phase 1**: Walk directory, identify files needing processing 3. **Phase 2**: Process files (chunk → pack → upload) #### SnapshotManager Manages snapshot lifecycle and metadata export. Key methods: - `CreateSnapshot(ctx, hostname, version, commit)` → Create snapshot record - `CompleteSnapshot(ctx, snapshotID)` → Mark snapshot complete - `ExportSnapshotMetadata(ctx, dbPath, snapshotID)` → Export to S3 - `CleanupIncompleteSnapshots(ctx, hostname)` → Remove failed snapshots ### `internal/database` SQLite database for local index. Single-writer mode for thread safety. Key types: - `DB` → Database connection wrapper - `Repositories` → Collection of all repository interfaces Repository interfaces: - `FilesRepository` → CRUD for File records - `ChunksRepository` → CRUD for Chunk records - `BlobsRepository` → CRUD for Blob records - `SnapshotsRepository` → CRUD for Snapshot records - Plus join table repositories (FileChunks, BlobChunks, etc.) ## Snapshot Creation Flow ``` CreateSnapshot(opts) │ ├─► CleanupIncompleteSnapshots() // Critical: avoid dedup errors │ ├─► SnapshotManager.CreateSnapshot() // Create DB record │ ├─► For each source directory: │ │ │ ├─► scanner.Scan(ctx, path, snapshotID) │ │ │ │ │ ├─► Phase 0: detectDeletedFiles() │ │ │ │ │ ├─► Phase 1: scanPhase() │ │ │ Walk directory │ │ │ Check file metadata changes │ │ │ Build list of files to process │ │ │ │ │ └─► Phase 2: processPhase() │ │ For each file: │ │ chunker.ChunkReaderStreaming() │ │ For each chunk: │ │ packer.AddChunk() │ │ If blob full → FinalizeBlob() │ │ → handleBlobReady() │ │ → s3Client.PutObjectWithProgress() │ │ packer.Flush() // Final blob │ │ │ └─► Accumulate statistics │ ├─► SnapshotManager.UpdateSnapshotStatsExtended() │ ├─► SnapshotManager.CompleteSnapshot() │ └─► SnapshotManager.ExportSnapshotMetadata() │ ├─► Copy database to temp file ├─► Clean to only current snapshot data ├─► Dump to SQL ├─► Compress with zstd ├─► Encrypt with age ├─► Upload db.zst.age to S3 └─► Upload manifest.json.zst to S3 ``` ## Deduplication Strategy 1. **File-level**: Files unchanged since last backup are skipped (metadata comparison: size, mtime, mode, uid, gid) 2. **Chunk-level**: Chunks are content-addressed by SHA256 hash. If a chunk hash already exists in the database, the chunk data is not re-uploaded. 3. **Blob-level**: Blobs contain only unique chunks. Duplicate chunks within a blob are skipped. ## Storage Layout in S3 ``` bucket/ ├── blobs/ │ └── {hash[0:2]}/ │ └── {hash[2:4]}/ │ └── {full-hash} # Compressed+encrypted blob │ └── metadata/ └── {snapshot-id}/ ├── db.zst.age # Encrypted database dump └── manifest.json.zst # Blob list (for verification) ``` ## Thread Safety - `Packer`: Thread-safe via mutex. Multiple goroutines can call `AddChunk()`. - `Scanner`: Uses `packerMu` mutex to coordinate blob finalization. - `Database`: Single-writer mode (`MaxOpenConns=1`) ensures SQLite thread safety. - `Repositories.WithTx()`: Handles transaction lifecycle automatically.