vaultik/ARCHITECTURE.md
sneak cda0cf865a Add ARCHITECTURE.md documenting internal design
Document the data model, type instantiation flow, and module
responsibilities. Covers chunker, packer, vaultik, cli, snapshot,
and database modules with detailed explanations of relationships
between File, Chunk, Blob, and Snapshot entities.
2025-12-18 19:49:42 -08:00

381 lines
14 KiB
Markdown

# Vaultik Architecture
This document describes the internal architecture of Vaultik, focusing on the data model, type instantiation, and the relationships between core modules.
## Overview
Vaultik is a backup system that uses content-defined chunking for deduplication and packs chunks into large, compressed, encrypted blobs for efficient cloud storage. The system is built around dependency injection using [uber-go/fx](https://github.com/uber-go/fx).
## Data Flow
```
Source Files
┌─────────────────┐
│ Scanner │ Walks directories, detects changed files
└────────┬────────┘
┌─────────────────┐
│ Chunker │ Splits files into variable-size chunks (FastCDC)
└────────┬────────┘
┌─────────────────┐
│ Packer │ Accumulates chunks, compresses (zstd), encrypts (age)
└────────┬────────┘
┌─────────────────┐
│ S3 Client │ Uploads blobs to remote storage
└─────────────────┘
```
## Data Model
### Core Entities
The database tracks five primary entities and their relationships:
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Snapshot │────▶│ File │────▶│ Chunk │
└──────────────┘ └──────────────┘ └──────────────┘
│ │
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Blob │◀─────────────────────────│ BlobChunk │
└──────────────┘ └──────────────┘
```
### Entity Descriptions
#### File (`database.File`)
Represents a file or directory in the backup system. Stores metadata needed for restoration:
- Path, timestamps (mtime, ctime)
- Size, mode, ownership (uid, gid)
- Symlink target (if applicable)
#### Chunk (`database.Chunk`)
A content-addressed unit of data. Files are split into variable-size chunks using the FastCDC algorithm:
- `ChunkHash`: SHA256 hash of chunk content (primary key)
- `Size`: Chunk size in bytes
Chunk sizes vary between `avgChunkSize/4` and `avgChunkSize*4` (typically 16KB-256KB for 64KB average).
#### FileChunk (`database.FileChunk`)
Maps files to their constituent chunks:
- `FileID`: Reference to the file
- `Idx`: Position of this chunk within the file (0-indexed)
- `ChunkHash`: Reference to the chunk
#### Blob (`database.Blob`)
The final storage unit uploaded to S3. Contains many compressed and encrypted chunks:
- `ID`: UUID assigned at creation
- `Hash`: SHA256 of final compressed+encrypted content
- `UncompressedSize`: Total raw chunk data before compression
- `CompressedSize`: Size after zstd compression and age encryption
- `CreatedTS`, `FinishedTS`, `UploadedTS`: Lifecycle timestamps
Blob creation process:
1. Chunks are accumulated (up to MaxBlobSize, typically 10GB)
2. Compressed with zstd
3. Encrypted with age (recipients configured in config)
4. SHA256 hash computed → becomes filename in S3
5. Uploaded to `blobs/{hash[0:2]}/{hash[2:4]}/{hash}`
#### BlobChunk (`database.BlobChunk`)
Maps chunks to their position within blobs:
- `BlobID`: Reference to the blob
- `ChunkHash`: Reference to the chunk
- `Offset`: Byte offset within the uncompressed blob
- `Length`: Chunk size
#### Snapshot (`database.Snapshot`)
Represents a point-in-time backup:
- `ID`: Format is `{hostname}-{YYYYMMDD}-{HHMMSS}Z`
- Tracks file count, chunk count, blob count, sizes, compression ratio
- `CompletedAt`: Null until snapshot finishes successfully
#### SnapshotFile / SnapshotBlob
Join tables linking snapshots to their files and blobs.
### Relationship Summary
```
Snapshot 1──────────▶ N SnapshotFile N ◀────────── 1 File
Snapshot 1──────────▶ N SnapshotBlob N ◀────────── 1 Blob
File 1──────────▶ N FileChunk N ◀────────── 1 Chunk
Blob 1──────────▶ N BlobChunk N ◀────────── 1 Chunk
```
## Type Instantiation
### Application Startup
The CLI uses fx for dependency injection. Here's the instantiation order:
```go
// cli/app.go: NewApp()
fx.New(
fx.Supply(config.ConfigPath(opts.ConfigPath)), // 1. Config path
fx.Supply(opts.LogOptions), // 2. Log options
fx.Provide(globals.New), // 3. Globals
fx.Provide(log.New), // 4. Logger config
config.Module, // 5. Config
database.Module, // 6. Database + Repositories
log.Module, // 7. Logger initialization
s3.Module, // 8. S3 client
snapshot.Module, // 9. SnapshotManager + ScannerFactory
fx.Provide(vaultik.New), // 10. Vaultik orchestrator
)
```
### Key Type Instantiation Points
#### 1. Config (`config.Config`)
- **Created by**: `config.Module` via `config.LoadConfig()`
- **When**: Application startup (fx DI)
- **Contains**: All configuration from YAML file (S3 credentials, encryption keys, paths, etc.)
#### 2. Database (`database.DB`)
- **Created by**: `database.Module` via `database.New()`
- **When**: Application startup (fx DI)
- **Contains**: SQLite connection, path reference
#### 3. Repositories (`database.Repositories`)
- **Created by**: `database.Module` via `database.NewRepositories()`
- **When**: Application startup (fx DI)
- **Contains**: All repository interfaces (Files, Chunks, Blobs, Snapshots, etc.)
#### 4. Vaultik (`vaultik.Vaultik`)
- **Created by**: `vaultik.New(VaultikParams)`
- **When**: Application startup (fx DI)
- **Contains**: All dependencies for backup operations
```go
type Vaultik struct {
Globals *globals.Globals
Config *config.Config
DB *database.DB
Repositories *database.Repositories
S3Client *s3.Client
ScannerFactory snapshot.ScannerFactory
SnapshotManager *snapshot.SnapshotManager
Shutdowner fx.Shutdowner
Fs afero.Fs
ctx context.Context
cancel context.CancelFunc
}
```
#### 5. SnapshotManager (`snapshot.SnapshotManager`)
- **Created by**: `snapshot.Module` via `snapshot.NewSnapshotManager()`
- **When**: Application startup (fx DI)
- **Responsibility**: Creates/completes snapshots, exports metadata to S3
#### 6. Scanner (`snapshot.Scanner`)
- **Created by**: `ScannerFactory(ScannerParams)`
- **When**: Each `CreateSnapshot()` call
- **Contains**: Chunker, Packer, progress reporter
```go
// vaultik/snapshot.go: CreateSnapshot()
scanner := v.ScannerFactory(snapshot.ScannerParams{
EnableProgress: !opts.Cron,
Fs: v.Fs,
})
```
#### 7. Chunker (`chunker.Chunker`)
- **Created by**: `chunker.NewChunker(avgChunkSize)`
- **When**: Inside `snapshot.NewScanner()`
- **Configuration**:
- `avgChunkSize`: From config (typically 64KB)
- `minChunkSize`: avgChunkSize / 4
- `maxChunkSize`: avgChunkSize * 4
#### 8. Packer (`blob.Packer`)
- **Created by**: `blob.NewPacker(PackerConfig)`
- **When**: Inside `snapshot.NewScanner()`
- **Configuration**:
- `MaxBlobSize`: Maximum blob size before finalization (typically 10GB)
- `CompressionLevel`: zstd level (1-19)
- `Recipients`: age public keys for encryption
```go
// snapshot/scanner.go: NewScanner()
packerCfg := blob.PackerConfig{
MaxBlobSize: cfg.MaxBlobSize,
CompressionLevel: cfg.CompressionLevel,
Recipients: cfg.AgeRecipients,
Repositories: cfg.Repositories,
Fs: cfg.FS,
}
packer, err := blob.NewPacker(packerCfg)
```
## Module Responsibilities
### `internal/cli`
Entry point for fx application. Combines all modules and handles signal interrupts.
Key functions:
- `NewApp(AppOptions)` → Creates fx.App with all modules
- `RunApp(ctx, app)` → Starts app, handles graceful shutdown
- `RunWithApp(ctx, opts)` → Convenience wrapper
### `internal/vaultik`
Main orchestrator containing all dependencies and command implementations.
Key methods:
- `New(VaultikParams)` → Constructor (fx DI)
- `CreateSnapshot(opts)` → Main backup operation
- `ListSnapshots(jsonOutput)` → List available snapshots
- `VerifySnapshot(id, deep)` → Verify snapshot integrity
- `PurgeSnapshots(...)` → Remove old snapshots
### `internal/chunker`
Content-defined chunking using FastCDC algorithm.
Key types:
- `Chunk` → Hash, Data, Offset, Size
- `Chunker` → avgChunkSize, minChunkSize, maxChunkSize
Key methods:
- `NewChunker(avgChunkSize)` → Constructor
- `ChunkReaderStreaming(reader, callback)` → Stream chunks with callback (preferred)
- `ChunkReader(reader)` → Return all chunks at once (memory-intensive)
### `internal/blob`
Blob packing: accumulates chunks, compresses, encrypts, tracks metadata.
Key types:
- `Packer` → Thread-safe blob accumulator
- `ChunkRef` → Hash + Data for adding to packer
- `FinishedBlob` → Completed blob ready for upload
- `BlobWithReader` → FinishedBlob + io.Reader for streaming upload
Key methods:
- `NewPacker(PackerConfig)` → Constructor
- `AddChunk(ChunkRef)` → Add chunk to current blob
- `FinalizeBlob()` → Compress, encrypt, hash current blob
- `Flush()` → Finalize any in-progress blob
- `SetBlobHandler(func)` → Set callback for upload
### `internal/snapshot`
#### Scanner
Orchestrates the backup process for a directory.
Key methods:
- `NewScanner(ScannerConfig)` → Constructor (creates Chunker + Packer)
- `Scan(ctx, path, snapshotID)` → Main scan operation
Scan phases:
1. **Phase 0**: Detect deleted files from previous snapshots
2. **Phase 1**: Walk directory, identify files needing processing
3. **Phase 2**: Process files (chunk → pack → upload)
#### SnapshotManager
Manages snapshot lifecycle and metadata export.
Key methods:
- `CreateSnapshot(ctx, hostname, version, commit)` → Create snapshot record
- `CompleteSnapshot(ctx, snapshotID)` → Mark snapshot complete
- `ExportSnapshotMetadata(ctx, dbPath, snapshotID)` → Export to S3
- `CleanupIncompleteSnapshots(ctx, hostname)` → Remove failed snapshots
### `internal/database`
SQLite database for local index. Single-writer mode for thread safety.
Key types:
- `DB` → Database connection wrapper
- `Repositories` → Collection of all repository interfaces
Repository interfaces:
- `FilesRepository` → CRUD for File records
- `ChunksRepository` → CRUD for Chunk records
- `BlobsRepository` → CRUD for Blob records
- `SnapshotsRepository` → CRUD for Snapshot records
- Plus join table repositories (FileChunks, BlobChunks, etc.)
## Snapshot Creation Flow
```
CreateSnapshot(opts)
├─► CleanupIncompleteSnapshots() // Critical: avoid dedup errors
├─► SnapshotManager.CreateSnapshot() // Create DB record
├─► For each source directory:
│ │
│ ├─► scanner.Scan(ctx, path, snapshotID)
│ │ │
│ │ ├─► Phase 0: detectDeletedFiles()
│ │ │
│ │ ├─► Phase 1: scanPhase()
│ │ │ Walk directory
│ │ │ Check file metadata changes
│ │ │ Build list of files to process
│ │ │
│ │ └─► Phase 2: processPhase()
│ │ For each file:
│ │ chunker.ChunkReaderStreaming()
│ │ For each chunk:
│ │ packer.AddChunk()
│ │ If blob full → FinalizeBlob()
│ │ → handleBlobReady()
│ │ → s3Client.PutObjectWithProgress()
│ │ packer.Flush() // Final blob
│ │
│ └─► Accumulate statistics
├─► SnapshotManager.UpdateSnapshotStatsExtended()
├─► SnapshotManager.CompleteSnapshot()
└─► SnapshotManager.ExportSnapshotMetadata()
├─► Copy database to temp file
├─► Clean to only current snapshot data
├─► Dump to SQL
├─► Compress with zstd
├─► Encrypt with age
├─► Upload db.zst.age to S3
└─► Upload manifest.json.zst to S3
```
## Deduplication Strategy
1. **File-level**: Files unchanged since last backup are skipped (metadata comparison: size, mtime, mode, uid, gid)
2. **Chunk-level**: Chunks are content-addressed by SHA256 hash. If a chunk hash already exists in the database, the chunk data is not re-uploaded.
3. **Blob-level**: Blobs contain only unique chunks. Duplicate chunks within a blob are skipped.
## Storage Layout in S3
```
bucket/
├── blobs/
│ └── {hash[0:2]}/
│ └── {hash[2:4]}/
│ └── {full-hash} # Compressed+encrypted blob
└── metadata/
└── {snapshot-id}/
├── db.zst.age # Encrypted database dump
└── manifest.json.zst # Blob list (for verification)
```
## Thread Safety
- `Packer`: Thread-safe via mutex. Multiple goroutines can call `AddChunk()`.
- `Scanner`: Uses `packerMu` mutex to coordinate blob finalization.
- `Database`: Single-writer mode (`MaxOpenConns=1`) ensures SQLite thread safety.
- `Repositories.WithTx()`: Handles transaction lifecycle automatically.