vaultik/ARCHITECTURE.md
sneak cda0cf865a Add ARCHITECTURE.md documenting internal design
Document the data model, type instantiation flow, and module
responsibilities. Covers chunker, packer, vaultik, cli, snapshot,
and database modules with detailed explanations of relationships
between File, Chunk, Blob, and Snapshot entities.
2025-12-18 19:49:42 -08:00

14 KiB

Vaultik Architecture

This document describes the internal architecture of Vaultik, focusing on the data model, type instantiation, and the relationships between core modules.

Overview

Vaultik is a backup system that uses content-defined chunking for deduplication and packs chunks into large, compressed, encrypted blobs for efficient cloud storage. The system is built around dependency injection using uber-go/fx.

Data Flow

Source Files
     │
     ▼
┌─────────────────┐
│    Scanner      │  Walks directories, detects changed files
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Chunker      │  Splits files into variable-size chunks (FastCDC)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Packer       │  Accumulates chunks, compresses (zstd), encrypts (age)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   S3 Client     │  Uploads blobs to remote storage
└─────────────────┘

Data Model

Core Entities

The database tracks five primary entities and their relationships:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Snapshot   │────▶│     File     │────▶│    Chunk     │
└──────────────┘     └──────────────┘     └──────────────┘
       │                                         │
       │                                         │
       ▼                                         ▼
┌──────────────┐                          ┌──────────────┐
│     Blob     │◀─────────────────────────│  BlobChunk   │
└──────────────┘                          └──────────────┘

Entity Descriptions

File (database.File)

Represents a file or directory in the backup system. Stores metadata needed for restoration:

  • Path, timestamps (mtime, ctime)
  • Size, mode, ownership (uid, gid)
  • Symlink target (if applicable)

Chunk (database.Chunk)

A content-addressed unit of data. Files are split into variable-size chunks using the FastCDC algorithm:

  • ChunkHash: SHA256 hash of chunk content (primary key)
  • Size: Chunk size in bytes

Chunk sizes vary between avgChunkSize/4 and avgChunkSize*4 (typically 16KB-256KB for 64KB average).

FileChunk (database.FileChunk)

Maps files to their constituent chunks:

  • FileID: Reference to the file
  • Idx: Position of this chunk within the file (0-indexed)
  • ChunkHash: Reference to the chunk

Blob (database.Blob)

The final storage unit uploaded to S3. Contains many compressed and encrypted chunks:

  • ID: UUID assigned at creation
  • Hash: SHA256 of final compressed+encrypted content
  • UncompressedSize: Total raw chunk data before compression
  • CompressedSize: Size after zstd compression and age encryption
  • CreatedTS, FinishedTS, UploadedTS: Lifecycle timestamps

Blob creation process:

  1. Chunks are accumulated (up to MaxBlobSize, typically 10GB)
  2. Compressed with zstd
  3. Encrypted with age (recipients configured in config)
  4. SHA256 hash computed → becomes filename in S3
  5. Uploaded to blobs/{hash[0:2]}/{hash[2:4]}/{hash}

BlobChunk (database.BlobChunk)

Maps chunks to their position within blobs:

  • BlobID: Reference to the blob
  • ChunkHash: Reference to the chunk
  • Offset: Byte offset within the uncompressed blob
  • Length: Chunk size

Snapshot (database.Snapshot)

Represents a point-in-time backup:

  • ID: Format is {hostname}-{YYYYMMDD}-{HHMMSS}Z
  • Tracks file count, chunk count, blob count, sizes, compression ratio
  • CompletedAt: Null until snapshot finishes successfully

SnapshotFile / SnapshotBlob

Join tables linking snapshots to their files and blobs.

Relationship Summary

Snapshot 1──────────▶ N SnapshotFile N ◀────────── 1 File
Snapshot 1──────────▶ N SnapshotBlob N ◀────────── 1 Blob
File     1──────────▶ N FileChunk    N ◀────────── 1 Chunk
Blob     1──────────▶ N BlobChunk    N ◀────────── 1 Chunk

Type Instantiation

Application Startup

The CLI uses fx for dependency injection. Here's the instantiation order:

// cli/app.go: NewApp()
fx.New(
    fx.Supply(config.ConfigPath(opts.ConfigPath)),  // 1. Config path
    fx.Supply(opts.LogOptions),                      // 2. Log options
    fx.Provide(globals.New),                         // 3. Globals
    fx.Provide(log.New),                             // 4. Logger config
    config.Module,                                   // 5. Config
    database.Module,                                 // 6. Database + Repositories
    log.Module,                                      // 7. Logger initialization
    s3.Module,                                       // 8. S3 client
    snapshot.Module,                                 // 9. SnapshotManager + ScannerFactory
    fx.Provide(vaultik.New),                         // 10. Vaultik orchestrator
)

Key Type Instantiation Points

1. Config (config.Config)

  • Created by: config.Module via config.LoadConfig()
  • When: Application startup (fx DI)
  • Contains: All configuration from YAML file (S3 credentials, encryption keys, paths, etc.)

2. Database (database.DB)

  • Created by: database.Module via database.New()
  • When: Application startup (fx DI)
  • Contains: SQLite connection, path reference

3. Repositories (database.Repositories)

  • Created by: database.Module via database.NewRepositories()
  • When: Application startup (fx DI)
  • Contains: All repository interfaces (Files, Chunks, Blobs, Snapshots, etc.)

4. Vaultik (vaultik.Vaultik)

  • Created by: vaultik.New(VaultikParams)
  • When: Application startup (fx DI)
  • Contains: All dependencies for backup operations
type Vaultik struct {
    Globals         *globals.Globals
    Config          *config.Config
    DB              *database.DB
    Repositories    *database.Repositories
    S3Client        *s3.Client
    ScannerFactory  snapshot.ScannerFactory
    SnapshotManager *snapshot.SnapshotManager
    Shutdowner      fx.Shutdowner
    Fs              afero.Fs
    ctx             context.Context
    cancel          context.CancelFunc
}

5. SnapshotManager (snapshot.SnapshotManager)

  • Created by: snapshot.Module via snapshot.NewSnapshotManager()
  • When: Application startup (fx DI)
  • Responsibility: Creates/completes snapshots, exports metadata to S3

6. Scanner (snapshot.Scanner)

  • Created by: ScannerFactory(ScannerParams)
  • When: Each CreateSnapshot() call
  • Contains: Chunker, Packer, progress reporter
// vaultik/snapshot.go: CreateSnapshot()
scanner := v.ScannerFactory(snapshot.ScannerParams{
    EnableProgress: !opts.Cron,
    Fs:             v.Fs,
})

7. Chunker (chunker.Chunker)

  • Created by: chunker.NewChunker(avgChunkSize)
  • When: Inside snapshot.NewScanner()
  • Configuration:
    • avgChunkSize: From config (typically 64KB)
    • minChunkSize: avgChunkSize / 4
    • maxChunkSize: avgChunkSize * 4

8. Packer (blob.Packer)

  • Created by: blob.NewPacker(PackerConfig)
  • When: Inside snapshot.NewScanner()
  • Configuration:
    • MaxBlobSize: Maximum blob size before finalization (typically 10GB)
    • CompressionLevel: zstd level (1-19)
    • Recipients: age public keys for encryption
// snapshot/scanner.go: NewScanner()
packerCfg := blob.PackerConfig{
    MaxBlobSize:      cfg.MaxBlobSize,
    CompressionLevel: cfg.CompressionLevel,
    Recipients:       cfg.AgeRecipients,
    Repositories:     cfg.Repositories,
    Fs:               cfg.FS,
}
packer, err := blob.NewPacker(packerCfg)

Module Responsibilities

internal/cli

Entry point for fx application. Combines all modules and handles signal interrupts.

Key functions:

  • NewApp(AppOptions) → Creates fx.App with all modules
  • RunApp(ctx, app) → Starts app, handles graceful shutdown
  • RunWithApp(ctx, opts) → Convenience wrapper

internal/vaultik

Main orchestrator containing all dependencies and command implementations.

Key methods:

  • New(VaultikParams) → Constructor (fx DI)
  • CreateSnapshot(opts) → Main backup operation
  • ListSnapshots(jsonOutput) → List available snapshots
  • VerifySnapshot(id, deep) → Verify snapshot integrity
  • PurgeSnapshots(...) → Remove old snapshots

internal/chunker

Content-defined chunking using FastCDC algorithm.

Key types:

  • Chunk → Hash, Data, Offset, Size
  • Chunker → avgChunkSize, minChunkSize, maxChunkSize

Key methods:

  • NewChunker(avgChunkSize) → Constructor
  • ChunkReaderStreaming(reader, callback) → Stream chunks with callback (preferred)
  • ChunkReader(reader) → Return all chunks at once (memory-intensive)

internal/blob

Blob packing: accumulates chunks, compresses, encrypts, tracks metadata.

Key types:

  • Packer → Thread-safe blob accumulator
  • ChunkRef → Hash + Data for adding to packer
  • FinishedBlob → Completed blob ready for upload
  • BlobWithReader → FinishedBlob + io.Reader for streaming upload

Key methods:

  • NewPacker(PackerConfig) → Constructor
  • AddChunk(ChunkRef) → Add chunk to current blob
  • FinalizeBlob() → Compress, encrypt, hash current blob
  • Flush() → Finalize any in-progress blob
  • SetBlobHandler(func) → Set callback for upload

internal/snapshot

Scanner

Orchestrates the backup process for a directory.

Key methods:

  • NewScanner(ScannerConfig) → Constructor (creates Chunker + Packer)
  • Scan(ctx, path, snapshotID) → Main scan operation

Scan phases:

  1. Phase 0: Detect deleted files from previous snapshots
  2. Phase 1: Walk directory, identify files needing processing
  3. Phase 2: Process files (chunk → pack → upload)

SnapshotManager

Manages snapshot lifecycle and metadata export.

Key methods:

  • CreateSnapshot(ctx, hostname, version, commit) → Create snapshot record
  • CompleteSnapshot(ctx, snapshotID) → Mark snapshot complete
  • ExportSnapshotMetadata(ctx, dbPath, snapshotID) → Export to S3
  • CleanupIncompleteSnapshots(ctx, hostname) → Remove failed snapshots

internal/database

SQLite database for local index. Single-writer mode for thread safety.

Key types:

  • DB → Database connection wrapper
  • Repositories → Collection of all repository interfaces

Repository interfaces:

  • FilesRepository → CRUD for File records
  • ChunksRepository → CRUD for Chunk records
  • BlobsRepository → CRUD for Blob records
  • SnapshotsRepository → CRUD for Snapshot records
  • Plus join table repositories (FileChunks, BlobChunks, etc.)

Snapshot Creation Flow

CreateSnapshot(opts)
    │
    ├─► CleanupIncompleteSnapshots()   // Critical: avoid dedup errors
    │
    ├─► SnapshotManager.CreateSnapshot()   // Create DB record
    │
    ├─► For each source directory:
    │       │
    │       ├─► scanner.Scan(ctx, path, snapshotID)
    │       │       │
    │       │       ├─► Phase 0: detectDeletedFiles()
    │       │       │
    │       │       ├─► Phase 1: scanPhase()
    │       │       │       Walk directory
    │       │       │       Check file metadata changes
    │       │       │       Build list of files to process
    │       │       │
    │       │       └─► Phase 2: processPhase()
    │       │               For each file:
    │       │                   chunker.ChunkReaderStreaming()
    │       │                   For each chunk:
    │       │                       packer.AddChunk()
    │       │                       If blob full → FinalizeBlob()
    │       │                           → handleBlobReady()
    │       │                           → s3Client.PutObjectWithProgress()
    │       │               packer.Flush()  // Final blob
    │       │
    │       └─► Accumulate statistics
    │
    ├─► SnapshotManager.UpdateSnapshotStatsExtended()
    │
    ├─► SnapshotManager.CompleteSnapshot()
    │
    └─► SnapshotManager.ExportSnapshotMetadata()
            │
            ├─► Copy database to temp file
            ├─► Clean to only current snapshot data
            ├─► Dump to SQL
            ├─► Compress with zstd
            ├─► Encrypt with age
            ├─► Upload db.zst.age to S3
            └─► Upload manifest.json.zst to S3

Deduplication Strategy

  1. File-level: Files unchanged since last backup are skipped (metadata comparison: size, mtime, mode, uid, gid)

  2. Chunk-level: Chunks are content-addressed by SHA256 hash. If a chunk hash already exists in the database, the chunk data is not re-uploaded.

  3. Blob-level: Blobs contain only unique chunks. Duplicate chunks within a blob are skipped.

Storage Layout in S3

bucket/
├── blobs/
│   └── {hash[0:2]}/
│       └── {hash[2:4]}/
│           └── {full-hash}          # Compressed+encrypted blob
│
└── metadata/
    └── {snapshot-id}/
        ├── db.zst.age               # Encrypted database dump
        └── manifest.json.zst        # Blob list (for verification)

Thread Safety

  • Packer: Thread-safe via mutex. Multiple goroutines can call AddChunk().
  • Scanner: Uses packerMu mutex to coordinate blob finalization.
  • Database: Single-writer mode (MaxOpenConns=1) ensures SQLite thread safety.
  • Repositories.WithTx(): Handles transaction lifecycle automatically.