sneak cda0cf865a Add ARCHITECTURE.md documenting internal design

Document the data model, type instantiation flow, and module
responsibilities. Covers chunker, packer, vaultik, cli, snapshot,
and database modules with detailed explanations of relationships
between File, Chunk, Blob, and Snapshot entities.

2025-12-18 19:49:42 -08:00

14 KiB

Raw Blame History

Vaultik Architecture

This document describes the internal architecture of Vaultik, focusing on the data model, type instantiation, and the relationships between core modules.

Overview

Vaultik is a backup system that uses content-defined chunking for deduplication and packs chunks into large, compressed, encrypted blobs for efficient cloud storage. The system is built around dependency injection using uber-go/fx.

Data Flow

Source Files
     │
     ▼
┌─────────────────┐
│    Scanner      │  Walks directories, detects changed files
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Chunker      │  Splits files into variable-size chunks (FastCDC)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Packer       │  Accumulates chunks, compresses (zstd), encrypts (age)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   S3 Client     │  Uploads blobs to remote storage
└─────────────────┘

Data Model

Core Entities

The database tracks five primary entities and their relationships:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Snapshot   │────▶│     File     │────▶│    Chunk     │
└──────────────┘     └──────────────┘     └──────────────┘
       │                                         │
       │                                         │
       ▼                                         ▼
┌──────────────┐                          ┌──────────────┐
│     Blob     │◀─────────────────────────│  BlobChunk   │
└──────────────┘                          └──────────────┘

Entity Descriptions

File (`database.File`)

Represents a file or directory in the backup system. Stores metadata needed for restoration:

Path, timestamps (mtime, ctime)
Size, mode, ownership (uid, gid)
Symlink target (if applicable)

Chunk (`database.Chunk`)

A content-addressed unit of data. Files are split into variable-size chunks using the FastCDC algorithm:

ChunkHash: SHA256 hash of chunk content (primary key)
Size: Chunk size in bytes

Chunk sizes vary between avgChunkSize/4 and avgChunkSize*4 (typically 16KB-256KB for 64KB average).

FileChunk (`database.FileChunk`)

Maps files to their constituent chunks:

FileID: Reference to the file
Idx: Position of this chunk within the file (0-indexed)
ChunkHash: Reference to the chunk

Blob (`database.Blob`)

The final storage unit uploaded to S3. Contains many compressed and encrypted chunks:

ID: UUID assigned at creation
Hash: SHA256 of final compressed+encrypted content
UncompressedSize: Total raw chunk data before compression
CompressedSize: Size after zstd compression and age encryption
CreatedTS, FinishedTS, UploadedTS: Lifecycle timestamps

Blob creation process:

Chunks are accumulated (up to MaxBlobSize, typically 10GB)
Compressed with zstd
Encrypted with age (recipients configured in config)
SHA256 hash computed → becomes filename in S3
Uploaded to blobs/{hash[0:2]}/{hash[2:4]}/{hash}

BlobChunk (`database.BlobChunk`)

Maps chunks to their position within blobs:

BlobID: Reference to the blob
ChunkHash: Reference to the chunk
Offset: Byte offset within the uncompressed blob
Length: Chunk size

Snapshot (`database.Snapshot`)

Represents a point-in-time backup:

ID: Format is {hostname}-{YYYYMMDD}-{HHMMSS}Z
Tracks file count, chunk count, blob count, sizes, compression ratio
CompletedAt: Null until snapshot finishes successfully

SnapshotFile / SnapshotBlob

Join tables linking snapshots to their files and blobs.

Relationship Summary

Snapshot 1──────────▶ N SnapshotFile N ◀────────── 1 File
Snapshot 1──────────▶ N SnapshotBlob N ◀────────── 1 Blob
File     1──────────▶ N FileChunk    N ◀────────── 1 Chunk
Blob     1──────────▶ N BlobChunk    N ◀────────── 1 Chunk

Type Instantiation

Application Startup

The CLI uses fx for dependency injection. Here's the instantiation order:

// cli/app.go: NewApp()
fx.New(
    fx.Supply(config.ConfigPath(opts.ConfigPath)),  // 1. Config path
    fx.Supply(opts.LogOptions),                      // 2. Log options
    fx.Provide(globals.New),                         // 3. Globals
    fx.Provide(log.New),                             // 4. Logger config
    config.Module,                                   // 5. Config
    database.Module,                                 // 6. Database + Repositories
    log.Module,                                      // 7. Logger initialization
    s3.Module,                                       // 8. S3 client
    snapshot.Module,                                 // 9. SnapshotManager + ScannerFactory
    fx.Provide(vaultik.New),                         // 10. Vaultik orchestrator
)

Key Type Instantiation Points

1. Config (`config.Config`)

Created by: config.Module via config.LoadConfig()
When: Application startup (fx DI)
Contains: All configuration from YAML file (S3 credentials, encryption keys, paths, etc.)

2. Database (`database.DB`)

Created by: database.Module via database.New()
When: Application startup (fx DI)
Contains: SQLite connection, path reference

3. Repositories (`database.Repositories`)

Created by: database.Module via database.NewRepositories()
When: Application startup (fx DI)
Contains: All repository interfaces (Files, Chunks, Blobs, Snapshots, etc.)

4. Vaultik (`vaultik.Vaultik`)

Created by: vaultik.New(VaultikParams)
When: Application startup (fx DI)
Contains: All dependencies for backup operations

type Vaultik struct {
    Globals         *globals.Globals
    Config          *config.Config
    DB              *database.DB
    Repositories    *database.Repositories
    S3Client        *s3.Client
    ScannerFactory  snapshot.ScannerFactory
    SnapshotManager *snapshot.SnapshotManager
    Shutdowner      fx.Shutdowner
    Fs              afero.Fs
    ctx             context.Context
    cancel          context.CancelFunc
}

5. SnapshotManager (`snapshot.SnapshotManager`)

Created by: snapshot.Module via snapshot.NewSnapshotManager()
When: Application startup (fx DI)
Responsibility: Creates/completes snapshots, exports metadata to S3

6. Scanner (`snapshot.Scanner`)

Created by: ScannerFactory(ScannerParams)
When: Each CreateSnapshot() call
Contains: Chunker, Packer, progress reporter

// vaultik/snapshot.go: CreateSnapshot()
scanner := v.ScannerFactory(snapshot.ScannerParams{
    EnableProgress: !opts.Cron,
    Fs:             v.Fs,
})

7. Chunker (`chunker.Chunker`)

Created by: chunker.NewChunker(avgChunkSize)
When: Inside snapshot.NewScanner()
Configuration:
- avgChunkSize: From config (typically 64KB)
- minChunkSize: avgChunkSize / 4
- maxChunkSize: avgChunkSize * 4

8. Packer (`blob.Packer`)

Created by: blob.NewPacker(PackerConfig)
When: Inside snapshot.NewScanner()
Configuration:
- MaxBlobSize: Maximum blob size before finalization (typically 10GB)
- CompressionLevel: zstd level (1-19)
- Recipients: age public keys for encryption

// snapshot/scanner.go: NewScanner()
packerCfg := blob.PackerConfig{
    MaxBlobSize:      cfg.MaxBlobSize,
    CompressionLevel: cfg.CompressionLevel,
    Recipients:       cfg.AgeRecipients,
    Repositories:     cfg.Repositories,
    Fs:               cfg.FS,
}
packer, err := blob.NewPacker(packerCfg)

Module Responsibilities

`internal/cli`

Entry point for fx application. Combines all modules and handles signal interrupts.

Key functions:

NewApp(AppOptions) → Creates fx.App with all modules
RunApp(ctx, app) → Starts app, handles graceful shutdown
RunWithApp(ctx, opts) → Convenience wrapper

`internal/vaultik`

Main orchestrator containing all dependencies and command implementations.

Key methods:

New(VaultikParams) → Constructor (fx DI)
CreateSnapshot(opts) → Main backup operation
ListSnapshots(jsonOutput) → List available snapshots
VerifySnapshot(id, deep) → Verify snapshot integrity
PurgeSnapshots(...) → Remove old snapshots

`internal/chunker`

Content-defined chunking using FastCDC algorithm.

Key types:

Chunk → Hash, Data, Offset, Size
Chunker → avgChunkSize, minChunkSize, maxChunkSize

Key methods:

NewChunker(avgChunkSize) → Constructor
ChunkReaderStreaming(reader, callback) → Stream chunks with callback (preferred)
ChunkReader(reader) → Return all chunks at once (memory-intensive)

`internal/blob`

Blob packing: accumulates chunks, compresses, encrypts, tracks metadata.

Key types:

Packer → Thread-safe blob accumulator
ChunkRef → Hash + Data for adding to packer
FinishedBlob → Completed blob ready for upload
BlobWithReader → FinishedBlob + io.Reader for streaming upload

Key methods:

NewPacker(PackerConfig) → Constructor
AddChunk(ChunkRef) → Add chunk to current blob
FinalizeBlob() → Compress, encrypt, hash current blob
Flush() → Finalize any in-progress blob
SetBlobHandler(func) → Set callback for upload

`internal/snapshot`

Scanner

Orchestrates the backup process for a directory.

Key methods:

NewScanner(ScannerConfig) → Constructor (creates Chunker + Packer)
Scan(ctx, path, snapshotID) → Main scan operation

Scan phases:

Phase 0: Detect deleted files from previous snapshots
Phase 1: Walk directory, identify files needing processing
Phase 2: Process files (chunk → pack → upload)

SnapshotManager

Manages snapshot lifecycle and metadata export.

Key methods:

CreateSnapshot(ctx, hostname, version, commit) → Create snapshot record
CompleteSnapshot(ctx, snapshotID) → Mark snapshot complete
ExportSnapshotMetadata(ctx, dbPath, snapshotID) → Export to S3
CleanupIncompleteSnapshots(ctx, hostname) → Remove failed snapshots

`internal/database`

SQLite database for local index. Single-writer mode for thread safety.

Key types:

DB → Database connection wrapper
Repositories → Collection of all repository interfaces

Repository interfaces:

FilesRepository → CRUD for File records
ChunksRepository → CRUD for Chunk records
BlobsRepository → CRUD for Blob records
SnapshotsRepository → CRUD for Snapshot records
Plus join table repositories (FileChunks, BlobChunks, etc.)

Snapshot Creation Flow

CreateSnapshot(opts)
    │
    ├─► CleanupIncompleteSnapshots()   // Critical: avoid dedup errors
    │
    ├─► SnapshotManager.CreateSnapshot()   // Create DB record
    │
    ├─► For each source directory:
    │       │
    │       ├─► scanner.Scan(ctx, path, snapshotID)
    │       │       │
    │       │       ├─► Phase 0: detectDeletedFiles()
    │       │       │
    │       │       ├─► Phase 1: scanPhase()
    │       │       │       Walk directory
    │       │       │       Check file metadata changes
    │       │       │       Build list of files to process
    │       │       │
    │       │       └─► Phase 2: processPhase()
    │       │               For each file:
    │       │                   chunker.ChunkReaderStreaming()
    │       │                   For each chunk:
    │       │                       packer.AddChunk()
    │       │                       If blob full → FinalizeBlob()
    │       │                           → handleBlobReady()
    │       │                           → s3Client.PutObjectWithProgress()
    │       │               packer.Flush()  // Final blob
    │       │
    │       └─► Accumulate statistics
    │
    ├─► SnapshotManager.UpdateSnapshotStatsExtended()
    │
    ├─► SnapshotManager.CompleteSnapshot()
    │
    └─► SnapshotManager.ExportSnapshotMetadata()
            │
            ├─► Copy database to temp file
            ├─► Clean to only current snapshot data
            ├─► Dump to SQL
            ├─► Compress with zstd
            ├─► Encrypt with age
            ├─► Upload db.zst.age to S3
            └─► Upload manifest.json.zst to S3

Deduplication Strategy

File-level: Files unchanged since last backup are skipped (metadata comparison: size, mtime, mode, uid, gid)
Chunk-level: Chunks are content-addressed by SHA256 hash. If a chunk hash already exists in the database, the chunk data is not re-uploaded.
Blob-level: Blobs contain only unique chunks. Duplicate chunks within a blob are skipped.

Storage Layout in S3

bucket/
├── blobs/
│   └── {hash[0:2]}/
│       └── {hash[2:4]}/
│           └── {full-hash}          # Compressed+encrypted blob
│
└── metadata/
    └── {snapshot-id}/
        ├── db.zst.age               # Encrypted database dump
        └── manifest.json.zst        # Blob list (for verification)

Thread Safety

Packer: Thread-safe via mutex. Multiple goroutines can call AddChunk().
Scanner: Uses packerMu mutex to coordinate blob finalization.
Database: Single-writer mode (MaxOpenConns=1) ensures SQLite thread safety.
Repositories.WithTx(): Handles transaction lifecycle automatically.

14 KiB Raw Blame History

Vaultik Architecture

Overview

Data Flow

Data Model

Core Entities

Entity Descriptions

File (database.File)

Chunk (database.Chunk)

FileChunk (database.FileChunk)

Blob (database.Blob)

BlobChunk (database.BlobChunk)

Snapshot (database.Snapshot)

SnapshotFile / SnapshotBlob

Relationship Summary

Type Instantiation

Application Startup

Key Type Instantiation Points

1. Config (config.Config)

2. Database (database.DB)

3. Repositories (database.Repositories)

4. Vaultik (vaultik.Vaultik)

5. SnapshotManager (snapshot.SnapshotManager)

6. Scanner (snapshot.Scanner)

7. Chunker (chunker.Chunker)

8. Packer (blob.Packer)

Module Responsibilities

internal/cli

internal/vaultik

internal/chunker

internal/blob

internal/snapshot

Scanner

SnapshotManager

internal/database

Snapshot Creation Flow

Deduplication Strategy

Storage Layout in S3

Thread Safety

14 KiB

Raw Blame History

File (`database.File`)

Chunk (`database.Chunk`)

FileChunk (`database.FileChunk`)

Blob (`database.Blob`)

BlobChunk (`database.BlobChunk`)

Snapshot (`database.Snapshot`)

1. Config (`config.Config`)

2. Database (`database.DB`)

3. Repositories (`database.Repositories`)

4. Vaultik (`vaultik.Vaultik`)

5. SnapshotManager (`snapshot.SnapshotManager`)

6. Scanner (`snapshot.Scanner`)

7. Chunker (`chunker.Chunker`)

8. Packer (`blob.Packer`)

`internal/cli`

`internal/vaultik`

`internal/chunker`

`internal/blob`

`internal/snapshot`

`internal/database`