Files
vaultik/ARCHITECTURE.md
user 1717677288
All checks were successful
check / check (pull_request) Successful in 4m19s
remove ctime column from schema, model, queries, scanner, and docs
ctime is ambiguous cross-platform (macOS birth time vs Linux inode change
time), never used operationally (scanning triggers on mtime), cannot be
restored on either platform, and was write-only forensic data with no
consumer.

Removes ctime from:
- files table schema (schema.sql)
- File struct (models.go)
- all SQL queries and scan targets (files.go)
- scanner file metadata collection (scanner.go)
- all test files
- ARCHITECTURE.md and docs/DATAMODEL.md

closes #54
2026-03-19 06:08:07 -07:00

14 KiB

Vaultik Architecture

This document describes the internal architecture of Vaultik, focusing on the data model, type instantiation, and the relationships between core modules.

Overview

Vaultik is a backup system that uses content-defined chunking for deduplication and packs chunks into large, compressed, encrypted blobs for efficient cloud storage. The system is built around dependency injection using uber-go/fx.

Data Flow

Source Files
     │
     ▼
┌─────────────────┐
│    Scanner      │  Walks directories, detects changed files
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Chunker      │  Splits files into variable-size chunks (FastCDC)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Packer       │  Accumulates chunks, compresses (zstd), encrypts (age)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   S3 Client     │  Uploads blobs to remote storage
└─────────────────┘

Data Model

Core Entities

The database tracks five primary entities and their relationships:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Snapshot   │────▶│     File     │────▶│    Chunk     │
└──────────────┘     └──────────────┘     └──────────────┘
       │                                         │
       │                                         │
       ▼                                         ▼
┌──────────────┐                          ┌──────────────┐
│     Blob     │◀─────────────────────────│  BlobChunk   │
└──────────────┘                          └──────────────┘

Entity Descriptions

File (database.File)

Represents a file or directory in the backup system. Stores metadata needed for restoration:

  • Path, mtime
  • Size, mode, ownership (uid, gid)
  • Symlink target (if applicable)

Chunk (database.Chunk)

A content-addressed unit of data. Files are split into variable-size chunks using the FastCDC algorithm:

  • ChunkHash: SHA256 hash of chunk content (primary key)
  • Size: Chunk size in bytes

Chunk sizes vary between avgChunkSize/4 and avgChunkSize*4 (typically 16KB-256KB for 64KB average).

FileChunk (database.FileChunk)

Maps files to their constituent chunks:

  • FileID: Reference to the file
  • Idx: Position of this chunk within the file (0-indexed)
  • ChunkHash: Reference to the chunk

Blob (database.Blob)

The final storage unit uploaded to S3. Contains many compressed and encrypted chunks:

  • ID: UUID assigned at creation
  • Hash: SHA256 of final compressed+encrypted content
  • UncompressedSize: Total raw chunk data before compression
  • CompressedSize: Size after zstd compression and age encryption
  • CreatedTS, FinishedTS, UploadedTS: Lifecycle timestamps

Blob creation process:

  1. Chunks are accumulated (up to MaxBlobSize, typically 10GB)
  2. Compressed with zstd
  3. Encrypted with age (recipients configured in config)
  4. SHA256 hash computed → becomes filename in S3
  5. Uploaded to blobs/{hash[0:2]}/{hash[2:4]}/{hash}

BlobChunk (database.BlobChunk)

Maps chunks to their position within blobs:

  • BlobID: Reference to the blob
  • ChunkHash: Reference to the chunk
  • Offset: Byte offset within the uncompressed blob
  • Length: Chunk size

Snapshot (database.Snapshot)

Represents a point-in-time backup:

  • ID: Format is {hostname}-{YYYYMMDD}-{HHMMSS}Z
  • Tracks file count, chunk count, blob count, sizes, compression ratio
  • CompletedAt: Null until snapshot finishes successfully

SnapshotFile / SnapshotBlob

Join tables linking snapshots to their files and blobs.

Relationship Summary

Snapshot 1──────────▶ N SnapshotFile N ◀────────── 1 File
Snapshot 1──────────▶ N SnapshotBlob N ◀────────── 1 Blob
File     1──────────▶ N FileChunk    N ◀────────── 1 Chunk
Blob     1──────────▶ N BlobChunk    N ◀────────── 1 Chunk

Type Instantiation

Application Startup

The CLI uses fx for dependency injection. Here's the instantiation order:

// cli/app.go: NewApp()
fx.New(
    fx.Supply(config.ConfigPath(opts.ConfigPath)),  // 1. Config path
    fx.Supply(opts.LogOptions),                      // 2. Log options
    fx.Provide(globals.New),                         // 3. Globals
    fx.Provide(log.New),                             // 4. Logger config
    config.Module,                                   // 5. Config
    database.Module,                                 // 6. Database + Repositories
    log.Module,                                      // 7. Logger initialization
    s3.Module,                                       // 8. S3 client
    snapshot.Module,                                 // 9. SnapshotManager + ScannerFactory
    fx.Provide(vaultik.New),                         // 10. Vaultik orchestrator
)

Key Type Instantiation Points

1. Config (config.Config)

  • Created by: config.Module via config.LoadConfig()
  • When: Application startup (fx DI)
  • Contains: All configuration from YAML file (S3 credentials, encryption keys, paths, etc.)

2. Database (database.DB)

  • Created by: database.Module via database.New()
  • When: Application startup (fx DI)
  • Contains: SQLite connection, path reference

3. Repositories (database.Repositories)

  • Created by: database.Module via database.NewRepositories()
  • When: Application startup (fx DI)
  • Contains: All repository interfaces (Files, Chunks, Blobs, Snapshots, etc.)

4. Vaultik (vaultik.Vaultik)

  • Created by: vaultik.New(VaultikParams)
  • When: Application startup (fx DI)
  • Contains: All dependencies for backup operations
type Vaultik struct {
    Globals         *globals.Globals
    Config          *config.Config
    DB              *database.DB
    Repositories    *database.Repositories
    S3Client        *s3.Client
    ScannerFactory  snapshot.ScannerFactory
    SnapshotManager *snapshot.SnapshotManager
    Shutdowner      fx.Shutdowner
    Fs              afero.Fs
    ctx             context.Context
    cancel          context.CancelFunc
}

5. SnapshotManager (snapshot.SnapshotManager)

  • Created by: snapshot.Module via snapshot.NewSnapshotManager()
  • When: Application startup (fx DI)
  • Responsibility: Creates/completes snapshots, exports metadata to S3

6. Scanner (snapshot.Scanner)

  • Created by: ScannerFactory(ScannerParams)
  • When: Each CreateSnapshot() call
  • Contains: Chunker, Packer, progress reporter
// vaultik/snapshot.go: CreateSnapshot()
scanner := v.ScannerFactory(snapshot.ScannerParams{
    EnableProgress: !opts.Cron,
    Fs:             v.Fs,
})

7. Chunker (chunker.Chunker)

  • Created by: chunker.NewChunker(avgChunkSize)
  • When: Inside snapshot.NewScanner()
  • Configuration:
    • avgChunkSize: From config (typically 64KB)
    • minChunkSize: avgChunkSize / 4
    • maxChunkSize: avgChunkSize * 4

8. Packer (blob.Packer)

  • Created by: blob.NewPacker(PackerConfig)
  • When: Inside snapshot.NewScanner()
  • Configuration:
    • MaxBlobSize: Maximum blob size before finalization (typically 10GB)
    • CompressionLevel: zstd level (1-19)
    • Recipients: age public keys for encryption
// snapshot/scanner.go: NewScanner()
packerCfg := blob.PackerConfig{
    MaxBlobSize:      cfg.MaxBlobSize,
    CompressionLevel: cfg.CompressionLevel,
    Recipients:       cfg.AgeRecipients,
    Repositories:     cfg.Repositories,
    Fs:               cfg.FS,
}
packer, err := blob.NewPacker(packerCfg)

Module Responsibilities

internal/cli

Entry point for fx application. Combines all modules and handles signal interrupts.

Key functions:

  • NewApp(AppOptions) → Creates fx.App with all modules
  • RunApp(ctx, app) → Starts app, handles graceful shutdown
  • RunWithApp(ctx, opts) → Convenience wrapper

internal/vaultik

Main orchestrator containing all dependencies and command implementations.

Key methods:

  • New(VaultikParams) → Constructor (fx DI)
  • CreateSnapshot(opts) → Main backup operation
  • ListSnapshots(jsonOutput) → List available snapshots
  • VerifySnapshot(id, deep) → Verify snapshot integrity
  • PurgeSnapshots(...) → Remove old snapshots

internal/chunker

Content-defined chunking using FastCDC algorithm.

Key types:

  • Chunk → Hash, Data, Offset, Size
  • Chunker → avgChunkSize, minChunkSize, maxChunkSize

Key methods:

  • NewChunker(avgChunkSize) → Constructor
  • ChunkReaderStreaming(reader, callback) → Stream chunks with callback (preferred)
  • ChunkReader(reader) → Return all chunks at once (memory-intensive)

internal/blob

Blob packing: accumulates chunks, compresses, encrypts, tracks metadata.

Key types:

  • Packer → Thread-safe blob accumulator
  • ChunkRef → Hash + Data for adding to packer
  • FinishedBlob → Completed blob ready for upload
  • BlobWithReader → FinishedBlob + io.Reader for streaming upload

Key methods:

  • NewPacker(PackerConfig) → Constructor
  • AddChunk(ChunkRef) → Add chunk to current blob
  • FinalizeBlob() → Compress, encrypt, hash current blob
  • Flush() → Finalize any in-progress blob
  • SetBlobHandler(func) → Set callback for upload

internal/snapshot

Scanner

Orchestrates the backup process for a directory.

Key methods:

  • NewScanner(ScannerConfig) → Constructor (creates Chunker + Packer)
  • Scan(ctx, path, snapshotID) → Main scan operation

Scan phases:

  1. Phase 0: Detect deleted files from previous snapshots
  2. Phase 1: Walk directory, identify files needing processing
  3. Phase 2: Process files (chunk → pack → upload)

SnapshotManager

Manages snapshot lifecycle and metadata export.

Key methods:

  • CreateSnapshot(ctx, hostname, version, commit) → Create snapshot record
  • CompleteSnapshot(ctx, snapshotID) → Mark snapshot complete
  • ExportSnapshotMetadata(ctx, dbPath, snapshotID) → Export to S3
  • CleanupIncompleteSnapshots(ctx, hostname) → Remove failed snapshots

internal/database

SQLite database for local index. Single-writer mode for thread safety.

Key types:

  • DB → Database connection wrapper
  • Repositories → Collection of all repository interfaces

Repository interfaces:

  • FilesRepository → CRUD for File records
  • ChunksRepository → CRUD for Chunk records
  • BlobsRepository → CRUD for Blob records
  • SnapshotsRepository → CRUD for Snapshot records
  • Plus join table repositories (FileChunks, BlobChunks, etc.)

Snapshot Creation Flow

CreateSnapshot(opts)
    │
    ├─► CleanupIncompleteSnapshots()   // Critical: avoid dedup errors
    │
    ├─► SnapshotManager.CreateSnapshot()   // Create DB record
    │
    ├─► For each source directory:
    │       │
    │       ├─► scanner.Scan(ctx, path, snapshotID)
    │       │       │
    │       │       ├─► Phase 0: detectDeletedFiles()
    │       │       │
    │       │       ├─► Phase 1: scanPhase()
    │       │       │       Walk directory
    │       │       │       Check file metadata changes
    │       │       │       Build list of files to process
    │       │       │
    │       │       └─► Phase 2: processPhase()
    │       │               For each file:
    │       │                   chunker.ChunkReaderStreaming()
    │       │                   For each chunk:
    │       │                       packer.AddChunk()
    │       │                       If blob full → FinalizeBlob()
    │       │                           → handleBlobReady()
    │       │                           → s3Client.PutObjectWithProgress()
    │       │               packer.Flush()  // Final blob
    │       │
    │       └─► Accumulate statistics
    │
    ├─► SnapshotManager.UpdateSnapshotStatsExtended()
    │
    ├─► SnapshotManager.CompleteSnapshot()
    │
    └─► SnapshotManager.ExportSnapshotMetadata()
            │
            ├─► Copy database to temp file
            ├─► Clean to only current snapshot data
            ├─► Dump to SQL
            ├─► Compress with zstd
            ├─► Encrypt with age
            ├─► Upload db.zst.age to S3
            └─► Upload manifest.json.zst to S3

Deduplication Strategy

  1. File-level: Files unchanged since last backup are skipped (metadata comparison: size, mtime, mode, uid, gid)

  2. Chunk-level: Chunks are content-addressed by SHA256 hash. If a chunk hash already exists in the database, the chunk data is not re-uploaded.

  3. Blob-level: Blobs contain only unique chunks. Duplicate chunks within a blob are skipped.

Storage Layout in S3

bucket/
├── blobs/
│   └── {hash[0:2]}/
│       └── {hash[2:4]}/
│           └── {full-hash}          # Compressed+encrypted blob
│
└── metadata/
    └── {snapshot-id}/
        ├── db.zst.age               # Encrypted database dump
        └── manifest.json.zst        # Blob list (for verification)

Thread Safety

  • Packer: Thread-safe via mutex. Multiple goroutines can call AddChunk().
  • Scanner: Uses packerMu mutex to coordinate blob finalization.
  • Database: Single-writer mode (MaxOpenConns=1) ensures SQLite thread safety.
  • Repositories.WithTx(): Handles transaction lifecycle automatically.