vaultik/ARCHITECTURE.md

# Vaultik Architecture

This document describes the internal architecture of Vaultik, focusing on the data model, type instantiation, and the relationships between core modules.

## Overview

Vaultik is a backup system that uses content-defined chunking for deduplication and packs chunks into large, compressed, encrypted blobs for efficient cloud storage. The system is built around dependency injection using [uber-go/fx](https://github.com/uber-go/fx).

## Data Flow

```
Source Files
     │
     ▼
┌─────────────────┐
│    Scanner      │  Walks directories, detects changed files
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Chunker      │  Splits files into variable-size chunks (FastCDC)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Packer       │  Accumulates chunks, compresses (zstd), encrypts (age)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   S3 Client     │  Uploads blobs to remote storage
└─────────────────┘
```

## Data Model

### Core Entities

The database tracks five primary entities and their relationships:

```
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Snapshot   │────▶│     File     │────▶│    Chunk     │
└──────────────┘     └──────────────┘     └──────────────┘
       │                                         │
       │                                         │
       ▼                                         ▼
┌──────────────┐                          ┌──────────────┐
│     Blob     │◀─────────────────────────│  BlobChunk   │
└──────────────┘                          └──────────────┘
```

### Entity Descriptions

#### File (`database.File`)
Represents a file or directory in the backup system. Stores metadata needed for restoration:
- Path, mtime
- Size, mode, ownership (uid, gid)
- Symlink target (if applicable)

#### Chunk (`database.Chunk`)
A content-addressed unit of data. Files are split into variable-size chunks using the FastCDC algorithm:
- `ChunkHash`: SHA256 hash of chunk content (primary key)
- `Size`: Chunk size in bytes

Chunk sizes vary between `avgChunkSize/4` and `avgChunkSize*4` (typically 16KB-256KB for 64KB average).

#### FileChunk (`database.FileChunk`)
Maps files to their constituent chunks:
- `FileID`: Reference to the file
- `Idx`: Position of this chunk within the file (0-indexed)
- `ChunkHash`: Reference to the chunk

#### Blob (`database.Blob`)
The final storage unit uploaded to S3. Contains many compressed and encrypted chunks:
- `ID`: UUID assigned at creation
- `Hash`: SHA256 of final compressed+encrypted content
- `UncompressedSize`: Total raw chunk data before compression
- `CompressedSize`: Size after zstd compression and age encryption
- `CreatedTS`, `FinishedTS`, `UploadedTS`: Lifecycle timestamps

Blob creation process:
1. Chunks are accumulated (up to MaxBlobSize, typically 10GB)
2. Compressed with zstd
3. Encrypted with age (recipients configured in config)
4. SHA256 hash computed → becomes filename in S3
5. Uploaded to `blobs/{hash[0:2]}/{hash[2:4]}/{hash}`

#### BlobChunk (`database.BlobChunk`)
Maps chunks to their position within blobs:
- `BlobID`: Reference to the blob
- `ChunkHash`: Reference to the chunk
- `Offset`: Byte offset within the uncompressed blob
- `Length`: Chunk size

#### Snapshot (`database.Snapshot`)
Represents a point-in-time backup:
- `ID`: Format is `{hostname}-{YYYYMMDD}-{HHMMSS}Z`
- Tracks file count, chunk count, blob count, sizes, compression ratio
- `CompletedAt`: Null until snapshot finishes successfully

#### SnapshotFile / SnapshotBlob
Join tables linking snapshots to their files and blobs.

### Relationship Summary

```
Snapshot 1──────────▶ N SnapshotFile N ◀────────── 1 File
Snapshot 1──────────▶ N SnapshotBlob N ◀────────── 1 Blob
File     1──────────▶ N FileChunk    N ◀────────── 1 Chunk
Blob     1──────────▶ N BlobChunk    N ◀────────── 1 Chunk
```

## Type Instantiation

### Application Startup

The CLI uses fx for dependency injection. Here's the instantiation order:

```go
// cli/app.go: NewApp()
fx.New(
    fx.Supply(config.ConfigPath(opts.ConfigPath)),  // 1. Config path
    fx.Supply(opts.LogOptions),                      // 2. Log options
    fx.Provide(globals.New),                         // 3. Globals
    fx.Provide(log.New),                             // 4. Logger config
    config.Module,                                   // 5. Config
    database.Module,                                 // 6. Database + Repositories
    log.Module,                                      // 7. Logger initialization
    s3.Module,                                       // 8. S3 client
    snapshot.Module,                                 // 9. SnapshotManager + ScannerFactory
    fx.Provide(vaultik.New),                         // 10. Vaultik orchestrator
)
```

### Key Type Instantiation Points

#### 1. Config (`config.Config`)
- **Created by**: `config.Module` via `config.LoadConfig()`
- **When**: Application startup (fx DI)
- **Contains**: All configuration from YAML file (S3 credentials, encryption keys, paths, etc.)

#### 2. Database (`database.DB`)
- **Created by**: `database.Module` via `database.New()`
- **When**: Application startup (fx DI)
- **Contains**: SQLite connection, path reference

#### 3. Repositories (`database.Repositories`)
- **Created by**: `database.Module` via `database.NewRepositories()`
- **When**: Application startup (fx DI)
- **Contains**: All repository interfaces (Files, Chunks, Blobs, Snapshots, etc.)

#### 4. Vaultik (`vaultik.Vaultik`)
- **Created by**: `vaultik.New(VaultikParams)`
- **When**: Application startup (fx DI)
- **Contains**: All dependencies for backup operations

```go
type Vaultik struct {
    Globals         *globals.Globals
    Config          *config.Config
    DB              *database.DB
    Repositories    *database.Repositories
    S3Client        *s3.Client
    ScannerFactory  snapshot.ScannerFactory
    SnapshotManager *snapshot.SnapshotManager
    Shutdowner      fx.Shutdowner
    Fs              afero.Fs
    ctx             context.Context
    cancel          context.CancelFunc
}
```

#### 5. SnapshotManager (`snapshot.SnapshotManager`)
- **Created by**: `snapshot.Module` via `snapshot.NewSnapshotManager()`
- **When**: Application startup (fx DI)
- **Responsibility**: Creates/completes snapshots, exports metadata to S3

#### 6. Scanner (`snapshot.Scanner`)
- **Created by**: `ScannerFactory(ScannerParams)`
- **When**: Each `CreateSnapshot()` call
- **Contains**: Chunker, Packer, progress reporter

```go
// vaultik/snapshot.go: CreateSnapshot()
scanner := v.ScannerFactory(snapshot.ScannerParams{
    EnableProgress: !opts.Cron,
    Fs:             v.Fs,
})
```

#### 7. Chunker (`chunker.Chunker`)
- **Created by**: `chunker.NewChunker(avgChunkSize)`
- **When**: Inside `snapshot.NewScanner()`
- **Configuration**:
  - `avgChunkSize`: From config (typically 64KB)
  - `minChunkSize`: avgChunkSize / 4
  - `maxChunkSize`: avgChunkSize * 4

#### 8. Packer (`blob.Packer`)
- **Created by**: `blob.NewPacker(PackerConfig)`
- **When**: Inside `snapshot.NewScanner()`
- **Configuration**:
  - `MaxBlobSize`: Maximum blob size before finalization (typically 10GB)
  - `CompressionLevel`: zstd level (1-19)
  - `Recipients`: age public keys for encryption

```go
// snapshot/scanner.go: NewScanner()
packerCfg := blob.PackerConfig{
    MaxBlobSize:      cfg.MaxBlobSize,
    CompressionLevel: cfg.CompressionLevel,
    Recipients:       cfg.AgeRecipients,
    Repositories:     cfg.Repositories,
    Fs:               cfg.FS,
}
packer, err := blob.NewPacker(packerCfg)
```

## Module Responsibilities

### `internal/cli`
Entry point for fx application. Combines all modules and handles signal interrupts.

Key functions:
- `NewApp(AppOptions)` → Creates fx.App with all modules
- `RunApp(ctx, app)` → Starts app, handles graceful shutdown
- `RunWithApp(ctx, opts)` → Convenience wrapper

### `internal/vaultik`
Main orchestrator containing all dependencies and command implementations.

Key methods:
- `New(VaultikParams)` → Constructor (fx DI)
- `CreateSnapshot(opts)` → Main backup operation
- `ListSnapshots(jsonOutput)` → List available snapshots
- `VerifySnapshot(id, deep)` → Verify snapshot integrity
- `PurgeSnapshots(...)` → Remove old snapshots

### `internal/chunker`
Content-defined chunking using FastCDC algorithm.

Key types:
- `Chunk` → Hash, Data, Offset, Size
- `Chunker` → avgChunkSize, minChunkSize, maxChunkSize

Key methods:
- `NewChunker(avgChunkSize)` → Constructor
- `ChunkReaderStreaming(reader, callback)` → Stream chunks with callback (preferred)
- `ChunkReader(reader)` → Return all chunks at once (memory-intensive)

### `internal/blob`
Blob packing: accumulates chunks, compresses, encrypts, tracks metadata.

Key types:
- `Packer` → Thread-safe blob accumulator
- `ChunkRef` → Hash + Data for adding to packer
- `FinishedBlob` → Completed blob ready for upload
- `BlobWithReader` → FinishedBlob + io.Reader for streaming upload

Key methods:
- `NewPacker(PackerConfig)` → Constructor
- `AddChunk(ChunkRef)` → Add chunk to current blob
- `FinalizeBlob()` → Compress, encrypt, hash current blob
- `Flush()` → Finalize any in-progress blob
- `SetBlobHandler(func)` → Set callback for upload

### `internal/snapshot`

#### Scanner
Orchestrates the backup process for a directory.

Key methods:
- `NewScanner(ScannerConfig)` → Constructor (creates Chunker + Packer)
- `Scan(ctx, path, snapshotID)` → Main scan operation

Scan phases:
1. **Phase 0**: Detect deleted files from previous snapshots
2. **Phase 1**: Walk directory, identify files needing processing
3. **Phase 2**: Process files (chunk → pack → upload)

#### SnapshotManager
Manages snapshot lifecycle and metadata export.

Key methods:
- `CreateSnapshot(ctx, hostname, version, commit)` → Create snapshot record
- `CompleteSnapshot(ctx, snapshotID)` → Mark snapshot complete
- `ExportSnapshotMetadata(ctx, dbPath, snapshotID)` → Export to S3
- `CleanupIncompleteSnapshots(ctx, hostname)` → Remove failed snapshots

### `internal/database`
SQLite database for local index. Single-writer mode for thread safety.

Key types:
- `DB` → Database connection wrapper
- `Repositories` → Collection of all repository interfaces

Repository interfaces:
- `FilesRepository` → CRUD for File records
- `ChunksRepository` → CRUD for Chunk records
- `BlobsRepository` → CRUD for Blob records
- `SnapshotsRepository` → CRUD for Snapshot records
- Plus join table repositories (FileChunks, BlobChunks, etc.)

## Snapshot Creation Flow

```
CreateSnapshot(opts)
    │
    ├─► CleanupIncompleteSnapshots()   // Critical: avoid dedup errors
    │
    ├─► SnapshotManager.CreateSnapshot()   // Create DB record
    │
    ├─► For each source directory:
    │       │
    │       ├─► scanner.Scan(ctx, path, snapshotID)
    │       │       │
    │       │       ├─► Phase 0: detectDeletedFiles()
    │       │       │
    │       │       ├─► Phase 1: scanPhase()
    │       │       │       Walk directory
    │       │       │       Check file metadata changes
    │       │       │       Build list of files to process
    │       │       │
    │       │       └─► Phase 2: processPhase()
    │       │               For each file:
    │       │                   chunker.ChunkReaderStreaming()
    │       │                   For each chunk:
    │       │                       packer.AddChunk()
    │       │                       If blob full → FinalizeBlob()
    │       │                           → handleBlobReady()
    │       │                           → s3Client.PutObjectWithProgress()
    │       │               packer.Flush()  // Final blob
    │       │
    │       └─► Accumulate statistics
    │
    ├─► SnapshotManager.UpdateSnapshotStatsExtended()
    │
    ├─► SnapshotManager.CompleteSnapshot()
    │
    └─► SnapshotManager.ExportSnapshotMetadata()
            │
            ├─► Copy database to temp file
            ├─► Clean to only current snapshot data
            ├─► Dump to SQL
            ├─► Compress with zstd
            ├─► Encrypt with age
            ├─► Upload db.zst.age to S3
            └─► Upload manifest.json.zst to S3
```

## Deduplication Strategy

1. **File-level**: Files unchanged since last backup are skipped (metadata comparison: size, mtime, mode, uid, gid)

2. **Chunk-level**: Chunks are content-addressed by SHA256 hash. If a chunk hash already exists in the database, the chunk data is not re-uploaded.

3. **Blob-level**: Blobs contain only unique chunks. Duplicate chunks within a blob are skipped.

## Storage Layout in S3

```
bucket/
├── blobs/
│   └── {hash[0:2]}/
│       └── {hash[2:4]}/
│           └── {full-hash}          # Compressed+encrypted blob
│
└── metadata/
    └── {snapshot-id}/
        ├── db.zst.age               # Encrypted database dump
        └── manifest.json.zst        # Blob list (for verification)
```

## Thread Safety

- `Packer`: Thread-safe via mutex. Multiple goroutines can call `AddChunk()`.
- `Scanner`: Uses `packerMu` mutex to coordinate blob finalization.
- `Database`: Single-writer mode (`MaxOpenConns=1`) ensures SQLite thread safety.
- `Repositories.WithTx()`: Handles transaction lifecycle automatically.