Files
vaultik/ARCHITECTURE.md
user 1717677288
All checks were successful
check / check (pull_request) Successful in 4m19s
remove ctime column from schema, model, queries, scanner, and docs
ctime is ambiguous cross-platform (macOS birth time vs Linux inode change
time), never used operationally (scanning triggers on mtime), cannot be
restored on either platform, and was write-only forensic data with no
consumer.

Removes ctime from:
- files table schema (schema.sql)
- File struct (models.go)
- all SQL queries and scan targets (files.go)
- scanner file metadata collection (scanner.go)
- all test files
- ARCHITECTURE.md and docs/DATAMODEL.md

closes #54
2026-03-19 06:08:07 -07:00

381 lines
14 KiB
Markdown

# Vaultik Architecture
This document describes the internal architecture of Vaultik, focusing on the data model, type instantiation, and the relationships between core modules.
## Overview
Vaultik is a backup system that uses content-defined chunking for deduplication and packs chunks into large, compressed, encrypted blobs for efficient cloud storage. The system is built around dependency injection using [uber-go/fx](https://github.com/uber-go/fx).
## Data Flow
```
Source Files
┌─────────────────┐
│ Scanner │ Walks directories, detects changed files
└────────┬────────┘
┌─────────────────┐
│ Chunker │ Splits files into variable-size chunks (FastCDC)
└────────┬────────┘
┌─────────────────┐
│ Packer │ Accumulates chunks, compresses (zstd), encrypts (age)
└────────┬────────┘
┌─────────────────┐
│ S3 Client │ Uploads blobs to remote storage
└─────────────────┘
```
## Data Model
### Core Entities
The database tracks five primary entities and their relationships:
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Snapshot │────▶│ File │────▶│ Chunk │
└──────────────┘ └──────────────┘ └──────────────┘
│ │
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Blob │◀─────────────────────────│ BlobChunk │
└──────────────┘ └──────────────┘
```
### Entity Descriptions
#### File (`database.File`)
Represents a file or directory in the backup system. Stores metadata needed for restoration:
- Path, mtime
- Size, mode, ownership (uid, gid)
- Symlink target (if applicable)
#### Chunk (`database.Chunk`)
A content-addressed unit of data. Files are split into variable-size chunks using the FastCDC algorithm:
- `ChunkHash`: SHA256 hash of chunk content (primary key)
- `Size`: Chunk size in bytes
Chunk sizes vary between `avgChunkSize/4` and `avgChunkSize*4` (typically 16KB-256KB for 64KB average).
#### FileChunk (`database.FileChunk`)
Maps files to their constituent chunks:
- `FileID`: Reference to the file
- `Idx`: Position of this chunk within the file (0-indexed)
- `ChunkHash`: Reference to the chunk
#### Blob (`database.Blob`)
The final storage unit uploaded to S3. Contains many compressed and encrypted chunks:
- `ID`: UUID assigned at creation
- `Hash`: SHA256 of final compressed+encrypted content
- `UncompressedSize`: Total raw chunk data before compression
- `CompressedSize`: Size after zstd compression and age encryption
- `CreatedTS`, `FinishedTS`, `UploadedTS`: Lifecycle timestamps
Blob creation process:
1. Chunks are accumulated (up to MaxBlobSize, typically 10GB)
2. Compressed with zstd
3. Encrypted with age (recipients configured in config)
4. SHA256 hash computed → becomes filename in S3
5. Uploaded to `blobs/{hash[0:2]}/{hash[2:4]}/{hash}`
#### BlobChunk (`database.BlobChunk`)
Maps chunks to their position within blobs:
- `BlobID`: Reference to the blob
- `ChunkHash`: Reference to the chunk
- `Offset`: Byte offset within the uncompressed blob
- `Length`: Chunk size
#### Snapshot (`database.Snapshot`)
Represents a point-in-time backup:
- `ID`: Format is `{hostname}-{YYYYMMDD}-{HHMMSS}Z`
- Tracks file count, chunk count, blob count, sizes, compression ratio
- `CompletedAt`: Null until snapshot finishes successfully
#### SnapshotFile / SnapshotBlob
Join tables linking snapshots to their files and blobs.
### Relationship Summary
```
Snapshot 1──────────▶ N SnapshotFile N ◀────────── 1 File
Snapshot 1──────────▶ N SnapshotBlob N ◀────────── 1 Blob
File 1──────────▶ N FileChunk N ◀────────── 1 Chunk
Blob 1──────────▶ N BlobChunk N ◀────────── 1 Chunk
```
## Type Instantiation
### Application Startup
The CLI uses fx for dependency injection. Here's the instantiation order:
```go
// cli/app.go: NewApp()
fx.New(
fx.Supply(config.ConfigPath(opts.ConfigPath)), // 1. Config path
fx.Supply(opts.LogOptions), // 2. Log options
fx.Provide(globals.New), // 3. Globals
fx.Provide(log.New), // 4. Logger config
config.Module, // 5. Config
database.Module, // 6. Database + Repositories
log.Module, // 7. Logger initialization
s3.Module, // 8. S3 client
snapshot.Module, // 9. SnapshotManager + ScannerFactory
fx.Provide(vaultik.New), // 10. Vaultik orchestrator
)
```
### Key Type Instantiation Points
#### 1. Config (`config.Config`)
- **Created by**: `config.Module` via `config.LoadConfig()`
- **When**: Application startup (fx DI)
- **Contains**: All configuration from YAML file (S3 credentials, encryption keys, paths, etc.)
#### 2. Database (`database.DB`)
- **Created by**: `database.Module` via `database.New()`
- **When**: Application startup (fx DI)
- **Contains**: SQLite connection, path reference
#### 3. Repositories (`database.Repositories`)
- **Created by**: `database.Module` via `database.NewRepositories()`
- **When**: Application startup (fx DI)
- **Contains**: All repository interfaces (Files, Chunks, Blobs, Snapshots, etc.)
#### 4. Vaultik (`vaultik.Vaultik`)
- **Created by**: `vaultik.New(VaultikParams)`
- **When**: Application startup (fx DI)
- **Contains**: All dependencies for backup operations
```go
type Vaultik struct {
Globals *globals.Globals
Config *config.Config
DB *database.DB
Repositories *database.Repositories
S3Client *s3.Client
ScannerFactory snapshot.ScannerFactory
SnapshotManager *snapshot.SnapshotManager
Shutdowner fx.Shutdowner
Fs afero.Fs
ctx context.Context
cancel context.CancelFunc
}
```
#### 5. SnapshotManager (`snapshot.SnapshotManager`)
- **Created by**: `snapshot.Module` via `snapshot.NewSnapshotManager()`
- **When**: Application startup (fx DI)
- **Responsibility**: Creates/completes snapshots, exports metadata to S3
#### 6. Scanner (`snapshot.Scanner`)
- **Created by**: `ScannerFactory(ScannerParams)`
- **When**: Each `CreateSnapshot()` call
- **Contains**: Chunker, Packer, progress reporter
```go
// vaultik/snapshot.go: CreateSnapshot()
scanner := v.ScannerFactory(snapshot.ScannerParams{
EnableProgress: !opts.Cron,
Fs: v.Fs,
})
```
#### 7. Chunker (`chunker.Chunker`)
- **Created by**: `chunker.NewChunker(avgChunkSize)`
- **When**: Inside `snapshot.NewScanner()`
- **Configuration**:
- `avgChunkSize`: From config (typically 64KB)
- `minChunkSize`: avgChunkSize / 4
- `maxChunkSize`: avgChunkSize * 4
#### 8. Packer (`blob.Packer`)
- **Created by**: `blob.NewPacker(PackerConfig)`
- **When**: Inside `snapshot.NewScanner()`
- **Configuration**:
- `MaxBlobSize`: Maximum blob size before finalization (typically 10GB)
- `CompressionLevel`: zstd level (1-19)
- `Recipients`: age public keys for encryption
```go
// snapshot/scanner.go: NewScanner()
packerCfg := blob.PackerConfig{
MaxBlobSize: cfg.MaxBlobSize,
CompressionLevel: cfg.CompressionLevel,
Recipients: cfg.AgeRecipients,
Repositories: cfg.Repositories,
Fs: cfg.FS,
}
packer, err := blob.NewPacker(packerCfg)
```
## Module Responsibilities
### `internal/cli`
Entry point for fx application. Combines all modules and handles signal interrupts.
Key functions:
- `NewApp(AppOptions)` → Creates fx.App with all modules
- `RunApp(ctx, app)` → Starts app, handles graceful shutdown
- `RunWithApp(ctx, opts)` → Convenience wrapper
### `internal/vaultik`
Main orchestrator containing all dependencies and command implementations.
Key methods:
- `New(VaultikParams)` → Constructor (fx DI)
- `CreateSnapshot(opts)` → Main backup operation
- `ListSnapshots(jsonOutput)` → List available snapshots
- `VerifySnapshot(id, deep)` → Verify snapshot integrity
- `PurgeSnapshots(...)` → Remove old snapshots
### `internal/chunker`
Content-defined chunking using FastCDC algorithm.
Key types:
- `Chunk` → Hash, Data, Offset, Size
- `Chunker` → avgChunkSize, minChunkSize, maxChunkSize
Key methods:
- `NewChunker(avgChunkSize)` → Constructor
- `ChunkReaderStreaming(reader, callback)` → Stream chunks with callback (preferred)
- `ChunkReader(reader)` → Return all chunks at once (memory-intensive)
### `internal/blob`
Blob packing: accumulates chunks, compresses, encrypts, tracks metadata.
Key types:
- `Packer` → Thread-safe blob accumulator
- `ChunkRef` → Hash + Data for adding to packer
- `FinishedBlob` → Completed blob ready for upload
- `BlobWithReader` → FinishedBlob + io.Reader for streaming upload
Key methods:
- `NewPacker(PackerConfig)` → Constructor
- `AddChunk(ChunkRef)` → Add chunk to current blob
- `FinalizeBlob()` → Compress, encrypt, hash current blob
- `Flush()` → Finalize any in-progress blob
- `SetBlobHandler(func)` → Set callback for upload
### `internal/snapshot`
#### Scanner
Orchestrates the backup process for a directory.
Key methods:
- `NewScanner(ScannerConfig)` → Constructor (creates Chunker + Packer)
- `Scan(ctx, path, snapshotID)` → Main scan operation
Scan phases:
1. **Phase 0**: Detect deleted files from previous snapshots
2. **Phase 1**: Walk directory, identify files needing processing
3. **Phase 2**: Process files (chunk → pack → upload)
#### SnapshotManager
Manages snapshot lifecycle and metadata export.
Key methods:
- `CreateSnapshot(ctx, hostname, version, commit)` → Create snapshot record
- `CompleteSnapshot(ctx, snapshotID)` → Mark snapshot complete
- `ExportSnapshotMetadata(ctx, dbPath, snapshotID)` → Export to S3
- `CleanupIncompleteSnapshots(ctx, hostname)` → Remove failed snapshots
### `internal/database`
SQLite database for local index. Single-writer mode for thread safety.
Key types:
- `DB` → Database connection wrapper
- `Repositories` → Collection of all repository interfaces
Repository interfaces:
- `FilesRepository` → CRUD for File records
- `ChunksRepository` → CRUD for Chunk records
- `BlobsRepository` → CRUD for Blob records
- `SnapshotsRepository` → CRUD for Snapshot records
- Plus join table repositories (FileChunks, BlobChunks, etc.)
## Snapshot Creation Flow
```
CreateSnapshot(opts)
├─► CleanupIncompleteSnapshots() // Critical: avoid dedup errors
├─► SnapshotManager.CreateSnapshot() // Create DB record
├─► For each source directory:
│ │
│ ├─► scanner.Scan(ctx, path, snapshotID)
│ │ │
│ │ ├─► Phase 0: detectDeletedFiles()
│ │ │
│ │ ├─► Phase 1: scanPhase()
│ │ │ Walk directory
│ │ │ Check file metadata changes
│ │ │ Build list of files to process
│ │ │
│ │ └─► Phase 2: processPhase()
│ │ For each file:
│ │ chunker.ChunkReaderStreaming()
│ │ For each chunk:
│ │ packer.AddChunk()
│ │ If blob full → FinalizeBlob()
│ │ → handleBlobReady()
│ │ → s3Client.PutObjectWithProgress()
│ │ packer.Flush() // Final blob
│ │
│ └─► Accumulate statistics
├─► SnapshotManager.UpdateSnapshotStatsExtended()
├─► SnapshotManager.CompleteSnapshot()
└─► SnapshotManager.ExportSnapshotMetadata()
├─► Copy database to temp file
├─► Clean to only current snapshot data
├─► Dump to SQL
├─► Compress with zstd
├─► Encrypt with age
├─► Upload db.zst.age to S3
└─► Upload manifest.json.zst to S3
```
## Deduplication Strategy
1. **File-level**: Files unchanged since last backup are skipped (metadata comparison: size, mtime, mode, uid, gid)
2. **Chunk-level**: Chunks are content-addressed by SHA256 hash. If a chunk hash already exists in the database, the chunk data is not re-uploaded.
3. **Blob-level**: Blobs contain only unique chunks. Duplicate chunks within a blob are skipped.
## Storage Layout in S3
```
bucket/
├── blobs/
│ └── {hash[0:2]}/
│ └── {hash[2:4]}/
│ └── {full-hash} # Compressed+encrypted blob
└── metadata/
└── {snapshot-id}/
├── db.zst.age # Encrypted database dump
└── manifest.json.zst # Blob list (for verification)
```
## Thread Safety
- `Packer`: Thread-safe via mutex. Multiple goroutines can call `AddChunk()`.
- `Scanner`: Uses `packerMu` mutex to coordinate blob finalization.
- `Database`: Single-writer mode (`MaxOpenConns=1`) ensures SQLite thread safety.
- `Repositories.WithTx()`: Handles transaction lifecycle automatically.