All checks were successful
check / check (push) Successful in 5s
Remove all ctime from the codebase per sneak's decision on [PR #48](#48). ## Rationale - ctime means different things on macOS (birth time) vs Linux (inode change time) — ambiguous cross-platform - Vaultik never uses ctime operationally (scanning triggers on mtime change) - Cannot be restored on either platform - Write-only forensic data with no consumer ## Changes - **Schema** (`internal/database/schema.sql`): Removed `ctime` column from `files` table - **Model** (`internal/database/models.go`): Removed `CTime` field from `File` struct - **Database layer** (`internal/database/files.go`): Removed ctime from all INSERT/SELECT queries, ON CONFLICT updates, and scan targets in both `scanFile` and `scanFileRows` helpers; updated `CreateBatch` accordingly - **Scanner** (`internal/snapshot/scanner.go`): Removed `CTime: info.ModTime()` assignment in `checkFileInMemory()` - **Tests**: Removed all `CTime` field assignments from 8 test files - **Documentation**: Removed ctime references from `ARCHITECTURE.md` and `docs/DATAMODEL.md` `docker build .` passes clean (lint, fmt-check, all tests). closes #54 Co-authored-by: user <user@Mac.lan guest wan> Reviewed-on: #55 Co-authored-by: clawbot <clawbot@noreply.example.org> Co-committed-by: clawbot <clawbot@noreply.example.org>
381 lines
14 KiB
Markdown
381 lines
14 KiB
Markdown
# Vaultik Architecture
|
|
|
|
This document describes the internal architecture of Vaultik, focusing on the data model, type instantiation, and the relationships between core modules.
|
|
|
|
## Overview
|
|
|
|
Vaultik is a backup system that uses content-defined chunking for deduplication and packs chunks into large, compressed, encrypted blobs for efficient cloud storage. The system is built around dependency injection using [uber-go/fx](https://github.com/uber-go/fx).
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
Source Files
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Scanner │ Walks directories, detects changed files
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Chunker │ Splits files into variable-size chunks (FastCDC)
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Packer │ Accumulates chunks, compresses (zstd), encrypts (age)
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ S3 Client │ Uploads blobs to remote storage
|
|
└─────────────────┘
|
|
```
|
|
|
|
## Data Model
|
|
|
|
### Core Entities
|
|
|
|
The database tracks five primary entities and their relationships:
|
|
|
|
```
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Snapshot │────▶│ File │────▶│ Chunk │
|
|
└──────────────┘ └──────────────┘ └──────────────┘
|
|
│ │
|
|
│ │
|
|
▼ ▼
|
|
┌──────────────┐ ┌──────────────┐
|
|
│ Blob │◀─────────────────────────│ BlobChunk │
|
|
└──────────────┘ └──────────────┘
|
|
```
|
|
|
|
### Entity Descriptions
|
|
|
|
#### File (`database.File`)
|
|
Represents a file or directory in the backup system. Stores metadata needed for restoration:
|
|
- Path, mtime
|
|
- Size, mode, ownership (uid, gid)
|
|
- Symlink target (if applicable)
|
|
|
|
#### Chunk (`database.Chunk`)
|
|
A content-addressed unit of data. Files are split into variable-size chunks using the FastCDC algorithm:
|
|
- `ChunkHash`: SHA256 hash of chunk content (primary key)
|
|
- `Size`: Chunk size in bytes
|
|
|
|
Chunk sizes vary between `avgChunkSize/4` and `avgChunkSize*4` (typically 16KB-256KB for 64KB average).
|
|
|
|
#### FileChunk (`database.FileChunk`)
|
|
Maps files to their constituent chunks:
|
|
- `FileID`: Reference to the file
|
|
- `Idx`: Position of this chunk within the file (0-indexed)
|
|
- `ChunkHash`: Reference to the chunk
|
|
|
|
#### Blob (`database.Blob`)
|
|
The final storage unit uploaded to S3. Contains many compressed and encrypted chunks:
|
|
- `ID`: UUID assigned at creation
|
|
- `Hash`: SHA256 of final compressed+encrypted content
|
|
- `UncompressedSize`: Total raw chunk data before compression
|
|
- `CompressedSize`: Size after zstd compression and age encryption
|
|
- `CreatedTS`, `FinishedTS`, `UploadedTS`: Lifecycle timestamps
|
|
|
|
Blob creation process:
|
|
1. Chunks are accumulated (up to MaxBlobSize, typically 10GB)
|
|
2. Compressed with zstd
|
|
3. Encrypted with age (recipients configured in config)
|
|
4. SHA256 hash computed → becomes filename in S3
|
|
5. Uploaded to `blobs/{hash[0:2]}/{hash[2:4]}/{hash}`
|
|
|
|
#### BlobChunk (`database.BlobChunk`)
|
|
Maps chunks to their position within blobs:
|
|
- `BlobID`: Reference to the blob
|
|
- `ChunkHash`: Reference to the chunk
|
|
- `Offset`: Byte offset within the uncompressed blob
|
|
- `Length`: Chunk size
|
|
|
|
#### Snapshot (`database.Snapshot`)
|
|
Represents a point-in-time backup:
|
|
- `ID`: Format is `{hostname}-{YYYYMMDD}-{HHMMSS}Z`
|
|
- Tracks file count, chunk count, blob count, sizes, compression ratio
|
|
- `CompletedAt`: Null until snapshot finishes successfully
|
|
|
|
#### SnapshotFile / SnapshotBlob
|
|
Join tables linking snapshots to their files and blobs.
|
|
|
|
### Relationship Summary
|
|
|
|
```
|
|
Snapshot 1──────────▶ N SnapshotFile N ◀────────── 1 File
|
|
Snapshot 1──────────▶ N SnapshotBlob N ◀────────── 1 Blob
|
|
File 1──────────▶ N FileChunk N ◀────────── 1 Chunk
|
|
Blob 1──────────▶ N BlobChunk N ◀────────── 1 Chunk
|
|
```
|
|
|
|
## Type Instantiation
|
|
|
|
### Application Startup
|
|
|
|
The CLI uses fx for dependency injection. Here's the instantiation order:
|
|
|
|
```go
|
|
// cli/app.go: NewApp()
|
|
fx.New(
|
|
fx.Supply(config.ConfigPath(opts.ConfigPath)), // 1. Config path
|
|
fx.Supply(opts.LogOptions), // 2. Log options
|
|
fx.Provide(globals.New), // 3. Globals
|
|
fx.Provide(log.New), // 4. Logger config
|
|
config.Module, // 5. Config
|
|
database.Module, // 6. Database + Repositories
|
|
log.Module, // 7. Logger initialization
|
|
s3.Module, // 8. S3 client
|
|
snapshot.Module, // 9. SnapshotManager + ScannerFactory
|
|
fx.Provide(vaultik.New), // 10. Vaultik orchestrator
|
|
)
|
|
```
|
|
|
|
### Key Type Instantiation Points
|
|
|
|
#### 1. Config (`config.Config`)
|
|
- **Created by**: `config.Module` via `config.LoadConfig()`
|
|
- **When**: Application startup (fx DI)
|
|
- **Contains**: All configuration from YAML file (S3 credentials, encryption keys, paths, etc.)
|
|
|
|
#### 2. Database (`database.DB`)
|
|
- **Created by**: `database.Module` via `database.New()`
|
|
- **When**: Application startup (fx DI)
|
|
- **Contains**: SQLite connection, path reference
|
|
|
|
#### 3. Repositories (`database.Repositories`)
|
|
- **Created by**: `database.Module` via `database.NewRepositories()`
|
|
- **When**: Application startup (fx DI)
|
|
- **Contains**: All repository interfaces (Files, Chunks, Blobs, Snapshots, etc.)
|
|
|
|
#### 4. Vaultik (`vaultik.Vaultik`)
|
|
- **Created by**: `vaultik.New(VaultikParams)`
|
|
- **When**: Application startup (fx DI)
|
|
- **Contains**: All dependencies for backup operations
|
|
|
|
```go
|
|
type Vaultik struct {
|
|
Globals *globals.Globals
|
|
Config *config.Config
|
|
DB *database.DB
|
|
Repositories *database.Repositories
|
|
S3Client *s3.Client
|
|
ScannerFactory snapshot.ScannerFactory
|
|
SnapshotManager *snapshot.SnapshotManager
|
|
Shutdowner fx.Shutdowner
|
|
Fs afero.Fs
|
|
ctx context.Context
|
|
cancel context.CancelFunc
|
|
}
|
|
```
|
|
|
|
#### 5. SnapshotManager (`snapshot.SnapshotManager`)
|
|
- **Created by**: `snapshot.Module` via `snapshot.NewSnapshotManager()`
|
|
- **When**: Application startup (fx DI)
|
|
- **Responsibility**: Creates/completes snapshots, exports metadata to S3
|
|
|
|
#### 6. Scanner (`snapshot.Scanner`)
|
|
- **Created by**: `ScannerFactory(ScannerParams)`
|
|
- **When**: Each `CreateSnapshot()` call
|
|
- **Contains**: Chunker, Packer, progress reporter
|
|
|
|
```go
|
|
// vaultik/snapshot.go: CreateSnapshot()
|
|
scanner := v.ScannerFactory(snapshot.ScannerParams{
|
|
EnableProgress: !opts.Cron,
|
|
Fs: v.Fs,
|
|
})
|
|
```
|
|
|
|
#### 7. Chunker (`chunker.Chunker`)
|
|
- **Created by**: `chunker.NewChunker(avgChunkSize)`
|
|
- **When**: Inside `snapshot.NewScanner()`
|
|
- **Configuration**:
|
|
- `avgChunkSize`: From config (typically 64KB)
|
|
- `minChunkSize`: avgChunkSize / 4
|
|
- `maxChunkSize`: avgChunkSize * 4
|
|
|
|
#### 8. Packer (`blob.Packer`)
|
|
- **Created by**: `blob.NewPacker(PackerConfig)`
|
|
- **When**: Inside `snapshot.NewScanner()`
|
|
- **Configuration**:
|
|
- `MaxBlobSize`: Maximum blob size before finalization (typically 10GB)
|
|
- `CompressionLevel`: zstd level (1-19)
|
|
- `Recipients`: age public keys for encryption
|
|
|
|
```go
|
|
// snapshot/scanner.go: NewScanner()
|
|
packerCfg := blob.PackerConfig{
|
|
MaxBlobSize: cfg.MaxBlobSize,
|
|
CompressionLevel: cfg.CompressionLevel,
|
|
Recipients: cfg.AgeRecipients,
|
|
Repositories: cfg.Repositories,
|
|
Fs: cfg.FS,
|
|
}
|
|
packer, err := blob.NewPacker(packerCfg)
|
|
```
|
|
|
|
## Module Responsibilities
|
|
|
|
### `internal/cli`
|
|
Entry point for fx application. Combines all modules and handles signal interrupts.
|
|
|
|
Key functions:
|
|
- `NewApp(AppOptions)` → Creates fx.App with all modules
|
|
- `RunApp(ctx, app)` → Starts app, handles graceful shutdown
|
|
- `RunWithApp(ctx, opts)` → Convenience wrapper
|
|
|
|
### `internal/vaultik`
|
|
Main orchestrator containing all dependencies and command implementations.
|
|
|
|
Key methods:
|
|
- `New(VaultikParams)` → Constructor (fx DI)
|
|
- `CreateSnapshot(opts)` → Main backup operation
|
|
- `ListSnapshots(jsonOutput)` → List available snapshots
|
|
- `VerifySnapshot(id, deep)` → Verify snapshot integrity
|
|
- `PurgeSnapshots(...)` → Remove old snapshots
|
|
|
|
### `internal/chunker`
|
|
Content-defined chunking using FastCDC algorithm.
|
|
|
|
Key types:
|
|
- `Chunk` → Hash, Data, Offset, Size
|
|
- `Chunker` → avgChunkSize, minChunkSize, maxChunkSize
|
|
|
|
Key methods:
|
|
- `NewChunker(avgChunkSize)` → Constructor
|
|
- `ChunkReaderStreaming(reader, callback)` → Stream chunks with callback (preferred)
|
|
- `ChunkReader(reader)` → Return all chunks at once (memory-intensive)
|
|
|
|
### `internal/blob`
|
|
Blob packing: accumulates chunks, compresses, encrypts, tracks metadata.
|
|
|
|
Key types:
|
|
- `Packer` → Thread-safe blob accumulator
|
|
- `ChunkRef` → Hash + Data for adding to packer
|
|
- `FinishedBlob` → Completed blob ready for upload
|
|
- `BlobWithReader` → FinishedBlob + io.Reader for streaming upload
|
|
|
|
Key methods:
|
|
- `NewPacker(PackerConfig)` → Constructor
|
|
- `AddChunk(ChunkRef)` → Add chunk to current blob
|
|
- `FinalizeBlob()` → Compress, encrypt, hash current blob
|
|
- `Flush()` → Finalize any in-progress blob
|
|
- `SetBlobHandler(func)` → Set callback for upload
|
|
|
|
### `internal/snapshot`
|
|
|
|
#### Scanner
|
|
Orchestrates the backup process for a directory.
|
|
|
|
Key methods:
|
|
- `NewScanner(ScannerConfig)` → Constructor (creates Chunker + Packer)
|
|
- `Scan(ctx, path, snapshotID)` → Main scan operation
|
|
|
|
Scan phases:
|
|
1. **Phase 0**: Detect deleted files from previous snapshots
|
|
2. **Phase 1**: Walk directory, identify files needing processing
|
|
3. **Phase 2**: Process files (chunk → pack → upload)
|
|
|
|
#### SnapshotManager
|
|
Manages snapshot lifecycle and metadata export.
|
|
|
|
Key methods:
|
|
- `CreateSnapshot(ctx, hostname, version, commit)` → Create snapshot record
|
|
- `CompleteSnapshot(ctx, snapshotID)` → Mark snapshot complete
|
|
- `ExportSnapshotMetadata(ctx, dbPath, snapshotID)` → Export to S3
|
|
- `CleanupIncompleteSnapshots(ctx, hostname)` → Remove failed snapshots
|
|
|
|
### `internal/database`
|
|
SQLite database for local index. Single-writer mode for thread safety.
|
|
|
|
Key types:
|
|
- `DB` → Database connection wrapper
|
|
- `Repositories` → Collection of all repository interfaces
|
|
|
|
Repository interfaces:
|
|
- `FilesRepository` → CRUD for File records
|
|
- `ChunksRepository` → CRUD for Chunk records
|
|
- `BlobsRepository` → CRUD for Blob records
|
|
- `SnapshotsRepository` → CRUD for Snapshot records
|
|
- Plus join table repositories (FileChunks, BlobChunks, etc.)
|
|
|
|
## Snapshot Creation Flow
|
|
|
|
```
|
|
CreateSnapshot(opts)
|
|
│
|
|
├─► CleanupIncompleteSnapshots() // Critical: avoid dedup errors
|
|
│
|
|
├─► SnapshotManager.CreateSnapshot() // Create DB record
|
|
│
|
|
├─► For each source directory:
|
|
│ │
|
|
│ ├─► scanner.Scan(ctx, path, snapshotID)
|
|
│ │ │
|
|
│ │ ├─► Phase 0: detectDeletedFiles()
|
|
│ │ │
|
|
│ │ ├─► Phase 1: scanPhase()
|
|
│ │ │ Walk directory
|
|
│ │ │ Check file metadata changes
|
|
│ │ │ Build list of files to process
|
|
│ │ │
|
|
│ │ └─► Phase 2: processPhase()
|
|
│ │ For each file:
|
|
│ │ chunker.ChunkReaderStreaming()
|
|
│ │ For each chunk:
|
|
│ │ packer.AddChunk()
|
|
│ │ If blob full → FinalizeBlob()
|
|
│ │ → handleBlobReady()
|
|
│ │ → s3Client.PutObjectWithProgress()
|
|
│ │ packer.Flush() // Final blob
|
|
│ │
|
|
│ └─► Accumulate statistics
|
|
│
|
|
├─► SnapshotManager.UpdateSnapshotStatsExtended()
|
|
│
|
|
├─► SnapshotManager.CompleteSnapshot()
|
|
│
|
|
└─► SnapshotManager.ExportSnapshotMetadata()
|
|
│
|
|
├─► Copy database to temp file
|
|
├─► Clean to only current snapshot data
|
|
├─► Dump to SQL
|
|
├─► Compress with zstd
|
|
├─► Encrypt with age
|
|
├─► Upload db.zst.age to S3
|
|
└─► Upload manifest.json.zst to S3
|
|
```
|
|
|
|
## Deduplication Strategy
|
|
|
|
1. **File-level**: Files unchanged since last backup are skipped (metadata comparison: size, mtime, mode, uid, gid)
|
|
|
|
2. **Chunk-level**: Chunks are content-addressed by SHA256 hash. If a chunk hash already exists in the database, the chunk data is not re-uploaded.
|
|
|
|
3. **Blob-level**: Blobs contain only unique chunks. Duplicate chunks within a blob are skipped.
|
|
|
|
## Storage Layout in S3
|
|
|
|
```
|
|
bucket/
|
|
├── blobs/
|
|
│ └── {hash[0:2]}/
|
|
│ └── {hash[2:4]}/
|
|
│ └── {full-hash} # Compressed+encrypted blob
|
|
│
|
|
└── metadata/
|
|
└── {snapshot-id}/
|
|
├── db.zst.age # Encrypted database dump
|
|
└── manifest.json.zst # Blob list (for verification)
|
|
```
|
|
|
|
## Thread Safety
|
|
|
|
- `Packer`: Thread-safe via mutex. Multiple goroutines can call `AddChunk()`.
|
|
- `Scanner`: Uses `packerMu` mutex to coordinate blob finalization.
|
|
- `Database`: Single-writer mode (`MaxOpenConns=1`) ensures SQLite thread safety.
|
|
- `Repositories.WithTx()`: Handles transaction lifecycle automatically.
|