- Add unified compression/encryption package in internal/blobgen - Update DATAMODEL.md to reflect current schema implementation - Refactor snapshot cleanup into well-named methods for clarity - Add snapshot_id to uploads table to track new blobs per snapshot - Fix blob count reporting for incremental backups - Add DeleteOrphaned method to BlobChunkRepository - Fix cleanup order to respect foreign key constraints - Update tests to reflect schema changes
268 lines
10 KiB
Markdown
268 lines
10 KiB
Markdown
# Vaultik Data Model
|
|
|
|
## Overview
|
|
|
|
Vaultik uses a local SQLite database to track file metadata, chunk mappings, and blob associations during the backup process. This database serves as an index for incremental backups and enables efficient deduplication.
|
|
|
|
**Important Notes:**
|
|
- **No Migration Support**: Vaultik does not support database schema migrations. If the schema changes, the local database must be deleted and recreated by performing a full backup.
|
|
- **Version Compatibility**: In rare cases, you may need to use the same version of Vaultik to restore a backup as was used to create it. This ensures compatibility with the metadata format stored in S3.
|
|
|
|
## Database Tables
|
|
|
|
### 1. `files`
|
|
Stores metadata about files in the filesystem being backed up.
|
|
|
|
**Columns:**
|
|
- `id` (TEXT PRIMARY KEY) - UUID for the file record
|
|
- `path` (TEXT NOT NULL UNIQUE) - Absolute file path
|
|
- `mtime` (INTEGER NOT NULL) - Modification time as Unix timestamp
|
|
- `ctime` (INTEGER NOT NULL) - Change time as Unix timestamp
|
|
- `size` (INTEGER NOT NULL) - File size in bytes
|
|
- `mode` (INTEGER NOT NULL) - Unix file permissions and type
|
|
- `uid` (INTEGER NOT NULL) - User ID of file owner
|
|
- `gid` (INTEGER NOT NULL) - Group ID of file owner
|
|
- `link_target` (TEXT) - Symlink target path (NULL for regular files)
|
|
|
|
**Indexes:**
|
|
- `idx_files_path` on `path` for efficient lookups
|
|
|
|
**Purpose:** Tracks file metadata to detect changes between backup runs. Used for incremental backup decisions. The UUID primary key provides stable references that don't change if files are moved.
|
|
|
|
### 2. `chunks`
|
|
Stores information about content-defined chunks created from files.
|
|
|
|
**Columns:**
|
|
- `chunk_hash` (TEXT PRIMARY KEY) - SHA256 hash of chunk content
|
|
- `size` (INTEGER NOT NULL) - Chunk size in bytes
|
|
|
|
**Purpose:** Enables deduplication by tracking unique chunks across all files.
|
|
|
|
### 3. `file_chunks`
|
|
Maps files to their constituent chunks in order.
|
|
|
|
**Columns:**
|
|
- `file_id` (TEXT) - File ID (FK to files.id)
|
|
- `idx` (INTEGER) - Chunk index within file (0-based)
|
|
- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
|
|
- PRIMARY KEY (`file_id`, `idx`)
|
|
|
|
**Purpose:** Allows reconstruction of files from chunks during restore.
|
|
|
|
### 4. `chunk_files`
|
|
Reverse mapping showing which files contain each chunk.
|
|
|
|
**Columns:**
|
|
- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
|
|
- `file_id` (TEXT) - File ID (FK to files.id)
|
|
- `file_offset` (INTEGER) - Byte offset of chunk within file
|
|
- `length` (INTEGER) - Length of chunk in bytes
|
|
- PRIMARY KEY (`chunk_hash`, `file_id`)
|
|
|
|
**Purpose:** Supports efficient queries for chunk usage and deduplication statistics.
|
|
|
|
### 5. `blobs`
|
|
Stores information about packed, compressed, and encrypted blob files.
|
|
|
|
**Columns:**
|
|
- `id` (TEXT PRIMARY KEY) - UUID assigned when blob creation starts
|
|
- `blob_hash` (TEXT UNIQUE) - SHA256 hash of final blob (NULL until finalized)
|
|
- `created_ts` (INTEGER NOT NULL) - Creation timestamp
|
|
- `finished_ts` (INTEGER) - Finalization timestamp (NULL if in progress)
|
|
- `uncompressed_size` (INTEGER NOT NULL DEFAULT 0) - Total size of chunks before compression
|
|
- `compressed_size` (INTEGER NOT NULL DEFAULT 0) - Size after compression and encryption
|
|
- `uploaded_ts` (INTEGER) - Upload completion timestamp (NULL if not uploaded)
|
|
|
|
**Purpose:** Tracks blob lifecycle from creation through upload. The UUID primary key allows immediate association of chunks with blobs.
|
|
|
|
### 6. `blob_chunks`
|
|
Maps chunks to the blobs that contain them.
|
|
|
|
**Columns:**
|
|
- `blob_id` (TEXT) - Blob ID (FK to blobs.id)
|
|
- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
|
|
- `offset` (INTEGER) - Byte offset of chunk within blob (before compression)
|
|
- `length` (INTEGER) - Length of chunk in bytes
|
|
- PRIMARY KEY (`blob_id`, `chunk_hash`)
|
|
|
|
**Purpose:** Enables chunk retrieval from blobs during restore operations.
|
|
|
|
### 7. `snapshots`
|
|
Tracks backup snapshots.
|
|
|
|
**Columns:**
|
|
- `id` (TEXT PRIMARY KEY) - Snapshot ID (format: hostname-YYYYMMDD-HHMMSSZ)
|
|
- `hostname` (TEXT) - Hostname where backup was created
|
|
- `vaultik_version` (TEXT) - Version of Vaultik used
|
|
- `vaultik_git_revision` (TEXT) - Git revision of Vaultik used
|
|
- `started_at` (INTEGER) - Start timestamp
|
|
- `completed_at` (INTEGER) - Completion timestamp (NULL if in progress)
|
|
- `file_count` (INTEGER) - Number of files in snapshot
|
|
- `chunk_count` (INTEGER) - Number of unique chunks
|
|
- `blob_count` (INTEGER) - Number of blobs referenced
|
|
- `total_size` (INTEGER) - Total size of all files
|
|
- `blob_size` (INTEGER) - Total size of all blobs (compressed)
|
|
- `blob_uncompressed_size` (INTEGER) - Total uncompressed size of all referenced blobs
|
|
- `compression_ratio` (REAL) - Compression ratio achieved
|
|
- `compression_level` (INTEGER) - Compression level used for this snapshot
|
|
- `upload_bytes` (INTEGER) - Total bytes uploaded during this snapshot
|
|
- `upload_duration_ms` (INTEGER) - Total milliseconds spent uploading to S3
|
|
|
|
**Purpose:** Provides snapshot metadata and statistics including version tracking for compatibility.
|
|
|
|
### 8. `snapshot_files`
|
|
Maps snapshots to the files they contain.
|
|
|
|
**Columns:**
|
|
- `snapshot_id` (TEXT) - Snapshot ID (FK to snapshots.id)
|
|
- `file_id` (TEXT) - File ID (FK to files.id)
|
|
- PRIMARY KEY (`snapshot_id`, `file_id`)
|
|
|
|
**Purpose:** Records which files are included in each snapshot.
|
|
|
|
### 9. `snapshot_blobs`
|
|
Maps snapshots to the blobs they reference.
|
|
|
|
**Columns:**
|
|
- `snapshot_id` (TEXT) - Snapshot ID (FK to snapshots.id)
|
|
- `blob_id` (TEXT) - Blob ID (FK to blobs.id)
|
|
- `blob_hash` (TEXT) - Denormalized blob hash for manifest generation
|
|
- PRIMARY KEY (`snapshot_id`, `blob_id`)
|
|
|
|
**Purpose:** Tracks blob dependencies for snapshots and enables manifest generation.
|
|
|
|
### 10. `uploads`
|
|
Tracks blob upload metrics.
|
|
|
|
**Columns:**
|
|
- `blob_hash` (TEXT PRIMARY KEY) - Hash of uploaded blob
|
|
- `snapshot_id` (TEXT NOT NULL) - The snapshot that triggered this upload (FK to snapshots.id)
|
|
- `uploaded_at` (INTEGER) - Upload timestamp
|
|
- `size` (INTEGER) - Size of uploaded blob
|
|
- `duration_ms` (INTEGER) - Upload duration in milliseconds
|
|
|
|
**Purpose:** Performance monitoring and tracking which blobs were newly created (uploaded) during each snapshot.
|
|
|
|
## Data Flow and Operations
|
|
|
|
### 1. Backup Process
|
|
|
|
1. **File Scanning**
|
|
- `INSERT OR REPLACE INTO files` - Update file metadata
|
|
- `SELECT * FROM files WHERE path = ?` - Check if file has changed
|
|
- `INSERT INTO snapshot_files` - Add file to current snapshot
|
|
|
|
2. **Chunking** (for changed files)
|
|
- `INSERT OR IGNORE INTO chunks` - Store new chunks
|
|
- `INSERT INTO file_chunks` - Map chunks to file
|
|
- `INSERT INTO chunk_files` - Create reverse mapping
|
|
|
|
3. **Blob Packing**
|
|
- `INSERT INTO blobs` - Create blob record with UUID (blob_hash NULL)
|
|
- `INSERT INTO blob_chunks` - Associate chunks with blob immediately
|
|
- `UPDATE blobs SET blob_hash = ?, finished_ts = ?` - Finalize blob after packing
|
|
|
|
4. **Upload**
|
|
- `UPDATE blobs SET uploaded_ts = ?` - Mark blob as uploaded
|
|
- `INSERT INTO uploads` - Record upload metrics with snapshot_id
|
|
- `INSERT INTO snapshot_blobs` - Associate blob with snapshot
|
|
|
|
5. **Snapshot Completion**
|
|
- `UPDATE snapshots SET completed_at = ?, stats...` - Finalize snapshot
|
|
- Generate and upload blob manifest from `snapshot_blobs`
|
|
|
|
### 2. Incremental Backup
|
|
|
|
1. **Change Detection**
|
|
- `SELECT * FROM files WHERE path = ?` - Get previous file metadata
|
|
- Compare mtime, size, mode to detect changes
|
|
- Skip unchanged files but still add to `snapshot_files`
|
|
|
|
2. **Chunk Reuse**
|
|
- `SELECT * FROM blob_chunks WHERE chunk_hash = ?` - Find existing chunks
|
|
- `INSERT INTO snapshot_blobs` - Reference existing blobs for unchanged files
|
|
|
|
### 3. Snapshot Metadata Export
|
|
|
|
After a snapshot is completed:
|
|
1. Copy database to temporary file
|
|
2. Clean temporary database to contain only current snapshot data
|
|
3. Export to SQL dump using sqlite3
|
|
4. Compress with zstd and encrypt with age
|
|
5. Upload to S3 as `metadata/{snapshot-id}/db.zst.age`
|
|
6. Generate blob manifest and upload as `metadata/{snapshot-id}/manifest.json.zst.age`
|
|
|
|
### 4. Restore Process
|
|
|
|
The restore process doesn't use the local database. Instead:
|
|
1. Downloads snapshot metadata from S3
|
|
2. Downloads required blobs based on manifest
|
|
3. Reconstructs files from decrypted and decompressed chunks
|
|
|
|
### 5. Pruning
|
|
|
|
1. **Identify Unreferenced Blobs**
|
|
- Query blobs not referenced by any remaining snapshot
|
|
- Delete from S3 and local database
|
|
|
|
### 6. Incomplete Snapshot Cleanup
|
|
|
|
Before each backup:
|
|
1. Query incomplete snapshots (where `completed_at IS NULL`)
|
|
2. Check if metadata exists in S3
|
|
3. If no metadata, delete snapshot and all associations
|
|
4. Clean up orphaned files, chunks, and blobs
|
|
|
|
## Repository Pattern
|
|
|
|
Vaultik uses a repository pattern for database access:
|
|
|
|
- `FileRepository` - CRUD operations for files and file metadata
|
|
- `ChunkRepository` - CRUD operations for content chunks
|
|
- `FileChunkRepository` - Manage file-to-chunk mappings
|
|
- `ChunkFileRepository` - Manage chunk-to-file reverse mappings
|
|
- `BlobRepository` - Manage blob lifecycle (creation, finalization, upload)
|
|
- `BlobChunkRepository` - Manage blob-to-chunk associations
|
|
- `SnapshotRepository` - Manage snapshots and their relationships
|
|
- `UploadRepository` - Track blob upload metrics
|
|
|
|
Each repository provides methods like:
|
|
- `Create()` - Insert new record
|
|
- `GetByID()` / `GetByPath()` / `GetByHash()` - Retrieve records
|
|
- `Update()` - Update existing records
|
|
- `Delete()` - Remove records
|
|
- Specialized queries for each entity type (e.g., `DeleteOrphaned()`, `GetIncompleteByHostname()`)
|
|
|
|
## Transaction Management
|
|
|
|
All database operations that modify multiple tables are wrapped in transactions:
|
|
|
|
```go
|
|
err := repos.WithTx(ctx, func(ctx context.Context, tx *sql.Tx) error {
|
|
// Multiple repository operations using tx
|
|
})
|
|
```
|
|
|
|
This ensures consistency, especially important for operations like:
|
|
- Creating file-chunk mappings
|
|
- Associating chunks with blobs
|
|
- Updating snapshot statistics
|
|
|
|
## Performance Considerations
|
|
|
|
1. **Indexes**:
|
|
- Primary keys are automatically indexed
|
|
- `idx_files_path` on `files(path)` for efficient file lookups
|
|
|
|
2. **Prepared Statements**: All queries use prepared statements for performance and security
|
|
|
|
3. **Batch Operations**: Where possible, operations are batched within transactions
|
|
|
|
4. **Write-Ahead Logging**: SQLite WAL mode is enabled for better concurrency
|
|
|
|
## Data Integrity
|
|
|
|
1. **Foreign Keys**: Enforced through CASCADE DELETE and application-level repository methods
|
|
2. **Unique Constraints**: Chunk hashes, file paths, and blob hashes are unique
|
|
3. **Null Handling**: Nullable fields clearly indicate in-progress operations
|
|
4. **Timestamp Tracking**: All major operations record timestamps for auditing |