sneak d3afa65420 Fix foreign key constraints and improve snapshot tracking

- Add unified compression/encryption package in internal/blobgen
- Update DATAMODEL.md to reflect current schema implementation
- Refactor snapshot cleanup into well-named methods for clarity
- Add snapshot_id to uploads table to track new blobs per snapshot
- Fix blob count reporting for incremental backups
- Add DeleteOrphaned method to BlobChunkRepository
- Fix cleanup order to respect foreign key constraints
- Update tests to reflect schema changes

2025-07-26 02:22:25 +02:00

10 KiB

Raw Blame History

Vaultik Data Model

Overview

Vaultik uses a local SQLite database to track file metadata, chunk mappings, and blob associations during the backup process. This database serves as an index for incremental backups and enables efficient deduplication.

Important Notes:

No Migration Support: Vaultik does not support database schema migrations. If the schema changes, the local database must be deleted and recreated by performing a full backup.
Version Compatibility: In rare cases, you may need to use the same version of Vaultik to restore a backup as was used to create it. This ensures compatibility with the metadata format stored in S3.

Database Tables

1. `files`

Stores metadata about files in the filesystem being backed up.

Columns:

id (TEXT PRIMARY KEY) - UUID for the file record
path (TEXT NOT NULL UNIQUE) - Absolute file path
mtime (INTEGER NOT NULL) - Modification time as Unix timestamp
ctime (INTEGER NOT NULL) - Change time as Unix timestamp
size (INTEGER NOT NULL) - File size in bytes
mode (INTEGER NOT NULL) - Unix file permissions and type
uid (INTEGER NOT NULL) - User ID of file owner
gid (INTEGER NOT NULL) - Group ID of file owner
link_target (TEXT) - Symlink target path (NULL for regular files)

Indexes:

idx_files_path on path for efficient lookups

Purpose: Tracks file metadata to detect changes between backup runs. Used for incremental backup decisions. The UUID primary key provides stable references that don't change if files are moved.

2. `chunks`

Stores information about content-defined chunks created from files.

Columns:

chunk_hash (TEXT PRIMARY KEY) - SHA256 hash of chunk content
size (INTEGER NOT NULL) - Chunk size in bytes

Purpose: Enables deduplication by tracking unique chunks across all files.

3. `file_chunks`

Maps files to their constituent chunks in order.

Columns:

file_id (TEXT) - File ID (FK to files.id)
idx (INTEGER) - Chunk index within file (0-based)
chunk_hash (TEXT) - Chunk hash (FK to chunks.chunk_hash)
PRIMARY KEY (file_id, idx)

Purpose: Allows reconstruction of files from chunks during restore.

4. `chunk_files`

Reverse mapping showing which files contain each chunk.

Columns:

chunk_hash (TEXT) - Chunk hash (FK to chunks.chunk_hash)
file_id (TEXT) - File ID (FK to files.id)
file_offset (INTEGER) - Byte offset of chunk within file
length (INTEGER) - Length of chunk in bytes
PRIMARY KEY (chunk_hash, file_id)

Purpose: Supports efficient queries for chunk usage and deduplication statistics.

5. `blobs`

Stores information about packed, compressed, and encrypted blob files.

Columns:

id (TEXT PRIMARY KEY) - UUID assigned when blob creation starts
blob_hash (TEXT UNIQUE) - SHA256 hash of final blob (NULL until finalized)
created_ts (INTEGER NOT NULL) - Creation timestamp
finished_ts (INTEGER) - Finalization timestamp (NULL if in progress)
uncompressed_size (INTEGER NOT NULL DEFAULT 0) - Total size of chunks before compression
compressed_size (INTEGER NOT NULL DEFAULT 0) - Size after compression and encryption
uploaded_ts (INTEGER) - Upload completion timestamp (NULL if not uploaded)

Purpose: Tracks blob lifecycle from creation through upload. The UUID primary key allows immediate association of chunks with blobs.

6. `blob_chunks`

Maps chunks to the blobs that contain them.

Columns:

blob_id (TEXT) - Blob ID (FK to blobs.id)
chunk_hash (TEXT) - Chunk hash (FK to chunks.chunk_hash)
offset (INTEGER) - Byte offset of chunk within blob (before compression)
length (INTEGER) - Length of chunk in bytes
PRIMARY KEY (blob_id, chunk_hash)

Purpose: Enables chunk retrieval from blobs during restore operations.

7. `snapshots`

Tracks backup snapshots.

Columns:

id (TEXT PRIMARY KEY) - Snapshot ID (format: hostname-YYYYMMDD-HHMMSSZ)
hostname (TEXT) - Hostname where backup was created
vaultik_version (TEXT) - Version of Vaultik used
vaultik_git_revision (TEXT) - Git revision of Vaultik used
started_at (INTEGER) - Start timestamp
completed_at (INTEGER) - Completion timestamp (NULL if in progress)
file_count (INTEGER) - Number of files in snapshot
chunk_count (INTEGER) - Number of unique chunks
blob_count (INTEGER) - Number of blobs referenced
total_size (INTEGER) - Total size of all files
blob_size (INTEGER) - Total size of all blobs (compressed)
blob_uncompressed_size (INTEGER) - Total uncompressed size of all referenced blobs
compression_ratio (REAL) - Compression ratio achieved
compression_level (INTEGER) - Compression level used for this snapshot
upload_bytes (INTEGER) - Total bytes uploaded during this snapshot
upload_duration_ms (INTEGER) - Total milliseconds spent uploading to S3

Purpose: Provides snapshot metadata and statistics including version tracking for compatibility.

8. `snapshot_files`

Maps snapshots to the files they contain.

Columns:

snapshot_id (TEXT) - Snapshot ID (FK to snapshots.id)
file_id (TEXT) - File ID (FK to files.id)
PRIMARY KEY (snapshot_id, file_id)

Purpose: Records which files are included in each snapshot.

9. `snapshot_blobs`

Maps snapshots to the blobs they reference.

Columns:

snapshot_id (TEXT) - Snapshot ID (FK to snapshots.id)
blob_id (TEXT) - Blob ID (FK to blobs.id)
blob_hash (TEXT) - Denormalized blob hash for manifest generation
PRIMARY KEY (snapshot_id, blob_id)

Purpose: Tracks blob dependencies for snapshots and enables manifest generation.

10. `uploads`

Tracks blob upload metrics.

Columns:

blob_hash (TEXT PRIMARY KEY) - Hash of uploaded blob
snapshot_id (TEXT NOT NULL) - The snapshot that triggered this upload (FK to snapshots.id)
uploaded_at (INTEGER) - Upload timestamp
size (INTEGER) - Size of uploaded blob
duration_ms (INTEGER) - Upload duration in milliseconds

Purpose: Performance monitoring and tracking which blobs were newly created (uploaded) during each snapshot.

Data Flow and Operations

1. Backup Process

File Scanning
- INSERT OR REPLACE INTO files - Update file metadata
- SELECT * FROM files WHERE path = ? - Check if file has changed
- INSERT INTO snapshot_files - Add file to current snapshot
Chunking (for changed files)
- INSERT OR IGNORE INTO chunks - Store new chunks
- INSERT INTO file_chunks - Map chunks to file
- INSERT INTO chunk_files - Create reverse mapping
Blob Packing
- INSERT INTO blobs - Create blob record with UUID (blob_hash NULL)
- INSERT INTO blob_chunks - Associate chunks with blob immediately
- UPDATE blobs SET blob_hash = ?, finished_ts = ? - Finalize blob after packing
Upload
- UPDATE blobs SET uploaded_ts = ? - Mark blob as uploaded
- INSERT INTO uploads - Record upload metrics with snapshot_id
- INSERT INTO snapshot_blobs - Associate blob with snapshot
Snapshot Completion
- UPDATE snapshots SET completed_at = ?, stats... - Finalize snapshot
- Generate and upload blob manifest from snapshot_blobs

2. Incremental Backup

Change Detection
- SELECT * FROM files WHERE path = ? - Get previous file metadata
- Compare mtime, size, mode to detect changes
- Skip unchanged files but still add to snapshot_files
Chunk Reuse
- SELECT * FROM blob_chunks WHERE chunk_hash = ? - Find existing chunks
- INSERT INTO snapshot_blobs - Reference existing blobs for unchanged files

3. Snapshot Metadata Export

After a snapshot is completed:

Copy database to temporary file
Clean temporary database to contain only current snapshot data
Export to SQL dump using sqlite3
Compress with zstd and encrypt with age
Upload to S3 as metadata/{snapshot-id}/db.zst.age
Generate blob manifest and upload as metadata/{snapshot-id}/manifest.json.zst.age

4. Restore Process

The restore process doesn't use the local database. Instead:

Downloads snapshot metadata from S3
Downloads required blobs based on manifest
Reconstructs files from decrypted and decompressed chunks

5. Pruning

Identify Unreferenced Blobs
- Query blobs not referenced by any remaining snapshot
- Delete from S3 and local database

6. Incomplete Snapshot Cleanup

Before each backup:

Query incomplete snapshots (where completed_at IS NULL)
Check if metadata exists in S3
If no metadata, delete snapshot and all associations
Clean up orphaned files, chunks, and blobs

Repository Pattern

Vaultik uses a repository pattern for database access:

FileRepository - CRUD operations for files and file metadata
ChunkRepository - CRUD operations for content chunks
FileChunkRepository - Manage file-to-chunk mappings
ChunkFileRepository - Manage chunk-to-file reverse mappings
BlobRepository - Manage blob lifecycle (creation, finalization, upload)
BlobChunkRepository - Manage blob-to-chunk associations
SnapshotRepository - Manage snapshots and their relationships
UploadRepository - Track blob upload metrics

Each repository provides methods like:

Create() - Insert new record
GetByID() / GetByPath() / GetByHash() - Retrieve records
Update() - Update existing records
Delete() - Remove records
Specialized queries for each entity type (e.g., DeleteOrphaned(), GetIncompleteByHostname())

Transaction Management

All database operations that modify multiple tables are wrapped in transactions:

err := repos.WithTx(ctx, func(ctx context.Context, tx *sql.Tx) error {
    // Multiple repository operations using tx
})

This ensures consistency, especially important for operations like:

Creating file-chunk mappings
Associating chunks with blobs
Updating snapshot statistics

Performance Considerations

Indexes:
- Primary keys are automatically indexed
- idx_files_path on files(path) for efficient file lookups
Prepared Statements: All queries use prepared statements for performance and security
Batch Operations: Where possible, operations are batched within transactions
Write-Ahead Logging: SQLite WAL mode is enabled for better concurrency

Data Integrity

Foreign Keys: Enforced through CASCADE DELETE and application-level repository methods
Unique Constraints: Chunk hashes, file paths, and blob hashes are unique
Null Handling: Nullable fields clearly indicate in-progress operations
Timestamp Tracking: All major operations record timestamps for auditing

10 KiB Raw Blame History

Vaultik Data Model

Overview

Database Tables

1. files

2. chunks

3. file_chunks

4. chunk_files

5. blobs

6. blob_chunks

7. snapshots

8. snapshot_files

9. snapshot_blobs

10. uploads

Data Flow and Operations

1. Backup Process

2. Incremental Backup

3. Snapshot Metadata Export

4. Restore Process

5. Pruning

6. Incomplete Snapshot Cleanup

Repository Pattern

Transaction Management

Performance Considerations

Data Integrity

10 KiB

Raw Blame History

1. `files`

2. `chunks`

3. `file_chunks`

4. `chunk_files`

5. `blobs`

6. `blob_chunks`

7. `snapshots`

8. `snapshot_files`

9. `snapshot_blobs`

10. `uploads`