- Manifests are now only compressed (not encrypted) so pruning operations can work without private keys - Updated generateBlobManifest to use zstd compression directly - Updated prune command to handle unencrypted manifests - Updated snapshot list command to handle new manifest format - Updated documentation to reflect manifest.json.zst (not .age) - Removed unnecessary VAULTIK_PRIVATE_KEY check from prune command
10 KiB
Vaultik Data Model
Overview
Vaultik uses a local SQLite database to track file metadata, chunk mappings, and blob associations during the backup process. This database serves as an index for incremental backups and enables efficient deduplication.
Important Notes:
- No Migration Support: Vaultik does not support database schema migrations. If the schema changes, the local database must be deleted and recreated by performing a full backup.
- Version Compatibility: In rare cases, you may need to use the same version of Vaultik to restore a backup as was used to create it. This ensures compatibility with the metadata format stored in S3.
Database Tables
1. files
Stores metadata about files in the filesystem being backed up.
Columns:
id
(TEXT PRIMARY KEY) - UUID for the file recordpath
(TEXT NOT NULL UNIQUE) - Absolute file pathmtime
(INTEGER NOT NULL) - Modification time as Unix timestampctime
(INTEGER NOT NULL) - Change time as Unix timestampsize
(INTEGER NOT NULL) - File size in bytesmode
(INTEGER NOT NULL) - Unix file permissions and typeuid
(INTEGER NOT NULL) - User ID of file ownergid
(INTEGER NOT NULL) - Group ID of file ownerlink_target
(TEXT) - Symlink target path (NULL for regular files)
Indexes:
idx_files_path
onpath
for efficient lookups
Purpose: Tracks file metadata to detect changes between backup runs. Used for incremental backup decisions. The UUID primary key provides stable references that don't change if files are moved.
2. chunks
Stores information about content-defined chunks created from files.
Columns:
chunk_hash
(TEXT PRIMARY KEY) - SHA256 hash of chunk contentsize
(INTEGER NOT NULL) - Chunk size in bytes
Purpose: Enables deduplication by tracking unique chunks across all files.
3. file_chunks
Maps files to their constituent chunks in order.
Columns:
file_id
(TEXT) - File ID (FK to files.id)idx
(INTEGER) - Chunk index within file (0-based)chunk_hash
(TEXT) - Chunk hash (FK to chunks.chunk_hash)- PRIMARY KEY (
file_id
,idx
)
Purpose: Allows reconstruction of files from chunks during restore.
4. chunk_files
Reverse mapping showing which files contain each chunk.
Columns:
chunk_hash
(TEXT) - Chunk hash (FK to chunks.chunk_hash)file_id
(TEXT) - File ID (FK to files.id)file_offset
(INTEGER) - Byte offset of chunk within filelength
(INTEGER) - Length of chunk in bytes- PRIMARY KEY (
chunk_hash
,file_id
)
Purpose: Supports efficient queries for chunk usage and deduplication statistics.
5. blobs
Stores information about packed, compressed, and encrypted blob files.
Columns:
id
(TEXT PRIMARY KEY) - UUID assigned when blob creation startsblob_hash
(TEXT UNIQUE) - SHA256 hash of final blob (NULL until finalized)created_ts
(INTEGER NOT NULL) - Creation timestampfinished_ts
(INTEGER) - Finalization timestamp (NULL if in progress)uncompressed_size
(INTEGER NOT NULL DEFAULT 0) - Total size of chunks before compressioncompressed_size
(INTEGER NOT NULL DEFAULT 0) - Size after compression and encryptionuploaded_ts
(INTEGER) - Upload completion timestamp (NULL if not uploaded)
Purpose: Tracks blob lifecycle from creation through upload. The UUID primary key allows immediate association of chunks with blobs.
6. blob_chunks
Maps chunks to the blobs that contain them.
Columns:
blob_id
(TEXT) - Blob ID (FK to blobs.id)chunk_hash
(TEXT) - Chunk hash (FK to chunks.chunk_hash)offset
(INTEGER) - Byte offset of chunk within blob (before compression)length
(INTEGER) - Length of chunk in bytes- PRIMARY KEY (
blob_id
,chunk_hash
)
Purpose: Enables chunk retrieval from blobs during restore operations.
7. snapshots
Tracks backup snapshots.
Columns:
id
(TEXT PRIMARY KEY) - Snapshot ID (format: hostname-YYYYMMDD-HHMMSSZ)hostname
(TEXT) - Hostname where backup was createdvaultik_version
(TEXT) - Version of Vaultik usedvaultik_git_revision
(TEXT) - Git revision of Vaultik usedstarted_at
(INTEGER) - Start timestampcompleted_at
(INTEGER) - Completion timestamp (NULL if in progress)file_count
(INTEGER) - Number of files in snapshotchunk_count
(INTEGER) - Number of unique chunksblob_count
(INTEGER) - Number of blobs referencedtotal_size
(INTEGER) - Total size of all filesblob_size
(INTEGER) - Total size of all blobs (compressed)blob_uncompressed_size
(INTEGER) - Total uncompressed size of all referenced blobscompression_ratio
(REAL) - Compression ratio achievedcompression_level
(INTEGER) - Compression level used for this snapshotupload_bytes
(INTEGER) - Total bytes uploaded during this snapshotupload_duration_ms
(INTEGER) - Total milliseconds spent uploading to S3
Purpose: Provides snapshot metadata and statistics including version tracking for compatibility.
8. snapshot_files
Maps snapshots to the files they contain.
Columns:
snapshot_id
(TEXT) - Snapshot ID (FK to snapshots.id)file_id
(TEXT) - File ID (FK to files.id)- PRIMARY KEY (
snapshot_id
,file_id
)
Purpose: Records which files are included in each snapshot.
9. snapshot_blobs
Maps snapshots to the blobs they reference.
Columns:
snapshot_id
(TEXT) - Snapshot ID (FK to snapshots.id)blob_id
(TEXT) - Blob ID (FK to blobs.id)blob_hash
(TEXT) - Denormalized blob hash for manifest generation- PRIMARY KEY (
snapshot_id
,blob_id
)
Purpose: Tracks blob dependencies for snapshots and enables manifest generation.
10. uploads
Tracks blob upload metrics.
Columns:
blob_hash
(TEXT PRIMARY KEY) - Hash of uploaded blobsnapshot_id
(TEXT NOT NULL) - The snapshot that triggered this upload (FK to snapshots.id)uploaded_at
(INTEGER) - Upload timestampsize
(INTEGER) - Size of uploaded blobduration_ms
(INTEGER) - Upload duration in milliseconds
Purpose: Performance monitoring and tracking which blobs were newly created (uploaded) during each snapshot.
Data Flow and Operations
1. Backup Process
-
File Scanning
INSERT OR REPLACE INTO files
- Update file metadataSELECT * FROM files WHERE path = ?
- Check if file has changedINSERT INTO snapshot_files
- Add file to current snapshot
-
Chunking (for changed files)
INSERT OR IGNORE INTO chunks
- Store new chunksINSERT INTO file_chunks
- Map chunks to fileINSERT INTO chunk_files
- Create reverse mapping
-
Blob Packing
INSERT INTO blobs
- Create blob record with UUID (blob_hash NULL)INSERT INTO blob_chunks
- Associate chunks with blob immediatelyUPDATE blobs SET blob_hash = ?, finished_ts = ?
- Finalize blob after packing
-
Upload
UPDATE blobs SET uploaded_ts = ?
- Mark blob as uploadedINSERT INTO uploads
- Record upload metrics with snapshot_idINSERT INTO snapshot_blobs
- Associate blob with snapshot
-
Snapshot Completion
UPDATE snapshots SET completed_at = ?, stats...
- Finalize snapshot- Generate and upload blob manifest from
snapshot_blobs
2. Incremental Backup
-
Change Detection
SELECT * FROM files WHERE path = ?
- Get previous file metadata- Compare mtime, size, mode to detect changes
- Skip unchanged files but still add to
snapshot_files
-
Chunk Reuse
SELECT * FROM blob_chunks WHERE chunk_hash = ?
- Find existing chunksINSERT INTO snapshot_blobs
- Reference existing blobs for unchanged files
3. Snapshot Metadata Export
After a snapshot is completed:
- Copy database to temporary file
- Clean temporary database to contain only current snapshot data
- Export to SQL dump using sqlite3
- Compress with zstd and encrypt with age
- Upload to S3 as
metadata/{snapshot-id}/db.zst.age
- Generate blob manifest and upload as
metadata/{snapshot-id}/manifest.json.zst
4. Restore Process
The restore process doesn't use the local database. Instead:
- Downloads snapshot metadata from S3
- Downloads required blobs based on manifest
- Reconstructs files from decrypted and decompressed chunks
5. Pruning
- Identify Unreferenced Blobs
- Query blobs not referenced by any remaining snapshot
- Delete from S3 and local database
6. Incomplete Snapshot Cleanup
Before each backup:
- Query incomplete snapshots (where
completed_at IS NULL
) - Check if metadata exists in S3
- If no metadata, delete snapshot and all associations
- Clean up orphaned files, chunks, and blobs
Repository Pattern
Vaultik uses a repository pattern for database access:
FileRepository
- CRUD operations for files and file metadataChunkRepository
- CRUD operations for content chunksFileChunkRepository
- Manage file-to-chunk mappingsChunkFileRepository
- Manage chunk-to-file reverse mappingsBlobRepository
- Manage blob lifecycle (creation, finalization, upload)BlobChunkRepository
- Manage blob-to-chunk associationsSnapshotRepository
- Manage snapshots and their relationshipsUploadRepository
- Track blob upload metrics
Each repository provides methods like:
Create()
- Insert new recordGetByID()
/GetByPath()
/GetByHash()
- Retrieve recordsUpdate()
- Update existing recordsDelete()
- Remove records- Specialized queries for each entity type (e.g.,
DeleteOrphaned()
,GetIncompleteByHostname()
)
Transaction Management
All database operations that modify multiple tables are wrapped in transactions:
err := repos.WithTx(ctx, func(ctx context.Context, tx *sql.Tx) error {
// Multiple repository operations using tx
})
This ensures consistency, especially important for operations like:
- Creating file-chunk mappings
- Associating chunks with blobs
- Updating snapshot statistics
Performance Considerations
-
Indexes:
- Primary keys are automatically indexed
idx_files_path
onfiles(path)
for efficient file lookups
-
Prepared Statements: All queries use prepared statements for performance and security
-
Batch Operations: Where possible, operations are batched within transactions
-
Write-Ahead Logging: SQLite WAL mode is enabled for better concurrency
Data Integrity
- Foreign Keys: Enforced through CASCADE DELETE and application-level repository methods
- Unique Constraints: Chunk hashes, file paths, and blob hashes are unique
- Null Handling: Nullable fields clearly indicate in-progress operations
- Timestamp Tracking: All major operations record timestamps for auditing