# Vaultik Data Model ## Overview Vaultik uses a local SQLite database to track file metadata, chunk mappings, and blob associations during the backup process. This database serves as an index for incremental backups and enables efficient deduplication. **Important Notes:** - **No Migration Support**: Vaultik does not support database schema migrations. If the schema changes, the local database must be deleted and recreated by performing a full backup. - **Version Compatibility**: In rare cases, you may need to use the same version of Vaultik to restore a backup as was used to create it. This ensures compatibility with the metadata format stored in S3. ## Database Tables ### 1. `files` Stores metadata about files in the filesystem being backed up. **Columns:** - `id` (TEXT PRIMARY KEY) - UUID for the file record - `path` (TEXT UNIQUE) - Absolute file path - `mtime` (INTEGER) - Modification time as Unix timestamp - `ctime` (INTEGER) - Change time as Unix timestamp - `size` (INTEGER) - File size in bytes - `mode` (INTEGER) - Unix file permissions and type - `uid` (INTEGER) - User ID of file owner - `gid` (INTEGER) - Group ID of file owner - `link_target` (TEXT) - Symlink target path (empty for regular files) **Purpose:** Tracks file metadata to detect changes between backup runs. Used for incremental backup decisions. The UUID primary key provides stable references that don't change if files are moved. ### 2. `chunks` Stores information about content-defined chunks created from files. **Columns:** - `chunk_hash` (TEXT PRIMARY KEY) - SHA256 hash of chunk content - `sha256` (TEXT) - SHA256 hash (currently same as chunk_hash) - `size` (INTEGER) - Chunk size in bytes **Purpose:** Enables deduplication by tracking unique chunks across all files. ### 3. `file_chunks` Maps files to their constituent chunks in order. **Columns:** - `file_id` (TEXT) - File ID (FK to files.id) - `idx` (INTEGER) - Chunk index within file (0-based) - `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash) - PRIMARY KEY (`file_id`, `idx`) **Purpose:** Allows reconstruction of files from chunks during restore. ### 4. `chunk_files` Reverse mapping showing which files contain each chunk. **Columns:** - `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash) - `file_id` (TEXT) - File ID (FK to files.id) - `file_offset` (INTEGER) - Byte offset of chunk within file - `length` (INTEGER) - Length of chunk in bytes - PRIMARY KEY (`chunk_hash`, `file_id`) **Purpose:** Supports efficient queries for chunk usage and deduplication statistics. ### 5. `blobs` Stores information about packed, compressed, and encrypted blob files. **Columns:** - `id` (TEXT PRIMARY KEY) - UUID assigned when blob creation starts - `hash` (TEXT) - SHA256 hash of final blob (empty until finalized) - `created_ts` (INTEGER) - Creation timestamp - `finished_ts` (INTEGER) - Finalization timestamp (NULL if in progress) - `uncompressed_size` (INTEGER) - Total size of chunks before compression - `compressed_size` (INTEGER) - Size after compression and encryption - `uploaded_ts` (INTEGER) - Upload completion timestamp (NULL if not uploaded) **Purpose:** Tracks blob lifecycle from creation through upload. The UUID primary key allows immediate association of chunks with blobs. ### 6. `blob_chunks` Maps chunks to the blobs that contain them. **Columns:** - `blob_id` (TEXT) - Blob ID (FK to blobs.id) - `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash) - `offset` (INTEGER) - Byte offset of chunk within blob (before compression) - `length` (INTEGER) - Length of chunk in bytes - PRIMARY KEY (`blob_id`, `chunk_hash`) **Purpose:** Enables chunk retrieval from blobs during restore operations. ### 7. `snapshots` Tracks backup snapshots. **Columns:** - `id` (TEXT PRIMARY KEY) - Snapshot ID (format: hostname-YYYYMMDD-HHMMSSZ) - `hostname` (TEXT) - Hostname where backup was created - `vaultik_version` (TEXT) - Version of Vaultik used - `vaultik_git_revision` (TEXT) - Git revision of Vaultik used - `started_at` (INTEGER) - Start timestamp - `completed_at` (INTEGER) - Completion timestamp (NULL if in progress) - `file_count` (INTEGER) - Number of files in snapshot - `chunk_count` (INTEGER) - Number of unique chunks - `blob_count` (INTEGER) - Number of blobs referenced - `total_size` (INTEGER) - Total size of all files - `blob_size` (INTEGER) - Total size of all blobs (compressed) - `blob_uncompressed_size` (INTEGER) - Total uncompressed size of all referenced blobs - `compression_ratio` (REAL) - Compression ratio achieved - `compression_level` (INTEGER) - Compression level used for this snapshot - `upload_bytes` (INTEGER) - Total bytes uploaded during this snapshot - `upload_duration_ms` (INTEGER) - Total milliseconds spent uploading to S3 **Purpose:** Provides snapshot metadata and statistics including version tracking for compatibility. ### 8. `snapshot_files` Maps snapshots to the files they contain. **Columns:** - `snapshot_id` (TEXT) - Snapshot ID (FK to snapshots.id) - `file_id` (TEXT) - File ID (FK to files.id) - PRIMARY KEY (`snapshot_id`, `file_id`) **Purpose:** Records which files are included in each snapshot. ### 9. `snapshot_blobs` Maps snapshots to the blobs they reference. **Columns:** - `snapshot_id` (TEXT) - Snapshot ID (FK to snapshots.id) - `blob_id` (TEXT) - Blob ID (FK to blobs.id) - `blob_hash` (TEXT) - Denormalized blob hash for manifest generation - PRIMARY KEY (`snapshot_id`, `blob_id`) **Purpose:** Tracks blob dependencies for snapshots and enables manifest generation. ### 10. `uploads` Tracks blob upload metrics. **Columns:** - `blob_hash` (TEXT PRIMARY KEY) - Hash of uploaded blob - `uploaded_at` (INTEGER) - Upload timestamp - `size` (INTEGER) - Size of uploaded blob - `duration_ms` (INTEGER) - Upload duration in milliseconds **Purpose:** Performance monitoring and upload tracking. ## Data Flow and Operations ### 1. Backup Process 1. **File Scanning** - `INSERT OR REPLACE INTO files` - Update file metadata - `SELECT * FROM files WHERE path = ?` - Check if file has changed - `INSERT INTO snapshot_files` - Add file to current snapshot 2. **Chunking** (for changed files) - `INSERT OR IGNORE INTO chunks` - Store new chunks - `INSERT INTO file_chunks` - Map chunks to file - `INSERT INTO chunk_files` - Create reverse mapping 3. **Blob Packing** - `INSERT INTO blobs` - Create blob record with UUID (hash empty) - `INSERT INTO blob_chunks` - Associate chunks with blob immediately - `UPDATE blobs SET hash = ?, finished_ts = ?` - Finalize blob after packing 4. **Upload** - `UPDATE blobs SET uploaded_ts = ?` - Mark blob as uploaded - `INSERT INTO uploads` - Record upload metrics - `INSERT INTO snapshot_blobs` - Associate blob with snapshot 5. **Snapshot Completion** - `UPDATE snapshots SET completed_at = ?, stats...` - Finalize snapshot - Generate and upload blob manifest from `snapshot_blobs` ### 2. Incremental Backup 1. **Change Detection** - `SELECT * FROM files WHERE path = ?` - Get previous file metadata - Compare mtime, size, mode to detect changes - Skip unchanged files but still add to `snapshot_files` 2. **Chunk Reuse** - `SELECT * FROM blob_chunks WHERE chunk_hash = ?` - Find existing chunks - `INSERT INTO snapshot_blobs` - Reference existing blobs for unchanged files ### 3. Restore Process The restore process doesn't use the local database. Instead: 1. Downloads snapshot metadata from S3 2. Downloads required blobs based on manifest 3. Reconstructs files from decrypted and decompressed chunks ### 4. Pruning 1. **Identify Unreferenced Blobs** - Query blobs not referenced by any remaining snapshot - Delete from S3 and local database ## Repository Pattern Vaultik uses a repository pattern for database access: - `FileRepository` - CRUD operations for files - `ChunkRepository` - CRUD operations for chunks - `FileChunkRepository` - Manage file-chunk mappings - `BlobRepository` - Manage blob lifecycle - `BlobChunkRepository` - Manage blob-chunk associations - `SnapshotRepository` - Manage snapshots - `UploadRepository` - Track upload metrics Each repository provides methods like: - `Create()` - Insert new record - `GetByID()` / `GetByPath()` / `GetByHash()` - Retrieve records - `Update()` - Update existing records - `Delete()` - Remove records - Specialized queries for each entity type ## Transaction Management All database operations that modify multiple tables are wrapped in transactions: ```go err := repos.WithTx(ctx, func(ctx context.Context, tx *sql.Tx) error { // Multiple repository operations using tx }) ``` This ensures consistency, especially important for operations like: - Creating file-chunk mappings - Associating chunks with blobs - Updating snapshot statistics ## Performance Considerations 1. **Indexes**: Primary keys are automatically indexed. Additional indexes may be needed for: - `blobs.hash` for lookup performance - `blob_chunks.chunk_hash` for chunk location queries 2. **Prepared Statements**: All queries use prepared statements for performance and security 3. **Batch Operations**: Where possible, operations are batched within transactions 4. **Write-Ahead Logging**: SQLite WAL mode is enabled for better concurrency ## Data Integrity 1. **Foreign Keys**: Enforced at the application level through repository methods 2. **Unique Constraints**: Chunk hashes and file paths are unique 3. **Null Handling**: Nullable fields clearly indicate in-progress operations 4. **Timestamp Tracking**: All major operations record timestamps for auditing