vaultik/DATAMODEL.md
sneak 78af626759 Major refactoring: UUID-based storage, streaming architecture, and CLI improvements
This commit represents a significant architectural overhaul of vaultik:

Database Schema Changes:
- Switch files table to use UUID primary keys instead of path-based keys
- Add UUID primary keys to blobs table for immediate chunk association
- Update all foreign key relationships to use UUIDs
- Add comprehensive schema documentation in DATAMODEL.md
- Add SQLite busy timeout handling for concurrent operations

Streaming and Performance Improvements:
- Implement true streaming blob packing without intermediate storage
- Add streaming chunk processing to reduce memory usage
- Improve progress reporting with real-time metrics
- Add upload metrics tracking in new uploads table

CLI Refactoring:
- Restructure CLI to use subcommands: snapshot create/list/purge/verify
- Add store info command for S3 configuration display
- Add custom duration parser supporting days/weeks/months/years
- Remove old backup.go in favor of enhanced snapshot.go
- Add --cron flag for silent operation

Configuration Changes:
- Remove unused index_prefix configuration option
- Add support for snapshot pruning retention policies
- Improve configuration validation and error messages

Testing Improvements:
- Add comprehensive repository tests with edge cases
- Add cascade delete debugging tests
- Fix concurrent operation tests to use SQLite busy timeout
- Remove tolerance for SQLITE_BUSY errors in tests

Documentation:
- Add MIT LICENSE file
- Update README with new command structure
- Add comprehensive DATAMODEL.md explaining database schema
- Update DESIGN.md with UUID-based architecture

Other Changes:
- Add test-config.yml for testing
- Update Makefile with better test output formatting
- Fix various race conditions in concurrent operations
- Improve error handling throughout
2025-07-22 14:56:44 +02:00

9.4 KiB

Vaultik Data Model

Overview

Vaultik uses a local SQLite database to track file metadata, chunk mappings, and blob associations during the backup process. This database serves as an index for incremental backups and enables efficient deduplication.

Important Notes:

  • No Migration Support: Vaultik does not support database schema migrations. If the schema changes, the local database must be deleted and recreated by performing a full backup.
  • Version Compatibility: In rare cases, you may need to use the same version of Vaultik to restore a backup as was used to create it. This ensures compatibility with the metadata format stored in S3.

Database Tables

1. files

Stores metadata about files in the filesystem being backed up.

Columns:

  • id (TEXT PRIMARY KEY) - UUID for the file record
  • path (TEXT UNIQUE) - Absolute file path
  • mtime (INTEGER) - Modification time as Unix timestamp
  • ctime (INTEGER) - Change time as Unix timestamp
  • size (INTEGER) - File size in bytes
  • mode (INTEGER) - Unix file permissions and type
  • uid (INTEGER) - User ID of file owner
  • gid (INTEGER) - Group ID of file owner
  • link_target (TEXT) - Symlink target path (empty for regular files)

Purpose: Tracks file metadata to detect changes between backup runs. Used for incremental backup decisions. The UUID primary key provides stable references that don't change if files are moved.

2. chunks

Stores information about content-defined chunks created from files.

Columns:

  • chunk_hash (TEXT PRIMARY KEY) - SHA256 hash of chunk content
  • sha256 (TEXT) - SHA256 hash (currently same as chunk_hash)
  • size (INTEGER) - Chunk size in bytes

Purpose: Enables deduplication by tracking unique chunks across all files.

3. file_chunks

Maps files to their constituent chunks in order.

Columns:

  • file_id (TEXT) - File ID (FK to files.id)
  • idx (INTEGER) - Chunk index within file (0-based)
  • chunk_hash (TEXT) - Chunk hash (FK to chunks.chunk_hash)
  • PRIMARY KEY (file_id, idx)

Purpose: Allows reconstruction of files from chunks during restore.

4. chunk_files

Reverse mapping showing which files contain each chunk.

Columns:

  • chunk_hash (TEXT) - Chunk hash (FK to chunks.chunk_hash)
  • file_id (TEXT) - File ID (FK to files.id)
  • file_offset (INTEGER) - Byte offset of chunk within file
  • length (INTEGER) - Length of chunk in bytes
  • PRIMARY KEY (chunk_hash, file_id)

Purpose: Supports efficient queries for chunk usage and deduplication statistics.

5. blobs

Stores information about packed, compressed, and encrypted blob files.

Columns:

  • id (TEXT PRIMARY KEY) - UUID assigned when blob creation starts
  • hash (TEXT) - SHA256 hash of final blob (empty until finalized)
  • created_ts (INTEGER) - Creation timestamp
  • finished_ts (INTEGER) - Finalization timestamp (NULL if in progress)
  • uncompressed_size (INTEGER) - Total size of chunks before compression
  • compressed_size (INTEGER) - Size after compression and encryption
  • uploaded_ts (INTEGER) - Upload completion timestamp (NULL if not uploaded)

Purpose: Tracks blob lifecycle from creation through upload. The UUID primary key allows immediate association of chunks with blobs.

6. blob_chunks

Maps chunks to the blobs that contain them.

Columns:

  • blob_id (TEXT) - Blob ID (FK to blobs.id)
  • chunk_hash (TEXT) - Chunk hash (FK to chunks.chunk_hash)
  • offset (INTEGER) - Byte offset of chunk within blob (before compression)
  • length (INTEGER) - Length of chunk in bytes
  • PRIMARY KEY (blob_id, chunk_hash)

Purpose: Enables chunk retrieval from blobs during restore operations.

7. snapshots

Tracks backup snapshots.

Columns:

  • id (TEXT PRIMARY KEY) - Snapshot ID (format: hostname-YYYYMMDD-HHMMSSZ)
  • hostname (TEXT) - Hostname where backup was created
  • vaultik_version (TEXT) - Version of Vaultik used
  • vaultik_git_revision (TEXT) - Git revision of Vaultik used
  • started_at (INTEGER) - Start timestamp
  • completed_at (INTEGER) - Completion timestamp (NULL if in progress)
  • file_count (INTEGER) - Number of files in snapshot
  • chunk_count (INTEGER) - Number of unique chunks
  • blob_count (INTEGER) - Number of blobs referenced
  • total_size (INTEGER) - Total size of all files
  • blob_size (INTEGER) - Total size of all blobs (compressed)
  • blob_uncompressed_size (INTEGER) - Total uncompressed size of all referenced blobs
  • compression_ratio (REAL) - Compression ratio achieved
  • compression_level (INTEGER) - Compression level used for this snapshot
  • upload_bytes (INTEGER) - Total bytes uploaded during this snapshot
  • upload_duration_ms (INTEGER) - Total milliseconds spent uploading to S3

Purpose: Provides snapshot metadata and statistics including version tracking for compatibility.

8. snapshot_files

Maps snapshots to the files they contain.

Columns:

  • snapshot_id (TEXT) - Snapshot ID (FK to snapshots.id)
  • file_id (TEXT) - File ID (FK to files.id)
  • PRIMARY KEY (snapshot_id, file_id)

Purpose: Records which files are included in each snapshot.

9. snapshot_blobs

Maps snapshots to the blobs they reference.

Columns:

  • snapshot_id (TEXT) - Snapshot ID (FK to snapshots.id)
  • blob_id (TEXT) - Blob ID (FK to blobs.id)
  • blob_hash (TEXT) - Denormalized blob hash for manifest generation
  • PRIMARY KEY (snapshot_id, blob_id)

Purpose: Tracks blob dependencies for snapshots and enables manifest generation.

10. uploads

Tracks blob upload metrics.

Columns:

  • blob_hash (TEXT PRIMARY KEY) - Hash of uploaded blob
  • uploaded_at (INTEGER) - Upload timestamp
  • size (INTEGER) - Size of uploaded blob
  • duration_ms (INTEGER) - Upload duration in milliseconds

Purpose: Performance monitoring and upload tracking.

Data Flow and Operations

1. Backup Process

  1. File Scanning

    • INSERT OR REPLACE INTO files - Update file metadata
    • SELECT * FROM files WHERE path = ? - Check if file has changed
    • INSERT INTO snapshot_files - Add file to current snapshot
  2. Chunking (for changed files)

    • INSERT OR IGNORE INTO chunks - Store new chunks
    • INSERT INTO file_chunks - Map chunks to file
    • INSERT INTO chunk_files - Create reverse mapping
  3. Blob Packing

    • INSERT INTO blobs - Create blob record with UUID (hash empty)
    • INSERT INTO blob_chunks - Associate chunks with blob immediately
    • UPDATE blobs SET hash = ?, finished_ts = ? - Finalize blob after packing
  4. Upload

    • UPDATE blobs SET uploaded_ts = ? - Mark blob as uploaded
    • INSERT INTO uploads - Record upload metrics
    • INSERT INTO snapshot_blobs - Associate blob with snapshot
  5. Snapshot Completion

    • UPDATE snapshots SET completed_at = ?, stats... - Finalize snapshot
    • Generate and upload blob manifest from snapshot_blobs

2. Incremental Backup

  1. Change Detection

    • SELECT * FROM files WHERE path = ? - Get previous file metadata
    • Compare mtime, size, mode to detect changes
    • Skip unchanged files but still add to snapshot_files
  2. Chunk Reuse

    • SELECT * FROM blob_chunks WHERE chunk_hash = ? - Find existing chunks
    • INSERT INTO snapshot_blobs - Reference existing blobs for unchanged files

3. Restore Process

The restore process doesn't use the local database. Instead:

  1. Downloads snapshot metadata from S3
  2. Downloads required blobs based on manifest
  3. Reconstructs files from decrypted and decompressed chunks

4. Pruning

  1. Identify Unreferenced Blobs
    • Query blobs not referenced by any remaining snapshot
    • Delete from S3 and local database

Repository Pattern

Vaultik uses a repository pattern for database access:

  • FileRepository - CRUD operations for files
  • ChunkRepository - CRUD operations for chunks
  • FileChunkRepository - Manage file-chunk mappings
  • BlobRepository - Manage blob lifecycle
  • BlobChunkRepository - Manage blob-chunk associations
  • SnapshotRepository - Manage snapshots
  • UploadRepository - Track upload metrics

Each repository provides methods like:

  • Create() - Insert new record
  • GetByID() / GetByPath() / GetByHash() - Retrieve records
  • Update() - Update existing records
  • Delete() - Remove records
  • Specialized queries for each entity type

Transaction Management

All database operations that modify multiple tables are wrapped in transactions:

err := repos.WithTx(ctx, func(ctx context.Context, tx *sql.Tx) error {
    // Multiple repository operations using tx
})

This ensures consistency, especially important for operations like:

  • Creating file-chunk mappings
  • Associating chunks with blobs
  • Updating snapshot statistics

Performance Considerations

  1. Indexes: Primary keys are automatically indexed. Additional indexes may be needed for:

    • blobs.hash for lookup performance
    • blob_chunks.chunk_hash for chunk location queries
  2. Prepared Statements: All queries use prepared statements for performance and security

  3. Batch Operations: Where possible, operations are batched within transactions

  4. Write-Ahead Logging: SQLite WAL mode is enabled for better concurrency

Data Integrity

  1. Foreign Keys: Enforced at the application level through repository methods
  2. Unique Constraints: Chunk hashes and file paths are unique
  3. Null Handling: Nullable fields clearly indicate in-progress operations
  4. Timestamp Tracking: All major operations record timestamps for auditing