Fix manifest generation to not encrypt manifests

- Manifests are now only compressed (not encrypted) so pruning operations can work without private keys - Updated generateBlobManifest to use zstd compression directly - Updated prune command to handle unencrypted manifests - Updated snapshot list command to handle new manifest format - Updated documentation to reflect manifest.json.zst (not .age) - Removed unnecessary VAULTIK_PRIVATE_KEY check from prune command
2025-07-26 02:54:52 +02:00
parent 1d027bde57
commit fb220685a2
4 changed files with 352 additions and 34 deletions
--- a/docs/DATAMODEL.md
+++ b/docs/DATAMODEL.md
@@ -0,0 +1,268 @@
+# Vaultik Data Model
+
+## Overview
+
+Vaultik uses a local SQLite database to track file metadata, chunk mappings, and blob associations during the backup process. This database serves as an index for incremental backups and enables efficient deduplication.
+
+**Important Notes:**
+- **No Migration Support**: Vaultik does not support database schema migrations. If the schema changes, the local database must be deleted and recreated by performing a full backup.
+- **Version Compatibility**: In rare cases, you may need to use the same version of Vaultik to restore a backup as was used to create it. This ensures compatibility with the metadata format stored in S3.
+
+## Database Tables
+
+### 1. `files`
+Stores metadata about files in the filesystem being backed up.
+
+**Columns:**
+- `id` (TEXT PRIMARY KEY) - UUID for the file record
+- `path` (TEXT NOT NULL UNIQUE) - Absolute file path
+- `mtime` (INTEGER NOT NULL) - Modification time as Unix timestamp
+- `ctime` (INTEGER NOT NULL) - Change time as Unix timestamp  
+- `size` (INTEGER NOT NULL) - File size in bytes
+- `mode` (INTEGER NOT NULL) - Unix file permissions and type
+- `uid` (INTEGER NOT NULL) - User ID of file owner
+- `gid` (INTEGER NOT NULL) - Group ID of file owner
+- `link_target` (TEXT) - Symlink target path (NULL for regular files)
+
+**Indexes:**
+- `idx_files_path` on `path` for efficient lookups
+
+**Purpose:** Tracks file metadata to detect changes between backup runs. Used for incremental backup decisions. The UUID primary key provides stable references that don't change if files are moved.
+
+### 2. `chunks`
+Stores information about content-defined chunks created from files.
+
+**Columns:**
+- `chunk_hash` (TEXT PRIMARY KEY) - SHA256 hash of chunk content
+- `size` (INTEGER NOT NULL) - Chunk size in bytes
+
+**Purpose:** Enables deduplication by tracking unique chunks across all files.
+
+### 3. `file_chunks`
+Maps files to their constituent chunks in order.
+
+**Columns:**
+- `file_id` (TEXT) - File ID (FK to files.id)
+- `idx` (INTEGER) - Chunk index within file (0-based)
+- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
+- PRIMARY KEY (`file_id`, `idx`)
+
+**Purpose:** Allows reconstruction of files from chunks during restore.
+
+### 4. `chunk_files`
+Reverse mapping showing which files contain each chunk.
+
+**Columns:**
+- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
+- `file_id` (TEXT) - File ID (FK to files.id)
+- `file_offset` (INTEGER) - Byte offset of chunk within file
+- `length` (INTEGER) - Length of chunk in bytes
+- PRIMARY KEY (`chunk_hash`, `file_id`)
+
+**Purpose:** Supports efficient queries for chunk usage and deduplication statistics.
+
+### 5. `blobs`
+Stores information about packed, compressed, and encrypted blob files.
+
+**Columns:**
+- `id` (TEXT PRIMARY KEY) - UUID assigned when blob creation starts
+- `blob_hash` (TEXT UNIQUE) - SHA256 hash of final blob (NULL until finalized)
+- `created_ts` (INTEGER NOT NULL) - Creation timestamp
+- `finished_ts` (INTEGER) - Finalization timestamp (NULL if in progress)
+- `uncompressed_size` (INTEGER NOT NULL DEFAULT 0) - Total size of chunks before compression
+- `compressed_size` (INTEGER NOT NULL DEFAULT 0) - Size after compression and encryption
+- `uploaded_ts` (INTEGER) - Upload completion timestamp (NULL if not uploaded)
+
+**Purpose:** Tracks blob lifecycle from creation through upload. The UUID primary key allows immediate association of chunks with blobs.
+
+### 6. `blob_chunks`
+Maps chunks to the blobs that contain them.
+
+**Columns:**
+- `blob_id` (TEXT) - Blob ID (FK to blobs.id)
+- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
+- `offset` (INTEGER) - Byte offset of chunk within blob (before compression)
+- `length` (INTEGER) - Length of chunk in bytes
+- PRIMARY KEY (`blob_id`, `chunk_hash`)
+
+**Purpose:** Enables chunk retrieval from blobs during restore operations.
+
+### 7. `snapshots`
+Tracks backup snapshots.
+
+**Columns:**
+- `id` (TEXT PRIMARY KEY) - Snapshot ID (format: hostname-YYYYMMDD-HHMMSSZ)
+- `hostname` (TEXT) - Hostname where backup was created
+- `vaultik_version` (TEXT) - Version of Vaultik used
+- `vaultik_git_revision` (TEXT) - Git revision of Vaultik used
+- `started_at` (INTEGER) - Start timestamp
+- `completed_at` (INTEGER) - Completion timestamp (NULL if in progress)
+- `file_count` (INTEGER) - Number of files in snapshot
+- `chunk_count` (INTEGER) - Number of unique chunks
+- `blob_count` (INTEGER) - Number of blobs referenced
+- `total_size` (INTEGER) - Total size of all files
+- `blob_size` (INTEGER) - Total size of all blobs (compressed)
+- `blob_uncompressed_size` (INTEGER) - Total uncompressed size of all referenced blobs
+- `compression_ratio` (REAL) - Compression ratio achieved
+- `compression_level` (INTEGER) - Compression level used for this snapshot
+- `upload_bytes` (INTEGER) - Total bytes uploaded during this snapshot
+- `upload_duration_ms` (INTEGER) - Total milliseconds spent uploading to S3
+
+**Purpose:** Provides snapshot metadata and statistics including version tracking for compatibility.
+
+### 8. `snapshot_files`
+Maps snapshots to the files they contain.
+
+**Columns:**
+- `snapshot_id` (TEXT) - Snapshot ID (FK to snapshots.id)
+- `file_id` (TEXT) - File ID (FK to files.id)
+- PRIMARY KEY (`snapshot_id`, `file_id`)
+
+**Purpose:** Records which files are included in each snapshot.
+
+### 9. `snapshot_blobs`
+Maps snapshots to the blobs they reference.
+
+**Columns:**
+- `snapshot_id` (TEXT) - Snapshot ID (FK to snapshots.id)
+- `blob_id` (TEXT) - Blob ID (FK to blobs.id)
+- `blob_hash` (TEXT) - Denormalized blob hash for manifest generation
+- PRIMARY KEY (`snapshot_id`, `blob_id`)
+
+**Purpose:** Tracks blob dependencies for snapshots and enables manifest generation.
+
+### 10. `uploads`
+Tracks blob upload metrics.
+
+**Columns:**
+- `blob_hash` (TEXT PRIMARY KEY) - Hash of uploaded blob
+- `snapshot_id` (TEXT NOT NULL) - The snapshot that triggered this upload (FK to snapshots.id)
+- `uploaded_at` (INTEGER) - Upload timestamp
+- `size` (INTEGER) - Size of uploaded blob
+- `duration_ms` (INTEGER) - Upload duration in milliseconds
+
+**Purpose:** Performance monitoring and tracking which blobs were newly created (uploaded) during each snapshot.
+
+## Data Flow and Operations
+
+### 1. Backup Process
+
+1. **File Scanning**
+   - `INSERT OR REPLACE INTO files` - Update file metadata
+   - `SELECT * FROM files WHERE path = ?` - Check if file has changed
+   - `INSERT INTO snapshot_files` - Add file to current snapshot
+
+2. **Chunking** (for changed files)
+   - `INSERT OR IGNORE INTO chunks` - Store new chunks
+   - `INSERT INTO file_chunks` - Map chunks to file
+   - `INSERT INTO chunk_files` - Create reverse mapping
+
+3. **Blob Packing**
+   - `INSERT INTO blobs` - Create blob record with UUID (blob_hash NULL)
+   - `INSERT INTO blob_chunks` - Associate chunks with blob immediately
+   - `UPDATE blobs SET blob_hash = ?, finished_ts = ?` - Finalize blob after packing
+
+4. **Upload**
+   - `UPDATE blobs SET uploaded_ts = ?` - Mark blob as uploaded
+   - `INSERT INTO uploads` - Record upload metrics with snapshot_id
+   - `INSERT INTO snapshot_blobs` - Associate blob with snapshot
+
+5. **Snapshot Completion**
+   - `UPDATE snapshots SET completed_at = ?, stats...` - Finalize snapshot
+   - Generate and upload blob manifest from `snapshot_blobs`
+
+### 2. Incremental Backup
+
+1. **Change Detection**
+   - `SELECT * FROM files WHERE path = ?` - Get previous file metadata
+   - Compare mtime, size, mode to detect changes
+   - Skip unchanged files but still add to `snapshot_files`
+
+2. **Chunk Reuse**
+   - `SELECT * FROM blob_chunks WHERE chunk_hash = ?` - Find existing chunks
+   - `INSERT INTO snapshot_blobs` - Reference existing blobs for unchanged files
+
+### 3. Snapshot Metadata Export
+
+After a snapshot is completed:
+1. Copy database to temporary file
+2. Clean temporary database to contain only current snapshot data
+3. Export to SQL dump using sqlite3
+4. Compress with zstd and encrypt with age
+5. Upload to S3 as `metadata/{snapshot-id}/db.zst.age`
+6. Generate blob manifest and upload as `metadata/{snapshot-id}/manifest.json.zst`
+
+### 4. Restore Process
+
+The restore process doesn't use the local database. Instead:
+1. Downloads snapshot metadata from S3
+2. Downloads required blobs based on manifest
+3. Reconstructs files from decrypted and decompressed chunks
+
+### 5. Pruning
+
+1. **Identify Unreferenced Blobs**
+   - Query blobs not referenced by any remaining snapshot
+   - Delete from S3 and local database
+
+### 6. Incomplete Snapshot Cleanup
+
+Before each backup:
+1. Query incomplete snapshots (where `completed_at IS NULL`)
+2. Check if metadata exists in S3
+3. If no metadata, delete snapshot and all associations
+4. Clean up orphaned files, chunks, and blobs
+
+## Repository Pattern
+
+Vaultik uses a repository pattern for database access:
+
+- `FileRepository` - CRUD operations for files and file metadata
+- `ChunkRepository` - CRUD operations for content chunks
+- `FileChunkRepository` - Manage file-to-chunk mappings
+- `ChunkFileRepository` - Manage chunk-to-file reverse mappings  
+- `BlobRepository` - Manage blob lifecycle (creation, finalization, upload)
+- `BlobChunkRepository` - Manage blob-to-chunk associations
+- `SnapshotRepository` - Manage snapshots and their relationships
+- `UploadRepository` - Track blob upload metrics
+
+Each repository provides methods like:
+- `Create()` - Insert new record
+- `GetByID()` / `GetByPath()` / `GetByHash()` - Retrieve records
+- `Update()` - Update existing records
+- `Delete()` - Remove records
+- Specialized queries for each entity type (e.g., `DeleteOrphaned()`, `GetIncompleteByHostname()`)
+
+## Transaction Management
+
+All database operations that modify multiple tables are wrapped in transactions:
+
+```go
+err := repos.WithTx(ctx, func(ctx context.Context, tx *sql.Tx) error {
+    // Multiple repository operations using tx
+})
+```
+
+This ensures consistency, especially important for operations like:
+- Creating file-chunk mappings
+- Associating chunks with blobs
+- Updating snapshot statistics
+
+## Performance Considerations
+
+1. **Indexes**: 
+   - Primary keys are automatically indexed
+   - `idx_files_path` on `files(path)` for efficient file lookups
+
+2. **Prepared Statements**: All queries use prepared statements for performance and security
+
+3. **Batch Operations**: Where possible, operations are batched within transactions
+
+4. **Write-Ahead Logging**: SQLite WAL mode is enabled for better concurrency
+
+## Data Integrity
+
+1. **Foreign Keys**: Enforced through CASCADE DELETE and application-level repository methods
+2. **Unique Constraints**: Chunk hashes, file paths, and blob hashes are unique
+3. **Null Handling**: Nullable fields clearly indicate in-progress operations
+4. **Timestamp Tracking**: All major operations record timestamps for auditing
--- a/docs/REPOSTRUCTURE.md
+++ b/docs/REPOSTRUCTURE.md
@@ -0,0 +1,143 @@
+# Vaultik S3 Repository Structure
+
+This document describes the structure and organization of data stored in the S3 bucket by Vaultik.
+
+## Overview
+
+Vaultik stores all backup data in an S3-compatible object store. The repository consists of two main components:
+1. **Blobs** - The actual backup data (content-addressed, encrypted)
+2. **Metadata** - Snapshot information and manifests (partially encrypted)
+
+## Directory Structure
+
+```
+<bucket>/<prefix>/
+├── blobs/
+│   └── <hash[0:2]>/
+│       └── <hash[2:4]>/
+│           └── <full-hash>
+└── metadata/
+    └── <snapshot-id>/
+        ├── db.zst.age
+        └── manifest.json.zst
+```
+
+## Blobs Directory (`blobs/`)
+
+### Structure
+- **Path format**: `blobs/<first-2-chars>/<next-2-chars>/<full-hash>`
+- **Example**: `blobs/ca/fe/cafebabe1234567890abcdef1234567890abcdef1234567890abcdef12345678`
+- **Sharding**: The two-level directory structure (using the first 4 characters of the hash) prevents any single directory from containing too many objects
+
+### Content
+- **What it contains**: Packed collections of content-defined chunks from files
+- **Format**: Zstandard compressed, then Age encrypted
+- **Encryption**: Always encrypted with Age using the configured recipients
+- **Naming**: Content-addressed using SHA256 hash of the encrypted blob
+
+### Why Encrypted
+Blobs contain the actual file data from backups and must be encrypted for security. The content-addressing ensures deduplication while the encryption ensures privacy.
+
+## Metadata Directory (`metadata/`)
+
+Each snapshot has its own subdirectory named with the snapshot ID.
+
+### Snapshot ID Format
+- **Format**: `<hostname>-<YYYYMMDD>-<HHMMSSZ>`
+- **Example**: `laptop-20240115-143052Z`
+- **Components**:
+  - Hostname (may contain hyphens)
+  - Date in YYYYMMDD format
+  - Time in HHMMSSZ format (Z indicates UTC)
+
+### Files in Each Snapshot Directory
+
+#### `db.zst.age` - Encrypted Database Dump
+- **What it contains**: Complete SQLite database dump for this snapshot
+- **Format**: SQL dump → Zstandard compressed → Age encrypted
+- **Encryption**: Encrypted with Age
+- **Purpose**: Contains full file metadata, chunk mappings, and all relationships
+- **Why encrypted**: Contains sensitive metadata like file paths, permissions, and ownership
+
+#### `manifest.json.zst` - Unencrypted Blob Manifest
+- **What it contains**: JSON list of all blob hashes referenced by this snapshot
+- **Format**: JSON → Zstandard compressed (NOT encrypted)
+- **Encryption**: NOT encrypted
+- **Purpose**: Enables pruning operations without requiring decryption keys
+- **Structure**:
+```json
+{
+  "snapshot_id": "laptop-20240115-143052Z",
+  "timestamp": "2024-01-15T14:30:52Z",
+  "blob_count": 42,
+  "blobs": [
+    "cafebabe1234567890abcdef1234567890abcdef1234567890abcdef12345678",
+    "deadbeef1234567890abcdef1234567890abcdef1234567890abcdef12345678",
+    ...
+  ]
+}
+```
+
+### Why Manifest is Unencrypted
+The manifest must be readable without the private key to enable:
+1. **Pruning operations** - Identifying unreferenced blobs for deletion
+2. **Storage analysis** - Understanding space usage without decryption
+3. **Verification** - Checking blob existence without decryption
+4. **Cross-snapshot deduplication analysis** - Finding shared blobs between snapshots
+
+The manifest only contains blob hashes, not file names or any other sensitive information.
+
+## Security Considerations
+
+### What's Encrypted
+- **All file content** (in blobs)
+- **All file metadata** (paths, permissions, timestamps, ownership in db.zst.age)
+- **File-to-chunk mappings** (in db.zst.age)
+
+### What's Not Encrypted
+- **Blob hashes** (in manifest.json.zst)
+- **Snapshot IDs** (directory names)
+- **Blob count per snapshot** (in manifest.json.zst)
+
+### Privacy Implications
+From the unencrypted data, an observer can determine:
+- When backups were taken (from snapshot IDs)
+- Which hostname created backups (from snapshot IDs)
+- How many blobs each snapshot references
+- Which blobs are shared between snapshots (deduplication patterns)
+- The size of each encrypted blob
+
+An observer cannot determine:
+- File names or paths
+- File contents
+- File permissions or ownership
+- Directory structure
+- Which chunks belong to which files
+
+## Consistency Guarantees
+
+1. **Blobs are immutable** - Once written, a blob is never modified
+2. **Blobs are written before metadata** - A snapshot's metadata is only written after all its blobs are successfully uploaded
+3. **Metadata is written atomically** - Both db.zst.age and manifest.json.zst are written as complete files
+4. **Snapshots are marked complete in local DB only after metadata upload** - Ensures consistency between local and remote state
+
+## Pruning Safety
+
+The prune operation is safe because:
+1. It only deletes blobs not referenced in any manifest
+2. Manifests are unencrypted and can be read without keys
+3. The operation compares the latest local DB snapshot with the latest S3 snapshot to ensure consistency
+4. Pruning will fail if these don't match, preventing accidental deletion of needed blobs
+
+## Restoration Requirements
+
+To restore from a backup, you need:
+1. **The Age private key** - To decrypt blobs and database
+2. **The snapshot metadata** - Both files from the snapshot's metadata directory
+3. **All referenced blobs** - As listed in the manifest
+
+The restoration process:
+1. Download and decrypt the database dump to understand file structure
+2. Download and decrypt the required blobs
+3. Reconstruct files from their chunks
+4. Restore file metadata (permissions, timestamps, etc.)