Fix manifest generation to not encrypt manifests

- Manifests are now only compressed (not encrypted) so pruning operations can work without private keys
- Updated generateBlobManifest to use zstd compression directly
- Updated prune command to handle unencrypted manifests
- Updated snapshot list command to handle new manifest format
- Updated documentation to reflect manifest.json.zst (not .age)
- Removed unnecessary VAULTIK_PRIVATE_KEY check from prune command
This commit is contained in:
2025-07-26 02:54:52 +02:00
parent 1d027bde57
commit fb220685a2
4 changed files with 352 additions and 34 deletions

268
docs/DATAMODEL.md Normal file
View File

@@ -0,0 +1,268 @@
# Vaultik Data Model
## Overview
Vaultik uses a local SQLite database to track file metadata, chunk mappings, and blob associations during the backup process. This database serves as an index for incremental backups and enables efficient deduplication.
**Important Notes:**
- **No Migration Support**: Vaultik does not support database schema migrations. If the schema changes, the local database must be deleted and recreated by performing a full backup.
- **Version Compatibility**: In rare cases, you may need to use the same version of Vaultik to restore a backup as was used to create it. This ensures compatibility with the metadata format stored in S3.
## Database Tables
### 1. `files`
Stores metadata about files in the filesystem being backed up.
**Columns:**
- `id` (TEXT PRIMARY KEY) - UUID for the file record
- `path` (TEXT NOT NULL UNIQUE) - Absolute file path
- `mtime` (INTEGER NOT NULL) - Modification time as Unix timestamp
- `ctime` (INTEGER NOT NULL) - Change time as Unix timestamp
- `size` (INTEGER NOT NULL) - File size in bytes
- `mode` (INTEGER NOT NULL) - Unix file permissions and type
- `uid` (INTEGER NOT NULL) - User ID of file owner
- `gid` (INTEGER NOT NULL) - Group ID of file owner
- `link_target` (TEXT) - Symlink target path (NULL for regular files)
**Indexes:**
- `idx_files_path` on `path` for efficient lookups
**Purpose:** Tracks file metadata to detect changes between backup runs. Used for incremental backup decisions. The UUID primary key provides stable references that don't change if files are moved.
### 2. `chunks`
Stores information about content-defined chunks created from files.
**Columns:**
- `chunk_hash` (TEXT PRIMARY KEY) - SHA256 hash of chunk content
- `size` (INTEGER NOT NULL) - Chunk size in bytes
**Purpose:** Enables deduplication by tracking unique chunks across all files.
### 3. `file_chunks`
Maps files to their constituent chunks in order.
**Columns:**
- `file_id` (TEXT) - File ID (FK to files.id)
- `idx` (INTEGER) - Chunk index within file (0-based)
- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
- PRIMARY KEY (`file_id`, `idx`)
**Purpose:** Allows reconstruction of files from chunks during restore.
### 4. `chunk_files`
Reverse mapping showing which files contain each chunk.
**Columns:**
- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
- `file_id` (TEXT) - File ID (FK to files.id)
- `file_offset` (INTEGER) - Byte offset of chunk within file
- `length` (INTEGER) - Length of chunk in bytes
- PRIMARY KEY (`chunk_hash`, `file_id`)
**Purpose:** Supports efficient queries for chunk usage and deduplication statistics.
### 5. `blobs`
Stores information about packed, compressed, and encrypted blob files.
**Columns:**
- `id` (TEXT PRIMARY KEY) - UUID assigned when blob creation starts
- `blob_hash` (TEXT UNIQUE) - SHA256 hash of final blob (NULL until finalized)
- `created_ts` (INTEGER NOT NULL) - Creation timestamp
- `finished_ts` (INTEGER) - Finalization timestamp (NULL if in progress)
- `uncompressed_size` (INTEGER NOT NULL DEFAULT 0) - Total size of chunks before compression
- `compressed_size` (INTEGER NOT NULL DEFAULT 0) - Size after compression and encryption
- `uploaded_ts` (INTEGER) - Upload completion timestamp (NULL if not uploaded)
**Purpose:** Tracks blob lifecycle from creation through upload. The UUID primary key allows immediate association of chunks with blobs.
### 6. `blob_chunks`
Maps chunks to the blobs that contain them.
**Columns:**
- `blob_id` (TEXT) - Blob ID (FK to blobs.id)
- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
- `offset` (INTEGER) - Byte offset of chunk within blob (before compression)
- `length` (INTEGER) - Length of chunk in bytes
- PRIMARY KEY (`blob_id`, `chunk_hash`)
**Purpose:** Enables chunk retrieval from blobs during restore operations.
### 7. `snapshots`
Tracks backup snapshots.
**Columns:**
- `id` (TEXT PRIMARY KEY) - Snapshot ID (format: hostname-YYYYMMDD-HHMMSSZ)
- `hostname` (TEXT) - Hostname where backup was created
- `vaultik_version` (TEXT) - Version of Vaultik used
- `vaultik_git_revision` (TEXT) - Git revision of Vaultik used
- `started_at` (INTEGER) - Start timestamp
- `completed_at` (INTEGER) - Completion timestamp (NULL if in progress)
- `file_count` (INTEGER) - Number of files in snapshot
- `chunk_count` (INTEGER) - Number of unique chunks
- `blob_count` (INTEGER) - Number of blobs referenced
- `total_size` (INTEGER) - Total size of all files
- `blob_size` (INTEGER) - Total size of all blobs (compressed)
- `blob_uncompressed_size` (INTEGER) - Total uncompressed size of all referenced blobs
- `compression_ratio` (REAL) - Compression ratio achieved
- `compression_level` (INTEGER) - Compression level used for this snapshot
- `upload_bytes` (INTEGER) - Total bytes uploaded during this snapshot
- `upload_duration_ms` (INTEGER) - Total milliseconds spent uploading to S3
**Purpose:** Provides snapshot metadata and statistics including version tracking for compatibility.
### 8. `snapshot_files`
Maps snapshots to the files they contain.
**Columns:**
- `snapshot_id` (TEXT) - Snapshot ID (FK to snapshots.id)
- `file_id` (TEXT) - File ID (FK to files.id)
- PRIMARY KEY (`snapshot_id`, `file_id`)
**Purpose:** Records which files are included in each snapshot.
### 9. `snapshot_blobs`
Maps snapshots to the blobs they reference.
**Columns:**
- `snapshot_id` (TEXT) - Snapshot ID (FK to snapshots.id)
- `blob_id` (TEXT) - Blob ID (FK to blobs.id)
- `blob_hash` (TEXT) - Denormalized blob hash for manifest generation
- PRIMARY KEY (`snapshot_id`, `blob_id`)
**Purpose:** Tracks blob dependencies for snapshots and enables manifest generation.
### 10. `uploads`
Tracks blob upload metrics.
**Columns:**
- `blob_hash` (TEXT PRIMARY KEY) - Hash of uploaded blob
- `snapshot_id` (TEXT NOT NULL) - The snapshot that triggered this upload (FK to snapshots.id)
- `uploaded_at` (INTEGER) - Upload timestamp
- `size` (INTEGER) - Size of uploaded blob
- `duration_ms` (INTEGER) - Upload duration in milliseconds
**Purpose:** Performance monitoring and tracking which blobs were newly created (uploaded) during each snapshot.
## Data Flow and Operations
### 1. Backup Process
1. **File Scanning**
- `INSERT OR REPLACE INTO files` - Update file metadata
- `SELECT * FROM files WHERE path = ?` - Check if file has changed
- `INSERT INTO snapshot_files` - Add file to current snapshot
2. **Chunking** (for changed files)
- `INSERT OR IGNORE INTO chunks` - Store new chunks
- `INSERT INTO file_chunks` - Map chunks to file
- `INSERT INTO chunk_files` - Create reverse mapping
3. **Blob Packing**
- `INSERT INTO blobs` - Create blob record with UUID (blob_hash NULL)
- `INSERT INTO blob_chunks` - Associate chunks with blob immediately
- `UPDATE blobs SET blob_hash = ?, finished_ts = ?` - Finalize blob after packing
4. **Upload**
- `UPDATE blobs SET uploaded_ts = ?` - Mark blob as uploaded
- `INSERT INTO uploads` - Record upload metrics with snapshot_id
- `INSERT INTO snapshot_blobs` - Associate blob with snapshot
5. **Snapshot Completion**
- `UPDATE snapshots SET completed_at = ?, stats...` - Finalize snapshot
- Generate and upload blob manifest from `snapshot_blobs`
### 2. Incremental Backup
1. **Change Detection**
- `SELECT * FROM files WHERE path = ?` - Get previous file metadata
- Compare mtime, size, mode to detect changes
- Skip unchanged files but still add to `snapshot_files`
2. **Chunk Reuse**
- `SELECT * FROM blob_chunks WHERE chunk_hash = ?` - Find existing chunks
- `INSERT INTO snapshot_blobs` - Reference existing blobs for unchanged files
### 3. Snapshot Metadata Export
After a snapshot is completed:
1. Copy database to temporary file
2. Clean temporary database to contain only current snapshot data
3. Export to SQL dump using sqlite3
4. Compress with zstd and encrypt with age
5. Upload to S3 as `metadata/{snapshot-id}/db.zst.age`
6. Generate blob manifest and upload as `metadata/{snapshot-id}/manifest.json.zst`
### 4. Restore Process
The restore process doesn't use the local database. Instead:
1. Downloads snapshot metadata from S3
2. Downloads required blobs based on manifest
3. Reconstructs files from decrypted and decompressed chunks
### 5. Pruning
1. **Identify Unreferenced Blobs**
- Query blobs not referenced by any remaining snapshot
- Delete from S3 and local database
### 6. Incomplete Snapshot Cleanup
Before each backup:
1. Query incomplete snapshots (where `completed_at IS NULL`)
2. Check if metadata exists in S3
3. If no metadata, delete snapshot and all associations
4. Clean up orphaned files, chunks, and blobs
## Repository Pattern
Vaultik uses a repository pattern for database access:
- `FileRepository` - CRUD operations for files and file metadata
- `ChunkRepository` - CRUD operations for content chunks
- `FileChunkRepository` - Manage file-to-chunk mappings
- `ChunkFileRepository` - Manage chunk-to-file reverse mappings
- `BlobRepository` - Manage blob lifecycle (creation, finalization, upload)
- `BlobChunkRepository` - Manage blob-to-chunk associations
- `SnapshotRepository` - Manage snapshots and their relationships
- `UploadRepository` - Track blob upload metrics
Each repository provides methods like:
- `Create()` - Insert new record
- `GetByID()` / `GetByPath()` / `GetByHash()` - Retrieve records
- `Update()` - Update existing records
- `Delete()` - Remove records
- Specialized queries for each entity type (e.g., `DeleteOrphaned()`, `GetIncompleteByHostname()`)
## Transaction Management
All database operations that modify multiple tables are wrapped in transactions:
```go
err := repos.WithTx(ctx, func(ctx context.Context, tx *sql.Tx) error {
// Multiple repository operations using tx
})
```
This ensures consistency, especially important for operations like:
- Creating file-chunk mappings
- Associating chunks with blobs
- Updating snapshot statistics
## Performance Considerations
1. **Indexes**:
- Primary keys are automatically indexed
- `idx_files_path` on `files(path)` for efficient file lookups
2. **Prepared Statements**: All queries use prepared statements for performance and security
3. **Batch Operations**: Where possible, operations are batched within transactions
4. **Write-Ahead Logging**: SQLite WAL mode is enabled for better concurrency
## Data Integrity
1. **Foreign Keys**: Enforced through CASCADE DELETE and application-level repository methods
2. **Unique Constraints**: Chunk hashes, file paths, and blob hashes are unique
3. **Null Handling**: Nullable fields clearly indicate in-progress operations
4. **Timestamp Tracking**: All major operations record timestamps for auditing

143
docs/REPOSTRUCTURE.md Normal file
View File

@@ -0,0 +1,143 @@
# Vaultik S3 Repository Structure
This document describes the structure and organization of data stored in the S3 bucket by Vaultik.
## Overview
Vaultik stores all backup data in an S3-compatible object store. The repository consists of two main components:
1. **Blobs** - The actual backup data (content-addressed, encrypted)
2. **Metadata** - Snapshot information and manifests (partially encrypted)
## Directory Structure
```
<bucket>/<prefix>/
├── blobs/
│ └── <hash[0:2]>/
│ └── <hash[2:4]>/
│ └── <full-hash>
└── metadata/
└── <snapshot-id>/
├── db.zst.age
└── manifest.json.zst
```
## Blobs Directory (`blobs/`)
### Structure
- **Path format**: `blobs/<first-2-chars>/<next-2-chars>/<full-hash>`
- **Example**: `blobs/ca/fe/cafebabe1234567890abcdef1234567890abcdef1234567890abcdef12345678`
- **Sharding**: The two-level directory structure (using the first 4 characters of the hash) prevents any single directory from containing too many objects
### Content
- **What it contains**: Packed collections of content-defined chunks from files
- **Format**: Zstandard compressed, then Age encrypted
- **Encryption**: Always encrypted with Age using the configured recipients
- **Naming**: Content-addressed using SHA256 hash of the encrypted blob
### Why Encrypted
Blobs contain the actual file data from backups and must be encrypted for security. The content-addressing ensures deduplication while the encryption ensures privacy.
## Metadata Directory (`metadata/`)
Each snapshot has its own subdirectory named with the snapshot ID.
### Snapshot ID Format
- **Format**: `<hostname>-<YYYYMMDD>-<HHMMSSZ>`
- **Example**: `laptop-20240115-143052Z`
- **Components**:
- Hostname (may contain hyphens)
- Date in YYYYMMDD format
- Time in HHMMSSZ format (Z indicates UTC)
### Files in Each Snapshot Directory
#### `db.zst.age` - Encrypted Database Dump
- **What it contains**: Complete SQLite database dump for this snapshot
- **Format**: SQL dump → Zstandard compressed → Age encrypted
- **Encryption**: Encrypted with Age
- **Purpose**: Contains full file metadata, chunk mappings, and all relationships
- **Why encrypted**: Contains sensitive metadata like file paths, permissions, and ownership
#### `manifest.json.zst` - Unencrypted Blob Manifest
- **What it contains**: JSON list of all blob hashes referenced by this snapshot
- **Format**: JSON → Zstandard compressed (NOT encrypted)
- **Encryption**: NOT encrypted
- **Purpose**: Enables pruning operations without requiring decryption keys
- **Structure**:
```json
{
"snapshot_id": "laptop-20240115-143052Z",
"timestamp": "2024-01-15T14:30:52Z",
"blob_count": 42,
"blobs": [
"cafebabe1234567890abcdef1234567890abcdef1234567890abcdef12345678",
"deadbeef1234567890abcdef1234567890abcdef1234567890abcdef12345678",
...
]
}
```
### Why Manifest is Unencrypted
The manifest must be readable without the private key to enable:
1. **Pruning operations** - Identifying unreferenced blobs for deletion
2. **Storage analysis** - Understanding space usage without decryption
3. **Verification** - Checking blob existence without decryption
4. **Cross-snapshot deduplication analysis** - Finding shared blobs between snapshots
The manifest only contains blob hashes, not file names or any other sensitive information.
## Security Considerations
### What's Encrypted
- **All file content** (in blobs)
- **All file metadata** (paths, permissions, timestamps, ownership in db.zst.age)
- **File-to-chunk mappings** (in db.zst.age)
### What's Not Encrypted
- **Blob hashes** (in manifest.json.zst)
- **Snapshot IDs** (directory names)
- **Blob count per snapshot** (in manifest.json.zst)
### Privacy Implications
From the unencrypted data, an observer can determine:
- When backups were taken (from snapshot IDs)
- Which hostname created backups (from snapshot IDs)
- How many blobs each snapshot references
- Which blobs are shared between snapshots (deduplication patterns)
- The size of each encrypted blob
An observer cannot determine:
- File names or paths
- File contents
- File permissions or ownership
- Directory structure
- Which chunks belong to which files
## Consistency Guarantees
1. **Blobs are immutable** - Once written, a blob is never modified
2. **Blobs are written before metadata** - A snapshot's metadata is only written after all its blobs are successfully uploaded
3. **Metadata is written atomically** - Both db.zst.age and manifest.json.zst are written as complete files
4. **Snapshots are marked complete in local DB only after metadata upload** - Ensures consistency between local and remote state
## Pruning Safety
The prune operation is safe because:
1. It only deletes blobs not referenced in any manifest
2. Manifests are unencrypted and can be read without keys
3. The operation compares the latest local DB snapshot with the latest S3 snapshot to ensure consistency
4. Pruning will fail if these don't match, preventing accidental deletion of needed blobs
## Restoration Requirements
To restore from a backup, you need:
1. **The Age private key** - To decrypt blobs and database
2. **The snapshot metadata** - Both files from the snapshot's metadata directory
3. **All referenced blobs** - As listed in the manifest
The restoration process:
1. Download and decrypt the database dump to understand file structure
2. Download and decrypt the required blobs
3. Reconstruct files from their chunks
4. Restore file metadata (permissions, timestamps, etc.)