Fix manifest generation to not encrypt manifests
- Manifests are now only compressed (not encrypted) so pruning operations can work without private keys - Updated generateBlobManifest to use zstd compression directly - Updated prune command to handle unencrypted manifests - Updated snapshot list command to handle new manifest format - Updated documentation to reflect manifest.json.zst (not .age) - Removed unnecessary VAULTIK_PRIVATE_KEY check from prune command
This commit is contained in:
268
docs/DATAMODEL.md
Normal file
268
docs/DATAMODEL.md
Normal file
@@ -0,0 +1,268 @@
|
||||
# Vaultik Data Model
|
||||
|
||||
## Overview
|
||||
|
||||
Vaultik uses a local SQLite database to track file metadata, chunk mappings, and blob associations during the backup process. This database serves as an index for incremental backups and enables efficient deduplication.
|
||||
|
||||
**Important Notes:**
|
||||
- **No Migration Support**: Vaultik does not support database schema migrations. If the schema changes, the local database must be deleted and recreated by performing a full backup.
|
||||
- **Version Compatibility**: In rare cases, you may need to use the same version of Vaultik to restore a backup as was used to create it. This ensures compatibility with the metadata format stored in S3.
|
||||
|
||||
## Database Tables
|
||||
|
||||
### 1. `files`
|
||||
Stores metadata about files in the filesystem being backed up.
|
||||
|
||||
**Columns:**
|
||||
- `id` (TEXT PRIMARY KEY) - UUID for the file record
|
||||
- `path` (TEXT NOT NULL UNIQUE) - Absolute file path
|
||||
- `mtime` (INTEGER NOT NULL) - Modification time as Unix timestamp
|
||||
- `ctime` (INTEGER NOT NULL) - Change time as Unix timestamp
|
||||
- `size` (INTEGER NOT NULL) - File size in bytes
|
||||
- `mode` (INTEGER NOT NULL) - Unix file permissions and type
|
||||
- `uid` (INTEGER NOT NULL) - User ID of file owner
|
||||
- `gid` (INTEGER NOT NULL) - Group ID of file owner
|
||||
- `link_target` (TEXT) - Symlink target path (NULL for regular files)
|
||||
|
||||
**Indexes:**
|
||||
- `idx_files_path` on `path` for efficient lookups
|
||||
|
||||
**Purpose:** Tracks file metadata to detect changes between backup runs. Used for incremental backup decisions. The UUID primary key provides stable references that don't change if files are moved.
|
||||
|
||||
### 2. `chunks`
|
||||
Stores information about content-defined chunks created from files.
|
||||
|
||||
**Columns:**
|
||||
- `chunk_hash` (TEXT PRIMARY KEY) - SHA256 hash of chunk content
|
||||
- `size` (INTEGER NOT NULL) - Chunk size in bytes
|
||||
|
||||
**Purpose:** Enables deduplication by tracking unique chunks across all files.
|
||||
|
||||
### 3. `file_chunks`
|
||||
Maps files to their constituent chunks in order.
|
||||
|
||||
**Columns:**
|
||||
- `file_id` (TEXT) - File ID (FK to files.id)
|
||||
- `idx` (INTEGER) - Chunk index within file (0-based)
|
||||
- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
|
||||
- PRIMARY KEY (`file_id`, `idx`)
|
||||
|
||||
**Purpose:** Allows reconstruction of files from chunks during restore.
|
||||
|
||||
### 4. `chunk_files`
|
||||
Reverse mapping showing which files contain each chunk.
|
||||
|
||||
**Columns:**
|
||||
- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
|
||||
- `file_id` (TEXT) - File ID (FK to files.id)
|
||||
- `file_offset` (INTEGER) - Byte offset of chunk within file
|
||||
- `length` (INTEGER) - Length of chunk in bytes
|
||||
- PRIMARY KEY (`chunk_hash`, `file_id`)
|
||||
|
||||
**Purpose:** Supports efficient queries for chunk usage and deduplication statistics.
|
||||
|
||||
### 5. `blobs`
|
||||
Stores information about packed, compressed, and encrypted blob files.
|
||||
|
||||
**Columns:**
|
||||
- `id` (TEXT PRIMARY KEY) - UUID assigned when blob creation starts
|
||||
- `blob_hash` (TEXT UNIQUE) - SHA256 hash of final blob (NULL until finalized)
|
||||
- `created_ts` (INTEGER NOT NULL) - Creation timestamp
|
||||
- `finished_ts` (INTEGER) - Finalization timestamp (NULL if in progress)
|
||||
- `uncompressed_size` (INTEGER NOT NULL DEFAULT 0) - Total size of chunks before compression
|
||||
- `compressed_size` (INTEGER NOT NULL DEFAULT 0) - Size after compression and encryption
|
||||
- `uploaded_ts` (INTEGER) - Upload completion timestamp (NULL if not uploaded)
|
||||
|
||||
**Purpose:** Tracks blob lifecycle from creation through upload. The UUID primary key allows immediate association of chunks with blobs.
|
||||
|
||||
### 6. `blob_chunks`
|
||||
Maps chunks to the blobs that contain them.
|
||||
|
||||
**Columns:**
|
||||
- `blob_id` (TEXT) - Blob ID (FK to blobs.id)
|
||||
- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
|
||||
- `offset` (INTEGER) - Byte offset of chunk within blob (before compression)
|
||||
- `length` (INTEGER) - Length of chunk in bytes
|
||||
- PRIMARY KEY (`blob_id`, `chunk_hash`)
|
||||
|
||||
**Purpose:** Enables chunk retrieval from blobs during restore operations.
|
||||
|
||||
### 7. `snapshots`
|
||||
Tracks backup snapshots.
|
||||
|
||||
**Columns:**
|
||||
- `id` (TEXT PRIMARY KEY) - Snapshot ID (format: hostname-YYYYMMDD-HHMMSSZ)
|
||||
- `hostname` (TEXT) - Hostname where backup was created
|
||||
- `vaultik_version` (TEXT) - Version of Vaultik used
|
||||
- `vaultik_git_revision` (TEXT) - Git revision of Vaultik used
|
||||
- `started_at` (INTEGER) - Start timestamp
|
||||
- `completed_at` (INTEGER) - Completion timestamp (NULL if in progress)
|
||||
- `file_count` (INTEGER) - Number of files in snapshot
|
||||
- `chunk_count` (INTEGER) - Number of unique chunks
|
||||
- `blob_count` (INTEGER) - Number of blobs referenced
|
||||
- `total_size` (INTEGER) - Total size of all files
|
||||
- `blob_size` (INTEGER) - Total size of all blobs (compressed)
|
||||
- `blob_uncompressed_size` (INTEGER) - Total uncompressed size of all referenced blobs
|
||||
- `compression_ratio` (REAL) - Compression ratio achieved
|
||||
- `compression_level` (INTEGER) - Compression level used for this snapshot
|
||||
- `upload_bytes` (INTEGER) - Total bytes uploaded during this snapshot
|
||||
- `upload_duration_ms` (INTEGER) - Total milliseconds spent uploading to S3
|
||||
|
||||
**Purpose:** Provides snapshot metadata and statistics including version tracking for compatibility.
|
||||
|
||||
### 8. `snapshot_files`
|
||||
Maps snapshots to the files they contain.
|
||||
|
||||
**Columns:**
|
||||
- `snapshot_id` (TEXT) - Snapshot ID (FK to snapshots.id)
|
||||
- `file_id` (TEXT) - File ID (FK to files.id)
|
||||
- PRIMARY KEY (`snapshot_id`, `file_id`)
|
||||
|
||||
**Purpose:** Records which files are included in each snapshot.
|
||||
|
||||
### 9. `snapshot_blobs`
|
||||
Maps snapshots to the blobs they reference.
|
||||
|
||||
**Columns:**
|
||||
- `snapshot_id` (TEXT) - Snapshot ID (FK to snapshots.id)
|
||||
- `blob_id` (TEXT) - Blob ID (FK to blobs.id)
|
||||
- `blob_hash` (TEXT) - Denormalized blob hash for manifest generation
|
||||
- PRIMARY KEY (`snapshot_id`, `blob_id`)
|
||||
|
||||
**Purpose:** Tracks blob dependencies for snapshots and enables manifest generation.
|
||||
|
||||
### 10. `uploads`
|
||||
Tracks blob upload metrics.
|
||||
|
||||
**Columns:**
|
||||
- `blob_hash` (TEXT PRIMARY KEY) - Hash of uploaded blob
|
||||
- `snapshot_id` (TEXT NOT NULL) - The snapshot that triggered this upload (FK to snapshots.id)
|
||||
- `uploaded_at` (INTEGER) - Upload timestamp
|
||||
- `size` (INTEGER) - Size of uploaded blob
|
||||
- `duration_ms` (INTEGER) - Upload duration in milliseconds
|
||||
|
||||
**Purpose:** Performance monitoring and tracking which blobs were newly created (uploaded) during each snapshot.
|
||||
|
||||
## Data Flow and Operations
|
||||
|
||||
### 1. Backup Process
|
||||
|
||||
1. **File Scanning**
|
||||
- `INSERT OR REPLACE INTO files` - Update file metadata
|
||||
- `SELECT * FROM files WHERE path = ?` - Check if file has changed
|
||||
- `INSERT INTO snapshot_files` - Add file to current snapshot
|
||||
|
||||
2. **Chunking** (for changed files)
|
||||
- `INSERT OR IGNORE INTO chunks` - Store new chunks
|
||||
- `INSERT INTO file_chunks` - Map chunks to file
|
||||
- `INSERT INTO chunk_files` - Create reverse mapping
|
||||
|
||||
3. **Blob Packing**
|
||||
- `INSERT INTO blobs` - Create blob record with UUID (blob_hash NULL)
|
||||
- `INSERT INTO blob_chunks` - Associate chunks with blob immediately
|
||||
- `UPDATE blobs SET blob_hash = ?, finished_ts = ?` - Finalize blob after packing
|
||||
|
||||
4. **Upload**
|
||||
- `UPDATE blobs SET uploaded_ts = ?` - Mark blob as uploaded
|
||||
- `INSERT INTO uploads` - Record upload metrics with snapshot_id
|
||||
- `INSERT INTO snapshot_blobs` - Associate blob with snapshot
|
||||
|
||||
5. **Snapshot Completion**
|
||||
- `UPDATE snapshots SET completed_at = ?, stats...` - Finalize snapshot
|
||||
- Generate and upload blob manifest from `snapshot_blobs`
|
||||
|
||||
### 2. Incremental Backup
|
||||
|
||||
1. **Change Detection**
|
||||
- `SELECT * FROM files WHERE path = ?` - Get previous file metadata
|
||||
- Compare mtime, size, mode to detect changes
|
||||
- Skip unchanged files but still add to `snapshot_files`
|
||||
|
||||
2. **Chunk Reuse**
|
||||
- `SELECT * FROM blob_chunks WHERE chunk_hash = ?` - Find existing chunks
|
||||
- `INSERT INTO snapshot_blobs` - Reference existing blobs for unchanged files
|
||||
|
||||
### 3. Snapshot Metadata Export
|
||||
|
||||
After a snapshot is completed:
|
||||
1. Copy database to temporary file
|
||||
2. Clean temporary database to contain only current snapshot data
|
||||
3. Export to SQL dump using sqlite3
|
||||
4. Compress with zstd and encrypt with age
|
||||
5. Upload to S3 as `metadata/{snapshot-id}/db.zst.age`
|
||||
6. Generate blob manifest and upload as `metadata/{snapshot-id}/manifest.json.zst`
|
||||
|
||||
### 4. Restore Process
|
||||
|
||||
The restore process doesn't use the local database. Instead:
|
||||
1. Downloads snapshot metadata from S3
|
||||
2. Downloads required blobs based on manifest
|
||||
3. Reconstructs files from decrypted and decompressed chunks
|
||||
|
||||
### 5. Pruning
|
||||
|
||||
1. **Identify Unreferenced Blobs**
|
||||
- Query blobs not referenced by any remaining snapshot
|
||||
- Delete from S3 and local database
|
||||
|
||||
### 6. Incomplete Snapshot Cleanup
|
||||
|
||||
Before each backup:
|
||||
1. Query incomplete snapshots (where `completed_at IS NULL`)
|
||||
2. Check if metadata exists in S3
|
||||
3. If no metadata, delete snapshot and all associations
|
||||
4. Clean up orphaned files, chunks, and blobs
|
||||
|
||||
## Repository Pattern
|
||||
|
||||
Vaultik uses a repository pattern for database access:
|
||||
|
||||
- `FileRepository` - CRUD operations for files and file metadata
|
||||
- `ChunkRepository` - CRUD operations for content chunks
|
||||
- `FileChunkRepository` - Manage file-to-chunk mappings
|
||||
- `ChunkFileRepository` - Manage chunk-to-file reverse mappings
|
||||
- `BlobRepository` - Manage blob lifecycle (creation, finalization, upload)
|
||||
- `BlobChunkRepository` - Manage blob-to-chunk associations
|
||||
- `SnapshotRepository` - Manage snapshots and their relationships
|
||||
- `UploadRepository` - Track blob upload metrics
|
||||
|
||||
Each repository provides methods like:
|
||||
- `Create()` - Insert new record
|
||||
- `GetByID()` / `GetByPath()` / `GetByHash()` - Retrieve records
|
||||
- `Update()` - Update existing records
|
||||
- `Delete()` - Remove records
|
||||
- Specialized queries for each entity type (e.g., `DeleteOrphaned()`, `GetIncompleteByHostname()`)
|
||||
|
||||
## Transaction Management
|
||||
|
||||
All database operations that modify multiple tables are wrapped in transactions:
|
||||
|
||||
```go
|
||||
err := repos.WithTx(ctx, func(ctx context.Context, tx *sql.Tx) error {
|
||||
// Multiple repository operations using tx
|
||||
})
|
||||
```
|
||||
|
||||
This ensures consistency, especially important for operations like:
|
||||
- Creating file-chunk mappings
|
||||
- Associating chunks with blobs
|
||||
- Updating snapshot statistics
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
1. **Indexes**:
|
||||
- Primary keys are automatically indexed
|
||||
- `idx_files_path` on `files(path)` for efficient file lookups
|
||||
|
||||
2. **Prepared Statements**: All queries use prepared statements for performance and security
|
||||
|
||||
3. **Batch Operations**: Where possible, operations are batched within transactions
|
||||
|
||||
4. **Write-Ahead Logging**: SQLite WAL mode is enabled for better concurrency
|
||||
|
||||
## Data Integrity
|
||||
|
||||
1. **Foreign Keys**: Enforced through CASCADE DELETE and application-level repository methods
|
||||
2. **Unique Constraints**: Chunk hashes, file paths, and blob hashes are unique
|
||||
3. **Null Handling**: Nullable fields clearly indicate in-progress operations
|
||||
4. **Timestamp Tracking**: All major operations record timestamps for auditing
|
||||
143
docs/REPOSTRUCTURE.md
Normal file
143
docs/REPOSTRUCTURE.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# Vaultik S3 Repository Structure
|
||||
|
||||
This document describes the structure and organization of data stored in the S3 bucket by Vaultik.
|
||||
|
||||
## Overview
|
||||
|
||||
Vaultik stores all backup data in an S3-compatible object store. The repository consists of two main components:
|
||||
1. **Blobs** - The actual backup data (content-addressed, encrypted)
|
||||
2. **Metadata** - Snapshot information and manifests (partially encrypted)
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
<bucket>/<prefix>/
|
||||
├── blobs/
|
||||
│ └── <hash[0:2]>/
|
||||
│ └── <hash[2:4]>/
|
||||
│ └── <full-hash>
|
||||
└── metadata/
|
||||
└── <snapshot-id>/
|
||||
├── db.zst.age
|
||||
└── manifest.json.zst
|
||||
```
|
||||
|
||||
## Blobs Directory (`blobs/`)
|
||||
|
||||
### Structure
|
||||
- **Path format**: `blobs/<first-2-chars>/<next-2-chars>/<full-hash>`
|
||||
- **Example**: `blobs/ca/fe/cafebabe1234567890abcdef1234567890abcdef1234567890abcdef12345678`
|
||||
- **Sharding**: The two-level directory structure (using the first 4 characters of the hash) prevents any single directory from containing too many objects
|
||||
|
||||
### Content
|
||||
- **What it contains**: Packed collections of content-defined chunks from files
|
||||
- **Format**: Zstandard compressed, then Age encrypted
|
||||
- **Encryption**: Always encrypted with Age using the configured recipients
|
||||
- **Naming**: Content-addressed using SHA256 hash of the encrypted blob
|
||||
|
||||
### Why Encrypted
|
||||
Blobs contain the actual file data from backups and must be encrypted for security. The content-addressing ensures deduplication while the encryption ensures privacy.
|
||||
|
||||
## Metadata Directory (`metadata/`)
|
||||
|
||||
Each snapshot has its own subdirectory named with the snapshot ID.
|
||||
|
||||
### Snapshot ID Format
|
||||
- **Format**: `<hostname>-<YYYYMMDD>-<HHMMSSZ>`
|
||||
- **Example**: `laptop-20240115-143052Z`
|
||||
- **Components**:
|
||||
- Hostname (may contain hyphens)
|
||||
- Date in YYYYMMDD format
|
||||
- Time in HHMMSSZ format (Z indicates UTC)
|
||||
|
||||
### Files in Each Snapshot Directory
|
||||
|
||||
#### `db.zst.age` - Encrypted Database Dump
|
||||
- **What it contains**: Complete SQLite database dump for this snapshot
|
||||
- **Format**: SQL dump → Zstandard compressed → Age encrypted
|
||||
- **Encryption**: Encrypted with Age
|
||||
- **Purpose**: Contains full file metadata, chunk mappings, and all relationships
|
||||
- **Why encrypted**: Contains sensitive metadata like file paths, permissions, and ownership
|
||||
|
||||
#### `manifest.json.zst` - Unencrypted Blob Manifest
|
||||
- **What it contains**: JSON list of all blob hashes referenced by this snapshot
|
||||
- **Format**: JSON → Zstandard compressed (NOT encrypted)
|
||||
- **Encryption**: NOT encrypted
|
||||
- **Purpose**: Enables pruning operations without requiring decryption keys
|
||||
- **Structure**:
|
||||
```json
|
||||
{
|
||||
"snapshot_id": "laptop-20240115-143052Z",
|
||||
"timestamp": "2024-01-15T14:30:52Z",
|
||||
"blob_count": 42,
|
||||
"blobs": [
|
||||
"cafebabe1234567890abcdef1234567890abcdef1234567890abcdef12345678",
|
||||
"deadbeef1234567890abcdef1234567890abcdef1234567890abcdef12345678",
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Why Manifest is Unencrypted
|
||||
The manifest must be readable without the private key to enable:
|
||||
1. **Pruning operations** - Identifying unreferenced blobs for deletion
|
||||
2. **Storage analysis** - Understanding space usage without decryption
|
||||
3. **Verification** - Checking blob existence without decryption
|
||||
4. **Cross-snapshot deduplication analysis** - Finding shared blobs between snapshots
|
||||
|
||||
The manifest only contains blob hashes, not file names or any other sensitive information.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### What's Encrypted
|
||||
- **All file content** (in blobs)
|
||||
- **All file metadata** (paths, permissions, timestamps, ownership in db.zst.age)
|
||||
- **File-to-chunk mappings** (in db.zst.age)
|
||||
|
||||
### What's Not Encrypted
|
||||
- **Blob hashes** (in manifest.json.zst)
|
||||
- **Snapshot IDs** (directory names)
|
||||
- **Blob count per snapshot** (in manifest.json.zst)
|
||||
|
||||
### Privacy Implications
|
||||
From the unencrypted data, an observer can determine:
|
||||
- When backups were taken (from snapshot IDs)
|
||||
- Which hostname created backups (from snapshot IDs)
|
||||
- How many blobs each snapshot references
|
||||
- Which blobs are shared between snapshots (deduplication patterns)
|
||||
- The size of each encrypted blob
|
||||
|
||||
An observer cannot determine:
|
||||
- File names or paths
|
||||
- File contents
|
||||
- File permissions or ownership
|
||||
- Directory structure
|
||||
- Which chunks belong to which files
|
||||
|
||||
## Consistency Guarantees
|
||||
|
||||
1. **Blobs are immutable** - Once written, a blob is never modified
|
||||
2. **Blobs are written before metadata** - A snapshot's metadata is only written after all its blobs are successfully uploaded
|
||||
3. **Metadata is written atomically** - Both db.zst.age and manifest.json.zst are written as complete files
|
||||
4. **Snapshots are marked complete in local DB only after metadata upload** - Ensures consistency between local and remote state
|
||||
|
||||
## Pruning Safety
|
||||
|
||||
The prune operation is safe because:
|
||||
1. It only deletes blobs not referenced in any manifest
|
||||
2. Manifests are unencrypted and can be read without keys
|
||||
3. The operation compares the latest local DB snapshot with the latest S3 snapshot to ensure consistency
|
||||
4. Pruning will fail if these don't match, preventing accidental deletion of needed blobs
|
||||
|
||||
## Restoration Requirements
|
||||
|
||||
To restore from a backup, you need:
|
||||
1. **The Age private key** - To decrypt blobs and database
|
||||
2. **The snapshot metadata** - Both files from the snapshot's metadata directory
|
||||
3. **All referenced blobs** - As listed in the manifest
|
||||
|
||||
The restoration process:
|
||||
1. Download and decrypt the database dump to understand file structure
|
||||
2. Download and decrypt the required blobs
|
||||
3. Reconstruct files from their chunks
|
||||
4. Restore file metadata (permissions, timestamps, etc.)
|
||||
Reference in New Issue
Block a user