diff --git a/AGENTS.md b/AGENTS.md index 3ae7952..102f70c 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -38,10 +38,9 @@ Version: 2025-06-08 1. Before committing, tests must pass (`make test`), linting must pass (`make lint`), and code must be formatted (`make fmt`). For go, those makefile targets should use `go fmt` and `go test -v ./...` and - `golangci-lint run`. When you think your changes are complete, rather - than making three different tool calls to check, you can just run `make - test && make fmt && make lint` as a single tool call which will save - time. + `golangci-lint run`. Each Makefile target does exactly one thing — to + run lint + fmt-check + test together (the standard pre-commit gate), + use `make check`. 2. Always write a `Makefile` with the default target being `test`, and with a `fmt` target that formats the code. The `test` target should run all @@ -103,3 +102,9 @@ Version: 2025-06-08 build files are acceptable in the root, but source code and other files should be organized in appropriate subdirectories. +13. Pre-1.0: NEVER write database migrations. There are no live databases + anywhere — every user's local index can be rebuilt from a fresh full + backup. When the schema changes, just change `schema.sql` (and any code + that touches the affected tables). The local index is disposable until + 1.0 ships and is tagged. + diff --git a/PROCESS.md b/PROCESS.md deleted file mode 100644 index 356b90e..0000000 --- a/PROCESS.md +++ /dev/null @@ -1,556 +0,0 @@ -# Vaultik Snapshot Creation Process - -This document describes the lifecycle of objects during snapshot creation, with a focus on database transactions and foreign key constraints. - -## Database Schema Overview - -### Tables and Foreign Key Dependencies - -``` -┌─────────────────────────────────────────────────────────────────────────┐ -│ FOREIGN KEY GRAPH │ -│ │ -│ snapshots ◄────── snapshot_files ────────► files │ -│ │ │ │ -│ └───────── snapshot_blobs ────────► blobs │ │ -│ │ │ │ -│ │ ├──► file_chunks ◄── chunks│ -│ │ │ ▲ │ -│ │ └──► chunk_files ────┘ │ -│ │ │ -│ └──► blob_chunks ─────────────┘│ -│ │ -│ uploads ───────► blobs.blob_hash │ -│ └──────────► snapshots.id │ -└─────────────────────────────────────────────────────────────────────────┘ -``` - -### Critical Constraint: `chunks` Must Exist First - -These tables reference `chunks.chunk_hash` **without CASCADE**: -- `file_chunks.chunk_hash` → `chunks.chunk_hash` -- `chunk_files.chunk_hash` → `chunks.chunk_hash` -- `blob_chunks.chunk_hash` → `chunks.chunk_hash` - -**Implication**: A chunk record MUST be committed to the database BEFORE any of these referencing records can be created. - -### Order of Operations Required by Schema - -``` -1. snapshots (created first, before scan) -2. blobs (created when packer starts new blob) -3. chunks (created during file processing) -4. blob_chunks (created immediately after chunk added to packer) -5. files (created after file fully chunked) -6. file_chunks (created with file record) -7. chunk_files (created with file record) -8. snapshot_files (created with file record) -9. snapshot_blobs (created after blob uploaded) -10. uploads (created after blob uploaded) -``` - ---- - -## Snapshot Creation Phases - -### Phase 0: Initialization - -**Actions:** -1. Snapshot record created in database (Transaction T0) -2. Known files loaded into memory from `files` table -3. Known chunks loaded into memory from `chunks` table - -**Transactions:** -``` -T0: INSERT INTO snapshots (id, hostname, ...) VALUES (...) - COMMIT -``` - ---- - -### Phase 1: Scan Directory - -**Actions:** -1. Walk filesystem directory tree -2. For each file, compare against in-memory `knownFiles` map -3. Classify files as: unchanged, new, or modified -4. Collect unchanged file IDs for later association -5. Collect new/modified files for processing - -**Transactions:** -``` -(None during scan - all in-memory) -``` - ---- - -### Phase 1b: Associate Unchanged Files - -**Actions:** -1. For unchanged files, add entries to `snapshot_files` table -2. Done in batches of 1000 - -**Transactions:** -``` -For each batch of 1000 file IDs: - T: BEGIN - INSERT INTO snapshot_files (snapshot_id, file_id) VALUES (?, ?) - ... (up to 1000 inserts) - COMMIT -``` - ---- - -### Phase 2: Process Files - -For each file that needs processing: - -#### Step 2a: Open and Chunk File - -**Location:** `processFileStreaming()` - -For each chunk produced by content-defined chunking: - -##### Step 2a-1: Check Chunk Existence -```go -chunkExists := s.chunkExists(chunk.Hash) // In-memory lookup -``` - -##### Step 2a-2: Create Chunk Record (if new) -```go -// TRANSACTION: Create chunk in database -err := s.repos.WithTx(ctx, func(txCtx context.Context, tx *sql.Tx) error { - dbChunk := &database.Chunk{ChunkHash: chunk.Hash, Size: chunk.Size} - return s.repos.Chunks.Create(txCtx, tx, dbChunk) -}) -// COMMIT immediately after WithTx returns - -// Update in-memory cache -s.addKnownChunk(chunk.Hash) -``` - -**Transaction:** -``` -T_chunk: BEGIN - INSERT INTO chunks (chunk_hash, size) VALUES (?, ?) - COMMIT -``` - -##### Step 2a-3: Add Chunk to Packer - -```go -s.packer.AddChunk(&blob.ChunkRef{Hash: chunk.Hash, Data: chunk.Data}) -``` - -**Inside packer.AddChunk → addChunkToCurrentBlob():** - -```go -// TRANSACTION: Create blob_chunks record IMMEDIATELY -if p.repos != nil { - blobChunk := &database.BlobChunk{ - BlobID: p.currentBlob.id, - ChunkHash: chunk.Hash, - Offset: offset, - Length: chunkSize, - } - err := p.repos.WithTx(context.Background(), func(ctx context.Context, tx *sql.Tx) error { - return p.repos.BlobChunks.Create(ctx, tx, blobChunk) - }) - // COMMIT immediately -} -``` - -**Transaction:** -``` -T_blob_chunk: BEGIN - INSERT INTO blob_chunks (blob_id, chunk_hash, offset, length) VALUES (?, ?, ?, ?) - COMMIT -``` - -**⚠️ CRITICAL DEPENDENCY**: This transaction requires `chunks.chunk_hash` to exist (FK constraint). -The chunk MUST be committed in Step 2a-2 BEFORE this can succeed. - ---- - -#### Step 2b: Blob Size Limit Handling - -If adding a chunk would exceed blob size limit: - -```go -if err == blob.ErrBlobSizeLimitExceeded { - if err := s.packer.FinalizeBlob(); err != nil { ... } - // Retry adding the chunk - if err := s.packer.AddChunk(...); err != nil { ... } -} -``` - -**FinalizeBlob() transactions:** -``` -T_blob_finish: BEGIN - UPDATE blobs SET blob_hash=?, uncompressed_size=?, compressed_size=?, finished_ts=? WHERE id=? - COMMIT -``` - -Then blob handler is called (handleBlobReady): -``` -(Upload to S3 - no transaction) - -T_blob_uploaded: BEGIN - UPDATE blobs SET uploaded_ts=? WHERE id=? - INSERT INTO snapshot_blobs (snapshot_id, blob_id, blob_hash) VALUES (?, ?, ?) - INSERT INTO uploads (blob_hash, snapshot_id, uploaded_at, size, duration_ms) VALUES (?, ?, ?, ?, ?) - COMMIT -``` - ---- - -#### Step 2c: Queue File for Batch Insertion - -After all chunks for a file are processed: - -```go -// Build file data (in-memory, no DB) -fileChunks := make([]database.FileChunk, len(chunks)) -chunkFiles := make([]database.ChunkFile, len(chunks)) - -// Queue for batch insertion -return s.addPendingFile(ctx, pendingFileData{ - file: fileToProcess.File, - fileChunks: fileChunks, - chunkFiles: chunkFiles, -}) -``` - -**No transaction yet** - just adds to `pendingFiles` slice. - -If `len(pendingFiles) >= fileBatchSize (100)`, triggers `flushPendingFiles()`. - ---- - -### Step 2d: Flush Pending Files - -**Location:** `flushPendingFiles()` - called when batch is full or at end of processing - -```go -return s.repos.WithTx(ctx, func(txCtx context.Context, tx *sql.Tx) error { - for _, data := range files { - // 1. Create file record - s.repos.Files.Create(txCtx, tx, data.file) // INSERT OR REPLACE - - // 2. Delete old associations - s.repos.FileChunks.DeleteByFileID(txCtx, tx, data.file.ID) - s.repos.ChunkFiles.DeleteByFileID(txCtx, tx, data.file.ID) - - // 3. Create file_chunks records - for _, fc := range data.fileChunks { - s.repos.FileChunks.Create(txCtx, tx, &fc) // FK: chunks.chunk_hash - } - - // 4. Create chunk_files records - for _, cf := range data.chunkFiles { - s.repos.ChunkFiles.Create(txCtx, tx, &cf) // FK: chunks.chunk_hash - } - - // 5. Add file to snapshot - s.repos.Snapshots.AddFileByID(txCtx, tx, s.snapshotID, data.file.ID) - } - return nil -}) -// COMMIT (all or nothing for the batch) -``` - -**Transaction:** -``` -T_files_batch: BEGIN - -- For each file in batch: - INSERT OR REPLACE INTO files (...) VALUES (...) - DELETE FROM file_chunks WHERE file_id = ? - DELETE FROM chunk_files WHERE file_id = ? - INSERT INTO file_chunks (file_id, idx, chunk_hash) VALUES (?, ?, ?) -- FK: chunks - INSERT INTO chunk_files (chunk_hash, file_id, ...) VALUES (?, ?, ...) -- FK: chunks - INSERT INTO snapshot_files (snapshot_id, file_id) VALUES (?, ?) - -- Repeat for each file - COMMIT -``` - -**⚠️ CRITICAL DEPENDENCY**: `file_chunks` and `chunk_files` require `chunks.chunk_hash` to exist. - ---- - -### Phase 2 End: Final Flush - -```go -// Flush any remaining pending files -if err := s.flushAllPending(ctx); err != nil { ... } - -// Final packer flush -s.packer.Flush() -``` - ---- - -## The Current Bug - -### Problem - -The current code attempts to batch file insertions, but `file_chunks` and `chunk_files` have foreign keys to `chunks.chunk_hash`. The batched file flush tries to insert these records, but if the chunks haven't been committed yet, the FK constraint fails. - -### Why It's Happening - -Looking at the sequence: - -1. Process file A, chunk X -2. Create chunk X in DB (Transaction commits) -3. Add chunk X to packer -4. Packer creates blob_chunks for chunk X (needs chunk X - OK, committed in step 2) -5. Queue file A with chunk references -6. Process file B, chunk Y -7. Create chunk Y in DB (Transaction commits) -8. ... etc ... -9. At end: flushPendingFiles() -10. Insert file_chunks for file A referencing chunk X (chunk X committed - should work) - -The chunks ARE being created individually. But something is going wrong. - -### Actual Issue - -Wait - let me re-read the code. The issue is: - -In `processFileStreaming`, when we queue file data: -```go -fileChunks[i] = database.FileChunk{ - FileID: fileToProcess.File.ID, - Idx: ci.fileChunk.Idx, - ChunkHash: ci.fileChunk.ChunkHash, -} -``` - -The `FileID` is set, but `fileToProcess.File.ID` might be empty at this point because the file record hasn't been created yet! - -Looking at `checkFileInMemory`: -```go -// For new files: -if !exists { - return file, true // file.ID is empty string! -} - -// For existing files: -file.ID = existingFile.ID // Reuse existing ID -``` - -**For NEW files, `file.ID` is empty!** - -Then in `flushPendingFiles`: -```go -s.repos.Files.Create(txCtx, tx, data.file) // This generates/uses the ID -``` - -But `data.fileChunks` was built with the EMPTY ID! - -### The Real Problem - -For new files: -1. `checkFileInMemory` creates file record with empty ID -2. `processFileStreaming` queues file_chunks with empty `FileID` -3. `flushPendingFiles` creates file (generates ID), but file_chunks still have empty `FileID` - -Wait, but `Files.Create` should be INSERT OR REPLACE by path, and the file struct should get updated... Let me check. - -Actually, looking more carefully at the code path - the file IS created first in the flush, but the `fileChunks` slice was already built with the old (possibly empty) ID. The ID isn't updated after the file is created. - -Hmm, but looking at the current code: -```go -fileChunks[i] = database.FileChunk{ - FileID: fileToProcess.File.ID, // This uses the ID from the File struct -``` - -And in `checkFileInMemory` for new files, we create a file struct but don't set the ID. However, looking at the database repository, `Files.Create` should be doing `INSERT OR REPLACE` and the ID should be pre-generated... - -Let me check if IDs are being generated. Looking at the File struct usage, it seems like UUIDs should be generated somewhere... - -Actually, looking at the test failures again: -``` -creating file chunk: inserting file_chunk: constraint failed: FOREIGN KEY constraint failed (787) -``` - -Error 787 is SQLite's foreign key constraint error. The failing FK is on `file_chunks.chunk_hash → chunks.chunk_hash`. - -So the chunks ARE NOT in the database when we try to insert file_chunks. Let me trace through more carefully... - ---- - -## Transaction Timing Issue - -The problem is transaction visibility in SQLite. - -Each `WithTx` creates a new transaction that commits at the end. But with batched file insertion: - -1. Chunk transactions commit one at a time -2. File batch transaction runs later - -If chunks are being inserted but something goes wrong with transaction isolation, the file batch might not see them. - -But actually SQLite in WAL mode should have SERIALIZABLE isolation by default, so committed transactions should be visible. - -Let me check if the in-memory cache is masking a database problem... - -Actually, wait. Let me re-check the current broken code more carefully. The issue might be simpler. - ---- - -## Current Code Flow Analysis - -Looking at `processFileStreaming` in the current broken state: - -```go -// For each chunk: -if !chunkExists { - err := s.repos.WithTx(ctx, func(txCtx context.Context, tx *sql.Tx) error { - dbChunk := &database.Chunk{ChunkHash: chunk.Hash, Size: chunk.Size} - return s.repos.Chunks.Create(txCtx, tx, dbChunk) - }) - // ... check error ... - s.addKnownChunk(chunk.Hash) -} - -// ... add to packer (creates blob_chunks) ... - -// Collect chunk info for file -chunks = append(chunks, chunkInfo{...}) -``` - -Then at end of function: -```go -// Queue file for batch insertion -return s.addPendingFile(ctx, pendingFileData{ - file: fileToProcess.File, - fileChunks: fileChunks, - chunkFiles: chunkFiles, -}) -``` - -At end of `processPhase`: -```go -if err := s.flushAllPending(ctx); err != nil { ... } -``` - -The chunks are being created one-by-one with individual transactions. By the time `flushPendingFiles` runs, all chunk transactions should have committed. - -Unless... there's a bug in how the chunks are being referenced. Let me check if the chunk_hash values are correct. - -Or... maybe the test database is being recreated between operations somehow? - -Actually, let me check the test setup. Maybe the issue is specific to the test environment. - ---- - -## Summary of Object Lifecycle - -| Object | When Created | Transaction | Dependencies | -|--------|--------------|-------------|--------------| -| snapshot | Before scan | Individual tx | None | -| blob | When packer needs new blob | Individual tx | None | -| chunk | During file chunking (each chunk) | Individual tx | None | -| blob_chunks | Immediately after adding chunk to packer | Individual tx | chunks, blobs | -| files | Batched at end of processing | Batch tx | None | -| file_chunks | With file (batched) | Batch tx | files, chunks | -| chunk_files | With file (batched) | Batch tx | files, chunks | -| snapshot_files | With file (batched) | Batch tx | snapshots, files | -| snapshot_blobs | After blob upload | Individual tx | snapshots, blobs | -| uploads | After blob upload | Same tx as snapshot_blobs | blobs, snapshots | - ---- - -## Root Cause Analysis - -After detailed analysis, I believe the issue is one of the following: - -### Hypothesis 1: File ID Not Set - -Looking at `checkFileInMemory()` for NEW files: -```go -if !exists { - return file, true // file.ID is empty string! -} -``` - -For new files, `file.ID` is empty. Then in `processFileStreaming`: -```go -fileChunks[i] = database.FileChunk{ - FileID: fileToProcess.File.ID, // Empty for new files! - ... -} -``` - -The `FileID` in the built `fileChunks` slice is empty. - -Then in `flushPendingFiles`: -```go -s.repos.Files.Create(txCtx, tx, data.file) // This generates the ID -// But data.fileChunks still has empty FileID! -for i := range data.fileChunks { - s.repos.FileChunks.Create(...) // Uses empty FileID -} -``` - -**Solution**: Generate file IDs upfront in `checkFileInMemory()`: -```go -file := &database.File{ - ID: uuid.New().String(), // Generate ID immediately - Path: path, - ... -} -``` - -### Hypothesis 2: Transaction Isolation - -SQLite with a single connection pool (`MaxOpenConns(1)`) should serialize all transactions. Committed data should be visible to subsequent transactions. - -However, there might be a subtle issue with how `context.Background()` is used in the packer vs the scanner's context. - -## Recommended Fix - -**Step 1: Generate file IDs upfront** - -In `checkFileInMemory()`, generate the UUID for new files immediately: -```go -file := &database.File{ - ID: uuid.New().String(), // Always generate ID - Path: path, - ... -} -``` - -This ensures `file.ID` is set when building `fileChunks` and `chunkFiles` slices. - -**Step 2: Verify by reverting to per-file transactions** - -If Step 1 doesn't fix it, revert to non-batched file insertion to isolate the issue: - -```go -// Instead of queuing: -// return s.addPendingFile(ctx, pendingFileData{...}) - -// Do immediate insertion: -return s.repos.WithTx(ctx, func(txCtx context.Context, tx *sql.Tx) error { - // Create file - s.repos.Files.Create(txCtx, tx, fileToProcess.File) - // Delete old associations - s.repos.FileChunks.DeleteByFileID(...) - s.repos.ChunkFiles.DeleteByFileID(...) - // Create new associations - for _, fc := range fileChunks { - s.repos.FileChunks.Create(...) - } - for _, cf := range chunkFiles { - s.repos.ChunkFiles.Create(...) - } - // Add to snapshot - s.repos.Snapshots.AddFileByID(...) - return nil -}) -``` - -**Step 3: If batching is still desired** - -After confirming per-file transactions work, re-implement batching with the ID fix in place, and add debug logging to trace exactly which chunk_hash is failing and why. diff --git a/README.md b/README.md index 4a5fd80..92dfedf 100644 --- a/README.md +++ b/README.md @@ -147,13 +147,11 @@ passphrase is needed or stored locally. vaultik [--config ] snapshot create [snapshot-names...] [--cron] [--prune] [--skip-errors] vaultik [--config ] snapshot list [--json] vaultik [--config ] snapshot verify [--deep] [--json] -vaultik [--config ] snapshot purge [--keep-latest | --older-than ] [--force] +vaultik [--config ] snapshot purge [--keep-latest | --older-than ] [--snapshot ...] [--force] vaultik [--config ] snapshot remove [--dry-run] [--force] [--remote] [--json] vaultik [--config ] snapshot prune vaultik [--config ] restore [paths...] [--verify] vaultik [--config ] prune [--force] [--json] -vaultik [--config ] purge [--keep-latest | --older-than ] [--force] -vaultik [--config ] verify [--deep] [--json] vaultik [--config ] info vaultik [--config ] remote info [--json] vaultik [--config ] store info @@ -172,7 +170,8 @@ vaultik version * Config is located at `/etc/vaultik/config.yml` by default * Optional snapshot names argument to create specific snapshots (default: all) * `--cron`: Silent unless error (for crontab) -* `--prune`: Delete old snapshots and orphaned blobs after backup +* `--prune`: After backup, drop older snapshots of each backed-up name (keeping + only the latest) and remove orphaned blobs from remote storage * `--skip-errors`: Skip file read errors (log them loudly but continue) **snapshot list**: List all snapshots with their timestamps and sizes @@ -181,9 +180,12 @@ vaultik version **snapshot verify**: Verify snapshot integrity * `--deep`: Download and verify blob contents (not just existence) -**snapshot purge**: Remove old snapshots based on criteria -* `--keep-latest`: Keep only the most recent snapshot +**snapshot purge**: Remove old snapshots based on criteria. Retention is +applied per-snapshot-name (e.g. `--keep-latest` keeps the latest of each +configured name, not the latest globally). +* `--keep-latest`: Keep only the most recent snapshot of each name * `--older-than`: Remove snapshots older than duration (e.g., 30d, 6mo, 1y) +* `--snapshot `: Restrict to specific snapshot names (repeat for multiple) * `--force`: Skip confirmation prompt **snapshot remove**: Remove a specific snapshot diff --git a/docs/DATAMODEL.md b/docs/DATAMODEL.md index 71d4b08..37f9480 100644 --- a/docs/DATAMODEL.md +++ b/docs/DATAMODEL.md @@ -5,8 +5,14 @@ Vaultik uses a local SQLite database to track file metadata, chunk mappings, and blob associations during the backup process. This database serves as an index for incremental backups and enables efficient deduplication. **Important Notes:** -- **No Migration Support**: Vaultik does not support database schema migrations. If the schema changes, the local database must be deleted and recreated by performing a full backup. -- **Version Compatibility**: In rare cases, you may need to use the same version of Vaultik to restore a backup as was used to create it. This ensures compatibility with the metadata format stored in S3. +- **No Migration Support (pre-1.0)**: Vaultik does not support database schema + migrations. The local index is treated as disposable — if the schema changes, + delete the local SQLite database (`vaultik database purge`) and run a full + backup. The remote storage is unaffected; the new index will re-deduplicate + against existing remote blobs. +- **Version Compatibility**: In rare cases, you may need to use the same version + of Vaultik to restore a backup as was used to create it. This ensures + compatibility with the metadata format stored in S3. ## Database Tables diff --git a/docs/REPOSTRUCTURE.md b/docs/REPOSTRUCTURE.md index e6527f3..4777f1a 100644 --- a/docs/REPOSTRUCTURE.md +++ b/docs/REPOSTRUCTURE.md @@ -43,18 +43,19 @@ Blobs contain the actual file data from backups and must be encrypted for securi Each snapshot has its own subdirectory named with the snapshot ID. ### Snapshot ID Format -- **Format**: `--` -- **Example**: `laptop-20240115-143052Z` +- **Format**: `__` (or `_` if no + name was specified) +- **Example**: `laptop_home_2024-01-15T14:30:52Z` - **Components**: - - Hostname (may contain hyphens) - - Date in YYYYMMDD format - - Time in HHMMSSZ format (Z indicates UTC) + - Short hostname (everything before the first dot is stripped from the FQDN) + - Snapshot name from the configured `snapshots:` map (optional) + - RFC3339 UTC timestamp ### Files in Each Snapshot Directory -#### `db.zst.age` - Encrypted Database Dump -- **What it contains**: Complete SQLite database dump for this snapshot -- **Format**: SQL dump → Zstandard compressed → Age encrypted +#### `db.zst.age` - Encrypted Database +- **What it contains**: Pruned binary SQLite database for this snapshot +- **Format**: Binary SQLite → Zstandard compressed → Age encrypted - **Encryption**: Encrypted with Age - **Purpose**: Contains full file metadata, chunk mappings, and all relationships - **Why encrypted**: Contains sensitive metadata like file paths, permissions, and ownership @@ -67,7 +68,7 @@ Each snapshot has its own subdirectory named with the snapshot ID. - **Structure**: ```json { - "snapshot_id": "laptop-20240115-143052Z", + "snapshot_id": "laptop_home_2024-01-15T14:30:52Z", "timestamp": "2024-01-15T14:30:52Z", "blob_count": 42, "blobs": [