vaultik/PROCESS.md

# Vaultik Snapshot Creation Process

This document describes the lifecycle of objects during snapshot creation, with a focus on database transactions and foreign key constraints.

## Database Schema Overview

### Tables and Foreign Key Dependencies

```
┌─────────────────────────────────────────────────────────────────────────┐
│                          FOREIGN KEY GRAPH                               │
│                                                                          │
│  snapshots ◄────── snapshot_files ────────► files                       │
│      │                                         │                         │
│      └───────── snapshot_blobs ────────► blobs │                         │
│                                           │    │                         │
│                                           │    ├──► file_chunks ◄── chunks│
│                                           │    │                    ▲    │
│                                           │    └──► chunk_files ────┘    │
│                                           │                              │
│                                           └──► blob_chunks ─────────────┘│
│                                                                          │
│  uploads ───────► blobs.blob_hash                                        │
│      └──────────► snapshots.id                                           │
└─────────────────────────────────────────────────────────────────────────┘
```

### Critical Constraint: `chunks` Must Exist First

These tables reference `chunks.chunk_hash` **without CASCADE**:
- `file_chunks.chunk_hash` → `chunks.chunk_hash`
- `chunk_files.chunk_hash` → `chunks.chunk_hash`
- `blob_chunks.chunk_hash` → `chunks.chunk_hash`

**Implication**: A chunk record MUST be committed to the database BEFORE any of these referencing records can be created.

### Order of Operations Required by Schema

```
1. snapshots      (created first, before scan)
2. blobs          (created when packer starts new blob)
3. chunks         (created during file processing)
4. blob_chunks    (created immediately after chunk added to packer)
5. files          (created after file fully chunked)
6. file_chunks    (created with file record)
7. chunk_files    (created with file record)
8. snapshot_files (created with file record)
9. snapshot_blobs (created after blob uploaded)
10. uploads       (created after blob uploaded)
```

---

## Snapshot Creation Phases

### Phase 0: Initialization

**Actions:**
1. Snapshot record created in database (Transaction T0)
2. Known files loaded into memory from `files` table
3. Known chunks loaded into memory from `chunks` table

**Transactions:**
```
T0: INSERT INTO snapshots (id, hostname, ...) VALUES (...)
    COMMIT
```

---

### Phase 1: Scan Directory

**Actions:**
1. Walk filesystem directory tree
2. For each file, compare against in-memory `knownFiles` map
3. Classify files as: unchanged, new, or modified
4. Collect unchanged file IDs for later association
5. Collect new/modified files for processing

**Transactions:**
```
(None during scan - all in-memory)
```

---

### Phase 1b: Associate Unchanged Files

**Actions:**
1. For unchanged files, add entries to `snapshot_files` table
2. Done in batches of 1000

**Transactions:**
```
For each batch of 1000 file IDs:
    T: BEGIN
       INSERT INTO snapshot_files (snapshot_id, file_id) VALUES (?, ?)
       ... (up to 1000 inserts)
       COMMIT
```

---

### Phase 2: Process Files

For each file that needs processing:

#### Step 2a: Open and Chunk File

**Location:** `processFileStreaming()`

For each chunk produced by content-defined chunking:

##### Step 2a-1: Check Chunk Existence
```go
chunkExists := s.chunkExists(chunk.Hash)  // In-memory lookup
```

##### Step 2a-2: Create Chunk Record (if new)
```go
// TRANSACTION: Create chunk in database
err := s.repos.WithTx(ctx, func(txCtx context.Context, tx *sql.Tx) error {
    dbChunk := &database.Chunk{ChunkHash: chunk.Hash, Size: chunk.Size}
    return s.repos.Chunks.Create(txCtx, tx, dbChunk)
})
// COMMIT immediately after WithTx returns

// Update in-memory cache
s.addKnownChunk(chunk.Hash)
```

**Transaction:**
```
T_chunk: BEGIN
         INSERT INTO chunks (chunk_hash, size) VALUES (?, ?)
         COMMIT
```

##### Step 2a-3: Add Chunk to Packer

```go
s.packer.AddChunk(&blob.ChunkRef{Hash: chunk.Hash, Data: chunk.Data})
```

**Inside packer.AddChunk → addChunkToCurrentBlob():**

```go
// TRANSACTION: Create blob_chunks record IMMEDIATELY
if p.repos != nil {
    blobChunk := &database.BlobChunk{
        BlobID:    p.currentBlob.id,
        ChunkHash: chunk.Hash,
        Offset:    offset,
        Length:    chunkSize,
    }
    err := p.repos.WithTx(context.Background(), func(ctx context.Context, tx *sql.Tx) error {
        return p.repos.BlobChunks.Create(ctx, tx, blobChunk)
    })
    // COMMIT immediately
}
```

**Transaction:**
```
T_blob_chunk: BEGIN
              INSERT INTO blob_chunks (blob_id, chunk_hash, offset, length) VALUES (?, ?, ?, ?)
              COMMIT
```

**⚠️ CRITICAL DEPENDENCY**: This transaction requires `chunks.chunk_hash` to exist (FK constraint).
The chunk MUST be committed in Step 2a-2 BEFORE this can succeed.

---

#### Step 2b: Blob Size Limit Handling

If adding a chunk would exceed blob size limit:

```go
if err == blob.ErrBlobSizeLimitExceeded {
    if err := s.packer.FinalizeBlob(); err != nil { ... }
    // Retry adding the chunk
    if err := s.packer.AddChunk(...); err != nil { ... }
}
```

**FinalizeBlob() transactions:**
```
T_blob_finish: BEGIN
               UPDATE blobs SET blob_hash=?, uncompressed_size=?, compressed_size=?, finished_ts=? WHERE id=?
               COMMIT
```

Then blob handler is called (handleBlobReady):
```
(Upload to S3 - no transaction)

T_blob_uploaded: BEGIN
                 UPDATE blobs SET uploaded_ts=? WHERE id=?
                 INSERT INTO snapshot_blobs (snapshot_id, blob_id, blob_hash) VALUES (?, ?, ?)
                 INSERT INTO uploads (blob_hash, snapshot_id, uploaded_at, size, duration_ms) VALUES (?, ?, ?, ?, ?)
                 COMMIT
```

---

#### Step 2c: Queue File for Batch Insertion

After all chunks for a file are processed:

```go
// Build file data (in-memory, no DB)
fileChunks := make([]database.FileChunk, len(chunks))
chunkFiles := make([]database.ChunkFile, len(chunks))

// Queue for batch insertion
return s.addPendingFile(ctx, pendingFileData{
    file:       fileToProcess.File,
    fileChunks: fileChunks,
    chunkFiles: chunkFiles,
})
```

**No transaction yet** - just adds to `pendingFiles` slice.

If `len(pendingFiles) >= fileBatchSize (100)`, triggers `flushPendingFiles()`.

---

### Step 2d: Flush Pending Files

**Location:** `flushPendingFiles()` - called when batch is full or at end of processing

```go
return s.repos.WithTx(ctx, func(txCtx context.Context, tx *sql.Tx) error {
    for _, data := range files {
        // 1. Create file record
        s.repos.Files.Create(txCtx, tx, data.file)  // INSERT OR REPLACE

        // 2. Delete old associations
        s.repos.FileChunks.DeleteByFileID(txCtx, tx, data.file.ID)
        s.repos.ChunkFiles.DeleteByFileID(txCtx, tx, data.file.ID)

        // 3. Create file_chunks records
        for _, fc := range data.fileChunks {
            s.repos.FileChunks.Create(txCtx, tx, &fc)  // FK: chunks.chunk_hash
        }

        // 4. Create chunk_files records
        for _, cf := range data.chunkFiles {
            s.repos.ChunkFiles.Create(txCtx, tx, &cf)  // FK: chunks.chunk_hash
        }

        // 5. Add file to snapshot
        s.repos.Snapshots.AddFileByID(txCtx, tx, s.snapshotID, data.file.ID)
    }
    return nil
})
// COMMIT (all or nothing for the batch)
```

**Transaction:**
```
T_files_batch: BEGIN
               -- For each file in batch:
               INSERT OR REPLACE INTO files (...) VALUES (...)
               DELETE FROM file_chunks WHERE file_id = ?
               DELETE FROM chunk_files WHERE file_id = ?
               INSERT INTO file_chunks (file_id, idx, chunk_hash) VALUES (?, ?, ?)  -- FK: chunks
               INSERT INTO chunk_files (chunk_hash, file_id, ...) VALUES (?, ?, ...) -- FK: chunks
               INSERT INTO snapshot_files (snapshot_id, file_id) VALUES (?, ?)
               -- Repeat for each file
               COMMIT
```

**⚠️ CRITICAL DEPENDENCY**: `file_chunks` and `chunk_files` require `chunks.chunk_hash` to exist.

---

### Phase 2 End: Final Flush

```go
// Flush any remaining pending files
if err := s.flushAllPending(ctx); err != nil { ... }

// Final packer flush
s.packer.Flush()
```

---

## The Current Bug

### Problem

The current code attempts to batch file insertions, but `file_chunks` and `chunk_files` have foreign keys to `chunks.chunk_hash`. The batched file flush tries to insert these records, but if the chunks haven't been committed yet, the FK constraint fails.

### Why It's Happening

Looking at the sequence:

1. Process file A, chunk X
2. Create chunk X in DB (Transaction commits)
3. Add chunk X to packer
4. Packer creates blob_chunks for chunk X (needs chunk X - OK, committed in step 2)
5. Queue file A with chunk references
6. Process file B, chunk Y
7. Create chunk Y in DB (Transaction commits)
8. ... etc ...
9. At end: flushPendingFiles()
10. Insert file_chunks for file A referencing chunk X (chunk X committed - should work)

The chunks ARE being created individually. But something is going wrong.

### Actual Issue

Wait - let me re-read the code. The issue is:

In `processFileStreaming`, when we queue file data:
```go
fileChunks[i] = database.FileChunk{
    FileID:    fileToProcess.File.ID,
    Idx:       ci.fileChunk.Idx,
    ChunkHash: ci.fileChunk.ChunkHash,
}
```

The `FileID` is set, but `fileToProcess.File.ID` might be empty at this point because the file record hasn't been created yet!

Looking at `checkFileInMemory`:
```go
// For new files:
if !exists {
    return file, true  // file.ID is empty string!
}

// For existing files:
file.ID = existingFile.ID  // Reuse existing ID
```

**For NEW files, `file.ID` is empty!**

Then in `flushPendingFiles`:
```go
s.repos.Files.Create(txCtx, tx, data.file)  // This generates/uses the ID
```

But `data.fileChunks` was built with the EMPTY ID!

### The Real Problem

For new files:
1. `checkFileInMemory` creates file record with empty ID
2. `processFileStreaming` queues file_chunks with empty `FileID`
3. `flushPendingFiles` creates file (generates ID), but file_chunks still have empty `FileID`

Wait, but `Files.Create` should be INSERT OR REPLACE by path, and the file struct should get updated... Let me check.

Actually, looking more carefully at the code path - the file IS created first in the flush, but the `fileChunks` slice was already built with the old (possibly empty) ID. The ID isn't updated after the file is created.

Hmm, but looking at the current code:
```go
fileChunks[i] = database.FileChunk{
    FileID:    fileToProcess.File.ID,  // This uses the ID from the File struct
```

And in `checkFileInMemory` for new files, we create a file struct but don't set the ID. However, looking at the database repository, `Files.Create` should be doing `INSERT OR REPLACE` and the ID should be pre-generated...

Let me check if IDs are being generated. Looking at the File struct usage, it seems like UUIDs should be generated somewhere...

Actually, looking at the test failures again:
```
creating file chunk: inserting file_chunk: constraint failed: FOREIGN KEY constraint failed (787)
```

Error 787 is SQLite's foreign key constraint error. The failing FK is on `file_chunks.chunk_hash → chunks.chunk_hash`.

So the chunks ARE NOT in the database when we try to insert file_chunks. Let me trace through more carefully...

---

## Transaction Timing Issue

The problem is transaction visibility in SQLite.

Each `WithTx` creates a new transaction that commits at the end. But with batched file insertion:

1. Chunk transactions commit one at a time
2. File batch transaction runs later

If chunks are being inserted but something goes wrong with transaction isolation, the file batch might not see them.

But actually SQLite in WAL mode should have SERIALIZABLE isolation by default, so committed transactions should be visible.

Let me check if the in-memory cache is masking a database problem...

Actually, wait. Let me re-check the current broken code more carefully. The issue might be simpler.

---

## Current Code Flow Analysis

Looking at `processFileStreaming` in the current broken state:

```go
// For each chunk:
if !chunkExists {
    err := s.repos.WithTx(ctx, func(txCtx context.Context, tx *sql.Tx) error {
        dbChunk := &database.Chunk{ChunkHash: chunk.Hash, Size: chunk.Size}
        return s.repos.Chunks.Create(txCtx, tx, dbChunk)
    })
    // ... check error ...
    s.addKnownChunk(chunk.Hash)
}

// ... add to packer (creates blob_chunks) ...

// Collect chunk info for file
chunks = append(chunks, chunkInfo{...})
```

Then at end of function:
```go
// Queue file for batch insertion
return s.addPendingFile(ctx, pendingFileData{
    file:       fileToProcess.File,
    fileChunks: fileChunks,
    chunkFiles: chunkFiles,
})
```

At end of `processPhase`:
```go
if err := s.flushAllPending(ctx); err != nil { ... }
```

The chunks are being created one-by-one with individual transactions. By the time `flushPendingFiles` runs, all chunk transactions should have committed.

Unless... there's a bug in how the chunks are being referenced. Let me check if the chunk_hash values are correct.

Or... maybe the test database is being recreated between operations somehow?

Actually, let me check the test setup. Maybe the issue is specific to the test environment.

---

## Summary of Object Lifecycle

| Object | When Created | Transaction | Dependencies |
|--------|--------------|-------------|--------------|
| snapshot | Before scan | Individual tx | None |
| blob | When packer needs new blob | Individual tx | None |
| chunk | During file chunking (each chunk) | Individual tx | None |
| blob_chunks | Immediately after adding chunk to packer | Individual tx | chunks, blobs |
| files | Batched at end of processing | Batch tx | None |
| file_chunks | With file (batched) | Batch tx | files, chunks |
| chunk_files | With file (batched) | Batch tx | files, chunks |
| snapshot_files | With file (batched) | Batch tx | snapshots, files |
| snapshot_blobs | After blob upload | Individual tx | snapshots, blobs |
| uploads | After blob upload | Same tx as snapshot_blobs | blobs, snapshots |

---

## Root Cause Analysis

After detailed analysis, I believe the issue is one of the following:

### Hypothesis 1: File ID Not Set

Looking at `checkFileInMemory()` for NEW files:
```go
if !exists {
    return file, true  // file.ID is empty string!
}
```

For new files, `file.ID` is empty. Then in `processFileStreaming`:
```go
fileChunks[i] = database.FileChunk{
    FileID:    fileToProcess.File.ID,  // Empty for new files!
    ...
}
```

The `FileID` in the built `fileChunks` slice is empty.

Then in `flushPendingFiles`:
```go
s.repos.Files.Create(txCtx, tx, data.file)  // This generates the ID
// But data.fileChunks still has empty FileID!
for i := range data.fileChunks {
    s.repos.FileChunks.Create(...)  // Uses empty FileID
}
```

**Solution**: Generate file IDs upfront in `checkFileInMemory()`:
```go
file := &database.File{
    ID:   uuid.New().String(),  // Generate ID immediately
    Path: path,
    ...
}
```

### Hypothesis 2: Transaction Isolation

SQLite with a single connection pool (`MaxOpenConns(1)`) should serialize all transactions. Committed data should be visible to subsequent transactions.

However, there might be a subtle issue with how `context.Background()` is used in the packer vs the scanner's context.

## Recommended Fix

**Step 1: Generate file IDs upfront**

In `checkFileInMemory()`, generate the UUID for new files immediately:
```go
file := &database.File{
    ID:   uuid.New().String(),  // Always generate ID
    Path: path,
    ...
}
```

This ensures `file.ID` is set when building `fileChunks` and `chunkFiles` slices.

**Step 2: Verify by reverting to per-file transactions**

If Step 1 doesn't fix it, revert to non-batched file insertion to isolate the issue:

```go
// Instead of queuing:
//   return s.addPendingFile(ctx, pendingFileData{...})

// Do immediate insertion:
return s.repos.WithTx(ctx, func(txCtx context.Context, tx *sql.Tx) error {
    // Create file
    s.repos.Files.Create(txCtx, tx, fileToProcess.File)
    // Delete old associations
    s.repos.FileChunks.DeleteByFileID(...)
    s.repos.ChunkFiles.DeleteByFileID(...)
    // Create new associations
    for _, fc := range fileChunks {
        s.repos.FileChunks.Create(...)
    }
    for _, cf := range chunkFiles {
        s.repos.ChunkFiles.Create(...)
    }
    // Add to snapshot
    s.repos.Snapshots.AddFileByID(...)
    return nil
})
```

**Step 3: If batching is still desired**

After confirming per-file transactions work, re-implement batching with the ID fix in place, and add debug logging to trace exactly which chunk_hash is failing and why.