# Vaultik Data Model

## Overview

Vaultik uses a local SQLite database to track file metadata, chunk mappings, and blob associations during the backup process. This database serves as an index for incremental backups and enables efficient deduplication.

**Important Notes:**
- **No Migration Support**: Vaultik does not support database schema migrations. If the schema changes, the local database must be deleted and recreated by performing a full backup.
- **Version Compatibility**: In rare cases, you may need to use the same version of Vaultik to restore a backup as was used to create it. This ensures compatibility with the metadata format stored in S3.

## Database Tables

### 1. `files`
Stores metadata about files in the filesystem being backed up.

**Columns:**
- `id` (TEXT PRIMARY KEY) - UUID for the file record
- `path` (TEXT UNIQUE) - Absolute file path
- `mtime` (INTEGER) - Modification time as Unix timestamp
- `ctime` (INTEGER) - Change time as Unix timestamp  
- `size` (INTEGER) - File size in bytes
- `mode` (INTEGER) - Unix file permissions and type
- `uid` (INTEGER) - User ID of file owner
- `gid` (INTEGER) - Group ID of file owner
- `link_target` (TEXT) - Symlink target path (empty for regular files)

**Purpose:** Tracks file metadata to detect changes between backup runs. Used for incremental backup decisions. The UUID primary key provides stable references that don't change if files are moved.

### 2. `chunks`
Stores information about content-defined chunks created from files.

**Columns:**
- `chunk_hash` (TEXT PRIMARY KEY) - SHA256 hash of chunk content
- `sha256` (TEXT) - SHA256 hash (currently same as chunk_hash)
- `size` (INTEGER) - Chunk size in bytes

**Purpose:** Enables deduplication by tracking unique chunks across all files.

### 3. `file_chunks`
Maps files to their constituent chunks in order.

**Columns:**
- `file_id` (TEXT) - File ID (FK to files.id)
- `idx` (INTEGER) - Chunk index within file (0-based)
- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
- PRIMARY KEY (`file_id`, `idx`)

**Purpose:** Allows reconstruction of files from chunks during restore.

### 4. `chunk_files`
Reverse mapping showing which files contain each chunk.

**Columns:**
- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
- `file_id` (TEXT) - File ID (FK to files.id)
- `file_offset` (INTEGER) - Byte offset of chunk within file
- `length` (INTEGER) - Length of chunk in bytes
- PRIMARY KEY (`chunk_hash`, `file_id`)

**Purpose:** Supports efficient queries for chunk usage and deduplication statistics.

### 5. `blobs`
Stores information about packed, compressed, and encrypted blob files.

**Columns:**
- `id` (TEXT PRIMARY KEY) - UUID assigned when blob creation starts
- `hash` (TEXT) - SHA256 hash of final blob (empty until finalized)
- `created_ts` (INTEGER) - Creation timestamp
- `finished_ts` (INTEGER) - Finalization timestamp (NULL if in progress)
- `uncompressed_size` (INTEGER) - Total size of chunks before compression
- `compressed_size` (INTEGER) - Size after compression and encryption
- `uploaded_ts` (INTEGER) - Upload completion timestamp (NULL if not uploaded)

**Purpose:** Tracks blob lifecycle from creation through upload. The UUID primary key allows immediate association of chunks with blobs.

### 6. `blob_chunks`
Maps chunks to the blobs that contain them.

**Columns:**
- `blob_id` (TEXT) - Blob ID (FK to blobs.id)
- `chunk_hash` (TEXT) - Chunk hash (FK to chunks.chunk_hash)
- `offset` (INTEGER) - Byte offset of chunk within blob (before compression)
- `length` (INTEGER) - Length of chunk in bytes
- PRIMARY KEY (`blob_id`, `chunk_hash`)

**Purpose:** Enables chunk retrieval from blobs during restore operations.

### 7. `snapshots`
Tracks backup snapshots.

**Columns:**
- `id` (TEXT PRIMARY KEY) - Snapshot ID (format: hostname-YYYYMMDD-HHMMSSZ)
- `hostname` (TEXT) - Hostname where backup was created
- `vaultik_version` (TEXT) - Version of Vaultik used
- `vaultik_git_revision` (TEXT) - Git revision of Vaultik used
- `started_at` (INTEGER) - Start timestamp
- `completed_at` (INTEGER) - Completion timestamp (NULL if in progress)
- `file_count` (INTEGER) - Number of files in snapshot
- `chunk_count` (INTEGER) - Number of unique chunks
- `blob_count` (INTEGER) - Number of blobs referenced
- `total_size` (INTEGER) - Total size of all files
- `blob_size` (INTEGER) - Total size of all blobs (compressed)
- `blob_uncompressed_size` (INTEGER) - Total uncompressed size of all referenced blobs
- `compression_ratio` (REAL) - Compression ratio achieved
- `compression_level` (INTEGER) - Compression level used for this snapshot
- `upload_bytes` (INTEGER) - Total bytes uploaded during this snapshot
- `upload_duration_ms` (INTEGER) - Total milliseconds spent uploading to S3

**Purpose:** Provides snapshot metadata and statistics including version tracking for compatibility.

### 8. `snapshot_files`
Maps snapshots to the files they contain.

**Columns:**
- `snapshot_id` (TEXT) - Snapshot ID (FK to snapshots.id)
- `file_id` (TEXT) - File ID (FK to files.id)
- PRIMARY KEY (`snapshot_id`, `file_id`)

**Purpose:** Records which files are included in each snapshot.

### 9. `snapshot_blobs`
Maps snapshots to the blobs they reference.

**Columns:**
- `snapshot_id` (TEXT) - Snapshot ID (FK to snapshots.id)
- `blob_id` (TEXT) - Blob ID (FK to blobs.id)
- `blob_hash` (TEXT) - Denormalized blob hash for manifest generation
- PRIMARY KEY (`snapshot_id`, `blob_id`)

**Purpose:** Tracks blob dependencies for snapshots and enables manifest generation.

### 10. `uploads`
Tracks blob upload metrics.

**Columns:**
- `blob_hash` (TEXT PRIMARY KEY) - Hash of uploaded blob
- `uploaded_at` (INTEGER) - Upload timestamp
- `size` (INTEGER) - Size of uploaded blob
- `duration_ms` (INTEGER) - Upload duration in milliseconds

**Purpose:** Performance monitoring and upload tracking.

## Data Flow and Operations

### 1. Backup Process

1. **File Scanning**
   - `INSERT OR REPLACE INTO files` - Update file metadata
   - `SELECT * FROM files WHERE path = ?` - Check if file has changed
   - `INSERT INTO snapshot_files` - Add file to current snapshot

2. **Chunking** (for changed files)
   - `INSERT OR IGNORE INTO chunks` - Store new chunks
   - `INSERT INTO file_chunks` - Map chunks to file
   - `INSERT INTO chunk_files` - Create reverse mapping

3. **Blob Packing**
   - `INSERT INTO blobs` - Create blob record with UUID (hash empty)
   - `INSERT INTO blob_chunks` - Associate chunks with blob immediately
   - `UPDATE blobs SET hash = ?, finished_ts = ?` - Finalize blob after packing

4. **Upload**
   - `UPDATE blobs SET uploaded_ts = ?` - Mark blob as uploaded
   - `INSERT INTO uploads` - Record upload metrics
   - `INSERT INTO snapshot_blobs` - Associate blob with snapshot

5. **Snapshot Completion**
   - `UPDATE snapshots SET completed_at = ?, stats...` - Finalize snapshot
   - Generate and upload blob manifest from `snapshot_blobs`

### 2. Incremental Backup

1. **Change Detection**
   - `SELECT * FROM files WHERE path = ?` - Get previous file metadata
   - Compare mtime, size, mode to detect changes
   - Skip unchanged files but still add to `snapshot_files`

2. **Chunk Reuse**
   - `SELECT * FROM blob_chunks WHERE chunk_hash = ?` - Find existing chunks
   - `INSERT INTO snapshot_blobs` - Reference existing blobs for unchanged files

### 3. Restore Process

The restore process doesn't use the local database. Instead:
1. Downloads snapshot metadata from S3
2. Downloads required blobs based on manifest
3. Reconstructs files from decrypted and decompressed chunks

### 4. Pruning

1. **Identify Unreferenced Blobs**
   - Query blobs not referenced by any remaining snapshot
   - Delete from S3 and local database

## Repository Pattern

Vaultik uses a repository pattern for database access:

- `FileRepository` - CRUD operations for files
- `ChunkRepository` - CRUD operations for chunks  
- `FileChunkRepository` - Manage file-chunk mappings
- `BlobRepository` - Manage blob lifecycle
- `BlobChunkRepository` - Manage blob-chunk associations
- `SnapshotRepository` - Manage snapshots
- `UploadRepository` - Track upload metrics

Each repository provides methods like:
- `Create()` - Insert new record
- `GetByID()` / `GetByPath()` / `GetByHash()` - Retrieve records
- `Update()` - Update existing records
- `Delete()` - Remove records
- Specialized queries for each entity type

## Transaction Management

All database operations that modify multiple tables are wrapped in transactions:

```go
err := repos.WithTx(ctx, func(ctx context.Context, tx *sql.Tx) error {
    // Multiple repository operations using tx
})
```

This ensures consistency, especially important for operations like:
- Creating file-chunk mappings
- Associating chunks with blobs
- Updating snapshot statistics

## Performance Considerations

1. **Indexes**: Primary keys are automatically indexed. Additional indexes may be needed for:
   - `blobs.hash` for lookup performance
   - `blob_chunks.chunk_hash` for chunk location queries

2. **Prepared Statements**: All queries use prepared statements for performance and security

3. **Batch Operations**: Where possible, operations are batched within transactions

4. **Write-Ahead Logging**: SQLite WAL mode is enabled for better concurrency

## Data Integrity

1. **Foreign Keys**: Enforced at the application level through repository methods
2. **Unique Constraints**: Chunk hashes and file paths are unique
3. **Null Handling**: Nullable fields clearly indicate in-progress operations
4. **Timestamp Tracking**: All major operations record timestamps for auditing