Refactor blob storage to use UUID primary keys and implement streaming chunking

- Changed blob table to use ID (UUID) as primary key instead of hash
- Blob records are now created at packing start, enabling immediate chunk associations
- Implemented streaming chunking to process large files without memory exhaustion
- Fixed blob manifest generation to include all referenced blobs
- Updated all foreign key references from blob_hash to blob_id
- Added progress reporting and improved error handling
- Enforced encryption requirement for all blob packing
- Updated tests to use test encryption keys
- Added Cyrillic transliteration to README
This commit is contained in:
2025-07-22 07:43:39 +02:00
parent 26db096913
commit 86b533d6ee
49 changed files with 5709 additions and 324 deletions

94
TODO.md
View File

@@ -1,40 +1,92 @@
# Implementation TODO
## Proposed: Store and Snapshot Commands
### Overview
Reorganize commands to provide better visibility into stored data and snapshots.
### Command Structure
#### `vaultik store` - Storage information commands
- `vaultik store info`
- Lists S3 bucket configuration
- Shows total number of snapshots (from metadata/ listing)
- Shows total number of blobs (from blobs/ listing)
- Shows total size of all blobs
- **No decryption required** - uses S3 listing only
#### `vaultik snapshot` - Snapshot management commands
- `vaultik snapshot create [path]`
- Renamed from `vaultik backup`
- Same functionality as current backup command
- `vaultik snapshot list [--json]`
- Lists all snapshots with:
- Snapshot ID
- Creation timestamp (parsed from snapshot ID)
- Compressed size (sum of referenced blob sizes from manifest)
- **No decryption required** - uses blob manifests only
- `--json` flag outputs in JSON format instead of table
- `vaultik snapshot purge`
- Requires one of:
- `--keep-latest` - keeps only the most recent snapshot
- `--older-than <duration>` - removes snapshots older than duration (e.g., "30d", "6m", "1y")
- Removes snapshot metadata and runs pruning to clean up unreferenced blobs
- Shows what would be deleted and requires confirmation
- `vaultik snapshot verify [--deep] <snapshot-id>`
- Basic mode: Verifies all blobs referenced in manifest exist in S3
- `--deep` mode: Downloads each blob and verifies its hash matches the stored hash
- **Stub implementation for now**
### Implementation Notes
1. **No Decryption Required**: All commands work with unencrypted blob manifests
2. **Blob Manifests**: Located at `metadata/{snapshot-id}/manifest.json.zst`
3. **S3 Operations**: Use S3 ListObjects to enumerate snapshots and blobs
4. **Size Calculations**: Sum blob sizes from S3 object metadata
5. **Timestamp Parsing**: Extract from snapshot ID format (e.g., `2024-01-15-143052-hostname`)
6. **S3 Metadata**: Only used for `snapshot verify` command
### Benefits
- Users can see storage usage without decryption keys
- Snapshot management doesn't require access to encrypted metadata
- Clean separation between storage info and snapshot operations
## Chunking and Hashing
1. Implement Rabin fingerprint chunker
1. Create streaming chunk processor
1. ~~Implement content-defined chunking~~ (done with FastCDC)
1. ~~Create streaming chunk processor~~ (done in chunker)
1. ~~Implement SHA256 hashing for chunks~~ (done in scanner)
1. ~~Add configurable chunk size parameters~~ (done in scanner)
1. Write tests for chunking consistency
1. ~~Write tests for chunking consistency~~ (done)
## Compression and Encryption
1. Implement zstd compression wrapper
1. Integrate age encryption library
1. Create Encryptor type for public key encryption
1. Create Decryptor type for private key decryption
1. Implement streaming encrypt/decrypt pipelines
1. Write tests for compression and encryption
1. ~~Implement compression~~ (done with zlib in blob packer)
1. ~~Integrate age encryption library~~ (done in crypto package)
1. ~~Create Encryptor type for public key encryption~~ (done)
1. ~~Implement streaming encrypt/decrypt pipelines~~ (done in packer)
1. ~~Write tests for compression and encryption~~ (done)
## Blob Packing
1. Implement BlobWriter with size limits
1. Add chunk accumulation and flushing
1. Create blob hash calculation
1. Implement proper error handling and rollback
1. Write tests for blob packing scenarios
1. ~~Implement BlobWriter with size limits~~ (done in packer)
1. ~~Add chunk accumulation and flushing~~ (done)
1. ~~Create blob hash calculation~~ (done)
1. ~~Implement proper error handling and rollback~~ (done with transactions)
1. ~~Write tests for blob packing scenarios~~ (done)
## S3 Operations
1. Integrate MinIO client library
1. Implement S3Client wrapper type
1. Add multipart upload support for large blobs
1. Implement retry logic with exponential backoff
1. Add connection pooling and timeout handling
1. Write tests using MinIO container
1. ~~Integrate MinIO client library~~ (done in s3 package)
1. ~~Implement S3Client wrapper type~~ (done)
1. ~~Add multipart upload support for large blobs~~ (done - using standard upload)
1. ~~Implement retry logic~~ (handled by MinIO client)
1. ~~Write tests using MinIO container~~ (done with testcontainers)
## Backup Command - Basic
1. ~~Implement directory walking with exclusion patterns~~ (done with afero)
1. Add file change detection using index
1. ~~Integrate chunking pipeline for changed files~~ (done in scanner)
1. Implement blob upload coordination
1. Implement blob upload coordination to S3
1. Add progress reporting to stderr
1. Write integration tests for backup