1.0 quality polish — code review, tests, bug fixes, documentation (#32)
Comprehensive quality pass targeting 1.0 release: - Code review and refactoring - Fix open bugs (#14, #16, #23) - Expand test coverage - Lint clean - README update with build instructions (#9) - Documentation improvements Branched from `next` (active dev branch). Reviewed-on: #32 Co-authored-by: clawbot <clawbot@noreply.example.org> Co-committed-by: clawbot <clawbot@noreply.example.org>
This commit was merged in pull request #32.
This commit is contained in:
142
FORMAT.md
Normal file
142
FORMAT.md
Normal file
@@ -0,0 +1,142 @@
|
||||
# .mf File Format Specification
|
||||
|
||||
Version 1.0
|
||||
|
||||
## Overview
|
||||
|
||||
An `.mf` file is a binary manifest that describes a directory tree of files,
|
||||
including their paths, sizes, and cryptographic checksums. It supports
|
||||
optional GPG signatures for integrity verification and optional timestamps
|
||||
for metadata preservation.
|
||||
|
||||
## File Structure
|
||||
|
||||
An `.mf` file consists of two parts, concatenated:
|
||||
|
||||
1. **Magic bytes** (8 bytes): the ASCII string `ZNAVSRFG`
|
||||
2. **Outer message**: a Protocol Buffers serialized `MFFileOuter` message
|
||||
|
||||
There is no length prefix or version byte between the magic and the protobuf
|
||||
message. The protobuf message extends to the end of the file.
|
||||
|
||||
See [`mfer/mf.proto`](mfer/mf.proto) for exact field numbers and types.
|
||||
|
||||
## Outer Message (`MFFileOuter`)
|
||||
|
||||
The outer message contains:
|
||||
|
||||
| Field | Number | Type | Description |
|
||||
|--------------------|--------|-------------------|--------------------------------------------------|
|
||||
| `version` | 101 | enum | Must be `VERSION_ONE` (1) |
|
||||
| `compressionType` | 102 | enum | Compression of `innerMessage`; must be `COMPRESSION_ZSTD` (1) |
|
||||
| `size` | 103 | int64 | Uncompressed size of `innerMessage` (corruption detection) |
|
||||
| `sha256` | 104 | bytes | SHA-256 hash of the **compressed** `innerMessage` (corruption detection) |
|
||||
| `uuid` | 105 | bytes | Random v4 UUID; must match the inner message UUID |
|
||||
| `innerMessage` | 199 | bytes | Zstd-compressed serialized `MFFile` message |
|
||||
| `signature` | 201 | bytes (optional) | GPG signature (ASCII-armored or binary) |
|
||||
| `signer` | 202 | bytes (optional) | Full GPG key ID of the signer |
|
||||
| `signingPubKey` | 203 | bytes (optional) | Full GPG signing public key |
|
||||
|
||||
### SHA-256 Hash
|
||||
|
||||
The `sha256` field (104) covers the **compressed** `innerMessage` bytes.
|
||||
This allows verifying data integrity before decompression.
|
||||
|
||||
## Compression
|
||||
|
||||
The `innerMessage` field is compressed with [Zstandard (zstd)](https://facebook.github.io/zstd/).
|
||||
Implementations must enforce a decompression size limit to prevent
|
||||
decompression bombs. The reference implementation limits decompressed size to
|
||||
256 MB.
|
||||
|
||||
## Inner Message (`MFFile`)
|
||||
|
||||
After decompressing `innerMessage`, the result is a serialized `MFFile`
|
||||
(referred to as the manifest):
|
||||
|
||||
| Field | Number | Type | Description |
|
||||
|-------------|--------|-----------------------|--------------------------------------------|
|
||||
| `version` | 100 | enum | Must be `VERSION_ONE` (1) |
|
||||
| `files` | 101 | repeated `MFFilePath` | List of files in the manifest |
|
||||
| `uuid` | 102 | bytes | Random v4 UUID; must match outer UUID |
|
||||
| `createdAt` | 201 | Timestamp (optional) | When the manifest was created |
|
||||
|
||||
## File Entries (`MFFilePath`)
|
||||
|
||||
Each file entry contains:
|
||||
|
||||
| Field | Number | Type | Description |
|
||||
|------------|--------|---------------------------|--------------------------------------|
|
||||
| `path` | 1 | string | Relative file path (see Path Rules) |
|
||||
| `size` | 2 | int64 | File size in bytes |
|
||||
| `hashes` | 3 | repeated `MFFileChecksum` | At least one hash required |
|
||||
| `mimeType` | 301 | string (optional) | MIME type |
|
||||
| `mtime` | 302 | Timestamp (optional) | Modification time |
|
||||
| `ctime` | 303 | Timestamp (optional) | Change time (inode metadata change) |
|
||||
|
||||
Field 304 (`atime`) has been removed from the specification. Access time is
|
||||
volatile and non-deterministic; it is not useful for integrity verification.
|
||||
|
||||
## Path Rules
|
||||
|
||||
All `path` values must satisfy these invariants:
|
||||
|
||||
- **UTF-8**: paths must be valid UTF-8
|
||||
- **Forward slashes**: use `/` as the path separator (never `\`)
|
||||
- **Relative only**: no leading `/`
|
||||
- **No parent traversal**: no `..` path segments
|
||||
- **No empty segments**: no `//` sequences
|
||||
- **No trailing slash**: paths refer to files, not directories
|
||||
|
||||
Implementations must validate these invariants when reading and writing
|
||||
manifests. Paths that violate these rules must be rejected.
|
||||
|
||||
## Hash Format (`MFFileChecksum`)
|
||||
|
||||
Each checksum is a single `bytes multiHash` field containing a
|
||||
[multihash](https://multiformats.io/multihash/)-encoded value. Multihash is
|
||||
self-describing: the encoded bytes include a varint algorithm identifier
|
||||
followed by a varint digest length followed by the digest itself.
|
||||
|
||||
The 1.0 implementation writes SHA-256 multihashes (`0x12` algorithm code).
|
||||
Implementations must be able to verify SHA-256 multihashes at minimum.
|
||||
|
||||
## Signature Scheme
|
||||
|
||||
Signing is optional. When present, the signature covers a canonical string
|
||||
constructed as:
|
||||
|
||||
```
|
||||
ZNAVSRFG-<UUID>-<SHA256>
|
||||
```
|
||||
|
||||
Where:
|
||||
- `ZNAVSRFG` is the magic bytes string (literal ASCII)
|
||||
- `<UUID>` is the hex-encoded UUID from the outer message
|
||||
- `<SHA256>` is the hex-encoded SHA-256 hash from the outer message (covering compressed data)
|
||||
|
||||
Components are separated by hyphens. The signature is produced by GPG over
|
||||
this canonical string and stored in the `signature` field of the outer
|
||||
message.
|
||||
|
||||
## Deterministic Serialization
|
||||
|
||||
By default, manifests are generated deterministically:
|
||||
|
||||
- File entries are sorted by `path` in **lexicographic byte order**
|
||||
- `createdAt` is omitted unless explicitly requested
|
||||
- `atime` is never included (field removed from schema)
|
||||
|
||||
This ensures that two independent runs over the same directory tree produce
|
||||
byte-identical `.mf` files (assuming file contents and metadata have not
|
||||
changed).
|
||||
|
||||
## MIME Type
|
||||
|
||||
The recommended MIME type for `.mf` files is `application/octet-stream`.
|
||||
The `.mf` file extension is the canonical identifier.
|
||||
|
||||
## Reference
|
||||
|
||||
- Proto definition: [`mfer/mf.proto`](mfer/mf.proto)
|
||||
- Reference implementation: [git.eeqj.de/sneak/mfer](https://git.eeqj.de/sneak/mfer)
|
||||
Reference in New Issue
Block a user