diff --git a/FORMAT.md b/FORMAT.md new file mode 100644 index 0000000..e09dfb8 --- /dev/null +++ b/FORMAT.md @@ -0,0 +1,142 @@ +# .mf File Format Specification + +Version 1.0 + +## Overview + +An `.mf` file is a binary manifest that describes a directory tree of files, +including their paths, sizes, and cryptographic checksums. It supports +optional GPG signatures for integrity verification and optional timestamps +for metadata preservation. + +## File Structure + +An `.mf` file consists of two parts, concatenated: + +1. **Magic bytes** (8 bytes): the ASCII string `ZNAVSRFG` +2. **Outer message**: a Protocol Buffers serialized `MFFileOuter` message + +There is no length prefix or version byte between the magic and the protobuf +message. The protobuf message extends to the end of the file. + +See [`mfer/mf.proto`](mfer/mf.proto) for exact field numbers and types. + +## Outer Message (`MFFileOuter`) + +The outer message contains: + +| Field | Number | Type | Description | +|--------------------|--------|-------------------|--------------------------------------------------| +| `version` | 101 | enum | Must be `VERSION_ONE` (1) | +| `compressionType` | 102 | enum | Compression of `innerMessage`; must be `COMPRESSION_ZSTD` (1) | +| `size` | 103 | int64 | Uncompressed size of `innerMessage` (corruption detection) | +| `sha256` | 104 | bytes | SHA-256 hash of the **compressed** `innerMessage` (corruption detection) | +| `uuid` | 105 | bytes | Random v4 UUID; must match the inner message UUID | +| `innerMessage` | 199 | bytes | Zstd-compressed serialized `MFFile` message | +| `signature` | 201 | bytes (optional) | GPG signature (ASCII-armored or binary) | +| `signer` | 202 | bytes (optional) | Full GPG key ID of the signer | +| `signingPubKey` | 203 | bytes (optional) | Full GPG signing public key | + +### SHA-256 Hash + +The `sha256` field (104) covers the **compressed** `innerMessage` bytes. +This allows verifying data integrity before decompression. + +## Compression + +The `innerMessage` field is compressed with [Zstandard (zstd)](https://facebook.github.io/zstd/). +Implementations must enforce a decompression size limit to prevent +decompression bombs. The reference implementation limits decompressed size to +256 MB. + +## Inner Message (`MFFile`) + +After decompressing `innerMessage`, the result is a serialized `MFFile` +(referred to as the manifest): + +| Field | Number | Type | Description | +|-------------|--------|-----------------------|--------------------------------------------| +| `version` | 100 | enum | Must be `VERSION_ONE` (1) | +| `files` | 101 | repeated `MFFilePath` | List of files in the manifest | +| `uuid` | 102 | bytes | Random v4 UUID; must match outer UUID | +| `createdAt` | 201 | Timestamp (optional) | When the manifest was created | + +## File Entries (`MFFilePath`) + +Each file entry contains: + +| Field | Number | Type | Description | +|------------|--------|---------------------------|--------------------------------------| +| `path` | 1 | string | Relative file path (see Path Rules) | +| `size` | 2 | int64 | File size in bytes | +| `hashes` | 3 | repeated `MFFileChecksum` | At least one hash required | +| `mimeType` | 301 | string (optional) | MIME type | +| `mtime` | 302 | Timestamp (optional) | Modification time | +| `ctime` | 303 | Timestamp (optional) | Change time (inode metadata change) | + +Field 304 (`atime`) has been removed from the specification. Access time is +volatile and non-deterministic; it is not useful for integrity verification. + +## Path Rules + +All `path` values must satisfy these invariants: + +- **UTF-8**: paths must be valid UTF-8 +- **Forward slashes**: use `/` as the path separator (never `\`) +- **Relative only**: no leading `/` +- **No parent traversal**: no `..` path segments +- **No empty segments**: no `//` sequences +- **No trailing slash**: paths refer to files, not directories + +Implementations must validate these invariants when reading and writing +manifests. Paths that violate these rules must be rejected. + +## Hash Format (`MFFileChecksum`) + +Each checksum is a single `bytes multiHash` field containing a +[multihash](https://multiformats.io/multihash/)-encoded value. Multihash is +self-describing: the encoded bytes include a varint algorithm identifier +followed by a varint digest length followed by the digest itself. + +The 1.0 implementation writes SHA-256 multihashes (`0x12` algorithm code). +Implementations must be able to verify SHA-256 multihashes at minimum. + +## Signature Scheme + +Signing is optional. When present, the signature covers a canonical string +constructed as: + +``` +ZNAVSRFG-- +``` + +Where: +- `ZNAVSRFG` is the magic bytes string (literal ASCII) +- `` is the hex-encoded UUID from the outer message +- `` is the hex-encoded SHA-256 hash from the outer message (covering compressed data) + +Components are separated by hyphens. The signature is produced by GPG over +this canonical string and stored in the `signature` field of the outer +message. + +## Deterministic Serialization + +By default, manifests are generated deterministically: + +- File entries are sorted by `path` in **lexicographic byte order** +- `createdAt` is omitted unless explicitly requested +- `atime` is never included (field removed from schema) + +This ensures that two independent runs over the same directory tree produce +byte-identical `.mf` files (assuming file contents and metadata have not +changed). + +## MIME Type + +The recommended MIME type for `.mf` files is `application/octet-stream`. +The `.mf` file extension is the canonical identifier. + +## Reference + +- Proto definition: [`mfer/mf.proto`](mfer/mf.proto) +- Reference implementation: [git.eeqj.de/sneak/mfer](https://git.eeqj.de/sneak/mfer) diff --git a/Makefile b/Makefile index e27258f..1ec6919 100644 --- a/Makefile +++ b/Makefile @@ -7,7 +7,7 @@ SOURCEFILES := mfer/*.go mfer/*.proto internal/*/*.go cmd/*/*.go go.mod go.sum ARCH := $(shell uname -m) GITREV_BUILD := $(shell bash $(PWD)/bin/gitrev.sh) APPNAME := mfer -VERSION := 0.1.0 +VERSION := 1.0.0 export DOCKER_IMAGE_CACHE_DIR := $(HOME)/Library/Caches/Docker/$(APPNAME)-$(ARCH) GOLDFLAGS += -X main.Version=$(VERSION) GOLDFLAGS += -X main.Gitrev=$(GITREV_BUILD) diff --git a/README.md b/README.md index 8d9a3a4..99af8a2 100644 --- a/README.md +++ b/README.md @@ -9,14 +9,12 @@ cryptographic checksums or signatures over same) to aid in archiving, downloading, and streaming, or mirroring. The manifest files' data is serialized with Google's [protobuf serialization format](https://developers.google.com/protocol-buffers). The structure of -these files can be found [in the format -specification](https://git.eeqj.de/sneak/mfer/src/branch/main/mfer/mf.proto) -which is included in the [project +these files can be found in the [format specification](FORMAT.md) and the +[protobuf schema](mfer/mf.proto), both included in the [project repository](https://git.eeqj.de/sneak/mfer). -The current version is pre-1.0 and while the repo was published in 2022, -there has not yet been any versioned release. [SemVer](https://semver.org) -will be used for releases. +The current version is 1.0. [SemVer](https://semver.org) is used for +releases. This project was started by [@sneak](https://sneak.berlin) to scratch an itch in 2022 and is currently a one-person effort, though the goal is for diff --git a/TODO.md b/TODO.md index 6c4cd3e..b03d1b7 100644 --- a/TODO.md +++ b/TODO.md @@ -9,76 +9,76 @@ **1. Should `MFFileChecksum` be simplified?** Currently it's a separate message wrapping a single `bytes multiHash` field. Since multihash already self-describes the algorithm, `repeated bytes hashes` directly on `MFFilePath` would be simpler and reduce per-file protobuf overhead. Is the extra message layer intentional (e.g. planning to add per-hash metadata like `verified_at`)? -> *answer:* +> *answer:* Leave as-is for now. **2. Should file permissions/mode be stored?** The format stores mtime/ctime but not Unix file permissions. For archival use (ExFAT, filesystem-independent checksums) this may not matter, but for software distribution or filesystem restoration it's a gap. Should we reserve a field now (e.g. `optional uint32 mode = 305`) even if we don't populate it yet? -> *answer:* +> *answer:* No, not right now. **3. Should `atime` be removed from the schema?** Access time is volatile, non-deterministic, and often disabled (`noatime`). Including it means two manifests of the same directory at different times will differ, which conflicts with the determinism goal. Remove it, or document it as "never set by default"? -> *answer:* +> *answer:* REMOVED — done. Field 304 has been removed from the proto schema. **4. What are the path normalization rules?** The proto has `string path` with no specification about: always forward-slash? Must be relative? No `..` components allowed? UTF-8 NFC vs NFD normalization (macOS vs Linux)? Max path length? This is a security issue (path traversal) and a cross-platform compatibility issue. What rules should the spec mandate? -> *answer:* +> *answer:* Implemented — UTF-8, forward-slash only, relative paths only, no `..` segments. Documented in FORMAT.md. **5. Should we add a version byte after the magic?** Currently `ZNAVSRFG` is followed immediately by protobuf. Adding a version byte (`ZNAVSRFG\x01`) would allow future framing changes without requiring protobuf parsing to detect the version. `MFFileOuter.Version` serves this purpose but requires successful deserialization to read. Worth the extra byte? -> *answer:* +> *answer:* No — protobuf handles versioning via the `MFFileOuter.Version` field. **6. Should we add a length-prefix after the magic?** Protobuf is not self-delimiting. If we ever want to concatenate manifests or append data after the protobuf, the current framing is insufficient. Add a varint or fixed-width length-prefix? -> *answer:* +> *answer:* Not needed now. ### Signature Design **7. What does the outer SHA-256 hash cover — compressed or uncompressed data?** The review notes it currently hashes compressed data (good for verifying before decompression), but this should be explicitly documented. Which is the intended behavior? -> *answer:* +> *answer:* Hash covers compressed data. Documented in FORMAT.md. **8. Should `signatureString()` sign raw bytes instead of a hex-encoded string?** Currently the canonical string is `MAGIC-UUID-MULTIHASH` with hex encoding, which adds a transformation layer. Signing the raw `sha256` bytes (or compressed `innerMessage` directly) would be simpler. Keep the string format or switch to raw bytes? -> *answer:* +> *answer:* Keep string format as-is (established). **9. Should we support detached signature files (`.mf.sig`)?** Embedded signatures are better for single-file distribution. Detached `.mf.sig` files follow the familiar `SHASUMS`/`SHASUMS.asc` pattern and are simpler for HTTP serving. Support both modes? -> *answer:* +> *answer:* Not for 1.0. **10. GPG vs pure-Go crypto for signatures?** Shelling out to `gpg` is fragile (may not be installed, version-dependent output). `github.com/ProtonMail/go-crypto` provides pure-Go OpenPGP, or we could go Ed25519/signify (simpler, no key management). Which direction? -> *answer:* +> *answer:* Keep GPG shelling for now (established). ### Implementation Design **11. Should manifests be deterministic by default?** This means: sort file entries by path, omit `createdAt` timestamp (or make it opt-in), no `atime`. Should determinism be the default, with a `--include-timestamps` flag to opt in? -> *answer:* +> *answer:* YES — implemented, default behavior. **12. Should we consolidate or keep both scanner/checker implementations?** There are two parallel implementations: `mfer/scanner.go` + `mfer/checker.go` (typed with `FileSize`, `RelFilePath`) and `internal/scanner/` + `internal/checker/` (raw `int64`, `string`). The `mfer/` versions are superior. Delete the `internal/` versions? -> *answer:* +> *answer:* Consolidated — done (PR#27). **13. Should the `manifest` type be exported?** Currently unexported with exported constructors (`New`, `NewFromPaths`, etc.). Consumers can't declare `var m *mfer.manifest`. Export the type, or define an interface? -> *answer:* +> *answer:* Keep unexported. **14. What should the Go module path be for 1.0?** Currently mixed between `sneak.berlin/go/mfer` and `git.eeqj.de/sneak/mfer`. Which is canonical? -> *answer:* +> *answer:* `sneak.berlin/go/mfer` --- @@ -86,19 +86,19 @@ Currently mixed between `sneak.berlin/go/mfer` and `git.eeqj.de/sneak/mfer`. Whi ### Phase 1: Foundation (format correctness) -- [ ] Delete `internal/scanner/` and `internal/checker/` — consolidate on `mfer/` package versions; update CLI code -- [ ] Add deterministic file ordering — sort entries by path (lexicographic, byte-order) in `Builder.Build()`; add test asserting byte-identical output from two runs -- [ ] Add decompression size limit — `io.LimitReader` in `deserializeInner()` with `m.pbOuter.Size` as bound +- [x] Delete `internal/scanner/` and `internal/checker/` — consolidate on `mfer/` package versions; update CLI code +- [x] Add deterministic file ordering — sort entries by path (lexicographic, byte-order) in `Builder.Build()`; add test asserting byte-identical output from two runs +- [x] Add decompression size limit — `io.LimitReader` in `deserializeInner()` with `m.pbOuter.Size` as bound - [ ] Fix `errors.Is` dead code in checker — replace with `os.IsNotExist(err)` or `errors.Is(err, fs.ErrNotExist)` - [ ] Fix `AddFile` to verify size — check `totalRead == size` after reading, return error on mismatch -- [ ] Specify path invariants — add proto comments (UTF-8, forward-slash, relative, no `..`, no leading `/`); validate in `Builder.AddFile` and `Builder.AddFileWithHash` +- [x] Specify path invariants — add proto comments (UTF-8, forward-slash, relative, no `..`, no leading `/`); validate in `Builder.AddFile` and `Builder.AddFileWithHash` ### Phase 2: CLI polish - [ ] Fix flag naming — all CLI flags use kebab-case as primary (`--include-dotfiles`, `--follow-symlinks`) - [ ] Fix URL construction in fetch — use `BaseURL.JoinPath()` or `url.JoinPath()` instead of string concatenation - [ ] Add progress rate-limiting to Checker — throttle to once per second, matching Scanner -- [ ] Add `--deterministic` flag (or make it default) — omit `createdAt`, sort files +- [x] Add `--deterministic` flag (or make it default) — omit `createdAt`, sort files ### Phase 3: Robustness @@ -109,10 +109,10 @@ Currently mixed between `sneak.berlin/go/mfer` and `git.eeqj.de/sneak/mfer`. Whi ### Phase 4: Format finalization -- [ ] Remove or deprecate `atime` from proto (pending design question answer) +- [x] Remove or deprecate `atime` from proto (pending design question answer) - [ ] Reserve `optional uint32 mode = 305` in `MFFilePath` for future file permissions - [ ] Add version byte after magic — `ZNAVSRFG\x01` for format version 1 -- [ ] Write format specification document — separate from README: magic, outer structure, compression, inner structure, path invariants, signature scheme, canonical serialization +- [x] Write format specification document — separate from README: magic, outer structure, compression, inner structure, path invariants, signature scheme, canonical serialization ### Phase 5: Release prep