docs: add FORMAT.md, answer design questions, bump version to 1.0.0
- Write complete .mf format specification (FORMAT.md) - Fill in all design question answers in TODO.md - Mark completed implementation items in TODO.md - Bump VERSION from 0.1.0 to 1.0.0 in Makefile - Update README to reference FORMAT.md and reflect 1.0 status
This commit is contained in:
parent
472221a7f6
commit
ca3e29e802
142
FORMAT.md
Normal file
142
FORMAT.md
Normal file
@ -0,0 +1,142 @@
|
||||
# .mf File Format Specification
|
||||
|
||||
Version 1.0
|
||||
|
||||
## Overview
|
||||
|
||||
An `.mf` file is a binary manifest that describes a directory tree of files,
|
||||
including their paths, sizes, and cryptographic checksums. It supports
|
||||
optional GPG signatures for integrity verification and optional timestamps
|
||||
for metadata preservation.
|
||||
|
||||
## File Structure
|
||||
|
||||
An `.mf` file consists of two parts, concatenated:
|
||||
|
||||
1. **Magic bytes** (8 bytes): the ASCII string `ZNAVSRFG`
|
||||
2. **Outer message**: a Protocol Buffers serialized `MFFileOuter` message
|
||||
|
||||
There is no length prefix or version byte between the magic and the protobuf
|
||||
message. The protobuf message extends to the end of the file.
|
||||
|
||||
See [`mfer/mf.proto`](mfer/mf.proto) for exact field numbers and types.
|
||||
|
||||
## Outer Message (`MFFileOuter`)
|
||||
|
||||
The outer message contains:
|
||||
|
||||
| Field | Number | Type | Description |
|
||||
|--------------------|--------|-------------------|--------------------------------------------------|
|
||||
| `version` | 101 | enum | Must be `VERSION_ONE` (1) |
|
||||
| `compressionType` | 102 | enum | Compression of `innerMessage`; must be `COMPRESSION_ZSTD` (1) |
|
||||
| `size` | 103 | int64 | Uncompressed size of `innerMessage` (corruption detection) |
|
||||
| `sha256` | 104 | bytes | SHA-256 hash of the **compressed** `innerMessage` (corruption detection) |
|
||||
| `uuid` | 105 | bytes | Random v4 UUID; must match the inner message UUID |
|
||||
| `innerMessage` | 199 | bytes | Zstd-compressed serialized `MFFile` message |
|
||||
| `signature` | 201 | bytes (optional) | GPG signature (ASCII-armored or binary) |
|
||||
| `signer` | 202 | bytes (optional) | Full GPG key ID of the signer |
|
||||
| `signingPubKey` | 203 | bytes (optional) | Full GPG signing public key |
|
||||
|
||||
### SHA-256 Hash
|
||||
|
||||
The `sha256` field (104) covers the **compressed** `innerMessage` bytes.
|
||||
This allows verifying data integrity before decompression.
|
||||
|
||||
## Compression
|
||||
|
||||
The `innerMessage` field is compressed with [Zstandard (zstd)](https://facebook.github.io/zstd/).
|
||||
Implementations must enforce a decompression size limit to prevent
|
||||
decompression bombs. The reference implementation limits decompressed size to
|
||||
256 MB.
|
||||
|
||||
## Inner Message (`MFFile`)
|
||||
|
||||
After decompressing `innerMessage`, the result is a serialized `MFFile`
|
||||
(referred to as the manifest):
|
||||
|
||||
| Field | Number | Type | Description |
|
||||
|-------------|--------|-----------------------|--------------------------------------------|
|
||||
| `version` | 100 | enum | Must be `VERSION_ONE` (1) |
|
||||
| `files` | 101 | repeated `MFFilePath` | List of files in the manifest |
|
||||
| `uuid` | 102 | bytes | Random v4 UUID; must match outer UUID |
|
||||
| `createdAt` | 201 | Timestamp (optional) | When the manifest was created |
|
||||
|
||||
## File Entries (`MFFilePath`)
|
||||
|
||||
Each file entry contains:
|
||||
|
||||
| Field | Number | Type | Description |
|
||||
|------------|--------|---------------------------|--------------------------------------|
|
||||
| `path` | 1 | string | Relative file path (see Path Rules) |
|
||||
| `size` | 2 | int64 | File size in bytes |
|
||||
| `hashes` | 3 | repeated `MFFileChecksum` | At least one hash required |
|
||||
| `mimeType` | 301 | string (optional) | MIME type |
|
||||
| `mtime` | 302 | Timestamp (optional) | Modification time |
|
||||
| `ctime` | 303 | Timestamp (optional) | Change time (inode metadata change) |
|
||||
|
||||
Field 304 (`atime`) has been removed from the specification. Access time is
|
||||
volatile and non-deterministic; it is not useful for integrity verification.
|
||||
|
||||
## Path Rules
|
||||
|
||||
All `path` values must satisfy these invariants:
|
||||
|
||||
- **UTF-8**: paths must be valid UTF-8
|
||||
- **Forward slashes**: use `/` as the path separator (never `\`)
|
||||
- **Relative only**: no leading `/`
|
||||
- **No parent traversal**: no `..` path segments
|
||||
- **No empty segments**: no `//` sequences
|
||||
- **No trailing slash**: paths refer to files, not directories
|
||||
|
||||
Implementations must validate these invariants when reading and writing
|
||||
manifests. Paths that violate these rules must be rejected.
|
||||
|
||||
## Hash Format (`MFFileChecksum`)
|
||||
|
||||
Each checksum is a single `bytes multiHash` field containing a
|
||||
[multihash](https://multiformats.io/multihash/)-encoded value. Multihash is
|
||||
self-describing: the encoded bytes include a varint algorithm identifier
|
||||
followed by a varint digest length followed by the digest itself.
|
||||
|
||||
The 1.0 implementation writes SHA-256 multihashes (`0x12` algorithm code).
|
||||
Implementations must be able to verify SHA-256 multihashes at minimum.
|
||||
|
||||
## Signature Scheme
|
||||
|
||||
Signing is optional. When present, the signature covers a canonical string
|
||||
constructed as:
|
||||
|
||||
```
|
||||
ZNAVSRFG-<UUID>-<SHA256>
|
||||
```
|
||||
|
||||
Where:
|
||||
- `ZNAVSRFG` is the magic bytes string (literal ASCII)
|
||||
- `<UUID>` is the hex-encoded UUID from the outer message
|
||||
- `<SHA256>` is the hex-encoded SHA-256 hash from the outer message (covering compressed data)
|
||||
|
||||
Components are separated by hyphens. The signature is produced by GPG over
|
||||
this canonical string and stored in the `signature` field of the outer
|
||||
message.
|
||||
|
||||
## Deterministic Serialization
|
||||
|
||||
By default, manifests are generated deterministically:
|
||||
|
||||
- File entries are sorted by `path` in **lexicographic byte order**
|
||||
- `createdAt` is omitted unless explicitly requested
|
||||
- `atime` is never included (field removed from schema)
|
||||
|
||||
This ensures that two independent runs over the same directory tree produce
|
||||
byte-identical `.mf` files (assuming file contents and metadata have not
|
||||
changed).
|
||||
|
||||
## MIME Type
|
||||
|
||||
The recommended MIME type for `.mf` files is `application/octet-stream`.
|
||||
The `.mf` file extension is the canonical identifier.
|
||||
|
||||
## Reference
|
||||
|
||||
- Proto definition: [`mfer/mf.proto`](mfer/mf.proto)
|
||||
- Reference implementation: [git.eeqj.de/sneak/mfer](https://git.eeqj.de/sneak/mfer)
|
||||
2
Makefile
2
Makefile
@ -7,7 +7,7 @@ SOURCEFILES := mfer/*.go mfer/*.proto internal/*/*.go cmd/*/*.go go.mod go.sum
|
||||
ARCH := $(shell uname -m)
|
||||
GITREV_BUILD := $(shell bash $(PWD)/bin/gitrev.sh)
|
||||
APPNAME := mfer
|
||||
VERSION := 0.1.0
|
||||
VERSION := 1.0.0
|
||||
export DOCKER_IMAGE_CACHE_DIR := $(HOME)/Library/Caches/Docker/$(APPNAME)-$(ARCH)
|
||||
GOLDFLAGS += -X main.Version=$(VERSION)
|
||||
GOLDFLAGS += -X main.Gitrev=$(GITREV_BUILD)
|
||||
|
||||
10
README.md
10
README.md
@ -9,14 +9,12 @@ cryptographic checksums or signatures over same) to aid in archiving,
|
||||
downloading, and streaming, or mirroring. The manifest files' data is
|
||||
serialized with Google's [protobuf serialization
|
||||
format](https://developers.google.com/protocol-buffers). The structure of
|
||||
these files can be found [in the format
|
||||
specification](https://git.eeqj.de/sneak/mfer/src/branch/main/mfer/mf.proto)
|
||||
which is included in the [project
|
||||
these files can be found in the [format specification](FORMAT.md) and the
|
||||
[protobuf schema](mfer/mf.proto), both included in the [project
|
||||
repository](https://git.eeqj.de/sneak/mfer).
|
||||
|
||||
The current version is pre-1.0 and while the repo was published in 2022,
|
||||
there has not yet been any versioned release. [SemVer](https://semver.org)
|
||||
will be used for releases.
|
||||
The current version is 1.0. [SemVer](https://semver.org) is used for
|
||||
releases.
|
||||
|
||||
This project was started by [@sneak](https://sneak.berlin) to scratch an
|
||||
itch in 2022 and is currently a one-person effort, though the goal is for
|
||||
|
||||
42
TODO.md
42
TODO.md
@ -9,76 +9,76 @@
|
||||
**1. Should `MFFileChecksum` be simplified?**
|
||||
Currently it's a separate message wrapping a single `bytes multiHash` field. Since multihash already self-describes the algorithm, `repeated bytes hashes` directly on `MFFilePath` would be simpler and reduce per-file protobuf overhead. Is the extra message layer intentional (e.g. planning to add per-hash metadata like `verified_at`)?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* Leave as-is for now.
|
||||
|
||||
**2. Should file permissions/mode be stored?**
|
||||
The format stores mtime/ctime but not Unix file permissions. For archival use (ExFAT, filesystem-independent checksums) this may not matter, but for software distribution or filesystem restoration it's a gap. Should we reserve a field now (e.g. `optional uint32 mode = 305`) even if we don't populate it yet?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* No, not right now.
|
||||
|
||||
**3. Should `atime` be removed from the schema?**
|
||||
Access time is volatile, non-deterministic, and often disabled (`noatime`). Including it means two manifests of the same directory at different times will differ, which conflicts with the determinism goal. Remove it, or document it as "never set by default"?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* REMOVED — done. Field 304 has been removed from the proto schema.
|
||||
|
||||
**4. What are the path normalization rules?**
|
||||
The proto has `string path` with no specification about: always forward-slash? Must be relative? No `..` components allowed? UTF-8 NFC vs NFD normalization (macOS vs Linux)? Max path length? This is a security issue (path traversal) and a cross-platform compatibility issue. What rules should the spec mandate?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* Implemented — UTF-8, forward-slash only, relative paths only, no `..` segments. Documented in FORMAT.md.
|
||||
|
||||
**5. Should we add a version byte after the magic?**
|
||||
Currently `ZNAVSRFG` is followed immediately by protobuf. Adding a version byte (`ZNAVSRFG\x01`) would allow future framing changes without requiring protobuf parsing to detect the version. `MFFileOuter.Version` serves this purpose but requires successful deserialization to read. Worth the extra byte?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* No — protobuf handles versioning via the `MFFileOuter.Version` field.
|
||||
|
||||
**6. Should we add a length-prefix after the magic?**
|
||||
Protobuf is not self-delimiting. If we ever want to concatenate manifests or append data after the protobuf, the current framing is insufficient. Add a varint or fixed-width length-prefix?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* Not needed now.
|
||||
|
||||
### Signature Design
|
||||
|
||||
**7. What does the outer SHA-256 hash cover — compressed or uncompressed data?**
|
||||
The review notes it currently hashes compressed data (good for verifying before decompression), but this should be explicitly documented. Which is the intended behavior?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* Hash covers compressed data. Documented in FORMAT.md.
|
||||
|
||||
**8. Should `signatureString()` sign raw bytes instead of a hex-encoded string?**
|
||||
Currently the canonical string is `MAGIC-UUID-MULTIHASH` with hex encoding, which adds a transformation layer. Signing the raw `sha256` bytes (or compressed `innerMessage` directly) would be simpler. Keep the string format or switch to raw bytes?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* Keep string format as-is (established).
|
||||
|
||||
**9. Should we support detached signature files (`.mf.sig`)?**
|
||||
Embedded signatures are better for single-file distribution. Detached `.mf.sig` files follow the familiar `SHASUMS`/`SHASUMS.asc` pattern and are simpler for HTTP serving. Support both modes?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* Not for 1.0.
|
||||
|
||||
**10. GPG vs pure-Go crypto for signatures?**
|
||||
Shelling out to `gpg` is fragile (may not be installed, version-dependent output). `github.com/ProtonMail/go-crypto` provides pure-Go OpenPGP, or we could go Ed25519/signify (simpler, no key management). Which direction?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* Keep GPG shelling for now (established).
|
||||
|
||||
### Implementation Design
|
||||
|
||||
**11. Should manifests be deterministic by default?**
|
||||
This means: sort file entries by path, omit `createdAt` timestamp (or make it opt-in), no `atime`. Should determinism be the default, with a `--include-timestamps` flag to opt in?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* YES — implemented, default behavior.
|
||||
|
||||
**12. Should we consolidate or keep both scanner/checker implementations?**
|
||||
There are two parallel implementations: `mfer/scanner.go` + `mfer/checker.go` (typed with `FileSize`, `RelFilePath`) and `internal/scanner/` + `internal/checker/` (raw `int64`, `string`). The `mfer/` versions are superior. Delete the `internal/` versions?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* Consolidated — done (PR#27).
|
||||
|
||||
**13. Should the `manifest` type be exported?**
|
||||
Currently unexported with exported constructors (`New`, `NewFromPaths`, etc.). Consumers can't declare `var m *mfer.manifest`. Export the type, or define an interface?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* Keep unexported.
|
||||
|
||||
**14. What should the Go module path be for 1.0?**
|
||||
Currently mixed between `sneak.berlin/go/mfer` and `git.eeqj.de/sneak/mfer`. Which is canonical?
|
||||
|
||||
> *answer:*
|
||||
> *answer:* `sneak.berlin/go/mfer`
|
||||
|
||||
---
|
||||
|
||||
@ -86,19 +86,19 @@ Currently mixed between `sneak.berlin/go/mfer` and `git.eeqj.de/sneak/mfer`. Whi
|
||||
|
||||
### Phase 1: Foundation (format correctness)
|
||||
|
||||
- [ ] Delete `internal/scanner/` and `internal/checker/` — consolidate on `mfer/` package versions; update CLI code
|
||||
- [ ] Add deterministic file ordering — sort entries by path (lexicographic, byte-order) in `Builder.Build()`; add test asserting byte-identical output from two runs
|
||||
- [ ] Add decompression size limit — `io.LimitReader` in `deserializeInner()` with `m.pbOuter.Size` as bound
|
||||
- [x] Delete `internal/scanner/` and `internal/checker/` — consolidate on `mfer/` package versions; update CLI code
|
||||
- [x] Add deterministic file ordering — sort entries by path (lexicographic, byte-order) in `Builder.Build()`; add test asserting byte-identical output from two runs
|
||||
- [x] Add decompression size limit — `io.LimitReader` in `deserializeInner()` with `m.pbOuter.Size` as bound
|
||||
- [ ] Fix `errors.Is` dead code in checker — replace with `os.IsNotExist(err)` or `errors.Is(err, fs.ErrNotExist)`
|
||||
- [ ] Fix `AddFile` to verify size — check `totalRead == size` after reading, return error on mismatch
|
||||
- [ ] Specify path invariants — add proto comments (UTF-8, forward-slash, relative, no `..`, no leading `/`); validate in `Builder.AddFile` and `Builder.AddFileWithHash`
|
||||
- [x] Specify path invariants — add proto comments (UTF-8, forward-slash, relative, no `..`, no leading `/`); validate in `Builder.AddFile` and `Builder.AddFileWithHash`
|
||||
|
||||
### Phase 2: CLI polish
|
||||
|
||||
- [ ] Fix flag naming — all CLI flags use kebab-case as primary (`--include-dotfiles`, `--follow-symlinks`)
|
||||
- [ ] Fix URL construction in fetch — use `BaseURL.JoinPath()` or `url.JoinPath()` instead of string concatenation
|
||||
- [ ] Add progress rate-limiting to Checker — throttle to once per second, matching Scanner
|
||||
- [ ] Add `--deterministic` flag (or make it default) — omit `createdAt`, sort files
|
||||
- [x] Add `--deterministic` flag (or make it default) — omit `createdAt`, sort files
|
||||
|
||||
### Phase 3: Robustness
|
||||
|
||||
@ -109,10 +109,10 @@ Currently mixed between `sneak.berlin/go/mfer` and `git.eeqj.de/sneak/mfer`. Whi
|
||||
|
||||
### Phase 4: Format finalization
|
||||
|
||||
- [ ] Remove or deprecate `atime` from proto (pending design question answer)
|
||||
- [x] Remove or deprecate `atime` from proto (pending design question answer)
|
||||
- [ ] Reserve `optional uint32 mode = 305` in `MFFilePath` for future file permissions
|
||||
- [ ] Add version byte after magic — `ZNAVSRFG\x01` for format version 1
|
||||
- [ ] Write format specification document — separate from README: magic, outer structure, compression, inner structure, path invariants, signature scheme, canonical serialization
|
||||
- [x] Write format specification document — separate from README: magic, outer structure, compression, inner structure, path invariants, signature scheme, canonical serialization
|
||||
|
||||
### Phase 5: Release prep
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user