diff --git a/AGENTS.md b/AGENTS.md index 89f7c2c..a71dee0 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -26,4 +26,5 @@ source for coding standards, formatting, linting, and workflow rules. - The proto definition is in `mfer/mf.proto`; generated `.pb.go` files are committed (required for `go get` compatibility). - The format specification is in `FORMAT.md`. -- See `TODO.md` for the 1.0 implementation plan and open design questions. +- See the TODO section in `README.md` for the 1.0 implementation plan + and open design questions. diff --git a/README.md b/README.md index 102a68f..259a9b0 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,8 @@ software. A compatible javascript library is planned. # Build Status -[![Build Status](https://drone.datavi.be/api/badges/sneak/mfer/status.svg)](https://drone.datavi.be/sneak/mfer) +CI runs via `docker build .` which executes `make check` (formatting, +linting, tests). The `main` branch must always be green. # Participation @@ -208,6 +209,218 @@ regardless of filesystem format. Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your desired username for an account on this Gitea instance. +# TODO: Remaining Work for 1.0 + +## Design Questions (Owner Decision Required) + +These require @sneak's input before implementation. Answers should be added +inline below each question. + +### Format Design + +**1. Should `MFFileChecksum` be simplified?** Currently it's a separate +message wrapping a single `bytes multiHash` field. Since multihash +already self-describes the algorithm, `repeated bytes hashes` directly on +`MFFilePath` would be simpler and reduce per-file protobuf overhead. Is +the extra message layer intentional (e.g. planning to add per-hash +metadata like `verified_at`)? + +> _answer:_ + +**2. Should file permissions/mode be stored?** The format stores +mtime/ctime but not Unix file permissions. For archival use this may not +matter, but for software distribution or filesystem restoration it's a +gap. Should we reserve a field now (e.g. `optional uint32 mode = 305`) +even if we don't populate it yet? + +> _answer:_ + +**3. Should `atime` be removed from the schema?** Access time is +volatile, non-deterministic, and often disabled (`noatime`). Including it +means two manifests of the same directory at different times will differ, +which conflicts with the determinism goal. Remove it, or document it as +"never set by default"? + +> _answer:_ + +**4. What are the path normalization rules?** The proto has `string path` +with no specification about: always forward-slash? Must be relative? No +`..` components allowed? UTF-8 NFC vs NFD normalization (macOS vs +Linux)? Max path length? This is a security issue (path traversal) and a +cross-platform compatibility issue. What rules should the spec mandate? + +> _answer:_ + +**5. Should we add a version byte after the magic?** Currently +`ZNAVSRFG` is followed immediately by protobuf. Adding a version byte +(`ZNAVSRFG\x01`) would allow future framing changes without requiring +protobuf parsing to detect the version. `MFFileOuter.Version` serves +this purpose but requires successful deserialization to read. Worth the +extra byte? + +> _answer:_ + +**6. Should we add a length-prefix after the magic?** Protobuf is not +self-delimiting. If we ever want to concatenate manifests or append data +after the protobuf, the current framing is insufficient. Add a varint or +fixed-width length-prefix? + +> _answer:_ + +### Signature Design + +**7. What does the outer SHA-256 hash cover — compressed or uncompressed +data?** The code currently hashes compressed data (good for verifying +before decompression), but this should be explicitly documented. Which is +the intended behavior? + +> _answer:_ + +**8. Should `signatureString()` sign raw bytes instead of a hex-encoded +string?** Currently the canonical string is `MAGIC-UUID-MULTIHASH` with +hex encoding, which adds a transformation layer. Signing the raw `sha256` +bytes (or compressed `innerMessage` directly) would be simpler. Keep the +string format or switch to raw bytes? + +> _answer:_ + +**9. Should we support detached signature files (`.mf.sig`)?** Embedded +signatures are better for single-file distribution. Detached `.mf.sig` +files follow the familiar `SHASUMS`/`SHASUMS.asc` pattern and are +simpler for HTTP serving. Support both modes? + +> _answer:_ + +**10. GPG vs pure-Go crypto for signatures?** Shelling out to `gpg` is +fragile (may not be installed, version-dependent output). +`github.com/ProtonMail/go-crypto` provides pure-Go OpenPGP, or we could +use Ed25519/signify (simpler, no key management). Which direction? + +> _answer:_ + +### Implementation Design + +**11. Should manifests be deterministic by default?** This means: sort +file entries by path, omit `createdAt` timestamp (or make it opt-in), no +`atime`. Should determinism be the default, with a +`--include-timestamps` flag to opt in? + +> _answer:_ + +**12. Should we consolidate or keep both scanner/checker +implementations?** There are two parallel implementations: +`mfer/scanner.go` + `mfer/checker.go` (typed with `FileSize`, +`RelFilePath`) and `internal/scanner/` + `internal/checker/` (raw +`int64`, `string`). The `mfer/` versions are superior. Delete the +`internal/` versions? + +> _answer:_ + +**13. Should the `manifest` type be exported?** Currently unexported with +exported constructors (`NewManifestFromReader`, `NewManifestFromFile`). +Consumers can't declare `var m *mfer.manifest`. Export the type, or +define an interface? + +> _answer:_ + +**14. What should the Go module path be for 1.0?** Currently +`sneak.berlin/go/mfer` in `go.mod` but `git.eeqj.de/sneak/mfer/mfer` in +the proto `go_package` option. Which is canonical? + +> _answer:_ + +## Implementation Tasks + +### Repo Infrastructure + +- [ ] Add `.golangci.yml` (fetch from + `https://git.eeqj.de/sneak/prompts/raw/branch/main/.golangci.yml`) +- [ ] Add `.editorconfig` +- [ ] Add `.gitea/workflows/check.yml` that runs `docker build .` + +### Format & Correctness + +- [ ] Resolve proto `go_package` path inconsistency + (`git.eeqj.de/sneak/mfer/mfer` vs `sneak.berlin/go/mfer`) +- [ ] Specify path invariants — add proto comments requiring UTF-8, + forward-slash, relative paths, no `..`, no leading `/`; validate + in `Builder.AddFile` and `Builder.AddFileWithHash` (pending design + question answer) +- [ ] Remove or deprecate `atime` from proto (pending design question + answer) +- [ ] Reserve `optional uint32 mode = 305` in `MFFilePath` for future + file permissions (pending design question answer) +- [ ] Add version byte after magic — `ZNAVSRFG\x01` for format version + 1 (pending design question answer) +- [ ] Write format specification document — separate from README: + magic, outer structure, compression, inner structure, path + invariants, signature scheme, canonical serialization + +### Library + +- [ ] Delete `internal/scanner/` and `internal/checker/` — consolidate + on `mfer/` package versions; update CLI code (pending design + question answer) +- [ ] Add deterministic file ordering — sort entries by path + (lexicographic, byte-order) in `Builder.Build()`; add test + asserting byte-identical output from two runs +- [ ] Add decompression size limit — `io.LimitReader` in + `deserializeInner()` with `m.pbOuter.Size` as bound +- [ ] Fix `errors.Is` dead code in checker — replace with + `os.IsNotExist(err)` or `errors.Is(err, fs.ErrNotExist)` +- [ ] Fix `AddFile` to verify size — check `totalRead == size` after + reading, return error on mismatch +- [ ] Export the `manifest` type or define a public interface (pending + design question answer) — currently consumers cannot hold a reference + to a loaded manifest in their own type declarations +- [ ] Replace GPG subprocess calls with pure-Go crypto (pending design + question answer) — current implementation shells out to `gpg` which + may not be installed +- [ ] Add timeout to any remaining subprocess calls + +### CLI + +- [ ] Fix flag naming — all CLI flags should use kebab-case as primary + (`--include-dotfiles`, `--follow-symlinks`) +- [ ] Fix URL construction in fetch — use `BaseURL.JoinPath()` or + `url.JoinPath()` instead of string concatenation +- [ ] Add progress rate-limiting to Checker — throttle to once per + second, matching Scanner +- [ ] Add `--deterministic` flag or make it default — omit `createdAt`, + sort files (pending design question answer) +- [ ] Wire `--version` flag properly (currently only a `version` + subcommand exists; top-level `--version` shows urfave/cli generic + output) +- [ ] Add retry logic to `fetch` — currently no retries on transient + HTTP errors; needs exponential backoff +- [ ] `fetch` command uses bare `http.Get` with no timeout — needs + `http.Client` with configurable timeout + +### Testing & Robustness + +- [ ] Add fuzzing tests for `NewManifestFromReader` — protobuf + deserialization of untrusted input needs fuzz coverage +- [ ] Add integration test for `freshen` CLI command — current tests + only verify setup, not the actual freshen operation end-to-end +- [ ] Add test for `fetch` CLI command end-to-end (currently only + `downloadFile` is tested) + +### Documentation + +- [ ] Promote `FORMAT.md` as primary spec reference; README should link + to it more prominently +- [ ] Audit and update all error messages for consistency and + helpfulness +- [ ] Document the signature scheme more thoroughly (canonical string + format, verification steps) + +### Release + +- [ ] Finalize Go module path +- [ ] Update version constant in `mfer/constants.go` +- [ ] Add `--version` output matching SemVer +- [ ] Tag `v1.0.0` + # See Also ## Prior Art: Metalink diff --git a/TODO.md b/TODO.md deleted file mode 100644 index 266b231..0000000 --- a/TODO.md +++ /dev/null @@ -1,122 +0,0 @@ -# TODO: mfer 1.0 - -## Design Questions - -_sneak: please answer inline below each question. These are preserved for posterity._ - -### Format Design - -**1. Should `MFFileChecksum` be simplified?** -Currently it's a separate message wrapping a single `bytes multiHash` field. Since multihash already self-describes the algorithm, `repeated bytes hashes` directly on `MFFilePath` would be simpler and reduce per-file protobuf overhead. Is the extra message layer intentional (e.g. planning to add per-hash metadata like `verified_at`)? - -> _answer:_ - -**2. Should file permissions/mode be stored?** -The format stores mtime/ctime but not Unix file permissions. For archival use (ExFAT, filesystem-independent checksums) this may not matter, but for software distribution or filesystem restoration it's a gap. Should we reserve a field now (e.g. `optional uint32 mode = 305`) even if we don't populate it yet? - -> _answer:_ - -**3. Should `atime` be removed from the schema?** -Access time is volatile, non-deterministic, and often disabled (`noatime`). Including it means two manifests of the same directory at different times will differ, which conflicts with the determinism goal. Remove it, or document it as "never set by default"? - -> _answer:_ - -**4. What are the path normalization rules?** -The proto has `string path` with no specification about: always forward-slash? Must be relative? No `..` components allowed? UTF-8 NFC vs NFD normalization (macOS vs Linux)? Max path length? This is a security issue (path traversal) and a cross-platform compatibility issue. What rules should the spec mandate? - -> _answer:_ - -**5. Should we add a version byte after the magic?** -Currently `ZNAVSRFG` is followed immediately by protobuf. Adding a version byte (`ZNAVSRFG\x01`) would allow future framing changes without requiring protobuf parsing to detect the version. `MFFileOuter.Version` serves this purpose but requires successful deserialization to read. Worth the extra byte? - -> _answer:_ - -**6. Should we add a length-prefix after the magic?** -Protobuf is not self-delimiting. If we ever want to concatenate manifests or append data after the protobuf, the current framing is insufficient. Add a varint or fixed-width length-prefix? - -> _answer:_ - -### Signature Design - -**7. What does the outer SHA-256 hash cover — compressed or uncompressed data?** -The review notes it currently hashes compressed data (good for verifying before decompression), but this should be explicitly documented. Which is the intended behavior? - -> _answer:_ - -**8. Should `signatureString()` sign raw bytes instead of a hex-encoded string?** -Currently the canonical string is `MAGIC-UUID-MULTIHASH` with hex encoding, which adds a transformation layer. Signing the raw `sha256` bytes (or compressed `innerMessage` directly) would be simpler. Keep the string format or switch to raw bytes? - -> _answer:_ - -**9. Should we support detached signature files (`.mf.sig`)?** -Embedded signatures are better for single-file distribution. Detached `.mf.sig` files follow the familiar `SHASUMS`/`SHASUMS.asc` pattern and are simpler for HTTP serving. Support both modes? - -> _answer:_ - -**10. GPG vs pure-Go crypto for signatures?** -Shelling out to `gpg` is fragile (may not be installed, version-dependent output). `github.com/ProtonMail/go-crypto` provides pure-Go OpenPGP, or we could go Ed25519/signify (simpler, no key management). Which direction? - -> _answer:_ - -### Implementation Design - -**11. Should manifests be deterministic by default?** -This means: sort file entries by path, omit `createdAt` timestamp (or make it opt-in), no `atime`. Should determinism be the default, with a `--include-timestamps` flag to opt in? - -> _answer:_ - -**12. Should we consolidate or keep both scanner/checker implementations?** -There are two parallel implementations: `mfer/scanner.go` + `mfer/checker.go` (typed with `FileSize`, `RelFilePath`) and `internal/scanner/` + `internal/checker/` (raw `int64`, `string`). The `mfer/` versions are superior. Delete the `internal/` versions? - -> _answer:_ - -**13. Should the `manifest` type be exported?** -Currently unexported with exported constructors (`New`, `NewFromPaths`, etc.). Consumers can't declare `var m *mfer.manifest`. Export the type, or define an interface? - -> _answer:_ - -**14. What should the Go module path be for 1.0?** -Currently mixed between `sneak.berlin/go/mfer` and `git.eeqj.de/sneak/mfer`. Which is canonical? - -> _answer:_ - ---- - -## Implementation Plan - -### Phase 1: Foundation (format correctness) - -- [ ] Delete `internal/scanner/` and `internal/checker/` — consolidate on `mfer/` package versions; update CLI code -- [ ] Add deterministic file ordering — sort entries by path (lexicographic, byte-order) in `Builder.Build()`; add test asserting byte-identical output from two runs -- [ ] Add decompression size limit — `io.LimitReader` in `deserializeInner()` with `m.pbOuter.Size` as bound -- [ ] Fix `errors.Is` dead code in checker — replace with `os.IsNotExist(err)` or `errors.Is(err, fs.ErrNotExist)` -- [ ] Fix `AddFile` to verify size — check `totalRead == size` after reading, return error on mismatch -- [ ] Specify path invariants — add proto comments (UTF-8, forward-slash, relative, no `..`, no leading `/`); validate in `Builder.AddFile` and `Builder.AddFileWithHash` - -### Phase 2: CLI polish - -- [ ] Fix flag naming — all CLI flags use kebab-case as primary (`--include-dotfiles`, `--follow-symlinks`) -- [ ] Fix URL construction in fetch — use `BaseURL.JoinPath()` or `url.JoinPath()` instead of string concatenation -- [ ] Add progress rate-limiting to Checker — throttle to once per second, matching Scanner -- [ ] Add `--deterministic` flag (or make it default) — omit `createdAt`, sort files - -### Phase 3: Robustness - -- [ ] Replace GPG subprocess with pure-Go crypto — `github.com/ProtonMail/go-crypto` or Ed25519/signify -- [ ] Add timeout to any remaining subprocess calls -- [ ] Add fuzzing tests for `NewManifestFromReader` -- [ ] Add retry logic to fetch — exponential backoff for transient HTTP errors - -### Phase 4: Format finalization - -- [ ] Remove or deprecate `atime` from proto (pending design question answer) -- [ ] Reserve `optional uint32 mode = 305` in `MFFilePath` for future file permissions -- [ ] Add version byte after magic — `ZNAVSRFG\x01` for format version 1 -- [ ] Write format specification document — separate from README: magic, outer structure, compression, inner structure, path invariants, signature scheme, canonical serialization - -### Phase 5: Release prep - -- [ ] Finalize Go module path -- [ ] Audit all error messages for consistency and helpfulness -- [ ] Add `--version` output matching SemVer -- [ ] Tag v1.0.0