2 Commits

Author SHA1 Message Date
clawbot
963dd829e6 Integrate all TODO.md items into README TODO section
Add all 14 design questions (7 were missing: atime removal, path
normalization, outer SHA-256 scope, signatureString format, detached
signatures, deterministic-by-default, scanner/checker consolidation).

Add all missing implementation tasks from TODO.md phases 1-4:
- Phase 1 foundation: consolidate internal packages, deterministic
  ordering, decompression size limit, errors.Is fix, AddFile size
  verification, path invariants
- Phase 2 CLI: flag naming, URL construction, progress rate-limiting,
  deterministic flag
- Phase 3: subprocess timeouts
- Phase 4: atime deprecation

Organize design questions with sub-headers (Format Design, Signature
Design, Implementation Design) matching original TODO.md structure.
2026-03-17 01:57:57 -07:00
user
cebf9785fc Add TODO section to README with 1.0 roadmap, remove TODO.md
Perform design and status review of the codebase. Add a comprehensive
TODO section to README.md covering design questions requiring owner
input, implementation tasks (repo infrastructure, format, library, CLI,
testing, documentation), and release checklist.

Integrate items from TODO.md into the README TODO section and remove
TODO.md from the repo. Update AGENTS.md to reference the new location.
2026-03-17 01:45:11 -07:00
4 changed files with 216 additions and 133 deletions

View File

@@ -1,9 +0,0 @@
name: check
on: [push]
jobs:
check:
runs-on: ubuntu-latest
steps:
# actions/checkout v4.2.2, 2026-03-16
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
- run: docker build .

View File

@@ -26,4 +26,5 @@ source for coding standards, formatting, linting, and workflow rules.
- The proto definition is in `mfer/mf.proto`; generated `.pb.go` files are - The proto definition is in `mfer/mf.proto`; generated `.pb.go` files are
committed (required for `go get` compatibility). committed (required for `go get` compatibility).
- The format specification is in `FORMAT.md`. - The format specification is in `FORMAT.md`.
- See `TODO.md` for the 1.0 implementation plan and open design questions. - See the TODO section in `README.md` for the 1.0 implementation plan
and open design questions.

215
README.md
View File

@@ -25,7 +25,8 @@ software. A compatible javascript library is planned.
# Build Status # Build Status
[![Build Status](https://drone.datavi.be/api/badges/sneak/mfer/status.svg)](https://drone.datavi.be/sneak/mfer) CI runs via `docker build .` which executes `make check` (formatting,
linting, tests). The `main` branch must always be green.
# Participation # Participation
@@ -208,6 +209,218 @@ regardless of filesystem format.
Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your
desired username for an account on this Gitea instance. desired username for an account on this Gitea instance.
# TODO: Remaining Work for 1.0
## Design Questions (Owner Decision Required)
These require @sneak's input before implementation. Answers should be added
inline below each question.
### Format Design
**1. Should `MFFileChecksum` be simplified?** Currently it's a separate
message wrapping a single `bytes multiHash` field. Since multihash
already self-describes the algorithm, `repeated bytes hashes` directly on
`MFFilePath` would be simpler and reduce per-file protobuf overhead. Is
the extra message layer intentional (e.g. planning to add per-hash
metadata like `verified_at`)?
> _answer:_
**2. Should file permissions/mode be stored?** The format stores
mtime/ctime but not Unix file permissions. For archival use this may not
matter, but for software distribution or filesystem restoration it's a
gap. Should we reserve a field now (e.g. `optional uint32 mode = 305`)
even if we don't populate it yet?
> _answer:_
**3. Should `atime` be removed from the schema?** Access time is
volatile, non-deterministic, and often disabled (`noatime`). Including it
means two manifests of the same directory at different times will differ,
which conflicts with the determinism goal. Remove it, or document it as
"never set by default"?
> _answer:_
**4. What are the path normalization rules?** The proto has `string path`
with no specification about: always forward-slash? Must be relative? No
`..` components allowed? UTF-8 NFC vs NFD normalization (macOS vs
Linux)? Max path length? This is a security issue (path traversal) and a
cross-platform compatibility issue. What rules should the spec mandate?
> _answer:_
**5. Should we add a version byte after the magic?** Currently
`ZNAVSRFG` is followed immediately by protobuf. Adding a version byte
(`ZNAVSRFG\x01`) would allow future framing changes without requiring
protobuf parsing to detect the version. `MFFileOuter.Version` serves
this purpose but requires successful deserialization to read. Worth the
extra byte?
> _answer:_
**6. Should we add a length-prefix after the magic?** Protobuf is not
self-delimiting. If we ever want to concatenate manifests or append data
after the protobuf, the current framing is insufficient. Add a varint or
fixed-width length-prefix?
> _answer:_
### Signature Design
**7. What does the outer SHA-256 hash cover — compressed or uncompressed
data?** The code currently hashes compressed data (good for verifying
before decompression), but this should be explicitly documented. Which is
the intended behavior?
> _answer:_
**8. Should `signatureString()` sign raw bytes instead of a hex-encoded
string?** Currently the canonical string is `MAGIC-UUID-MULTIHASH` with
hex encoding, which adds a transformation layer. Signing the raw `sha256`
bytes (or compressed `innerMessage` directly) would be simpler. Keep the
string format or switch to raw bytes?
> _answer:_
**9. Should we support detached signature files (`.mf.sig`)?** Embedded
signatures are better for single-file distribution. Detached `.mf.sig`
files follow the familiar `SHASUMS`/`SHASUMS.asc` pattern and are
simpler for HTTP serving. Support both modes?
> _answer:_
**10. GPG vs pure-Go crypto for signatures?** Shelling out to `gpg` is
fragile (may not be installed, version-dependent output).
`github.com/ProtonMail/go-crypto` provides pure-Go OpenPGP, or we could
use Ed25519/signify (simpler, no key management). Which direction?
> _answer:_
### Implementation Design
**11. Should manifests be deterministic by default?** This means: sort
file entries by path, omit `createdAt` timestamp (or make it opt-in), no
`atime`. Should determinism be the default, with a
`--include-timestamps` flag to opt in?
> _answer:_
**12. Should we consolidate or keep both scanner/checker
implementations?** There are two parallel implementations:
`mfer/scanner.go` + `mfer/checker.go` (typed with `FileSize`,
`RelFilePath`) and `internal/scanner/` + `internal/checker/` (raw
`int64`, `string`). The `mfer/` versions are superior. Delete the
`internal/` versions?
> _answer:_
**13. Should the `manifest` type be exported?** Currently unexported with
exported constructors (`NewManifestFromReader`, `NewManifestFromFile`).
Consumers can't declare `var m *mfer.manifest`. Export the type, or
define an interface?
> _answer:_
**14. What should the Go module path be for 1.0?** Currently
`sneak.berlin/go/mfer` in `go.mod` but `git.eeqj.de/sneak/mfer/mfer` in
the proto `go_package` option. Which is canonical?
> _answer:_
## Implementation Tasks
### Repo Infrastructure
- [ ] Add `.golangci.yml` (fetch from
`https://git.eeqj.de/sneak/prompts/raw/branch/main/.golangci.yml`)
- [ ] Add `.editorconfig`
- [ ] Add `.gitea/workflows/check.yml` that runs `docker build .`
### Format & Correctness
- [ ] Resolve proto `go_package` path inconsistency
(`git.eeqj.de/sneak/mfer/mfer` vs `sneak.berlin/go/mfer`)
- [ ] Specify path invariants — add proto comments requiring UTF-8,
forward-slash, relative paths, no `..`, no leading `/`; validate
in `Builder.AddFile` and `Builder.AddFileWithHash` (pending design
question answer)
- [ ] Remove or deprecate `atime` from proto (pending design question
answer)
- [ ] Reserve `optional uint32 mode = 305` in `MFFilePath` for future
file permissions (pending design question answer)
- [ ] Add version byte after magic — `ZNAVSRFG\x01` for format version
1 (pending design question answer)
- [ ] Write format specification document — separate from README:
magic, outer structure, compression, inner structure, path
invariants, signature scheme, canonical serialization
### Library
- [ ] Delete `internal/scanner/` and `internal/checker/` — consolidate
on `mfer/` package versions; update CLI code (pending design
question answer)
- [ ] Add deterministic file ordering — sort entries by path
(lexicographic, byte-order) in `Builder.Build()`; add test
asserting byte-identical output from two runs
- [ ] Add decompression size limit — `io.LimitReader` in
`deserializeInner()` with `m.pbOuter.Size` as bound
- [ ] Fix `errors.Is` dead code in checker — replace with
`os.IsNotExist(err)` or `errors.Is(err, fs.ErrNotExist)`
- [ ] Fix `AddFile` to verify size — check `totalRead == size` after
reading, return error on mismatch
- [ ] Export the `manifest` type or define a public interface (pending
design question answer) — currently consumers cannot hold a reference
to a loaded manifest in their own type declarations
- [ ] Replace GPG subprocess calls with pure-Go crypto (pending design
question answer) — current implementation shells out to `gpg` which
may not be installed
- [ ] Add timeout to any remaining subprocess calls
### CLI
- [ ] Fix flag naming — all CLI flags should use kebab-case as primary
(`--include-dotfiles`, `--follow-symlinks`)
- [ ] Fix URL construction in fetch — use `BaseURL.JoinPath()` or
`url.JoinPath()` instead of string concatenation
- [ ] Add progress rate-limiting to Checker — throttle to once per
second, matching Scanner
- [ ] Add `--deterministic` flag or make it default — omit `createdAt`,
sort files (pending design question answer)
- [ ] Wire `--version` flag properly (currently only a `version`
subcommand exists; top-level `--version` shows urfave/cli generic
output)
- [ ] Add retry logic to `fetch` — currently no retries on transient
HTTP errors; needs exponential backoff
- [ ] `fetch` command uses bare `http.Get` with no timeout — needs
`http.Client` with configurable timeout
### Testing & Robustness
- [ ] Add fuzzing tests for `NewManifestFromReader` — protobuf
deserialization of untrusted input needs fuzz coverage
- [ ] Add integration test for `freshen` CLI command — current tests
only verify setup, not the actual freshen operation end-to-end
- [ ] Add test for `fetch` CLI command end-to-end (currently only
`downloadFile` is tested)
### Documentation
- [ ] Promote `FORMAT.md` as primary spec reference; README should link
to it more prominently
- [ ] Audit and update all error messages for consistency and
helpfulness
- [ ] Document the signature scheme more thoroughly (canonical string
format, verification steps)
### Release
- [ ] Finalize Go module path
- [ ] Update version constant in `mfer/constants.go`
- [ ] Add `--version` output matching SemVer
- [ ] Tag `v1.0.0`
# See Also # See Also
## Prior Art: Metalink ## Prior Art: Metalink

122
TODO.md
View File

@@ -1,122 +0,0 @@
# TODO: mfer 1.0
## Design Questions
_sneak: please answer inline below each question. These are preserved for posterity._
### Format Design
**1. Should `MFFileChecksum` be simplified?**
Currently it's a separate message wrapping a single `bytes multiHash` field. Since multihash already self-describes the algorithm, `repeated bytes hashes` directly on `MFFilePath` would be simpler and reduce per-file protobuf overhead. Is the extra message layer intentional (e.g. planning to add per-hash metadata like `verified_at`)?
> _answer:_
**2. Should file permissions/mode be stored?**
The format stores mtime/ctime but not Unix file permissions. For archival use (ExFAT, filesystem-independent checksums) this may not matter, but for software distribution or filesystem restoration it's a gap. Should we reserve a field now (e.g. `optional uint32 mode = 305`) even if we don't populate it yet?
> _answer:_
**3. Should `atime` be removed from the schema?**
Access time is volatile, non-deterministic, and often disabled (`noatime`). Including it means two manifests of the same directory at different times will differ, which conflicts with the determinism goal. Remove it, or document it as "never set by default"?
> _answer:_
**4. What are the path normalization rules?**
The proto has `string path` with no specification about: always forward-slash? Must be relative? No `..` components allowed? UTF-8 NFC vs NFD normalization (macOS vs Linux)? Max path length? This is a security issue (path traversal) and a cross-platform compatibility issue. What rules should the spec mandate?
> _answer:_
**5. Should we add a version byte after the magic?**
Currently `ZNAVSRFG` is followed immediately by protobuf. Adding a version byte (`ZNAVSRFG\x01`) would allow future framing changes without requiring protobuf parsing to detect the version. `MFFileOuter.Version` serves this purpose but requires successful deserialization to read. Worth the extra byte?
> _answer:_
**6. Should we add a length-prefix after the magic?**
Protobuf is not self-delimiting. If we ever want to concatenate manifests or append data after the protobuf, the current framing is insufficient. Add a varint or fixed-width length-prefix?
> _answer:_
### Signature Design
**7. What does the outer SHA-256 hash cover — compressed or uncompressed data?**
The review notes it currently hashes compressed data (good for verifying before decompression), but this should be explicitly documented. Which is the intended behavior?
> _answer:_
**8. Should `signatureString()` sign raw bytes instead of a hex-encoded string?**
Currently the canonical string is `MAGIC-UUID-MULTIHASH` with hex encoding, which adds a transformation layer. Signing the raw `sha256` bytes (or compressed `innerMessage` directly) would be simpler. Keep the string format or switch to raw bytes?
> _answer:_
**9. Should we support detached signature files (`.mf.sig`)?**
Embedded signatures are better for single-file distribution. Detached `.mf.sig` files follow the familiar `SHASUMS`/`SHASUMS.asc` pattern and are simpler for HTTP serving. Support both modes?
> _answer:_
**10. GPG vs pure-Go crypto for signatures?**
Shelling out to `gpg` is fragile (may not be installed, version-dependent output). `github.com/ProtonMail/go-crypto` provides pure-Go OpenPGP, or we could go Ed25519/signify (simpler, no key management). Which direction?
> _answer:_
### Implementation Design
**11. Should manifests be deterministic by default?**
This means: sort file entries by path, omit `createdAt` timestamp (or make it opt-in), no `atime`. Should determinism be the default, with a `--include-timestamps` flag to opt in?
> _answer:_
**12. Should we consolidate or keep both scanner/checker implementations?**
There are two parallel implementations: `mfer/scanner.go` + `mfer/checker.go` (typed with `FileSize`, `RelFilePath`) and `internal/scanner/` + `internal/checker/` (raw `int64`, `string`). The `mfer/` versions are superior. Delete the `internal/` versions?
> _answer:_
**13. Should the `manifest` type be exported?**
Currently unexported with exported constructors (`New`, `NewFromPaths`, etc.). Consumers can't declare `var m *mfer.manifest`. Export the type, or define an interface?
> _answer:_
**14. What should the Go module path be for 1.0?**
Currently mixed between `sneak.berlin/go/mfer` and `git.eeqj.de/sneak/mfer`. Which is canonical?
> _answer:_
---
## Implementation Plan
### Phase 1: Foundation (format correctness)
- [ ] Delete `internal/scanner/` and `internal/checker/` — consolidate on `mfer/` package versions; update CLI code
- [ ] Add deterministic file ordering — sort entries by path (lexicographic, byte-order) in `Builder.Build()`; add test asserting byte-identical output from two runs
- [ ] Add decompression size limit — `io.LimitReader` in `deserializeInner()` with `m.pbOuter.Size` as bound
- [ ] Fix `errors.Is` dead code in checker — replace with `os.IsNotExist(err)` or `errors.Is(err, fs.ErrNotExist)`
- [ ] Fix `AddFile` to verify size — check `totalRead == size` after reading, return error on mismatch
- [ ] Specify path invariants — add proto comments (UTF-8, forward-slash, relative, no `..`, no leading `/`); validate in `Builder.AddFile` and `Builder.AddFileWithHash`
### Phase 2: CLI polish
- [ ] Fix flag naming — all CLI flags use kebab-case as primary (`--include-dotfiles`, `--follow-symlinks`)
- [ ] Fix URL construction in fetch — use `BaseURL.JoinPath()` or `url.JoinPath()` instead of string concatenation
- [ ] Add progress rate-limiting to Checker — throttle to once per second, matching Scanner
- [ ] Add `--deterministic` flag (or make it default) — omit `createdAt`, sort files
### Phase 3: Robustness
- [ ] Replace GPG subprocess with pure-Go crypto — `github.com/ProtonMail/go-crypto` or Ed25519/signify
- [ ] Add timeout to any remaining subprocess calls
- [ ] Add fuzzing tests for `NewManifestFromReader`
- [ ] Add retry logic to fetch — exponential backoff for transient HTTP errors
### Phase 4: Format finalization
- [ ] Remove or deprecate `atime` from proto (pending design question answer)
- [ ] Reserve `optional uint32 mode = 305` in `MFFilePath` for future file permissions
- [ ] Add version byte after magic — `ZNAVSRFG\x01` for format version 1
- [ ] Write format specification document — separate from README: magic, outer structure, compression, inner structure, path invariants, signature scheme, canonical serialization
### Phase 5: Release prep
- [ ] Finalize Go module path
- [ ] Audit all error messages for consistency and helpfulness
- [ ] Add `--version` output matching SemVer
- [ ] Tag v1.0.0