Add all 14 design questions (7 were missing: atime removal, path normalization, outer SHA-256 scope, signatureString format, detached signatures, deterministic-by-default, scanner/checker consolidation). Add all missing implementation tasks from TODO.md phases 1-4: - Phase 1 foundation: consolidate internal packages, deterministic ordering, decompression size limit, errors.Is fix, AddFile size verification, path invariants - Phase 2 CLI: flag naming, URL construction, progress rate-limiting, deterministic flag - Phase 3: subprocess timeouts - Phase 4: atime deprecation Organize design questions with sub-headers (Format Design, Signature Design, Implementation Design) matching original TODO.md structure.
445 lines
18 KiB
Markdown
445 lines
18 KiB
Markdown
# mfer
|
|
|
|
[mfer](https://git.eeqj.de/sneak/mfer) is a reference implementation library
|
|
and thin wrapper command-line utility written in [Go](https://golang.org)
|
|
and first published in 2022 under the [WTFPL](https://wtfpl.net) (public
|
|
domain) license. It specifies and generates `.mf` manifest files over a
|
|
directory tree of files to encapsulate metadata about them (such as
|
|
cryptographic checksums or signatures over same) to aid in archiving,
|
|
downloading, and streaming, or mirroring. The manifest files' data is
|
|
serialized with Google's [protobuf serialization
|
|
format](https://developers.google.com/protocol-buffers). The structure of
|
|
these files can be found [in the format
|
|
specification](https://git.eeqj.de/sneak/mfer/src/branch/main/mfer/mf.proto)
|
|
which is included in the [project
|
|
repository](https://git.eeqj.de/sneak/mfer).
|
|
|
|
The current version is pre-1.0 and while the repo was published in 2022,
|
|
there has not yet been any versioned release. [SemVer](https://semver.org)
|
|
will be used for releases.
|
|
|
|
This project was started by [@sneak](https://sneak.berlin) to scratch an
|
|
itch in 2022 and is currently a one-person effort, though the goal is for
|
|
this to emerge as a de-facto standard and be incorporated into other
|
|
software. A compatible javascript library is planned.
|
|
|
|
# Build Status
|
|
|
|
CI runs via `docker build .` which executes `make check` (formatting,
|
|
linting, tests). The `main` branch must always be green.
|
|
|
|
# Participation
|
|
|
|
The community is as yet nonexistent so there are no defined policies or
|
|
norms yet. Primary development happens on a privately-run Gitea instance at
|
|
[https://git.eeqj.de/sneak/mfer](https://git.eeqj.de/sneak/mfer) and issues
|
|
are [tracked there](https://git.eeqj.de/sneak/mfer/issues).
|
|
|
|
Changes must always be formatted with a standard `go fmt`, syntactically
|
|
valid, and must pass the linting defined in the repository (presently only
|
|
the `golangci-lint` defaults), which can be run with a `make lint`. The
|
|
`main` branch is protected and all changes must be made via [pull
|
|
requests](https://git.eeqj.de/sneak/mfer/pulls) and pass CI to be merged.
|
|
Any changes submitted to this project must also be
|
|
[WTFPL-licensed](https://wtfpl.net) to be considered.
|
|
|
|
See [`REPO_POLICIES.md`](REPO_POLICIES.md) for detailed coding standards,
|
|
tooling requirements, and workflow conventions.
|
|
|
|
# Problem Statement
|
|
|
|
Given a plain URL, there is no standard way to safely and programmatically
|
|
download everything "under" that URL path. `wget -r` can traverse directory
|
|
listings if they're enabled, but every server has a different format, and
|
|
this does not verify cryptographic integrity of the files, or enable them to
|
|
be fetched using a different protocol other than HTTP/s.
|
|
|
|
Currently, the solution that people are using are sidecar files in the
|
|
format of `SHASUMS` checksum files, as well as a `SHASUMS.asc` PGP detached
|
|
signature. This is not checksum-algorithm-agnostic and the sidecar file is
|
|
not always consistently named.
|
|
|
|
Real issues I face:
|
|
|
|
- when I plug in an ExFAT hard drive, I don't know if any files on the
|
|
filesystem are corrupted or missing
|
|
- current ad-hoc solution are `SHASUMS`/`SHASUMS.asc` files
|
|
- when I want to mirror an HTTP archive, I have to use special tools like
|
|
debmirror that understand the archive format
|
|
- the debian repository metadata structure is hot garbage
|
|
- when I download a large file via HTTP, I have no way of knowing if the
|
|
file content is what it's supposed to be
|
|
|
|
# Proposed Solution
|
|
|
|
A standard, a manifest file format, and a tool for generating same.
|
|
|
|
The manifest file would be called `index.mf`, and the tool for generating such would be called `mfer`.
|
|
|
|
The manifest file would do several important things:
|
|
|
|
- have a standard filename, so if given
|
|
`https://example.com/downloadpackage/` one could fetch
|
|
`https://example.com/downloadpackage/index.mf` to enumerate the full
|
|
directory listing.
|
|
- contain a version field for extensibility
|
|
- contain structured data (protobuf, json, or cbor)
|
|
- provide an inner signed container, so that the manifest file itself can
|
|
embed a signature and a public key alongside in a single file
|
|
- contain a list of files, each with a relative path to the manifest
|
|
- contain manifest timestamp
|
|
- contain ctime/mtime information for files so that file metadata can be
|
|
preserved
|
|
- contain cryptographic checksums in several different algorithms for each
|
|
file
|
|
- probably encoded with multihash to indicate algo + hash
|
|
- sha256 at the minimum
|
|
- would be nice to include an IPFS/IPLD CIDv1 root hash for each file,
|
|
which likely involves doing an ipfs file object chunking
|
|
- maybe even including the complete IPFS/IPLD directory tree objects and
|
|
chunklists?
|
|
- this is because generating an `index.mf` does not imply publishing on
|
|
ipfs at that time
|
|
- maybe a bittorrent chunklist for torrent client compatibility? perhaps a
|
|
top-level infohash for the whole manifest?
|
|
|
|
# Design Goals
|
|
|
|
- Replace SHASUMS/SHASUMS.asc files
|
|
- be easy to download/resume a whole directory tree published via HTTP
|
|
- be easy to use across protocols (given an HTTPS url, fetch manifest, then
|
|
download file contents via bittorrent or ipfs)
|
|
- not strongly coupled to HTTP use case, should not require special hosting,
|
|
content types, or HTTP headers being sent
|
|
|
|
# Non-Goals
|
|
|
|
- Manifest generation speed
|
|
- likely involves IPFS chunking, bittorrent chunking, and several
|
|
different cryptographic hash functions over the entirety of each and
|
|
every file
|
|
- Small manifest file size (within reason)
|
|
- 30MiB files are "small" these days, given modern storage/bandwidth
|
|
- metadata size should not be used as an excuse to sacrifice utility (such
|
|
as providing checksums over each chunk of a large file)
|
|
|
|
# Open Questions
|
|
|
|
- Should the manifest file include checksums of individual file chunks, or just for the whole assembled file?
|
|
- If so, should the chunksize be fixed or dynamic?
|
|
|
|
- Should the manifest signature format be GnuPG signatures, or those from
|
|
OpenBSD's signify (of which there is a good [golang
|
|
implementation](https://github.com/frankbraun/gosignify)?
|
|
|
|
- Should the on-disk serialization format be proto3 or json?
|
|
|
|
# Tool Examples
|
|
|
|
- `mfer gen` / `mfer gen .`
|
|
- recurses under current directory and writes out an `index.mf`
|
|
- `mfer check` / `mfer check .`
|
|
- verifies checksums of all files in manifest, displaying error and
|
|
exiting nonzero if any files are missing or corrupted
|
|
- `mfer fetch https://example.com/stuff/`
|
|
- fetches `/stuff/index.mf` and downloads all files listed in manifest,
|
|
optionally resuming any that already exist locally, and assures
|
|
cryptographic integrity of downloaded files.
|
|
|
|
# Implementation Plan
|
|
|
|
## Phase One:
|
|
|
|
- golang module for reusability/embedding
|
|
- golang module client providing `mfer` CLI
|
|
|
|
## Phase Two:
|
|
|
|
- ES6 or TypeScript module for reusability/embedding
|
|
- ES6/TypeScript module client providing `mfer.js` CLI
|
|
|
|
# Hopes And Dreams
|
|
|
|
- `aria2c https://example.com/manifestdirectory/`
|
|
- (fetches `https://example.com/manifestdirectory/index.mf`, downloads and
|
|
checksums all files, resumes any that exist locally already)
|
|
- `mfer fetch https://example.com/manifestdirectory/`
|
|
- a command line option to zero/omit mtime/ctime, as well as manifest
|
|
timestamp, and sort all directory listings so that manifest file
|
|
generation is deterministic/reproducible
|
|
- URL format `mfer fetch https://exmaple.com/manifestdirectory/?key=5539AD00DE4C42F3AFE11575052443F4DF2A55C2`
|
|
to assert in the URL which PGP signing key should be used in the manifest,
|
|
so that shared URLs have a cryptographic trust root
|
|
- a "well-known" key in the manifest that maps well known keys (could reuse
|
|
the http spec) to specific file paths in the manifest.
|
|
- example: a `berlin.sneak.app.slideshow` key that maps to a json
|
|
slideshow config listing what image paths to show, and for how long, and
|
|
in what order
|
|
|
|
# Use Cases
|
|
|
|
## Web Images
|
|
|
|
I'd like to be able to put a bunch of images into a directory, generate a
|
|
manifest, and then point a slideshow client (such as an ambient display, or
|
|
a react app with the target directory in a query string arg) at that
|
|
statically hosted directory, and have it discover the full list of images
|
|
available at that URL.
|
|
|
|
## Software Distribution
|
|
|
|
I'd like to be able to download a whole tree of files available via HTTP
|
|
resumably by either HTTP or IPFS/BitTorrent without a .torrent file.
|
|
|
|
## Filesystem Archive Integrity
|
|
|
|
I use filesystems that don't include data checksums, and I would like a
|
|
cryptographically signed checksum file so that I can later verify that a set
|
|
of archive files have not been modified, none are missing, and that the
|
|
checksums have not been altered in storage by a second party.
|
|
|
|
## Filesystem-Independent Checksums
|
|
|
|
I would like to be able to plug in a hard drive or flash drive and, if there
|
|
is an `index.mf` in the root, automatically detect missing/corrupted files,
|
|
regardless of filesystem format.
|
|
|
|
# Collaboration
|
|
|
|
Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your
|
|
desired username for an account on this Gitea instance.
|
|
|
|
# TODO: Remaining Work for 1.0
|
|
|
|
## Design Questions (Owner Decision Required)
|
|
|
|
These require @sneak's input before implementation. Answers should be added
|
|
inline below each question.
|
|
|
|
### Format Design
|
|
|
|
**1. Should `MFFileChecksum` be simplified?** Currently it's a separate
|
|
message wrapping a single `bytes multiHash` field. Since multihash
|
|
already self-describes the algorithm, `repeated bytes hashes` directly on
|
|
`MFFilePath` would be simpler and reduce per-file protobuf overhead. Is
|
|
the extra message layer intentional (e.g. planning to add per-hash
|
|
metadata like `verified_at`)?
|
|
|
|
> _answer:_
|
|
|
|
**2. Should file permissions/mode be stored?** The format stores
|
|
mtime/ctime but not Unix file permissions. For archival use this may not
|
|
matter, but for software distribution or filesystem restoration it's a
|
|
gap. Should we reserve a field now (e.g. `optional uint32 mode = 305`)
|
|
even if we don't populate it yet?
|
|
|
|
> _answer:_
|
|
|
|
**3. Should `atime` be removed from the schema?** Access time is
|
|
volatile, non-deterministic, and often disabled (`noatime`). Including it
|
|
means two manifests of the same directory at different times will differ,
|
|
which conflicts with the determinism goal. Remove it, or document it as
|
|
"never set by default"?
|
|
|
|
> _answer:_
|
|
|
|
**4. What are the path normalization rules?** The proto has `string path`
|
|
with no specification about: always forward-slash? Must be relative? No
|
|
`..` components allowed? UTF-8 NFC vs NFD normalization (macOS vs
|
|
Linux)? Max path length? This is a security issue (path traversal) and a
|
|
cross-platform compatibility issue. What rules should the spec mandate?
|
|
|
|
> _answer:_
|
|
|
|
**5. Should we add a version byte after the magic?** Currently
|
|
`ZNAVSRFG` is followed immediately by protobuf. Adding a version byte
|
|
(`ZNAVSRFG\x01`) would allow future framing changes without requiring
|
|
protobuf parsing to detect the version. `MFFileOuter.Version` serves
|
|
this purpose but requires successful deserialization to read. Worth the
|
|
extra byte?
|
|
|
|
> _answer:_
|
|
|
|
**6. Should we add a length-prefix after the magic?** Protobuf is not
|
|
self-delimiting. If we ever want to concatenate manifests or append data
|
|
after the protobuf, the current framing is insufficient. Add a varint or
|
|
fixed-width length-prefix?
|
|
|
|
> _answer:_
|
|
|
|
### Signature Design
|
|
|
|
**7. What does the outer SHA-256 hash cover — compressed or uncompressed
|
|
data?** The code currently hashes compressed data (good for verifying
|
|
before decompression), but this should be explicitly documented. Which is
|
|
the intended behavior?
|
|
|
|
> _answer:_
|
|
|
|
**8. Should `signatureString()` sign raw bytes instead of a hex-encoded
|
|
string?** Currently the canonical string is `MAGIC-UUID-MULTIHASH` with
|
|
hex encoding, which adds a transformation layer. Signing the raw `sha256`
|
|
bytes (or compressed `innerMessage` directly) would be simpler. Keep the
|
|
string format or switch to raw bytes?
|
|
|
|
> _answer:_
|
|
|
|
**9. Should we support detached signature files (`.mf.sig`)?** Embedded
|
|
signatures are better for single-file distribution. Detached `.mf.sig`
|
|
files follow the familiar `SHASUMS`/`SHASUMS.asc` pattern and are
|
|
simpler for HTTP serving. Support both modes?
|
|
|
|
> _answer:_
|
|
|
|
**10. GPG vs pure-Go crypto for signatures?** Shelling out to `gpg` is
|
|
fragile (may not be installed, version-dependent output).
|
|
`github.com/ProtonMail/go-crypto` provides pure-Go OpenPGP, or we could
|
|
use Ed25519/signify (simpler, no key management). Which direction?
|
|
|
|
> _answer:_
|
|
|
|
### Implementation Design
|
|
|
|
**11. Should manifests be deterministic by default?** This means: sort
|
|
file entries by path, omit `createdAt` timestamp (or make it opt-in), no
|
|
`atime`. Should determinism be the default, with a
|
|
`--include-timestamps` flag to opt in?
|
|
|
|
> _answer:_
|
|
|
|
**12. Should we consolidate or keep both scanner/checker
|
|
implementations?** There are two parallel implementations:
|
|
`mfer/scanner.go` + `mfer/checker.go` (typed with `FileSize`,
|
|
`RelFilePath`) and `internal/scanner/` + `internal/checker/` (raw
|
|
`int64`, `string`). The `mfer/` versions are superior. Delete the
|
|
`internal/` versions?
|
|
|
|
> _answer:_
|
|
|
|
**13. Should the `manifest` type be exported?** Currently unexported with
|
|
exported constructors (`NewManifestFromReader`, `NewManifestFromFile`).
|
|
Consumers can't declare `var m *mfer.manifest`. Export the type, or
|
|
define an interface?
|
|
|
|
> _answer:_
|
|
|
|
**14. What should the Go module path be for 1.0?** Currently
|
|
`sneak.berlin/go/mfer` in `go.mod` but `git.eeqj.de/sneak/mfer/mfer` in
|
|
the proto `go_package` option. Which is canonical?
|
|
|
|
> _answer:_
|
|
|
|
## Implementation Tasks
|
|
|
|
### Repo Infrastructure
|
|
|
|
- [ ] Add `.golangci.yml` (fetch from
|
|
`https://git.eeqj.de/sneak/prompts/raw/branch/main/.golangci.yml`)
|
|
- [ ] Add `.editorconfig`
|
|
- [ ] Add `.gitea/workflows/check.yml` that runs `docker build .`
|
|
|
|
### Format & Correctness
|
|
|
|
- [ ] Resolve proto `go_package` path inconsistency
|
|
(`git.eeqj.de/sneak/mfer/mfer` vs `sneak.berlin/go/mfer`)
|
|
- [ ] Specify path invariants — add proto comments requiring UTF-8,
|
|
forward-slash, relative paths, no `..`, no leading `/`; validate
|
|
in `Builder.AddFile` and `Builder.AddFileWithHash` (pending design
|
|
question answer)
|
|
- [ ] Remove or deprecate `atime` from proto (pending design question
|
|
answer)
|
|
- [ ] Reserve `optional uint32 mode = 305` in `MFFilePath` for future
|
|
file permissions (pending design question answer)
|
|
- [ ] Add version byte after magic — `ZNAVSRFG\x01` for format version
|
|
1 (pending design question answer)
|
|
- [ ] Write format specification document — separate from README:
|
|
magic, outer structure, compression, inner structure, path
|
|
invariants, signature scheme, canonical serialization
|
|
|
|
### Library
|
|
|
|
- [ ] Delete `internal/scanner/` and `internal/checker/` — consolidate
|
|
on `mfer/` package versions; update CLI code (pending design
|
|
question answer)
|
|
- [ ] Add deterministic file ordering — sort entries by path
|
|
(lexicographic, byte-order) in `Builder.Build()`; add test
|
|
asserting byte-identical output from two runs
|
|
- [ ] Add decompression size limit — `io.LimitReader` in
|
|
`deserializeInner()` with `m.pbOuter.Size` as bound
|
|
- [ ] Fix `errors.Is` dead code in checker — replace with
|
|
`os.IsNotExist(err)` or `errors.Is(err, fs.ErrNotExist)`
|
|
- [ ] Fix `AddFile` to verify size — check `totalRead == size` after
|
|
reading, return error on mismatch
|
|
- [ ] Export the `manifest` type or define a public interface (pending
|
|
design question answer) — currently consumers cannot hold a reference
|
|
to a loaded manifest in their own type declarations
|
|
- [ ] Replace GPG subprocess calls with pure-Go crypto (pending design
|
|
question answer) — current implementation shells out to `gpg` which
|
|
may not be installed
|
|
- [ ] Add timeout to any remaining subprocess calls
|
|
|
|
### CLI
|
|
|
|
- [ ] Fix flag naming — all CLI flags should use kebab-case as primary
|
|
(`--include-dotfiles`, `--follow-symlinks`)
|
|
- [ ] Fix URL construction in fetch — use `BaseURL.JoinPath()` or
|
|
`url.JoinPath()` instead of string concatenation
|
|
- [ ] Add progress rate-limiting to Checker — throttle to once per
|
|
second, matching Scanner
|
|
- [ ] Add `--deterministic` flag or make it default — omit `createdAt`,
|
|
sort files (pending design question answer)
|
|
- [ ] Wire `--version` flag properly (currently only a `version`
|
|
subcommand exists; top-level `--version` shows urfave/cli generic
|
|
output)
|
|
- [ ] Add retry logic to `fetch` — currently no retries on transient
|
|
HTTP errors; needs exponential backoff
|
|
- [ ] `fetch` command uses bare `http.Get` with no timeout — needs
|
|
`http.Client` with configurable timeout
|
|
|
|
### Testing & Robustness
|
|
|
|
- [ ] Add fuzzing tests for `NewManifestFromReader` — protobuf
|
|
deserialization of untrusted input needs fuzz coverage
|
|
- [ ] Add integration test for `freshen` CLI command — current tests
|
|
only verify setup, not the actual freshen operation end-to-end
|
|
- [ ] Add test for `fetch` CLI command end-to-end (currently only
|
|
`downloadFile` is tested)
|
|
|
|
### Documentation
|
|
|
|
- [ ] Promote `FORMAT.md` as primary spec reference; README should link
|
|
to it more prominently
|
|
- [ ] Audit and update all error messages for consistency and
|
|
helpfulness
|
|
- [ ] Document the signature scheme more thoroughly (canonical string
|
|
format, verification steps)
|
|
|
|
### Release
|
|
|
|
- [ ] Finalize Go module path
|
|
- [ ] Update version constant in `mfer/constants.go`
|
|
- [ ] Add `--version` output matching SemVer
|
|
- [ ] Tag `v1.0.0`
|
|
|
|
# See Also
|
|
|
|
## Prior Art: Metalink
|
|
|
|
- [Metalink - Mozilla Wiki](https://wiki.mozilla.org/Metalink)
|
|
- [Metalink - Wikipedia](https://en.wikipedia.org/wiki/Metalink)
|
|
- [RFC 5854 - The Metalink Download Description Format](https://datatracker.ietf.org/doc/html/rfc5854)
|
|
- [RFC 6249 - Metalink/HTTP: Mirrors and Hashes](https://www.rfc-editor.org/rfc/rfc6249.html)
|
|
|
|
## Links
|
|
|
|
- Repo: [https://git.eeqj.de/sneak/mfer](https://git.eeqj.de/sneak/mfer)
|
|
- Issues: [https://git.eeqj.de/sneak/mfer/issues](https://git.eeqj.de/sneak/mfer/issues)
|
|
|
|
# Authors
|
|
|
|
- [@sneak <sneak@sneak.berlin>](mailto:sneak@sneak.berlin)
|
|
|
|
# License
|
|
|
|
- [WTFPL](https://wtfpl.net)
|