Perform design and status review of the codebase. Add a comprehensive TODO section to README.md covering design questions requiring owner input, implementation tasks (repo infrastructure, format, library, CLI, testing, documentation), and release checklist. Integrate items from TODO.md into the README TODO section and remove TODO.md from the repo. Update AGENTS.md to reference the new location.
358 lines
14 KiB
Markdown
358 lines
14 KiB
Markdown
# mfer
|
|
|
|
[mfer](https://git.eeqj.de/sneak/mfer) is a reference implementation library
|
|
and thin wrapper command-line utility written in [Go](https://golang.org)
|
|
and first published in 2022 under the [WTFPL](https://wtfpl.net) (public
|
|
domain) license. It specifies and generates `.mf` manifest files over a
|
|
directory tree of files to encapsulate metadata about them (such as
|
|
cryptographic checksums or signatures over same) to aid in archiving,
|
|
downloading, and streaming, or mirroring. The manifest files' data is
|
|
serialized with Google's [protobuf serialization
|
|
format](https://developers.google.com/protocol-buffers). The structure of
|
|
these files can be found [in the format
|
|
specification](https://git.eeqj.de/sneak/mfer/src/branch/main/mfer/mf.proto)
|
|
which is included in the [project
|
|
repository](https://git.eeqj.de/sneak/mfer).
|
|
|
|
The current version is pre-1.0 and while the repo was published in 2022,
|
|
there has not yet been any versioned release. [SemVer](https://semver.org)
|
|
will be used for releases.
|
|
|
|
This project was started by [@sneak](https://sneak.berlin) to scratch an
|
|
itch in 2022 and is currently a one-person effort, though the goal is for
|
|
this to emerge as a de-facto standard and be incorporated into other
|
|
software. A compatible javascript library is planned.
|
|
|
|
# Build Status
|
|
|
|
CI runs via `docker build .` which executes `make check` (formatting,
|
|
linting, tests). The `main` branch must always be green.
|
|
|
|
# Participation
|
|
|
|
The community is as yet nonexistent so there are no defined policies or
|
|
norms yet. Primary development happens on a privately-run Gitea instance at
|
|
[https://git.eeqj.de/sneak/mfer](https://git.eeqj.de/sneak/mfer) and issues
|
|
are [tracked there](https://git.eeqj.de/sneak/mfer/issues).
|
|
|
|
Changes must always be formatted with a standard `go fmt`, syntactically
|
|
valid, and must pass the linting defined in the repository (presently only
|
|
the `golangci-lint` defaults), which can be run with a `make lint`. The
|
|
`main` branch is protected and all changes must be made via [pull
|
|
requests](https://git.eeqj.de/sneak/mfer/pulls) and pass CI to be merged.
|
|
Any changes submitted to this project must also be
|
|
[WTFPL-licensed](https://wtfpl.net) to be considered.
|
|
|
|
See [`REPO_POLICIES.md`](REPO_POLICIES.md) for detailed coding standards,
|
|
tooling requirements, and workflow conventions.
|
|
|
|
# Problem Statement
|
|
|
|
Given a plain URL, there is no standard way to safely and programmatically
|
|
download everything "under" that URL path. `wget -r` can traverse directory
|
|
listings if they're enabled, but every server has a different format, and
|
|
this does not verify cryptographic integrity of the files, or enable them to
|
|
be fetched using a different protocol other than HTTP/s.
|
|
|
|
Currently, the solution that people are using are sidecar files in the
|
|
format of `SHASUMS` checksum files, as well as a `SHASUMS.asc` PGP detached
|
|
signature. This is not checksum-algorithm-agnostic and the sidecar file is
|
|
not always consistently named.
|
|
|
|
Real issues I face:
|
|
|
|
- when I plug in an ExFAT hard drive, I don't know if any files on the
|
|
filesystem are corrupted or missing
|
|
- current ad-hoc solution are `SHASUMS`/`SHASUMS.asc` files
|
|
- when I want to mirror an HTTP archive, I have to use special tools like
|
|
debmirror that understand the archive format
|
|
- the debian repository metadata structure is hot garbage
|
|
- when I download a large file via HTTP, I have no way of knowing if the
|
|
file content is what it's supposed to be
|
|
|
|
# Proposed Solution
|
|
|
|
A standard, a manifest file format, and a tool for generating same.
|
|
|
|
The manifest file would be called `index.mf`, and the tool for generating such would be called `mfer`.
|
|
|
|
The manifest file would do several important things:
|
|
|
|
- have a standard filename, so if given
|
|
`https://example.com/downloadpackage/` one could fetch
|
|
`https://example.com/downloadpackage/index.mf` to enumerate the full
|
|
directory listing.
|
|
- contain a version field for extensibility
|
|
- contain structured data (protobuf, json, or cbor)
|
|
- provide an inner signed container, so that the manifest file itself can
|
|
embed a signature and a public key alongside in a single file
|
|
- contain a list of files, each with a relative path to the manifest
|
|
- contain manifest timestamp
|
|
- contain ctime/mtime information for files so that file metadata can be
|
|
preserved
|
|
- contain cryptographic checksums in several different algorithms for each
|
|
file
|
|
- probably encoded with multihash to indicate algo + hash
|
|
- sha256 at the minimum
|
|
- would be nice to include an IPFS/IPLD CIDv1 root hash for each file,
|
|
which likely involves doing an ipfs file object chunking
|
|
- maybe even including the complete IPFS/IPLD directory tree objects and
|
|
chunklists?
|
|
- this is because generating an `index.mf` does not imply publishing on
|
|
ipfs at that time
|
|
- maybe a bittorrent chunklist for torrent client compatibility? perhaps a
|
|
top-level infohash for the whole manifest?
|
|
|
|
# Design Goals
|
|
|
|
- Replace SHASUMS/SHASUMS.asc files
|
|
- be easy to download/resume a whole directory tree published via HTTP
|
|
- be easy to use across protocols (given an HTTPS url, fetch manifest, then
|
|
download file contents via bittorrent or ipfs)
|
|
- not strongly coupled to HTTP use case, should not require special hosting,
|
|
content types, or HTTP headers being sent
|
|
|
|
# Non-Goals
|
|
|
|
- Manifest generation speed
|
|
- likely involves IPFS chunking, bittorrent chunking, and several
|
|
different cryptographic hash functions over the entirety of each and
|
|
every file
|
|
- Small manifest file size (within reason)
|
|
- 30MiB files are "small" these days, given modern storage/bandwidth
|
|
- metadata size should not be used as an excuse to sacrifice utility (such
|
|
as providing checksums over each chunk of a large file)
|
|
|
|
# Open Questions
|
|
|
|
- Should the manifest file include checksums of individual file chunks, or just for the whole assembled file?
|
|
- If so, should the chunksize be fixed or dynamic?
|
|
|
|
- Should the manifest signature format be GnuPG signatures, or those from
|
|
OpenBSD's signify (of which there is a good [golang
|
|
implementation](https://github.com/frankbraun/gosignify)?
|
|
|
|
- Should the on-disk serialization format be proto3 or json?
|
|
|
|
# Tool Examples
|
|
|
|
- `mfer gen` / `mfer gen .`
|
|
- recurses under current directory and writes out an `index.mf`
|
|
- `mfer check` / `mfer check .`
|
|
- verifies checksums of all files in manifest, displaying error and
|
|
exiting nonzero if any files are missing or corrupted
|
|
- `mfer fetch https://example.com/stuff/`
|
|
- fetches `/stuff/index.mf` and downloads all files listed in manifest,
|
|
optionally resuming any that already exist locally, and assures
|
|
cryptographic integrity of downloaded files.
|
|
|
|
# Implementation Plan
|
|
|
|
## Phase One:
|
|
|
|
- golang module for reusability/embedding
|
|
- golang module client providing `mfer` CLI
|
|
|
|
## Phase Two:
|
|
|
|
- ES6 or TypeScript module for reusability/embedding
|
|
- ES6/TypeScript module client providing `mfer.js` CLI
|
|
|
|
# Hopes And Dreams
|
|
|
|
- `aria2c https://example.com/manifestdirectory/`
|
|
- (fetches `https://example.com/manifestdirectory/index.mf`, downloads and
|
|
checksums all files, resumes any that exist locally already)
|
|
- `mfer fetch https://example.com/manifestdirectory/`
|
|
- a command line option to zero/omit mtime/ctime, as well as manifest
|
|
timestamp, and sort all directory listings so that manifest file
|
|
generation is deterministic/reproducible
|
|
- URL format `mfer fetch https://exmaple.com/manifestdirectory/?key=5539AD00DE4C42F3AFE11575052443F4DF2A55C2`
|
|
to assert in the URL which PGP signing key should be used in the manifest,
|
|
so that shared URLs have a cryptographic trust root
|
|
- a "well-known" key in the manifest that maps well known keys (could reuse
|
|
the http spec) to specific file paths in the manifest.
|
|
- example: a `berlin.sneak.app.slideshow` key that maps to a json
|
|
slideshow config listing what image paths to show, and for how long, and
|
|
in what order
|
|
|
|
# Use Cases
|
|
|
|
## Web Images
|
|
|
|
I'd like to be able to put a bunch of images into a directory, generate a
|
|
manifest, and then point a slideshow client (such as an ambient display, or
|
|
a react app with the target directory in a query string arg) at that
|
|
statically hosted directory, and have it discover the full list of images
|
|
available at that URL.
|
|
|
|
## Software Distribution
|
|
|
|
I'd like to be able to download a whole tree of files available via HTTP
|
|
resumably by either HTTP or IPFS/BitTorrent without a .torrent file.
|
|
|
|
## Filesystem Archive Integrity
|
|
|
|
I use filesystems that don't include data checksums, and I would like a
|
|
cryptographically signed checksum file so that I can later verify that a set
|
|
of archive files have not been modified, none are missing, and that the
|
|
checksums have not been altered in storage by a second party.
|
|
|
|
## Filesystem-Independent Checksums
|
|
|
|
I would like to be able to plug in a hard drive or flash drive and, if there
|
|
is an `index.mf` in the root, automatically detect missing/corrupted files,
|
|
regardless of filesystem format.
|
|
|
|
# Collaboration
|
|
|
|
Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your
|
|
desired username for an account on this Gitea instance.
|
|
|
|
# TODO: Remaining Work for 1.0
|
|
|
|
## Design Questions (Owner Decision Required)
|
|
|
|
These require @sneak's input before implementation. Answers should be added
|
|
inline below each question.
|
|
|
|
**1. Should `MFFileChecksum` be simplified?** Currently it's a separate
|
|
message wrapping a single `bytes multiHash` field. Since multihash
|
|
already self-describes the algorithm, `repeated bytes hashes` directly on
|
|
`MFFilePath` would be simpler and reduce per-file protobuf overhead. Is
|
|
the extra message layer intentional (e.g. planning to add per-hash
|
|
metadata like `verified_at`)?
|
|
|
|
> _answer:_
|
|
|
|
**2. Should file permissions/mode be stored?** The format stores
|
|
mtime/ctime but not Unix file permissions. For archival use this may not
|
|
matter, but for software distribution or filesystem restoration it's a
|
|
gap. Should we reserve a field now (e.g. `optional uint32 mode = 305`)
|
|
even if we don't populate it yet?
|
|
|
|
> _answer:_
|
|
|
|
**3. Should the `manifest` type be exported?** Currently unexported with
|
|
exported constructors (`NewManifestFromReader`, `NewManifestFromFile`).
|
|
Consumers can't declare `var m *mfer.manifest`. Export the type, or
|
|
define an interface?
|
|
|
|
> _answer:_
|
|
|
|
**4. What should the Go module path be for 1.0?** Currently
|
|
`sneak.berlin/go/mfer` in `go.mod` but `git.eeqj.de/sneak/mfer/mfer` in
|
|
the proto `go_package` option. Which is canonical?
|
|
|
|
> _answer:_
|
|
|
|
**5. GPG vs pure-Go crypto for signatures?** Shelling out to `gpg` is
|
|
fragile (may not be installed, version-dependent output).
|
|
`github.com/ProtonMail/go-crypto` provides pure-Go OpenPGP, or we could
|
|
use Ed25519/signify (simpler, no key management). Which direction?
|
|
|
|
> _answer:_
|
|
|
|
**6. Should we add a version byte after the magic?** Currently
|
|
`ZNAVSRFG` is followed immediately by protobuf. Adding a version byte
|
|
(`ZNAVSRFG\x01`) would allow future framing changes without requiring
|
|
protobuf parsing to detect the version. `MFFileOuter.Version` serves
|
|
this purpose but requires successful deserialization to read. Worth the
|
|
extra byte?
|
|
|
|
> _answer:_
|
|
|
|
**7. Should we add a length-prefix after the magic?** Protobuf is not
|
|
self-delimiting. If we ever want to concatenate manifests or append data
|
|
after the protobuf, the current framing is insufficient. Add a varint or
|
|
fixed-width length-prefix?
|
|
|
|
> _answer:_
|
|
|
|
## Implementation Tasks
|
|
|
|
### Repo Infrastructure
|
|
|
|
- [ ] Add `.golangci.yml` (fetch from
|
|
`https://git.eeqj.de/sneak/prompts/raw/branch/main/.golangci.yml`)
|
|
- [ ] Add `.editorconfig`
|
|
- [ ] Add `.gitea/workflows/check.yml` that runs `docker build .`
|
|
|
|
### Format & Correctness
|
|
|
|
- [ ] Resolve proto `go_package` path inconsistency
|
|
(`git.eeqj.de/sneak/mfer/mfer` vs `sneak.berlin/go/mfer`)
|
|
- [ ] Reserve `optional uint32 mode = 305` in `MFFilePath` for future
|
|
file permissions (pending design question answer)
|
|
- [ ] Add version byte after magic — `ZNAVSRFG\x01` for format version
|
|
1 (pending design question answer)
|
|
|
|
### Library
|
|
|
|
- [ ] Export the `manifest` type or define a public interface (pending
|
|
design question answer) — currently consumers cannot hold a reference
|
|
to a loaded manifest in their own type declarations
|
|
- [ ] Replace GPG subprocess calls with pure-Go crypto (pending design
|
|
question answer) — current implementation shells out to `gpg` which
|
|
may not be installed
|
|
|
|
### CLI
|
|
|
|
- [ ] Wire `--version` flag properly (currently only a `version`
|
|
subcommand exists; top-level `--version` shows urfave/cli generic
|
|
output)
|
|
- [ ] Add `--deterministic` flag documentation — deterministic output is
|
|
the default, but this isn't obvious from CLI help
|
|
- [ ] Add retry logic to `fetch` — currently no retries on transient
|
|
HTTP errors; needs exponential backoff
|
|
- [ ] `fetch` command uses bare `http.Get` with no timeout — needs
|
|
`http.Client` with configurable timeout
|
|
|
|
### Testing & Robustness
|
|
|
|
- [ ] Add fuzzing tests for `NewManifestFromReader` — protobuf
|
|
deserialization of untrusted input needs fuzz coverage
|
|
- [ ] Add integration test for `freshen` CLI command — current tests
|
|
only verify setup, not the actual freshen operation end-to-end
|
|
- [ ] Add test for `fetch` CLI command end-to-end (currently only
|
|
`downloadFile` is tested)
|
|
|
|
### Documentation
|
|
|
|
- [ ] Write standalone format specification document — `FORMAT.md`
|
|
exists but should be promoted to the primary spec reference; README
|
|
should link to it more prominently
|
|
- [ ] Audit and update all error messages for consistency and
|
|
helpfulness
|
|
- [ ] Document the signature scheme more thoroughly (canonical string
|
|
format, verification steps)
|
|
|
|
### Release
|
|
|
|
- [ ] Finalize Go module path
|
|
- [ ] Update version constant in `mfer/constants.go`
|
|
- [ ] Add `--version` output matching SemVer
|
|
- [ ] Tag `v1.0.0`
|
|
|
|
# See Also
|
|
|
|
## Prior Art: Metalink
|
|
|
|
- [Metalink - Mozilla Wiki](https://wiki.mozilla.org/Metalink)
|
|
- [Metalink - Wikipedia](https://en.wikipedia.org/wiki/Metalink)
|
|
- [RFC 5854 - The Metalink Download Description Format](https://datatracker.ietf.org/doc/html/rfc5854)
|
|
- [RFC 6249 - Metalink/HTTP: Mirrors and Hashes](https://www.rfc-editor.org/rfc/rfc6249.html)
|
|
|
|
## Links
|
|
|
|
- Repo: [https://git.eeqj.de/sneak/mfer](https://git.eeqj.de/sneak/mfer)
|
|
- Issues: [https://git.eeqj.de/sneak/mfer/issues](https://git.eeqj.de/sneak/mfer/issues)
|
|
|
|
# Authors
|
|
|
|
- [@sneak <sneak@sneak.berlin>](mailto:sneak@sneak.berlin)
|
|
|
|
# License
|
|
|
|
- [WTFPL](https://wtfpl.net)
|