mfer/README.md

4.2 KiB

mfer

Manifest file generator and checker.

Problem Statement

Given a plain URL, there is no standard way to safely and programmatically download everything "under" that URL path. wget -r can traverse directory listings if they're enabled, but every server has a different format, and this does not verify cryptographic integrity of the files, or enable them to be fetched using a different protocol other than HTTP/s.

Currently, the solution that people are using are sidecar files in the format of SHASUMS checksum files, as well as a SHASUMS.asc PGP detached signature. This is not checksum-algorithm-agnostic and the sidecar file is not always consistently named.

Proposed Solution

A standard, a manifest file format, and a tool for generating same.

The manifest file would be called index.mf, and the tool for generating such would be called mfer.

The manifest file would do several important things:

  • have a standard filename, so if given https://example.com/downloadpackage/ one could fetch https://example.com/downloadpackage/index.mf to enumerate the full directory listing.
  • contain a version field for extensibility
  • contain structured data (protobuf, json, or cbor)
  • provide an inner signed container, so that the manifest file itself can embed a signature and a public key alongside in a single file
  • contain a list of files, each with a relative path to the manifest
  • contain manifest timestamp
  • contain ctime/mtime information for files so that file metadata can be preserved
  • contain cryptographic checksums in several different algorithms for each file
    • probably encoded with multihash to indicate algo + hash
    • sha256 at the minimum
    • would be nice to include an IPFS/IPLD CIDv1 root hash for each file, which likely involves doing an ipfs file object chunking
    • maybe a bittorrent chunklist for torrent client compatibility? perhaps a top-level infohash for the whole manifest?

Design Goals

  • Replace SHASUMS/SHASUMS.asc files
  • be easy to download/resume
  • be easy to use across protocols (given an HTTPS url, fetch manifest, then download file contents via bittorrent or ipfs)

Non-Goals

  • Manifest generation speed
  • Small manifest file size (within reason)

Open Questions

  • Should the manifest file include checksums of individual file chunks, or just for the whole assembled file?
    • If so, should the chunksize be fixed or dynamic?

Tool Examples

  • mfer gen / mfer gen .
    • recurses under current directory and writes out an index.mf
  • mfer check / mfer check .
    • verifies checksums of all files in manifest, displaying error and exiting nonzero if any files are missing or corrupted
  • mfer fetch https://example.com/stuff/
    • fetches /stuff/index.mf and downloads all files listed in manifest, optionally resuming any that already exist locally, and assures cryptographic integrity of downloaded files.

Implementation Plan

Phase One:

  • golang module for reusability/embedding
  • golang module client providing mfer CLI

Phase Two:

  • ES5 or TypeScript module for reusability/embedding
  • ES5/TypeScript module client providing mfjs CLI

Hopes And Dreams

  • aria2c https://example.com/manifestdirectory/
    • (fetches https://example.com/manifestdirectory/index.mf, downloads and checksums all files, resumes any that exist locally already)
  • mfer fetch https://example.com/manifestdirectory/

Use Cases

Web Images

I'd like to be able to put a bunch of images into a directory, generate a manifest, and then point a slideshow client (such as an ambient display, or a react app with the target directory in a query string arg) at that statically hosted directory, and have it discover the full list of images available at that URL.

Software Distribution

I'd like to be able to download a whole tree of files available via HTTP resumably by either HTTP or IPFS/BitTorrent without a .torrent file.

Filesystem Archive Integrity

I use filesystems that don't include data checksums, and I would like a cryptographically signed checksum file so that I can later verify that a set of archive files have not been modified, none are missing, and that the checksums have not been altered in storage by a second party.