mfer/README.md

176 lines
7.0 KiB
Markdown
Raw Permalink Normal View History

2021-01-18 23:33:46 +00:00
# mfer
2021-01-18 23:36:40 +00:00
Manifest file generator and checker.
2022-02-02 05:45:03 +00:00
# Build Status
[![Build Status](https://drone.datavi.be/api/badges/sneak/mfer/status.svg)](https://drone.datavi.be/sneak/mfer)
2021-01-18 23:36:40 +00:00
# Problem Statement
2022-02-02 05:45:03 +00:00
Given a plain URL, there is no standard way to safely and programmatically
download everything "under" that URL path. `wget -r` can traverse directory
listings if they're enabled, but every server has a different format, and
this does not verify cryptographic integrity of the files, or enable them to
be fetched using a different protocol other than HTTP/s.
2021-01-18 23:36:40 +00:00
2022-02-02 05:45:03 +00:00
Currently, the solution that people are using are sidecar files in the
format of `SHASUMS` checksum files, as well as a `SHASUMS.asc` PGP detached
signature. This is not checksum-algorithm-agnostic and the sidecar file is
not always consistently named.
2021-01-18 23:36:40 +00:00
2021-01-19 03:53:52 +00:00
Real issues I face:
2022-02-02 05:45:03 +00:00
- when I plug in an ExFAT hard drive, I don't know if any files on the
filesystem are corrupted or missing
2021-11-03 10:49:24 +00:00
- current ad-hoc solution are `SHASUMS`/`SHASUMS.asc` files
2022-02-02 05:45:03 +00:00
- when I want to mirror an HTTP archive, I have to use special tools like
debmirror that understand the archive format
2021-11-03 10:49:24 +00:00
- the debian repository metadata structure is hot garbage
2022-02-02 05:45:03 +00:00
- when I download a large file via HTTP, I have no way of knowing if the
file content is what it's supposed to be
2021-01-19 03:53:52 +00:00
2021-01-18 23:36:40 +00:00
# Proposed Solution
A standard, a manifest file format, and a tool for generating same.
The manifest file would be called `index.mf`, and the tool for generating such would be called `mfer`.
The manifest file would do several important things:
2022-02-02 05:45:03 +00:00
- have a standard filename, so if given
`https://example.com/downloadpackage/` one could fetch
`https://example.com/downloadpackage/index.mf` to enumerate the full
directory listing.
2021-11-03 10:49:24 +00:00
- contain a version field for extensibility
- contain structured data (protobuf, json, or cbor)
2022-02-02 05:45:03 +00:00
- provide an inner signed container, so that the manifest file itself can
embed a signature and a public key alongside in a single file
2021-11-03 10:49:24 +00:00
- contain a list of files, each with a relative path to the manifest
- contain manifest timestamp
2022-02-02 05:45:03 +00:00
- contain ctime/mtime information for files so that file metadata can be
preserved
- contain cryptographic checksums in several different algorithms for each
file
2021-11-03 10:49:24 +00:00
- probably encoded with multihash to indicate algo + hash
- sha256 at the minimum
2022-02-02 05:45:03 +00:00
- would be nice to include an IPFS/IPLD CIDv1 root hash for each file,
which likely involves doing an ipfs file object chunking
- maybe even including the complete IPFS/IPLD directory tree objects and
chunklists?
- this is because generating an `index.mf` does not imply publishing on
ipfs at that time
- maybe a bittorrent chunklist for torrent client compatibility? perhaps a
top-level infohash for the whole manifest?
2021-01-18 23:36:40 +00:00
# Design Goals
2021-11-03 10:49:24 +00:00
- Replace SHASUMS/SHASUMS.asc files
- be easy to download/resume a whole directory tree published via HTTP
2022-02-02 05:45:03 +00:00
- be easy to use across protocols (given an HTTPS url, fetch manifest, then
download file contents via bittorrent or ipfs)
- not strongly coupled to HTTP use case, should not require special hosting,
content types, or HTTP headers being sent
2021-01-18 23:36:40 +00:00
2021-01-18 23:38:36 +00:00
# Non-Goals
2021-11-03 10:49:24 +00:00
- Manifest generation speed
2022-02-02 05:45:03 +00:00
- likely involves IPFS chunking, bittorrent chunking, and several
different cryptographic hash functions over the entirety of each and
every file
2021-11-03 10:49:24 +00:00
- Small manifest file size (within reason)
- 30MiB files are "small" these days, given modern storage/bandwidth
2022-02-02 05:45:03 +00:00
- metadata size should not be used as an excuse to sacrifice utility (such
as providing checksums over each chunk of a large file)
2021-01-18 23:38:36 +00:00
# Open Questions
2021-10-26 09:04:43 +00:00
2021-11-03 10:49:24 +00:00
- Should the manifest file include checksums of individual file chunks, or just for the whole assembled file?
- If so, should the chunksize be fixed or dynamic?
2021-01-18 23:44:16 +00:00
2021-11-03 10:49:24 +00:00
- Should the manifest signature format be GnuPG signatures, or those from
2021-10-26 09:04:43 +00:00
OpenBSD's signify (of which there is a good [golang
implementation](https://github.com/frankbraun/gosignify)?
2021-11-03 10:49:24 +00:00
- Should the on-disk serialization format be proto3 or json?
2021-10-26 09:04:43 +00:00
2021-01-18 23:36:40 +00:00
# Tool Examples
2021-11-03 10:49:24 +00:00
- `mfer gen` / `mfer gen .`
- recurses under current directory and writes out an `index.mf`
- `mfer check` / `mfer check .`
2022-02-02 05:45:03 +00:00
- verifies checksums of all files in manifest, displaying error and
exiting nonzero if any files are missing or corrupted
2021-11-03 10:49:24 +00:00
- `mfer fetch https://example.com/stuff/`
2022-02-02 05:45:03 +00:00
- fetches `/stuff/index.mf` and downloads all files listed in manifest,
optionally resuming any that already exist locally, and assures
cryptographic integrity of downloaded files.
2021-01-18 23:36:40 +00:00
# Implementation Plan
## Phase One:
2021-11-03 10:49:24 +00:00
- golang module for reusability/embedding
- golang module client providing `mfer` CLI
2021-01-18 23:36:40 +00:00
## Phase Two:
2021-11-03 10:49:24 +00:00
- ES6 or TypeScript module for reusability/embedding
- ES6/TypeScript module client providing `mfer.js` CLI
2021-01-18 23:36:40 +00:00
# Hopes And Dreams
2021-11-03 10:49:24 +00:00
- `aria2c https://example.com/manifestdirectory/`
2022-02-02 05:45:03 +00:00
- (fetches `https://example.com/manifestdirectory/index.mf`, downloads and
checksums all files, resumes any that exist locally already)
2021-11-03 10:49:24 +00:00
- `mfer fetch https://example.com/manifestdirectory/`
2022-02-02 05:45:03 +00:00
- a command line option to zero/omit mtime/ctime, as well as manifest
timestamp, and sort all directory listings so that manifest file
generation is deterministic/reproducible
- URL format `mfer fetch https://exmaple.com/manifestdirectory/?key=5539AD00DE4C42F3AFE11575052443F4DF2A55C2`
2022-02-02 05:45:03 +00:00
to assert in the URL which PGP signing key should be used in the manifest,
so that shared URLs have a cryptographic trust root
- a "well-known" key in the manifest that maps well known keys (could reuse
the http spec) to specific file paths in the manifest.
- example: a `berlin.sneak.app.slideshow` key that maps to a json
slideshow config listing what image paths to show, and for how long, and
in what order
2021-01-18 23:41:28 +00:00
# Use Cases
## Web Images
2022-02-02 05:45:03 +00:00
I'd like to be able to put a bunch of images into a directory, generate a
manifest, and then point a slideshow client (such as an ambient display, or
a react app with the target directory in a query string arg) at that
statically hosted directory, and have it discover the full list of images
available at that URL.
2021-01-18 23:41:28 +00:00
## Software Distribution
2022-02-02 05:45:03 +00:00
I'd like to be able to download a whole tree of files available via HTTP
resumably by either HTTP or IPFS/BitTorrent without a .torrent file.
2021-01-18 23:41:28 +00:00
## Filesystem Archive Integrity
2022-02-02 05:45:03 +00:00
I use filesystems that don't include data checksums, and I would like a
cryptographically signed checksum file so that I can later verify that a set
of archive files have not been modified, none are missing, and that the
checksums have not been altered in storage by a second party.
2021-01-19 00:02:49 +00:00
2021-01-19 00:48:29 +00:00
## Filesystem-Independent Checksums
2022-02-02 05:45:03 +00:00
I would like to be able to plug in a hard drive or flash drive and, if there
is an `index.mf` in the root, automatically detect missing/corrupted files,
regardless of filesystem format.
2021-01-19 00:48:29 +00:00
2021-01-19 00:02:49 +00:00
# Collaboration
2021-11-03 10:49:24 +00:00
2022-02-02 05:45:03 +00:00
Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your
desired username for an account on this Gitea instance.
2021-01-19 00:02:49 +00:00
2022-02-02 05:45:03 +00:00
I am currently interested in hiring a contractor skilled with the Go
standard library interfaces to specify this tool in full and develop a
prototype implementation.