From 7e4f8366a7227e7a20c08bee6638dbd243513e34 Mon Sep 17 00:00:00 2001 From: sneak Date: Tue, 1 Feb 2022 21:45:03 -0800 Subject: [PATCH] format readme, add build status badge --- README.md | 114 ++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 84 insertions(+), 30 deletions(-) diff --git a/README.md b/README.md index 6ed8db3..c3fbd22 100644 --- a/README.md +++ b/README.md @@ -2,19 +2,33 @@ Manifest file generator and checker. +# Build Status + +[![Build Status](https://drone.datavi.be/api/badges/sneak/mfer/status.svg)](https://drone.datavi.be/sneak/mfer) + # Problem Statement -Given a plain URL, there is no standard way to safely and programmatically download everything "under" that URL path. `wget -r` can traverse directory listings if they're enabled, but every server has a different format, and this does not verify cryptographic integrity of the files, or enable them to be fetched using a different protocol other than HTTP/s. +Given a plain URL, there is no standard way to safely and programmatically +download everything "under" that URL path. `wget -r` can traverse directory +listings if they're enabled, but every server has a different format, and +this does not verify cryptographic integrity of the files, or enable them to +be fetched using a different protocol other than HTTP/s. -Currently, the solution that people are using are sidecar files in the format of `SHASUMS` checksum files, as well as a `SHASUMS.asc` PGP detached signature. This is not checksum-algorithm-agnostic and the sidecar file is not always consistently named. +Currently, the solution that people are using are sidecar files in the +format of `SHASUMS` checksum files, as well as a `SHASUMS.asc` PGP detached +signature. This is not checksum-algorithm-agnostic and the sidecar file is +not always consistently named. Real issues I face: -- when I plug in an ExFAT hard drive, I don't know if any files on the filesystem are corrupted or missing +- when I plug in an ExFAT hard drive, I don't know if any files on the + filesystem are corrupted or missing - current ad-hoc solution are `SHASUMS`/`SHASUMS.asc` files -- when I want to mirror an HTTP archive, I have to use special tools like debmirror that understand the archive format +- when I want to mirror an HTTP archive, I have to use special tools like + debmirror that understand the archive format - the debian repository metadata structure is hot garbage -- when I download a large file via HTTP, I have no way of knowing if the file content is what it's supposed to be +- when I download a large file via HTTP, I have no way of knowing if the + file content is what it's supposed to be # Proposed Solution @@ -24,35 +38,50 @@ The manifest file would be called `index.mf`, and the tool for generating such w The manifest file would do several important things: -- have a standard filename, so if given `https://example.com/downloadpackage/` one could fetch `https://example.com/downloadpackage/index.mf` to enumerate the full directory listing. +- have a standard filename, so if given + `https://example.com/downloadpackage/` one could fetch + `https://example.com/downloadpackage/index.mf` to enumerate the full + directory listing. - contain a version field for extensibility - contain structured data (protobuf, json, or cbor) -- provide an inner signed container, so that the manifest file itself can embed a signature and a public key alongside in a single file +- provide an inner signed container, so that the manifest file itself can + embed a signature and a public key alongside in a single file - contain a list of files, each with a relative path to the manifest - contain manifest timestamp -- contain ctime/mtime information for files so that file metadata can be preserved -- contain cryptographic checksums in several different algorithms for each file +- contain ctime/mtime information for files so that file metadata can be + preserved +- contain cryptographic checksums in several different algorithms for each + file - probably encoded with multihash to indicate algo + hash - sha256 at the minimum - - would be nice to include an IPFS/IPLD CIDv1 root hash for each file, which likely involves doing an ipfs file object chunking - - maybe even including the complete IPFS/IPLD directory tree objects and chunklists? - - this is because generating an `index.mf` does not imply publishing on ipfs at that time - - maybe a bittorrent chunklist for torrent client compatibility? perhaps a top-level infohash for the whole manifest? + - would be nice to include an IPFS/IPLD CIDv1 root hash for each file, + which likely involves doing an ipfs file object chunking + - maybe even including the complete IPFS/IPLD directory tree objects and + chunklists? + - this is because generating an `index.mf` does not imply publishing on + ipfs at that time + - maybe a bittorrent chunklist for torrent client compatibility? perhaps a + top-level infohash for the whole manifest? # Design Goals - Replace SHASUMS/SHASUMS.asc files - be easy to download/resume a whole directory tree published via HTTP -- be easy to use across protocols (given an HTTPS url, fetch manifest, then download file contents via bittorrent or ipfs) -- not strongly coupled to HTTP use case, should not require special hosting, content types, or HTTP headers being sent +- be easy to use across protocols (given an HTTPS url, fetch manifest, then + download file contents via bittorrent or ipfs) +- not strongly coupled to HTTP use case, should not require special hosting, + content types, or HTTP headers being sent # Non-Goals - Manifest generation speed - - likely involves IPFS chunking, bittorrent chunking, and several different cryptographic hash functions over the entirety of each and every file + - likely involves IPFS chunking, bittorrent chunking, and several + different cryptographic hash functions over the entirety of each and + every file - Small manifest file size (within reason) - 30MiB files are "small" these days, given modern storage/bandwidth - - metadata size should not be used as an excuse to sacrifice utility (such as providing checksums over each chunk of a large file) + - metadata size should not be used as an excuse to sacrifice utility (such + as providing checksums over each chunk of a large file) # Open Questions @@ -71,9 +100,12 @@ The manifest file would do several important things: - `mfer gen` / `mfer gen .` - recurses under current directory and writes out an `index.mf` - `mfer check` / `mfer check .` - - verifies checksums of all files in manifest, displaying error and exiting nonzero if any files are missing or corrupted + - verifies checksums of all files in manifest, displaying error and + exiting nonzero if any files are missing or corrupted - `mfer fetch https://example.com/stuff/` - - fetches `/stuff/index.mf` and downloads all files listed in manifest, optionally resuming any that already exist locally, and assures cryptographic integrity of downloaded files. + - fetches `/stuff/index.mf` and downloads all files listed in manifest, + optionally resuming any that already exist locally, and assures + cryptographic integrity of downloaded files. # Implementation Plan @@ -90,33 +122,55 @@ The manifest file would do several important things: # Hopes And Dreams - `aria2c https://example.com/manifestdirectory/` - - (fetches `https://example.com/manifestdirectory/index.mf`, downloads and checksums all files, resumes any that exist locally already) + - (fetches `https://example.com/manifestdirectory/index.mf`, downloads and + checksums all files, resumes any that exist locally already) - `mfer fetch https://example.com/manifestdirectory/` -- a command line option to zero/omit mtime/ctime, as well as manifest timestamp, and sort all directory listings so that manifest file generation is deterministic/reproducible -- URL format `mfer fetch https://exmaple.com/manifestdirectory/?key=5539AD00DE4C42F3AFE11575052443F4DF2A55C2` to assert in the URL which PGP signing key should be used in the manifest, so that shared URLs have a cryptographic trust root -- a "well-known" key in the manifest that maps well known keys (could reuse the http spec) to specific file paths in the manifest. - - example: a `berlin.sneak.app.slideshow` key that maps to a json slideshow config listing what image paths to show, and for how long, and in what order +- a command line option to zero/omit mtime/ctime, as well as manifest + timestamp, and sort all directory listings so that manifest file + generation is deterministic/reproducible +- URL format `mfer fetch + https://exmaple.com/manifestdirectory/?key=5539AD00DE4C42F3AFE11575052443F4DF2A55C2` + to assert in the URL which PGP signing key should be used in the manifest, + so that shared URLs have a cryptographic trust root +- a "well-known" key in the manifest that maps well known keys (could reuse + the http spec) to specific file paths in the manifest. + - example: a `berlin.sneak.app.slideshow` key that maps to a json + slideshow config listing what image paths to show, and for how long, and + in what order # Use Cases ## Web Images -I'd like to be able to put a bunch of images into a directory, generate a manifest, and then point a slideshow client (such as an ambient display, or a react app with the target directory in a query string arg) at that statically hosted directory, and have it discover the full list of images available at that URL. +I'd like to be able to put a bunch of images into a directory, generate a +manifest, and then point a slideshow client (such as an ambient display, or +a react app with the target directory in a query string arg) at that +statically hosted directory, and have it discover the full list of images +available at that URL. ## Software Distribution -I'd like to be able to download a whole tree of files available via HTTP resumably by either HTTP or IPFS/BitTorrent without a .torrent file. +I'd like to be able to download a whole tree of files available via HTTP +resumably by either HTTP or IPFS/BitTorrent without a .torrent file. ## Filesystem Archive Integrity -I use filesystems that don't include data checksums, and I would like a cryptographically signed checksum file so that I can later verify that a set of archive files have not been modified, none are missing, and that the checksums have not been altered in storage by a second party. +I use filesystems that don't include data checksums, and I would like a +cryptographically signed checksum file so that I can later verify that a set +of archive files have not been modified, none are missing, and that the +checksums have not been altered in storage by a second party. ## Filesystem-Independent Checksums -I would like to be able to plug in a hard drive or flash drive and, if there is an `index.mf` in the root, automatically detect missing/corrupted files, regardless of filesystem format. +I would like to be able to plug in a hard drive or flash drive and, if there +is an `index.mf` in the root, automatically detect missing/corrupted files, +regardless of filesystem format. # Collaboration -Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your desired username for an account on this Gitea instance. +Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your +desired username for an account on this Gitea instance. -I am currently interested in hiring a contractor skilled with the Go standard library interfaces to specify this tool in full and develop a prototype implementation. +I am currently interested in hiring a contractor skilled with the Go +standard library interfaces to specify this tool in full and develop a +prototype implementation.