From 4a6469b003fdbbe45e0fd9c8ac247d3ef743fb90 Mon Sep 17 00:00:00 2001 From: sneak Date: Wed, 3 Nov 2021 03:49:24 -0700 Subject: [PATCH] fmt --- README.md | 105 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 54 insertions(+), 51 deletions(-) diff --git a/README.md b/README.md index 98226af..6ed8db3 100644 --- a/README.md +++ b/README.md @@ -4,17 +4,17 @@ Manifest file generator and checker. # Problem Statement -Given a plain URL, there is no standard way to safely and programmatically download everything "under" that URL path. `wget -r` can traverse directory listings if they're enabled, but every server has a different format, and this does not verify cryptographic integrity of the files, or enable them to be fetched using a different protocol other than HTTP/s. +Given a plain URL, there is no standard way to safely and programmatically download everything "under" that URL path. `wget -r` can traverse directory listings if they're enabled, but every server has a different format, and this does not verify cryptographic integrity of the files, or enable them to be fetched using a different protocol other than HTTP/s. -Currently, the solution that people are using are sidecar files in the format of `SHASUMS` checksum files, as well as a `SHASUMS.asc` PGP detached signature. This is not checksum-algorithm-agnostic and the sidecar file is not always consistently named. +Currently, the solution that people are using are sidecar files in the format of `SHASUMS` checksum files, as well as a `SHASUMS.asc` PGP detached signature. This is not checksum-algorithm-agnostic and the sidecar file is not always consistently named. Real issues I face: -* when I plug in an ExFAT hard drive, I don't know if any files on the filesystem are corrupted or missing - * current ad-hoc solution are `SHASUMS`/`SHASUMS.asc` files -* when I want to mirror an HTTP archive, I have to use special tools like debmirror that understand the archive format - * the debian repository metadata structure is hot garbage -* when I download a large file via HTTP, I have no way of knowing if the file content is what it's supposed to be +- when I plug in an ExFAT hard drive, I don't know if any files on the filesystem are corrupted or missing + - current ad-hoc solution are `SHASUMS`/`SHASUMS.asc` files +- when I want to mirror an HTTP archive, I have to use special tools like debmirror that understand the archive format + - the debian repository metadata structure is hot garbage +- when I download a large file via HTTP, I have no way of knowing if the file content is what it's supposed to be # Proposed Solution @@ -24,76 +24,78 @@ The manifest file would be called `index.mf`, and the tool for generating such w The manifest file would do several important things: -* have a standard filename, so if given `https://example.com/downloadpackage/` one could fetch `https://example.com/downloadpackage/index.mf` to enumerate the full directory listing. -* contain a version field for extensibility -* contain structured data (protobuf, json, or cbor) -* provide an inner signed container, so that the manifest file itself can embed a signature and a public key alongside in a single file -* contain a list of files, each with a relative path to the manifest -* contain manifest timestamp -* contain ctime/mtime information for files so that file metadata can be preserved -* contain cryptographic checksums in several different algorithms for each file - * probably encoded with multihash to indicate algo + hash - * sha256 at the minimum - * would be nice to include an IPFS/IPLD CIDv1 root hash for each file, which likely involves doing an ipfs file object chunking - * maybe even including the complete IPFS/IPLD directory tree objects and chunklists? - * this is because generating an `index.mf` does not imply publishing on ipfs at that time - * maybe a bittorrent chunklist for torrent client compatibility? perhaps a top-level infohash for the whole manifest? +- have a standard filename, so if given `https://example.com/downloadpackage/` one could fetch `https://example.com/downloadpackage/index.mf` to enumerate the full directory listing. +- contain a version field for extensibility +- contain structured data (protobuf, json, or cbor) +- provide an inner signed container, so that the manifest file itself can embed a signature and a public key alongside in a single file +- contain a list of files, each with a relative path to the manifest +- contain manifest timestamp +- contain ctime/mtime information for files so that file metadata can be preserved +- contain cryptographic checksums in several different algorithms for each file + - probably encoded with multihash to indicate algo + hash + - sha256 at the minimum + - would be nice to include an IPFS/IPLD CIDv1 root hash for each file, which likely involves doing an ipfs file object chunking + - maybe even including the complete IPFS/IPLD directory tree objects and chunklists? + - this is because generating an `index.mf` does not imply publishing on ipfs at that time + - maybe a bittorrent chunklist for torrent client compatibility? perhaps a top-level infohash for the whole manifest? # Design Goals -* Replace SHASUMS/SHASUMS.asc files -* be easy to download/resume a whole directory tree published via HTTP -* be easy to use across protocols (given an HTTPS url, fetch manifest, then download file contents via bittorrent or ipfs) -* not strongly coupled to HTTP use case, should not require special hosting, content types, or HTTP headers being sent +- Replace SHASUMS/SHASUMS.asc files +- be easy to download/resume a whole directory tree published via HTTP +- be easy to use across protocols (given an HTTPS url, fetch manifest, then download file contents via bittorrent or ipfs) +- not strongly coupled to HTTP use case, should not require special hosting, content types, or HTTP headers being sent # Non-Goals -* Manifest generation speed - * likely involves IPFS chunking, bittorrent chunking, and several different cryptographic hash functions over the entirety of each and every file -* Small manifest file size (within reason) - * 30MiB files are "small" these days, given modern storage/bandwidth - * metadata size should not be used as an excuse to sacrifice utility (such as providing checksums over each chunk of a large file) + +- Manifest generation speed + - likely involves IPFS chunking, bittorrent chunking, and several different cryptographic hash functions over the entirety of each and every file +- Small manifest file size (within reason) + - 30MiB files are "small" these days, given modern storage/bandwidth + - metadata size should not be used as an excuse to sacrifice utility (such as providing checksums over each chunk of a large file) # Open Questions -* Should the manifest file include checksums of individual file chunks, or just for the whole assembled file? - * If so, should the chunksize be fixed or dynamic? +- Should the manifest file include checksums of individual file chunks, or just for the whole assembled file? -* Should the manifest signature format be GnuPG signatures, or those from + - If so, should the chunksize be fixed or dynamic? + +- Should the manifest signature format be GnuPG signatures, or those from OpenBSD's signify (of which there is a good [golang implementation](https://github.com/frankbraun/gosignify)? -* Should the on-disk serialization format be proto3 or json? +- Should the on-disk serialization format be proto3 or json? # Tool Examples -* `mfer gen` / `mfer gen .` - * recurses under current directory and writes out an `index.mf` -* `mfer check` / `mfer check .` - * verifies checksums of all files in manifest, displaying error and exiting nonzero if any files are missing or corrupted -* `mfer fetch https://example.com/stuff/` - * fetches `/stuff/index.mf` and downloads all files listed in manifest, optionally resuming any that already exist locally, and assures cryptographic integrity of downloaded files. +- `mfer gen` / `mfer gen .` + - recurses under current directory and writes out an `index.mf` +- `mfer check` / `mfer check .` + - verifies checksums of all files in manifest, displaying error and exiting nonzero if any files are missing or corrupted +- `mfer fetch https://example.com/stuff/` + - fetches `/stuff/index.mf` and downloads all files listed in manifest, optionally resuming any that already exist locally, and assures cryptographic integrity of downloaded files. # Implementation Plan ## Phase One: -* golang module for reusability/embedding -* golang module client providing `mfer` CLI +- golang module for reusability/embedding +- golang module client providing `mfer` CLI ## Phase Two: -* ES6 or TypeScript module for reusability/embedding -* ES6/TypeScript module client providing `mfer.js` CLI +- ES6 or TypeScript module for reusability/embedding +- ES6/TypeScript module client providing `mfer.js` CLI # Hopes And Dreams -* `aria2c https://example.com/manifestdirectory/` - * (fetches `https://example.com/manifestdirectory/index.mf`, downloads and checksums all files, resumes any that exist locally already) -* `mfer fetch https://example.com/manifestdirectory/` -* a command line option to zero/omit mtime/ctime, as well as manifest timestamp, and sort all directory listings so that manifest file generation is deterministic/reproducible -* URL format `mfer fetch https://exmaple.com/manifestdirectory/?key=5539AD00DE4C42F3AFE11575052443F4DF2A55C2` to assert in the URL which PGP signing key should be used in the manifest, so that shared URLs have a cryptographic trust root -* a "well-known" key in the manifest that maps well known keys (could reuse the http spec) to specific file paths in the manifest. - * example: a `berlin.sneak.app.slideshow` key that maps to a json slideshow config listing what image paths to show, and for how long, and in what order +- `aria2c https://example.com/manifestdirectory/` + - (fetches `https://example.com/manifestdirectory/index.mf`, downloads and checksums all files, resumes any that exist locally already) +- `mfer fetch https://example.com/manifestdirectory/` +- a command line option to zero/omit mtime/ctime, as well as manifest timestamp, and sort all directory listings so that manifest file generation is deterministic/reproducible +- URL format `mfer fetch https://exmaple.com/manifestdirectory/?key=5539AD00DE4C42F3AFE11575052443F4DF2A55C2` to assert in the URL which PGP signing key should be used in the manifest, so that shared URLs have a cryptographic trust root +- a "well-known" key in the manifest that maps well known keys (could reuse the http spec) to specific file paths in the manifest. + - example: a `berlin.sneak.app.slideshow` key that maps to a json slideshow config listing what image paths to show, and for how long, and in what order # Use Cases @@ -114,6 +116,7 @@ I use filesystems that don't include data checksums, and I would like a cryptogr I would like to be able to plug in a hard drive or flash drive and, if there is an `index.mf` in the root, automatically detect missing/corrupted files, regardless of filesystem format. # Collaboration + Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your desired username for an account on this Gitea instance. I am currently interested in hiring a contractor skilled with the Go standard library interfaces to specify this tool in full and develop a prototype implementation.