Given a plain URL, there is no standard way to safely and programmatically download everything "under" that URL path. `wget -r` can traverse directory listings if they're enabled, but every server has a different format, and this does not verify cryptographic integrity of the files, or enable them to be fetched using a different protocol other than HTTP/s.
Given a plain URL, there is no standard way to safely and programmatically
download everything "under" that URL path. `wget -r` can traverse directory
listings if they're enabled, but every server has a different format, and
this does not verify cryptographic integrity of the files, or enable them to
be fetched using a different protocol other than HTTP/s.
Currently, the solution that people are using are sidecar files in the format of `SHASUMS` checksum files, as well as a `SHASUMS.asc` PGP detached signature. This is not checksum-algorithm-agnostic and the sidecar file is not always consistently named.
Currently, the solution that people are using are sidecar files in the
format of `SHASUMS` checksum files, as well as a `SHASUMS.asc` PGP detached
signature. This is not checksum-algorithm-agnostic and the sidecar file is
not always consistently named.
Real issues I face:
- when I plug in an ExFAT hard drive, I don't know if any files on the filesystem are corrupted or missing
- when I plug in an ExFAT hard drive, I don't know if any files on the
filesystem are corrupted or missing
- current ad-hoc solution are `SHASUMS`/`SHASUMS.asc` files
- when I want to mirror an HTTP archive, I have to use special tools like debmirror that understand the archive format
- when I want to mirror an HTTP archive, I have to use special tools like
debmirror that understand the archive format
- the debian repository metadata structure is hot garbage
- when I download a large file via HTTP, I have no way of knowing if the file content is what it's supposed to be
- when I download a large file via HTTP, I have no way of knowing if the
file content is what it's supposed to be
# Proposed Solution
@ -24,35 +38,50 @@ The manifest file would be called `index.mf`, and the tool for generating such w
The manifest file would do several important things:
- have a standard filename, so if given `https://example.com/downloadpackage/` one could fetch `https://example.com/downloadpackage/index.mf` to enumerate the full directory listing.
- have a standard filename, so if given
`https://example.com/downloadpackage/` one could fetch
`https://example.com/downloadpackage/index.mf` to enumerate the full
directory listing.
- contain a version field for extensibility
- contain structured data (protobuf, json, or cbor)
- provide an inner signed container, so that the manifest file itself can embed a signature and a public key alongside in a single file
- provide an inner signed container, so that the manifest file itself can
embed a signature and a public key alongside in a single file
- contain a list of files, each with a relative path to the manifest
- contain manifest timestamp
- contain ctime/mtime information for files so that file metadata can be preserved
- contain cryptographic checksums in several different algorithms for each file
- contain ctime/mtime information for files so that file metadata can be
preserved
- contain cryptographic checksums in several different algorithms for each
file
- probably encoded with multihash to indicate algo + hash
- sha256 at the minimum
- would be nice to include an IPFS/IPLD CIDv1 root hash for each file, which likely involves doing an ipfs file object chunking
- maybe even including the complete IPFS/IPLD directory tree objects and chunklists?
- this is because generating an `index.mf` does not imply publishing on ipfs at that time
- maybe a bittorrent chunklist for torrent client compatibility? perhaps a top-level infohash for the whole manifest?
- would be nice to include an IPFS/IPLD CIDv1 root hash for each file,
which likely involves doing an ipfs file object chunking
- maybe even including the complete IPFS/IPLD directory tree objects and
chunklists?
- this is because generating an `index.mf` does not imply publishing on
ipfs at that time
- maybe a bittorrent chunklist for torrent client compatibility? perhaps a
top-level infohash for the whole manifest?
# Design Goals
- Replace SHASUMS/SHASUMS.asc files
- be easy to download/resume a whole directory tree published via HTTP
- be easy to use across protocols (given an HTTPS url, fetch manifest, then download file contents via bittorrent or ipfs)
- not strongly coupled to HTTP use case, should not require special hosting, content types, or HTTP headers being sent
- be easy to use across protocols (given an HTTPS url, fetch manifest, then
download file contents via bittorrent or ipfs)
- not strongly coupled to HTTP use case, should not require special hosting,
content types, or HTTP headers being sent
# Non-Goals
- Manifest generation speed
- likely involves IPFS chunking, bittorrent chunking, and several different cryptographic hash functions over the entirety of each and every file
- likely involves IPFS chunking, bittorrent chunking, and several
different cryptographic hash functions over the entirety of each and
every file
- Small manifest file size (within reason)
- 30MiB files are "small" these days, given modern storage/bandwidth
- metadata size should not be used as an excuse to sacrifice utility (such as providing checksums over each chunk of a large file)
- metadata size should not be used as an excuse to sacrifice utility (such
as providing checksums over each chunk of a large file)
# Open Questions
@ -71,9 +100,12 @@ The manifest file would do several important things:
- `mfer gen` / `mfer gen .`
- recurses under current directory and writes out an `index.mf`
- `mfer check` / `mfer check .`
- verifies checksums of all files in manifest, displaying error and exiting nonzero if any files are missing or corrupted
- verifies checksums of all files in manifest, displaying error and
exiting nonzero if any files are missing or corrupted
- `mfer fetch https://example.com/stuff/`
- fetches `/stuff/index.mf` and downloads all files listed in manifest, optionally resuming any that already exist locally, and assures cryptographic integrity of downloaded files.
- fetches `/stuff/index.mf` and downloads all files listed in manifest,
optionally resuming any that already exist locally, and assures
cryptographic integrity of downloaded files.
# Implementation Plan
@ -90,33 +122,55 @@ The manifest file would do several important things:
# Hopes And Dreams
- `aria2c https://example.com/manifestdirectory/`
- (fetches `https://example.com/manifestdirectory/index.mf`, downloads and checksums all files, resumes any that exist locally already)
- (fetches `https://example.com/manifestdirectory/index.mf`, downloads and
checksums all files, resumes any that exist locally already)
- a command line option to zero/omit mtime/ctime, as well as manifest timestamp, and sort all directory listings so that manifest file generation is deterministic/reproducible
- URL format `mfer fetch https://exmaple.com/manifestdirectory/?key=5539AD00DE4C42F3AFE11575052443F4DF2A55C2` to assert in the URL which PGP signing key should be used in the manifest, so that shared URLs have a cryptographic trust root
- a "well-known" key in the manifest that maps well known keys (could reuse the http spec) to specific file paths in the manifest.
- example: a `berlin.sneak.app.slideshow` key that maps to a json slideshow config listing what image paths to show, and for how long, and in what order
- a command line option to zero/omit mtime/ctime, as well as manifest
timestamp, and sort all directory listings so that manifest file
to assert in the URL which PGP signing key should be used in the manifest,
so that shared URLs have a cryptographic trust root
- a "well-known" key in the manifest that maps well known keys (could reuse
the http spec) to specific file paths in the manifest.
- example: a `berlin.sneak.app.slideshow` key that maps to a json
slideshow config listing what image paths to show, and for how long, and
in what order
# Use Cases
## Web Images
I'd like to be able to put a bunch of images into a directory, generate a manifest, and then point a slideshow client (such as an ambient display, or a react app with the target directory in a query string arg) at that statically hosted directory, and have it discover the full list of images available at that URL.
I'd like to be able to put a bunch of images into a directory, generate a
manifest, and then point a slideshow client (such as an ambient display, or
a react app with the target directory in a query string arg) at that
statically hosted directory, and have it discover the full list of images
available at that URL.
## Software Distribution
I'd like to be able to download a whole tree of files available via HTTP resumably by either HTTP or IPFS/BitTorrent without a .torrent file.
I'd like to be able to download a whole tree of files available via HTTP
resumably by either HTTP or IPFS/BitTorrent without a .torrent file.
## Filesystem Archive Integrity
I use filesystems that don't include data checksums, and I would like a cryptographically signed checksum file so that I can later verify that a set of archive files have not been modified, none are missing, and that the checksums have not been altered in storage by a second party.
I use filesystems that don't include data checksums, and I would like a
cryptographically signed checksum file so that I can later verify that a set
of archive files have not been modified, none are missing, and that the
checksums have not been altered in storage by a second party.
## Filesystem-Independent Checksums
I would like to be able to plug in a hard drive or flash drive and, if there is an `index.mf` in the root, automatically detect missing/corrupted files, regardless of filesystem format.
I would like to be able to plug in a hard drive or flash drive and, if there
is an `index.mf` in the root, automatically detect missing/corrupted files,
regardless of filesystem format.
# Collaboration
Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your desired username for an account on this Gitea instance.
Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your
desired username for an account on this Gitea instance.
I am currently interested in hiring a contractor skilled with the Go standard library interfaces to specify this tool in full and develop a prototype implementation.
I am currently interested in hiring a contractor skilled with the Go
standard library interfaces to specify this tool in full and develop a