mfer/README.md

469 lines
20 KiB
Markdown

# mfer
[mfer](https://git.eeqj.de/sneak/mfer) is a reference implementation library
and thin wrapper command-line utility written in [Go](https://golang.org)
and first published in 2022 under the [WTFPL](https://wtfpl.net) (public
domain) license. It specifies and generates `.mf` manifest files over a
directory tree of files to encapsulate metadata about them (such as
cryptographic checksums or signatures over same) to aid in archiving,
downloading, and streaming, or mirroring. The manifest files' data is
serialized with Google's [protobuf serialization
format](https://developers.google.com/protocol-buffers). The structure of
these files can be found [in the format
specification](https://git.eeqj.de/sneak/mfer/src/branch/main/mfer/mf.proto)
which is included in the [project
repository](https://git.eeqj.de/sneak/mfer).
The current version is pre-1.0 and while the repo was published in 2022,
there has not yet been any versioned release. [SemVer](https://semver.org)
will be used for releases.
This project was started by [@sneak](https://sneak.berlin) to scratch an
itch in 2022 and is currently a one-person effort, though the goal is for
this to emerge as a de-facto standard and be incorporated into other
software. A compatible javascript library is planned.
# Phases
Manifest generation happens in two distinct phases:
## Phase 1: Enumeration
Walking directories and calling `stat()` on files to collect metadata (path, size, mtime, ctime). This builds the list of files to be scanned. Relatively fast as it only reads filesystem metadata, not file contents.
**Progress:** `EnumerateStatus` with `FilesFound` and `BytesFound`
## Phase 2: Scan (ToManifest)
Reading file contents and computing cryptographic hashes for manifest generation. This is the expensive phase that reads all file data from disk.
**Progress:** `ScanStatus` with `TotalFiles`, `ScannedFiles`, `TotalBytes`, `ScannedBytes`, `BytesPerSec`
# Code Conventions
- **Logging:** Never use `fmt.Printf` or write to stdout/stderr directly in normal code. Use the `internal/log` package for all output (`log.Info`, `log.Infof`, `log.Debug`, `log.Debugf`, `log.Progressf`, `log.ProgressDone`).
- **Filesystem abstraction:** Use `github.com/spf13/afero` for filesystem operations to enable testing and flexibility.
- **CLI framework:** Use `github.com/urfave/cli/v2` for command-line interface.
- **Serialization:** Use Protocol Buffers for manifest file format.
- **Internal packages:** Non-exported implementation details go in `internal/` subdirectories.
- **Concurrency:** Use `sync.RWMutex` for protecting shared state; prefer channels for progress reporting.
- **Progress channels:** Use buffered channels (size 1) with non-blocking sends to avoid blocking the main operation if the consumer is slow.
- **Context support:** Long-running operations should accept `context.Context` for cancellation.
- **NO_COLOR:** Respect the `NO_COLOR` environment variable for disabling colored output.
- **Options pattern:** Use `NewWithOptions(opts *Options)` constructor pattern for configurable types.
# Codebase Structure
## cmd/mfer/
### main.go
- **Variables**
- `Appname string` - Application name
- `Version string` - Version string (set at build time)
- `Gitrev string` - Git revision (set at build time)
## internal/cli/
### entry.go
- **Variables**
- `NO_COLOR bool` - Disables color output when NO_COLOR env var is set
- **Functions**
- `Run(Appname, Version, Gitrev string) int` - Main entry point for the CLI
### mfer.go
- **Types**
- `CLIApp struct` - Main CLI application container
- **Methods**
- `(*CLIApp) VersionString() string` - Returns formatted version string
## internal/log/
### log.go
- **Functions**
- `Init()` - Initializes the logger
- `Info(arg string)` - Logs at info level
- `Infof(format string, args ...interface{})` - Logs at info level with formatting
- `Debug(arg string)` - Logs at debug level with caller info
- `Debugf(format string, args ...interface{})` - Logs at debug level with formatting and caller info
- `Dump(args ...interface{})` - Logs spew dump at debug level
- `Progressf(format string, args ...interface{})` - Prints progress message (overwrites current line)
- `ProgressDone()` - Completes progress line with newline
- `EnableDebugLogging()` - Sets log level to debug
- `SetLevel(arg log.Level)` - Sets log level
- `SetLevelFromVerbosity(l int)` - Sets log level from verbosity count
- `GetLevel() log.Level` - Returns current log level
- `GetLogger() *log.Logger` - Returns underlying logger
- `WithError(e error) *log.Entry` - Returns log entry with error attached
- `DisableStyling()` - Disables colors and styling (for NO_COLOR)
## internal/scanner/
### scanner.go
- **Types**
- `Options struct` - Options for scanner behavior
- `IncludeDotfiles bool` - Include dot (hidden) files (excluded by default)
- `FollowSymLinks bool`
- `EnumerateStatus struct` - Progress information for enumeration phase
- `FilesFound int64`
- `BytesFound int64`
- `ScanStatus struct` - Progress information for scan phase
- `TotalFiles int64`
- `ScannedFiles int64`
- `TotalBytes int64`
- `ScannedBytes int64`
- `BytesPerSec float64`
- `ETA time.Duration`
- `FileEntry struct` - Represents an enumerated file
- `Path string` - Relative path (used in manifest)
- `AbsPath string` - Absolute path (used for reading file content)
- `Size int64`
- `Mtime time.Time`
- `Ctime time.Time`
- `Scanner struct` - Accumulates files and generates manifests
- **Functions**
- `New() *Scanner` - Creates a new Scanner with default options
- `NewWithOptions(opts *Options) *Scanner` - Creates a new Scanner with given options
- **Methods (Enumeration Phase)**
- `(*Scanner) EnumerateFile(path string) error` - Enumerates a single file, calling stat() for metadata
- `(*Scanner) EnumeratePath(inputPath string, progress chan<- EnumerateStatus) error` - Walks a directory and enumerates all files
- `(*Scanner) EnumeratePaths(progress chan<- EnumerateStatus, inputPaths ...string) error` - Walks multiple directories
- `(*Scanner) EnumerateFS(afs afero.Fs, basePath string, progress chan<- EnumerateStatus) error` - Walks an afero filesystem
- **Methods (Accessors)**
- `(*Scanner) Files() []*FileEntry` - Returns copy of all enumerated files
- `(*Scanner) FileCount() int64` - Returns number of files
- `(*Scanner) TotalBytes() int64` - Returns total size of all files
- **Methods (Scan Phase)**
- `(*Scanner) ToManifest(ctx context.Context, w io.Writer, progress chan<- ScanStatus) error` - Reads file contents, computes hashes, generates manifest
## internal/checker/
### checker.go
- **Types**
- `Result struct` - Outcome of checking a single file
- `Path string` - File path from manifest
- `Status Status` - Verification status
- `Message string` - Error or status message
- `Status int` - Verification status enumeration
- `StatusOK` - File matches manifest
- `StatusMissing` - File not found
- `StatusSizeMismatch` - File size differs from manifest
- `StatusHashMismatch` - File hash differs from manifest
- `StatusError` - Error occurred during verification
- `CheckStatus struct` - Progress information for check operation
- `TotalFiles int64`
- `CheckedFiles int64`
- `TotalBytes int64`
- `CheckedBytes int64`
- `BytesPerSec float64`
- `ETA time.Duration`
- `Failures int64`
- `Checker struct` - Verifies files against a manifest
- **Functions**
- `NewChecker(manifestPath string, basePath string) (*Checker, error)` - Creates a new Checker for the given manifest and base path
- **Methods**
- `(s Status) String() string` - Returns string representation of status
- `(*Checker) FileCount() int64` - Returns number of files in the manifest
- `(*Checker) TotalBytes() int64` - Returns total size of all files in manifest
- `(*Checker) Check(ctx context.Context, results chan<- Result, progress chan<- CheckStatus) error` - Verifies all files against the manifest
## mfer/
### manifest.go
- **Types**
- `ManifestScanOptions struct` - Options for scanning directories
- `IncludeDotfiles bool` - Include dot (hidden) files (excluded by default)
- `FollowSymLinks bool`
- **Functions**
- `New() *manifest` - Creates a new empty manifest
- `NewFromPaths(options *ManifestScanOptions, inputPaths ...string) (*manifest, error)` - Creates manifest from filesystem paths
- `NewFromFS(options *ManifestScanOptions, fs afero.Fs) (*manifest, error)` - Creates manifest from afero filesystem
- **Methods**
- `(*manifest) HasError() bool` - Returns true if manifest has errors
- `(*manifest) AddError(e error) *manifest` - Adds an error to the manifest
- `(*manifest) WithContext(c context.Context) *manifest` - Sets context for cancellation
- `(*manifest) GetFileCount() int64` - Returns number of files in manifest
- `(*manifest) GetTotalFileSize() int64` - Returns total size of all files
- `(*manifest) Files() []*MFFilePath` - Returns all file entries from a loaded manifest
- `(*manifest) Scan() error` - Scans source filesystems and populates file list
### output.go
- **Methods**
- `(*manifest) WriteToFile(path string) error` - Writes manifest to file path
- `(*manifest) WriteTo(output io.Writer) error` - Writes manifest to io.Writer
### builder.go
- **Types**
- `FileProgress func(bytesRead int64)` - Callback for file processing progress
- `Builder struct` - Constructs manifests by adding files one at a time
- **Functions**
- `NewBuilder() *Builder` - Creates a new Builder
- **Methods**
- `(*Builder) AddFile(path string, size int64, mtime time.Time, reader io.Reader, progress FileProgress) (int64, error)` - Reads file, computes hash, adds to manifest
- `(*Builder) FileCount() int` - Returns number of files added
- `(*Builder) Build(w io.Writer) error` - Finalizes and writes manifest
### serialize.go
- **Constants**
- `MAGIC string` - Magic bytes prefix for manifest files ("ZNAVSRFG")
### deserialize.go
- **Functions**
- `NewFromProto(input io.Reader) (*manifest, error)` - Deserializes manifest from protobuf
- `NewManifestFromReader(input io.Reader) (*manifest, error)` - Reads and parses manifest from io.Reader
- `NewManifestFromFile(path string) (*manifest, error)` - Reads and parses manifest from file path
### mf.pb.go (generated from mf.proto)
- **Enum Types**
- `MFFileOuter_Version` - Outer file format version
- `MFFileOuter_VERSION_NONE`
- `MFFileOuter_VERSION_ONE`
- `MFFileOuter_CompressionType` - Compression type for inner message
- `MFFileOuter_COMPRESSION_NONE`
- `MFFileOuter_COMPRESSION_ZSTD`
- `MFFile_Version` - Inner file format version
- `MFFile_VERSION_NONE`
- `MFFile_VERSION_ONE`
- **Message Types**
- `Timestamp struct` - Timestamp with seconds and nanoseconds
- `GetSeconds() int64`
- `GetNanos() int32`
- `MFFileOuter struct` - Outer wrapper containing compressed/signed inner message
- `GetVersion() MFFileOuter_Version`
- `GetCompressionType() MFFileOuter_CompressionType`
- `GetSize() int64`
- `GetSha256() []byte`
- `GetInnerMessage() []byte`
- `GetSignature() []byte`
- `GetSigner() []byte`
- `GetSigningPubKey() []byte`
- `MFFilePath struct` - Individual file entry in manifest
- `GetPath() string`
- `GetSize() int64`
- `GetHashes() []*MFFileChecksum`
- `GetMimeType() string`
- `GetMtime() *Timestamp`
- `GetCtime() *Timestamp`
- `GetAtime() *Timestamp`
- `MFFileChecksum struct` - File checksum using multihash
- `GetMultiHash() []byte`
- `MFFile struct` - Inner manifest containing file list
- `GetVersion() MFFile_Version`
- `GetFiles() []*MFFilePath`
- `GetCreatedAt() *Timestamp`
# Build Status
[![Build Status](https://drone.datavi.be/api/badges/sneak/mfer/status.svg)](https://drone.datavi.be/sneak/mfer)
# Participation
The community is as yet nonexistent so there are no defined policies or
norms yet. Primary development happens on a privately-run Gitea instance at
[https://git.eeqj.de/sneak/mfer](https://git.eeqj.de/sneak/mfer) and issues
are [tracked there](https://git.eeqj.de/sneak/mfer/issues).
Changes must always be formatted with a standard `go fmt`, syntactically
valid, and must pass the linting defined in the repository (presently only
the `golangci-lint` defaults), which can be run with a `make lint`. The
`main` branch is protected and all changes must be made via [pull
requests](https://git.eeqj.de/sneak/mfer/pulls) and pass CI to be merged.
Any changes submitted to this project must also be
[WTFPL-licensed](https://wtfpl.net) to be considered.
# Problem Statement
Given a plain URL, there is no standard way to safely and programmatically
download everything "under" that URL path. `wget -r` can traverse directory
listings if they're enabled, but every server has a different format, and
this does not verify cryptographic integrity of the files, or enable them to
be fetched using a different protocol other than HTTP/s.
Currently, the solution that people are using are sidecar files in the
format of `SHASUMS` checksum files, as well as a `SHASUMS.asc` PGP detached
signature. This is not checksum-algorithm-agnostic and the sidecar file is
not always consistently named.
Real issues I face:
- when I plug in an ExFAT hard drive, I don't know if any files on the
filesystem are corrupted or missing
- current ad-hoc solution are `SHASUMS`/`SHASUMS.asc` files
- when I want to mirror an HTTP archive, I have to use special tools like
debmirror that understand the archive format
- the debian repository metadata structure is hot garbage
- when I download a large file via HTTP, I have no way of knowing if the
file content is what it's supposed to be
# Proposed Solution
A standard, a manifest file format, and a tool for generating same.
The manifest file would be called `index.mf`, and the tool for generating such would be called `mfer`.
The manifest file would do several important things:
- have a standard filename, so if given
`https://example.com/downloadpackage/` one could fetch
`https://example.com/downloadpackage/index.mf` to enumerate the full
directory listing.
- contain a version field for extensibility
- contain structured data (protobuf, json, or cbor)
- provide an inner signed container, so that the manifest file itself can
embed a signature and a public key alongside in a single file
- contain a list of files, each with a relative path to the manifest
- contain manifest timestamp
- contain ctime/mtime information for files so that file metadata can be
preserved
- contain cryptographic checksums in several different algorithms for each
file
- probably encoded with multihash to indicate algo + hash
- sha256 at the minimum
- would be nice to include an IPFS/IPLD CIDv1 root hash for each file,
which likely involves doing an ipfs file object chunking
- maybe even including the complete IPFS/IPLD directory tree objects and
chunklists?
- this is because generating an `index.mf` does not imply publishing on
ipfs at that time
- maybe a bittorrent chunklist for torrent client compatibility? perhaps a
top-level infohash for the whole manifest?
# Design Goals
- Replace SHASUMS/SHASUMS.asc files
- be easy to download/resume a whole directory tree published via HTTP
- be easy to use across protocols (given an HTTPS url, fetch manifest, then
download file contents via bittorrent or ipfs)
- not strongly coupled to HTTP use case, should not require special hosting,
content types, or HTTP headers being sent
# Non-Goals
- Manifest generation speed
- likely involves IPFS chunking, bittorrent chunking, and several
different cryptographic hash functions over the entirety of each and
every file
- Small manifest file size (within reason)
- 30MiB files are "small" these days, given modern storage/bandwidth
- metadata size should not be used as an excuse to sacrifice utility (such
as providing checksums over each chunk of a large file)
# Limitations
- **Manifest size:** Manifests must fit entirely in system memory during reading and writing.
# TODO
## Medium Priority
- [x] **Atomic writes for `mfer gen`** - Writes to temp file then atomic rename; cleans up temp file on error/interrupt.
- [ ] **Change FileProgress callback to channel** - `mfer/builder.go` uses a callback for progress reporting; should use channels like `EnumerateStatus` and `ScanStatus` for consistency.
- [ ] **Consolidate legacy manifest code** - `mfer/manifest.go` has old scanning code (`Scan()`, `addFile()`) that duplicates the new `internal/scanner` + `mfer/builder.go` pattern.
- [ ] **Add context cancellation to legacy code** - The old `manifest.Scan()` doesn't support context cancellation; the new scanner does.
## Lower Priority
- [x] **Add unit tests for `internal/checker`** - 88.5% coverage.
- [x] **Add unit tests for `internal/scanner`** - 80.1% coverage.
- [x] **Clean up FIXMEs in manifest.go** - Validated input paths, validated filesystem, added context cancellation.
- [x] **Validate input paths before scanning** - Fails fast with clear error if paths don't exist.
# Open Questions
- Should the manifest file include checksums of individual file chunks, or just for the whole assembled file?
- If so, should the chunksize be fixed or dynamic?
- Should the manifest signature format be GnuPG signatures, or those from
OpenBSD's signify (of which there is a good [golang
implementation](https://github.com/frankbraun/gosignify)?
- Should the on-disk serialization format be proto3 or json?
# Tool Examples
- `mfer gen` / `mfer gen .`
- recurses under current directory and writes out an `index.mf`
- `mfer check` / `mfer check .`
- verifies checksums of all files in manifest, displaying error and
exiting nonzero if any files are missing or corrupted
- `mfer fetch https://example.com/stuff/`
- fetches `/stuff/index.mf` and downloads all files listed in manifest,
optionally resuming any that already exist locally, and assures
cryptographic integrity of downloaded files.
# Implementation Plan
## Phase One:
- golang module for reusability/embedding
- golang module client providing `mfer` CLI
## Phase Two:
- ES6 or TypeScript module for reusability/embedding
- ES6/TypeScript module client providing `mfer.js` CLI
# Hopes And Dreams
- `aria2c https://example.com/manifestdirectory/`
- (fetches `https://example.com/manifestdirectory/index.mf`, downloads and
checksums all files, resumes any that exist locally already)
- `mfer fetch https://example.com/manifestdirectory/`
- a command line option to zero/omit mtime/ctime, as well as manifest
timestamp, and sort all directory listings so that manifest file
generation is deterministic/reproducible
- URL format `mfer fetch https://exmaple.com/manifestdirectory/?key=5539AD00DE4C42F3AFE11575052443F4DF2A55C2`
to assert in the URL which PGP signing key should be used in the manifest,
so that shared URLs have a cryptographic trust root
- a "well-known" key in the manifest that maps well known keys (could reuse
the http spec) to specific file paths in the manifest.
- example: a `berlin.sneak.app.slideshow` key that maps to a json
slideshow config listing what image paths to show, and for how long, and
in what order
# Use Cases
## Web Images
I'd like to be able to put a bunch of images into a directory, generate a
manifest, and then point a slideshow client (such as an ambient display, or
a react app with the target directory in a query string arg) at that
statically hosted directory, and have it discover the full list of images
available at that URL.
## Software Distribution
I'd like to be able to download a whole tree of files available via HTTP
resumably by either HTTP or IPFS/BitTorrent without a .torrent file.
## Filesystem Archive Integrity
I use filesystems that don't include data checksums, and I would like a
cryptographically signed checksum file so that I can later verify that a set
of archive files have not been modified, none are missing, and that the
checksums have not been altered in storage by a second party.
## Filesystem-Independent Checksums
I would like to be able to plug in a hard drive or flash drive and, if there
is an `index.mf` in the root, automatically detect missing/corrupted files,
regardless of filesystem format.
# Collaboration
Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your
desired username for an account on this Gitea instance.
## Links
* Repo: [https://git.eeqj.de/sneak/mfer](https://git.eeqj.de/sneak/mfer)
* Issues: [https://git.eeqj.de/sneak/mfer/issues](https://git.eeqj.de/sneak/mfer/issues)
# Authors
* [@sneak &lt;sneak@sneak.berlin&gt;](mailto:sneak@sneak.berlin)
# License
* [WTFPL](https://wtfpl.net)