473 lines
21 KiB
Markdown
473 lines
21 KiB
Markdown
# mfer
|
|
|
|
[mfer](https://git.eeqj.de/sneak/mfer) is a reference implementation library
|
|
and thin wrapper command-line utility written in [Go](https://golang.org)
|
|
and first published in 2022 under the [WTFPL](https://wtfpl.net) (public
|
|
domain) license. It specifies and generates `.mf` manifest files over a
|
|
directory tree of files to encapsulate metadata about them (such as
|
|
cryptographic checksums or signatures over same) to aid in archiving,
|
|
downloading, and streaming, or mirroring. The manifest files' data is
|
|
serialized with Google's [protobuf serialization
|
|
format](https://developers.google.com/protocol-buffers). The structure of
|
|
these files can be found [in the format
|
|
specification](https://git.eeqj.de/sneak/mfer/src/branch/main/mfer/mf.proto)
|
|
which is included in the [project
|
|
repository](https://git.eeqj.de/sneak/mfer).
|
|
|
|
The current version is pre-1.0 and while the repo was published in 2022,
|
|
there has not yet been any versioned release. [SemVer](https://semver.org)
|
|
will be used for releases.
|
|
|
|
This project was started by [@sneak](https://sneak.berlin) to scratch an
|
|
itch in 2022 and is currently a one-person effort, though the goal is for
|
|
this to emerge as a de-facto standard and be incorporated into other
|
|
software. A compatible javascript library is planned.
|
|
|
|
# Phases
|
|
|
|
Manifest generation happens in two distinct phases:
|
|
|
|
## Phase 1: Enumeration
|
|
|
|
Walking directories and calling `stat()` on files to collect metadata (path, size, mtime, ctime). This builds the list of files to be scanned. Relatively fast as it only reads filesystem metadata, not file contents.
|
|
|
|
**Progress:** `EnumerateStatus` with `FilesFound` and `BytesFound`
|
|
|
|
## Phase 2: Scan (ToManifest)
|
|
|
|
Reading file contents and computing cryptographic hashes for manifest generation. This is the expensive phase that reads all file data from disk.
|
|
|
|
**Progress:** `ScanStatus` with `TotalFiles`, `ScannedFiles`, `TotalBytes`, `ScannedBytes`, `BytesPerSec`
|
|
|
|
# Code Conventions
|
|
|
|
- **Logging:** Never use `fmt.Printf` or write to stdout/stderr directly in normal code. Use the `internal/log` package for all output (`log.Info`, `log.Infof`, `log.Debug`, `log.Debugf`, `log.Progressf`, `log.ProgressDone`).
|
|
- **Filesystem abstraction:** Use `github.com/spf13/afero` for filesystem operations to enable testing and flexibility.
|
|
- **CLI framework:** Use `github.com/urfave/cli/v2` for command-line interface.
|
|
- **Serialization:** Use Protocol Buffers for manifest file format.
|
|
- **Internal packages:** Non-exported implementation details go in `internal/` subdirectories.
|
|
- **Concurrency:** Use `sync.RWMutex` for protecting shared state; prefer channels for progress reporting.
|
|
- **Progress channels:** Use buffered channels (size 1) with non-blocking sends to avoid blocking the main operation if the consumer is slow.
|
|
- **Context support:** Long-running operations should accept `context.Context` for cancellation.
|
|
- **NO_COLOR:** Respect the `NO_COLOR` environment variable for disabling colored output.
|
|
- **Options pattern:** Use `NewWithOptions(opts *Options)` constructor pattern for configurable types.
|
|
|
|
# Codebase Structure
|
|
|
|
## cmd/mfer/
|
|
|
|
### main.go
|
|
- **Variables**
|
|
- `Appname string` - Application name
|
|
- `Version string` - Version string (set at build time)
|
|
- `Gitrev string` - Git revision (set at build time)
|
|
|
|
## internal/cli/
|
|
|
|
### entry.go
|
|
- **Variables**
|
|
- `NO_COLOR bool` - Disables color output when NO_COLOR env var is set
|
|
- **Functions**
|
|
- `Run(Appname, Version, Gitrev string) int` - Main entry point for the CLI
|
|
|
|
### mfer.go
|
|
- **Types**
|
|
- `CLIApp struct` - Main CLI application container
|
|
- **Methods**
|
|
- `(*CLIApp) VersionString() string` - Returns formatted version string
|
|
|
|
## internal/log/
|
|
|
|
### log.go
|
|
- **Functions**
|
|
- `Init()` - Initializes the logger
|
|
- `Info(arg string)` - Logs at info level
|
|
- `Infof(format string, args ...interface{})` - Logs at info level with formatting
|
|
- `Debug(arg string)` - Logs at debug level with caller info
|
|
- `Debugf(format string, args ...interface{})` - Logs at debug level with formatting and caller info
|
|
- `Dump(args ...interface{})` - Logs spew dump at debug level
|
|
- `Progressf(format string, args ...interface{})` - Prints progress message (overwrites current line)
|
|
- `ProgressDone()` - Completes progress line with newline
|
|
- `EnableDebugLogging()` - Sets log level to debug
|
|
- `SetLevel(arg log.Level)` - Sets log level
|
|
- `SetLevelFromVerbosity(l int)` - Sets log level from verbosity count
|
|
- `GetLevel() log.Level` - Returns current log level
|
|
- `GetLogger() *log.Logger` - Returns underlying logger
|
|
- `WithError(e error) *log.Entry` - Returns log entry with error attached
|
|
- `DisableStyling()` - Disables colors and styling (for NO_COLOR)
|
|
|
|
## internal/scanner/
|
|
|
|
### scanner.go
|
|
- **Types**
|
|
- `Options struct` - Options for scanner behavior
|
|
- `IgnoreDotfiles bool`
|
|
- `FollowSymLinks bool`
|
|
- `EnumerateStatus struct` - Progress information for enumeration phase
|
|
- `FilesFound int64`
|
|
- `BytesFound int64`
|
|
- `ScanStatus struct` - Progress information for scan phase
|
|
- `TotalFiles int64`
|
|
- `ScannedFiles int64`
|
|
- `TotalBytes int64`
|
|
- `ScannedBytes int64`
|
|
- `BytesPerSec float64`
|
|
- `FileEntry struct` - Represents an enumerated file
|
|
- `Path string` - Relative path (used in manifest)
|
|
- `AbsPath string` - Absolute path (used for reading file content)
|
|
- `Size int64`
|
|
- `Mtime time.Time`
|
|
- `Ctime time.Time`
|
|
- `Scanner struct` - Accumulates files and generates manifests
|
|
- **Functions**
|
|
- `New() *Scanner` - Creates a new Scanner with default options
|
|
- `NewWithOptions(opts *Options) *Scanner` - Creates a new Scanner with given options
|
|
- **Methods (Enumeration Phase)**
|
|
- `(*Scanner) EnumerateFile(path string) error` - Enumerates a single file, calling stat() for metadata
|
|
- `(*Scanner) EnumeratePath(inputPath string, progress chan<- EnumerateStatus) error` - Walks a directory and enumerates all files
|
|
- `(*Scanner) EnumeratePaths(progress chan<- EnumerateStatus, inputPaths ...string) error` - Walks multiple directories
|
|
- `(*Scanner) EnumerateFS(afs afero.Fs, basePath string, progress chan<- EnumerateStatus) error` - Walks an afero filesystem
|
|
- **Methods (Accessors)**
|
|
- `(*Scanner) Files() []*FileEntry` - Returns copy of all enumerated files
|
|
- `(*Scanner) FileCount() int64` - Returns number of files
|
|
- `(*Scanner) TotalBytes() int64` - Returns total size of all files
|
|
- **Methods (Scan Phase)**
|
|
- `(*Scanner) ToManifest(ctx context.Context, w io.Writer, progress chan<- ScanStatus) error` - Reads file contents, computes hashes, generates manifest
|
|
|
|
## internal/checker/
|
|
|
|
### checker.go
|
|
- **Types**
|
|
- `Result struct` - Outcome of checking a single file
|
|
- `Path string` - File path from manifest
|
|
- `Status Status` - Verification status
|
|
- `Message string` - Error or status message
|
|
- `Status int` - Verification status enumeration
|
|
- `StatusOK` - File matches manifest
|
|
- `StatusMissing` - File not found
|
|
- `StatusSizeMismatch` - File size differs from manifest
|
|
- `StatusHashMismatch` - File hash differs from manifest
|
|
- `StatusError` - Error occurred during verification
|
|
- `CheckStatus struct` - Progress information for check operation
|
|
- `TotalFiles int64`
|
|
- `CheckedFiles int64`
|
|
- `TotalBytes int64`
|
|
- `CheckedBytes int64`
|
|
- `BytesPerSec float64`
|
|
- `Failures int64`
|
|
- `Checker struct` - Verifies files against a manifest
|
|
- **Functions**
|
|
- `NewChecker(manifestPath string, basePath string) (*Checker, error)` - Creates a new Checker for the given manifest and base path
|
|
- **Methods**
|
|
- `(s Status) String() string` - Returns string representation of status
|
|
- `(*Checker) FileCount() int64` - Returns number of files in the manifest
|
|
- `(*Checker) TotalBytes() int64` - Returns total size of all files in manifest
|
|
- `(*Checker) Check(ctx context.Context, results chan<- Result, progress chan<- CheckStatus) error` - Verifies all files against the manifest
|
|
|
|
## mfer/
|
|
|
|
### manifest.go
|
|
- **Types**
|
|
- `ManifestScanOptions struct` - Options for scanning directories
|
|
- `IgnoreDotfiles bool`
|
|
- `FollowSymLinks bool`
|
|
- **Functions**
|
|
- `New() *manifest` - Creates a new empty manifest
|
|
- `NewFromPaths(options *ManifestScanOptions, inputPaths ...string) (*manifest, error)` - Creates manifest from filesystem paths
|
|
- `NewFromFS(options *ManifestScanOptions, fs afero.Fs) (*manifest, error)` - Creates manifest from afero filesystem
|
|
- **Methods**
|
|
- `(*manifest) HasError() bool` - Returns true if manifest has errors
|
|
- `(*manifest) AddError(e error) *manifest` - Adds an error to the manifest
|
|
- `(*manifest) WithContext(c context.Context) *manifest` - Sets context for cancellation
|
|
- `(*manifest) GetFileCount() int64` - Returns number of files in manifest
|
|
- `(*manifest) GetTotalFileSize() int64` - Returns total size of all files
|
|
- `(*manifest) Files() []*MFFilePath` - Returns all file entries from a loaded manifest
|
|
- `(*manifest) Scan() error` - Scans source filesystems and populates file list
|
|
|
|
### output.go
|
|
- **Methods**
|
|
- `(*manifest) WriteToFile(path string) error` - Writes manifest to file path
|
|
- `(*manifest) WriteTo(output io.Writer) error` - Writes manifest to io.Writer
|
|
|
|
### builder.go
|
|
- **Types**
|
|
- `FileProgress func(bytesRead int64)` - Callback for file processing progress
|
|
- `Builder struct` - Constructs manifests by adding files one at a time
|
|
- **Functions**
|
|
- `NewBuilder() *Builder` - Creates a new Builder
|
|
- **Methods**
|
|
- `(*Builder) AddFile(path string, size int64, mtime time.Time, reader io.Reader, progress FileProgress) (int64, error)` - Reads file, computes hash, adds to manifest
|
|
- `(*Builder) FileCount() int` - Returns number of files added
|
|
- `(*Builder) Build(w io.Writer) error` - Finalizes and writes manifest
|
|
|
|
### serialize.go
|
|
- **Constants**
|
|
- `MAGIC string` - Magic bytes prefix for manifest files ("ZNAVSRFG")
|
|
|
|
### deserialize.go
|
|
- **Functions**
|
|
- `NewFromProto(input io.Reader) (*manifest, error)` - Deserializes manifest from protobuf
|
|
- `NewManifestFromReader(input io.Reader) (*manifest, error)` - Reads and parses manifest from io.Reader
|
|
- `NewManifestFromFile(path string) (*manifest, error)` - Reads and parses manifest from file path
|
|
|
|
### mf.pb.go (generated from mf.proto)
|
|
- **Enum Types**
|
|
- `MFFileOuter_Version` - Outer file format version
|
|
- `MFFileOuter_VERSION_NONE`
|
|
- `MFFileOuter_VERSION_ONE`
|
|
- `MFFileOuter_CompressionType` - Compression type for inner message
|
|
- `MFFileOuter_COMPRESSION_NONE`
|
|
- `MFFileOuter_COMPRESSION_GZIP`
|
|
- `MFFile_Version` - Inner file format version
|
|
- `MFFile_VERSION_NONE`
|
|
- `MFFile_VERSION_ONE`
|
|
- **Message Types**
|
|
- `Timestamp struct` - Timestamp with seconds and nanoseconds
|
|
- `GetSeconds() int64`
|
|
- `GetNanos() int32`
|
|
- `MFFileOuter struct` - Outer wrapper containing compressed/signed inner message
|
|
- `GetVersion() MFFileOuter_Version`
|
|
- `GetCompressionType() MFFileOuter_CompressionType`
|
|
- `GetSize() int64`
|
|
- `GetSha256() []byte`
|
|
- `GetInnerMessage() []byte`
|
|
- `GetSignature() []byte`
|
|
- `GetSigner() []byte`
|
|
- `GetSigningPubKey() []byte`
|
|
- `MFFilePath struct` - Individual file entry in manifest
|
|
- `GetPath() string`
|
|
- `GetSize() int64`
|
|
- `GetHashes() []*MFFileChecksum`
|
|
- `GetMimeType() string`
|
|
- `GetMtime() *Timestamp`
|
|
- `GetCtime() *Timestamp`
|
|
- `GetAtime() *Timestamp`
|
|
- `MFFileChecksum struct` - File checksum using multihash
|
|
- `GetMultiHash() []byte`
|
|
- `MFFile struct` - Inner manifest containing file list
|
|
- `GetVersion() MFFile_Version`
|
|
- `GetFiles() []*MFFilePath`
|
|
- `GetCreatedAt() *Timestamp`
|
|
|
|
# Build Status
|
|
|
|
[](https://drone.datavi.be/sneak/mfer)
|
|
|
|
# Participation
|
|
|
|
The community is as yet nonexistent so there are no defined policies or
|
|
norms yet. Primary development happens on a privately-run Gitea instance at
|
|
[https://git.eeqj.de/sneak/mfer](https://git.eeqj.de/sneak/mfer) and issues
|
|
are [tracked there](https://git.eeqj.de/sneak/mfer/issues).
|
|
|
|
Changes must always be formatted with a standard `go fmt`, syntactically
|
|
valid, and must pass the linting defined in the repository (presently only
|
|
the `golangci-lint` defaults), which can be run with a `make lint`. The
|
|
`main` branch is protected and all changes must be made via [pull
|
|
requests](https://git.eeqj.de/sneak/mfer/pulls) and pass CI to be merged.
|
|
Any changes submitted to this project must also be
|
|
[WTFPL-licensed](https://wtfpl.net) to be considered.
|
|
|
|
|
|
# Problem Statement
|
|
|
|
Given a plain URL, there is no standard way to safely and programmatically
|
|
download everything "under" that URL path. `wget -r` can traverse directory
|
|
listings if they're enabled, but every server has a different format, and
|
|
this does not verify cryptographic integrity of the files, or enable them to
|
|
be fetched using a different protocol other than HTTP/s.
|
|
|
|
Currently, the solution that people are using are sidecar files in the
|
|
format of `SHASUMS` checksum files, as well as a `SHASUMS.asc` PGP detached
|
|
signature. This is not checksum-algorithm-agnostic and the sidecar file is
|
|
not always consistently named.
|
|
|
|
Real issues I face:
|
|
|
|
- when I plug in an ExFAT hard drive, I don't know if any files on the
|
|
filesystem are corrupted or missing
|
|
- current ad-hoc solution are `SHASUMS`/`SHASUMS.asc` files
|
|
- when I want to mirror an HTTP archive, I have to use special tools like
|
|
debmirror that understand the archive format
|
|
- the debian repository metadata structure is hot garbage
|
|
- when I download a large file via HTTP, I have no way of knowing if the
|
|
file content is what it's supposed to be
|
|
|
|
# Proposed Solution
|
|
|
|
A standard, a manifest file format, and a tool for generating same.
|
|
|
|
The manifest file would be called `index.mf`, and the tool for generating such would be called `mfer`.
|
|
|
|
The manifest file would do several important things:
|
|
|
|
- have a standard filename, so if given
|
|
`https://example.com/downloadpackage/` one could fetch
|
|
`https://example.com/downloadpackage/index.mf` to enumerate the full
|
|
directory listing.
|
|
- contain a version field for extensibility
|
|
- contain structured data (protobuf, json, or cbor)
|
|
- provide an inner signed container, so that the manifest file itself can
|
|
embed a signature and a public key alongside in a single file
|
|
- contain a list of files, each with a relative path to the manifest
|
|
- contain manifest timestamp
|
|
- contain ctime/mtime information for files so that file metadata can be
|
|
preserved
|
|
- contain cryptographic checksums in several different algorithms for each
|
|
file
|
|
- probably encoded with multihash to indicate algo + hash
|
|
- sha256 at the minimum
|
|
- would be nice to include an IPFS/IPLD CIDv1 root hash for each file,
|
|
which likely involves doing an ipfs file object chunking
|
|
- maybe even including the complete IPFS/IPLD directory tree objects and
|
|
chunklists?
|
|
- this is because generating an `index.mf` does not imply publishing on
|
|
ipfs at that time
|
|
- maybe a bittorrent chunklist for torrent client compatibility? perhaps a
|
|
top-level infohash for the whole manifest?
|
|
|
|
# Design Goals
|
|
|
|
- Replace SHASUMS/SHASUMS.asc files
|
|
- be easy to download/resume a whole directory tree published via HTTP
|
|
- be easy to use across protocols (given an HTTPS url, fetch manifest, then
|
|
download file contents via bittorrent or ipfs)
|
|
- not strongly coupled to HTTP use case, should not require special hosting,
|
|
content types, or HTTP headers being sent
|
|
|
|
# Non-Goals
|
|
|
|
- Manifest generation speed
|
|
- likely involves IPFS chunking, bittorrent chunking, and several
|
|
different cryptographic hash functions over the entirety of each and
|
|
every file
|
|
- Small manifest file size (within reason)
|
|
- 30MiB files are "small" these days, given modern storage/bandwidth
|
|
- metadata size should not be used as an excuse to sacrifice utility (such
|
|
as providing checksums over each chunk of a large file)
|
|
|
|
# Limitations
|
|
|
|
- **Manifest size:** Manifests must fit entirely in system memory during reading and writing.
|
|
|
|
# TODO
|
|
|
|
## High Priority
|
|
|
|
- [ ] **Implement `fetch` command** - Currently panics with "not implemented". Should download a manifest and its referenced files from a URL.
|
|
- [ ] **Fix import in fetch.go** - Uses `github.com/apex/log` directly instead of `internal/log`, violating codebase conventions.
|
|
|
|
## Medium Priority
|
|
|
|
- [ ] **Add `--force` flag for overwrites** - Currently silently overwrites existing manifest files. Should require `-f` to overwrite.
|
|
- [ ] **Implement FollowSymLinks option** - The flag exists in CLI and Options structs but does nothing. Scanner should use `EvalSymlinks` or `Lstat`.
|
|
- [ ] **Change FileProgress callback to channel** - `mfer/builder.go` uses a callback for progress reporting; should use channels like `EnumerateStatus` and `ScanStatus` for consistency.
|
|
- [ ] **Consolidate legacy manifest code** - `mfer/manifest.go` has old scanning code (`Scan()`, `addFile()`, `generateInner()`) that duplicates the new `internal/scanner` + `mfer/builder.go` pattern.
|
|
- [ ] **Add context cancellation to legacy code** - The old `manifest.Scan()` doesn't support context cancellation; the new scanner does.
|
|
|
|
## Lower Priority
|
|
|
|
- [ ] **Add unit tests for `internal/checker`** - Currently has no test files; only tested indirectly via CLI tests.
|
|
- [ ] **Add unit tests for `internal/scanner`** - Currently has no test files.
|
|
- [ ] **Clean up FIXMEs in manifest.go** - Validate input paths exist, validate filesystem, avoid redundant stat calls.
|
|
- [ ] **Validate input paths before scanning** - Should fail fast with a clear error if paths don't exist.
|
|
|
|
# Open Questions
|
|
|
|
- Should the manifest file include checksums of individual file chunks, or just for the whole assembled file?
|
|
|
|
- If so, should the chunksize be fixed or dynamic?
|
|
|
|
- Should the manifest signature format be GnuPG signatures, or those from
|
|
OpenBSD's signify (of which there is a good [golang
|
|
implementation](https://github.com/frankbraun/gosignify)?
|
|
|
|
- Should the on-disk serialization format be proto3 or json?
|
|
|
|
# Tool Examples
|
|
|
|
- `mfer gen` / `mfer gen .`
|
|
- recurses under current directory and writes out an `index.mf`
|
|
- `mfer check` / `mfer check .`
|
|
- verifies checksums of all files in manifest, displaying error and
|
|
exiting nonzero if any files are missing or corrupted
|
|
- `mfer fetch https://example.com/stuff/`
|
|
- fetches `/stuff/index.mf` and downloads all files listed in manifest,
|
|
optionally resuming any that already exist locally, and assures
|
|
cryptographic integrity of downloaded files.
|
|
|
|
# Implementation Plan
|
|
|
|
## Phase One:
|
|
|
|
- golang module for reusability/embedding
|
|
- golang module client providing `mfer` CLI
|
|
|
|
## Phase Two:
|
|
|
|
- ES6 or TypeScript module for reusability/embedding
|
|
- ES6/TypeScript module client providing `mfer.js` CLI
|
|
|
|
# Hopes And Dreams
|
|
|
|
- `aria2c https://example.com/manifestdirectory/`
|
|
- (fetches `https://example.com/manifestdirectory/index.mf`, downloads and
|
|
checksums all files, resumes any that exist locally already)
|
|
- `mfer fetch https://example.com/manifestdirectory/`
|
|
- a command line option to zero/omit mtime/ctime, as well as manifest
|
|
timestamp, and sort all directory listings so that manifest file
|
|
generation is deterministic/reproducible
|
|
- URL format `mfer fetch https://exmaple.com/manifestdirectory/?key=5539AD00DE4C42F3AFE11575052443F4DF2A55C2`
|
|
to assert in the URL which PGP signing key should be used in the manifest,
|
|
so that shared URLs have a cryptographic trust root
|
|
- a "well-known" key in the manifest that maps well known keys (could reuse
|
|
the http spec) to specific file paths in the manifest.
|
|
- example: a `berlin.sneak.app.slideshow` key that maps to a json
|
|
slideshow config listing what image paths to show, and for how long, and
|
|
in what order
|
|
|
|
# Use Cases
|
|
|
|
## Web Images
|
|
|
|
I'd like to be able to put a bunch of images into a directory, generate a
|
|
manifest, and then point a slideshow client (such as an ambient display, or
|
|
a react app with the target directory in a query string arg) at that
|
|
statically hosted directory, and have it discover the full list of images
|
|
available at that URL.
|
|
|
|
## Software Distribution
|
|
|
|
I'd like to be able to download a whole tree of files available via HTTP
|
|
resumably by either HTTP or IPFS/BitTorrent without a .torrent file.
|
|
|
|
## Filesystem Archive Integrity
|
|
|
|
I use filesystems that don't include data checksums, and I would like a
|
|
cryptographically signed checksum file so that I can later verify that a set
|
|
of archive files have not been modified, none are missing, and that the
|
|
checksums have not been altered in storage by a second party.
|
|
|
|
## Filesystem-Independent Checksums
|
|
|
|
I would like to be able to plug in a hard drive or flash drive and, if there
|
|
is an `index.mf` in the root, automatically detect missing/corrupted files,
|
|
regardless of filesystem format.
|
|
|
|
# Collaboration
|
|
|
|
Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your
|
|
desired username for an account on this Gitea instance.
|
|
|
|
## Links
|
|
|
|
* Repo: [https://git.eeqj.de/sneak/mfer](https://git.eeqj.de/sneak/mfer)
|
|
* Issues: [https://git.eeqj.de/sneak/mfer/issues](https://git.eeqj.de/sneak/mfer/issues)
|
|
|
|
# Authors
|
|
|
|
* [@sneak <sneak@sneak.berlin>](mailto:sneak@sneak.berlin)
|
|
|
|
# License
|
|
|
|
* [WTFPL](https://wtfpl.net) |