# mfer Manifest file generator and checker. # Phases Manifest generation happens in two distinct phases: ## Phase 1: Enumeration Walking directories and calling `stat()` on files to collect metadata (path, size, mtime, ctime). This builds the list of files to be scanned. Relatively fast as it only reads filesystem metadata, not file contents. **Progress:** `EnumerateStatus` with `FilesFound` and `BytesFound` ## Phase 2: Scan (ToManifest) Reading file contents and computing cryptographic hashes for manifest generation. This is the expensive phase that reads all file data from disk. **Progress:** `ScanStatus` with `TotalFiles`, `ScannedFiles`, `TotalBytes`, `ScannedBytes`, `BytesPerSec` # Code Conventions - **Logging:** Never use `fmt.Printf` or write to stdout/stderr directly in normal code. Use the `internal/log` package for all output (`log.Info`, `log.Infof`, `log.Debug`, `log.Debugf`, `log.Progressf`, `log.ProgressDone`). - **Filesystem abstraction:** Use `github.com/spf13/afero` for filesystem operations to enable testing and flexibility. - **CLI framework:** Use `github.com/urfave/cli/v2` for command-line interface. - **Serialization:** Use Protocol Buffers for manifest file format. - **Internal packages:** Non-exported implementation details go in `internal/` subdirectories. - **Concurrency:** Use `sync.RWMutex` for protecting shared state; prefer channels for progress reporting. - **Progress channels:** Use buffered channels (size 1) with non-blocking sends to avoid blocking the main operation if the consumer is slow. - **Context support:** Long-running operations should accept `context.Context` for cancellation. - **NO_COLOR:** Respect the `NO_COLOR` environment variable for disabling colored output. - **Options pattern:** Use `NewWithOptions(opts *Options)` constructor pattern for configurable types. # Codebase Structure ## cmd/mfer/ ### main.go - **Variables** - `Appname string` - Application name - `Version string` - Version string (set at build time) - `Gitrev string` - Git revision (set at build time) ## internal/cli/ ### entry.go - **Variables** - `NO_COLOR bool` - Disables color output when NO_COLOR env var is set - **Functions** - `Run(Appname, Version, Gitrev string) int` - Main entry point for the CLI ### mfer.go - **Types** - `CLIApp struct` - Main CLI application container - **Methods** - `(*CLIApp) VersionString() string` - Returns formatted version string ## internal/log/ ### log.go - **Functions** - `Init()` - Initializes the logger - `Info(arg string)` - Logs at info level - `Infof(format string, args ...interface{})` - Logs at info level with formatting - `Debug(arg string)` - Logs at debug level with caller info - `Debugf(format string, args ...interface{})` - Logs at debug level with formatting and caller info - `Dump(args ...interface{})` - Logs spew dump at debug level - `Progressf(format string, args ...interface{})` - Prints progress message (overwrites current line) - `ProgressDone()` - Completes progress line with newline - `EnableDebugLogging()` - Sets log level to debug - `SetLevel(arg log.Level)` - Sets log level - `SetLevelFromVerbosity(l int)` - Sets log level from verbosity count - `GetLevel() log.Level` - Returns current log level - `GetLogger() *log.Logger` - Returns underlying logger - `WithError(e error) *log.Entry` - Returns log entry with error attached - `DisableStyling()` - Disables colors and styling (for NO_COLOR) ## internal/scanner/ ### scanner.go - **Types** - `Options struct` - Options for scanner behavior - `IgnoreDotfiles bool` - `FollowSymLinks bool` - `EnumerateStatus struct` - Progress information for enumeration phase - `FilesFound int64` - `BytesFound int64` - `ScanStatus struct` - Progress information for scan phase - `TotalFiles int64` - `ScannedFiles int64` - `TotalBytes int64` - `ScannedBytes int64` - `BytesPerSec float64` - `FileEntry struct` - Represents an enumerated file - `Path string` - Relative path (used in manifest) - `AbsPath string` - Absolute path (used for reading file content) - `Size int64` - `Mtime time.Time` - `Ctime time.Time` - `Scanner struct` - Accumulates files and generates manifests - **Functions** - `New() *Scanner` - Creates a new Scanner with default options - `NewWithOptions(opts *Options) *Scanner` - Creates a new Scanner with given options - **Methods (Enumeration Phase)** - `(*Scanner) EnumerateFile(path string) error` - Enumerates a single file, calling stat() for metadata - `(*Scanner) EnumeratePath(inputPath string, progress chan<- EnumerateStatus) error` - Walks a directory and enumerates all files - `(*Scanner) EnumeratePaths(progress chan<- EnumerateStatus, inputPaths ...string) error` - Walks multiple directories - `(*Scanner) EnumerateFS(afs afero.Fs, basePath string, progress chan<- EnumerateStatus) error` - Walks an afero filesystem - **Methods (Accessors)** - `(*Scanner) Files() []*FileEntry` - Returns copy of all enumerated files - `(*Scanner) FileCount() int64` - Returns number of files - `(*Scanner) TotalBytes() int64` - Returns total size of all files - **Methods (Scan Phase)** - `(*Scanner) ToManifest(ctx context.Context, w io.Writer, progress chan<- ScanStatus) error` - Reads file contents, computes hashes, generates manifest ## internal/checker/ ### checker.go - **Types** - `Result struct` - Outcome of checking a single file - `Path string` - File path from manifest - `Status Status` - Verification status - `Message string` - Error or status message - `Status int` - Verification status enumeration - `StatusOK` - File matches manifest - `StatusMissing` - File not found - `StatusSizeMismatch` - File size differs from manifest - `StatusHashMismatch` - File hash differs from manifest - `StatusError` - Error occurred during verification - `CheckStatus struct` - Progress information for check operation - `TotalFiles int64` - `CheckedFiles int64` - `TotalBytes int64` - `CheckedBytes int64` - `BytesPerSec float64` - `Failures int64` - `Checker struct` - Verifies files against a manifest - **Functions** - `NewChecker(manifestPath string, basePath string) (*Checker, error)` - Creates a new Checker for the given manifest and base path - **Methods** - `(s Status) String() string` - Returns string representation of status - `(*Checker) FileCount() int64` - Returns number of files in the manifest - `(*Checker) TotalBytes() int64` - Returns total size of all files in manifest - `(*Checker) Check(ctx context.Context, results chan<- Result, progress chan<- CheckStatus) error` - Verifies all files against the manifest ## mfer/ ### manifest.go - **Types** - `ManifestScanOptions struct` - Options for scanning directories - `IgnoreDotfiles bool` - `FollowSymLinks bool` - **Functions** - `New() *manifest` - Creates a new empty manifest - `NewFromPaths(options *ManifestScanOptions, inputPaths ...string) (*manifest, error)` - Creates manifest from filesystem paths - `NewFromFS(options *ManifestScanOptions, fs afero.Fs) (*manifest, error)` - Creates manifest from afero filesystem - **Methods** - `(*manifest) HasError() bool` - Returns true if manifest has errors - `(*manifest) AddError(e error) *manifest` - Adds an error to the manifest - `(*manifest) WithContext(c context.Context) *manifest` - Sets context for cancellation - `(*manifest) GetFileCount() int64` - Returns number of files in manifest - `(*manifest) GetTotalFileSize() int64` - Returns total size of all files - `(*manifest) Files() []*MFFilePath` - Returns all file entries from a loaded manifest - `(*manifest) Scan() error` - Scans source filesystems and populates file list ### output.go - **Methods** - `(*manifest) WriteToFile(path string) error` - Writes manifest to file path - `(*manifest) WriteTo(output io.Writer) error` - Writes manifest to io.Writer ### builder.go - **Types** - `FileProgress func(bytesRead int64)` - Callback for file processing progress - `ManifestBuilder struct` - Constructs manifests by adding files one at a time - **Functions** - `NewBuilder() *ManifestBuilder` - Creates a new ManifestBuilder - **Methods** - `(*ManifestBuilder) AddFile(path string, size int64, mtime time.Time, reader io.Reader, progress FileProgress) (int64, error)` - Reads file, computes hash, adds to manifest - `(*ManifestBuilder) FileCount() int` - Returns number of files added - `(*ManifestBuilder) Build(w io.Writer) error` - Finalizes and writes manifest ### serialize.go - **Constants** - `MAGIC string` - Magic bytes prefix for manifest files ("ZNAVSRFG") ### deserialize.go - **Functions** - `NewFromProto(input io.Reader) (*manifest, error)` - Deserializes manifest from protobuf - `NewManifestFromReader(input io.Reader) (*manifest, error)` - Reads and parses manifest from io.Reader - `NewManifestFromFile(path string) (*manifest, error)` - Reads and parses manifest from file path ### mf.pb.go (generated from mf.proto) - **Enum Types** - `MFFileOuter_Version` - Outer file format version - `MFFileOuter_VERSION_NONE` - `MFFileOuter_VERSION_ONE` - `MFFileOuter_CompressionType` - Compression type for inner message - `MFFileOuter_COMPRESSION_NONE` - `MFFileOuter_COMPRESSION_GZIP` - `MFFile_Version` - Inner file format version - `MFFile_VERSION_NONE` - `MFFile_VERSION_ONE` - **Message Types** - `Timestamp struct` - Timestamp with seconds and nanoseconds - `GetSeconds() int64` - `GetNanos() int32` - `MFFileOuter struct` - Outer wrapper containing compressed/signed inner message - `GetVersion() MFFileOuter_Version` - `GetCompressionType() MFFileOuter_CompressionType` - `GetSize() int64` - `GetSha256() []byte` - `GetInnerMessage() []byte` - `GetSignature() []byte` - `GetSigner() []byte` - `GetSigningPubKey() []byte` - `MFFilePath struct` - Individual file entry in manifest - `GetPath() string` - `GetSize() int64` - `GetHashes() []*MFFileChecksum` - `GetMimeType() string` - `GetMtime() *Timestamp` - `GetCtime() *Timestamp` - `GetAtime() *Timestamp` - `MFFileChecksum struct` - File checksum using multihash - `GetMultiHash() []byte` - `MFFile struct` - Inner manifest containing file list - `GetVersion() MFFile_Version` - `GetFiles() []*MFFilePath` - `GetCreatedAt() *Timestamp` # Build Status [![Build Status](https://drone.datavi.be/api/badges/sneak/mfer/status.svg)](https://drone.datavi.be/sneak/mfer) # Problem Statement Given a plain URL, there is no standard way to safely and programmatically download everything "under" that URL path. `wget -r` can traverse directory listings if they're enabled, but every server has a different format, and this does not verify cryptographic integrity of the files, or enable them to be fetched using a different protocol other than HTTP/s. Currently, the solution that people are using are sidecar files in the format of `SHASUMS` checksum files, as well as a `SHASUMS.asc` PGP detached signature. This is not checksum-algorithm-agnostic and the sidecar file is not always consistently named. Real issues I face: - when I plug in an ExFAT hard drive, I don't know if any files on the filesystem are corrupted or missing - current ad-hoc solution are `SHASUMS`/`SHASUMS.asc` files - when I want to mirror an HTTP archive, I have to use special tools like debmirror that understand the archive format - the debian repository metadata structure is hot garbage - when I download a large file via HTTP, I have no way of knowing if the file content is what it's supposed to be # Proposed Solution A standard, a manifest file format, and a tool for generating same. The manifest file would be called `index.mf`, and the tool for generating such would be called `mfer`. The manifest file would do several important things: - have a standard filename, so if given `https://example.com/downloadpackage/` one could fetch `https://example.com/downloadpackage/index.mf` to enumerate the full directory listing. - contain a version field for extensibility - contain structured data (protobuf, json, or cbor) - provide an inner signed container, so that the manifest file itself can embed a signature and a public key alongside in a single file - contain a list of files, each with a relative path to the manifest - contain manifest timestamp - contain ctime/mtime information for files so that file metadata can be preserved - contain cryptographic checksums in several different algorithms for each file - probably encoded with multihash to indicate algo + hash - sha256 at the minimum - would be nice to include an IPFS/IPLD CIDv1 root hash for each file, which likely involves doing an ipfs file object chunking - maybe even including the complete IPFS/IPLD directory tree objects and chunklists? - this is because generating an `index.mf` does not imply publishing on ipfs at that time - maybe a bittorrent chunklist for torrent client compatibility? perhaps a top-level infohash for the whole manifest? # Design Goals - Replace SHASUMS/SHASUMS.asc files - be easy to download/resume a whole directory tree published via HTTP - be easy to use across protocols (given an HTTPS url, fetch manifest, then download file contents via bittorrent or ipfs) - not strongly coupled to HTTP use case, should not require special hosting, content types, or HTTP headers being sent # Non-Goals - Manifest generation speed - likely involves IPFS chunking, bittorrent chunking, and several different cryptographic hash functions over the entirety of each and every file - Small manifest file size (within reason) - 30MiB files are "small" these days, given modern storage/bandwidth - metadata size should not be used as an excuse to sacrifice utility (such as providing checksums over each chunk of a large file) # Limitations - **Manifest size:** Manifests must fit entirely in system memory during reading and writing. # Open Questions - Should the manifest file include checksums of individual file chunks, or just for the whole assembled file? - If so, should the chunksize be fixed or dynamic? - Should the manifest signature format be GnuPG signatures, or those from OpenBSD's signify (of which there is a good [golang implementation](https://github.com/frankbraun/gosignify)? - Should the on-disk serialization format be proto3 or json? # Tool Examples - `mfer gen` / `mfer gen .` - recurses under current directory and writes out an `index.mf` - `mfer check` / `mfer check .` - verifies checksums of all files in manifest, displaying error and exiting nonzero if any files are missing or corrupted - `mfer fetch https://example.com/stuff/` - fetches `/stuff/index.mf` and downloads all files listed in manifest, optionally resuming any that already exist locally, and assures cryptographic integrity of downloaded files. # Implementation Plan ## Phase One: - golang module for reusability/embedding - golang module client providing `mfer` CLI ## Phase Two: - ES6 or TypeScript module for reusability/embedding - ES6/TypeScript module client providing `mfer.js` CLI # Hopes And Dreams - `aria2c https://example.com/manifestdirectory/` - (fetches `https://example.com/manifestdirectory/index.mf`, downloads and checksums all files, resumes any that exist locally already) - `mfer fetch https://example.com/manifestdirectory/` - a command line option to zero/omit mtime/ctime, as well as manifest timestamp, and sort all directory listings so that manifest file generation is deterministic/reproducible - URL format `mfer fetch https://exmaple.com/manifestdirectory/?key=5539AD00DE4C42F3AFE11575052443F4DF2A55C2` to assert in the URL which PGP signing key should be used in the manifest, so that shared URLs have a cryptographic trust root - a "well-known" key in the manifest that maps well known keys (could reuse the http spec) to specific file paths in the manifest. - example: a `berlin.sneak.app.slideshow` key that maps to a json slideshow config listing what image paths to show, and for how long, and in what order # Use Cases ## Web Images I'd like to be able to put a bunch of images into a directory, generate a manifest, and then point a slideshow client (such as an ambient display, or a react app with the target directory in a query string arg) at that statically hosted directory, and have it discover the full list of images available at that URL. ## Software Distribution I'd like to be able to download a whole tree of files available via HTTP resumably by either HTTP or IPFS/BitTorrent without a .torrent file. ## Filesystem Archive Integrity I use filesystems that don't include data checksums, and I would like a cryptographically signed checksum file so that I can later verify that a set of archive files have not been modified, none are missing, and that the checksums have not been altered in storage by a second party. ## Filesystem-Independent Checksums I would like to be able to plug in a hard drive or flash drive and, if there is an `index.mf` in the root, automatically detect missing/corrupted files, regardless of filesystem format. # Collaboration Please email [`sneak@sneak.berlin`](mailto:sneak@sneak.berlin) with your desired username for an account on this Gitea instance. I am currently interested in hiring a contractor skilled with the Go standard library interfaces to specify this tool in full and develop a prototype implementation.