AGENTS.md — AI Agent Instructions for dockersnap¶

Project Overview¶

dockersnap is a Go CLI/daemon that manages isolated Docker daemon instances on ZFS datasets, providing instant snapshot/revert/clone of fully-deployed Docker-based dev environments (e.g. kind clusters).

Each "instance" is a self-contained environment: its own ZFS dataset, its own Docker daemon, its own Docker network, and one workload (such as a kind cluster) inside it.

Architecture Context¶

Runtime: Single binary (dockersnap) with two modes: CLI commands (local or remote) and serve (API daemon).
Storage: ZFS pool dockersnap with datasets under dockersnap/instances/<name>. Docker uses overlay2 on top of the ZFS mount (NOT Docker's ZFS storage driver).
Isolation: One dockerd process per instance, listening on /run/dockersnap/<name>.sock. Each daemon runs with live-restore: true so containers survive service restarts.
Networking: Each instance gets an auto-allocated /16 subnet from a configurable range.
Proxy: Per-instance dockerd processes inherit proxy env vars from config (docker.proxy section).
Config: /etc/dockersnap/config.yaml (written by Ansible during setup).
State: /var/lib/dockersnap/state.json (tracks instances, their datasets, snapshots, and network allocations).
Daemon configs: /var/lib/dockersnap/daemon-configs/<name>.json (per-instance dockerd daemon.json).

Key Invariants (NEVER violate these)¶

Stop-before-mutate: NEVER rollback or snapshot a ZFS dataset while its dockerd is running. Always: stop all containers → stop dockerd → mutate ZFS → start dockerd.
Recursive snapshots: Always use zfs snapshot -r and zfs rollback -r to capture/restore all child datasets atomically.
One dockerd per instance: No sharing Docker daemons between instances. Full process isolation.
State file is source of truth: If the state file says an instance exists, the corresponding ZFS dataset and dockerd should exist. Reconcile discrepancies on daemon startup.
Network determinism: Subnet allocation is deterministic based on instance index (not hash) to avoid collisions and allow predictable addressing.
Robust stop: Always stop containers before stopping dockerd. After dockerd stops, clean up leftover mounts (netns, overlay) and orphan processes (containerd-shim) before touching the ZFS dataset.
Network namespace isolation: Each dockerd runs in its own Linux network namespace (ds-<name>) with a veth pair to the host. This gives fully isolated iptables, bridges, and routing. kubectl access requires running inside the namespace: ip netns exec ds-<name> kubectl ....

Workflow Rules (ALWAYS follow these)¶

Keep docs up to date: When making design decisions or discovering important behavior, update the relevant doc (docs/DESIGN.md, docs/SNAPSHOT-INTERNALS.md, or this file). Docs are not an afterthought — they are part of the deliverable.
Commit whenever it makes sense: Don't accumulate huge uncommitted changesets. Commit after each logical unit of work (feature, bugfix, refactor). Use descriptive multi-line commit messages with sign-off.
Test before deploying: Always go build ./... and go test ./... before deploying to the VM.
Deploy pattern: Build → scp → stop service → replace binary → start service. The service has reconciliation so instances auto-recover.
Remote VM access: SSH to your dockersnap host (set DOCKERSNAP_VM_HOST=user@vm.example.com for the deploy tasks). Use sudo for privileged ops. If you're behind a corporate proxy, set HTTP(S)_PROXY for external downloads. sudo bash is blocked — use sudo sh -c instead.
Git no-pager: ALWAYS use git --no-pager for all git commands (log, diff, show, branch, etc.) to avoid hanging on pager prompts.

Git Commit Message Convention¶

This repo follows Conventional Commits.

Subject format: <type>(<scope>): <subject> where <type> is one of feat / fix / docs / refactor / test / chore / perf / build / ci / style / revert. Scope is optional; use it when a change is contained to one area (feat(api): …, fix(plugin/kind): …, docs(readme): …). Append ! to the type/scope (or add a BREAKING CHANGE: footer) when the change breaks consumers.

Examples:

feat(plugin): expose health TTL via /workload/health?fresh=true
fix(dockerd): wait for socket before declaring instance healthy
docs(snapshot-internals): describe page-cache drop after rollback
refactor(state)!: rename Status field values to typed constants

zsh quirk: never use git commit -m "multiline..." in zsh — newlines get swallowed. Write the message to a temp file and use -F instead:

printf 'feat(scope): subject line here\n\nBody paragraph explaining what and why.\n- bullet points for details\n' > /tmp/commit-msg.txt
git --no-pager commit -F /tmp/commit-msg.txt

Code Conventions¶

Language: Go 1.26+ (see go.mod).
CLI framework: cobra. rootCmd.SilenceUsage = true — RunE errors print just the Error: line, not the full usage block.
API framework: chi (lightweight, stdlib-compatible). Token middleware is mounted on a r.Group(...) so /api/v1/health stays public regardless of route order.
Error handling: Wrap errors with fmt.Errorf("operation: %w", err). Never swallow errors silently — even shell-outs (sync, drop_caches, umount) should at minimum log on failure.
Logging: log/slog (structured logging). Use slog.With("instance", name) for context.
ZFS interaction: Shell out to zfs/zpool commands via os/exec. No CGO ZFS bindings.
Docker interaction: Official Docker SDK for Go API calls; docker CLI for ad-hoc operations and docker events streams.
File paths: All system paths via helpers on config.Config (SocketPath, PidFilePath, DatasetPath, MountPoint).
Instance names: Validated by instance.ValidateName (^[a-z][a-z0-9-]{0,31}$). All API/CLI entry points must validate before reaching the manager.
State mutations: Always go through state.Store.Update(fn) / state.Store.View(fn) so the load-modify-save cycle is atomic. Never Load() → mutate → Save() directly outside tests.
Manager lifecycle ops that take long-running I/O follow a three-phase pattern: reserve under lock → heavy I/O outside lock → commit/rollback under lock.
Snapshot/Revert share runStoppedZFSOp — extend it (don't duplicate it) when adding new ZFS-while-stopped operations.
Status field: state.Status typed string — use state.StatusRunning / StatusStopped / StatusError, never bare strings.
Workload plugins: layered on top of an instance via dockersnap create <name> --plugin <name> [--config k=v | --config-file <path>]. Plugin binaries live in /usr/local/lib/dockersnap/plugins/. Daemon-side runner: internal/plugin. Public Go SDK for plugin authors: pkg/pluginsdk. Reference plugin: plugins/kind/. Contract: see docs/PLUGIN-DESIGN.md. Plugins run on the daemon host; the CLI reaches them via API endpoints (/access, /workload, /workload/health, /plugins, /plugins/reload) so behavior is identical for local and remote DOCKERSNAP_REMOTE.
Testing: github.com/stretchr/testify (assert for soft checks, require for hard preconditions) — both unit tests and the integration suite under tests/e2e/. Table-driven where the input grid is the point. Mocks for zfs.Commander and instance.DockerdManager. Plugin tests use shell-script fakes in t.TempDir() (not exec mocking). Integration tests tagged //go:build integration; system-state probes (zfsDatasetExists, netnsExists, iptablesHasComment, …) live in tests/e2e/helpers_test.go and return bools so the test decides what to assert.
Test layout: Every package has _test.go companions for its pure logic; HTTP layer (internal/api, internal/client) is exercised via httptest.Server. Code that shells out to systemctl/iptables/docker/zfs (e.g. internal/dockerd, internal/proxy/scan.go, EventListener) is validated only in the integration suite — don't try to unit-test those by mocking exec.Command.
Integration tests are split: tests/e2e/ covers core dockersnap (ZFS lifecycle, netns, iptables, systemd) and tests/plugins/<name>/ is a per-plugin suite (one binary, one Go package). Each suite has its own main_test.go that calls integrationutil.Bootstrap("<suite>") for client + namespaced instance prefix. Run them remotely via task e2e:run, task e2e:plugins:kind, task e2e:plugins:echo, or task e2e:plugins:run for everything.

Directory Structure¶

cmd/           → Cobra CLI command definitions. Thin wrappers that call into internal/.
internal/      → All business logic. Packages:
  api/         → REST API server (chi), handlers, middleware
  client/      → HTTP client wrapping the REST API for the CLI
  instance/    → Instance lifecycle (create, delete, list, start, stop)
  dockerd/     → Docker daemon process management (start/stop/health)
  zfs/         → ZFS operations (dataset, snapshot, rollback, clone, destroy)
  network/     → Subnet auto-allocation
  proxy/       → Host-side TCP port proxy + auto-discovery watchers
  config/      → Configuration loading from /etc/dockersnap/config.yaml
  state/       → Persistent state management (JSON file)
  plugin/      → Workload-plugin runner (discovers and invokes plugin binaries)
pkg/           → Public Go modules (importable by plugin authors).
  pluginsdk/   → Plugin SDK: types, runner helpers, typed config, progress, sdktest
deploy/        → Ansible role for one-time infrastructure setup
docs/          → Design documents (DESIGN.md, SNAPSHOT-INTERNALS.md, PLUGIN-DESIGN.md)
plugins/       → Reference plugin binaries (e.g. plugins/kind/) — once implemented

What NOT to Do¶

Do NOT use Docker's ZFS storage driver configuration. We manage ZFS datasets ourselves and point each dockerd's --data-root at a dataset mountpoint.
Do NOT attempt live/hot snapshots. Always stop all containers AND dockerd first.
Do NOT use zfs rollback without -r flag if intermediate snapshots might exist.
Do NOT hardcode subnet ranges. Always read from config.
Do NOT assume ZFS pool name. Always read from config.
Do NOT bind the API to 0.0.0.0 by default. Default to 127.0.0.1 for security. Configurable.
Do NOT stop dockerd without stopping containers first — orphan containerd-shim processes will hold the ZFS dataset busy.
Do NOT use kind export kubeconfig for stdout output — use kind get kubeconfig instead.

Testing Approach¶

Unit tests: Mock the zfs and dockerd interfaces. Test business logic in internal/instance/.
Integration tests: Require actual ZFS pool (created in CI or locally). Use //go:build integration tag.
End-to-end: Shell script that runs create → snapshot → revert → verify.

Ansible Role Context¶

The deploy/ directory contains a standalone Ansible playbook + role for setting up the host: - Installs ZFS kernel modules and userspace tools - Creates the ZFS pool on a specified vdev - Sets zfs_arc_max to configured value (default 5GB) - Installs Docker CE (disables system-wide daemon) - Installs runtime dependencies (socat, iproute2, iptables, util-linux/nsenter, procps) - Installs the dockersnap binary - Writes /etc/dockersnap/config.yaml (including proxy settings) - Enables and starts the dockersnap systemd service