Troubleshooting¶
A short list of issues you're likely to hit on first contact, with the diagnostic and the fix. Open an issue if you trip over something not covered here.
"cannot mount … no such pool" / "permission denied" on first run¶
The daemon expects a ZFS pool named per zfs.pool in the config (default
dockersnap) with a child dataset instances.
sudo zpool list # is the pool there?
sudo zfs list dockersnap/instances # is the parent dataset there?
# If not:
sudo zpool create dockersnap /dev/sdb # adapt to your hardware
sudo zfs create dockersnap/instances
The daemon must run as root (it shells out to zfs/ip/iptables/systemd-run).
dockersnap CLI says "connection refused"¶
The CLI is asking the daemon at $DOCKERSNAP_REMOTE (or --remote).
echo $DOCKERSNAP_REMOTE # what URL is the CLI hitting?
curl -s http://my-vm:9847/api/v1/health # is the daemon up?
sudo systemctl status dockersnap # ...if installed as a service
sudo journalctl -u dockersnap -n 200 # what does the daemon say?
If api.token is set in the daemon config, also export DOCKERSNAP_TOKEN.
Containers inside an instance can't reach the internet¶
Each instance's dockerd lives in netns ds-<name> with a veth pair to the
host. NAT and DNS need to be in place.
sudo ip netns exec ds-<name> ip a # interfaces present? IP assigned?
sudo ip netns exec ds-<name> ping -c1 1.1.1.1
sudo ip netns exec ds-<name> getent hosts example.com
Common causes:
- systemd-networkd hijacking the veth. The daemon writes
/etc/systemd/network/10-dockersnap-veth.networkto markve-*as unmanaged; if that file is missing or the priority is wrong, the host side gets a DHCP/link-local address and routing breaks. Re-running the daemon (or the Ansible role) reinstates it. - dnsmasq not forwarding upstream. The daemon's dnsmasq config uses
resolv-file=/run/systemd/resolve/resolv.conf. Ifsystemd-resolvedisn't running or that file is empty, external hostnames don't resolve. - Corporate proxy. Set
dockersnap_proxy_http/dockersnap_proxy_httpsindeploy/host_vars/<host>.ymlso each dockerd inherits the right env vars.
After revert, containerd or etcd panic with bbolt errors¶
This is the page-cache-after-rollback corruption scenario. The fix is
already in manager.go: sync → zfs rollback -r → drop_caches →
start dockerd. If you're seeing it, you're either on an older binary or
something is racing the page-cache drop. Check journalctl -u dockersnap
around the revert for the drop_caches line.
Parallel kind clusters fail with "too many open files" / inotify errors¶
Each kind node runs systemd, which needs an inotify instance. Default
limits (fs.inotify.max_user_instances=128) are too low for two parallel
kind instances. The daemon raises these on startup; the Ansible role also
persists them via sysctl. Verify:
sudo sysctl fs.inotify.max_user_instances # should be 1024
sudo sysctl fs.inotify.max_user_watches # should be 524288
A port published inside an instance isn't reachable from the host¶
Port forwarding is owned by an in-daemon TCP proxy that runs nsenter +
socat to bridge into the netns. Discovery is event-driven (docker
events) with a 30-second polling fallback.
dockersnap status <name> # check the Ports section
curl http://<host>:9847/api/v1/instances/<name>/ports
# Force a re-scan if needed:
curl -X POST http://<host>:9847/api/v1/instances/<name>/ports/refresh
If a port shows up in dockersnap status but isn't reachable, check host
firewall / cloud security group rules.
A clone reuses the source's host port and fails to start¶
Each clone gets a fresh subnet, but Docker's ephemeral host-port allocator
isn't aware of other instances and can hand out the same number twice. The
proxy compensates by detecting EADDRINUSE and falling back to :0. If
you hit this and the proxy doesn't recover, file an issue with
dockersnap status <clone> --json.