From 31f7172cf336ac5d22b5e247c49861f6dd238be5 Mon Sep 17 00:00:00 2001
From: Carsten <carsten@ravensburg.it>
Date: Wed, 22 Apr 2026 08:51:04 +0200
Subject: [PATCH] docs: SETUP-AND-DEPLOY runbook for phase 5 production rollout

Single top-to-bottom runbook covering preflight, local build, server deploy,
first-agent dry run, test tier, full rollout, rollback, and ongoing ops.
Each step has a verification command. Ends with a Go/No-Go sign-off list.
---
 SETUP-AND-DEPLOY.md | 838 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 838 insertions(+)
 create mode 100644 SETUP-AND-DEPLOY.md
diff --git a/SETUP-AND-DEPLOY.md b/SETUP-AND-DEPLOY.md
new file mode 100644
index 0000000..f03ba47
--- /dev/null
+++ b/SETUP-AND-DEPLOY.md
@@ -0,0 +1,838 @@
+# Proxmox Monitor — Setup & Deploy Runbook
+
+> **This document is a runbook, not reference material.** Work from top to bottom.
+> Each checkbox is an action. Don't skip verification steps — they exist because
+> we've been burned by skipping them.
+
+**Target audience:** the operator who owns the monitor service.
+
+**Total time:** 2–3 hours end-to-end with a small test tier, plus however long
+your Proxmox fleet takes to roll out (10–15 min per host).
+
+**Phases of this runbook:**
+
+| §   | Purpose                                       | Touch points          |
+|-----|-----------------------------------------------|-----------------------|
+| 1   | **Preflight** — confirm prerequisites         | Local only            |
+| 2   | **Local build** — produce artifacts           | Your workstation      |
+| 3   | **Server deploy** — one-time LXC bring-up     | Hypervisor + LXC      |
+| 4   | **First agent** — prove the pipeline          | One Proxmox host      |
+| 5   | **Test tier** — 2–3 hosts for 24h             | Small batch           |
+| 6   | **Full rollout** — the remaining hosts        | Fleet-wide            |
+| 7   | **Rollback** — when something goes wrong      | —                     |
+| 8   | **Ongoing operations**                        | Upgrades, backups     |
+| 9   | **Go / No-Go** sign-off                       | Final gate            |
+
+Related docs (reference, not sequential):
+- `server/docs/deploy-lxc.md` — deeper LXC detail
+- `agent/docs/install.md` — single-host agent install
+- `server/docs/Caddyfile.example` — TLS/WSS proxy template
+- `proxmox-monitor-konzept.md` — design concept
+- `docs/deployment-overview.md` — high-level picture
+
+---
+
+## § 1. Preflight Checklist
+
+### 1.1 Hardware & network
+
+- [ ] **Server LXC** can be provisioned on a Proxmox host in the RZ.
+      Minimum: 1 GB RAM, 2 cores, 10 GB disk. Debian 12 template available.
+- [ ] **DNS A record** for `monitor.<yourdomain>` points at the public IP of the
+      Proxmox host that hosts the LXC. Verify with `dig +short monitor.<yourdomain>`.
+- [ ] **Port 443 inbound** to the server LXC's public IP is open (Caddy will
+      get Let's Encrypt certs via HTTP-01 and serve on 443).
+- [ ] **Outbound HTTPS from every Proxmox host to `monitor.<yourdomain>`** is
+      open. Agents connect out; no inbound port is required on Proxmox hosts.
+- [ ] You have **SSH root access** to:
+  - The hypervisor running the server LXC (for `pct create` / `pct enter`)
+  - Every Proxmox host that will run an agent
+- [ ] **Docker** is installed and daemon is running on your build machine
+      (`docker --version` should succeed and `docker ps` should not error).
+      If not, use a Linux box (even the server LXC itself) as the build host.
+
+### 1.2 Versions
+
+- [ ] Proxmox hosts are **VE 8.3+** with **OpenZFS 2.3+** (check with
+      `pveversion` and `zfs --version`). If some hosts are older, either
+      upgrade them first or accept that ZFS payloads will be empty on those.
+
+### 1.3 Tools on your workstation
+
+- [ ] Elixir 1.19 + OTP 28 (`elixir --version`)
+- [ ] Mix + Hex (`mix local.hex`)
+- [ ] SSH + scp
+- [ ] `sqlite3` CLI (for smoke-test DB inspection; optional)
+
+### 1.4 Secrets plan
+
+Write down (don't commit) the three secrets you'll need. Keep them in a password
+manager.
+
+| Secret                        | Generated how                                                |
+|-------------------------------|--------------------------------------------------------------|
+| Dashboard password (plaintext) | You choose it. Use a strong random string.                   |
+| `SECRET_KEY_BASE`              | `cd server && mix phx.gen.secret` (64-byte base64)           |
+| Agent tokens                  | Created by the admin UI, one per host, revealed once.        |
+
+---
+
+## § 2. Local Build
+
+Do this once, on your build machine. Re-run for every upgrade.
+
+### 2.1 Clone the repo
+
+- [ ] `git clone <repo>` if you don't already have it.
+- [ ] `cd proxmox_monitor`
+- [ ] `git pull --ff-only origin main`
+
+### 2.2 Confirm tests are green
+
+- [ ] `cd server && mix deps.get && mix test`
+- [ ] `cd ../agent && mix deps.get && mix test`
+
+**Expected:** both suites pass. If any test fails, **stop here**. Fix or cherry-pick
+a known-good commit before continuing.
+
+### 2.3 Build the server release
+
+- [ ] Generate `DASHBOARD_PASSWORD_HASH` once:
+
+```bash
+cd server
+mix run -e 'IO.puts(Argon2.hash_pwd_salt("<your-dashboard-password>"))'
+```
+
+Copy the `$argon2id$...` line into your password manager. You'll paste it
+into the LXC env file later.
+
+- [ ] Build the release (the placeholder is only needed to satisfy runtime.exs
+      during build; the real value is set on the LXC at start time):
+
+```bash
+MIX_ENV=prod DASHBOARD_PASSWORD_HASH='placeholder' mix release --overwrite
+```
+
+**Expected:** `_build/prod/rel/server/` contains `bin/server`, `bin/migrate`,
+`erts-*`, `lib/`, `releases/`.
+
+- [ ] Package the release:
+
+```bash
+tar -czf /tmp/server_release.tgz -C _build/prod/rel server
+ls -lh /tmp/server_release.tgz
+```
+
+**Expected:** ~30–60 MB tarball.
+
+### 2.4 Build the agent binaries
+
+Requires Docker running locally (or do this on a Linux host).
+
+- [ ] `cd ../agent`
+- [ ] `./scripts/build-linux.sh`
+
+**Expected output (~5–10 min first run, much faster with Docker layer cache on
+subsequent runs):**
+
+```
+Binaries written to /path/to/agent/dist:
+proxmox-monitor-agent_linux_amd64
+proxmox-monitor-agent_linux_arm64
+```
+
+- [ ] Sanity check:
+
+```bash
+file dist/proxmox-monitor-agent_linux_amd64 | grep -E "ELF 64-bit"
+```
+
+**Expected:** `ELF 64-bit LSB executable, x86-64`.
+
+If Docker isn't available on your workstation: scp the `agent/` directory onto
+the server LXC after § 3, run `./scripts/build-linux.sh` there, then scp the
+binaries back. The LXC doesn't need Docker at runtime.
+
+---
+
+## § 3. Server Deployment
+
+One-time. Subsequent upgrades use § 8.1.
+
+### 3.1 Create the LXC (on the hypervisor)
+
+- [ ] SSH to the hypervisor and run:
+
+```bash
+pct create 200 \
+  /var/lib/vz/template/cache/debian-12-standard_12.7-1_amd64.tar.zst \
+  --hostname proxmox-monitor \
+  --memory 1024 --cores 2 \
+  --rootfs local-zfs:10 \
+  --net0 name=eth0,bridge=vmbr0,ip=dhcp \
+  --unprivileged 1 --features nesting=0 --onboot 1
+pct start 200
+```
+
+Adjust the container ID (`200`), bridge, and rootfs to match your environment.
+
+- [ ] Get the LXC's IP:
+
+```bash
+pct exec 200 -- ip -4 addr show eth0 | grep -Po 'inet \K[\d.]+'
+```
+
+Put this IP in `LXC_IP` for the rest of this section (use a shell variable,
+not a literal in every command — typos here cost hours).
+
+### 3.2 Base packages inside the LXC
+
+```bash
+pct enter 200
+```
+
+- [ ] Install Caddy + SQLite + tools:
+
+```bash
+apt-get update
+apt-get install -y ca-certificates curl gnupg debian-keyring debian-archive-keyring apt-transport-https sqlite3
+
+# Caddy's apt repo
+curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | \
+  gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
+curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
+  > /etc/apt/sources.list.d/caddy-stable.list
+apt-get update
+apt-get install -y caddy
+
+caddy version  # sanity
+```
+
+- [ ] Exit the container: `exit`.
+
+### 3.3 Upload the release
+
+On your workstation:
+
+- [ ] `scp /tmp/server_release.tgz root@<LXC_IP>:/tmp/`
+
+Back inside the LXC (`pct enter 200`):
+
+- [ ] Unpack:
+
+```bash
+mkdir -p /opt/proxmox-monitor
+tar -xzf /tmp/server_release.tgz -C /opt/proxmox-monitor
+ls /opt/proxmox-monitor/server/bin/
+# Expected: server, migrate, server.bat, migrate.bat
+```
+
+### 3.4 Data directory + environment file
+
+- [ ] Create the data dir:
+
+```bash
+install -d -m 0700 /var/lib/proxmox-monitor
+install -d -m 0755 /etc/default
+```
+
+- [ ] Create `/etc/default/proxmox-monitor`. Substitute the values you
+      generated in § 2.3:
+
+```bash
+cat > /etc/default/proxmox-monitor <<'EOF'
+DATABASE_PATH=/var/lib/proxmox-monitor/monitor.db
+SECRET_KEY_BASE=<paste-output-of-mix-phx.gen.secret>
+DASHBOARD_PASSWORD_HASH=<paste-$argon2id$-hash-from-2.3>
+PHX_SERVER=true
+PHX_HOST=monitor.example.com
+PORT=4000
+EOF
+chmod 0600 /etc/default/proxmox-monitor
+```
+
+**Gotchas:**
+- `DASHBOARD_PASSWORD_HASH` contains `$` characters. Use single quotes around
+  the value in the heredoc, or escape each `$` with `\$`. Double-quoted
+  heredocs will silently eat them.
+- No spaces around `=`.
+- No quotes around values in the file itself.
+
+### 3.5 Run the first migration
+
+- [ ] Apply migrations:
+
+```bash
+set -a; . /etc/default/proxmox-monitor; set +a
+/opt/proxmox-monitor/server/bin/server eval 'Server.Release.migrate()'
+```
+
+**Expected output:**
+
+```
+[info] == Running 20260421200116 Server.Repo.Migrations.CreateHosts.change/0 forward
+[info] create table hosts
+...
+[info] == Migrated 20260421200116 in 0.0s
+[info] == Running 20260421202512 Server.Repo.Migrations.CreateMetrics.change/0 forward
+[info] create table metrics
+...
+[info] == Migrated 20260421202512 in 0.0s
+```
+
+- [ ] Verify the DB exists:
+
+```bash
+ls -la /var/lib/proxmox-monitor/monitor.db
+sqlite3 /var/lib/proxmox-monitor/monitor.db '.tables'
+# Expected: hosts  metrics  schema_migrations
+```
+
+### 3.6 systemd unit for the server
+
+- [ ] Write the unit:
+
+```bash
+cat > /etc/systemd/system/proxmox-monitor.service <<'EOF'
+[Unit]
+Description=Proxmox Monitor Server
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=exec
+EnvironmentFile=/etc/default/proxmox-monitor
+ExecStartPre=/opt/proxmox-monitor/server/bin/server eval 'Server.Release.migrate()'
+ExecStart=/opt/proxmox-monitor/server/bin/server start
+ExecStop=/opt/proxmox-monitor/server/bin/server stop
+Restart=always
+RestartSec=5
+User=root
+
+[Install]
+WantedBy=multi-user.target
+EOF
+
+systemctl daemon-reload
+systemctl enable --now proxmox-monitor
+```
+
+- [ ] Watch it come up:
+
+```bash
+journalctl -u proxmox-monitor -f
+# wait for: "Running ServerWeb.Endpoint with Bandit"
+# then Ctrl+C
+```
+
+- [ ] Smoke-test from inside the LXC:
+
+```bash
+curl -s http://127.0.0.1:4000/health | sqlite3 -cmd '.mode json' /dev/null  # or just cat
+curl -s http://127.0.0.1:4000/health
+# Expected: {"db":"ok","status":"ok","version":"0.1.0"}
+```
+
+**If you see anything other than `status:ok`**: stop. Check `journalctl -u
+proxmox-monitor -n 100`. Common causes: missing env var (check `/etc/default`),
+DB path not writable.
+
+### 3.7 Caddy TLS + reverse proxy
+
+- [ ] Copy the template and edit:
+
+```bash
+cp /opt/proxmox-monitor/server/lib/server-0.1.0/priv/docs/Caddyfile.example \
+   /etc/caddy/Caddyfile \
+   2>/dev/null || \
+  scp root@<YOUR-WORKSTATION>:proxmox_monitor/server/docs/Caddyfile.example \
+      /etc/caddy/Caddyfile
+# (The first form only works if you bundled docs into the release; the second
+# pulls fresh from your checkout.)
+
+sed -i "s/monitor.example.com/$ACTUAL_HOST/g" /etc/caddy/Caddyfile
+caddy validate --config /etc/caddy/Caddyfile
+```
+
+- [ ] Reload Caddy:
+
+```bash
+systemctl reload caddy
+journalctl -u caddy -n 30
+# Expected: "certificate obtained successfully" from Let's Encrypt
+# (only on first reload after DNS is set correctly)
+```
+
+- [ ] Verify from the public internet:
+
+```bash
+# From any outside machine:
+curl -s https://monitor.example.com/health
+# Expected: {"db":"ok","status":"ok","version":"0.1.0"}
+```
+
+**If this fails:**
+- `curl -vI` to see where it stops. Name resolution? TCP? TLS?
+- `dig +short monitor.example.com` — does it point to the expected IP?
+- Check the hypervisor's firewall / any cloud-level firewall for port 443.
+
+### 3.8 Browser smoke-test
+
+- [ ] Open `https://monitor.example.com/` in a browser.
+- [ ] Confirm: redirect to `/login`.
+- [ ] Log in with your dashboard password.
+- [ ] Confirm: empty overview page with "No hosts registered yet."
+
+**If login fails** with "Incorrect password": your `DASHBOARD_PASSWORD_HASH` env
+doesn't match the password you typed. Re-generate and re-deploy §3.4.
+
+---
+
+## § 4. First Agent — Dry Run
+
+Pick one Proxmox host. This run will validate the whole pipeline before you
+touch more hosts.
+
+### 4.1 Register the host in the dashboard
+
+- [ ] Browser → `https://monitor.example.com/admin/hosts`.
+- [ ] Enter the short name (`pve-host-01` or whatever matches your
+      convention). Click **Add**.
+- [ ] The page reveals a token. **Copy it now** — it is shown only once.
+
+### 4.2 Copy the binary + systemd unit to the host
+
+From your workstation (substitute `<HOST>`):
+
+```bash
+export HOST=<proxmox-host-ip-or-name>
+
+scp agent/dist/proxmox-monitor-agent_linux_amd64 \
+    root@$HOST:/usr/local/bin/proxmox-monitor-agent
+ssh root@$HOST 'chmod 0755 /usr/local/bin/proxmox-monitor-agent'
+
+scp agent/rel/proxmox-monitor-agent.service \
+    root@$HOST:/etc/systemd/system/
+```
+
+### 4.3 Write the agent config
+
+SSH to the host (`ssh root@$HOST`) and:
+
+```bash
+install -d -m 0700 /etc/proxmox-monitor
+install -d -m 0700 /var/cache/proxmox-monitor-agent
+
+cat > /etc/proxmox-monitor/agent.toml <<'EOF'
+server_url = "wss://monitor.example.com/socket/websocket"
+token = "<paste-token-from-dashboard>"
+host_id = "pve-host-01"
+
+[intervals]
+fast_seconds = 30
+medium_seconds = 300
+slow_seconds = 1800
+EOF
+
+chmod 0600 /etc/proxmox-monitor/agent.toml
+```
+
+### 4.4 Enable the agent
+
+Still on the Proxmox host:
+
+```bash
+systemctl daemon-reload
+systemctl enable --now proxmox-monitor-agent
+```
+
+- [ ] Watch the log:
+
+```bash
+journalctl -u proxmox-monitor-agent -f
+```
+
+**Expected within 10s:**
+
+```
+agent: starting with host_id=pve-host-01
+reporter: connected, joining host:pve-host-01
+reporter: joined host:pve-host-01
+```
+
+Ctrl+C to stop tailing.
+
+### 4.5 Confirm in the dashboard
+
+- [ ] Reload `https://monitor.example.com/` — the card for `pve-host-01`
+      should show **online**, status green, with Load/RAM/Pools/VMs populated.
+- [ ] Click the card. Verify each section (ZFS pools, snapshots, storage,
+      VMs) has real data.
+
+### 4.6 Stop-and-restart verification
+
+Verify the offline flip works as designed.
+
+- [ ] On the Proxmox host: `systemctl stop proxmox-monitor-agent`.
+- [ ] Dashboard card should switch to **offline** (grey border) within ~1s.
+- [ ] `systemctl start proxmox-monitor-agent` — card flips back to green
+      within ~30s.
+
+**If the card stays green when the agent is stopped**: the Channel terminate
+callback didn't fire, which usually means Caddy's `read_timeout` is set too
+short or absent. Check `/etc/caddy/Caddyfile` contains `read_timeout 90s`.
+
+### 4.7 Token rotation sanity-check
+
+- [ ] In the admin UI, click **Rotate** on the host. Confirm.
+- [ ] On the Proxmox host, `journalctl -u proxmox-monitor-agent -f` —
+      within ~30s the agent should log `reporter: disconnected` then begin
+      reconnecting, failing with `invalid_token`.
+- [ ] Update `/etc/proxmox-monitor/agent.toml` with the new token and
+      `systemctl restart proxmox-monitor-agent`. Verify green again.
+
+---
+
+## § 5. Test Tier (2–3 Hosts)
+
+Pick 2–3 Proxmox hosts that are either non-critical, or critical but with
+existing independent monitoring you can fall back on.
+
+### 5.1 Roll out
+
+- [ ] For each host, repeat § 4.1–4.5. Use distinct `host_id` values.
+
+### 5.2 Observe for 24 hours
+
+- [ ] Leave the test tier running overnight.
+- [ ] Next morning, verify all three cards still show **online**.
+- [ ] Check `journalctl -u proxmox-monitor` on the server:
+  - No `[error]` lines repeating.
+  - `retention: pruned N stale samples` appears ≥ 1 time (retention fires
+    hourly; after 48h it starts deleting).
+
+### 5.3 Restart test
+
+Reboot one of the Proxmox hosts. Watch the dashboard:
+
+- [ ] Card goes offline during the reboot.
+- [ ] Card flips back to online within a minute of the host coming back,
+      without you touching anything.
+
+### 5.4 Server reboot test
+
+- [ ] On the server LXC: `systemctl restart proxmox-monitor`.
+- [ ] All agents should briefly flip to offline, then back to online within
+      ~30s as their Slipstream clients reconnect.
+- [ ] No agents should end up stuck offline requiring manual restart.
+
+**If any agent stays offline**: its Slipstream reconnect backoff may need
+investigation. `journalctl -u proxmox-monitor-agent -f` on the affected host.
+
+### 5.5 Go / No-Go gate for full rollout
+
+Do NOT proceed to § 6 until **all** of these are true for 24h:
+
+- [ ] All test-tier hosts show **online** continuously.
+- [ ] No repeating error lines in server logs.
+- [ ] Retention has pruned ≥ 1 row.
+- [ ] Rotate + restart behavior works as expected.
+- [ ] Dashboard is responsive (<1s LiveView updates).
+
+---
+
+## § 6. Full Rollout
+
+For each remaining Proxmox host:
+
+1. Admin UI → register host, copy token.
+2. `scp` binary + systemd unit.
+3. Write `/etc/proxmox-monitor/agent.toml`.
+4. `systemctl enable --now proxmox-monitor-agent`.
+5. Verify in dashboard.
+
+### 6.1 Loop shortcut
+
+Once you've done 3–4 hosts by hand and are confident, you can batch. The
+tricky part is that each host needs a unique token, so the admin-UI step
+still has to be interactive. One workflow:
+
+```bash
+# On your workstation:
+for HOST in pve-host-04 pve-host-05 pve-host-06; do
+  echo ">>> Setting up $HOST"
+  echo "Register in the admin UI, paste token here, then press Enter:"
+  read -s TOKEN
+  scp agent/dist/proxmox-monitor-agent_linux_amd64 \
+      root@$HOST:/usr/local/bin/proxmox-monitor-agent
+  scp agent/rel/proxmox-monitor-agent.service \
+      root@$HOST:/etc/systemd/system/
+  ssh root@$HOST "chmod 0755 /usr/local/bin/proxmox-monitor-agent && \
+    install -d -m 0700 /etc/proxmox-monitor /var/cache/proxmox-monitor-agent && \
+    cat > /etc/proxmox-monitor/agent.toml <<EOF
+server_url = \"wss://monitor.example.com/socket/websocket\"
+token = \"$TOKEN\"
+host_id = \"$HOST\"
+
+[intervals]
+fast_seconds = 30
+medium_seconds = 300
+slow_seconds = 1800
+EOF
+    chmod 0600 /etc/proxmox-monitor/agent.toml && \
+    systemctl daemon-reload && \
+    systemctl enable --now proxmox-monitor-agent"
+  echo ">>> $HOST done."
+done
+```
+
+### 6.2 Validation at scale
+
+After every batch of ~5 hosts:
+
+- [ ] Open `/` and confirm the card count matches how many agents you've
+      configured.
+- [ ] Sort/filter by offline — should be empty.
+- [ ] Click a random card and confirm real payload data.
+
+### 6.3 Completion check
+
+- [ ] Overview shows all N hosts.
+- [ ] None are in `offline` or `critical` state (unless that's actually true
+      of the host, e.g. a real DEGRADED pool).
+- [ ] VM search returns hits for a well-known VM name.
+
+---
+
+## § 7. Rollback
+
+### 7.1 Disable a single agent
+
+```bash
+ssh root@$HOST 'systemctl disable --now proxmox-monitor-agent'
+```
+
+Dashboard card flips to offline. Delete from `/admin/hosts` if you want it
+gone entirely.
+
+### 7.2 Take the whole service down
+
+```bash
+# Inside the server LXC
+systemctl stop proxmox-monitor
+systemctl stop caddy
+```
+
+Agents will keep trying to reconnect every few seconds (harmless). Dashboard
+is gone.
+
+### 7.3 Roll back to a previous server release
+
+If a new version misbehaves:
+
+```bash
+# On the LXC — assuming you kept the previous /tmp/server_release_PREV.tgz
+systemctl stop proxmox-monitor
+rm -rf /opt/proxmox-monitor/server
+tar -xzf /tmp/server_release_PREV.tgz -C /opt/proxmox-monitor
+systemctl start proxmox-monitor
+```
+
+Your SQLite DB has not been touched — rollbacks are cheap as long as the
+migration list didn't change between versions.
+
+### 7.4 DB restore from backup
+
+See § 8.4 for creating backups. To restore:
+
+```bash
+systemctl stop proxmox-monitor
+cp /var/backups/proxmox-monitor/monitor-YYYY-MM-DD.db /var/lib/proxmox-monitor/monitor.db
+chown root:root /var/lib/proxmox-monitor/monitor.db
+systemctl start proxmox-monitor
+```
+
+Host tokens in the restored DB are still valid. Metrics from after the backup
+are lost — that's 48h max given the retention policy.
+
+---
+
+## § 8. Ongoing Operations
+
+### 8.1 Upgrading the server
+
+Work from the repo on your workstation:
+
+```bash
+# 1. Build
+cd server
+MIX_ENV=prod DASHBOARD_PASSWORD_HASH='placeholder' mix release --overwrite
+tar -czf /tmp/server_release.tgz -C _build/prod/rel server
+
+# 2. Upload, keeping the previous around for rollback
+scp /tmp/server_release.tgz root@<LXC>:/tmp/
+
+# 3. Swap on the LXC
+ssh root@<LXC> '
+  cp /tmp/server_release.tgz /tmp/server_release_PREV.tgz.bak   # optional
+  systemctl stop proxmox-monitor
+  mv /opt/proxmox-monitor/server /opt/proxmox-monitor/server.old
+  tar -xzf /tmp/server_release.tgz -C /opt/proxmox-monitor
+  systemctl start proxmox-monitor      # ExecStartPre runs migrate
+  sleep 5
+  systemctl status proxmox-monitor --no-pager
+'
+```
+
+Verify `/health` responds before deleting the `.old` copy.
+
+### 8.2 Upgrading an agent
+
+```bash
+scp agent/dist/proxmox-monitor-agent_linux_amd64 \
+    root@$HOST:/usr/local/bin/proxmox-monitor-agent.new
+ssh root@$HOST '
+  mv /usr/local/bin/proxmox-monitor-agent{.new,}
+  systemctl restart proxmox-monitor-agent
+'
+```
+
+### 8.3 Token rotation (leak or routine)
+
+1. Dashboard → Admin → **Rotate** on the affected host.
+2. Copy the new token.
+3. SSH to the host: update `/etc/proxmox-monitor/agent.toml`, `systemctl
+   restart proxmox-monitor-agent`.
+4. Verify card flips back to green.
+
+### 8.4 SQLite backup (recommended weekly)
+
+The DB is small. SQLite's online backup is safe while the server runs.
+
+Install a cron job inside the LXC:
+
+```bash
+cat > /etc/cron.d/proxmox-monitor-backup <<'EOF'
+# Minute Hour Dom Month Dow  User  Command
+30     3    *   *     *   root  install -d -m 0700 /var/backups/proxmox-monitor && \
+  sqlite3 /var/lib/proxmox-monitor/monitor.db \
+  ".backup /var/backups/proxmox-monitor/monitor-$(date +\%Y-\%m-\%d).db" && \
+  find /var/backups/proxmox-monitor -name 'monitor-*.db' -mtime +30 -delete
+EOF
+```
+
+Keeps 30 days of daily snapshots.
+
+### 8.5 Log inspection
+
+Server:
+
+```bash
+# Live
+journalctl -u proxmox-monitor -f
+
+# Last 500
+journalctl -u proxmox-monitor -n 500 --no-pager
+
+# Errors only
+journalctl -u proxmox-monitor -p err --no-pager
+```
+
+Agents (from the server for any host):
+
+```bash
+ssh root@$HOST 'journalctl -u proxmox-monitor-agent -n 200 --no-pager'
+```
+
+### 8.6 External uptime monitoring
+
+Point your uptime service (UptimeRobot, BetterUptime, your-own-Prometheus,
+etc.) at:
+
+```
+https://monitor.example.com/health
+```
+
+Expect `{"status":"ok","db":"ok","version":"0.1.0"}` with HTTP 200. Alert on
+anything else.
+
+### 8.7 Changing the dashboard password
+
+1. On your workstation:
+
+```bash
+cd server
+mix run -e 'IO.puts(Argon2.hash_pwd_salt("<new-password>"))'
+```
+
+2. On the server LXC: edit `/etc/default/proxmox-monitor`, replace
+   `DASHBOARD_PASSWORD_HASH`, `systemctl restart proxmox-monitor`.
+3. All existing sessions are invalidated on next request.
+
+---
+
+## § 9. Go / No-Go Sign-Off
+
+Tick each box before declaring the rollout complete.
+
+### Production readiness
+
+- [ ] `https://monitor.example.com/health` returns 200 / `status:ok`.
+- [ ] External uptime monitor is configured and reporting green.
+- [ ] All intended Proxmox hosts appear on the overview and show **online**.
+- [ ] At least one full 48h retention cycle has completed (retention log
+      shows pruning).
+- [ ] SQLite backup cron is installed and yesterday's `.db` file exists.
+- [ ] You have rolled back once on purpose (drill), proving § 7 works.
+
+### Access & secrets hygiene
+
+- [ ] Dashboard password is in a password manager, not a text file.
+- [ ] `SECRET_KEY_BASE` is in a password manager.
+- [ ] `/etc/default/proxmox-monitor` is `0600 root:root`.
+- [ ] `/etc/proxmox-monitor/agent.toml` is `0600 root:root` on every host.
+- [ ] You know how to rotate an agent token in < 2 minutes.
+
+### Documentation handoff
+
+- [ ] This runbook's checkboxes are all green for the current rollout.
+- [ ] If you're handing this to a teammate, you've walked them through one
+      agent install and one token rotation live.
+
+**If all of the above are green, the monitor is in production.**
+
+---
+
+## Appendix A — Common Errors
+
+| Symptom                                              | First thing to check                                              |
+|------------------------------------------------------|--------------------------------------------------------------------|
+| Browser gets `NET::ERR_CERT_AUTHORITY_INVALID`       | Caddy didn't finish LE cert issuance. Wait 60s; then `journalctl -u caddy`. |
+| Login page loops — correct password rejected        | `DASHBOARD_PASSWORD_HASH` mismatch. Regenerate.                   |
+| Card stays offline after agent restart              | Wrong token or `unknown_host` (name mismatch). Check agent journal. |
+| All agents reconnect every ~30s                     | Caddy `read_timeout` missing or too short.                        |
+| `/health` returns 503                               | Server process up, SQLite path unreadable or wrong permissions.   |
+| LXC can't bind port 4000                            | Another process owns it. `ss -ltnp | grep 4000`.                  |
+| `mix release` fails with DASHBOARD error            | You forgot to set `DASHBOARD_PASSWORD_HASH=placeholder` at build. |
+| Agent logs `{:enoent, "pvesh"}`                     | Agent is running on a non-Proxmox host, or `$PATH` is empty under systemd. |
+
+## Appendix B — File & Port Cheat Sheet
+
+```
+Server LXC
+  /opt/proxmox-monitor/server/        release tree
+  /etc/default/proxmox-monitor        env secrets, 0600
+  /etc/systemd/system/proxmox-monitor.service
+  /etc/caddy/Caddyfile
+  /var/lib/proxmox-monitor/monitor.db SQLite
+  /var/backups/proxmox-monitor/       daily backups
+  tcp 443 (caddy)  → tcp 127.0.0.1:4000 (phoenix)
+
+Proxmox host (per agent)
+  /usr/local/bin/proxmox-monitor-agent
+  /etc/proxmox-monitor/agent.toml     token + intervals, 0600
+  /etc/systemd/system/proxmox-monitor-agent.service
+  /var/cache/proxmox-monitor-agent/   Burrito unpack cache
+  no listening ports
+```