From 31f7172cf336ac5d22b5e247c49861f6dd238be5 Mon Sep 17 00:00:00 2001 From: Carsten Date: Wed, 22 Apr 2026 08:51:04 +0200 Subject: [PATCH] docs: SETUP-AND-DEPLOY runbook for phase 5 production rollout Single top-to-bottom runbook covering preflight, local build, server deploy, first-agent dry run, test tier, full rollout, rollback, and ongoing ops. Each step has a verification command. Ends with a Go/No-Go sign-off list. --- SETUP-AND-DEPLOY.md | 838 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 838 insertions(+) create mode 100644 SETUP-AND-DEPLOY.md diff --git a/SETUP-AND-DEPLOY.md b/SETUP-AND-DEPLOY.md new file mode 100644 index 0000000..f03ba47 --- /dev/null +++ b/SETUP-AND-DEPLOY.md @@ -0,0 +1,838 @@ +# Proxmox Monitor — Setup & Deploy Runbook + +> **This document is a runbook, not reference material.** Work from top to bottom. +> Each checkbox is an action. Don't skip verification steps — they exist because +> we've been burned by skipping them. + +**Target audience:** the operator who owns the monitor service. + +**Total time:** 2–3 hours end-to-end with a small test tier, plus however long +your Proxmox fleet takes to roll out (10–15 min per host). + +**Phases of this runbook:** + +| § | Purpose | Touch points | +|-----|-----------------------------------------------|-----------------------| +| 1 | **Preflight** — confirm prerequisites | Local only | +| 2 | **Local build** — produce artifacts | Your workstation | +| 3 | **Server deploy** — one-time LXC bring-up | Hypervisor + LXC | +| 4 | **First agent** — prove the pipeline | One Proxmox host | +| 5 | **Test tier** — 2–3 hosts for 24h | Small batch | +| 6 | **Full rollout** — the remaining hosts | Fleet-wide | +| 7 | **Rollback** — when something goes wrong | — | +| 8 | **Ongoing operations** | Upgrades, backups | +| 9 | **Go / No-Go** sign-off | Final gate | + +Related docs (reference, not sequential): +- `server/docs/deploy-lxc.md` — deeper LXC detail +- `agent/docs/install.md` — single-host agent install +- `server/docs/Caddyfile.example` — TLS/WSS proxy template +- `proxmox-monitor-konzept.md` — design concept +- `docs/deployment-overview.md` — high-level picture + +--- + +## § 1. Preflight Checklist + +### 1.1 Hardware & network + +- [ ] **Server LXC** can be provisioned on a Proxmox host in the RZ. + Minimum: 1 GB RAM, 2 cores, 10 GB disk. Debian 12 template available. +- [ ] **DNS A record** for `monitor.` points at the public IP of the + Proxmox host that hosts the LXC. Verify with `dig +short monitor.`. +- [ ] **Port 443 inbound** to the server LXC's public IP is open (Caddy will + get Let's Encrypt certs via HTTP-01 and serve on 443). +- [ ] **Outbound HTTPS from every Proxmox host to `monitor.`** is + open. Agents connect out; no inbound port is required on Proxmox hosts. +- [ ] You have **SSH root access** to: + - The hypervisor running the server LXC (for `pct create` / `pct enter`) + - Every Proxmox host that will run an agent +- [ ] **Docker** is installed and daemon is running on your build machine + (`docker --version` should succeed and `docker ps` should not error). + If not, use a Linux box (even the server LXC itself) as the build host. + +### 1.2 Versions + +- [ ] Proxmox hosts are **VE 8.3+** with **OpenZFS 2.3+** (check with + `pveversion` and `zfs --version`). If some hosts are older, either + upgrade them first or accept that ZFS payloads will be empty on those. + +### 1.3 Tools on your workstation + +- [ ] Elixir 1.19 + OTP 28 (`elixir --version`) +- [ ] Mix + Hex (`mix local.hex`) +- [ ] SSH + scp +- [ ] `sqlite3` CLI (for smoke-test DB inspection; optional) + +### 1.4 Secrets plan + +Write down (don't commit) the three secrets you'll need. Keep them in a password +manager. + +| Secret | Generated how | +|-------------------------------|--------------------------------------------------------------| +| Dashboard password (plaintext) | You choose it. Use a strong random string. | +| `SECRET_KEY_BASE` | `cd server && mix phx.gen.secret` (64-byte base64) | +| Agent tokens | Created by the admin UI, one per host, revealed once. | + +--- + +## § 2. Local Build + +Do this once, on your build machine. Re-run for every upgrade. + +### 2.1 Clone the repo + +- [ ] `git clone ` if you don't already have it. +- [ ] `cd proxmox_monitor` +- [ ] `git pull --ff-only origin main` + +### 2.2 Confirm tests are green + +- [ ] `cd server && mix deps.get && mix test` +- [ ] `cd ../agent && mix deps.get && mix test` + +**Expected:** both suites pass. If any test fails, **stop here**. Fix or cherry-pick +a known-good commit before continuing. + +### 2.3 Build the server release + +- [ ] Generate `DASHBOARD_PASSWORD_HASH` once: + +```bash +cd server +mix run -e 'IO.puts(Argon2.hash_pwd_salt(""))' +``` + +Copy the `$argon2id$...` line into your password manager. You'll paste it +into the LXC env file later. + +- [ ] Build the release (the placeholder is only needed to satisfy runtime.exs + during build; the real value is set on the LXC at start time): + +```bash +MIX_ENV=prod DASHBOARD_PASSWORD_HASH='placeholder' mix release --overwrite +``` + +**Expected:** `_build/prod/rel/server/` contains `bin/server`, `bin/migrate`, +`erts-*`, `lib/`, `releases/`. + +- [ ] Package the release: + +```bash +tar -czf /tmp/server_release.tgz -C _build/prod/rel server +ls -lh /tmp/server_release.tgz +``` + +**Expected:** ~30–60 MB tarball. + +### 2.4 Build the agent binaries + +Requires Docker running locally (or do this on a Linux host). + +- [ ] `cd ../agent` +- [ ] `./scripts/build-linux.sh` + +**Expected output (~5–10 min first run, much faster with Docker layer cache on +subsequent runs):** + +``` +Binaries written to /path/to/agent/dist: +proxmox-monitor-agent_linux_amd64 +proxmox-monitor-agent_linux_arm64 +``` + +- [ ] Sanity check: + +```bash +file dist/proxmox-monitor-agent_linux_amd64 | grep -E "ELF 64-bit" +``` + +**Expected:** `ELF 64-bit LSB executable, x86-64`. + +If Docker isn't available on your workstation: scp the `agent/` directory onto +the server LXC after § 3, run `./scripts/build-linux.sh` there, then scp the +binaries back. The LXC doesn't need Docker at runtime. + +--- + +## § 3. Server Deployment + +One-time. Subsequent upgrades use § 8.1. + +### 3.1 Create the LXC (on the hypervisor) + +- [ ] SSH to the hypervisor and run: + +```bash +pct create 200 \ + /var/lib/vz/template/cache/debian-12-standard_12.7-1_amd64.tar.zst \ + --hostname proxmox-monitor \ + --memory 1024 --cores 2 \ + --rootfs local-zfs:10 \ + --net0 name=eth0,bridge=vmbr0,ip=dhcp \ + --unprivileged 1 --features nesting=0 --onboot 1 +pct start 200 +``` + +Adjust the container ID (`200`), bridge, and rootfs to match your environment. + +- [ ] Get the LXC's IP: + +```bash +pct exec 200 -- ip -4 addr show eth0 | grep -Po 'inet \K[\d.]+' +``` + +Put this IP in `LXC_IP` for the rest of this section (use a shell variable, +not a literal in every command — typos here cost hours). + +### 3.2 Base packages inside the LXC + +```bash +pct enter 200 +``` + +- [ ] Install Caddy + SQLite + tools: + +```bash +apt-get update +apt-get install -y ca-certificates curl gnupg debian-keyring debian-archive-keyring apt-transport-https sqlite3 + +# Caddy's apt repo +curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | \ + gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg +curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \ + > /etc/apt/sources.list.d/caddy-stable.list +apt-get update +apt-get install -y caddy + +caddy version # sanity +``` + +- [ ] Exit the container: `exit`. + +### 3.3 Upload the release + +On your workstation: + +- [ ] `scp /tmp/server_release.tgz root@:/tmp/` + +Back inside the LXC (`pct enter 200`): + +- [ ] Unpack: + +```bash +mkdir -p /opt/proxmox-monitor +tar -xzf /tmp/server_release.tgz -C /opt/proxmox-monitor +ls /opt/proxmox-monitor/server/bin/ +# Expected: server, migrate, server.bat, migrate.bat +``` + +### 3.4 Data directory + environment file + +- [ ] Create the data dir: + +```bash +install -d -m 0700 /var/lib/proxmox-monitor +install -d -m 0755 /etc/default +``` + +- [ ] Create `/etc/default/proxmox-monitor`. Substitute the values you + generated in § 2.3: + +```bash +cat > /etc/default/proxmox-monitor <<'EOF' +DATABASE_PATH=/var/lib/proxmox-monitor/monitor.db +SECRET_KEY_BASE= +DASHBOARD_PASSWORD_HASH= +PHX_SERVER=true +PHX_HOST=monitor.example.com +PORT=4000 +EOF +chmod 0600 /etc/default/proxmox-monitor +``` + +**Gotchas:** +- `DASHBOARD_PASSWORD_HASH` contains `$` characters. Use single quotes around + the value in the heredoc, or escape each `$` with `\$`. Double-quoted + heredocs will silently eat them. +- No spaces around `=`. +- No quotes around values in the file itself. + +### 3.5 Run the first migration + +- [ ] Apply migrations: + +```bash +set -a; . /etc/default/proxmox-monitor; set +a +/opt/proxmox-monitor/server/bin/server eval 'Server.Release.migrate()' +``` + +**Expected output:** + +``` +[info] == Running 20260421200116 Server.Repo.Migrations.CreateHosts.change/0 forward +[info] create table hosts +... +[info] == Migrated 20260421200116 in 0.0s +[info] == Running 20260421202512 Server.Repo.Migrations.CreateMetrics.change/0 forward +[info] create table metrics +... +[info] == Migrated 20260421202512 in 0.0s +``` + +- [ ] Verify the DB exists: + +```bash +ls -la /var/lib/proxmox-monitor/monitor.db +sqlite3 /var/lib/proxmox-monitor/monitor.db '.tables' +# Expected: hosts metrics schema_migrations +``` + +### 3.6 systemd unit for the server + +- [ ] Write the unit: + +```bash +cat > /etc/systemd/system/proxmox-monitor.service <<'EOF' +[Unit] +Description=Proxmox Monitor Server +After=network-online.target +Wants=network-online.target + +[Service] +Type=exec +EnvironmentFile=/etc/default/proxmox-monitor +ExecStartPre=/opt/proxmox-monitor/server/bin/server eval 'Server.Release.migrate()' +ExecStart=/opt/proxmox-monitor/server/bin/server start +ExecStop=/opt/proxmox-monitor/server/bin/server stop +Restart=always +RestartSec=5 +User=root + +[Install] +WantedBy=multi-user.target +EOF + +systemctl daemon-reload +systemctl enable --now proxmox-monitor +``` + +- [ ] Watch it come up: + +```bash +journalctl -u proxmox-monitor -f +# wait for: "Running ServerWeb.Endpoint with Bandit" +# then Ctrl+C +``` + +- [ ] Smoke-test from inside the LXC: + +```bash +curl -s http://127.0.0.1:4000/health | sqlite3 -cmd '.mode json' /dev/null # or just cat +curl -s http://127.0.0.1:4000/health +# Expected: {"db":"ok","status":"ok","version":"0.1.0"} +``` + +**If you see anything other than `status:ok`**: stop. Check `journalctl -u +proxmox-monitor -n 100`. Common causes: missing env var (check `/etc/default`), +DB path not writable. + +### 3.7 Caddy TLS + reverse proxy + +- [ ] Copy the template and edit: + +```bash +cp /opt/proxmox-monitor/server/lib/server-0.1.0/priv/docs/Caddyfile.example \ + /etc/caddy/Caddyfile \ + 2>/dev/null || \ + scp root@:proxmox_monitor/server/docs/Caddyfile.example \ + /etc/caddy/Caddyfile +# (The first form only works if you bundled docs into the release; the second +# pulls fresh from your checkout.) + +sed -i "s/monitor.example.com/$ACTUAL_HOST/g" /etc/caddy/Caddyfile +caddy validate --config /etc/caddy/Caddyfile +``` + +- [ ] Reload Caddy: + +```bash +systemctl reload caddy +journalctl -u caddy -n 30 +# Expected: "certificate obtained successfully" from Let's Encrypt +# (only on first reload after DNS is set correctly) +``` + +- [ ] Verify from the public internet: + +```bash +# From any outside machine: +curl -s https://monitor.example.com/health +# Expected: {"db":"ok","status":"ok","version":"0.1.0"} +``` + +**If this fails:** +- `curl -vI` to see where it stops. Name resolution? TCP? TLS? +- `dig +short monitor.example.com` — does it point to the expected IP? +- Check the hypervisor's firewall / any cloud-level firewall for port 443. + +### 3.8 Browser smoke-test + +- [ ] Open `https://monitor.example.com/` in a browser. +- [ ] Confirm: redirect to `/login`. +- [ ] Log in with your dashboard password. +- [ ] Confirm: empty overview page with "No hosts registered yet." + +**If login fails** with "Incorrect password": your `DASHBOARD_PASSWORD_HASH` env +doesn't match the password you typed. Re-generate and re-deploy §3.4. + +--- + +## § 4. First Agent — Dry Run + +Pick one Proxmox host. This run will validate the whole pipeline before you +touch more hosts. + +### 4.1 Register the host in the dashboard + +- [ ] Browser → `https://monitor.example.com/admin/hosts`. +- [ ] Enter the short name (`pve-host-01` or whatever matches your + convention). Click **Add**. +- [ ] The page reveals a token. **Copy it now** — it is shown only once. + +### 4.2 Copy the binary + systemd unit to the host + +From your workstation (substitute ``): + +```bash +export HOST= + +scp agent/dist/proxmox-monitor-agent_linux_amd64 \ + root@$HOST:/usr/local/bin/proxmox-monitor-agent +ssh root@$HOST 'chmod 0755 /usr/local/bin/proxmox-monitor-agent' + +scp agent/rel/proxmox-monitor-agent.service \ + root@$HOST:/etc/systemd/system/ +``` + +### 4.3 Write the agent config + +SSH to the host (`ssh root@$HOST`) and: + +```bash +install -d -m 0700 /etc/proxmox-monitor +install -d -m 0700 /var/cache/proxmox-monitor-agent + +cat > /etc/proxmox-monitor/agent.toml <<'EOF' +server_url = "wss://monitor.example.com/socket/websocket" +token = "" +host_id = "pve-host-01" + +[intervals] +fast_seconds = 30 +medium_seconds = 300 +slow_seconds = 1800 +EOF + +chmod 0600 /etc/proxmox-monitor/agent.toml +``` + +### 4.4 Enable the agent + +Still on the Proxmox host: + +```bash +systemctl daemon-reload +systemctl enable --now proxmox-monitor-agent +``` + +- [ ] Watch the log: + +```bash +journalctl -u proxmox-monitor-agent -f +``` + +**Expected within 10s:** + +``` +agent: starting with host_id=pve-host-01 +reporter: connected, joining host:pve-host-01 +reporter: joined host:pve-host-01 +``` + +Ctrl+C to stop tailing. + +### 4.5 Confirm in the dashboard + +- [ ] Reload `https://monitor.example.com/` — the card for `pve-host-01` + should show **online**, status green, with Load/RAM/Pools/VMs populated. +- [ ] Click the card. Verify each section (ZFS pools, snapshots, storage, + VMs) has real data. + +### 4.6 Stop-and-restart verification + +Verify the offline flip works as designed. + +- [ ] On the Proxmox host: `systemctl stop proxmox-monitor-agent`. +- [ ] Dashboard card should switch to **offline** (grey border) within ~1s. +- [ ] `systemctl start proxmox-monitor-agent` — card flips back to green + within ~30s. + +**If the card stays green when the agent is stopped**: the Channel terminate +callback didn't fire, which usually means Caddy's `read_timeout` is set too +short or absent. Check `/etc/caddy/Caddyfile` contains `read_timeout 90s`. + +### 4.7 Token rotation sanity-check + +- [ ] In the admin UI, click **Rotate** on the host. Confirm. +- [ ] On the Proxmox host, `journalctl -u proxmox-monitor-agent -f` — + within ~30s the agent should log `reporter: disconnected` then begin + reconnecting, failing with `invalid_token`. +- [ ] Update `/etc/proxmox-monitor/agent.toml` with the new token and + `systemctl restart proxmox-monitor-agent`. Verify green again. + +--- + +## § 5. Test Tier (2–3 Hosts) + +Pick 2–3 Proxmox hosts that are either non-critical, or critical but with +existing independent monitoring you can fall back on. + +### 5.1 Roll out + +- [ ] For each host, repeat § 4.1–4.5. Use distinct `host_id` values. + +### 5.2 Observe for 24 hours + +- [ ] Leave the test tier running overnight. +- [ ] Next morning, verify all three cards still show **online**. +- [ ] Check `journalctl -u proxmox-monitor` on the server: + - No `[error]` lines repeating. + - `retention: pruned N stale samples` appears ≥ 1 time (retention fires + hourly; after 48h it starts deleting). + +### 5.3 Restart test + +Reboot one of the Proxmox hosts. Watch the dashboard: + +- [ ] Card goes offline during the reboot. +- [ ] Card flips back to online within a minute of the host coming back, + without you touching anything. + +### 5.4 Server reboot test + +- [ ] On the server LXC: `systemctl restart proxmox-monitor`. +- [ ] All agents should briefly flip to offline, then back to online within + ~30s as their Slipstream clients reconnect. +- [ ] No agents should end up stuck offline requiring manual restart. + +**If any agent stays offline**: its Slipstream reconnect backoff may need +investigation. `journalctl -u proxmox-monitor-agent -f` on the affected host. + +### 5.5 Go / No-Go gate for full rollout + +Do NOT proceed to § 6 until **all** of these are true for 24h: + +- [ ] All test-tier hosts show **online** continuously. +- [ ] No repeating error lines in server logs. +- [ ] Retention has pruned ≥ 1 row. +- [ ] Rotate + restart behavior works as expected. +- [ ] Dashboard is responsive (<1s LiveView updates). + +--- + +## § 6. Full Rollout + +For each remaining Proxmox host: + +1. Admin UI → register host, copy token. +2. `scp` binary + systemd unit. +3. Write `/etc/proxmox-monitor/agent.toml`. +4. `systemctl enable --now proxmox-monitor-agent`. +5. Verify in dashboard. + +### 6.1 Loop shortcut + +Once you've done 3–4 hosts by hand and are confident, you can batch. The +tricky part is that each host needs a unique token, so the admin-UI step +still has to be interactive. One workflow: + +```bash +# On your workstation: +for HOST in pve-host-04 pve-host-05 pve-host-06; do + echo ">>> Setting up $HOST" + echo "Register in the admin UI, paste token here, then press Enter:" + read -s TOKEN + scp agent/dist/proxmox-monitor-agent_linux_amd64 \ + root@$HOST:/usr/local/bin/proxmox-monitor-agent + scp agent/rel/proxmox-monitor-agent.service \ + root@$HOST:/etc/systemd/system/ + ssh root@$HOST "chmod 0755 /usr/local/bin/proxmox-monitor-agent && \ + install -d -m 0700 /etc/proxmox-monitor /var/cache/proxmox-monitor-agent && \ + cat > /etc/proxmox-monitor/agent.toml <>> $HOST done." +done +``` + +### 6.2 Validation at scale + +After every batch of ~5 hosts: + +- [ ] Open `/` and confirm the card count matches how many agents you've + configured. +- [ ] Sort/filter by offline — should be empty. +- [ ] Click a random card and confirm real payload data. + +### 6.3 Completion check + +- [ ] Overview shows all N hosts. +- [ ] None are in `offline` or `critical` state (unless that's actually true + of the host, e.g. a real DEGRADED pool). +- [ ] VM search returns hits for a well-known VM name. + +--- + +## § 7. Rollback + +### 7.1 Disable a single agent + +```bash +ssh root@$HOST 'systemctl disable --now proxmox-monitor-agent' +``` + +Dashboard card flips to offline. Delete from `/admin/hosts` if you want it +gone entirely. + +### 7.2 Take the whole service down + +```bash +# Inside the server LXC +systemctl stop proxmox-monitor +systemctl stop caddy +``` + +Agents will keep trying to reconnect every few seconds (harmless). Dashboard +is gone. + +### 7.3 Roll back to a previous server release + +If a new version misbehaves: + +```bash +# On the LXC — assuming you kept the previous /tmp/server_release_PREV.tgz +systemctl stop proxmox-monitor +rm -rf /opt/proxmox-monitor/server +tar -xzf /tmp/server_release_PREV.tgz -C /opt/proxmox-monitor +systemctl start proxmox-monitor +``` + +Your SQLite DB has not been touched — rollbacks are cheap as long as the +migration list didn't change between versions. + +### 7.4 DB restore from backup + +See § 8.4 for creating backups. To restore: + +```bash +systemctl stop proxmox-monitor +cp /var/backups/proxmox-monitor/monitor-YYYY-MM-DD.db /var/lib/proxmox-monitor/monitor.db +chown root:root /var/lib/proxmox-monitor/monitor.db +systemctl start proxmox-monitor +``` + +Host tokens in the restored DB are still valid. Metrics from after the backup +are lost — that's 48h max given the retention policy. + +--- + +## § 8. Ongoing Operations + +### 8.1 Upgrading the server + +Work from the repo on your workstation: + +```bash +# 1. Build +cd server +MIX_ENV=prod DASHBOARD_PASSWORD_HASH='placeholder' mix release --overwrite +tar -czf /tmp/server_release.tgz -C _build/prod/rel server + +# 2. Upload, keeping the previous around for rollback +scp /tmp/server_release.tgz root@:/tmp/ + +# 3. Swap on the LXC +ssh root@ ' + cp /tmp/server_release.tgz /tmp/server_release_PREV.tgz.bak # optional + systemctl stop proxmox-monitor + mv /opt/proxmox-monitor/server /opt/proxmox-monitor/server.old + tar -xzf /tmp/server_release.tgz -C /opt/proxmox-monitor + systemctl start proxmox-monitor # ExecStartPre runs migrate + sleep 5 + systemctl status proxmox-monitor --no-pager +' +``` + +Verify `/health` responds before deleting the `.old` copy. + +### 8.2 Upgrading an agent + +```bash +scp agent/dist/proxmox-monitor-agent_linux_amd64 \ + root@$HOST:/usr/local/bin/proxmox-monitor-agent.new +ssh root@$HOST ' + mv /usr/local/bin/proxmox-monitor-agent{.new,} + systemctl restart proxmox-monitor-agent +' +``` + +### 8.3 Token rotation (leak or routine) + +1. Dashboard → Admin → **Rotate** on the affected host. +2. Copy the new token. +3. SSH to the host: update `/etc/proxmox-monitor/agent.toml`, `systemctl + restart proxmox-monitor-agent`. +4. Verify card flips back to green. + +### 8.4 SQLite backup (recommended weekly) + +The DB is small. SQLite's online backup is safe while the server runs. + +Install a cron job inside the LXC: + +```bash +cat > /etc/cron.d/proxmox-monitor-backup <<'EOF' +# Minute Hour Dom Month Dow User Command +30 3 * * * root install -d -m 0700 /var/backups/proxmox-monitor && \ + sqlite3 /var/lib/proxmox-monitor/monitor.db \ + ".backup /var/backups/proxmox-monitor/monitor-$(date +\%Y-\%m-\%d).db" && \ + find /var/backups/proxmox-monitor -name 'monitor-*.db' -mtime +30 -delete +EOF +``` + +Keeps 30 days of daily snapshots. + +### 8.5 Log inspection + +Server: + +```bash +# Live +journalctl -u proxmox-monitor -f + +# Last 500 +journalctl -u proxmox-monitor -n 500 --no-pager + +# Errors only +journalctl -u proxmox-monitor -p err --no-pager +``` + +Agents (from the server for any host): + +```bash +ssh root@$HOST 'journalctl -u proxmox-monitor-agent -n 200 --no-pager' +``` + +### 8.6 External uptime monitoring + +Point your uptime service (UptimeRobot, BetterUptime, your-own-Prometheus, +etc.) at: + +``` +https://monitor.example.com/health +``` + +Expect `{"status":"ok","db":"ok","version":"0.1.0"}` with HTTP 200. Alert on +anything else. + +### 8.7 Changing the dashboard password + +1. On your workstation: + +```bash +cd server +mix run -e 'IO.puts(Argon2.hash_pwd_salt(""))' +``` + +2. On the server LXC: edit `/etc/default/proxmox-monitor`, replace + `DASHBOARD_PASSWORD_HASH`, `systemctl restart proxmox-monitor`. +3. All existing sessions are invalidated on next request. + +--- + +## § 9. Go / No-Go Sign-Off + +Tick each box before declaring the rollout complete. + +### Production readiness + +- [ ] `https://monitor.example.com/health` returns 200 / `status:ok`. +- [ ] External uptime monitor is configured and reporting green. +- [ ] All intended Proxmox hosts appear on the overview and show **online**. +- [ ] At least one full 48h retention cycle has completed (retention log + shows pruning). +- [ ] SQLite backup cron is installed and yesterday's `.db` file exists. +- [ ] You have rolled back once on purpose (drill), proving § 7 works. + +### Access & secrets hygiene + +- [ ] Dashboard password is in a password manager, not a text file. +- [ ] `SECRET_KEY_BASE` is in a password manager. +- [ ] `/etc/default/proxmox-monitor` is `0600 root:root`. +- [ ] `/etc/proxmox-monitor/agent.toml` is `0600 root:root` on every host. +- [ ] You know how to rotate an agent token in < 2 minutes. + +### Documentation handoff + +- [ ] This runbook's checkboxes are all green for the current rollout. +- [ ] If you're handing this to a teammate, you've walked them through one + agent install and one token rotation live. + +**If all of the above are green, the monitor is in production.** + +--- + +## Appendix A — Common Errors + +| Symptom | First thing to check | +|------------------------------------------------------|--------------------------------------------------------------------| +| Browser gets `NET::ERR_CERT_AUTHORITY_INVALID` | Caddy didn't finish LE cert issuance. Wait 60s; then `journalctl -u caddy`. | +| Login page loops — correct password rejected | `DASHBOARD_PASSWORD_HASH` mismatch. Regenerate. | +| Card stays offline after agent restart | Wrong token or `unknown_host` (name mismatch). Check agent journal. | +| All agents reconnect every ~30s | Caddy `read_timeout` missing or too short. | +| `/health` returns 503 | Server process up, SQLite path unreadable or wrong permissions. | +| LXC can't bind port 4000 | Another process owns it. `ss -ltnp | grep 4000`. | +| `mix release` fails with DASHBOARD error | You forgot to set `DASHBOARD_PASSWORD_HASH=placeholder` at build. | +| Agent logs `{:enoent, "pvesh"}` | Agent is running on a non-Proxmox host, or `$PATH` is empty under systemd. | + +## Appendix B — File & Port Cheat Sheet + +``` +Server LXC + /opt/proxmox-monitor/server/ release tree + /etc/default/proxmox-monitor env secrets, 0600 + /etc/systemd/system/proxmox-monitor.service + /etc/caddy/Caddyfile + /var/lib/proxmox-monitor/monitor.db SQLite + /var/backups/proxmox-monitor/ daily backups + tcp 443 (caddy) → tcp 127.0.0.1:4000 (phoenix) + +Proxmox host (per agent) + /usr/local/bin/proxmox-monitor-agent + /etc/proxmox-monitor/agent.toml token + intervals, 0600 + /etc/systemd/system/proxmox-monitor-agent.service + /var/cache/proxmox-monitor-agent/ Burrito unpack cache + no listening ports +```