proxMon/SETUP-AND-DEPLOY.md

843 lines
26 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Proxmox Monitor — Setup & Deploy Runbook
> **This document is a runbook, not reference material.** Work from top to bottom.
> Each checkbox is an action. Don't skip verification steps — they exist because
> we've been burned by skipping them.
**Target audience:** the operator who owns the monitor service.
**Total time:** 23 hours end-to-end with a small test tier, plus however long
your Proxmox fleet takes to roll out (1015 min per host).
**Phases of this runbook:**
| § | Purpose | Touch points |
|-----|-----------------------------------------------|-----------------------|
| 1 | **Preflight** — confirm prerequisites | Local only |
| 2 | **Local build** — produce artifacts | Your workstation |
| 3 | **Server deploy** — one-time LXC bring-up | Hypervisor + LXC |
| 4 | **First agent** — prove the pipeline | One Proxmox host |
| 5 | **Test tier** — 23 hosts for 24h | Small batch |
| 6 | **Full rollout** — the remaining hosts | Fleet-wide |
| 7 | **Rollback** — when something goes wrong | — |
| 8 | **Ongoing operations** | Upgrades, backups |
| 9 | **Go / No-Go** sign-off | Final gate |
Related docs (reference, not sequential):
- `server/docs/deploy-lxc.md` — deeper LXC detail
- `agent/docs/install.md` — single-host agent install
- `server/docs/Caddyfile.example` — TLS/WSS proxy template
- `proxmox-monitor-konzept.md` — design concept
- `docs/deployment-overview.md` — high-level picture
---
## § 1. Preflight Checklist
### 1.1 Hardware & network
- [ ] **Server LXC** can be provisioned on a Proxmox host in the RZ.
Minimum: 1 GB RAM, 2 cores, 10 GB disk. Debian 12 template available.
- [ ] **DNS A record** for `monitor.<yourdomain>` points at the public IP of the
Proxmox host that hosts the LXC. Verify with `dig +short monitor.<yourdomain>`.
- [ ] **Port 443 inbound** to the server LXC's public IP is open (Caddy will
get Let's Encrypt certs via HTTP-01 and serve on 443).
- [ ] **Outbound HTTPS from every Proxmox host to `monitor.<yourdomain>`** is
open. Agents connect out; no inbound port is required on Proxmox hosts.
- [ ] You have **SSH root access** to:
- The hypervisor running the server LXC (for `pct create` / `pct enter`)
- Every Proxmox host that will run an agent
- [ ] **Docker** is installed and daemon is running on your build machine
(`docker --version` should succeed and `docker ps` should not error).
If not, use a Linux box (even the server LXC itself) as the build host.
### 1.2 Versions
- [ ] Proxmox hosts are **VE 8.3+** with **OpenZFS 2.3+** (check with
`pveversion` and `zfs --version`). If some hosts are older, either
upgrade them first or accept that ZFS payloads will be empty on those.
### 1.3 Tools on your workstation
- [ ] Elixir 1.19 + OTP 28 (`elixir --version`)
- [ ] Mix + Hex (`mix local.hex`)
- [ ] SSH + scp
- [ ] `sqlite3` CLI (for smoke-test DB inspection; optional)
### 1.4 Secrets plan
Write down (don't commit) the three secrets you'll need. Keep them in a password
manager.
| Secret | Generated how |
|-------------------------------|--------------------------------------------------------------|
| Dashboard password (plaintext) | You choose it. Use a strong random string. |
| `SECRET_KEY_BASE` | `cd server && mix phx.gen.secret` (64-byte base64) |
| Agent tokens | Created by the admin UI, one per host, revealed once. |
---
## § 2. Local Build
Do this once, on your build machine. Re-run for every upgrade.
### 2.1 Clone the repo
- [ ] `git clone <repo>` if you don't already have it.
- [ ] `cd proxmox_monitor`
- [ ] `git pull --ff-only origin main`
### 2.2 Confirm tests are green
- [ ] `cd server && mix deps.get && mix test`
- [ ] `cd ../agent && mix deps.get && mix test`
**Expected:** both suites pass. If any test fails, **stop here**. Fix or cherry-pick
a known-good commit before continuing.
### 2.3 Build the server release
- [ ] Generate `DASHBOARD_PASSWORD_HASH` once:
```bash
cd server
mix run -e 'IO.puts(Argon2.hash_pwd_salt("<your-dashboard-password>"))'
```
Copy the `$argon2id$...` line into your password manager. You'll paste it
into the LXC env file later.
- [ ] Build the release (the placeholder is only needed to satisfy runtime.exs
during build; the real value is set on the LXC at start time):
```bash
MIX_ENV=prod DASHBOARD_PASSWORD_HASH='placeholder' mix release --overwrite
```
The release step also runs `mix assets.deploy` as a pre-assemble step, so
minified + digested JS/CSS are baked into the tarball automatically. You
don't need to run `assets.deploy` separately.
**Expected:** `_build/prod/rel/server/` contains `bin/server`, `bin/migrate`,
`erts-*`, `lib/`, `releases/`, and `lib/server-0.1.0/priv/static/cache_manifest.json`.
- [ ] Package the release:
```bash
tar -czf /tmp/server_release.tgz -C _build/prod/rel server
ls -lh /tmp/server_release.tgz
```
**Expected:** ~3060 MB tarball.
### 2.4 Build the agent binaries
Requires Docker running locally (or do this on a Linux host).
- [ ] `cd ../agent`
- [ ] `./scripts/build-linux.sh`
**Expected output (~510 min first run, much faster with Docker layer cache on
subsequent runs):**
```
Binaries written to /path/to/agent/dist:
proxmox-monitor-agent_linux_amd64
proxmox-monitor-agent_linux_arm64
```
- [ ] Sanity check:
```bash
file dist/proxmox-monitor-agent_linux_amd64 | grep -E "ELF 64-bit"
```
**Expected:** `ELF 64-bit LSB executable, x86-64`.
If Docker isn't available on your workstation: scp the `agent/` directory onto
the server LXC after § 3, run `./scripts/build-linux.sh` there, then scp the
binaries back. The LXC doesn't need Docker at runtime.
---
## § 3. Server Deployment
One-time. Subsequent upgrades use § 8.1.
### 3.1 Create the LXC (on the hypervisor)
- [ ] SSH to the hypervisor and run:
```bash
pct create 200 \
/var/lib/vz/template/cache/debian-12-standard_12.7-1_amd64.tar.zst \
--hostname proxmox-monitor \
--memory 1024 --cores 2 \
--rootfs local-zfs:10 \
--net0 name=eth0,bridge=vmbr0,ip=dhcp \
--unprivileged 1 --features nesting=0 --onboot 1
pct start 200
```
Adjust the container ID (`200`), bridge, and rootfs to match your environment.
- [ ] Get the LXC's IP:
```bash
pct exec 200 -- ip -4 addr show eth0 | grep -Po 'inet \K[\d.]+'
```
Put this IP in `LXC_IP` for the rest of this section (use a shell variable,
not a literal in every command — typos here cost hours).
### 3.2 Base packages inside the LXC
```bash
pct enter 200
```
- [ ] Install Caddy + SQLite + tools:
```bash
apt-get update
apt-get install -y ca-certificates curl gnupg debian-keyring debian-archive-keyring apt-transport-https sqlite3
# Caddy's apt repo
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | \
gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
> /etc/apt/sources.list.d/caddy-stable.list
apt-get update
apt-get install -y caddy
caddy version # sanity
```
- [ ] Exit the container: `exit`.
### 3.3 Upload the release
On your workstation:
- [ ] `scp /tmp/server_release.tgz root@<LXC_IP>:/tmp/`
Back inside the LXC (`pct enter 200`):
- [ ] Unpack:
```bash
mkdir -p /opt/proxmox-monitor
tar -xzf /tmp/server_release.tgz -C /opt/proxmox-monitor
ls /opt/proxmox-monitor/server/bin/
# Expected: server, migrate, server.bat, migrate.bat
```
### 3.4 Data directory + environment file
- [ ] Create the data dir:
```bash
install -d -m 0700 /var/lib/proxmox-monitor
install -d -m 0755 /etc/default
```
- [ ] Create `/etc/default/proxmox-monitor`. Substitute the values you
generated in § 2.3:
```bash
cat > /etc/default/proxmox-monitor <<'EOF'
DATABASE_PATH=/var/lib/proxmox-monitor/monitor.db
SECRET_KEY_BASE=<paste-output-of-mix-phx.gen.secret>
DASHBOARD_PASSWORD_HASH=<paste-$argon2id$-hash-from-2.3>
PHX_SERVER=true
PHX_HOST=monitor.example.com
PORT=4000
EOF
chmod 0600 /etc/default/proxmox-monitor
```
**Gotchas:**
- `DASHBOARD_PASSWORD_HASH` contains `$` characters. Use single quotes around
the value in the heredoc, or escape each `$` with `\$`. Double-quoted
heredocs will silently eat them.
- No spaces around `=`.
- No quotes around values in the file itself.
### 3.5 Run the first migration
- [ ] Apply migrations:
```bash
set -a; . /etc/default/proxmox-monitor; set +a
/opt/proxmox-monitor/server/bin/server eval 'Server.Release.migrate()'
```
**Expected output:**
```
[info] == Running 20260421200116 Server.Repo.Migrations.CreateHosts.change/0 forward
[info] create table hosts
...
[info] == Migrated 20260421200116 in 0.0s
[info] == Running 20260421202512 Server.Repo.Migrations.CreateMetrics.change/0 forward
[info] create table metrics
...
[info] == Migrated 20260421202512 in 0.0s
```
- [ ] Verify the DB exists:
```bash
ls -la /var/lib/proxmox-monitor/monitor.db
sqlite3 /var/lib/proxmox-monitor/monitor.db '.tables'
# Expected: hosts metrics schema_migrations
```
### 3.6 systemd unit for the server
- [ ] Write the unit:
```bash
cat > /etc/systemd/system/proxmox-monitor.service <<'EOF'
[Unit]
Description=Proxmox Monitor Server
After=network-online.target
Wants=network-online.target
[Service]
Type=exec
EnvironmentFile=/etc/default/proxmox-monitor
ExecStartPre=/opt/proxmox-monitor/server/bin/server eval 'Server.Release.migrate()'
ExecStart=/opt/proxmox-monitor/server/bin/server start
ExecStop=/opt/proxmox-monitor/server/bin/server stop
Restart=always
RestartSec=5
User=root
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now proxmox-monitor
```
- [ ] Watch it come up:
```bash
journalctl -u proxmox-monitor -f
# wait for: "Running ServerWeb.Endpoint with Bandit"
# then Ctrl+C
```
- [ ] Smoke-test from inside the LXC:
```bash
curl -s http://127.0.0.1:4000/health | sqlite3 -cmd '.mode json' /dev/null # or just cat
curl -s http://127.0.0.1:4000/health
# Expected: {"db":"ok","status":"ok","version":"0.1.0"}
```
**If you see anything other than `status:ok`**: stop. Check `journalctl -u
proxmox-monitor -n 100`. Common causes: missing env var (check `/etc/default`),
DB path not writable.
### 3.7 Caddy TLS + reverse proxy
- [ ] Copy the template and edit:
```bash
cp /opt/proxmox-monitor/server/lib/server-0.1.0/priv/docs/Caddyfile.example \
/etc/caddy/Caddyfile \
2>/dev/null || \
scp root@<YOUR-WORKSTATION>:proxmox_monitor/server/docs/Caddyfile.example \
/etc/caddy/Caddyfile
# (The first form only works if you bundled docs into the release; the second
# pulls fresh from your checkout.)
sed -i "s/monitor.example.com/$ACTUAL_HOST/g" /etc/caddy/Caddyfile
caddy validate --config /etc/caddy/Caddyfile
```
- [ ] Reload Caddy:
```bash
systemctl reload caddy
journalctl -u caddy -n 30
# Expected: "certificate obtained successfully" from Let's Encrypt
# (only on first reload after DNS is set correctly)
```
- [ ] Verify from the public internet:
```bash
# From any outside machine:
curl -s https://monitor.example.com/health
# Expected: {"db":"ok","status":"ok","version":"0.1.0"}
```
**If this fails:**
- `curl -vI` to see where it stops. Name resolution? TCP? TLS?
- `dig +short monitor.example.com` — does it point to the expected IP?
- Check the hypervisor's firewall / any cloud-level firewall for port 443.
### 3.8 Browser smoke-test
- [ ] Open `https://monitor.example.com/` in a browser.
- [ ] Confirm: redirect to `/login`.
- [ ] Log in with your dashboard password.
- [ ] Confirm: empty overview page with "No hosts registered yet."
**If login fails** with "Incorrect password": your `DASHBOARD_PASSWORD_HASH` env
doesn't match the password you typed. Re-generate and re-deploy §3.4.
---
## § 4. First Agent — Dry Run
Pick one Proxmox host. This run will validate the whole pipeline before you
touch more hosts.
### 4.1 Register the host in the dashboard
- [ ] Browser → `https://monitor.example.com/admin/hosts`.
- [ ] Enter the short name (`pve-host-01` or whatever matches your
convention). Click **Add**.
- [ ] The page reveals a token. **Copy it now** — it is shown only once.
### 4.2 Copy the binary + systemd unit to the host
From your workstation (substitute `<HOST>`):
```bash
export HOST=<proxmox-host-ip-or-name>
scp agent/dist/proxmox-monitor-agent_linux_amd64 \
root@$HOST:/usr/local/bin/proxmox-monitor-agent
ssh root@$HOST 'chmod 0755 /usr/local/bin/proxmox-monitor-agent'
scp agent/rel/proxmox-monitor-agent.service \
root@$HOST:/etc/systemd/system/
```
### 4.3 Write the agent config
SSH to the host (`ssh root@$HOST`) and:
```bash
install -d -m 0700 /etc/proxmox-monitor
install -d -m 0700 /var/cache/proxmox-monitor-agent
cat > /etc/proxmox-monitor/agent.toml <<'EOF'
server_url = "wss://monitor.example.com/socket/websocket"
token = "<paste-token-from-dashboard>"
host_id = "pve-host-01"
[intervals]
fast_seconds = 30
medium_seconds = 300
slow_seconds = 1800
EOF
chmod 0600 /etc/proxmox-monitor/agent.toml
```
### 4.4 Enable the agent
Still on the Proxmox host:
```bash
systemctl daemon-reload
systemctl enable --now proxmox-monitor-agent
```
- [ ] Watch the log:
```bash
journalctl -u proxmox-monitor-agent -f
```
**Expected within 10s:**
```
agent: starting with host_id=pve-host-01
reporter: connected, joining host:pve-host-01
reporter: joined host:pve-host-01
```
Ctrl+C to stop tailing.
### 4.5 Confirm in the dashboard
- [ ] Reload `https://monitor.example.com/` — the card for `pve-host-01`
should show **online**, status green, with Load/RAM/Pools/VMs populated.
- [ ] Click the card. Verify each section (ZFS pools, snapshots, storage,
VMs) has real data.
### 4.6 Stop-and-restart verification
Verify the offline flip works as designed.
- [ ] On the Proxmox host: `systemctl stop proxmox-monitor-agent`.
- [ ] Dashboard card should switch to **offline** (grey border) within ~1s.
- [ ] `systemctl start proxmox-monitor-agent` — card flips back to green
within ~30s.
**If the card stays green when the agent is stopped**: the Channel terminate
callback didn't fire, which usually means Caddy's `read_timeout` is set too
short or absent. Check `/etc/caddy/Caddyfile` contains `read_timeout 90s`.
### 4.7 Token rotation sanity-check
- [ ] In the admin UI, click **Rotate** on the host. Confirm.
- [ ] On the Proxmox host, `journalctl -u proxmox-monitor-agent -f`
within ~30s the agent should log `reporter: disconnected` then begin
reconnecting, failing with `invalid_token`.
- [ ] Update `/etc/proxmox-monitor/agent.toml` with the new token and
`systemctl restart proxmox-monitor-agent`. Verify green again.
---
## § 5. Test Tier (23 Hosts)
Pick 23 Proxmox hosts that are either non-critical, or critical but with
existing independent monitoring you can fall back on.
### 5.1 Roll out
- [ ] For each host, repeat § 4.14.5. Use distinct `host_id` values.
### 5.2 Observe for 24 hours
- [ ] Leave the test tier running overnight.
- [ ] Next morning, verify all three cards still show **online**.
- [ ] Check `journalctl -u proxmox-monitor` on the server:
- No `[error]` lines repeating.
- `retention: pruned N stale samples` appears ≥ 1 time (retention fires
hourly; after 48h it starts deleting).
### 5.3 Restart test
Reboot one of the Proxmox hosts. Watch the dashboard:
- [ ] Card goes offline during the reboot.
- [ ] Card flips back to online within a minute of the host coming back,
without you touching anything.
### 5.4 Server reboot test
- [ ] On the server LXC: `systemctl restart proxmox-monitor`.
- [ ] All agents should briefly flip to offline, then back to online within
~30s as their Slipstream clients reconnect.
- [ ] No agents should end up stuck offline requiring manual restart.
**If any agent stays offline**: its Slipstream reconnect backoff may need
investigation. `journalctl -u proxmox-monitor-agent -f` on the affected host.
### 5.5 Go / No-Go gate for full rollout
Do NOT proceed to § 6 until **all** of these are true for 24h:
- [ ] All test-tier hosts show **online** continuously.
- [ ] No repeating error lines in server logs.
- [ ] Retention has pruned ≥ 1 row.
- [ ] Rotate + restart behavior works as expected.
- [ ] Dashboard is responsive (<1s LiveView updates).
---
## § 6. Full Rollout
For each remaining Proxmox host:
1. Admin UI register host, copy token.
2. `scp` binary + systemd unit.
3. Write `/etc/proxmox-monitor/agent.toml`.
4. `systemctl enable --now proxmox-monitor-agent`.
5. Verify in dashboard.
### 6.1 Loop shortcut
Once you've done 34 hosts by hand and are confident, you can batch. The
tricky part is that each host needs a unique token, so the admin-UI step
still has to be interactive. One workflow:
```bash
# On your workstation:
for HOST in pve-host-04 pve-host-05 pve-host-06; do
echo ">>> Setting up $HOST"
echo "Register in the admin UI, paste token here, then press Enter:"
read -s TOKEN
scp agent/dist/proxmox-monitor-agent_linux_amd64 \
root@$HOST:/usr/local/bin/proxmox-monitor-agent
scp agent/rel/proxmox-monitor-agent.service \
root@$HOST:/etc/systemd/system/
ssh root@$HOST "chmod 0755 /usr/local/bin/proxmox-monitor-agent && \
install -d -m 0700 /etc/proxmox-monitor /var/cache/proxmox-monitor-agent && \
cat > /etc/proxmox-monitor/agent.toml <<EOF
server_url = \"wss://monitor.example.com/socket/websocket\"
token = \"$TOKEN\"
host_id = \"$HOST\"
[intervals]
fast_seconds = 30
medium_seconds = 300
slow_seconds = 1800
EOF
chmod 0600 /etc/proxmox-monitor/agent.toml && \
systemctl daemon-reload && \
systemctl enable --now proxmox-monitor-agent"
echo ">>> $HOST done."
done
```
### 6.2 Validation at scale
After every batch of ~5 hosts:
- [ ] Open `/` and confirm the card count matches how many agents you've
configured.
- [ ] Sort/filter by offline should be empty.
- [ ] Click a random card and confirm real payload data.
### 6.3 Completion check
- [ ] Overview shows all N hosts.
- [ ] None are in `offline` or `critical` state (unless that's actually true
of the host, e.g. a real DEGRADED pool).
- [ ] VM search returns hits for a well-known VM name.
---
## § 7. Rollback
### 7.1 Disable a single agent
```bash
ssh root@$HOST 'systemctl disable --now proxmox-monitor-agent'
```
Dashboard card flips to offline. Delete from `/admin/hosts` if you want it
gone entirely.
### 7.2 Take the whole service down
```bash
# Inside the server LXC
systemctl stop proxmox-monitor
systemctl stop caddy
```
Agents will keep trying to reconnect every few seconds (harmless). Dashboard
is gone.
### 7.3 Roll back to a previous server release
If a new version misbehaves:
```bash
# On the LXC — assuming you kept the previous /tmp/server_release_PREV.tgz
systemctl stop proxmox-monitor
rm -rf /opt/proxmox-monitor/server
tar -xzf /tmp/server_release_PREV.tgz -C /opt/proxmox-monitor
systemctl start proxmox-monitor
```
Your SQLite DB has not been touched rollbacks are cheap as long as the
migration list didn't change between versions.
### 7.4 DB restore from backup
See § 8.4 for creating backups. To restore:
```bash
systemctl stop proxmox-monitor
cp /var/backups/proxmox-monitor/monitor-YYYY-MM-DD.db /var/lib/proxmox-monitor/monitor.db
chown root:root /var/lib/proxmox-monitor/monitor.db
systemctl start proxmox-monitor
```
Host tokens in the restored DB are still valid. Metrics from after the backup
are lost that's 48h max given the retention policy.
---
## § 8. Ongoing Operations
### 8.1 Upgrading the server
Work from the repo on your workstation:
```bash
# 1. Build
cd server
MIX_ENV=prod DASHBOARD_PASSWORD_HASH='placeholder' mix release --overwrite
tar -czf /tmp/server_release.tgz -C _build/prod/rel server
# 2. Upload, keeping the previous around for rollback
scp /tmp/server_release.tgz root@<LXC>:/tmp/
# 3. Swap on the LXC
ssh root@<LXC> '
cp /tmp/server_release.tgz /tmp/server_release_PREV.tgz.bak # optional
systemctl stop proxmox-monitor
mv /opt/proxmox-monitor/server /opt/proxmox-monitor/server.old
tar -xzf /tmp/server_release.tgz -C /opt/proxmox-monitor
systemctl start proxmox-monitor # ExecStartPre runs migrate
sleep 5
systemctl status proxmox-monitor --no-pager
'
```
Verify `/health` responds before deleting the `.old` copy.
### 8.2 Upgrading an agent
```bash
scp agent/dist/proxmox-monitor-agent_linux_amd64 \
root@$HOST:/usr/local/bin/proxmox-monitor-agent.new
ssh root@$HOST '
mv /usr/local/bin/proxmox-monitor-agent{.new,}
systemctl restart proxmox-monitor-agent
'
```
### 8.3 Token rotation (leak or routine)
1. Dashboard Admin **Rotate** on the affected host.
2. Copy the new token.
3. SSH to the host: update `/etc/proxmox-monitor/agent.toml`, `systemctl
restart proxmox-monitor-agent`.
4. Verify card flips back to green.
### 8.4 SQLite backup (recommended weekly)
The DB is small. SQLite's online backup is safe while the server runs.
Install a cron job inside the LXC:
```bash
cat > /etc/cron.d/proxmox-monitor-backup <<'EOF'
# Minute Hour Dom Month Dow User Command
30 3 * * * root install -d -m 0700 /var/backups/proxmox-monitor && \
sqlite3 /var/lib/proxmox-monitor/monitor.db \
".backup /var/backups/proxmox-monitor/monitor-$(date +\%Y-\%m-\%d).db" && \
find /var/backups/proxmox-monitor -name 'monitor-*.db' -mtime +30 -delete
EOF
```
Keeps 30 days of daily snapshots.
### 8.5 Log inspection
Server:
```bash
# Live
journalctl -u proxmox-monitor -f
# Last 500
journalctl -u proxmox-monitor -n 500 --no-pager
# Errors only
journalctl -u proxmox-monitor -p err --no-pager
```
Agents (from the server for any host):
```bash
ssh root@$HOST 'journalctl -u proxmox-monitor-agent -n 200 --no-pager'
```
### 8.6 External uptime monitoring
Point your uptime service (UptimeRobot, BetterUptime, your-own-Prometheus,
etc.) at:
```
https://monitor.example.com/health
```
Expect `{"status":"ok","db":"ok","version":"0.1.0"}` with HTTP 200. Alert on
anything else.
### 8.7 Changing the dashboard password
1. On your workstation:
```bash
cd server
mix run -e 'IO.puts(Argon2.hash_pwd_salt("<new-password>"))'
```
2. On the server LXC: edit `/etc/default/proxmox-monitor`, replace
`DASHBOARD_PASSWORD_HASH`, `systemctl restart proxmox-monitor`.
3. All existing sessions are invalidated on next request.
---
## § 9. Go / No-Go Sign-Off
Tick each box before declaring the rollout complete.
### Production readiness
- [ ] `https://monitor.example.com/health` returns 200 / `status:ok`.
- [ ] External uptime monitor is configured and reporting green.
- [ ] All intended Proxmox hosts appear on the overview and show **online**.
- [ ] At least one full 48h retention cycle has completed (retention log
shows pruning).
- [ ] SQLite backup cron is installed and yesterday's `.db` file exists.
- [ ] You have rolled back once on purpose (drill), proving § 7 works.
### Access & secrets hygiene
- [ ] Dashboard password is in a password manager, not a text file.
- [ ] `SECRET_KEY_BASE` is in a password manager.
- [ ] `/etc/default/proxmox-monitor` is `0600 root:root`.
- [ ] `/etc/proxmox-monitor/agent.toml` is `0600 root:root` on every host.
- [ ] You know how to rotate an agent token in < 2 minutes.
### Documentation handoff
- [ ] This runbook's checkboxes are all green for the current rollout.
- [ ] If you're handing this to a teammate, you've walked them through one
agent install and one token rotation live.
**If all of the above are green, the monitor is in production.**
---
## Appendix A — Common Errors
| Symptom | First thing to check |
|------------------------------------------------------|--------------------------------------------------------------------|
| Browser gets `NET::ERR_CERT_AUTHORITY_INVALID` | Caddy didn't finish LE cert issuance. Wait 60s; then `journalctl -u caddy`. |
| Login page loops — correct password rejected | `DASHBOARD_PASSWORD_HASH` mismatch. Regenerate. |
| Card stays offline after agent restart | Wrong token or `unknown_host` (name mismatch). Check agent journal. |
| All agents reconnect every ~30s | Caddy `read_timeout` missing or too short. |
| `/health` returns 503 | Server process up, SQLite path unreadable or wrong permissions. |
| LXC can't bind port 4000 | Another process owns it. `ss -ltnp | grep 4000`. |
| `mix release` fails with DASHBOARD error | You forgot to set `DASHBOARD_PASSWORD_HASH=placeholder` at build. |
| Agent logs `{:enoent, "pvesh"}` | Agent is running on a non-Proxmox host, or `$PATH` is empty under systemd. |
| Admin "Add host" redirects to `/admin/hosts?host%5Bname%5D=` | Asset bundle didn't ship; `cache_manifest.json` missing → LiveView JS never attaches → native HTML GET submit. Rebuild the release and redeploy. |
## Appendix B — File & Port Cheat Sheet
```
Server LXC
/opt/proxmox-monitor/server/ release tree
/etc/default/proxmox-monitor env secrets, 0600
/etc/systemd/system/proxmox-monitor.service
/etc/caddy/Caddyfile
/var/lib/proxmox-monitor/monitor.db SQLite
/var/backups/proxmox-monitor/ daily backups
tcp 443 (caddy) → tcp 127.0.0.1:4000 (phoenix)
Proxmox host (per agent)
/usr/local/bin/proxmox-monitor-agent
/etc/proxmox-monitor/agent.toml token + intervals, 0600
/etc/systemd/system/proxmox-monitor-agent.service
/var/cache/proxmox-monitor-agent/ Burrito unpack cache
no listening ports
```