docs: SETUP-AND-DEPLOY runbook for phase 5 production rollout
Single top-to-bottom runbook covering preflight, local build, server deploy, first-agent dry run, test tier, full rollout, rollback, and ongoing ops. Each step has a verification command. Ends with a Go/No-Go sign-off list.
This commit is contained in:
parent
579d7fc6e8
commit
31f7172cf3
1 changed files with 838 additions and 0 deletions
838
SETUP-AND-DEPLOY.md
Normal file
838
SETUP-AND-DEPLOY.md
Normal file
|
|
@ -0,0 +1,838 @@
|
||||||
|
# Proxmox Monitor — Setup & Deploy Runbook
|
||||||
|
|
||||||
|
> **This document is a runbook, not reference material.** Work from top to bottom.
|
||||||
|
> Each checkbox is an action. Don't skip verification steps — they exist because
|
||||||
|
> we've been burned by skipping them.
|
||||||
|
|
||||||
|
**Target audience:** the operator who owns the monitor service.
|
||||||
|
|
||||||
|
**Total time:** 2–3 hours end-to-end with a small test tier, plus however long
|
||||||
|
your Proxmox fleet takes to roll out (10–15 min per host).
|
||||||
|
|
||||||
|
**Phases of this runbook:**
|
||||||
|
|
||||||
|
| § | Purpose | Touch points |
|
||||||
|
|-----|-----------------------------------------------|-----------------------|
|
||||||
|
| 1 | **Preflight** — confirm prerequisites | Local only |
|
||||||
|
| 2 | **Local build** — produce artifacts | Your workstation |
|
||||||
|
| 3 | **Server deploy** — one-time LXC bring-up | Hypervisor + LXC |
|
||||||
|
| 4 | **First agent** — prove the pipeline | One Proxmox host |
|
||||||
|
| 5 | **Test tier** — 2–3 hosts for 24h | Small batch |
|
||||||
|
| 6 | **Full rollout** — the remaining hosts | Fleet-wide |
|
||||||
|
| 7 | **Rollback** — when something goes wrong | — |
|
||||||
|
| 8 | **Ongoing operations** | Upgrades, backups |
|
||||||
|
| 9 | **Go / No-Go** sign-off | Final gate |
|
||||||
|
|
||||||
|
Related docs (reference, not sequential):
|
||||||
|
- `server/docs/deploy-lxc.md` — deeper LXC detail
|
||||||
|
- `agent/docs/install.md` — single-host agent install
|
||||||
|
- `server/docs/Caddyfile.example` — TLS/WSS proxy template
|
||||||
|
- `proxmox-monitor-konzept.md` — design concept
|
||||||
|
- `docs/deployment-overview.md` — high-level picture
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## § 1. Preflight Checklist
|
||||||
|
|
||||||
|
### 1.1 Hardware & network
|
||||||
|
|
||||||
|
- [ ] **Server LXC** can be provisioned on a Proxmox host in the RZ.
|
||||||
|
Minimum: 1 GB RAM, 2 cores, 10 GB disk. Debian 12 template available.
|
||||||
|
- [ ] **DNS A record** for `monitor.<yourdomain>` points at the public IP of the
|
||||||
|
Proxmox host that hosts the LXC. Verify with `dig +short monitor.<yourdomain>`.
|
||||||
|
- [ ] **Port 443 inbound** to the server LXC's public IP is open (Caddy will
|
||||||
|
get Let's Encrypt certs via HTTP-01 and serve on 443).
|
||||||
|
- [ ] **Outbound HTTPS from every Proxmox host to `monitor.<yourdomain>`** is
|
||||||
|
open. Agents connect out; no inbound port is required on Proxmox hosts.
|
||||||
|
- [ ] You have **SSH root access** to:
|
||||||
|
- The hypervisor running the server LXC (for `pct create` / `pct enter`)
|
||||||
|
- Every Proxmox host that will run an agent
|
||||||
|
- [ ] **Docker** is installed and daemon is running on your build machine
|
||||||
|
(`docker --version` should succeed and `docker ps` should not error).
|
||||||
|
If not, use a Linux box (even the server LXC itself) as the build host.
|
||||||
|
|
||||||
|
### 1.2 Versions
|
||||||
|
|
||||||
|
- [ ] Proxmox hosts are **VE 8.3+** with **OpenZFS 2.3+** (check with
|
||||||
|
`pveversion` and `zfs --version`). If some hosts are older, either
|
||||||
|
upgrade them first or accept that ZFS payloads will be empty on those.
|
||||||
|
|
||||||
|
### 1.3 Tools on your workstation
|
||||||
|
|
||||||
|
- [ ] Elixir 1.19 + OTP 28 (`elixir --version`)
|
||||||
|
- [ ] Mix + Hex (`mix local.hex`)
|
||||||
|
- [ ] SSH + scp
|
||||||
|
- [ ] `sqlite3` CLI (for smoke-test DB inspection; optional)
|
||||||
|
|
||||||
|
### 1.4 Secrets plan
|
||||||
|
|
||||||
|
Write down (don't commit) the three secrets you'll need. Keep them in a password
|
||||||
|
manager.
|
||||||
|
|
||||||
|
| Secret | Generated how |
|
||||||
|
|-------------------------------|--------------------------------------------------------------|
|
||||||
|
| Dashboard password (plaintext) | You choose it. Use a strong random string. |
|
||||||
|
| `SECRET_KEY_BASE` | `cd server && mix phx.gen.secret` (64-byte base64) |
|
||||||
|
| Agent tokens | Created by the admin UI, one per host, revealed once. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## § 2. Local Build
|
||||||
|
|
||||||
|
Do this once, on your build machine. Re-run for every upgrade.
|
||||||
|
|
||||||
|
### 2.1 Clone the repo
|
||||||
|
|
||||||
|
- [ ] `git clone <repo>` if you don't already have it.
|
||||||
|
- [ ] `cd proxmox_monitor`
|
||||||
|
- [ ] `git pull --ff-only origin main`
|
||||||
|
|
||||||
|
### 2.2 Confirm tests are green
|
||||||
|
|
||||||
|
- [ ] `cd server && mix deps.get && mix test`
|
||||||
|
- [ ] `cd ../agent && mix deps.get && mix test`
|
||||||
|
|
||||||
|
**Expected:** both suites pass. If any test fails, **stop here**. Fix or cherry-pick
|
||||||
|
a known-good commit before continuing.
|
||||||
|
|
||||||
|
### 2.3 Build the server release
|
||||||
|
|
||||||
|
- [ ] Generate `DASHBOARD_PASSWORD_HASH` once:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd server
|
||||||
|
mix run -e 'IO.puts(Argon2.hash_pwd_salt("<your-dashboard-password>"))'
|
||||||
|
```
|
||||||
|
|
||||||
|
Copy the `$argon2id$...` line into your password manager. You'll paste it
|
||||||
|
into the LXC env file later.
|
||||||
|
|
||||||
|
- [ ] Build the release (the placeholder is only needed to satisfy runtime.exs
|
||||||
|
during build; the real value is set on the LXC at start time):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
MIX_ENV=prod DASHBOARD_PASSWORD_HASH='placeholder' mix release --overwrite
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected:** `_build/prod/rel/server/` contains `bin/server`, `bin/migrate`,
|
||||||
|
`erts-*`, `lib/`, `releases/`.
|
||||||
|
|
||||||
|
- [ ] Package the release:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tar -czf /tmp/server_release.tgz -C _build/prod/rel server
|
||||||
|
ls -lh /tmp/server_release.tgz
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected:** ~30–60 MB tarball.
|
||||||
|
|
||||||
|
### 2.4 Build the agent binaries
|
||||||
|
|
||||||
|
Requires Docker running locally (or do this on a Linux host).
|
||||||
|
|
||||||
|
- [ ] `cd ../agent`
|
||||||
|
- [ ] `./scripts/build-linux.sh`
|
||||||
|
|
||||||
|
**Expected output (~5–10 min first run, much faster with Docker layer cache on
|
||||||
|
subsequent runs):**
|
||||||
|
|
||||||
|
```
|
||||||
|
Binaries written to /path/to/agent/dist:
|
||||||
|
proxmox-monitor-agent_linux_amd64
|
||||||
|
proxmox-monitor-agent_linux_arm64
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] Sanity check:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
file dist/proxmox-monitor-agent_linux_amd64 | grep -E "ELF 64-bit"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected:** `ELF 64-bit LSB executable, x86-64`.
|
||||||
|
|
||||||
|
If Docker isn't available on your workstation: scp the `agent/` directory onto
|
||||||
|
the server LXC after § 3, run `./scripts/build-linux.sh` there, then scp the
|
||||||
|
binaries back. The LXC doesn't need Docker at runtime.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## § 3. Server Deployment
|
||||||
|
|
||||||
|
One-time. Subsequent upgrades use § 8.1.
|
||||||
|
|
||||||
|
### 3.1 Create the LXC (on the hypervisor)
|
||||||
|
|
||||||
|
- [ ] SSH to the hypervisor and run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pct create 200 \
|
||||||
|
/var/lib/vz/template/cache/debian-12-standard_12.7-1_amd64.tar.zst \
|
||||||
|
--hostname proxmox-monitor \
|
||||||
|
--memory 1024 --cores 2 \
|
||||||
|
--rootfs local-zfs:10 \
|
||||||
|
--net0 name=eth0,bridge=vmbr0,ip=dhcp \
|
||||||
|
--unprivileged 1 --features nesting=0 --onboot 1
|
||||||
|
pct start 200
|
||||||
|
```
|
||||||
|
|
||||||
|
Adjust the container ID (`200`), bridge, and rootfs to match your environment.
|
||||||
|
|
||||||
|
- [ ] Get the LXC's IP:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pct exec 200 -- ip -4 addr show eth0 | grep -Po 'inet \K[\d.]+'
|
||||||
|
```
|
||||||
|
|
||||||
|
Put this IP in `LXC_IP` for the rest of this section (use a shell variable,
|
||||||
|
not a literal in every command — typos here cost hours).
|
||||||
|
|
||||||
|
### 3.2 Base packages inside the LXC
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pct enter 200
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] Install Caddy + SQLite + tools:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
apt-get update
|
||||||
|
apt-get install -y ca-certificates curl gnupg debian-keyring debian-archive-keyring apt-transport-https sqlite3
|
||||||
|
|
||||||
|
# Caddy's apt repo
|
||||||
|
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | \
|
||||||
|
gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
|
||||||
|
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
|
||||||
|
> /etc/apt/sources.list.d/caddy-stable.list
|
||||||
|
apt-get update
|
||||||
|
apt-get install -y caddy
|
||||||
|
|
||||||
|
caddy version # sanity
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] Exit the container: `exit`.
|
||||||
|
|
||||||
|
### 3.3 Upload the release
|
||||||
|
|
||||||
|
On your workstation:
|
||||||
|
|
||||||
|
- [ ] `scp /tmp/server_release.tgz root@<LXC_IP>:/tmp/`
|
||||||
|
|
||||||
|
Back inside the LXC (`pct enter 200`):
|
||||||
|
|
||||||
|
- [ ] Unpack:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mkdir -p /opt/proxmox-monitor
|
||||||
|
tar -xzf /tmp/server_release.tgz -C /opt/proxmox-monitor
|
||||||
|
ls /opt/proxmox-monitor/server/bin/
|
||||||
|
# Expected: server, migrate, server.bat, migrate.bat
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.4 Data directory + environment file
|
||||||
|
|
||||||
|
- [ ] Create the data dir:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
install -d -m 0700 /var/lib/proxmox-monitor
|
||||||
|
install -d -m 0755 /etc/default
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] Create `/etc/default/proxmox-monitor`. Substitute the values you
|
||||||
|
generated in § 2.3:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat > /etc/default/proxmox-monitor <<'EOF'
|
||||||
|
DATABASE_PATH=/var/lib/proxmox-monitor/monitor.db
|
||||||
|
SECRET_KEY_BASE=<paste-output-of-mix-phx.gen.secret>
|
||||||
|
DASHBOARD_PASSWORD_HASH=<paste-$argon2id$-hash-from-2.3>
|
||||||
|
PHX_SERVER=true
|
||||||
|
PHX_HOST=monitor.example.com
|
||||||
|
PORT=4000
|
||||||
|
EOF
|
||||||
|
chmod 0600 /etc/default/proxmox-monitor
|
||||||
|
```
|
||||||
|
|
||||||
|
**Gotchas:**
|
||||||
|
- `DASHBOARD_PASSWORD_HASH` contains `$` characters. Use single quotes around
|
||||||
|
the value in the heredoc, or escape each `$` with `\$`. Double-quoted
|
||||||
|
heredocs will silently eat them.
|
||||||
|
- No spaces around `=`.
|
||||||
|
- No quotes around values in the file itself.
|
||||||
|
|
||||||
|
### 3.5 Run the first migration
|
||||||
|
|
||||||
|
- [ ] Apply migrations:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
set -a; . /etc/default/proxmox-monitor; set +a
|
||||||
|
/opt/proxmox-monitor/server/bin/server eval 'Server.Release.migrate()'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected output:**
|
||||||
|
|
||||||
|
```
|
||||||
|
[info] == Running 20260421200116 Server.Repo.Migrations.CreateHosts.change/0 forward
|
||||||
|
[info] create table hosts
|
||||||
|
...
|
||||||
|
[info] == Migrated 20260421200116 in 0.0s
|
||||||
|
[info] == Running 20260421202512 Server.Repo.Migrations.CreateMetrics.change/0 forward
|
||||||
|
[info] create table metrics
|
||||||
|
...
|
||||||
|
[info] == Migrated 20260421202512 in 0.0s
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] Verify the DB exists:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ls -la /var/lib/proxmox-monitor/monitor.db
|
||||||
|
sqlite3 /var/lib/proxmox-monitor/monitor.db '.tables'
|
||||||
|
# Expected: hosts metrics schema_migrations
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.6 systemd unit for the server
|
||||||
|
|
||||||
|
- [ ] Write the unit:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat > /etc/systemd/system/proxmox-monitor.service <<'EOF'
|
||||||
|
[Unit]
|
||||||
|
Description=Proxmox Monitor Server
|
||||||
|
After=network-online.target
|
||||||
|
Wants=network-online.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=exec
|
||||||
|
EnvironmentFile=/etc/default/proxmox-monitor
|
||||||
|
ExecStartPre=/opt/proxmox-monitor/server/bin/server eval 'Server.Release.migrate()'
|
||||||
|
ExecStart=/opt/proxmox-monitor/server/bin/server start
|
||||||
|
ExecStop=/opt/proxmox-monitor/server/bin/server stop
|
||||||
|
Restart=always
|
||||||
|
RestartSec=5
|
||||||
|
User=root
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
EOF
|
||||||
|
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl enable --now proxmox-monitor
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] Watch it come up:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
journalctl -u proxmox-monitor -f
|
||||||
|
# wait for: "Running ServerWeb.Endpoint with Bandit"
|
||||||
|
# then Ctrl+C
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] Smoke-test from inside the LXC:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s http://127.0.0.1:4000/health | sqlite3 -cmd '.mode json' /dev/null # or just cat
|
||||||
|
curl -s http://127.0.0.1:4000/health
|
||||||
|
# Expected: {"db":"ok","status":"ok","version":"0.1.0"}
|
||||||
|
```
|
||||||
|
|
||||||
|
**If you see anything other than `status:ok`**: stop. Check `journalctl -u
|
||||||
|
proxmox-monitor -n 100`. Common causes: missing env var (check `/etc/default`),
|
||||||
|
DB path not writable.
|
||||||
|
|
||||||
|
### 3.7 Caddy TLS + reverse proxy
|
||||||
|
|
||||||
|
- [ ] Copy the template and edit:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp /opt/proxmox-monitor/server/lib/server-0.1.0/priv/docs/Caddyfile.example \
|
||||||
|
/etc/caddy/Caddyfile \
|
||||||
|
2>/dev/null || \
|
||||||
|
scp root@<YOUR-WORKSTATION>:proxmox_monitor/server/docs/Caddyfile.example \
|
||||||
|
/etc/caddy/Caddyfile
|
||||||
|
# (The first form only works if you bundled docs into the release; the second
|
||||||
|
# pulls fresh from your checkout.)
|
||||||
|
|
||||||
|
sed -i "s/monitor.example.com/$ACTUAL_HOST/g" /etc/caddy/Caddyfile
|
||||||
|
caddy validate --config /etc/caddy/Caddyfile
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] Reload Caddy:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl reload caddy
|
||||||
|
journalctl -u caddy -n 30
|
||||||
|
# Expected: "certificate obtained successfully" from Let's Encrypt
|
||||||
|
# (only on first reload after DNS is set correctly)
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] Verify from the public internet:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From any outside machine:
|
||||||
|
curl -s https://monitor.example.com/health
|
||||||
|
# Expected: {"db":"ok","status":"ok","version":"0.1.0"}
|
||||||
|
```
|
||||||
|
|
||||||
|
**If this fails:**
|
||||||
|
- `curl -vI` to see where it stops. Name resolution? TCP? TLS?
|
||||||
|
- `dig +short monitor.example.com` — does it point to the expected IP?
|
||||||
|
- Check the hypervisor's firewall / any cloud-level firewall for port 443.
|
||||||
|
|
||||||
|
### 3.8 Browser smoke-test
|
||||||
|
|
||||||
|
- [ ] Open `https://monitor.example.com/` in a browser.
|
||||||
|
- [ ] Confirm: redirect to `/login`.
|
||||||
|
- [ ] Log in with your dashboard password.
|
||||||
|
- [ ] Confirm: empty overview page with "No hosts registered yet."
|
||||||
|
|
||||||
|
**If login fails** with "Incorrect password": your `DASHBOARD_PASSWORD_HASH` env
|
||||||
|
doesn't match the password you typed. Re-generate and re-deploy §3.4.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## § 4. First Agent — Dry Run
|
||||||
|
|
||||||
|
Pick one Proxmox host. This run will validate the whole pipeline before you
|
||||||
|
touch more hosts.
|
||||||
|
|
||||||
|
### 4.1 Register the host in the dashboard
|
||||||
|
|
||||||
|
- [ ] Browser → `https://monitor.example.com/admin/hosts`.
|
||||||
|
- [ ] Enter the short name (`pve-host-01` or whatever matches your
|
||||||
|
convention). Click **Add**.
|
||||||
|
- [ ] The page reveals a token. **Copy it now** — it is shown only once.
|
||||||
|
|
||||||
|
### 4.2 Copy the binary + systemd unit to the host
|
||||||
|
|
||||||
|
From your workstation (substitute `<HOST>`):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export HOST=<proxmox-host-ip-or-name>
|
||||||
|
|
||||||
|
scp agent/dist/proxmox-monitor-agent_linux_amd64 \
|
||||||
|
root@$HOST:/usr/local/bin/proxmox-monitor-agent
|
||||||
|
ssh root@$HOST 'chmod 0755 /usr/local/bin/proxmox-monitor-agent'
|
||||||
|
|
||||||
|
scp agent/rel/proxmox-monitor-agent.service \
|
||||||
|
root@$HOST:/etc/systemd/system/
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 Write the agent config
|
||||||
|
|
||||||
|
SSH to the host (`ssh root@$HOST`) and:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
install -d -m 0700 /etc/proxmox-monitor
|
||||||
|
install -d -m 0700 /var/cache/proxmox-monitor-agent
|
||||||
|
|
||||||
|
cat > /etc/proxmox-monitor/agent.toml <<'EOF'
|
||||||
|
server_url = "wss://monitor.example.com/socket/websocket"
|
||||||
|
token = "<paste-token-from-dashboard>"
|
||||||
|
host_id = "pve-host-01"
|
||||||
|
|
||||||
|
[intervals]
|
||||||
|
fast_seconds = 30
|
||||||
|
medium_seconds = 300
|
||||||
|
slow_seconds = 1800
|
||||||
|
EOF
|
||||||
|
|
||||||
|
chmod 0600 /etc/proxmox-monitor/agent.toml
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 Enable the agent
|
||||||
|
|
||||||
|
Still on the Proxmox host:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl enable --now proxmox-monitor-agent
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] Watch the log:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
journalctl -u proxmox-monitor-agent -f
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected within 10s:**
|
||||||
|
|
||||||
|
```
|
||||||
|
agent: starting with host_id=pve-host-01
|
||||||
|
reporter: connected, joining host:pve-host-01
|
||||||
|
reporter: joined host:pve-host-01
|
||||||
|
```
|
||||||
|
|
||||||
|
Ctrl+C to stop tailing.
|
||||||
|
|
||||||
|
### 4.5 Confirm in the dashboard
|
||||||
|
|
||||||
|
- [ ] Reload `https://monitor.example.com/` — the card for `pve-host-01`
|
||||||
|
should show **online**, status green, with Load/RAM/Pools/VMs populated.
|
||||||
|
- [ ] Click the card. Verify each section (ZFS pools, snapshots, storage,
|
||||||
|
VMs) has real data.
|
||||||
|
|
||||||
|
### 4.6 Stop-and-restart verification
|
||||||
|
|
||||||
|
Verify the offline flip works as designed.
|
||||||
|
|
||||||
|
- [ ] On the Proxmox host: `systemctl stop proxmox-monitor-agent`.
|
||||||
|
- [ ] Dashboard card should switch to **offline** (grey border) within ~1s.
|
||||||
|
- [ ] `systemctl start proxmox-monitor-agent` — card flips back to green
|
||||||
|
within ~30s.
|
||||||
|
|
||||||
|
**If the card stays green when the agent is stopped**: the Channel terminate
|
||||||
|
callback didn't fire, which usually means Caddy's `read_timeout` is set too
|
||||||
|
short or absent. Check `/etc/caddy/Caddyfile` contains `read_timeout 90s`.
|
||||||
|
|
||||||
|
### 4.7 Token rotation sanity-check
|
||||||
|
|
||||||
|
- [ ] In the admin UI, click **Rotate** on the host. Confirm.
|
||||||
|
- [ ] On the Proxmox host, `journalctl -u proxmox-monitor-agent -f` —
|
||||||
|
within ~30s the agent should log `reporter: disconnected` then begin
|
||||||
|
reconnecting, failing with `invalid_token`.
|
||||||
|
- [ ] Update `/etc/proxmox-monitor/agent.toml` with the new token and
|
||||||
|
`systemctl restart proxmox-monitor-agent`. Verify green again.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## § 5. Test Tier (2–3 Hosts)
|
||||||
|
|
||||||
|
Pick 2–3 Proxmox hosts that are either non-critical, or critical but with
|
||||||
|
existing independent monitoring you can fall back on.
|
||||||
|
|
||||||
|
### 5.1 Roll out
|
||||||
|
|
||||||
|
- [ ] For each host, repeat § 4.1–4.5. Use distinct `host_id` values.
|
||||||
|
|
||||||
|
### 5.2 Observe for 24 hours
|
||||||
|
|
||||||
|
- [ ] Leave the test tier running overnight.
|
||||||
|
- [ ] Next morning, verify all three cards still show **online**.
|
||||||
|
- [ ] Check `journalctl -u proxmox-monitor` on the server:
|
||||||
|
- No `[error]` lines repeating.
|
||||||
|
- `retention: pruned N stale samples` appears ≥ 1 time (retention fires
|
||||||
|
hourly; after 48h it starts deleting).
|
||||||
|
|
||||||
|
### 5.3 Restart test
|
||||||
|
|
||||||
|
Reboot one of the Proxmox hosts. Watch the dashboard:
|
||||||
|
|
||||||
|
- [ ] Card goes offline during the reboot.
|
||||||
|
- [ ] Card flips back to online within a minute of the host coming back,
|
||||||
|
without you touching anything.
|
||||||
|
|
||||||
|
### 5.4 Server reboot test
|
||||||
|
|
||||||
|
- [ ] On the server LXC: `systemctl restart proxmox-monitor`.
|
||||||
|
- [ ] All agents should briefly flip to offline, then back to online within
|
||||||
|
~30s as their Slipstream clients reconnect.
|
||||||
|
- [ ] No agents should end up stuck offline requiring manual restart.
|
||||||
|
|
||||||
|
**If any agent stays offline**: its Slipstream reconnect backoff may need
|
||||||
|
investigation. `journalctl -u proxmox-monitor-agent -f` on the affected host.
|
||||||
|
|
||||||
|
### 5.5 Go / No-Go gate for full rollout
|
||||||
|
|
||||||
|
Do NOT proceed to § 6 until **all** of these are true for 24h:
|
||||||
|
|
||||||
|
- [ ] All test-tier hosts show **online** continuously.
|
||||||
|
- [ ] No repeating error lines in server logs.
|
||||||
|
- [ ] Retention has pruned ≥ 1 row.
|
||||||
|
- [ ] Rotate + restart behavior works as expected.
|
||||||
|
- [ ] Dashboard is responsive (<1s LiveView updates).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## § 6. Full Rollout
|
||||||
|
|
||||||
|
For each remaining Proxmox host:
|
||||||
|
|
||||||
|
1. Admin UI → register host, copy token.
|
||||||
|
2. `scp` binary + systemd unit.
|
||||||
|
3. Write `/etc/proxmox-monitor/agent.toml`.
|
||||||
|
4. `systemctl enable --now proxmox-monitor-agent`.
|
||||||
|
5. Verify in dashboard.
|
||||||
|
|
||||||
|
### 6.1 Loop shortcut
|
||||||
|
|
||||||
|
Once you've done 3–4 hosts by hand and are confident, you can batch. The
|
||||||
|
tricky part is that each host needs a unique token, so the admin-UI step
|
||||||
|
still has to be interactive. One workflow:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On your workstation:
|
||||||
|
for HOST in pve-host-04 pve-host-05 pve-host-06; do
|
||||||
|
echo ">>> Setting up $HOST"
|
||||||
|
echo "Register in the admin UI, paste token here, then press Enter:"
|
||||||
|
read -s TOKEN
|
||||||
|
scp agent/dist/proxmox-monitor-agent_linux_amd64 \
|
||||||
|
root@$HOST:/usr/local/bin/proxmox-monitor-agent
|
||||||
|
scp agent/rel/proxmox-monitor-agent.service \
|
||||||
|
root@$HOST:/etc/systemd/system/
|
||||||
|
ssh root@$HOST "chmod 0755 /usr/local/bin/proxmox-monitor-agent && \
|
||||||
|
install -d -m 0700 /etc/proxmox-monitor /var/cache/proxmox-monitor-agent && \
|
||||||
|
cat > /etc/proxmox-monitor/agent.toml <<EOF
|
||||||
|
server_url = \"wss://monitor.example.com/socket/websocket\"
|
||||||
|
token = \"$TOKEN\"
|
||||||
|
host_id = \"$HOST\"
|
||||||
|
|
||||||
|
[intervals]
|
||||||
|
fast_seconds = 30
|
||||||
|
medium_seconds = 300
|
||||||
|
slow_seconds = 1800
|
||||||
|
EOF
|
||||||
|
chmod 0600 /etc/proxmox-monitor/agent.toml && \
|
||||||
|
systemctl daemon-reload && \
|
||||||
|
systemctl enable --now proxmox-monitor-agent"
|
||||||
|
echo ">>> $HOST done."
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6.2 Validation at scale
|
||||||
|
|
||||||
|
After every batch of ~5 hosts:
|
||||||
|
|
||||||
|
- [ ] Open `/` and confirm the card count matches how many agents you've
|
||||||
|
configured.
|
||||||
|
- [ ] Sort/filter by offline — should be empty.
|
||||||
|
- [ ] Click a random card and confirm real payload data.
|
||||||
|
|
||||||
|
### 6.3 Completion check
|
||||||
|
|
||||||
|
- [ ] Overview shows all N hosts.
|
||||||
|
- [ ] None are in `offline` or `critical` state (unless that's actually true
|
||||||
|
of the host, e.g. a real DEGRADED pool).
|
||||||
|
- [ ] VM search returns hits for a well-known VM name.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## § 7. Rollback
|
||||||
|
|
||||||
|
### 7.1 Disable a single agent
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh root@$HOST 'systemctl disable --now proxmox-monitor-agent'
|
||||||
|
```
|
||||||
|
|
||||||
|
Dashboard card flips to offline. Delete from `/admin/hosts` if you want it
|
||||||
|
gone entirely.
|
||||||
|
|
||||||
|
### 7.2 Take the whole service down
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Inside the server LXC
|
||||||
|
systemctl stop proxmox-monitor
|
||||||
|
systemctl stop caddy
|
||||||
|
```
|
||||||
|
|
||||||
|
Agents will keep trying to reconnect every few seconds (harmless). Dashboard
|
||||||
|
is gone.
|
||||||
|
|
||||||
|
### 7.3 Roll back to a previous server release
|
||||||
|
|
||||||
|
If a new version misbehaves:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On the LXC — assuming you kept the previous /tmp/server_release_PREV.tgz
|
||||||
|
systemctl stop proxmox-monitor
|
||||||
|
rm -rf /opt/proxmox-monitor/server
|
||||||
|
tar -xzf /tmp/server_release_PREV.tgz -C /opt/proxmox-monitor
|
||||||
|
systemctl start proxmox-monitor
|
||||||
|
```
|
||||||
|
|
||||||
|
Your SQLite DB has not been touched — rollbacks are cheap as long as the
|
||||||
|
migration list didn't change between versions.
|
||||||
|
|
||||||
|
### 7.4 DB restore from backup
|
||||||
|
|
||||||
|
See § 8.4 for creating backups. To restore:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl stop proxmox-monitor
|
||||||
|
cp /var/backups/proxmox-monitor/monitor-YYYY-MM-DD.db /var/lib/proxmox-monitor/monitor.db
|
||||||
|
chown root:root /var/lib/proxmox-monitor/monitor.db
|
||||||
|
systemctl start proxmox-monitor
|
||||||
|
```
|
||||||
|
|
||||||
|
Host tokens in the restored DB are still valid. Metrics from after the backup
|
||||||
|
are lost — that's 48h max given the retention policy.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## § 8. Ongoing Operations
|
||||||
|
|
||||||
|
### 8.1 Upgrading the server
|
||||||
|
|
||||||
|
Work from the repo on your workstation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Build
|
||||||
|
cd server
|
||||||
|
MIX_ENV=prod DASHBOARD_PASSWORD_HASH='placeholder' mix release --overwrite
|
||||||
|
tar -czf /tmp/server_release.tgz -C _build/prod/rel server
|
||||||
|
|
||||||
|
# 2. Upload, keeping the previous around for rollback
|
||||||
|
scp /tmp/server_release.tgz root@<LXC>:/tmp/
|
||||||
|
|
||||||
|
# 3. Swap on the LXC
|
||||||
|
ssh root@<LXC> '
|
||||||
|
cp /tmp/server_release.tgz /tmp/server_release_PREV.tgz.bak # optional
|
||||||
|
systemctl stop proxmox-monitor
|
||||||
|
mv /opt/proxmox-monitor/server /opt/proxmox-monitor/server.old
|
||||||
|
tar -xzf /tmp/server_release.tgz -C /opt/proxmox-monitor
|
||||||
|
systemctl start proxmox-monitor # ExecStartPre runs migrate
|
||||||
|
sleep 5
|
||||||
|
systemctl status proxmox-monitor --no-pager
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
Verify `/health` responds before deleting the `.old` copy.
|
||||||
|
|
||||||
|
### 8.2 Upgrading an agent
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scp agent/dist/proxmox-monitor-agent_linux_amd64 \
|
||||||
|
root@$HOST:/usr/local/bin/proxmox-monitor-agent.new
|
||||||
|
ssh root@$HOST '
|
||||||
|
mv /usr/local/bin/proxmox-monitor-agent{.new,}
|
||||||
|
systemctl restart proxmox-monitor-agent
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8.3 Token rotation (leak or routine)
|
||||||
|
|
||||||
|
1. Dashboard → Admin → **Rotate** on the affected host.
|
||||||
|
2. Copy the new token.
|
||||||
|
3. SSH to the host: update `/etc/proxmox-monitor/agent.toml`, `systemctl
|
||||||
|
restart proxmox-monitor-agent`.
|
||||||
|
4. Verify card flips back to green.
|
||||||
|
|
||||||
|
### 8.4 SQLite backup (recommended weekly)
|
||||||
|
|
||||||
|
The DB is small. SQLite's online backup is safe while the server runs.
|
||||||
|
|
||||||
|
Install a cron job inside the LXC:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat > /etc/cron.d/proxmox-monitor-backup <<'EOF'
|
||||||
|
# Minute Hour Dom Month Dow User Command
|
||||||
|
30 3 * * * root install -d -m 0700 /var/backups/proxmox-monitor && \
|
||||||
|
sqlite3 /var/lib/proxmox-monitor/monitor.db \
|
||||||
|
".backup /var/backups/proxmox-monitor/monitor-$(date +\%Y-\%m-\%d).db" && \
|
||||||
|
find /var/backups/proxmox-monitor -name 'monitor-*.db' -mtime +30 -delete
|
||||||
|
EOF
|
||||||
|
```
|
||||||
|
|
||||||
|
Keeps 30 days of daily snapshots.
|
||||||
|
|
||||||
|
### 8.5 Log inspection
|
||||||
|
|
||||||
|
Server:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Live
|
||||||
|
journalctl -u proxmox-monitor -f
|
||||||
|
|
||||||
|
# Last 500
|
||||||
|
journalctl -u proxmox-monitor -n 500 --no-pager
|
||||||
|
|
||||||
|
# Errors only
|
||||||
|
journalctl -u proxmox-monitor -p err --no-pager
|
||||||
|
```
|
||||||
|
|
||||||
|
Agents (from the server for any host):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh root@$HOST 'journalctl -u proxmox-monitor-agent -n 200 --no-pager'
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8.6 External uptime monitoring
|
||||||
|
|
||||||
|
Point your uptime service (UptimeRobot, BetterUptime, your-own-Prometheus,
|
||||||
|
etc.) at:
|
||||||
|
|
||||||
|
```
|
||||||
|
https://monitor.example.com/health
|
||||||
|
```
|
||||||
|
|
||||||
|
Expect `{"status":"ok","db":"ok","version":"0.1.0"}` with HTTP 200. Alert on
|
||||||
|
anything else.
|
||||||
|
|
||||||
|
### 8.7 Changing the dashboard password
|
||||||
|
|
||||||
|
1. On your workstation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd server
|
||||||
|
mix run -e 'IO.puts(Argon2.hash_pwd_salt("<new-password>"))'
|
||||||
|
```
|
||||||
|
|
||||||
|
2. On the server LXC: edit `/etc/default/proxmox-monitor`, replace
|
||||||
|
`DASHBOARD_PASSWORD_HASH`, `systemctl restart proxmox-monitor`.
|
||||||
|
3. All existing sessions are invalidated on next request.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## § 9. Go / No-Go Sign-Off
|
||||||
|
|
||||||
|
Tick each box before declaring the rollout complete.
|
||||||
|
|
||||||
|
### Production readiness
|
||||||
|
|
||||||
|
- [ ] `https://monitor.example.com/health` returns 200 / `status:ok`.
|
||||||
|
- [ ] External uptime monitor is configured and reporting green.
|
||||||
|
- [ ] All intended Proxmox hosts appear on the overview and show **online**.
|
||||||
|
- [ ] At least one full 48h retention cycle has completed (retention log
|
||||||
|
shows pruning).
|
||||||
|
- [ ] SQLite backup cron is installed and yesterday's `.db` file exists.
|
||||||
|
- [ ] You have rolled back once on purpose (drill), proving § 7 works.
|
||||||
|
|
||||||
|
### Access & secrets hygiene
|
||||||
|
|
||||||
|
- [ ] Dashboard password is in a password manager, not a text file.
|
||||||
|
- [ ] `SECRET_KEY_BASE` is in a password manager.
|
||||||
|
- [ ] `/etc/default/proxmox-monitor` is `0600 root:root`.
|
||||||
|
- [ ] `/etc/proxmox-monitor/agent.toml` is `0600 root:root` on every host.
|
||||||
|
- [ ] You know how to rotate an agent token in < 2 minutes.
|
||||||
|
|
||||||
|
### Documentation handoff
|
||||||
|
|
||||||
|
- [ ] This runbook's checkboxes are all green for the current rollout.
|
||||||
|
- [ ] If you're handing this to a teammate, you've walked them through one
|
||||||
|
agent install and one token rotation live.
|
||||||
|
|
||||||
|
**If all of the above are green, the monitor is in production.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Appendix A — Common Errors
|
||||||
|
|
||||||
|
| Symptom | First thing to check |
|
||||||
|
|------------------------------------------------------|--------------------------------------------------------------------|
|
||||||
|
| Browser gets `NET::ERR_CERT_AUTHORITY_INVALID` | Caddy didn't finish LE cert issuance. Wait 60s; then `journalctl -u caddy`. |
|
||||||
|
| Login page loops — correct password rejected | `DASHBOARD_PASSWORD_HASH` mismatch. Regenerate. |
|
||||||
|
| Card stays offline after agent restart | Wrong token or `unknown_host` (name mismatch). Check agent journal. |
|
||||||
|
| All agents reconnect every ~30s | Caddy `read_timeout` missing or too short. |
|
||||||
|
| `/health` returns 503 | Server process up, SQLite path unreadable or wrong permissions. |
|
||||||
|
| LXC can't bind port 4000 | Another process owns it. `ss -ltnp | grep 4000`. |
|
||||||
|
| `mix release` fails with DASHBOARD error | You forgot to set `DASHBOARD_PASSWORD_HASH=placeholder` at build. |
|
||||||
|
| Agent logs `{:enoent, "pvesh"}` | Agent is running on a non-Proxmox host, or `$PATH` is empty under systemd. |
|
||||||
|
|
||||||
|
## Appendix B — File & Port Cheat Sheet
|
||||||
|
|
||||||
|
```
|
||||||
|
Server LXC
|
||||||
|
/opt/proxmox-monitor/server/ release tree
|
||||||
|
/etc/default/proxmox-monitor env secrets, 0600
|
||||||
|
/etc/systemd/system/proxmox-monitor.service
|
||||||
|
/etc/caddy/Caddyfile
|
||||||
|
/var/lib/proxmox-monitor/monitor.db SQLite
|
||||||
|
/var/backups/proxmox-monitor/ daily backups
|
||||||
|
tcp 443 (caddy) → tcp 127.0.0.1:4000 (phoenix)
|
||||||
|
|
||||||
|
Proxmox host (per agent)
|
||||||
|
/usr/local/bin/proxmox-monitor-agent
|
||||||
|
/etc/proxmox-monitor/agent.toml token + intervals, 0600
|
||||||
|
/etc/systemd/system/proxmox-monitor-agent.service
|
||||||
|
/var/cache/proxmox-monitor-agent/ Burrito unpack cache
|
||||||
|
no listening ports
|
||||||
|
```
|
||||||
Loading…
Add table
Add a link
Reference in a new issue