diff --git a/SETUP-AND-DEPLOY-slides.html b/SETUP-AND-DEPLOY-slides.html new file mode 100644 index 0000000..a10f2a1 --- /dev/null +++ b/SETUP-AND-DEPLOY-slides.html @@ -0,0 +1,1028 @@ + + + + + +Proxmox Monitor — Setup & Deploy + + + + +
+ Proxmox Monitor · Setup & Deploy + 1 / 1 +
+
+ + +
+ + +
+ Runbook +

Proxmox Monitor

+
Setup & Deployment — Production Rollout
+

+ Agent-server monitoring for Proxmox hosts. Elixir/OTP backend, Burrito-packaged agents, + Phoenix LiveView dashboard. This deck walks you from a clean environment to 20 hosts reporting, + in order, with verification at every step. +

+
+ Reference: SETUP-AND-DEPLOY.md · ~2–3h end-to-end + host rollout time +
+
+ + +
+ What you're deploying +

Architecture

+

Two artifacts, independent pipelines, one dashboard

+
+ ┌─────────────────────────┐ + │ Server (LXC in RZ) │ + agents ──WSS──│ · Phoenix release │ + │ · SQLite │ + │ · Caddy (TLS) │ + └─────────────────────────┘ + ▲ + │ ssh + ┌─────────────────────────┐ + │ Operator workstation │ + │ · builds server │ + │ · builds agent binary │ + └─────────────────────────┘ + │ scp + ▼ + ┌─────────────────────────┐ + │ Proxmox host (1 of N) │ + │ · Burrito binary │ + │ · systemd unit │ + └─────────────────────────┘ +
+

Agents initiate outbound WSS — no inbound ports on Proxmox hosts.

+
+ + +
+ Phases +

Roadmap for this deck

+
    +
  1. Preflight — confirm prerequisites
  2. +
  3. Local build — produce the two artifacts
  4. +
  5. Server deploy — one-time LXC bring-up
  6. +
  7. First agent — prove the pipeline end-to-end
  8. +
  9. Test tier — 2–3 hosts for 24h
  10. +
  11. Full rollout — the remaining fleet
  12. +
  13. Rollback — because things go wrong
  14. +
  15. Ongoing operations — upgrades, backups, rotation
  16. +
  17. Go / No-Go — final sign-off
  18. +
+
+ + +
+ § 1 Preflight +

Hardware & network

+
+
+
Server LXC
+
Debian 12 · 1 GB RAM · 2 cores · 10 GB
+

Unprivileged. Covers >20 agents comfortably.

+
+
+
DNS
+
A record → public IP
+

Verify: dig +short monitor.example.com

+
+
+
Inbound
+
TCP 443 → server LXC
+

Caddy handles Let's Encrypt via HTTP-01.

+
+
+
Outbound
+
HTTPS from every Proxmox host
+

No inbound port required on hosts.

+
+
+

SSH root access: hypervisor + every Proxmox host.

+
+ + +
+ § 1 Preflight +

Versions & tools

+
+
+

Proxmox fleet

+
    +
  • VE 8.3+
  • +
  • OpenZFS 2.3+ (for -j JSON output)
  • +
  • Older hosts will report empty ZFS payloads
  • +
+
+
+

Build machine

+
    +
  • Elixir 1.19 + OTP 28
  • +
  • Mix + Hex
  • +
  • Docker daemon running (for Linux binaries)
  • +
  • SSH, scp, sqlite3 (optional)
  • +
+
+
+
+ No Docker? Run ./scripts/build-linux.sh on the server LXC itself instead. +
+
+ + +
+ § 1 Preflight +

Secrets plan

+

Three values — keep in a password manager, never in git

+ + + + + + + + + + + + + + + + +
SecretHow to generate
DASHBOARD_PASSWORD_HASHmix run -e 'IO.puts(Argon2.hash_pwd_salt("<pw>"))'
SECRET_KEY_BASEmix phx.gen.secret (64-byte base64)
Per-agent tokensAdmin UI → Add host reveals token once
+
+ Tokens are shown once. Paste into your password manager before clicking away. +
+
+ + +
+ § 2 Local build +

Tests first

+

If either suite is red, stop

+
cd server && mix deps.get && mix test
+cd ../agent && mix deps.get && mix test
+
+
+
Server
+
58 tests, 0 failures
+
+
+
Agent
+
23 tests, 0 failures
+
+
+

Never build a release from a branch with failing tests.

+
+ + +
+ § 2 Local build +

Hash the password

+
cd server
+mix run -e 'IO.puts(Argon2.hash_pwd_salt("your-password"))'
+

Output looks like:

+
$argon2id$v=19$m=65536,t=3,p=4$dSB9...$x0OQ...
+
+ Copy the whole $argon2id$... string into your password manager. + The plaintext password never leaves your head / password manager. +
+
+ + +
+ § 2 Local build +

Server release

+
MIX_ENV=prod DASHBOARD_PASSWORD_HASH='placeholder' \
+  mix release --overwrite
+
+tar -czf /tmp/server_release.tgz -C _build/prod/rel server
+ls -lh /tmp/server_release.tgz
+

Expected: ~30–60 MB tarball.

+
+ The placeholder hash only needs to exist so config/runtime.exs + accepts it. The real hash is supplied on the LXC at start time. +
+
+ + +
+ § 2 Local build +

Agent binaries

+
cd ../agent
+./scripts/build-linux.sh
+

Expected output:

+
Binaries written to /.../agent/dist:
+  proxmox-monitor-agent_linux_amd64
+  proxmox-monitor-agent_linux_arm64
+

Sanity check:

+
file dist/proxmox-monitor-agent_linux_amd64 | grep 'ELF 64-bit'
+

First build: 5–10 min. Subsequent builds: seconds (Docker layer cache).

+
+ + +
+ § 3 Server deploy +

Create the LXC

+

On the hypervisor:

+
pct create 200 \
+  /var/lib/vz/template/cache/debian-12-standard_12.7-1_amd64.tar.zst \
+  --hostname proxmox-monitor \
+  --memory 1024 --cores 2 \
+  --rootfs local-zfs:10 \
+  --net0 name=eth0,bridge=vmbr0,ip=dhcp \
+  --unprivileged 1 --features nesting=0 --onboot 1
+
+pct start 200
+pct exec 200 -- ip -4 addr show eth0 | grep -Po 'inet \K[\d.]+'
+

Save the IP as LXC_IP. Typos here cost hours.

+
+ + +
+ § 3 Server deploy +

Base packages

+

pct enter 200 then:

+
apt-get update
+apt-get install -y ca-certificates curl gnupg \
+  debian-keyring debian-archive-keyring apt-transport-https sqlite3
+
+curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | \
+  gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
+curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
+  > /etc/apt/sources.list.d/caddy-stable.list
+
+apt-get update && apt-get install -y caddy
+caddy version
+
+ + +
+ § 3 Server deploy +

Upload and extract the release

+

From your workstation:

+
scp /tmp/server_release.tgz root@$LXC_IP:/tmp/
+

Inside the LXC:

+
mkdir -p /opt/proxmox-monitor
+tar -xzf /tmp/server_release.tgz -C /opt/proxmox-monitor
+ls /opt/proxmox-monitor/server/bin/
+# server  migrate  server.bat  migrate.bat
+
+ + +
+ § 3 Server deploy +

Environment file

+
install -d -m 0700 /var/lib/proxmox-monitor
+
+cat > /etc/default/proxmox-monitor <<'EOF'
+DATABASE_PATH=/var/lib/proxmox-monitor/monitor.db
+SECRET_KEY_BASE=<paste-mix-phx.gen.secret-output>
+DASHBOARD_PASSWORD_HASH=<paste-$argon2id$-hash>
+PHX_SERVER=true
+PHX_HOST=monitor.example.com
+PORT=4000
+EOF
+chmod 0600 /etc/default/proxmox-monitor
+
+ Single-quoted heredoc matters. A double-quoted one eats the $ + characters in the Argon2 hash. +
+
+ + +
+ § 3 Server deploy +

Migrate & systemd

+
set -a; . /etc/default/proxmox-monitor; set +a
+/opt/proxmox-monitor/server/bin/server eval 'Server.Release.migrate()'
+
+sqlite3 /var/lib/proxmox-monitor/monitor.db '.tables'
+# hosts  metrics  schema_migrations
+

Then install the systemd unit (see runbook §3.6) with:

+
ExecStartPre=/opt/proxmox-monitor/server/bin/server eval 'Server.Release.migrate()'
+ExecStart=/opt/proxmox-monitor/server/bin/server start
+Restart=always
+RestartSec=5
+
systemctl daemon-reload && systemctl enable --now proxmox-monitor
+
+ + +
+ § 3 Server deploy +

Caddy: TLS + WSS reverse-proxy

+
monitor.example.com {
+    reverse_proxy 127.0.0.1:4000 {
+        header_up X-Forwarded-Proto {scheme}
+        header_up X-Forwarded-For {remote_host}
+        transport http {
+            read_timeout 90s
+            dial_timeout 10s
+        }
+    }
+}
+
+ read_timeout 90s is critical. Without it, every agent's + WebSocket is torn down every ~30s and the dashboard stays permanently offline-looking. +
+
caddy validate --config /etc/caddy/Caddyfile
+systemctl reload caddy
+
+ + +
+ § 3 Server deploy +

Server smoke test

+

From anywhere on the internet:

+
curl -s https://monitor.example.com/health
+

Expected:

+
{"db":"ok","status":"ok","version":"0.1.0"}
+

Then browser:

+ +
+ Login loops on "Incorrect password"? DASHBOARD_PASSWORD_HASH was not pasted + correctly. Re-generate and redeploy §3.4. +
+
+ + +
+ § 4 First agent +

Register in the admin UI

+
    +
  1. Browser → /admin/hosts
  2. +
  3. "Register a new host" → enter short name (e.g. pve-host-01)
  4. +
  5. Click Add
  6. +
  7. The page reveals a token — copy it now
  8. +
+
+ Tokens are shown exactly once. If you close the page without copying, + Rotate and try again. +
+
+ + +
+ § 4 First agent +

Deploy binary + config

+
export HOST=pve-host-01
+
+scp agent/dist/proxmox-monitor-agent_linux_amd64 \
+    root@$HOST:/usr/local/bin/proxmox-monitor-agent
+ssh root@$HOST 'chmod 0755 /usr/local/bin/proxmox-monitor-agent'
+
+scp agent/rel/proxmox-monitor-agent.service \
+    root@$HOST:/etc/systemd/system/
+

On the host — write the TOML config:

+
install -d -m 0700 /etc/proxmox-monitor /var/cache/proxmox-monitor-agent
+
+cat > /etc/proxmox-monitor/agent.toml <<'EOF'
+server_url = "wss://monitor.example.com/socket/websocket"
+token = "<paste-token-from-dashboard>"
+host_id = "pve-host-01"
+
+[intervals]
+fast_seconds = 30
+medium_seconds = 300
+slow_seconds = 1800
+EOF
+chmod 0600 /etc/proxmox-monitor/agent.toml
+
+ + +
+ § 4 First agent +

Enable and verify

+
systemctl daemon-reload
+systemctl enable --now proxmox-monitor-agent
+journalctl -u proxmox-monitor-agent -f
+

Expected within 10s:

+
agent: starting with host_id=pve-host-01
+reporter: connected, joining host:pve-host-01
+reporter: joined host:pve-host-01
+

Reload the dashboard — the card should be online (green border) + with Load / RAM / Pools / VMs populated.

+
+ + +
+ § 4 First agent +

Verify the offline flip

+
+
+

Test

+
ssh root@$HOST \
+  'systemctl stop proxmox-monitor-agent'
+

Dashboard card grey within ~1s.

+
ssh root@$HOST \
+  'systemctl start proxmox-monitor-agent'
+

Green again within 30s.

+
+
+

If the card stays green

+

Channel terminate callback didn't run — usually Caddy.

+
    +
  • Check /etc/caddy/Caddyfile has read_timeout 90s
  • +
  • systemctl reload caddy after fixing
  • +
+
+
+
+ + +
+ § 5 Test tier +

2–3 hosts for 24h

+

Pick non-critical hosts, or hosts with independent monitoring to fall back on.

+

What to look for overnight:

+ +

Tests to actively run

+ +
+ + +
+ § 5 Test tier +

Go / No-Go gate

+

Do NOT proceed to full rollout unless ALL are true for 24h

+
All test-tier hosts show online continuously
+
No repeating error lines in server logs
+
Retention has pruned at least one row
+
Token rotation + restart behaves as designed
+
Server-reboot drill: all agents recover without intervention
+
Dashboard is responsive (<1s LiveView updates)
+
+ + +
+ § 6 Full rollout +

Batch loop

+

After 3–4 hosts by hand, batch:

+
for HOST in pve-host-04 pve-host-05 pve-host-06; do
+  echo "Register $HOST in admin UI, paste token:"
+  read -s TOKEN
+
+  scp agent/dist/proxmox-monitor-agent_linux_amd64 \
+      root@$HOST:/usr/local/bin/proxmox-monitor-agent
+  scp agent/rel/proxmox-monitor-agent.service \
+      root@$HOST:/etc/systemd/system/
+
+  ssh root@$HOST "chmod 0755 /usr/local/bin/proxmox-monitor-agent &&
+    install -d -m 0700 /etc/proxmox-monitor /var/cache/proxmox-monitor-agent &&
+    cat > /etc/proxmox-monitor/agent.toml <<EOF
+server_url = \"wss://monitor.example.com/socket/websocket\"
+token = \"$TOKEN\"
+host_id = \"$HOST\"
+EOF
+    chmod 0600 /etc/proxmox-monitor/agent.toml &&
+    systemctl daemon-reload &&
+    systemctl enable --now proxmox-monitor-agent"
+done
+

After each batch of ~5: spot-check cards, filter for offline, open a random host detail.

+
+ + +
+ § 7 Rollback +

Four escape hatches

+
+
+
One agent
+
ssh root@$HOST \
+  'systemctl disable --now proxmox-monitor-agent'
+
+
+
Whole service
+
systemctl stop proxmox-monitor
+systemctl stop caddy
+
+
+
Previous release
+
systemctl stop proxmox-monitor
+rm -rf /opt/proxmox-monitor/server
+tar -xzf /tmp/server_release_PREV.tgz \
+    -C /opt/proxmox-monitor
+systemctl start proxmox-monitor
+
+
+
Restore DB
+
systemctl stop proxmox-monitor
+cp /var/backups/proxmox-monitor/monitor-YYYY-MM-DD.db \
+   /var/lib/proxmox-monitor/monitor.db
+systemctl start proxmox-monitor
+
+
+

Tokens survive DB restores. Metrics post-backup are lost (48h max by retention policy).

+
+ + +
+ § 8 Ongoing ops +

Upgrades

+
+
+

Server

+
cd server
+MIX_ENV=prod DASHBOARD_PASSWORD_HASH='placeholder' \
+  mix release --overwrite
+tar -czf /tmp/server_release.tgz -C _build/prod/rel server
+scp /tmp/server_release.tgz root@$LXC:/tmp/
+
+ssh root@$LXC '
+  systemctl stop proxmox-monitor
+  mv /opt/proxmox-monitor/server{,.old}
+  tar -xzf /tmp/server_release.tgz -C /opt/proxmox-monitor
+  systemctl start proxmox-monitor   # ExecStartPre runs migrate
+'
+

Verify /health then delete server.old.

+
+
+

Agent

+
scp agent/dist/proxmox-monitor-agent_linux_amd64 \
+    root@$HOST:/usr/local/bin/proxmox-monitor-agent.new
+
+ssh root@$HOST '
+  mv /usr/local/bin/proxmox-monitor-agent{.new,}
+  systemctl restart proxmox-monitor-agent
+'
+

No DB on the host, so agent upgrades are trivially atomic.

+
+
+
+ + +
+ § 8 Ongoing ops +

SQLite backups

+

Install as a cron inside the LXC — keeps 30 daily snapshots:

+
cat > /etc/cron.d/proxmox-monitor-backup <<'EOF'
+30 3 * * * root install -d -m 0700 /var/backups/proxmox-monitor && \
+  sqlite3 /var/lib/proxmox-monitor/monitor.db \
+    ".backup /var/backups/proxmox-monitor/monitor-$(date +\%Y-\%m-\%d).db" && \
+  find /var/backups/proxmox-monitor -name 'monitor-*.db' -mtime +30 -delete
+EOF
+

SQLite's online-backup command is safe while the server is running.

+

Verify at least one run before declaring the rollout complete.

+
+ + +
+ § 9 Sign-off +

Production readiness

+
/health returns 200 with status:ok
+
External uptime monitor configured and green
+
All intended Proxmox hosts on overview, all online
+
≥1 full 48h retention cycle observed (pruning log present)
+
SQLite backup cron installed and yesterday's file exists
+
You have rolled back once on purpose (drill)
+
+ + +
+ § 9 Sign-off +

Secrets hygiene

+
Dashboard password in a password manager, not a text file
+
SECRET_KEY_BASE in a password manager
+
/etc/default/proxmox-monitor is 0600 root:root
+
/etc/proxmox-monitor/agent.toml is 0600 root:root on every host
+
You can rotate an agent token in <2 minutes
+
A teammate has been walked through one agent install and one token rotation live
+
+ + +
+ Appendix A +

Common errors

+ + + + + + + + + + + +
SymptomFirst thing to check
CERT_AUTHORITY_INVALID in browserCaddy hasn't finished LE issuance. Wait 60s. journalctl -u caddy.
Login loops on correct passwordDASHBOARD_PASSWORD_HASH mismatch. Regenerate and redeploy.
Card stays offline after agent restartWrong token or unknown_host. Check agent journal.
All agents reconnect every ~30sCaddy read_timeout missing or too short.
/health returns 503Process up but DB unreadable. Check permissions + DATABASE_PATH.
LXC can't bind port 4000Another process owns it. ss -ltnp | grep 4000.
Agent logs {:enoent, "pvesh"}Not a Proxmox host, or empty $PATH under systemd.
+
+ + +
+ Appendix B +

File & port cheat sheet

+
+
+

Server LXC

+
/opt/proxmox-monitor/server/       release tree
+/etc/default/proxmox-monitor       env secrets, 0600
+/etc/systemd/system/proxmox-monitor.service
+/etc/caddy/Caddyfile
+/var/lib/proxmox-monitor/monitor.db
+/var/backups/proxmox-monitor/      daily backups
+
+tcp 443 (caddy) → tcp 127.0.0.1:4000 (phoenix)
+
+
+

Proxmox host (per agent)

+
/usr/local/bin/proxmox-monitor-agent
+/etc/proxmox-monitor/agent.toml    token, 0600
+/etc/systemd/system/proxmox-monitor-agent.service
+/var/cache/proxmox-monitor-agent/  Burrito unpack
+
+no listening ports
+
+
+
+ + +
+ Done +

MVP in production

+

+ All four phases from the concept shipped: monitoring skeleton, ZFS/VM/storage collectors, + LiveView dashboard, packaged binaries. The operator has the runbook; agents report; retention + prunes; backups run. Everything else is iteration. +

+

+ Full runbook: SETUP-AND-DEPLOY.md · Concept: proxmox-monitor-konzept.md +

+
+ +
+ + + + +