commit fab512f1e18d9d5006247c6e577b94f7c9a3bd89 Author: Carsten Date: Tue Apr 21 21:59:29 2026 +0200 chore: project skeleton + phase-1 plan diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..2b16eb5 --- /dev/null +++ b/.gitignore @@ -0,0 +1,27 @@ +# Elixir/Mix +/server/_build/ +/server/deps/ +/server/cover/ +/server/doc/ +/server/.fetch +/server/erl_crash.dump +/server/*.ez +/server/priv/static/assets/ +/server/priv/static/cache_manifest.json +/server/*.db +/server/*.db-journal +/server/*.db-wal +/server/*.db-shm + +/agent/_build/ +/agent/deps/ +/agent/cover/ +/agent/doc/ +/agent/.fetch +/agent/erl_crash.dump +/agent/*.ez + +# Editors / OS +.DS_Store +.vscode/ +.idea/ diff --git a/README.md b/README.md new file mode 100644 index 0000000..c1aa044 --- /dev/null +++ b/README.md @@ -0,0 +1,8 @@ +# Proxmox Monitor + +Agent-Server monitoring for Proxmox hosts. Elixir/OTP. See `proxmox-monitor-konzept.md`. + +- `server/` — Phoenix + SQLite + LiveView +- `agent/` — Slipstream Channels client, deploys as Burrito binary + +Phase 1 focuses on end-to-end metric push. Later phases add ZFS/VM collectors, persistence, LiveView dashboard. diff --git a/docs/superpowers/plans/2026-04-21-phase1-grundgeruest.md b/docs/superpowers/plans/2026-04-21-phase1-grundgeruest.md new file mode 100644 index 0000000..bd6aa0c --- /dev/null +++ b/docs/superpowers/plans/2026-04-21-phase1-grundgeruest.md @@ -0,0 +1,1595 @@ +# Phase 1 — Grundgerüst Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Stand up a minimal agent+server pair where an Elixir agent running locally connects via Phoenix Channels to a Phoenix server, authenticates with a token, and pushes host CPU/RAM metrics every 30 seconds. Server logs the incoming payloads. + +**Architecture:** Monorepo with two independent Mix projects (`server/` Phoenix+SQLite, `agent/` plain OTP app using Slipstream). Agent initiates a persistent WSS connection, joins topic `host:`, pushes `metric:fast` events. Server persists only `hosts` in Phase 1 — metric storage lands in Phase 2. + +**Tech Stack:** Elixir 1.19 / OTP 28, Phoenix 1.7.14, Ecto + `ecto_sqlite3`, `bcrypt_elixir` (token hashing), `slipstream` (agent Channels client), `toml` (agent config), ExUnit. + +--- + +## File Structure + +``` +proxmox_monitor/ +├── .gitignore +├── README.md +├── proxmox-monitor-konzept.md (existing) +├── docs/superpowers/plans/2026-04-21-phase1-grundgeruest.md +│ +├── server/ (created by mix phx.new) +│ ├── mix.exs modify: add :bcrypt_elixir +│ ├── config/{config,dev,test,runtime}.exs scaffolded +│ ├── priv/repo/migrations/_create_hosts.exs create +│ ├── lib/server/application.ex scaffolded +│ ├── lib/server/repo.ex scaffolded +│ ├── lib/server/schema/host.ex create +│ ├── lib/server/hosts.ex create (context) +│ ├── lib/server_web/endpoint.ex modify: add agent socket +│ ├── lib/server_web/channels/agent_socket.ex create +│ ├── lib/server_web/channels/host_channel.ex create +│ ├── test/server/hosts_test.exs create +│ └── test/server_web/channels/host_channel_test.exs create +│ +└── agent/ (created by mix new --sup) + ├── mix.exs modify: deps + app config + ├── config/config.exs create + ├── config/runtime.exs create + ├── lib/agent.ex scaffolded + ├── lib/agent/application.ex modify + ├── lib/agent/config.ex create + ├── lib/agent/collectors/host.ex create + ├── lib/agent/reporter.ex create + ├── test/agent/config_test.exs create + ├── test/agent/collectors/host_test.exs create + └── test/fixtures/proc/ create (loadavg, meminfo, stat samples) +``` + +Each file has one responsibility: schema, context (business logic), channel (transport), collector (data acquisition), reporter (transmission). Test files mirror the source tree. + +--- + +## Task 1: Monorepo Init + +**Files:** +- Create: `.gitignore` +- Create: `README.md` + +- [ ] **Step 1: Write `.gitignore` (covers both Mix projects)** + +``` +# Elixir/Mix +/server/_build/ +/server/deps/ +/server/cover/ +/server/doc/ +/server/.fetch +/server/erl_crash.dump +/server/*.ez +/server/priv/static/assets/ +/server/priv/static/cache_manifest.json +/server/*.db +/server/*.db-journal +/server/*.db-wal +/server/*.db-shm + +/agent/_build/ +/agent/deps/ +/agent/cover/ +/agent/doc/ +/agent/.fetch +/agent/erl_crash.dump +/agent/*.ez + +# Editors / OS +.DS_Store +.vscode/ +.idea/ +``` + +- [ ] **Step 2: Write `README.md` (minimal)** + +```markdown +# Proxmox Monitor + +Agent-Server monitoring for Proxmox hosts. Elixir/OTP. See `proxmox-monitor-konzept.md`. + +- `server/` — Phoenix + SQLite + LiveView +- `agent/` — Slipstream Channels client, deploys as Burrito binary + +Phase 1 focuses on end-to-end metric push. Later phases add ZFS/VM collectors, persistence, LiveView dashboard. +``` + +- [ ] **Step 3: Initial commit** + +```bash +git add .gitignore README.md proxmox-monitor-konzept.md docs/ +git commit -m "chore: project skeleton + phase-1 plan" +``` + +--- + +## Task 2: Server — Phoenix Bootstrap + +**Files:** +- Create: entire `server/` tree via `mix phx.new` + +- [ ] **Step 1: Generate Phoenix project** + +Run from `/Users/cabele/claudeprojects/proxmox_monitor`: + +```bash +mix phx.new server --database sqlite3 --no-mailer --no-gettext --live --install +``` + +If prompted, answer `Y` to fetch deps. + +Expected: creates `server/` with Phoenix scaffold, SQLite adapter, LiveView enabled, no Gettext, no Mailer. Deps fetched, assets installed. + +- [ ] **Step 2: Verify scaffold builds and tests pass** + +```bash +cd server && mix compile && mix test +``` + +Expected: compiles clean, default `PageControllerTest` passes. + +- [ ] **Step 3: Commit the scaffold** + +```bash +cd /Users/cabele/claudeprojects/proxmox_monitor +git add server/ +git commit -m "feat(server): phoenix 1.7 scaffold with sqlite + liveview" +``` + +--- + +## Task 3: Server — Bcrypt Dependency + +**Files:** +- Modify: `server/mix.exs` + +- [ ] **Step 1: Add `:bcrypt_elixir` to deps** + +In `server/mix.exs`, locate the `defp deps do` list and add the line below alongside existing entries: + +```elixir + {:bcrypt_elixir, "~> 3.1"}, +``` + +- [ ] **Step 2: Fetch and compile** + +```bash +cd server && mix deps.get && mix compile +``` + +Expected: bcrypt_elixir and cc_precompiler fetched; compile succeeds (bcrypt NIF builds). + +- [ ] **Step 3: Commit** + +```bash +git add server/mix.exs server/mix.lock +git commit -m "feat(server): add bcrypt_elixir for token hashing" +``` + +--- + +## Task 4: Server — Host Schema + Context (TDD) + +**Files:** +- Create: `server/priv/repo/migrations/_create_hosts.exs` +- Create: `server/lib/server/schema/host.ex` +- Create: `server/lib/server/hosts.ex` +- Create: `server/test/server/hosts_test.exs` + +- [ ] **Step 1: Generate migration file** + +```bash +cd server && mix ecto.gen.migration create_hosts +``` + +Fill the generated file (timestamped name) with: + +```elixir +defmodule Server.Repo.Migrations.CreateHosts do + use Ecto.Migration + + def change do + create table(:hosts) do + add :name, :string, null: false + add :token_hash, :string, null: false + add :agent_version, :string + add :proxmox_version, :string + add :zfs_version, :string + add :status, :string, null: false, default: "never_connected" + add :last_seen_at, :utc_datetime_usec + + timestamps(type: :utc_datetime_usec) + end + + create unique_index(:hosts, [:name]) + end +end +``` + +- [ ] **Step 2: Write schema module** + +Create `server/lib/server/schema/host.ex`: + +```elixir +defmodule Server.Schema.Host do + use Ecto.Schema + import Ecto.Changeset + + @statuses ~w(never_connected online offline) + + schema "hosts" do + field :name, :string + field :token_hash, :string + field :agent_version, :string + field :proxmox_version, :string + field :zfs_version, :string + field :status, :string, default: "never_connected" + field :last_seen_at, :utc_datetime_usec + + timestamps(type: :utc_datetime_usec) + end + + def create_changeset(host, attrs) do + host + |> cast(attrs, [:name, :token_hash]) + |> validate_required([:name, :token_hash]) + |> validate_length(:name, min: 1, max: 100) + |> unique_constraint(:name) + end + + def status_changeset(host, attrs) do + host + |> cast(attrs, [:status, :last_seen_at, :agent_version]) + |> validate_inclusion(:status, @statuses) + end +end +``` + +- [ ] **Step 3: Write failing tests for the context** + +Create `server/test/server/hosts_test.exs`: + +```elixir +defmodule Server.HostsTest do + use Server.DataCase, async: true + + alias Server.Hosts + + describe "create_host/1" do + test "returns host and a plaintext token on success" do + assert {:ok, {host, token}} = Hosts.create_host("pve-01") + assert host.name == "pve-01" + assert host.status == "never_connected" + assert is_binary(token) and byte_size(token) >= 32 + refute host.token_hash == token + end + + test "rejects duplicate names" do + {:ok, _} = Hosts.create_host("pve-01") + assert {:error, changeset} = Hosts.create_host("pve-01") + assert %{name: ["has already been taken"]} = errors_on(changeset) + end + end + + describe "authenticate/2" do + test "returns host for valid name+token" do + {:ok, {host, token}} = Hosts.create_host("pve-01") + assert {:ok, found} = Hosts.authenticate("pve-01", token) + assert found.id == host.id + end + + test "returns :invalid_token for wrong token" do + {:ok, {_host, _token}} = Hosts.create_host("pve-01") + assert {:error, :invalid_token} = Hosts.authenticate("pve-01", "wrong") + end + + test "returns :unknown_host when name does not exist" do + assert {:error, :unknown_host} = Hosts.authenticate("nope", "whatever") + end + end + + describe "mark_online/2 and mark_offline/1" do + test "mark_online stamps status, last_seen_at, agent_version" do + {:ok, {host, _}} = Hosts.create_host("pve-01") + assert {:ok, updated} = Hosts.mark_online(host, "0.1.0") + assert updated.status == "online" + assert updated.agent_version == "0.1.0" + assert updated.last_seen_at != nil + end + + test "mark_offline sets status to offline" do + {:ok, {host, _}} = Hosts.create_host("pve-01") + {:ok, online} = Hosts.mark_online(host, "0.1.0") + assert {:ok, offline} = Hosts.mark_offline(online) + assert offline.status == "offline" + end + end +end +``` + +- [ ] **Step 4: Run tests — expect failure** + +```bash +cd server && mix test test/server/hosts_test.exs +``` + +Expected: compile error `Server.Hosts is not available` or similar. + +- [ ] **Step 5: Implement the context** + +Create `server/lib/server/hosts.ex`: + +```elixir +defmodule Server.Hosts do + @moduledoc "Host registration, authentication, status tracking." + + alias Server.Repo + alias Server.Schema.Host + + @spec create_host(String.t()) :: {:ok, {Host.t(), String.t()}} | {:error, Ecto.Changeset.t()} + def create_host(name) do + token = generate_token() + hash = Bcrypt.hash_pwd_salt(token) + + %Host{} + |> Host.create_changeset(%{name: name, token_hash: hash}) + |> Repo.insert() + |> case do + {:ok, host} -> {:ok, {host, token}} + {:error, cs} -> {:error, cs} + end + end + + @spec authenticate(String.t(), String.t()) :: + {:ok, Host.t()} | {:error, :unknown_host | :invalid_token} + def authenticate(name, token) when is_binary(name) and is_binary(token) do + case Repo.get_by(Host, name: name) do + nil -> + Bcrypt.no_user_verify() + {:error, :unknown_host} + + host -> + if Bcrypt.verify_pass(token, host.token_hash) do + {:ok, host} + else + {:error, :invalid_token} + end + end + end + + @spec mark_online(Host.t(), String.t() | nil) :: {:ok, Host.t()} | {:error, Ecto.Changeset.t()} + def mark_online(%Host{} = host, agent_version) do + host + |> Host.status_changeset(%{ + status: "online", + last_seen_at: DateTime.utc_now(), + agent_version: agent_version + }) + |> Repo.update() + end + + @spec mark_offline(Host.t()) :: {:ok, Host.t()} | {:error, Ecto.Changeset.t()} + def mark_offline(%Host{} = host) do + host + |> Host.status_changeset(%{status: "offline"}) + |> Repo.update() + end + + @doc "Mark every host offline — called on server boot to clear stale online flags." + @spec mark_all_offline() :: {integer(), nil} + def mark_all_offline do + import Ecto.Query + Repo.update_all(from(h in Host), set: [status: "offline", updated_at: DateTime.utc_now()]) + end + + defp generate_token do + :crypto.strong_rand_bytes(32) |> Base.url_encode64(padding: false) + end +end +``` + +- [ ] **Step 6: Speed up bcrypt in tests** + +In `server/config/test.exs`, add at the bottom (before the existing `config :phoenix` line if present, or anywhere at top level): + +```elixir +config :bcrypt_elixir, :log_rounds, 4 +``` + +- [ ] **Step 7: Run tests — expect all pass** + +```bash +cd server && mix ecto.reset && mix test test/server/hosts_test.exs +``` + +Expected: 7 tests pass. + +- [ ] **Step 8: Commit** + +```bash +git add server/priv server/lib/server server/test/server server/config/test.exs +git commit -m "feat(server): host schema, context, auth, status transitions" +``` + +--- + +## Task 5: Server — AgentSocket + Mark-All-Offline on Boot + +**Files:** +- Create: `server/lib/server_web/channels/agent_socket.ex` +- Modify: `server/lib/server_web/endpoint.ex` +- Modify: `server/lib/server/application.ex` + +- [ ] **Step 1: Write AgentSocket** + +Create `server/lib/server_web/channels/agent_socket.ex`: + +```elixir +defmodule ServerWeb.AgentSocket do + @moduledoc "Entry socket for agents. Actual authentication happens in HostChannel.join/3." + use Phoenix.Socket + + channel "host:*", ServerWeb.HostChannel + + @impl true + def connect(_params, socket, _connect_info), do: {:ok, socket} + + @impl true + def id(_socket), do: nil +end +``` + +- [ ] **Step 2: Mount the socket in the endpoint** + +In `server/lib/server_web/endpoint.ex`, find the existing `socket "/live"` line and add just below it: + +```elixir + socket "/socket", ServerWeb.AgentSocket, + websocket: [timeout: 45_000], + longpoll: false +``` + +- [ ] **Step 3: Clear stale online flags on boot** + +In `server/lib/server/application.ex`, find the existing `start/2` function. It currently ends with something like: + +```elixir + opts = [strategy: :one_for_one, name: Server.Supervisor] + Supervisor.start_link(children, opts) + end +``` + +Replace those two lines with: + +```elixir + opts = [strategy: :one_for_one, name: Server.Supervisor] + result = Supervisor.start_link(children, opts) + with {:ok, _} <- result, do: Server.Hosts.mark_all_offline() + result + end +``` + +Rationale: if the server is restarted while agents were connected, their `online` row persists stale. Marking everything offline on boot lets the agent's next channel join flip it back to `online` cleanly. + +- [ ] **Step 4: Compile to verify** + +```bash +cd server && mix compile +``` + +Expected: no warnings about undefined `ServerWeb.HostChannel` (module exists as channel ref only; we'll create it next task — note this is acceptable because `channel/2` only registers the name). + +- [ ] **Step 5: Commit** + +```bash +git add server/lib/server_web/channels/agent_socket.ex server/lib/server_web/endpoint.ex server/lib/server/application.ex +git commit -m "feat(server): agent socket endpoint, clear online status on boot" +``` + +--- + +## Task 6: Server — HostChannel (TDD) + +**Files:** +- Create: `server/lib/server_web/channels/host_channel.ex` +- Create: `server/test/server_web/channels/host_channel_test.exs` +- Modify: `server/test/support/channel_case.ex` (verify it exists; Phoenix scaffold creates it) + +- [ ] **Step 1: Confirm ChannelCase exists** + +```bash +ls server/test/support/channel_case.ex +``` + +Expected: file exists (`Phoenix 1.7 --live` scaffold creates it). If missing, skip this check and note — ChannelCase is required for the tests below. + +- [ ] **Step 2: Write failing channel tests** + +Create `server/test/server_web/channels/host_channel_test.exs`: + +```elixir +defmodule ServerWeb.HostChannelTest do + use ServerWeb.ChannelCase, async: false + + alias Server.Hosts + alias ServerWeb.AgentSocket + + setup do + {:ok, {host, token}} = Hosts.create_host("pve-01") + %{host: host, token: token} + end + + describe "join" do + test "succeeds with valid token and marks host online", %{host: host, token: token} do + {:ok, socket} = connect(AgentSocket, %{}) + + assert {:ok, _reply, socket} = + subscribe_and_join(socket, "host:pve-01", %{ + "token" => token, + "agent_version" => "0.1.0" + }) + + assert socket.assigns.host_id == host.id + + reloaded = Server.Repo.reload!(host) + assert reloaded.status == "online" + assert reloaded.agent_version == "0.1.0" + assert reloaded.last_seen_at != nil + end + + test "rejects invalid token", %{host: _host} do + {:ok, socket} = connect(AgentSocket, %{}) + + assert {:error, %{reason: "invalid_token"}} = + subscribe_and_join(socket, "host:pve-01", %{ + "token" => "garbage", + "agent_version" => "0.1.0" + }) + end + + test "rejects unknown host name" do + {:ok, socket} = connect(AgentSocket, %{}) + + assert {:error, %{reason: "unknown_host"}} = + subscribe_and_join(socket, "host:nope", %{ + "token" => "x", + "agent_version" => "0.1.0" + }) + end + + test "rejects topic mismatch" do + {:ok, socket} = connect(AgentSocket, %{}) + + assert {:error, %{reason: "bad_topic"}} = + subscribe_and_join(socket, "host:", %{"token" => "x", "agent_version" => "0.1.0"}) + end + end + + describe "metric:fast event" do + setup %{token: token} do + {:ok, socket} = connect(AgentSocket, %{}) + + {:ok, _reply, joined} = + subscribe_and_join(socket, "host:pve-01", %{ + "token" => token, + "agent_version" => "0.1.0" + }) + + %{socket: joined} + end + + test "accepts metric payload and replies :ok", %{socket: socket} do + ref = + push(socket, "metric:fast", %{ + "collected_at" => "2026-04-21T12:00:00Z", + "data" => %{"cpu_percent" => 12.3, "load1" => 0.2} + }) + + assert_reply ref, :ok + end + end + + describe "terminate" do + test "marks host offline when channel process exits", %{host: host, token: token} do + {:ok, socket} = connect(AgentSocket, %{}) + + {:ok, _, joined} = + subscribe_and_join(socket, "host:pve-01", %{ + "token" => token, + "agent_version" => "0.1.0" + }) + + Process.unlink(joined.channel_pid) + ref = Process.monitor(joined.channel_pid) + close(joined) + assert_receive {:DOWN, ^ref, :process, _, _}, 1_000 + + reloaded = Server.Repo.reload!(host) + assert reloaded.status == "offline" + end + end +end +``` + +- [ ] **Step 3: Run tests — expect failure (HostChannel not implemented)** + +```bash +cd server && mix test test/server_web/channels/host_channel_test.exs +``` + +Expected: compile error `ServerWeb.HostChannel is not available`. + +- [ ] **Step 4: Implement HostChannel** + +Create `server/lib/server_web/channels/host_channel.ex`: + +```elixir +defmodule ServerWeb.HostChannel do + use ServerWeb, :channel + require Logger + + alias Server.Hosts + + @impl true + def join("host:" <> name, params, socket) when name != "" do + token = Map.get(params, "token", "") + agent_version = Map.get(params, "agent_version") + + case Hosts.authenticate(name, token) do + {:ok, host} -> + {:ok, _} = Hosts.mark_online(host, agent_version) + Logger.info("agent joined host:#{name}") + {:ok, assign(socket, :host_id, host.id) |> assign(:host_name, name)} + + {:error, :unknown_host} -> + {:error, %{reason: "unknown_host"}} + + {:error, :invalid_token} -> + {:error, %{reason: "invalid_token"}} + end + end + + def join(_topic, _params, _socket), do: {:error, %{reason: "bad_topic"}} + + @impl true + def handle_in("metric:fast", payload, socket) do + Logger.info("metric:fast host=#{socket.assigns.host_name} data=#{inspect(payload["data"])}") + {:reply, :ok, socket} + end + + def handle_in("metric:medium", payload, socket) do + Logger.info("metric:medium host=#{socket.assigns.host_name} payload=#{inspect(payload)}") + {:reply, :ok, socket} + end + + def handle_in("metric:slow", payload, socket) do + Logger.info("metric:slow host=#{socket.assigns.host_name} payload=#{inspect(payload)}") + {:reply, :ok, socket} + end + + @impl true + def terminate(_reason, socket) do + case socket.assigns[:host_id] do + nil -> + :ok + + id -> + with host when not is_nil(host) <- Server.Repo.get(Server.Schema.Host, id) do + Hosts.mark_offline(host) + end + + :ok + end + end +end +``` + +- [ ] **Step 5: Run tests — expect pass** + +```bash +cd server && mix test test/server_web/channels/host_channel_test.exs +``` + +Expected: all tests pass. + +- [ ] **Step 6: Run full test suite** + +```bash +cd server && mix test +``` + +Expected: all tests green. + +- [ ] **Step 7: Commit** + +```bash +git add server/lib/server_web/channels/host_channel.ex server/test/server_web/channels/host_channel_test.exs +git commit -m "feat(server): host channel with token auth and metric events" +``` + +--- + +## Task 7: Server — Smoke-Test Helper + +**Files:** +- Create: `server/lib/server/release.ex` (minimal helper for IEx-driven host creation) + +- [ ] **Step 1: Add a tiny release helper** + +Create `server/lib/server/release.ex`: + +```elixir +defmodule Server.Release do + @moduledoc "Convenience functions for IEx and future release tasks." + + @doc "Create a host and print the plaintext token once." + def register_host(name) do + case Server.Hosts.create_host(name) do + {:ok, {host, token}} -> + IO.puts("Host '#{host.name}' registered (id=#{host.id}).") + IO.puts("TOKEN: #{token}") + IO.puts("Store this token NOW — it will never be shown again.") + {:ok, host, token} + + {:error, cs} -> + IO.puts("Failed to register host: #{inspect(cs.errors)}") + {:error, cs} + end + end +end +``` + +- [ ] **Step 2: Compile** + +```bash +cd server && mix compile +``` + +- [ ] **Step 3: Commit** + +```bash +git add server/lib/server/release.ex +git commit -m "chore(server): iex helper for host registration" +``` + +--- + +## Task 8: Agent — Mix Project Bootstrap + +**Files:** +- Create: `agent/` directory tree via `mix new` + +- [ ] **Step 1: Generate the OTP app** + +Run from `/Users/cabele/claudeprojects/proxmox_monitor`: + +```bash +mix new agent --sup +``` + +Expected: creates `agent/` with `mix.exs`, `lib/agent.ex`, `lib/agent/application.ex`, `test/`. + +- [ ] **Step 2: Replace `agent/mix.exs` contents** + +Open `agent/mix.exs` and replace with: + +```elixir +defmodule Agent.MixProject do + use Mix.Project + + @version "0.1.0" + + def project do + [ + app: :agent, + version: @version, + elixir: "~> 1.17", + start_permanent: Mix.env() == :prod, + deps: deps(), + elixirc_paths: elixirc_paths(Mix.env()) + ] + end + + def application do + [ + extra_applications: [:logger, :crypto], + mod: {Agent.Application, []} + ] + end + + defp deps do + [ + {:slipstream, "~> 1.1"}, + {:jason, "~> 1.4"}, + {:toml, "~> 0.7"} + ] + end + + defp elixirc_paths(:test), do: ["lib", "test/support"] + defp elixirc_paths(_), do: ["lib"] +end +``` + +- [ ] **Step 3: Fetch deps and compile** + +```bash +cd agent && mix deps.get && mix compile +``` + +Expected: slipstream, mint_web_socket, jason, toml fetched; compile succeeds. + +- [ ] **Step 4: Commit** + +```bash +cd /Users/cabele/claudeprojects/proxmox_monitor +git add agent/ +git commit -m "feat(agent): otp app scaffold with slipstream + toml deps" +``` + +--- + +## Task 9: Agent — Version Constant + +**Files:** +- Modify: `agent/lib/agent.ex` + +- [ ] **Step 1: Replace the scaffolded Agent module** + +Replace the entire contents of `agent/lib/agent.ex` with: + +```elixir +defmodule Agent do + @moduledoc "Top-level namespace. Exposes the compiled version for reporting." + + @version Mix.Project.config()[:version] + + @spec version() :: String.t() + def version, do: @version +end +``` + +- [ ] **Step 2: Compile and quick-check in IEx** + +```bash +cd agent && mix compile +``` + +- [ ] **Step 3: Commit** + +```bash +git add agent/lib/agent.ex +git commit -m "feat(agent): expose compile-time version" +``` + +--- + +## Task 10: Agent — Config Module (TDD) + +**Files:** +- Create: `agent/lib/agent/config.ex` +- Create: `agent/test/agent/config_test.exs` +- Create: `agent/test/fixtures/agent.toml` (sample config used by test) + +- [ ] **Step 1: Write a fixture config** + +Create `agent/test/fixtures/agent.toml`: + +```toml +server_url = "wss://monitor.example.com/socket/websocket" +token = "test_token_123" +host_id = "pve-test-01" + +[intervals] +fast_seconds = 15 +medium_seconds = 120 +slow_seconds = 600 +``` + +- [ ] **Step 2: Write failing tests** + +Create `agent/test/agent/config_test.exs`: + +```elixir +defmodule Agent.ConfigTest do + use ExUnit.Case, async: true + + alias Agent.Config + + @fixture Path.expand("../fixtures/agent.toml", __DIR__) + + describe "load/1" do + test "parses required fields" do + assert {:ok, cfg} = Config.load(@fixture) + assert cfg.server_url == "wss://monitor.example.com/socket/websocket" + assert cfg.token == "test_token_123" + assert cfg.host_id == "pve-test-01" + assert cfg.fast_seconds == 15 + assert cfg.medium_seconds == 120 + assert cfg.slow_seconds == 600 + end + + test "returns error for missing file" do + assert {:error, {:file_read, _}} = Config.load("/does/not/exist.toml") + end + + test "defaults host_id to system hostname when absent" do + tmp = Path.join(System.tmp_dir!(), "agent_nohost.toml") + + File.write!(tmp, """ + server_url = "wss://x/socket/websocket" + token = "t" + """) + + on_exit(fn -> File.rm(tmp) end) + + assert {:ok, cfg} = Config.load(tmp) + assert is_binary(cfg.host_id) + assert cfg.host_id != "" + end + + test "applies default intervals when [intervals] is absent" do + tmp = Path.join(System.tmp_dir!(), "agent_nointervals.toml") + + File.write!(tmp, """ + server_url = "wss://x/socket/websocket" + token = "t" + host_id = "h" + """) + + on_exit(fn -> File.rm(tmp) end) + + assert {:ok, cfg} = Config.load(tmp) + assert cfg.fast_seconds == 30 + assert cfg.medium_seconds == 300 + assert cfg.slow_seconds == 1800 + end + + test "returns error when required keys missing" do + tmp = Path.join(System.tmp_dir!(), "agent_bad.toml") + File.write!(tmp, "token = \"t\"\n") + on_exit(fn -> File.rm(tmp) end) + assert {:error, {:missing_key, :server_url}} = Config.load(tmp) + end + end +end +``` + +- [ ] **Step 3: Run tests — expect failure** + +```bash +cd agent && mix test test/agent/config_test.exs +``` + +Expected: `Agent.Config is not available`. + +- [ ] **Step 4: Implement the config loader** + +Create `agent/lib/agent/config.ex`: + +```elixir +defmodule Agent.Config do + @moduledoc "Loads and validates the TOML agent config." + + defstruct [ + :server_url, + :token, + :host_id, + fast_seconds: 30, + medium_seconds: 300, + slow_seconds: 1800 + ] + + @type t :: %__MODULE__{ + server_url: String.t(), + token: String.t(), + host_id: String.t(), + fast_seconds: pos_integer(), + medium_seconds: pos_integer(), + slow_seconds: pos_integer() + } + + @required ~w(server_url token)a + + @spec load(Path.t()) :: + {:ok, t()} + | {:error, {:file_read, term()} | {:parse, term()} | {:missing_key, atom()}} + def load(path) do + with {:ok, body} <- read_file(path), + {:ok, parsed} <- parse_toml(body), + :ok <- validate_required(parsed) do + {:ok, build(parsed)} + end + end + + defp read_file(path) do + case File.read(path) do + {:ok, body} -> {:ok, body} + {:error, reason} -> {:error, {:file_read, reason}} + end + end + + defp parse_toml(body) do + case Toml.decode(body) do + {:ok, map} -> {:ok, map} + {:error, reason} -> {:error, {:parse, reason}} + end + end + + defp validate_required(map) do + Enum.find_value(@required, :ok, fn key -> + case Map.get(map, Atom.to_string(key)) do + v when is_binary(v) and v != "" -> nil + _ -> {:error, {:missing_key, key}} + end + end) + end + + defp build(map) do + intervals = Map.get(map, "intervals", %{}) + + %__MODULE__{ + server_url: map["server_url"], + token: map["token"], + host_id: map["host_id"] || hostname(), + fast_seconds: Map.get(intervals, "fast_seconds", 30), + medium_seconds: Map.get(intervals, "medium_seconds", 300), + slow_seconds: Map.get(intervals, "slow_seconds", 1800) + } + end + + defp hostname do + case :inet.gethostname() do + {:ok, name} -> List.to_string(name) + _ -> "unknown-host" + end + end +end +``` + +- [ ] **Step 5: Run tests — expect pass** + +```bash +cd agent && mix test test/agent/config_test.exs +``` + +Expected: 5 tests pass. + +- [ ] **Step 6: Commit** + +```bash +git add agent/lib/agent/config.ex agent/test/agent/config_test.exs agent/test/fixtures/agent.toml +git commit -m "feat(agent): toml config loader with defaults and validation" +``` + +--- + +## Task 11: Agent — Host Collector (TDD with /proc fixtures) + +**Files:** +- Create: `agent/lib/agent/collectors/host.ex` +- Create: `agent/test/agent/collectors/host_test.exs` +- Create: `agent/test/fixtures/proc/loadavg` +- Create: `agent/test/fixtures/proc/meminfo` +- Create: `agent/test/fixtures/proc/uptime` + +The collector reads Linux `/proc`. Tests run on macOS too — they point the collector at fixture files instead. + +- [ ] **Step 1: Write fixture files** + +Create `agent/test/fixtures/proc/loadavg`: + +``` +0.42 0.55 0.31 3/512 12345 +``` + +Create `agent/test/fixtures/proc/meminfo`: + +``` +MemTotal: 16384000 kB +MemFree: 2048000 kB +MemAvailable: 8192000 kB +Buffers: 256000 kB +Cached: 4096000 kB +SwapTotal: 4194304 kB +SwapFree: 4194304 kB +``` + +Create `agent/test/fixtures/proc/uptime`: + +``` +123456.78 987654.32 +``` + +- [ ] **Step 2: Write failing tests** + +Create `agent/test/agent/collectors/host_test.exs`: + +```elixir +defmodule Agent.Collectors.HostTest do + use ExUnit.Case, async: true + + alias Agent.Collectors.Host + + @proc Path.expand("../../fixtures/proc", __DIR__) + + test "collects load average" do + sample = Host.collect(proc_dir: @proc) + assert sample.load1 == 0.42 + assert sample.load5 == 0.55 + assert sample.load15 == 0.31 + end + + test "collects memory in bytes" do + sample = Host.collect(proc_dir: @proc) + assert sample.mem_total_bytes == 16_384_000 * 1024 + assert sample.mem_available_bytes == 8_192_000 * 1024 + assert sample.mem_used_bytes == sample.mem_total_bytes - sample.mem_available_bytes + end + + test "collects uptime seconds" do + sample = Host.collect(proc_dir: @proc) + assert sample.uptime_seconds == 123_456 + end + + test "includes hostname string" do + sample = Host.collect(proc_dir: @proc) + assert is_binary(sample.hostname) + assert sample.hostname != "" + end + + test "missing proc files yield :error field, not a crash" do + sample = Host.collect(proc_dir: "/nonexistent/path/xyz") + assert sample.errors != [] + end +end +``` + +- [ ] **Step 3: Run tests — expect failure** + +```bash +cd agent && mix test test/agent/collectors/host_test.exs +``` + +Expected: `Agent.Collectors.Host is not available`. + +- [ ] **Step 4: Implement collector** + +Create `agent/lib/agent/collectors/host.ex`: + +```elixir +defmodule Agent.Collectors.Host do + @moduledoc """ + Reads host metrics from /proc. Accepts `proc_dir:` option for testability. + Never raises — on read failure, populates `:errors` and leaves the field nil. + """ + + @type sample :: %{ + hostname: String.t(), + load1: float() | nil, + load5: float() | nil, + load15: float() | nil, + mem_total_bytes: non_neg_integer() | nil, + mem_available_bytes: non_neg_integer() | nil, + mem_used_bytes: non_neg_integer() | nil, + uptime_seconds: non_neg_integer() | nil, + errors: [term()] + } + + @spec collect(keyword()) :: sample() + def collect(opts \\ []) do + proc_dir = Keyword.get(opts, :proc_dir, "/proc") + + {load, e1} = safe(&read_loadavg/1, [proc_dir], {nil, nil, nil}) + {mem, e2} = safe(&read_meminfo/1, [proc_dir], %{total: nil, available: nil}) + {uptime, e3} = safe(&read_uptime/1, [proc_dir], nil) + + total = mem.total + avail = mem.available + used = if total && avail, do: total - avail, else: nil + {load1, load5, load15} = load + + %{ + hostname: hostname(), + load1: load1, + load5: load5, + load15: load15, + mem_total_bytes: total, + mem_available_bytes: avail, + mem_used_bytes: used, + uptime_seconds: uptime, + errors: Enum.filter([e1, e2, e3], & &1) + } + end + + defp safe(fun, args, fallback) do + try do + {apply(fun, args), nil} + rescue + e -> {fallback, {fun_name(fun), Exception.message(e)}} + catch + :error, reason -> {fallback, {fun_name(fun), reason}} + end + end + + defp fun_name(fun), do: Function.info(fun)[:name] + + defp read_loadavg(proc_dir) do + body = File.read!(Path.join(proc_dir, "loadavg")) + [l1, l5, l15 | _] = String.split(body, ~r/\s+/, trim: true) + {to_float(l1), to_float(l5), to_float(l15)} + end + + defp read_meminfo(proc_dir) do + body = File.read!(Path.join(proc_dir, "meminfo")) + + parsed = + body + |> String.split("\n", trim: true) + |> Enum.reduce(%{}, fn line, acc -> + case String.split(line, ~r/:\s+/, parts: 2) do + [key, val] -> Map.put(acc, key, val) + _ -> acc + end + end) + + %{ + total: kb_to_bytes(parsed["MemTotal"]), + available: kb_to_bytes(parsed["MemAvailable"]) + } + end + + defp read_uptime(proc_dir) do + body = File.read!(Path.join(proc_dir, "uptime")) + [secs | _] = String.split(body, " ", trim: true) + secs |> to_float() |> trunc() + end + + defp kb_to_bytes(nil), do: nil + + defp kb_to_bytes(str) do + case Regex.run(~r/(\d+)\s*kB/, str) do + [_, kb] -> String.to_integer(kb) * 1024 + _ -> nil + end + end + + defp to_float(s) do + {f, _} = Float.parse(s) + f + end + + defp hostname do + case :inet.gethostname() do + {:ok, name} -> List.to_string(name) + _ -> "unknown-host" + end + end +end +``` + +- [ ] **Step 5: Run tests — expect pass** + +```bash +cd agent && mix test test/agent/collectors/host_test.exs +``` + +Expected: 5 tests pass. + +- [ ] **Step 6: Commit** + +```bash +git add agent/lib/agent/collectors agent/test/agent/collectors agent/test/fixtures/proc +git commit -m "feat(agent): host collector for /proc loadavg, meminfo, uptime" +``` + +--- + +## Task 12: Agent — Reporter (Slipstream Client) + +**Files:** +- Create: `agent/lib/agent/reporter.ex` + +The Reporter is a Slipstream-backed GenServer. Unit-testing a real WS client is out of scope for Phase 1 — coverage comes from the end-to-end smoke test in Task 14. + +- [ ] **Step 1: Implement Reporter** + +Create `agent/lib/agent/reporter.ex`: + +```elixir +defmodule Agent.Reporter do + @moduledoc """ + Maintains a persistent Phoenix Channel connection to the server, joins + `host:`, and pushes metric samples on the configured fast interval. + """ + + use Slipstream, restart: :permanent + require Logger + + alias Agent.Collectors.Host + + def start_link(%Agent.Config{} = cfg) do + Slipstream.start_link(__MODULE__, cfg, name: __MODULE__) + end + + @impl Slipstream + def init(cfg) do + socket = + new_socket() + |> assign(:cfg, cfg) + |> assign(:topic, "host:" <> cfg.host_id) + |> connect!(uri: cfg.server_url) + + {:ok, socket} + end + + @impl Slipstream + def handle_connect(socket) do + topic = socket.assigns.topic + cfg = socket.assigns.cfg + + payload = %{"token" => cfg.token, "agent_version" => Agent.version()} + Logger.info("reporter: connected, joining #{topic}") + {:ok, join(socket, topic, payload)} + end + + @impl Slipstream + def handle_join(topic, _reply, socket) do + Logger.info("reporter: joined #{topic}") + send(self(), :collect_fast) + {:ok, socket} + end + + @impl Slipstream + def handle_info(:collect_fast, socket) do + sample = Host.collect() + payload = %{collected_at: DateTime.utc_now() |> DateTime.to_iso8601(), data: sample} + :ok = push_metric(socket, "metric:fast", payload) + Process.send_after(self(), :collect_fast, socket.assigns.cfg.fast_seconds * 1000) + {:ok, socket} + end + + @impl Slipstream + def handle_disconnect(reason, socket) do + Logger.warning("reporter: disconnected — #{inspect(reason)}; reconnecting") + reconnect(socket) + end + + @impl Slipstream + def handle_topic_close(topic, reason, socket) do + Logger.warning("reporter: topic #{topic} closed: #{inspect(reason)}; rejoining") + rejoin(socket, topic) + end + + defp push_metric(socket, event, payload) do + case push(socket, socket.assigns.topic, event, payload) do + {:ok, _ref} -> :ok + {:error, reason} -> + Logger.warning("reporter: push failed: #{inspect(reason)}") + :ok + end + end +end +``` + +- [ ] **Step 2: Compile** + +```bash +cd agent && mix compile +``` + +Expected: no errors. Warnings about unused `handle_topic_close` params are fine. + +- [ ] **Step 3: Commit** + +```bash +git add agent/lib/agent/reporter.ex +git commit -m "feat(agent): slipstream reporter — join, push, auto-reconnect" +``` + +--- + +## Task 13: Agent — Application Supervisor + +**Files:** +- Modify: `agent/lib/agent/application.ex` +- Create: `agent/config/config.exs` +- Create: `agent/config/runtime.exs` + +- [ ] **Step 1: Replace application module** + +Replace `agent/lib/agent/application.ex` with: + +```elixir +defmodule Agent.Application do + @moduledoc false + use Application + require Logger + + @impl true + def start(_type, _args) do + children = + case load_config() do + {:ok, cfg} -> + Logger.info("agent: starting with host_id=#{cfg.host_id}") + [{Agent.Reporter, cfg}] + + {:error, reason} -> + Logger.error("agent: no config loaded (#{inspect(reason)}); running in idle mode") + [] + end + + Supervisor.start_link(children, strategy: :one_for_one, name: Agent.Supervisor) + end + + defp load_config do + path = + System.get_env("AGENT_CONFIG") || + Application.get_env(:agent, :config_path, "/etc/proxmox-monitor/agent.toml") + + case File.exists?(path) do + true -> Agent.Config.load(path) + false -> {:error, {:file_missing, path}} + end + end +end +``` + +- [ ] **Step 2: Add minimal compile-time config** + +Create `agent/config/config.exs`: + +```elixir +import Config + +config :logger, :default_formatter, format: "$time [$level] $message\n" + +if File.exists?(Path.join([__DIR__, "#{config_env()}.exs"])) do + import_config "#{config_env()}.exs" +end +``` + +Create `agent/config/runtime.exs`: + +```elixir +import Config + +if path = System.get_env("AGENT_CONFIG") do + config :agent, :config_path, path +end +``` + +- [ ] **Step 3: Compile and run existing tests** + +```bash +cd agent && mix compile && mix test +``` + +Expected: all tests pass. On cold boot with no config present, the app starts in idle mode (no crash). + +- [ ] **Step 4: Commit** + +```bash +git add agent/lib/agent/application.ex agent/config +git commit -m "feat(agent): supervisor boots reporter when config is present" +``` + +--- + +## Task 14: End-to-End Smoke Test + +**Goal:** Prove the agent connects to a locally-running server, joins the channel, and the server logs an incoming `metric:fast` payload. + +**Files:** +- Create: `/tmp/agent-local.toml` (ad-hoc, not committed) + +- [ ] **Step 1: Start the server** + +In terminal A: + +```bash +cd /Users/cabele/claudeprojects/proxmox_monitor/server +mix ecto.create +mix ecto.migrate +iex -S mix phx.server +``` + +Expected: `[info] Running ServerWeb.Endpoint with Bandit ... http://localhost:4000` + +- [ ] **Step 2: Register a host from the IEx shell in terminal A** + +```elixir +iex> Server.Release.register_host("pve-dev-01") +``` + +Expected output: + +``` +Host 'pve-dev-01' registered (id=1). +TOKEN: <32+ char string> +Store this token NOW — it will never be shown again. +``` + +Copy the token for the next step. + +- [ ] **Step 3: Write a local agent config** + +In terminal B, with `` from the previous step: + +```bash +cat > /tmp/agent-local.toml < Server.Repo.get_by(Server.Schema.Host, name: "pve-dev-01") |> Map.take([:status, :agent_version, :last_seen_at]) +``` + +Expected: `%{status: "online", agent_version: "0.1.0", last_seen_at: ~U[...]}`. + +- [ ] **Step 7: Verify terminate marks host offline** + +Stop the agent in terminal B with `Ctrl+C, a`. Re-run the query from Step 6. + +Expected: `status: "offline"`, `last_seen_at` preserved from the last online stamp. + +- [ ] **Step 8: Clean up temp file and commit a smoke-test log** + +```bash +rm /tmp/agent-local.toml +``` + +No code changes — no commit needed. Phase 1 is functionally complete. + +--- + +## Phase 1 Exit Criteria + +- Monorepo with `server/` and `agent/` each building clean. +- `cd server && mix test` — all green. +- `cd agent && mix test` — all green. +- Manual smoke test in Task 14 — agent joins channel, server logs metrics, host status transitions online→offline on disconnect. +- All commits on `main`. + +Next up (Phase 2): metric persistence in SQLite, ZFS collector, VM collector, Storage collector. See roadmap in `proxmox-monitor-konzept.md`. diff --git a/proxmox-monitor-konzept.md b/proxmox-monitor-konzept.md new file mode 100644 index 0000000..831fa99 --- /dev/null +++ b/proxmox-monitor-konzept.md @@ -0,0 +1,418 @@ +# Proxmox Monitor — Konzept + +Eine Agent-Server-Anwendung zum Monitoring von Proxmox-Hosts mit Fokus auf ZFS-Gesundheit und VM-Übersicht. Implementiert in Elixir/OTP. + +## Designprinzipien + +- **KISS**: Jede Entscheidung zugunsten der einfacheren Lösung, solange sie funktioniert. +- **YAGNI**: Features werden erst gebaut, wenn sie konkret gebraucht werden — nicht prophylaktisch. +- **Read-only**: Der Agent führt keine verändernden Commands auf dem Proxmox-Host aus. +- **Push-Architektur**: Agents initiieren die Verbindung zum Server (NAT-freundlich). + +--- + +## Architektur-Übersicht + +``` +┌──────────────────┐ ┌────────────────────┐ +│ Proxmox-Host 1 │ │ Server (LXC) │ +│ ┌────────────┐ │ │ im Rechenzentrum │ +│ │ Agent │──┼──WSS──┐ │ │ +│ └────────────┘ │ │ │ ┌──────────────┐ │ +└──────────────────┘ └──▶│ │ Caddy │ │ + │ │ Reverse Proxy│ │ +┌──────────────────┐ │ └──────┬───────┘ │ +│ Proxmox-Host 2 │ │ │ │ +│ ┌────────────┐ │ │ ┌──────▼───────┐ │ +│ │ Agent │──┼──WSS─────▶│ │ Phoenix │ │ +│ └────────────┘ │ │ │ LiveView │ │ +└──────────────────┘ │ └──────┬───────┘ │ + ... │ │ │ + │ ┌──────▼───────┐ │ +┌──────────────────┐ │ │ SQLite │ │ +│ Proxmox-Host N │ │ └──────────────┘ │ +│ ┌────────────┐ │ │ │ +│ │ Agent │──┼──WSS─────▶│ │ +│ └────────────┘ │ │ │ +└──────────────────┘ └────────────────────┘ +``` + +## Technologie-Stack + +| Komponente | Technologie | Begründung | +|------------|-------------|------------| +| Agent | Elixir + Burrito | Eigenständige Binary, keine Erlang-Installation auf Proxmox nötig | +| Server | Phoenix + LiveView | Realtime-UI ohne separates Frontend | +| Transport | Phoenix Channels (WSS) | Persistente Verbindung, Auto-Reconnect, Offline-Detection gratis | +| Datenbank | SQLite + Ecto | Für ~20 Hosts vollkommen ausreichend, keine separate DB-Instanz | +| TLS / Reverse Proxy | Caddy (bereits vorhanden) | Let's Encrypt automatisch | +| Deployment (Server) | LXC-Container auf Proxmox im RZ | Geringer Overhead, saubere Isolation | +| Deployment (Agent) | systemd-Service | Standard auf Debian/Proxmox | + +## Systemanforderungen + +- **Proxmox-Hosts**: Proxmox VE 8.3+ mit OpenZFS 2.3+ (für `-j` JSON-Output) +- **Server**: LXC oder VM mit ausreichend RAM (1 GB reicht), Debian/Ubuntu +- **Netzwerk**: Server muss öffentlich über HTTPS erreichbar sein (via Caddy) + +--- + +## Agent + +### Verantwortlichkeiten + +Der Agent läuft auf jedem Proxmox-Host, sammelt in festen Intervallen Metriken und schickt sie an den Server. Er hält eine persistente WebSocket-Verbindung zum Server über Phoenix Channels. + +### Sammlungs-Intervalle + +Nicht alles muss gleich häufig gesammelt werden. Daten mit hoher Änderungsrate werden öfter abgefragt, statische Informationen seltener. + +| Intervall | Daten | +|-----------|-------| +| 30 Sekunden | Host-Metriken, VM-Runtime-Status, ZFS-Pool-Status, Storage-Auslastung | +| 5 Minuten | Snapshots, Dataset-Liste, VM-Config, Guest-Agent-IPs | +| 30 Minuten | Proxmox-Version, pending APT-Updates, ZFS-Version | + +### Zu sammelnde Daten + +**Host-Metriken** (aus `/proc` und `uptime`) +- CPU-Auslastung (%), Load-Average (1/5/15) +- RAM: used, total, available +- Uptime in Sekunden +- Root-Filesystem: used, total +- Hostname, Kernel-Version + +**Proxmox-Storage** (`pvesh get /nodes//storage --output-format json`) +- Alle konfigurierten Storages (ZFS, NFS, Local, etc.) +- Typ, Status (active/inactive), used, total +- Content-Typen (images, backup, iso, ...) + +**ZFS-Pools** (`zpool status -j --json-flat-vdevs --json-int`, `zpool list -j --json-int`) +- Rohes JSON (vollständig gespeichert für spätere Analyse) +- Zusätzlich extrahiertes Summary: + - Pool-Name, Health-State + - Size, Allocated, Free (Bytes) + - Fragmentation (%), Capacity (%) + - Error-Counter (read/write/checksum) + - Scrub-Status, letzter erfolgreicher Scrub + - Anzahl vdevs, Anzahl degraded vdevs + +**ZFS-Datasets & Snapshots** (`zfs list -j --json-int`) +- Rohes JSON (Datasets und Snapshots) +- Snapshot-Summary pro Dataset: + - Anzahl Snapshots + - Alter des ältesten / neuesten Snapshots + - Gesamt-Speicher (usedbysnapshots) + +**Virtuelle Maschinen & LXC-Container** (`pvesh`) +- Statisch (aus Config): VMID, Name, Type (qemu/lxc), Cores, RAM (max), Disks (mit Storage-Backend), Tags, Autostart +- Dynamisch (aus Runtime-Status): Status, Uptime, CPU-Auslastung, RAM-Verbrauch, Disk-I/O, Netzwerk-I/O +- Via Guest-Agent (QEMU, optional): IP-Adressen aller Interfaces, OS-Info, Disk-Usage im Guest +- LXC: IPs direkt aus Container-Config + +**System-Info** +- Proxmox-Version (`pveversion`) +- Anzahl verfügbarer Updates (`apt list --upgradable 2>/dev/null | wc -l`) +- ZFS-Version +- Agent-Version (fest einkompiliert) + +### Konfiguration + +Datei: `/etc/proxmox-monitor/agent.toml`, Zugriffsrechte `0600`, Eigentümer `root`. + +```toml +server_url = "wss://monitor.example.com/socket/websocket" +token = "agent_abc123xyz..." +host_id = "pve-host-01" # optional, sonst Hostname + +[intervals] +fast_seconds = 30 +medium_seconds = 300 +slow_seconds = 1800 +``` + +### Deployment + +```bash +# Einmaliges Setup auf einem Proxmox-Host +scp proxmox-monitor-agent root@pve-host-01:/usr/local/bin/ +scp agent.toml root@pve-host-01:/etc/proxmox-monitor/ +scp proxmox-monitor-agent.service root@pve-host-01:/etc/systemd/system/ +ssh root@pve-host-01 "systemctl enable --now proxmox-monitor-agent" +``` + +Der Agent muss als root laufen, da `zpool status` bei Degraded-Pools und einigen ZFS-Details Root-Rechte benötigt. Das ist akzeptabel, weil der Agent ausschließlich lesende Commands ausführt und keine eingehenden Verbindungen akzeptiert. + +--- + +## Server + +### Verantwortlichkeiten + +- Nimmt Verbindungen von Agents entgegen, authentifiziert sie +- Speichert Metriken in SQLite +- Stellt LiveView-Dashboard bereit +- Cleanup alter Metriken (Retention) + +### Datenmodell + +```elixir +# hosts +- id (PK) +- name (unique) +- token_hash (bcrypt) +- created_at +- last_seen_at +- agent_version (nullable) +- proxmox_version (nullable) +- zfs_version (nullable) +- status ("online" | "offline" | "never_connected") + +# metrics (Zeitreihe, ein Eintrag pro Sample) +- id (PK) +- host_id (FK) +- collected_at (indexed) +- interval_type ("fast" | "medium" | "slow") +- payload (JSON) # Gesamter Sample-Inhalt + +# Indexiert wegen Dashboard-Queries: +# - (host_id, collected_at DESC) für "letzte N Samples eines Hosts" +# - (collected_at) für Retention-Cleanup +``` + +**Begründung für JSON-Payload statt normalisierter Spalten:** +Die Daten werden fast immer vollständig pro Host/Zeitpunkt gelesen (für die Detail-View). Separate Spalten pro Metrik würden die Schema-Evolution aufwendig machen. SQLite unterstützt JSON-Operatoren (`->>`) für Queries, falls wir mal einzelne Felder filtern müssen. + +### Phoenix Channels — Protokoll + +**Join** (Agent → Server): +```json +{ + "topic": "host:pve-host-01", + "payload": { + "token": "agent_abc123xyz...", + "agent_version": "0.1.0" + } +} +``` + +Server validiert Token gegen `hosts.token_hash`. Bei Erfolg: Assign der `host_id` an den Socket, `last_seen_at` aktualisieren, Status auf `online` setzen. + +**Events** (Agent → Server): +```json +// Event: "metric:fast" | "metric:medium" | "metric:slow" +{ + "collected_at": "2026-04-21T12:34:56Z", + "data": { /* je nach Intervall-Typ */ } +} +``` + +**Disconnect-Handling** (Server): +Wenn der Channel-Prozess stirbt (Terminate-Callback), wird der Host sofort als `offline` markiert. Kein Polling, keine Timeouts nötig. + +### LiveView-Seiten + +**1. Übersicht (`/`)** +- Eine Karte pro Host, Grid-Layout +- Status-Ampel: Grün (alles ok) / Gelb (Warnung) / Rot (kritisch) / Grau (offline) +- Kritisch = Pool DEGRADED/FAULTED, Capacity > 90%, Agent offline +- Warnung = Capacity 80-90%, alte Snapshots, pending Updates, Scrub überfällig +- Pro Karte sichtbar: Host-Name, CPU, RAM, Uptime, Pool-Status (Zusammenfassung), VM-Anzahl +- Sortier-/Filter-Optionen: nach Status, nach Host-Name + +**2. Host-Detail (`/hosts/:name`)** +- Header mit Proxmox-Version, Uptime, letzter Kontakt +- Tabs oder Sektionen: + - **Metriken**: CPU/RAM/Load-Graphen über 24h (simple Line-Charts mit Chart.js oder Contex) + - **ZFS-Pools**: Pro Pool eine Box mit Health, Capacity-Bar, Fragmentation, Error-Counters, Scrub-Info, vdev-Liste + - **Snapshots**: Tabelle pro Dataset mit Anzahl, Alter des ältesten/neuesten Snapshots + - **Storage**: Tabelle aller Proxmox-Storages mit Auslastung + - **VMs/LXCs**: Tabelle mit Name, Type, Status, CPU/RAM-Auslastung, IPs, Tags + +**3. VM-Suche (`/vms`)** +- Globale Suche über alle Hosts +- Input: Name oder IP +- Ergebnis: Tabelle mit VM-Name, Host, Status, IP, Ressourcen +- Killer-Feature bei 20 Hosts: "Wo läuft `nginx-proxy`?" + +**4. Host-Verwaltung (`/admin/hosts`)** +- Liste aller Hosts mit Status, letzter Kontakt, Agent-Version +- Button "Neuen Host hinzufügen": Formular mit Name → generiert Token, zeigt Install-Anleitung +- Pro Host: "Token revoken" (setzt neuen Token), "Host löschen" + +### Authentifizierung + +**Web-UI:** +- Single-User, Login via Passwort (Argon2-Hash in Environment-Variable oder Secret-Config) +- Minimal-Setup mit einer einzigen Session/Cookie +- Kein User-Management-UI nötig + +**Agents:** +- Shared Token pro Agent +- Token-Generierung: 32 Bytes Random, Base64-URL-encoded +- In DB als Bcrypt-Hash gespeichert +- Revoke durch Token-Neugenerierung im Admin-UI + +### Retention / Cleanup + +Ein GenServer läuft stündlich und löscht: +- `metrics` älter als 48 Stunden + +Das reicht für "was ist in den letzten 2 Tagen passiert"-Analysen. Bei 20 Hosts × 30s-Samples × 48h = ca. 230.000 Rows. Völlig unproblematisch für SQLite. + +--- + +## Projekt-Struktur + +Monorepo mit zwei getrennten Mix-Projekten: + +``` +proxmox_monitor/ +├── README.md +├── konzept.md # Dieses Dokument +│ +├── agent/ # Mix-Projekt → Burrito-Binary +│ ├── mix.exs +│ ├── config/ +│ │ ├── config.exs +│ │ └── runtime.exs +│ ├── lib/ +│ │ ├── agent.ex +│ │ ├── agent/ +│ │ │ ├── application.ex +│ │ │ ├── config.ex +│ │ │ ├── reporter.ex # Phoenix Channel Client +│ │ │ ├── supervisor.ex +│ │ │ └── collectors/ +│ │ │ ├── host.ex +│ │ │ ├── storage.ex +│ │ │ ├── zfs.ex +│ │ │ ├── vms.ex +│ │ │ └── system_info.ex +│ │ └── agent/schema/ # Structs für Samples +│ │ ├── sample.ex +│ │ ├── pool_summary.ex +│ │ └── ... +│ ├── rel/ +│ │ └── burrito.exs +│ └── test/ +│ +└── server/ # Phoenix-Projekt + ├── mix.exs + ├── config/ + ├── lib/ + │ ├── server/ + │ │ ├── application.ex + │ │ ├── hosts.ex # Kontext: Host-Verwaltung + │ │ ├── metrics.ex # Kontext: Metrik-Speicherung + │ │ ├── retention.ex # GenServer: Cleanup + │ │ └── schema/ + │ │ ├── host.ex + │ │ └── metric.ex + │ └── server_web/ + │ ├── router.ex + │ ├── endpoint.ex + │ ├── channels/ + │ │ ├── agent_socket.ex + │ │ └── host_channel.ex + │ ├── live/ + │ │ ├── overview_live.ex + │ │ ├── host_detail_live.ex + │ │ ├── vm_search_live.ex + │ │ └── admin_hosts_live.ex + │ └── controllers/ + │ └── auth_controller.ex + ├── priv/ + │ └── repo/migrations/ + └── test/ +``` + +--- + +## Sicherheit + +- **TLS erzwungen** über Caddy, Agent verifiziert Server-Zertifikat +- **Token niemals in Plaintext** in DB (Bcrypt) oder Logs +- **Agent-Config mit Rechten `0600`, Root** +- **Rate-Limiting** auf Channel-Join via Hammer-Library (z.B. 5 Versuche/Minute pro IP) +- **Keine eingehenden Verbindungen zum Agent** — Agent initiiert immer +- **Server-URL im Agent fest gepinnt** (keine Redirects) +- **Read-only auf dem Proxmox-Host** — Agent führt keine verändernden Commands aus + +--- + +## MVP-Scope — Was ist drin, was nicht + +### ✅ Im MVP + +- Agent sammelt Host-, ZFS-, VM-, Storage-, System-Metriken +- Persistente WebSocket-Verbindung via Phoenix Channels +- SQLite-Speicherung mit 48h Retention +- Dashboard: Übersicht, Host-Detail, VM-Suche, Admin +- Single-User-Auth für Web-UI +- Token-basierte Agent-Auth +- Burrito-Binary für Agent +- LXC-Deployment für Server + +### ❌ Bewusst nicht im MVP (YAGNI) + +- **Alerts per E-Mail/Telegram/Gotify** — Dashboard zeigt farbig, reicht erstmal +- **SMART-Werte der Disks** — eigenes Unterprojekt, später +- **Backup-Task-Historie** — aufwendiges Parsing, später +- **Cluster-Support** — nicht benötigt +- **Agent-Self-Update** — manuell via scp reicht bei 20 Hosts +- **Remote-Actions** (VM starten, Scrub triggern) — Read-only bleibt Read-only +- **Multi-User, RBAC** — Single-User reicht +- **Langzeit-Historie (>48h)** — separates Konzept (Downsampling) wenn gebraucht +- **Mobile App** — Web-UI responsive reicht + +--- + +## Roadmap + +### Phase 1 — Grundgerüst (Woche 1-2) +- Agent-Skeleton: Ein Collector (Host-Metriken), Output auf Console +- Server-Skeleton: Phoenix-Projekt, Host-Schema, einfacher Channel, der Daten empfängt und loggt +- Ende: Agent verbindet sich lokal zum Server und pusht CPU-Metriken + +### Phase 2 — ZFS & VMs (Woche 3-4) +- ZFS-Collector (pool status, list, snapshots) +- VM-Collector (pvesh) +- Storage-Collector +- Server speichert in SQLite, simple Anzeige-Route + +### Phase 3 — LiveView-Dashboard (Woche 5-6) +- Übersichtsseite mit Status-Ampeln +- Host-Detail mit allen Sektionen +- VM-Suche +- Auth für Web-UI + +### Phase 4 — Admin & Deployment (Woche 7) +- Admin-UI für Host-Verwaltung +- Retention-GenServer +- Burrito-Build für Agent +- LXC-Deployment-Dokumentation + +### Phase 5 — Produktiv-Rollout (Woche 8) +- Deployment auf Server im RZ +- Caddy-Konfiguration +- Agent auf 2-3 Test-Hosts ausrollen +- Nach erfolgreichem Test: Rollout auf alle 20 Hosts + +### Nach dem MVP — ggf. als nächstes + +1. **Alerts** (wenn klar ist, welche Thresholds sinnvoll sind) +2. **SMART-Monitoring** (für ZFS-Vorhersage) +3. **Backup-Task-Tracking** +4. **Langzeit-Historie** mit Downsampling + +--- + +## Offene Entscheidungen / Vormerkungen + +Diese Punkte können beim Implementieren entschieden werden, sind aber fürs Konzept nicht kritisch: + +- **Chart-Library**: Chart.js (JS, via LiveView-Hook) vs. Contex (SVG, pure Elixir). Contex ist "eleganter", Chart.js bietet mehr Optionen out-of-the-box. +- **Shared Structs zwischen Agent/Server**: Anfangs duplizieren. Falls das wehtut, später in eine geteilte Lib extrahieren. +- **Exception-Handling bei pvesh/zpool-Fehlern**: Ausfall einer Datenquelle darf nicht das ganze Sample verwerfen. Teil-Samples mit Fehler-Flag akzeptieren. +- **Logs**: Strukturiert (JSON) vs. Text. Vorschlag: Server strukturiert für spätere Auswertung, Agent textbasiert für einfaches journalctl-Debugging.