proxMon/docs/superpowers/plans/2026-04-21-phase1-grundgeruest.md

41 KiB

Phase 1 — Grundgerüst Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Stand up a minimal agent+server pair where an Elixir agent running locally connects via Phoenix Channels to a Phoenix server, authenticates with a token, and pushes host CPU/RAM metrics every 30 seconds. Server logs the incoming payloads.

Architecture: Monorepo with two independent Mix projects (server/ Phoenix+SQLite, agent/ plain OTP app using Slipstream). Agent initiates a persistent WSS connection, joins topic host:<name>, pushes metric:fast events. Server persists only hosts in Phase 1 — metric storage lands in Phase 2.

Tech Stack: Elixir 1.19 / OTP 28, Phoenix 1.7.14, Ecto + ecto_sqlite3, bcrypt_elixir (token hashing), slipstream (agent Channels client), toml (agent config), ExUnit.


File Structure

proxmox_monitor/
├── .gitignore
├── README.md
├── proxmox-monitor-konzept.md         (existing)
├── docs/superpowers/plans/2026-04-21-phase1-grundgeruest.md
│
├── server/                            (created by mix phx.new)
│   ├── mix.exs                        modify: add :bcrypt_elixir
│   ├── config/{config,dev,test,runtime}.exs   scaffolded
│   ├── priv/repo/migrations/<ts>_create_hosts.exs     create
│   ├── lib/server/application.ex      scaffolded
│   ├── lib/server/repo.ex             scaffolded
│   ├── lib/server/schema/host.ex      create
│   ├── lib/server/hosts.ex            create (context)
│   ├── lib/server_web/endpoint.ex     modify: add agent socket
│   ├── lib/server_web/channels/agent_socket.ex   create
│   ├── lib/server_web/channels/host_channel.ex   create
│   ├── test/server/hosts_test.exs     create
│   └── test/server_web/channels/host_channel_test.exs   create
│
└── agent/                             (created by mix new --sup)
    ├── mix.exs                        modify: deps + app config
    ├── config/config.exs              create
    ├── config/runtime.exs             create
    ├── lib/agent.ex                   scaffolded
    ├── lib/agent/application.ex       modify
    ├── lib/agent/config.ex            create
    ├── lib/agent/collectors/host.ex   create
    ├── lib/agent/reporter.ex          create
    ├── test/agent/config_test.exs     create
    ├── test/agent/collectors/host_test.exs   create
    └── test/fixtures/proc/            create (loadavg, meminfo, stat samples)

Each file has one responsibility: schema, context (business logic), channel (transport), collector (data acquisition), reporter (transmission). Test files mirror the source tree.


Task 1: Monorepo Init

Files:

  • Create: .gitignore

  • Create: README.md

  • Step 1: Write .gitignore (covers both Mix projects)

# Elixir/Mix
/server/_build/
/server/deps/
/server/cover/
/server/doc/
/server/.fetch
/server/erl_crash.dump
/server/*.ez
/server/priv/static/assets/
/server/priv/static/cache_manifest.json
/server/*.db
/server/*.db-journal
/server/*.db-wal
/server/*.db-shm

/agent/_build/
/agent/deps/
/agent/cover/
/agent/doc/
/agent/.fetch
/agent/erl_crash.dump
/agent/*.ez

# Editors / OS
.DS_Store
.vscode/
.idea/
  • Step 2: Write README.md (minimal)
# Proxmox Monitor

Agent-Server monitoring for Proxmox hosts. Elixir/OTP. See `proxmox-monitor-konzept.md`.

- `server/` — Phoenix + SQLite + LiveView
- `agent/` — Slipstream Channels client, deploys as Burrito binary

Phase 1 focuses on end-to-end metric push. Later phases add ZFS/VM collectors, persistence, LiveView dashboard.
  • Step 3: Initial commit
git add .gitignore README.md proxmox-monitor-konzept.md docs/
git commit -m "chore: project skeleton + phase-1 plan"

Task 2: Server — Phoenix Bootstrap

Files:

  • Create: entire server/ tree via mix phx.new

  • Step 1: Generate Phoenix project

Run from /Users/cabele/claudeprojects/proxmox_monitor:

mix phx.new server --database sqlite3 --no-mailer --no-gettext --live --install

If prompted, answer Y to fetch deps.

Expected: creates server/ with Phoenix scaffold, SQLite adapter, LiveView enabled, no Gettext, no Mailer. Deps fetched, assets installed.

  • Step 2: Verify scaffold builds and tests pass
cd server && mix compile && mix test

Expected: compiles clean, default PageControllerTest passes.

  • Step 3: Commit the scaffold
cd /Users/cabele/claudeprojects/proxmox_monitor
git add server/
git commit -m "feat(server): phoenix 1.7 scaffold with sqlite + liveview"

Task 3: Server — Bcrypt Dependency

Files:

  • Modify: server/mix.exs

  • Step 1: Add :bcrypt_elixir to deps

In server/mix.exs, locate the defp deps do list and add the line below alongside existing entries:

      {:bcrypt_elixir, "~> 3.1"},
  • Step 2: Fetch and compile
cd server && mix deps.get && mix compile

Expected: bcrypt_elixir and cc_precompiler fetched; compile succeeds (bcrypt NIF builds).

  • Step 3: Commit
git add server/mix.exs server/mix.lock
git commit -m "feat(server): add bcrypt_elixir for token hashing"

Task 4: Server — Host Schema + Context (TDD)

Files:

  • Create: server/priv/repo/migrations/<ts>_create_hosts.exs

  • Create: server/lib/server/schema/host.ex

  • Create: server/lib/server/hosts.ex

  • Create: server/test/server/hosts_test.exs

  • Step 1: Generate migration file

cd server && mix ecto.gen.migration create_hosts

Fill the generated file (timestamped name) with:

defmodule Server.Repo.Migrations.CreateHosts do
  use Ecto.Migration

  def change do
    create table(:hosts) do
      add :name, :string, null: false
      add :token_hash, :string, null: false
      add :agent_version, :string
      add :proxmox_version, :string
      add :zfs_version, :string
      add :status, :string, null: false, default: "never_connected"
      add :last_seen_at, :utc_datetime_usec

      timestamps(type: :utc_datetime_usec)
    end

    create unique_index(:hosts, [:name])
  end
end
  • Step 2: Write schema module

Create server/lib/server/schema/host.ex:

defmodule Server.Schema.Host do
  use Ecto.Schema
  import Ecto.Changeset

  @statuses ~w(never_connected online offline)

  schema "hosts" do
    field :name, :string
    field :token_hash, :string
    field :agent_version, :string
    field :proxmox_version, :string
    field :zfs_version, :string
    field :status, :string, default: "never_connected"
    field :last_seen_at, :utc_datetime_usec

    timestamps(type: :utc_datetime_usec)
  end

  def create_changeset(host, attrs) do
    host
    |> cast(attrs, [:name, :token_hash])
    |> validate_required([:name, :token_hash])
    |> validate_length(:name, min: 1, max: 100)
    |> unique_constraint(:name)
  end

  def status_changeset(host, attrs) do
    host
    |> cast(attrs, [:status, :last_seen_at, :agent_version])
    |> validate_inclusion(:status, @statuses)
  end
end
  • Step 3: Write failing tests for the context

Create server/test/server/hosts_test.exs:

defmodule Server.HostsTest do
  use Server.DataCase, async: true

  alias Server.Hosts

  describe "create_host/1" do
    test "returns host and a plaintext token on success" do
      assert {:ok, {host, token}} = Hosts.create_host("pve-01")
      assert host.name == "pve-01"
      assert host.status == "never_connected"
      assert is_binary(token) and byte_size(token) >= 32
      refute host.token_hash == token
    end

    test "rejects duplicate names" do
      {:ok, _} = Hosts.create_host("pve-01")
      assert {:error, changeset} = Hosts.create_host("pve-01")
      assert %{name: ["has already been taken"]} = errors_on(changeset)
    end
  end

  describe "authenticate/2" do
    test "returns host for valid name+token" do
      {:ok, {host, token}} = Hosts.create_host("pve-01")
      assert {:ok, found} = Hosts.authenticate("pve-01", token)
      assert found.id == host.id
    end

    test "returns :invalid_token for wrong token" do
      {:ok, {_host, _token}} = Hosts.create_host("pve-01")
      assert {:error, :invalid_token} = Hosts.authenticate("pve-01", "wrong")
    end

    test "returns :unknown_host when name does not exist" do
      assert {:error, :unknown_host} = Hosts.authenticate("nope", "whatever")
    end
  end

  describe "mark_online/2 and mark_offline/1" do
    test "mark_online stamps status, last_seen_at, agent_version" do
      {:ok, {host, _}} = Hosts.create_host("pve-01")
      assert {:ok, updated} = Hosts.mark_online(host, "0.1.0")
      assert updated.status == "online"
      assert updated.agent_version == "0.1.0"
      assert updated.last_seen_at != nil
    end

    test "mark_offline sets status to offline" do
      {:ok, {host, _}} = Hosts.create_host("pve-01")
      {:ok, online} = Hosts.mark_online(host, "0.1.0")
      assert {:ok, offline} = Hosts.mark_offline(online)
      assert offline.status == "offline"
    end
  end
end
  • Step 4: Run tests — expect failure
cd server && mix test test/server/hosts_test.exs

Expected: compile error Server.Hosts is not available or similar.

  • Step 5: Implement the context

Create server/lib/server/hosts.ex:

defmodule Server.Hosts do
  @moduledoc "Host registration, authentication, status tracking."

  alias Server.Repo
  alias Server.Schema.Host

  @spec create_host(String.t()) :: {:ok, {Host.t(), String.t()}} | {:error, Ecto.Changeset.t()}
  def create_host(name) do
    token = generate_token()
    hash = Bcrypt.hash_pwd_salt(token)

    %Host{}
    |> Host.create_changeset(%{name: name, token_hash: hash})
    |> Repo.insert()
    |> case do
      {:ok, host} -> {:ok, {host, token}}
      {:error, cs} -> {:error, cs}
    end
  end

  @spec authenticate(String.t(), String.t()) ::
          {:ok, Host.t()} | {:error, :unknown_host | :invalid_token}
  def authenticate(name, token) when is_binary(name) and is_binary(token) do
    case Repo.get_by(Host, name: name) do
      nil ->
        Bcrypt.no_user_verify()
        {:error, :unknown_host}

      host ->
        if Bcrypt.verify_pass(token, host.token_hash) do
          {:ok, host}
        else
          {:error, :invalid_token}
        end
    end
  end

  @spec mark_online(Host.t(), String.t() | nil) :: {:ok, Host.t()} | {:error, Ecto.Changeset.t()}
  def mark_online(%Host{} = host, agent_version) do
    host
    |> Host.status_changeset(%{
      status: "online",
      last_seen_at: DateTime.utc_now(),
      agent_version: agent_version
    })
    |> Repo.update()
  end

  @spec mark_offline(Host.t()) :: {:ok, Host.t()} | {:error, Ecto.Changeset.t()}
  def mark_offline(%Host{} = host) do
    host
    |> Host.status_changeset(%{status: "offline"})
    |> Repo.update()
  end

  @doc "Mark every host offline — called on server boot to clear stale online flags."
  @spec mark_all_offline() :: {integer(), nil}
  def mark_all_offline do
    import Ecto.Query
    Repo.update_all(from(h in Host), set: [status: "offline", updated_at: DateTime.utc_now()])
  end

  defp generate_token do
    :crypto.strong_rand_bytes(32) |> Base.url_encode64(padding: false)
  end
end
  • Step 6: Speed up bcrypt in tests

In server/config/test.exs, add at the bottom (before the existing config :phoenix line if present, or anywhere at top level):

config :bcrypt_elixir, :log_rounds, 4
  • Step 7: Run tests — expect all pass
cd server && mix ecto.reset && mix test test/server/hosts_test.exs

Expected: 7 tests pass.

  • Step 8: Commit
git add server/priv server/lib/server server/test/server server/config/test.exs
git commit -m "feat(server): host schema, context, auth, status transitions"

Task 5: Server — AgentSocket + Mark-All-Offline on Boot

Files:

  • Create: server/lib/server_web/channels/agent_socket.ex

  • Modify: server/lib/server_web/endpoint.ex

  • Modify: server/lib/server/application.ex

  • Step 1: Write AgentSocket

Create server/lib/server_web/channels/agent_socket.ex:

defmodule ServerWeb.AgentSocket do
  @moduledoc "Entry socket for agents. Actual authentication happens in HostChannel.join/3."
  use Phoenix.Socket

  channel "host:*", ServerWeb.HostChannel

  @impl true
  def connect(_params, socket, _connect_info), do: {:ok, socket}

  @impl true
  def id(_socket), do: nil
end
  • Step 2: Mount the socket in the endpoint

In server/lib/server_web/endpoint.ex, find the existing socket "/live" line and add just below it:

  socket "/socket", ServerWeb.AgentSocket,
    websocket: [timeout: 45_000],
    longpoll: false
  • Step 3: Clear stale online flags on boot

In server/lib/server/application.ex, find the existing start/2 function. It currently ends with something like:

    opts = [strategy: :one_for_one, name: Server.Supervisor]
    Supervisor.start_link(children, opts)
  end

Replace those two lines with:

    opts = [strategy: :one_for_one, name: Server.Supervisor]
    result = Supervisor.start_link(children, opts)
    with {:ok, _} <- result, do: Server.Hosts.mark_all_offline()
    result
  end

Rationale: if the server is restarted while agents were connected, their online row persists stale. Marking everything offline on boot lets the agent's next channel join flip it back to online cleanly.

  • Step 4: Compile to verify
cd server && mix compile

Expected: no warnings about undefined ServerWeb.HostChannel (module exists as channel ref only; we'll create it next task — note this is acceptable because channel/2 only registers the name).

  • Step 5: Commit
git add server/lib/server_web/channels/agent_socket.ex server/lib/server_web/endpoint.ex server/lib/server/application.ex
git commit -m "feat(server): agent socket endpoint, clear online status on boot"

Task 6: Server — HostChannel (TDD)

Files:

  • Create: server/lib/server_web/channels/host_channel.ex

  • Create: server/test/server_web/channels/host_channel_test.exs

  • Modify: server/test/support/channel_case.ex (verify it exists; Phoenix scaffold creates it)

  • Step 1: Confirm ChannelCase exists

ls server/test/support/channel_case.ex

Expected: file exists (Phoenix 1.7 --live scaffold creates it). If missing, skip this check and note — ChannelCase is required for the tests below.

  • Step 2: Write failing channel tests

Create server/test/server_web/channels/host_channel_test.exs:

defmodule ServerWeb.HostChannelTest do
  use ServerWeb.ChannelCase, async: false

  alias Server.Hosts
  alias ServerWeb.AgentSocket

  setup do
    {:ok, {host, token}} = Hosts.create_host("pve-01")
    %{host: host, token: token}
  end

  describe "join" do
    test "succeeds with valid token and marks host online", %{host: host, token: token} do
      {:ok, socket} = connect(AgentSocket, %{})

      assert {:ok, _reply, socket} =
               subscribe_and_join(socket, "host:pve-01", %{
                 "token" => token,
                 "agent_version" => "0.1.0"
               })

      assert socket.assigns.host_id == host.id

      reloaded = Server.Repo.reload!(host)
      assert reloaded.status == "online"
      assert reloaded.agent_version == "0.1.0"
      assert reloaded.last_seen_at != nil
    end

    test "rejects invalid token", %{host: _host} do
      {:ok, socket} = connect(AgentSocket, %{})

      assert {:error, %{reason: "invalid_token"}} =
               subscribe_and_join(socket, "host:pve-01", %{
                 "token" => "garbage",
                 "agent_version" => "0.1.0"
               })
    end

    test "rejects unknown host name" do
      {:ok, socket} = connect(AgentSocket, %{})

      assert {:error, %{reason: "unknown_host"}} =
               subscribe_and_join(socket, "host:nope", %{
                 "token" => "x",
                 "agent_version" => "0.1.0"
               })
    end

    test "rejects topic mismatch" do
      {:ok, socket} = connect(AgentSocket, %{})

      assert {:error, %{reason: "bad_topic"}} =
               subscribe_and_join(socket, "host:", %{"token" => "x", "agent_version" => "0.1.0"})
    end
  end

  describe "metric:fast event" do
    setup %{token: token} do
      {:ok, socket} = connect(AgentSocket, %{})

      {:ok, _reply, joined} =
        subscribe_and_join(socket, "host:pve-01", %{
          "token" => token,
          "agent_version" => "0.1.0"
        })

      %{socket: joined}
    end

    test "accepts metric payload and replies :ok", %{socket: socket} do
      ref =
        push(socket, "metric:fast", %{
          "collected_at" => "2026-04-21T12:00:00Z",
          "data" => %{"cpu_percent" => 12.3, "load1" => 0.2}
        })

      assert_reply ref, :ok
    end
  end

  describe "terminate" do
    test "marks host offline when channel process exits", %{host: host, token: token} do
      {:ok, socket} = connect(AgentSocket, %{})

      {:ok, _, joined} =
        subscribe_and_join(socket, "host:pve-01", %{
          "token" => token,
          "agent_version" => "0.1.0"
        })

      Process.unlink(joined.channel_pid)
      ref = Process.monitor(joined.channel_pid)
      close(joined)
      assert_receive {:DOWN, ^ref, :process, _, _}, 1_000

      reloaded = Server.Repo.reload!(host)
      assert reloaded.status == "offline"
    end
  end
end
  • Step 3: Run tests — expect failure (HostChannel not implemented)
cd server && mix test test/server_web/channels/host_channel_test.exs

Expected: compile error ServerWeb.HostChannel is not available.

  • Step 4: Implement HostChannel

Create server/lib/server_web/channels/host_channel.ex:

defmodule ServerWeb.HostChannel do
  use ServerWeb, :channel
  require Logger

  alias Server.Hosts

  @impl true
  def join("host:" <> name, params, socket) when name != "" do
    token = Map.get(params, "token", "")
    agent_version = Map.get(params, "agent_version")

    case Hosts.authenticate(name, token) do
      {:ok, host} ->
        {:ok, _} = Hosts.mark_online(host, agent_version)
        Logger.info("agent joined host:#{name}")
        {:ok, assign(socket, :host_id, host.id) |> assign(:host_name, name)}

      {:error, :unknown_host} ->
        {:error, %{reason: "unknown_host"}}

      {:error, :invalid_token} ->
        {:error, %{reason: "invalid_token"}}
    end
  end

  def join(_topic, _params, _socket), do: {:error, %{reason: "bad_topic"}}

  @impl true
  def handle_in("metric:fast", payload, socket) do
    Logger.info("metric:fast host=#{socket.assigns.host_name} data=#{inspect(payload["data"])}")
    {:reply, :ok, socket}
  end

  def handle_in("metric:medium", payload, socket) do
    Logger.info("metric:medium host=#{socket.assigns.host_name} payload=#{inspect(payload)}")
    {:reply, :ok, socket}
  end

  def handle_in("metric:slow", payload, socket) do
    Logger.info("metric:slow host=#{socket.assigns.host_name} payload=#{inspect(payload)}")
    {:reply, :ok, socket}
  end

  @impl true
  def terminate(_reason, socket) do
    case socket.assigns[:host_id] do
      nil ->
        :ok

      id ->
        with host when not is_nil(host) <- Server.Repo.get(Server.Schema.Host, id) do
          Hosts.mark_offline(host)
        end

        :ok
    end
  end
end
  • Step 5: Run tests — expect pass
cd server && mix test test/server_web/channels/host_channel_test.exs

Expected: all tests pass.

  • Step 6: Run full test suite
cd server && mix test

Expected: all tests green.

  • Step 7: Commit
git add server/lib/server_web/channels/host_channel.ex server/test/server_web/channels/host_channel_test.exs
git commit -m "feat(server): host channel with token auth and metric events"

Task 7: Server — Smoke-Test Helper

Files:

  • Create: server/lib/server/release.ex (minimal helper for IEx-driven host creation)

  • Step 1: Add a tiny release helper

Create server/lib/server/release.ex:

defmodule Server.Release do
  @moduledoc "Convenience functions for IEx and future release tasks."

  @doc "Create a host and print the plaintext token once."
  def register_host(name) do
    case Server.Hosts.create_host(name) do
      {:ok, {host, token}} ->
        IO.puts("Host '#{host.name}' registered (id=#{host.id}).")
        IO.puts("TOKEN: #{token}")
        IO.puts("Store this token NOW — it will never be shown again.")
        {:ok, host, token}

      {:error, cs} ->
        IO.puts("Failed to register host: #{inspect(cs.errors)}")
        {:error, cs}
    end
  end
end
  • Step 2: Compile
cd server && mix compile
  • Step 3: Commit
git add server/lib/server/release.ex
git commit -m "chore(server): iex helper for host registration"

Task 8: Agent — Mix Project Bootstrap

Files:

  • Create: agent/ directory tree via mix new

  • Step 1: Generate the OTP app

Run from /Users/cabele/claudeprojects/proxmox_monitor:

mix new agent --sup

Expected: creates agent/ with mix.exs, lib/agent.ex, lib/agent/application.ex, test/.

  • Step 2: Replace agent/mix.exs contents

Open agent/mix.exs and replace with:

defmodule Agent.MixProject do
  use Mix.Project

  @version "0.1.0"

  def project do
    [
      app: :agent,
      version: @version,
      elixir: "~> 1.17",
      start_permanent: Mix.env() == :prod,
      deps: deps(),
      elixirc_paths: elixirc_paths(Mix.env())
    ]
  end

  def application do
    [
      extra_applications: [:logger, :crypto],
      mod: {Agent.Application, []}
    ]
  end

  defp deps do
    [
      {:slipstream, "~> 1.1"},
      {:jason, "~> 1.4"},
      {:toml, "~> 0.7"}
    ]
  end

  defp elixirc_paths(:test), do: ["lib", "test/support"]
  defp elixirc_paths(_), do: ["lib"]
end
  • Step 3: Fetch deps and compile
cd agent && mix deps.get && mix compile

Expected: slipstream, mint_web_socket, jason, toml fetched; compile succeeds.

  • Step 4: Commit
cd /Users/cabele/claudeprojects/proxmox_monitor
git add agent/
git commit -m "feat(agent): otp app scaffold with slipstream + toml deps"

Task 9: Agent — Version Constant

Files:

  • Modify: agent/lib/agent.ex

  • Step 1: Replace the scaffolded Agent module

Replace the entire contents of agent/lib/agent.ex with:

defmodule Agent do
  @moduledoc "Top-level namespace. Exposes the compiled version for reporting."

  @version Mix.Project.config()[:version]

  @spec version() :: String.t()
  def version, do: @version
end
  • Step 2: Compile and quick-check in IEx
cd agent && mix compile
  • Step 3: Commit
git add agent/lib/agent.ex
git commit -m "feat(agent): expose compile-time version"

Task 10: Agent — Config Module (TDD)

Files:

  • Create: agent/lib/agent/config.ex

  • Create: agent/test/agent/config_test.exs

  • Create: agent/test/fixtures/agent.toml (sample config used by test)

  • Step 1: Write a fixture config

Create agent/test/fixtures/agent.toml:

server_url = "wss://monitor.example.com/socket/websocket"
token = "test_token_123"
host_id = "pve-test-01"

[intervals]
fast_seconds = 15
medium_seconds = 120
slow_seconds = 600
  • Step 2: Write failing tests

Create agent/test/agent/config_test.exs:

defmodule Agent.ConfigTest do
  use ExUnit.Case, async: true

  alias Agent.Config

  @fixture Path.expand("../fixtures/agent.toml", __DIR__)

  describe "load/1" do
    test "parses required fields" do
      assert {:ok, cfg} = Config.load(@fixture)
      assert cfg.server_url == "wss://monitor.example.com/socket/websocket"
      assert cfg.token == "test_token_123"
      assert cfg.host_id == "pve-test-01"
      assert cfg.fast_seconds == 15
      assert cfg.medium_seconds == 120
      assert cfg.slow_seconds == 600
    end

    test "returns error for missing file" do
      assert {:error, {:file_read, _}} = Config.load("/does/not/exist.toml")
    end

    test "defaults host_id to system hostname when absent" do
      tmp = Path.join(System.tmp_dir!(), "agent_nohost.toml")

      File.write!(tmp, """
      server_url = "wss://x/socket/websocket"
      token = "t"
      """)

      on_exit(fn -> File.rm(tmp) end)

      assert {:ok, cfg} = Config.load(tmp)
      assert is_binary(cfg.host_id)
      assert cfg.host_id != ""
    end

    test "applies default intervals when [intervals] is absent" do
      tmp = Path.join(System.tmp_dir!(), "agent_nointervals.toml")

      File.write!(tmp, """
      server_url = "wss://x/socket/websocket"
      token = "t"
      host_id = "h"
      """)

      on_exit(fn -> File.rm(tmp) end)

      assert {:ok, cfg} = Config.load(tmp)
      assert cfg.fast_seconds == 30
      assert cfg.medium_seconds == 300
      assert cfg.slow_seconds == 1800
    end

    test "returns error when required keys missing" do
      tmp = Path.join(System.tmp_dir!(), "agent_bad.toml")
      File.write!(tmp, "token = \"t\"\n")
      on_exit(fn -> File.rm(tmp) end)
      assert {:error, {:missing_key, :server_url}} = Config.load(tmp)
    end
  end
end
  • Step 3: Run tests — expect failure
cd agent && mix test test/agent/config_test.exs

Expected: Agent.Config is not available.

  • Step 4: Implement the config loader

Create agent/lib/agent/config.ex:

defmodule Agent.Config do
  @moduledoc "Loads and validates the TOML agent config."

  defstruct [
    :server_url,
    :token,
    :host_id,
    fast_seconds: 30,
    medium_seconds: 300,
    slow_seconds: 1800
  ]

  @type t :: %__MODULE__{
          server_url: String.t(),
          token: String.t(),
          host_id: String.t(),
          fast_seconds: pos_integer(),
          medium_seconds: pos_integer(),
          slow_seconds: pos_integer()
        }

  @required ~w(server_url token)a

  @spec load(Path.t()) ::
          {:ok, t()}
          | {:error, {:file_read, term()} | {:parse, term()} | {:missing_key, atom()}}
  def load(path) do
    with {:ok, body} <- read_file(path),
         {:ok, parsed} <- parse_toml(body),
         :ok <- validate_required(parsed) do
      {:ok, build(parsed)}
    end
  end

  defp read_file(path) do
    case File.read(path) do
      {:ok, body} -> {:ok, body}
      {:error, reason} -> {:error, {:file_read, reason}}
    end
  end

  defp parse_toml(body) do
    case Toml.decode(body) do
      {:ok, map} -> {:ok, map}
      {:error, reason} -> {:error, {:parse, reason}}
    end
  end

  defp validate_required(map) do
    Enum.find_value(@required, :ok, fn key ->
      case Map.get(map, Atom.to_string(key)) do
        v when is_binary(v) and v != "" -> nil
        _ -> {:error, {:missing_key, key}}
      end
    end)
  end

  defp build(map) do
    intervals = Map.get(map, "intervals", %{})

    %__MODULE__{
      server_url: map["server_url"],
      token: map["token"],
      host_id: map["host_id"] || hostname(),
      fast_seconds: Map.get(intervals, "fast_seconds", 30),
      medium_seconds: Map.get(intervals, "medium_seconds", 300),
      slow_seconds: Map.get(intervals, "slow_seconds", 1800)
    }
  end

  defp hostname do
    case :inet.gethostname() do
      {:ok, name} -> List.to_string(name)
      _ -> "unknown-host"
    end
  end
end
  • Step 5: Run tests — expect pass
cd agent && mix test test/agent/config_test.exs

Expected: 5 tests pass.

  • Step 6: Commit
git add agent/lib/agent/config.ex agent/test/agent/config_test.exs agent/test/fixtures/agent.toml
git commit -m "feat(agent): toml config loader with defaults and validation"

Task 11: Agent — Host Collector (TDD with /proc fixtures)

Files:

  • Create: agent/lib/agent/collectors/host.ex
  • Create: agent/test/agent/collectors/host_test.exs
  • Create: agent/test/fixtures/proc/loadavg
  • Create: agent/test/fixtures/proc/meminfo
  • Create: agent/test/fixtures/proc/uptime

The collector reads Linux /proc. Tests run on macOS too — they point the collector at fixture files instead.

  • Step 1: Write fixture files

Create agent/test/fixtures/proc/loadavg:

0.42 0.55 0.31 3/512 12345

Create agent/test/fixtures/proc/meminfo:

MemTotal:       16384000 kB
MemFree:         2048000 kB
MemAvailable:    8192000 kB
Buffers:          256000 kB
Cached:          4096000 kB
SwapTotal:       4194304 kB
SwapFree:        4194304 kB

Create agent/test/fixtures/proc/uptime:

123456.78 987654.32
  • Step 2: Write failing tests

Create agent/test/agent/collectors/host_test.exs:

defmodule Agent.Collectors.HostTest do
  use ExUnit.Case, async: true

  alias Agent.Collectors.Host

  @proc Path.expand("../../fixtures/proc", __DIR__)

  test "collects load average" do
    sample = Host.collect(proc_dir: @proc)
    assert sample.load1 == 0.42
    assert sample.load5 == 0.55
    assert sample.load15 == 0.31
  end

  test "collects memory in bytes" do
    sample = Host.collect(proc_dir: @proc)
    assert sample.mem_total_bytes == 16_384_000 * 1024
    assert sample.mem_available_bytes == 8_192_000 * 1024
    assert sample.mem_used_bytes == sample.mem_total_bytes - sample.mem_available_bytes
  end

  test "collects uptime seconds" do
    sample = Host.collect(proc_dir: @proc)
    assert sample.uptime_seconds == 123_456
  end

  test "includes hostname string" do
    sample = Host.collect(proc_dir: @proc)
    assert is_binary(sample.hostname)
    assert sample.hostname != ""
  end

  test "missing proc files yield :error field, not a crash" do
    sample = Host.collect(proc_dir: "/nonexistent/path/xyz")
    assert sample.errors != []
  end
end
  • Step 3: Run tests — expect failure
cd agent && mix test test/agent/collectors/host_test.exs

Expected: Agent.Collectors.Host is not available.

  • Step 4: Implement collector

Create agent/lib/agent/collectors/host.ex:

defmodule Agent.Collectors.Host do
  @moduledoc """
  Reads host metrics from /proc. Accepts `proc_dir:` option for testability.
  Never raises — on read failure, populates `:errors` and leaves the field nil.
  """

  @type sample :: %{
          hostname: String.t(),
          load1: float() | nil,
          load5: float() | nil,
          load15: float() | nil,
          mem_total_bytes: non_neg_integer() | nil,
          mem_available_bytes: non_neg_integer() | nil,
          mem_used_bytes: non_neg_integer() | nil,
          uptime_seconds: non_neg_integer() | nil,
          errors: [term()]
        }

  @spec collect(keyword()) :: sample()
  def collect(opts \\ []) do
    proc_dir = Keyword.get(opts, :proc_dir, "/proc")

    {load, e1} = safe(&read_loadavg/1, [proc_dir], {nil, nil, nil})
    {mem, e2} = safe(&read_meminfo/1, [proc_dir], %{total: nil, available: nil})
    {uptime, e3} = safe(&read_uptime/1, [proc_dir], nil)

    total = mem.total
    avail = mem.available
    used = if total && avail, do: total - avail, else: nil
    {load1, load5, load15} = load

    %{
      hostname: hostname(),
      load1: load1,
      load5: load5,
      load15: load15,
      mem_total_bytes: total,
      mem_available_bytes: avail,
      mem_used_bytes: used,
      uptime_seconds: uptime,
      errors: Enum.filter([e1, e2, e3], & &1)
    }
  end

  defp safe(fun, args, fallback) do
    try do
      {apply(fun, args), nil}
    rescue
      e -> {fallback, {fun_name(fun), Exception.message(e)}}
    catch
      :error, reason -> {fallback, {fun_name(fun), reason}}
    end
  end

  defp fun_name(fun), do: Function.info(fun)[:name]

  defp read_loadavg(proc_dir) do
    body = File.read!(Path.join(proc_dir, "loadavg"))
    [l1, l5, l15 | _] = String.split(body, ~r/\s+/, trim: true)
    {to_float(l1), to_float(l5), to_float(l15)}
  end

  defp read_meminfo(proc_dir) do
    body = File.read!(Path.join(proc_dir, "meminfo"))

    parsed =
      body
      |> String.split("\n", trim: true)
      |> Enum.reduce(%{}, fn line, acc ->
        case String.split(line, ~r/:\s+/, parts: 2) do
          [key, val] -> Map.put(acc, key, val)
          _ -> acc
        end
      end)

    %{
      total: kb_to_bytes(parsed["MemTotal"]),
      available: kb_to_bytes(parsed["MemAvailable"])
    }
  end

  defp read_uptime(proc_dir) do
    body = File.read!(Path.join(proc_dir, "uptime"))
    [secs | _] = String.split(body, " ", trim: true)
    secs |> to_float() |> trunc()
  end

  defp kb_to_bytes(nil), do: nil

  defp kb_to_bytes(str) do
    case Regex.run(~r/(\d+)\s*kB/, str) do
      [_, kb] -> String.to_integer(kb) * 1024
      _ -> nil
    end
  end

  defp to_float(s) do
    {f, _} = Float.parse(s)
    f
  end

  defp hostname do
    case :inet.gethostname() do
      {:ok, name} -> List.to_string(name)
      _ -> "unknown-host"
    end
  end
end
  • Step 5: Run tests — expect pass
cd agent && mix test test/agent/collectors/host_test.exs

Expected: 5 tests pass.

  • Step 6: Commit
git add agent/lib/agent/collectors agent/test/agent/collectors agent/test/fixtures/proc
git commit -m "feat(agent): host collector for /proc loadavg, meminfo, uptime"

Task 12: Agent — Reporter (Slipstream Client)

Files:

  • Create: agent/lib/agent/reporter.ex

The Reporter is a Slipstream-backed GenServer. Unit-testing a real WS client is out of scope for Phase 1 — coverage comes from the end-to-end smoke test in Task 14.

  • Step 1: Implement Reporter

Create agent/lib/agent/reporter.ex:

defmodule Agent.Reporter do
  @moduledoc """
  Maintains a persistent Phoenix Channel connection to the server, joins
  `host:<host_id>`, and pushes metric samples on the configured fast interval.
  """

  use Slipstream, restart: :permanent
  require Logger

  alias Agent.Collectors.Host

  def start_link(%Agent.Config{} = cfg) do
    Slipstream.start_link(__MODULE__, cfg, name: __MODULE__)
  end

  @impl Slipstream
  def init(cfg) do
    socket =
      new_socket()
      |> assign(:cfg, cfg)
      |> assign(:topic, "host:" <> cfg.host_id)
      |> connect!(uri: cfg.server_url)

    {:ok, socket}
  end

  @impl Slipstream
  def handle_connect(socket) do
    topic = socket.assigns.topic
    cfg = socket.assigns.cfg

    payload = %{"token" => cfg.token, "agent_version" => Agent.version()}
    Logger.info("reporter: connected, joining #{topic}")
    {:ok, join(socket, topic, payload)}
  end

  @impl Slipstream
  def handle_join(topic, _reply, socket) do
    Logger.info("reporter: joined #{topic}")
    send(self(), :collect_fast)
    {:ok, socket}
  end

  @impl Slipstream
  def handle_info(:collect_fast, socket) do
    sample = Host.collect()
    payload = %{collected_at: DateTime.utc_now() |> DateTime.to_iso8601(), data: sample}
    :ok = push_metric(socket, "metric:fast", payload)
    Process.send_after(self(), :collect_fast, socket.assigns.cfg.fast_seconds * 1000)
    {:ok, socket}
  end

  @impl Slipstream
  def handle_disconnect(reason, socket) do
    Logger.warning("reporter: disconnected — #{inspect(reason)}; reconnecting")
    reconnect(socket)
  end

  @impl Slipstream
  def handle_topic_close(topic, reason, socket) do
    Logger.warning("reporter: topic #{topic} closed: #{inspect(reason)}; rejoining")
    rejoin(socket, topic)
  end

  defp push_metric(socket, event, payload) do
    case push(socket, socket.assigns.topic, event, payload) do
      {:ok, _ref} -> :ok
      {:error, reason} ->
        Logger.warning("reporter: push failed: #{inspect(reason)}")
        :ok
    end
  end
end
  • Step 2: Compile
cd agent && mix compile

Expected: no errors. Warnings about unused handle_topic_close params are fine.

  • Step 3: Commit
git add agent/lib/agent/reporter.ex
git commit -m "feat(agent): slipstream reporter — join, push, auto-reconnect"

Task 13: Agent — Application Supervisor

Files:

  • Modify: agent/lib/agent/application.ex

  • Create: agent/config/config.exs

  • Create: agent/config/runtime.exs

  • Step 1: Replace application module

Replace agent/lib/agent/application.ex with:

defmodule Agent.Application do
  @moduledoc false
  use Application
  require Logger

  @impl true
  def start(_type, _args) do
    children =
      case load_config() do
        {:ok, cfg} ->
          Logger.info("agent: starting with host_id=#{cfg.host_id}")
          [{Agent.Reporter, cfg}]

        {:error, reason} ->
          Logger.error("agent: no config loaded (#{inspect(reason)}); running in idle mode")
          []
      end

    Supervisor.start_link(children, strategy: :one_for_one, name: Agent.Supervisor)
  end

  defp load_config do
    path =
      System.get_env("AGENT_CONFIG") ||
        Application.get_env(:agent, :config_path, "/etc/proxmox-monitor/agent.toml")

    case File.exists?(path) do
      true -> Agent.Config.load(path)
      false -> {:error, {:file_missing, path}}
    end
  end
end
  • Step 2: Add minimal compile-time config

Create agent/config/config.exs:

import Config

config :logger, :default_formatter, format: "$time [$level] $message\n"

if File.exists?(Path.join([__DIR__, "#{config_env()}.exs"])) do
  import_config "#{config_env()}.exs"
end

Create agent/config/runtime.exs:

import Config

if path = System.get_env("AGENT_CONFIG") do
  config :agent, :config_path, path
end
  • Step 3: Compile and run existing tests
cd agent && mix compile && mix test

Expected: all tests pass. On cold boot with no config present, the app starts in idle mode (no crash).

  • Step 4: Commit
git add agent/lib/agent/application.ex agent/config
git commit -m "feat(agent): supervisor boots reporter when config is present"

Task 14: End-to-End Smoke Test

Goal: Prove the agent connects to a locally-running server, joins the channel, and the server logs an incoming metric:fast payload.

Files:

  • Create: /tmp/agent-local.toml (ad-hoc, not committed)

  • Step 1: Start the server

In terminal A:

cd /Users/cabele/claudeprojects/proxmox_monitor/server
mix ecto.create
mix ecto.migrate
iex -S mix phx.server

Expected: [info] Running ServerWeb.Endpoint with Bandit ... http://localhost:4000

  • Step 2: Register a host from the IEx shell in terminal A
iex> Server.Release.register_host("pve-dev-01")

Expected output:

Host 'pve-dev-01' registered (id=1).
TOKEN: <32+ char string>
Store this token NOW — it will never be shown again.

Copy the token for the next step.

  • Step 3: Write a local agent config

In terminal B, with <TOKEN> from the previous step:

cat > /tmp/agent-local.toml <<EOF
server_url = "ws://localhost:4000/socket/websocket"
token = "<TOKEN>"
host_id = "pve-dev-01"

[intervals]
fast_seconds = 5
medium_seconds = 60
slow_seconds = 300
EOF
  • Step 4: Start the agent

Still in terminal B:

cd /Users/cabele/claudeprojects/proxmox_monitor/agent
AGENT_CONFIG=/tmp/agent-local.toml iex -S mix

Expected in terminal B: agent: starting with host_id=pve-dev-01 then reporter: connected, joining host:pve-dev-01 then reporter: joined host:pve-dev-01.

  • Step 5: Observe metrics in terminal A

Within 5 seconds, terminal A should show:

[info] agent joined host:pve-dev-01
[info] metric:fast host=pve-dev-01 data=%{...}

The data= map contains :hostname, :load1/5/15, :mem_*_bytes, :uptime_seconds. On macOS dev machines, :errors will be populated (no /proc). That's expected — the network path and channel protocol are what we're verifying here.

  • Step 6: Verify host status in DB

In terminal A IEx:

iex> Server.Repo.get_by(Server.Schema.Host, name: "pve-dev-01") |> Map.take([:status, :agent_version, :last_seen_at])

Expected: %{status: "online", agent_version: "0.1.0", last_seen_at: ~U[...]}.

  • Step 7: Verify terminate marks host offline

Stop the agent in terminal B with Ctrl+C, a. Re-run the query from Step 6.

Expected: status: "offline", last_seen_at preserved from the last online stamp.

  • Step 8: Clean up temp file and commit a smoke-test log
rm /tmp/agent-local.toml

No code changes — no commit needed. Phase 1 is functionally complete.


Phase 1 Exit Criteria

  • Monorepo with server/ and agent/ each building clean.
  • cd server && mix test — all green.
  • cd agent && mix test — all green.
  • Manual smoke test in Task 14 — agent joins channel, server logs metrics, host status transitions online→offline on disconnect.
  • All commits on main.

Next up (Phase 2): metric persistence in SQLite, ZFS collector, VM collector, Storage collector. See roadmap in proxmox-monitor-konzept.md.