Carsten b2070b6a39 docs: spec for agent diagnostic dump

Opt-in per-command and per-sample dump to configurable dump_dir.
Config-gated via [debug] dump_dir, no change when unset. Serialized
through a single writer GenServer to avoid interleaving.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-22 21:57:09 +02:00

10 KiB

Raw Blame History

Agent Diagnostic Dump — Design

Date: 2026-04-22 Status: approved (pending implementation)

Problem

When the dashboard shows data that looks wrong, it's impossible to tell whether the agent's shell commands returned the wrong data, the collector parsed it wrong, the channel dropped a field, or the LiveView rendered it wrong. The agent currently does not log the raw zpool / zfs / pvesh output, nor the assembled payloads it pushes.

To verify source data is correct, the operator needs to see:

Every shell command the agent runs and its full raw output.
The final assembled payload for each fast/medium/slow push, right before it leaves the agent.

Both need to be inspectable with tail -f, on demand, without rebuilding the Burrito binary.

Goal

Opt-in diagnostic dump, controlled by a single TOML config key. When enabled, the agent writes two append-only files containing command results and outgoing payloads. When disabled (default), behavior and I/O cost are unchanged.

Scope

In scope:

Config — new [debug] dump_dir key in agent.toml.
ProxmoxAgent.Shell.run/2 — when dump is enabled, append one entry per command containing timestamp, command + args, exit/error, duration, output size, full raw body, and a separator line.
ProxmoxAgent.Reporter.push_metric/3 — when dump is enabled, append one entry per metric push right before push/4, containing timestamp, kind (fast/medium/slow), full payload as indented JSON, and a separator line.
Unit tests covering the dump-on / dump-off branches.

Out of scope (YAGNI):

Log rotation / retention — rely on logrotate or manual truncation.
CLI subcommand for a one-shot dump — can add later if tailing proves awkward.
Structured formats (e.g. NDJSON, SQLite). Plain text with separator lines is sufficient for grep / less / diffing.

Configuration

Add to agent.toml:

[debug]
dump_dir = "/var/log/proxmox-monitor"

When dump_dir is absent or empty, dumping is off. This is the default.
When dump_dir is set, the agent creates the directory at startup (ignoring "already exists") and writes commands.log and samples.log inside it.
If directory creation fails, the agent logs a single warning and runs as if dumping is off. It does not crash.
Permissions — the directory is created with mode 0755. Files are created with mode 0644 (default). The agent runs as root per the existing deployment, so no additional permission handling is needed.

Config struct changes

ProxmoxAgent.Config gains one field:

defstruct [
  :server_url,
  :token,
  :host_id,
  :dump_dir,                 # new — String.t() | nil
  fast_seconds: 30,
  medium_seconds: 300,
  slow_seconds: 1800
]

build/1 reads Map.get(map, "debug", %{}) |> Map.get("dump_dir"), treating empty strings as nil.

Runtime wiring

ProxmoxAgent.Application.start/2 loads the config, then calls ProxmoxAgent.Diagnostics.configure/1 with the dump_dir value. Diagnostics is a new thin module (see below) that holds the dump state in Application.put_env/3 (not :persistent_term — persistent_term is GC-expensive and overkill for a single nilable string). The Shell and Reporter both read it via Application.get_env(:agent, :dump_dir).

Rationale: passing dump_dir through every collector and shell call via opts pollutes signatures for a cross-cutting concern. A single application-env read is simple and the contract ("set once at startup, read everywhere") is clear.

New module: `ProxmoxAgent.Diagnostics`

One module, one responsibility: decide whether dumping is on and append entries to the right file.

defmodule ProxmoxAgent.Diagnostics do
  @moduledoc """
  Optional diagnostic dump for verifying agent source data.
  Disabled when `:dump_dir` is nil. Entries append to commands.log or samples.log.
  """

  @spec configure(String.t() | nil) :: :ok
  def configure(nil), do: Application.delete_env(:agent, :dump_dir)
  def configure(""),  do: Application.delete_env(:agent, :dump_dir)
  def configure(dir) when is_binary(dir) do
    case File.mkdir_p(dir) do
      :ok ->
        Application.put_env(:agent, :dump_dir, dir)
        :ok

      {:error, reason} ->
        Logger.warning("diagnostics: mkdir #{dir} failed (#{inspect(reason)}); dumping disabled")
        Application.delete_env(:agent, :dump_dir)
        :ok
    end
  end

  @spec enabled?() :: boolean()
  def enabled?, do: not is_nil(Application.get_env(:agent, :dump_dir))

  @spec log_command(String.t(), [String.t()], result, non_neg_integer()) :: :ok
  def log_command(cmd, args, result, duration_us) do
    case Application.get_env(:agent, :dump_dir) do
      nil -> :ok
      dir -> append(Path.join(dir, "commands.log"), format_command(cmd, args, result, duration_us))
    end
  end

  @spec log_sample(String.t(), map()) :: :ok
  def log_sample(kind, payload) do
    case Application.get_env(:agent, :dump_dir) do
      nil -> :ok
      dir -> append(Path.join(dir, "samples.log"), format_sample(kind, payload))
    end
  end

  # private: append/2, format_command/4, format_sample/2
end

Entry formats

commands.log:

===== 2026-04-22T15:32:17.104Z =====
$ zpool status -j --json-flat-vdevs --json-int
exit=0  duration=124ms  size=12847

<full raw output body here, verbatim>

Error form:

===== 2026-04-22T15:32:19.312Z =====
$ zpool status -j --json-flat-vdevs --json-int
error={:enoent, "zpool"}  duration=2ms

samples.log:

===== 2026-04-22T15:32:17.220Z  kind=fast =====
{
  "host": { ... },
  "zfs_pools": { "pools": [ ... ] },
  ...
}

Separators (===== ...) are grep-friendly. Timestamps use DateTime.utc_now() |> DateTime.to_iso8601(). The payload is Jason.encode!(payload, pretty: true).

Append safety

Multiple Shell calls can happen concurrently across collectors. File.write!/3 with [:append] mode is atomic enough for line-oriented appends on ext4/xfs/btrfs on Linux up to PIPE_BUF (4 KB), but the full output bodies we write are larger than that. To avoid interleaved bytes, serialize writes through a single GenServer process:

ProxmoxAgent.Diagnostics.Writer — a GenServer started by the supervision tree when dump_dir is set. It owns both file handles and handles two casts: {:command, formatted} and {:sample, formatted}. Each cast does a single IO.write/2 and returns. File handles open once at init; on :file.write error, log warning and continue (don't crash the agent).

Diagnostics.log_command/4 and Diagnostics.log_sample/2 cast to the writer if dumping is enabled; otherwise they are no-ops. The module encapsulates the enabled/disabled branching so Shell and Reporter don't need to know.

Shell changes

Wrap the existing System.cmd call with timing and a post-call dump:

def run(command, args) do
  start = System.monotonic_time(:microsecond)
  result = do_run(command, args)
  duration = System.monotonic_time(:microsecond) - start
  ProxmoxAgent.Diagnostics.log_command(command, args, result, duration)
  result
end

do_run/2 is the existing body moved unchanged.

Reporter changes

One line added in push_metric/3 before the existing push/4:

defp push_metric(socket, event, data) do
  payload = %{collected_at: DateTime.utc_now() |> DateTime.to_iso8601(), data: data}
  kind = String.replace_prefix(event, "metric:", "")
  ProxmoxAgent.Diagnostics.log_sample(kind, payload)

  case push(socket, socket.assigns.topic, event, payload) do
    ...
  end
end

Supervision

ProxmoxAgent.Application.start/2 starts Diagnostics.Writer conditionally. If cfg.dump_dir is set and File.mkdir_p succeeded, add the writer to the child spec list; otherwise don't start it. The writer's absence is fine because log_command / log_sample no-op when the application env key is absent.

Concretely: the existing application loads the config, then computes dump_dir = configure(cfg.dump_dir) returning either a valid dir or nil. If non-nil, add ProxmoxAgent.Diagnostics.Writer to the children list, before ProxmoxAgent.Reporter.

Tests

Two targets, both in agent/test/proxmox_agent/:

`diagnostics_test.exs`

enabled?/0 returns false when env key unset.
configure/1 with nil or "" disables.
configure/1 with a valid path creates the directory and sets the env key.
log_command/4 with dumping enabled writes a well-formed entry containing the command, args, exit code, and raw body.
log_sample/2 with dumping enabled writes a well-formed JSON entry with the kind and payload.
Both loggers are no-ops when disabled (no file created, no exception).

Uses System.tmp_dir!() <> "/diag-<random>" for the dump dir; cleans up on exit.

`shell_test.exs` (extension)

One new test confirming that when dumping is on, Shell.run/2 appends to commands.log. Uses a config where the shell command is echo hello so the test is deterministic and doesn't require zpool on the test host.

`reporter` integration

Skipped — Reporter is a Slipstream client whose integration is awkward to mock. The Diagnostics.log_sample/2 call is a one-line invocation with no branching; covering it via diagnostics_test.exs is sufficient.

Risks

File size growth — accepted. Verification is a bounded activity. A fast-path zpool status of ~12 KB every 30 s is ~35 MB/day. Bigger hosts more. User turns off when done.
Writer backlog — if the writer GenServer falls behind (shouldn't, given small message rates — fast is ~30 s, medium ~5 min, slow ~30 min, and Shell calls are serialized by the collectors anyway), casts queue in the mailbox. Not a production concern; debug feature.
Concurrent shell errors during writer start — if Shell.run/2 fires before Diagnostics.Writer is up, log_command no-ops because Application.get_env returns nil until configure/1 runs. The supervision order — Diagnostics.Writer before Reporter — ensures the dump path is primed before any collectors fire.

Rollout

Additive. Default off. Operator enables by editing agent.toml and restarting the service. No migration, no impact on server or dashboard.

10 KiB Raw Blame History