# Agent Diagnostic Dump — Design Date: 2026-04-22 Status: approved (pending implementation) ## Problem When the dashboard shows data that looks wrong, it's impossible to tell whether the agent's shell commands returned the wrong data, the collector parsed it wrong, the channel dropped a field, or the LiveView rendered it wrong. The agent currently does not log the raw `zpool` / `zfs` / `pvesh` output, nor the assembled payloads it pushes. To verify source data is correct, the operator needs to see: 1. Every shell command the agent runs and its full raw output. 2. The final assembled payload for each fast/medium/slow push, right before it leaves the agent. Both need to be inspectable with `tail -f`, on demand, without rebuilding the Burrito binary. ## Goal Opt-in diagnostic dump, controlled by a single TOML config key. When enabled, the agent writes two append-only files containing command results and outgoing payloads. When disabled (default), behavior and I/O cost are unchanged. ## Scope In scope: 1. Config — new `[debug] dump_dir` key in `agent.toml`. 2. `ProxmoxAgent.Shell.run/2` — when dump is enabled, append one entry per command containing timestamp, command + args, exit/error, duration, output size, full raw body, and a separator line. 3. `ProxmoxAgent.Reporter.push_metric/3` — when dump is enabled, append one entry per metric push right before `push/4`, containing timestamp, kind (`fast`/`medium`/`slow`), full payload as indented JSON, and a separator line. 4. Unit tests covering the dump-on / dump-off branches. Out of scope (YAGNI): - Log rotation / retention — rely on `logrotate` or manual truncation. - CLI subcommand for a one-shot dump — can add later if tailing proves awkward. - Structured formats (e.g. NDJSON, SQLite). Plain text with separator lines is sufficient for `grep` / `less` / diffing. ## Configuration Add to `agent.toml`: ```toml [debug] dump_dir = "/var/log/proxmox-monitor" ``` - When `dump_dir` is absent or empty, dumping is off. This is the default. - When `dump_dir` is set, the agent creates the directory at startup (ignoring "already exists") and writes `commands.log` and `samples.log` inside it. - If directory creation fails, the agent logs a single warning and runs as if dumping is off. It does not crash. - Permissions — the directory is created with mode `0755`. Files are created with mode `0644` (default). The agent runs as root per the existing deployment, so no additional permission handling is needed. ## Config struct changes `ProxmoxAgent.Config` gains one field: ```elixir defstruct [ :server_url, :token, :host_id, :dump_dir, # new — String.t() | nil fast_seconds: 30, medium_seconds: 300, slow_seconds: 1800 ] ``` `build/1` reads `Map.get(map, "debug", %{}) |> Map.get("dump_dir")`, treating empty strings as nil. ## Runtime wiring `ProxmoxAgent.Application.start/2` loads the config, then calls `ProxmoxAgent.Diagnostics.configure/1` with the `dump_dir` value. `Diagnostics` is a new thin module (see below) that holds the dump state in `Application.put_env/3` (not `:persistent_term` — persistent_term is GC-expensive and overkill for a single nilable string). The Shell and Reporter both read it via `Application.get_env(:agent, :dump_dir)`. Rationale: passing `dump_dir` through every collector and shell call via opts pollutes signatures for a cross-cutting concern. A single application-env read is simple and the contract ("set once at startup, read everywhere") is clear. ## New module: `ProxmoxAgent.Diagnostics` One module, one responsibility: decide whether dumping is on and append entries to the right file. ```elixir defmodule ProxmoxAgent.Diagnostics do @moduledoc """ Optional diagnostic dump for verifying agent source data. Disabled when `:dump_dir` is nil. Entries append to commands.log or samples.log. """ @spec configure(String.t() | nil) :: :ok def configure(nil), do: Application.delete_env(:agent, :dump_dir) def configure(""), do: Application.delete_env(:agent, :dump_dir) def configure(dir) when is_binary(dir) do case File.mkdir_p(dir) do :ok -> Application.put_env(:agent, :dump_dir, dir) :ok {:error, reason} -> Logger.warning("diagnostics: mkdir #{dir} failed (#{inspect(reason)}); dumping disabled") Application.delete_env(:agent, :dump_dir) :ok end end @spec enabled?() :: boolean() def enabled?, do: not is_nil(Application.get_env(:agent, :dump_dir)) @spec log_command(String.t(), [String.t()], result, non_neg_integer()) :: :ok def log_command(cmd, args, result, duration_us) do case Application.get_env(:agent, :dump_dir) do nil -> :ok dir -> append(Path.join(dir, "commands.log"), format_command(cmd, args, result, duration_us)) end end @spec log_sample(String.t(), map()) :: :ok def log_sample(kind, payload) do case Application.get_env(:agent, :dump_dir) do nil -> :ok dir -> append(Path.join(dir, "samples.log"), format_sample(kind, payload)) end end # private: append/2, format_command/4, format_sample/2 end ``` ### Entry formats `commands.log`: ``` ===== 2026-04-22T15:32:17.104Z ===== $ zpool status -j --json-flat-vdevs --json-int exit=0 duration=124ms size=12847 ``` Error form: ``` ===== 2026-04-22T15:32:19.312Z ===== $ zpool status -j --json-flat-vdevs --json-int error={:enoent, "zpool"} duration=2ms ``` `samples.log`: ``` ===== 2026-04-22T15:32:17.220Z kind=fast ===== { "host": { ... }, "zfs_pools": { "pools": [ ... ] }, ... } ``` Separators (`===== ...`) are grep-friendly. Timestamps use `DateTime.utc_now() |> DateTime.to_iso8601()`. The payload is Jason.encode!(payload, pretty: true). ### Append safety Multiple Shell calls can happen concurrently across collectors. `File.write!/3` with `[:append]` mode is atomic enough for line-oriented appends on ext4/xfs/btrfs on Linux up to PIPE_BUF (4 KB), but the full output bodies we write are larger than that. To avoid interleaved bytes, serialize writes through a single GenServer process: `ProxmoxAgent.Diagnostics.Writer` — a GenServer started by the supervision tree when `dump_dir` is set. It owns both file handles and handles two casts: `{:command, formatted}` and `{:sample, formatted}`. Each cast does a single `IO.write/2` and returns. File handles open once at init; on `:file.write` error, log warning and continue (don't crash the agent). `Diagnostics.log_command/4` and `Diagnostics.log_sample/2` cast to the writer if dumping is enabled; otherwise they are no-ops. The module encapsulates the enabled/disabled branching so Shell and Reporter don't need to know. ## Shell changes Wrap the existing `System.cmd` call with timing and a post-call dump: ```elixir def run(command, args) do start = System.monotonic_time(:microsecond) result = do_run(command, args) duration = System.monotonic_time(:microsecond) - start ProxmoxAgent.Diagnostics.log_command(command, args, result, duration) result end ``` `do_run/2` is the existing body moved unchanged. ## Reporter changes One line added in `push_metric/3` before the existing `push/4`: ```elixir defp push_metric(socket, event, data) do payload = %{collected_at: DateTime.utc_now() |> DateTime.to_iso8601(), data: data} kind = String.replace_prefix(event, "metric:", "") ProxmoxAgent.Diagnostics.log_sample(kind, payload) case push(socket, socket.assigns.topic, event, payload) do ... end end ``` ## Supervision `ProxmoxAgent.Application.start/2` starts `Diagnostics.Writer` conditionally. If `cfg.dump_dir` is set and `File.mkdir_p` succeeded, add the writer to the child spec list; otherwise don't start it. The writer's absence is fine because `log_command` / `log_sample` no-op when the application env key is absent. Concretely: the existing application loads the config, then computes `dump_dir = configure(cfg.dump_dir)` returning either a valid dir or `nil`. If non-nil, add `ProxmoxAgent.Diagnostics.Writer` to the children list, before `ProxmoxAgent.Reporter`. ## Tests Two targets, both in `agent/test/proxmox_agent/`: ### `diagnostics_test.exs` - `enabled?/0` returns false when env key unset. - `configure/1` with `nil` or `""` disables. - `configure/1` with a valid path creates the directory and sets the env key. - `log_command/4` with dumping enabled writes a well-formed entry containing the command, args, exit code, and raw body. - `log_sample/2` with dumping enabled writes a well-formed JSON entry with the kind and payload. - Both loggers are no-ops when disabled (no file created, no exception). Uses `System.tmp_dir!() <> "/diag-"` for the dump dir; cleans up on exit. ### `shell_test.exs` (extension) One new test confirming that when dumping is on, `Shell.run/2` appends to `commands.log`. Uses a config where the shell command is `echo hello` so the test is deterministic and doesn't require `zpool` on the test host. ### `reporter` integration Skipped — `Reporter` is a Slipstream client whose integration is awkward to mock. The `Diagnostics.log_sample/2` call is a one-line invocation with no branching; covering it via `diagnostics_test.exs` is sufficient. ## Risks - **File size growth** — accepted. Verification is a bounded activity. A fast-path zpool status of ~12 KB every 30 s is ~35 MB/day. Bigger hosts more. User turns off when done. - **Writer backlog** — if the writer GenServer falls behind (shouldn't, given small message rates — fast is ~30 s, medium ~5 min, slow ~30 min, and Shell calls are serialized by the collectors anyway), casts queue in the mailbox. Not a production concern; debug feature. - **Concurrent shell errors during writer start** — if `Shell.run/2` fires before `Diagnostics.Writer` is up, `log_command` no-ops because `Application.get_env` returns nil until `configure/1` runs. The supervision order — Diagnostics.Writer before Reporter — ensures the dump path is primed before any collectors fire. ## Rollout Additive. Default off. Operator enables by editing `agent.toml` and restarting the service. No migration, no impact on server or dashboard.