Opt-in per-command and per-sample dump to configurable dump_dir. Config-gated via [debug] dump_dir, no change when unset. Serialized through a single writer GenServer to avoid interleaving. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 KiB
Agent Diagnostic Dump — Design
Date: 2026-04-22 Status: approved (pending implementation)
Problem
When the dashboard shows data that looks wrong, it's impossible to tell whether the agent's shell commands returned the wrong data, the collector parsed it wrong, the channel dropped a field, or the LiveView rendered it wrong. The agent currently does not log the raw zpool / zfs / pvesh output, nor the assembled payloads it pushes.
To verify source data is correct, the operator needs to see:
- Every shell command the agent runs and its full raw output.
- The final assembled payload for each fast/medium/slow push, right before it leaves the agent.
Both need to be inspectable with tail -f, on demand, without rebuilding the Burrito binary.
Goal
Opt-in diagnostic dump, controlled by a single TOML config key. When enabled, the agent writes two append-only files containing command results and outgoing payloads. When disabled (default), behavior and I/O cost are unchanged.
Scope
In scope:
- Config — new
[debug] dump_dirkey inagent.toml. ProxmoxAgent.Shell.run/2— when dump is enabled, append one entry per command containing timestamp, command + args, exit/error, duration, output size, full raw body, and a separator line.ProxmoxAgent.Reporter.push_metric/3— when dump is enabled, append one entry per metric push right beforepush/4, containing timestamp, kind (fast/medium/slow), full payload as indented JSON, and a separator line.- Unit tests covering the dump-on / dump-off branches.
Out of scope (YAGNI):
- Log rotation / retention — rely on
logrotateor manual truncation. - CLI subcommand for a one-shot dump — can add later if tailing proves awkward.
- Structured formats (e.g. NDJSON, SQLite). Plain text with separator lines is sufficient for
grep/less/ diffing.
Configuration
Add to agent.toml:
[debug]
dump_dir = "/var/log/proxmox-monitor"
- When
dump_diris absent or empty, dumping is off. This is the default. - When
dump_diris set, the agent creates the directory at startup (ignoring "already exists") and writescommands.logandsamples.loginside it. - If directory creation fails, the agent logs a single warning and runs as if dumping is off. It does not crash.
- Permissions — the directory is created with mode
0755. Files are created with mode0644(default). The agent runs as root per the existing deployment, so no additional permission handling is needed.
Config struct changes
ProxmoxAgent.Config gains one field:
defstruct [
:server_url,
:token,
:host_id,
:dump_dir, # new — String.t() | nil
fast_seconds: 30,
medium_seconds: 300,
slow_seconds: 1800
]
build/1 reads Map.get(map, "debug", %{}) |> Map.get("dump_dir"), treating empty strings as nil.
Runtime wiring
ProxmoxAgent.Application.start/2 loads the config, then calls ProxmoxAgent.Diagnostics.configure/1 with the dump_dir value. Diagnostics is a new thin module (see below) that holds the dump state in Application.put_env/3 (not :persistent_term — persistent_term is GC-expensive and overkill for a single nilable string). The Shell and Reporter both read it via Application.get_env(:agent, :dump_dir).
Rationale: passing dump_dir through every collector and shell call via opts pollutes signatures for a cross-cutting concern. A single application-env read is simple and the contract ("set once at startup, read everywhere") is clear.
New module: ProxmoxAgent.Diagnostics
One module, one responsibility: decide whether dumping is on and append entries to the right file.
defmodule ProxmoxAgent.Diagnostics do
@moduledoc """
Optional diagnostic dump for verifying agent source data.
Disabled when `:dump_dir` is nil. Entries append to commands.log or samples.log.
"""
@spec configure(String.t() | nil) :: :ok
def configure(nil), do: Application.delete_env(:agent, :dump_dir)
def configure(""), do: Application.delete_env(:agent, :dump_dir)
def configure(dir) when is_binary(dir) do
case File.mkdir_p(dir) do
:ok ->
Application.put_env(:agent, :dump_dir, dir)
:ok
{:error, reason} ->
Logger.warning("diagnostics: mkdir #{dir} failed (#{inspect(reason)}); dumping disabled")
Application.delete_env(:agent, :dump_dir)
:ok
end
end
@spec enabled?() :: boolean()
def enabled?, do: not is_nil(Application.get_env(:agent, :dump_dir))
@spec log_command(String.t(), [String.t()], result, non_neg_integer()) :: :ok
def log_command(cmd, args, result, duration_us) do
case Application.get_env(:agent, :dump_dir) do
nil -> :ok
dir -> append(Path.join(dir, "commands.log"), format_command(cmd, args, result, duration_us))
end
end
@spec log_sample(String.t(), map()) :: :ok
def log_sample(kind, payload) do
case Application.get_env(:agent, :dump_dir) do
nil -> :ok
dir -> append(Path.join(dir, "samples.log"), format_sample(kind, payload))
end
end
# private: append/2, format_command/4, format_sample/2
end
Entry formats
commands.log:
===== 2026-04-22T15:32:17.104Z =====
$ zpool status -j --json-flat-vdevs --json-int
exit=0 duration=124ms size=12847
<full raw output body here, verbatim>
Error form:
===== 2026-04-22T15:32:19.312Z =====
$ zpool status -j --json-flat-vdevs --json-int
error={:enoent, "zpool"} duration=2ms
samples.log:
===== 2026-04-22T15:32:17.220Z kind=fast =====
{
"host": { ... },
"zfs_pools": { "pools": [ ... ] },
...
}
Separators (===== ...) are grep-friendly. Timestamps use DateTime.utc_now() |> DateTime.to_iso8601(). The payload is Jason.encode!(payload, pretty: true).
Append safety
Multiple Shell calls can happen concurrently across collectors. File.write!/3 with [:append] mode is atomic enough for line-oriented appends on ext4/xfs/btrfs on Linux up to PIPE_BUF (4 KB), but the full output bodies we write are larger than that. To avoid interleaved bytes, serialize writes through a single GenServer process:
ProxmoxAgent.Diagnostics.Writer — a GenServer started by the supervision tree when dump_dir is set. It owns both file handles and handles two casts: {:command, formatted} and {:sample, formatted}. Each cast does a single IO.write/2 and returns. File handles open once at init; on :file.write error, log warning and continue (don't crash the agent).
Diagnostics.log_command/4 and Diagnostics.log_sample/2 cast to the writer if dumping is enabled; otherwise they are no-ops. The module encapsulates the enabled/disabled branching so Shell and Reporter don't need to know.
Shell changes
Wrap the existing System.cmd call with timing and a post-call dump:
def run(command, args) do
start = System.monotonic_time(:microsecond)
result = do_run(command, args)
duration = System.monotonic_time(:microsecond) - start
ProxmoxAgent.Diagnostics.log_command(command, args, result, duration)
result
end
do_run/2 is the existing body moved unchanged.
Reporter changes
One line added in push_metric/3 before the existing push/4:
defp push_metric(socket, event, data) do
payload = %{collected_at: DateTime.utc_now() |> DateTime.to_iso8601(), data: data}
kind = String.replace_prefix(event, "metric:", "")
ProxmoxAgent.Diagnostics.log_sample(kind, payload)
case push(socket, socket.assigns.topic, event, payload) do
...
end
end
Supervision
ProxmoxAgent.Application.start/2 starts Diagnostics.Writer conditionally. If cfg.dump_dir is set and File.mkdir_p succeeded, add the writer to the child spec list; otherwise don't start it. The writer's absence is fine because log_command / log_sample no-op when the application env key is absent.
Concretely: the existing application loads the config, then computes dump_dir = configure(cfg.dump_dir) returning either a valid dir or nil. If non-nil, add ProxmoxAgent.Diagnostics.Writer to the children list, before ProxmoxAgent.Reporter.
Tests
Two targets, both in agent/test/proxmox_agent/:
diagnostics_test.exs
enabled?/0returns false when env key unset.configure/1withnilor""disables.configure/1with a valid path creates the directory and sets the env key.log_command/4with dumping enabled writes a well-formed entry containing the command, args, exit code, and raw body.log_sample/2with dumping enabled writes a well-formed JSON entry with the kind and payload.- Both loggers are no-ops when disabled (no file created, no exception).
Uses System.tmp_dir!() <> "/diag-<random>" for the dump dir; cleans up on exit.
shell_test.exs (extension)
One new test confirming that when dumping is on, Shell.run/2 appends to commands.log. Uses a config where the shell command is echo hello so the test is deterministic and doesn't require zpool on the test host.
reporter integration
Skipped — Reporter is a Slipstream client whose integration is awkward to mock. The Diagnostics.log_sample/2 call is a one-line invocation with no branching; covering it via diagnostics_test.exs is sufficient.
Risks
- File size growth — accepted. Verification is a bounded activity. A fast-path zpool status of ~12 KB every 30 s is ~35 MB/day. Bigger hosts more. User turns off when done.
- Writer backlog — if the writer GenServer falls behind (shouldn't, given small message rates — fast is ~30 s, medium ~5 min, slow ~30 min, and Shell calls are serialized by the collectors anyway), casts queue in the mailbox. Not a production concern; debug feature.
- Concurrent shell errors during writer start — if
Shell.run/2fires beforeDiagnostics.Writeris up,log_commandno-ops becauseApplication.get_envreturns nil untilconfigure/1runs. The supervision order — Diagnostics.Writer before Reporter — ensures the dump path is primed before any collectors fire.
Rollout
Additive. Default off. Operator enables by editing agent.toml and restarting the service. No migration, no impact on server or dashboard.