From b2070b6a396d0f87411f70bc576e2fb5e41b68c5 Mon Sep 17 00:00:00 2001 From: Carsten Date: Wed, 22 Apr 2026 21:57:09 +0200 Subject: [PATCH] docs: spec for agent diagnostic dump Opt-in per-command and per-sample dump to configurable dump_dir. Config-gated via [debug] dump_dir, no change when unset. Serialized through a single writer GenServer to avoid interleaving. Co-Authored-By: Claude Opus 4.7 (1M context) --- ...2026-04-22-agent-diagnostic-dump-design.md | 237 ++++++++++++++++++ 1 file changed, 237 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-22-agent-diagnostic-dump-design.md diff --git a/docs/superpowers/specs/2026-04-22-agent-diagnostic-dump-design.md b/docs/superpowers/specs/2026-04-22-agent-diagnostic-dump-design.md new file mode 100644 index 0000000..3e4e0c6 --- /dev/null +++ b/docs/superpowers/specs/2026-04-22-agent-diagnostic-dump-design.md @@ -0,0 +1,237 @@ +# Agent Diagnostic Dump — Design + +Date: 2026-04-22 +Status: approved (pending implementation) + +## Problem + +When the dashboard shows data that looks wrong, it's impossible to tell whether the agent's shell commands returned the wrong data, the collector parsed it wrong, the channel dropped a field, or the LiveView rendered it wrong. The agent currently does not log the raw `zpool` / `zfs` / `pvesh` output, nor the assembled payloads it pushes. + +To verify source data is correct, the operator needs to see: + +1. Every shell command the agent runs and its full raw output. +2. The final assembled payload for each fast/medium/slow push, right before it leaves the agent. + +Both need to be inspectable with `tail -f`, on demand, without rebuilding the Burrito binary. + +## Goal + +Opt-in diagnostic dump, controlled by a single TOML config key. When enabled, the agent writes two append-only files containing command results and outgoing payloads. When disabled (default), behavior and I/O cost are unchanged. + +## Scope + +In scope: + +1. Config — new `[debug] dump_dir` key in `agent.toml`. +2. `ProxmoxAgent.Shell.run/2` — when dump is enabled, append one entry per command containing timestamp, command + args, exit/error, duration, output size, full raw body, and a separator line. +3. `ProxmoxAgent.Reporter.push_metric/3` — when dump is enabled, append one entry per metric push right before `push/4`, containing timestamp, kind (`fast`/`medium`/`slow`), full payload as indented JSON, and a separator line. +4. Unit tests covering the dump-on / dump-off branches. + +Out of scope (YAGNI): + +- Log rotation / retention — rely on `logrotate` or manual truncation. +- CLI subcommand for a one-shot dump — can add later if tailing proves awkward. +- Structured formats (e.g. NDJSON, SQLite). Plain text with separator lines is sufficient for `grep` / `less` / diffing. + +## Configuration + +Add to `agent.toml`: + +```toml +[debug] +dump_dir = "/var/log/proxmox-monitor" +``` + +- When `dump_dir` is absent or empty, dumping is off. This is the default. +- When `dump_dir` is set, the agent creates the directory at startup (ignoring "already exists") and writes `commands.log` and `samples.log` inside it. +- If directory creation fails, the agent logs a single warning and runs as if dumping is off. It does not crash. +- Permissions — the directory is created with mode `0755`. Files are created with mode `0644` (default). The agent runs as root per the existing deployment, so no additional permission handling is needed. + +## Config struct changes + +`ProxmoxAgent.Config` gains one field: + +```elixir +defstruct [ + :server_url, + :token, + :host_id, + :dump_dir, # new — String.t() | nil + fast_seconds: 30, + medium_seconds: 300, + slow_seconds: 1800 +] +``` + +`build/1` reads `Map.get(map, "debug", %{}) |> Map.get("dump_dir")`, treating empty strings as nil. + +## Runtime wiring + +`ProxmoxAgent.Application.start/2` loads the config, then calls `ProxmoxAgent.Diagnostics.configure/1` with the `dump_dir` value. `Diagnostics` is a new thin module (see below) that holds the dump state in `Application.put_env/3` (not `:persistent_term` — persistent_term is GC-expensive and overkill for a single nilable string). The Shell and Reporter both read it via `Application.get_env(:agent, :dump_dir)`. + +Rationale: passing `dump_dir` through every collector and shell call via opts pollutes signatures for a cross-cutting concern. A single application-env read is simple and the contract ("set once at startup, read everywhere") is clear. + +## New module: `ProxmoxAgent.Diagnostics` + +One module, one responsibility: decide whether dumping is on and append entries to the right file. + +```elixir +defmodule ProxmoxAgent.Diagnostics do + @moduledoc """ + Optional diagnostic dump for verifying agent source data. + Disabled when `:dump_dir` is nil. Entries append to commands.log or samples.log. + """ + + @spec configure(String.t() | nil) :: :ok + def configure(nil), do: Application.delete_env(:agent, :dump_dir) + def configure(""), do: Application.delete_env(:agent, :dump_dir) + def configure(dir) when is_binary(dir) do + case File.mkdir_p(dir) do + :ok -> + Application.put_env(:agent, :dump_dir, dir) + :ok + + {:error, reason} -> + Logger.warning("diagnostics: mkdir #{dir} failed (#{inspect(reason)}); dumping disabled") + Application.delete_env(:agent, :dump_dir) + :ok + end + end + + @spec enabled?() :: boolean() + def enabled?, do: not is_nil(Application.get_env(:agent, :dump_dir)) + + @spec log_command(String.t(), [String.t()], result, non_neg_integer()) :: :ok + def log_command(cmd, args, result, duration_us) do + case Application.get_env(:agent, :dump_dir) do + nil -> :ok + dir -> append(Path.join(dir, "commands.log"), format_command(cmd, args, result, duration_us)) + end + end + + @spec log_sample(String.t(), map()) :: :ok + def log_sample(kind, payload) do + case Application.get_env(:agent, :dump_dir) do + nil -> :ok + dir -> append(Path.join(dir, "samples.log"), format_sample(kind, payload)) + end + end + + # private: append/2, format_command/4, format_sample/2 +end +``` + +### Entry formats + +`commands.log`: + +``` +===== 2026-04-22T15:32:17.104Z ===== +$ zpool status -j --json-flat-vdevs --json-int +exit=0 duration=124ms size=12847 + + + +``` + +Error form: + +``` +===== 2026-04-22T15:32:19.312Z ===== +$ zpool status -j --json-flat-vdevs --json-int +error={:enoent, "zpool"} duration=2ms + +``` + +`samples.log`: + +``` +===== 2026-04-22T15:32:17.220Z kind=fast ===== +{ + "host": { ... }, + "zfs_pools": { "pools": [ ... ] }, + ... +} + +``` + +Separators (`===== ...`) are grep-friendly. Timestamps use `DateTime.utc_now() |> DateTime.to_iso8601()`. The payload is Jason.encode!(payload, pretty: true). + +### Append safety + +Multiple Shell calls can happen concurrently across collectors. `File.write!/3` with `[:append]` mode is atomic enough for line-oriented appends on ext4/xfs/btrfs on Linux up to PIPE_BUF (4 KB), but the full output bodies we write are larger than that. To avoid interleaved bytes, serialize writes through a single GenServer process: + +`ProxmoxAgent.Diagnostics.Writer` — a GenServer started by the supervision tree when `dump_dir` is set. It owns both file handles and handles two casts: `{:command, formatted}` and `{:sample, formatted}`. Each cast does a single `IO.write/2` and returns. File handles open once at init; on `:file.write` error, log warning and continue (don't crash the agent). + +`Diagnostics.log_command/4` and `Diagnostics.log_sample/2` cast to the writer if dumping is enabled; otherwise they are no-ops. The module encapsulates the enabled/disabled branching so Shell and Reporter don't need to know. + +## Shell changes + +Wrap the existing `System.cmd` call with timing and a post-call dump: + +```elixir +def run(command, args) do + start = System.monotonic_time(:microsecond) + result = do_run(command, args) + duration = System.monotonic_time(:microsecond) - start + ProxmoxAgent.Diagnostics.log_command(command, args, result, duration) + result +end +``` + +`do_run/2` is the existing body moved unchanged. + +## Reporter changes + +One line added in `push_metric/3` before the existing `push/4`: + +```elixir +defp push_metric(socket, event, data) do + payload = %{collected_at: DateTime.utc_now() |> DateTime.to_iso8601(), data: data} + kind = String.replace_prefix(event, "metric:", "") + ProxmoxAgent.Diagnostics.log_sample(kind, payload) + + case push(socket, socket.assigns.topic, event, payload) do + ... + end +end +``` + +## Supervision + +`ProxmoxAgent.Application.start/2` starts `Diagnostics.Writer` conditionally. If `cfg.dump_dir` is set and `File.mkdir_p` succeeded, add the writer to the child spec list; otherwise don't start it. The writer's absence is fine because `log_command` / `log_sample` no-op when the application env key is absent. + +Concretely: the existing application loads the config, then computes `dump_dir = configure(cfg.dump_dir)` returning either a valid dir or `nil`. If non-nil, add `ProxmoxAgent.Diagnostics.Writer` to the children list, before `ProxmoxAgent.Reporter`. + +## Tests + +Two targets, both in `agent/test/proxmox_agent/`: + +### `diagnostics_test.exs` + +- `enabled?/0` returns false when env key unset. +- `configure/1` with `nil` or `""` disables. +- `configure/1` with a valid path creates the directory and sets the env key. +- `log_command/4` with dumping enabled writes a well-formed entry containing the command, args, exit code, and raw body. +- `log_sample/2` with dumping enabled writes a well-formed JSON entry with the kind and payload. +- Both loggers are no-ops when disabled (no file created, no exception). + +Uses `System.tmp_dir!() <> "/diag-"` for the dump dir; cleans up on exit. + +### `shell_test.exs` (extension) + +One new test confirming that when dumping is on, `Shell.run/2` appends to `commands.log`. Uses a config where the shell command is `echo hello` so the test is deterministic and doesn't require `zpool` on the test host. + +### `reporter` integration + +Skipped — `Reporter` is a Slipstream client whose integration is awkward to mock. The `Diagnostics.log_sample/2` call is a one-line invocation with no branching; covering it via `diagnostics_test.exs` is sufficient. + +## Risks + +- **File size growth** — accepted. Verification is a bounded activity. A fast-path zpool status of ~12 KB every 30 s is ~35 MB/day. Bigger hosts more. User turns off when done. +- **Writer backlog** — if the writer GenServer falls behind (shouldn't, given small message rates — fast is ~30 s, medium ~5 min, slow ~30 min, and Shell calls are serialized by the collectors anyway), casts queue in the mailbox. Not a production concern; debug feature. +- **Concurrent shell errors during writer start** — if `Shell.run/2` fires before `Diagnostics.Writer` is up, `log_command` no-ops because `Application.get_env` returns nil until `configure/1` runs. The supervision order — Diagnostics.Writer before Reporter — ensures the dump path is primed before any collectors fire. + +## Rollout + +Additive. Default off. Operator enables by editing `agent.toml` and restarting the service. No migration, no impact on server or dashboard.