proxMon/docs/superpowers/specs/2026-04-22-agent-diagnostic-dump-design.md
Carsten b2070b6a39 docs: spec for agent diagnostic dump
Opt-in per-command and per-sample dump to configurable dump_dir.
Config-gated via [debug] dump_dir, no change when unset. Serialized
through a single writer GenServer to avoid interleaving.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 21:57:09 +02:00

237 lines
10 KiB
Markdown

# Agent Diagnostic Dump — Design
Date: 2026-04-22
Status: approved (pending implementation)
## Problem
When the dashboard shows data that looks wrong, it's impossible to tell whether the agent's shell commands returned the wrong data, the collector parsed it wrong, the channel dropped a field, or the LiveView rendered it wrong. The agent currently does not log the raw `zpool` / `zfs` / `pvesh` output, nor the assembled payloads it pushes.
To verify source data is correct, the operator needs to see:
1. Every shell command the agent runs and its full raw output.
2. The final assembled payload for each fast/medium/slow push, right before it leaves the agent.
Both need to be inspectable with `tail -f`, on demand, without rebuilding the Burrito binary.
## Goal
Opt-in diagnostic dump, controlled by a single TOML config key. When enabled, the agent writes two append-only files containing command results and outgoing payloads. When disabled (default), behavior and I/O cost are unchanged.
## Scope
In scope:
1. Config — new `[debug] dump_dir` key in `agent.toml`.
2. `ProxmoxAgent.Shell.run/2` — when dump is enabled, append one entry per command containing timestamp, command + args, exit/error, duration, output size, full raw body, and a separator line.
3. `ProxmoxAgent.Reporter.push_metric/3` — when dump is enabled, append one entry per metric push right before `push/4`, containing timestamp, kind (`fast`/`medium`/`slow`), full payload as indented JSON, and a separator line.
4. Unit tests covering the dump-on / dump-off branches.
Out of scope (YAGNI):
- Log rotation / retention — rely on `logrotate` or manual truncation.
- CLI subcommand for a one-shot dump — can add later if tailing proves awkward.
- Structured formats (e.g. NDJSON, SQLite). Plain text with separator lines is sufficient for `grep` / `less` / diffing.
## Configuration
Add to `agent.toml`:
```toml
[debug]
dump_dir = "/var/log/proxmox-monitor"
```
- When `dump_dir` is absent or empty, dumping is off. This is the default.
- When `dump_dir` is set, the agent creates the directory at startup (ignoring "already exists") and writes `commands.log` and `samples.log` inside it.
- If directory creation fails, the agent logs a single warning and runs as if dumping is off. It does not crash.
- Permissions — the directory is created with mode `0755`. Files are created with mode `0644` (default). The agent runs as root per the existing deployment, so no additional permission handling is needed.
## Config struct changes
`ProxmoxAgent.Config` gains one field:
```elixir
defstruct [
:server_url,
:token,
:host_id,
:dump_dir, # new — String.t() | nil
fast_seconds: 30,
medium_seconds: 300,
slow_seconds: 1800
]
```
`build/1` reads `Map.get(map, "debug", %{}) |> Map.get("dump_dir")`, treating empty strings as nil.
## Runtime wiring
`ProxmoxAgent.Application.start/2` loads the config, then calls `ProxmoxAgent.Diagnostics.configure/1` with the `dump_dir` value. `Diagnostics` is a new thin module (see below) that holds the dump state in `Application.put_env/3` (not `:persistent_term` — persistent_term is GC-expensive and overkill for a single nilable string). The Shell and Reporter both read it via `Application.get_env(:agent, :dump_dir)`.
Rationale: passing `dump_dir` through every collector and shell call via opts pollutes signatures for a cross-cutting concern. A single application-env read is simple and the contract ("set once at startup, read everywhere") is clear.
## New module: `ProxmoxAgent.Diagnostics`
One module, one responsibility: decide whether dumping is on and append entries to the right file.
```elixir
defmodule ProxmoxAgent.Diagnostics do
@moduledoc """
Optional diagnostic dump for verifying agent source data.
Disabled when `:dump_dir` is nil. Entries append to commands.log or samples.log.
"""
@spec configure(String.t() | nil) :: :ok
def configure(nil), do: Application.delete_env(:agent, :dump_dir)
def configure(""), do: Application.delete_env(:agent, :dump_dir)
def configure(dir) when is_binary(dir) do
case File.mkdir_p(dir) do
:ok ->
Application.put_env(:agent, :dump_dir, dir)
:ok
{:error, reason} ->
Logger.warning("diagnostics: mkdir #{dir} failed (#{inspect(reason)}); dumping disabled")
Application.delete_env(:agent, :dump_dir)
:ok
end
end
@spec enabled?() :: boolean()
def enabled?, do: not is_nil(Application.get_env(:agent, :dump_dir))
@spec log_command(String.t(), [String.t()], result, non_neg_integer()) :: :ok
def log_command(cmd, args, result, duration_us) do
case Application.get_env(:agent, :dump_dir) do
nil -> :ok
dir -> append(Path.join(dir, "commands.log"), format_command(cmd, args, result, duration_us))
end
end
@spec log_sample(String.t(), map()) :: :ok
def log_sample(kind, payload) do
case Application.get_env(:agent, :dump_dir) do
nil -> :ok
dir -> append(Path.join(dir, "samples.log"), format_sample(kind, payload))
end
end
# private: append/2, format_command/4, format_sample/2
end
```
### Entry formats
`commands.log`:
```
===== 2026-04-22T15:32:17.104Z =====
$ zpool status -j --json-flat-vdevs --json-int
exit=0 duration=124ms size=12847
<full raw output body here, verbatim>
```
Error form:
```
===== 2026-04-22T15:32:19.312Z =====
$ zpool status -j --json-flat-vdevs --json-int
error={:enoent, "zpool"} duration=2ms
```
`samples.log`:
```
===== 2026-04-22T15:32:17.220Z kind=fast =====
{
"host": { ... },
"zfs_pools": { "pools": [ ... ] },
...
}
```
Separators (`===== ...`) are grep-friendly. Timestamps use `DateTime.utc_now() |> DateTime.to_iso8601()`. The payload is Jason.encode!(payload, pretty: true).
### Append safety
Multiple Shell calls can happen concurrently across collectors. `File.write!/3` with `[:append]` mode is atomic enough for line-oriented appends on ext4/xfs/btrfs on Linux up to PIPE_BUF (4 KB), but the full output bodies we write are larger than that. To avoid interleaved bytes, serialize writes through a single GenServer process:
`ProxmoxAgent.Diagnostics.Writer` — a GenServer started by the supervision tree when `dump_dir` is set. It owns both file handles and handles two casts: `{:command, formatted}` and `{:sample, formatted}`. Each cast does a single `IO.write/2` and returns. File handles open once at init; on `:file.write` error, log warning and continue (don't crash the agent).
`Diagnostics.log_command/4` and `Diagnostics.log_sample/2` cast to the writer if dumping is enabled; otherwise they are no-ops. The module encapsulates the enabled/disabled branching so Shell and Reporter don't need to know.
## Shell changes
Wrap the existing `System.cmd` call with timing and a post-call dump:
```elixir
def run(command, args) do
start = System.monotonic_time(:microsecond)
result = do_run(command, args)
duration = System.monotonic_time(:microsecond) - start
ProxmoxAgent.Diagnostics.log_command(command, args, result, duration)
result
end
```
`do_run/2` is the existing body moved unchanged.
## Reporter changes
One line added in `push_metric/3` before the existing `push/4`:
```elixir
defp push_metric(socket, event, data) do
payload = %{collected_at: DateTime.utc_now() |> DateTime.to_iso8601(), data: data}
kind = String.replace_prefix(event, "metric:", "")
ProxmoxAgent.Diagnostics.log_sample(kind, payload)
case push(socket, socket.assigns.topic, event, payload) do
...
end
end
```
## Supervision
`ProxmoxAgent.Application.start/2` starts `Diagnostics.Writer` conditionally. If `cfg.dump_dir` is set and `File.mkdir_p` succeeded, add the writer to the child spec list; otherwise don't start it. The writer's absence is fine because `log_command` / `log_sample` no-op when the application env key is absent.
Concretely: the existing application loads the config, then computes `dump_dir = configure(cfg.dump_dir)` returning either a valid dir or `nil`. If non-nil, add `ProxmoxAgent.Diagnostics.Writer` to the children list, before `ProxmoxAgent.Reporter`.
## Tests
Two targets, both in `agent/test/proxmox_agent/`:
### `diagnostics_test.exs`
- `enabled?/0` returns false when env key unset.
- `configure/1` with `nil` or `""` disables.
- `configure/1` with a valid path creates the directory and sets the env key.
- `log_command/4` with dumping enabled writes a well-formed entry containing the command, args, exit code, and raw body.
- `log_sample/2` with dumping enabled writes a well-formed JSON entry with the kind and payload.
- Both loggers are no-ops when disabled (no file created, no exception).
Uses `System.tmp_dir!() <> "/diag-<random>"` for the dump dir; cleans up on exit.
### `shell_test.exs` (extension)
One new test confirming that when dumping is on, `Shell.run/2` appends to `commands.log`. Uses a config where the shell command is `echo hello` so the test is deterministic and doesn't require `zpool` on the test host.
### `reporter` integration
Skipped — `Reporter` is a Slipstream client whose integration is awkward to mock. The `Diagnostics.log_sample/2` call is a one-line invocation with no branching; covering it via `diagnostics_test.exs` is sufficient.
## Risks
- **File size growth** — accepted. Verification is a bounded activity. A fast-path zpool status of ~12 KB every 30 s is ~35 MB/day. Bigger hosts more. User turns off when done.
- **Writer backlog** — if the writer GenServer falls behind (shouldn't, given small message rates — fast is ~30 s, medium ~5 min, slow ~30 min, and Shell calls are serialized by the collectors anyway), casts queue in the mailbox. Not a production concern; debug feature.
- **Concurrent shell errors during writer start** — if `Shell.run/2` fires before `Diagnostics.Writer` is up, `log_command` no-ops because `Application.get_env` returns nil until `configure/1` runs. The supervision order — Diagnostics.Writer before Reporter — ensures the dump path is primed before any collectors fire.
## Rollout
Additive. Default off. Operator enables by editing `agent.toml` and restarting the service. No migration, no impact on server or dashboard.