proxMon/docs/superpowers/specs/2026-04-22-agent-diagnostic-dump-design.md

# Agent Diagnostic Dump — Design

Date: 2026-04-22
Status: approved (pending implementation)

## Problem

When the dashboard shows data that looks wrong, it's impossible to tell whether the agent's shell commands returned the wrong data, the collector parsed it wrong, the channel dropped a field, or the LiveView rendered it wrong. The agent currently does not log the raw `zpool` / `zfs` / `pvesh` output, nor the assembled payloads it pushes.

To verify source data is correct, the operator needs to see:

1. Every shell command the agent runs and its full raw output.
2. The final assembled payload for each fast/medium/slow push, right before it leaves the agent.

Both need to be inspectable with `tail -f`, on demand, without rebuilding the Burrito binary.

## Goal

Opt-in diagnostic dump, controlled by a single TOML config key. When enabled, the agent writes two append-only files containing command results and outgoing payloads. When disabled (default), behavior and I/O cost are unchanged.

## Scope

In scope:

1. Config — new `[debug] dump_dir` key in `agent.toml`.
2. `ProxmoxAgent.Shell.run/2` — when dump is enabled, append one entry per command containing timestamp, command + args, exit/error, duration, output size, full raw body, and a separator line.
3. `ProxmoxAgent.Reporter.push_metric/3` — when dump is enabled, append one entry per metric push right before `push/4`, containing timestamp, kind (`fast`/`medium`/`slow`), full payload as indented JSON, and a separator line.
4. Unit tests covering the dump-on / dump-off branches.

Out of scope (YAGNI):

- Log rotation / retention — rely on `logrotate` or manual truncation.
- CLI subcommand for a one-shot dump — can add later if tailing proves awkward.
- Structured formats (e.g. NDJSON, SQLite). Plain text with separator lines is sufficient for `grep` / `less` / diffing.

## Configuration

Add to `agent.toml`:

```toml
[debug]
dump_dir = "/var/log/proxmox-monitor"
```

- When `dump_dir` is absent or empty, dumping is off. This is the default.
- When `dump_dir` is set, the agent creates the directory at startup (ignoring "already exists") and writes `commands.log` and `samples.log` inside it.
- If directory creation fails, the agent logs a single warning and runs as if dumping is off. It does not crash.
- Permissions — the directory is created with mode `0755`. Files are created with mode `0644` (default). The agent runs as root per the existing deployment, so no additional permission handling is needed.

## Config struct changes

`ProxmoxAgent.Config` gains one field:

```elixir
defstruct [
  :server_url,
  :token,
  :host_id,
  :dump_dir,                 # new — String.t() | nil
  fast_seconds: 30,
  medium_seconds: 300,
  slow_seconds: 1800
]
```

`build/1` reads `Map.get(map, "debug", %{}) |> Map.get("dump_dir")`, treating empty strings as nil.

## Runtime wiring

`ProxmoxAgent.Application.start/2` loads the config, then calls `ProxmoxAgent.Diagnostics.configure/1` with the `dump_dir` value. `Diagnostics` is a new thin module (see below) that holds the dump state in `Application.put_env/3` (not `:persistent_term` — persistent_term is GC-expensive and overkill for a single nilable string). The Shell and Reporter both read it via `Application.get_env(:agent, :dump_dir)`.

Rationale: passing `dump_dir` through every collector and shell call via opts pollutes signatures for a cross-cutting concern. A single application-env read is simple and the contract ("set once at startup, read everywhere") is clear.

## New module: `ProxmoxAgent.Diagnostics`

One module, one responsibility: decide whether dumping is on and append entries to the right file.

```elixir
defmodule ProxmoxAgent.Diagnostics do
  @moduledoc """
  Optional diagnostic dump for verifying agent source data.
  Disabled when `:dump_dir` is nil. Entries append to commands.log or samples.log.
  """

  @spec configure(String.t() | nil) :: :ok
  def configure(nil), do: Application.delete_env(:agent, :dump_dir)
  def configure(""),  do: Application.delete_env(:agent, :dump_dir)
  def configure(dir) when is_binary(dir) do
    case File.mkdir_p(dir) do
      :ok ->
        Application.put_env(:agent, :dump_dir, dir)
        :ok

      {:error, reason} ->
        Logger.warning("diagnostics: mkdir #{dir} failed (#{inspect(reason)}); dumping disabled")
        Application.delete_env(:agent, :dump_dir)
        :ok
    end
  end

  @spec enabled?() :: boolean()
  def enabled?, do: not is_nil(Application.get_env(:agent, :dump_dir))

  @spec log_command(String.t(), [String.t()], result, non_neg_integer()) :: :ok
  def log_command(cmd, args, result, duration_us) do
    case Application.get_env(:agent, :dump_dir) do
      nil -> :ok
      dir -> append(Path.join(dir, "commands.log"), format_command(cmd, args, result, duration_us))
    end
  end

  @spec log_sample(String.t(), map()) :: :ok
  def log_sample(kind, payload) do
    case Application.get_env(:agent, :dump_dir) do
      nil -> :ok
      dir -> append(Path.join(dir, "samples.log"), format_sample(kind, payload))
    end
  end

  # private: append/2, format_command/4, format_sample/2
end
```

### Entry formats

`commands.log`:

```
===== 2026-04-22T15:32:17.104Z =====
$ zpool status -j --json-flat-vdevs --json-int
exit=0  duration=124ms  size=12847

<full raw output body here, verbatim>

```

Error form:

```
===== 2026-04-22T15:32:19.312Z =====
$ zpool status -j --json-flat-vdevs --json-int
error={:enoent, "zpool"}  duration=2ms

```

`samples.log`:

```
===== 2026-04-22T15:32:17.220Z  kind=fast =====
{
  "host": { ... },
  "zfs_pools": { "pools": [ ... ] },
  ...
}

```

Separators (`===== ...`) are grep-friendly. Timestamps use `DateTime.utc_now() |> DateTime.to_iso8601()`. The payload is Jason.encode!(payload, pretty: true).

### Append safety

Multiple Shell calls can happen concurrently across collectors. `File.write!/3` with `[:append]` mode is atomic enough for line-oriented appends on ext4/xfs/btrfs on Linux up to PIPE_BUF (4 KB), but the full output bodies we write are larger than that. To avoid interleaved bytes, serialize writes through a single GenServer process:

`ProxmoxAgent.Diagnostics.Writer` — a GenServer started by the supervision tree when `dump_dir` is set. It owns both file handles and handles two casts: `{:command, formatted}` and `{:sample, formatted}`. Each cast does a single `IO.write/2` and returns. File handles open once at init; on `:file.write` error, log warning and continue (don't crash the agent).

`Diagnostics.log_command/4` and `Diagnostics.log_sample/2` cast to the writer if dumping is enabled; otherwise they are no-ops. The module encapsulates the enabled/disabled branching so Shell and Reporter don't need to know.

## Shell changes

Wrap the existing `System.cmd` call with timing and a post-call dump:

```elixir
def run(command, args) do
  start = System.monotonic_time(:microsecond)
  result = do_run(command, args)
  duration = System.monotonic_time(:microsecond) - start
  ProxmoxAgent.Diagnostics.log_command(command, args, result, duration)
  result
end
```

`do_run/2` is the existing body moved unchanged.

## Reporter changes

One line added in `push_metric/3` before the existing `push/4`:

```elixir
defp push_metric(socket, event, data) do
  payload = %{collected_at: DateTime.utc_now() |> DateTime.to_iso8601(), data: data}
  kind = String.replace_prefix(event, "metric:", "")
  ProxmoxAgent.Diagnostics.log_sample(kind, payload)

  case push(socket, socket.assigns.topic, event, payload) do
    ...
  end
end
```

## Supervision

`ProxmoxAgent.Application.start/2` starts `Diagnostics.Writer` conditionally. If `cfg.dump_dir` is set and `File.mkdir_p` succeeded, add the writer to the child spec list; otherwise don't start it. The writer's absence is fine because `log_command` / `log_sample` no-op when the application env key is absent.

Concretely: the existing application loads the config, then computes `dump_dir = configure(cfg.dump_dir)` returning either a valid dir or `nil`. If non-nil, add `ProxmoxAgent.Diagnostics.Writer` to the children list, before `ProxmoxAgent.Reporter`.

## Tests

Two targets, both in `agent/test/proxmox_agent/`:

### `diagnostics_test.exs`

- `enabled?/0` returns false when env key unset.
- `configure/1` with `nil` or `""` disables.
- `configure/1` with a valid path creates the directory and sets the env key.
- `log_command/4` with dumping enabled writes a well-formed entry containing the command, args, exit code, and raw body.
- `log_sample/2` with dumping enabled writes a well-formed JSON entry with the kind and payload.
- Both loggers are no-ops when disabled (no file created, no exception).

Uses `System.tmp_dir!() <> "/diag-<random>"` for the dump dir; cleans up on exit.

### `shell_test.exs` (extension)

One new test confirming that when dumping is on, `Shell.run/2` appends to `commands.log`. Uses a config where the shell command is `echo hello` so the test is deterministic and doesn't require `zpool` on the test host.

### `reporter` integration

Skipped — `Reporter` is a Slipstream client whose integration is awkward to mock. The `Diagnostics.log_sample/2` call is a one-line invocation with no branching; covering it via `diagnostics_test.exs` is sufficient.

## Risks

- **File size growth** — accepted. Verification is a bounded activity. A fast-path zpool status of ~12 KB every 30 s is ~35 MB/day. Bigger hosts more. User turns off when done.
- **Writer backlog** — if the writer GenServer falls behind (shouldn't, given small message rates — fast is ~30 s, medium ~5 min, slow ~30 min, and Shell calls are serialized by the collectors anyway), casts queue in the mailbox. Not a production concern; debug feature.
- **Concurrent shell errors during writer start** — if `Shell.run/2` fires before `Diagnostics.Writer` is up, `log_command` no-ops because `Application.get_env` returns nil until `configure/1` runs. The supervision order — Diagnostics.Writer before Reporter — ensures the dump path is primed before any collectors fire.

## Rollout

Additive. Default off. Operator enables by editing `agent.toml` and restarting the service. No migration, no impact on server or dashboard.