docs: spec for agent diagnostic dump
Opt-in per-command and per-sample dump to configurable dump_dir. Config-gated via [debug] dump_dir, no change when unset. Serialized through a single writer GenServer to avoid interleaving. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
28a40a2650
commit
b2070b6a39
1 changed files with 237 additions and 0 deletions
|
|
@ -0,0 +1,237 @@
|
||||||
|
# Agent Diagnostic Dump — Design
|
||||||
|
|
||||||
|
Date: 2026-04-22
|
||||||
|
Status: approved (pending implementation)
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
When the dashboard shows data that looks wrong, it's impossible to tell whether the agent's shell commands returned the wrong data, the collector parsed it wrong, the channel dropped a field, or the LiveView rendered it wrong. The agent currently does not log the raw `zpool` / `zfs` / `pvesh` output, nor the assembled payloads it pushes.
|
||||||
|
|
||||||
|
To verify source data is correct, the operator needs to see:
|
||||||
|
|
||||||
|
1. Every shell command the agent runs and its full raw output.
|
||||||
|
2. The final assembled payload for each fast/medium/slow push, right before it leaves the agent.
|
||||||
|
|
||||||
|
Both need to be inspectable with `tail -f`, on demand, without rebuilding the Burrito binary.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Opt-in diagnostic dump, controlled by a single TOML config key. When enabled, the agent writes two append-only files containing command results and outgoing payloads. When disabled (default), behavior and I/O cost are unchanged.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
In scope:
|
||||||
|
|
||||||
|
1. Config — new `[debug] dump_dir` key in `agent.toml`.
|
||||||
|
2. `ProxmoxAgent.Shell.run/2` — when dump is enabled, append one entry per command containing timestamp, command + args, exit/error, duration, output size, full raw body, and a separator line.
|
||||||
|
3. `ProxmoxAgent.Reporter.push_metric/3` — when dump is enabled, append one entry per metric push right before `push/4`, containing timestamp, kind (`fast`/`medium`/`slow`), full payload as indented JSON, and a separator line.
|
||||||
|
4. Unit tests covering the dump-on / dump-off branches.
|
||||||
|
|
||||||
|
Out of scope (YAGNI):
|
||||||
|
|
||||||
|
- Log rotation / retention — rely on `logrotate` or manual truncation.
|
||||||
|
- CLI subcommand for a one-shot dump — can add later if tailing proves awkward.
|
||||||
|
- Structured formats (e.g. NDJSON, SQLite). Plain text with separator lines is sufficient for `grep` / `less` / diffing.
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Add to `agent.toml`:
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[debug]
|
||||||
|
dump_dir = "/var/log/proxmox-monitor"
|
||||||
|
```
|
||||||
|
|
||||||
|
- When `dump_dir` is absent or empty, dumping is off. This is the default.
|
||||||
|
- When `dump_dir` is set, the agent creates the directory at startup (ignoring "already exists") and writes `commands.log` and `samples.log` inside it.
|
||||||
|
- If directory creation fails, the agent logs a single warning and runs as if dumping is off. It does not crash.
|
||||||
|
- Permissions — the directory is created with mode `0755`. Files are created with mode `0644` (default). The agent runs as root per the existing deployment, so no additional permission handling is needed.
|
||||||
|
|
||||||
|
## Config struct changes
|
||||||
|
|
||||||
|
`ProxmoxAgent.Config` gains one field:
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
defstruct [
|
||||||
|
:server_url,
|
||||||
|
:token,
|
||||||
|
:host_id,
|
||||||
|
:dump_dir, # new — String.t() | nil
|
||||||
|
fast_seconds: 30,
|
||||||
|
medium_seconds: 300,
|
||||||
|
slow_seconds: 1800
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
`build/1` reads `Map.get(map, "debug", %{}) |> Map.get("dump_dir")`, treating empty strings as nil.
|
||||||
|
|
||||||
|
## Runtime wiring
|
||||||
|
|
||||||
|
`ProxmoxAgent.Application.start/2` loads the config, then calls `ProxmoxAgent.Diagnostics.configure/1` with the `dump_dir` value. `Diagnostics` is a new thin module (see below) that holds the dump state in `Application.put_env/3` (not `:persistent_term` — persistent_term is GC-expensive and overkill for a single nilable string). The Shell and Reporter both read it via `Application.get_env(:agent, :dump_dir)`.
|
||||||
|
|
||||||
|
Rationale: passing `dump_dir` through every collector and shell call via opts pollutes signatures for a cross-cutting concern. A single application-env read is simple and the contract ("set once at startup, read everywhere") is clear.
|
||||||
|
|
||||||
|
## New module: `ProxmoxAgent.Diagnostics`
|
||||||
|
|
||||||
|
One module, one responsibility: decide whether dumping is on and append entries to the right file.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
defmodule ProxmoxAgent.Diagnostics do
|
||||||
|
@moduledoc """
|
||||||
|
Optional diagnostic dump for verifying agent source data.
|
||||||
|
Disabled when `:dump_dir` is nil. Entries append to commands.log or samples.log.
|
||||||
|
"""
|
||||||
|
|
||||||
|
@spec configure(String.t() | nil) :: :ok
|
||||||
|
def configure(nil), do: Application.delete_env(:agent, :dump_dir)
|
||||||
|
def configure(""), do: Application.delete_env(:agent, :dump_dir)
|
||||||
|
def configure(dir) when is_binary(dir) do
|
||||||
|
case File.mkdir_p(dir) do
|
||||||
|
:ok ->
|
||||||
|
Application.put_env(:agent, :dump_dir, dir)
|
||||||
|
:ok
|
||||||
|
|
||||||
|
{:error, reason} ->
|
||||||
|
Logger.warning("diagnostics: mkdir #{dir} failed (#{inspect(reason)}); dumping disabled")
|
||||||
|
Application.delete_env(:agent, :dump_dir)
|
||||||
|
:ok
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
@spec enabled?() :: boolean()
|
||||||
|
def enabled?, do: not is_nil(Application.get_env(:agent, :dump_dir))
|
||||||
|
|
||||||
|
@spec log_command(String.t(), [String.t()], result, non_neg_integer()) :: :ok
|
||||||
|
def log_command(cmd, args, result, duration_us) do
|
||||||
|
case Application.get_env(:agent, :dump_dir) do
|
||||||
|
nil -> :ok
|
||||||
|
dir -> append(Path.join(dir, "commands.log"), format_command(cmd, args, result, duration_us))
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
@spec log_sample(String.t(), map()) :: :ok
|
||||||
|
def log_sample(kind, payload) do
|
||||||
|
case Application.get_env(:agent, :dump_dir) do
|
||||||
|
nil -> :ok
|
||||||
|
dir -> append(Path.join(dir, "samples.log"), format_sample(kind, payload))
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
# private: append/2, format_command/4, format_sample/2
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
### Entry formats
|
||||||
|
|
||||||
|
`commands.log`:
|
||||||
|
|
||||||
|
```
|
||||||
|
===== 2026-04-22T15:32:17.104Z =====
|
||||||
|
$ zpool status -j --json-flat-vdevs --json-int
|
||||||
|
exit=0 duration=124ms size=12847
|
||||||
|
|
||||||
|
<full raw output body here, verbatim>
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
Error form:
|
||||||
|
|
||||||
|
```
|
||||||
|
===== 2026-04-22T15:32:19.312Z =====
|
||||||
|
$ zpool status -j --json-flat-vdevs --json-int
|
||||||
|
error={:enoent, "zpool"} duration=2ms
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
`samples.log`:
|
||||||
|
|
||||||
|
```
|
||||||
|
===== 2026-04-22T15:32:17.220Z kind=fast =====
|
||||||
|
{
|
||||||
|
"host": { ... },
|
||||||
|
"zfs_pools": { "pools": [ ... ] },
|
||||||
|
...
|
||||||
|
}
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
Separators (`===== ...`) are grep-friendly. Timestamps use `DateTime.utc_now() |> DateTime.to_iso8601()`. The payload is Jason.encode!(payload, pretty: true).
|
||||||
|
|
||||||
|
### Append safety
|
||||||
|
|
||||||
|
Multiple Shell calls can happen concurrently across collectors. `File.write!/3` with `[:append]` mode is atomic enough for line-oriented appends on ext4/xfs/btrfs on Linux up to PIPE_BUF (4 KB), but the full output bodies we write are larger than that. To avoid interleaved bytes, serialize writes through a single GenServer process:
|
||||||
|
|
||||||
|
`ProxmoxAgent.Diagnostics.Writer` — a GenServer started by the supervision tree when `dump_dir` is set. It owns both file handles and handles two casts: `{:command, formatted}` and `{:sample, formatted}`. Each cast does a single `IO.write/2` and returns. File handles open once at init; on `:file.write` error, log warning and continue (don't crash the agent).
|
||||||
|
|
||||||
|
`Diagnostics.log_command/4` and `Diagnostics.log_sample/2` cast to the writer if dumping is enabled; otherwise they are no-ops. The module encapsulates the enabled/disabled branching so Shell and Reporter don't need to know.
|
||||||
|
|
||||||
|
## Shell changes
|
||||||
|
|
||||||
|
Wrap the existing `System.cmd` call with timing and a post-call dump:
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def run(command, args) do
|
||||||
|
start = System.monotonic_time(:microsecond)
|
||||||
|
result = do_run(command, args)
|
||||||
|
duration = System.monotonic_time(:microsecond) - start
|
||||||
|
ProxmoxAgent.Diagnostics.log_command(command, args, result, duration)
|
||||||
|
result
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
`do_run/2` is the existing body moved unchanged.
|
||||||
|
|
||||||
|
## Reporter changes
|
||||||
|
|
||||||
|
One line added in `push_metric/3` before the existing `push/4`:
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
defp push_metric(socket, event, data) do
|
||||||
|
payload = %{collected_at: DateTime.utc_now() |> DateTime.to_iso8601(), data: data}
|
||||||
|
kind = String.replace_prefix(event, "metric:", "")
|
||||||
|
ProxmoxAgent.Diagnostics.log_sample(kind, payload)
|
||||||
|
|
||||||
|
case push(socket, socket.assigns.topic, event, payload) do
|
||||||
|
...
|
||||||
|
end
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
## Supervision
|
||||||
|
|
||||||
|
`ProxmoxAgent.Application.start/2` starts `Diagnostics.Writer` conditionally. If `cfg.dump_dir` is set and `File.mkdir_p` succeeded, add the writer to the child spec list; otherwise don't start it. The writer's absence is fine because `log_command` / `log_sample` no-op when the application env key is absent.
|
||||||
|
|
||||||
|
Concretely: the existing application loads the config, then computes `dump_dir = configure(cfg.dump_dir)` returning either a valid dir or `nil`. If non-nil, add `ProxmoxAgent.Diagnostics.Writer` to the children list, before `ProxmoxAgent.Reporter`.
|
||||||
|
|
||||||
|
## Tests
|
||||||
|
|
||||||
|
Two targets, both in `agent/test/proxmox_agent/`:
|
||||||
|
|
||||||
|
### `diagnostics_test.exs`
|
||||||
|
|
||||||
|
- `enabled?/0` returns false when env key unset.
|
||||||
|
- `configure/1` with `nil` or `""` disables.
|
||||||
|
- `configure/1` with a valid path creates the directory and sets the env key.
|
||||||
|
- `log_command/4` with dumping enabled writes a well-formed entry containing the command, args, exit code, and raw body.
|
||||||
|
- `log_sample/2` with dumping enabled writes a well-formed JSON entry with the kind and payload.
|
||||||
|
- Both loggers are no-ops when disabled (no file created, no exception).
|
||||||
|
|
||||||
|
Uses `System.tmp_dir!() <> "/diag-<random>"` for the dump dir; cleans up on exit.
|
||||||
|
|
||||||
|
### `shell_test.exs` (extension)
|
||||||
|
|
||||||
|
One new test confirming that when dumping is on, `Shell.run/2` appends to `commands.log`. Uses a config where the shell command is `echo hello` so the test is deterministic and doesn't require `zpool` on the test host.
|
||||||
|
|
||||||
|
### `reporter` integration
|
||||||
|
|
||||||
|
Skipped — `Reporter` is a Slipstream client whose integration is awkward to mock. The `Diagnostics.log_sample/2` call is a one-line invocation with no branching; covering it via `diagnostics_test.exs` is sufficient.
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- **File size growth** — accepted. Verification is a bounded activity. A fast-path zpool status of ~12 KB every 30 s is ~35 MB/day. Bigger hosts more. User turns off when done.
|
||||||
|
- **Writer backlog** — if the writer GenServer falls behind (shouldn't, given small message rates — fast is ~30 s, medium ~5 min, slow ~30 min, and Shell calls are serialized by the collectors anyway), casts queue in the mailbox. Not a production concern; debug feature.
|
||||||
|
- **Concurrent shell errors during writer start** — if `Shell.run/2` fires before `Diagnostics.Writer` is up, `log_command` no-ops because `Application.get_env` returns nil until `configure/1` runs. The supervision order — Diagnostics.Writer before Reporter — ensures the dump path is primed before any collectors fire.
|
||||||
|
|
||||||
|
## Rollout
|
||||||
|
|
||||||
|
Additive. Default off. Operator enables by editing `agent.toml` and restarting the service. No migration, no impact on server or dashboard.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue