proxMon/docs/superpowers/specs/2026-04-22-zfs-pool-detail-design.md
Carsten 45f59eb163 docs: spec for ZFS pool detail enrichment
Compact per-pool block with type, capacity bar, used/free/total,
scrub state, and vdev summary. Collector gets pool_type derivation,
scan state, and vdev list — no new shell-outs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:21:06 +02:00

6.7 KiB

ZFS Pool Detail — Design

Date: 2026-04-22 Status: approved (pending implementation)

Problem

The host detail view renders a compact row per ZFS pool today (server/lib/server_web/live/host_detail_live.ex:67):

rpool [ONLINE]
cap 0% · frag 0% · err 0 · vdevs 4 (deg 0)                          scrub never

This hides information the user needs at first glance:

  • Total / used / free size (bytes are already collected but never rendered).
  • Pool layout (mirror / raidz1 / raidz2 / stripe / mixed) — not collected.
  • Scan state — only end_time is kept, so an in-progress scrub looks like a finished one.

The original concept doc calls for "Health, Capacity-Bar, Fragmentation, Error-Counters, Scrub-Info, vdev-Liste" per pool (proxmox-monitor-konzept.md:227). We never finished that.

Goal

One compact block per pool that answers at a glance: is it healthy, what layout is it, how full is it, is a scrub running. No drill-down yet.

Scope

In scope:

  1. Agent collector enrichment — derive pool_type, keep vdev summary list, keep scan function/state. No new shell-outs; zpool status -j --json-flat-vdevs --json-int already returns all of this.
  2. Host detail LiveView — replace the current single-line pool row with a richer compact block (see layout below).
  3. Capacity bar styling in assets/css/app.css.
  4. Tests — extend agent/test/proxmox_agent/collectors/zfs_test.exs fixtures and assertions for the new fields.

Out of scope (YAGNI):

  • Drill-down view with per-vdev disk state, resilver progress bars, or scan history.
  • Persistence schema changes — payload is stored as JSON blob; adding keys is additive.
  • Storage/dataset/VM panel changes — separate conversation.

Agent changes

Collector output

Extend ProxmoxAgent.Collectors.Zfs.pool_summary with three fields:

%{
  # existing fields unchanged:
  name:, health:, size_bytes:, allocated_bytes:, free_bytes:,
  fragmentation_percent:, capacity_percent:, error_count:,
  vdev_count:, degraded_vdev_count:, last_scrub_end:,

  # new:
  pool_type: String.t(),            # "mirror" | "raidz1" | "raidz2" | "raidz3" | "stripe" | "mixed"
  scan_function: String.t() | nil,  # "scrub" | "resilver" | nil
  scan_state: String.t() | nil,     # "SCANNING" | "FINISHED" | "CANCELED" | nil
  vdevs: [%{name: String.t(), type: String.t(), state: String.t(),
            read_errors: non_neg_integer(), write_errors: non_neg_integer(),
            checksum_errors: non_neg_integer()}]
}

Derivation rules

pool_type is derived from the set of vdev_type values across top-level vdevs:

  • All vdevs the same type → that type ("mirror", "raidz1", "raidz2", "raidz3").
  • All vdevs are disk (plain top-level disk with no redundancy) → "stripe".
  • Anything else → "mixed".

Special vdev types (log, cache, spare, dedup, special) are ignored for layout classification — they don't change the data redundancy story. They are still included in the vdevs list.

scan_function / scan_state read get_in(status_info, ["scan", "function" | "state"]).

Per-vdev numeric fields (read_errors, write_errors, checksum_errors) are parsed the same way error_count already is (string or int tolerant).

Tests

agent/test/fixtures/zfs/zpool_status.json already has a mirror and a raidz2 pool; extend assertions in zfs_test.exs:

  • rpool.pool_type == "mirror"
  • tank.pool_type == "raidz2"
  • rpool.scan_state == "FINISHED"
  • rpool.vdevs has length 1 with type: "mirror", state: "ONLINE"

Add one new fixture-free unit test covering the "stripe" and "mixed" branches by injecting a synthetic runner.

Server changes

None in the collector pipeline. The channel handler already stores the whole zfs_pools.pools list as JSON (server/lib/server_web/channels/host_channel.ex — to confirm in plan) and the LiveView reads it with get_in/2. New keys flow through automatically.

UI changes

Layout

Replace the current .pool-row flex block in host_detail_live.ex:69-86 with a per-pool compact block:

rpool  mirror                                               [ONLINE]
████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░  40%
used 200.0 GB  ·  free 300.0 GB  ·  total 500.0 GB
frag 17%  ·  err 0  ·  vdevs 1 (deg 0)  ·  scrub finished 2026-04-19

Element mapping:

  • Line 1: pool name (bright mono, bold) · pool_type (muted) · health badge (right).
  • Line 2: capacity bar (div with width % + background color keyed to capacity thresholds).
  • Line 3: used / free / total — rendered with the existing format_bytes/1 helper.
  • Line 4: the existing compact details line, plus scrub state — scrub scanning / scrub finished <date> / scrub never.

Capacity bar

CSS in server/assets/css/app.css:

.capbar {
  height: 4px; background: var(--panel-2); border-radius: 2px;
  overflow: hidden; margin: 0.25rem 0;
}
.capbar > span { display: block; height: 100%; background: var(--ok); }
.capbar[data-level="warn"] > span { background: var(--warn); }
.capbar[data-level="crit"] > span { background: var(--crit); }

Thresholds (matching the concept doc's thresholds at proxmox-monitor-konzept.md:218-219):

  • cap >= 90data-level="crit"
  • cap >= 80data-level="warn"
  • else → default (ok green).

Degraded pool callout

For ONLINE pools with degraded_vdev_count == 0, do not render per-vdev detail — keep it simple. For anything else, render one line per non-ONLINE vdev below the detail line:

!  mirror-1  DEGRADED  r=0 w=0 cksum=12

Styled with the existing .callout.err class.

Scrub rendering

  • scan_state == "SCANNING""scrub scanning" (no date).
  • scan_state == "FINISHED" and last_scrub_end present → "scrub #{format_date(last_scrub_end)}".
  • Otherwise → "scrub never".

last_scrub_end is a string like "Sat Apr 19 02:00:00 2026" — keep as-is or reformat to YYYY-MM-DD with a tiny helper (strptime isn't stdlib-trivial in Elixir; simplest: split on whitespace and reorder). Accept "as-is" if reformatting is ugly.

Risks

  • ZFS JSON output has changed shape between OpenZFS releases. The concept doc requires OpenZFS 2.3+. Agent code tolerates missing keys via Map.get/3 defaults — keep that discipline.
  • zpool status --json-flat-vdevs flattens nested mirrors-of-mirrors. Top-level vdevs are keyed by name; pool_type derivation inspects only top-level entries (no child vdev walking needed in the flat form).

Rollout

Additive collector changes + additive UI. No DB migration, no breaking payload change. Old agents without the new fields render the "graceful degraded" path: pool_type shows as , scrub line falls back to never, capacity bar still renders from existing bytes.