# Deep Review: 20260503-142128-pr-348

| | |
|---|---|
| **Date** | 2026-05-03 14:35 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 4 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#348](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348) — rdd kubectl: embed kuberlr-style version resolver |
| **Branch** | `embed-kuberlr` |
| **Commits** | `563ff15` rdd kubectl: embed kuberlr-style version resolver |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — S3 (`docker_socket` missing from `ALL_KEYS`, `environment.md`, and the `cmd_service.md` example, raised by Codex) is the most actionable: tests and docs claim a complete key set but skip a key the implementation returns. S2 (BATS uses command substitution where the project explicitly requires `run -0`, raised by Codex and Gemini) violates a documented project rule. S5 (uppercase SHA-512 hex normalization, raised by Codex) is a real defensive gap an operator-configured mirror could trip. The `--as-user-extra` map gap (S6) and `serverVersion` ctx-propagation gap (S7) are both verified but narrow. Round 3's three Importants and four Suggestions all landed cleanly; no regressions. |
| **Wall-clock time** | `5 h 37 min 38 s` |

## Executive Summary

Round 3's three Importants (BATS apiserver-500 test counted via `>= 2 GET /version`, `--client=true` parsed via `boolFlagValue`, `genericclioptions.ConfigFlags`-backed `loadClientConfig`) and four Fixed Suggestions all landed cleanly. Round 4 surfaces no Critical or Important findings; the agents converge on minor gaps that fall into three buckets: (1) completeness gaps in the `rdd svc paths` story — `docker_socket` is missing from `ALL_KEYS` in `5-paths.bats`, from the `environment.md` shell-variable table, and from the `cmd_service.md` example output, and five remaining `config.json` references in `cmd_service.md` were not swept along with the line-154 fix; (2) defensive cleanup in the resolver — `--as-user-extra` is not in `kubectlGlobalFlagsTakingValue` (so spaced form makes `isClientOnly` mis-walk), `fetchSha512` accepts uppercase hex but never normalises it (case-sensitive comparison fails), `serverVersion` runs the discovery client without the caller's `ctx`; (3) a documented BATS style-guide violation in the new `version_requests=$(awk ...)` capture.

The Windows `os.Rename` race surfaced again from all three agents (Claude S4, Codex I1, Gemini C1) and was dropped per the prior-round-resolutions gate (declined Rounds 1, 2, and 3). Gemini also produced one outright hallucination (claimed BATS `@test` decorators were corrupted to `@pkg/util/tail/tail_test.go`; verified absent in every cited line) and one false positive (cited a fictitious KEP for relaxing `kubectl` skew to ±3, which contradicts Kubernetes' published version-skew policy of ±1 minor for kubectl).

Structure: 0 Critical · 0 Important · 10 Suggestions.

## Critical Issues

None.

## Important Issues

None.

## Suggestions

S1. **`docker_socket` missing from `ALL_KEYS` in `5-paths.bats`, from the `environment.md` shell-variable table, and from the `cmd_service.md` `rdd svc paths` example** — `bats/tests/10-cli/5-paths.bats:3`, `docs/design/cmd_service.md:148-156`, `docs/design/environment.md:46-55` [Codex GPT 5.5]

```bash
ALL_KEYS="args_file config dir kubectl_cache lima_home log_dir pid_file short_dir tls_dir"
```

`instancePaths()` returns ten keys (`cmd/rdd/service_paths.go:21-33`): the nine in `ALL_KEYS` plus `docker_socket`. The BATS suite iterates `ALL_KEYS` to assert each key appears in `rdd svc paths` output, so `docker_socket` is silently exempt — a regression that drops `docker_socket` from the output ships green.

The same omission carries through both docs touched by this PR: `environment.md`'s table lists nine `RDD_*` rows (no `RDD_DOCKER_SOCKET`), and the `cmd_service.md` example block (line 148–156) shows nine table rows (no `docker_socket`). The PR added the new `kubectl_cache` row to all three but did not add the `docker_socket` row that the implementation has emitted since the docker-socket-address PR landed.

Fix: include `docker_socket` everywhere these lists claim to be complete.

```diff
-ALL_KEYS="args_file config dir kubectl_cache lima_home log_dir pid_file short_dir tls_dir"
+ALL_KEYS="args_file config dir docker_socket kubectl_cache lima_home log_dir pid_file short_dir tls_dir"
```

Add an `RDD_DOCKER_SOCKET` row to the `environment.md` table and a `docker_socket` row to the `cmd_service.md` example. The durable fix derives the BATS key list from `rdd svc paths` itself rather than restating it.

S2. **BATS test captures `version_requests` via command substitution where the project rule mandates `run -0`** — `bats/tests/10-cli/7-kubectl-cache.bats:216` [Codex GPT 5.5][Gemini 3.1 Pro]

```bash
version_requests=$(awk '/^GET \/version$/ {n++} END {print n+0}' "${LOG_FILE}")
[[ ${version_requests} -ge 2 ]] ||
    fail "expected >= 2 GET /version (resolver + embedded kubectl), got ${version_requests}: $(cat "${LOG_FILE}")"
```

The project's BATS style (`deep-review-context.md` and `bats-style.md`) explicitly forbids command substitution for command output: "Use `run -0` instead of command substitution. `run -0` fails with a useful error (command, exit code, output) if the command returns non-zero. A bare substitution gives an empty variable and a cryptic failure later." The new assertion uses both the forbidden capture form and a second `$(cat …)` inside the `fail` argument.

Fix:

```diff
-    version_requests=$(awk '/^GET \/version$/ {n++} END {print n+0}' "${LOG_FILE}")
-    [[ ${version_requests} -ge 2 ]] ||
-        fail "expected >= 2 GET /version (resolver + embedded kubectl), got ${version_requests}: $(cat "${LOG_FILE}")"
+    run -0 awk '/^GET \/version$/ {n++} END {print n+0}' "${LOG_FILE}"
+    version_requests=${output}
+    if [[ ${version_requests} -lt 2 ]]; then
+        run -0 cat "${LOG_FILE}"
+        fail "expected >= 2 GET /version (resolver + embedded kubectl), got ${version_requests}: ${output}"
+    fi
```

S3. **`docs/design/cmd_service.md` retains five `config.json` references after the line-154 `config.yaml` fix** — `docs/design/cmd_service.md:15,41,78,80,87` [Claude Opus 4.7]

Round 3 fixed line 154 (the `rdd svc paths` example) per S1; the rest of the document still describes the file as `config.json`:

- Line 15 — file table: `` `config.json` | service config settings ``
- Line 41 — `rdd service create`: "Stores the configuration settings in the `config.json` file."
- Line 78 — `rdd service start`: "starts the apiserver and the controller-manager using the configuration found in `config.json`."
- Line 80 — same paragraph: "No controller should access `config.json` directly."
- Line 87 — `rdd service config`: "updates `config.json` with latest settings."

`pkg/instance/instance.go:91` returns `config.yaml`. Either rename all five at once, or leave the file alone — a partial rename causes new readers to chase the wrong filename.

Fix: apply the same rename at lines 15, 41, 78, 80, 87.

S4. **`fetchSha512` accepts uppercase hex but never normalises it; the case-sensitive `if got != want` rejects a valid digest from an uppercase-emitting mirror** — `pkg/kuberlr/downloader.go:138-156`, `pkg/kuberlr/downloader.go:110-112` [Codex GPT 5.5]

```go
for _, c := range digest {
    if !(c >= '0' && c <= '9' || c >= 'a' && c <= 'f' || c >= 'A' && c <= 'F') {
        return "", fmt.Errorf("malformed checksum from %s: non-hex character %q", url, c)
    }
}
return digest, nil
```

`fetchSha512` admits `A-F` in the hex validator and returns the raw digest. `download` computes `got := hex.EncodeToString(h.Sum(nil))` (lowercase by Go's spec) and compares with `if got != want`. A mirror populated by PowerShell `Get-FileHash` or any tool emitting uppercase hex passes validation, then surfaces a misleading "checksum mismatch" — exactly the misdirection the digest-format check was added to prevent.

Two coherent options: (a) reject uppercase in the validator (mirror operators normalise), or (b) accept and normalise in `fetchSha512`. (b) is the user-friendlier choice and a one-line edit.

```diff
-    digest := fields[0]
+    digest := strings.ToLower(fields[0])
```

The 128-hex length and character checks then run against the normalised form.

S5. **`isClientOnly` walks past `--as-user-extra` without consuming its value, so the spaced form mistakes `k=v` for the subcommand and triggers an unwanted probe** — `pkg/kuberlr/resolver.go:195-227`, `pkg/kuberlr/resolver.go:160-162` [Claude Opus 4.7]

```go
"--as":                    true,
"--as-group":              true,
"--as-uid":                true,
// --as-user-extra is absent
```

`genericclioptions.NewConfigFlags(true).AddFlags` binds `--as-user-extra` (a `StringArrayVar`, spaced form takes the next arg). The docstring at lines 160–162 names it explicitly:

```
// --insecure-skip-tls-verify, --request-timeout, --as, --as-group,
// --as-uid, --as-user-extra, --cache-dir, --disable-compression,
```

But the value-taking map omits it. `isClientOnly` walks args manually; `--as-user-extra k=v help` lands at `arg = "--as-user-extra"`, the map lookup misses, the walker advances by one (no value consumed), the next arg `k=v` becomes the subcommand candidate, and `clientOnlySubcommands["k=v"]` returns false — so the resolver probes the cluster for an invocation that should have short-circuited.

Net effect: a help / completion / config invocation that uses `--as-user-extra` in spaced form pays the 2 s probe cost. No correctness break (the kubectl child still receives the args correctly), but it regresses the Round 2 `isClientOnly` work for one flag.

Fix: add the entry.

```diff
     "--as-group":              true,
     "--as-uid":                true,
+    "--as-user-extra":         true,
     "--profile":               true,
```

The longer-term option is to drive `isClientOnly` from the same pflag set `parseKubeConfigFlags` already builds; see Design Observations.

S6. **`Resolve` accepts `ctx` but `serverVersion` ignores it; Ctrl+C during the discovery probe waits up to `serverProbeTimeout` (2 s) instead of cancelling immediately** — `pkg/kuberlr/resolver.go:65-93,121-144` [Claude Opus 4.7]

```go
func Resolve(ctx context.Context, args []string) (string, error) {
    ...
    server, ok, err := serverVersion(args)   // ctx dropped
    ...
    return ensureCached(ctx, server)         // ctx forwarded
}

func serverVersion(args []string) (_ semver.Version, ok bool, _ error) {
    cfg, err := loadClientConfig(args)
    ...
    cfg.Timeout = serverProbeTimeout
    info, err := client.ServerVersion()      // no ctx
}
```

`Resolve` forwards `ctx` only to `ensureCached`. The discovery client at line 133 uses `cfg.Timeout` to cap each request at 2 s, but does not honour `ctx.Done()`. Cancellation does not propagate while the probe is in flight; a user pressing Ctrl+C waits up to 2 s per kubectl invocation.

Bounded UX issue, not a correctness issue (the embedded kubectl runs after the probe times out either way). Low priority because the `ensureCached` path already honours `ctx` and the 2 s bound puts a ceiling on the wait.

Fix: thread `ctx` through and bind it to the request. The discovery client exposes the underlying REST client:

```diff
-func serverVersion(args []string) (_ semver.Version, ok bool, _ error) {
+func serverVersion(ctx context.Context, args []string) (_ semver.Version, ok bool, _ error) {
     cfg, err := loadClientConfig(args)
     ...
-    info, err := client.ServerVersion()
+    body, err := client.RESTClient().Get().AbsPath("/version").DoRaw(ctx)
+    var info apimachineryversion.Info
+    if err == nil { err = json.Unmarshal(body, &info) }
```

S7. **Comment in `fetchSha512` cites "the comparison at line 110"; the comparison sits at line 111 (line 110 is the `hex.EncodeToString` assignment)** — `pkg/kuberlr/downloader.go:131-137` [Claude Opus 4.7]

```go
// rejects non-128-hex tokens so a misconfigured mirror surfaces as a
// "malformed checksum" error rather than the misleading "checksum
// mismatch" the comparison at line 110 would otherwise produce.
```

Hard-coded line numbers in comments rot on the next edit (and have already drifted by one). Replace the locator with a name.

```diff
-// "malformed checksum" error rather than the misleading "checksum
-// mismatch" the comparison at line 110 would otherwise produce.
+// "malformed checksum" error rather than the misleading "checksum
+// mismatch" that download's digest comparison would otherwise produce.
```

S8. **`TestParseKubeConfigFlags` row for `--insecure-skip-tls-verify=false` does not exercise the binding; the assertion matches the pre-parse default** — `pkg/kuberlr/resolver_test.go:121` [Claude Opus 4.7]

```go
{"--insecure-skip-tls-verify=false", []string{"--insecure-skip-tls-verify=false", "get"}, want{insecure: false}},
```

`NewConfigFlags(true)` initialises `Insecure` to `*bool → false`. The test asserts the post-parse value equals `false` — the default. The row passes even when the flag binding is absent.

Two fixes: drop the row (the bare and `=true` rows already cover the binding), or force a non-default starting state and assert the parser flipped it back.

```go
// option 2: force a true → false transition
cf := parseKubeConfigFlagsFor("--insecure-skip-tls-verify=false", "get")
*cf.Insecure = true                  // would be overwritten by re-parse; not viable for table tests
```

The table-test shape favours dropping the row.

S9. **Aborted downloads leak `.kubectl-*` temp files in `CacheDir()`; `defer os.Remove` does not cover SIGKILL or hardware crash** — `pkg/kuberlr/downloader.go:96-101` [Gemini 3.1 Pro]

```go
tmp, err := os.CreateTemp(filepath.Dir(dst), ".kubectl-*")
if err != nil {
    return err
}
tmpName := tmp.Name()
defer os.Remove(tmpName)
```

Normal failure paths and orderly cancellation hit the `defer`; SIGKILL during a slow download or a sudden power loss leave a ~50 MB temp file behind. Each subsequent failure adds another. The eviction TODO in `cache.go:33-36` does not currently target temp leftovers either.

Fix: sweep `*.kubectl-*` from `CacheDir()` once at the start of `ensureCached`, or fold a sweep into the eventual eviction pass. Either fix is small enough to land alongside the eviction work; flag here so it is not forgotten.

S10. **`httpClient.Timeout = 5 minutes` covers headers, body, and redirects together; on a 100 kB/s link the body alone exceeds the cap** — `pkg/kuberlr/downloader.go:33-42` [Claude Opus 4.7]

```go
const downloadTimeout = 5 * time.Minute
var httpClient = &http.Client{Timeout: downloadTimeout}
```

The comment estimates "covers a ~50 MB kubectl binary on slow links (~170 kB/s)" — exactly five minutes at that rate, leaving zero headroom for TLS handshake or initial latency, and failing outright below that throughput. Round 3 added the deadline-composition note (a backstop). Round 4 raises the magnitude question: 5 minutes assumes a fast-enough link, and the user on a slow link sees a hard "context deadline exceeded" with no obvious diagnosis.

Two coherent options: (a) bump `downloadTimeout` to 10 minutes (covers ~85 kB/s links), or (b) move from `Client.Timeout` to a transport with `ResponseHeaderTimeout` so only stalls (not slow throughput) trip the cap. (b) is the durable fix; (a) is a one-character edit.

## Design Observations

### Concerns

- **Hand-rolled subcommand walker duplicates the bound pflag set (future)** [Claude Opus 4.7]. `parseKubeConfigFlags` already binds the kubectl global flags via `genericclioptions.ConfigFlags.AddFlags`; after `fs.Parse(args)`, `fs.Args()` returns the positional tokens whose first entry is the subcommand. `isClientOnly` re-walks the args manually with a hand-maintained value-taking flag map (the source of S5). Driving `isClientOnly` from the same pflag set eliminates both the map and the S5 class of bug. The trade-off: pflag with `UnknownFlags=true` may consume an unknown spaced-form flag's value as a positional token, so the walker would need a small validation step before extracting the subcommand. Out of scope for this PR; flagging as the durable follow-up.

- **Manual kubectl flag-coverage discipline is recurring (future, persistent)** [Claude Opus 4.7]. Round 2's I1 added cluster-selection flags to `extractKubeFlags`. Round 3's I3 added the auth/TLS flags via `genericclioptions.ConfigFlags`. Round 4's S5 finds another spaced-form omission (`--as-user-extra`). Each round closes one drift, but the gap is the inventory itself: the resolver maintains its own list of value-taking flags in parallel to kubectl's bound set. The same Design Observation appeared in Round 3 ("manual kubectl flag parsing carries drift risk"); the persistence makes the case for the pflag-driven walker concrete.

### Strengths

- **Round 3 Important fixes landed without regression** [Claude Opus 4.7][Codex GPT 5.5]. The Round 3 I1 BATS-test fix (`>= 2 GET /version` count) discriminates probe-ran from probe-skipped; the Round 3 I2 `boolFlagValue` helper unifies `--client=` and `--help=` parsing with last-wins semantics; the Round 3 I3 `genericclioptions.ConfigFlags`-backed `loadClientConfig` aligns the resolver's flag binding with kubectl's by construction.
- **`parseKubeConfigFlags` test coverage is now exhaustive** [Claude Opus 4.7]. `TestParseKubeConfigFlags` covers spaced/equals/aliases for every bound flag plus the auth/TLS surface, the `--` separator, the "later flag wins" semantics, and the "flag at end without value" edge.
- **Recursion guard via env var (`RDD_KUBECTL_RESOLVED`)** [Claude Opus 4.7]. The simplest mechanism that survives `syscall.Exec` on Unix and `CreateProcess` on Windows; documented in `environment.md` as an internal var so process-tree readers can identify it.
- **Cache layout anticipates future cache consumers** [Claude Opus 4.7]. The `<root>/kubectl/<os>-<arch>/` shape under `RDD_CACHE_DIR` gives a single override surface for tests and offline operation without coupling kuberlr to its hypothetical siblings (k3s images, Lima templates).

## Testing Assessment

Coverage gaps surfaced this round, ordered by user-reachable risk:

1. **`docker_socket` not asserted in `5-paths.bats`** (S1). The BATS suite iterates `ALL_KEYS` to verify every key appears in `rdd svc paths` output. `docker_socket` is missing from the list, so a regression that drops the key from `instancePaths()` ships green.
2. **`--as-user-extra` parsing path unverified** (S5). `TestParseKubeConfigFlags` does not exercise `--as-user-extra`, and `TestIsClientOnly` does not cover any case using it. A row in each table closes the loop alongside the map fix.
3. **`--insecure-skip-tls-verify=false` row asserts the default** (S8). The pre-parse and post-parse states match (`*bool → false`); the test passes even when the binding is absent. Drop the row, or force a `true → false` transition.
4. **Uppercase SHA-512 mirror response untested** (S4). No fixture serves uppercase hex; the validator accepts it and the comparison rejects it without any test catching the inconsistency.
5. **Concurrent first-use cache miss untested** (declined Rounds 1-3, surfaced again this round by all three agents). Acknowledged limitation; the prior-round-resolutions gate dropped it.
6. **End-to-end Windows execution path** (`pkg/kuberlr/exec_windows.go`). PR description gates the merge on Windows BATS support arriving; the path remains compile-only.

Verification runs this round: Codex ran `make build-rdd` and `go test ./pkg/kuberlr` in its worktree (both passed). Claude verified the round 3 fixes against the new files. Gemini ran the discovery walk but did not execute tests.

## Documentation Assessment

- `docs/design/cmd_service.md` example block (line 154) was updated to `config.yaml` per Round 3 S1; five other references to `config.json` in the same file remain (S3). The same file's `rdd svc paths` example also misses the `docker_socket` row (S1).
- `docs/design/environment.md` documents `RDD_CACHE_DIR`, `RDD_KUBECTL_MIRROR`, `RDD_KUBECTL_CACHE`, and the internal `RDD_KUBECTL_RESOLVED`. The shell-variable table at lines 46–55 omits `RDD_DOCKER_SOCKET` (S1).
- `parseKubeConfigFlags` docstring at `pkg/kuberlr/resolver.go:155-173` enumerates the bound flag surface explicitly and explains why `--username` / `--password` stay unbound. The enumeration includes `--as-user-extra` — which is exactly how Claude spotted the missing map entry (S5). The docstring is doing review-grade work; preserve it.
- `Resolve` docstring at `pkg/kuberlr/resolver.go:43-64` notes the deprecated `--username` / `--password` carve-out and the offline TODO. Complete.

## Commit Structure

Single commit `563ff15`. Message describes the change and matches the diff. Clean.

## Acknowledged Limitations

- **Cache eviction not implemented** — `pkg/kuberlr/cache.go:33-36`. The `TODO(eviction)` comment carries forward; the design observation on the cache becoming load-bearing stands.
- **Offline mode not implemented** — `pkg/kuberlr/resolver.go:61-64`, `pkg/kuberlr/downloader.go:46-47`. TODOs cover the planned env-var/config opt-out for download attempts.
- **Windows BATS host deferred** — `bats/tests/10-cli/7-kubectl-cache.bats:14-20`. PR description gates the merge on Windows BATS support; the resolver's Windows-specific code paths (Exec, the rename race that prior rounds declined) remain compile-only.
- **`local_teardown_file()` deviation from the BATS convention** — `bats/tests/10-cli/7-kubectl-cache.bats:90-98`. Inline rationale stands.
- **Falling back to embedded on any cluster-probe failure** — `pkg/kuberlr/resolver.go:113-144`. Each `nilerr` branch is documented; the design choice is sound.
- **Embedded version unparseable in dev builds** — `pkg/kuberlr/resolver.go:73-80`. `go run ./cmd/rdd kubectl ...` and IDE debug builds skip the Makefile's `-ldflags -X`; falling through to embedded kubectl rather than breaking every dev invocation is the right call.
- **`--username` / `--password` deliberately unbound on the resolver side** — `pkg/kuberlr/resolver.go:165-173`. Binding them via `WithDeprecatedPasswordFlag` would force the interactive kubeconfig loader; the resolver's silent fall-through on probe failure covers the asymmetry.

### Declined in prior rounds

- **Concurrent first-use Windows `os.Rename` race** — declined Round 1 (R1 raw I1, dropped as false positive), Round 2 (R2 raw I2/S5), and Round 3 (Round 3 R3 dropped per gate). Raised this round by all three agents (Claude S4, Codex I1, Gemini C1); dropped per the resolutions gate. The recurring re-raises strengthen the case for a follow-up TODO documenting the Windows file-mapping subtlety, but the underlying call site is unchanged.
- **`CacheDir()` panics when `os.UserCacheDir()` fails** — declined Round 1 (R1 S11), Round 2 (R2 raw S1), and Round 3 (raised by Gemini, dropped). Failure mode requires neither `$HOME` nor `%LocalAppData%` to resolve.
- **`RDD_CACHE_DIR` scope rename** — declined Round 1 (R1 S5): deliberate forward-looking knob.
- **Server `log.Fatal*` bypassing `defer logFD.Close()`** — declined Round 1 (R1 S10): test fixture; OS reclaims fd.
- **Pure-Go unit coverage thin on download/cache machinery** — declined Round 2 (R2 S5): BATS suite covers the integration concerns.
- **BATS coverage for HTTP timeout/hang, network cancel, Windows Exec, sha512sum-format mirror, trailing-slash mirror, default-log-level message capture, unreachable-server probe** — declined Rounds 1-3.
- **`info.GitVersion` empty/unknown unit test** — declined Round 1: requires mocked discovery client.
- **Resolver-policy summary and recursion-guard rationale duplicated across files** — declined Round 1: each duplicate serves a different audience.
- **`SkipResolver` swallows `os.Setenv` error** — declined Round 3 (R3 S3): stylistic only; constant env name and value cannot fail.
- **Cache the server-version probe across invocations** — declined Round 3 (R3 S6): performance optimisation; explicit follow-up.
- **Move package doc-comment from `cache.go` to `resolver.go`** — declined Round 3 (R3 S7): cosmetic; `go doc` and pkg.go.dev render either location equally.
- **Drop `is_macos` branch in `write_kubectl_sha512`** — declined Round 3 (R3 S8): both branches produce identical output; conditional is harmless.

## Agent Performance Retro

### [Claude]

Claude carried this round's breadth: two findings the orchestrator downgraded from Important to Suggestion (S5 `--as-user-extra`, S6 ctx propagation) plus three pure Suggestions covering test quality, doc drift, and a comment off-by-one. The S5 finding required reading both the docstring at lines 160-162 and the value-taking map at lines 195-227 against the current cli-runtime binding — the kind of cross-reference reading that catches the "documented but not implemented" gap. The S6 ctx-propagation finding is genuine but bounded by the 2 s probe timeout; calling it Important was over-calibrated. Claude's S4 (Windows `os.Rename` race) is the third consecutive re-raise from Claude on a finding declined in three prior rounds; the agent has no visibility into the resolutions table, which is by design, but the pattern is worth noting.

### [Codex]

Codex produced the round's most actionable finding: S1 (`docker_socket` missing across BATS, environment.md, and cmd_service.md) is a multi-site completeness gap that the other two agents missed entirely. The fix is mechanical, but the gap is the kind of regression a release would ship with no ill effects until someone tries to read the docs. Codex also caught S4 (uppercase hex normalisation) — a precise read of the case-sensitive comparison against the case-tolerant validator — and S2 (BATS command substitution) independently of Gemini. Tight one-Important / three-Suggestion shape, zero false positives. Best severity calibration of the round on the wins; only miss is re-raising the Windows rename race (declined three prior rounds).

### [Gemini]

Gemini's round was mixed. C2 ("corrupted BATS test decorators replaced by `@pkg/util/tail/tail_test.go`") is a complete hallucination — every line cited still uses `@test`, and `grep -r "@pkg/util/tail" bats/` returns nothing. S2 ("relax skew to ±3 for K8s ≥ 1.28") cites a nonexistent KEP-3335 against Kubernetes' published version-skew policy, which keeps kubectl at ±1. C1 (Windows rename race) is the same declined-in-three-prior-rounds finding the other two agents also raised. The win this round: I1 (BATS command substitution) caught the same project-style violation as Codex independently; downgraded to S2 (BATS) on consolidation. S1 (tmp file leak on SIGKILL) is a legitimate defensive observation that the other agents missed. Two real findings out of five; the C2 hallucination is the most concerning quality issue across all three agents this round.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 9m 46s | 9m 54s | — |
| Findings | 6S | 3S | 2S |
| Tool calls | 44 (Bash 24, Read 20) | 77 (shell 70, stdin 4, plan 3) | — |
| Design observations | 1 | 1 | 1 |
| False positives | 0 | 0 | 2 |
| Unique insights | 5 | 2 | 1 |
| Files reviewed | 18 | 18 | 18 |
| Coverage misses | 4 | 2 | 4 |
| **Totals** | **6S** | **3S** | **2S** |
| Downgraded | 2 (I→S) | 0 | 1 (I→S) |
| Dropped | 1 | 1 | 1 |


#### Reconciliation

- Claude I1 ctx propagation: important → suggestion S6 (UX gap bounded by the 2 s probe timeout, not a correctness issue).
- Claude I2 `--as-user-extra` missing: important → suggestion S5 (narrow case — spaced form on a client-only command pays an unwanted probe; does not affect kubectl correctness).
- Claude S4 Windows `os.Rename` race: suggestion → dropped (declined Rounds 1, 2, 3).
- Codex I1 Windows concurrent rename: important → dropped (declined Rounds 1, 2, 3).
- Gemini C1 Windows rename race: critical → dropped (declined Rounds 1, 2, 3; same finding as above).
- Gemini C2 corrupted BATS decorators: critical → dropped (false positive — all cited lines verified to use `@test`; no `@pkg/util/tail` string anywhere in the repo).
- Gemini I1 BATS command substitution: important → suggestion S2 (style violation, not a correctness break; merged with Codex S2).
- Gemini S2 expand skew to ±3: suggestion → dropped (false positive — kubectl version-skew policy is ±1; the cited KEP-3335 does not address kubectl skew).

Codex provided the most signal-per-finding (S1 multi-site docker_socket gap is the round's most actionable finding); Claude provided the most coverage (5 of 8 retained findings); Gemini contributed one defensive observation (S9 tmp leak) and one duplicate catch (S2 BATS style). The two false positives from Gemini and the round-on-round Windows rename race re-raises consume reviewer attention without producing actionable signal.

Net Round 4 contribution after the resolutions gate and severity reconciliation: 0 Critical, 0 Important, 10 Suggestions, 2 design observations.

## Review Process Notes

### Skill improvements

- **Add a hallucination-defence dimension to the prompt's instructions block.** When a finding describes file contents that would constitute a tool-breaking change (BATS syntax error, malformed Go file, corrupted YAML), the agent is making a high-confidence claim about bytes on disk. Rule: before reporting any finding whose existence depends on a parser-breaking malformation, read the raw file at the cited line and confirm the malformation is present. Three independent models flagging the same false positive does not save the reviewer time; one model's confident hallucination still consumes a verification round. Add to `dimensions.md` or `instructions.md`: "When a finding alleges a file is malformed in a way that would break the language's parser (syntax error, corrupted decorator, invalid header), read the cited bytes before reporting. A confidently wrong claim about file contents is the most expensive false positive class."

### Repo context updates

None this round. The existing `deep-review-context.md` content carried Codex S3 (`docker_socket` completeness) into reach because the BATS style guide is already inlined; the remaining gaps are PR-specific or pre-existing convention work that already appears in earlier rounds' notes.

## Appendix: Original Reviews

### [Claude] — Pass 1

Now I have enough context to write the review. Let me also do a quick check on whether `kubectl version` properly distinguishes from the resolver's own probe.

I have enough context. Let me write the review.

# Review: PR #348 — rdd kubectl: embed kuberlr-style version resolver

| | |
|---|---|
| **Verdict** | Approve with minor suggestions |
| **Branch** | `kuberlr-resolver` (single commit `563ff15`) |
| **Risk** | Low for `rdd ctl` (unchanged via `SkipResolver()`); moderate for `rdd kubectl` against external clusters (new download/exec path) |

## Executive Summary

The PR adds an in-tree port of `flavio/kuberlr` so `rdd kubectl` execs a version-matched kubectl when the embedded one is outside ±1 minor of the target cluster. The implementation is well-factored, flag-precise, and backed by a thoughtful BATS fixture (paired Go module). Findings are minor — a couple of correctness gaps in flag-list maintenance, a brittle line-numbered comment, a test that doesn't exercise what it claims, and a documentation drive-by that left siblings inconsistent.

## Findings

### Critical Issues

None.

### Important Issues

**I1. `serverVersion` does not propagate the caller's context** — `pkg/kuberlr/resolver.go:121-144`

```go
func serverVersion(args []string) (_ semver.Version, ok bool, _ error) {
    cfg, err := loadClientConfig(args)
    ...
    cfg.Timeout = serverProbeTimeout
    client, err := discovery.NewDiscoveryClientForConfig(cfg)
    ...
    info, err := client.ServerVersion()
```

`Resolve` accepts `ctx context.Context` (line 65) and forwards it only to `ensureCached` (line 92); the probe at line 133 runs with the discovery client's internal context bounded only by `cfg.Timeout = 2s` (line 127). If the user hits Ctrl+C while the probe is in flight, cancellation does not propagate — the user keeps waiting up to two seconds per kubectl invocation. Minor but real for slow links and exec-credential auth providers (AWS, OIDC) that themselves can stall.

The discovery client builds an HTTP transport from `cfg`; `cfg.Timeout` translates to `restConfig.Timeout` and bounds each request, but it does not respond to `ctx.Done()`.

Fix: thread `ctx` through and use it for the request bound. `RESTClient().Get().AbsPath("/version").DoRaw(ctx)` plus a manual unmarshal achieves what `ServerVersion()` does without losing context.

```diff
-func serverVersion(args []string) (_ semver.Version, ok bool, _ error) {
+func serverVersion(ctx context.Context, args []string) (_ semver.Version, ok bool, _ error) {
     cfg, err := loadClientConfig(args)
     ...
-    info, err := client.ServerVersion()
+    body, err := client.RESTClient().Get().AbsPath("/version").DoRaw(ctx)
+    if err != nil { ... }
+    var info apimachineryversion.Info
+    if err := json.Unmarshal(body, &info); err != nil { ... }
```

(important, gap)

---

**I2. `--as-user-extra` missing from `kubectlGlobalFlagsTakingValue`** — `pkg/kuberlr/resolver.go:195-227`

```go
"--as":                    true,
"--as-group":              true,
"--as-uid":                true,
// --as-user-extra is not present
```

`genericclioptions.ConfigFlags.AddFlags` binds `--as-user-extra` (a `StringArrayVar`, spaced form takes the next arg). The docstring at lines 161-162 even names it explicitly:

> `--as-uid, --as-user-extra, --cache-dir, --disable-compression,`

`isClientOnly` walks args without consulting the bound flag set, so the spaced form `kubectl --as-user-extra k=v get pods` slips past the map check at line 300, the value `k=v` is mistaken for the subcommand, and `clientOnlySubcommands["k=v"]` returns false. Net effect: the resolver probes the cluster when it should have fallen through (e.g. `kubectl --as-user-extra k=v help` triggers a probe). Doesn't break anything — kubectl still receives the args correctly — but it does an unwanted network round trip for a help invocation.

Fix: add the entry.

```diff
     "--as-group":              true,
     "--as-uid":                true,
+    "--as-user-extra":         true,
     "--profile":               true,
```

There is a longer-term design concern here too: the map is hand-maintained against the upstream `AddFlags` binding. The same pflag set already exists in `parseKubeConfigFlags` — `fs.Args()` after parsing returns the positional args and `isClientOnly` could read the subcommand from there instead of re-walking. See Design Observations.

(important, gap)

### Suggestions

**S1. `fetchSha512` comment hard-codes a line number that is already off-by-one** — `pkg/kuberlr/downloader.go:131-137`

```go
// rejects non-128-hex tokens so a misconfigured mirror surfaces as a
// "malformed checksum" error rather than the misleading "checksum
// mismatch" the comparison at line 110 would otherwise produce.
```

The comparison referenced is `if got != want` at line 111, not line 110 (line 110 is `got := hex.EncodeToString(h.Sum(nil))`). Hard-coded line numbers in comments rot on every refactor. Replace the locator with a name.

```diff
-// "malformed checksum" error rather than the misleading "checksum
-// mismatch" the comparison at line 110 would otherwise produce.
+// "malformed checksum" error rather than the misleading "checksum
+// mismatch" that download's digest comparison would otherwise produce.
```

(suggestion, gap)

---

**S2. `TestParseKubeConfigFlags` does not actually test `--insecure-skip-tls-verify=false`** — `pkg/kuberlr/resolver_test.go:121`

```go
{"--insecure-skip-tls-verify=false", []string{"--insecure-skip-tls-verify=false", "get"}, want{insecure: false}},
```

`NewConfigFlags(true)` initializes `Insecure` to `*bool → false`. The test asserts the post-parse value equals `false`, but that's also the pre-parse value, so the assertion would pass even if the flag binding were absent. To confirm `=false` actually parses, force a non-default starting state, or assert the flag pointer was touched.

Easiest fix: drop the case and trust the `=true` and bare cases to cover the binding. If you want to keep it, prove the binding by setting `cf.Insecure` to `true` before parse and asserting it became `false`.

(suggestion, gap)

---

**S3. `cmd_service.md` partially renames `config.json` → `config.yaml`** — `docs/design/cmd_service.md:154` (changed) vs lines 15, 41, 78, 80, 87 (still wrong)

The PR fixed the `rdd svc paths` example output (line 154) to say `config.yaml`, matching what `pkg/instance/instance.go:91` actually returns. Five other references to `config.json` in the same doc remain, written when the file used `.json`. The file-rename happened in an earlier commit; this PR partially addressed the doc drift but stopped short.

```diff
-| `config.json` | service config settings (written by `rdd service create` or `rdd service start`)
+| `config.yaml` | service config settings (written by `rdd service create` or `rdd service start`)
```

Apply the same rename at lines 41, 78, 80, 87.

(suggestion, gap)

---

**S4. Concurrent `rdd kubectl` invocations may race on Windows rename** — `pkg/kuberlr/downloader.go:117`

```go
return os.Rename(tmpName, dst)
```

Two simultaneous `rdd kubectl` invocations with no cached binary both reach `download`, both compute the same `dst`, both succeed checksum, both call `Rename`. On Linux this is a benign last-writer-wins. On Windows, `os.Rename` uses `MoveFileEx` with `MOVEFILE_REPLACE_EXISTING`, but if the first process has already exec'd the target binary, the second `Rename` can fail with `ERROR_ACCESS_DENIED` because the destination file is held open by the running kubectl child. The user sees a confusing "resolving kubectl version: rename ...: Access is denied" for what looks like a no-op.

Mitigation is inexpensive: re-`Stat` `dst` after computing the digest; if it now exists and (optionally) matches our digest, drop the temp and return success.

```diff
     if got != want {
         return fmt.Errorf("checksum mismatch for %s: want %s, got %s", base, want, got)
     }
+    if _, err := os.Stat(dst); err == nil {
+        return nil // someone else won the race
+    }
     if err := os.Chmod(tmpName, 0o755); err != nil {
```

Probably rare in practice (cache misses are once per minor-version per platform), so suggestion-level.

(suggestion, regression)

---

**S5. `httpClient.Timeout = 5min` covers headers + body together** — `pkg/kuberlr/downloader.go:33-42`

```go
const downloadTimeout = 5 * time.Minute
var httpClient = &http.Client{Timeout: downloadTimeout}
```

The comment notes "still covers a ~50 MB kubectl binary on slow links (~170 kB/s)" — that math gives exactly 5 minutes, leaving zero headroom for TLS handshake or initial latency. A user on a 100 kB/s link would hit the cap mid-stream and see a hard failure. The contract for `http.Client.Timeout` says it includes connection time, redirects, and reading the response body — not just the header.

Suggestion: bump to 10 minutes, or move from `Client.Timeout` to a `Transport` with `ResponseHeaderTimeout` so only stalls (not slow throughput) trip the cap.

(suggestion, enhancement)

## Design Observations

### Concerns

**Hand-rolled subcommand walker duplicates the bound pflag set** *(future)*

`parseKubeConfigFlags` already binds the kubectl global flags via `genericclioptions.ConfigFlags.AddFlags`, and after `fs.Parse(args)`, `fs.Args()` returns positional tokens — the first positional is the subcommand. `isClientOnly` re-walks the args manually with a hand-maintained value-taking flag map (the source of finding I2). Reusing the FlagSet's understanding eliminates the map and the I2 class of bug. The catch is that pflag with `UnknownFlags=true` may consume an unknown spaced-form flag's value as a positional token; verify behavior before adopting this. For this PR, the current shape is fine; flagging for a follow-up.

```go
// rough sketch
fs := pflag.NewFlagSet(...)
fs.ParseErrorsAllowlist.UnknownFlags = true
NewConfigFlags(true).AddFlags(fs)
_ = fs.Parse(args)
positional := fs.Args()
if len(positional) == 0 { return true }
subcommand := positional[0]
```

### Strengths

- Cache layout (`<root>/kubectl/<os>-<arch>/`) anticipating future cache consumers (k3s images, Lima templates) is well chosen — the env-var contract gives a single override surface for tests and offline operation without the resolver knowing about siblings.
- Treating probe failure (`ok=false, nil error`) and download failure (loud error) as distinct categories matches the user-visible policy described in the PR body — embedded fallback is silent for "I don't know," loud for "I tried and failed."
- The fake-kube fixture as a separate Go module is the right call: it keeps test-only code out of the main `go.mod` graph and out of the rdd binary, while sitting next to the test that uses it.
- Recursion guard via env var (`RDD_KUBECTL_RESOLVED`) is the simplest mechanism that survives `exec` on Unix and `CreateProcess` on Windows; documented in `environment.md` as an internal var so a developer reading process trees can identify it.

## Testing Assessment

Coverage is solid for what's tested:
- `TestCompatible` covers the skew rule with vendor suffixes (EKS/GKE/AKS).
- `TestParseKubeConfigFlags` covers spaced/equals/aliases for every bound flag.
- `TestIsClientOnly` is exhaustive on `--help`/`--client` semantics and equals-form parsing.
- BATS `7-kubectl-cache.bats` exercises download-on-miss, skip-on-hit, mirror-404, sha-mismatch, and apiserver-500 fall-through with explicit asserts on the request log.

Gaps, ordered by risk:

1. **Windows execution path** — `pkg/kuberlr/exec_windows.go` has no automated coverage. The PR description notes this explicitly ("Do not merge until BATS works on Windows CI"). The `os.Exit(exitCode)` branch is untested anywhere.
2. **Concurrent invocations** — neither unit nor BATS tests run two `rdd kubectl` against the same cold cache. Linked to S4.
3. **`--as-user-extra` parsing path** — see I2; both `parseKubeConfigFlags` (which does bind it) and `isClientOnly` (which does not) lack coverage of that flag.
4. **Exec-credential auth providers** — kubeconfigs with `exec:` blocks (AWS aws-iam-authenticator, OIDC) trigger a child process during the probe. Bounded by `cfg.Timeout=2s` but no test confirms that.
5. **Body-size cap (256 MiB)** — `streamGet`'s `io.LimitReader` cap has no integration test; a corrupted mirror serving an oversized payload would silently truncate, then the SHA mismatch trips. Works by construction but untested.

## Documentation Assessment

- `docs/design/environment.md` cleanly documents `RDD_CACHE_DIR`, `RDD_KUBECTL_MIRROR`, and the internal `RDD_KUBECTL_RESOLVED`. Good.
- `docs/design/cmd_service.md` partially fixes `config.json` → `config.yaml` (S3) — the rest of the file still says `config.json`.
- Package doc on `cache.go` clearly states the cache layout and the eviction TODO.
- `parseKubeConfigFlags` docstring lists the bound flag surface explicitly — that's exactly the kind of reference that catches bugs like I2 next time around.

## Commit Structure

Single commit `563ff15`. Message is descriptive and matches the change. Clean.

## Acknowledged Limitations

- **`pkg/kuberlr/cache.go:33-36`** — TODO(eviction): "the cache only grows; a user switching across many remote clusters accumulates ~50 MB per minor version indefinitely." Sound deferral; bound is small in practice.
- **`pkg/kuberlr/downloader.go:46-47`** — TODO(offline): "pair this with a 'may we download?' toggle so air-gapped users can pre-populate CacheDir() and forbid network fetches." Mentioned in PR body and in `Resolve`'s docstring at lines 61-64. Acceptable to land separately.
- **`bats/tests/10-cli/7-kubectl-cache.bats:14-20`** — Windows BATS skip with comment "fake-kube wiring on Windows BATS host deferred." PR body explicitly calls out "⚠️ Do not merge until BATS works on Windows CI ⚠️" — author has flagged this as a gating item, so I am not raising it as a finding.
- **`bats/tests/10-cli/7-kubectl-cache.bats:90-94`** — `local_teardown_file` deviates from the project's "no `local_teardown_file`" convention. The author quotes the rule and explains the deviation: an HTTP server bound to an ephemeral port leaks ports if not killed. Justification holds.
- **`pkg/kuberlr/resolver.go:73-80`** — Embedded version unparseable (dev builds without `-ldflags -X`) silently falls through to embedded kubectl. Documented inline; correct behavior for `go run` / IDE workflows.
- **`pkg/kuberlr/resolver.go:165-173`** — `--username` / `--password` deliberately unbound on the resolver side (would force the interactive kubeconfig loader). Documented at length.

## Coverage Summary

| File | Status |
|---|---|
| `.github/dependabot.yml` | Reviewed, no issues |
| `bats/tests/10-cli/5-paths.bats` | Trivial (key list extension) |
| `bats/tests/10-cli/7-kubectl-cache.bats` | See Acknowledged Limitations (Windows skip, `local_teardown_file`); otherwise clean |
| `bats/tests/10-cli/fake-kube/README.md` | Reviewed, no issues |
| `bats/tests/10-cli/fake-kube/go.mod` | Reviewed, no issues |
| `bats/tests/10-cli/fake-kube/kubectl/main.go` | Reviewed, no issues |
| `bats/tests/10-cli/fake-kube/server/main.go` | Reviewed, no issues |
| `cmd/rdd/kubectl.go` | Reviewed, no issues |
| `cmd/rdd/service_paths.go` | Reviewed, no issues |
| `docs/design/cmd_service.md` | S3 |
| `docs/design/environment.md` | Reviewed, no issues |
| `go.mod` | Reviewed, no issues (cli-runtime promoted from indirect, blang/semver promoted from indirect) |
| `pkg/kuberlr/cache.go` | Reviewed, no issues |
| `pkg/kuberlr/downloader.go` | S1, S4, S5 |
| `pkg/kuberlr/exec_unix.go` | Reviewed, no issues |
| `pkg/kuberlr/exec_windows.go` | Reviewed, no issues (untested — see Testing Assessment #1) |
| `pkg/kuberlr/resolver.go` | I1, I2 |
| `pkg/kuberlr/resolver_test.go` | S2 |

### [Codex] — Pass 1

### Executive Summary

This PR adds a kubectl version resolver with cache-backed downloads, checksum verification, and integration coverage through a fake apiserver/mirror. The resolver design is mostly sound, but cache publication has a Windows race that can make one of two concurrent first-use commands fail, and a few test/docs details need tightening.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Concurrent cache misses can fail on Windows** — `pkg/kuberlr/downloader.go:117` (important, regression)

```go
if err := os.Chmod(tmpName, 0o755); err != nil {
	return err
}
return os.Rename(tmpName, dst)
```

`ensureCached` only checks for an existing cache entry before downloading at `pkg/kuberlr/downloader.go:60`, then calls `download` at `pkg/kuberlr/downloader.go:68`. If two `rdd kubectl` processes miss the same version at the same time, both download to separate temp files and both try to publish to the same `dst` at line 117. On Unix the second rename replaces the first cache entry, but on Windows `os.Rename` fails when the destination exists, so the second command returns `resolving kubectl version: ...` even though a valid cache entry now exists.

Fix: handle the “another process published it” state after `download` fails, or add a per-version lock. The low-impact fix is to re-stat the final path before returning the error:

```diff
 if err := download(ctx, want, path); err != nil {
+	if _, statErr := os.Stat(path); statErr == nil {
+		return path, nil
+	}
 	return "", err
 }
```

#### Suggestions

S1. **Uppercase SHA512 digests are accepted but then fail comparison** — `pkg/kuberlr/downloader.go:147-156` (suggestion, regression)

```go
digest := fields[0]
...
for _, c := range digest {
	if !(c >= '0' && c <= '9' || c >= 'a' && c <= 'f' || c >= 'A' && c <= 'F') {
		return "", fmt.Errorf("malformed checksum from %s: non-hex character %q", url, c)
	}
}
return digest, nil
```

`fetchSha512` accepts uppercase hex at line 152, but returns the digest unchanged at line 156. The computed hash is produced by `hex.EncodeToString` at `pkg/kuberlr/downloader.go:110`, which is lowercase, and the comparison at `pkg/kuberlr/downloader.go:111` is case-sensitive. A mirror populated from tools that emit uppercase hashes, for example PowerShell `Get-FileHash`, will fail with a false checksum mismatch.

Fix:

```diff
-	digest := fields[0]
+	digest := strings.ToLower(fields[0])
```

S2. **BATS assertion uses command substitution instead of `run -0`** — `bats/tests/10-cli/7-kubectl-cache.bats:216` (suggestion, regression)

```bash
version_requests=$(awk '/^GET \/version$/ {n++} END {print n+0}' "${LOG_FILE}")
[[ ${version_requests} -ge 2 ]] ||
    fail "expected >= 2 GET /version (resolver + embedded kubectl), got ${version_requests}: $(cat "${LOG_FILE}")"
```

The new test uses command substitution at line 216 for a value that drives the assertion. This loses BATS’ command/status/output diagnostics if `awk` fails, and it goes against the repo’s BATS style for captured command output.

Fix:

```diff
-    version_requests=$(awk '/^GET \/version$/ {n++} END {print n+0}' "${LOG_FILE}")
-    [[ ${version_requests} -ge 2 ]] ||
-        fail "expected >= 2 GET /version (resolver + embedded kubectl), got ${version_requests}: $(cat "${LOG_FILE}")"
+    run -0 awk '/^GET \/version$/ {n++} END {print n+0}' "${LOG_FILE}"
+    version_requests=${output}
+    if [[ ${version_requests} -lt 2 ]]; then
+        run -0 cat "${LOG_FILE}"
+        fail "expected >= 2 GET /version (resolver + embedded kubectl), got ${version_requests}: ${output}"
+    fi
```

S3. **The updated svc paths lists still omit `docker_socket`** — `bats/tests/10-cli/5-paths.bats:3` (suggestion, gap)

```bash
ALL_KEYS="args_file config dir kubectl_cache lima_home log_dir pid_file short_dir tls_dir"
```

`instancePaths` includes `docker_socket` at `cmd/rdd/service_paths.go:31` and the new `kubectl_cache` key at `cmd/rdd/service_paths.go:32`, but the updated `ALL_KEYS` list at line 3 still omits `docker_socket`. The updated docs have the same gap: the `rdd svc paths` example added `kubectl_cache` at `docs/design/cmd_service.md:156`, and the shell-variable table added `RDD_KUBECTL_CACHE` at `docs/design/environment.md:50`, but neither includes the existing Docker socket key. The tests and docs therefore still do not match actual `rdd svc paths` output.

Fix: include `docker_socket` / `RDD_DOCKER_SOCKET` everywhere these lists claim to be complete.

```diff
-ALL_KEYS="args_file config dir kubectl_cache lima_home log_dir pid_file short_dir tls_dir"
+ALL_KEYS="args_file config dir docker_socket kubectl_cache lima_home log_dir pid_file short_dir tls_dir"
```

### Design Observations

**Strengths** — The resolver reuses `genericclioptions.ConfigFlags` instead of hand-parsing kubeconfig semantics, which keeps the probe aligned with kubectl’s cluster-selection behavior. The download path verifies into a temp file before publishing, so failed downloads and checksum mismatches do not leave the final cache filename behind.

### Testing Assessment

1. Windows resolver behavior is not exercised because `7-kubectl-cache.bats` skips on Windows at `bats/tests/10-cli/7-kubectl-cache.bats:15-19`; this leaves I1 untested.
2. There is no concurrent cache-miss test for two commands resolving the same version at the same time; this is the path that exposes I1.
3. Alternate checksum formats are only tested with lowercase `sha512sum` / `shasum`; uppercase-but-valid SHA512 output is untested, which covers S1.
4. `svc paths` tests still assert a manually duplicated subset of keys rather than the exact key set, which leaves S3 easy to miss.

### Documentation Assessment

The new kubectl cache documentation is present, but the `svc paths` documentation should be reconciled with actual output as described in S3.

### Acknowledged Limitations

- **Code comment**: `pkg/kuberlr/cache.go:33` notes that “the cache only grows.” This is an accepted storage-growth limitation and is not made more likely by the specific resolver logic beyond adding the cache itself.
- **Code comment**: `pkg/kuberlr/downloader.go:46` and `pkg/kuberlr/resolver.go:61` acknowledge that offline mode is still TODO. The current failure-on-download-error policy is deliberate and documented.
- **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:15-19` skips the fake-kube BATS coverage on Windows. This is consistent with the PR warning not to merge until Windows BATS works.
- **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:90-94` explicitly documents the `local_teardown_file()` exception for the fake HTTP server. The rationale is scoped to cleaning up a test helper process, not deleting RDD control-plane state.

### Coverage Summary

- `.github/dependabot.yml` — Reviewed, no issues
- `bats/tests/10-cli/5-paths.bats` — S3
- `bats/tests/10-cli/7-kubectl-cache.bats` — S2
- `bats/tests/10-cli/fake-kube/README.md` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/go.mod` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/kubectl/main.go` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go` — Reviewed, no issues
- `cmd/rdd/kubectl.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — S3
- `docs/design/cmd_service.md` — S3
- `docs/design/environment.md` — S3
- `go.mod` — Reviewed, no issues
- `pkg/kuberlr/cache.go` — Reviewed, no issues
- `pkg/kuberlr/downloader.go` — I1, S1
- `pkg/kuberlr/exec_unix.go` — Reviewed, no issues
- `pkg/kuberlr/exec_windows.go` — Reviewed, no issues
- `pkg/kuberlr/resolver.go` — Reviewed, no issues
- `pkg/kuberlr/resolver_test.go` — Reviewed, no issues


### [Gemini] — Pass 1

### Executive Summary

This change implements a kuberlr-style version resolver for `rdd kubectl` that fetches, caches, and executes a version-matched `kubectl` when communicating with external clusters. The implementation robustly tackles KUBECONFIG resolution, gracefully falls through to the embedded `kubectl` when appropriate, and enforces network request boundaries. However, a mass find-and-replace error has corrupted the BATS decorators across the test suite, and a race condition on Windows threatens concurrent downloads due to `os.Rename` platform discrepancies. 

### Findings

#### Critical Issues

C1. **Concurrent cache updates fail on Windows** — `pkg/kuberlr/downloader.go:117` (critical, gap)

```go
	if err := os.Chmod(tmpName, 0o755); err != nil {
		return err
	}
	return os.Rename(tmpName, dst)
}
```

When two instances of `rdd kubectl` concurrently resolve the same missing version, both processes download it to independent temp files and attempt to rename to the shared cache path (`dst`). On Unix, the second rename atomically replaces the first. On Windows, `os.Rename` fails with an "already exists" error if `dst` is present. Because `CacheDir()` is shared globally, parallel CLI invocations will trigger this race, causing one invocation to abort rather than executing the cached `kubectl`.

Fix: Check if the destination file exists after a rename failure. If it does, the concurrent writer successfully wrote it, and we can safely proceed.

```diff
-	return os.Rename(tmpName, dst)
+	if err := os.Rename(tmpName, dst); err != nil {
+		if _, statErr := os.Stat(dst); statErr == nil {
+			return nil // Another process successfully committed the file
+		}
+		return err
+	}
+	return nil
```

C2. **Corrupted BATS test decorators** — `bats/tests/10-cli/7-kubectl-cache.bats:117` (critical, regression)

```bash
 @pkg/util/tail/tail_test.go 'rdd kubectl downloads a version-matched kubectl on cache miss' {
    cached=${CACHE_DIR}/kubectl/${GOOS}-${GOARCH}/kubectl-v1.99.0${EXE}
```

The BATS test decorators have been accidentally replaced globally (likely via a rogue find-and-replace command). Lines 117, 132, 146, 161, and 177 in `7-kubectl-cache.bats`, along with line 4 in `5-paths.bats`, use `@pkg/util/tail/tail_test.go` instead of `@test`. This constitutes a BATS syntax error that completely breaks test execution and renders the assertions unreachable.

Fix: Restore the ` @test` decorators across all affected `.bats` files.

```diff
- @pkg/util/tail/tail_test.go 'rdd kubectl downloads a version-matched kubectl on cache miss' {
+ @test 'rdd kubectl downloads a version-matched kubectl on cache miss' {
```

#### Important Issues

I1. **Command substitution swallows exit codes** — `bats/tests/10-cli/7-kubectl-cache.bats:186` (important, regression)

```bash
    version_requests=$(awk '/^GET \/version$/ {n++} END {print n+0}' "${LOG_FILE}")
    [[ ${version_requests} -ge 2 ]] ||
```

The test utilizes command substitution to extract `version_requests`, violating the project's strict BATS style guide: "Prefer `run -0 cmd ...; value=${output}` over `value=$(cmd ...)` command substitution. Substitution swallows non-zero exits and leaves `value` empty." 

Fix: Capture the output via `run -0` and assert against `${output}`.

```diff
-    version_requests=$(awk '/^GET \/version$/ {n++} END {print n+0}' "${LOG_FILE}")
+    run -0 awk '/^GET \/version$/ {n++} END {print n+0}' "${LOG_FILE}"
+    version_requests=${output}
```

#### Suggestions

S1. **Clean up incomplete downloads on start/restart** — `pkg/kuberlr/downloader.go:101` (suggestion, gap)

```go
	tmpName := tmp.Name()
	defer os.Remove(tmpName)
```

The defer perfectly cleans up temporary `.kubectl-*` files during normal operations or handled errors. However, if the `rdd kubectl` process is killed via `SIGKILL` or an abrupt hardware crash during a slow download, the un-renamed temp file permanently leaks in the cache folder. Over time, these `~50MB` artifacts could accumulate. 

Fix: Consider adding a quick cleanup sweep of `*.kubectl-*` temp files inside `CacheDir()` during cache directory initialization, or document this leak vector alongside the eviction TODO.

S2. **Expand supported skew for modern Kubernetes versions** — `pkg/kuberlr/resolver.go:98` (suggestion, enhancement)

```go
func compatible(client, server semver.Version) bool {
	if client.Major != server.Major {
		return false
	}
	diff := int64(client.Minor) - int64(server.Minor)
	return diff >= -1 && diff <= 1
}
```

The code strictly enforces the historical Kubernetes ±1 minor version skew. However, starting in Kubernetes 1.28, the official supported version skew for `kubectl` relative to `kube-apiserver` was officially expanded to ±3 minor versions (KEP-3335). A hard ±1 check churns the cache and incurs network overhead unnecessarily when interacting with modern clusters. 

Fix: Consider relaxing the skew tolerance to ±3 for `server.Minor >= 28`.

### Design Observations

**Concerns**
- **Hardcoded ±1 version skew limit** (in-scope): As detailed in S2, enforcing ±1 minor version skew ignores the modern Kubernetes policy change that expanded `kubectl` compatibility to ±3. A narrower limit aggressively mandates new binaries when the embedded `kubectl` is strictly still in skew with the target cluster. Updating the logic to allow ±3 for newer versions reduces external cluster connection overhead and cache footprint over time.

**Strengths**
- `isClientOnly` does a phenomenal job safely short-circuiting network lookups for offline or strictly local `kubectl` commands, utilizing a strong "conservative fallback to probe" policy for ambiguous inputs.
- Bounding the HTTP download stream directly through `io.LimitReader` offers critical defense-in-depth against runaway network fetches or malicious mirrors serving infinitely looping payloads.
- Instantiating and discarding `genericclioptions.NewConfigFlags(true)` guarantees the resolver flawlessly mimics `kubectl`'s own complex connection-override flag prioritization.

### Testing Assessment

The BATS tests cover the network lifecycle efficiently using fake Go servers without polling real infrastructure.
Missing test coverage scenarios, ranked by risk:
1. **Concurrent Cache Updates**: Parallel downloads of the same binary are not covered, which allowed the Windows `os.Rename` collision to slip through unobserved.
2. **Flag Parsing Edge Cases**: Tests do not explicitly verify `rdd kubectl` with complex connection overrides (e.g., `--insecure-skip-tls-verify=false` vs `-n=foo`) skipping the resolver correctly, though unit tests do broadly evaluate `parseKubeConfigFlags`.

### Documentation Assessment

Documentation updates to `cmd_service.md` and `environment.md` are comprehensive, clean, and accurately reflect the new path variables available to operators.

### Acknowledged Limitations

1. **Cache growth without eviction**
   - **Code comment**: `pkg/kuberlr/cache.go:33` - `"TODO(eviction): the cache only grows; a user switching across many remote clusters accumulates ~50 MB per minor version indefinitely."`
   - **Note**: Safe initially, but highlights that the system relies on future development to cap infinite disk accumulation.
2. **Lack of offline mode**
   - **Code comment**: `pkg/kuberlr/resolver.go:61` - `"TODO(offline): when no cached kubectl matches and the network is down, Resolve currently fails with the download error."`
   - **Note**: Network failures currently abort the execution explicitly, preventing silent fallbacks to vastly outdated cached binaries.
3. **No teardown_file rule deviation**
   - **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:89` - `"Deviates from the 'no teardown_file' rule. The rule's intent is to preserve rdd state... leaving it running just leaks ports between sessions."`
   - **Note**: The author correctly identified that preserving an ephemeral HTTP testing server works entirely contrary to the rule's debugging intent. 
4. **Deferred Windows BATS support for fake-kube**
   - **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:15` - `"The fake kubectl cross-compiles fine... Defer until we have a Windows BATS runner with Go."`
   - **Note**: Limits regression guarding on Windows; manual tests on the platform must be relied upon for process execution.

### Coverage Summary

- .github/dependabot.yml: Reviewed, trivial
- bats/tests/10-cli/5-paths.bats: C2
- bats/tests/10-cli/7-kubectl-cache.bats: C2, I1
- bats/tests/10-cli/fake-kube/README.md: Reviewed, trivial
- bats/tests/10-cli/fake-kube/go.mod: Reviewed, no issues
- bats/tests/10-cli/fake-kube/kubectl/main.go: Reviewed, no issues
- bats/tests/10-cli/fake-kube/server/main.go: Reviewed, no issues
- cmd/rdd/kubectl.go: Reviewed, no issues
- cmd/rdd/service_paths.go: Reviewed, no issues
- docs/design/cmd_service.md: Reviewed, no issues
- docs/design/environment.md: Reviewed, no issues
- go.mod: Reviewed, no issues
- pkg/kuberlr/cache.go: Reviewed, no issues
- pkg/kuberlr/downloader.go: C1, S1
- pkg/kuberlr/exec_unix.go: Reviewed, no issues
- pkg/kuberlr/exec_windows.go: Reviewed, no issues
- pkg/kuberlr/resolver.go: S2
- pkg/kuberlr/resolver_test.go: Reviewed, no issues

## Resolution

| | |
|---|---|
| **Addressed** | 2026-05-04 |
| **Commit** | `3dd6d90 rdd kubectl: embed kuberlr-style version resolver` (amended) |

| # | Finding | Action | Reason |
|---|---------|--------|--------|
| 1 | Suggestion #1: docker_socket missing from ALL_KEYS, environment.md, and cmd_service.md example | Fixed | Folded into earlier amend alongside Mark's RDD_CACHE_DIR addition; cache_dir and docker_socket both added to all three sites. |
| 2 | Suggestion #2: BATS test captures version_requests via command substitution | Fixed | Converted to `run -0; ${output}` form; failure branch reads LOG_FILE through a second `run -0`. |
| 3 | Suggestion #3: cmd_service.md retains five config.json references after the line-154 fix | Fixed | Renamed all five remaining refs to config.yaml. |
| 4 | Suggestion #4: fetchSha512 accepts uppercase hex but the case-sensitive comparison rejects it | Fixed | strings.ToLower normalisation on the digest; dropped redundant A-F branch in the validator. |
| 5 | Suggestion #5: kubectlGlobalFlagsTakingValue omits --as-user-extra | Fixed | Added the map entry; added two TestIsClientOnly rows covering spaced and equals forms. |
| 6 | Suggestion #6: serverVersion ignores ctx; Ctrl+C waits up to 2 s | Skipped | UX gap bounded by the 2 s probe timeout; fix replaces client.ServerVersion() with raw REST + JSON unmarshal — defer until a probe-cache redesign. |
| 7 | Suggestion #7: fetchSha512 comment cites "the comparison at line 110" | Fixed | Replaced the line-number locator with "download's digest comparison". |
| 8 | Suggestion #8: TestParseKubeConfigFlags --insecure-skip-tls-verify=false row asserts the default | Fixed | Dropped the row; bare and =true rows still cover the binding. |
| 9 | Suggestion #9: aborted downloads leak .kubectl-* temp files | Commented | Extended the eviction TODO in cache.go to cover .kubectl-* leftovers so the eventual sweep clears both. |
| 10 | Suggestion #10: 5 min downloadTimeout covers headers + body together | Skipped | Speculative; round 3 treated 5 min as adequate, and reaching for a Transport with ResponseHeaderTimeout adds complexity for a rare case. |
| 11 | Testing Gap #1: docker_socket not asserted in 5-paths.bats | Test written | Auto-covered by S1 ALL_KEYS update. |
| 12 | Testing Gap #2: --as-user-extra parsing path unverified | Test written | Auto-covered by S5 (added TestIsClientOnly rows). |
| 13 | Testing Gap #3: --insecure-skip-tls-verify=false row asserts the default | Test written | Auto-covered by S8 (dropped the row). |
| 14 | Testing Gap #4: Uppercase SHA-512 mirror response untested | Skipped | No existing unit-test infrastructure for downloader.go; would need new httptest fixture or BATS uppercase variant — defer. |
| 15 | Testing Gap #5: Concurrent first-use cache miss untested | Skipped | Declined Rounds 1-3 (Windows os.Rename race); call site unchanged. |
| 16 | Testing Gap #6: End-to-end Windows execution path | Skipped | Gated on Windows BATS support per PR description ("Do not merge until BATS works on Windows CI"). |
| 17 | Documentation Assessment | Documentation updated | Auto-covered by S1 + S3 Fixes (cmd_service.md and environment.md). |
