# Deep Review: 20260504-180530-pr-348

| | |
|---|---|
| **Date** | 2026-05-04 18:05 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 5 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#348](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348) — rdd kubectl: embed kuberlr-style version resolver |
| **Branch** | `embed-kuberlr` |
| **Commits** | `fb0811d` rdd kubectl: embed kuberlr-style version resolver |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — S1 (Codex + Claude) is the round's most actionable: `RDD_CACHE_DIR` is documented in `environment.md` both as a behavior-changing override AND as a Path Variable whose section preamble says "setting them has no effect on RDD's behavior" — the contradiction is internal and visible to anyone who reads the file. S3 (Claude + Gemini) brings the long-standing `CacheRoot` panic finding back with a material change: `instancePaths()` now calls `kuberlr.CacheRoot()` unconditionally, so a user with no `$HOME`/`%LocalAppData%` sees `rdd svc paths` panic with a stack trace. The remaining suggestions (S2 download UX, S4 misleading test comment, S5 process-env coupling, S6 dead error scaffold) are all genuine but lower priority. Round 4's ten suggestions all landed cleanly with no regressions. |
| **Wall-clock time** | `23 min 8 s` |

## Executive Summary

Round 4's ten Suggestions all landed cleanly. Round 5 surfaces no Critical or Important findings; the agents converge on six smaller items in three buckets:

1. **Documentation contradiction around `RDD_CACHE_DIR`** (S1, two agents). `environment.md` line 16 documents the variable as a behavior-changing override; the Path Variables section preamble at line 39 explicitly states "setting them has no effect on RDD's behavior"; line 48 then lists `RDD_CACHE_DIR` in that table. Codex also notes that `cmd_service.md`'s key table at lines 127–137 lists `kubectl_cache` but omits the new `cache_dir` key the example output (line 154) and the `instancePaths()` map (`cmd/rdd/service_paths.go:32`) include.

2. **Cache-root panic now reaches `rdd svc paths`** (S3, Claude + Gemini). The panic in `CacheRoot` on `os.UserCacheDir()` failure is the same finding declined in Rounds 1–3, but the blast radius materially changed in this round: `instancePaths()` now calls `kuberlr.CacheRoot()` and `kuberlr.CacheDir()` unconditionally, so any `rdd svc paths` invocation (the most basic introspection command) crashes for a user with no `$HOME`/`%LocalAppData%`. Per the resolutions gate, a material code change since the decline justifies a re-raise; Suggestion-level because the trigger remains rare.

3. **Defensive cleanup in the new code** — S2 (silent first-run download has no user feedback), S4 (BATS comment misnames the `EXE`-export mechanism), S5 (`SkipResolver` mutates the global process env where a package var would express the same intent without leaking into spawned children), and S6 (`serverVersion` declares an `error` return but every path returns `nil`, leaving `Resolve`'s `if err != nil` as documented dead code).

The Windows `os.Rename` race surfaced again from all three agents (Codex I1, Gemini I1, Claude S3) and was dropped per the prior-round-resolutions gate (declined Rounds 1–4).

Structure: 0 Critical · 0 Important · 6 Suggestions.

## Critical Issues

None.

## Important Issues

None.

## Suggestions

S1. **`environment.md` describes `RDD_CACHE_DIR` as both behavior-changing and inert; `cmd_service.md` key table omits the new `cache_dir` row** — `docs/design/environment.md:16,39,48`, `docs/design/cmd_service.md:127-137,154` [Codex GPT 5.5][Claude Opus 4.7]

```markdown
| `RDD_CACHE_DIR` | Overrides the rdd-wide cache root. ... |
...
## Path Variables

`rdd svc paths --output=shell` exports these variables. They reflect
the paths RDD uses for the current instance; setting them has no effect
on RDD's behavior.
...
| `RDD_CACHE_DIR` | rdd-wide cache root (shared across instances) | `~/Library/Caches/rancher-desktop` |
```

`environment.md` line 16 (Configuration Variables) correctly documents `RDD_CACHE_DIR` as a behavior-changing override that the kubectl resolver reads at `pkg/kuberlr/cache.go:48`. Line 39 then opens the Path Variables section with "setting them has no effect on RDD's behavior," and line 48 lists `RDD_CACHE_DIR` in that table. A reader who lands in the Path Variables section reads the opposite of what the resolver actually does.

`RDD_KUBECTL_CACHE` (line 52) is correctly placed: kuberlr only reads `RDD_CACHE_DIR` and always derives `kubectl/<os>-<arch>/`, so setting `RDD_KUBECTL_CACHE` truly has no effect.

The same gap carries through `cmd_service.md`. The example output at line 154 shows a `cache_dir` row, and `instancePaths()` at `cmd/rdd/service_paths.go:32` returns `cache_dir` alongside `kubectl_cache`. The key table at lines 127–137 added `kubectl_cache` (line 137) but skipped `cache_dir`, so the table no longer matches `rdd svc paths` output.

Fix: qualify the Path Variables preamble (or drop `RDD_CACHE_DIR` from the path table and reference the Configuration Variables entry), and add a `cache_dir` row to the `cmd_service.md` key table.

```diff
-`rdd svc paths --output=shell` exports these variables. They reflect
-the paths RDD uses for the current instance; setting them has no effect
-on RDD's behavior.
+`rdd svc paths --output=shell` exports these variables. Most are
+informational; `RDD_CACHE_DIR` is also honored as the cache-root
+override (see Configuration Variables above).
```

```diff
 | `pid_file` | PID file path |
 | `args_file` | Saved arguments file path |
+| `cache_dir` | rdd-wide cache root (shared across instances) |
 | `kubectl_cache` | Cache directory for downloaded kubectl binaries (shared across instances) |
```

S2. **First-run `rdd kubectl` blocks silently for 30 s+ on slow links** — `pkg/kuberlr/downloader.go:58-72`, `cmd/rdd/kubectl.go:77` [Claude Opus 4.7]

```go
func ensureCached(ctx context.Context, want semver.Version) (string, error) {
    path := cachedPath(want)
    if _, err := os.Stat(path); err == nil {
        return path, nil
    } else if !errors.Is(err, os.ErrNotExist) {
        return "", err
    }
    ...
    if err := download(ctx, want, path); err != nil {
        return "", err
    }
```

Default log level is `warn` (`cmd/rdd/main.go:90`). The "using cached kubectl" line in `kubectlAction` (`cmd/rdd/kubectl.go:77`) and any `Debug`/`Info` logging in `download` stay invisible at the default level. A first-run cache miss against `dl.k8s.io` over a slow link blocks the terminal for 30 s or more for a ~50 MB binary; near the 5 min `downloadTimeout` it looks identical to a hung process. The user has no signal that the resolver is working.

Fix: emit one warn-level line before the download starts, naming the version and the mirror. The line shows up at the default log level and disappears once the cache hits.

```diff
     if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
         return "", err
     }
+    logrus.Warnf("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
     if err := download(ctx, want, path); err != nil {
         return "", err
     }
```

S3. **`CacheRoot` panic now reaches `rdd svc paths`; declined in Rounds 1–3 but the call surface materially changed this round** — `pkg/kuberlr/cache.go:47-56`, `cmd/rdd/service_paths.go:32-33` [Claude Opus 4.7][Gemini 3.1 Pro]

```go
func CacheRoot() string {
    if root := os.Getenv(envCacheDir); root != "" {
        return root
    }
    cache, err := os.UserCacheDir()
    if err != nil {
        panic(fmt.Errorf("could not determine user cache directory: %w", err))
    }
    return filepath.Join(cache, "rancher-desktop")
}
```

The panic on `os.UserCacheDir()` failure is the same finding declined in Rounds 1, 2, and 3 ("declined: failure mode requires neither $HOME nor %LocalAppData% to resolve"). The blast radius materially changed this round: `instancePaths()` at `cmd/rdd/service_paths.go:32-33` now calls `kuberlr.CacheRoot()` and `kuberlr.CacheDir()` unconditionally, so `rdd svc paths` (basic introspection, unrelated to the kubectl resolver) panics with a stack trace for any user whose `$HOME` or `%LocalAppData%` is unset.

The trigger remains rare in normal desktop environments. It is reachable from stripped CI containers, restricted services, and `sudo -i` shells without HOME forwarding. `instance.Dir()` and the other path helpers do not panic on similar failures; aligning `CacheRoot` with the rest of the codebase keeps the surface uniform.

Fix: change the signature to `(string, error)`, propagate from `CacheDir()`, and have `instancePaths` return `(map[string]string, error)`. `Resolve` would treat a `CacheRoot` error like an unparseable embedded version — log and fall through to embedded.

S4. **`local_setup_file` comment misnames the `EXE` mechanism — `commands.bash` sets but does not export** — `bats/tests/10-cli/7-kubectl-cache.bats:29-30`, `bats/helpers/commands.bash:1-5` [Claude Opus 4.7]

```bash
local_setup_file() {
    GOOS=$(go env GOOS)
    GOARCH=$(go env GOARCH)
    # EXE is exported by commands.bash: ".exe" on Windows, empty
    # elsewhere. Don't shadow it locally.
```

`commands.bash` sets `EXE=""` at line 1 and `EXE=".exe"` at line 5 with no `export`. `EXE` is in scope for `@test` blocks because BATS sources `commands.bash` once and `@test` runs in the same shell. The variable inherits, but the test re-exports `EXE` later in the same `local_setup_file` (line 99, alongside the other `export GOOS GOARCH EXE ...`) so the rdd subprocess sees it. The comment names `commands.bash` as the exporter, which misleads the next reader who reorganizes BATS helpers.

Fix:

```diff
-    # EXE is exported by commands.bash: ".exe" on Windows, empty
-    # elsewhere. Don't shadow it locally.
+    # EXE is set in commands.bash (".exe" on Windows, empty elsewhere)
+    # and re-exported below for the rdd subprocess. Don't shadow it
+    # locally.
```

S5. **`SkipResolver` mutates the global process env where a package-level bool would express the same intent without leaking into spawned children** — `pkg/kuberlr/resolver.go:30-36` [Claude Opus 4.7]

```go
func SkipResolver() {
    os.Setenv(envSkipResolver, "1")
}
```

`SkipResolver`'s docstring describes a same-process toggle. The implementation mutates the global process env, which leaks into every subprocess spawned after the call. The current callers (only `rdd ctl` invokes it before `kubectlAction`, which never re-spawns rdd) make this harmless today, but the function name and comment hide the side effect. A future caller who does spawn unrelated subprocesses inherits the leaked env without warning.

A package-level `bool` checked inside `Resolve` matches the documented intent and removes the coupling. The existing `envSkipResolver` env var still covers the recursion guard for the kubectl child, which `Exec` sets explicitly in `exec_unix.go:22` and `exec_windows.go:28`.

Fix:

```diff
-func SkipResolver() {
-    os.Setenv(envSkipResolver, "1")
-}
+var skipResolver bool
+
+// SkipResolver short-circuits Resolve for the rest of this process.
+// rdd ctl calls this before kubectlAction because it always targets
+// the embedded apiserver, whose version matches the embedded kubectl
+// by construction — the probe would always fall through anyway.
+func SkipResolver() {
+    skipResolver = true
+}
```

In `Resolve`, replace the env check with `if skipResolver || os.Getenv(envSkipResolver) != ""`.

S6. **`serverVersion` declares an `error` return but every path returns `nil`; `Resolve`'s `if err != nil` is dead** — `pkg/kuberlr/resolver.go:82-85`, `pkg/kuberlr/resolver.go:121-144` [Gemini 3.1 Pro]

```go
server, ok, err := serverVersion(args)
if err != nil {
    return "", err
}
```

`serverVersion` has five return paths. Four of them return `(zero, false, nil)` with `//nolint:nilerr` annotations documenting the deliberate fall-through; the fifth returns `(v, true, nil)` on success. No path returns a non-nil error. The `if err != nil` check at `Resolve` line 83 cannot fire; the `nilerr` suppressions even document the function's deliberate "always nil" contract.

Two coherent options: drop the `error` return entirely (making `serverVersion` a `(semver.Version, bool)`-returning helper that matches its actual contract), or keep the scaffold and document it as deliberate forward-compatibility. The former matches what the code does; the latter matches a hypothetical future where some probe failure should abort the resolver instead of falling through.

Fix (drop the dead path):

```diff
-func serverVersion(args []string) (_ semver.Version, ok bool, _ error) {
+func serverVersion(args []string) (semver.Version, bool) {
     ...
-    return semver.Version{}, false, nil //nolint:nilerr // ...
+    return semver.Version{}, false
     ...
-    return v, true, nil
+    return v, true
 }
```

And in `Resolve`:

```diff
-    server, ok, err := serverVersion(args)
-    if err != nil {
-        return "", err
-    }
+    server, ok := serverVersion(args)
     if !ok {
         return "", nil
     }
```


## Design Observations

### Concerns

- **Mirror checksum gives no security against a compromised mirror (future)** [Gemini 3.1 Pro]. `download` fetches the kubectl binary and its `.sha512` from the same mirror (`dl.k8s.io` or `RDD_KUBECTL_MIRROR`); the hash protects against truncation and disk corruption but not against a mirror serving a malicious binary alongside a matching malicious checksum. SIG Release signs binaries with Sigstore/Cosign. Verifying the Cosign signatures (in addition to or in place of the mirror's own checksum) closes the supply-chain gap. Out of scope for this PR; flagging as durable follow-up.

- **Hand-rolled subcommand walker duplicates the bound pflag set (future, persistent)** [carry-over from Rounds 3–4]. `parseKubeConfigFlags` already binds the kubectl global flags via `genericclioptions.ConfigFlags.AddFlags`; `isClientOnly` re-walks args manually with `kubectlGlobalFlagsTakingValue`, a hand-maintained value-taking flag map. Round 4's S5 (`--as-user-extra`) was the third drift in three rounds. Driving the walker from the same FlagSet (`fs.Args()` after parse returns the positional tokens) eliminates the map and the drift class. Trade-off: pflag with `UnknownFlags=true` may consume an unknown spaced-form flag's value as a positional token, so the walker needs a small validation step before extracting the subcommand. Round 5 surfaces no new drift, but the design observation persists.

### Strengths

- **Round 4 fixes landed without regression** [Claude Opus 4.7][Codex GPT 5.5]. `ALL_KEYS` now covers `cache_dir` and `docker_socket`; `cmd_service.md` consistently uses `config.yaml`; `fetchSha512` lowercases the digest; `--as-user-extra` is in `kubectlGlobalFlagsTakingValue` with covering rows in `TestIsClientOnly`; the line-numbered comment in `fetchSha512` is replaced with a name; the `--insecure-skip-tls-verify=false` row is gone; the eviction TODO covers `.kubectl-*` leftovers.

- **Body caps at the right place** [Claude Opus 4.7][Gemini 3.1 Pro]. `streamGet` wraps the response in `io.LimitReader` with separate caps for the digest (4 KiB) and the binary (256 MiB). A misconfigured mirror serving the binary at the `.sha512` URL fails fast with a clear error rather than streaming megabytes into a `strings.Builder`.

- **Hash before mutation** [Claude Opus 4.7]. `download` feeds the response body through `MultiWriter(tmp, sha512.New())` and verifies the digest before any chmod or rename. A failed verification leaves no executable in place for the next call to mistake for a valid binary.

- **Reusing kubectl's own flag binding** [Claude Opus 4.7][Codex GPT 5.5]. `parseKubeConfigFlags` binds `genericclioptions.NewConfigFlags(true)` rather than reimplementing kubectl's connection-override parsing. The probe and the kubectl child target the same cluster by construction.

- **Conservative bias in `isClientOnly`** [Claude Opus 4.7][Gemini 3.1 Pro]. When the walker cannot classify, it falls through to the probe. The comment at `resolver.go:264-267` explicitly calls out the failure mode of the opposite default — a misclassification toward "skip the probe" silently uses a mismatched embedded kubectl.

## Testing Assessment

Coverage gaps surfaced this round, ordered by user-reachable risk:

1. **`os.UserCacheDir` failure path** (S3). The panic in `CacheRoot` has no test. A unit test that unsets `HOME` (Linux) or `LocalAppData` (Windows) and asserts a clean error would lock in the recommended fix.
2. **First-run download UX** (S2). No test asserts that the user sees something on cache miss — neither output nor log lines. The fake-kube fixture already serves the binary; a one-line BATS assertion against `${output}` after the download case would close the gap.
3. **Concurrent first-use cache miss** (declined Rounds 1–4, surfaced again this round by all three agents). Acknowledged limitation.
4. **End-to-end Windows execution path** (`pkg/kuberlr/exec_windows.go`). PR description gates the merge on Windows BATS support arriving; the path remains compile-only.
5. **Cluster reporting unparseable `gitVersion`** [Claude Opus 4.7]. `serverVersion` logs at `warn` and falls through, but no test covers it. The fake server already has `--git-version-file`; writing `"not-a-version"` and asserting fall-through to embedded kubectl would close the gap.

Verification runs this round: Codex ran `make build-rdd` and `go test ./pkg/kuberlr ./cmd/rdd -count=1` (both passed). Claude verified Round 4 fixes against the current files. Gemini ran the discovery walk only.

## Documentation Assessment

- `docs/design/environment.md` carries the contradiction in S1: line 16 documents `RDD_CACHE_DIR` as a behavior-changing override; line 39 says path-table variables have no effect; line 48 lists `RDD_CACHE_DIR` in the path table. Internal Variables section (lines 29–35) is well-placed and correctly explains both the recursion guard and the `rdd ctl` short-circuit reuse of the same env var.
- `docs/design/cmd_service.md` key table at lines 127–137 is missing `cache_dir` (S1). Round 4's `config.json` → `config.yaml` sweep landed cleanly. Example output at lines 152–163 includes both `cache_dir` and `kubectl_cache`.
- `pkg/kuberlr/cache.go` package doc at lines 5–9 clearly describes the resolve-or-fall-through model and credits the upstream kuberlr.
- `parseKubeConfigFlags` docstring at `pkg/kuberlr/resolver.go:155-173` enumerates the bound flag surface and explains why `--username`/`--password` stay unbound.
- `local_setup_file` comment at `7-kubectl-cache.bats:29-30` misnames the `EXE`-export mechanism (S4).

## Commit Structure

Single commit `fb0811d`. Message describes the change and matches the diff. Clean.

## Acknowledged Limitations

- **Cache eviction not implemented** — `pkg/kuberlr/cache.go:33-39`. The `TODO(eviction)` covers both growth (~50 MB per minor version) and `.kubectl-*` temp leftovers from SIGKILL/power loss; Round 4's S9 extended the TODO to name both.
- **Offline mode not implemented** — `pkg/kuberlr/resolver.go:61-64`, `pkg/kuberlr/downloader.go:46-47`. TODOs cover the planned env-var/config opt-out for download attempts.
- **Windows BATS host deferred** — `bats/tests/10-cli/7-kubectl-cache.bats:14-20`. PR description gates the merge on Windows BATS support; the resolver's Windows-specific code paths remain compile-only.
- **`local_teardown_file()` deviation from BATS convention** — `bats/tests/10-cli/7-kubectl-cache.bats:102-110`. Inline rationale stands.
- **Falling back to embedded on any cluster-probe failure** — `pkg/kuberlr/resolver.go:113-144`. Each `nilerr` branch is documented; the design choice is sound (S6 proposes simplifying the dead-error scaffolding without changing the policy).
- **Embedded version unparseable in dev builds** — `pkg/kuberlr/resolver.go:73-80`. `go run ./cmd/rdd kubectl ...` and IDE debug builds skip `-ldflags -X`; falling through to embedded kubectl rather than breaking every dev invocation is the right call.
- **`--username`/`--password` deliberately unbound on the resolver side** — `pkg/kuberlr/resolver.go:165-173`. Binding them via `WithDeprecatedPasswordFlag` would force the interactive kubeconfig loader; the resolver's silent fall-through on probe failure covers the asymmetry.
- **`RDD_KUBECTL_MIRROR` accepts non-HTTPS URLs** [Claude Opus 4.7]. Implicit. Acknowledged in `mirrorURL()` (the BATS test relies on `http://127.0.0.1:port`). For users who set this to a non-localhost http URL, MITM is possible. Trusting the env-var owner is consistent with other `RDD_*` overrides.
- **Empty-message `cliexit.Error` contract** — `cmd/rdd/main.go:159-160`. The contract is clear in context but currently single-call-site (only `kuberlr.Exec` produces one).

### Declined in prior rounds

- **Concurrent first-use Windows `os.Rename` race** — declined Rounds 1, 2, 3, 4. Raised this round by all three agents (Codex I1, Gemini I1, Claude S3); dropped per the resolutions gate. The recurring re-raises strengthen the case for an inline TODO documenting the Windows file-mapping subtlety so future agents recognize the deliberate design choice.
- **`RDD_CACHE_DIR` scope rename** — declined Round 1: deliberate forward-looking knob.
- **Server `log.Fatal*` bypassing `defer logFD.Close()`** — declined Round 1: test fixture; OS reclaims fd.
- **Pure-Go unit coverage thin on download/cache machinery** — declined Round 2: BATS suite covers the integration concerns.
- **BATS coverage for HTTP timeout/hang, network cancel, Windows Exec, sha512sum-format mirror, trailing-slash mirror, default-log-level message capture, unreachable-server probe** — declined Rounds 1–3.
- **`info.GitVersion` empty/unknown unit test** — declined Round 1: requires mocked discovery client.
- **Resolver-policy summary and recursion-guard rationale duplicated across files** — declined Round 1: each duplicate serves a different audience.
- **`SkipResolver` swallows `os.Setenv` error** — declined Round 3: stylistic only; constant env name and value cannot fail.
- **Cache the server-version probe across invocations** — declined Round 3: performance optimisation; explicit follow-up.
- **Move package doc-comment from `cache.go` to `resolver.go`** — declined Round 3: cosmetic; `go doc` and pkg.go.dev render either location equally.
- **Drop `is_macos` branch in `write_kubectl_sha512`** — declined Round 3: both branches produce identical output; conditional is harmless.
- **Round 4 S6: `serverVersion` ignores ctx** — skipped Round 4: UX gap bounded by 2 s probe timeout.
- **Round 4 S10: 5 min `downloadTimeout` covers headers + body** — skipped Round 4: speculative.

## Unresolved Feedback

- **@mook-as on `cmd/rdd/service_paths.go:33`** — `https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348#discussion_r3182959199`: "Would it be possible to expose `RDD_CACHE_DIR` here too?"
  Resolved. Both `cache_dir` (the rdd-wide root) and `kubectl_cache` (the kubectl-specific subdir) appear in `instancePaths()` (`service_paths.go:32-33`), the `environment.md` Path Variables table (lines 48, 52), and the `cmd_service.md` example output (lines 154, 158). The `cmd_service.md` key table still misses `cache_dir` (S1).

## Agent Performance Retro

### [Claude]

Claude carried this round's coverage: five Suggestions retained (S1 docs shared with Codex, S2 download UX, S3 panic with the blast-radius re-raise, S4 BATS comment, S5 SkipResolver env coupling). The S3 re-raise was the round's best work — Claude noticed that `instancePaths()` now calls `kuberlr.CacheRoot()` unconditionally, so the same panic that prior rounds declined as too narrow now reaches `rdd svc paths`. That call-surface check is exactly the discipline the resolutions gate asks for. Two findings declined or downgraded: I1 (download UX) was over-calibrated as Important — the gap is real but bounded; S5 (drive-by `config.json → config.yaml` in `cmd_service.md`) was rejected because Round 4's S3 (raised by Claude itself) explicitly asked for that rename, and re-flagging it as scope creep after it landed moves the goalposts. Claude's S3 (Windows rename race) is the fourth consecutive re-raise from Claude on a finding declined four rounds running.

### [Codex]

Codex produced one shared Important (Windows rename race, dropped per gate) and one shared Suggestion (S1 docs contradiction). The S1 work was tight: Codex spotted both the `environment.md` self-contradiction and the `cmd_service.md` key-table omission in a single finding, where Claude reported only the `environment.md` half. Codex also ran `make build-rdd` and `go test ./pkg/kuberlr ./cmd/rdd` in its worktree (both passed) — that verification work is hard to parallelize across agents and Codex consistently provides it. Tight one-Important / one-Suggestion shape, zero false positives.

### [Gemini]

Gemini's round was on-target after Round 4's hallucination. C-level findings disappeared; the two real findings (I1 Windows rename race shared with the others, S2 panic on `os.UserCacheDir`) were both legitimate. S1 (dead error return in `serverVersion`) is a fresh Suggestion that neither other agent caught — verified by reading every return path in the function and confirming all four `nilerr` branches plus the success path return `nil`. The supply-chain Cosign observation in Design Concerns is a future-scope idea worth carrying forward. No false positives this round; Gemini's coverage tightened up materially compared to Round 4.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 16m 22s | 8m 02s | — |
| Findings | 5S | 1S | 2S |
| Tool calls | 35 (Bash 17, Read 17, Grep 1) | 74 (shell 66, stdin 8) | — |
| Design observations | 5 | 1 | 2 |
| False positives | 0 | 0 | 0 |
| Unique insights | 4 | 0 | 2 |
| Files reviewed | 19 | 19 | 19 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **5S** | **1S** | **2S** |
| Downgraded | 2 (I→S) | 0 | 0 |
| Dropped | 2 | 1 | 1 |


Codex provided the tightest signal-per-finding ratio (one shared S1 was the round's most actionable). Claude provided the most coverage (five retained Suggestions, including the S3 blast-radius re-raise). Gemini contributed one defensive observation (S6 dead error scaffold) plus one shared finding (S3 panic).

Net Round 5 contribution after the resolutions gate: 0 Critical, 0 Important, 6 Suggestions, 2 design concerns.

#### Reconciliation

- Claude I1 download UX: important → suggestion S2 (UX gap, bounded by the 5 min `downloadTimeout` ceiling; no correctness break).
- Claude I2 `CacheRoot` panic: important → suggestion S3 (rare trigger; the re-raise is justified by the new `rdd svc paths` blast radius, but severity stays at Suggestion).
- Claude S3 Windows `os.Rename` race: suggestion → dropped (declined Rounds 1, 2, 3, 4).
- Claude S5 drive-by `config.yaml` rename: suggestion → dropped (Round 4's S3 — raised by Claude itself — explicitly asked for the rename, which then landed).
- Codex I1 concurrent cold-cache publishes: important → dropped (declined Rounds 1, 2, 3, 4; same finding as above).
- Gemini I1 Windows rename race: important → dropped (declined Rounds 1, 2, 3, 4; same finding as above).
- Gemini S2 `CacheRoot` panic: kept and merged with Claude I2 into S3.

## Review Process Notes

### Skill improvements

- **Add a "blast-radius check on declined re-raises" rule to the resolutions gate.** A finding declined in earlier rounds can become valid again when later commits widen its trigger surface, even if the offending lines themselves are unchanged. Today the gate says "re-raise only when the surrounding code has materially changed since the decline," which an agent reading "code unchanged" reads as "drop." Strengthen to: "Before dropping a re-raised declined finding, list every caller of the cited symbol on the current SHA and on the SHA that produced the original decline. If new callers exist on the current SHA, the blast radius has changed; treat that as material code change and re-raise (with severity recalibrated for the new surface). Recording the call-site count alongside the decline reason in the Resolution table makes the comparison mechanical." Future panic-on-startup or fatal-on-rare-input findings carry the same shape: the bug is "rare" until it isn't, and the trigger surface is what shifts.

### Repo context updates

None this round. The existing `deep-review-context.md` carried Codex S1 (`environment.md` Path Variables preamble vs. behavior-changing override) into reach because the BATS style and CRD-default rules already train agents to read documentation tables against implementation; the same instinct surfaced the contradiction here.


## Appendix: Original Reviews

### [Claude] — Pass 1

Now I have enough context to write the review. Let me also briefly check one thing related to `exec.CommandContext` lifecycle on Windows.

# Code Review — PR #348: rdd kubectl version resolver

## Header

| | |
|---|---|
| **Branch** | `fb0811d` |
| **Verdict** | Approve with suggestions — Windows BATS gating noted by author |

## Executive Summary

This PR adds a kuberlr-style version resolver that exec's a downloaded, sha512-verified kubectl when the embedded one falls outside the ±1 minor skew of the user's cluster. The implementation is well-structured: the resolver, downloader, cache, and OS-specific exec are cleanly separated; unit tests cover the skew rule, kubectl-flag scanner, and `--client` short-circuit; and a BATS suite with a fake apiserver/mirror exercises the real download/verify/exec lifecycle. Findings are mostly minor — UX gaps around silent downloads, a documentation inconsistency, a misleading comment, and a panic in `CacheRoot` that should surface as an error.

## Findings

### Critical Issues

None.

### Important Issues

**I1. No user feedback during kubectl download — `pkg/kuberlr/downloader.go:58-72`** *(important, gap)*

```go
func ensureCached(ctx context.Context, want semver.Version) (string, error) {
	path := cachedPath(want)
	if _, err := os.Stat(path); err == nil {
		return path, nil
	} else if !errors.Is(err, os.ErrNotExist) {
		return "", err
	}
	...
	if err := download(ctx, want, path); err != nil {
		return "", err
	}
```

Default log level is `warn` (cmd/rdd/main.go:90), so neither the `Debug` log in `kubectlAction` (cmd/rdd/kubectl.go:77) nor anything inside `download` surfaces to the user. A first-run cache miss against the real `dl.k8s.io` over a slow link can take 30 s+ for a ~50 MB binary — the user sees no output and no progress. Network hangs near the 5 min `downloadTimeout` look identical to "rdd is broken." This is a UX bug for a flagship CLI surface.

Fix: emit a single warn-level line (or `fmt.Fprintln(os.Stderr, ...)`) before the download starts, identifying the version and mirror. Even a one-shot "Downloading kubectl v1.x.y from https://dl.k8s.io ..." closes the gap.

```diff
 	if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
 		return "", err
 	}
+	logrus.Warnf("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
 	if err := download(ctx, want, path); err != nil {
 		return "", err
 	}
```

**I2. `CacheRoot` panics on `os.UserCacheDir` failure — `pkg/kuberlr/cache.go:47-56`** *(important, regression)*

```go
func CacheRoot() string {
	if root := os.Getenv(envCacheDir); root != "" {
		return root
	}
	cache, err := os.UserCacheDir()
	if err != nil {
		panic(fmt.Errorf("could not determine user cache directory: %w", err))
	}
	return filepath.Join(cache, "rancher-desktop")
}
```

`os.UserCacheDir()` returns an error when `$HOME` is unset on Unix or `%LocalAppData%` on Windows — rare but reachable in stripped CI containers, restricted services, and `sudo -i` shells without HOME forwarding. A panic here aborts every `rdd` invocation that touches the resolver path, including `rdd ctl` (which calls `kuberlr.SkipResolver()` → reads no cache) **and** every `rdd svc paths` invocation (cmd/rdd/service_paths.go:32-33), since `instancePaths` calls `kuberlr.CacheRoot()` and `kuberlr.CacheDir()` unconditionally. A user with an unset HOME would see `rdd svc paths` crash with a stack trace.

`os.UserCacheDir()`'s string is already explanatory. Returning an error and surfacing it as a regular failure is more in line with the rest of the codebase (`instance.Dir()` and friends do not panic).

Fix: change the signature to `(string, error)` and propagate. `instancePaths` already returns a `map[string]string`; have it return `(map[string]string, error)` and surface the error from `servicePathsAction`. Within `Resolve`, treat a CacheRoot error like an unparseable embedded version — log + fall through to embedded.

### Suggestions

**S1. Docs contradict themselves on `RDD_CACHE_DIR` — `docs/design/environment.md:16,39,48`** *(suggestion, regression)*

```markdown
| `RDD_CACHE_DIR` | Overrides the rdd-wide cache root. ... |
...
## Path Variables

`rdd svc paths --output=shell` exports these variables. They reflect the paths RDD uses for the current instance; setting them has no effect on RDD's behavior.
...
| `RDD_CACHE_DIR` | rdd-wide cache root (shared across instances) | `~/Library/Caches/rancher-desktop` |
```

`RDD_CACHE_DIR` appears as a Configuration Variable (line 16 — does affect behavior) **and** a Path Variable (line 48 — section preamble at line 39 explicitly states "setting them has no effect on RDD's behavior"). A reader who lands in the Path Variables section is told the opposite of what `cache.go:47-50` actually does.

`RDD_KUBECTL_CACHE` (line 52) is correctly described — kuberlr only reads `RDD_CACHE_DIR` and always derives the kubectl subpath, so setting `RDD_KUBECTL_CACHE` truly has no effect.

Fix: either drop `RDD_CACHE_DIR` from the Path Variables table and add a note, or qualify the Path Variables preamble — e.g., "setting them has no effect on RDD's behavior, except where noted in the Configuration Variables section above."

**S2. Misleading comment about `EXE` export — `bats/tests/10-cli/7-kubectl-cache.bats:29-30`** *(suggestion, regression)*

```bash
local_setup_file() {
    GOOS=$(go env GOOS)
    GOARCH=$(go env GOARCH)
    # EXE is exported by commands.bash: ".exe" on Windows, empty
    # elsewhere. Don't shadow it locally.
```

`bats/helpers/commands.bash:1` sets `EXE=""` (and `EXE=".exe"` at line 5) without exporting. The variable is in scope for `@test` blocks because BATS sources `commands.bash` once and inherits the shell, but it is not "exported by commands.bash." The test then re-exports `EXE` at line 99 for the rdd subprocess. The comment misnames the mechanism, which will mislead the next person who reorganizes BATS helpers.

Fix:
```diff
-    # EXE is exported by commands.bash: ".exe" on Windows, empty
-    # elsewhere. Don't shadow it locally.
+    # EXE is set in commands.bash (".exe" on Windows, empty elsewhere)
+    # and exported below for the rdd subprocess. Don't shadow it locally.
```

**S3. Concurrent first-run download race on Windows — `pkg/kuberlr/downloader.go:96-117`** *(suggestion, gap)*

```go
tmp, err := os.CreateTemp(filepath.Dir(dst), ".kubectl-*")
...
return os.Rename(tmpName, dst)
```

If two `rdd kubectl` processes hit a cold cache simultaneously, both pass the `os.Stat(path) == ErrNotExist` check (downloader.go:60), both download, and both `os.Rename(tmp, dst)`. Per the repo's own `CLAUDE.md` note, Windows `os.Rename` can fail when the destination is held open — and the first process's rename completes a fraction of a second before the second process starts execing the just-renamed file. The losing process surfaces an obscure rename error to the user, even though the cache has the right bytes. It is rare (cold cache + two concurrent invocations) and self-heals on retry, so this is SUGGESTION territory.

Fix: treat `os.Rename` failure as benign when `dst` exists and matches the same digest. Or, more conservatively, retry once after re-Stat'ing.

**S4. `SkipResolver` mutates process env — `pkg/kuberlr/resolver.go:34-36`** *(suggestion, gap)*

```go
func SkipResolver() {
	os.Setenv(envSkipResolver, "1")
}
```

`SkipResolver` is documented as a same-process toggle but actually mutates the global process env, which leaks into every subprocess `rdd` later spawns. For the current callers this is harmless (`rdd ctl` only uses the embedded kubectl, no subprocess inherits the env), but the function name and comment hide a process-wide side effect that could surprise a future caller who does spawn subprocesses for unrelated work. A package-level `bool` (read inside `Resolve`) would express the same intent without the env coupling — and the existing `envSkipResolver` env var would still cover the recursion guard for the actual exec'd kubectl child (set explicitly in `exec_unix.go:22` / `exec_windows.go:28`).

Fix:
```diff
-func SkipResolver() {
-	os.Setenv(envSkipResolver, "1")
-}
+var skipResolver bool
+
+// SkipResolver short-circuits Resolve for the rest of this process. ...
+func SkipResolver() {
+	skipResolver = true
+}
```
And in `Resolve`, check `if skipResolver || os.Getenv(envSkipResolver) != ""`.

**S5. Drive-by `config.json` → `config.yaml` rename in cmd_service.md — `docs/design/cmd_service.md:15,41,79,82,87`** *(suggestion, regression)*

The change is correct (`pkg/instance/instance.go:91` returns `config.yaml`), but it is a stale-doc fix unrelated to the kubectl resolver. Per the repo's "smallest reasonable change" guidance, this belongs in a separate doc-cleanup commit; bundling it in this PR makes the diff noisier and the commit message ("rdd kubectl: embed kuberlr-style version resolver") contradicts the actual scope. Not worth blocking on, but worth noting.

## Design Observations

### Strengths

- **Hash before mutation.** `download` (`downloader.go:103-117`) feeds the response body through a `MultiWriter(tmp, sha512.New())` and verifies the digest before any chmod or rename, satisfying the repo's "hash the as-downloaded bytes before any mutation" rule.
- **Body caps at the right place.** `streamGet` wraps the response in `io.LimitReader` with separate caps for the digest (4 KiB) and the binary (256 MiB) — a misconfigured mirror serving the binary at the `.sha512` URL fails fast with a clear error rather than streaming megabytes into a `strings.Builder`.
- **Reusing kubectl's own flag binding.** `parseKubeConfigFlags` (`resolver.go:177-186`) binds `genericclioptions.NewConfigFlags(true)` rather than reimplementing kubectl's connection-override parsing. The probe and the kubectl child target the same cluster by construction. The unit tests in `TestParseKubeConfigFlags` cover spaced/equals forms, `-s`/`-n` aliases, and `--` semantics.
- **Conservative bias in `isClientOnly`.** When the walker can't classify, it falls through to the probe — the comment at resolver.go:264-267 explicitly calls out that a misclassification toward "skip the probe" would silently use a mismatched embedded kubectl. The TestIsClientOnly cases exercise the surprising shapes (`--client=garbage`, `-h=false`, repeated flags with last-wins).
- **Probe timeout cap.** The 2-second `serverProbeTimeout` (resolver.go:41) keeps unreachable clusters from stalling every kubectl invocation, and the warn-vs-debug split for unparseable server versions is the right trade-off.

## Testing Assessment

The unit + BATS coverage is strong for a first cut. Untested scenarios, ranked by risk:

1. **Concurrent first-run downloads.** No test exercises two simultaneous cache misses. Adding a BATS test that backgrounds two `rdd kubectl` calls and asserts both succeed would catch S3 and any future regressions.
2. **`os.UserCacheDir` failure path.** The panic in `CacheRoot` (I2) has no test. A unit test that unsets `HOME` (Linux) or `LocalAppData` (Windows) and asserts a clean error would lock in the recommended fix.
3. **Atomic-write contract.** `download` is correct (verify before publish), but no test asserts that a partial download leaves no file at `dst`. A BATS test that kills the server mid-stream and asserts `kubectl-vX.Y.Z` does not appear in the cache would harden the contract.
4. **Empty-message cliexit.Error path in `main`.** The new branch in `cmd/rdd/main.go:163-165` is exercised only via the BATS Windows path (where `kuberlr.Exec` is the only producer). A small unit test in `pkg/cli/exit/` covering `Error.Error()` for both `Err: nil` and wrapped variants would protect the contract.
5. **Cluster reporting unparseable `gitVersion`.** `serverVersion` logs at `warn` and falls through, but no test covers it. The fake server already has the `--git-version-file` knob — a one-line test that writes `"not-a-version"` and asserts the resolver falls through to embedded kubectl would close the gap.

## Documentation Assessment

- **Inconsistent `RDD_CACHE_DIR` description**: see S1.
- **Misleading comment in BATS test**: see S2.
- **Internal Variables section** (`docs/design/environment.md:29-35`) is a nice addition — it correctly explains both the recursion guard and the `rdd ctl` short-circuit reuse of the same env var.
- The `docs/design/cmd_service.md` `kubectl_cache` row at line 137 is well-placed; the example output at line 158 includes the new keys. Good.
- `pkg/kuberlr/cache.go:5-9` package doc clearly describes the resolve-or-fall-through model and credits the upstream kuberlr.

## Commit Structure

Single commit, scoped reasonably. Title accurately reflects the bulk of the change. The `config.json` → `config.yaml` doc rename (S5) is the only out-of-scope hunk; in isolation, separate-PR-worthy, but trivial.

## Acknowledged Limitations

- **Cache eviction and orphaned `.kubectl-*` temp files** — `pkg/kuberlr/cache.go:33-39`:
  > TODO(eviction): the cache only grows; ... SIGKILL or power loss during a download also leaks the .kubectl-* temp file that defer os.Remove normally cleans up. A per-version TTL or LRU sweep should clear both stale binaries and stale .kubectl-* leftovers ...

  Acknowledged. Impact stays small until users routinely switch across many minor versions.

- **Offline mode with cached fallback** — `pkg/kuberlr/resolver.go:61-64`:
  > TODO(offline): when no cached kubectl matches and the network is down, Resolve currently fails with the download error. Add an opt-out for download attempts (config flag or RDD_KUBECTL_OFFLINE) and fall back to the cached binary closest to the server's minor.

  Acknowledged.

- **`RDD_KUBECTL_MIRROR` accepts non-HTTPS URLs** — implicit. Acknowledged in `mirrorURL()` (the BATS test relies on `http://127.0.0.1:port`). For production users who set this to a non-localhost http URL, MITM is possible. Trusting the env-var owner is consistent with other `RDD_*` overrides.

- **`local_teardown_file` deviates from the repo's "no teardown_file" convention** — `bats/tests/10-cli/7-kubectl-cache.bats:102-110`:
  > Deviates from the "no teardown_file" rule. The rule's intent is to preserve rdd state for post-mortem inspection; an HTTP server bound to an ephemeral port is the opposite — leaving it running just leaks ports between sessions.

  Author's rationale is sound: the rule exists to preserve rdd state for inspection, not to leave HTTP servers leaking ephemeral ports.

- **Windows BATS gating** — PR description: "⚠️ Do not merge until BATS works on Windows CI ⚠️". Acknowledged.

- **Empty-message cliexit.Error contract** — `cmd/rdd/main.go:159-160`:
  > Empty Err means the subcommand has already written its own output (e.g. kuberlr.Exec propagating a kubectl child's exit code), so skip the redundant log line.

  Acknowledged. The contract is clear in context but currently single-call-site.

## Unresolved Feedback

- **@mook-as on `cmd/rdd/service_paths.go:33`** — `https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348#discussion_r3182959199`: "Would it be possible to expose `RDD_CACHE_DIR` here too?"
  Resolved. Both `cache_dir` (the rdd-wide root) and `kubectl_cache` (the kubectl-specific subdir) appear in `instancePaths` (service_paths.go:32-33) and the docs (cmd_service.md:137, environment.md:48). No outstanding action.

## Coverage Summary

| File | Result |
|---|---|
| `.github/dependabot.yml` | Reviewed, no issues |
| `bats/tests/10-cli/5-paths.bats` | Reviewed, no issues (correctly adds `cache_dir`/`kubectl_cache` and the previously-missing `docker_socket`) |
| `bats/tests/10-cli/7-kubectl-cache.bats` | S2 |
| `bats/tests/10-cli/fake-kube/README.md` | Trivial |
| `bats/tests/10-cli/fake-kube/go.mod` | Trivial |
| `bats/tests/10-cli/fake-kube/kubectl/main.go` | Reviewed, no issues |
| `bats/tests/10-cli/fake-kube/server/main.go` | Reviewed, no issues |
| `cmd/rdd/kubectl.go` | Reviewed, no issues |
| `cmd/rdd/main.go` | Reviewed, no issues (empty-message contract documented; see Acknowledged Limitations) |
| `cmd/rdd/service_paths.go` | Reviewed, no issues |
| `docs/design/cmd_service.md` | S5 |
| `docs/design/environment.md` | S1 |
| `go.mod` | Reviewed, no issues (cli-runtime correctly promoted from indirect to direct; blang/semver/v4 promoted to direct) |
| `pkg/kuberlr/cache.go` | I2 |
| `pkg/kuberlr/downloader.go` | I1, S3 |
| `pkg/kuberlr/exec_unix.go` | Reviewed, no issues |
| `pkg/kuberlr/exec_windows.go` | Reviewed, no issues |
| `pkg/kuberlr/resolver.go` | S4 |
| `pkg/kuberlr/resolver_test.go` | Reviewed, no issues (TestParseKubeConfigFlags and TestIsClientOnly are thorough) |
### [Codex] — Pass 1

### Executive Summary

This PR adds a kubectl version resolver with a shared verified cache and covers the main miss/hit/error flows. The main issue is a shared-cache publish race that can turn a valid concurrent download into a user-visible failure on Windows; the docs also need a small consistency pass around the new cache path export.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Concurrent cold-cache publishes can fail on Windows** — `pkg/kuberlr/downloader.go:60-69`, `pkg/kuberlr/downloader.go:117`, `cmd/rdd/kubectl.go:74` (important, regression)

```go
if _, err := os.Stat(path); err == nil {
	return path, nil
}
// ...
if err := download(ctx, want, path); err != nil {
	return "", err
}
// ...
return os.Rename(tmpName, dst)
```

On a cold shared cache, every process that observes the binary missing at `pkg/kuberlr/downloader.go:60` enters `download` via `pkg/kuberlr/downloader.go:68`. The verified temp file is then published with an unconditional `os.Rename` at `pkg/kuberlr/downloader.go:117`. If two `rdd kubectl` processes resolve the same out-of-skew version concurrently, the first process can publish and start executing the cached `.exe` while the second is still downloading; on Windows, replacing an existing/running executable can fail, and that error returns through `pkg/kuberlr/downloader.go:69` and becomes a top-level `resolving kubectl version` failure at `cmd/rdd/kubectl.go:74`. At that point the cache is already valid, so the second command should use it instead of failing.

Fix: treat an existing destination after rename failure as a successful concurrent publish. This does not introduce a new trust state because `pkg/kuberlr/downloader.go:60` already treats any existing destination as a cache hit.

```diff
-	return os.Rename(tmpName, dst)
+	if err := os.Rename(tmpName, dst); err != nil {
+		// Another rdd process may have published the verified binary while we downloaded.
+		if _, statErr := os.Stat(dst); statErr == nil {
+			return nil
+		}
+		return err
+	}
+	return nil
```

#### Suggestions

S1. **Path documentation contradicts the new cache export** — `docs/design/environment.md:16`, `docs/design/environment.md:39-48`, `docs/design/cmd_service.md:127-158`, `cmd/rdd/service_paths.go:32` (suggestion, regression)

```md
| `RDD_CACHE_DIR` | Overrides the rdd-wide cache root. ... |

`rdd svc paths --output=shell` exports these variables. They reflect the paths RDD uses for the current instance; setting them has no effect on RDD's behavior.

| `RDD_CACHE_DIR` | rdd-wide cache root (shared across instances) | ... |
```

`docs/design/environment.md:16` correctly documents `RDD_CACHE_DIR` as a behavior-changing override, but `docs/design/environment.md:39` still says path exports have no effect on RDD behavior while `docs/design/environment.md:48` newly lists `RDD_CACHE_DIR` as one of those exports. Also, `cmd/rdd/service_paths.go:32` adds the new `cache_dir` key and `docs/design/cmd_service.md:154` shows it in the example, but the key table at `docs/design/cmd_service.md:127-137` only adds `kubectl_cache` at `docs/design/cmd_service.md:137`.

Fix: call out `RDD_CACHE_DIR` as the exception in `environment.md`, and add `cache_dir` to the service paths key table.

```diff
-`rdd svc paths --output=shell` exports these variables. They reflect the paths RDD uses for the current instance; setting them has no effect on RDD's behavior.
+`rdd svc paths --output=shell` exports these variables. Most are informational; `RDD_CACHE_DIR` is also honored by `rdd` as the cache-root override.
```

```diff
 | `pid_file` | PID file path |
 | `args_file` | Saved arguments file path |
+| `cache_dir` | rdd-wide cache root (shared across instances) |
 | `kubectl_cache` | Cache directory for downloaded kubectl binaries (shared across instances) |
```

### Design Observations

**Strengths**

The resolver’s split between probe failures and download/checksum failures is sound: unreachable or malformed clusters fall through, while cache population failures surface loudly. Reusing `genericclioptions.ConfigFlags` for kubeconfig and context flag binding is also the right choice because it keeps resolver targeting aligned with kubectl’s own client config behavior.

### Testing Assessment

1. I1 is not covered: the BATS cache tests exercise sequential miss and hit paths, but not two concurrent cold-cache resolvers publishing the same version, especially with Windows executable locking.
2. The `--kubeconfig` and `--context` scanner has unit coverage, but there is no end-to-end resolver test proving those args steer the actual `/version` probe.

Verification run: `make build-rdd` passed; `go test ./pkg/kuberlr ./cmd/rdd -count=1` passed. BATS was not run because the BATS submodules are not initialized in this worktree.

### Documentation Assessment

S1 covers the only documentation gap found. The new environment variables are otherwise documented in `docs/design/environment.md`.

### Acknowledged Limitations

- **Code comment**: `TODO(eviction): the cache only grows...` at `pkg/kuberlr/cache.go:33`. This is a known footprint issue for users who visit many cluster minors; the current change makes it possible but does not need to solve eviction in this PR.
- **Code comment**: `TODO(offline): pair this with a "may we download?" toggle...` at `pkg/kuberlr/downloader.go:46`, expanded at `pkg/kuberlr/resolver.go:61`. This explicitly defers offline/no-download behavior; the implementation is consistent with that deferral.
- **Code comment**: `Deviates from the "no teardown_file" rule...` at `bats/tests/10-cli/7-kubectl-cache.bats:103`. The comment scopes the teardown to the fake HTTP server rather than RDD state, so it is acknowledged rather than reported as a BATS-style finding.

### Coverage Summary

- `.github/dependabot.yml` — Reviewed, no issues
- `bats/tests/10-cli/5-paths.bats` — Reviewed, no issues
- `bats/tests/10-cli/7-kubectl-cache.bats` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/README.md` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/go.mod` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/kubectl/main.go` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go` — Reviewed, no issues
- `cmd/rdd/kubectl.go` — I1
- `cmd/rdd/main.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — S1
- `docs/design/cmd_service.md` — S1
- `docs/design/environment.md` — S1
- `go.mod` — Reviewed, no issues
- `pkg/kuberlr/cache.go` — Reviewed, no issues
- `pkg/kuberlr/downloader.go` — I1
- `pkg/kuberlr/exec_unix.go` — Reviewed, no issues
- `pkg/kuberlr/exec_windows.go` — Reviewed, no issues
- `pkg/kuberlr/resolver.go` — Reviewed, no issues
- `pkg/kuberlr/resolver_test.go` — Reviewed, no issues

### [Gemini] — Pass 1

### Executive Summary

This change introduces a version-aware `kubectl` resolver based on the `kuberlr` pattern. When `rdd kubectl` runs against an external cluster, it probes the cluster's version, and if the embedded `kubectl` is out of skew, it fetches a matching binary from `dl.k8s.io` into a shared cache. The implementation is robust, featuring capped stream limits, careful short-circuiting for client-only subcommands, and comprehensive BATS tests. The primary finding is a Windows-specific race condition during concurrent downloads where `os.Rename` can fail against an executing binary.

### Findings

#### Critical Issues
None.

#### Important Issues

I1. **Race condition during concurrent downloads on Windows** — `pkg/kuberlr/downloader.go:116-118` (important, regression)

```go
	if err := os.Chmod(tmpName, 0o755); err != nil {
		return err
	}
	return os.Rename(tmpName, dst)
```

On Windows, executing binaries are locked by the OS. If two concurrent `rdd kubectl` processes race to download the same missing version, both will pass the initial `os.Stat` check and download to temporary files. The first to finish renames its temp file to `dst` and execs it. The second process then attempts `os.Rename` into `dst`. Because the first process is actively running the binary, the second `os.Rename` will fail with an `ERROR_ACCESS_DENIED` syscall error. The second process will abort and surface an error to the user, even though a valid, verified binary is now present at `dst`.

Fix: On Windows, catch the `os.Rename` failure. If the destination file already exists, treat the failure as a success. The `defer os.Remove(tmpName)` will automatically clean up the unused temp file.

```diff
 	if err := os.Chmod(tmpName, 0o755); err != nil {
 		return err
 	}
-	return os.Rename(tmpName, dst)
+	err := os.Rename(tmpName, dst)
+	if err != nil && runtime.GOOS == "windows" {
+		if _, statErr := os.Stat(dst); statErr == nil {
+			return nil // Another process beat us to it and published the file.
+		}
+	}
+	return err
```

#### Suggestions

S1. **Redundant error return and dead code** — `pkg/kuberlr/resolver.go:82-84` (suggestion, enhancement)

```go
	server, ok, err := serverVersion(args)
	if err != nil {
		return "", err
	}
```

The `serverVersion` function is designed to swallow all probe errors (unreachable, malformed kubeconfig, unparseable version) so the resolver can silently fall back to the embedded `kubectl`. It unconditionally returns `nil` for its `error` result. As a result, the `err != nil` check shown above is dead code.

Fix: Remove the third `error` return value from the `serverVersion` signature entirely, and drop the `if err != nil` block in `Resolve`.

S2. **Unhandled panic on missing user cache directory** — `pkg/kuberlr/cache.go:52-55` (suggestion, gap)

```go
	cache, err := os.UserCacheDir()
	if err != nil {
		panic(fmt.Errorf("could not determine user cache directory: %w", err))
	}
```

If the environment lacks a `HOME` (or `LocalAppData` on Windows) variable, `os.UserCacheDir()` returns an error, causing `CacheRoot()` to panic. While a missing home directory is rare in desktop environments, a hard panic in a CLI tool offers a poor user experience.

Fix: Consider returning an error from `CacheRoot` and `CacheDir`, or falling back to `os.TempDir()` if the user cache directory is unavailable.

### Design Observations

#### Concerns

- **Supply Chain Trust**: Currently, `downloader.go` fetches the `kubectl` binary and its `.sha512` checksum from the exact same mirror (`dl.k8s.io` or `RDD_KUBECTL_MIRROR`). While verifying the hash protects against network truncation and disk corruption, it provides no security against a compromised mirror serving a malicious binary alongside a matching malicious `.sha512` file. Upstream Kubernetes releases are signed with Sigstore/Cosign. Moving forward, verifying the Cosign signatures rather than relying solely on the mirror's own checksum file would close this supply-chain gap. (future)

#### Strengths

- **Bounded Stream Reads**: The use of `io.LimitReader` coupled with `maxKubectlBytes` (256 MiB) and `maxSha512Bytes` in `streamGet` is an excellent resilience pattern. It turns potentially endless streams (from misconfigured or malicious mirrors) into bounded failures, ensuring the daemon won't exhaust memory or disk space.
- **Client-only Short-circuiting**: `isClientOnly` implements a well-considered "conservative bias". By defaulting to probing when encountering unknown flags with values, the resolver avoids silently falling through to a version-mismatched embedded `kubectl` when it is unsure, heavily favoring correctness over speed.

### Testing Assessment

The BATS test suite provides robust coverage of the new cache lifecycle, including success paths, SHA512 mismatches, 404s, and 500s from the apiserver.
1. The only untested scenario is the Windows-specific process replacement exit code propagation (`pkg/kuberlr/exec_windows.go`). Simulating a `kubectl` child returning a non-zero exit code and verifying `rdd kubectl` accurately forwards it would seal this gap.

### Documentation Assessment

The documentation explicitly notes `RDD_CACHE_DIR` and `RDD_KUBECTL_MIRROR`, and rectifies a pre-existing typo (`config.json` to `config.yaml`). No further documentation updates are necessary.

### Acknowledged Limitations

- **No Teardown File Deviation**: "Deviates from the "no teardown_file" rule. The rule's intent is to preserve rdd state for post-mortem inspection; an HTTP server bound to an ephemeral port is the opposite..." (`bats/tests/10-cli/7-kubectl-cache.bats:102`). The deviation is justified and correctly cleans up the ephemeral fake server to avoid port leaks.
- **Offline Mode Stub**: "TODO(offline): when no cached kubectl matches and the network is down, Resolve currently fails with the download error..." (`pkg/kuberlr/resolver.go:61`). The author acknowledges the lack of closest-match fallback offline.
- **Cache Eviction**: "TODO(eviction): the cache only grows; a user switching across many remote clusters accumulates ~50 MB per minor version indefinitely..." (`pkg/kuberlr/cache.go:33`). The author notes the lack of TTL/LRU sweep for stale binaries and abandoned `.kubectl-*` temp files.

### Coverage Summary

- `.github/dependabot.yml`: Trivial
- `bats/tests/10-cli/5-paths.bats`: Reviewed, no issues
- `bats/tests/10-cli/7-kubectl-cache.bats`: Reviewed, no issues
- `bats/tests/10-cli/fake-kube/README.md`: Reviewed, no issues
- `bats/tests/10-cli/fake-kube/go.mod`: Reviewed, no issues
- `bats/tests/10-cli/fake-kube/kubectl/main.go`: Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go`: Reviewed, no issues
- `cmd/rdd/kubectl.go`: Reviewed, no issues
- `cmd/rdd/main.go`: Reviewed, no issues
- `cmd/rdd/service_paths.go`: Reviewed, no issues
- `docs/design/cmd_service.md`: Reviewed, no issues
- `docs/design/environment.md`: Reviewed, no issues
- `go.mod`: Reviewed, no issues
- `pkg/kuberlr/cache.go`: S2
- `pkg/kuberlr/downloader.go`: I1
- `pkg/kuberlr/exec_unix.go`: Reviewed, no issues
- `pkg/kuberlr/exec_windows.go`: Reviewed, no issues
- `pkg/kuberlr/resolver.go`: S1
- `pkg/kuberlr/resolver_test.go`: Reviewed, no issues

## Resolution

| | |
|---|---|
| **Addressed** | 2026-05-04 |
| **Commit** | `72bfad2 rdd kubectl: embed kuberlr-style version resolver` (amended) |

| # | Finding | Action | Reason |
|---|---------|--------|--------|
| 1 | Suggestion #1: environment.md describes RDD_CACHE_DIR as both behavior-changing and inert; cmd_service.md key table omits cache_dir | Fixed | |
| 2 | Suggestion #2: First-run rdd kubectl blocks silently for 30 s+ on slow links | Fixed | |
| 3 | Suggestion #3: CacheRoot panic now reaches rdd svc paths | Skipped | Trigger requires neither $HOME nor %LocalAppData% to resolve; declined Rounds 1-3 for the same reason. The new rdd svc paths blast radius is real but exotic in actual desktop use; the structural change touches multiple files for a low-probability defensive case. |
| 4 | Suggestion #4: local_setup_file comment misnames the EXE mechanism | Fixed | |
| 5 | Suggestion #5: SkipResolver mutates the global process env where a package-level bool would express the same intent without leaking into spawned children | Fixed | |
| 6 | Suggestion #6: serverVersion declares an error return but every path returns nil | Fixed | |
| 7 | Testing Gap #1: os.UserCacheDir failure path | Skipped | Depends on Skipped Suggestion #3. |
| 8 | Testing Gap #2: First-run download UX | Skipped | Single assert_output addition is light coverage; the warn line is trivially correct. Existing tests verify the download itself; a UX-line assertion is brittle if the message changes. |
| 9 | Testing Gap #3: Concurrent first-use cache miss | Skipped | Declined per resolutions gate (Rounds 1-4). Windows os.Rename race remains an acknowledged limitation. |
| 10 | Testing Gap #4: End-to-end Windows execution path | Skipped | Gated on Windows BATS support per PR description ("Do not merge until BATS works on Windows CI"). |
| 11 | Testing Gap #5: Cluster reporting unparseable gitVersion | Skipped | The code is straightforward (semver.ParseTolerant + warn log + fall-through) and the BATS test would mostly assert the embedded kubectl runs, which the existing apiserver-500 fall-through test already verifies in spirit. |
