# Deep Review: 20260519-235059-pr-348

| | |
|---|---|
| **Date** | 2026-05-19 23:50 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 10 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#348](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348) — rdd kubectl: embed kuberlr-style version resolver |
| **Branch** | `embed-kuberlr` |
| **Commits** | `37ac64c` bats: fix 7-kubectl-cache.bats failures on Git Bash<br>`30289ab` rdd kubectl: embed kuberlr-style version resolver |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — Round 10 reviews commit `37ac64c`, a focused Git Bash fix on top of the Round 9 commit `30289ab`. The fix correctly resolves the MSYS env-var prefix-stripping bug it documents, but the substitution it introduces — `env(1) rdd "$@"` in `rdd_env` (`bats/tests/10-cli/7-kubectl-cache.bats:100-106`) — bypasses the WSL-aware `rdd()` shell wrapper at `bats/helpers/commands.bash:42`. On WSL, the bin dir contains only `rdd.exe` and Linux `env(1)` does not auto-extend to `.exe`, so `env rdd` returns command-not-found before the test reaches the resolver. Codex (I1) caught this. All seven Round 9 findings (`Warnf → Infof` overshoot, WSL `go env GOOS` mismatch, five polish suggestions) remain unaddressed in `37ac64c`'s scope and carry over. Claude added a round of design observations on resolver structure (probe-on-every-call, error-envelope decoration) and 11 polish suggestions, the most substantive of which target shared-helper duplication and operator-facing error messages. Dropped per the resolutions/declined gates: the concurrent Windows `os.Rename` race (Codex I1, Gemini C1, Claude I1 — declined Rounds 1-9), the `CacheRoot` panic (Claude I5, Gemini I2 — declined Rounds 1-8), and `winpath` command substitution swallowing exits (Gemini I1 — declined Round 8 S5, Round 9 S2). |
| **Wall-clock time** | `1 h 4 min 43 s` |

## Executive Summary

Round 10 reviews commit `37ac64c` (Round 9 was `30289ab`). The fix is sound for Git Bash but its substitution — `env(1)` invoking bare `rdd` — bypasses the `rdd()` shell wrapper that handles WSL's WSLENV translation and the `.exe` suffix. The result regresses WSL test coverage (Codex I1). Round 9's seven findings (Warnf overshoot, WSL GOOS mismatch, five polish items) sit unaddressed because `37ac64c` touches only two files. Claude adds polish around shared-helper duplication and operator-facing error messages.

Structure: 0 Critical · 3 Important · 14 Suggestions.


---

## Critical Issues

None.

---

## Important Issues

I1. **`env(1) rdd "$@"` bypasses the WSL-aware `rdd()` shell wrapper, breaking WSL test coverage** — `bats/tests/10-cli/7-kubectl-cache.bats:100-106` [Codex GPT 5.5]

```bash
rdd_env() {
    env \
        "RDD_CACHE_DIR=$(winpath "${CACHE_DIR}")" \
        "RDD_KUBECTL_MIRROR=http://127.0.0.1:${PORT}" \
        "KUBECONFIG=$(winpath "${KUBECONFIG_PATH}")" \
        rdd "$@"
}
```

Commit `37ac64c` introduces `rdd_env` to dodge Git Bash's MSYS env-var prefix stripping. On WSL the bypass has two consequences. The Makefile sets `EXE=.exe` when `winver.exe` exists, so `bin/` holds `rdd.exe` not `rdd`; Linux `env(1)` does not auto-extend argv[0] with `.exe` the way the MSYS port does, so `env ... rdd "$@"` returns command-not-found before the resolver runs.

Second, the `rdd()` wrapper at `commands.bash:71-82` builds a `WSLENV` string so the Windows child sees `RDD_*` vars; bypassing it means the env vars reach `rdd.exe` empty even if the binary were findable. Round 9 I2 already noted WSL coverage breaks via bare `go env GOOS` at lines 15-17; this finding fails earlier and with a less informative error.

Fix: keep the `env(1)` shape for MSYS, restore the wrapper for WSL/Linux/macOS:

```diff
 rdd_env() {
-    env "RDD_CACHE_DIR=..." ... rdd "$@"
+    if is_msys; then
+        env "RDD_CACHE_DIR=$(winpath "${CACHE_DIR}")" ... \
+            "${PATH_REPO_ROOT}/bin/rdd${EXE}" "$@"
+    else
+        RDD_CACHE_DIR="$(winpath "${CACHE_DIR}")" ... rdd "$@"
+    fi
 }
```

(important, regression)

I2. **Round 8 `Warnf → Infof` fix overshoots: download notice invisible at default log level** — `pkg/kuberlr/downloader.go:78` [carry-over from Round 9 I1]

```go
logrus.Infof("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
```

Round 9 I1 (Claude) flagged that demoting from `Warnf` to `Infof` made the download notice invisible at the default `RDD_LOG_LEVEL=warn`, so a multi-minute kubectl download against a slow link gives the user no feedback. Commit `37ac64c` touches only `bats/helpers/os.bash` and `bats/tests/10-cli/7-kubectl-cache.bats`; this line is unchanged and the trade-off is unresolved. The Round 9 framing stands: a `fmt.Fprintln(os.Stderr, ...)` rendering as a plain operator notice sidesteps the level-vs-rendering tradeoff that `Warnf`/`Infof` cannot resolve.

```diff
-	logrus.Infof("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
+	fmt.Fprintf(os.Stderr, "downloading kubectl v%d.%d.%d from %s\n", want.Major, want.Minor, want.Patch, mirrorURL())
```

(important, regression, round 9 I1 unaddressed)

I3. **WSL BATS run populates mirror at `bin/linux/...` while `rdd.exe` requests `bin/windows/...`** — `bats/tests/10-cli/7-kubectl-cache.bats:15-17,35,38` [carry-over from Round 9 I2]

```bash
GOOS=$(go env GOOS)
export GOOS
GOARCH=$(go env GOARCH)
export GOARCH
```

Round 9 I2 (Codex) flagged that on a WSL workstation with Linux Go on `PATH`, `go env GOOS` returns `linux` while `using_windows_exe()` at `bats/helpers/defaults.bash:10-18` selects `rdd.exe` and `runtime.GOOS=windows` inside it. The mirror tree lands at `bin/linux/$GOARCH` while `rdd.exe` GETs `bin/windows/$GOARCH/kubectl.exe.sha512` and 404s. Commit `37ac64c` does not touch this code; the finding stands. Pin GOOS/GOARCH to match the rdd binary when `using_windows_exe` is true:

```diff
-GOOS=$(go env GOOS); export GOOS
-GOARCH=$(go env GOARCH); export GOARCH
+if using_windows_exe; then
+    export GOOS=windows
+else
+    GOOS=$(go env GOOS); export GOOS
+fi
+GOARCH=$(go env GOARCH); export GOARCH
```

`GOOS=windows go build` cross-compiles a Windows binary on WSL using the Linux Go cross-compiler that ships out of the box.

(important, regression, round 9 I2 unaddressed)

---

## Suggestions

S1. **`parseKubeConfigFlags` test does not assert that `--username` / `--password` stay unbound** — `pkg/kuberlr/resolver_test.go:69-153` [carry-over from Round 9 S1]

The `want` struct carries `username` and `password` fields and the comparison loop dereferences `cf.Username` / `cf.Password`, but no test case sets either flag in `args`. The docstring at `resolver.go:168-175` justifies leaving them unbound to dodge `ToRawKubeConfigLoader`'s interactive credential prompt; a future refactor that calls `WithDeprecatedPasswordFlag` would silently bind them and the test suite as written would not catch the regression.

```diff
+		{"--username unbound by design", []string{"--username=alice", "get"}, want{}},
+		{"--password unbound by design", []string{"--password=secret", "get"}, want{}},
```

(suggestion, gap, round 9 S1 unaddressed)

S2. **Sentinel comparison in fake server uses `!=` instead of `errors.Is`** — `bats/tests/10-cli/fake-kube/server/main.go:91` [carry-over from Round 9 S2]

```go
if err := server.Serve(listener); err != nil && err != http.ErrServerClosed {
    log.Fatalf("serve: %v", err)
}
```

`http.ErrServerClosed` is a sentinel value; `errors.Is(err, http.ErrServerClosed)` is the standard idiom. The fixture never wraps the error today so the equality compare works, but the code reads like production-shaped Go from a quick scan, and the repo uses `errors.Is` for sentinels elsewhere.

(suggestion, enhancement, round 9 S2 unaddressed)

S3. **`defer os.Remove(tmpName)` runs after a successful `Rename`; scope to the failure path** — `pkg/kuberlr/downloader.go:107-128` [carry-over from Round 9 S3]

```go
tmpName := tmp.Name()
defer os.Remove(tmpName)
...
return os.Rename(tmpName, dst)
```

After a successful `Rename`, `tmpName` no longer exists, so the deferred `os.Remove` returns `ENOENT` silently. Harmless today, but the reader has to mentally verify the deferred call cannot delete the new `dst`. A boolean-guarded defer expresses intent more clearly.

(suggestion, enhancement, round 9 S3 unaddressed)

S4. **`cliexit.Error{Err: nil}` no-log contract is implicit** — `cmd/rdd/main.go:172-178`, `pkg/kuberlr/exec_windows.go:39` [carry-over from Round 9 S4]

Only `pkg/kuberlr/exec_windows.go:39` builds the struct with `Err == nil` today. The contract lives only in a comment at `main.go:170-171`; a future caller that omits `Err` silently loses its log output. A named constructor documents the convention at the producer site:

```go
func Silent(code int) *Error { return &Error{Code: code} }
```

(suggestion, enhancement, round 9 S4 unaddressed)

S5. **`serverProbeTimeout = 2 * time.Second` may be tight for VPN-bound corporate clusters** — `pkg/kuberlr/resolver.go:41-44` [carry-over from Round 9 S5]

```go
const serverProbeTimeout = 2 * time.Second
```

A GKE / EKS / AKS control plane reached through a corporate split-tunnel VPN can routinely take 1-3 seconds to complete TLS handshake + IAM auth + return `/version`, especially on a cold connection. A spurious timeout falls through silently to the embedded kubectl per `serverVersion`'s contract at `resolver.go:135-138`, so an out-of-skew cluster reached over a slow VPN runs the embedded mismatched kubectl with no operator-visible reason. 5 seconds absorbs the realistic tail without weakening the unreachable-cluster guard.

(suggestion, enhancement, round 9 S5 unaddressed)

S6. **`streamGet` error message omits proxy and remote-IP context** — `pkg/kuberlr/downloader.go:189-198` [Claude Opus 4.7]

```go
if resp.StatusCode != http.StatusOK {
    return fmt.Errorf("GET %s: %s", url, resp.Status)
}
```

When a corporate proxy intercepts the connection and returns a custom 403 page, the operator sees `GET https://dl.k8s.io/...: 403 Forbidden` without an indication that traffic was intercepted. Including the `Via` and `X-Cache` response headers on non-2xx, or echoing the response body's first ~256 bytes, points the operator at the right team. Proxies typically embed enough text in the rejection to identify themselves.

(suggestion, enhancement)

S7. **`cachedPath` and `download` duplicate the GOOS/.exe formatting** — `pkg/kuberlr/downloader.go:86-92, 98-102` [Claude Opus 4.7]

```go
name := fmt.Sprintf("kubectl-v%d.%d.%d", v.Major, v.Minor, v.Patch)
if runtime.GOOS == "windows" {
    name += ".exe"
}
```

The same `runtime.GOOS == "windows"` + `.exe` append appears in both `cachedPath` (line 88) and `download` (line 100). A refactor changing one and not the other lands a mismatched cache key vs. mirror URL; `TestCachedPath` (resolver_test.go:65) only checks the suffix, so the bug would slip through. Factor a `kubectlBinaryName(v)` helper that both call.

(suggestion, gap)

S8. **`parseKubeConfigFlags` and `isClientOnly` each build a fresh ConfigFlags FlagSet for the same args** — `pkg/kuberlr/resolver.go:179-188, 230-264` [Claude Opus 4.7]

Both helpers run `genericclioptions.NewConfigFlags(true).AddFlags(fs)` and `fs.Parse(args)` on the same args within a single `Resolve` call (`isClientOnly` → `loadClientConfig` → `parseKubeConfigFlags`). The duplication risks the binding surface drifting between the two — a future change to ConfigFlags would update one site and miss the other. Share a single parsed ConfigFlags through `Resolve`, or split out the bare-bool resolver-specific binding (`--client`, `--warnings-as-errors`, `--match-server-version`) into a separate small function that `isClientOnly` composes onto whatever `parseKubeConfigFlags` returns.

(suggestion, gap)

S9. **`envSkipResolver` doc comment omits the env-var spelling** — `pkg/kuberlr/resolver.go:23-26` [Claude Opus 4.7]

The doc comment names "Exec sets it on the kubectl child process" but does not spell out `RDD_KUBECTL_RESOLVED`. A reader debugging a re-exec loop greps for the literal env-var name, not the Go identifier. Adding `(RDD_KUBECTL_RESOLVED)` matches how `envCacheDir` and `envKubeMirror` comment their constants.

(suggestion, documentation)

S10. **`dependabot.yml` registers a second go.mod directory that has no dependencies** — `.github/dependabot.yml` [Claude Opus 4.7]

```yaml
- package-ecosystem: gomod
  directories:
  - /
  - /bats/tests/10-cli/fake-kube
```

The `bats/tests/10-cli/fake-kube/go.mod` declares zero dependencies (stdlib only plus the `fake-kube` self-import). Dependabot's `gomod` ecosystem still scans the directory on every interval. Drop the second entry until the module grows a real dependency, or add a comment explaining the deliberate scaffold.

(suggestion, enhancement)

S11. **Fake-kube server hand-rolls `version.Info` JSON instead of marshaling the upstream struct** — `bats/tests/10-cli/fake-kube/server/main.go:73-77` [Claude Opus 4.7]

```go
_, _ = fmt.Fprintf(w,
    `{"major":%q,"minor":%q,"gitVersion":%q,"gitCommit":"fake","gitTreeState":"clean","buildDate":"2026-01-01T00:00:00Z","goVersion":"go1.21","compiler":"gc","platform":"%s/%s"}`,
    ...)
```

The current JSON covers `gitCommit`, `gitTreeState`, `buildDate`, `goVersion`, `compiler`, `platform`. A future Kubernetes bump that adds a required field to `k8s.io/apimachinery/pkg/version.Info` would break discovery silently without the test noticing. Construct the value via the upstream struct and `encoding/json.Marshal` so the fixture stays trivially in sync.

(suggestion, gap)

S12. **`local_setup_file` polling loop fails without context when the fake server cannot bind** — `bats/tests/10-cli/7-kubectl-cache.bats:64-69` [Claude Opus 4.7]

```bash
for i in {1..50}; do
    [[ -s ${PORT_FILE} ]] && break
    sleep 0.1
done
[[ -s ${PORT_FILE} ]] || fail "fake-kube-server did not write a port file"
```

A flaky CI host that misses the 5-second budget fails with a generic message and no clue why the server failed to start. `log.Fatalf` errors from the server (e.g., "listen: bind: address already in use") go to stderr only, not to `LOG_FILE`. Capture stderr (`2>"${BATS_FILE_TMPDIR}/server.stderr"`) and `cat` it on failure.

(suggestion, enhancement)

S13. **`streamGet` returns without draining the response body on non-200, defeating connection reuse** — `pkg/kuberlr/downloader.go:189-198` [Claude Opus 4.7]

```go
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
    return fmt.Errorf("GET %s: %s", url, resp.Status)
}
```

`resp.Body.Close()` on an unread body prevents the keep-alive connection from being returned to the pool. For one-shot CLI use this does not matter, but the same pattern in a long-lived daemon (the TODO mentions k3s images, Lima templates) would leak connections. `io.Copy(io.Discard, resp.Body)` before return is the idiomatic fix.

(suggestion, gap)

S14. **`--client` test surface and binding scope leave a documented behavior unverified** — `pkg/kuberlr/resolver.go:218-264`, `resolver_test.go:178` [Claude Opus 4.7]

The comment at `resolver.go:218-224` documents that `--client=garbage version` (leading-flag form) halts pflag mid-args, lands in the empty-positionals branch, and returns `true` (treats as client-only). The existing test case "version --client=garbage stays false" at `resolver_test.go:178` only exercises the trailing-flag form. Add the leading-flag case so the documented behavior is locked in:

```diff
+		{"--client=garbage leading halts parse, no positionals", []string{"--client=garbage", "version"}, true},
```

Separately, `--client` binds at the global level via `fs.BoolVar(&client, "client", false, "")` at `resolver.go:251`. Today this is harmless because `isClientOnly` only consults `client` when `subcommand == "version"` (line 262), but the binding's signal-to-noise would improve if the bool lived only when relevant. Either form is fine; documenting the choice or scoping the bind keeps the surface honest.

(suggestion, gap)

---

## Design Observations

### Concerns

- **Resolver probes `/version` on every non-client-only kubectl invocation (in-scope)** [Claude Opus 4.7] — `Resolve` at `pkg/kuberlr/resolver.go:68-93` calls `serverVersion(args)` before checking the cache. A reachable cluster's discovery adds a round-trip per invocation; an unreachable cluster pays the full 2-second `serverProbeTimeout`. A tight loop of `kubectl get pod -w` retries pays the probe cost every retry. The natural inversion — check `cachedPath(serverFromLastProbe)` first and fall back to a probe only when no cached binary matches the most recent observed version — matches what kuberlr upstream does. A small per-context cache (`{context, version, mtime}` under `CacheDir()`) with a 30-second TTL probes once per minute instead of once per call. Future, layered on top of the existing TODO(eviction) work.

- **`kubectlAction`'s resolver-failure path produces a `level=fatal` decoration on the CLI** [Claude Opus 4.7] — `cmd/rdd/kubectl.go:50` returns `fmt.Errorf("resolving kubectl version: %w", err)`. `cmd/rdd/main.go:178` falls through to `logrus.Fatal(err)` for any non-`*cliexit.Error`, prepending `level=fatal msg=` and exiting 1. The decoration reads like an internal logger dump rather than a CLI error. The same pre-existing behavior affects every CLI command that returns a plain `error`, so this is not a regression; raise it here only because the resolver introduces a new high-frequency error path. A clean fix is project-wide rather than kuberlr-local — surface `*cliexit.Error{Code: 1}` after `fmt.Fprintln(os.Stderr, ...)` whenever the CLI wants kubectl-style output.

- **`pkg/kuberlr/` mixes cache layout, HTTP download, and version probe (future)** [Claude Opus 4.7] — `cache.go` is exposed to all of rdd via `rdd svc paths`; `downloader.go` and `resolver.go` are internal to `rdd kubectl`. Once a future consumer (k3s images, Lima templates per the existing TODO) lands, `cache.go`'s 56 lines move up (perhaps to `pkg/cache/`) and `kuberlr` becomes just the resolver. The split is cheap; pre-empting it would clarify the boundary now.

### Strengths

- **Recursion-guard belt-and-braces** [Claude Opus 4.7] — cross-process via `RDD_KUBECTL_RESOLVED` env var, same-process via `skipResolver` bool, covering both axes (downloaded kubectl re-execs rdd through a shim vs. `rdd ctl` calling `kubectlAction` in-process). `TestResolveSkipsWhenRecursionGuardSet` (Round 8 testing gap, closed in Round 9) locks the cross-process axis in.

- **Streaming caps on `streamGet` with explicit truncation tradeoff comments** [Claude Opus 4.7] — `maxSha512Bytes` (4 KiB) and `maxKubectlBytes` (256 MiB), paired with `downloadTimeout` (5 min) and the in-code comment naming the silent-truncation-as-checksum-mismatch behavior, turn a malicious or misconfigured mirror into a bounded failure rather than an unbounded write.

- **Conservative probe fallback policy** [Codex GPT 5.5, Gemini 3.1 Pro] — missing kubeconfig, malformed kubeconfig, unreachable cluster, and unparseable server version all fall through silently to the embedded kubectl. Only download and sha errors surface loudly, matching the policy the resolver comments draw explicitly.

- **Flag-binding test surface is unusually thorough** [Claude Opus 4.7] — `TestParseKubeConfigFlags` and `TestIsClientOnly` together exercise 50+ cases of pflag's UnknownFlags handling, `--client=garbage` short-circuit, `--` separator, `-n` namespace aliases, kubectl globals. Hard-won coverage of pflag quirks that future kubectl-flag changes benefit from.

- **Sibling Go module for fake-kube fixtures** [Claude Opus 4.7] — `bats/tests/10-cli/fake-kube/go.mod` keeps test-only dependencies out of the parent module; one server plays both apiserver `/version` and release mirror `/release/*` roles with file-based per-test overrides.

---

## Testing Assessment

`pkg/kuberlr/resolver_test.go` and the five-test BATS lifecycle in `7-kubectl-cache.bats` cover the user-reachable paths well. Round 10 surfaces test-side gaps, no production-code gaps:

1. **WSL with native `rdd.exe` plus a Linux `go` on PATH** (I1, I3) — the env(1)/wrapper-bypass and GOOS-mismatch failures compound; the test suite has no WSL coverage that would catch either before merge.
2. **`--username` / `--password` deliberately unbound on the resolver side** (S1) — add cases that pass the flags and assert the bound values remain empty.
3. **`--client=garbage` leading-flag form** (S14) — the comment documents the case; lock it in with a test.
4. **Server `/version` returns 200 OK with an empty body** — the existing 404 test (mirror_404) covers the not-found case, but a "200 OK empty body" exercises the `fetchSha512` empty-string branch.
5. **Resolver probe under cluster-reachable-but-slow** — no test asserts the 2-second timeout actually fires; a fake server that delays `/version` by 3 seconds would lock in the timeout.
6. **`rdd ctl` does not re-trigger the resolver** — `cmd/rdd/kubectl.go:48` asserts the contract by inspection only; no test exercises `rdd ctl get pods` and confirms the resolver short-circuits via the same-process `skipResolver` flag.

---

## Documentation Assessment

`docs/design/cmd_service.md` and `docs/design/environment.md` match the resolver behavior. The probe-on-every-call characteristic (Design Concern) is undocumented in user-facing help text and design docs — worth a one-line note in `environment.md` so an operator notices the extra discovery round-trip per invocation. Round 9's documentation assessment also stands: no gaps in the changed docs.

---

## Commit Structure

The branch carries two commits:

- `30289ab rdd kubectl: embed kuberlr-style version resolver` — large feature commit; self-contained.
- `37ac64c bats: fix 7-kubectl-cache.bats failures on Git Bash` — a focused BATS fix for code added in `30289ab`. Both Claude and the dimensions guidance on fixup commits flag this as a fold-back candidate: it patches lines from the immediately-preceding commit, the subject names the fix not new functionality, and the diff stat (41 insertions / 17 deletions, two files) is small. The repo's working-memory note allows amending during development; folding `37ac64c` into `30289ab` before merge tightens the history. Not a blocker.

---

## Acknowledged Limitations

- **Code comment**: `pkg/kuberlr/cache.go:33-39` documents unbounded cache growth and leaked `.kubectl-*` temp files after SIGKILL or power loss; eviction work is deferred.
- **Code comment**: `pkg/kuberlr/downloader.go:55-57` and `pkg/kuberlr/resolver.go:64-67` document the planned `RDD_KUBECTL_OFFLINE` opt-out; intentionally deferred.
- **Code comment**: `pkg/kuberlr/downloader.go:138-141` documents `maxKubectlBytes` silent truncation surfacing as "checksum mismatch"; left as a comment per Round 6.
- **Code comment**: `pkg/kuberlr/exec_windows.go:27-30` documents the dormant `CommandContext` hard-kill race against console-delivered Ctrl+C; a future refactor that wires signal-driven cancellation must revisit this site (Round 7 S3 → Commented). Claude S7 in the original review (not raised here) noted the same surface.
- **Code comment**: `pkg/kuberlr/resolver.go:217-228` documents the `--client=garbage` pflag short-circuit and the deviation from the conservative bias. See I3 / S14.
- **Code comment**: `pkg/kuberlr/resolver.go:168-175` documents leaving `--username` / `--password` unbound to dodge `ToRawKubeConfigLoader`'s interactive prompt.
- **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:108-116` documents the deliberate deviation from the "no `teardown_file`" rule for the ephemeral HTTP-server fixture.
- **PR comment by @mook-as on `cmd/rdd/service_paths.go:33`** — "Would it be possible to expose `RDD_CACHE_DIR` here too?" Addressed in `30289ab`: `cache_dir` and `kubectl_cache` keys are exposed via `instancePaths()`, and `--output=shell` exports `RDD_CACHE_DIR` and `RDD_KUBECTL_CACHE` per `docs/design/environment.md:48-52`.

### Declined in prior rounds

- **Concurrent first-use Windows `os.Rename` cache-publish race** — declined Rounds 1-9. Re-raised this round by Codex (I1), Gemini (C1), and Claude (I1). Dropped per the resolutions gate. Ten rounds running.
- **`CacheRoot` panic on `os.UserCacheDir` failure** — declined Rounds 1, 2, 3, 5, 7, 8. Re-raised this round by Claude (I5) and Gemini (I2). Dropped per the resolutions gate.
- **Command substitution swallows non-zero exits** (`$(winpath ...)`) — declined Round 8 S5, declined Round 9 S2. Re-raised this round by Gemini (I1). Dropped per the resolutions gate.

---

## Unresolved Feedback

None. The only PR review comment (RDD_CACHE_DIR exposure) is addressed in commit `30289ab`.

---

## Agent Performance Retro

### [Claude]

Claude carried this round again. Five Important findings and twelve Suggestions, with two Important dropped by the declined-prior-rounds gate (I1 `os.Rename`, I5 `CacheRoot` panic) and two reframed as Design Observations (I2 cliexit envelope decoration, I4 probe-on-every-call). One Important (I3 `--client=garbage`) downgraded to a Suggestion because the comment at `resolver.go:218-224` documents the deviation as deliberate. Net contribution: nine Suggestions (S6-S14) covering shared-helper duplication, operator-facing error messages, fake-server fixture fidelity, and BATS debuggability. The three design observations — probe-on-call, error-envelope decoration, and the cache/download/probe split — were the unique signal of this round; no other agent traced into the upstream kuberlr design enough to spot them. The session ran ~13 minutes at xhigh effort; the first attempt hit the 15-minute watchdog mid-investigation, the second completed cleanly under a 30-minute ceiling.

### [Codex]

Codex returned its trademark tight, focused review: two Important findings, zero Suggestions. I1 (concurrent `os.Rename`) was the long-declined re-raise; dropped per the resolutions gate without prejudice — ten rounds running. I2 (`env(1) rdd "$@"` bypasses the WSL-aware `rdd()` shell wrapper) is the most valuable finding this round and the only fresh Important that survived consolidation. The chain from "commit `37ac64c` introduced env(1)" through "Linux env(1) doesn't auto-extend .exe" to "WSLENV translation lost" is the kind of cross-file, cross-platform trace only available to an agent that reads `commands.bash` and `defaults.bash` in addition to the diff. Coverage was complete on all 21 changed files.

### [Gemini]

Gemini's signal was thin this round and entirely declined items. C1 (`os.Rename`), I1 (command substitution swallowing exits), and I2 (`CacheRoot` panic) all map to long-standing declined-in-prior-rounds entries; S1 (drive-by `docker_socket` in `ALL_KEYS`) recapitulates a Round 7 green-lit precedent. The Design Strengths overlapped with Codex's. Net new contribution: zero. Gemini does not run `git blame` (daily quota limits) so the declined-prior-rounds signal can't reach its prompt without explicit re-prompting; the consolidation-side resolutions gate caught the drops cheaply, but the round still spent tokens on shapes already settled.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 13m 15s | 6m 44s | 3m 56s |
| Findings | 9S | 1I | none |
| Tool calls | 44 (Bash 25, Read 19) | 64 (shell 64) | — |
| Design observations | 3 | 0 | 0 |
| False positives | 0 | 0 | 0 |
| Unique insights | 12 | 1 | 0 |
| Files reviewed | 21 | 21 | 21 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **9S** | **1I** | **none** |
| Downgraded | 3 (I→S) | 0 | 0 |
| Dropped | 2 | 1 | 4 |


Reconciliation: Claude I1 (critical Windows `os.Rename` race) and I5 (CacheRoot panic) → dropped per the declined-prior-rounds gate. Claude I2 (cliexit envelope) → reframed as Design Observation because the same issue affects every CLI command, not just kuberlr. Claude I3 (`--client=garbage` leading flag) → downgraded to Suggestion S14 because the comment at `resolver.go:218-224` documents the deviation as deliberate. Claude I4 (probe-on-every-call) → reframed as Design Observation, in-scope; the cache-then-probe inversion is a future enhancement rather than a Round 10 must-fix. Codex I1 → dropped (declined Rounds 1-9). Gemini C1, I1, I2 → all dropped per the resolutions/declined gates.

Overall: Claude provided the highest signal this round (one Important kept, nine consolidated Suggestions, three Design Observations). Codex contributed the most valuable single finding (I1 env(1) bypass). Gemini contributed no net-new findings.

---

## Review Process Notes

**Skill improvements**

- **Document the `claude-xhigh` subagent registration requirement and recovery path.** A user-level subagent file under `~/.claude/agents/` loads only at session start; if the session begins before the file lands on disk, the in-process Claude path returns "agent type not found" and the orchestrator must fall back to `--claude-external`. Add to the deep-review skill's prerequisites: a one-line check that `claude-xhigh` is registered (e.g., grep the session's available-agents list for the name) before the first Step 4 dispatch, with a clear recovery instruction ("resume the Claude Code session, or pass `--claude-external` for this run"). The current skill assumes the subagent is always available; when it is not, the failure mode looks like a transient agent error rather than a setup gap, which costs an investigation cycle.

- **Tighten the prior-round-resolutions gate to enumerate every prior round, not just the most recent.** Round 10's declined-list pulled "concurrent Windows `os.Rename` race — declined Rounds 1-9" because the Round 9 retro paragraph names the count. The gate as written says "Read the most recent prior `.reviews/{ID}.md` for the same target and parse its `## Resolution` table (if present)". The most recent prior review for PR #348 has no Resolution table yet (it has not been resolved), so the gate's literal interpretation would not catch the declined-prior-rounds entries that landed *before* the most recent unresolved round. Add: "When the most recent prior review has no Resolution table, walk back to the most recent review that does." Round 9's `20260519-105717-pr-348.md` is the right anchor for this PR; the Round 10 review at `20260519-131414-pr-348.md` is informational only.

**Repo context updates**

None this round. The Repository section's long-declined-findings entry (added after Rounds 7-8) is doing its job: Codex, Gemini, and Claude all re-raised the `os.Rename` race, and consolidation dropped them quickly. Adding more shape-specific entries (CacheRoot panic, winpath command substitution) would help even more, but each is already in the declined-list per the gate.

---

## Appendix: Original Reviews

### [Claude] — Pass 1

# Code Review: PR #348 — rdd kubectl: embed kuberlr-style version resolver

## Executive Summary

The change adds a kuberlr-style kubectl resolver in `pkg/kuberlr/`, wires `rdd kubectl` through it, and adds BATS coverage with a fake mirror + apiserver. The implementation is careful (sha512 verification, body-size caps, atomic rename via temp file, recursion guard, conservative client-only detection) and well-tested. The headline concern is platform-specific: concurrent `rdd kubectl` cache-miss races and post-crash artifacts both interact poorly with Windows `os.Rename` semantics. Several smaller robustness and UX issues are worth tightening before merge.

## Findings

### Critical Issues

None.

### Important Issues

**I1. Concurrent cache miss races on Windows `os.Rename` — `pkg/kuberlr/downloader.go:107-128`** *(important, regression)*

```go
tmp, err := os.CreateTemp(filepath.Dir(dst), ".kubectl-*")
...
return os.Rename(tmpName, dst)
```

Two concurrent `rdd kubectl` invocations against the same out-of-skew cluster both observe a cache miss at `ensureCached`'s `os.Stat` (line 70), both call `download`, both `os.CreateTemp` to distinct names, and both race to `os.Rename(tmpName, dst)` (line 128). On POSIX the second rename atomically replaces; on Windows `os.Rename` fails with `ERROR_ALREADY_EXISTS` (Go does not pass `MOVEFILE_REPLACE_EXISTING`). The losing invocation then returns `fmt.Errorf("rename ... already exists")` wrapped through `kubectlAction`'s "resolving kubectl version" prefix, exiting non-zero — a flaky failure mode that surfaces under shells that pipeline or parallel-launch kubectl. The cache is explicitly cross-instance, so two RDD instances driven by separate shells hit it easily.

Fix:
```diff
 if err := os.Chmod(tmpName, 0o755); err != nil {
     return err
 }
-return os.Rename(tmpName, dst)
+if err := os.Rename(tmpName, dst); err != nil {
+    // On Windows os.Rename refuses to clobber. If a concurrent
+    // download just published the same binary, accept it and
+    // discard ours.
+    if _, statErr := os.Stat(dst); statErr == nil {
+        return nil
+    }
+    return err
+}
+return nil
```

**I2. `kubectlAction` reports a download/checksum failure but `main` re-prints it as a Fatal — `cmd/rdd/kubectl.go:63-79`, `cmd/rdd/main.go:168-181`** *(important, regression)*

```go
return fmt.Errorf("resolving kubectl version: %w", err)
```

`kubectlAction` returns a plain wrapped error on resolver failure. In `main.go`'s `errors.As(err, &exitErr)` branch, the new code at line 174 only skips logging when `exitErr.Err == nil`. A plain `fmt.Errorf(...)` is not a `*cliexit.Error`, so the fallthrough at line 179 hits `logrus.Fatal(err)`. `logrus.Fatal` prints `fatal level=fatal msg="resolving kubectl version: ..."` and exits 1. The user-visible message reads like an internal logger dump rather than a kubectl-style error.

Fix:
```diff
-return fmt.Errorf("resolving kubectl version: %w", err)
+fmt.Fprintf(os.Stderr, "rdd: resolving kubectl version: %v\n", err)
+return &cliexit.Error{Code: 1}
```

**I3. `--client=garbage version` order misclassified as client-only — `pkg/kuberlr/resolver.go:230-264`** *(important, gap)*

```go
fs.BoolVar(&client, "client", false, "")
...
_ = fs.Parse(args)
```

For `["--client=garbage", "version"]`, pflag's bool-parse error halts parsing mid-args. Positionals = `[]`, so `isClientOnly` returns `true` from the empty-positionals branch — meaning a real `kubectl version` invocation against a remote cluster is treated as client-only. Test case `version --client=garbage stays false` (resolver_test.go:178) covers the trailing-flag form but not this leading-flag form.

The comment at lines 218-224 names this exact behavior ("halts parse mid-args, leaves no positionals, and lands in the empty-positionals branch below") and accepts it as a deliberate tradeoff. So this is more a test gap than a behavior gap — add the leading-flag case to lock the deviation in.

**I4. `Resolve` probes /version on every kubectl invocation, including cache hits — `pkg/kuberlr/resolver.go:68-93`** *(important, gap)*

Every non-client-only `rdd kubectl` invocation makes a fresh HTTP probe to the apiserver before checking the cache. The 2-second timeout caps the worst case, but on an unreachable cluster every invocation pays the full timeout. More importantly, even with a reachable cluster a `kubectl get pods` call now does an extra discovery round-trip every time. A tight loop of `kubectl get pod -w` retries pays the probe cost per retry. This is a visible regression for users on slow links.

Fix: cache the most recent server version per kubeconfig context for a short TTL (e.g. 30s) under `CacheDir()` with `{context, version, mtime}` keys, refreshed on miss/expiry — so a steady-state stream of kubectl invocations probes once per minute rather than once per call. Smaller step: skip the probe when an exact cached version already exists.

**I5. `os.UserCacheDir` panic propagates into `rdd svc paths` — `pkg/kuberlr/cache.go:47-55`, `cmd/rdd/service_paths.go:32-33`** *(important, regression)*

```go
cache, err := os.UserCacheDir()
if err != nil {
    panic(fmt.Errorf("could not determine user cache directory: %w", err))
}
```

`os.UserCacheDir` fails when neither `$XDG_CACHE_HOME` nor `$HOME` is set (POSIX) or `%LocalAppData%` is unset (Windows). `instancePaths()` now calls `kuberlr.CacheRoot()` and `kuberlr.CacheDir()` unconditionally to build the table that backs every `rdd svc paths` invocation. A previously inert error path (only `rdd kubectl` would have hit it) now crashes any `rdd svc paths` invocation with a Go panic stack trace. Minimal CI images that lack `HOME` are the realistic trigger.

### Suggestions

**S1. `fetchSha512` empty-body branch reachability comment — `pkg/kuberlr/downloader.go:147-176`** *(suggestion, gap)*

`fetchSha512`'s "empty checksum response" (line 161) is reachable only when a mirror returns 200 OK with a zero-byte body. A 404 errors earlier in `streamGet`. Worth noting in the comment so the next reader doesn't think it's unreachable.

**S2. `streamGet` does not surface remote IP / proxy in the error — `pkg/kuberlr/downloader.go:183-199`** *(suggestion, enhancement)*

```go
return fmt.Errorf("GET %s: %s", url, resp.Status)
```

When a corporate proxy intercepts the connection and returns a custom 403 page, the operator sees `GET https://dl.k8s.io/...: 403 Forbidden` without an indication that traffic was intercepted. Consider including the `Via` and `X-Cache` response headers when the status is non-2xx, or echoing the response body's first ~256 bytes.

**S3. `cachedPath` and `download` duplicate the GOOS/.exe formatting — `pkg/kuberlr/downloader.go:86-92, 98-102`** *(suggestion, gap)*

```go
name := fmt.Sprintf("kubectl-v%d.%d.%d", v.Major, v.Minor, v.Patch)
if runtime.GOOS == "windows" {
    name += ".exe"
}
```

The same pattern appears twice. Factor a `kubectlBinaryName()` helper; today a refactor could change one and not the other and `TestCachedPath` would not catch it because the assertion at resolver_test.go:65 only checks the suffix.

**S4. `parseKubeConfigFlags` and `isClientOnly` each build a fresh ConfigFlags FlagSet — `pkg/kuberlr/resolver.go:179-188, 230-264`** *(suggestion, gap)*

Both helpers run `genericclioptions.NewConfigFlags(true).AddFlags(fs)` and `fs.Parse(args)` on the same args within a single `Resolve` call. The duplication risks the binding surface drifting between the two. Either share a single parsed ConfigFlags through `Resolve`, or split out the resolver-specific bare-bool binding into a separate small function.

**S5. Recursion-guard env-var name belongs in the Go comment — `pkg/kuberlr/resolver.go:23-26`** *(suggestion, documentation)*

The doc comment for `envSkipResolver` says "Exec sets it on the kubectl child process" but does not name the env-var spelling (`RDD_KUBECTL_RESOLVED`). A reader debugging a re-exec loop greps for the literal name. Adding `(RDD_KUBECTL_RESOLVED)` to the comment matches how `envCacheDir` and `envKubeMirror` are commented.

**S6. `dependabot.yml` directory entry for fake-kube has no current deps — `.github/dependabot.yml`** *(suggestion, regression)*

The fake-kube module has zero declared dependencies (only stdlib + the self-import). Dependabot's `gomod` ecosystem still triggers periodic scans against an empty go.mod. Worth dropping the second `directories:` entry until the fake-kube module grows a dependency.

**S7. Windows `Exec` swallows `*exec.ExitError`'s nested context cancellation — `pkg/kuberlr/exec_windows.go:36-44`** *(suggestion, gap)*

```go
var exitErr *exec.ExitError
if errors.As(err, &exitErr) {
    return &cliexit.Error{Code: exitErr.ExitCode()}
}
```

When `CommandContext`'s ctx cancels, `cmd.Run()` returns the child's exit error with code 1 (kubectl killed by signal). The code returns `&cliexit.Error{Code: 1}` and main.go prints nothing — the user sees rdd hang on shutdown then exit silently. The TODO at lines 27-30 names this issue. Once signal-driven cancellation lands, distinguish "kubectl exited" from "rdd killed kubectl via context cancel".

**S8. `fake-kube/server/main.go` hand-rolls `version.Info` JSON — `bats/tests/10-cli/fake-kube/server/main.go:73-77`** *(suggestion, gap)*

The current JSON sets `gitCommit`, `gitTreeState`, `buildDate`, `goVersion`, `compiler`, `platform`. A future k8s bump could add a required field. Mirror the upstream `k8s.io/apimachinery/pkg/version.Info` struct via `encoding/json` rather than hand-rolling the JSON string so the test fixture stays trivially in sync.

**S9. `local_setup_file` polling loop has no stderr capture on server startup failure — `bats/tests/10-cli/7-kubectl-cache.bats:64-69`** *(suggestion, gap)*

```bash
for i in {1..50}; do
    [[ -s ${PORT_FILE} ]] && break
    sleep 0.1
done
```

A flaky CI host can miss this 5-second budget and fail with `fake-kube-server did not write a port file` without context. Capture the server's stderr to a file (`2>"${BATS_FILE_TMPDIR}/server.stderr"`) and `cat` it on failure. `log.Fatalf` errors (e.g., "listen: bind: address already in use") go to stderr only, not to `LOG_FILE`.

**S10. `streamGet` returns without draining the response body on non-200 — `pkg/kuberlr/downloader.go:189-198`** *(suggestion, gap)*

```go
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
    return fmt.Errorf("GET %s: %s", url, resp.Status)
}
```

Correct but the unread body prevents connection reuse from the pool. For one-shot CLI use this does not matter; a long-lived daemon caller of the same pattern would leak connections. An `io.Copy(io.Discard, resp.Body)` before return is the idiomatic fix.

**S11. `--client` binds at the global level, not scoped to `version` — `pkg/kuberlr/resolver.go:242-263`** *(suggestion, gap)*

`fs.BoolVar(&client, "client", false, "")` binds `--client` at the global level, so `rdd kubectl --client foo` parses `client=true` and a bogus subcommand `foo`. Today this is harmless: `isClientOnly` returns false (foo is not in the map, not "version"), so the resolver probes anyway and kubectl rejects `--client` at the top level. Worth a one-line test for the bind-before-subcommand confusion, or restructure to only consult `client` when `subcommand == "version"` (which the code already does at line 262).

### Design Observations

**Concerns**

*(in-scope)* The package mixes three responsibilities in one directory: (1) cache layout (`cache.go`, used by `rdd svc paths`), (2) HTTP download + sha512 verify (`downloader.go`), and (3) version probe + flag parsing (`resolver.go`). The first is exposed to all of rdd; the latter two are internal to `rdd kubectl`. Once a future consumer (k3s images, Lima templates) lands as the TODO promises, `cache.go` will need to move up (perhaps to `pkg/cache/`) and `kuberlr` should become just the resolver.

*(future)* The "probe before consulting cache" design (I4) inverts the natural cache-then-validate flow. A more efficient design checks `cachedPath(serverFromLastProbe)` first and falls back to a probe only when no cached binary matches the most recent observed version. The kuberlr upstream does this; the simplification here drops it. Calling out the regression in the README would help users decide whether to set `RDD_KUBECTL_OFFLINE` once it lands.

**Strengths**

- The recursion-guard contract (in-process `skipResolver` for `rdd ctl`; env-var `RDD_KUBECTL_RESOLVED` for re-exec'd kubectl shims) is well-thought-out and the tests at `resolver_test.go:243-258` lock it in.
- The body-size caps on `streamGet` with explicit comments about silent-truncation tradeoffs are the right defensive posture for a binary fetch from a CDN.
- The flag-binding test surface at `resolver_test.go:99-153` is unusually thorough — every kubeconfig override pflag form is exercised, including the `--` boundary case.
- The fake-kube fixture as a sibling Go module avoids polluting the main `go.mod` with test-only deps.

### Testing Assessment

1. Concurrent download race (I1) — no test exercises two simultaneous `ensureCached` calls.
2. `UserCacheDir` failure (I5) — no test runs with `HOME=""` and `XDG_CACHE_HOME=""`.
3. Mirror returns 200 OK with empty body for the checksum — existing 404 test covers the not-found case, but not the empty-body case.
4. Resolver probe under cluster-reachable-but-slow — no test asserts the 2-second timeout actually fires.
5. `rdd kubectl` exit code propagation for non-cliexit errors — the BATS suite asserts `assert_failure` but not the exact code.

### Documentation Assessment

- `docs/design/environment.md` correctly documents the new `RDD_CACHE_DIR` and `RDD_KUBECTL_MIRROR` env vars and the internal `RDD_KUBECTL_RESOLVED` recursion guard.
- `docs/design/cmd_service.md` updates the example output and the keys table.
- The kubectl-version-resolver behavior change (extra probe per call, I4) is not documented in the rdd help text or design docs.

### Commit Structure

- `30289ab rdd kubectl: embed kuberlr-style version resolver` — large feature commit; self-contained.
- `37ac64c bats: fix 7-kubectl-cache.bats failures on Git Bash` — a fixup for the BATS test added in the previous commit. The repo's working-memory note says "Amending commits during development is fine — use it to keep changes organized logically." This second commit fixes lines added in the first; folding it back would tighten history.

### Acknowledged Limitations

- **PR comment by @mook-as on `cmd/rdd/service_paths.go:33`** — "Would it be possible to expose RDD_CACHE_DIR here too?" The same commit (`30289ab`) adds both `cache_dir` and `kubectl_cache` keys in `instancePaths()`. Addressed.
- `pkg/kuberlr/cache.go:33-39` `TODO(eviction)`: cache only grows; SIGKILL leaks `.kubectl-*` temp files.
- `pkg/kuberlr/downloader.go:54-57` `TODO(offline)`: opt-out for download attempts deferred.
- `pkg/kuberlr/downloader.go:137-141`: `io.LimitReader` silent truncation surfaces as checksum mismatch.
- `pkg/kuberlr/exec_windows.go:27-30`: `CommandContext` hard-kills on ctx cancellation.
- `pkg/kuberlr/resolver.go:64-67` `TODO(offline)`: download failure rather than fall back to closest cached binary.
- `pkg/kuberlr/resolver.go:77-84`: dev-build fall-through documented.
- `pkg/kuberlr/resolver.go:218-224`: bool-parse-error halts-parse behavior. See I3.
- `bats/tests/10-cli/7-kubectl-cache.bats:108-116`: `local_teardown_file` deviation documented.

### Unresolved Feedback

None.

### Coverage Summary

- `.github/dependabot.yml` — S6.
- `bats/helpers/os.bash` — Reviewed, no issues.
- `bats/helpers/paths.bash` — Reviewed, no issues.
- `bats/tests/10-cli/5-paths.bats` — Reviewed, no issues.
- `bats/tests/10-cli/7-kubectl-cache.bats` — S9.
- `bats/tests/10-cli/fake-kube/README.md` — Trivial.
- `bats/tests/10-cli/fake-kube/go.mod` — Trivial.
- `bats/tests/10-cli/fake-kube/kubectl/main.go` — Reviewed, no issues.
- `bats/tests/10-cli/fake-kube/server/main.go` — S8.
- `cmd/rdd/kubectl.go` — I2.
- `cmd/rdd/main.go` — I2.
- `cmd/rdd/service_paths.go` — I5.
- `docs/design/cmd_service.md` — Reviewed, no issues.
- `docs/design/environment.md` — Reviewed, no issues.
- `go.mod` — Reviewed, no issues.
- `pkg/kuberlr/cache.go` — I5.
- `pkg/kuberlr/downloader.go` — I1, S1, S2, S3, S10.
- `pkg/kuberlr/exec_unix.go` — Reviewed, no issues.
- `pkg/kuberlr/exec_windows.go` — S7.
- `pkg/kuberlr/resolver.go` — I3, I4, S4, S5, S11.
- `pkg/kuberlr/resolver_test.go` — Reviewed, no issues.

---

### [Codex] — Pass 1

### Executive Summary

The resolver design is mostly sound and has meaningful unit/BATS coverage. The main merge blocker is cache publication: concurrent first-use downloads can fail on Windows because the shared cache uses a check-then-rename sequence without handling an already-published destination. I also found a Windows test helper path that bypasses the repo’s WSL-aware `rdd()` wrapper.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Concurrent cache population fails on Windows** — `pkg/kuberlr/downloader.go:68-80` `(important, regression)`

```go
path := cachedPath(want)
if _, err := os.Stat(path); err == nil {
	return path, nil
}
...
if err := download(ctx, want, path); err != nil {
	return "", err
}
```

`ensureCached` checks for the target at line 70, then downloads and publishes later via `os.Rename(tmpName, dst)` at `pkg/kuberlr/downloader.go:128`. The cache is explicitly shared across instances, so two `rdd kubectl` processes can both miss line 70 and race to publish the same version. On Windows, `os.Rename` fails when `dst` already exists, so the loser returns `resolving kubectl version: ...` even though a valid kubectl is now cached.

Fix: serialize per-version downloads or handle `ErrExist` after rename by hashing the existing `dst` against the fetched checksum and returning success only if it matches.

```diff
-	return os.Rename(tmpName, dst)
+	if err := os.Rename(tmpName, dst); err != nil {
+		if errors.Is(err, os.ErrExist) && fileMatchesSHA512(dst, want) {
+			return nil
+		}
+		return err
+	}
+	return nil
```

I2. **`rdd_env` bypasses the WSL-aware test wrapper** — `bats/tests/10-cli/7-kubectl-cache.bats:100-105` `(important, regression)`

```bash
rdd_env() {
    env \
        "RDD_CACHE_DIR=$(winpath "${CACHE_DIR}")" \
        "RDD_KUBECTL_MIRROR=http://127.0.0.1:${PORT}" \
        "KUBECONFIG=$(winpath "${KUBECONFIG_PATH}")" \
        rdd "$@"
}
```

The helper invokes `env ... rdd` at line 105, which runs an external binary and bypasses the shell function in `bats/helpers/commands.bash:42`. That wrapper is where the WSL path and `WSLENV` handling live (`bats/helpers/commands.bash:63-82`). On WSL, the built binary is `rdd.exe`, environment propagation for `RDD_*` is wrapper-managed, and plain `rdd` may not resolve to `rdd.exe`.

Fix: keep the direct `env` path only for MSYS, and use the wrapper for WSL/Linux/macOS.

```bash
if is_msys; then
    env "RDD_CACHE_DIR=$(winpath "${CACHE_DIR}")" ... "${PATH_REPO_ROOT}/bin/rdd${EXE}" "$@"
else
    RDD_CACHE_DIR="$(winpath "${CACHE_DIR}")" \
    RDD_KUBECTL_MIRROR="http://127.0.0.1:${PORT}" \
    KUBECONFIG="$(winpath "${KUBECONFIG_PATH}")" rdd "$@"
fi
```

#### Suggestions

None.

### Design Observations

**Strengths**

The resolver keeps probe failures non-fatal while surfacing download and checksum failures, which matches the operational difference between “cluster unreachable” and “we knowingly selected a mismatched client.”

The download path verifies while streaming into a temp file and only publishes after checksum success, which avoids normal partial-file cache poisoning.

### Testing Assessment

1. Concurrent first-use cache population is untested, especially the Windows `os.Rename` destination-exists case covered by I1.
2. The new BATS path is not exercised under WSL with Windows executables, which would catch I2.
3. Existing tests cover cache hit/miss, mirror 404, checksum mismatch, and `/version` fall-through well.

### Documentation Assessment

The new cache variables and `kubectl_cache` path are documented. Existing design docs outside this PR still mention `config.json` in places, but that drift predates this change.

### Acknowledged Limitations

- **Code comment**: `pkg/kuberlr/cache.go:33` documents that the cache grows indefinitely and stale `.kubectl-*` temp files can survive crashes. The change makes this relevant over time, but the limitation is explicitly marked with a TODO.
- **Code comment**: `pkg/kuberlr/downloader.go:56` documents the missing offline mode. This is an intentional follow-up, not a review finding.
- **Code comment**: `pkg/kuberlr/downloader.go:138` documents that oversized kubectl downloads surface as checksum mismatches rather than size-cap errors. This is acknowledged and bounded.
- **Code comment**: `pkg/kuberlr/exec_windows.go:27` documents `CommandContext` hard-kill behavior on future cancellation wiring. Current cobra contexts are not cancelled, so this is not an active regression.
- **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:108` explicitly explains the `local_teardown_file` exception for the fake HTTP server.

### Coverage Summary

- `.github/dependabot.yml` — Reviewed, no issues
- `bats/helpers/os.bash` — Reviewed, no issues
- `bats/helpers/paths.bash` — Reviewed, no issues
- `bats/tests/10-cli/5-paths.bats` — Reviewed, no issues
- `bats/tests/10-cli/7-kubectl-cache.bats` — I2
- `bats/tests/10-cli/fake-kube/README.md` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/go.mod` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/kubectl/main.go` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go` — Reviewed, no issues
- `cmd/rdd/kubectl.go` — Reviewed, no issues
- `cmd/rdd/main.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — Reviewed, no issues
- `docs/design/cmd_service.md` — Reviewed, no issues
- `docs/design/environment.md` — Reviewed, no issues
- `go.mod` — Reviewed, no issues
- `pkg/kuberlr/cache.go` — Reviewed, no issues
- `pkg/kuberlr/downloader.go` — I1
- `pkg/kuberlr/exec_unix.go` — Reviewed, no issues
- `pkg/kuberlr/exec_windows.go` — Reviewed, no issues
- `pkg/kuberlr/resolver.go` — Reviewed, no issues
- `pkg/kuberlr/resolver_test.go` — Reviewed, no issues

`go test ./pkg/kuberlr` passed. `go test ./cmd/rdd` could not run because `pkg/guestagent/lima-guestagent.gz` is absent in this checkout.

---

### [Gemini] — Pass 1

### Executive Summary
This PR implements a version resolver for `rdd kubectl` that downloads, verifies, and executes a version-matched `kubectl` from an upstream mirror if the embedded binary falls outside the ±1 minor version skew. The implementation is well-tested and robust, though it introduces a race condition on Windows where concurrent cache misses fail due to `os.Rename` collision semantics.

### Findings

#### Critical Issues
C1. **Concurrent downloads fail on Windows** — `pkg/kuberlr/downloader.go:128`

```go
	if err := os.Chmod(tmpName, 0o755); err != nil {
		return err
	}
	return os.Rename(tmpName, dst)
```

On Windows, `os.Rename` fails with `os.ErrExist` if the destination file already exists. If multiple `rdd kubectl` processes encounter a cache miss concurrently, both will download to temporary files, and the second process to call `os.Rename` will abort with a download error. Since the first process successfully populated and verified the file, the second process should treat the pre-existing destination as a success rather than surfacing an error.

Fix: Catch and swallow `os.ErrExist` on Windows.

```diff
 	if err := os.Chmod(tmpName, 0o755); err != nil {
 		return err
 	}
-	return os.Rename(tmpName, dst)
+	err := os.Rename(tmpName, dst)
+	if err != nil && runtime.GOOS == "windows" && errors.Is(err, os.ErrExist) {
+		return nil
+	}
+	return err
```
(critical, regression)

#### Important Issues
I1. **Command substitution swallows non-zero exits** — `bats/tests/10-cli/5-paths.bats:43`

```bash
 @pkg/util/tail/tail_test.go 'rdd svc paths honors RDD_CACHE_DIR override' {
    cache_override=$(winpath "${BATS_TEST_TMPDIR}/cache-override")
```

The use of command substitution `$(...)` swallows non-zero exit codes from the underlying helper commands (like `cygpath` or `wslpath` inside `winpath`). If the helper fails, the test continues silently with an empty variable, leading to cryptic downstream assertion failures. The project convention requires using `run -0` to surface failures immediately. (See also `bats/helpers/os.bash:107`).

Fix: Extract the substitution into a direct `run -0` call.

```diff
-    cache_override=$(winpath "${BATS_TEST_TMPDIR}/cache-override")
+    run -0 winpath "${BATS_TEST_TMPDIR}/cache-override"
+    cache_override=${output}
```
(important, gap)

I2. **CacheRoot panics if the user cache directory is undetermined** — `pkg/kuberlr/cache.go:53`

```go
	cache, err := os.UserCacheDir()
	if err != nil {
		panic(fmt.Errorf("could not determine user cache directory: %w", err))
	}
	return filepath.Join(cache, "rancher-desktop")
```

When `$HOME` (on Unix) or `%LocalAppData%` (on Windows) is unset in a minimal CI or container environment, `os.UserCacheDir()` returns an error. The current implementation responds with a `panic`. Because `rdd svc paths` calls `CacheRoot()` to populate its paths table, running that command in such an environment will crash the CLI entirely instead of failing gracefully.

Fix: Fall back to a reasonable default like `os.TempDir()` when the standard cache directory cannot be determined.

```diff
 	cache, err := os.UserCacheDir()
 	if err != nil {
-		panic(fmt.Errorf("could not determine user cache directory: %w", err))
+		cache = os.TempDir()
 	}
 	return filepath.Join(cache, "rancher-desktop")
```
(important, regression)

#### Suggestions
S1. **Bundled test fix obscures PR scope** — `bats/tests/10-cli/5-paths.bats:3`

```bash
-ALL_KEYS="args_file config dir lima_home log_dir pid_file short_dir tls_dir"
+ALL_KEYS="args_file cache_dir config dir docker_socket kubectl_cache lima_home log_dir pid_file short_dir tls_dir"
```

The PR adds `docker_socket` to `ALL_KEYS` alongside the expected `cache_dir` and `kubectl_cache` keys. This implicitly patches a pre-existing gap where the `docker_socket` path was missing from the test suite's validation list. While the fix is helpful, bundling unrelated test suite corrections obscures the PR's core intent. Keep the addition, but in the future, extract unrelated cleanup into separate commits.

Fix: No code change required.
(suggestion, enhancement)

### Design Observations

**Concerns**
None.

**Strengths**
- The separation of global flags parsing in `isClientOnly` with `UnknownFlags = true` gracefully handles unknown flags without polluting the cluster probe. The fallback bias is extremely safe and well-documented.
- The use of `io.LimitReader` alongside hash validation against the mirrored checksum defends beautifully against misconfigured or malicious HTTP servers, correctly bounding memory without intricate size-checking loops.
- The exit-code propagation mechanism cleanly passes `cliexit.Error` upwards on Windows, shielding the user from redundant log lines natively.

### Testing Assessment
The test suite comprehensively covers the version skew logic, download-and-verify loops, missing upstream binaries, checksum validation failures, and silent fall-throughs during cluster probe failures. Error behaviors and test server components are rigorously mocked.

### Documentation Assessment
The architectural diagrams and docs in `docs/design/` are properly updated with the new environment overrides and cache path exposures. The inline comments explaining tricky execution patterns and edge cases (like `limitReader` truncation) are exceptionally clear.

### Acknowledged Limitations
- **Cache eviction strategy:** The author added a TODO in `pkg/kuberlr/cache.go:33` acknowledging that the cache only grows and lacks a TTL or LRU eviction strategy.
- **Offline mode support:** The author added a TODO in `pkg/kuberlr/resolver.go:64` acknowledging that offline operations currently fail with a network error instead of falling back to the closest cached binary.
- **Hard-kill on context cancellation (Windows):** The author added a comment in `pkg/kuberlr/exec_windows.go:27` acknowledging that `exec.CommandContext` issues a hard kill upon cancellation, racing `kubectl`'s graceful shutdown.
- **Teardown file deviation:** The author added a comment in `bats/tests/10-cli/7-kubectl-cache.bats:108` explicitly deviating from the "no `local_teardown_file`" rule to prevent the test HTTP server from leaking ephemeral ports across sessions.

### Coverage Summary
- `.github/dependabot.yml`: Trivial
- `bats/helpers/os.bash`: I1
- `bats/helpers/paths.bash`: Reviewed, no issues
- `bats/tests/10-cli/5-paths.bats`: I1, S1
- `bats/tests/10-cli/7-kubectl-cache.bats`: Reviewed, no issues
- `bats/tests/10-cli/fake-kube/README.md`: Trivial
- `bats/tests/10-cli/fake-kube/go.mod`: Trivial
- `bats/tests/10-cli/fake-kube/kubectl/main.go`: Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go`: Reviewed, no issues
- `cmd/rdd/kubectl.go`: Reviewed, no issues
- `cmd/rdd/main.go`: Reviewed, no issues
- `cmd/rdd/service_paths.go`: Reviewed, no issues
- `docs/design/cmd_service.md`: Trivial
- `docs/design/environment.md`: Trivial
- `go.mod`: Trivial
- `pkg/kuberlr/cache.go`: I2
- `pkg/kuberlr/downloader.go`: C1
- `pkg/kuberlr/exec_unix.go`: Reviewed, no issues
- `pkg/kuberlr/exec_windows.go`: Reviewed, no issues
- `pkg/kuberlr/resolver.go`: Reviewed, no issues
- `pkg/kuberlr/resolver_test.go`: Reviewed, no issues
