# Deep Review: 20260519-105717-pr-348

| | |
|---|---|
| **Date** | 2026-05-19 10:57 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 8 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#348](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348) — rdd kubectl: embed kuberlr-style version resolver |
| **Branch** | `embed-kuberlr` |
| **Commits** | `ce035d5` rdd kubectl: embed kuberlr-style version resolver |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — Round 8 surfaces three new Important findings on the Windows/WSL BATS plumbing introduced by Round 7's tightening of the `5-paths.bats` RDD_CACHE_DIR override test: the `kill ${SERVER_PID}` call in `local_teardown_file()` cannot reach the native Win32 `fake-kube-server.exe` (the helper file documents `process_kill` for this case), `env … rdd …` bypasses the `rdd()` shell function and its WSL path / WSLENV conversion, and `winpath` only handles MSYS so WSL invocations of native rdd.exe receive Linux-form paths. None of these affect macOS/Linux runs; they degrade the Windows/WSL CI matrix. I1 (`logrus.Warnf` for routine cache-miss download) is a judgment call against the repo's `RDD_LOG_LEVEL=warn` default — surface for decision rather than auto-flag. Three long-declined re-raises (Windows `os.Rename` race, `CacheRoot` panic) dropped per the resolutions gate; Claude's S2 was the same race shape and is dropped likewise. |
| **Wall-clock time** | `17 min 30 s` |

## Executive Summary

Round 8 reviews commit `ce035d5b` (Round 7 was `6b419bc`; Round 7 fixes — `kustomize`/`plugin` in `clientOnlySubcommands`, `docker_socket`/`args.json` cmd_service.md sync, inline note on `exec_windows.go:27` — verified landed). The new commit also introduced `bats/helpers/paths.bash::winpath`, the `local_teardown_file()` cleanup hook on `7-kubectl-cache.bats`, and the `rdd svc paths honors RDD_CACHE_DIR override` test on `5-paths.bats`. All three Round 8 Important findings cluster on the BATS plumbing those Round 7 changes added — and all three are Windows/WSL-specific. On macOS and Linux the test surface is clean.

The keepers: I1 (Claude + Gemini) flags `logrus.Warnf` at downloader.go:78 in a routine download path — at the repo's default `RDD_LOG_LEVEL=warn` this surfaces a yellow WARN banner on every cache miss; the intent (operator-relevant "downloading kubectl..." status) is legitimate, but the level is a judgment call worth Jan's decision. I2 (Codex) notes that `local_teardown_file()` calls `kill ${SERVER_PID}` against `fake-kube-server.exe`, a native Win32 process MSYS `kill` cannot reach — `bats/helpers/os.bash:96-117` documents this exact case and provides `process_kill`. I3 (Codex) notes that the new `run -0 env RDD_CACHE_DIR=... rdd ...` form in `5-paths.bats:45,48` and across `7-kubectl-cache.bats` execs `bin/rdd${EXE}` directly via PATH, bypassing the `rdd()` bash function — losing WSL `/mnt/...` path conversion and the WSLENV adjustment that propagates `RDD_*` vars across the WSL boundary.

Verification dropped Codex I1, Gemini I1, Gemini I2, and Claude S2 — all re-raises of the long-declined Windows `os.Rename` race or `CacheRoot` panic. The Round 7 retro already proposed a "long-declined findings" repo-context entry; Round 8 makes that recommendation more urgent: eight rounds, four agents, same two shapes.

Structure: 0 Critical · 3 Important · 7 Suggestions.

## Critical Issues

None.

## Important Issues

I1. **`logrus.Warnf` in routine cache-miss download surfaces a WARN banner at default log level** — `pkg/kuberlr/downloader.go:78` [Claude Opus 4.7, Gemini 3.1 Pro]

```go
logrus.Warnf("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
```

Outside developer mode, `RDD_LOG_LEVEL` defaults to `warn`, so every `logrus.Warn` reaches the user's terminal. A cache miss on a remote cluster's first invocation at a new minor version is a routine expected event: the resolver downloads, sha-verifies, and execs. The user sees a yellow `WARN[…]` banner above their normal `kubectl get pods` output. Compare to the `logrus.Debug` lines elsewhere in the resolver for routine fall-through paths.

The intent is legitimate — telling the user "you are about to wait for a ~50 MB download" is operator-relevant — so the choice of level is a judgment call. Claude raised it as such; Gemini flagged it as Warn-level misuse outright. Two reasonable shapes for the fix:

```diff
- logrus.Warnf("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
+ // One-time per (server-version, cache state) event. Surface what the
+ // resolver is doing without raising a WARN; the user sees kubectl
+ // output immediately after.
+ logrus.Infof("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
```

Or keep `Warnf` and document the rationale inline against the project's logging convention.

I2. **`local_teardown_file()` cannot terminate the native Win32 fake-kube-server.exe** — `bats/tests/10-cli/7-kubectl-cache.bats:91-99` [Codex GPT 5.5]

```bash
local_teardown_file() {
    if load_var SERVER_PID; then
        kill "${SERVER_PID}" 2>/dev/null || true
    fi
}
```

`local_setup_file()` builds the fake server as `fake-kube-server.exe` on Windows and launches it (lines 35-39, 49-55). MSYS2 `kill` cannot reach Win32 processes — `bats/helpers/os.bash:96-117` documents this case explicitly and provides `process_kill`, which dispatches to `taskkill.exe /F /PID` on Windows. The existing teardown silently fails on MSYS, leaking the HTTP server and the bound ephemeral port across BATS sessions, which is the exact behavior the deviation from the "no `teardown_file`" rule was added to prevent. The cleanup works on macOS and Linux.

```diff
 local_teardown_file() {
     if load_var SERVER_PID; then
-        kill "${SERVER_PID}" 2>/dev/null || true
+        process_kill "${SERVER_PID}" 2>/dev/null || true
     fi
 }
```

I3. **`run -0 env … rdd …` bypasses the `rdd()` shell function on WSL** — `bats/tests/10-cli/5-paths.bats:45,48`; `bats/tests/10-cli/7-kubectl-cache.bats:124-128,143-147,159-163,178-182,199-203` [Codex GPT 5.5]

```bash
run -0 env \
    RDD_CACHE_DIR="$(winpath "${CACHE_DIR}")" \
    RDD_KUBECTL_MIRROR="http://127.0.0.1:${PORT}" \
    KUBECONFIG="$(winpath "${KUBECONFIG_PATH}")" \
    rdd kubectl get pods
```

`bats/helpers/commands.bash:42-89` defines `rdd()` as a bash function that handles three Windows boundaries: MSYS path conversion (lines 47-61), WSL `/mnt/...` path conversion (lines 62-70), and the WSLENV augmentation (lines 71-82) that adds every `RDD_*` env var to WSLENV so they cross the WSL→Win32 boundary into `rdd.exe`. `bats/helpers/load.bash:65` puts `bin/` on PATH, so `env … rdd …` execvp's `bin/rdd${EXE}` directly — none of the wrapper logic runs. Two consequences on WSL:

```bash
# bats/helpers/commands.bash:71-82 — WSLENV augmentation
WSLENV=""
for env in "${envs[@]}"; do
    WSLENV="${WSLENV}:${env}"
done
```

First, `RDD_CACHE_DIR` and `RDD_KUBECTL_MIRROR` set on the `env` command line do not get added to WSLENV, so the values do not cross into `rdd.exe`. Second, `winpath` (see S1) does not handle WSL, so the pre-conversion in the test produces Linux-form paths that `rdd.exe` cannot read.

The fix routes the call through the bash function. Two ways:

```diff
-    run -0 env \
-        RDD_CACHE_DIR="$(winpath "${CACHE_DIR}")" \
-        RDD_KUBECTL_MIRROR="http://127.0.0.1:${PORT}" \
-        KUBECONFIG="$(winpath "${KUBECONFIG_PATH}")" \
-        rdd kubectl get pods
+    RDD_CACHE_DIR="${CACHE_DIR}" \
+    RDD_KUBECTL_MIRROR="http://127.0.0.1:${PORT}" \
+    KUBECONFIG="${KUBECONFIG_PATH}" \
+    run -0 rdd kubectl get pods
```

The temporary-assignment-before-command form invokes the `rdd` function with the env vars set for that one call; the function then handles path conversion and WSLENV. (`winpath` becomes redundant once the wrapper handles it.)

## Suggestions

S1. **`winpath` does not handle WSL→Windows.exe path conversion** — `bats/helpers/paths.bash:14-20` [Codex GPT 5.5]

```bash
winpath() {
    if is_msys; then
        cygpath -w "$1"
    else
        printf '%s' "$1"
    fi
}
```

`winpath` only converts on MSYS. WSL is the other shell that calls a native Windows `rdd.exe` (`commands.bash:4-6` sets `EXE=.exe` whenever `using_windows_exe` is true), and the existing repo convention for WSL path conversion is `wslpath -w` (see `commands.bash:65-66` and `paths.bash:26-27`). On WSL the `$(winpath ...)` call sites pass Linux-form paths into `rdd.exe`, which cannot read them. The narrow fix:

```diff
 winpath() {
     if is_msys; then
         cygpath -w "$1"
+    elif is_windows && using_windows_exe; then
+        wslpath -w "$1"
     else
         printf '%s' "$1"
     fi
 }
```

A wider fix routes the call sites through the `rdd()` shell function (see I3); doing both is defensible because `winpath` is also used in `local_setup_file` for raw `${SERVER_BIN}` args that never traverse the `rdd()` wrapper.

S2. **Unbound kubectl-global bool flags can hide the subcommand from `isClientOnly`** — `pkg/kuberlr/resolver.go:230-264` [Claude Opus 4.7]

```go
var help, client, warningsAsErrors bool
fs.BoolVarP(&help, "help", "h", false, "")
fs.BoolVar(&client, "client", false, "")
fs.BoolVar(&warningsAsErrors, "warnings-as-errors", false, "")
```

The pflag UnknownFlags handling at lines 217-224 documents — and the test at `resolver_test.go:200` confirms — that an unknown spaced flag consumes its next token as the assumed value. `kubectl --match-server-version config view` therefore has pflag swallow `config`, leaves `view` as the sole positional, and `view` is not in `clientOnlySubcommands`, so the resolver probes the cluster for a command kubectl can answer locally. The fall-through is correct (no silent mismatched execution), but the user pays the 2-second probe penalty. `--match-server-version` is the obvious candidate to add; an audit of upstream `staging/src/k8s.io/kubectl/pkg/cmd/cmd.go` would surface any other bool globals kubectl declares above the subcommand layer.

```diff
 var help, client, warningsAsErrors bool
 fs.BoolVarP(&help, "help", "h", false, "")
 fs.BoolVar(&client, "client", false, "")
 fs.BoolVar(&warningsAsErrors, "warnings-as-errors", false, "")
+var matchServerVersion bool
+fs.BoolVar(&matchServerVersion, "match-server-version", false, "")
```

A mirror of the existing `--warnings-as-errors` test cases locks the behavior in: `{"--match-server-version before config", []string{"--match-server-version", "config", "view"}, true}`.

S3. **`skipResolver` global is a set-only toggle that leaks across hypothetical future tests** — `pkg/kuberlr/resolver.go:31-39`, `pkg/kuberlr/resolver_test.go:222-233` [Claude Opus 4.7]

```go
var skipResolver bool

func SkipResolver() {
    skipResolver = true
}
```

`SkipResolver()` is a one-way toggle on a package-global; nothing flips it back. `TestResolveEmbeddedVersionUnparseable` swaps `embeddedVersion` via `t.Cleanup` but never touches `skipResolver`, so a future test that calls `SkipResolver()` would leave the bit set for any later test that exercises Resolve, and Resolve short-circuits on line 69 before the path the test means to exercise. The test would still pass but for the wrong reason.

In production the CLI process exits after one command, so the global is fine; the concern is purely test hygiene against future test additions.

```diff
 func TestResolveEmbeddedVersionUnparseable(t *testing.T) {
+    origSkip := skipResolver
+    t.Cleanup(func() { skipResolver = origSkip })
     orig := embeddedVersion
     t.Cleanup(func() { embeddedVersion = orig })
```

S4. **`main.go` `errors.As` block captures `msg` it does not use; check `exitErr.Err` directly** — `cmd/rdd/main.go:172-178` [Claude Opus 4.7]

```go
if errors.As(err, &exitErr) {
    if msg := err.Error(); msg != "" {
        logrus.Error(err)
    }
    os.Exit(exitErr.Code)
}
```

The captured `msg` is unused; `logrus.Error(err)` calls `err.Error()` again. The empty-string check works today because `cliexit.Error.Error()` returns `""` iff `Err == nil`. If a future caller wraps the `cliexit.Error` (e.g. `fmt.Errorf("foo: %w", &cliexit.Error{Code: c})`), the outer `err.Error()` is non-empty even when the inner sentinel marks "subcommand already wrote output," and the redundant log line returns. Checking `exitErr.Err != nil` states intent precisely and survives the wrap.

```diff
 if errors.As(err, &exitErr) {
-    if msg := err.Error(); msg != "" {
+    if exitErr.Err != nil {
         logrus.Error(err)
     }
     os.Exit(exitErr.Code)
 }
```

S5. **`winpath` via `$(...)` swallows non-zero exits in standalone assignments** — `bats/tests/10-cli/5-paths.bats:43`; `bats/tests/10-cli/7-kubectl-cache.bats:39` [Claude Opus 4.7]

```bash
cache_override=$(winpath "${BATS_TEST_TMPDIR}/cache-override")
```

The repo's BATS style rule prefers `run -0 cmd ...; value=${output}` over `value=$(cmd ...)` because command substitution discards the exit code and leaves the variable empty on failure. `winpath` delegates to `cygpath -w` on MSYS, which can fail in pathological cases. At the standalone assignment sites — `5-paths.bats:43` and the `SERVER_BIN=...` shape near `7-kubectl-cache.bats:39` — the `run -0` form locks in the safety. Sites inside a `run` command line (most of `7-kubectl-cache.bats`) cannot use `run -0` recursively; those are unavoidable shell composition.

S6. **BATS coverage forces every test out of skew; the within-skew "no download" path runs only in unit tests** — `bats/tests/10-cli/7-kubectl-cache.bats` [Claude Opus 4.7]

The five `7-kubectl-cache.bats` tests all pin the fake apiserver at `v1.99.0`, always out of skew with the embedded kubectl. The end-to-end coverage exercises the download lifecycle (miss, hit, mirror 404, sha mismatch, apiserver 500 fall-through) but never the within-skew branch where `compatible()` returns true and `Resolve` returns `""` without probing the mirror. The within-skew rule is covered by `TestCompatible` at the unit level; an end-to-end test (e.g. set `--git-version` to `v$(EMBEDDED_MINOR).0`) would prove the whole resolver respects the rule end-to-end.

S7. **Per-patch cache key accumulates one binary per patch release** — `pkg/kuberlr/downloader.go:86-92` [Claude Opus 4.7] (future)

```go
name := fmt.Sprintf("kubectl-v%d.%d.%d", v.Major, v.Minor, v.Patch)
```

A cluster stepping through `v1.30.5 → v1.30.6 → v1.30.7` accumulates three ~50 MB binaries even though any one would satisfy the skew rule. The eviction TODO in `cache.go:33-39` addresses growth but not consolidation. A future enhancement could cache one binary per server minor (the highest patch the mirror knows about for that minor) and reuse it for any server in the same minor band. Within scope of the existing TODO; flag for sequencing.

## Design Observations

### Concerns

- **Double execution of KUBECONFIG `exec` auth plugins (in-scope)** [carried over from Round 7 retro]. `serverVersion` calls `discovery.NewDiscoveryClientForConfig(cfg).ServerVersion()`, which runs the user's kubeconfig `exec` credential plugin (SSO, IDP CLI, hardware-token). The probe runs the plugin once; if Resolve then downloads and execs a remote kubectl, kubectl invokes the same plugin again. Non-interactive plugins double the latency; interactive plugins double the prompts. The TODO at `resolver.go:64-67` for `RDD_KUBECTL_OFFLINE` is the natural escape hatch — offline mode should also suppress the probe so the user's `exec` plugin runs exactly once.

### Strengths

- **`genericclioptions.ConfigFlags` for resolver flag binding** [Codex, Gemini]. The probe and the kubectl child stay aligned on kubeconfig precedence by construction — any future kubectl flag rename or addition flows through `NewConfigFlags(true).AddFlags(fs)` without a parallel implementation to drift.
- **Body-size caps and per-request timeout on `streamGet`** [Claude]. `maxSha512Bytes` (4 KiB) and `maxKubectlBytes` (256 MiB), paired with `downloadTimeout` (5 min), turn a malicious or misconfigured mirror into a bounded failure rather than an unbounded write.
- **Sibling Go module for fake-kube fixtures** [Claude, Gemini]. `bats/tests/10-cli/fake-kube/go.mod` keeps test-only dependencies out of the parent module, and a single server plays both apiserver `/version` and release mirror `/release/*` roles with file-based per-test overrides.
- **Conservative probe fallback policy** [Codex, Gemini]. Missing kubeconfig, malformed kubeconfig, unreachable cluster, and unparseable server version all fall through silently to the embedded kubectl. Only download and sha errors surface loudly, matching the "user wants to know" cut explicitly drawn in the resolver's comments.
- **Cross-process recursion guard** [Claude]. The `RDD_KUBECTL_RESOLVED` env var plus the same-process `skipResolver` flag together cover both axes of recursion: kubectl reentry through a shim, and `rdd ctl` short-circuiting before it dispatches to `kubectlAction`.

## Testing Assessment

The Pure-Go coverage in `pkg/kuberlr/resolver_test.go` and the five-test BATS lifecycle in `7-kubectl-cache.bats` cover the user-reachable paths well. Round 8 surfaces three coverage gaps, all on the testing side rather than the production code:

1. **`run -0 env … rdd …` form does not exercise the WSL wrapper path** (I3). Five sites in `7-kubectl-cache.bats` plus two in `5-paths.bats` use the bypass form. Re-routing them through the `rdd()` function (see I3's diff) restores Windows/WSL coverage without changing macOS/Linux behavior.
2. **No BATS test exercises the within-skew "no download" path** (S6). `TestCompatible` covers the rule at the unit level; an end-to-end assertion that the resolver did not GET `/release/...` when the server version is within ±1 minor would close the loop.
3. **No unit test exercises the cross-process recursion guard.** A test that sets `RDD_KUBECTL_RESOLVED=1` and asserts `Resolve(ctx, []string{"get", "pods"})` returns `("", nil)` would lock the `envSkipResolver` short-circuit in.

## Documentation Assessment

`docs/design/cmd_service.md` and `docs/design/environment.md` match the resolver behavior — the Round 7 fixes landed (key table now lists `docker_socket`, `cache_dir`, `kubectl_cache`; example output shows `args.json` and `config.yaml`). No new documentation gaps surfaced this round.

## Commit Structure

Single commit `ce035d5` "rdd kubectl: embed kuberlr-style version resolver". The `config.json` → `config.yaml`, `rdd.args` → `args.json`, and `docker_socket` key-table additions in `cmd_service.md` are doc-sync corrections that predate this PR; Round 7 explicitly green-lit bundling them as drive-bys while editing the same paragraph. Worth noting in the commit body that the doc fixes are bundled with the feature, but the bundling decision itself is settled.

## Acknowledged Limitations

- **Code comment**: `pkg/kuberlr/cache.go:33-39` documents unbounded cache growth and leaked `.kubectl-*` temp files after SIGKILL or power loss; eviction work is deferred.
- **Code comment**: `pkg/kuberlr/downloader.go:55-57` and `pkg/kuberlr/resolver.go:64-67` document the planned `RDD_KUBECTL_OFFLINE` opt-out; intentionally deferred.
- **Code comment**: `pkg/kuberlr/downloader.go:138-141` documents `maxKubectlBytes` truncation surfacing as "checksum mismatch"; left as a comment per Round 6.
- **Code comment**: `pkg/kuberlr/exec_windows.go:27-30` documents the dormant `CommandContext` hard-kill race against console-delivered Ctrl+C; a future refactor that wires signal-driven cancellation must revisit this site (Round 7 S3 → Commented).
- **Code comment**: `pkg/kuberlr/resolver.go:217-228` documents the `--client=garbage` pflag short-circuit and the explicit deviation from the conservative bias.
- **Code comment**: `pkg/kuberlr/resolver.go:168-175` documents leaving `--username`/`--password` unbound on the resolver side to dodge `ToRawKubeConfigLoader`'s interactive prompt.
- **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:91-99` documents the deliberate deviation from the "no `teardown_file`" rule for the ephemeral HTTP-server fixture. See I2 for the implementation defect this deviation now needs to actually deliver on its intent.

### Declined in prior rounds

- **Concurrent first-use Windows `os.Rename` race** — declined Rounds 1, 2, 3, 4, 5, 6, 7. Re-raised this round by Codex (I1), Gemini (I1), and Claude (S2). Dropped per the resolutions gate. Eight rounds running.
- **`CacheRoot` panic on `os.UserCacheDir` failure** — declined Rounds 1, 2, 3, 5, 7. Re-raised this round by Gemini (I2) and Claude (S8). Dropped per the resolutions gate.

## Unresolved Feedback

None. The only PR review comment was @mook-as on `cmd/rdd/service_paths.go:33` asking to expose `RDD_CACHE_DIR`; the current diff adds `cache_dir` and `kubectl_cache` to `instancePaths()`, and `docs/design/environment.md:39` documents the dual role. Round 7 marked this resolved.

## Agent Performance Retro

### [Claude]

Claude returned cleanly after Round 7's stream-idle timeout. The shape was thorough: three Important findings (I1 the `Warnf` UX call, I2 the `--match-server-version` flag binding, I3 the `skipResolver` test hygiene) plus eight Suggestions ranging from doc bundling to per-patch cache accumulation. Consolidation downgraded Claude I2 and I3 to Suggestions (S2 and S3) because the behavior Claude flagged is documented in code and covered by an existing test case (the `--unknown config view` test at `resolver_test.go:200`), and the test fragility hits only hypothetical future tests. The unique value: I1's read of the `RDD_LOG_LEVEL=warn` default against this PR's new `Warnf` (matched by Gemini), plus the `errors.As` redundancy in `main.go` (S4) that neither other agent surfaced. Coverage was complete on the 20 changed files.

### [Codex]

Codex anchored the Round 8 keepers. Three new Important findings landed on the BATS plumbing introduced by the Round 7 fixes — and Codex was the only agent to read `bats/helpers/commands.bash` and `bats/helpers/os.bash` deeply enough to surface them (I2 `kill` vs `process_kill` for Win32, I3 `env … rdd …` bypass of the wrapper, S1 `winpath` WSL gap). Codex also re-raised the long-declined Windows `os.Rename` race (I1 → dropped per gate), but with no severity escalation and verified `go test ./pkg/kuberlr` passed before writing.

### [Gemini]

Gemini brought back two declined-in-prior-rounds findings (I1 Windows `os.Rename`, I2 `CacheRoot` panic) for the eighth and sixth time respectively. The single new contribution was I3 (the `Warnf` level critique), which Claude raised independently and consolidates with Claude's I1. No suggestions, no design observations. Gemini reviewed 20 files with no coverage misses but produced the thinnest net new signal of the three agents this round.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 9m 40s | 7m 03s | — |
| Findings | 1I 6S | 2I 1S | 1I |
| Tool calls | 24 (Read 16, Bash 4, Grep 4) | 48 (shell 48) | — |
| Design observations | 0 | 1 | 0 |
| False positives | 0 | 0 | 0 |
| Unique insights | 3 | 3 | 0 |
| Files reviewed | 20 | 20 | 20 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **1I 6S** | **2I 1S** | **1I** |
| Downgraded | 2 (I→S) | 0 | 0 |
| Dropped | 1 | 1 | 2 |


Overall: Codex provided the highest signal this round — three new Important findings on the Round 7 plumbing surface that the other two agents missed entirely. Claude rounded out the round with the unique main.go redundancy and the `--match-server-version` defensive binding. Gemini's contribution was confined to seconding Claude's `Warnf` call; its other findings were re-raises of long-declined items.

## Review Process Notes

**Repo context updates**

- **Add a "long-declined findings on PR #348" section to `deep-review-context.md`** for the duration of this PR's review window. Round 7 made this recommendation; Round 8's repetition confirms the cost. List the two declined-three-or-more-rounds findings — Windows `os.Rename` cache-publish race (rounds 1-8) and `CacheRoot` panic on `os.UserCacheDir` failure (rounds 1, 2, 3, 5, 7, 8) — with one-line decline rationales. Agents reviewing future amendments will see them in the inlined repo context and can either skip or argue with new evidence rather than re-raise the same shape. Keep the list short and PR-scoped.

## Appendix: Original Reviews

### [Claude] — Pass 1

I'll skip tasks for a single-shot review. Let me read the BATS tests and the fake-kube server.

Let me check how klog/discovery-client side effects might surface, and explore a few more files.

Now I have enough context. Let me write the review.

# Code Review — PR #348: rdd kubectl version resolver

## Executive Summary

This PR adds a kuberlr-style version resolver to `rdd kubectl`: when the user's cluster is more than ±1 minor out of skew with the embedded kubectl, rdd probes `/version`, downloads a matching kubectl from `dl.k8s.io` (sha512-verified) into a shared `RDD_CACHE_DIR` cache, and exec's it. The design is well documented and the failure paths are deliberately chosen. Findings cluster around a UX-regression Warn at the default log level, fragility in the package-global `skipResolver` flag, and a small set of kubectl global flags the resolver's flag binding does not cover.

## Findings

### Critical Issues

None.

### Important Issues

**I1. `logrus.Warnf` in the routine "download kubectl" path surfaces at default log level** — `pkg/kuberlr/downloader.go:78`

```go
logrus.Warnf("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
```

Per the repo's deep-review context, `RDD_LOG_LEVEL` defaults to `warn` outside developer mode, so every `logrus.Warn` reaches the user's terminal on every invocation that hits this branch. Downloading a missing kubectl is the expected steady-state behavior for any user with an out-of-skew remote cluster — once per minor-version delta. The user will see a yellow `WARN[…]` banner above their normal `kubectl get pods` output. Compare to `logrus.Debug` lines used elsewhere in the resolver for the routine "fall-through to embedded" paths.

This is a deliberate user-visible message ("downloading X from Y" is operator-relevant), so the level is a judgment call. The guidance says to surface the chosen level for review rather than auto-flag — calling it out here.

Fix: demote to `Info` (the standard level for one-time operator-visible events) and document the rationale, or keep `Warn` and add a code comment justifying it against the project's logging convention.

```diff
- logrus.Warnf("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
+ // Info: a one-time event per (server-version, cache state). The user
+ // is about to see kubectl output after this line; surface what we did
+ // without raising it as a warning.
+ logrus.Infof("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
```

(important, regression)

---

**I2. Unbound kubectl-global bool flags can hide the real subcommand from `isClientOnly`** — `pkg/kuberlr/resolver.go:230-264`

```go
var help, client, warningsAsErrors bool
fs.BoolVarP(&help, "help", "h", false, "")
fs.BoolVar(&client, "client", false, "")
fs.BoolVar(&warningsAsErrors, "warnings-as-errors", false, "")
```

The comment at line 211-216 documents why klog flags stay unbound but explains pflag's UnknownFlags behavior: pflag consumes the *next* non-flag token as the "assumed value" of an unknown flag. That behavior is correct for value-taking flags but wrong for unbound bool flags. The walk binds `--help`, `--client`, and `--warnings-as-errors`, but `kubectl` exposes other globals as bools that are not bound here, most notably `--match-server-version` (declared in `kubectl/pkg/cmd/cmd.go` as a `BoolVar` global). Running `rdd kubectl --match-server-version config view` causes pflag to swallow `config` as `--match-server-version`'s "assumed value", leaving positionals as `[view]`. `view` is not in `clientOnlySubcommands` → `isClientOnly` returns false → the resolver probes the cluster unnecessarily for a command kubectl can answer locally.

Severity is IMPORTANT, not CRITICAL: misclassification produces an unnecessary 2-second probe and falls through to the embedded kubectl correctly, no silent mismatched execution. But the comment's conservative-bias claim (probe rather than skip) does not hold the other direction either — a `config` subcommand wrongly probes against an unreachable cluster on a per-invocation 2-second penalty.

Fix: bind the kubectl globals that are bool-typed so pflag does not swallow the next positional. Audit upstream `staging/src/k8s.io/kubectl/pkg/cmd/cmd.go` and `pkg/component-base/version/verflag` for the full set; at minimum add `--match-server-version`.

```diff
  var help, client, warningsAsErrors bool
  fs.BoolVarP(&help, "help", "h", false, "")
  fs.BoolVar(&client, "client", false, "")
  fs.BoolVar(&warningsAsErrors, "warnings-as-errors", false, "")
+ var matchServerVersion bool
+ fs.BoolVar(&matchServerVersion, "match-server-version", false, "")
```

A test case mirroring the existing `--warnings-as-errors` cases would lock the behavior in: `{"--match-server-version before config", []string{"--match-server-version", "config", "view"}, true}`. (important, gap)

---

**I3. `skipResolver` global is set-only and leaks across tests** — `pkg/kuberlr/resolver.go:31-39`, `pkg/kuberlr/resolver_test.go:222-233`

```go
var skipResolver bool

func SkipResolver() {
    skipResolver = true
}
```

`SkipResolver` is a one-way toggle on a package-global. `TestResolveEmbeddedVersionUnparseable` swaps `embeddedVersion` (via `t.Cleanup`) but never touches `skipResolver`. If a future test in this package — or any other test that imports the package transitively — calls `SkipResolver()`, this test still passes because `Resolve` short-circuits on line 69 *before* the embedded-version branch. The test would silently exercise the wrong code path. Same concern for any future `Resolve` test added to the file.

In production this is fine — the CLI process exits after handling one command. But the test fragility traps real regressions.

Fix: make the test set the global explicitly (and restore it via `t.Cleanup`), or introduce a private helper that takes the skip flag as a parameter and a thin `SkipResolver()` wrapper that sets the global, so tests bypass the global.

```diff
 func TestResolveEmbeddedVersionUnparseable(t *testing.T) {
+    origSkip := skipResolver
+    t.Cleanup(func() { skipResolver = origSkip })
+    skipResolver = false
     orig := embeddedVersion
     t.Cleanup(func() { embeddedVersion = orig })
     embeddedVersion = func() (semver.Version, error) {
         return semver.Version{}, errors.New("test: unparseable embedded version")
     }
```

(important, gap)

### Suggestions

**S1. `main.go` `errors.As → err.Error()` redundancy** — `cmd/rdd/main.go:172-178`

```go
if errors.As(err, &exitErr) {
    if msg := err.Error(); msg != "" {
        logrus.Error(err)
    }
    os.Exit(exitErr.Code)
}
```

The captured `msg` is unused; the next line calls `err.Error()` again indirectly via `logrus.Error(err)`. Also, the empty-string check is a proxy for "no wrapped error" — works correctly today because `kuberlr.Exec` on Windows returns the raw `&cliexit.Error{Code: ...}`, but if a future caller wraps the cliexit.Error (`fmt.Errorf("foo: %w", &cliexit.Error{Code: c})`), `err.Error()` is non-empty even though the inner Err is nil, and the log line reappears.

Fix: collapse the redundancy and check the inner error directly, which states intent precisely.

```diff
- if errors.As(err, &exitErr) {
-     if msg := err.Error(); msg != "" {
-         logrus.Error(err)
-     }
-     os.Exit(exitErr.Code)
- }
+ if errors.As(err, &exitErr) {
+     if exitErr.Err != nil {
+         logrus.Error(err)
+     }
+     os.Exit(exitErr.Code)
+ }
```

(suggestion, regression)

---

**S2. Concurrent `rdd kubectl` cache-fill race on Windows** — `pkg/kuberlr/downloader.go:68-129`

```go
func ensureCached(ctx context.Context, want semver.Version) (string, error) {
    path := cachedPath(want)
    if _, err := os.Stat(path); err == nil {
        return path, nil
    }
    ...
    return os.Rename(tmpName, dst)
}
```

Two concurrent invocations (different terminals, same user) with an empty cache both stat → ENOENT, both download to distinct temp names, both verify, both `os.Rename` to the same `dst`. On Unix the last writer wins and both processes hold the inode they exec'd, so both work. On Windows, if process A has already exec'd `dst` via `CreateProcess`, the OS holds an exclusive lock on the file and B's `os.Rename` fails with `ERROR_SHARING_VIOLATION`, surfacing as `resolving kubectl version: rename ...`. A user retry succeeds because the cache now contains the binary.

This is rare and self-heals on retry, so it is SUGGESTION severity. But the failure mode is operator-confusing ("rename" error on a download path is hard to map back to the actual issue).

Fix: on rename failure, re-stat the destination and accept it if it now exists with a matching sha512.

```diff
- if err := os.Chmod(tmpName, 0o755); err != nil {
-     return err
- }
- return os.Rename(tmpName, dst)
+ if err := os.Chmod(tmpName, 0o755); err != nil {
+     return err
+ }
+ if err := os.Rename(tmpName, dst); err != nil {
+     // Concurrent install? Accept the racer's binary if it landed.
+     if _, statErr := os.Stat(dst); statErr == nil {
+         return nil
+     }
+     return err
+ }
+ return nil
```

(suggestion, gap)

---

**S3. `winpath` via command substitution swallows errors** — `bats/tests/10-cli/5-paths.bats:43`, `bats/tests/10-cli/7-kubectl-cache.bats` (multiple sites)

```bash
cache_override=$(winpath "${BATS_TEST_TMPDIR}/cache-override")
```

The repo's BATS style requires `run -0 cmd ...; value=${output}` over `value=$(cmd ...)` because substitution swallows non-zero exits and leaves the variable empty. `winpath` is a bash function that delegates to `cygpath -w` on MSYS; cygpath can fail in pathological cases. In practice unlikely, but the style rule is general.

Fix: where the conversion is part of a `run` command line (most call sites in `7-kubectl-cache.bats`), this is unavoidable shell composition. Where it's a standalone assignment (5-paths.bats:43, and the `SERVER_BIN=...` shape in 7-kubectl-cache.bats local_setup_file), convert to the `run -0`/`${output}` idiom or document why this site is exempt.

(suggestion, gap)

---

**S4. Stale-config doc fixes bundled with the resolver change** — `docs/design/cmd_service.md`

The PR folds in unrelated documentation corrections that pre-date this work: `config.json` → `config.yaml` (multiple), `rdd.args` → `args.json`, and the example output now includes `docker_socket` (which was added in a prior commit `1d71d4a8`, never reflected in docs). Adding `cache_dir` and `kubectl_cache` is in scope; the rest is drive-by cleanup.

Per the repo convention (smallest reasonable change) these belong in a separate "docs: sync svc-paths examples to current code" commit. Doing it inline is mild scope creep but makes the diff harder to bisect against the resolver feature later.

Fix: split into a precursor commit titled along the lines of "docs(cmd_service): sync paths table to current code" or simply note the bundling in the commit message.

(suggestion, regression)

---

**S5. `5-paths.bats` ALL_KEYS now tests `docker_socket` along with `cache_dir`/`kubectl_cache`** — `bats/tests/10-cli/5-paths.bats:3`

```diff
-ALL_KEYS="args_file config dir lima_home log_dir pid_file short_dir tls_dir"
+ALL_KEYS="args_file cache_dir config dir docker_socket kubectl_cache lima_home log_dir pid_file short_dir tls_dir"
```

`docker_socket` was added to `instancePaths()` in commit `1d71d4a8` (Apr 14) but never added to the BATS test fixture. This PR fixes the gap as a side-effect. Surfacing this so the author is aware (the commit message and PR description don't mention it). Nothing to change.

(suggestion, regression — but a net-positive one)

---

**S6. Cache key includes patch version, accumulates per-patch binaries** — `pkg/kuberlr/downloader.go:86-92`

```go
name := fmt.Sprintf("kubectl-v%d.%d.%d", v.Major, v.Minor, v.Patch)
```

Cache lookup is per-patch, but K8s skew policy permits any patch within a minor band. A user whose remote cluster steps through `v1.30.5 → v1.30.6 → v1.30.7` accumulates three ~50 MB binaries even though any one would work. The eviction TODO in `cache.go:33` mentions a TTL/LRU but doesn't address the consolidation lever.

Alternative: when the embedded kubectl is at the same minor but a different patch from the server, skip the download (already handled — compatible() returns true). When out-of-skew, you could download "latest patch of server's minor" via a small mirror query, then reuse for any other server in the same minor band. This is a future enhancement and the current behavior is correct.

(suggestion, future)

---

**S7. BATS test does not exercise the actual `compatible()` skew rule** — `bats/tests/10-cli/7-kubectl-cache.bats`

All five tests force the fake apiserver to `v1.99.0`, which is always out of skew. They cover the download lifecycle (miss, hit, 404, sha-mismatch, 500-fall-through) but never exercise the within-skew path that returns "use the embedded kubectl, no download". Coverage for the within-skew path lives only in the unit `TestCompatible`. A BATS test with `--major 1 --minor X` where X is +/-1 of the embedded version would prove the resolver does not download when it shouldn't — currently this is asserted only by the unit test on the function `compatible`, not on the end-to-end behavior through `Resolve`.

Severity is SUGGESTION because the unit test is sufficient for the function under test; this is a thoroughness enhancement.

(suggestion, gap)

---

**S8. `CacheRoot` panics on `os.UserCacheDir` failure** — `pkg/kuberlr/cache.go:47-56`

```go
func CacheRoot() string {
    if root := os.Getenv(envCacheDir); root != "" {
        return root
    }
    cache, err := os.UserCacheDir()
    if err != nil {
        panic(fmt.Errorf("could not determine user cache directory: %w", err))
    }
    return filepath.Join(cache, "rancher-desktop")
}
```

`CacheRoot()` is called from `rdd svc paths` (per `service_paths.go:32-33`). On a misconfigured system where `os.UserCacheDir` returns an error (no `HOME`, no `LocalAppData`), `rdd svc paths` panics — including for keys that do not need the cache (e.g., `rdd svc paths log_dir`). The map is built eagerly in `instancePaths()`.

`os.UserCacheDir`'s only documented failure is "if the directory cannot be determined" — extremely rare in practice. SUGGESTION-only.

Fix options: either return an error from CacheRoot/CacheDir and bubble it up, or treat the failure as "fall back to a relative path" (less ideal). Alternatively, recover the panic in `instancePaths()` so unrelated keys still work.

(suggestion, gap)

## Design Observations

### Concerns

None — the design is well-scoped, the failure-path policy (loud on download error, silent on probe failure) matches the upstream kuberlr philosophy, and the cross-process recursion guard (env var) plus same-process flag (`skipResolver`) cover both axes of recursion appropriately.

### Strengths

- The comments at `resolver.go:115-122` and `kubectl.go:64-79` explicitly partition errors into "user wants to know" (download/sha) vs. "fall through silently" (probe-failure) and document why each falls where it does. This is the right policy decomposition.
- `parseKubeConfigFlags` reuses kubectl's own `genericclioptions.ConfigFlags` rather than hand-rolling a flag parser — the resolver probe and the kubectl child necessarily agree on what `--server`, `--context`, etc. mean.
- Sibling Go module for fake-kube fixtures (`bats/tests/10-cli/fake-kube/go.mod`) keeps test-only dependencies out of the parent module — clean approach.
- Body-size caps on `streamGet` (`maxSha512Bytes`, `maxKubectlBytes`) plus per-request HTTP timeout (`downloadTimeout`) match the repo guidance for fetched-executable code paths.

## Testing Assessment

Ranked by risk:

1. End-to-end coverage for the "within-skew, no download" path (S7).
2. Negative coverage for unbound kubectl-global bool flags (`--match-server-version`, etc.) — see I2.
3. No test for the cross-process recursion guard (`envSkipResolver`): a unit test that sets the env var and confirms `Resolve` returns `("", nil)` would lock the behavior in.
4. No test for `serverVersion`'s malformed-version branch (logs at warn, returns `(_, false)`). Unit-testable by injecting a discovery client that returns a malformed `gitVersion`.
5. No test for concurrent cache fill (S2). Hard to make deterministic; an acceptable answer is to document the behavior in `ensureCached` and skip the test.

## Documentation Assessment

- `docs/design/environment.md` covers all four new env vars (`RDD_CACHE_DIR`, `RDD_KUBECTL_MIRROR`, `RDD_KUBECTL_CACHE` export, `RDD_KUBECTL_RESOLVED` internal) and explains the relationship between path-exported and config-honored variables.
- `docs/design/cmd_service.md` adds the new path keys to the table and example.
- `cmd/rdd/main.go:170-176` and `pkg/kuberlr/resolver.go:46-67`, `64-67`, `204-229` carry detailed rationale comments — at the high end of what should be commented in this codebase.
- No documentation for the user-visible behavior change: the `rdd kubectl` command can now download binaries. This belongs in the user-facing changelog/README, not just the design docs. If a top-level README mentions `rdd kubectl`, update it. (Did not locate a top-level user-facing README in the diff; the design docs are not user-facing.)

## Commit Structure

The PR is a single commit `ce035d5`. Subject ("rdd kubectl: embed kuberlr-style version resolver") accurately describes the change. The doc-sync cleanup in `cmd_service.md` (`config.json` → `config.yaml`, etc.) is bundled — see S4. Otherwise clean.

## Acknowledged Limitations

- **`pkg/kuberlr/cache.go:33-39`**: TODO(eviction) — the cache only grows; SIGKILL or power loss during download leaks `.kubectl-*` temp files. The author notes a per-version TTL/LRU sweep as the cleanup mechanism. This change does not make the situation worse — fresh code, fresh tradeoff.
- **`pkg/kuberlr/resolver.go:64-67`** and **`pkg/kuberlr/downloader.go:55-57`**: TODO(offline) — no offline mode; missing-mirror + missing-cache produces a download error. The author plans to add an `RDD_KUBECTL_OFFLINE` toggle.
- **`pkg/kuberlr/downloader.go:138-141`**: if `kubectl` ever exceeds 256 MiB, `io.LimitReader` truncates silently and the sha512 mismatches. Comment instructs maintainers to bump the cap.
- **`pkg/kuberlr/exec_unix.go:19-20`**: `ctx` accepted for parity with Windows; `syscall.Exec` does not honor cancellation. Documented.
- **`pkg/kuberlr/exec_windows.go:27-30`**: `CommandContext` hard-kills on cancel; the cobra context is never canceled today, so this race is dormant. Documented for future signal-handling work.
- **`pkg/kuberlr/resolver.go:217-224`**: `--client=garbage` halts pflag parse mid-args, leaving no positionals, so `isClientOnly` returns true; the embedded kubectl rejects the same malformed flag before contacting a cluster. Author documents the deviation from the conservative bias and the absence of silent mismatched execution.
- **`pkg/kuberlr/resolver.go:168-175`**: deprecated `--username`/`--password` left unbound on the resolver side to avoid `ToRawKubeConfigLoader` switching to the interactive variant that prompts on stdin during the background version probe. Documented; kubectl child still receives the flags via its own binding.
- **`bats/tests/10-cli/7-kubectl-cache.bats:91-99`**: deviation from the "no `teardown_file`" rule — author explains the deviation (HTTP server on ephemeral port has no preservation value, leaks ports between sessions). Reasonable exception, well-documented.
- **PR comment by @mook-as on `cmd/rdd/service_paths.go:33`**: "Would it be possible to expose `RDD_CACHE_DIR` here too?" — the current diff adds `cache_dir` (root override target) and `kubectl_cache` (subdir) to `instancePaths()`, so this is addressed. `docs/design/environment.md:39` calls out the dual role.

## Unresolved Feedback

None — the only PR review comment was @mook-as's request to expose `RDD_CACHE_DIR`, which the current diff addresses (see Acknowledged Limitations).

## Coverage Summary

| File | Coverage |
| --- | --- |
| `.github/dependabot.yml` | Trivial |
| `bats/helpers/paths.bash` | Reviewed, no issues (new `winpath` helper) |
| `bats/tests/10-cli/5-paths.bats` | S3 (winpath substitution), S5 (docker_socket gap closed) |
| `bats/tests/10-cli/7-kubectl-cache.bats` | S3, S7 |
| `bats/tests/10-cli/fake-kube/README.md` | Trivial |
| `bats/tests/10-cli/fake-kube/go.mod` | Trivial |
| `bats/tests/10-cli/fake-kube/kubectl/main.go` | Reviewed, no issues |
| `bats/tests/10-cli/fake-kube/server/main.go` | Reviewed, no issues |
| `cmd/rdd/kubectl.go` | Reviewed, no issues |
| `cmd/rdd/main.go` | S1 |
| `cmd/rdd/service_paths.go` | Reviewed, no issues (addresses mook-as comment) |
| `docs/design/cmd_service.md` | S4 (bundled scope) |
| `docs/design/environment.md` | Reviewed, no issues |
| `go.mod` | Trivial (semver promoted to direct, cli-runtime promoted to direct) |
| `pkg/kuberlr/cache.go` | S8 (panic on UserCacheDir) |
| `pkg/kuberlr/downloader.go` | I1 (Warn level), S2 (concurrent rename), S6 (per-patch cache key) |
| `pkg/kuberlr/exec_unix.go` | Reviewed, no issues |
| `pkg/kuberlr/exec_windows.go` | Reviewed, no issues |
| `pkg/kuberlr/resolver.go` | I2 (unbound bool flags), I3 (skipResolver fragility) |
| `pkg/kuberlr/resolver_test.go` | I3 |
### [Codex] — Pass 1

### Executive Summary

This PR adds a kuberlr-style resolver that probes the target cluster, downloads a version-matched kubectl into a shared cache when needed, and exposes the cache paths. The main product issue is a Windows cache publication race; the test suite also has Windows cleanup/path portability problems.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Concurrent cache misses can fail on Windows** — `pkg/kuberlr/downloader.go:68-80,128` (important, regression)

```go
if _, err := os.Stat(path); err == nil {
	return path, nil
}
...
if err := download(ctx, want, path); err != nil {
	return "", err
}
```

```go
return os.Rename(tmpName, dst)
```

The cache is shared across instances, but `ensureCached` checks for absence at line 70 and then downloads without any publication guard at line 79. If two `rdd kubectl` processes miss the same version concurrently, both can validate a temp file; the first `os.Rename` at line 128 publishes the cache entry, and the second fails on Windows because `os.Rename` does not replace an existing destination there. The second command then exits with `resolving kubectl version` even though a valid cache entry now exists.

Fix: either serialize per-version downloads with a lock file, or make publication tolerate a winner. If handling `os.IsExist`, re-hash the now-existing destination against `want` before returning success so a stale/corrupt file is not accepted blindly.

```diff
	if err := os.Chmod(tmpName, 0o755); err != nil {
		return err
	}
-	return os.Rename(tmpName, dst)
+	if err := os.Rename(tmpName, dst); err != nil {
+		if os.IsExist(err) {
+			return verifyCachedFile(dst, want)
+		}
+		return err
+	}
+	return nil
```

I2. **Windows test teardown leaks the fake server** — `bats/tests/10-cli/7-kubectl-cache.bats:91-98` (important, regression)

```bash
local_teardown_file() {
    if load_var SERVER_PID; then
        kill "${SERVER_PID}" 2>/dev/null || true
    fi
}
```

The fake server is explicitly a native Windows `.exe` on MSYS at lines 45-55, but the teardown uses plain `kill` at line 97. The existing helper documentation in `bats/helpers/os.bash:96-113` says MSYS2 `kill` cannot reach Win32 processes and provides `process_kill` for this case. As written, the cleanup path that was added to avoid leaking an ephemeral HTTP server still leaks it on Windows.

Fix:

```diff
 local_teardown_file() {
     if load_var SERVER_PID; then
-        kill "${SERVER_PID}" 2>/dev/null || true
+        process_kill "${SERVER_PID}" 2>/dev/null || true
     fi
 }
```

I3. **New BATS env invocations bypass the rdd wrapper on WSL** — `bats/tests/10-cli/7-kubectl-cache.bats:124-128` (important, regression)

```bash
run -0 env \
    RDD_CACHE_DIR="$(winpath "${CACHE_DIR}")" \
    RDD_KUBECTL_MIRROR="http://127.0.0.1:${PORT}" \
    KUBECONFIG="$(winpath "${KUBECONFIG_PATH}")" \
    rdd kubectl get pods
```

The `rdd()` helper in `bats/helpers/commands.bash:42-88` is where Windows argument conversion and WSL `WSLENV` handling live. Calling `env ... rdd` at line 124 executes a PATH lookup instead of the shell function, so WSL will not get the `rdd.exe` suffix/path handling and the Windows process will receive unconverted env path values. The same pattern appears in `bats/tests/10-cli/5-paths.bats:45` and `bats/tests/10-cli/5-paths.bats:48`.

Fix: keep the call going through the `rdd` function. For example, export per-command env vars around `run -0 rdd ...`, or add a small helper that sets env vars then invokes the `rdd` function.

#### Suggestions

S1. **`winpath` does not cover WSL native-Windows paths** — `bats/helpers/paths.bash:14-19` (suggestion, regression)

```bash
winpath() {
    if is_msys; then
        cygpath -w "$1"
    else
        printf '%s' "$1"
    fi
}
```

The helper is used for env vars and args that go straight to native Windows binaries, but line 15 only handles MSYS. In WSL, `is_windows` is also true, and the suite normally uses `wslpath` in the `rdd` wrapper for this boundary. Leaving WSL paths unchanged at line 18 makes the new cache/kubeconfig/server path plumbing fragile for WSL Windows-binary runs.

Fix:

```diff
 winpath() {
     if is_msys; then
         cygpath -w "$1"
+    elif is_windows && using_windows_exe; then
+        wslpath -w "$1"
     else
         printf '%s' "$1"
     fi
 }
```

### Design Observations

**Concerns**

- The resolver duplicates kubectl flag interpretation in `pkg/kuberlr/resolver.go`. Reusing `genericclioptions.ConfigFlags` is the right mitigation, but the client-only classifier still hand-rolls kubectl command knowledge. Keeping tests broad here is important because kubectl’s flag surface changes upstream.

**Strengths**

- The download path hashes as it streams into a temp file and only publishes after checksum verification, which keeps partial downloads out of the executable cache.
- `rdd ctl` explicitly skips the resolver before delegating to the shared kubectl action, preserving the embedded-control-plane path.

### Testing Assessment

1. No test covers concurrent cache misses, especially the Windows destination-exists behavior from I1.
2. The BATS coverage exercises cache hit/miss/error paths, but the new env invocation style misses the WSL/native-Windows wrapper path described in I3.
3. Unit tests cover skew and flag parsing well; they do not cover download publication errors.

Local verification run:
- `go test ./pkg/kuberlr` passed.
- `go test ./...` inside `bats/tests/10-cli/fake-kube` passed.
- `go test ./cmd/rdd` could not run in this checkout because `pkg/guestagent/guestagent.go` references missing generated asset `lima-guestagent.gz`.

### Documentation Assessment

The new environment and service-path docs are updated for `RDD_CACHE_DIR`, `RDD_KUBECTL_MIRROR`, and `kubectl_cache`. No additional documentation gap found beyond the behavior risks above.

### Acknowledged Limitations

- **Code comment**: `pkg/kuberlr/cache.go:33-39` acknowledges unbounded cache growth and leaked temp files after hard crashes. This is consistent with the implementation and should not block the PR.
- **Code comment**: `pkg/kuberlr/resolver.go:64-67` acknowledges missing offline mode. This is an explicit future enhancement.
- **Code comment**: `pkg/kuberlr/downloader.go:56-57` acknowledges the missing “may download” toggle for air-gapped users.
- **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:91-95` acknowledges the new `local_teardown_file()` deviation. The rationale is reasonable for a test-only HTTP server, but I2 shows the cleanup does not work on Windows as written.
- **PR comment**: @mook-as asked whether `RDD_CACHE_DIR` could be exposed via `rdd svc paths`; the PR now adds `cache_dir` in `cmd/rdd/service_paths.go:32`, updates docs, and tests the override.

### Coverage Summary

- `.github/dependabot.yml` — Reviewed, no issues
- `bats/helpers/paths.bash` — S1
- `bats/tests/10-cli/5-paths.bats` — I3
- `bats/tests/10-cli/7-kubectl-cache.bats` — I2, I3
- `bats/tests/10-cli/fake-kube/README.md` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/go.mod` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/kubectl/main.go` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go` — Reviewed, no issues
- `cmd/rdd/kubectl.go` — Reviewed, no issues
- `cmd/rdd/main.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — Reviewed, no issues
- `docs/design/cmd_service.md` — Reviewed, no issues
- `docs/design/environment.md` — Reviewed, no issues
- `go.mod` — Reviewed, no issues
- `pkg/kuberlr/cache.go` — Reviewed, no issues
- `pkg/kuberlr/downloader.go` — I1
- `pkg/kuberlr/exec_unix.go` — Reviewed, no issues
- `pkg/kuberlr/exec_windows.go` — Reviewed, no issues
- `pkg/kuberlr/resolver.go` — Reviewed, no issues
- `pkg/kuberlr/resolver_test.go` — Reviewed, no issues

### [Gemini] — Pass 1

### Executive Summary

This PR integrates a `kuberlr`-style version resolver into the `rdd kubectl` command. It correctly proxies invocations, parses user connection overrides, probes the target cluster's version, and falls back to a transparent download/cache layer when the embedded binary is out of skew. The implementation is robust, with excellent test coverage via the `fake-kube` BATS fixtures. A few important issues were found regarding a Windows-specific file collision race condition, a strict panic on missing cache directories, and an inappropriate `Warnf` log level for routine downloads.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Windows `os.Rename` race condition in concurrent downloads** — `pkg/kuberlr/downloader.go:126-129` (important, gap)

```go
	if err := os.Chmod(tmpName, 0o755); err != nil {
		return err
	}
	return os.Rename(tmpName, dst)
```

On Windows, `os.Rename` fails with `ERROR_ALREADY_EXISTS` (mapped to `os.ErrExist`) if the destination file already exists, unlike Unix which replaces it atomically. If two `rdd kubectl` invocations hit a cache miss for the same version concurrently, they will both download to unique `.kubectl-*` temp files and both attempt to rename them to `dst`. The first process succeeds, but the second fails and returns an error, aborting the user's `kubectl` invocation even though a valid binary is now successfully cached.

Fix: Catch the error on Windows and consider the operation successful if the destination file exists.

```diff
 	if err := os.Chmod(tmpName, 0o755); err != nil {
 		return err
 	}
-	return os.Rename(tmpName, dst)
+	err := os.Rename(tmpName, dst)
+	if err != nil && runtime.GOOS == "windows" {
+		if _, statErr := os.Stat(dst); statErr == nil {
+			return nil
+		}
+	}
+	return err
 }
```

I2. **Panic on missing user cache directory** — `pkg/kuberlr/cache.go:48-51` (important, gap)

```go
	cache, err := os.UserCacheDir()
	if err != nil {
		panic(fmt.Errorf("could not determine user cache directory: %w", err))
	}
```

`CacheRoot()` evaluates `os.UserCacheDir()` and panics if it fails (e.g., if `$HOME` is unset in a constrained CI/Docker environment). Because `instancePaths()` in `cmd/rdd/service_paths.go` evaluates `kuberlr.CacheRoot()` unconditionally, running `rdd svc paths` in such an environment will crash the entire `rdd` CLI process. While a missing cache directory might impede binary downloads, it should not crash informational CLI commands.

Fix: Provide a graceful fallback to the OS temporary directory rather than panicking.

```diff
 	cache, err := os.UserCacheDir()
 	if err != nil {
-		panic(fmt.Errorf("could not determine user cache directory: %w", err))
+		cache = os.TempDir()
 	}
 	return filepath.Join(cache, "rancher-desktop")
```

I3. **Misuse of Warn level for routine informational output** — `pkg/kuberlr/downloader.go:78-80` (important, regression)

```go
	if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
		return "", err
	}
	logrus.Warnf("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
	if err := download(ctx, want, path); err != nil {
```

A cache miss on a new cluster version is a routine, expected path. Logging the download start at `Warn` level breaks standard logging semantics and surfaces a yellow/red warning to the user for a successful operation. While this bypasses the default `--log-level=warn` filter to keep the user informed, warning channels should be reserved for degraded or unexpected conditions.

Fix: Standard CLI tools print routine status updates directly to standard error, leaving structured logs for actual warnings.

```diff
 	if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
 		return "", err
 	}
-	logrus.Warnf("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
+	fmt.Fprintf(os.Stderr, "downloading kubectl v%d.%d.%d from %s\n", want.Major, want.Minor, want.Patch, mirrorURL())
 	if err := download(ctx, want, path); err != nil {
```

#### Suggestions

None.

### Design Observations

**Concerns**

None.

**Strengths**

- **`fake-kube` BATS integration**: Implementing a small, parallel Go module for the test fixtures allows the test suite to emulate both the apiserver response and the mirror download natively, without injecting messy dependency-heavy mocks into the `rdd` binary.
- **Robust flag parsing**: Using `pflag.FlagSet` with `UnknownFlags = true` to accurately isolate connection-override globals without breaking positional arguments is an excellent and deeply defensive way to reuse `kubectl` flag semantics securely.
- **Omission of basic-auth overrides**: Intentionally bypassing `--username` and `--password` binding to dodge the interactive prompt inside `ToRawKubeConfigLoader` is extremely insightful.

### Testing Assessment

The integration coverage is extensive and well-calibrated. The edge cases in the version skew matching have full unit coverage, and the BATS test accurately triggers downloads and simulates apiserver errors using the standalone `fake-kube` server. No additional scenarios are immediately necessary.

### Documentation Assessment

The CLI service paths and environment variables added by the feature are nicely documented in the corresponding markdown files under `docs/design/`.

### Acknowledged Limitations

- **Temp file leak on SIGKILL**: `pkg/kuberlr/cache.go:34` — "SIGKILL or power loss during a download also leaks the .kubectl-* temp file..." The author explicitly recognizes the lack of a garbage collection loop.
- **Offline download fallback**: `pkg/kuberlr/resolver.go:64` — "TODO(offline): when no cached kubectl matches and the network is down... Add an opt-out for download attempts..." The feature currently fails instead of finding the next best cached matching version.
- **Windows Ctrl+C propagation**: `pkg/kuberlr/exec_windows.go:20` — "CommandContext hard-kills on ctx cancellation... rdd's hard-kill races kubectl's graceful shutdown of the console-delivered Ctrl+C." The author documented that graceful signal handling is deferred.
- **Max file size truncation limits**: `pkg/kuberlr/downloader.go:142` — "If kubectl ever exceeds maxKubectlBytes, io.LimitReader truncates silently and the digest covers a partial body..." The author notes it will safely fail checksum verification rather than logging a dedicated size-limit warning.
- **Malformed boolean flags in client-only check**: `pkg/kuberlr/resolver.go:197` — "Bool-parse errors short-circuit pflag's parser... [and] lands in the empty-positionals branch below (return true)." The codebase notes that `kubectl` itself will reject the bad flag later, avoiding any unverified network request.
- **BATS custom teardown usage**: `bats/tests/10-cli/7-kubectl-cache.bats:90` — "Deviates from the 'no teardown_file' rule. The rule's intent is to preserve rdd state for post-mortem inspection..." The local ephemeral server runs as a background process and must be cleaned up to prevent leaking HTTP ports across BATS sessions.

### Coverage Summary

- `.github/dependabot.yml` — Reviewed, no issues
- `bats/helpers/paths.bash` — Reviewed, no issues
- `bats/tests/10-cli/5-paths.bats` — Reviewed, no issues
- `bats/tests/10-cli/7-kubectl-cache.bats` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/README.md` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/go.mod` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/kubectl/main.go` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go` — Reviewed, no issues
- `cmd/rdd/kubectl.go` — Reviewed, no issues
- `cmd/rdd/main.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — Reviewed, no issues
- `docs/design/cmd_service.md` — Reviewed, no issues
- `docs/design/environment.md` — Reviewed, no issues
- `go.mod` — Reviewed, no issues
- `pkg/kuberlr/cache.go` — I2
- `pkg/kuberlr/downloader.go` — I1, I3
- `pkg/kuberlr/exec_unix.go` — Reviewed, no issues
- `pkg/kuberlr/exec_windows.go` — Reviewed, no issues
- `pkg/kuberlr/resolver.go` — Reviewed, no issues
- `pkg/kuberlr/resolver_test.go` — Reviewed, no issues

## Resolution

| | |
|---|---|
| **Addressed** | 2026-05-19 |
| **Commit** | `30289ab rdd kubectl: embed kuberlr-style version resolver` (amended) |

| # | Finding | Action | Reason |
|---|---------|--------|--------|
| 1 | Important #1: `logrus.Warnf` for routine cache-miss download | Fixed | |
| 2 | Important #2: `kill` cannot terminate Win32 fake-kube-server.exe | Fixed | |
| 3 | Important #3: `env … rdd …` bypasses `rdd()` shell wrapper on WSL | Fixed | Lifted shared env vars (`RDD_CACHE_DIR`, `RDD_KUBECTL_MIRROR`, `KUBECONFIG`) to `local_setup_file` as per-line `export VAR=value` (or split-assign for subshell forms, per SC2155); each `@test` now calls `run -0 rdd …` directly. |
| 4 | Suggestion #1: `winpath` does not handle WSL→Windows.exe paths | Fixed | |
| 5 | Suggestion #2: Unbound kubectl-global bool flags hide subcommand from `isClientOnly` | Fixed | Added `--match-server-version` binding plus two TestIsClientOnly cases mirroring the `--warnings-as-errors` pattern |
| 6 | Suggestion #3: `skipResolver` global is set-only test fragility | Skipped | Future-test hygiene concern with no current harm; the next test that adds Resolve coverage can handle the global at that time |
| 7 | Suggestion #4: `main.go` `errors.As` block captures unused `msg` | Fixed | |
| 8 | Suggestion #5: `winpath` via `$(...)` swallows non-zero exits | Skipped | `cygpath -w` failure is pathological; the `run -0` rewrite at standalone sites buys minimal practical safety while adding noise to the surrounding code |
| 9 | Suggestion #6: BATS coverage forces every test out of skew | Skipped | Unit-level `TestCompatible` already covers the skew rule; end-to-end coverage of the within-skew branch is thoroughness, not correctness |
| 10 | Suggestion #7: Per-patch cache key accumulates one binary per patch | Skipped | Future enhancement layered on top of the existing eviction TODO; defer until eviction is designed |
| 11 | Testing Gap #3: No unit test for cross-process recursion guard | Test written | Added `TestResolveSkipsWhenRecursionGuardSet` with mock `embeddedVersion` that sets a sentinel flag; verified the test fails when the env-var check is removed from `Resolve` |
