# Deep Review: 20260504-192042-pr-348

| | |
|---|---|
| **Date** | 2026-05-04 19:20 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 6 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#348](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348) — rdd kubectl: embed kuberlr-style version resolver |
| **Branch** | `embed-kuberlr` |
| **Commits** | `72bfad2` rdd kubectl: embed kuberlr-style version resolver |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — Round 5 fixes for S1 docs, S2 download UX, S4 BATS comment, S5 SkipResolver bool, and S6 dead error return all landed cleanly. Round 6 surfaces no Critical or Important findings; the round's most actionable item is S1 — the same Round 5 amendment that swapped `SkipResolver` from env-var to package bool left `docs/design/environment.md:35` describing `RDD_KUBECTL_RESOLVED` as written by `rdd ctl`, which is now untrue. Both Claude (I3) and Codex (S1) caught this regression. The remaining Suggestions are smaller defensive items (Gemini's `--` separator misclassification in `isClientOnly`, plus three Claude items: 256 MiB cap surfaces as checksum mismatch, missing User-Agent header, dead `os.Chmod` call on Windows). |
| **Wall-clock time** | `33 min 47 s` |

## Executive Summary

Round 5's six Suggestions all landed; Round 6 produces no Critical or Important findings. The five surviving Suggestions group into three buckets:

1. **Documentation drift introduced by Round 5's `SkipResolver` fix** (S1, two agents). Round 5's S5 changed `SkipResolver()` from `os.Setenv(envSkipResolver, "1")` to a package-level `skipResolver` bool. The internal-variables table in `docs/design/environment.md:35` still says "`rdd ctl` sets it before invoking the kubectl action," which is no longer true — `rdd ctl` calls `kuberlr.SkipResolver()` (`cmd/rdd/kubectl.go:48`), and the env var is set only by `kuberlr.Exec` on the kubectl child. A developer grepping a process tree for `RDD_KUBECTL_RESOLVED=1` will see it on a downloaded kubectl child but never on the `rdd ctl` parent the doc claims sets it.

2. **`--` separator boundary misclassified as client-only** (S2, Gemini). `isClientOnly` walks `args` and breaks on `--` (`pkg/kuberlr/resolver.go:278-280`). For `kubectl -- get pods`, the loop breaks before recording any subcommand, the empty-`subcommand` fallback at line 312-314 returns `true`, and the resolver skips the probe — exactly the misclassification the cautionary comment at lines 264-269 calls out. The trigger is rare (no `kubectl` recipe puts `--` before the subcommand) but the fallback should still resolve, not skip.

3. **Defensive cleanup in the new code** — S3 (`io.LimitReader`'s 256 MiB cap silently truncates and surfaces as "checksum mismatch" rather than naming the cap), S4 (`Go-http-client/1.1` UA may be rate-limited by corporate/air-gapped mirrors), S5 (`os.Chmod(tmpName, 0o755)` is a no-op on Windows; comment or wrap).

The Windows `os.Rename` race surfaced again from all three agents (Codex I1, Gemini I1, Claude I1) and was dropped per the prior-round-resolutions gate (declined Rounds 1–5). Gemini also brought back the `CacheRoot` panic as Critical with a different fix shape (silent `os.TempDir()` fallback); same finding declined in Round 5 as Suggestion S3, dropped per the gate.

Structure: 0 Critical · 0 Important · 5 Suggestions.

## Critical Issues

None.

## Important Issues

None.

## Suggestions

S1. **`docs/design/environment.md:35` says `rdd ctl` sets `RDD_KUBECTL_RESOLVED`, but Round 5's S5 fix removed that env-var write** — `docs/design/environment.md:35`, `pkg/kuberlr/resolver.go:30-41`, `cmd/rdd/kubectl.go:48` [Codex GPT 5.5][Claude Opus 4.7]

```md
| `RDD_KUBECTL_RESOLVED` | Recursion guard for the kubectl version resolver. `kuberlr.Exec` sets it on the kubectl child process so a downloaded kubectl that re-execs `rdd` through a shim cannot recurse back into version resolution. `rdd ctl` sets it before invoking the kubectl action so the resolver skips the redundant probe against the embedded apiserver, whose version matches the embedded kubectl by construction. |
```

Round 5's S5 fix replaced the `os.Setenv(envSkipResolver, "1")` body of `SkipResolver` with a package-level `skipResolver` bool (`pkg/kuberlr/resolver.go:30-41`); `rdd ctl` then calls that helper at `cmd/rdd/kubectl.go:48`. The internal-variables table in `environment.md` still credits `rdd ctl` with writing the env var. The intro at line 31 — "they appear here so a developer reading a process tree or strace output can identify them" — sets the wrong expectation: only the downloaded kubectl child carries `RDD_KUBECTL_RESOLVED=1` now.

Fix: rewrite the second sentence to describe the in-process toggle.

```diff
-| `RDD_KUBECTL_RESOLVED` | Recursion guard for the kubectl version resolver. `kuberlr.Exec` sets it on the kubectl child process so a downloaded kubectl that re-execs `rdd` through a shim cannot recurse back into version resolution. `rdd ctl` sets it before invoking the kubectl action so the resolver skips the redundant probe against the embedded apiserver, whose version matches the embedded kubectl by construction. |
+| `RDD_KUBECTL_RESOLVED` | Recursion guard for the kubectl version resolver. `kuberlr.Exec` sets it on the kubectl child process so a downloaded kubectl that re-execs `rdd` through a shim cannot recurse back into version resolution. `rdd ctl` achieves the same skip via the in-process `kuberlr.SkipResolver()` helper, not this env var. |
```

S2. **`isClientOnly` misclassifies `--` before subcommand as client-only** — `pkg/kuberlr/resolver.go:270-322` [Gemini 3.1 Pro]

```go
for i := 0; i < len(args); i++ {
    arg := args[i]
    if arg == "--" {
        break
    }
    ...
}
if subcommand == "" {
    return true
}
```

`kubectl -- get pods` is syntactically valid: pflag treats `--` as the end-of-flags marker and Cobra reads the first positional as the subcommand. `isClientOnly` walks `args`, breaks on `--` at line 278-280, finds no subcommand, and returns `true` via the empty-`subcommand` fallback at line 312-314. The resolver then skips the probe and runs the embedded kubectl — exactly the misclassification the comment at lines 264-269 warns against. Probability is low (no documented `kubectl` recipe puts `--` before the subcommand) but the fallback's bias should still be "probe, don't skip."

Fix: take the next positional after `--` as the subcommand candidate before breaking.

```diff
 for i := 0; i < len(args); i++ {
     arg := args[i]
     if arg == "--" {
+        if subcommand == "" && i+1 < len(args) {
+            subcommand = args[i+1]
+        }
         break
     }
```

A `TestIsClientOnly` row covering `["--", "get", "pods"]` (expect `false`) would lock the fix in.

S3. **`streamGet`'s 256 MiB cap surfaces as "checksum mismatch", not as "binary too large"** — `pkg/kuberlr/downloader.go:105,128-130,169-184` [Claude Opus 4.7]

```go
if err := streamGet(ctx, base, io.MultiWriter(tmp, h), maxKubectlBytes); err != nil {
    tmp.Close()
    return fmt.Errorf("downloading %s: %w", base, err)
}
...
got := hex.EncodeToString(h.Sum(nil))
if got != want {
    return fmt.Errorf("checksum mismatch for %s: want %s, got %s", base, want, got)
}
```

If kubectl ever exceeds `maxKubectlBytes` (256 MiB), `io.LimitReader` returns silently after `maxBytes`, the digest covers a truncated body, and the caller sees a `checksum mismatch` error pointing at integrity instead of the size cap. `fetchSha512` already gets this right: a misconfigured mirror serving the binary at the `.sha512` URL surfaces as "malformed checksum from %s: %d chars" (`downloader.go:153-159`). Mirror the dedicated error path on the binary side so a future cap bump is signalled by a clear "exceeds N MiB" message rather than a misleading hash mismatch.

Fix: wrap the limit reader so reaching the cap returns a distinct sentinel and the error message names the cap.

S4. **HTTP requests omit a `User-Agent` header** — `pkg/kuberlr/downloader.go:169-184` [Claude Opus 4.7]

```go
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
```

`dl.k8s.io` is operated by SIG Release and accepts the default Go client UA today, but corporate proxies and air-gapped mirrors regularly use UA-based rate limiting or denylists for `Go-http-client/1.1`. A stable UA also helps the SIG Release team correlate kuberlr-style traffic. One header on each request closes the gap.

Fix:

```diff
 req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
 if err != nil {
     return err
 }
+req.Header.Set("User-Agent", "rdd-kuberlr/"+componentbasever.Get().GitVersion)
```

S5. **`os.Chmod(tmpName, 0o755)` is a no-op on Windows** — `pkg/kuberlr/downloader.go:116-118` [Claude Opus 4.7]

```go
if err := os.Chmod(tmpName, 0o755); err != nil {
    return err
}
```

Windows file mode is reduced to a single read-only bit; executability is determined by extension (`.exe`) and PE header. `0o755` toggles only the read-only bit (writable, since the write bit is set), so the chmod call has no functional effect. Mode 0755 doesn't break anything, but the next reader of `downloader.go` will read the line as "ensure +x on Windows," which it isn't. A one-line comment or a `runtime.GOOS != "windows"` guard removes the ambiguity.

Fix:

```diff
+// On Windows the executable bit doesn't exist (extension and PE
+// header decide); chmod only affects the read-only bit. Skip the
+// call rather than mislead the next reader.
+if runtime.GOOS != "windows" {
+    if err := os.Chmod(tmpName, 0o755); err != nil {
+        return err
+    }
+}
-if err := os.Chmod(tmpName, 0o755); err != nil {
-    return err
-}
```


## Design Observations

### Concerns

- **Resolver runs against `context.Background()` (future)** [Claude Opus 4.7]. `cmd.Context()` returns `context.Background()` because `cli.RunNoErrOutput` calls `cmd.Execute()`, not `cmd.ExecuteContext()`. The 5 min `httpClient.Timeout` and the 2 s discovery `cfg.Timeout` are the only cancellation backstops; SIGINT to `rdd` during a download kills the process without running the `defer os.Remove(tmpName)`, so a Ctrl-C mid-download leaks a `.kubectl-*` tempfile. Wiring `signal.NotifyContext(context.Background(), os.Interrupt)` in `main` and switching to `cmd.SetContext(ctx)` would close this. Pairs with the existing `TODO(eviction)` at `cache.go:33-39`, which already covers the cleanup side.

- **Cache hit trusts the on-disk binary without re-verification (future)** [Claude Opus 4.7]. `ensureCached` returns immediately when `cachedPath(want)` exists; nothing checks the cached file's sha512 against the upstream digest on hit. Filesystem corruption or a manually-tampered cache file surfaces as a confusing kubectl exec failure rather than a "cache integrity error" the resolver could re-download from. Re-hashing every cached binary on every kubectl invocation is too expensive; a future LRU/TTL sweep (the `TODO(eviction)` at `cache.go:33-39`) is the right place to verify before evicting or reusing.

- **Mirror checksum gives no security against a compromised mirror (future, persistent)** [carry-over from Round 5]. `download` fetches the kubectl binary and its `.sha512` from the same mirror; the hash protects against truncation and disk corruption but not against a mirror serving a malicious binary alongside a matching malicious checksum. SIG Release signs binaries with Sigstore/Cosign. Verifying the Cosign signatures (in addition to or in place of the mirror's own checksum) closes the supply-chain gap. Out of scope for this PR.

- **Hand-rolled subcommand walker duplicates the bound pflag set (future, persistent)** [carry-over from Rounds 3–5]. `parseKubeConfigFlags` already binds the kubectl global flags via `genericclioptions.ConfigFlags.AddFlags`; `isClientOnly` re-walks args manually with the hand-maintained `kubectlGlobalFlagsTakingValue` map. S2 this round is a fresh shape of the same drift class — `--` boundary handling slipping past the manual walker. Driving the walker from the same FlagSet (`fs.Args()` after parse returns the positional tokens) eliminates the map and most of the drift class.

### Strengths

- **Round 5 fixes landed without regression** [Claude Opus 4.7][Codex GPT 5.5]. `environment.md` Path Variables preamble now scopes the no-effect claim correctly; `cmd_service.md` key table includes `cache_dir`; the BATS comment names the right export site; `SkipResolver` uses a package-level bool; `serverVersion` returns `(semver.Version, bool)` with the dead error path gone; the warn-level "downloading kubectl" line surfaces at the default log level.

- **Process replacement on Unix vs. spawn on Windows is the right split** [Claude Opus 4.7]. `syscall.Exec` keeps the kubectl child at the expected place in the process tree, and the Windows path's `*cliexit.Error` envelope plus the empty-message branch in `cmd/rdd/main.go:163-165` propagates the kubectl exit code without a confusing `logrus.Error` line.

- **Same-process toggle plus cross-process env-var recursion guard** [Claude Opus 4.7]. `SkipResolver()` for `rdd ctl` (in-process bool) and `RDD_KUBECTL_RESOLVED` for the rare downloaded-shim re-exec (env-based) cover their respective shapes minimally; the env var no longer leaks into unrelated subprocesses.

- **Hash before mutation; body caps at the right place; reusing kubectl's flag binding** [Claude Opus 4.7][Codex GPT 5.5][Gemini 3.1 Pro]. Round 5 strengths persist: `download` verifies the digest before any chmod or rename; `streamGet` caps the digest body at 4 KiB and the binary at 256 MiB; `parseKubeConfigFlags` binds `genericclioptions.NewConfigFlags(true)` so the resolver and kubectl child target the same cluster by construction.

- **Graceful fall-through on probe failure** [Gemini 3.1 Pro]. `serverVersion` returns `ok=false` for missing kubeconfig, malformed config, unreachable cluster, and unparseable version. Each path falls through to the embedded kubectl rather than aborting the kubectl invocation. Client-only commands keep working when the cluster is unreachable.

- **Robust BATS server fixture** [Gemini 3.1 Pro]. `fake-kube-server` opens its log file with `O_APPEND` so the BATS-side `: >"${LOG_FILE}"` truncation between tests does not race with the server's writes.

## Testing Assessment

Coverage gaps surfaced this round, ordered by user-reachable risk:

1. **`isClientOnly` with `--` before subcommand** (S2). `TestIsClientOnly` covers `["exec", "pod", "--", "completion"]` (`--` after subcommand) but not `["--", "get", "pods"]` (`--` before subcommand). One row pinning the post-fix behavior would close the gap and lock the conservative bias in.

2. **Concurrent first-use cache miss** (declined Rounds 1–5; raised again this round by all three agents). Acknowledged limitation.

3. **Download larger than `maxKubectlBytes`** (S3). `streamGet`'s 256 MiB cap has no test asserting the post-cap behavior; a unit test that points `streamGet` at a body larger than `maxBytes` would document the truncation contract.

4. **End-to-end Windows execution path** (`pkg/kuberlr/exec_windows.go`). PR description gates the merge on Windows BATS support; the path remains compile-only.

5. **`embeddedVersion` parse failure** (the `go run` case the comment at `resolver.go:79-83` names). A unit test that injects an unparseable string and asserts `Resolve` returns `("", nil)` with no probe would lock in the documented dev-build fall-through.

6. **`RDD_KUBECTL_RESOLVED=1` on the parent `rdd kubectl`** [Claude Opus 4.7]. No test confirms the env-var recursion guard short-circuits Resolve; only the same-process `SkipResolver` boolean is exercised indirectly via `rdd ctl`.

Verification this round: Codex ran `go test ./pkg/kuberlr` (passed) and `go test ./...` inside `bats/tests/10-cli/fake-kube` (passed); `go test ./cmd/rdd` failed in the worktree on a missing embedded asset (`pkg/guestagent/lima-guestagent.gz`), unrelated to the PR.

## Documentation Assessment

- `docs/design/environment.md:35` (Internal Variables) carries the regression in S1: the `RDD_KUBECTL_RESOLVED` row still credits `rdd ctl` with writing the env var that Round 5's `SkipResolver` change removed.
- The Path Variables preamble at line 39 (post Round 5 fix) now correctly carves out `RDD_CACHE_DIR` from the "no effect on RDD's behavior" claim.
- `docs/design/cmd_service.md` key table at lines 127-138 includes `cache_dir` after Round 5; example output at lines 152-164 matches.
- `pkg/kuberlr/cache.go` package doc at lines 5-10 still accurately describes the resolve-or-fall-through model.
- `parseKubeConfigFlags` docstring at `pkg/kuberlr/resolver.go:157-178` enumerates the bound flag surface and explains the deliberate `--username`/`--password` carve-out.

## Commit Structure

Single commit `72bfad2` (amended from Round 5's `fb0811d`). Message accurately describes the change. Clean.

## Acknowledged Limitations

- **Cache eviction not implemented** — `pkg/kuberlr/cache.go:33-39`. `TODO(eviction)` covers both growth (~50 MB per minor version) and `.kubectl-*` temp leftovers from SIGKILL/power loss; SIGINT mid-download (resolver runs against `context.Background()`) is the most likely source of a leaked tempfile and pairs with the cache-hit re-verification design observation above.
- **Offline mode not implemented** — `pkg/kuberlr/resolver.go:66-69`, `pkg/kuberlr/downloader.go:46-48`. TODOs cover the planned env-var/config opt-out for download attempts.
- **Windows BATS host deferred** — `bats/tests/10-cli/7-kubectl-cache.bats:14-20`. PR description gates the merge on Windows BATS support; the resolver's Windows-specific code paths remain compile-only.
- **`local_teardown_file()` deviation from BATS convention** — `bats/tests/10-cli/7-kubectl-cache.bats:103-110`. Inline rationale stands.
- **Falling back to embedded on any cluster-probe failure** — `pkg/kuberlr/resolver.go:115-145`. Each fall-through path is documented; the design choice is sound.
- **Embedded version unparseable in dev builds** — `pkg/kuberlr/resolver.go:77-86`. `go run ./cmd/rdd kubectl ...` and IDE debug builds skip `-ldflags -X`; falling through to embedded kubectl rather than breaking every dev invocation is the right call.
- **`--username`/`--password` deliberately unbound on the resolver side** — `pkg/kuberlr/resolver.go:168-175`. Binding them via `WithDeprecatedPasswordFlag` would force the interactive kubeconfig loader; the resolver's silent fall-through on probe failure covers the asymmetry.
- **`RDD_KUBECTL_MIRROR` accepts non-HTTPS URLs** — Implicit. The BATS test relies on `http://127.0.0.1:port`; for users who set this to a non-localhost http URL, MITM is possible. Trusting the env-var owner is consistent with other `RDD_*` overrides.
- **Empty-message `cliexit.Error` contract** — `cmd/rdd/main.go:159-160`. The contract is clear in context but currently single-call-site (only `kuberlr.Exec` produces one).

### Declined in prior rounds

- **Concurrent first-use Windows `os.Rename` race** — declined Rounds 1, 2, 3, 4, 5. Raised this round by all three agents (Claude I1, Codex I1, Gemini I1); dropped per the resolutions gate. The recurring re-raise (now six rounds running) keeps strengthening the case for an inline `// design note: see review history` style TODO documenting the deliberate design choice so future agents recognize the decline pattern in the code itself.
- **`CacheRoot` panic on `os.UserCacheDir` failure** — declined Rounds 1, 2, 3 ("trigger requires neither $HOME nor %LocalAppData% to resolve"); declined Round 5 as Suggestion S3 (blast-radius re-raise: structural change touches multiple files for a low-probability defensive case). Raised this round by Gemini as C1 with a different fix shape (silent `os.TempDir()` fallback). Dropped per the resolutions gate; the underlying decline reasoning (rare trigger, structural change disproportionate) holds across the new fix proposal.
- **Drive-by `config.json` → `config.yaml` rename in `cmd_service.md`** — declined Round 5 ("Round 4's S3 — raised by Claude itself — explicitly asked for the rename, which then landed"). Raised again this round by Claude as S4 (drive-by); dropped per the gate.
- **`RDD_CACHE_DIR` scope rename** — declined Round 1.
- **Server `log.Fatal*` bypassing `defer logFD.Close()`** — declined Round 1: test fixture; OS reclaims fd.
- **Pure-Go unit coverage thin on download/cache machinery** — declined Round 2: BATS suite covers the integration concerns.
- **BATS coverage for HTTP timeout/hang, network cancel, Windows Exec, sha512sum-format mirror, trailing-slash mirror, default-log-level message capture, unreachable-server probe** — declined Rounds 1–3.
- **`info.GitVersion` empty/unknown unit test** — declined Round 1: requires mocked discovery client.
- **Resolver-policy summary and recursion-guard rationale duplicated across files** — declined Round 1: each duplicate serves a different audience.
- **`SkipResolver` swallows `os.Setenv` error** — declined Round 3; moot since Round 5 removed the `Setenv` call entirely.
- **Cache the server-version probe across invocations** — declined Round 3: performance optimisation; explicit follow-up.
- **Move package doc-comment from `cache.go` to `resolver.go`** — declined Round 3.
- **Drop `is_macos` branch in `write_kubectl_sha512`** — declined Round 3.
- **Round 4 S6: `serverVersion` ignores ctx** — skipped Round 4: UX gap bounded by 2 s probe timeout.
- **Round 4 S10: 5 min `downloadTimeout` covers headers + body** — skipped Round 4: speculative.
- **Round 5 testing gaps (S3 panic test, download UX assertion, concurrent cache miss, Windows execution path, unparseable gitVersion BATS)** — Skipped Round 5; rationale stands.

## Unresolved Feedback

- **@mook-as on `cmd/rdd/service_paths.go:33`** — `https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348#discussion_r3182959199`: "Would it be possible to expose `RDD_CACHE_DIR` here too?"
  Resolved across rounds. `cache_dir` (`kuberlr.CacheRoot()`) and `kubectl_cache` (`kuberlr.CacheDir()`) appear in `instancePaths()` (`service_paths.go:32-33`), the `environment.md` Path Variables table, and the `cmd_service.md` example output and key table.


## Agent Performance Retro

### [Claude]

Claude carried this round's coverage: four Suggestions kept (the shared S1 docs regression with Codex, plus three unique items — S3 the 256 MiB cap surfacing as checksum mismatch, S4 the missing User-Agent, S5 the dead `os.Chmod` on Windows) plus two Design Observations (resolver running against `context.Background()` and trusting cache hits without re-verification). The S1 catch was the round's most actionable item; Claude noticed that Round 5's `SkipResolver` change broke the env-var documentation in the same file Round 5 fixed earlier. Three findings dropped: I1 (Windows rename race) per the resolutions gate (six rounds running), I2 (a self-downgraded note about `--instance` interaction with the walker), and S4 (drive-by `config.json → config.yaml`, which Round 5 already declined as the rename Round 4 explicitly asked for). Claude S2 (cache-hit re-verification) moved to Design Observations as Claude himself flagged it as "no fix required for this PR."

### [Codex]

Codex ran a tight one-Important / one-Suggestion shape: I1 (Windows rename race, dropped per the gate) and S1 (RDD_KUBECTL_RESOLVED docs, kept and shared with Claude). Codex framed the docs finding as Suggestion-severity from the start, which matches the calibration the consolidation settled on; Claude's I3 was downgraded to align. Codex also ran `go test ./pkg/kuberlr` and `go test ./...` inside `bats/tests/10-cli/fake-kube` (both passed) — verification work hard to parallelize across agents and consistent across rounds.

### [Gemini]

Gemini brought back the `CacheRoot` panic as Critical (C1, dropped per the gate — declined Round 5 for the same blast-radius reasoning) and the Windows rename race as Important (I1, also dropped). The keeper was I2: the `--` boundary misclassification in `isClientOnly`. No other agent traced the empty-`subcommand` fallback path through to the cautionary comment at lines 264-269. Severity calibrated down from Important to Suggestion (S2) given how rare `kubectl -- subcommand` is in practice. One unique finding from Gemini balances out the two declined re-raises.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 15m 21s | 6m 02s | — |
| Findings | 4S | 1S | 1S |
| Tool calls | 45 (Bash 24, Read 21) | 35 (shell 35) | — |
| Design observations | 2 | 0 | 0 |
| False positives | 0 | 0 | 0 |
| Unique insights | 3 | 0 | 1 |
| Files reviewed | 19 | 19 | 19 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **4S** | **1S** | **1S** |
| Downgraded | 1 (I→S) | 0 | 1 (I→S) |
| Dropped | 4 | 1 | 2 |


Claude provided the most coverage (three unique Suggestions plus two Design Observations). Codex's S1 was the round's most actionable item (the regression Round 5 introduced). Gemini's S2 was the round's most surprising — a real walker-bias bug in code that has been reviewed five times.

#### Reconciliation

- Claude I1 (Windows rename race): important → dropped per resolutions gate (declined Rounds 1–5).
- Claude I2 (--instance walker interaction): important → dropped (Claude self-downgraded to a doc-comment note; marginal value).
- Claude I3 (RDD_KUBECTL_RESOLVED docs): important → suggestion S1 (docs-only regression; merged with Codex S1).
- Claude S2 (cache hit re-verification): suggestion → Design Observation (Claude flagged "no fix required for this PR").
- Claude S4 (drive-by config.yaml rename): suggestion → dropped per resolutions gate (declined Round 5).
- Codex I1 (Windows rename race): important → dropped per resolutions gate.
- Gemini C1 (CacheRoot panic): critical → dropped per resolutions gate (declined Round 5 with same blast-radius reasoning; Gemini's TempDir-fallback fix shape doesn't change the decline rationale).
- Gemini I1 (Windows rename race): important → dropped per resolutions gate.
- Gemini I2 (`--` separator misclassification): important → suggestion S2 (rare trigger; conservative-bias correction).

## Review Process Notes

### Skill improvements

- **Add a "watch for documentation regressions in the same file as the code fix" rule to the consolidation gates.** When a prior round amends a function's behavior (env-var write removed, return type changed, semantics inverted) and a docs file in the same PR describes that behavior, the docs are highly likely to drift in the round that lands the code fix. Today the prior-round-resolutions gate verifies that fixes landed; add a sibling check: for every Resolution table row marked "Fixed" that touched named functions or constants, grep the diff for documentation references to those names and confirm they describe the post-fix behavior. The check would have caught S1 mechanically — Round 5's S5 fix removed `os.Setenv(envSkipResolver, …)` from `SkipResolver`, but `environment.md:35` still says "rdd ctl sets it." Future amendments that rename, remove, or invert behavior carry the same shape.

### Repo context updates

None this round. The existing `deep-review-context.md` carries enough for agents to surface S1 (docs-vs-code drift) and S2 (walker bias against the conservative-bias comment) on a careful read; no new repo-specific patterns emerged.


## Appendix: Original Reviews

### [Claude] — Pass 1

Now I have enough context to write the review.

# Code Review: PR #348 — rdd kubectl: embed kuberlr-style version resolver

## Executive Summary

Adds a kuberlr-style resolver under `pkg/kuberlr/` that fetches a version-matched kubectl from `dl.k8s.io` when the embedded kubectl is more than ±1 minor out of skew with the cluster, sha512-verifies it, caches it under `os.UserCacheDir()`, and execs it. Comprehensive flag-binding/skew/client-only unit tests plus a dedicated BATS file with a Go-based fake apiserver/mirror. The design is cleanly separated and the test coverage is unusually thorough; the most material concern is a Windows-specific concurrent-download race that can produce a confusing failure on first cache miss.

## Findings

### Critical Issues

None.

### Important Issues

**I1. Concurrent download race produces a hard failure on Windows** — `pkg/kuberlr/downloader.go:59-74, 98-119`

```go
func ensureCached(ctx context.Context, want semver.Version) (string, error) {
	path := cachedPath(want)
	if _, err := os.Stat(path); err == nil {
		return path, nil
	} else if !errors.Is(err, os.ErrNotExist) {
		return "", err
	}
	...
	return path, nil
}
```

```go
	if err := os.Chmod(tmpName, 0o755); err != nil {
		return err
	}
	return os.Rename(tmpName, dst)
```

`ensureCached` has a TOCTOU window between the `os.Stat` (line 61) and the final `os.Rename` (line 119). When two `rdd kubectl` invocations both miss the cache against the same out-of-skew cluster (a routine pattern: a new-to-the-user remote cluster, two terminals, two scripts), both processes download to their own `.kubectl-*` tempfile and both call `os.Rename(tmp, dst)`.

On Unix this is benign — both renames succeed, the second wins, and the running kubectl child of the loser still resolves through its inode. On Windows (per the repo overview's explicit reminder about `os.Rename`), if process A has already renamed and is mid-`CreateProcess` on the cached binary, process B's rename returns `ERROR_ACCESS_DENIED` / `ERROR_SHARING_VIOLATION` because the file is currently held as an executable image. B then surfaces "resolving kubectl version: …" and exits non-zero even though the file it just wrote is byte-identical to what A produced and what's already on disk.

The race window is wide (the entire download duration — seconds to minutes), so this is not a theoretical edge case for users on slow links.

Fix: when the rename fails, re-check the destination — if it now exists with a matching sha512, treat the rename as "lost the race" and use the existing file. (No new failure modes: if the sha mismatches, fall through to the original error.)

```diff
 	if err := os.Chmod(tmpName, 0o755); err != nil {
 		return err
 	}
-	return os.Rename(tmpName, dst)
+	if err := os.Rename(tmpName, dst); err != nil {
+		// Another rdd invocation may have won the race and placed dst
+		// while we were downloading; on Windows os.Rename fails when
+		// the winner's child process is currently holding dst as an
+		// executable image. Accept dst if it is already in place.
+		if _, statErr := os.Stat(dst); statErr == nil {
+			return nil
+		}
+		return err
+	}
+	return nil
 }
```

(important, regression)

---

**I2. `--instance=…` and `--log-level=…` swallow the next positional as a value when they appear in resolver `args`** — `pkg/kuberlr/resolver.go:197-230, 270-322`

```go
var kubectlGlobalFlagsTakingValue = map[string]bool{
	"--kubeconfig":            true,
	...
	"--log-file":              true,
	...
}
```

The `kubectl` command in `cmd/rdd/kubectl.go` is registered with `DisableFlagParsing: true`, so any `rdd`-side flag that appears AFTER `kubectl` lands in `args` verbatim. Two specific flags interact badly with `isClientOnly`:

- `--log-level` (rdd's persistent flag, `cmd/rdd/main.go:136`) takes a value but is unknown to the resolver. With `--log-level info get pods`, the walk treats `--log-level` as a no-value flag (line 302–306), then sees `info` as the subcommand. `clientOnlySubcommands["info"]` is `false`, so the resolver probes — correct outcome. The bigger issue is `info` isn't a kubectl subcommand at all, but `kubectl` will reject it later, so the resolver's classification doesn't matter here. OK.
- `klog`-style `-v 4` lines: `-v` IS in `kubectlGlobalFlagsTakingValue`, so `-v 4 get pods` correctly skips `4` and finds `get`. ✓
- `--log-file` is in the map (line 223). If a user passes `kubectl --log-file something.log get pods` (unusual but legal as a klog flag forwarded to kubectl), the walk consumes `something.log` as the value and finds `get` as the subcommand. ✓

The actual regression risk is the inverse: an `rdd`-side flag that takes a value but is NOT in the map. `--instance` is the canonical example, but `parseInstanceFlag` strips it from `os.Args[1]` before cobra runs (`cmd/rdd/main.go:51-74`). For `rdd kubectl --instance foo get pods`, parseInstanceFlag does NOT strip (it only handles position 1), so `--instance` reaches the resolver. The walk treats it as no-value, then `foo` becomes the subcommand candidate, then `get` is ignored. `clientOnlySubcommands["foo"]` is false → resolver probes. The downstream `kubectl` will fail with "unknown flag --instance", but the resolver's classification (probe) is the safe default. Pre-existing limitation, not a regression.

Downgraded to a note: the conservative bias is correct, but the comment at line 264-269 should explicitly call out `--instance` as a known case so a future contributor doesn't add `--instance` to the map "to be helpful" and break the kubectl subcommand `instance` (none exists today, but the precedent matters). (suggestion, gap)

---

**I3. `RDD_KUBECTL_RESOLVED` documentation contradicts the code** — `docs/design/environment.md:35`

```
| `RDD_KUBECTL_RESOLVED` | … `rdd ctl` sets it before invoking the
kubectl action so the resolver skips the redundant probe against the
embedded apiserver, whose version matches the embedded kubectl by
construction. |
```

`rdd ctl` does NOT set the env var. `cmd/rdd/kubectl.go:48` calls `kuberlr.SkipResolver()` which flips an in-process `bool` (`pkg/kuberlr/resolver.go:33, 39-41`); the env var is set ONLY by `kuberlr.Exec` on the kubectl child (`exec_unix.go:22`, `exec_windows.go:28`). A developer reading the doc and grepping their process tree for `RDD_KUBECTL_RESOLVED=1` will see it on a downloaded `kubectl` child but never on the `rdd ctl` parent that the doc claims set it.

Fix: drop the second sentence or rewrite it to describe the in-process toggle.

```diff
-| `RDD_KUBECTL_RESOLVED` | Recursion guard for the kubectl version resolver. `kuberlr.Exec` sets it on the kubectl child process so a downloaded kubectl that re-execs `rdd` through a shim cannot recurse back into version resolution. `rdd ctl` sets it before invoking the kubectl action so the resolver skips the redundant probe against the embedded apiserver, whose version matches the embedded kubectl by construction. |
+| `RDD_KUBECTL_RESOLVED` | Recursion guard for the kubectl version resolver. `kuberlr.Exec` sets it on the kubectl child process so a downloaded kubectl that re-execs `rdd` through a shim cannot recurse back into version resolution. (`rdd ctl` achieves the same skip via the in-process `kuberlr.SkipResolver()` toggle, not this env var.) |
```

(important, regression)

### Suggestions

**S1. `streamGet`'s 256 MiB cap surfaces as "checksum mismatch", not as "binary too large"** — `pkg/kuberlr/downloader.go:105, 169-184`

```go
if err := streamGet(ctx, base, io.MultiWriter(tmp, h), maxKubectlBytes); err != nil {
    ...
}
...
got := hex.EncodeToString(h.Sum(nil))
if got != want {
    return fmt.Errorf("checksum mismatch for %s: want %s, got %s", base, want, got)
}
```

If kubectl ever exceeds `maxKubectlBytes` (256 MiB), `io.LimitReader` truncates silently and the comparison fails as a checksum mismatch. The user message — "checksum mismatch" — points at integrity, not at the size cap that actually fired. The same code already gets this right for `fetchSha512` (the dedicated "malformed checksum" path) per the comment at lines 122-127. Worth doing the same for the binary path: wrap the limit reader so reaching the cap returns a distinct sentinel and the error message names the cap. (suggestion, enhancement)

---

**S2. Cached file is trusted without re-verification on hit** — `pkg/kuberlr/downloader.go:60-65`

```go
path := cachedPath(want)
if _, err := os.Stat(path); err == nil {
    return path, nil
}
```

Filesystem corruption, a partial write from a SIGKILL/power-loss between `Rename` and the next caller (impossible on POSIX, but `Rename` on Windows is two ops under the hood), or a manually-tampered cache file all surface as a confusing exec-time failure rather than a "cache integrity error" the resolver could re-download from. The TODO at `cache.go:33-39` acknowledges leaked `.kubectl-*` tempfiles; a sha512 re-check on hit would also let a future LRU sweeper distinguish "stale" from "corrupt." No fix required for this PR — flag for the same future sweep that picks up the eviction TODO. (suggestion, enhancement)

---

**S3. HTTP requests omit a `User-Agent` header** — `pkg/kuberlr/downloader.go:169-184`

```go
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
```

dl.k8s.io is operated by SIG Release and currently accepts the default Go client UA, but corporate/air-gapped mirrors (and CDNs in general) often use UA-based rate limiting or denylist `Go-http-client/1.1`. A stable UA also helps the SIG Release team correlate kuberlr-style traffic. Set `req.Header.Set("User-Agent", "rdd-kuberlr/<version>")` on each request. (suggestion, enhancement)

---

**S4. Drive-by doc and BATS-key fixes are folded into a feature commit** — `docs/design/cmd_service.md:15, 41, 78, 80, 87, 153-164`; `bats/tests/10-cli/5-paths.bats:3`

The diff includes two unrelated cleanups: `config.json` → `config.yaml` throughout `cmd_service.md` (the actual filename per `pkg/instance/instance.go:91`) and the addition of `docker_socket` to `ALL_KEYS` (`docker_socket` was added to `instancePaths` in commit `1d71d4a` but the test wasn't updated). Both fixes are correct, but they belong in their own commit so a future bisect over the doc rename or the test gap doesn't land on a kubectl-resolver feature commit. Severity is suggestion-only because each fix is small and obviously correct in isolation. (suggestion, enhancement)

---

**S5. `os.Chmod(tmpName, 0o755)` is a no-op on Windows** — `pkg/kuberlr/downloader.go:116`

```go
if err := os.Chmod(tmpName, 0o755); err != nil {
    return err
}
```

On Windows, `os.Chmod` only toggles the read-only bit; the executable bit doesn't exist (executability is determined by the file extension and PE header). Mode `0o755` happens to mean "writable" so no harm done, but the chmod call is dead on Windows. Wrap with `runtime.GOOS != "windows"` for clarity, or leave as-is and add a one-line comment so a future Windows-bound reader doesn't try to "fix" the missing executable bit. (suggestion, enhancement)

## Design Observations

### Strengths

- **Process replacement on Unix vs. spawn on Windows is the right split.** `syscall.Exec` keeps the kubectl child at PID 1's expected place in the process tree (no extra `rdd` parent), and the Windows path's `*cliexit.Error` envelope plus the empty-message branch in `cmd/rdd/main.go:163-165` cleanly propagates the kubectl exit code without a confusing `logrus.Error` line.
- **Same-process toggle + cross-process env-var recursion guard is well-factored.** `SkipResolver()` for `rdd ctl` (cheap in-process boolean) and `RDD_KUBECTL_RESOLVED` for the rare downloaded-shim re-exec case (env-based) cover their respective shapes minimally.
- **Reusing `genericclioptions.NewConfigFlags(true).AddFlags` to bind the resolver's probe to the same kubeconfig surface kubectl will use** is the right call — keeping the resolver's flag set in sync with kubectl by construction is a real win over hand-rolling a parser. The deprecated `--username`/`--password` carve-out and its rationale are well-explained at lines 168-175.
- **The fake-kube test fixture is an excellent design.** A single Go server playing both apiserver `/version` and release-mirror `/release/*` lets tests steer behavior via per-request file reads without restarts. The `LOG_FILE`-grep assertions ("did the resolver probe? did it download? did it skip?") are the right granularity for an end-to-end test.

### Concerns

- **The resolver runs against `context.Background()` (Concern, future).** `cmd.Context()` returns `context.Background()` because `cli.RunNoErrOutput` calls `cmd.Execute()`, not `cmd.ExecuteContext()`. The 5-minute `httpClient.Timeout` and the 2-second discovery `cfg.Timeout` are the only cancellation backstops; SIGINT to `rdd` during a download kills the process without running the `defer os.Remove(tmpName)`. The cache.go TODO at lines 33-39 acknowledges the resulting tempfile leak. Wiring `signal.NotifyContext(context.Background(), os.Interrupt)` in `main` and switching to `cmd.SetContext(ctx)` would close this, and would benefit other long-running paths in the CLI too.

## Testing Assessment

The unit test surface is unusually thorough — `TestParseKubeConfigFlags` exercises every binding kubectl exposes, `TestIsClientOnly` covers bare/equals/garbage forms of `--client` and `--help` plus the `--` boundary, and the BATS file pins down all five resolver lifecycle outcomes (download-on-miss, skip-on-hit, mirror-404, sha-mismatch, apiserver-500 fall-through). Untested scenarios, in rough order of risk:

1. **Concurrent download against an out-of-skew cluster.** Per finding I1, this is the most likely real-world failure shape on Windows. A two-process race is hard to write deterministically, but the BATS fixture could simulate it by holding `dst` open via a separate process and asserting `Rename`'s graceful fallback.
2. **`embeddedVersion` parse failure (the `go run` case the comment at resolver.go:79-83 names).** Easy to add: a unit test that injects an unparseable string and asserts Resolve returns `("", nil)` with no probe.
3. **`apply -f manifest.yaml` against an out-of-skew cluster.** BATS only exercises `kubectl get pods` and `kubectl version` against the fake apiserver; an apply (or any subcommand that reads stdin/local files) is the more common real-world out-of-skew use case.
4. **`RDD_KUBECTL_RESOLVED=1` on the parent `rdd kubectl` invocation.** No test confirms the env-var recursion guard short-circuits Resolve; only the same-process `SkipResolver` boolean is exercised indirectly via `rdd ctl`.
5. **Download larger than `maxKubectlBytes`.** Per S1, the failure mode is misleading; a unit test on `streamGet` against a >256 MiB body would pin down the current behavior.

## Documentation Assessment

`docs/design/environment.md` and `docs/design/cmd_service.md` are kept in sync with the new env vars and `svc paths` keys — good. The `RDD_KUBECTL_RESOLVED` row is wrong about who sets it (see I3). The `cmd_service.md` `config.json` → `config.yaml` cleanup is correct and overdue, but is unrelated to the kubectl resolver (see S4). All exported symbols in `pkg/kuberlr/` carry a Godoc comment in the project's house style.

## Commit Structure

The commit message and the diff are clean and self-contained. Two minor scope notes (covered as S4): the `cmd_service.md` `config.json` rename and the `5-paths.bats` `docker_socket` addition are both pre-existing fixes that don't belong with the kuberlr feature. Splitting them into a "doc/test fixups" precursor commit would tighten the bisect surface, but neither is large enough to materially harm reviewability.

## Acknowledged Limitations

- **`pkg/kuberlr/cache.go:33-39` (TODO eviction):** The cache only grows; SIGKILL/power-loss leaks a `.kubectl-*` tempfile that the `defer os.Remove` would normally clean up. The author plans an LRU/TTL sweep covering both stale binaries and stale tempfiles. This PR makes the limitation marginally more important (cache is now actually used), but the TODO is well-placed and the failure mode is recoverable (~50 MB per leak is bounded).
- **`pkg/kuberlr/resolver.go:66-69` (TODO offline mode):** When no cached kubectl matches and the network is down, Resolve fails with the download error rather than falling back to the closest cached binary. Author plans an `RDD_KUBECTL_OFFLINE` opt-out layered on the same env-var surface.
- **`pkg/kuberlr/downloader.go:45-48` (TODO offline toggle):** Pairs with the resolver TODO above — air-gapped users can't yet pre-populate `CacheDir()` and forbid network fetches.
- **`pkg/kuberlr/resolver.go:79-83` (`go run`/IDE debug):** `componentbasever.Get().GitVersion` returns the unparseable in-tree default `v0.0.0-master+$Format:%H$` when `-ldflags -X` is skipped; the resolver falls through to the embedded kubectl rather than break dev workflows. Documented and intentional.
- **`pkg/kuberlr/resolver.go:168-175` (`--username`/`--password` unbound on the resolver side):** The author explicitly chose to leave these unbound to avoid switching `ToRawKubeConfigLoader` to the interactive variant that prompts on stdin. Probe falls through to the embedded kubectl on auth failure either way.
- **`bats/tests/10-cli/7-kubectl-cache.bats:103-111` (`local_teardown_file` deviation):** The test adds a teardown_file to kill the fake-kube server, deviating from the project rule. The deviation is well-justified (the rule's intent is post-mortem inspection of `rdd` state; an HTTP server on an ephemeral port is the opposite). Author acknowledged in code.
- **`bats/tests/10-cli/7-kubectl-cache.bats:202-235` (apiserver-500 exit code unasserted):** `kubectl version` behavior on a 500 from `/version` varies across kubectl releases; the test asserts probe-then-fall-through via the request-count check rather than exit code.
- **PR Review Comment (`@mook-as`, `cmd/rdd/service_paths.go:33`):** Asked whether `RDD_CACHE_DIR` could be exposed via `rdd svc paths`. Addressed in the current diff: `cache_dir` (kuberlr.CacheRoot()) and `kubectl_cache` (kuberlr.CacheDir()) are both registered at lines 32-33 and reflected in `bats/tests/10-cli/5-paths.bats:3` and `docs/design/{cmd_service,environment}.md`. Resolved.

## Coverage Summary

- `.github/dependabot.yml` — Trivial (adds `/bats/tests/10-cli/fake-kube` to the gomod ecosystem list)
- `bats/tests/10-cli/5-paths.bats` — S4 (drive-by `docker_socket` fix bundled with `cache_dir`/`kubectl_cache`)
- `bats/tests/10-cli/7-kubectl-cache.bats` — Reviewed, no issues (well-structured end-to-end coverage; teardown_file deviation acknowledged in code)
- `bats/tests/10-cli/fake-kube/README.md` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/go.mod` — Trivial
- `bats/tests/10-cli/fake-kube/kubectl/main.go` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go` — Reviewed, no issues (the `O_APPEND` rationale at lines 55-58 is a particularly nice catch)
- `cmd/rdd/kubectl.go` — Reviewed, no issues
- `cmd/rdd/main.go` — Reviewed, no issues (empty-message envelope cleanly propagates the kubectl child exit code without a confusing `logrus.Error` line)
- `cmd/rdd/service_paths.go` — Reviewed, no issues
- `docs/design/cmd_service.md` — S4 (`config.json` → `config.yaml` rename is correct but unrelated to the kuberlr feature)
- `docs/design/environment.md` — I3 (`RDD_KUBECTL_RESOLVED` row attributes the env-var write to the wrong code path)
- `go.mod` — Trivial (promotes `blang/semver/v4` and `k8s.io/cli-runtime` from indirect to direct)
- `pkg/kuberlr/cache.go` — Reviewed, no issues
- `pkg/kuberlr/downloader.go` — I1, S1, S2, S3, S5
- `pkg/kuberlr/exec_unix.go` — Reviewed, no issues
- `pkg/kuberlr/exec_windows.go` — Reviewed, no issues
- `pkg/kuberlr/resolver.go` — I2 (note only)
- `pkg/kuberlr/resolver_test.go` — Reviewed, no issues
### [Codex] — Pass 1

### Executive Summary

The PR adds a kubectl version resolver that probes the target cluster, downloads an out-of-skew kubectl into a shared cache, verifies it, and execs it. The main correctness issue is a Windows-specific race when two processes populate the same shared cache entry concurrently; there is also one documentation mismatch around the resolver skip mechanism.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Concurrent cache misses can fail on Windows after another process publishes the binary** — `pkg/kuberlr/downloader.go:61-119` (important, regression)

```go
if _, err := os.Stat(path); err == nil {
	return path, nil
}
// ...
if err := os.Chmod(tmpName, 0o755); err != nil {
	return err
}
return os.Rename(tmpName, dst)
```

`CacheDir` is intentionally shared across instances at `pkg/kuberlr/cache.go:24-25`, and `ensureCached` only checks whether the final path exists before downloading at `pkg/kuberlr/downloader.go:61`. If two `rdd kubectl` processes miss the same cache entry, both can download and verify successfully, but the loser then publishes with `os.Rename(tmpName, dst)` at `pkg/kuberlr/downloader.go:119`. On Windows, `os.Rename` fails when `dst` already exists, so the second command returns an error through `cmd/rdd/kubectl.go:74` even though a valid kubectl is now cached. This self-heals on the next invocation, but it makes concurrent first use of the shared cache flaky on Windows.

Fix: after a failed publish, treat an already-published regular destination as success, or serialize per-version downloads with a lock file. The minimal fix keeps the current temp-file flow:

```diff
 if err := os.Chmod(tmpName, 0o755); err != nil {
 	return err
 }
-return os.Rename(tmpName, dst)
+if err := os.Rename(tmpName, dst); err != nil {
+	if runtime.GOOS == "windows" {
+		if info, statErr := os.Stat(dst); statErr == nil && info.Mode().IsRegular() {
+			return nil
+		}
+	}
+	return err
+}
+return nil
```

#### Suggestions

S1. **Internal variable docs say `rdd ctl` sets an env var it does not set** — `docs/design/environment.md:35` (suggestion, regression)

```md
| `RDD_KUBECTL_RESOLVED` | Recursion guard for the kubectl version resolver. `kuberlr.Exec` sets it on the kubectl child process so a downloaded kubectl that re-execs `rdd` through a shim cannot recurse back into version resolution. `rdd ctl` sets it before invoking the kubectl action so the resolver skips the redundant probe against the embedded apiserver, whose version matches the embedded kubectl by construction. |
```

The documentation at `docs/design/environment.md:31` says these variables appear so developers can identify them in a process tree, but the `rdd ctl` path calls `kuberlr.SkipResolver()` at `cmd/rdd/kubectl.go:48`, and `SkipResolver` only sets the package-global `skipResolver` at `pkg/kuberlr/resolver.go:39-40`. No `RDD_KUBECTL_RESOLVED` environment variable is set for `rdd ctl`, so the table is misleading.

Fix:

```diff
-| `RDD_KUBECTL_RESOLVED` | Recursion guard for the kubectl version resolver. `kuberlr.Exec` sets it on the kubectl child process so a downloaded kubectl that re-execs `rdd` through a shim cannot recurse back into version resolution. `rdd ctl` sets it before invoking the kubectl action so the resolver skips the redundant probe against the embedded apiserver, whose version matches the embedded kubectl by construction. |
+| `RDD_KUBECTL_RESOLVED` | Recursion guard for the kubectl version resolver. `kuberlr.Exec` sets it on the kubectl child process so a downloaded kubectl that re-execs `rdd` through a shim cannot recurse back into version resolution. `rdd ctl` uses the in-process `kuberlr.SkipResolver()` helper instead because it invokes the embedded kubectl in the same process. |
```

### Design Observations

**Strengths**

The resolver reuses `genericclioptions.ConfigFlags` for kubeconfig and connection overrides, which keeps the probe aligned with the kubectl invocation instead of maintaining a separate flag parser.

The download path verifies the sha512 before publishing the binary and keeps incomplete downloads in temp files, which avoids treating interrupted downloads as cache hits.

### Testing Assessment

1. Concurrent cache population on Windows is not covered; that is the scenario behind I1.
2. Focused checks run: `go test ./pkg/kuberlr` passed, and `go test ./...` inside `bats/tests/10-cli/fake-kube` passed.
3. `go test ./cmd/rdd` could not run in this checkout because setup fails on missing embedded asset `pkg/guestagent/lima-guestagent.gz`.

### Documentation Assessment

S1 covers the only documentation mismatch found. The cache path and environment-variable docs otherwise track the new behavior.

### Acknowledged Limitations

- **Code comment**: `pkg/kuberlr/cache.go:33` notes `TODO(eviction): the cache only grows`. The change makes this relevant over long-term use against many cluster versions, but the limitation is explicitly acknowledged and does not affect immediate correctness.
- **Code comment**: `pkg/kuberlr/resolver.go:66` notes offline mode is deferred when no cached kubectl matches and the network is down. `pkg/kuberlr/downloader.go:47` repeats the same future offline toggle at the mirror layer.
- **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:103` explicitly acknowledges the `local_teardown_file()` deviation and explains it is only cleaning up the fake HTTP server, not RDD state.

### Coverage Summary

- `.github/dependabot.yml` — Reviewed, no issues
- `bats/tests/10-cli/5-paths.bats` — Reviewed, no issues
- `bats/tests/10-cli/7-kubectl-cache.bats` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/README.md` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/go.mod` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/kubectl/main.go` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go` — Reviewed, no issues
- `cmd/rdd/kubectl.go` — I1
- `cmd/rdd/main.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — Reviewed, no issues
- `docs/design/cmd_service.md` — Reviewed, no issues
- `docs/design/environment.md` — S1
- `go.mod` — Reviewed, no issues
- `pkg/kuberlr/cache.go` — I1
- `pkg/kuberlr/downloader.go` — I1
- `pkg/kuberlr/exec_unix.go` — Reviewed, no issues
- `pkg/kuberlr/exec_windows.go` — Reviewed, no issues
- `pkg/kuberlr/resolver.go` — Reviewed, no issues
- `pkg/kuberlr/resolver_test.go` — Reviewed, no issues

### [Gemini] — Pass 1

### Executive Summary

This PR introduces `kuberlr`-style version resolution to `rdd kubectl`, allowing it to dynamically download and execute a version-matched binary when the user's cluster is out of skew with the embedded control plane. The design is robust, handling fallbacks gracefully without blocking offline workflows, but contains a critical panic path when `HOME` is absent and an important race condition on Windows cache population.

### Findings

#### Critical Issues

**C1. `CacheRoot` panics when `os.UserCacheDir` fails** — `pkg/kuberlr/cache.go:51-54` (critical, regression)

```go
	cache, err := os.UserCacheDir()
	if err != nil {
		panic(fmt.Errorf("could not determine user cache directory: %w", err))
	}
```

When `os.UserCacheDir()` fails (e.g., if `$HOME` is unset in a minimal container, systemd service, or headless CI environment), `CacheRoot` panics. Because `CacheRoot` is called by background systems and CLI commands like `rdd svc paths`, this panic will bring down the entire `rdd` daemon or command. A newly-absorbed package must not introduce unrecoverable crash paths.

**Fix:** Fall back to `os.TempDir()` rather than panicking.

```diff
 	cache, err := os.UserCacheDir()
 	if err != nil {
-		panic(fmt.Errorf("could not determine user cache directory: %w", err))
+		cache = os.TempDir()
 	}
 	return filepath.Join(cache, "rancher-desktop")
```

#### Important Issues

**I1. Concurrent downloads fail on Windows due to `os.Rename` race** — `pkg/kuberlr/downloader.go:118-120` (important, regression)

```go
	if err := os.Chmod(tmpName, 0o755); err != nil {
		return err
	}
	return os.Rename(tmpName, dst)
```

On Windows, `os.Rename` does not atomically replace existing files; it fails with `ERROR_ALREADY_EXISTS` / `os.ErrExist` if the destination is present. If multiple `rdd kubectl` invocations race to download the same version concurrently (e.g., via parallel scripts), the first one will succeed, but all others will fail at the rename step and bubble up a download error. Since a valid binary is now present in the cache, the failing processes should just use it.

**Fix:** Catch the error and treat the operation as successful if the destination file now exists.

```diff
 	if err := os.Chmod(tmpName, 0o755); err != nil {
 		return err
 	}
-	return os.Rename(tmpName, dst)
+	if err := os.Rename(tmpName, dst); err != nil {
+		if _, statErr := os.Stat(dst); statErr == nil {
+			return nil
+		}
+		return err
+	}
+	return nil
```

**I2. `isClientOnly` bypasses resolver when `--` is used** — `pkg/kuberlr/resolver.go:266-268` (important, regression)

```go
	for i := 0; i < len(args); i++ {
		arg := args[i]
		if arg == "--" {
			break
		}
```

The `--` separator stops flag parsing in upstream `kubectl`, so an invocation like `kubectl -- get pods` is perfectly valid. However, `isClientOnly` breaks early on `--` and leaves `subcommand` empty. The fallback `if subcommand == "" { return true }` at the end incorrectly classifies the command as client-only. This causes `rdd kubectl -- get pods` to silently skip the version probe and run the embedded binary against the remote cluster. 

**Fix:** If `--` is encountered, grab the next positional argument as the subcommand before breaking.

```diff
 	for i := 0; i < len(args); i++ {
 		arg := args[i]
 		if arg == "--" {
+			if subcommand == "" && i+1 < len(args) {
+				subcommand = args[i+1]
+			}
 			break
 		}
```

#### Suggestions

None.

### Design Observations

#### Strengths

- **Graceful Error Degradation:** The decision to fail open (falling back to the embedded kubectl) on unreachable clusters, misconfigured kubeconfig files, and garbage HTTP responses is excellent. It preserves local `rdd ctl` capabilities and offline functionality.
- **Developer UX:** Catching the unparseable `v0.0.0-master+$Format:%H$` dev version from `componentbasever` and bypassing the resolver avoids breaking local `make run` cycles for project contributors.
- **Robust BATS Server:** The `fake-kube-server` correctly anticipates test log truncation by opening the log file with `O_APPEND`. This prevents the sparse file bugs commonly seen when external test processes hold file descriptors across `>` shell truncations.

### Testing Assessment

The test suite covers standard cache hits, misses, and mirror failures. The following scenarios are unverified:

1. **`isClientOnly` handling of `--`** (Addresses I2): The test suite checks `exec pod -- completion` (where the subcommand precedes `--`), but misses `--` before the subcommand (e.g., `-- get pods`).
2. **Exit Code Propagation**: `7-kubectl-cache.bats` validates that a checksum mismatch fails (`assert_failure`), but does not assert that `rdd kubectl` correctly propagates the exact non-zero exit code of a downloaded or embedded `kubectl` child process when the probe succeeds. 
3. **Concurrent Download Race**: There are no parallel invocations validating behavior when two processes populate the cache simultaneously (Addresses I1).

### Documentation Assessment

The CLI documentation in `cmd_service.md` and `environment.md` correctly maps out the new paths and cache variables. No further updates needed.

### Acknowledged Limitations

- **Temp file leaks on SIGKILL:** The author acknowledged in a TODO (`pkg/kuberlr/cache.go:33-36`) that process crashes leave `.kubectl-*` temp files stranded in the cache directory, deferring an eviction/sweep pass to the future.
- **No offline cache fallback:** The author acknowledged in a TODO (`pkg/kuberlr/resolver.go:66-69`) that when a mirror is unreachable but a close-enough downloaded binary already exists in the cache, the resolver currently hard-fails rather than using the cached binary.
- **BATS Teardown:** The author added an explicit inline comment (`bats/tests/10-cli/7-kubectl-cache.bats:103-108`) documenting their deliberate deviation from the "no teardown_file" repo rule to avoid leaking ephemeral HTTP server ports across test sessions.

### Coverage Summary

- `.github/dependabot.yml`: Trivial
- `bats/tests/10-cli/5-paths.bats`: Trivial
- `bats/tests/10-cli/7-kubectl-cache.bats`: Reviewed, no issues
- `bats/tests/10-cli/fake-kube/README.md`: Trivial
- `bats/tests/10-cli/fake-kube/go.mod`: Trivial
- `bats/tests/10-cli/fake-kube/kubectl/main.go`: Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go`: Reviewed, no issues
- `cmd/rdd/kubectl.go`: Reviewed, no issues
- `cmd/rdd/main.go`: Reviewed, no issues
- `cmd/rdd/service_paths.go`: Reviewed, no issues
- `docs/design/cmd_service.md`: Trivial
- `docs/design/environment.md`: Trivial
- `go.mod`: Trivial
- `pkg/kuberlr/cache.go`: C1
- `pkg/kuberlr/downloader.go`: I1
- `pkg/kuberlr/exec_unix.go`: Reviewed, no issues
- `pkg/kuberlr/exec_windows.go`: Reviewed, no issues
- `pkg/kuberlr/resolver.go`: I2
- `pkg/kuberlr/resolver_test.go`: I2

## Resolution

| | |
|---|---|
| **Addressed** | 2026-05-04 |
| **Commit** | `6b419bc rdd kubectl: embed kuberlr-style version resolver` (amended) |

| # | Finding | Action | Reason |
|---|---------|--------|--------|
| 1 | Suggestion #1: `RDD_KUBECTL_RESOLVED` row in environment.md credits `rdd ctl` with an env-var write Round 5's `SkipResolver` change removed | Fixed | |
| 2 | Suggestion #2: `isClientOnly` misclassifies `--` before subcommand as client-only | Fixed | Replaced the manual map+walker with a per-call pflag FlagSet bound to genericclioptions plus `--help`/`-h`, `--client`, and `--warnings-as-errors`. Klog flags ride pflag's UnknownFlags handling to avoid leaking parse state into klog's process globals. |
| 3 | Suggestion #3: 256 MiB `streamGet` cap surfaces silently as "checksum mismatch" | Commented | Comment near `maxKubectlBytes` names the truncation failure mode so a future cap miss does not look like an integrity bug |
| 4 | Suggestion #4: HTTP requests omit `User-Agent` | Fixed | |
| 5 | Suggestion #5: `os.Chmod(tmpName, 0o755)` no-op on Windows | Skipped | Mode 0755 is harmless on Windows; the call is dead but not wrong |
| 6 | Testing Gap #1: `isClientOnly` with `--` before subcommand | Test written | Added `-- before subcommand still finds the subcommand` row plus six other shapes covering klog `--v`/`--vmodule`, `--warnings-as-errors`, and unknown-flag fall-through. Built into the S2 refactor. |
| 7 | Testing Gap #2: Concurrent first-use cache miss | Skipped | Declined Rounds 1–5; acknowledged limitation |
| 8 | Testing Gap #3: Download larger than `maxKubectlBytes` | Skipped | Trigger is hypothetical (kubectl ~50 MB today, cap is 256 MiB); the new comment near `maxKubectlBytes` names the failure mode |
| 9 | Testing Gap #4: End-to-end Windows execution path | Skipped | Acknowledged limitation; PR description gates merge on Windows BATS support |
| 10 | Testing Gap #5: `embeddedVersion` parse failure | Test written | `TestResolveEmbeddedVersionUnparseable` injects an unparseable version via the now-overridable `embeddedVersion` var and asserts Resolve returns `("", nil)` |
| 11 | Testing Gap #6: `RDD_KUBECTL_RESOLVED=1` on the parent `rdd kubectl` | Skipped | Recursion guard is two lines; BATS download-on-miss exercises Exec setting the env var on the kubectl child indirectly |
