# Deep Review: 20260503-123001-pr-348

| | |
|---|---|
| **Date** | 2026-05-03 12:30 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 3 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#348](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348) — rdd kubectl: embed kuberlr-style version resolver |
| **Branch** | `embed-kuberlr` |
| **Commits** | `75346bd` rdd kubectl: embed kuberlr-style version resolver |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — I1 (BATS 500-fall-through test cannot fire its key assertion after the Round 2 isClientOnly short-circuit) and I2 (`--client=true` reaches the probe path) are real correctness regressions. I3 (auth/TLS overrides not threaded into the probe) is a genuine extension of Round 2's I1 partial fix; the docstring documents the limitation, but a `--server` + `--token` invocation will silently pick the wrong kubectl version. |
| **Wall-clock time** | `28 min 20 s` |

## Executive Summary

Round 2's three Important fixes (kubectl flag scanner, log-level demotion, isClientOnly short-circuit) all landed cleanly. Round 3 surfaces three new Importants on the same code paths the Round 2 fixes touched: a BATS test (`apiserver returns 500 on /version`) calls `kubectl version --client`, which the Round 2 isClientOnly short-circuit now skips before any probe — making the test's `assert_file_contains "${LOG_FILE}" '^GET /version'` assertion impossible to satisfy (I1, raised independently by Claude and Codex); `isClientOnly` recognises `--client` exactly but treats `--client=true` as a generic flag, so `kubectl version --client=true` reaches the probe and stalls 2 s against an unreachable cluster (I2); and the Round 2 extension of `extractKubeFlags` to `--server/--cluster/--user` did not include the auth/TLS overrides (`--token`, `--certificate-authority`, `--client-certificate`, `--client-key`, `--tls-server-name`, `--username`, `--password`, `--insecure-skip-tls-verify`), so a probe against a cluster whose access requires those flags fails 401/TLS and silently falls through to a possibly-mismatched embedded kubectl (I3).

Suggestions are localised: a `config.json` → `config.yaml` typo in the new `cmd_service.md` example block, missing body-size caps on `streamGet` (raised by both Claude and Gemini), an `os.Setenv` error swallowed in `SkipResolver`, three documentation/comment polish items.

Structure: 0 Critical · 3 Important · 8 Suggestions.


## Critical Issues

None.

## Important Issues

I1. **BATS `apiserver returns 500` test asserts a `/version` GET that the Round 2 `isClientOnly` short-circuit skips** — `bats/tests/10-cli/7-kubectl-cache.bats:190-206`, `pkg/kuberlr/resolver.go:65-66,272-307` [Claude Opus 4.7][Codex GPT 5.5]

```bash
@test 'rdd kubectl falls through to embedded when the apiserver returns 500 on /version' {
    echo 500 >"${VERSION_STATUS_FILE}"

    run -0 env \
        RDD_CACHE_DIR="${CACHE_DIR}" \
        RDD_KUBECTL_MIRROR="http://127.0.0.1:${PORT}" \
        KUBECONFIG="${KUBECONFIG_PATH}" \
        rdd kubectl version --client

    assert_output
    refute_output --partial 'fake-kubectl:'
    refute_output --partial 'resolving kubectl version'
    assert_file_contains "${LOG_FILE}" '^GET /version'
    refute_file_contains "${LOG_FILE}" '/release/'
}
```

The Round 2 fix added `isClientOnly` and short-circuits `Resolve` for `version --client` (`pkg/kuberlr/resolver.go:65-66`). The unit test at `pkg/kuberlr/resolver_test.go:96` confirms `{"version --client", []string{"version", "--client"}, true}`. So `rdd kubectl version --client` skips the probe entirely; `LOG_FILE` ends up empty for the `/version` GET; the assertion at line 204 cannot fire.

The test's stated intent — verifying graceful fall-through when the apiserver returns 500 — needs a command that actually triggers the probe. The current command bypasses the probe before the 500 status matters.

Fix: pick a probe-triggering command. Drop `--client` so plain `kubectl version` runs (it probes the server for its version, then prints both client and server output; modern kubectl exits 0 with a "couldn't find server version" diagnostic on probe failure).

```diff
-        rdd kubectl version --client
+        rdd kubectl version
```

Verify the chosen command end-to-end before merge — if `kubectl version` against a 500ing apiserver exits non-zero, switch to `run env ...` and assert on the exit code separately.

I2. **`isClientOnly` matches `--client` exactly but not `--client=true` — `kubectl version --client=true` reaches the probe and stalls 2 s against an unreachable cluster** — `pkg/kuberlr/resolver.go:286-294` [Codex GPT 5.5]

```go
if arg == "--client" {
    sawClient = true
    continue
}
if strings.HasPrefix(arg, "-") {
    if !strings.Contains(arg, "=") && kubectlGlobalFlagsTakingValue[arg] && i+1 < len(args) {
        i++
    }
    continue
}
```

`--client=true` is the documented bool form kubectl accepts (along with `--client=false`, `--client=1`, `--client=0`). It matches the second branch (`strings.HasPrefix(arg, "-")`), so `sawClient` stays false. The walk lands at line 304 with `subcommand="version"` and `sawClient=false`, returning `false`. Resolve then probes the cluster — exactly the regression Round 2's I3 short-circuit was meant to prevent. `--client=false` should also work the other direction (force-probe), and repeated flags should follow kubectl's "last wins" semantics.

The unit test table at `pkg/kuberlr/resolver_test.go:96-97` covers `version --client` and `version --client --output=json` but no equals-form case.

Fix: parse `--client=<bool>` and let later occurrences overwrite earlier ones, matching kubectl's flag-parser behaviour.

```diff
+import "strconv"
+
 if arg == "--client" {
     sawClient = true
     continue
 }
+if strings.HasPrefix(arg, "--client=") {
+    if v, err := strconv.ParseBool(strings.TrimPrefix(arg, "--client=")); err == nil {
+        sawClient = v
+    }
+    continue
+}
```

Add unit-test rows for `version --client=true`, `version --client=false`, and `version --client --client=false` (last wins).

I3. **Round 2's `extractKubeFlags` extension covered cluster-selection but not auth/TLS overrides; `rdd kubectl --server X --token Y` makes the probe fail 401 and silently exec mismatched embedded kubectl** — `pkg/kuberlr/resolver.go:146-156,161-167,209-248` [Codex GPT 5.5]

```go
overrides := &clientcmd.ConfigOverrides{
    CurrentContext: flags.context,
    ClusterInfo:    clientapi.Cluster{Server: flags.server},
    Context:        clientapi.Context{Cluster: flags.cluster, AuthInfo: flags.user},
}
```

Round 2's I1 fix added `--server`, `--cluster`, `--user` to `extractKubeFlags`. The Round 2 finding text named the rest of kubectl's connection-override surface (`--token`, `--certificate-authority`, `--client-certificate`, `--client-key`, `--tls-server-name`, `--username`, `--password`, `--insecure-skip-tls-verify`) and said "If the full flag set is too much code for this PR, document the gap in the Resolve doc." The author chose the minimum fix and updated the docstring to enumerate the five handled flags (`pkg/kuberlr/resolver.go:41-47`). The auth/TLS surface stayed out.

`kubectlGlobalFlagsTakingValue` (`pkg/kuberlr/resolver.go:216-248`) already lists the auth/TLS flags as value-taking — `isClientOnly` knows to skip them when locating the subcommand — but `extractKubeFlags` never records them, so they never reach `clientcmd.ConfigOverrides`. A user running

```
rdd kubectl --server=https://api.staging:6443 --token=ey... --insecure-skip-tls-verify get pods
```

makes the resolver probe `staging` without `--token` and without `--insecure-skip-tls-verify`. The probe fails (401 or TLS error), `serverVersion` returns `ok=false`, `Resolve` returns `""`, and the embedded kubectl runs against `staging` with the full flag set. If `staging` is at v1.34 and embedded is at v1.30, the user gets a silent skew violation — the very case the resolver exists to prevent.

Fix: thread the standard auth/TLS overrides through `clientcmd.ConfigOverrides`. The minimum is `--token` → `AuthInfo.Token`, `--certificate-authority` → `ClusterInfo.CertificateAuthority`, `--client-certificate`/`--client-key` → `AuthInfo.ClientCertificate`/`ClientKey`, `--tls-server-name` → `ClusterInfo.TLSServerName`, `--insecure-skip-tls-verify` → `ClusterInfo.InsecureSkipTLSVerify`. The durable fix delegates to `genericclioptions.ConfigFlags` so resolver and kubectl child cannot drift by construction:

```go
import "k8s.io/cli-runtime/pkg/genericclioptions"

func loadClientConfig(args []string) (*rest.Config, error) {
    flags := genericclioptions.NewConfigFlags(false)
    fs := pflag.NewFlagSet("rdd-kubectl", pflag.ContinueOnError)
    flags.AddFlags(fs)
    _ = fs.Parse(stripPositional(args))
    return flags.ToRESTConfig()
}
```

If the durable fix is too much for this PR, extend `extractKubeFlags` and `loadClientConfig` to cover the auth/TLS minimum above. Either way, add `TestExtractKubeFlags` rows for `--token`, `--certificate-authority`, and `--insecure-skip-tls-verify`.

## Suggestions

S1. **`docs/design/cmd_service.md` example shows `config.json`; implementation reports `config.yaml`** — `docs/design/cmd_service.md:154` [Codex GPT 5.5]

```console
args_file      /path/to/rancher-desktop-default/rdd.args
config         /path/to/rancher-desktop-default/config.json
dir            /path/to/rancher-desktop-default
kubectl_cache  /path/to/Caches/rancher-desktop/kubectl/darwin-arm64
```

`instance.Config()` returns `filepath.Join(Dir(), "config.yaml")` (`pkg/instance/instance.go:91`). `git blame` shows the example block was rewritten by this PR's commit (`75346bd`), so the typo is a regression introduced here, not a pre-existing doc drift.

Fix:

```diff
-config         /path/to/rancher-desktop-default/config.json
+config         /path/to/rancher-desktop-default/config.yaml
```

S2. **`streamGet` has no body-size cap; a misconfigured mirror returning the binary URL for the `.sha512` path streams ~50 MB into a `strings.Builder`** — `pkg/kuberlr/downloader.go:123-152` [Claude Opus 4.7][Gemini 3.1 Pro]

```go
func fetchSha512(ctx context.Context, url string) (string, error) {
    var sb strings.Builder
    if err := streamGet(ctx, url, &sb); err != nil {
        return "", err
    }
    fields := strings.Fields(sb.String())
```

`streamGet` does `io.Copy(w, resp.Body)` with no `io.LimitReader`. The 5-minute `httpClient.Timeout` bounds total wall-time, but at modest LAN speeds 5 minutes is gigabytes. The realistic trigger is a misconfigured mirror that serves the binary at the `.sha512` URL (or any mirror error page returned with status 200) — `fetchSha512` allocates the whole response in memory, then `strings.Fields(sb.String())` returns gibberish that fails the digest comparison with a misleading `checksum mismatch` error.

Default mirror (`dl.k8s.io`) is trusted, so this matters only with operator-configured mirrors. Severity is Suggestion, not Important — Gemini raised it as Important; reconciled to Suggestion (see Reconciliation).

Fix: bound both endpoints. The sha512 response is at most a few hundred bytes; the kubectl binary is ~50 MB but bounding to e.g. 200 MB gives headroom.

```diff
-func streamGet(ctx context.Context, url string, w io.Writer) error {
+func streamGet(ctx context.Context, url string, w io.Writer, maxBytes int64) error {
     ...
-    _, err = io.Copy(w, resp.Body)
+    _, err = io.Copy(w, io.LimitReader(resp.Body, maxBytes))
     return err
 }
```

Pass `1<<10` from `fetchSha512`, `256<<20` from `download`. Belt-and-suspenders: validate `fields[0]` is exactly 128 hex characters in `fetchSha512` (`sha512.Size*2`) so a non-digest response surfaces with a clearer error than `checksum mismatch`.

S3. **`SkipResolver` swallows `os.Setenv` error; the surrounding `ctlAction` propagates the same call's error for `KUBECONFIG`** — `pkg/kuberlr/resolver.go:32-34`, `cmd/rdd/kubectl.go:42-48` [Claude Opus 4.7]

```go
func SkipResolver() {
    os.Setenv(envSkipResolver, "1")
}
```

`ctlAction` two lines up does `if err := os.Setenv("KUBECONFIG", instance.Config()); err != nil { return ... }`. The inconsistency reads as an oversight rather than a deliberate choice; the constant name and value `"1"` cannot trigger the `EINVAL` path in practice, but for stylistic alignment with the sibling call, surface the error.

Fix:

```diff
-func SkipResolver() {
-    os.Setenv(envSkipResolver, "1")
+func SkipResolver() error {
+    if err := os.Setenv(envSkipResolver, "1"); err != nil {
+        return fmt.Errorf("setting %s: %w", envSkipResolver, err)
+    }
+    return nil
 }
```

Then in `ctlAction`:

```diff
-    kuberlr.SkipResolver()
+    if err := kuberlr.SkipResolver(); err != nil {
+        return err
+    }
```

S4. **`isClientOnly` lacks a comment explaining why bare `version` is cluster-bound** — `pkg/kuberlr/resolver.go:303-305` [Claude Opus 4.7]

```go
if subcommand == "version" {
    return sawClient
}
```

The unit test `{"version without --client", []string{"version"}, false}` covers the behaviour; the rationale ("kubectl version probes the server for its version, so we want a version-matched binary doing that probe") is non-obvious from the code. A one-line comment locks the intent.

S5. **Document `httpClient.Timeout` as a backstop, not the primary deadline** — `pkg/kuberlr/downloader.go:33-42` [Claude Opus 4.7]

```go
// downloadTimeout caps each mirror request. Five minutes turns a hung
// mirror into a bounded failure and still covers a ~50 MB kubectl
// binary on slow links (~170 kB/s).
const downloadTimeout = 5 * time.Minute

// httpClient enforces downloadTimeout on every kuberlr request.
// http.DefaultClient sets no Timeout, and the inherited cobra context
// carries no deadline.
var httpClient = &http.Client{Timeout: downloadTimeout}
```

If a future caller adds a deadline to the cobra context, the request honours both — whichever fires first. The current comment explains why the cap exists today but does not name the composition rule. A one-line addition prevents a future maintainer from concluding "context deadline must be wrong; the client wins."

Fix: end the `httpClient` comment with "If the request context carries a shorter deadline, that deadline wins."

S6. **Cache the server-version probe across invocations to remove RTT from interactive `rdd kubectl` loops** — `pkg/kuberlr/resolver.go:78-88` [Claude Opus 4.7]

Every non-client-only invocation pays a fresh `/version` round trip. For an interactive workflow running many short `rdd kubectl` commands in sequence, that is RTT × N of latency that a `(server-URL, server-version, expires-at)` cache keyed on the resolved `flags.server` would amortise away. Out of scope for this PR; flagging as a follow-up.

S7. **Move package doc-comment from `cache.go` to `resolver.go` (or a dedicated `doc.go`) for findability** — `pkg/kuberlr/cache.go:5-10` [Claude Opus 4.7]

```go
// Package kuberlr resolves a kubectl binary compatible with the cluster
// targeted by the user's kubeconfig. ...
package kuberlr
```

The package doc on `cache.go` is the convention-breaker — most readers open `resolver.go` first. Cosmetic.

S8. **`write_kubectl_sha512` could drop the `is_macos` branch — `shasum -a 512` ships on Linux too** — `bats/tests/10-cli/7-kubectl-cache.bats:111-117` [Claude Opus 4.7]

```bash
write_kubectl_sha512() {
    if is_macos; then
        shasum -a 512 "${MIRROR_BIN_DIR}/kubectl${EXE}" | awk '{print $1}' >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
    else
        sha512sum "${MIRROR_BIN_DIR}/kubectl${EXE}" | awk '{print $1}' >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
    fi
}
```

`shasum` is part of Perl's standard distribution and present on most Linuxes by default. Picking one tool removes the conditional. Trivial; do not block the merge.

## Design Observations

### Concerns

- **Double execution of client-go exec auth plugins (in-scope)** [Gemini 3.1 Pro]. `serverVersion` builds a complete `rest.Config` and issues a `/version` request through client-go, which executes any `exec` authenticator (AWS IAM Authenticator, gcloud, kubelogin, etc.) configured on the kubeconfig user. The downloaded kubectl child then executes the same `exec` authenticator a second time. For non-interactive plugins this doubles latency on every `rdd kubectl` invocation; for plugins that prompt for MFA, the user gets two prompts back-to-back. Caching the probe (S6) would mask the latency cost and would also fix the double-MFA case for cached server versions; until then, an `--insecure-skip-tls-verify`-equivalent escape hatch (skip the probe when the user passes a flag the resolver knows it cannot honour without re-executing the authenticator) is the cheap mitigation.

- **Manual kubectl flag parsing carries drift risk (in-scope, persistent)** [Codex GPT 5.5]. Raised in Round 1 and Round 2; the underlying issue surfaces again as I3 in this round (the auth/TLS flag set was not threaded through the Round 2 fix) and as I2 (`--client=true` not handled). Both are concrete instances of the broader pattern: the resolver approximates kubectl's flag semantics by hand. Switching to `genericclioptions.ConfigFlags` plus `pflag` (per the I3 fix sketch) makes resolver and child kubectl share the parser by construction.

- **Cache eviction acknowledged TODO is reaching the point of being load-bearing (future)** [Claude Opus 4.7]. The cache only grows; ~50 MB per minor version. A user spanning 6 minors over a year accumulates ~300 MB; one spanning vendor-tagged minors (`-eks-1234`, `-gke.500`) accumulates one entry per tag. A simple "delete kubectl binaries not accessed in 30 days" sweep on each `rdd kubectl` invocation would suffice and is cheap to implement.

### Strengths

- **Round 2 fixes landed cleanly** [Claude Opus 4.7][Codex GPT 5.5]. `extractKubeFlags` now covers the cluster-selection flags, `isClientOnly` short-circuits routine fall-through paths, log levels are demoted to debug for the three routine branches and stay at warn for the surprising one. The Round 2 review's concerns about UX regressions (warn-level log noise, 2 s probe stall on client-only commands) are addressed.
- **Atomic publish stays clean** [Claude Opus 4.7][Codex GPT 5.5]. `os.CreateTemp` → `io.MultiWriter(tmp, h)` → `os.Rename` with `defer os.Remove(tmpName)` covers every error path.
- **Recursion guard** [Claude Opus 4.7]. `RDD_KUBECTL_RESOLVED` set in `Exec` plus the new `SkipResolver()` entry point keep the resolver structurally non-recursive.
- **Test fixture as a separate Go module** [Claude Opus 4.7]. `bats/tests/10-cli/fake-kube/` keeps test-only dependencies out of the main `go.mod`; `dependabot.yml`'s `directories:` extension keeps the fixture patched.

## Testing Assessment

Highest-risk untested scenarios in Round 3 (excluding items declined in prior rounds):

1. **`--client=true` and `--client=false` equals-form** (I2). No test asserts `kubectl version --client=true` is client-only.
2. **`--token` / `--certificate-authority` / `--insecure-skip-tls-verify` overrides** (I3). No `TestExtractKubeFlags` row covers these.
3. **End-to-end exec of the downloaded binary on Windows.** `pkg/kuberlr/exec_windows.go` has no automated coverage; declined in prior rounds (Windows BATS runner deferred), surfacing again because I3's fix interacts with TLS code paths that diverge between platforms.
4. **`info.GitVersion` from a vendor cluster** (EKS, GKE, AKS — `v1.30.0-eks-1234`, `v1.30.7-gke.500`). `semver.ParseTolerant` should handle suffixed forms; a `TestCompatible`-adjacent unit test would lock the contract.

Verification runs: Codex ran `make build-rdd` and `go test ./pkg/kuberlr` in its worktree (both passed). Claude verified `isClientOnly(["version", "--client"]) == true` against the implementation directly (relevant to I1).

## Documentation Assessment

- `docs/design/cmd_service.md` example block at line 154 lists `config.json` where the implementation reports `config.yaml` (S1; PR-introduced regression).
- `docs/design/environment.md` documents `RDD_CACHE_DIR`, `RDD_KUBECTL_MIRROR`, `RDD_KUBECTL_CACHE`, and (per Round 2) `RDD_KUBECTL_RESOLVED` under Internal Variables. Complete.
- The `Resolve` docstring (`pkg/kuberlr/resolver.go:41-47`) enumerates only the five flags `extractKubeFlags` actually handles, which documents I3's gap obliquely (positive enumeration). A negative statement — "The resolver does NOT honour --token, --certificate-authority, --client-certificate, --client-key, --tls-server-name, --insecure-skip-tls-verify; probes against clusters that require them fall through to the embedded kubectl" — would make the limitation explicit until I3 is fixed.

## Acknowledged Limitations

- **Cache eviction not implemented** — `pkg/kuberlr/cache.go:33-36`. The `TODO(eviction)` comment carries the framing forward; see the design observation on this becoming load-bearing.
- **Offline mode not implemented** — `pkg/kuberlr/resolver.go:57-60`, `pkg/kuberlr/downloader.go:45-46`. TODOs cover the planned env-var/config opt-out for download attempts. Consistent across rounds.
- **Windows BATS host deferred** — `bats/tests/10-cli/7-kubectl-cache.bats:14-20`. Documented; Windows-specific code paths (Exec, the os.Rename concurrent-publish race that prior rounds declined) remain compile-only.
- **`local_teardown_file()` deviation from BATS convention** — `bats/tests/10-cli/7-kubectl-cache.bats:90-98`. Inline rationale stands.
- **Falling back to embedded on any cluster-probe failure** — `pkg/kuberlr/resolver.go:88-117`. Each `nilerr` branch is documented; the design choice is sound. I3 highlights a case where this fall-through is the wrong outcome (probe failed because the resolver did not pass the user's auth flags), but the fall-through itself is not the bug.

### Declined in prior rounds

- **`CacheDir()` panics when `os.UserCacheDir()` fails** — declined Round 1 (R1 S11) and Round 2 (R2 raw S1): failure mode requires neither `$HOME` nor `%LocalAppData%` to resolve; clear stack on the rare failure. Gemini raised again as S1; dropped per the resolutions gate.
- **Concurrent first-use Windows `os.Rename` race** — declined Round 1 (R1 raw I1, dropped as false positive on the basis of modern Go's `REPLACE_IF_EXISTS|POSIX_SEMANTICS`) and Round 2 (R2 raw I2/S5): surrounding code unchanged. Raised this round by all three agents (Claude I2, Codex S1, Gemini I1); dropped per the resolutions gate. The R2 retro noted this is worth filing as a follow-up TODO documenting the Windows file-mapping subtlety if Jan wants a record beyond the prior decisions; the Round 3 re-raises strengthen that case.
- **`RDD_CACHE_DIR` scope rename** — declined Round 1 (R1 S5): deliberate forward-looking knob.
- **Server `log.Fatal*` bypassing `defer logFD.Close()`** — declined Round 1 (R1 S10): test fixture; OS reclaims fd.
- **Pure-Go unit coverage thin on download/cache machinery** — declined Round 2 (R2 S5): BATS suite covers the integration concerns.
- **BATS coverage for HTTP timeout/hang, network cancel, Windows Exec, sha512sum-format mirror, trailing-slash mirror, default-log-level message capture, unreachable-server probe** — declined Round 1 and Round 2 (R1 testing gap #1, R2 testing gaps #2-7): defer; no new test infrastructure this round.
- **`info.GitVersion` empty/unknown unit test** — declined Round 1: requires mocked discovery client.
- **Resolver-policy summary and recursion-guard rationale duplicated across files** — declined Round 1: each duplicate serves a different audience (package vs. symbol godoc).

## Agent Performance Retro

### [Claude]

Claude's strongest move this round was independent verification of I1 — running the args through `isClientOnly` against the implementation, then cross-referencing the unit test row that confirms `version --client` returns true. Codex caught the same test bug at the same severity. Beyond I1, Claude's contribution was breadth: seven Suggestions covering body-size caps (also caught by Gemini), `SkipResolver` error handling, comment polish, and a server-version probe cache. Two of Claude's initial Importants (I4 httpClient.Timeout composition, I5 every-invocation probe) were withdrawn within the review and downgraded to Suggestions — visible self-correction is a positive signal, not a knock. Claude's I2 (Windows `os.Rename` race) and S7 (`CacheDir` panic) re-raised findings declined in Round 1 and Round 2; both dropped per the resolutions gate.

### [Codex]

Codex carried this round's correctness work. I2 (`--client=true`) and I3 (auth/TLS overrides) are both genuine follow-ups to Round 2's fixes that Claude and Gemini missed; Codex was the only agent to spot the equals-form gap and the only one to enumerate the auth/TLS flag set against `kubectlGlobalFlagsTakingValue`. S2 (`config.json` typo in cmd_service.md) required reading the example block against the implementation — `instance.Config()` returns `config.yaml`, the example was rewritten in this PR's commit, no other agent caught it. Tight three Importants, one Suggestion, one declined-prior-rounds Suggestion (Windows rename); zero false positives. Best severity calibration of the round.

### [Gemini]

Gemini ran successfully this round (R1 and R2 had auth/trust failures). Of three findings, two (I1 Windows `os.Rename` race, S1 `CacheDir` panic) were declined-in-prior-rounds and dropped per the resolutions gate. The surviving I2 (unbounded read in `fetchSha512`) overlapped with Claude's S4 — Gemini called it Important, Claude called it Suggestion; reconciled to Suggestion (S2 in this report) because the trigger requires an operator-configured malicious or misconfigured mirror. Gemini's design observation on double execution of client-go exec auth plugins (AWS IAM Authenticator, gcloud, kubelogin) is a genuinely new architectural insight neither Round 1 nor Round 2 surfaced; it is the kind of cross-component reasoning that justifies running three agents.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 21m 02s | 8m 39s | — |
| Findings | 1I 7S | 3I 1S | 1S |
| Tool calls | 39 (Read 19, Bash 17, Grep 3) | 71 (shell 70, stdin 1) | — |
| Design observations | 1 | 1 | 1 |
| False positives | 0 | 0 | 0 |
| Unique insights | 6 | 4 | 1 |
| Files reviewed | 18 | 18 | 18 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **1I 7S** | **3I 1S** | **1S** |
| Downgraded | 2 (I→S) | 0 | 1 (I→S) |
| Dropped | 3 | 1 | 2 |


#### Reconciliation

- Gemini I2 unbounded `fetchSha512` read: important → suggestion S2 (default mirror is trusted; trigger requires an operator-configured mirror).
- Gemini I1 Windows `os.Rename` race: important → dropped (declined Round 1+2; surrounding code unchanged).
- Gemini S1 `CacheDir` panic: suggestion → dropped (declined Round 1+2).
- Claude I2 Windows `os.Rename` race: important → dropped (declined Round 1+2).
- Claude I3 `SkipResolver` `os.Setenv` error: important → suggestion S3 (the failure mode for a constant env name is unreachable in practice; finding stands as stylistic alignment).
- Claude I4 `httpClient.Timeout` composition: withdrawn by Claude → suggestion S5.
- Claude I5 every-invocation probe: withdrawn by Claude → suggestion S6.
- Claude S4 body-size cap: kept; merged with Gemini I2 reconciled to S2.
- Claude S7 `CacheDir` panic: suggestion → dropped (declined Round 1+2).
- Claude S9 `clientOnlySubcommands` missing api-resources/etc: dropped (Claude himself notes "no fix needed"; the conservative bias is documented).
- Codex S1 Windows cold-cache rename: suggestion → dropped (declined Round 1+2).

Claude provided the most volume; Codex provided the most value. The two complement each other — Codex tight on correctness, Claude broad on polish and forward-looking suggestions. Gemini contributed one architecturally interesting observation that the other agents did not surface.

Net Round 3 contribution after the resolutions gate: 3 Important findings (all on Round 2's recently-touched code paths), 8 Suggestions, 3 design observations.

## Review Process Notes

### Skill improvements

None. The prior-round resolutions gate did its job: nine declined-in-prior-rounds re-raises were dropped without consuming clarification rounds or finding slots. The agents' recurring re-raises of the Windows `os.Rename` race (now declined in three consecutive rounds across all three agents) suggest the gate is the right mechanism but the agents lack visibility into prior decisions; that is by design — they review the code, not the prior reviews.

### Repo context updates

- **Add a "follow-up to a partial fix" check to `deep-review-context.md` for review rounds ≥ 2.** When a prior-round Important finding's resolution table marks an item Fixed, the agents do not see whether the fix covered the entire surface the prior finding named. If the prior finding's text enumerated a wider surface (e.g., "the minimum is X, Y, Z; the durable fix is W") and the author chose the minimum, the next round should re-evaluate whether the residual gap warrants Important severity given the documented limitation. Add: "When a prior-round Important is marked Fixed and its finding text named a wider surface than the implemented fix, look for the residual gap. The author may have made a deliberate scope choice — verify the documented limitation matches the residual surface, and treat any user-reachable correctness gap as Important. This applies to flag scanners, error handlers, retry policies, and any other 'add the missing case' fix."


## Appendix: Original Reviews

### [Claude] — Pass 1

Now I have enough context. Let me write up the review.

# Deep Review — PR #348: rdd kubectl version resolver

## Executive Summary

This PR embeds a kuberlr-style version resolver into `rdd kubectl` that fetches a version-matched kubectl from `dl.k8s.io` (or a configured mirror), sha512-verifies it, and execs it when the embedded kubectl is more than ±1 minor out of skew with the target cluster. The resolver is well-structured with clean separation of concerns (cache, downloader, exec, resolver) and good test coverage. Main concerns: the BATS test for the 500-fall-through path appears to assert behavior that contradicts the resolver's own short-circuit logic, and there's a Windows-only race when concurrent processes finish a download for the same version simultaneously.

## Findings

### Critical Issues

None.

### Important Issues

**I1. BATS "apiserver returns 500" test contradicts `isClientOnly` short-circuit — `bats/tests/10-cli/7-kubectl-cache.bats:190-206`** (important, regression)

```bash
@test 'rdd kubectl falls through to embedded when the apiserver returns 500 on /version' {
    echo 500 >"${VERSION_STATUS_FILE}"

    run -0 env \
        RDD_CACHE_DIR="${CACHE_DIR}" \
        RDD_KUBECTL_MIRROR="http://127.0.0.1:${PORT}" \
        KUBECONFIG="${KUBECONFIG_PATH}" \
        rdd kubectl version --client

    ...
    assert_file_contains "${LOG_FILE}" '^GET /version'
    refute_file_contains "${LOG_FILE}" '/release/'
}
```

The command `rdd kubectl version --client` is classified as client-only by `isClientOnly` at `pkg/kuberlr/resolver.go:272-307` (its own unit test at `pkg/kuberlr/resolver_test.go:96` asserts this with `{"version --client", []string{"version", "--client"}, true}`). When `isClientOnly` returns true, `Resolve` short-circuits at `pkg/kuberlr/resolver.go:65-67` and returns `""` without probing `/version`. The embedded kubectl then runs `version --client`, which prints only the client version and never contacts the server. As a result, `LOG_FILE` should be empty after this test, and the assertion `assert_file_contains "${LOG_FILE}" '^GET /version'` should fail.

The test's stated intent — verifying graceful fall-through when the apiserver returns 500 — cannot be exercised with a client-only command, because the resolver never reaches the probe path.

I verified `isClientOnly(["version", "--client"]) == true` against the implementation directly. Either the test was written but never run, or the implementation diverged from the test's design.

Fix: pick a command that triggers the probe but where embedded kubectl can still exit 0 (e.g., `kubectl options` is not workable since it's also client-only). One option is to drop `--client` so `version` triggers the probe; modern kubectl prints a "couldn't find server version" diagnostic and still exits 0 even when the server is unhelpful. Verify the chosen command end-to-end before merge.

```diff
-        rdd kubectl version --client
+        rdd kubectl version
```

If the chosen command does not exit 0, change to `run env ...` and assert specific output — but ensure the probe-then-fall-through path is actually exercised. As written, the test's fourth assertion is the only one that proves the 500 was observed; without a working version of that assertion, the test does not prove what its name claims.

---

**I2. Concurrent downloads race on Windows `os.Rename` — `pkg/kuberlr/downloader.go:57-70, 116`** (important, regression)

```go
func ensureCached(ctx context.Context, want semver.Version) (string, error) {
    path := cachedPath(want)
    if _, err := os.Stat(path); err == nil {
        return path, nil
    }
    ...
    if err := download(ctx, want, path); err != nil {
        return "", err
    }
    return path, nil
}
```

```go
// download(...)
return os.Rename(tmpName, dst)
```

The cache is shared across all rdd instances (per `cache.go:24-25`). If two `rdd kubectl` invocations (across instances or in parallel scripts) miss the same version simultaneously, both pass the `os.Stat` at line 59, both download to separate temp files, and both call `os.Rename` to the same destination. On Unix, `rename(2)` atomically replaces the destination, so the second wins silently. On Windows, `os.Rename` fails when the destination exists (the user-instructions context calls this out as a Windows-specific concern), so the second invocation surfaces a "rename ...: file exists" error to the user even though a perfectly valid kubectl is in place.

Fix: after the sha verification but before `os.Rename`, re-check whether the destination already exists; if so, drop the temp file and return the existing path. Alternatively, on Windows, attempt the rename and treat "destination exists" as success (re-stat the destination to confirm, then remove the temp).

```diff
     if err := os.Chmod(tmpName, 0o755); err != nil {
         return err
     }
+    if _, err := os.Stat(dst); err == nil {
+        // Concurrent rdd kubectl from another instance won the race; the file
+        // on disk has the same sha512 we just verified. Drop our temp.
+        return nil
+    }
     return os.Rename(tmpName, dst)
```

Note that the `defer os.Remove(tmpName)` at line 100 already cleans up the temp on the early return.

---

**I3. `os.Setenv` ignores its error — `pkg/kuberlr/resolver.go:32-34`** (important, gap)

```go
func SkipResolver() {
    os.Setenv(envSkipResolver, "1")
}
```

`os.Setenv` returns an error (e.g., on a name with `=` or NUL on some platforms, or syscall failure). The caller in `cmd/rdd/kubectl.go:48` cannot react. While this is in practice unlikely to fail, the surrounding code (`cmd/rdd/kubectl.go:42-44` for `KUBECONFIG`) does propagate the error from `os.Setenv`. Either propagate the error from `SkipResolver` for consistency, or document the choice not to.

Fix:
```diff
-func SkipResolver() {
-    os.Setenv(envSkipResolver, "1")
+func SkipResolver() error {
+    if err := os.Setenv(envSkipResolver, "1"); err != nil {
+        return fmt.Errorf("setting %s: %w", envSkipResolver, err)
+    }
+    return nil
 }
```

Or, if the call is genuinely infallible at the platforms we care about, a brief `//nolint:errcheck` comment with rationale.

---

**I4. `httpClient.Timeout` overrides the `ctx` deadline silently — `pkg/kuberlr/downloader.go:38-42, 137-152`** (important, gap)

```go
var httpClient = &http.Client{Timeout: downloadTimeout}
```

```go
func streamGet(ctx context.Context, url string, w io.Writer) error {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
    ...
}
```

`http.Client.Timeout` covers the whole request including body read, regardless of the ctx. If the cobra-supplied `ctx` is canceled (e.g., user hits Ctrl+C, or some upstream timeout fires), the `http.Client.Do` honors ctx via the request context. If the ctx has a deadline shorter than 5 minutes, the deadline wins (the earlier of the two fires). If ctx has no deadline (the typical cobra case), the client's 5-minute timeout caps the request. So the two compose correctly.

But: the comment at line 40 says "the inherited cobra context carries no deadline." That is true today, but if someone later wires a deadline into the cobra context, the test assertion `assert_output --partial '404'` in the BATS suite would still pass (status comes back fast), while a slow-mirror scenario would silently switch from a 5-minute timeout to whatever the cobra timeout is. The combination is correct; the intent isn't immediately obvious. A short note that the two timeouts are intentionally redundant (whichever fires first wins) would make this less brittle for future readers.

Suggestion for a follow-up comment, severity SUGGESTION more than IMPORTANT — moving to Suggestions below.

(Withdrawing this as Important — moved to S2.)

---

**I5. Resolver probes on every cluster-bound `rdd kubectl` invocation — `pkg/kuberlr/resolver.go:78-88`** (important, gap)

```go
server, ok, err := serverVersion(args)
if err != nil {
    return "", err
}
if !ok {
    return "", nil
}
if compatible(embedded, server) {
    return "", nil
}
return ensureCached(ctx, server)
```

Every non-client-only `rdd kubectl` invocation (i.e., the common case) does a fresh `/version` round trip with a 2-second timeout cap. For an interactive workflow where the user runs many short kubectl commands in sequence, this adds latency proportional to network RTT to every command. A short-lived cache of `(server-URL, server-version, expires-at)` keyed on the resolved `flags.server` would eliminate the round trip for the steady-state case.

This is a performance gap rather than a correctness bug, and it composes with the existing 2-second timeout reasonably well. Worth flagging as a follow-up rather than a blocker; downgrading to Suggestion S3 below.

(Withdrawing this as Important — moved to S3.)

### Suggestions

**S1. `kubectl version` (without `--client`) skips the probe in the unit test but is treated as cluster-bound by `isClientOnly` — `pkg/kuberlr/resolver_test.go:98`** (suggestion, gap)

```go
{"version without --client", []string{"version"}, false},
```

`kubectl version` (no flag) prints client version, then asks the server for its version. With an unreachable cluster, modern kubectl exits non-zero. The current resolver classification (probe; if cluster reachable but skew differs, exec the downloaded binary which then makes its own server probe) is correct and matches expectations.

Worth adding a comment in `isClientOnly` near the `subcommand == "version"` branch: "we treat `kubectl version` (no --client) as cluster-bound because kubectl will probe the server for its version, and we want a version-matched binary to do that probe."

Fix: add a brief comment near `pkg/kuberlr/resolver.go:303-305`.

---

**S2. Document that `httpClient.Timeout` is a backstop, not the primary timeout — `pkg/kuberlr/downloader.go:33-42`** (suggestion, enhancement)

```go
const downloadTimeout = 5 * time.Minute
var httpClient = &http.Client{Timeout: downloadTimeout}
```

The current comment ("the inherited cobra context carries no deadline") explains why the cap exists today, but reading the code you cannot tell whether the cap is meant to override a future ctx deadline or coexist. Restate it as: "this cap is a backstop; if the ctx gains a shorter deadline later, that deadline wins." Costs zero today, and saves a future reader from debugging "why is my context-set timeout being ignored?"

---

**S3. Cache the server-version probe across invocations — `pkg/kuberlr/resolver.go:78-88`** (suggestion, enhancement)

For interactive workflows running `rdd kubectl` in tight loops, every invocation pays a 2-second-capped network round trip. A short TTL cache (per server URL or per `(kubeconfig, context)` tuple) keyed on the resolved flags would eliminate all but the first probe for a stable cluster. This is a follow-up optimization rather than a blocker; flagging here so the design observation isn't lost.

---

**S4. `streamGet` has no body size limit — `pkg/kuberlr/downloader.go:137-152`** (suggestion, enhancement)

```go
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
    return fmt.Errorf("GET %s: %s", url, resp.Status)
}
_, err = io.Copy(w, resp.Body)
return err
```

A malicious or buggy mirror could stream gigabytes of zeros at `fetchSha512`'s `strings.Builder` (memory-bound) or at the kubectl download (disk-bound). For sha512: the response should be a single hex digest plus optional filename — bound it to a few hundred bytes. For the kubectl binary: bound it to e.g. 200 MB (current kubectl is ~50 MB).

Fix:
```diff
-    _, err = io.Copy(w, resp.Body)
+    _, err = io.Copy(w, io.LimitReader(resp.Body, maxBytes))
     return err
```

Pass `maxBytes` from the call site (`fetchSha512` uses 1 KiB; `download` uses 256 MiB). Low priority since the user triggers the fetch and `httpClient.Timeout` provides a bandwidth-time cap.

---

**S5. Package doc comment lives in `cache.go`, not in a dedicated file — `pkg/kuberlr/cache.go:5-10`** (suggestion, enhancement)

```go
// Package kuberlr resolves a kubectl binary compatible with the cluster
// targeted by the user's kubeconfig. ...
package kuberlr
```

The package doc-comment is attached to `cache.go`. Most Go projects place package docs in either `doc.go` or in the file most central to the package (here, `resolver.go`). Putting it in `cache.go` makes it slightly less discoverable, since most readers will open `resolver.go` first. Move to `resolver.go` (or to a dedicated `doc.go`) for findability. Trivial.

---

**S6. `strings.Fields` will accept a sha file with a leading filename — `pkg/kuberlr/downloader.go:123-133`** (suggestion, enhancement)

```go
fields := strings.Fields(sb.String())
if len(fields) == 0 {
    return "", fmt.Errorf("empty checksum response from %s", url)
}
return fields[0], nil
```

The comment at line 119-122 explains that this handles both `sha512sum`-style (`digest  filename`) and the bare-digest format. Good. But it also accepts arbitrary garbage at `fields[0]` if the response is e.g. an HTML 200 from a misconfigured mirror. The follow-up sha verification will catch this (the "garbage" will fail to match the computed digest), so this is a defense-in-depth concern rather than a bug. Optional: validate that `fields[0]` is exactly 128 hex characters before returning, with a clearer error than "checksum mismatch."

```diff
-    return fields[0], nil
+    digest := fields[0]
+    if len(digest) != sha512.Size*2 {
+        return "", fmt.Errorf("malformed checksum from %s: %d chars, want %d", url, len(digest), sha512.Size*2)
+    }
+    return digest, nil
```

---

**S7. `os.Setenv` panic on `CacheDir` may surprise — `pkg/kuberlr/cache.go:37-47`** (suggestion, gap)

```go
cache, err := os.UserCacheDir()
if err != nil {
    panic(fmt.Errorf("could not determine user cache directory: %w", err))
}
```

`os.UserCacheDir` fails when neither `$XDG_CACHE_HOME` nor `$HOME` is set on Unix, or when neither `%LocalAppData%` nor `%USERPROFILE%` is set on Windows — corner cases that can happen in CI or stripped-down container shells. Panicking on the path-discovery routine is a heavyweight failure mode for a side-effect-free getter; an `error` return would let the caller (`cmd/rdd/service_paths.go:32` and `pkg/kuberlr/resolver.go`'s call site) decide.

Note that the existing `instancePaths()` map at `cmd/rdd/service_paths.go:21-34` already mixes other panic-free path getters; making `CacheDir()` panic on misconfigured environments is inconsistent with siblings like `instance.LogDir()` etc. Worth aligning with the project style.

Low severity because the failure is rare and fail-fast is defensible. Suggesting a follow-up rather than blocking the merge.

---

**S8. `is_macos` branch in `write_kubectl_sha512` could just use `shasum` — `bats/tests/10-cli/7-kubectl-cache.bats:111-117`** (suggestion, enhancement)

```bash
write_kubectl_sha512() {
    if is_macos; then
        shasum -a 512 "${MIRROR_BIN_DIR}/kubectl${EXE}" | awk '{print $1}' >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
    else
        sha512sum "${MIRROR_BIN_DIR}/kubectl${EXE}" | awk '{print $1}' >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
    fi
}
```

`shasum -a 512` works on Linux as well as macOS (it's part of Perl's standard distribution). The OS-conditional could be dropped. Trivial.

---

**S9. `clientOnlySubcommands` is missing `api-resources`/`api-versions`/`explain` etc. — `pkg/kuberlr/resolver.go:253-258`** (suggestion, gap)

```go
var clientOnlySubcommands = map[string]bool{
    "completion": true,
    "config":     true,
    "options":    true,
    "help":       true,
}
```

`kubectl explain --recursive deployments.spec` is sometimes used purely for client-side inspection. `kubectl api-resources` and `kubectl api-versions` are cluster-bound (so they don't belong here). The current list is conservative (returns false → probe) which is the safe direction; misclassifying a cluster-bound command as client-only would silently skip the version skew check.

The comment at 269-271 already documents the conservative bias. No fix needed; mentioning here so the maintainer can confirm `explain` and any other purely-client subcommands don't need adding.

## Design Observations

**Concerns**

- **Resolver probe runs against the apiserver every invocation, even when args don't reach a cluster** (future). For example, `kubectl get -h` probes the cluster (returns false from `isClientOnly` because `-h` is parsed... actually wait, the loop hits `--help`/`-h` first and short-circuits). Re-reading: `-h` short-circuits at line 283-285 before subcommand recognition. So `kubectl get -h` is correctly classified as client-only. OK; this concern is unfounded. Leaving the note so a future reviewer can confirm.

- **Cache eviction is acknowledged TODO but worth surfacing as a design question** (future). The TODO at `cache.go:33-36` mentions per-version TTL/LRU. With one version cached per minor of every cluster the user touches, the cache could grow to ~1 GB if a user spans many minors. A simple "delete kubectl binaries not accessed in 30 days" sweep on each `rdd kubectl` invocation would suffice, and is cheap to implement.

**Strengths**

- The split between `cache.go` (path discovery), `downloader.go` (HTTP and integrity), `exec_unix.go`/`exec_windows.go` (process replacement), and `resolver.go` (orchestration + arg parsing) is clean and each file is independently understandable.
- Recursion guard via `RDD_KUBECTL_RESOLVED` is documented end-to-end (env var name, who sets it, why) in `docs/design/environment.md:33-35` and explained at the call sites.
- Tests for `compatible`, `extractKubeFlags`, and `isClientOnly` cover the tricky edge cases (--, equals form, alias form, end-of-args).
- The fake-kube test fixture as a sibling Go module is a clean way to keep test-only dependencies out of the main `go.mod`.

## Testing Assessment

Untested scenarios, ranked by risk:

1. **Concurrent downloads of the same version** (Windows). I2 above. No test reproduces two parallel `rdd kubectl` invocations missing the cache simultaneously; with `os.Rename` failing on Windows the second errors out.
2. **End-to-end exec of the downloaded binary on Windows.** The BATS test currently skips on Windows entirely (line 14-20). The Windows `Exec` helper at `exec_windows.go` uses `cmd.Run()` with `os.Exit(exitErr.ExitCode())` — this path has unit-test coverage neither in `resolver_test.go` nor in BATS.
3. **Kubectl plugins discovery from the cache directory.** If a user adds the cache directory to `PATH`, `kubectl-v1.30.0` could be picked up by kubectl's plugin discovery as a plugin named `v1.30.0`. Not a regression (the user has to add the dir to PATH) but worth a doc note.
4. **`info.GitVersion` from a vendor cluster** (EKS, GKE, AKS). `semver.ParseTolerant` should handle the suffixed forms (`v1.30.0-eks-1234`, `v1.30.7-gke.500`). A unit test case in `TestCompatible`-adjacent territory would lock in the contract.
5. **Compatibility downgrade scenario.** What if the embedded kubectl is `v1.35` and the user's cluster is `v1.32` (more than ±1 minor behind)? The resolver fetches `kubectl v1.32`. There's no test that the fetched version actually matches the server (vs. e.g., an off-by-one).

The `TestIsClientOnly` test could add `{"version --client equals form", []string{"version=--client"}, false}` etc. to lock the behavior down, but the existing 19 cases are already strong.

## Documentation Assessment

- `docs/design/environment.md` is updated correctly.
- `docs/design/cmd_service.md` is updated with `kubectl_cache`.
- `RDD_KUBECTL_CACHE` (path-export form) and `RDD_CACHE_DIR` (input env var) are easy to confuse. The two docs each explain their own variable but don't cross-reference — a one-line note in either would help.
- The `pkg/kuberlr` package docstring documents the package well.
- The `Resolve` docstring is unusually thorough and very welcome.

## Commit Structure

Single commit with a clear, descriptive message. Title and body match the diff. No issues.

## Acknowledged Limitations

- **Offline mode TODO** — `pkg/kuberlr/resolver.go:57-60` and `pkg/kuberlr/downloader.go:45-46`: when no cached kubectl matches and the network is down, the resolver fails with the download error. The author marks this as a TODO with a suggested API (`RDD_KUBECTL_OFFLINE` env or config flag, fall back to closest cached binary). The current PR makes the limitation more user-visible (resolver now exists at all), but the TODO accurately describes the gap. Not promoting to a finding.
- **Cache eviction TODO** — `pkg/kuberlr/cache.go:33-36`: the cache only grows. Author acknowledges and proposes per-version TTL/LRU as a follow-up. Not a finding.
- **`local_teardown_file` deviates from "no teardown_file" rule** — `bats/tests/10-cli/7-kubectl-cache.bats:90-98`: author explains the deviation in the comment (HTTP server bound to ephemeral port leaks if not reaped). Multiple existing BATS files also use `local_teardown_file` (limavm-instance, notary-external, crd-upgrade, etc.), so the deviation has prior art. Not promoting to a finding.
- **Windows BATS host deferred** — `bats/tests/10-cli/7-kubectl-cache.bats:15-20`: the cache lifecycle test is skipped on Windows BATS until a Windows BATS runner with Go is available. Documented limitation.

## Coverage Summary

| File | Status |
| --- | --- |
| `.github/dependabot.yml` | Reviewed, no issues |
| `bats/tests/10-cli/5-paths.bats` | Trivial (one constant updated) |
| `bats/tests/10-cli/7-kubectl-cache.bats` | I1, S8 |
| `bats/tests/10-cli/fake-kube/README.md` | Reviewed, no issues |
| `bats/tests/10-cli/fake-kube/go.mod` | Reviewed, no issues |
| `bats/tests/10-cli/fake-kube/kubectl/main.go` | Reviewed, no issues |
| `bats/tests/10-cli/fake-kube/server/main.go` | Reviewed, no issues |
| `cmd/rdd/kubectl.go` | Reviewed, no issues (interacts with I3 via `SkipResolver`) |
| `cmd/rdd/service_paths.go` | Reviewed, no issues (interacts with S7) |
| `docs/design/cmd_service.md` | Reviewed, no issues |
| `docs/design/environment.md` | Reviewed, no issues |
| `go.mod` | Trivial (semver moved from indirect to direct) |
| `pkg/kuberlr/cache.go` | S5, S7 |
| `pkg/kuberlr/downloader.go` | I2, S2, S4, S6 |
| `pkg/kuberlr/exec_unix.go` | Reviewed, no issues |
| `pkg/kuberlr/exec_windows.go` | Reviewed, no issues (testing gap noted under Testing Assessment item 2) |
| `pkg/kuberlr/resolver.go` | I3, S1, S3, S9 |
| `pkg/kuberlr/resolver_test.go` | S1 |
### [Codex] — Pass 1

### Executive Summary

This PR adds a kubectl version resolver with cache/download behavior and wires it into `rdd kubectl`. The main implementation is close, but the argument scanner currently diverges from kubectl for auth/TLS and bool flag handling, and one new BATS test contradicts the resolver’s intentional client-only fast path.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Resolver probes without kubectl auth/TLS overrides** — `pkg/kuberlr/resolver.go:146-155` (important, regression)

```go
overrides := &clientcmd.ConfigOverrides{
	CurrentContext: flags.context,
	ClusterInfo:    clientapi.Cluster{Server: flags.server},
	Context:        clientapi.Context{Cluster: flags.cluster, AuthInfo: flags.user},
}
```

`loadClientConfig` only carries kubeconfig/context/server/cluster/user into the probe at lines 147-153. However kubectl also accepts connection/auth overrides such as `--token`, `--certificate-authority`, `--client-certificate`, `--client-key`, `--tls-server-name`, `--username`, `--password`, and `--insecure-skip-tls-verify`; this resolver even knows several of those consume values at lines 225-233, but `extractKubeFlags` never records them. A user running `rdd kubectl --server https://... --token ... --insecure-skip-tls-verify get pods` will have kubectl connect successfully, while the resolver probes without those settings, fails at `ServerVersion` line 129, returns `ok=false` at line 132, and falls through to the embedded kubectl at lines 82-83 even when the cluster is out of skew.

Fix: parse and populate the same clientcmd override fields kubectl uses, or reuse kubectl/client-go’s config flag machinery so the probe and child command share semantics.

```diff
 type kubeFlags struct {
 	kubeconfig string
 	context    string
 	server     string
 	cluster    string
 	user       string
+	token      string
+	caFile     string
+	certFile   string
+	keyFile    string
+	tlsName    string
+	username   string
+	password   string
+	insecure   *bool
 }
 
 overrides := &clientcmd.ConfigOverrides{
 	CurrentContext: flags.context,
-	ClusterInfo:    clientapi.Cluster{Server: flags.server},
+	ClusterInfo: clientapi.Cluster{
+		Server:                flags.server,
+		CertificateAuthority:  flags.caFile,
+		TLSServerName:         flags.tlsName,
+		InsecureSkipTLSVerify: flags.insecure != nil && *flags.insecure,
+	},
 	Context:        clientapi.Context{Cluster: flags.cluster, AuthInfo: flags.user},
+	AuthInfo:       clientapi.AuthInfo{Token: flags.token, ClientCertificate: flags.certFile, ClientKey: flags.keyFile, Username: flags.username, Password: flags.password},
 }
```

I2. **`version --client=true` is treated as server-contacting** — `pkg/kuberlr/resolver.go:286-304` (important, regression)

```go
if arg == "--client" {
	sawClient = true
	continue
}
if strings.HasPrefix(arg, "-") {
	// ...
	continue
}
```

`isClientOnly` sets `sawClient` only for the exact `--client` token at lines 286-288. The valid bool form `--client=true` falls into the generic flag branch at lines 290-294, so `version --client=true` reaches line 304 with `sawClient=false` and the resolver may probe, download, or fail a mirror lookup for a command that should not need the server. Repeated bool flags also do not follow kubectl’s “last flag wins” behavior.

Fix: parse `--client=<bool>` and let later occurrences overwrite earlier ones.

```diff
+import "strconv"
+
 if arg == "--client" {
 	sawClient = true
 	continue
 }
+if strings.HasPrefix(arg, "--client=") {
+	if v, err := strconv.ParseBool(strings.TrimPrefix(arg, "--client=")); err == nil {
+		sawClient = v
+	}
+	continue
+}
```

I3. **BATS 500 test asserts a probe for a skipped client-only command** — `bats/tests/10-cli/7-kubectl-cache.bats:190-205` (important, regression)

```bash
run -0 env \
    RDD_CACHE_DIR="${CACHE_DIR}" \
    RDD_KUBECTL_MIRROR="http://127.0.0.1:${PORT}" \
    KUBECONFIG="${KUBECONFIG_PATH}" \
    rdd kubectl version --client

assert_file_contains "${LOG_FILE}" '^GET /version'
```

The command at line 197 is `version --client`, and `Resolve` intentionally returns before probing for client-only commands at `pkg/kuberlr/resolver.go:65-66`. Therefore the assertion at line 204 expects a `/version` request that should never happen. This test will fail once the BATS harness runs, and even if adjusted to pass, it currently does not exercise the “apiserver returns 500” fall-through path named by the test.

Fix: use a server-contacting command and assert the embedded kubectl path is used after the failed probe.

```diff
-    run -0 env \
+    run env \
         RDD_CACHE_DIR="${CACHE_DIR}" \
         RDD_KUBECTL_MIRROR="http://127.0.0.1:${PORT}" \
         KUBECONFIG="${KUBECONFIG_PATH}" \
-        rdd kubectl version --client
+        rdd kubectl get pods
 
-    assert_output
+    assert_failure
     refute_output --partial 'fake-kubectl:'
     refute_output --partial 'resolving kubectl version'
```

#### Suggestions

S1. **Cold-cache concurrent downloads can fail on Windows rename** — `pkg/kuberlr/downloader.go:57-70`, `pkg/kuberlr/downloader.go:95-116` (suggestion, regression)

```go
if _, err := os.Stat(path); err == nil {
	return path, nil
}
// ...
return os.Rename(tmpName, dst)
```

The shared cache check at line 59 and publish at line 116 are not coordinated. On Unix, a second concurrent cold-cache download can rename over the first result; on Windows, `os.Rename` fails when `dst` already exists. The next invocation self-heals because line 59 sees the file, but one of two parallel first-time invocations returns a resolver error.

Fix: after a rename failure, if `dst` now exists, treat the race as a cache hit; otherwise return the original error. A lock file would also avoid duplicate downloads.

S2. **Service paths example still shows `config.json`** — `docs/design/cmd_service.md:153-156` (suggestion, regression)

```console
args_file      /path/to/rancher-desktop-default/rdd.args
config         /path/to/rancher-desktop-default/config.json
dir            /path/to/rancher-desktop-default
kubectl_cache  /path/to/Caches/rancher-desktop/kubectl/darwin-arm64
```

The implementation reports `config.yaml` via `instance.Config()` and `rdd svc paths config`; the updated example at line 154 still says `config.json`.

Fix: change the example to `config.yaml`.

### Design Observations

**Concerns** — The manual kubectl argument scanner is now responsible for mirroring kubectl flag semantics, and I1/I2 are both drift from that contract. Reusing `clientcmd`/kubectl config flag parsing where possible would reduce the edge-case surface.

**Strengths** — The download path hashes the bytes as downloaded before chmod/rename, and temp-file publishing avoids treating partial downloads as valid cache entries. The resolver also deliberately keeps probe failures separate from download/checksum failures, which preserves embedded kubectl behavior when the cluster is unreachable.

### Testing Assessment

1. Missing coverage for `--server` plus auth/TLS override flags, which is the scenario in I1.
2. Missing coverage for bool flag forms such as `version --client=true`, `version --client=false`, and repeated `--client` flags, covered by I2.
3. The new BATS apiserver-500 test is currently invalid, covered by I3.
4. No Windows coverage for the cache publish race in S1.

Verification run: `make build-rdd` passed; `go test ./pkg/kuberlr` passed. I did not run BATS because the BATS submodules are empty in this checkout.

### Documentation Assessment

S2 is the only documentation issue found. The new Dependabot `directories` syntax matches GitHub’s current Dependabot options reference: https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference

### Acknowledged Limitations

- **Code comment**: `pkg/kuberlr/resolver.go:57` and `pkg/kuberlr/downloader.go:45` acknowledge offline mode is deferred. This is consistent with the current explicit “download failures are loud” policy.
- **Code comment**: `pkg/kuberlr/cache.go:33` notes the cache currently grows without eviction. This matters only after many distinct Kubernetes versions accumulate.
- **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:91` explicitly acknowledges the `local_teardown_file` convention deviation for the fake HTTP server. I did not raise it as a finding because the comment distinguishes it from preserving RDD control-plane state.

### Coverage Summary

- `.github/dependabot.yml` — Reviewed, no issues
- `bats/tests/10-cli/5-paths.bats` — Reviewed, no issues
- `bats/tests/10-cli/7-kubectl-cache.bats` — I3
- `bats/tests/10-cli/fake-kube/README.md` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/go.mod` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/kubectl/main.go` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go` — Reviewed, no issues
- `cmd/rdd/kubectl.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — Reviewed, no issues
- `docs/design/cmd_service.md` — S2
- `docs/design/environment.md` — Reviewed, no issues
- `go.mod` — Reviewed, no issues
- `pkg/kuberlr/cache.go` — Reviewed, no issues
- `pkg/kuberlr/downloader.go` — S1
- `pkg/kuberlr/exec_unix.go` — Reviewed, no issues
- `pkg/kuberlr/exec_windows.go` — Reviewed, no issues
- `pkg/kuberlr/resolver.go` — I1, I2, I3
- `pkg/kuberlr/resolver_test.go` — I2

### [Gemini] — Pass 1

### Executive Summary
This change introduces a version-aware `kubectl` execution path (`kuberlr`-style) for `rdd kubectl`. When the targeted cluster's minor version falls outside the supported skew of the embedded `kubectl`, the resolver downloads, caches, and executes a version-matched binary from a release mirror. The implementation effectively handles discovery, cache resolution, and argument translation, though it introduces a Windows-specific file replacement race condition and an unbounded memory read that should be tightened before merge.

### Findings

#### Critical Issues
None.

#### Important Issues
I1. **`os.Rename` fails on Windows when the cached binary already exists** — `pkg/kuberlr/downloader.go:115`

```go
	if err := os.Chmod(tmpName, 0o755); err != nil {
		return err
	}
	return os.Rename(tmpName, dst)
```
When two `rdd kubectl` invocations run concurrently and cache the same version, both processes download to temporary files and call `os.Rename(tmpName, dst)`. On Unix, the second rename atomically replaces the first. On Windows, `os.Rename` fails if the destination already exists. The second invocation will fail entirely and return an error to the user, despite a valid cached binary having just been successfully written to disk by the other process.

Fix: Check if the destination already exists when `os.Rename` fails. If it does, the rename failure is benign (another process won the race).

```diff
-	return os.Rename(tmpName, dst)
+	if err := os.Rename(tmpName, dst); err != nil {
+		if _, statErr := os.Stat(dst); statErr == nil {
+			return nil // Another process populated the cache
+		}
+		return err
+	}
+	return nil
```
`(important, regression)`

I2. **Unbounded in-memory read in `fetchSha512` risks OOM** — `pkg/kuberlr/downloader.go:123-134`

```go
func fetchSha512(ctx context.Context, url string) (string, error) {
	var sb strings.Builder
	if err := streamGet(ctx, url, &sb); err != nil {
		return "", err
	}
	fields := strings.Fields(sb.String())
```
`fetchSha512` uses `streamGet` (which uses `io.Copy`) to read the entire checksum response into a `strings.Builder`. If an alternate mirror configured via `RDD_KUBECTL_MIRROR` is malicious or misconfigured and returns a never-ending stream, the process will consume memory until it OOM-crashes. Because `streamGet` abstracts the `resp.Body` behind an `io.Writer` interface, fixing this safely requires open-coding the HTTP request.

Fix: Open-code the request and use `io.LimitReader` to cap the read to a safe size (e.g., 4096 bytes).

```go
func fetchSha512(ctx context.Context, url string) (string, error) {
	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
	if err != nil {
		return "", err
	}
	resp, err := httpClient.Do(req)
	if err != nil {
		return "", err
	}
	defer resp.Body.Close()
	if resp.StatusCode != http.StatusOK {
		return "", fmt.Errorf("GET %s: %s", url, resp.Status)
	}
	b, err := io.ReadAll(io.LimitReader(resp.Body, 4096))
	if err != nil {
		return "", err
	}
	fields := strings.Fields(string(b))
	if len(fields) == 0 {
		return "", fmt.Errorf("empty checksum response from %s", url)
	}
	return fields[0], nil
}
```
`(important, gap)`

#### Suggestions
S1. **Avoid panicking on cache directory discovery failures** — `pkg/kuberlr/cache.go:38-40`

```go
		cache, err := os.UserCacheDir()
		if err != nil {
			panic(fmt.Errorf("could not determine user cache directory: %w", err))
		}
```
Library code generally should not panic. While a missing `HOME` directory is a fatal startup condition, panicking abruptly crashes commands like `rdd svc paths` instead of letting the CLI framework surface the error gracefully.

Fix: Refactor `CacheDir` to return `(string, error)` and handle the failure gracefully in callers.
`(suggestion, enhancement)`

### Design Observations

**Concerns**
- **Double execution of client auth plugins**: `serverVersion` builds a complete `rest.Config` and issues a `/version` request. For clusters secured by `exec` auth plugins (e.g., AWS IAM Authenticator, gcloud, kubelogin), this executes the plugin. Once the resolver delegates to the downloaded `kubectl`, the child process executes the plugin *again*. For plugins requiring interactive MFA, this prompts the user twice; for others, it doubles latency. This is a known architectural trade-off of `kuberlr`-style interception wrappers. `(in-scope)`

**Strengths**
- The `isClientOnly` check correctly bypasses the cluster probe for local operations, preserving the responsive feel of local `kubectl` subcommands and handling offline environments cleanly.
- The use of BATS with a standalone Go fake apiserver accurately simulates the network semantics and binary caching behaviors without introducing heavy test dependencies.

### Testing Assessment
1. The Windows behavior for `os.Rename` is untested locally because the fake apiserver wiring is skipped on Windows BATS hosts. Testing concurrent `rdd kubectl` invocations locally on Windows would immediately surface the `ERROR_ALREADY_EXISTS` race condition identified in I1.
2. The error handling for the checksum download (`fetchSha512`) does not test the edge case of a misbehaving server (e.g., a server returning an endless stream) to catch unbounded memory growth conditions.

### Documentation Assessment
No major documentation gaps. The new configuration paths and caching behaviors are accurately covered in `docs/design/cmd_service.md` and `docs/design/environment.md`.

### Acknowledged Limitations
- **`local_teardown_file` deviation**: The author acknowledged deviating from the strict BATS teardown rule.
  - Code comment (`bats/tests/10-cli/7-kubectl-cache.bats:89`): `# Deviates from the "no teardown_file" rule. The rule's intent is to preserve rdd state for post-mortem inspection; an HTTP server bound to an ephemeral port is the opposite...`
  - This exception is sensible given the fixture's nature.
- **Windows BATS wiring deferred**: The author deferred Windows E2E tests for the fake server.
  - Code comment (`bats/tests/10-cli/7-kubectl-cache.bats:15`): `# Defer until we have a Windows BATS runner with Go.`
- **Unknown flag classification bias**: The author acknowledged that treating unknown long flags as no-value might temporarily misclassify the command structure but relies on `kubectl` ultimately rejecting unknown globals.
  - Code comment (`pkg/kuberlr/resolver.go:284`): `// The conservative bias keeps Resolve probing on commands isClientOnly cannot identify; misclassifying as client-only would skip the version skew check...`
- **Offline Mode Deferred**: The author explicitly deferred adding an offline mode override.
  - Code comment (`pkg/kuberlr/resolver.go:54`): `// TODO(offline): when no cached kubectl matches and the network is down... Add an opt-out for download attempts`
- **Cache Eviction Deferred**: Unbounded cache accumulation is noted for future improvement.
  - Code comment (`pkg/kuberlr/cache.go:30`): `// TODO(eviction): the cache only grows...`

### Coverage Summary
- `.github/dependabot.yml`: Trivial
- `bats/tests/10-cli/5-paths.bats`: Trivial
- `bats/tests/10-cli/7-kubectl-cache.bats`: Reviewed, no issues
- `bats/tests/10-cli/fake-kube/README.md`: Trivial
- `bats/tests/10-cli/fake-kube/go.mod`: Trivial
- `bats/tests/10-cli/fake-kube/kubectl/main.go`: Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go`: Reviewed, no issues
- `cmd/rdd/kubectl.go`: Reviewed, no issues
- `cmd/rdd/service_paths.go`: Reviewed, no issues
- `docs/design/cmd_service.md`: Trivial
- `docs/design/environment.md`: Trivial
- `go.mod`: Trivial
- `pkg/kuberlr/cache.go`: S1
- `pkg/kuberlr/downloader.go`: I1, I2
- `pkg/kuberlr/exec_unix.go`: Reviewed, no issues
- `pkg/kuberlr/exec_windows.go`: Reviewed, no issues
- `pkg/kuberlr/resolver.go`: Reviewed, no issues
- `pkg/kuberlr/resolver_test.go`: Reviewed, no issues

## Resolution

| | |
|---|---|
| **Addressed** | 2026-05-03 |
| **Commit** | `563ff15 rdd kubectl: embed kuberlr-style version resolver` (squashed via --fixup into 75346bd) |

| # | Finding | Action | Reason |
|---|---------|--------|--------|
| 1 | Important #1: BATS apiserver-500 test asserts a /version GET that the Round 2 isClientOnly short-circuit skips | Fixed | Subagent caught a follow-up flaw: `kubectl version` itself queries /version, so the simple assertion no longer discriminated probe-ran from probe-skipped. Final fix counts /version requests (>= 2 expected). |
| 2 | Important #2: `isClientOnly` matches `--client` exactly but not `--client=true` | Fixed | Subagent caught the same equals-form bug for --help/-h; extended fix with a shared boolFlagValue helper covering both. |
| 3 | Important #3: extractKubeFlags extension covered cluster-selection but not auth/TLS overrides | Fixed | Refactored loadClientConfig to use genericclioptions.ConfigFlags + pflag. Subagent caught WithDeprecatedPasswordFlag's interactive-prompt risk; dropped that binding. |
| 4 | Suggestion #1: cmd_service.md example shows config.json instead of config.yaml | Fixed | |
| 5 | Suggestion #2: streamGet has no body-size cap | Fixed | Added maxSha512Bytes (4 KiB) and maxKubectlBytes (256 MiB) caps via io.LimitReader; added 128-hex digest validation in fetchSha512 for clearer errors than misleading "checksum mismatch". |
| 6 | Suggestion #3: SkipResolver swallows os.Setenv error | Skipped | Constant env name and value cannot fail; finding is stylistic only. |
| 7 | Suggestion #4: comment why bare `version` is cluster-bound | Fixed | |
| 8 | Suggestion #5: document httpClient.Timeout as a backstop | Fixed | |
| 9 | Suggestion #6: cache the server-version probe across invocations | Skipped | Performance optimization; review explicitly framed as follow-up. Probe cost is bounded by the 2s timeout. |
| 10 | Suggestion #7: move package doc from cache.go to resolver.go | Skipped | Cosmetic; package doc is discoverable in either location via `go doc` or pkg.go.dev. |
| 11 | Suggestion #8: drop is_macos branch in write_kubectl_sha512 | Skipped | Both branches produce identical output; conditional is harmless. `sha512sum` is the more idiomatic choice on Linux. |
| 12 | Testing Gap #1: vendor GitVersion (EKS, GKE, AKS) unit test rows | Test written | Added EKS/GKE/AKS rows to TestCompatible. |
| 13 | Lint: forbidigo on t.Fatalf | Fixed | Used assert.NilError instead. |
| 14 | Lint: staticcheck SA1019 ParseErrorsWhitelist deprecation | Fixed | Renamed to ParseErrorsAllowlist. |