# Deep Review: 20260502-225351-pr-348

| | |
|---|---|
| **Date** | 2026-05-02 22:53 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 1 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#348](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348) — rdd kubectl: embed kuberlr-style version resolver |
| **Branch** | `embed-kuberlr` |
| **Commits** | `9624299` rdd kubectl: embed kuberlr-style version resolver |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh) |
| **Verdict** | **Merge with fixes** — fix download timeout and BATS sha512-fixture restoration before merging; the rest are small follow-ups. |
| **Wall-clock time** | `44 min 29 s` |

## Executive Summary

The PR embeds a kuberlr-style kubectl version resolver: when the cluster is more than ±1 minor out of skew, `rdd kubectl` downloads a version-matched kubectl from `dl.k8s.io`, sha512-verifies it against the published checksum, caches it under `os.UserCacheDir()`, and execs it in place. Within skew, the embedded kubectl runs. The implementation is small (~280 lines), the responsibilities are cleanly partitioned across `cache.go`/`downloader.go`/`resolver.go`/`exec_*.go`, and the BATS lifecycle test is a good fit for the resolver's contract.

The two Important findings are both narrow: the download HTTP path lacks a per-request deadline (a hung mirror wedges the CLI), and the sha512-mismatch BATS test corrupts the shared mirror fixture without restoring it. Neither blocks merge. Suggestions cover smaller policy/style points and a couple of TODO-worthy enhancements (cache eviction, doc updates).

Structure: 0 Critical · 2 Important · 11 Suggestions.

## Critical Issues

None.

## Important Issues

I1. **Mirror download has no deadline; a slow or hung mirror wedges the CLI** — `pkg/kuberlr/downloader.go:119-134` [Codex GPT 5.5][Claude Opus 4.7]

```go
func streamGet(ctx context.Context, url string, w io.Writer) error {
	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
	if err != nil {
		return err
	}
	resp, err := http.DefaultClient.Do(req)
```

`http.DefaultClient` carries no `Timeout`, and the inherited `cmd.Context()` from cobra has no deadline either. The apiserver probe at `pkg/kuberlr/resolver.go:91` explicitly sets `cfg.Timeout = serverProbeTimeout` so an unreachable cluster fails fast, but the checksum and binary download paths have no equivalent. A wedged TCP connection or a byte-trickling mirror will hang `rdd kubectl` until the user kills it — the same failure mode the apiserver-probe timeout was added to prevent.

Both checksum and binary GETs use the same `streamGet`, so a single bounded deadline (or a private `*http.Client` instance) covers both. Since the kubectl binary is ~50 MB, pick a deadline generous enough for slow links but short enough that the user notices a hang in minutes, not hours.

Fix:

```diff
+const downloadTimeout = 5 * time.Minute
+
+var httpClient = &http.Client{Timeout: downloadTimeout}
+
 func streamGet(ctx context.Context, url string, w io.Writer) error {
 	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
 	if err != nil {
 		return err
 	}
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)
```

Codex flagged this as Important; Claude raised the same point as a Suggestion (S4). I align with Codex on severity — an indefinite hang is a real operator-facing issue, not a stylistic gap.

I2. **BATS sha512-mismatch test corrupts the shared mirror fixture without restoring it** — `bats/tests/10-cli/7-kubectl-cache.bats:169-186` [Claude Opus 4.7]

```bash
@test 'rdd kubectl errors on sha512 mismatch' {
    echo "00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" \
        >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
    ...
}
```

`local_setup_file()` writes the real sha512 once; `local_setup()` resets `CACHE_DIR`, `LOG_FILE`, and the `/version` override files but does not restore `${MIRROR_BIN_DIR}/kubectl${EXE}.sha512`. The mismatch test overwrites that fixture with zeroes and never puts it back. The next test (`apiserver returns 500 → fall through to embedded`) runs `kubectl version --client` and never triggers a download, so the corruption is invisible today. Any future test that relies on a working mirror will fail at random distance from the cause.

Fix: restore the published checksum in `local_setup`, or factor the `shasum`/`sha512sum` block out of `local_setup_file` into a helper used by both:

```diff
 local_setup() {
     rm -rf "${CACHE_DIR}"
     : >"${LOG_FILE}"
     rm -f "${GIT_VERSION_FILE}" "${VERSION_STATUS_FILE}"
+    if is_macos; then
+        shasum -a 512 "${MIRROR_BIN_DIR}/kubectl${EXE}" | awk '{print $1}' >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
+    else
+        sha512sum "${MIRROR_BIN_DIR}/kubectl${EXE}" | awk '{print $1}' >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
+    fi
 }
```

## Suggestions

S1. **Flag scanner does not stop at kubectl's `--` argument delimiter** — `pkg/kuberlr/resolver.go:124-138` [Codex GPT 5.5][Claude Opus 4.7]

```go
for i, arg := range args {
    switch {
    case arg == "--kubeconfig" && i+1 < len(args):
        kubeconfig = args[i+1]
    case strings.HasPrefix(arg, "--kubeconfig="):
        kubeconfig = strings.TrimPrefix(arg, "--kubeconfig=")
    case arg == "--context" && i+1 < len(args):
        context = args[i+1]
    case strings.HasPrefix(arg, "--context="):
        context = strings.TrimPrefix(arg, "--context=")
    }
}
```

`extractKubeFlags` scans every argument. `cmd/rdd/kubectl.go:60` passes the same raw args to `Resolve` and to the kubectl child, but kubectl treats arguments after `--` as positional. `rdd kubectl exec mypod -- mycmd --kubeconfig=/other` makes the resolver probe `/other`, while the real kubectl invocation still targets the original kubeconfig. The current consequence is benign — a stray probe goes to a context the user never asked for, returns `ok=false`, and the resolver falls through — but the parsed values lie about the user's intent.

Fix: stop scanning at the first `--`, and add unit tests covering both the spaced and `=` forms after the delimiter.

```diff
 for i, arg := range args {
+    if arg == "--" {
+        break
+    }
     switch {
```

S2. **`serverVersion` parse failure errors out instead of falling through** — `pkg/kuberlr/resolver.go:100-103` [Claude Opus 4.7]

```go
v, err := semver.ParseTolerant(info.GitVersion)
if err != nil {
    return semver.Version{}, false, fmt.Errorf("parsing server version %q: %w", info.GitVersion, err)
}
```

The other three failure modes — no kubeconfig, malformed config, unreachable cluster — return `ok=false, err=nil` and let the embedded kubectl handle client-only commands. An unparseable `info.GitVersion` is logically the same situation: with no comparable version, the resolver cannot make an informed decision, so the documented fall-through policy applies. Today this branch errors out and blocks every kubectl invocation against that cluster, including `kubectl config view` and `kubectl version --client` that never needed a server version.

Fix:

```diff
     v, err := semver.ParseTolerant(info.GitVersion)
     if err != nil {
-        return semver.Version{}, false, fmt.Errorf("parsing server version %q: %w", info.GitVersion, err)
+        return semver.Version{}, false, nil //nolint:nilerr // unparseable server version: fall through to embedded
     }
```

S3. **`compatible` checks Minor only; cross-major skew slips through** — `pkg/kuberlr/resolver.go:70-73` [Claude Opus 4.7]

```go
func compatible(client, server semver.Version) bool {
    diff := int64(client.Minor) - int64(server.Minor)
    return diff >= -1 && diff <= 1
}
```

A future server at `v2.30.0` against an embedded kubectl `v1.30.0` computes `diff = 0` and returns `true`, silently routing through a major-mismatched embedded kubectl. Kubernetes 2.x has no realistic horizon, so the impact today is theoretical, but the fix is one line and the unit table at `resolver_test.go:26-33` only covers `Major == 1`.

Fix:

```diff
 func compatible(client, server semver.Version) bool {
+    if client.Major != server.Major {
+        return false
+    }
     diff := int64(client.Minor) - int64(server.Minor)
     return diff >= -1 && diff <= 1
 }
```

Add a test case: `{"different major", v(1, 30), semver.Version{Major: 2, Minor: 30}, false}`.

S4. **New user-facing path key and env vars are not in the design docs** — `docs/design/cmd_service.md`, `docs/design/environment.md` [Codex GPT 5.5]

```go
"kubectl_cache": kuberlr.CacheDir(),
```

The PR adds the `kubectl_cache` service path (`cmd/rdd/service_paths.go:32`) and introduces operator-facing `RDD_CACHE_DIR` (`pkg/kuberlr/cache.go:22`) and `RDD_KUBECTL_MIRROR` (`pkg/kuberlr/downloader.go:30`). `docs/design/cmd_service.md` lists no `kubectl_cache` key, and `docs/design/environment.md` describes neither variable. Verified against the worktree: neither file contains the new identifiers.

Fix: add `kubectl_cache` to the paths table in `cmd_service.md`, and document `RDD_CACHE_DIR` (rdd-wide cache root with kubectl appending its own subdirectory) and `RDD_KUBECTL_MIRROR` (alternate release mirror, default `https://dl.k8s.io`) in `environment.md`. The path-vars table in `environment.md` should also gain `RDD_KUBECTL_CACHE` so `rdd svc paths --output=shell` is fully described.

S5. **`RDD_CACHE_DIR` is documented as the rdd-wide cache root, but only kubectl writes to it** — `pkg/kuberlr/cache.go:19-21` [Claude Opus 4.7]

```go
// envCacheDir lets tests (and operators) override the rdd-wide cache root.
// Anticipates a shared rdd cache that future consumers (k3s images, lima
// templates) can join — kuberlr appends its own subdirectory.
const envCacheDir = "RDD_CACHE_DIR"
```

The variable name promises an rdd-wide root, but kubectl is the sole consumer. If a future cache user adopts a different env var, the "rdd-wide" promise breaks and operators have to learn a second variable. Either rename to `RDD_KUBECTL_CACHE_DIR` so the name matches the actual scope (and reserve `RDD_CACHE_DIR` for the day a real shared cache lands), or scope the comment: "kubectl cache root; future consumers may layer their own subdirectories under the same value."

S6. **BATS test uses `grep -q` where the suite uses `assert_file_contains`** — `bats/tests/10-cli/7-kubectl-cache.bats:127-129, 145-148, 184-185, 202-205` [Claude Opus 4.7]

```bash
grep -q '^GET /version' "${LOG_FILE}"
...
if grep -q '/release/' "${LOG_FILE}"; then
    fail "expected no GET to /release/, but log contained: $(cat "${LOG_FILE}")"
fi
```

`bats/helpers/utils.bash:91` defines `refute_file_contains`; the suite-wide `assert_file_contains` is used by `bats/tests/20-service/4-start-parameters.bats` and `bats/tests/20-service/5-port-fallback.bats`. The bats-assert idioms surface both the expected pattern and the actual file contents on failure, while bare `grep -q` only sets the exit status.

Fix sketch:

```diff
-    grep -q '^GET /version' "${LOG_FILE}"
-    grep -q "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}.sha512$" "${LOG_FILE}"
-    grep -q "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}$" "${LOG_FILE}"
+    assert_file_contains "${LOG_FILE}" '^GET /version'
+    assert_file_contains "${LOG_FILE}" "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}.sha512$"
+    assert_file_contains "${LOG_FILE}" "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}$"
```

S7. **GOOS/GOARCH set on the kubectl build, not on the server build** — `bats/tests/10-cli/7-kubectl-cache.bats:33-42` [Claude Opus 4.7]

```bash
(
    cd "${BATS_TEST_DIRNAME}/fake-kube"
    GOOS=${GOOS} GOARCH=${GOARCH} \
        go build -ldflags='-s -w' \
        -o "${MIRROR_BIN_DIR}/kubectl${EXE}" \
        ./kubectl
    go build -ldflags='-s -w' \
        -o "${BATS_FILE_TMPDIR}/fake-kube-server" \
        ./server
)
```

Both invocations target the host today because `GOOS=$(go env GOOS)` returns the host value, but the asymmetry hints at a forgotten cross-compile concern. Drop the explicit env on the kubectl build (host targets host) or pass it on both for consistency.

S8. **Server's `log.Fatal*` calls bypass the `defer logFD.Close()`** — `bats/tests/10-cli/fake-kube/server/main.go:51-93` [Claude Opus 4.7]

```go
log.Fatal("--root, --port-file, and --log-file are required")
...
defer logFD.Close()
...
log.Fatalf("listen: %v", err)
log.Fatalf("writing port file: %v", err)
log.Fatalf("serve: %v", err)
```

`log.Fatal*` calls `os.Exit(1)`, which skips deferred functions. The `defer logFD.Close()` at line 62 reads as if it cleans up on every exit path; in practice it runs only on a clean `Serve` return, which never happens during a test run. The cleanup is harmless inside a fixture, but the deferred call misleads a future reader. Either drop the defer (the OS reclaims the fd anyway) or replace `log.Fatal*` with `log.Print*` followed by an explicit `os.Exit(1)` after a single deferred cleanup block.

S9. **PR description "rdd ctl stays untouched" overstates** — `cmd/rdd/kubectl.go:35-46, 59-82` [Claude Opus 4.7]

```go
func ctlAction(cmd *cobra.Command, args []string) error {
    ...
    return kubectlAction(cmd, args)
}

func kubectlAction(cmd *cobra.Command, args []string) error {
    path, err := kuberlr.Resolve(cmd.Context(), args)
    ...
}
```

`rdd ctl` shares `kubectlAction`, so it now performs a `serverVersion` probe against the embedded apiserver on every invocation. The probe always reports the matching version, `compatible` returns true, and `Resolve` returns `""` — functional behavior is unchanged. The PR description's "rdd ctl stays untouched" is technically inaccurate.

Either update the description to "rdd ctl behavior is unchanged; the resolver always falls through because the embedded apiserver and kubectl share a version", or short-circuit `Resolve` for the embedded path. A check against `instance.Config()` would skip the (small) extra probe; it is purely cosmetic since the probe goes to a localhost socket.

S10. **Cache grows without bound** — `pkg/kuberlr/cache.go`, `pkg/kuberlr/downloader.go` [Claude Opus 4.7]

`ensureCached` only adds; no eviction, no GC, no per-version TTL. Each downloaded kubectl is roughly 50 MB. A user who switches across several remote clusters over months accumulates indefinitely. Acceptable for v1, but a `TODO(eviction)` near the existing `TODO(offline)` comment lets a future consumer reuse the framing instead of reinventing it.

S11. **`CacheDir()` panics on `os.UserCacheDir()` failure; `instancePaths()` cannot recover** — `pkg/kuberlr/cache.go:35-38`, `cmd/rdd/service_paths.go:21-34` [Claude Opus 4.7]

```go
cache, err := os.UserCacheDir()
if err != nil {
    panic(fmt.Errorf("could not determine user cache directory: %w", err))
}
```

`CacheDir()` is invoked from `instancePaths()`, which `rdd svc paths` calls. `os.UserCacheDir()` errors only when neither `$HOME` nor `$XDG_CACHE_HOME` (Linux) nor `%LocalAppData%` (Windows) resolves — extremely rare, but a panic in `rdd svc paths` is a poor error UX for a path-listing command. Either fall back to `os.TempDir()` with a logged warning, or change `CacheDir` to return `(string, error)` so the path command surfaces a normal error.


## Design Observations

### Concerns

- **Manual kubectl flag parsing carries drift risk (in-scope)** [Codex GPT 5.5]. S1 above is one consequence of the broader pattern: `extractKubeFlags` approximates kubectl's flag semantics by hand. If the resolver ever needs more flags (`--cluster`, `--user`, `--namespace`), the surface area for divergence grows. Delegating to `clientcmd`'s flag-binding helpers or a pflag-based parser keeps the resolver and the kubectl child aligned by construction.

### Strengths

- **Clean separation of concerns** [Claude Opus 4.7]. `cache.go` (path layout), `downloader.go` (HTTP and sha512), `resolver.go` (skew policy), `exec_*.go` (process launch). Each file owns one responsibility; the boundary between policy and I/O is easy to follow.
- **Hash-as-you-stream verification** [Claude Opus 4.7][Codex GPT 5.5]. `download` writes through `io.MultiWriter(tmp, h)` so the digest covers exactly the bytes received, before chmod or rename. Failed downloads leave only a temp file, which `defer os.Remove(tmpName)` cleans up. Cache hits are never poisoned by partial writes.
- **Recursion guard via env** [Claude Opus 4.7]. `RDD_KUBECTL_RESOLVED` set in `Exec` (`exec_unix.go:19`, `exec_windows.go:24`) prevents a downloaded kubectl that re-execs rdd through a shim from looping back into `Resolve`.
- **Documented fall-through policy** [Claude Opus 4.7][Codex GPT 5.5]. Each `nilerr` site in `serverVersion` carries a `//nolint:nilerr` comment naming the legitimate reason for silent fall-through. The contrast with the loud download-error policy is explicit in both the `Resolve` doc and the comment in `kubectlAction`, so a future maintainer does not have to reverse-engineer the intent.
- **Test fixture design** [Claude Opus 4.7]. One Go server playing both apiserver `/version` and release-mirror `/release/*` (with per-test override files) is a low-friction way to drive the resolver under deterministic conditions without restarting the server between cases. Per-test `git-version` and `version-status` overrides keep tests fast and isolated.

## Testing Assessment

Coverage is solid for host-platform behavior. Five BATS cases exercise download-on-miss, skip-on-hit, mirror-404, sha-mismatch, and apiserver-500 fall-through; Go unit tests cover `compatible`, `cachedPath`, and `extractKubeFlags`. Gaps tied to specific findings:

1. **Untested: HTTP timeout / hang** — see I1.
2. **Untested: BATS test isolation after fixture mutation** — see I2.
3. **Untested: `--` separator handling** — see S1.
4. **Untested: empty / unknown `info.GitVersion`** — see S2.
5. **Untested: cross-major skew** — see S3.
6. **Untested: Windows path of `Exec`** (`pkg/kuberlr/exec_windows.go`). The BATS file skips Windows. `cmd.Run` plus `os.Exit` on a non-zero exit is a meaningfully different code path; no exercise. The Windows BATS-host TODO at `bats/tests/10-cli/7-kubectl-cache.bats:14-20` covers the harness gap, not the production code path.
7. **Untested: concurrent invocations**. Two parallel `rdd kubectl` against the same out-of-skew cluster both hit the cache-miss path and both download. On Linux/macOS the second `os.Rename` overwrites; in Go 1.21+ Windows `os.Rename` uses `FILE_RENAME_INFORMATION_EX` with `REPLACE_IF_EXISTS|POSIX_SEMANTICS`, so the dst-exists case is also handled atomically. (Codex flagged a Windows-rename race here; modern Go handles it — see Reconciliation.) A single concurrency test would still help if practical.

## Documentation Assessment

- All exported symbols carry godoc-style comments (`CacheDir`, `Resolve`, `Exec`).
- The new `kubectl_cache` path key and the `RDD_CACHE_DIR` / `RDD_KUBECTL_MIRROR` environment variables are not in the design docs — see S4.
- The resolver-policy summary appears in both `cache.go` (package doc) and `resolver.go` (`Resolve` doc). The duplication is minor, but a single canonical source would be easier to keep in sync.
- The recursion-guard rationale appears in both `resolver.go:22-25` and `exec_windows.go:21`. Keep the canonical version in `resolver.go` and reference it from the platform files.

## Commit Structure

Single-commit branch (`9624299`), self-contained, accurately summarized. No fix-up commits, no scope bleed. Nothing to flag.

## Acknowledged Limitations

- `pkg/kuberlr/resolver.go:43-46` — `TODO(offline)` for the case "no cached kubectl matches and the network is down". The current behavior is the documented failure mode; the TODO carries the future opt-out. **Status**: consistent.
- `pkg/kuberlr/downloader.go:34-35` — `TODO(offline)` for the "may we download?" toggle for air-gapped users. The current change makes this matter on first cache miss; the comment calls it out. **Status**: consistent.
- `bats/tests/10-cli/7-kubectl-cache.bats:14-20` — Windows BATS host deferred. The fake kubectl cross-builds for `windows/<arch>`, but the server-build and service-pid wiring assume a unix parent. **Status**: consistent — code paths are platform-clean; only the harness is unix-bound.
- `bats/tests/10-cli/7-kubectl-cache.bats:97-105` — `local_teardown_file` deviates from the repo "no teardown_file" rule, with explicit rationale: a long-lived ephemeral-port HTTP server is not the kind of state worth preserving across runs. **Status**: rationale stands; the deviation is narrow.


## Agent Performance Retro

### [Claude]

Claude produced the most detailed review by far — 12 raw findings spanning correctness (Major-version check, BATS fixture pollution), policy (parse-failure fall-through, naming scope of `RDD_CACHE_DIR`), test hygiene (`assert_file_contains` idiom, GOOS/GOARCH asymmetry), and product framing (cache eviction TODO, PR description correction). Claude was the only agent to catch the BATS sha512-fixture pollution (I2), which is the second Important finding in this report. It was also the only agent to verify the test isolation chain end-to-end and to flag that the next test happens not to hit the corrupted fixture.

The Major-version check (I1 → S3) was raised at Important; I downgraded it to Suggestion because Kubernetes 2.x has no realistic horizon. Claude's HTTP-timeout finding (S4) was undersold at Suggestion; promoted to Important to match Codex.

### [Codex]

Codex stayed tight — four findings, all on real concerns. Its HTTP-timeout finding (I2) was the right severity call from the start, and it was the only agent to flag the design-doc gap (S4) by checking the docs against the new env vars. Its "Windows os.Rename race" claim (I1 raw) was a false positive: the diagnosis was right that two concurrent processes can both hit the cache-miss path, but modern Go (1.21+) handles dst-exists on Windows transparently with `REPLACE_IF_EXISTS|POSIX_SEMANTICS`, and the proposed fix targeted `errors.Is(err, os.ErrExist)` — the wrong error class. Dropped.

Codex's coverage table was tight and accurate; its design-observations section identified the same drift risk in manual flag parsing that Claude's S1 hinted at.

### [Gemini]

Gemini failed at startup with `ProjectIdRequiredError` (needs `GOOGLE_CLOUD_PROJECT` env var) and a "not running in a trusted directory" rejection. The CLI returned exit 55 before reading the prompt. Excluded from this round.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 36m 56s | 5m 19s | — |
| Findings | 1I 10S | 1I 2S | — |
| Tool calls | 50 (Bash 30, Read 20) | 49 (shell 49) | — |
| Design observations | 5 | 1 | — |
| False positives | 0 | 1 | 0 |
| Unique insights | 9 | 2 | 0 |
| Files reviewed | 16 | 16 | — |
| Coverage misses | 0 | 0 | — |
| **Totals** | **1I 10S** | **1I 2S** | **none** |
| Downgraded | 1 (I→S) | 0 | 0 |
| Dropped | 0 | 1 | 0 |


Claude provided the most value on this review: the only Important finding it surfaced uniquely (I2 BATS fixture pollution) is the kind of latent test-pollution bug that escapes review and surfaces months later as a flaky test. Codex provided the best severity calibration (HTTP timeout) and the best documentation-gap coverage. The two agents complemented each other — Codex was tighter, Claude was deeper.

#### Reconciliation

- Codex P1 Windows os.Rename race: important → dropped (false positive — modern Go handles dst-exists on Windows; proposed fix targets the wrong error class).
- Claude P1 Major-version check: important → suggestion S3 (trigger condition is hypothetical; one-line fix retained).
- Claude P1 HTTP timeout: suggestion S4 → important I1 (aligned with Codex's call; an indefinite hang is operator-facing).

## Review Process Notes

### Repo context updates

- **Add a "third-party CLI download verification" pattern** to `deep-review-context.md`. When code under review fetches an executable from a remote mirror, expect (a) a per-request HTTP deadline distinct from the caller's context, (b) hash-as-you-stream verification before publication, (c) atomic rename with no partial-file leftovers, and (d) bounded cache growth or a documented eviction TODO. Future PRs that add similar fetchers (k3s images, lima templates, registry mirrors) should be reviewed against the same checklist.

- **Document the "embedded-vs-external" decision pattern** for commands that may target either an internal control plane or an external one. When a command has two callers — one internal (`rdd ctl`, embedded apiserver) and one external (`rdd kubectl`, user kubeconfig) — flag any added side effect (probe, lookup, fetch) that runs unconditionally for both. Either short-circuit for the embedded case or update user-facing docs to describe the new flow.


## Appendix: Original Reviews

### [Claude] — Pass 1

Now I have enough context to write the review.

# Code Review: rdd kubectl version resolver (PR #348)

## Header

| | |
| - | - |
| Branch | resolver branch (commit `9624299`) |
| Verdict | ✅ Ready to merge with minor fixes |
| Files reviewed | 16/16 |

## Executive Summary

The PR introduces a kuberlr-style version resolver that downloads version-matched `kubectl` binaries from `dl.k8s.io` when the cluster is more than ±1 minor out of skew with the embedded `kubectl`. The implementation is small (~280 lines of production code), well-commented, and the test surface (Go unit tests + a five-case BATS lifecycle test backed by a per-test fake apiserver/mirror) is solid. Findings are mostly minor robustness gaps (Major-version check, BATS fixture state leakage, `--` separator handling).

## Findings

### Critical Issues

None.

### Important Issues

**I1. `compatible` ignores Major version, accepts cross-major skew as compatible — `pkg/kuberlr/resolver.go:70-73` (important, gap)**

```go
func compatible(client, server semver.Version) bool {
    diff := int64(client.Minor) - int64(server.Minor)
    return diff >= -1 && diff <= 1
}
```

Comparison only inspects `Minor`. A future server at `v2.30.0` with embedded kubectl `v1.30.0` would compute `diff = 0`, return `true`, and silently use a major-mismatched embedded `kubectl`. The unit test at `pkg/kuberlr/resolver_test.go:26-33` only covers `Major == 1` cases, so the gap is not surfaced.

While Kubernetes 2.x is not on any near-term horizon, the missing check is one line and leaves no latent landmine. The same comparison also wraps poorly if `Minor` were ever `> 2^63` — irrelevant in practice but trivial to address with the same fix.

Fix:

```diff
 func compatible(client, server semver.Version) bool {
+    if client.Major != server.Major {
+        return false
+    }
     diff := int64(client.Minor) - int64(server.Minor)
     return diff >= -1 && diff <= 1
 }
```

Add a test case: `{"different major", v(1, 30) /*client*/, semver.Version{Major: 2, Minor: 30}, false}`.

---

**I2. BATS test 'sha mismatch' corrupts the shared `.sha512` fixture without restoring it — `bats/tests/10-cli/7-kubectl-cache.bats:169-186` (important, gap)**

```bash
@test 'rdd kubectl errors on sha512 mismatch' {
    echo "00000…000" \
        >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
    …
}
```

`local_setup()` (lines 107-113) resets `CACHE_DIR`, `LOG_FILE`, and the `/version` override files, but not the `.sha512` fixture written into `MIRROR_BIN_DIR` during `local_setup_file()` (lines 43-47). The mismatch test overwrites that fixture and never restores it. The current next test (line 188, "falls through to embedded …") runs `version --client` which never triggers a download, so the corruption is invisible — but a future test that depends on a working mirror will fail mysteriously and far from the cause.

Fix: restore the published checksum in `local_setup`, e.g.

```diff
 local_setup() {
     rm -rf "${CACHE_DIR}"
     : >"${LOG_FILE}"
     rm -f "${GIT_VERSION_FILE}" "${VERSION_STATUS_FILE}"
+    if is_macos; then
+        shasum -a 512 "${MIRROR_BIN_DIR}/kubectl${EXE}" | awk '{print $1}' >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
+    else
+        sha512sum "${MIRROR_BIN_DIR}/kubectl${EXE}" | awk '{print $1}' >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
+    fi
 }
```

(Alternatively, factor the `shasum`/`sha512sum` block out of `local_setup_file` into a helper.)

### Suggestions

**S1. `extractKubeFlags` keeps scanning past `--`, picks up flags meant for an embedded subcommand — `pkg/kuberlr/resolver.go:124-138` (suggestion, gap)**

```go
for i, arg := range args {
    switch {
    case arg == "--kubeconfig" && i+1 < len(args):
        kubeconfig = args[i+1]
    …
    case strings.HasPrefix(arg, "--context="):
        context = strings.TrimPrefix(arg, "--context=")
    }
}
```

For `rdd kubectl exec mypod -- mycmd --context=elsewhere`, the embedded scanner happily picks `--context=elsewhere` from inside the post-`--` tail and points the resolver's probe at a context the user never asked for. kubectl's own flag parser respects `--`. The bug is benign (worst case: probe goes nowhere → ok=false → fall-through), but the test case `"both flags"` would silently lie if any future caller relied on the parsed value.

Fix:

```diff
 func extractKubeFlags(args []string) (kubeconfig, context string) {
     for i, arg := range args {
+        if arg == "--" {
+            return kubeconfig, context
+        }
         switch {
         case arg == "--kubeconfig" && i+1 < len(args):
```

Add a unit test case: `{"stops at --", []string{"--", "--kubeconfig=/k"}, "", ""}`.

---

**S2. `serverVersion` parse failure errors out instead of falling through — `pkg/kuberlr/resolver.go:100-103` (suggestion, gap)**

```go
v, err := semver.ParseTolerant(info.GitVersion)
if err != nil {
    return semver.Version{}, false, fmt.Errorf("parsing server version %q: %w", info.GitVersion, err)
}
```

The other three failure modes — no kubeconfig, malformed config, unreachable cluster — return `ok=false, err=nil` and let the embedded kubectl handle client-only commands. An unparseable `info.GitVersion` (apiserver returns `"unknown"`, `""`, or a non-semver build) is logically the same situation: we have no version to skew-compare against, so we cannot make an informed download decision. Surfacing it as a hard `resolving kubectl version: parsing server version "...": ...` error blocks every kubectl invocation against that cluster — including `kubectl config view` or `kubectl version --client`, which never needed the server version. Per the PR description's stated policy ("Apiserver-probe failures … stay silent"), this is the cleaner outcome.

Fix:

```diff
     v, err := semver.ParseTolerant(info.GitVersion)
     if err != nil {
-        return semver.Version{}, false, fmt.Errorf("parsing server version %q: %w", info.GitVersion, err)
+        return semver.Version{}, false, nil //nolint:nilerr // unparseable server version: fall through to embedded
     }
     return v, true, nil
```

---

**S3. `RDD_CACHE_DIR` documented as cache root, but only kubectl cache writes through it — `pkg/kuberlr/cache.go:19-21` and `cmd/rdd/service_paths.go:32` (suggestion, gap)**

```go
// envCacheDir lets tests (and operators) override the rdd-wide cache root.
// Anticipates a shared rdd cache that future consumers (k3s images, lima
// templates) can join — kuberlr appends its own subdirectory.
const envCacheDir = "RDD_CACHE_DIR"
```

The variable is documented as the rdd-wide cache root, and the new `paths` key `kubectl_cache` is the only writer. If a future consumer (k3s images, lima templates) adopts a different env var for its cache root, the "rdd-wide" promise breaks. The current scope is fine, but consider naming the variable `RDD_KUBECTL_CACHE_DIR` so the env-var shape matches the actual scope and a future shared cache can introduce a properly rdd-wide variable without having to migrate users.

Either rename or update the doc to scope the promise: "kubectl cache root; future consumers may layer their own subdirectories under the same value."

---

**S4. HTTP download has no client-level timeout; a slow/hung mirror hangs `rdd kubectl` indefinitely — `pkg/kuberlr/downloader.go:119-134` (suggestion, gap)**

```go
func streamGet(ctx context.Context, url string, w io.Writer) error {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
    …
    resp, err := http.DefaultClient.Do(req)
```

`http.DefaultClient` has no `Timeout` field set, and the inherited `cmd.Context()` is `context.Background()` (no signal handling, no deadline). A wedged TCP connection or a slow byte-trickling mirror will hang the CLI until the user kills it. The kubectl binary is ~50 MB, so even 30-60s is generous. Add a deadline to the per-request context (or to a private `*http.Client` instance) to bound the worst case.

```diff
-func streamGet(ctx context.Context, url string, w io.Writer) error {
+func streamGet(ctx context.Context, url string, w io.Writer) error {
+    ctx, cancel := context.WithTimeout(ctx, 5*time.Minute)
+    defer cancel()
     req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
```

---

**S5. Test relies on `grep -q` instead of repo's idiomatic `assert_file_contains` / `refute_file_contains` — `bats/tests/10-cli/7-kubectl-cache.bats:127-129, 145-148, 184-185, 202-205` (suggestion, gap)**

```bash
grep -q '^GET /version' "${LOG_FILE}"
…
if grep -q '/release/' "${LOG_FILE}"; then
    fail "expected no GET to /release/, but log contained: $(cat "${LOG_FILE}")"
fi
```

`bats/helpers/utils.bash:91` defines `refute_file_contains`; bats-file ships `assert_file_contains`. The current bare `grep -q` works (it's correctly under `set -e`), but the bats-assert idioms surface both the expected pattern and the actual file contents on failure, matching the rest of the suite (`bats/tests/20-service/4-start-parameters.bats` etc.).

Fix sketch:

```diff
-    grep -q '^GET /version' "${LOG_FILE}"
-    grep -q "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}.sha512$" "${LOG_FILE}"
-    grep -q "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}$" "${LOG_FILE}"
+    assert_file_contains "${LOG_FILE}" '^GET /version'
+    assert_file_contains "${LOG_FILE}" "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}.sha512$"
+    assert_file_contains "${LOG_FILE}" "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}$"
```

---

**S6. `CacheDir()` panics when `os.UserCacheDir()` fails; `instancePaths()` cannot recover — `pkg/kuberlr/cache.go:35-38` and `cmd/rdd/service_paths.go:21-34` (suggestion, gap)**

```go
cache, err := os.UserCacheDir()
if err != nil {
    panic(fmt.Errorf("could not determine user cache directory: %w", err))
}
```

`CacheDir()` is now invoked from `instancePaths()` (called by `rdd svc paths`, which is expected to enumerate paths and not crash). `os.UserCacheDir()` only errors when neither `$HOME` nor `$XDG_CACHE_HOME` (Linux) or `%LocalAppData%` (Windows) resolve — extremely rare, but a panic in `rdd svc paths` is a poor error UX. Either fall back to `os.TempDir()` plus a logged warning, or have `CacheDir` return `(string, error)` so the path command surfaces a regular error.

(SUGGESTION since this requires an unusual environment to trigger.)

---

**S7. Build sets `GOOS`/`GOARCH` for kubectl but not for server — `bats/tests/10-cli/7-kubectl-cache.bats:33-42` (suggestion, gap)**

```bash
(
    cd "${BATS_TEST_DIRNAME}/fake-kube"
    GOOS=${GOOS} GOARCH=${GOARCH} \
        go build -ldflags='-s -w' \
        -o "${MIRROR_BIN_DIR}/kubectl${EXE}" \
        ./kubectl
    go build -ldflags='-s -w' \
        -o "${BATS_FILE_TMPDIR}/fake-kube-server" \
        ./server
)
```

The first `go build` exports `GOOS`/`GOARCH` explicitly; the second one doesn't. Both currently target the host because `GOOS=$(go env GOOS)` returns the host value, but the asymmetry hints at a forgotten cross-compile concern. Either drop the explicit env on the kubectl build (host = host) or pass it on both — consistency removes the maintenance trap.

```diff
     (
         cd "${BATS_TEST_DIRNAME}/fake-kube"
-        GOOS=${GOOS} GOARCH=${GOARCH} \
-            go build -ldflags='-s -w' \
+        go build -ldflags='-s -w' \
             -o "${MIRROR_BIN_DIR}/kubectl${EXE}" \
             ./kubectl
         go build -ldflags='-s -w' \
             -o "${BATS_FILE_TMPDIR}/fake-kube-server" \
             ./server
     )
```

(Or keep both with `GOOS`/`GOARCH` if cross-build is genuinely intended for future Windows runners.)

---

**S8. Cache grows without bound — `pkg/kuberlr/cache.go` and `pkg/kuberlr/downloader.go` (suggestion, enhancement)**

`ensureCached` only adds to the cache; no eviction, no GC, no per-version TTL. Each downloaded kubectl is ~50 MB; a user who flips through several remote clusters over months accumulates indefinitely. The `TODO(offline)` covers a related concern (offline mode) but not eviction.

Acceptable for a v1, but worth a `TODO(eviction)` next to the existing offline TODO so a future consumer doesn't reinvent the framing. No code change required.

---

**S9. PR description says "rdd ctl stays untouched" — but `ctlAction` now flows through `Resolve` — `cmd/rdd/kubectl.go:35-46, 59-82` (suggestion, gap)**

```go
func ctlAction(cmd *cobra.Command, args []string) error {
    …
    return kubectlAction(cmd, args)
}

func kubectlAction(cmd *cobra.Command, args []string) error {
    path, err := kuberlr.Resolve(cmd.Context(), args)
    …
}
```

`rdd ctl` shares `kubectlAction`, so it now performs a `serverVersion` probe against the embedded apiserver on every invocation. Functionally a no-op (skew is always 0; resolver returns `""`), but the wording in the PR description is imprecise. Either:

1. Update the description to "rdd ctl behavior is unchanged; the resolver always falls through because the embedded apiserver and kubectl share a version", or
2. Short-circuit `Resolve` for the embedded path. A simple sentinel — `if os.Getenv("KUBECONFIG") == instance.Config() { return "", nil }` — avoids the (small) extra probe. Since the probe goes to a localhost socket and is preceded by `ensureServiceRunning`, this is purely cosmetic.

No code change strictly required; mention here so future readers don't fall into the same trap.

---

**S10. Server uses `log.Fatal*` for setup errors with no cleanup — `bats/tests/10-cli/fake-kube/server/main.go:51-93` (suggestion, gap)**

```go
log.Fatal("--root, --port-file, and --log-file are required")
…
log.Fatalf("opening log file: %v", err)
…
log.Fatalf("listen: %v", err)
…
log.Fatalf("writing port file: %v", err)
```

If the server fails after opening `logFD` (line 58) but before `Serve` (line 91), `defer logFD.Close()` does not run — `log.Fatal` calls `os.Exit(1)`, bypassing deferred functions. Inside a test fixture this is harmless, but the `defer logFD.Close()` reads as if it provides cleanup. Either drop the defer (it's unreachable on the failure paths and on `Serve` blocking forever) or replace `log.Fatal*` with `log.Print*` + explicit `os.Exit(1)` after a single deferred cleanup block.

No functional bug; just confusing.

## Design Observations

### Strengths

- **Clear separation of concerns**: `cache.go` (path layout), `downloader.go` (HTTP + sha), `resolver.go` (policy), `exec_*.go` (platform process launch). Each file has a single, narrow responsibility.
- **Hash-as-you-stream**: `io.MultiWriter(tmp, h)` in `download` (line 91) computes the digest over the bytes as received, before any chmod/rename. This matches the repo guidance about hashing pre-mutation and avoids a re-read of the temp file.
- **Excellent test fixture design**: a single Go server playing both apiserver `/version` and release-mirror `/release/*` (with per-test override files) is a clean, low-friction way to drive the resolver under deterministic conditions without restarting the server between cases.
- **Recursion guard**: `RDD_KUBECTL_RESOLVED` set in `Exec` (`exec_unix.go:19`, `exec_windows.go:24`) handles the obscure but real failure mode of a downloaded kubectl re-execing rdd through a shim.
- **Documented fall-through policy**: the `nilerr` sites in `serverVersion` (`resolver.go:88-99`) each carry a one-line `//nolint:nilerr` comment naming why silent fall-through is correct. Future maintainers will not have to reverse-engineer the policy.

## Testing Assessment

Coverage is solid for the resolver's behavior on the host platform:

1. **Untested: Windows path of `Exec` (`pkg/kuberlr/exec_windows.go`)**. The BATS file skips Windows entirely (line 15-20). `cmd.Run` + `os.Exit` on a non-zero exit is a meaningfully different code path; no exercise. Note that the recursion guard cannot reach `os.Exit(0)` either (success case returns `nil` and the cobra layer returns 0).
2. **Untested: Major version mismatch** — see I1.
3. **Untested: empty/unknown `info.GitVersion`** — see S2.
4. **Untested: `--` separator handling** — see S1.
5. **Untested: concurrent invocations**. Two parallel `rdd kubectl` against the same out-of-skew cluster both hit the cache miss path; both download, both rename. On Linux/macOS the second rename overwrites; on modern Go (1.21+) `os.Rename` on Windows now uses `FILE_RENAME_INFORMATION_EX` with `REPLACE_IF_EXISTS`, so it should also replace atomically. Worth a single concurrency test if practical, otherwise a noted limitation.
6. **Untested: HTTP timeout / hang scenarios** — see S4.

## Documentation Assessment

- All exported symbols have docstrings (`CacheDir`, `Resolve`, `Exec`).
- `Resolve` doc clearly enumerates the fall-through cases, but the embedded resolver-policy summary is duplicated between the package doc (`cache.go:5-9`) and `Resolve`'s doc (`resolver.go:32-46`). Acceptable, but a single canonical source would be easier to maintain.
- `defaultKubeMirror = "https://dl.k8s.io"` — the doc comment at `downloader.go:23-24` says "SIG Release publishes dl.k8s.io as the canonical CDN endpoint." Accurate but the URL itself encodes the contract; the comment adds little. Optional cleanup.
- `pkg/kuberlr/exec_windows.go:21` — "The recursion guard prevents kubectl from looping through Resolve if it ever re-execs us" duplicates the (better) comment in `resolver.go:22-25`. Single source preferred.

The PR description and `cache.go` package doc both anchor the design on `github.com/flavio/kuberlr`. The "everything functional sits under internal/" rationale for re-implementation is clear.

## Commit Structure

Single-commit branch (`9624299`). Self-contained, well-scoped, and the message accurately summarizes the change. No fix-up commits, no scope bleed.

## Acknowledged Limitations

- `pkg/kuberlr/resolver.go:43-46` — `TODO(offline)`: when no cached kubectl matches and the network is down, fail the download. Author plans an opt-out. **Status**: consistent; this PR's behavior is the documented failure mode.
- `pkg/kuberlr/downloader.go:34-35` — `TODO(offline)`: pair with a "may we download?" toggle for air-gapped users. **Status**: consistent.
- `bats/tests/10-cli/7-kubectl-cache.bats:14-20` — Windows BATS host deferred. The fake kubectl builds fine for `windows/<arch>` but the `setup_file`/server-pid wiring assumes a unix parent. **Status**: consistent — code path is platform-clean; only the harness is unix-bound.
- `bats/tests/10-cli/7-kubectl-cache.bats:97-105` — `local_teardown_file` deviates from the repo "no teardown_file" rule, with explicit rationale: a long-lived HTTP server is not the kind of state worth preserving across runs. **Status**: rationale stands; the deviation is narrow and well-justified.

## Coverage Summary

| File | Status |
| --- | --- |
| `.github/dependabot.yml` | Trivial — adds the new helper module to gomod scanning. |
| `bats/tests/10-cli/5-paths.bats` | Reviewed, no issues — adds `kubectl_cache` to `ALL_KEYS`. |
| `bats/tests/10-cli/7-kubectl-cache.bats` | I2, S5, S7 |
| `bats/tests/10-cli/fake-kube/README.md` | Reviewed, no issues. |
| `bats/tests/10-cli/fake-kube/go.mod` | Trivial. |
| `bats/tests/10-cli/fake-kube/kubectl/main.go` | Reviewed, no issues. |
| `bats/tests/10-cli/fake-kube/server/main.go` | S10 |
| `cmd/rdd/kubectl.go` | S9 |
| `cmd/rdd/service_paths.go` | S3 |
| `go.mod` | Trivial — promotes `blang/semver/v4` from indirect to direct. |
| `pkg/kuberlr/cache.go` | S3, S6 |
| `pkg/kuberlr/downloader.go` | S4, S8 |
| `pkg/kuberlr/exec_unix.go` | Reviewed, no issues. |
| `pkg/kuberlr/exec_windows.go` | Reviewed, no issues (testing gap noted). |
| `pkg/kuberlr/resolver.go` | I1, S1, S2 |
| `pkg/kuberlr/resolver_test.go` | Reviewed, no issues — tests cover what's there; coverage gaps live with the production code. |
### [Codex] — Pass 1

### Executive Summary

This PR adds a kubectl version resolver, cache, downloader, exec handoff, service path reporting, and integration coverage with a fake Kubernetes server/mirror. The main correctness risks are around cache publication under concurrent Windows cache misses and unbounded mirror downloads; the remaining issues are narrower parity/documentation gaps.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Concurrent Windows cache misses can fail after another process fills the shared cache** — `pkg/kuberlr/downloader.go:46-105` *(important, regression)*

```go
func ensureCached(ctx context.Context, want semver.Version) (string, error) {
	path := cachedPath(want)
	if _, err := os.Stat(path); err == nil {
		return path, nil
	}
	// ...
	if err := download(ctx, want, path); err != nil {
		return "", err
	}
```

```go
if err := os.Chmod(tmpName, 0o755); err != nil {
	return err
}
return os.Rename(tmpName, dst)
```

`ensureCached` checks for the cached file once at line 48, then downloads at line 56 and publishes with `os.Rename` at line 105. Because the cache is intentionally shared across instances, two `rdd kubectl` invocations can both observe a miss at line 48 and both download. On Unix the second rename replaces the first file, but on Windows `os.Rename` fails when `dst` already exists, so one of the two successful downloads reports `resolving kubectl version` even though the cache is now valid.

Fix: either serialize per-version downloads with a lock file, or make publication tolerant of a peer winning the race. If handling the race after `Rename`, distinguish the "destination already exists" state from real publish errors and return the completed cache path only after confirming `dst` exists.

```diff
 if err := os.Chmod(tmpName, 0o755); err != nil {
 	return err
 }
-return os.Rename(tmpName, dst)
+if err := os.Rename(tmpName, dst); err != nil {
+	if runtime.GOOS == "windows" && errors.Is(err, os.ErrExist) {
+		if _, statErr := os.Stat(dst); statErr == nil {
+			return nil
+		}
+	}
+	return err
+}
+return nil
```

I2. **Mirror downloads have no deadline** — `pkg/kuberlr/downloader.go:119-124` *(important, regression)*

```go
func streamGet(ctx context.Context, url string, w io.Writer) error {
	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
	// ...
	resp, err := http.DefaultClient.Do(req)
```

The apiserver probe has an explicit timeout at `pkg/kuberlr/resolver.go:91`, but the checksum and kubectl download path uses `http.DefaultClient.Do` at line 124 with only the caller’s command context. On a cache miss, a stalled mirror connection can hang `rdd kubectl` indefinitely instead of surfacing the intended download failure.

Fix: add a bounded HTTP client or wrap download requests in a generous timeout. Keep the timeout large enough for slow kubectl binary downloads.

```diff
+const downloadTimeout = 5 * time.Minute
+
+var httpClient = &http.Client{Timeout: downloadTimeout}
+
 func streamGet(ctx context.Context, url string, w io.Writer) error {
 	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
 	if err != nil {
 		return err
 	}
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)
```

#### Suggestions

S1. **Flag scanner should stop at kubectl’s argument delimiter** — `pkg/kuberlr/resolver.go:124-134` *(suggestion, regression)*

```go
for i, arg := range args {
	switch {
	case arg == "--kubeconfig" && i+1 < len(args):
		kubeconfig = args[i+1]
	case strings.HasPrefix(arg, "--kubeconfig="):
		kubeconfig = strings.TrimPrefix(arg, "--kubeconfig=")
```

`kubectlAction` passes the same raw args to `Resolve` at `cmd/rdd/kubectl.go:60` and later to the selected kubectl at `cmd/rdd/kubectl.go:74`, but `extractKubeFlags` scans every arg at line 125. Kubectl treats arguments after `--` as positional command arguments, so `rdd kubectl exec pod -- tool --kubeconfig=/other` can make the resolver probe `/other` even though the real kubectl invocation targets the original kubeconfig.

Fix: stop scanning at `--` and add a unit test for both `--kubeconfig` and `--context` after the delimiter.

```diff
 for i, arg := range args {
+	if arg == "--" {
+		break
+	}
 	switch {
```

S2. **New user-facing cache path and env vars are not reflected in design docs** — `cmd/rdd/service_paths.go:32` *(suggestion, regression)*

```go
"kubectl_cache": kuberlr.CacheDir(),
```

The PR adds the `kubectl_cache` service path at line 32 and introduces operator-facing `RDD_CACHE_DIR` / `RDD_KUBECTL_MIRROR` constants at `pkg/kuberlr/cache.go:22` and `pkg/kuberlr/downloader.go:30`, but `docs/design/cmd_service.md` still lists the old `rdd service paths` keys and `docs/design/environment.md` does not describe the new environment variables.

Fix: add `kubectl_cache` to the service paths table/example and document that `RDD_CACHE_DIR` changes resolver cache placement while `RDD_KUBECTL_MIRROR` changes the kubectl release mirror.

### Design Observations

**Concerns**

- **Manual kubectl flag parsing has drift risk (in-scope).** S1 is one example of a broader issue: the resolver is approximating kubectl’s flag semantics. If this grows beyond `--kubeconfig` and `--context`, consider delegating to Kubernetes’ existing config flag machinery or parsing with pflag semantics so resolver target selection cannot drift from the kubectl invocation.

**Strengths**

- The cache writes downloaded bytes to a temp file, hashes the downloaded stream before publishing, and removes partial temp files on failure. That keeps ordinary download and checksum failures from poisoning later cache hits.
- The resolver keeps apiserver probe failures silent while surfacing mirror/checksum failures loudly, which matches the distinction between client-only kubectl usage and failed version resolution.

### Testing Assessment

1. Windows cache behavior is untested; this leaves I1 exposed, especially because the new BATS file skips Windows.
2. Stalled mirror responses are untested; current tests cover 404 and checksum mismatch but not a server that accepts the connection and never completes I2.
3. Argument delimiter behavior is untested; add `extractKubeFlags([]string{"exec", "p", "--", "tool", "--kubeconfig=/other"})`.
4. Verified locally: `go test ./pkg/kuberlr` passed, and `go test ./...` in `bats/tests/10-cli/fake-kube` passed. `go test ./cmd/rdd` could not run in this worktree because `pkg/guestagent/lima-guestagent.gz` is absent.

### Documentation Assessment

S2 covers the main documentation gap: the new path key and environment variables need user-facing design documentation updates.

### Acknowledged Limitations

- **Code comment**: `pkg/kuberlr/resolver.go:43` notes that offline mode is still TODO: “when no cached kubectl matches and the network is down”. This is an acknowledged product limitation; it does not cover I2’s stalled-connection hang.
- **Code comment**: `pkg/kuberlr/downloader.go:34` notes a future “may we download?” toggle for air-gapped users. The current change makes the limitation relevant on first cache miss, but the comment accurately calls it out.
- **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:16` says Windows BATS wiring is deferred. That explains the skipped Windows integration coverage; it also means Windows-specific cache behavior needs separate review attention.
- **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:98` explicitly documents the `local_teardown_file` deviation and why the fake HTTP server should be killed. The rationale is consistent with the helper process being unrelated to preserving RDD control-plane state.

### Coverage Summary

- `.github/dependabot.yml` — Reviewed, no issues
- `bats/tests/10-cli/5-paths.bats` — Reviewed, no issues
- `bats/tests/10-cli/7-kubectl-cache.bats` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/README.md` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/go.mod` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/kubectl/main.go` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go` — Reviewed, no issues
- `cmd/rdd/kubectl.go` — S1
- `cmd/rdd/service_paths.go` — S2
- `go.mod` — Reviewed, no issues
- `pkg/kuberlr/cache.go` — S2
- `pkg/kuberlr/downloader.go` — I1, I2, S2
- `pkg/kuberlr/exec_unix.go` — Reviewed, no issues
- `pkg/kuberlr/exec_windows.go` — Reviewed, no issues
- `pkg/kuberlr/resolver.go` — S1
- `pkg/kuberlr/resolver_test.go` — S1

### [Gemini] — Pass 1

Gemini failed at startup before reading the prompt. `stderr-gemini-pass-1.log`:

```
Error authenticating: ProjectIdRequiredError: This account requires setting the GOOGLE_CLOUD_PROJECT or GOOGLE_CLOUD_PROJECT_ID env var.
Gemini CLI is not running in a trusted directory. To proceed, either use --skip-trust, set the GEMINI_CLI_TRUST_WORKSPACE=true environment variable, or trust this directory in interactive mode.
```

The CLI returned exit 55. The auth/trust configuration is local to this machine and unrelated to the PR; Gemini was excluded from this round.

## Resolution

| | |
|---|---|
| **Addressed** | 2026-05-03 |
| **Commit** | `7531daf rdd kubectl: embed kuberlr-style version resolver` (amended into the original commit) |

| # | Finding | Action | Reason |
|---|---------|--------|--------|
| 1 | Important #1: Mirror download has no deadline | Fixed | |
| 2 | Important #2: BATS sha512-mismatch test corrupts the shared mirror fixture | Fixed | |
| 3 | Suggestion #1: Flag scanner does not stop at kubectl's `--` argument delimiter | Fixed | |
| 4 | Suggestion #2: `serverVersion` parse failure errors out instead of falling through | Fixed | Bundled with WARN-level logging on all four silent fall-through paths |
| 5 | Suggestion #3: `compatible` checks Minor only; cross-major skew slips through | Fixed | |
| 6 | Suggestion #4: New user-facing path key and env vars are not in the design docs | Documentation updated | |
| 7 | Suggestion #5: `RDD_CACHE_DIR` is documented as the rdd-wide cache root, but only kubectl writes to it | Skipped | RDD_CACHE_DIR is a deliberate forward-looking knob; future consumers will join it without forcing a migration |
| 8 | Suggestion #6: BATS test uses `grep -q` where the suite uses `assert_file_contains` | Fixed | |
| 9 | Suggestion #7: GOOS/GOARCH set on the kubectl build but not on the server build | Fixed | |
| 10 | Suggestion #8: Server's `log.Fatal*` calls bypass the `defer logFD.Close()` | Skipped | Test fixture; OS reclaims the fd. Cleanup correctness is moot. Not worth the diff. |
| 11 | Suggestion #9: PR description "rdd ctl stays untouched" overstates | Documentation updated | PR body edited via `gh pr edit 348` |
| 12 | Suggestion #10: Cache grows without bound | Commented | TODO(eviction) marker added on CacheDir; defers eviction policy to a future consumer |
| 13 | Suggestion #11: `CacheDir()` panics on `os.UserCacheDir()` failure | Skipped | Failure mode requires an unusual environment (neither $HOME nor %LocalAppData% set). Panic surfaces a clear stack; not worth the API change. |
| 14 | Testing Gap #1: HTTP timeout / hang | Skipped | Requires slow-mock server fault injection; deferred |
| 15 | Testing Gap #2: BATS test isolation after fixture mutation | Test written | Covered by I2 fix (sha-mismatch test now restores fixture via local_setup) |
| 16 | Testing Gap #3: `--` separator handling | Test written | Unit test added to TestExtractKubeFlags |
| 17 | Testing Gap #4: empty / unknown info.GitVersion | Skipped | Requires mocked discovery client; deferred |
| 18 | Testing Gap #5: cross-major skew | Test written | Unit test added to TestCompatible |
| 19 | Testing Gap #6: Windows path of Exec | Skipped | Requires a Windows BATS runner with Go; deferred per acknowledged limitation |
| 20 | Testing Gap #7: Concurrent invocations | Skipped | Modern Go handles dst-exists on os.Rename across platforms; concurrency harness deferred |
| 21 | Doc Gap #2: Resolver-policy summary duplicated in cache.go and resolver.go | Skipped | Each duplicate serves a different audience (package godoc vs symbol godoc); deliberate, not drift |
| 22 | Doc Gap #3: Recursion-guard rationale duplicated in resolver.go and exec_windows.go | Skipped | Each duplicate serves a different audience; deliberate, not drift |
