# Deep Review: 20260504-203834-pr-348

| | |
|---|---|
| **Date** | 2026-05-04 20:38 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 7 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#348](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348) — rdd kubectl: embed kuberlr-style version resolver |
| **Branch** | `embed-kuberlr` |
| **Commits** | `6b419bc` rdd kubectl: embed kuberlr-style version resolver |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — Round 6 closed all actionable items; Round 7 surfaces no Critical or Important findings. The actionable item is S1: this PR added `docker_socket` to the `rdd svc paths` example output in `cmd_service.md` but did not add the matching row to the key table. The same example carries forward a pre-existing `args_file` → `rdd.args` discrepancy (code returns `args.json`); worth fixing as a drive-by while editing the example. The other two Suggestions are smaller defensive items (Gemini's `kustomize` subcommand omission from `clientOnlySubcommands` and Windows Ctrl+C race against `exec.CommandContext` cancellation). Codex I1 (Windows `os.Rename` race) declined Rounds 1–6, dropped per the resolutions gate. Gemini C1 (same race) and I1 (`CacheRoot` panic) likewise dropped. Claude failed mid-execution with an API stream-idle timeout after ~24 min and produced no findings; consolidation runs on Codex + Gemini. |
| **Wall-clock time** | `31 min 5 s` |

## Executive Summary

Round 7 reviews the same `6b419bc` commit Round 6 closed against. Codex and Gemini surface no Critical or Important findings that survive verification: Codex's I1 and Gemini's C1 both re-raise the Windows `os.Rename` cache-publish race that has been declined every round since Round 1, and Gemini's I1 re-raises the `CacheRoot` `os.UserCacheDir` panic that was declined Round 5. All three drop per the prior-round resolutions gate. Gemini's C2 ("corrupted `@test` declarations") is a hallucination — `bats/tests/10-cli/7-kubectl-cache.bats` and `5-paths.bats` both use the correct `@test` directive on every test block.

The keeper finding is Codex's S1: `docs/design/cmd_service.md` lists `docker_socket` in the `rdd svc paths` example output but the key table directly above it has no `docker_socket` row, and the same example uses `rdd.args` for `args_file` while the code returns `args.json`. The first inconsistency is a regression this PR introduced when it widened the example; the second is pre-existing and worth fixing as a drive-by while the same paragraph is being edited. Gemini's S2 (`kustomize` omitted from `clientOnlySubcommands`) and S3 (Windows `exec.CommandContext` may hard-kill `kubectl` before its console-delivered Ctrl+C handler completes) are minor defensive items.

Structure: 0 Critical · 0 Important · 3 Suggestions.

## Critical Issues

None.

## Important Issues

None.

## Suggestions

S1. **`cmd_service.md` example lists `docker_socket` not in the key table; example uses pre-existing wrong `args_file` filename** — `docs/design/cmd_service.md:127-163` [Codex GPT 5.5]

```md
| `args_file` | Saved arguments file path |
| `cache_dir` | rdd-wide cache root (shared across instances) |
| `kubectl_cache` | Cache directory for downloaded kubectl binaries (shared across instances) |
...
args_file      /path/to/rancher-desktop-default/rdd.args
...
docker_socket  /path/to/.rd2/docker.sock
```

This PR widened the `rdd svc paths` example output to include `cache_dir`, `kubectl_cache`, and `docker_socket` (lines 153-163). The key table at lines 127-138 gained `cache_dir` and `kubectl_cache` rows, but `docker_socket` is shown in the example without a matching table row — a regression introduced by this PR's example update.

The same example carries forward a pre-existing inconsistency: `args_file` is shown as `/path/to/rancher-desktop-default/rdd.args`, but `instance.ArgsFile()` at `pkg/instance/instance.go:86` returns `args.json`. The PR did not introduce this gap (the merge-base example also had `rdd.args`), but the PR is editing the same paragraph and can fix it as a drive-by.

Fix: add a `docker_socket` row to the key table and update the example to `args.json`.

```diff
 | `args_file` | Saved arguments file path |
+| `docker_socket` | Host-side Docker socket |
 | `cache_dir` | rdd-wide cache root (shared across instances) |
 | `kubectl_cache` | Cache directory for downloaded kubectl binaries (shared across instances) |
...
-args_file      /path/to/rancher-desktop-default/rdd.args
+args_file      /path/to/rancher-desktop-default/args.json
```

S2. **`kustomize` subcommand missing from `clientOnlySubcommands`** — `pkg/kuberlr/resolver.go:193-198` [Gemini 3.1 Pro]

```go
var clientOnlySubcommands = map[string]bool{
	"completion": true,
	"config":     true,
	"options":    true,
	"help":       true,
}
```

`kubectl kustomize <dir>` builds manifests locally and prints the result; it does not contact a cluster. The current `clientOnlySubcommands` map omits it, so `rdd kubectl kustomize` triggers a 2-second server probe (capped by `serverProbeTimeout`) before falling through to the embedded kubectl. On an unreachable network the probe burns the full 2 seconds for every kustomize invocation, then times out and falls through silently.

Fix: add `"kustomize": true` to `clientOnlySubcommands`. The risk surface is bounded — `kustomize` is documented as a local-only manifest renderer in `kubectl --help`; it has no flag that turns on cluster contact.

```diff
 var clientOnlySubcommands = map[string]bool{
 	"completion": true,
 	"config":     true,
+	"kustomize":  true,
 	"options":    true,
 	"help":       true,
 }
```

S3. **Windows `exec.CommandContext` may race kubectl's own Ctrl+C handler** — `pkg/kuberlr/exec_windows.go:27` [Gemini 3.1 Pro]

```go
func Exec(ctx context.Context, path string, args []string) error {
	cmd := exec.CommandContext(ctx, path, args...)
```

`exec.CommandContext` arranges for `cmd.Process.Kill()` (a hard-kill on Windows) when `ctx` is canceled. Today the cobra/cli context is never canceled — `cli.RunNoErrOutput` does not install a signal handler that propagates SIGINT to `cmd.Context()` — so this code path is dormant. If a future refactor adds signal-driven context cancellation, the hard-kill would race against the Ctrl+C the shared console delivers to kubectl directly, terminating mid-graceful-shutdown (port-forwards, `exec` sessions, watch streams).

Fix is not urgent given current behavior. When/if the parent does install signal-driven cancellation, switch to plain `exec.Command(path, args...)` and let the shared-console signal delivery be the cancellation path; the comment block at lines 22-25 already documents this as the intended model. A defensive note in the file pointing at this assumption would prevent a future refactor from quietly enabling the race.

## Design Observations

### Concerns

- **Double execution of KUBECONFIG `exec` auth plugins** (in-scope) [Gemini 3.1 Pro]. `serverVersion` calls `discovery.NewDiscoveryClientForConfig(cfg).ServerVersion()`, which makes an authenticated `/version` request through the standard client-go config chain. If the user's kubeconfig context uses an `exec` credential plugin (SSO browser launch, hardware-token prompt, IDP CLI), the resolver invokes it once for the probe; if the resolver then downloads and execs a remote kubectl, kubectl invokes the same plugin again. For non-interactive plugins this only doubles latency; for interactive ones it doubles user-visible prompts. The TODO at `resolver.go:64-67` for an offline mode is the natural escape hatch — `RDD_KUBECTL_OFFLINE` should also imply "skip the probe so my exec plugin runs only once." Worth keeping in mind when designing that opt-out.

### Strengths

- **`genericclioptions.ConfigFlags` for resolver flag binding** [Codex GPT 5.5]. Reusing `genericclioptions.NewConfigFlags(true).AddFlags(fs)` keeps the resolver's connection-override surface aligned with kubectl's by construction. Any flag kubectl gains in a future bump (or any rename) flows through automatically; no parallel kubeconfig precedence implementation to drift.

- **Robust fake-server fixture design** [Gemini 3.1 Pro]. `fake-kube/server` plays both apiserver `/version` and release-mirror `/release/*` roles on one ephemeral port, with file-based override hooks (`--git-version-file`, `--version-status-file`) that let individual tests steer behavior without a server restart between cases. The `O_APPEND` log fd survives BATS-side `: > LOG_FILE` truncation between tests, which would otherwise leave the server writing into a sparse hole and break `assert_file_contains` greps.

- **Conservative probe fallback policy** [Gemini 3.1 Pro]. `serverVersion` returns `ok=false, err=nil` for missing kubeconfig, malformed kubeconfig, unreachable cluster, and unparseable server version — each falls through to the embedded kubectl rather than failing the kubectl invocation. Client-only commands keep working when the cluster is unreachable; only download/sha errors surface loudly.

## Testing Assessment

The Pure-Go coverage in `pkg/kuberlr/resolver_test.go` — `TestCompatible`, `TestParseKubeConfigFlags`, `TestIsClientOnly`, `TestCachedPath`, `TestResolveEmbeddedVersionUnparseable` — and the BATS `7-kubectl-cache.bats` integration suite (download-on-miss, skip-on-hit, mirror-404, sha-mismatch, apiserver-500 fall-through) cover the resolver's user-reachable paths well. Two coverage gaps surfaced this round:

1. **`rdd ctl` resolver-skip path lacks an integration test** [Codex GPT 5.5]. `cmd/rdd/kubectl.go:48` calls `kuberlr.SkipResolver()` before `kubectlAction`, but no test asserts that `rdd ctl get pods` neither probes nor downloads. A regression that drops the `SkipResolver()` call would not fail any current test — `rdd ctl` already targets the embedded apiserver, so a probe would simply fall through silently. A counter-style assertion against the BATS log file (no `GET /version` from the resolver) would lock the contract.

2. **`rdd svc paths --output=shell` does not assert `RDD_CACHE_DIR` reshapes `kubectl_cache`** [Codex GPT 5.5]. `5-paths.bats` checks the variable names exist in shell output but not that an `RDD_CACHE_DIR` override actually flows through to the reported `kubectl_cache` path. A one-line override + `assert_output --partial` would prevent a regression in the env-var honor chain from going unnoticed.

3. **Concurrent first-use cache miss** — declined Rounds 1–6, acknowledged limitation (see below).

## Documentation Assessment

The `RDD_CACHE_DIR` and `RDD_KUBECTL_MIRROR` documentation in `docs/design/environment.md` matches the resolver behavior; Round 6 closed the `RDD_KUBECTL_RESOLVED` description regression. The `cmd_service.md` updates in this round introduce one new key-table/example mismatch (S1) and inherit a pre-existing `rdd.args` typo in the example output.

## Acknowledged Limitations

- **Code comment**: `pkg/kuberlr/cache.go:33-39` documents unbounded cache growth and leaked `.kubectl-*` temp files after SIGKILL or power loss; called out as deferred eviction work.
- **Code comment**: `pkg/kuberlr/downloader.go:56` and `pkg/kuberlr/resolver.go:64-67` document the planned offline mode (`RDD_KUBECTL_OFFLINE`); intentionally deferred.
- **Code comment**: `pkg/kuberlr/downloader.go:138-141` documents `maxKubectlBytes` truncation surfacing as "checksum mismatch"; left as a comment per Round 6 (Suggestion S3 → Commented).

### Declined in prior rounds

- **Concurrent first-use Windows `os.Rename` race** — declined Rounds 1, 2, 3, 4, 5, 6. Re-raised this round by Codex (I1) and Gemini (C1, severity escalated from prior rounds' Important to Critical with no new evidence). Dropped per the resolutions gate. Seven rounds running; an inline `// design note: see review history` comment in `download` near the rename would document the deliberate decline for future agents to recognize without reading review archives.
- **`CacheRoot` panic on `os.UserCacheDir` failure** — declined Rounds 1, 2, 3 (trigger requires neither `$HOME` nor `%LocalAppData%` to resolve), declined Round 5 as Suggestion S3 (blast-radius re-raise: structural change for a low-probability defensive case). Re-raised this round by Gemini as I1 with the same "fall back to `os.TempDir()`" fix shape Round 5 dismissed. Dropped per the gate.
- **`local_teardown_file()` deviation from BATS "no teardown" rule** — new this round (Gemini I2). The convention's documented rationale is "preserve rdd state for post-mortem inspection"; the test fixture is an HTTP server bound to an ephemeral port with no inspectable state, and the deviation is annotated in code at `7-kubectl-cache.bats:103-110`. Codex independently accepted the rationale ("consistent with this test fixture, since it cleans up an external process rather than deleting RDD state"). Dropped — the rule's spirit does not apply to ephemeral test fixtures, and Gemini's proposed fix (`pkill -f fake-kube-server` in setup) would race against concurrent test runs.
- **"Corrupted `@test` declarations"** — Gemini C2. Hallucination: every test block in `7-kubectl-cache.bats` and `5-paths.bats` opens with the correct `@test` directive. Dropped as a false positive.

## Unresolved Feedback

- **@mook-as on `cmd/rdd/service_paths.go:33`** — `https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348#discussion_r3182959199`: "Would it be possible to expose `RDD_CACHE_DIR` here too?" Resolved: `cache_dir` (`kuberlr.CacheRoot()`) and `kubectl_cache` (`kuberlr.CacheDir()`) appear in `instancePaths()` and are documented in `environment.md` and `cmd_service.md` (modulo the table-vs-example mismatch in S1).

## Agent Performance Retro

### [Claude]

Claude failed: API "Stream idle timeout - partial response received" after ~24 minutes and ~$3.98 of cost (per the NDJSON `result` event). The session reached 44 turns and was actively making tool calls when the stream went idle. No structured findings landed; the output file holds only the error string. Transient API-side issue rather than a code or prompt problem; the same prompt has worked across all six prior rounds. No retry attempted — the conversation was already deep into the verification window.

### [Codex]

Codex ran the tightest shape this round: one Important (later dropped per the resolutions gate as a Round 1–6 declined re-raise) and one Suggestion (S1, the `cmd_service.md` table/example mismatch — kept). Verification included `go test ./pkg/kuberlr` (passed) and noted `go test ./cmd/rdd` could not run in the worktree without the `lima-guestagent.gz` asset. The S1 catch is the round's most actionable item; Codex traced both halves of it (the missing `docker_socket` table row and the `args_file` filename mismatch against `instance.ArgsFile()`) without prompting.

### [Gemini]

Gemini brought back two declined-in-prior-rounds findings (C1 Windows `os.Rename` race, I1 `CacheRoot` panic) plus one hallucination (C2 "corrupted `@test` declarations" — every test block in the actual file uses the correct `@test`). The keepers were S1 (`kustomize` missing from `clientOnlySubcommands`) and S2 (Windows `exec.CommandContext` race against console-delivered Ctrl+C), both genuine new finds for this round, plus the "double execution of KUBECONFIG `exec` plugins" Design Observation. Severity calibration ran hot: a re-raised declined finding climbed from Important (Rounds 1–6) to Critical (Round 7) without new evidence, and a fictional `@test` corruption also reached Critical.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 23m 36s | 5m 43s | — |
| Findings | none | 1S | 2S |
| Tool calls | 43 (Read 22, Bash 20, Grep 1) | 40 (shell 39, stdin 1) | — |
| Design observations | 0 | 1 | 4 |
| False positives | 0 | 0 | 1 |
| Unique insights | 0 | 1 | 2 |
| Files reviewed | 0 | 19 | 18 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **none** | **1S** | **2S** |
| Downgraded | 0 | 0 | 0 |
| Dropped | 0 | 1 | 4 |


**Reconciliation**

- Codex I1 (Windows `os.Rename` race): important → dropped per resolutions gate (declined Rounds 1–6).
- Gemini C1 (Windows `os.Rename` race): critical → dropped per resolutions gate (same finding as Codex I1, declined Rounds 1–6).
- Gemini C2 (corrupted `@test` declarations): critical → dropped as false positive (verified against `bats/tests/10-cli/7-kubectl-cache.bats` and `5-paths.bats`).
- Gemini I1 (`CacheRoot` panic): important → dropped per resolutions gate (declined Rounds 1, 2, 3, 5).
- Gemini I2 (`local_teardown_file()` BATS deviation): important → dropped (documented exception for ephemeral test fixture; rule's spirit covers RDD state, not test servers).

Overall: Codex provided more value this round — its sole keeper is the round's only actionable finding. Gemini contributed two genuine Suggestions and one Design Observation alongside three drops. Claude's failure left the consolidation thinner than ideal but did not block any actionable item.

## Review Process Notes

**Skill improvements**

- **Stop the seventh re-raise.** The Windows `os.Rename` race has now been raised in seven consecutive rounds and dropped seven times. The resolutions gate works, but the cost is real: each round burns agent tokens on the same finding. Consider extending the prior-round-resolutions gate so that a finding declined in N consecutive rounds (e.g., N≥3) triggers a stronger-worded decline note that the lead helper inlines into the next round's prompt — something agents see before they raise the finding rather than the orchestrator dropping it after. Recognise the pattern in code: a finding cited at the same `file:line` (within 5 lines) and same title shape across three or more `Resolution` tables in `.reviews/{ID-prefix}-*.md` for the same target. The principle: persistent declines should suppress the next raise, not just dedupe it after the fact.

**Repo context updates**

- **Add a "long-declined findings" section to `deep-review-context.md`.** Specifically for PR #348: list the three findings that have been declined three or more rounds running (Windows `os.Rename` cache-publish race, `CacheRoot` panic on `os.UserCacheDir` failure, drive-by `config.json → config.yaml` rename) with one-line decline rationales. Agents reviewing future amendments to this PR will see them in the inlined context and can either skip or argue with new evidence rather than re-raise the same shape. Keep this list short and PR-scoped — it should not become a graveyard of every declined finding across the repo.

## Appendix: Original Reviews

### [Claude] — Pass 1

Claude failed mid-execution: API "Stream idle timeout - partial response received" after ~24 minutes. Output file contains only the error string; no structured findings emitted. NDJSON shows 44 turns of thinking and tool calls before the stream went idle. Excluded from consolidation.

### [Codex] — Pass 1

### Executive Summary

This PR adds a kubectl version resolver with a shared cache, mirror downloads, checksum verification, CLI wiring, tests, and path/env documentation. The main correctness issue is a Windows cache-publish race on concurrent first use; the rest of the implementation is generally well covered, with one documentation inconsistency in the updated service paths docs.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Concurrent cache misses can fail on Windows after another process publishes the binary** — `pkg/kuberlr/downloader.go:70-128` (important, regression)

```go
if _, err := os.Stat(path); err == nil {
	return path, nil
}
...
return os.Rename(tmpName, dst)
```

`ensureCached` only checks whether the final cache path exists before downloading at line 70, then `download` unconditionally renames its temp file to that path at line 128. Because the cache is shared across instances, two `rdd kubectl` processes can both miss line 70, both download, and then race to publish. On Windows, `os.Rename` fails when `dst` already exists, so the loser returns a user-visible `resolving kubectl version: ...` error even though the winner has already installed a valid kubectl.

Fix: after a rename error, tolerate the case where another process has created `dst`, matching the existing cache-hit trust model.

```diff
 if err := os.Chmod(tmpName, 0o755); err != nil {
 	return err
 }
-return os.Rename(tmpName, dst)
+if err := os.Rename(tmpName, dst); err != nil {
+	if _, statErr := os.Stat(dst); statErr == nil {
+		return nil
+	}
+	return err
+}
+return nil
```

#### Suggestions

S1. **Updated service paths docs are internally inconsistent** — `docs/design/cmd_service.md:127-158` (suggestion, gap)

```md
| `args_file` | Saved arguments file path |
| `cache_dir` | rdd-wide cache root (shared across instances) |
| `kubectl_cache` | Cache directory for downloaded kubectl binaries (shared across instances) |
...
args_file      /path/to/rancher-desktop-default/rdd.args
docker_socket  /path/to/.rd2/docker.sock
```

The example added at line 158 shows `docker_socket`, but the key table at lines 127-138 still has no `docker_socket` row. The same rewritten example says `args_file` is `rdd.args` at line 154, while the code returns `args.json` from `instance.ArgsFile()` at `pkg/instance/instance.go:86`.

Fix: add the missing `docker_socket` row and update the example to `args.json`.

### Design Observations

**Strengths** — Reusing `genericclioptions.ConfigFlags` for resolver flag parsing is the right choice; it keeps the probe aligned with kubectl’s own connection flags instead of reimplementing kubeconfig precedence by hand.

### Testing Assessment

1. Concurrent first-use cache publication, especially on Windows, is not covered; this is the scenario behind I1.
2. `rdd ctl` resolver skipping is wired in code but lacks an integration test proving it never probes or downloads.
3. `rdd svc paths --output=shell` checks the variable names, but not that `RDD_CACHE_DIR` actually changes the reported `kubectl_cache`.

### Documentation Assessment

S1 covers the only changed-doc inconsistency I found. The new environment-variable documentation matches the resolver behavior.

### Acknowledged Limitations

- **Code comment**: `pkg/kuberlr/cache.go:33` documents unbounded cache growth and leaked `.kubectl-*` temp files after abrupt termination. The change makes this relevant over time, but the limitation is explicitly marked for eviction work.
- **Code comment**: `pkg/kuberlr/downloader.go:56` and `pkg/kuberlr/resolver.go:64` document the missing offline mode. This is intentionally deferred and should not be treated as a current defect.
- **Code comment**: `bats/tests/10-cli/7-kubectl-cache.bats:103` explicitly acknowledges the `local_teardown_file` exception for the fake HTTP server. The rationale is consistent with this test fixture, since it cleans up an external process rather than deleting RDD state.

### Coverage Summary

- `.github/dependabot.yml` — Reviewed, no issues
- `bats/tests/10-cli/5-paths.bats` — Reviewed, no issues
- `bats/tests/10-cli/7-kubectl-cache.bats` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/README.md` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/go.mod` — Trivial
- `bats/tests/10-cli/fake-kube/kubectl/main.go` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go` — Reviewed, no issues
- `cmd/rdd/kubectl.go` — Reviewed, no issues
- `cmd/rdd/main.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — Reviewed, no issues
- `docs/design/cmd_service.md` — S1
- `docs/design/environment.md` — Reviewed, no issues
- `go.mod` — Reviewed, no issues
- `pkg/kuberlr/cache.go` — Reviewed, no issues
- `pkg/kuberlr/downloader.go` — I1
- `pkg/kuberlr/exec_unix.go` — Reviewed, no issues
- `pkg/kuberlr/exec_windows.go` — Reviewed, no issues
- `pkg/kuberlr/resolver.go` — Reviewed, no issues
- `pkg/kuberlr/resolver_test.go` — Reviewed, no issues

Verification: `go test ./pkg/kuberlr` passed. `go test ./cmd/rdd` could not run in this checkout because `pkg/guestagent/guestagent.go` expects the generated `lima-guestagent.gz` asset.

### [Gemini] — Pass 1

### Executive Summary

This PR introduces a kuberlr-style version resolver for `rdd kubectl`. It probes the remote cluster's version and, if it falls outside the ±1 minor version skew of the embedded `kubectl`, fetches a version-matched binary from an upstream mirror into a shared cache. The key findings include a critical race condition on Windows during cache population where renaming over a running executable fails, and a critical corruption in the BATS test declarations caused by a stray find-and-replace.

### Findings

#### Critical Issues

**C1. Race condition on Windows cache population** — `pkg/kuberlr/downloader.go:126` (critical, regression)

```go
	if err := os.Chmod(tmpName, 0o755); err != nil {
		return err
	}
	return os.Rename(tmpName, dst)
```

On Windows, `os.Rename` fails with `ERROR_ACCESS_DENIED` if the destination file is currently executing. If multiple `rdd kubectl` invocations concurrently download the same cache miss, the first one to finish will rename its temp file and immediately execute it. The lagging invocations will then attempt to `os.Rename` over the now-locked executable, fail, and abort the entire `rdd kubectl` command instead of utilizing the already-cached binary.

Fix: Catch the `os.Rename` error and check if the destination file already exists. If it does, discard the temporary file and proceed, as the cache is already populated.

```diff
 	if err := os.Chmod(tmpName, 0o755); err != nil {
 		return err
 	}
-	return os.Rename(tmpName, dst)
+	err = os.Rename(tmpName, dst)
+	if err != nil {
+		if _, statErr := os.Stat(dst); statErr == nil {
+			return nil // Another process populated the cache first
+		}
+		return err
+	}
+	return nil
```

**C2. Corrupted BATS test declarations** — `bats/tests/10-cli/7-kubectl-cache.bats:133`, `bats/tests/10-cli/5-paths.bats:5` (critical, regression)

```bash
 @pkg/util/tail/tail_test.go 'rdd kubectl downloads a version-matched kubectl on cache miss' {
```

A stray find-and-replace has corrupted the BATS test declarations. Every test case in the new `7-kubectl-cache.bats` and one in `5-paths.bats` begins with ` @pkg/util/tail/tail_test.go` instead of ` @test`. BATS will fail to parse these blocks, meaning the entire cache test suite will either be skipped or crash the test runner.

Fix: Restore the ` @test` directive globally in the affected files.

```diff
- @pkg/util/tail/tail_test.go 'rdd kubectl downloads a version-matched kubectl on cache miss' {
+@test 'rdd kubectl downloads a version-matched kubectl on cache miss' {
```

#### Important Issues

**I1. Panic on missing user cache directory** — `pkg/kuberlr/cache.go:53` (important, gap)

```go
	cache, err := os.UserCacheDir()
	if err != nil {
		panic(fmt.Errorf("could not determine user cache directory: %w", err))
	}
```

`CacheRoot` panics if `os.UserCacheDir()` returns an error (e.g., when `$HOME` is missing in constrained environments or CI). A panic abruptly crashes the `rdd` CLI instead of returning a manageable error or falling back to a safe default.

Fix: Gracefully fall back to `os.TempDir()` rather than panicking.

```diff
 	cache, err := os.UserCacheDir()
 	if err != nil {
-		panic(fmt.Errorf("could not determine user cache directory: %w", err))
+		cache = os.TempDir()
 	}
 	return filepath.Join(cache, "rancher-desktop")
```

**I2. Violates BATS teardown rule** — `bats/tests/10-cli/7-kubectl-cache.bats:102-109` (important, regression)

```bash
local_teardown_file() {
    # Deviates from the "no teardown_file" rule. The rule's intent is to
    # preserve rdd state for post-mortem inspection; an HTTP server bound
    # to an ephemeral port is the opposite — leaving it running just leaks
    # ports between sessions.
    if load_var SERVER_PID; then
        kill "${SERVER_PID}" 2>/dev/null || true
    fi
}
```

The PR introduces a `local_teardown_file()`, which violates the repo convention explicitly stating not to add one. Although the comment justifies this by pointing out that the HTTP server leaks ports, the project's strategy dictates keeping cleanup localized to the setup phase.

Fix: Remove `local_teardown_file()`. Add an aggressive cleanup step like `pkill -f fake-kube-server || true` to `local_setup_file()` to terminate leftover servers from previous runs, aligning with how the control plane is managed.

#### Suggestions

**S1. `kustomize` subcommand missing from client-only list** — `pkg/kuberlr/resolver.go:183` (suggestion, gap)

```go
var clientOnlySubcommands = map[string]bool{
	"completion": true,
	"config":     true,
	"options":    true,
	"help":       true,
}
```

The `clientOnlySubcommands` map omits `kustomize`, which renders local manifests without contacting a cluster. When running `rdd kubectl kustomize` offline, the resolver will attempt a 2-second server probe before falling back to the embedded `kubectl`.

Fix: Add `"kustomize": true` to `clientOnlySubcommands`.

**S2. Windows child process termination on Ctrl+C** — `pkg/kuberlr/exec_windows.go:27` (suggestion, gap)

```go
func Exec(ctx context.Context, path string, args []string) error {
	cmd := exec.CommandContext(ctx, path, args...)
```

On Windows, `exec.CommandContext` responds to context cancellation by forcefully terminating the child process (`cmd.Process.Kill()`). Because `rdd` and the child `kubectl` share the console, `kubectl` receives Ctrl+C directly and attempts a graceful shutdown (e.g., closing port-forwards). If `rdd` cancels its context on Ctrl+C, it races to hard-kill `kubectl` before the graceful shutdown completes.

Fix: Consider using `exec.Command` without tying it to the `rdd` context, relying on the shared console to deliver the interrupt signal to the child directly, allowing `rdd` to wait gracefully for the child's own exit.

### Design Observations

**Concerns**

- **Double execution of KUBECONFIG exec plugins (in-scope)**: Because `serverVersion` initializes a full `client-go` REST client, any `exec` auth plugin configured in the user's kubeconfig will be executed during the version probe. If the resolver decides to download and exec the remote `kubectl`, the child process will execute the same auth plugin a second time. For interactive plugins (e.g., SSO logins that open a browser), this results in double execution and double prompting. While this is an inherent limitation of out-of-process resolution, it highlights the value of the planned offline/opt-out mode to bypass the probe entirely.

**Strengths**

- **Robust fake server design**: The `fake-kube-server` elegantly plays both apiserver and release mirror roles, and using file-based overrides to steer its behavior without restarting it between test cases is a great approach for fast BATS tests.
- **Conservative probe fallbacks**: Falling back to the embedded `kubectl` on missing kubeconfig, unreachable clusters, or unparseable versions ensures the version probe never gets in the user's way during degraded states.

### Testing Assessment

The testing is comprehensive, utilizing a custom `fake-kube-server` to mock the remote mirror and cluster effectively.
1. **Concurrent Cache Population**: The test suite verifies download-on-miss and skip-on-hit sequentially, but it lacks a test simulating concurrent `rdd kubectl` invocations to verify race-safety during cache population (which would surface the Windows `os.Rename` lock issue).

### Documentation Assessment

The documentation updates correctly map out the newly introduced `RDD_CACHE_DIR` and `RDD_KUBECTL_MIRROR` variables and paths.

### Coverage Summary

- `.github/dependabot.yml` — Trivial
- `bats/tests/10-cli/5-paths.bats` — C2
- `bats/tests/10-cli/7-kubectl-cache.bats` — C2, I2
- `bats/tests/10-cli/fake-kube/README.md` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/go.mod` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/kubectl/main.go` — Reviewed, no issues
- `bats/tests/10-cli/fake-kube/server/main.go` — Reviewed, no issues
- `cmd/rdd/kubectl.go` — Reviewed, no issues
- `cmd/rdd/main.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — Reviewed, no issues
- `docs/design/cmd_service.md` — Trivial
- `docs/design/environment.md` — Trivial
- `go.mod` — Trivial
- `pkg/kuberlr/cache.go` — I1
- `pkg/kuberlr/downloader.go` — C1
- `pkg/kuberlr/exec_unix.go` — Reviewed, no issues
- `pkg/kuberlr/exec_windows.go` — S2
- `pkg/kuberlr/resolver.go` — S1
- `pkg/kuberlr/resolver_test.go` — Reviewed, no issues

## Resolution

| | |
|---|---|
| **Addressed** | 2026-05-04 |
| **Commit** | `0153ca6 rdd kubectl: embed kuberlr-style version resolver` (amended) |

| # | Finding | Action | Reason |
|---|---------|--------|--------|
| 1 | Suggestion #1: cmd_service.md docker_socket missing from key table; args_file uses pre-existing wrong filename | Fixed | |
| 2 | Suggestion #2: kustomize subcommand missing from clientOnlySubcommands | Fixed | Subagent surfaced `plugin` as the same anti-pattern site; both added in one fix. |
| 3 | Suggestion #3: Windows exec.CommandContext may race kubectl Ctrl+C handler | Commented | Inline note at exec_windows.go:27 records the assumption that cobra's context is not signal-canceled today; a future refactor wiring signal cancellation must revisit this site |
| 4 | Testing Gap #1: rdd ctl SkipResolver lacks integration test | Skipped | Requires real apiserver + resolver-targetable kubeconfig; integration matrix needed to assert non-probe is heavier than the recursion guard's two-line surface justifies |
| 5 | Testing Gap #2: RDD_CACHE_DIR override not asserted in 5-paths.bats | Test written | Added `rdd svc paths honors RDD_CACHE_DIR override` test; extracted shared `winpath` helper from 7-kubectl-cache.bats into bats/helpers/paths.bash |
