Deep Review: 20260504-180530-pr-348

Date	2026-05-04 18:05
Repo	rancher-sandbox/rancher-desktop-daemon
Round	5
Author	@jandubois
PR	#348 — rdd kubectl: embed kuberlr-style version resolver
Branch	`embed-kuberlr`
Commits	`fb0811d` rdd kubectl: embed kuberlr-style version resolver
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — S1 (Codex + Claude) is the round's most actionable: `RDD_CACHE_DIR` is documented in `environment.md` both as a behavior-changing override AND as a Path Variable whose section preamble says "setting them has no effect on RDD's behavior" — the contradiction is internal and visible to anyone who reads the file. S3 (Claude + Gemini) brings the long-standing `CacheRoot` panic finding back with a material change: `instancePaths()` now calls `kuberlr.CacheRoot()` unconditionally, so a user with no `$HOME`/`%LocalAppData%` sees `rdd svc paths` panic with a stack trace. The remaining suggestions (S2 download UX, S4 misleading test comment, S5 process-env coupling, S6 dead error scaffold) are all genuine but lower priority. Round 4's ten suggestions all landed cleanly with no regressions.
Wall-clock time	`23 min 8 s`

Executive Summary ¶

Round 4's ten Suggestions all landed cleanly. Round 5 surfaces no Critical or Important findings; the agents converge on six smaller items in three buckets:

Documentation contradiction around RDD_CACHE_DIR (S1, two agents). environment.md line 16 documents the variable as a behavior-changing override; the Path Variables section preamble at line 39 explicitly states "setting them has no effect on RDD's behavior"; line 48 then lists RDD_CACHE_DIR in that table. Codex also notes that cmd_service.md's key table at lines 127–137 lists kubectl_cache but omits the new cache_dir key the example output (line 154) and the instancePaths() map (cmd/rdd/service_paths.go:32) include.
Cache-root panic now reaches rdd svc paths (S3, Claude + Gemini). The panic in CacheRoot on os.UserCacheDir() failure is the same finding declined in Rounds 1–3, but the blast radius materially changed in this round: instancePaths() now calls kuberlr.CacheRoot() and kuberlr.CacheDir() unconditionally, so any rdd svc paths invocation (the most basic introspection command) crashes for a user with no $HOME/%LocalAppData%. Per the resolutions gate, a material code change since the decline justifies a re-raise; Suggestion-level because the trigger remains rare.
Defensive cleanup in the new code — S2 (silent first-run download has no user feedback), S4 (BATS comment misnames the EXE-export mechanism), S5 (SkipResolver mutates the global process env where a package var would express the same intent without leaking into spawned children), and S6 (serverVersion declares an error return but every path returns nil, leaving Resolve's if err != nil as documented dead code).

The Windows os.Rename race surfaced again from all three agents (Codex I1, Gemini I1, Claude S3) and was dropped per the prior-round-resolutions gate (declined Rounds 1–4).

Structure: 0 Critical · 0 Important · 6 Suggestions.

Critical Issues ¶

None.

Important Issues ¶

None.

Suggestions ¶

S1. environment.md describes RDD_CACHE_DIR as both behavior-changing and inert; cmd_service.md key table omits the new cache_dir row Codex Claude¶

docs/design/environment.md

| `RDD_INSTANCE` | Instance identifier. Determines which control plane and directories to use. Also settable via `rdd --instance`. | `2` |
| `RDD_DEVELOPER_MODE` | Enables developer mode: exposes hidden CLI flags, detects source tree for local builds. | unset |
| `RDD_KEEP_LOGS` | Preserves logs for post-mortem debugging. See [Log Preservation](#log-preservation) for details. | unset |
| `RDD_LOG_LEVEL` | Sets the log level (`fatal`, `error`, `warn`, `info`, `debug`, `trace`). Overridden by `--log-level` flag. When unset, defaults to `debug` in developer mode, `warn` otherwise. | unset |
| `RDD_LOG_TITLE` | When set, writes this string as the first line of each new log file. Useful for identifying log files from specific test runs or sessions. | unset |
| `RDD_CACHE_DIR` | Overrides the rdd-wide cache root. The kubectl resolver appends `kubectl/<os>-<arch>/` inside it. | `os.UserCacheDir()`/`rancher-desktop` (e.g. `~/Library/Caches/rancher-desktop` on macOS) |
| `RDD_KUBECTL_MIRROR` | Overrides the Kubernetes release mirror used to download version-matched kubectl binaries. | `https://dl.k8s.io` |

### BATS Test Variables

These variables configure the BATS test framework. They have no effect on `rdd` itself.

environment.md line 16 (Configuration Variables) correctly documents RDD_CACHE_DIR as a behavior-changing override that the kubectl resolver reads at pkg/kuberlr/cache.go:48. Line 39 then opens the Path Variables section with "setting them has no effect on RDD's behavior," and line 48 lists RDD_CACHE_DIR in that table. A reader who lands in the Path Variables section reads the opposite of what the resolver actually does.

pkg/kuberlr/cache.go


// CacheRoot returns the rdd-wide cache root. RDD_CACHE_DIR overrides the
// OS default. Future cache consumers (k3s images, Lima templates) should
// append their own subdirectory inside this root.
func CacheRoot() string {
	if root := os.Getenv(envCacheDir); root != "" {
		return root
	}
	cache, err := os.UserCacheDir()
	if err != nil {
		panic(fmt.Errorf("could not determine user cache directory: %w", err))

docs/design/environment.md

```

| Variable | Description | Example (macOS, instance `2`) |
| --- | --- | --- |
| `RDD_ARGS_FILE` | Saved startup arguments | `~/Library/Application Support/rancher-desktop-2/args.json` |
| `RDD_CACHE_DIR` | rdd-wide cache root (shared across instances) | `~/Library/Caches/rancher-desktop` |
| `RDD_DIR` | Service instance directory | `~/Library/Application Support/rancher-desktop-2` |
| `RDD_CONFIG` | RDD control plane config file | `~/Library/Application Support/rancher-desktop-2/config.yaml` |
| `RDD_DOCKER_SOCKET` | Host-side Docker socket | `~/.rd2/docker.sock` |
| `RDD_KUBECTL_CACHE` | Cache directory for downloaded kubectl binaries (shared across instances) | `~/Library/Caches/rancher-desktop/kubectl/darwin-arm64` |
| `RDD_LIMA_HOME` | Lima home directory | `~/.rd2/lima` |

RDD_KUBECTL_CACHE (line 52) is correctly placed: kuberlr only reads RDD_CACHE_DIR and always derives kubectl/<os>-<arch>/, so setting RDD_KUBECTL_CACHE truly has no effect.

docs/design/environment.md

| `RDD_ARGS_FILE` | Saved startup arguments | `~/Library/Application Support/rancher-desktop-2/args.json` |
| `RDD_CACHE_DIR` | rdd-wide cache root (shared across instances) | `~/Library/Caches/rancher-desktop` |
| `RDD_DIR` | Service instance directory | `~/Library/Application Support/rancher-desktop-2` |
| `RDD_CONFIG` | RDD control plane config file | `~/Library/Application Support/rancher-desktop-2/config.yaml` |
| `RDD_DOCKER_SOCKET` | Host-side Docker socket | `~/.rd2/docker.sock` |
| `RDD_KUBECTL_CACHE` | Cache directory for downloaded kubectl binaries (shared across instances) | `~/Library/Caches/rancher-desktop/kubectl/darwin-arm64` |
| `RDD_LIMA_HOME` | Lima home directory | `~/.rd2/lima` |
| `RDD_LOG_DIR` | Log directory | `~/Library/Logs/rancher-desktop-2` |
| `RDD_PID_FILE` | Service PID file | `~/Library/Application Support/rancher-desktop-2/rdd.pid` |
| `RDD_SHORT_DIR` | Short directory path | `~/.rd2` |
| `RDD_TLS_DIR` | TLS certificate directory | `~/Library/Application Support/rancher-desktop-2/tls` |

The same gap carries through cmd_service.md. The example output at line 154 shows a cache_dir row, and instancePaths() at cmd/rdd/service_paths.go:32 returns cache_dir alongside kubectl_cache. The key table at lines 127–137 added kubectl_cache (line 137) but skipped cache_dir, so the table no longer matches rdd svc paths output.

cmd/rdd/service_paths.go

		"tls_dir":       instance.TLSDir(),
		"config":        instance.Config(),
		"pid_file":      instance.PIDFile(),
		"args_file":     instance.ArgsFile(),
		"docker_socket": instance.DockerSocket(),
		"cache_dir":     kuberlr.CacheRoot(),
		"kubectl_cache": kuberlr.CacheDir(),
	}
}

const (

Fix: qualify the Path Variables preamble (or drop RDD_CACHE_DIR from the path table and reference the Configuration Variables entry), and add a cache_dir row to the cmd_service.md key table.

-`rdd svc paths --output=shell` exports these variables. They reflect
-the paths RDD uses for the current instance; setting them has no effect
-on RDD's behavior.
+`rdd svc paths --output=shell` exports these variables. Most are
+informational; `RDD_CACHE_DIR` is also honored as the cache-root
+override (see Configuration Variables above).

 | `pid_file` | PID file path |
 | `args_file` | Saved arguments file path |
+| `cache_dir` | rdd-wide cache root (shared across instances) |
 | `kubectl_cache` | Cache directory for downloaded kubectl binaries (shared across instances) |

S2. First-run rdd kubectl blocks silently for 30 s+ on slow links Claude¶

pkg/kuberlr/downloader.go

}

// ensureCached returns the path to a cached kubectl matching want. If no
// matching binary exists on disk, ensureCached downloads one from the
// upstream mirror and verifies its sha512 before publishing it atomically.
func ensureCached(ctx context.Context, want semver.Version) (string, error) {
	path := cachedPath(want)
	if _, err := os.Stat(path); err == nil {
		return path, nil
	} else if !errors.Is(err, os.ErrNotExist) {
		return "", err
	}
	if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
		return "", err
	}
	if err := download(ctx, want, path); err != nil {
		return "", err
	}
	return path, nil
}

// cachedPath returns the absolute path of the cached kubectl for v.
func cachedPath(v semver.Version) string {
	name := fmt.Sprintf("kubectl-v%d.%d.%d", v.Major, v.Minor, v.Patch)
	if runtime.GOOS == "windows" {

Default log level is warn (cmd/rdd/main.go:90). The "using cached kubectl" line in kubectlAction (cmd/rdd/kubectl.go:77) and any Debug/Info logging in download stay invisible at the default level. A first-run cache miss against dl.k8s.io over a slow link blocks the terminal for 30 s or more for a ~50 MB binary; near the 5 min downloadTimeout it looks identical to a hung process. The user has no signal that the resolver is working.

cmd/rdd/main.go

	if logLevel == "" {
		// Default log level: warn for regular mode, debug for developer mode
		if developer.Mode() {
			logLevel = "debug"
		} else {
			logLevel = "warn"
		}
	}
	level, err := logrus.ParseLevel(logLevel)
	if err != nil {
		return err

cmd/rdd/kubectl.go

		// with a nil error, so Resolve returns "" and the embedded
		// kubectl handles client-only commands.
		return fmt.Errorf("resolving kubectl version: %w", err)
	}
	if path != "" {
		logrus.WithField("path", path).Debug("using cached kubectl")
		return kuberlr.Exec(cmd.Context(), path, args)
	}
	os.Args = os.Args[1:]
	command := kubectlcmd.NewDefaultKubectlCommand()
	if err := cli.RunNoErrOutput(command); err != nil {

Fix: emit one warn-level line before the download starts, naming the version and the mirror. The line shows up at the default log level and disappears once the cache hits.

     if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
         return "", err
     }
+    logrus.Warnf("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
     if err := download(ctx, want, path); err != nil {
         return "", err
     }

S3. CacheRoot panic now reaches rdd svc paths; declined in Rounds 1–3 but the call surface materially changed this round Claude Gemini¶

pkg/kuberlr/cache.go

}

// CacheRoot returns the rdd-wide cache root. RDD_CACHE_DIR overrides the
// OS default. Future cache consumers (k3s images, Lima templates) should
// append their own subdirectory inside this root.
func CacheRoot() string {
	if root := os.Getenv(envCacheDir); root != "" {
		return root
	}
	cache, err := os.UserCacheDir()
	if err != nil {
		panic(fmt.Errorf("could not determine user cache directory: %w", err))
	}
	return filepath.Join(cache, "rancher-desktop")
}

The panic on os.UserCacheDir() failure is the same finding declined in Rounds 1, 2, and 3 ("declined: failure mode requires neither $HOME nor %LocalAppData% to resolve"). The blast radius materially changed this round: instancePaths() at cmd/rdd/service_paths.go:32-33 now calls kuberlr.CacheRoot() and kuberlr.CacheDir() unconditionally, so rdd svc paths (basic introspection, unrelated to the kubectl resolver) panics with a stack trace for any user whose $HOME or %LocalAppData% is unset.

cmd/rdd/service_paths.go

		"tls_dir":       instance.TLSDir(),
		"config":        instance.Config(),
		"pid_file":      instance.PIDFile(),
		"args_file":     instance.ArgsFile(),
		"docker_socket": instance.DockerSocket(),
		"cache_dir":     kuberlr.CacheRoot(),
		"kubectl_cache": kuberlr.CacheDir(),
	}
}

const (
	outputTable = "table"

The trigger remains rare in normal desktop environments. It is reachable from stripped CI containers, restricted services, and sudo -i shells without HOME forwarding. instance.Dir() and the other path helpers do not panic on similar failures; aligning CacheRoot with the rest of the codebase keeps the surface uniform.

Fix: change the signature to (string, error), propagate from CacheDir(), and have instancePaths return (map[string]string, error). Resolve would treat a CacheRoot error like an unparseable embedded version — log and fall through to embedded.

S4. local_setup_file comment misnames the EXE mechanism — commands.bash sets but does not export Claude¶

bats/tests/10-cli/7-kubectl-cache.bats

}

local_setup_file() {
    GOOS=$(go env GOOS)
    GOARCH=$(go env GOARCH)
    # EXE is exported by commands.bash: ".exe" on Windows, empty
    # elsewhere. Don't shadow it locally.

    # Stage the fake kubectl into the mirror tree at the path the
    # resolver will GET when it sees serverVersion=v1.99.0.
    MIRROR_ROOT=${BATS_FILE_TMPDIR}/mirror
    MIRROR_BIN_DIR=${MIRROR_ROOT}/release/v1.99.0/bin/${GOOS}/${GOARCH}

commands.bash sets EXE="" at line 1 and EXE=".exe" at line 5 with no export. EXE is in scope for @test blocks because BATS sources commands.bash once and @test runs in the same shell. The variable inherits, but the test re-exports EXE later in the same local_setup_file (line 99, alongside the other export GOOS GOARCH EXE ...) so the rdd subprocess sees it. The comment names commands.bash as the exporter, which misleads the next reader who reorganizes BATS helpers.

bats/tests/10-cli/7-kubectl-cache.bats

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: SUSE LLC
# SPDX-FileCopyrightText: The Rancher Desktop Authors

# End-to-end test of the rdd kubectl version resolver's download / cache
# lifecycle. A fake apiserver advertises an out-of-skew version (v1.99.0),
# and a fake mirror serves a matching fake kubectl. The Go server under
# the sibling fake-kube/server directory plays both roles. The fake
# kubectl prints "fake-kubectl: <args>"; a passing test proves the
# resolver downloaded, sha-verified, and exec'd it.

bats/tests/10-cli/7-kubectl-cache.bats

current-context: fake
EOF

    CACHE_DIR=${BATS_FILE_TMPDIR}/cache

    export GOOS GOARCH EXE MIRROR_ROOT MIRROR_BIN_DIR LOG_FILE KUBECONFIG_PATH CACHE_DIR PORT GIT_VERSION_FILE VERSION_STATUS_FILE
}

local_teardown_file() {
    # Deviates from the "no teardown_file" rule. The rule's intent is to
    # preserve rdd state for post-mortem inspection; an HTTP server bound

Fix:

-    # EXE is exported by commands.bash: ".exe" on Windows, empty
-    # elsewhere. Don't shadow it locally.
+    # EXE is set in commands.bash (".exe" on Windows, empty elsewhere)
+    # and re-exported below for the rdd subprocess. Don't shadow it
+    # locally.

S5. SkipResolver mutates the global process env where a package-level bool would express the same intent without leaking into spawned children Claude¶

pkg/kuberlr/resolver.go

// envSkipResolver tells Resolve to short-circuit. Exec sets it on the
// kubectl child process so a downloaded kubectl that re-execs us through
// a shim cannot recurse back into version resolution.
const envSkipResolver = "RDD_KUBECTL_RESOLVED"

// SkipResolver short-circuits Resolve for the rest of this process.
// rdd ctl calls this before kubectlAction because it always targets
// the embedded apiserver, whose version matches the embedded kubectl
// by construction — the probe would always fall through anyway.
func SkipResolver() {
	os.Setenv(envSkipResolver, "1")
}

// serverProbeTimeout caps the discovery call. A reachable cluster answers
// in under 100 ms; an unreachable one would otherwise stall every kubectl
// invocation.
const serverProbeTimeout = 2 * time.Second

SkipResolver's docstring describes a same-process toggle. The implementation mutates the global process env, which leaks into every subprocess spawned after the call. The current callers (only rdd ctl invokes it before kubectlAction, which never re-spawns rdd) make this harmless today, but the function name and comment hide the side effect. A future caller who does spawn unrelated subprocesses inherits the leaked env without warning.

A package-level bool checked inside Resolve matches the documented intent and removes the coupling. The existing envSkipResolver env var still covers the recursion guard for the kubectl child, which Exec sets explicitly in exec_unix.go:22 and exec_windows.go:28.

pkg/kuberlr/exec_unix.go

// the current environment plus a recursion guard. The kubectl process
// inherits our PID, so the shell sees its exit status directly. ctx is
// accepted for signature parity with the Windows variant; syscall.Exec
// replaces the process so cancellation no longer applies.
func Exec(_ context.Context, path string, args []string) error {
	env := append(os.Environ(), envSkipResolver+"=1")
	if err := syscall.Exec(path, append([]string{path}, args...), env); err != nil {
		return fmt.Errorf("exec %s: %w", path, err)
	}
	return nil
}

pkg/kuberlr/exec_windows.go

// Ctrl+C and Ctrl+Break reach kubectl directly without explicit signal
// forwarding. The recursion guard prevents kubectl from looping through
// Resolve if it ever re-execs us.
func Exec(ctx context.Context, path string, args []string) error {
	cmd := exec.CommandContext(ctx, path, args...)
	cmd.Env = append(os.Environ(), envSkipResolver+"=1")
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	err := cmd.Run()
	var exitErr *exec.ExitError

Fix:

-func SkipResolver() {
-    os.Setenv(envSkipResolver, "1")
-}
+var skipResolver bool
+
+// SkipResolver short-circuits Resolve for the rest of this process.
+// rdd ctl calls this before kubectlAction because it always targets
+// the embedded apiserver, whose version matches the embedded kubectl
+// by construction — the probe would always fall through anyway.
+func SkipResolver() {
+    skipResolver = true
+}

In Resolve, replace the env check with if skipResolver || os.Getenv(envSkipResolver) != "".

S6. serverVersion declares an error return but every path returns nil; Resolve's if err != nil is dead Gemini¶

pkg/kuberlr/resolver.go

		// semver parsing. Fall through to the embedded kubectl rather
		// than break every dev invocation.
		logrus.WithError(err).Debug("kubectl resolver: embedded version not parseable; using embedded kubectl")
		return "", nil
	}
	server, ok, err := serverVersion(args)
	if err != nil {
		return "", err
	}
	if !ok {
		return "", nil
	}
	if compatible(embedded, server) {
		return "", nil

serverVersion has five return paths. Four of them return (zero, false, nil) with //nolint:nilerr annotations documenting the deliberate fall-through; the fifth returns (v, true, nil) on success. No path returns a non-nil error. The if err != nil check at Resolve line 83 cannot fire; the nilerr suppressions even document the function's deliberate "always nil" contract.

Two coherent options: drop the error return entirely (making serverVersion a (semver.Version, bool)-returning helper that matches its actual contract), or keep the scaffold and document it as deliberate forward-compatibility. The former matches what the code does; the latter matches a hypothetical future where some probe failure should abort the resolver instead of falling through.

Fix (drop the dead path):

-func serverVersion(args []string) (_ semver.Version, ok bool, _ error) {
+func serverVersion(args []string) (semver.Version, bool) {
     ...
-    return semver.Version{}, false, nil //nolint:nilerr // ...
+    return semver.Version{}, false
     ...
-    return v, true, nil
+    return v, true
 }

And in Resolve:

-    server, ok, err := serverVersion(args)
-    if err != nil {
-        return "", err
-    }
+    server, ok := serverVersion(args)
     if !ok {
         return "", nil
     }

Design Observations ¶

Concerns ¶

Mirror checksum gives no security against a compromised mirror (future) Gemini. download fetches the kubectl binary and its .sha512 from the same mirror (dl.k8s.io or RDD_KUBECTL_MIRROR); the hash protects against truncation and disk corruption but not against a mirror serving a malicious binary alongside a matching malicious checksum. SIG Release signs binaries with Sigstore/Cosign. Verifying the Cosign signatures (in addition to or in place of the mirror's own checksum) closes the supply-chain gap. Out of scope for this PR; flagging as durable follow-up.
Hand-rolled subcommand walker duplicates the bound pflag set (future, persistent) [carry-over from Rounds 3–4]. parseKubeConfigFlags already binds the kubectl global flags via genericclioptions.ConfigFlags.AddFlags; isClientOnly re-walks args manually with kubectlGlobalFlagsTakingValue, a hand-maintained value-taking flag map. Round 4's S5 (--as-user-extra) was the third drift in three rounds. Driving the walker from the same FlagSet (fs.Args() after parse returns the positional tokens) eliminates the map and the drift class. Trade-off: pflag with UnknownFlags=true may consume an unknown spaced-form flag's value as a positional token, so the walker needs a small validation step before extracting the subcommand. Round 5 surfaces no new drift, but the design observation persists.

Strengths ¶

Round 4 fixes landed without regression ClaudeCodex. ALL_KEYS now covers cache_dir and docker_socket; cmd_service.md consistently uses config.yaml; fetchSha512 lowercases the digest; --as-user-extra is in kubectlGlobalFlagsTakingValue with covering rows in TestIsClientOnly; the line-numbered comment in fetchSha512 is replaced with a name; the --insecure-skip-tls-verify=false row is gone; the eviction TODO covers .kubectl-* leftovers.
Body caps at the right place ClaudeGemini. streamGet wraps the response in io.LimitReader with separate caps for the digest (4 KiB) and the binary (256 MiB). A misconfigured mirror serving the binary at the .sha512 URL fails fast with a clear error rather than streaming megabytes into a strings.Builder.
Hash before mutation Claude. download feeds the response body through MultiWriter(tmp, sha512.New()) and verifies the digest before any chmod or rename. A failed verification leaves no executable in place for the next call to mistake for a valid binary.
Reusing kubectl's own flag binding ClaudeCodex. parseKubeConfigFlags binds genericclioptions.NewConfigFlags(true) rather than reimplementing kubectl's connection-override parsing. The probe and the kubectl child target the same cluster by construction.
Conservative bias in isClientOnly ClaudeGemini. When the walker cannot classify, it falls through to the probe. The comment at resolver.go:264-267 explicitly calls out the failure mode of the opposite default — a misclassification toward "skip the probe" silently uses a mismatched embedded kubectl.

Testing Assessment ¶

Coverage gaps surfaced this round, ordered by user-reachable risk:

os.UserCacheDir failure path (S3). The panic in CacheRoot has no test. A unit test that unsets HOME (Linux) or LocalAppData (Windows) and asserts a clean error would lock in the recommended fix.
First-run download UX (S2). No test asserts that the user sees something on cache miss — neither output nor log lines. The fake-kube fixture already serves the binary; a one-line BATS assertion against ${output} after the download case would close the gap.
Concurrent first-use cache miss (declined Rounds 1–4, surfaced again this round by all three agents). Acknowledged limitation.
End-to-end Windows execution path (pkg/kuberlr/exec_windows.go). PR description gates the merge on Windows BATS support arriving; the path remains compile-only.
Cluster reporting unparseable gitVersion Claude. serverVersion logs at warn and falls through, but no test covers it. The fake server already has --git-version-file; writing "not-a-version" and asserting fall-through to embedded kubectl would close the gap.

Verification runs this round: Codex ran make build-rdd and go test ./pkg/kuberlr ./cmd/rdd -count=1 (both passed). Claude verified Round 4 fixes against the current files. Gemini ran the discovery walk only.

Documentation Assessment ¶

docs/design/environment.md carries the contradiction in S1: line 16 documents RDD_CACHE_DIR as a behavior-changing override; line 39 says path-table variables have no effect; line 48 lists RDD_CACHE_DIR in the path table. Internal Variables section (lines 29–35) is well-placed and correctly explains both the recursion guard and the rdd ctl short-circuit reuse of the same env var.
docs/design/cmd_service.md key table at lines 127–137 is missing cache_dir (S1). Round 4's config.json → config.yaml sweep landed cleanly. Example output at lines 152–163 includes both cache_dir and kubectl_cache.
pkg/kuberlr/cache.go package doc at lines 5–9 clearly describes the resolve-or-fall-through model and credits the upstream kuberlr.
parseKubeConfigFlags docstring at pkg/kuberlr/resolver.go:155-173 enumerates the bound flag surface and explains why --username/--password stay unbound.
local_setup_file comment at 7-kubectl-cache.bats:29-30 misnames the EXE-export mechanism (S4).

Commit Structure ¶

Single commit fb0811d. Message describes the change and matches the diff. Clean.

Acknowledged Limitations ¶

Cache eviction not implemented — pkg/kuberlr/cache.go:33-39. The TODO(eviction) covers both growth (~50 MB per minor version) and .kubectl-* temp leftovers from SIGKILL/power loss; Round 4's S9 extended the TODO to name both.
Offline mode not implemented — pkg/kuberlr/resolver.go:61-64, pkg/kuberlr/downloader.go:46-47. TODOs cover the planned env-var/config opt-out for download attempts.
Windows BATS host deferred — bats/tests/10-cli/7-kubectl-cache.bats:14-20. PR description gates the merge on Windows BATS support; the resolver's Windows-specific code paths remain compile-only.
local_teardown_file() deviation from BATS convention — bats/tests/10-cli/7-kubectl-cache.bats:102-110. Inline rationale stands.
Falling back to embedded on any cluster-probe failure — pkg/kuberlr/resolver.go:113-144. Each nilerr branch is documented; the design choice is sound (S6 proposes simplifying the dead-error scaffolding without changing the policy).
Embedded version unparseable in dev builds — pkg/kuberlr/resolver.go:73-80. go run ./cmd/rdd kubectl ... and IDE debug builds skip -ldflags -X; falling through to embedded kubectl rather than breaking every dev invocation is the right call.
--username/--password deliberately unbound on the resolver side — pkg/kuberlr/resolver.go:165-173. Binding them via WithDeprecatedPasswordFlag would force the interactive kubeconfig loader; the resolver's silent fall-through on probe failure covers the asymmetry.
RDD_KUBECTL_MIRROR accepts non-HTTPS URLs Claude. Implicit. Acknowledged in mirrorURL() (the BATS test relies on http://127.0.0.1:port). For users who set this to a non-localhost http URL, MITM is possible. Trusting the env-var owner is consistent with other RDD_* overrides.
Empty-message cliexit.Error contract — cmd/rdd/main.go:159-160. The contract is clear in context but currently single-call-site (only kuberlr.Exec produces one).

Declined in prior rounds ¶

Concurrent first-use Windows os.Rename race — declined Rounds 1, 2, 3, 4. Raised this round by all three agents (Codex I1, Gemini I1, Claude S3); dropped per the resolutions gate. The recurring re-raises strengthen the case for an inline TODO documenting the Windows file-mapping subtlety so future agents recognize the deliberate design choice.
RDD_CACHE_DIR scope rename — declined Round 1: deliberate forward-looking knob.
Server log.Fatal* bypassing defer logFD.Close() — declined Round 1: test fixture; OS reclaims fd.
Pure-Go unit coverage thin on download/cache machinery — declined Round 2: BATS suite covers the integration concerns.
BATS coverage for HTTP timeout/hang, network cancel, Windows Exec, sha512sum-format mirror, trailing-slash mirror, default-log-level message capture, unreachable-server probe — declined Rounds 1–3.
info.GitVersion empty/unknown unit test — declined Round 1: requires mocked discovery client.
Resolver-policy summary and recursion-guard rationale duplicated across files — declined Round 1: each duplicate serves a different audience.
SkipResolver swallows os.Setenv error — declined Round 3: stylistic only; constant env name and value cannot fail.
Cache the server-version probe across invocations — declined Round 3: performance optimisation; explicit follow-up.
Move package doc-comment from cache.go to resolver.go — declined Round 3: cosmetic; go doc and pkg.go.dev render either location equally.
Drop is_macos branch in write_kubectl_sha512 — declined Round 3: both branches produce identical output; conditional is harmless.
Round 4 S6: serverVersion ignores ctx — skipped Round 4: UX gap bounded by 2 s probe timeout.
Round 4 S10: 5 min downloadTimeout covers headers + body — skipped Round 4: speculative.

Unresolved Feedback ¶

@mook-as on cmd/rdd/service_paths.go:33 — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/348#discussion_r3182959199: "Would it be possible to expose RDD_CACHE_DIR here too?" Resolved. Both cache_dir (the rdd-wide root) and kubectl_cache (the kubectl-specific subdir) appear in instancePaths() (service_paths.go:32-33), the environment.md Path Variables table (lines 48, 52), and the cmd_service.md example output (lines 154, 158). The cmd_service.md key table still misses cache_dir (S1).

Agent Performance Retro ¶

Claude ¶

Claude carried this round's coverage: five Suggestions retained (S1 docs shared with Codex, S2 download UX, S3 panic with the blast-radius re-raise, S4 BATS comment, S5 SkipResolver env coupling). The S3 re-raise was the round's best work — Claude noticed that instancePaths() now calls kuberlr.CacheRoot() unconditionally, so the same panic that prior rounds declined as too narrow now reaches rdd svc paths. That call-surface check is exactly the discipline the resolutions gate asks for. Two findings declined or downgraded: I1 (download UX) was over-calibrated as Important — the gap is real but bounded; S5 (drive-by config.json → config.yaml in cmd_service.md) was rejected because Round 4's S3 (raised by Claude itself) explicitly asked for that rename, and re-flagging it as scope creep after it landed moves the goalposts. Claude's S3 (Windows rename race) is the fourth consecutive re-raise from Claude on a finding declined four rounds running.

Codex ¶

Codex produced one shared Important (Windows rename race, dropped per gate) and one shared Suggestion (S1 docs contradiction). The S1 work was tight: Codex spotted both the environment.md self-contradiction and the cmd_service.md key-table omission in a single finding, where Claude reported only the environment.md half. Codex also ran make build-rdd and go test ./pkg/kuberlr ./cmd/rdd in its worktree (both passed) — that verification work is hard to parallelize across agents and Codex consistently provides it. Tight one-Important / one-Suggestion shape, zero false positives.

Gemini ¶

Gemini's round was on-target after Round 4's hallucination. C-level findings disappeared; the two real findings (I1 Windows rename race shared with the others, S2 panic on os.UserCacheDir) were both legitimate. S1 (dead error return in serverVersion) is a fresh Suggestion that neither other agent caught — verified by reading every return path in the function and confirming all four nilerr branches plus the success path return nil. The supply-chain Cosign observation in Design Concerns is a future-scope idea worth carrying forward. No false positives this round; Gemini's coverage tightened up materially compared to Round 4.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.5	Gemini 3.1 Pro
Duration	16m 22s	8m 02s	—
Findings	5S	1S	2S
Tool calls	35 (Bash 17, Read 17, Grep 1)	74 (shell 66, stdin 8)	—
Design observations	5	1	2
False positives	0	0	0
Unique insights	4	0	2
Files reviewed	19	19	19
Coverage misses	0	0	0
Totals	5S	1S	2S
Downgraded	2 (I→S)	0	0
Dropped	2	1	1

Codex provided the tightest signal-per-finding ratio (one shared S1 was the round's most actionable). Claude provided the most coverage (five retained Suggestions, including the S3 blast-radius re-raise). Gemini contributed one defensive observation (S6 dead error scaffold) plus one shared finding (S3 panic).

Net Round 5 contribution after the resolutions gate: 0 Critical, 0 Important, 6 Suggestions, 2 design concerns.

#### Reconciliation

Claude I1 download UX: important → suggestion S2 (UX gap, bounded by the 5 min downloadTimeout ceiling; no correctness break).
Claude I2 CacheRoot panic: important → suggestion S3 (rare trigger; the re-raise is justified by the new rdd svc paths blast radius, but severity stays at Suggestion).
Claude S3 Windows os.Rename race: suggestion → dropped (declined Rounds 1, 2, 3, 4).
Claude S5 drive-by config.yaml rename: suggestion → dropped (Round 4's S3 — raised by Claude itself — explicitly asked for the rename, which then landed).
Codex I1 concurrent cold-cache publishes: important → dropped (declined Rounds 1, 2, 3, 4; same finding as above).
Gemini I1 Windows rename race: important → dropped (declined Rounds 1, 2, 3, 4; same finding as above).
Gemini S2 CacheRoot panic: kept and merged with Claude I2 into S3.

Review Process Notes ¶

Skill improvements ¶

Add a "blast-radius check on declined re-raises" rule to the resolutions gate. A finding declined in earlier rounds can become valid again when later commits widen its trigger surface, even if the offending lines themselves are unchanged. Today the gate says "re-raise only when the surrounding code has materially changed since the decline," which an agent reading "code unchanged" reads as "drop." Strengthen to: "Before dropping a re-raised declined finding, list every caller of the cited symbol on the current SHA and on the SHA that produced the original decline. If new callers exist on the current SHA, the blast radius has changed; treat that as material code change and re-raise (with severity recalibrated for the new surface). Recording the call-site count alongside the decline reason in the Resolution table makes the comparison mechanical." Future panic-on-startup or fatal-on-rare-input findings carry the same shape: the bug is "rare" until it isn't, and the trigger surface is what shifts.

Repo context updates ¶

None this round. The existing deep-review-context.md carried Codex S1 (environment.md Path Variables preamble vs. behavior-changing override) into reach because the BATS style and CRD-default rules already train agents to read documentation tables against implementation; the same instinct surfaced the contradiction here.