Deep Review: 20260502-225351-pr-348

Date	2026-05-02 22:53
Repo	rancher-sandbox/rancher-desktop-daemon
Round	1
Author	@jandubois
PR	#348 — rdd kubectl: embed kuberlr-style version resolver
Branch	`embed-kuberlr`
Commits	`9624299` rdd kubectl: embed kuberlr-style version resolver
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh)
Verdict	Merge with fixes — fix download timeout and BATS sha512-fixture restoration before merging; the rest are small follow-ups.
Wall-clock time	`44 min 29 s`

Executive Summary ¶

The PR embeds a kuberlr-style kubectl version resolver: when the cluster is more than ±1 minor out of skew, rdd kubectl downloads a version-matched kubectl from dl.k8s.io, sha512-verifies it against the published checksum, caches it under os.UserCacheDir(), and execs it in place. Within skew, the embedded kubectl runs. The implementation is small (~280 lines), the responsibilities are cleanly partitioned across cache.go/downloader.go/resolver.go/exec_*.go, and the BATS lifecycle test is a good fit for the resolver's contract.

The two Important findings are both narrow: the download HTTP path lacks a per-request deadline (a hung mirror wedges the CLI), and the sha512-mismatch BATS test corrupts the shared mirror fixture without restoring it. Neither blocks merge. Suggestions cover smaller policy/style points and a couple of TODO-worthy enhancements (cache eviction, doc updates).

Structure: 0 Critical · 2 Important · 11 Suggestions.

Critical Issues ¶

None.

Important Issues ¶

I1. Mirror download has no deadline; a slow or hung mirror wedges the CLI Codex Claude¶

pkg/kuberlr/downloader.go

	return strings.TrimSpace(sb.String()), nil
}

// streamGet performs a GET on url and copies the response body into w.
// streamGet returns an error on any non-200 status.
func streamGet(ctx context.Context, url string, w io.Writer) error {
	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
	if err != nil {
		return err
	}
	resp, err := http.DefaultClient.Do(req)
	if err != nil {
		return err
	}
	defer resp.Body.Close()
	if resp.StatusCode != http.StatusOK {
		return fmt.Errorf("GET %s: %s", url, resp.Status)
	}
	_, err = io.Copy(w, resp.Body)
	return err
}

http.DefaultClient carries no Timeout, and the inherited cmd.Context() from cobra has no deadline either. The apiserver probe at pkg/kuberlr/resolver.go:91 explicitly sets cfg.Timeout = serverProbeTimeout so an unreachable cluster fails fast, but the checksum and binary download paths have no equivalent. A wedged TCP connection or a byte-trickling mirror will hang rdd kubectl until the user kills it — the same failure mode the apiserver-probe timeout was added to prevent.

pkg/kuberlr/resolver.go

func serverVersion(args []string) (_ semver.Version, ok bool, _ error) {
	cfg, err := loadClientConfig(args)
	if err != nil {
		return semver.Version{}, false, nil //nolint:nilerr // no loadable kubeconfig: kubectl can still run client-only commands
	}
	cfg.Timeout = serverProbeTimeout
	client, err := discovery.NewDiscoveryClientForConfig(cfg)
	if err != nil {
		return semver.Version{}, false, nil //nolint:nilerr // malformed config: kubectl can still run client-only commands
	}
	info, err := client.ServerVersion()

Both checksum and binary GETs use the same streamGet, so a single bounded deadline (or a private *http.Client instance) covers both. Since the kubectl binary is ~50 MB, pick a deadline generous enough for slow links but short enough that the user notices a hang in minutes, not hours.

Fix:

+const downloadTimeout = 5 * time.Minute
+
+var httpClient = &http.Client{Timeout: downloadTimeout}
+
 func streamGet(ctx context.Context, url string, w io.Writer) error {
 	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, http.NoBody)
 	if err != nil {
 		return err
 	}
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)

Codex flagged this as Important; Claude raised the same point as a Suggestion (S4). I align with Codex on severity — an indefinite hang is a real operator-facing issue, not a stylistic gap.

I2. BATS sha512-mismatch test corrupts the shared mirror fixture without restoring it Claude¶

bats/tests/10-cli/7-kubectl-cache.bats

    assert_output --partial '404'
    refute_output --partial 'fake-kubectl:'
    assert_file_not_exists "${CACHE_DIR}/kubectl/${GOOS}-${GOARCH}/kubectl-v1.99.1${EXE}"
}

@test 'rdd kubectl errors on sha512 mismatch' {
    # Replace the published checksum with a wrong one. The download
    # succeeds, the sha verification fails, and no cache file lands.
    echo "00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" \
        >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"

    run env \
        RDD_CACHE_DIR="${CACHE_DIR}" \
        RDD_KUBECTL_MIRROR="http://127.0.0.1:${PORT}" \
        KUBECONFIG="${KUBECONFIG_PATH}" \
        rdd kubectl get pods

    assert_failure
    assert_output --partial 'resolving kubectl version'
    assert_output --partial 'checksum mismatch'
    refute_output --partial 'fake-kubectl:'
    assert_file_not_exists "${CACHE_DIR}/kubectl/${GOOS}-${GOARCH}/kubectl-v1.99.0${EXE}"
}

@test 'rdd kubectl falls through to embedded when the apiserver returns 500 on /version' {
    echo 500 >"${VERSION_STATUS_FILE}"

    run -0 env \

local_setup_file() writes the real sha512 once; local_setup() resets CACHE_DIR, LOG_FILE, and the /version override files but does not restore ${MIRROR_BIN_DIR}/kubectl${EXE}.sha512. The mismatch test overwrites that fixture with zeroes and never puts it back. The next test (apiserver returns 500 → fall through to embedded) runs kubectl version --client and never triggers a download, so the corruption is invisible today. Any future test that relies on a working mirror will fail at random distance from the cause.

Fix: restore the published checksum in local_setup, or factor the shasum/sha512sum block out of local_setup_file into a helper used by both:

 local_setup() {
     rm -rf "${CACHE_DIR}"
     : >"${LOG_FILE}"
     rm -f "${GIT_VERSION_FILE}" "${VERSION_STATUS_FILE}"
+    if is_macos; then
+        shasum -a 512 "${MIRROR_BIN_DIR}/kubectl${EXE}" | awk '{print $1}' >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
+    else
+        sha512sum "${MIRROR_BIN_DIR}/kubectl${EXE}" | awk '{print $1}' >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
+    fi
 }

Suggestions ¶

S1. Flag scanner does not stop at kubectl's -- argument delimiter Codex Claude¶

pkg/kuberlr/resolver.go

// extractKubeFlags scans args for the kubectl flags Resolve cares about
// (--kubeconfig and --context) and returns their values. Both the spaced
// form (--kubeconfig path) and the equals form (--kubeconfig=path) match.
// A flag missing from args yields an empty string, which clientcmd treats
// as "use the default".
func extractKubeFlags(args []string) (kubeconfig, context string) {
	for i, arg := range args {
		switch {
		case arg == "--kubeconfig" && i+1 < len(args):
			kubeconfig = args[i+1]
		case strings.HasPrefix(arg, "--kubeconfig="):
			kubeconfig = strings.TrimPrefix(arg, "--kubeconfig=")
		case arg == "--context" && i+1 < len(args):
			context = args[i+1]
		case strings.HasPrefix(arg, "--context="):
			context = strings.TrimPrefix(arg, "--context=")
		}
	}
	return kubeconfig, context
}

extractKubeFlags scans every argument. cmd/rdd/kubectl.go:60 passes the same raw args to Resolve and to the kubectl child, but kubectl treats arguments after -- as positional. rdd kubectl exec mypod -- mycmd --kubeconfig=/other makes the resolver probe /other, while the real kubectl invocation still targets the original kubeconfig. The current consequence is benign — a stray probe goes to a context the user never asked for, returns ok=false, and the resolver falls through — but the parsed values lie about the user's intent.

cmd/rdd/kubectl.go

	}
	return command
}

func kubectlAction(cmd *cobra.Command, args []string) error {
	path, err := kuberlr.Resolve(cmd.Context(), args)
	if err != nil {
		// A download or sha-mismatch failure surfaces here. We refuse
		// to silently fall back to the embedded kubectl — running a
		// version-mismatched binary against the user's cluster would
		// hide the failure behind weird kubectl errors. Apiserver-probe

Fix: stop scanning at the first --, and add unit tests covering both the spaced and = forms after the delimiter.

 for i, arg := range args {
+    if arg == "--" {
+        break
+    }
     switch {

S2. serverVersion parse failure errors out instead of falling through Claude¶

pkg/kuberlr/resolver.go

	}
	info, err := client.ServerVersion()
	if err != nil {
		return semver.Version{}, false, nil //nolint:nilerr // unreachable cluster: kubectl can still run client-only commands
	}
	v, err := semver.ParseTolerant(info.GitVersion)
	if err != nil {
		return semver.Version{}, false, fmt.Errorf("parsing server version %q: %w", info.GitVersion, err)
	}
	return v, true, nil
}

// loadClientConfig builds a rest.Config that honors KUBECONFIG plus any
// --kubeconfig and --context flags in args, the way kubectl itself

The other three failure modes — no kubeconfig, malformed config, unreachable cluster — return ok=false, err=nil and let the embedded kubectl handle client-only commands. An unparseable info.GitVersion is logically the same situation: with no comparable version, the resolver cannot make an informed decision, so the documented fall-through policy applies. Today this branch errors out and blocks every kubectl invocation against that cluster, including kubectl config view and kubectl version --client that never needed a server version.

Fix:

     v, err := semver.ParseTolerant(info.GitVersion)
     if err != nil {
-        return semver.Version{}, false, fmt.Errorf("parsing server version %q: %w", info.GitVersion, err)
+        return semver.Version{}, false, nil //nolint:nilerr // unparseable server version: fall through to embedded
     }

S3. compatible checks Minor only; cross-major skew slips through Claude¶

pkg/kuberlr/resolver.go

	return ensureCached(ctx, server)
}

// compatible reports whether a kubectl of client.Minor can drive a server
// of server.Minor under the standard Kubernetes skew of ±1 minor.
func compatible(client, server semver.Version) bool {
	diff := int64(client.Minor) - int64(server.Minor)
	return diff >= -1 && diff <= 1
}

// embeddedVersion reads the kubectl version baked into the rdd binary
// from k8s.io/component-base/version, populated by -X linker flags in
// the Makefile.
func embeddedVersion() (semver.Version, error) {

A future server at v2.30.0 against an embedded kubectl v1.30.0 computes diff = 0 and returns true, silently routing through a major-mismatched embedded kubectl. Kubernetes 2.x has no realistic horizon, so the impact today is theoretical, but the fix is one line and the unit table at resolver_test.go:26-33 only covers Major == 1.

pkg/kuberlr/resolver_test.go

		name   string
		client semver.Version
		server semver.Version
		want   bool
	}{
		{"same minor", v(1, 30), v(1, 30), true},
		{"client one ahead", v(1, 31), v(1, 30), true},
		{"client one behind", v(1, 29), v(1, 30), true},
		{"client two ahead", v(1, 32), v(1, 30), false},
		{"client two behind", v(1, 28), v(1, 30), false},
		{"server at zero, client at zero", v(1, 0), v(1, 0), true},
		{"server at zero, client at one", v(1, 1), v(1, 0), true},
		{"patch differences ignored", semver.Version{Major: 1, Minor: 30, Patch: 99}, semver.Version{Major: 1, Minor: 30, Patch: 0}, true},
	}
	for _, tc := range cases {
		t.Run(tc.name, func(t *testing.T) {
			assert.Equal(t, compatible(tc.client, tc.server), tc.want)
		})

Fix:

 func compatible(client, server semver.Version) bool {
+    if client.Major != server.Major {
+        return false
+    }
     diff := int64(client.Minor) - int64(server.Minor)
     return diff >= -1 && diff <= 1
 }

Add a test case: {"different major", v(1, 30), semver.Version{Major: 2, Minor: 30}, false}.

S4. New user-facing path key and env vars are not in the design docs Codex¶

docs/design/cmd_service.md

# Service Commands

Service commands are for working with the control plane. The [Service API](api_service.md) document has additional information about the control plane internals, like startup and shutdown logic.

## Directories

The PR adds the kubectl_cache service path (cmd/rdd/service_paths.go:32) and introduces operator-facing RDD_CACHE_DIR (pkg/kuberlr/cache.go:22) and RDD_KUBECTL_MIRROR (pkg/kuberlr/downloader.go:30). docs/design/cmd_service.md lists no kubectl_cache key, and docs/design/environment.md describes neither variable. Verified against the worktree: neither file contains the new identifiers.

cmd/rdd/service_paths.go

		"tls_dir":       instance.TLSDir(),
		"config":        instance.Config(),
		"pid_file":      instance.PIDFile(),
		"args_file":     instance.ArgsFile(),
		"docker_socket": instance.DockerSocket(),
		"kubectl_cache": kuberlr.CacheDir(),
	}
}

const (
	outputTable = "table"

pkg/kuberlr/cache.go

)

// envCacheDir lets tests (and operators) override the rdd-wide cache root.
// Anticipates a shared rdd cache that future consumers (k3s images, lima
// templates) can join — kuberlr appends its own subdirectory.
const envCacheDir = "RDD_CACHE_DIR"

// CacheDir returns the directory holding downloaded kubectl binaries. All
// rdd instances share this cache; kubectl does not vary per instance.
// Setting RDD_CACHE_DIR overrides the rdd-wide root; kuberlr always
// appends kubectl/<os>-<arch>/ inside that root.

pkg/kuberlr/downloader.go

const defaultKubeMirror = "https://dl.k8s.io"

// envKubeMirror lets tests (and operators) point the resolver at an
// alternate mirror. Useful for offline mirrors and for the BATS cache
// lifecycle test, which serves a fake binary from a local HTTP server.
const envKubeMirror = "RDD_KUBECTL_MIRROR"

// mirrorURL returns the mirror base URL the resolver downloads from.
//
// TODO(offline): pair this with a "may we download?" toggle so air-gapped
// users can pre-populate CacheDir() and forbid network fetches.

Fix: add kubectl_cache to the paths table in cmd_service.md, and document RDD_CACHE_DIR (rdd-wide cache root with kubectl appending its own subdirectory) and RDD_KUBECTL_MIRROR (alternate release mirror, default https://dl.k8s.io) in environment.md. The path-vars table in environment.md should also gain RDD_KUBECTL_CACHE so rdd svc paths --output=shell is fully described.

S5. RDD_CACHE_DIR is documented as the rdd-wide cache root, but only kubectl writes to it Claude¶

pkg/kuberlr/cache.go

	"os"
	"path/filepath"
	"runtime"
)

// envCacheDir lets tests (and operators) override the rdd-wide cache root.
// Anticipates a shared rdd cache that future consumers (k3s images, lima
// templates) can join — kuberlr appends its own subdirectory.
const envCacheDir = "RDD_CACHE_DIR"

// CacheDir returns the directory holding downloaded kubectl binaries. All
// rdd instances share this cache; kubectl does not vary per instance.
// Setting RDD_CACHE_DIR overrides the rdd-wide root; kuberlr always

The variable name promises an rdd-wide root, but kubectl is the sole consumer. If a future cache user adopts a different env var, the "rdd-wide" promise breaks and operators have to learn a second variable. Either rename to RDD_KUBECTL_CACHE_DIR so the name matches the actual scope (and reserve RDD_CACHE_DIR for the day a real shared cache lands), or scope the comment: "kubectl cache root; future consumers may layer their own subdirectories under the same value."

S6. BATS test uses grep -q where the suite uses assert_file_contains Claude¶

bats/tests/10-cli/7-kubectl-cache.bats

        KUBECONFIG="${KUBECONFIG_PATH}" \
        rdd kubectl get pods

    assert_output --partial 'fake-kubectl: get pods'
    assert_file_executable "${cached}"
    grep -q '^GET /version' "${LOG_FILE}"
    grep -q "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}.sha512$" "${LOG_FILE}"
    grep -q "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}$" "${LOG_FILE}"
}

@test 'rdd kubectl skips the download when the cache already has a match' {
    cache_subdir=${CACHE_DIR}/kubectl/${GOOS}-${GOARCH}
    mkdir -p "${cache_subdir}"

bats/helpers/utils.bash:91 defines refute_file_contains; the suite-wide assert_file_contains is used by bats/tests/20-service/4-start-parameters.bats and bats/tests/20-service/5-port-fallback.bats. The bats-assert idioms surface both the expected pattern and the actual file contents on failure, while bare grep -q only sets the exit status.

bats/helpers/utils.bash


refute_file_exists() {
    assert_file_not_exists "$@"
}

refute_file_contains() {
    assert_file_not_contains "$@"
}

########################################################################

Fix sketch:

-    grep -q '^GET /version' "${LOG_FILE}"
-    grep -q "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}.sha512$" "${LOG_FILE}"
-    grep -q "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}$" "${LOG_FILE}"
+    assert_file_contains "${LOG_FILE}" '^GET /version'
+    assert_file_contains "${LOG_FILE}" "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}.sha512$"
+    assert_file_contains "${LOG_FILE}" "^GET /release/v1.99.0/bin/${GOOS}/${GOARCH}/kubectl${EXE}$"

S7. GOOS/GOARCH set on the kubectl build, not on the server build Claude¶

bats/tests/10-cli/7-kubectl-cache.bats

    MIRROR_ROOT=${BATS_FILE_TMPDIR}/mirror
    MIRROR_BIN_DIR=${MIRROR_ROOT}/release/v1.99.0/bin/${GOOS}/${GOARCH}
    mkdir -p "${MIRROR_BIN_DIR}"
    # Build inside the helper module so go.mod resolution picks up
    # bats/fake-kube/go.mod, not the parent rancher-desktop-daemon one.
    (
        cd "${BATS_TEST_DIRNAME}/fake-kube"
        GOOS=${GOOS} GOARCH=${GOARCH} \
            go build -ldflags='-s -w' \
            -o "${MIRROR_BIN_DIR}/kubectl${EXE}" \
            ./kubectl
        go build -ldflags='-s -w' \
            -o "${BATS_FILE_TMPDIR}/fake-kube-server" \
            ./server
    )
    if is_macos; then
        shasum -a 512 "${MIRROR_BIN_DIR}/kubectl${EXE}" | awk '{print $1}' >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
    else
        sha512sum "${MIRROR_BIN_DIR}/kubectl${EXE}" | awk '{print $1}' >"${MIRROR_BIN_DIR}/kubectl${EXE}.sha512"
    fi

Both invocations target the host today because GOOS=$(go env GOOS) returns the host value, but the asymmetry hints at a forgotten cross-compile concern. Drop the explicit env on the kubectl build (host targets host) or pass it on both for consistency.

S8. Server's log.Fatal* calls bypass the defer logFD.Close() Claude¶

bats/tests/10-cli/fake-kube/server/main.go

	gitVersionFile := flag.String("git-version-file", "", "optional file overriding gitVersion at request time")
	versionStatusFile := flag.String("version-status-file", "", "optional file overriding /version HTTP status at request time")
	portFile := flag.String("port-file", "", "file to receive the assigned port")
	logFile := flag.String("log-file", "", "file to receive one request line per request")
	flag.Parse()
	if *root == "" || *portFile == "" || *logFile == "" {
		log.Fatal("--root, --port-file, and --log-file are required")
	}

	// O_APPEND so per-test truncation (`: > log`) by BATS is honored:
	// without it, the server's fd position drifts past the new EOF and
	// writes land in a sparse hole that grep treats as binary noise.
	logFD, err := os.OpenFile(*logFile, os.O_WRONLY|os.O_CREATE|os.O_APPEND, 0o644)
	if err != nil {
		log.Fatalf("opening log file: %v", err)
	}
	defer logFD.Close()
	var logMu sync.Mutex

	mux := http.NewServeMux()
	mux.HandleFunc("/version", func(w http.ResponseWriter, r *http.Request) {
		recordRequest(logFD, &logMu, r)
		status := readIntFile(*versionStatusFile, http.StatusOK)
		gv := readStringFile(*gitVersionFile, *gitVersion)
		w.Header().Set("Content-Type", "application/json")
		w.WriteHeader(status)
		if status == http.StatusOK {
			_, _ = fmt.Fprintf(w,
				`{"major":%q,"minor":%q,"gitVersion":%q,"gitCommit":"fake","gitTreeState":"clean","buildDate":"2026-01-01T00:00:00Z","goVersion":"go1.21","compiler":"gc","platform":"%s/%s"}`,
				*major, *minor, gv, runtime.GOOS, runtime.GOARCH,
			)
		}
	})
	mux.Handle("/release/", logging(logFD, &logMu, http.StripPrefix("/release/", http.FileServer(http.Dir(filepath.Join(*root, "release"))))))

	listener, err := net.Listen("tcp", "127.0.0.1:0")
	if err != nil {
		log.Fatalf("listen: %v", err)
	}
	port := listener.Addr().(*net.TCPAddr).Port
	if err := os.WriteFile(*portFile, []byte(strconv.Itoa(port)), 0o644); err != nil {
		log.Fatalf("writing port file: %v", err)
	}

	server := &http.Server{Handler: mux}
	if err := server.Serve(listener); err != nil && err != http.ErrServerClosed {
		log.Fatalf("serve: %v", err)
	}
}

// logging wraps h so every request lands in the request log before h sees
// it. The release file server does not get to bypass logging just because
// it sits behind a StripPrefix.

log.Fatal* calls os.Exit(1), which skips deferred functions. The defer logFD.Close() at line 62 reads as if it cleans up on every exit path; in practice it runs only on a clean Serve return, which never happens during a test run. The cleanup is harmless inside a fixture, but the deferred call misleads a future reader. Either drop the defer (the OS reclaims the fd anyway) or replace log.Fatal* with log.Print* followed by an explicit os.Exit(1) after a single deferred cleanup block.

S9. PR description "rdd ctl stays untouched" overstates Claude¶

cmd/rdd/kubectl.go

		RunE:               ctlAction,
	}
	return command
}

func ctlAction(cmd *cobra.Command, args []string) error {
	if err := ensureServiceRunning(cmd.Context()); err != nil {
		return err
	}
	if len(args) > 0 && args[0] == "wait-condition" {
		return ctlWaitConditionAction(cmd, args[1:])
	}
	if err := os.Setenv("KUBECONFIG", instance.Config()); err != nil {
		return fmt.Errorf("failed to set KUBECONFIG: %w", err)
	}
	return kubectlAction(cmd, args)
}

func newKubectlCommand() *cobra.Command {
	command := &cobra.Command{
		Use:                "kubectl",
		Short:              "Run the kubectl command",

rdd ctl shares kubectlAction, so it now performs a serverVersion probe against the embedded apiserver on every invocation. The probe always reports the matching version, compatible returns true, and Resolve returns "" — functional behavior is unchanged. The PR description's "rdd ctl stays untouched" is technically inaccurate.

Either update the description to "rdd ctl behavior is unchanged; the resolver always falls through because the embedded apiserver and kubectl share a version", or short-circuit Resolve for the embedded path. A check against instance.Config() would skip the (small) extra probe; it is purely cosmetic since the probe goes to a localhost socket.

S10. Cache grows without bound Claude¶

ensureCached only adds; no eviction, no GC, no per-version TTL. Each downloaded kubectl is roughly 50 MB. A user who switches across several remote clusters over months accumulates indefinitely. Acceptable for v1, but a TODO(eviction) near the existing TODO(offline) comment lets a future consumer reuse the framing instead of reinventing it.

S11. CacheDir() panics on os.UserCacheDir() failure; instancePaths() cannot recover Claude¶

pkg/kuberlr/cache.go

//	Linux:   ~/.cache/rancher-desktop/kubectl/<os>-<arch>/  ($XDG_CACHE_HOME)
//	Windows: %LocalAppData%\rancher-desktop\kubectl\<os>-<arch>\
func CacheDir() string {
	root := os.Getenv(envCacheDir)
	if root == "" {
		cache, err := os.UserCacheDir()
		if err != nil {
			panic(fmt.Errorf("could not determine user cache directory: %w", err))
		}
		root = filepath.Join(cache, "rancher-desktop")
	}
	return filepath.Join(root, "kubectl", runtime.GOOS+"-"+runtime.GOARCH)
}

CacheDir() is invoked from instancePaths(), which rdd svc paths calls. os.UserCacheDir() errors only when neither $HOME nor $XDG_CACHE_HOME (Linux) nor %LocalAppData% (Windows) resolves — extremely rare, but a panic in rdd svc paths is a poor error UX for a path-listing command. Either fall back to os.TempDir() with a logged warning, or change CacheDir to return (string, error) so the path command surfaces a normal error.

Design Observations ¶

Concerns ¶

Manual kubectl flag parsing carries drift risk (in-scope) Codex. S1 above is one consequence of the broader pattern: extractKubeFlags approximates kubectl's flag semantics by hand. If the resolver ever needs more flags (--cluster, --user, --namespace), the surface area for divergence grows. Delegating to clientcmd's flag-binding helpers or a pflag-based parser keeps the resolver and the kubectl child aligned by construction.

Strengths ¶

Clean separation of concerns Claude. cache.go (path layout), downloader.go (HTTP and sha512), resolver.go (skew policy), exec_*.go (process launch). Each file owns one responsibility; the boundary between policy and I/O is easy to follow.
Hash-as-you-stream verification ClaudeCodex. download writes through io.MultiWriter(tmp, h) so the digest covers exactly the bytes received, before chmod or rename. Failed downloads leave only a temp file, which defer os.Remove(tmpName) cleans up. Cache hits are never poisoned by partial writes.
Recursion guard via env Claude. RDD_KUBECTL_RESOLVED set in Exec (exec_unix.go:19, exec_windows.go:24) prevents a downloaded kubectl that re-execs rdd through a shim from looping back into Resolve.
Documented fall-through policy ClaudeCodex. Each nilerr site in serverVersion carries a //nolint:nilerr comment naming the legitimate reason for silent fall-through. The contrast with the loud download-error policy is explicit in both the Resolve doc and the comment in kubectlAction, so a future maintainer does not have to reverse-engineer the intent.
Test fixture design Claude. One Go server playing both apiserver /version and release-mirror /release/* (with per-test override files) is a low-friction way to drive the resolver under deterministic conditions without restarting the server between cases. Per-test git-version and version-status overrides keep tests fast and isolated.

Testing Assessment ¶

Coverage is solid for host-platform behavior. Five BATS cases exercise download-on-miss, skip-on-hit, mirror-404, sha-mismatch, and apiserver-500 fall-through; Go unit tests cover compatible, cachedPath, and extractKubeFlags. Gaps tied to specific findings:

Untested: HTTP timeout / hang — see I1.
Untested: BATS test isolation after fixture mutation — see I2.
Untested: -- separator handling — see S1.
Untested: empty / unknown info.GitVersion — see S2.
Untested: cross-major skew — see S3.
Untested: Windows path of Exec (pkg/kuberlr/exec_windows.go). The BATS file skips Windows. cmd.Run plus os.Exit on a non-zero exit is a meaningfully different code path; no exercise. The Windows BATS-host TODO at bats/tests/10-cli/7-kubectl-cache.bats:14-20 covers the harness gap, not the production code path.
Untested: concurrent invocations. Two parallel rdd kubectl against the same out-of-skew cluster both hit the cache-miss path and both download. On Linux/macOS the second os.Rename overwrites; in Go 1.21+ Windows os.Rename uses FILE_RENAME_INFORMATION_EX with REPLACE_IF_EXISTS|POSIX_SEMANTICS, so the dst-exists case is also handled atomically. (Codex flagged a Windows-rename race here; modern Go handles it — see Reconciliation.) A single concurrency test would still help if practical.

Documentation Assessment ¶

All exported symbols carry godoc-style comments (CacheDir, Resolve, Exec).
The new kubectl_cache path key and the RDD_CACHE_DIR / RDD_KUBECTL_MIRROR environment variables are not in the design docs — see S4.
The resolver-policy summary appears in both cache.go (package doc) and resolver.go (Resolve doc). The duplication is minor, but a single canonical source would be easier to keep in sync.
The recursion-guard rationale appears in both resolver.go:22-25 and exec_windows.go:21. Keep the canonical version in resolver.go and reference it from the platform files.

Commit Structure ¶

Single-commit branch (9624299), self-contained, accurately summarized. No fix-up commits, no scope bleed. Nothing to flag.

Acknowledged Limitations ¶

pkg/kuberlr/resolver.go:43-46 — TODO(offline) for the case "no cached kubectl matches and the network is down". The current behavior is the documented failure mode; the TODO carries the future opt-out. Status: consistent.
pkg/kuberlr/downloader.go:34-35 — TODO(offline) for the "may we download?" toggle for air-gapped users. The current change makes this matter on first cache miss; the comment calls it out. Status: consistent.
bats/tests/10-cli/7-kubectl-cache.bats:14-20 — Windows BATS host deferred. The fake kubectl cross-builds for windows/<arch>, but the server-build and service-pid wiring assume a unix parent. Status: consistent — code paths are platform-clean; only the harness is unix-bound.
bats/tests/10-cli/7-kubectl-cache.bats:97-105 — local_teardown_file deviates from the repo "no teardown_file" rule, with explicit rationale: a long-lived ephemeral-port HTTP server is not the kind of state worth preserving across runs. Status: rationale stands; the deviation is narrow.

Agent Performance Retro ¶

Claude ¶

Claude produced the most detailed review by far — 12 raw findings spanning correctness (Major-version check, BATS fixture pollution), policy (parse-failure fall-through, naming scope of RDD_CACHE_DIR), test hygiene (assert_file_contains idiom, GOOS/GOARCH asymmetry), and product framing (cache eviction TODO, PR description correction). Claude was the only agent to catch the BATS sha512-fixture pollution (I2), which is the second Important finding in this report. It was also the only agent to verify the test isolation chain end-to-end and to flag that the next test happens not to hit the corrupted fixture.

The Major-version check (I1 → S3) was raised at Important; I downgraded it to Suggestion because Kubernetes 2.x has no realistic horizon. Claude's HTTP-timeout finding (S4) was undersold at Suggestion; promoted to Important to match Codex.

Codex ¶

Codex stayed tight — four findings, all on real concerns. Its HTTP-timeout finding (I2) was the right severity call from the start, and it was the only agent to flag the design-doc gap (S4) by checking the docs against the new env vars. Its "Windows os.Rename race" claim (I1 raw) was a false positive: the diagnosis was right that two concurrent processes can both hit the cache-miss path, but modern Go (1.21+) handles dst-exists on Windows transparently with REPLACE_IF_EXISTS|POSIX_SEMANTICS, and the proposed fix targeted errors.Is(err, os.ErrExist) — the wrong error class. Dropped.

Codex's coverage table was tight and accurate; its design-observations section identified the same drift risk in manual flag parsing that Claude's S1 hinted at.

Gemini ¶

Gemini failed at startup with ProjectIdRequiredError (needs GOOGLE_CLOUD_PROJECT env var) and a "not running in a trusted directory" rejection. The CLI returned exit 55 before reading the prompt. Excluded from this round.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.5	Gemini 3.1 Pro
Duration	36m 56s	5m 19s	—
Findings	1I 10S	1I 2S	—
Tool calls	50 (Bash 30, Read 20)	49 (shell 49)	—
Design observations	5	1	—
False positives	0	1	0
Unique insights	9	2	0
Files reviewed	16	16	—
Coverage misses	0	0	—
Totals	1I 10S	1I 2S	none
Downgraded	1 (I→S)	0	0
Dropped	0	1	0

Claude provided the most value on this review: the only Important finding it surfaced uniquely (I2 BATS fixture pollution) is the kind of latent test-pollution bug that escapes review and surfaces months later as a flaky test. Codex provided the best severity calibration (HTTP timeout) and the best documentation-gap coverage. The two agents complemented each other — Codex was tighter, Claude was deeper.

#### Reconciliation

Codex P1 Windows os.Rename race: important → dropped (false positive — modern Go handles dst-exists on Windows; proposed fix targets the wrong error class).
Claude P1 Major-version check: important → suggestion S3 (trigger condition is hypothetical; one-line fix retained).
Claude P1 HTTP timeout: suggestion S4 → important I1 (aligned with Codex's call; an indefinite hang is operator-facing).

Review Process Notes ¶

Repo context updates ¶

Add a "third-party CLI download verification" pattern to deep-review-context.md. When code under review fetches an executable from a remote mirror, expect (a) a per-request HTTP deadline distinct from the caller's context, (b) hash-as-you-stream verification before publication, (c) atomic rename with no partial-file leftovers, and (d) bounded cache growth or a documented eviction TODO. Future PRs that add similar fetchers (k3s images, lima templates, registry mirrors) should be reviewed against the same checklist.
Document the "embedded-vs-external" decision pattern for commands that may target either an internal control plane or an external one. When a command has two callers — one internal (rdd ctl, embedded apiserver) and one external (rdd kubectl, user kubeconfig) — flag any added side effect (probe, lookup, fetch) that runs unconditionally for both. Either short-circuit for the embedded case or update user-facing docs to describe the new flow.