Deep Review: 20260519-131414-pr-348

Date	2026-05-19 13:14
Repo	rancher-sandbox/rancher-desktop-daemon
Round	9
Author	@jandubois
PR	#348 — rdd kubectl: embed kuberlr-style version resolver
Branch	`embed-kuberlr`
Commits	`30289ab` rdd kubectl: embed kuberlr-style version resolver
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — Two new judgment calls surfaced after the Round 8 fixes landed on commit `30289ab`. I1 (Claude) is the contrapositive of Round 8 I1: the `Warnf → Infof` demotion that Round 8 recommended now hides the download notice entirely at `RDD_LOG_LEVEL=warn` (the non-developer default), so users see no feedback during a multi-second-to-multi-minute download. I2 (Codex) is a genuine new defect in the BATS test infrastructure added by this PR: `7-kubectl-cache.bats` calls bare `go env GOOS` while `using_windows_exe()` always selects Windows binaries on WSL, so a WSL dev with Linux Go on PATH populates the mirror at `bin/linux/...` while `rdd.exe` requests `bin/windows/...`. The remaining five Suggestions are polish (test coverage for unbound auth flags, `errors.Is` for sentinel comparison, `defer os.Remove` intent, `cliexit.Error{Err: nil}` contract, and a 2-second probe timeout that may be tight for VPN-bound clusters). Dropped per the resolutions/declined gates: the Windows `os.Rename` race (Codex I1, Gemini C1 — declined Rounds 1-8) and the `exec` auth-plugin probe duplication (Gemini I1 — Round 8 design observation, framed slightly more strongly this round).
Wall-clock time	`18 min 39 s`

Executive Summary ¶

Round 9 reviews commit 30289ab (Round 8 was ce035d5; the Round 8 fixes — Warnf → Infof, kill → process_kill, exports lifted to local_setup_file, winpath WSL branch, --match-server-version binding, errors.As exitErr.Err != nil check, and TestResolveSkipsWhenRecursionGuardSet — all verified landed). The PR is in good shape; this round surfaces two judgment-call findings rather than fresh structural defects.

I1 (Claude) is the opposite shape of Round 8 I1. Round 8 said Warnf was too loud at the default log level; Jan fixed by demoting to Infof. Round 9 now observes that Infof is invisible at the default RDD_LOG_LEVEL=warn, so a multi-minute kubectl download from a 3G or VPN link shows the user no feedback — they will think rdd has hung. The right answer is some level that does render at the default threshold while not feeling like a spurious warning; a plain fmt.Fprintln(os.Stderr, ...) is one option that side-steps the level-vs-rendering tradeoff.

I2 (Codex) is a genuine new defect in the BATS plumbing this PR adds. bats/tests/10-cli/7-kubectl-cache.bats:15-17,35,38 calls bare go env GOOS / go env GOOS / go build. On WSL, bats/helpers/defaults.bash:10-18 using_windows_exe() always returns true and commands.bash:1-8 sets EXE=.exe, so rdd is rdd.exe and runtime.GOOS=windows inside it. If the WSL dev's PATH resolves go to Linux Go, the fake mirror lands at bin/linux/$GOARCH/ while rdd.exe requests bin/windows/$GOARCH/, producing a 404 instead of exercising the resolver.

Dropped per the resolutions/declined gates: Codex I1 and Gemini C1 (concurrent Windows os.Rename cache-publish race) — declined Rounds 1-8 inclusive, ninth re-raise this round with no new evidence. Gemini I1 (the user's KUBECONFIG exec credential plugin can prompt or open a browser before kubectl prints anything) is the same shape as the Round 8 Design Observation "Double execution of KUBECONFIG exec auth plugins"; carried over with Gemini's stronger framing rather than re-raised as new.

Structure: 0 Critical · 2 Important · 5 Suggestions.

Critical Issues ¶

None.

Important Issues ¶

I1. Round 8 Warnf → Infof fix overshoots: download notice is now invisible at default log level Claude¶

pkg/kuberlr/downloader.go

		return "", err
	}
	if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
		return "", err
	}
	logrus.Infof("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
	if err := download(ctx, want, path); err != nil {
		return "", err
	}
	return path, nil
}

Round 8 I1 flagged this line as Warnf, arguing the WARN banner at default log level was too loud for a routine cache-miss event. Jan demoted to Infof. Round 9 surfaces the contrapositive: cmd/rdd/main.go:96-101 defaults RDD_LOG_LEVEL to warn outside developer mode, so Infof never reaches the user's terminal. A rdd kubectl get pods against an out-of-skew cluster on a slow link now appears to hang for the full downloadTimeout window (up to 5 minutes, per downloader.go:39) with no indication that rdd is downloading. A user who Ctrl+C's the apparent hang loses the partial download and may chase a "rdd is broken" debugging path.

cmd/rdd/main.go

	}
	if logLevel == "" {
		logLevel = strings.TrimSpace(os.Getenv("RDD_LOG_LEVEL"))
	}
	if logLevel == "" {
		// Default log level: warn for regular mode, debug for developer mode
		if developer.Mode() {
			logLevel = "debug"
		} else {
			logLevel = "warn"
		}
	}
	level, err := logrus.ParseLevel(logLevel)
	if err != nil {
		return err
	}

pkg/kuberlr/downloader.go

const envKubeMirror = "RDD_KUBECTL_MIRROR"

// downloadTimeout caps each mirror request. Five minutes turns a hung
// mirror into a bounded failure and still covers a ~50 MB kubectl
// binary on slow links (~170 kB/s).
const downloadTimeout = 5 * time.Minute

// httpClient enforces downloadTimeout on every kuberlr request.
// http.DefaultClient sets no Timeout, and the inherited cobra context
// carries no deadline. If the request context later carries a shorter
// deadline, that deadline wins.

This is a genuine three-way trade-off the Round 8 framing flattened:

Level	Visible at default	Banner color	Connotation
`Warnf`	yes	yellow `WARN[...]`	"something is wrong"
`Infof`	no	n/a	invisible
`fmt.Fprintln(os.Stderr, ...)`	yes	none	plain operator notice

The third option side-steps the level-vs-rendering tradeoff: it always renders, carries no severity connotation, and matches how most CLIs surface progress on a long operation. Compare git clone's Cloning into 'x'..., cargo build's Compiling x v1.2.3, apt's Get:1 http://... — none use a logging framework's WARN level. The line is operator-facing CLI output, not application telemetry.

-	logrus.Infof("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
+	fmt.Fprintf(os.Stderr, "downloading kubectl v%d.%d.%d from %s\n", want.Major, want.Minor, want.Patch, mirrorURL())

Alternatives if Jan prefers to stay inside logrus: keep Warnf and document the deliberate choice inline (reversing Round 8 I1), or add a project-specific log level for one-shot operator notices.

I2. WSL BATS run on 7-kubectl-cache.bats populates the mirror at bin/linux/... while rdd.exe requests bin/windows/... Codex¶

bats/tests/10-cli/7-kubectl-cache.bats

# resolver downloaded, sha-verified, and exec'd it.

load '../../helpers/load'

local_setup_file() {
    GOOS=$(go env GOOS)
    export GOOS
    GOARCH=$(go env GOARCH)
    export GOARCH
    # EXE is set in commands.bash (".exe" on Windows, empty elsewhere)
    # and re-exported below for the rdd subprocess. Don't shadow it
    # locally.
    export EXE

bats/helpers/defaults.bash:10-18 using_windows_exe() returns true on every WSL run today (the TODO at line 16 notes Linux-binary coverage on WSL is not yet wired up), and bats/helpers/commands.bash:1-8 sets EXE=.exe for that path. So rdd resolves to native rdd.exe, whose runtime.GOOS is windows. The resolver downloads from bin/${runtime.GOOS}/${runtime.GOARCH}/kubectl.exe.

bats/helpers/defaults.bash

: "${RDD_TRACE:=false}"
: "${RDD_NAMESPACE:=rdd-bats}"
: "${RDD_KEEP_LOGS:=1}"
export RDD_KEEP_LOGS

using_windows_exe() {
    # MSYS2 always uses Windows executables; there is no Linux alternative.
    if is_msys; then
        return 0
    fi
    # WSL: currently always uses Windows executables.
    # TODO: Support testing with the Linux binary on WSL.
    true
}

bats/helpers/commands.bash

EXE=""
PLATFORM=${OS}
if is_windows; then
    if using_windows_exe; then
        EXE=".exe"
        PLATFORM=win32
    else
        # WSL with Linux binary (not yet supported).
        PLATFORM=linux
    fi
fi

no_cr() {

But setup_file calls bare go env GOOS. On a WSL dev workstation that has Linux Go on PATH (the common case — apt install golang or the official Go tarball), go env GOOS returns linux. MIRROR_BIN_DIR then evaluates to ${MIRROR_ROOT}/release/v1.99.0/bin/linux/$GOARCH, and the go build calls produce ELF binaries at that path. When rdd.exe probes the mirror, it GETs bin/windows/$GOARCH/kubectl.exe.sha512 and hits a 404. The test fails not because the resolver is broken but because the fixture was built for the wrong platform.

The test passes on macOS, Linux, and MSYS2 (which has no Linux Go option). The failure surface is specifically WSL with a Linux Go toolchain.

Codex's go${EXE} fix is one option but assumes go.exe reaches a Windows Go toolchain on PATH — not guaranteed on every WSL setup. A more robust fix hard-pins GOOS/GOARCH to match the rdd binary the test will exercise:

-GOOS=$(go env GOOS)
-export GOOS
-GOARCH=$(go env GOARCH)
-export GOARCH
+if using_windows_exe; then
+    export GOOS=windows
+    export GOARCH=$(go env GOARCH)  # arch matches host even cross-OS
+else
+    GOOS=$(go env GOOS)
+    export GOOS
+    GOARCH=$(go env GOARCH)
+    export GOARCH
+fi
...
-        go build -ldflags='-s -w' \
+        GOOS="${GOOS}" go build -ldflags='-s -w' \
             -o "$(winpath "${MIRROR_BIN_DIR}/kubectl${EXE}")" \
             ./kubectl
-        go build -ldflags='-s -w' \
+        GOOS="${GOOS}" go build -ldflags='-s -w' \
             -o "$(winpath "${BATS_FILE_TMPDIR}/fake-kube-server${EXE}")" \
             ./server

The GOOS=... inline on go build makes Linux Go cross-compile a Windows binary on WSL. Linux Go ships a working cross-compiler out of the box, so this needs no additional toolchain on the dev machine. The fake-kube-server${EXE} build also needs to produce a Windows binary on WSL because local_setup_file execs it directly at line 52 — production-side rdd.exe and the test's server need to match the OS.

bats/tests/10-cli/7-kubectl-cache.bats

    export VERSION_STATUS_FILE=${BATS_FILE_TMPDIR}/version-status
    # On MSYS, the server is a native Windows .exe that cannot read
    # MSYS-namespace paths (/tmp/..., /c/...). Convert each path arg
    # with cygpath. Production rdd never crosses this boundary;
    # MSYS_NO_PATHCONV=1 in load.bash leaves URL-like paths alone.
    "${SERVER_BIN}" \
        --root "$(winpath "${MIRROR_ROOT}")" \
        --major 1 --minor 99 --git-version v1.99.0 \
        --git-version-file "$(winpath "${GIT_VERSION_FILE}")" \
        --version-status-file "$(winpath "${VERSION_STATUS_FILE}")" \
        --port-file "$(winpath "${PORT_FILE}")" \

(important, regression)

Suggestions ¶

S1. parseKubeConfigFlags test does not assert that --username / --password stay unbound Claude¶

pkg/kuberlr/resolver_test.go

	}
	assert.Assert(t, strings.HasSuffix(got, wantSuffix), "cachedPath = %q, want suffix %q", got, wantSuffix)
	assert.Assert(t, strings.HasPrefix(got, CacheDir()), "cachedPath = %q, want prefix %q", got, CacheDir())
}

func TestParseKubeConfigFlags(t *testing.T) {
	deref := func(p *string) string {
		if p == nil {
			return ""
		}
		return *p
	}
	derefBool := func(p *bool) bool {
		if p == nil {
			return false
		}
		return *p
	}

	type want struct {
		kubeconfig string
		context    string
		server     string
		cluster    string
		user       string
		token      string
		caFile     string
		certFile   string
		keyFile    string
		tlsName    string
		insecure   bool
		username   string
		password   string
		namespace  string
	}
	cases := []struct {
		name string
		args []string
		want want
	}{
		{"empty", nil, want{}},
		{"unrelated args", []string{"get", "pods"}, want{}},
		{"--kubeconfig spaced", []string{"--kubeconfig", "/k", "get", "pods"}, want{kubeconfig: "/k"}},
		{"--kubeconfig equals", []string{"--kubeconfig=/k", "get", "pods"}, want{kubeconfig: "/k"}},
		{"--context spaced", []string{"--context", "c", "get", "pods"}, want{context: "c"}},
		{"--context equals", []string{"--context=c", "get", "pods"}, want{context: "c"}},
		{"--server spaced", []string{"--server", "https://x:6443", "get"}, want{server: "https://x:6443"}},
		{"--server equals", []string{"--server=https://x:6443", "get"}, want{server: "https://x:6443"}},
		{"-s alias spaced", []string{"-s", "https://x:6443", "get"}, want{server: "https://x:6443"}},
		{"-s alias equals", []string{"-s=https://x:6443", "get"}, want{server: "https://x:6443"}},
		{"--cluster", []string{"--cluster=prod", "get"}, want{cluster: "prod"}},
		{"--user", []string{"--user=alice", "get"}, want{user: "alice"}},
		{"--token", []string{"--token=eyJabc", "get"}, want{token: "eyJabc"}},
		{"--certificate-authority", []string{"--certificate-authority=/ca.crt", "get"}, want{caFile: "/ca.crt"}},
		{"--client-certificate", []string{"--client-certificate=/c.crt", "get"}, want{certFile: "/c.crt"}},
		{"--client-key", []string{"--client-key=/c.key", "get"}, want{keyFile: "/c.key"}},
		{"--tls-server-name", []string{"--tls-server-name=api.example", "get"}, want{tlsName: "api.example"}},
		{"--insecure-skip-tls-verify bare", []string{"--insecure-skip-tls-verify", "get"}, want{insecure: true}},
		{"--insecure-skip-tls-verify=true", []string{"--insecure-skip-tls-verify=true", "get"}, want{insecure: true}},
		{"--namespace spaced", []string{"--namespace", "kube-system", "get"}, want{namespace: "kube-system"}},
		{"-n alias", []string{"-n", "kube-system", "get"}, want{namespace: "kube-system"}},
		{"all auth/tls together", []string{"--server=https://x", "--token=tk", "--certificate-authority=/ca", "--insecure-skip-tls-verify", "get"}, want{server: "https://x", token: "tk", caFile: "/ca", insecure: true}},
		{"flag at end without value is ignored", []string{"get", "--server"}, want{}},
		{"later flag wins", []string{"--context=a", "--context=b"}, want{context: "b"}},
		{"stops at --", []string{"exec", "pod", "--", "tool", "--kubeconfig=/other", "--server=https://other"}, want{}},
		{"flags before -- still parse", []string{"--kubeconfig=/k", "--server=https://x", "exec", "pod", "--", "tool", "--context=other"}, want{kubeconfig: "/k", server: "https://x"}},
	}
	for _, tc := range cases {
		t.Run(tc.name, func(t *testing.T) {
			cf := parseKubeConfigFlags(tc.args)
			actual := want{
				kubeconfig: deref(cf.KubeConfig),
				context:    deref(cf.Context),
				server:     deref(cf.APIServer),
				cluster:    deref(cf.ClusterName),
				user:       deref(cf.AuthInfoName),
				token:      deref(cf.BearerToken),
				caFile:     deref(cf.CAFile),
				certFile:   deref(cf.CertFile),
				keyFile:    deref(cf.KeyFile),
				tlsName:    deref(cf.TLSServerName),
				insecure:   derefBool(cf.Insecure),
				username:   deref(cf.Username),
				password:   deref(cf.Password),
				namespace:  deref(cf.Namespace),
			}
			assert.Equal(t, actual, tc.want)
		})
	}
}

func TestIsClientOnly(t *testing.T) {
	cases := []struct {
		name string
		args []string

The want struct carries username and password fields and the comparison loop dereferences cf.Username / cf.Password, but no test case sets either flag in args. The docstring at resolver.go:168-175 justifies leaving them unbound to dodge ToRawKubeConfigLoader's interactive credential prompt; a future refactor that calls WithDeprecatedPasswordFlag would silently bind them and the test suite as written would not catch the regression. Add cases that pass the flags and assert the bound values remain empty:

pkg/kuberlr/resolver.go

// --insecure-skip-tls-verify, --request-timeout, --as, --as-group,
// --as-uid, --as-user-extra, --cache-dir, --disable-compression,
// --namespace/-n). Unknown flags (kubectl subcommands and their own
// flags) and positional arguments pass through silently.
//
// The deprecated --username and --password flags stay unbound. Binding
// them via WithDeprecatedPasswordFlag would make ToRawKubeConfigLoader
// return the interactive variant that prompts on stdin when the
// kubeconfig context lacks credentials — the wrong behavior for a
// background version probe. The kubectl child still receives those
// flags and processes them per its own binding; the asymmetry only
// affects the resolver's probe, which falls through silently to the
// embedded kubectl on probe failure either way.
//
// parseKubeConfigFlags exists as a separate helper so unit tests can
// verify the binding without reaching for a real cluster.
func parseKubeConfigFlags(args []string) *genericclioptions.ConfigFlags {
	fs := pflag.NewFlagSet("rdd-kubectl", pflag.ContinueOnError)

+		{"--username unbound by design", []string{"--username=alice", "get"}, want{}},
+		{"--password unbound by design", []string{"--password=secret", "get"}, want{}},

(suggestion, gap)

S2. Sentinel comparison in fake server uses != instead of errors.Is Claude¶

bats/tests/10-cli/fake-kube/server/main.go

	if err := os.WriteFile(*portFile, []byte(strconv.Itoa(port)), 0o644); err != nil {
		log.Fatalf("writing port file: %v", err)
	}

	server := &http.Server{Handler: mux}
	if err := server.Serve(listener); err != nil && err != http.ErrServerClosed {
		log.Fatalf("serve: %v", err)
	}
}

// logging wraps h so every request lands in the request log before h sees

http.ErrServerClosed is a sentinel value; errors.Is(err, http.ErrServerClosed) is the standard idiom for sentinel comparison. The fixture never wraps the error today, so the equality compare works, but the code reads like production-shaped Go from a quick scan and the repo uses errors.Is for sentinels elsewhere.

-	if err := server.Serve(listener); err != nil && err != http.ErrServerClosed {
+	if err := server.Serve(listener); err != nil && !errors.Is(err, http.ErrServerClosed) {

(suggestion, enhancement)

S3. defer os.Remove(tmpName) runs after a successful Rename; consider scoping to the failure path Claude¶

pkg/kuberlr/downloader.go

	}
	want, err := fetchSha512(ctx, base+".sha512")
	if err != nil {
		return fmt.Errorf("fetching checksum: %w", err)
	}
	tmp, err := os.CreateTemp(filepath.Dir(dst), ".kubectl-*")
	if err != nil {
		return err
	}
	tmpName := tmp.Name()
	defer os.Remove(tmpName)
	h := sha512.New()
	if err := streamGet(ctx, base, io.MultiWriter(tmp, h), maxKubectlBytes); err != nil {
		tmp.Close()
		return fmt.Errorf("downloading %s: %w", base, err)
	}
	if err := tmp.Close(); err != nil {
		return err
	}
	got := hex.EncodeToString(h.Sum(nil))
	if got != want {
		return fmt.Errorf("checksum mismatch for %s: want %s, got %s", base, want, got)
	}
	if err := os.Chmod(tmpName, 0o755); err != nil {
		return err
	}
	return os.Rename(tmpName, dst)
}

// Body-size caps for streamGet. maxSha512Bytes covers the longest
// `sha512sum`-format line (128 hex chars + two spaces + a generous
// filename); maxKubectlBytes covers the kubectl binary with headroom

After a successful Rename, tmpName no longer exists, so the deferred os.Remove returns ENOENT silently — harmless on both Unix and Windows. The reader still has to mentally verify the deferred call cannot delete the new dst (it cannot: Rename moves the inode, leaving tmpName empty). A small boolean-guarded defer expresses intent more clearly:

-	tmpName := tmp.Name()
-	defer os.Remove(tmpName)
+	tmpName := tmp.Name()
+	cleanup := true
+	defer func() {
+		if cleanup {
+			os.Remove(tmpName)
+		}
+	}()
...
-	return os.Rename(tmpName, dst)
+	if err := os.Rename(tmpName, dst); err != nil {
+		return err
+	}
+	cleanup = false
+	return nil

Purely an intent / readability change; no behavioral difference today.

(suggestion, enhancement)

S4. cliexit.Error{Err: nil} no-log contract is implicit; only one producer relies on it Claude¶

cmd/rdd/main.go

	)
	if err := cli.RunNoErrOutput(cmd); err != nil {
		// *cliexit.Error lets a subcommand opt into a specific exit code; everything else gets exit 1.
		// A nil inner Err means the subcommand has already written its own output (e.g. kuberlr.Exec
		// propagating a kubectl child's exit code), so skip the redundant log line.
		var exitErr *cliexit.Error
		if errors.As(err, &exitErr) {
			if exitErr.Err != nil {
				logrus.Error(err)
			}
			os.Exit(exitErr.Code)
		}
		logrus.Fatal(err)
	}
}

The only producer of &cliexit.Error{Code: ...} with Err == nil today is pkg/kuberlr/exec_windows.go:39 (kubectl's child has already written its own stderr; rdd should not log on top of it). Round 8 S4's resolution captured the consumer-side change cleanly. The implicit contract still lives only in the comment at main.go:170-171; a future caller that builds &cliexit.Error{Code: 5} without an Err will silently lose its log output, which is the right behavior only if the caller has already written to stderr. A named constructor documents the convention at the producer site:

pkg/kuberlr/exec_windows.go

	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	err := cmd.Run()
	var exitErr *exec.ExitError
	if errors.As(err, &exitErr) {
		return &cliexit.Error{Code: exitErr.ExitCode()}
	}
	if err != nil {
		return fmt.Errorf("running %s: %w", path, err)
	}
	return nil

+// Silent wraps a kubectl-child exit code without an inner error. main.go
+// suppresses its log line in this case because the child has already
+// written its own stderr.
+func Silent(code int) *Error { return &Error{Code: code} }

(suggestion, enhancement)

S5. serverProbeTimeout = 2 * time.Second may be tight for VPN-bound corporate clusters Claude¶

pkg/kuberlr/resolver.go

// by construction — the probe would always fall through anyway.
func SkipResolver() {
	skipResolver = true
}

// serverProbeTimeout caps the discovery call. A reachable cluster answers
// in under 100 ms; an unreachable one would otherwise stall every kubectl
// invocation.
const serverProbeTimeout = 2 * time.Second

// Resolve returns the path to a kubectl binary compatible with the cluster
// the user's kubeconfig points at. An empty path means "use the embedded
// kubectl"; a non-empty path means "exec this binary instead". args holds
// the kubectl-style arguments the caller will pass through; Resolve binds

The comment's "under 100 ms" holds for LAN clusters and direct internet routes. A GKE / EKS / AKS control plane reached through a corporate split-tunnel VPN can routinely take 1-3 seconds to complete TLS handshake + IAM auth + return /version, especially on a cold connection. A spurious timeout falls through silently to the embedded kubectl (per serverVersion's contract at resolver.go:135-138), so an out-of-skew cluster reached over a slow VPN runs the embedded mismatched kubectl with no operator-visible reason. 5 seconds is the same order of magnitude and absorbs the realistic tail without weakening the unreachable-cluster guard.

pkg/kuberlr/resolver.go

	client, err := discovery.NewDiscoveryClientForConfig(cfg)
	if err != nil {
		logrus.WithError(err).Debug("kubectl resolver: cannot build discovery client; using embedded kubectl")
		return semver.Version{}, false
	}
	info, err := client.ServerVersion()
	if err != nil {
		logrus.WithError(err).Debug("kubectl resolver: cluster probe failed; using embedded kubectl")
		return semver.Version{}, false
	}
	v, err := semver.ParseTolerant(info.GitVersion)
	if err != nil {
		logrus.WithError(err).Warnf("kubectl resolver: cannot parse server version %q; using embedded kubectl", info.GitVersion)
		return semver.Version{}, false

(suggestion, enhancement)

Design Observations ¶

Concerns ¶

KUBECONFIG exec credential plugin can prompt or open a browser before kubectl runs (in-scope) Gemini — carried over from the Round 8 Design Observation "Double execution of KUBECONFIG exec auth plugins". Gemini's framing this round is sharper: the first user-visible side effect of rdd kubectl get pods against an EKS / GKE cluster whose exec plugin (aws eks get-token, gke-gcloud-auth-plugin, kubelogin) lacks a cached token is the plugin firing — including an interactive prompt, OS keychain unlock, or browser window — from rdd's serverVersion probe, before the user knows their command has even started. The Round 8 framing was "double execution" (plugin runs in the probe, then again under kubectl); Gemini's framing is "first execution is unexpected" (user thinks kubectl get pods triggered the prompt, not rdd's version check). The TODO at resolver.go:64-67 (RDD_KUBECTL_OFFLINE) and the operator escape hatch are still the natural answer; surface here so the design observation does not regress into "carried over without rationale" in future rounds.

Strengths ¶

Warnf → Infof → ? review loop is working as intended Claude — Round 8 surfaced the level choice as a judgment call; Jan picked one direction; Round 9 surfaces the contrapositive. This is the right way for the review process to handle reversible UX tradeoffs: state the trade-off, ship a reasonable answer, accept feedback if the answer overshoots. No process change needed.
Single-file flag-parsing test surface locks in subtle behavior Claude — TestIsClientOnly and TestParseKubeConfigFlags together exercise 50+ cases of pflag's UnknownFlags handling, --client=garbage short-circuit, -- separator, -n namespace aliases, kubectl globals (--warnings-as-errors, --match-server-version). Hard-won coverage of pflag's quirks that future kubectl-flag changes will benefit from.
Recursion guard belt-and-braces Claude — cross-process via RDD_KUBECTL_RESOLVED env var, same-process via skipResolver bool, covering both axes (downloaded kubectl re-execs rdd through a shim vs. rdd ctl calling kubectlAction in-process). TestResolveSkipsWhenRecursionGuardSet (Round 8 testing gap, now closed) locks the cross-process axis in.
Streaming caps on streamGet Claude — maxSha512Bytes (4 KiB) and maxKubectlBytes (256 MiB), paired with downloadTimeout (5 min), turn a malicious or misconfigured mirror into a bounded failure rather than an unbounded write.
Sibling Go module for fake-kube fixtures Claude — bats/tests/10-cli/fake-kube/go.mod keeps test-only dependencies out of the parent module; one server plays both apiserver /version and release mirror /release/* roles with file-based per-test overrides.
Conservative probe fallback policy Codex Gemini — missing kubeconfig, malformed kubeconfig, unreachable cluster, and unparseable server version all fall through silently to the embedded kubectl. Only download and sha errors surface loudly, matching the policy the resolver comments draw explicitly.

Testing Assessment ¶

pkg/kuberlr/resolver_test.go and the five-test BATS lifecycle in 7-kubectl-cache.bats cover the user-reachable paths well. Round 9 surfaces test-side gaps, no production-code gaps:

WSL with native rdd.exe plus a Linux go on PATH (I2). The mirror tree and rdd.exe's GET path mismatch on OS; the test fails for a reason unrelated to the resolver. Pin GOOS/GOARCH to match rdd.exe's platform when using_windows_exe is true.
--username / --password deliberately unbound on the resolver side (S1). Add the two test cases that lock the unbound invariant in.
Server /version returns valid JSON with an unexpected gitVersion (e.g., v0.0.0). The current tests cover unparseable ("") and within-skew / out-of-skew cases; an apiserver that answers with a parseable-but-bogus version would currently trigger a download attempt for a release that does not exist on the mirror.
rdd ctl does not re-trigger the resolver — cmd/rdd/kubectl.go:48 asserts the contract by inspection only; no test exercises rdd ctl get pods and confirms the resolver short-circuits via the same-process skipResolver flag.

Documentation Assessment ¶

docs/design/cmd_service.md and docs/design/environment.md match the resolver behavior. No documentation gaps surfaced this round.

Commit Structure ¶

Single commit 30289ab. The config.json → config.yaml, rdd.args → args.json, and docker_socket key-table additions in cmd_service.md are doc-sync corrections that predate this PR; Round 7 explicitly green-lit bundling them as drive-bys. Settled.

Acknowledged Limitations ¶

Code comment: pkg/kuberlr/cache.go:33-39 documents unbounded cache growth and leaked .kubectl-* temp files after SIGKILL or power loss; eviction work is deferred.
Code comment: pkg/kuberlr/downloader.go:55-57 and pkg/kuberlr/resolver.go:64-67 document the planned RDD_KUBECTL_OFFLINE opt-out; intentionally deferred.
Code comment: pkg/kuberlr/downloader.go:138-141 documents maxKubectlBytes truncation surfacing as "checksum mismatch"; left as a comment per Round 6.
Code comment: pkg/kuberlr/exec_windows.go:27-30 documents the dormant CommandContext hard-kill race against console-delivered Ctrl+C; a future refactor that wires signal-driven cancellation must revisit this site (Round 7 S3 → Commented).
Code comment: pkg/kuberlr/resolver.go:217-228 documents the --client=garbage pflag short-circuit and the explicit deviation from the conservative bias.
Code comment: pkg/kuberlr/resolver.go:168-175 documents leaving --username / --password unbound on the resolver side to dodge ToRawKubeConfigLoader's interactive prompt.
Code comment: bats/tests/10-cli/7-kubectl-cache.bats:102-105 documents the deliberate deviation from the "no teardown_file" rule for the ephemeral HTTP-server fixture.

Declined in prior rounds ¶

Concurrent first-use Windows os.Rename cache-publish race — declined Rounds 1, 2, 3, 4, 5, 6, 7, 8. Re-raised this round by Codex (I1) and Gemini (C1). Dropped per the resolutions gate. Nine rounds running.
CacheRoot panic on os.UserCacheDir failure — declined Rounds 1, 2, 3, 5, 7, 8. Not re-raised this round.
local_teardown_file exists — Gemini S1. The deviation is documented in-code at lines 102-105; the comment names the exact rule it deviates from and gives a rationale ("preserve rdd state vs. leak ephemeral port"). Dropped — not a re-raise of a prior-round finding but a re-raise of a documented in-code deviation.
Command substitution swallows non-zero exits ($(winpath ...)) — Gemini S2. Same shape as Round 8 S5; declined in Round 8 with reason "winpath/cygpath failure is pathological; the run -0 rewrite at standalone sites buys minimal practical safety while adding noise to the surrounding code." Dropped per the resolutions gate.

Unresolved Feedback ¶

None. The Round 8 resolution table lists 11 items, of which 7 were Fixed, 3 were Skipped (S3, S5, S6, S7 — covered by the declined-in-prior-rounds list above), and 1 had a test written. The Round 9 commit 30289ab reflects all Fixed items.

Agent Performance Retro ¶

Claude ¶

Claude carried this round. Six findings (1 Important + 5 Suggestions), four of them unique to Claude. The unique I1 — observing that the Round 8 fix overshoots — is the kind of judgment-call reversal review processes are supposed to surface and that no fixed-prompt agent can be expected to produce reliably; it required reading both the prior .reviews/ round and the post-fix code with attention to default log levels. The five suggestions split between code polish (S3, S4) and test coverage gaps (S1) and one operator-facing tuning (S5). Coverage was complete on the 20 changed files. Calibration was on — nothing was inflated to Important; nothing should be promoted from Suggestion to Important in consolidation.

Codex ¶

Codex returned its trademark tight, focused review: two Important findings, no Suggestions, no design observations beyond the two strengths. I1 (Windows os.Rename race) was the long-declined re-raise; dropped per the resolutions gate without prejudice. I2 (WSL go env GOOS mismatch) was the second-most-valuable finding this round — a genuine new defect in the test infrastructure this PR added, surface area no other agent reached. Codex ran go test ./pkg/kuberlr and make build-rdd to verify before writing, the right gate for a build-or-test claim. The Coverage Summary called out 20/20 files; no misses.

Gemini ¶

Gemini's signal was thin this round. Three of four findings were re-raises of declined items (C1 Windows os.Rename ninth round; S1 local_teardown_file against a documented in-code deviation; S2 command substitution declined Round 8 S5). I1 (KUBECONFIG exec plugin user-visible prompt) is a sharper framing of the Round 8 Design Observation but not a new finding; carried over with the refined framing rather than re-raised as Important. Gemini reviewed 20 files with no coverage misses but produced no net-new signal this round.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.5	Gemini 3.1 Pro
Duration	12m 45s	7m 28s	—
Findings	1I 5S	1I	none
Tool calls	51 (Read 27, Grep 15, Bash 9)	50 (shell 46, stdin 4)	—
Design observations	1	2	0
False positives	0	0	0
Unique insights	5	1	0
Files reviewed	20	20	20
Coverage misses	0	0	0
Totals	1I 5S	1I	none
Downgraded	0	0	0
Dropped	0	1	4

Overall: Claude provided the highest signal this round (unique I1 contrapositive + four polish suggestions). Codex contributed the second-most-valuable finding (I2 WSL toolchain). Gemini contributed no net-new findings; its single non-re-raise (I1 exec-plugin) was a refinement of an existing design observation rather than a fresh finding.

Review Process Notes ¶

Skill improvements

Add explicit guidance for "contrapositive" findings after a Fixed prior-round Important. When a prior round flagged a knob as set wrong in direction A and the author fixed it by moving to direction B, the new direction can be wrong for the opposite reason. Agents that read the resolution table and see "Fixed" tend to treat the line as settled. Add to the prior-round resolutions gate: "When the fix changed a UX-facing knob (log level, timeout, default, error severity), evaluate the new value with the same scrutiny the prior round applied to the old value — a fix can overshoot." This is the principle, not the specific bug; it applies to any reversible UX tradeoff a prior round surfaced as a judgment call.

Repo context updates

None this round. The "long-declined findings on PR #348" entry added after Round 7-8 is doing its job — Codex and Gemini still re-raised the os.Rename race, but consolidation dropped them in seconds rather than spending clarification-round budget.