Deep Review: 20260503-142128-pr-348

Date	2026-05-03 14:35
Repo	rancher-sandbox/rancher-desktop-daemon
Round	4
Author	@jandubois
PR	#348 — rdd kubectl: embed kuberlr-style version resolver
Branch	`embed-kuberlr`
Commits	`563ff15` rdd kubectl: embed kuberlr-style version resolver
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — S3 (`docker_socket` missing from `ALL_KEYS`, `environment.md`, and the `cmd_service.md` example, raised by Codex) is the most actionable: tests and docs claim a complete key set but skip a key the implementation returns. S2 (BATS uses command substitution where the project explicitly requires `run -0`, raised by Codex and Gemini) violates a documented project rule. S5 (uppercase SHA-512 hex normalization, raised by Codex) is a real defensive gap an operator-configured mirror could trip. The `--as-user-extra` map gap (S6) and `serverVersion` ctx-propagation gap (S7) are both verified but narrow. Round 3's three Importants and four Suggestions all landed cleanly; no regressions.
Wall-clock time	`5 h 37 min 38 s`

Executive Summary ¶

Round 3's three Importants (BATS apiserver-500 test counted via >= 2 GET /version, --client=true parsed via boolFlagValue, genericclioptions.ConfigFlags-backed loadClientConfig) and four Fixed Suggestions all landed cleanly. Round 4 surfaces no Critical or Important findings; the agents converge on minor gaps that fall into three buckets: (1) completeness gaps in the rdd svc paths story — docker_socket is missing from ALL_KEYS in 5-paths.bats, from the environment.md shell-variable table, and from the cmd_service.md example output, and five remaining config.json references in cmd_service.md were not swept along with the line-154 fix; (2) defensive cleanup in the resolver — --as-user-extra is not in kubectlGlobalFlagsTakingValue (so spaced form makes isClientOnly mis-walk), fetchSha512 accepts uppercase hex but never normalises it (case-sensitive comparison fails), serverVersion runs the discovery client without the caller's ctx; (3) a documented BATS style-guide violation in the new version_requests=$(awk ...) capture.

The Windows os.Rename race surfaced again from all three agents (Claude S4, Codex I1, Gemini C1) and was dropped per the prior-round-resolutions gate (declined Rounds 1, 2, and 3). Gemini also produced one outright hallucination (claimed BATS @test decorators were corrupted to @pkg/util/tail/tail_test.go; verified absent in every cited line) and one false positive (cited a fictitious KEP for relaxing kubectl skew to ±3, which contradicts Kubernetes' published version-skew policy of ±1 minor for kubectl).

Structure: 0 Critical · 0 Important · 10 Suggestions.

Critical Issues ¶

None.

Important Issues ¶

None.

Suggestions ¶

S1. docker_socket missing from ALL_KEYS in 5-paths.bats, from the environment.md shell-variable table, and from the cmd_service.md rdd svc paths example Codex¶

bats/tests/10-cli/5-paths.bats

load '../../helpers/load'

ALL_KEYS="args_file config dir kubectl_cache lima_home log_dir pid_file short_dir tls_dir"

@test 'rdd svc paths prints all keys in table format' {
    run -0 rdd svc paths
    for key in ${ALL_KEYS}; do
        assert_line --regexp "^${key} "

instancePaths() returns ten keys (cmd/rdd/service_paths.go:21-33): the nine in ALL_KEYS plus docker_socket. The BATS suite iterates ALL_KEYS to assert each key appears in rdd svc paths output, so docker_socket is silently exempt — a regression that drops docker_socket from the output ships green.

cmd/rdd/service_paths.go


	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/instance"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/kuberlr"
)

func instancePaths() map[string]string {
	return map[string]string{
		"dir":           instance.Dir(),
		"log_dir":       instance.LogDir(),
		"short_dir":     instance.ShortDir(),
		"lima_home":     instance.LimaHome(),
		"tls_dir":       instance.TLSDir(),
		"config":        instance.Config(),
		"pid_file":      instance.PIDFile(),
		"args_file":     instance.ArgsFile(),
		"docker_socket": instance.DockerSocket(),
		"kubectl_cache": kuberlr.CacheDir(),
	}
}

const (
	outputTable = "table"
	outputJSON  = "json"

The same omission carries through both docs touched by this PR: environment.md's table lists nine RDD_* rows (no RDD_DOCKER_SOCKET), and the cmd_service.md example block (line 148–156) shows nine table rows (no docker_socket). The PR added the new kubectl_cache row to all three but did not add the docker_socket row that the implementation has emitted since the docker-socket-address PR landed.

Fix: include docker_socket everywhere these lists claim to be complete.

-ALL_KEYS="args_file config dir kubectl_cache lima_home log_dir pid_file short_dir tls_dir"
+ALL_KEYS="args_file config dir docker_socket kubectl_cache lima_home log_dir pid_file short_dir tls_dir"

Add an RDD_DOCKER_SOCKET row to the environment.md table and a docker_socket row to the cmd_service.md example. The durable fix derives the BATS key list from rdd svc paths itself rather than restating it.

S2. BATS test captures version_requests via command substitution where the project rule mandates run -0 Codex Gemini¶

bats/tests/10-cli/7-kubectl-cache.bats

    # Count /version requests to prove the resolver probed: the resolver
    # sends one, then the embedded kubectl's `version` subcommand sends
    # one of its own. A regression that short-circuits the resolver
    # would drop the count to 1 — assert_file_contains alone cannot
    # tell that case from this one.
    version_requests=$(awk '/^GET \/version$/ {n++} END {print n+0}' "${LOG_FILE}")
    [[ ${version_requests} -ge 2 ]] ||
        fail "expected >= 2 GET /version (resolver + embedded kubectl), got ${version_requests}: $(cat "${LOG_FILE}")"
}

The project's BATS style (deep-review-context.md and bats-style.md) explicitly forbids command substitution for command output: "Use run -0 instead of command substitution. run -0 fails with a useful error (command, exit code, output) if the command returns non-zero. A bare substitution gives an empty variable and a cryptic failure later." The new assertion uses both the forbidden capture form and a second $(cat …) inside the fail argument.

Fix:

-    version_requests=$(awk '/^GET \/version$/ {n++} END {print n+0}' "${LOG_FILE}")
-    [[ ${version_requests} -ge 2 ]] ||
-        fail "expected >= 2 GET /version (resolver + embedded kubectl), got ${version_requests}: $(cat "${LOG_FILE}")"
+    run -0 awk '/^GET \/version$/ {n++} END {print n+0}' "${LOG_FILE}"
+    version_requests=${output}
+    if [[ ${version_requests} -lt 2 ]]; then
+        run -0 cat "${LOG_FILE}"
+        fail "expected >= 2 GET /version (resolver + embedded kubectl), got ${version_requests}: ${output}"
+    fi

S3. docs/design/cmd_service.md retains five config.json references after the line-154 config.yaml fix Claude¶

Round 3 fixed line 154 (the rdd svc paths example) per S1; the rest of the document still describes the file as config.json:

docs/design/cmd_service.md

Examples:

```console
$ rdd svc paths
args_file      /path/to/rancher-desktop-default/rdd.args
config         /path/to/rancher-desktop-default/config.yaml
dir            /path/to/rancher-desktop-default
kubectl_cache  /path/to/Caches/rancher-desktop/kubectl/darwin-arm64
lima_home      /path/to/.rd2/lima
log_dir        /path/to/rancher-desktop-default/log
pid_file       /path/to/rancher-desktop-default/rdd.pid

Line 15 — file table: config.json | service config settings
Line 41 — rdd service create: "Stores the configuration settings in the config.json file."
Line 78 — rdd service start: "starts the apiserver and the controller-manager using the configuration found in config.json."
Line 80 — same paragraph: "No controller should access config.json directly."
Line 87 — rdd service config: "updates config.json with latest settings."

pkg/instance/instance.go:91 returns config.yaml. Either rename all five at once, or leave the file alone — a partial rename causes new readers to chase the wrong filename.

pkg/instance/instance.go

	return filepath.Join(Dir(), "args.json")
})

// Config returns the path to the kubeconfig file.
var Config = sync.OnceValue(func() string {
	return filepath.Join(Dir(), "config.yaml")
})

// PIDFile returns the path to the service PID file.
var PIDFile = sync.OnceValue(func() string {
	return filepath.Join(Dir(), "rdd.pid")

Fix: apply the same rename at lines 15, 41, 78, 80, 87.

S4. fetchSha512 accepts uppercase hex but never normalises it; the case-sensitive if got != want rejects a valid digest from an uppercase-emitting mirror Codex¶

pkg/kuberlr/downloader.go

// digest, two spaces, and the filename — fields() takes the digest and
// drops the trailing filename so both formats verify. fetchSha512
// rejects non-128-hex tokens so a misconfigured mirror surfaces as a
// "malformed checksum" error rather than the misleading "checksum
// mismatch" the comparison at line 110 would otherwise produce.
func fetchSha512(ctx context.Context, url string) (string, error) {
	var sb strings.Builder
	if err := streamGet(ctx, url, &sb, maxSha512Bytes); err != nil {
		return "", err
	}
	fields := strings.Fields(sb.String())
	if len(fields) == 0 {
		return "", fmt.Errorf("empty checksum response from %s", url)
	}
	digest := fields[0]
	if len(digest) != sha512.Size*2 {
		return "", fmt.Errorf("malformed checksum from %s: %d chars, want %d", url, len(digest), sha512.Size*2)
	}
	for _, c := range digest {
		if !(c >= '0' && c <= '9' || c >= 'a' && c <= 'f' || c >= 'A' && c <= 'F') {
			return "", fmt.Errorf("malformed checksum from %s: non-hex character %q", url, c)
		}
	}
	return digest, nil
}

// streamGet performs a GET on url and copies the response body into w,
// capped at maxBytes. The cap turns a malicious or misconfigured mirror
// into a bounded failure: the underlying io.LimitReader stops reading

fetchSha512 admits A-F in the hex validator and returns the raw digest. download computes got := hex.EncodeToString(h.Sum(nil)) (lowercase by Go's spec) and compares with if got != want. A mirror populated by PowerShell Get-FileHash or any tool emitting uppercase hex passes validation, then surfaces a misleading "checksum mismatch" — exactly the misdirection the digest-format check was added to prevent.

Two coherent options: (a) reject uppercase in the validator (mirror operators normalise), or (b) accept and normalise in fetchSha512. (b) is the user-friendlier choice and a one-line edit.

-    digest := fields[0]
+    digest := strings.ToLower(fields[0])

The 128-hex length and character checks then run against the normalised form.

S5. isClientOnly walks past --as-user-extra without consuming its value, so the spaced form mistakes k=v for the subcommand and triggers an unwanted probe Claude¶

pkg/kuberlr/resolver.go

// and its value when it walks args looking for the subcommand. The
// list covers what kubectl exposes via genericclioptions.ConfigFlags
// and the klog flags. An unknown flag falls through as no-value;
// kubectl rejects unknown globals anyway, so the realistic miss set
// is empty.
var kubectlGlobalFlagsTakingValue = map[string]bool{
	"--kubeconfig":            true,
	"--context":               true,
	"--cluster":               true,
	"--user":                  true,
	"--namespace":             true,
	"-n":                      true,
	"--server":                true,
	"-s":                      true,
	"--token":                 true,
	"--certificate-authority": true,
	"--client-certificate":    true,
	"--client-key":            true,
	"--tls-server-name":       true,
	"--cache-dir":             true,
	"--request-timeout":       true,
	"--password":              true,
	"--username":              true,
	"--as":                    true,
	"--as-group":              true,
	"--as-uid":                true,
	"--profile":               true,
	"--profile-output":        true,
	"--log-backtrace-at":      true,
	"--log-dir":               true,
	"--log-file":              true,
	"--log-file-max-size":     true,
	"--log-flush-frequency":   true,
	"--stderrthreshold":       true,
	"-v":                      true,
	"--v":                     true,
	"--vmodule":               true,
}

// clientOnlySubcommands lists kubectl subcommands that never contact
// a cluster. config and completion manipulate local state only;
// options prints kubectl's global flags; help prints help text.
var clientOnlySubcommands = map[string]bool{

genericclioptions.NewConfigFlags(true).AddFlags binds --as-user-extra (a StringArrayVar, spaced form takes the next arg). The docstring at lines 160–162 names it explicitly:

pkg/kuberlr/resolver.go

// parseKubeConfigFlags binds genericclioptions.NewConfigFlags(true) to
// a pflag FlagSet, parses args, and returns the populated flags. The
// bound surface is whatever NewConfigFlags(true).AddFlags exposes —
// the kubectl connection-override flags (--kubeconfig, --context,
// --server/-s, --cluster, --user, --token, --certificate-authority,
// --client-certificate, --client-key, --tls-server-name,
// --insecure-skip-tls-verify, --request-timeout, --as, --as-group,
// --as-uid, --as-user-extra, --cache-dir, --disable-compression,
// --namespace/-n). Unknown flags (kubectl subcommands and their own
// flags) and positional arguments pass through silently.
//
// The deprecated --username and --password flags stay unbound. Binding
// them via WithDeprecatedPasswordFlag would make ToRawKubeConfigLoader

// --insecure-skip-tls-verify, --request-timeout, --as, --as-group,
// --as-uid, --as-user-extra, --cache-dir, --disable-compression,

But the value-taking map omits it. isClientOnly walks args manually; --as-user-extra k=v help lands at arg = "--as-user-extra", the map lookup misses, the walker advances by one (no value consumed), the next arg k=v becomes the subcommand candidate, and clientOnlySubcommands["k=v"] returns false — so the resolver probes the cluster for an invocation that should have short-circuited.

Net effect: a help / completion / config invocation that uses --as-user-extra in spaced form pays the 2 s probe cost. No correctness break (the kubectl child still receives the args correctly), but it regresses the Round 2 isClientOnly work for one flag.

Fix: add the entry.

     "--as-group":              true,
     "--as-uid":                true,
+    "--as-user-extra":         true,
     "--profile":               true,

The longer-term option is to drive isClientOnly from the same pflag set parseKubeConfigFlags already builds; see Design Observations.

S6. Resolve accepts ctx but serverVersion ignores it; Ctrl+C during the discovery probe waits up to serverProbeTimeout (2 s) instead of cancelling immediately Claude¶

pkg/kuberlr/resolver.go

//
// TODO(offline): when no cached kubectl matches and the network is down,
// Resolve currently fails with the download error. Add an opt-out for
// download attempts (config flag or RDD_KUBECTL_OFFLINE) and fall back to
// the cached binary closest to the server's minor.
func Resolve(ctx context.Context, args []string) (string, error) {
	if os.Getenv(envSkipResolver) != "" {
		return "", nil
	}
	if isClientOnly(args) {
		return "", nil
	}
	embedded, err := embeddedVersion()
	if err != nil {
		// `go run ./cmd/rdd kubectl ...` and IDE debug builds skip the
		// Makefile's -ldflags -X, so componentbasever returns the
		// in-tree default `v0.0.0-master+$Format:%H$`, which fails
		// semver parsing. Fall through to the embedded kubectl rather
		// than break every dev invocation.
		logrus.WithError(err).Debug("kubectl resolver: embedded version not parseable; using embedded kubectl")
		return "", nil
	}
	server, ok, err := serverVersion(args)
	if err != nil {
		return "", err
	}
	if !ok {
		return "", nil
	}
	if compatible(embedded, server) {
		return "", nil
	}
	return ensureCached(ctx, server)
}

// compatible reports whether a kubectl of client.Minor can drive a server
// of server.Minor under the standard Kubernetes skew of ±1 minor. A
// different Major rules out compatibility outright.
func compatible(client, server semver.Version) bool {

Resolve forwards ctx only to ensureCached. The discovery client at line 133 uses cfg.Timeout to cap each request at 2 s, but does not honour ctx.Done(). Cancellation does not propagate while the probe is in flight; a user pressing Ctrl+C waits up to 2 s per kubectl invocation.

pkg/kuberlr/resolver.go

	client, err := discovery.NewDiscoveryClientForConfig(cfg)
	if err != nil {
		logrus.WithError(err).Debug("kubectl resolver: cannot build discovery client; using embedded kubectl")
		return semver.Version{}, false, nil //nolint:nilerr // legitimate fall-through to embedded kubectl
	}
	info, err := client.ServerVersion()
	if err != nil {
		logrus.WithError(err).Debug("kubectl resolver: cluster probe failed; using embedded kubectl")
		return semver.Version{}, false, nil //nolint:nilerr // legitimate fall-through to embedded kubectl
	}
	v, err := semver.ParseTolerant(info.GitVersion)

Bounded UX issue, not a correctness issue (the embedded kubectl runs after the probe times out either way). Low priority because the ensureCached path already honours ctx and the 2 s bound puts a ceiling on the wait.

Fix: thread ctx through and bind it to the request. The discovery client exposes the underlying REST client:

-func serverVersion(args []string) (_ semver.Version, ok bool, _ error) {
+func serverVersion(ctx context.Context, args []string) (_ semver.Version, ok bool, _ error) {
     cfg, err := loadClientConfig(args)
     ...
-    info, err := client.ServerVersion()
+    body, err := client.RESTClient().Get().AbsPath("/version").DoRaw(ctx)
+    var info apimachineryversion.Info
+    if err == nil { err = json.Unmarshal(body, &info) }

S7. Comment in fetchSha512 cites "the comparison at line 110"; the comparison sits at line 111 (line 110 is the hex.EncodeToString assignment) Claude¶

pkg/kuberlr/downloader.go

const (
	maxSha512Bytes  = 4 << 10   // 4 KiB
	maxKubectlBytes = 256 << 20 // 256 MiB
)

// fetchSha512 downloads the sha512 hex digest at url. dl.k8s.io serves
// the bare digest, but a mirror populated with `sha512sum` writes the
// digest, two spaces, and the filename — fields() takes the digest and
// drops the trailing filename so both formats verify. fetchSha512
// rejects non-128-hex tokens so a misconfigured mirror surfaces as a
// "malformed checksum" error rather than the misleading "checksum
// mismatch" the comparison at line 110 would otherwise produce.
func fetchSha512(ctx context.Context, url string) (string, error) {
	var sb strings.Builder
	if err := streamGet(ctx, url, &sb, maxSha512Bytes); err != nil {
		return "", err
	}

Hard-coded line numbers in comments rot on the next edit (and have already drifted by one). Replace the locator with a name.

-// "malformed checksum" error rather than the misleading "checksum
-// mismatch" the comparison at line 110 would otherwise produce.
+// "malformed checksum" error rather than the misleading "checksum
+// mismatch" that download's digest comparison would otherwise produce.

S8. TestParseKubeConfigFlags row for --insecure-skip-tls-verify=false does not exercise the binding; the assertion matches the pre-parse default Claude¶

pkg/kuberlr/resolver_test.go

		{"--client-certificate", []string{"--client-certificate=/c.crt", "get"}, want{certFile: "/c.crt"}},
		{"--client-key", []string{"--client-key=/c.key", "get"}, want{keyFile: "/c.key"}},
		{"--tls-server-name", []string{"--tls-server-name=api.example", "get"}, want{tlsName: "api.example"}},
		{"--insecure-skip-tls-verify bare", []string{"--insecure-skip-tls-verify", "get"}, want{insecure: true}},
		{"--insecure-skip-tls-verify=true", []string{"--insecure-skip-tls-verify=true", "get"}, want{insecure: true}},
		{"--insecure-skip-tls-verify=false", []string{"--insecure-skip-tls-verify=false", "get"}, want{insecure: false}},
		{"--namespace spaced", []string{"--namespace", "kube-system", "get"}, want{namespace: "kube-system"}},
		{"-n alias", []string{"-n", "kube-system", "get"}, want{namespace: "kube-system"}},
		{"all auth/tls together", []string{"--server=https://x", "--token=tk", "--certificate-authority=/ca", "--insecure-skip-tls-verify", "get"}, want{server: "https://x", token: "tk", caFile: "/ca", insecure: true}},
		{"flag at end without value is ignored", []string{"get", "--server"}, want{}},
		{"later flag wins", []string{"--context=a", "--context=b"}, want{context: "b"}},

NewConfigFlags(true) initialises Insecure to *bool → false. The test asserts the post-parse value equals false — the default. The row passes even when the flag binding is absent.

Two fixes: drop the row (the bare and =true rows already cover the binding), or force a non-default starting state and assert the parser flipped it back.

// option 2: force a true → false transition
cf := parseKubeConfigFlagsFor("--insecure-skip-tls-verify=false", "get")
*cf.Insecure = true                  // would be overwritten by re-parse; not viable for table tests

The table-test shape favours dropping the row.

S9. Aborted downloads leak .kubectl-* temp files in CacheDir(); defer os.Remove does not cover SIGKILL or hardware crash Gemini¶

pkg/kuberlr/downloader.go

	}
	want, err := fetchSha512(ctx, base+".sha512")
	if err != nil {
		return fmt.Errorf("fetching checksum: %w", err)
	}
	tmp, err := os.CreateTemp(filepath.Dir(dst), ".kubectl-*")
	if err != nil {
		return err
	}
	tmpName := tmp.Name()
	defer os.Remove(tmpName)
	h := sha512.New()
	if err := streamGet(ctx, base, io.MultiWriter(tmp, h), maxKubectlBytes); err != nil {
		tmp.Close()
		return fmt.Errorf("downloading %s: %w", base, err)
	}

Normal failure paths and orderly cancellation hit the defer; SIGKILL during a slow download or a sudden power loss leave a ~50 MB temp file behind. Each subsequent failure adds another. The eviction TODO in cache.go:33-36 does not currently target temp leftovers either.

pkg/kuberlr/cache.go

//
//	macOS:   ~/Library/Caches/rancher-desktop/kubectl/<os>-<arch>/
//	Linux:   ~/.cache/rancher-desktop/kubectl/<os>-<arch>/  ($XDG_CACHE_HOME)
//	Windows: %LocalAppData%\rancher-desktop\kubectl\<os>-<arch>\
//
// TODO(eviction): the cache only grows; a user switching across many
// remote clusters accumulates ~50 MB per minor version indefinitely.
// Pair with a per-version TTL or LRU sweep before the directory becomes
// a noticeable footprint.
func CacheDir() string {
	root := os.Getenv(envCacheDir)
	if root == "" {
		cache, err := os.UserCacheDir()
		if err != nil {

Fix: sweep *.kubectl-* from CacheDir() once at the start of ensureCached, or fold a sweep into the eventual eviction pass. Either fix is small enough to land alongside the eviction work; flag here so it is not forgotten.

S10. httpClient.Timeout = 5 minutes covers headers, body, and redirects together; on a 100 kB/s link the body alone exceeds the cap Claude¶

pkg/kuberlr/downloader.go

// envKubeMirror lets tests (and operators) point the resolver at an
// alternate mirror. Useful for offline mirrors and for the BATS cache
// lifecycle test, which serves a fake binary from a local HTTP server.
const envKubeMirror = "RDD_KUBECTL_MIRROR"

// downloadTimeout caps each mirror request. Five minutes turns a hung
// mirror into a bounded failure and still covers a ~50 MB kubectl
// binary on slow links (~170 kB/s).
const downloadTimeout = 5 * time.Minute

// httpClient enforces downloadTimeout on every kuberlr request.
// http.DefaultClient sets no Timeout, and the inherited cobra context
// carries no deadline. If the request context later carries a shorter
// deadline, that deadline wins.
var httpClient = &http.Client{Timeout: downloadTimeout}

// mirrorURL returns the mirror base URL the resolver downloads from.
//
// TODO(offline): pair this with a "may we download?" toggle so air-gapped
// users can pre-populate CacheDir() and forbid network fetches.

The comment estimates "covers a ~50 MB kubectl binary on slow links (~170 kB/s)" — exactly five minutes at that rate, leaving zero headroom for TLS handshake or initial latency, and failing outright below that throughput. Round 3 added the deadline-composition note (a backstop). Round 4 raises the magnitude question: 5 minutes assumes a fast-enough link, and the user on a slow link sees a hard "context deadline exceeded" with no obvious diagnosis.

Two coherent options: (a) bump downloadTimeout to 10 minutes (covers ~85 kB/s links), or (b) move from Client.Timeout to a transport with ResponseHeaderTimeout so only stalls (not slow throughput) trip the cap. (b) is the durable fix; (a) is a one-character edit.

Design Observations ¶

Concerns ¶

Hand-rolled subcommand walker duplicates the bound pflag set (future) Claude. parseKubeConfigFlags already binds the kubectl global flags via genericclioptions.ConfigFlags.AddFlags; after fs.Parse(args), fs.Args() returns the positional tokens whose first entry is the subcommand. isClientOnly re-walks the args manually with a hand-maintained value-taking flag map (the source of S5). Driving isClientOnly from the same pflag set eliminates both the map and the S5 class of bug. The trade-off: pflag with UnknownFlags=true may consume an unknown spaced-form flag's value as a positional token, so the walker would need a small validation step before extracting the subcommand. Out of scope for this PR; flagging as the durable follow-up.
Manual kubectl flag-coverage discipline is recurring (future, persistent) Claude. Round 2's I1 added cluster-selection flags to extractKubeFlags. Round 3's I3 added the auth/TLS flags via genericclioptions.ConfigFlags. Round 4's S5 finds another spaced-form omission (--as-user-extra). Each round closes one drift, but the gap is the inventory itself: the resolver maintains its own list of value-taking flags in parallel to kubectl's bound set. The same Design Observation appeared in Round 3 ("manual kubectl flag parsing carries drift risk"); the persistence makes the case for the pflag-driven walker concrete.

Strengths ¶

Round 3 Important fixes landed without regression ClaudeCodex. The Round 3 I1 BATS-test fix (>= 2 GET /version count) discriminates probe-ran from probe-skipped; the Round 3 I2 boolFlagValue helper unifies --client= and --help= parsing with last-wins semantics; the Round 3 I3 genericclioptions.ConfigFlags-backed loadClientConfig aligns the resolver's flag binding with kubectl's by construction.
parseKubeConfigFlags test coverage is now exhaustive Claude. TestParseKubeConfigFlags covers spaced/equals/aliases for every bound flag plus the auth/TLS surface, the -- separator, the "later flag wins" semantics, and the "flag at end without value" edge.
Recursion guard via env var (RDD_KUBECTL_RESOLVED) Claude. The simplest mechanism that survives syscall.Exec on Unix and CreateProcess on Windows; documented in environment.md as an internal var so process-tree readers can identify it.
Cache layout anticipates future cache consumers Claude. The <root>/kubectl/<os>-<arch>/ shape under RDD_CACHE_DIR gives a single override surface for tests and offline operation without coupling kuberlr to its hypothetical siblings (k3s images, Lima templates).

Testing Assessment ¶

Coverage gaps surfaced this round, ordered by user-reachable risk:

docker_socket not asserted in 5-paths.bats (S1). The BATS suite iterates ALL_KEYS to verify every key appears in rdd svc paths output. docker_socket is missing from the list, so a regression that drops the key from instancePaths() ships green.
--as-user-extra parsing path unverified (S5). TestParseKubeConfigFlags does not exercise --as-user-extra, and TestIsClientOnly does not cover any case using it. A row in each table closes the loop alongside the map fix.
--insecure-skip-tls-verify=false row asserts the default (S8). The pre-parse and post-parse states match (*bool → false); the test passes even when the binding is absent. Drop the row, or force a true → false transition.
Uppercase SHA-512 mirror response untested (S4). No fixture serves uppercase hex; the validator accepts it and the comparison rejects it without any test catching the inconsistency.
Concurrent first-use cache miss untested (declined Rounds 1-3, surfaced again this round by all three agents). Acknowledged limitation; the prior-round-resolutions gate dropped it.
End-to-end Windows execution path (pkg/kuberlr/exec_windows.go). PR description gates the merge on Windows BATS support arriving; the path remains compile-only.

Verification runs this round: Codex ran make build-rdd and go test ./pkg/kuberlr in its worktree (both passed). Claude verified the round 3 fixes against the new files. Gemini ran the discovery walk but did not execute tests.

Documentation Assessment ¶

docs/design/cmd_service.md example block (line 154) was updated to config.yaml per Round 3 S1; five other references to config.json in the same file remain (S3). The same file's rdd svc paths example also misses the docker_socket row (S1).
docs/design/environment.md documents RDD_CACHE_DIR, RDD_KUBECTL_MIRROR, RDD_KUBECTL_CACHE, and the internal RDD_KUBECTL_RESOLVED. The shell-variable table at lines 46–55 omits RDD_DOCKER_SOCKET (S1).
parseKubeConfigFlags docstring at pkg/kuberlr/resolver.go:155-173 enumerates the bound flag surface explicitly and explains why --username / --password stay unbound. The enumeration includes --as-user-extra — which is exactly how Claude spotted the missing map entry (S5). The docstring is doing review-grade work; preserve it.
Resolve docstring at pkg/kuberlr/resolver.go:43-64 notes the deprecated --username / --password carve-out and the offline TODO. Complete.

Commit Structure ¶

Single commit 563ff15. Message describes the change and matches the diff. Clean.

Acknowledged Limitations ¶

Cache eviction not implemented — pkg/kuberlr/cache.go:33-36. The TODO(eviction) comment carries forward; the design observation on the cache becoming load-bearing stands.
Offline mode not implemented — pkg/kuberlr/resolver.go:61-64, pkg/kuberlr/downloader.go:46-47. TODOs cover the planned env-var/config opt-out for download attempts.
Windows BATS host deferred — bats/tests/10-cli/7-kubectl-cache.bats:14-20. PR description gates the merge on Windows BATS support; the resolver's Windows-specific code paths (Exec, the rename race that prior rounds declined) remain compile-only.
local_teardown_file() deviation from the BATS convention — bats/tests/10-cli/7-kubectl-cache.bats:90-98. Inline rationale stands.
Falling back to embedded on any cluster-probe failure — pkg/kuberlr/resolver.go:113-144. Each nilerr branch is documented; the design choice is sound.
Embedded version unparseable in dev builds — pkg/kuberlr/resolver.go:73-80. go run ./cmd/rdd kubectl ... and IDE debug builds skip the Makefile's -ldflags -X; falling through to embedded kubectl rather than breaking every dev invocation is the right call.
--username / --password deliberately unbound on the resolver side — pkg/kuberlr/resolver.go:165-173. Binding them via WithDeprecatedPasswordFlag would force the interactive kubeconfig loader; the resolver's silent fall-through on probe failure covers the asymmetry.

Declined in prior rounds ¶

Concurrent first-use Windows os.Rename race — declined Round 1 (R1 raw I1, dropped as false positive), Round 2 (R2 raw I2/S5), and Round 3 (Round 3 R3 dropped per gate). Raised this round by all three agents (Claude S4, Codex I1, Gemini C1); dropped per the resolutions gate. The recurring re-raises strengthen the case for a follow-up TODO documenting the Windows file-mapping subtlety, but the underlying call site is unchanged.
CacheDir() panics when os.UserCacheDir() fails — declined Round 1 (R1 S11), Round 2 (R2 raw S1), and Round 3 (raised by Gemini, dropped). Failure mode requires neither $HOME nor %LocalAppData% to resolve.
RDD_CACHE_DIR scope rename — declined Round 1 (R1 S5): deliberate forward-looking knob.
Server log.Fatal* bypassing defer logFD.Close() — declined Round 1 (R1 S10): test fixture; OS reclaims fd.
Pure-Go unit coverage thin on download/cache machinery — declined Round 2 (R2 S5): BATS suite covers the integration concerns.
BATS coverage for HTTP timeout/hang, network cancel, Windows Exec, sha512sum-format mirror, trailing-slash mirror, default-log-level message capture, unreachable-server probe — declined Rounds 1-3.
info.GitVersion empty/unknown unit test — declined Round 1: requires mocked discovery client.
Resolver-policy summary and recursion-guard rationale duplicated across files — declined Round 1: each duplicate serves a different audience.
SkipResolver swallows os.Setenv error — declined Round 3 (R3 S3): stylistic only; constant env name and value cannot fail.
Cache the server-version probe across invocations — declined Round 3 (R3 S6): performance optimisation; explicit follow-up.
Move package doc-comment from cache.go to resolver.go — declined Round 3 (R3 S7): cosmetic; go doc and pkg.go.dev render either location equally.
Drop is_macos branch in write_kubectl_sha512 — declined Round 3 (R3 S8): both branches produce identical output; conditional is harmless.

Agent Performance Retro ¶

Claude ¶

Claude carried this round's breadth: two findings the orchestrator downgraded from Important to Suggestion (S5 --as-user-extra, S6 ctx propagation) plus three pure Suggestions covering test quality, doc drift, and a comment off-by-one. The S5 finding required reading both the docstring at lines 160-162 and the value-taking map at lines 195-227 against the current cli-runtime binding — the kind of cross-reference reading that catches the "documented but not implemented" gap. The S6 ctx-propagation finding is genuine but bounded by the 2 s probe timeout; calling it Important was over-calibrated. Claude's S4 (Windows os.Rename race) is the third consecutive re-raise from Claude on a finding declined in three prior rounds; the agent has no visibility into the resolutions table, which is by design, but the pattern is worth noting.

Codex ¶

Codex produced the round's most actionable finding: S1 (docker_socket missing across BATS, environment.md, and cmd_service.md) is a multi-site completeness gap that the other two agents missed entirely. The fix is mechanical, but the gap is the kind of regression a release would ship with no ill effects until someone tries to read the docs. Codex also caught S4 (uppercase hex normalisation) — a precise read of the case-sensitive comparison against the case-tolerant validator — and S2 (BATS command substitution) independently of Gemini. Tight one-Important / three-Suggestion shape, zero false positives. Best severity calibration of the round on the wins; only miss is re-raising the Windows rename race (declined three prior rounds).

Gemini ¶

Gemini's round was mixed. C2 ("corrupted BATS test decorators replaced by @pkg/util/tail/tail_test.go") is a complete hallucination — every line cited still uses @test, and grep -r "@pkg/util/tail" bats/ returns nothing. S2 ("relax skew to ±3 for K8s ≥ 1.28") cites a nonexistent KEP-3335 against Kubernetes' published version-skew policy, which keeps kubectl at ±1. C1 (Windows rename race) is the same declined-in-three-prior-rounds finding the other two agents also raised. The win this round: I1 (BATS command substitution) caught the same project-style violation as Codex independently; downgraded to S2 (BATS) on consolidation. S1 (tmp file leak on SIGKILL) is a legitimate defensive observation that the other agents missed. Two real findings out of five; the C2 hallucination is the most concerning quality issue across all three agents this round.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.5	Gemini 3.1 Pro
Duration	9m 46s	9m 54s	—
Findings	6S	3S	2S
Tool calls	44 (Bash 24, Read 20)	77 (shell 70, stdin 4, plan 3)	—
Design observations	1	1	1
False positives	0	0	2
Unique insights	5	2	1
Files reviewed	18	18	18
Coverage misses	4	2	4
Totals	6S	3S	2S
Downgraded	2 (I→S)	0	1 (I→S)
Dropped	1	1	1

#### Reconciliation

Claude I1 ctx propagation: important → suggestion S6 (UX gap bounded by the 2 s probe timeout, not a correctness issue).
Claude I2 --as-user-extra missing: important → suggestion S5 (narrow case — spaced form on a client-only command pays an unwanted probe; does not affect kubectl correctness).
Claude S4 Windows os.Rename race: suggestion → dropped (declined Rounds 1, 2, 3).
Codex I1 Windows concurrent rename: important → dropped (declined Rounds 1, 2, 3).
Gemini C1 Windows rename race: critical → dropped (declined Rounds 1, 2, 3; same finding as above).
Gemini C2 corrupted BATS decorators: critical → dropped (false positive — all cited lines verified to use @test; no @pkg/util/tail string anywhere in the repo).
Gemini I1 BATS command substitution: important → suggestion S2 (style violation, not a correctness break; merged with Codex S2).
Gemini S2 expand skew to ±3: suggestion → dropped (false positive — kubectl version-skew policy is ±1; the cited KEP-3335 does not address kubectl skew).

Codex provided the most signal-per-finding (S1 multi-site docker_socket gap is the round's most actionable finding); Claude provided the most coverage (5 of 8 retained findings); Gemini contributed one defensive observation (S9 tmp leak) and one duplicate catch (S2 BATS style). The two false positives from Gemini and the round-on-round Windows rename race re-raises consume reviewer attention without producing actionable signal.

Net Round 4 contribution after the resolutions gate and severity reconciliation: 0 Critical, 0 Important, 10 Suggestions, 2 design observations.

Review Process Notes ¶

Skill improvements ¶

Add a hallucination-defence dimension to the prompt's instructions block. When a finding describes file contents that would constitute a tool-breaking change (BATS syntax error, malformed Go file, corrupted YAML), the agent is making a high-confidence claim about bytes on disk. Rule: before reporting any finding whose existence depends on a parser-breaking malformation, read the raw file at the cited line and confirm the malformation is present. Three independent models flagging the same false positive does not save the reviewer time; one model's confident hallucination still consumes a verification round. Add to dimensions.md or instructions.md: "When a finding alleges a file is malformed in a way that would break the language's parser (syntax error, corrupted decorator, invalid header), read the cited bytes before reporting. A confidently wrong claim about file contents is the most expensive false positive class."

Repo context updates ¶

None this round. The existing deep-review-context.md content carried Codex S3 (docker_socket completeness) into reach because the BATS style guide is already inlined; the remaining gaps are PR-specific or pre-existing convention work that already appears in earlier rounds' notes.