Deep Review: 20260519-105717-pr-348

Date	2026-05-19 10:57
Repo	rancher-sandbox/rancher-desktop-daemon
Round	8
Author	@jandubois
PR	#348 — rdd kubectl: embed kuberlr-style version resolver
Branch	`embed-kuberlr`
Commits	`ce035d5` rdd kubectl: embed kuberlr-style version resolver
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — Round 8 surfaces three new Important findings on the Windows/WSL BATS plumbing introduced by Round 7's tightening of the `5-paths.bats` RDDCACHEDIR override test: the `kill ${SERVER_PID}` call in `local_teardown_file()` cannot reach the native Win32 `fake-kube-server.exe` (the helper file documents `process_kill` for this case), `env … rdd …` bypasses the `rdd()` shell function and its WSL path / WSLENV conversion, and `winpath` only handles MSYS so WSL invocations of native rdd.exe receive Linux-form paths. None of these affect macOS/Linux runs; they degrade the Windows/WSL CI matrix. I1 (`logrus.Warnf` for routine cache-miss download) is a judgment call against the repo's `RDD_LOG_LEVEL=warn` default — surface for decision rather than auto-flag. Three long-declined re-raises (Windows `os.Rename` race, `CacheRoot` panic) dropped per the resolutions gate; Claude's S2 was the same race shape and is dropped likewise.
Wall-clock time	`17 min 30 s`

Executive Summary ¶

Round 8 reviews commit ce035d5b (Round 7 was 6b419bc; Round 7 fixes — kustomize/plugin in clientOnlySubcommands, docker_socket/args.json cmd_service.md sync, inline note on exec_windows.go:27 — verified landed). The new commit also introduced bats/helpers/paths.bash::winpath, the local_teardown_file() cleanup hook on 7-kubectl-cache.bats, and the rdd svc paths honors RDD_CACHE_DIR override test on 5-paths.bats. All three Round 8 Important findings cluster on the BATS plumbing those Round 7 changes added — and all three are Windows/WSL-specific. On macOS and Linux the test surface is clean.

The keepers: I1 (Claude + Gemini) flags logrus.Warnf at downloader.go:78 in a routine download path — at the repo's default RDD_LOG_LEVEL=warn this surfaces a yellow WARN banner on every cache miss; the intent (operator-relevant "downloading kubectl..." status) is legitimate, but the level is a judgment call worth Jan's decision. I2 (Codex) notes that local_teardown_file() calls kill ${SERVER_PID} against fake-kube-server.exe, a native Win32 process MSYS kill cannot reach — bats/helpers/os.bash:96-117 documents this exact case and provides process_kill. I3 (Codex) notes that the new run -0 env RDD_CACHE_DIR=... rdd ... form in 5-paths.bats:45,48 and across 7-kubectl-cache.bats execs bin/rdd${EXE} directly via PATH, bypassing the rdd() bash function — losing WSL /mnt/... path conversion and the WSLENV adjustment that propagates RDD_* vars across the WSL boundary.

Verification dropped Codex I1, Gemini I1, Gemini I2, and Claude S2 — all re-raises of the long-declined Windows os.Rename race or CacheRoot panic. The Round 7 retro already proposed a "long-declined findings" repo-context entry; Round 8 makes that recommendation more urgent: eight rounds, four agents, same two shapes.

Structure: 0 Critical · 3 Important · 7 Suggestions.

Critical Issues ¶

None.

Important Issues ¶

I1. logrus.Warnf in routine cache-miss download surfaces a WARN banner at default log level Claude Gemini¶

pkg/kuberlr/downloader.go

		return "", err
	}
	if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
		return "", err
	}
	logrus.Warnf("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
	if err := download(ctx, want, path); err != nil {
		return "", err
	}
	return path, nil
}

Outside developer mode, RDD_LOG_LEVEL defaults to warn, so every logrus.Warn reaches the user's terminal. A cache miss on a remote cluster's first invocation at a new minor version is a routine expected event: the resolver downloads, sha-verifies, and execs. The user sees a yellow WARN[…] banner above their normal kubectl get pods output. Compare to the logrus.Debug lines elsewhere in the resolver for routine fall-through paths.

The intent is legitimate — telling the user "you are about to wait for a ~50 MB download" is operator-relevant — so the choice of level is a judgment call. Claude raised it as such; Gemini flagged it as Warn-level misuse outright. Two reasonable shapes for the fix:

- logrus.Warnf("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())
+ // One-time per (server-version, cache state) event. Surface what the
+ // resolver is doing without raising a WARN; the user sees kubectl
+ // output immediately after.
+ logrus.Infof("downloading kubectl v%d.%d.%d from %s", want.Major, want.Minor, want.Patch, mirrorURL())

Or keep Warnf and document the rationale inline against the project's logging convention.

I2. local_teardown_file() cannot terminate the native Win32 fake-kube-server.exe Codex¶

bats/tests/10-cli/7-kubectl-cache.bats

    CACHE_DIR=${BATS_FILE_TMPDIR}/cache

    export GOOS GOARCH EXE MIRROR_ROOT MIRROR_BIN_DIR LOG_FILE KUBECONFIG_PATH CACHE_DIR PORT GIT_VERSION_FILE VERSION_STATUS_FILE
}

local_teardown_file() {
    # Deviates from the "no teardown_file" rule. The rule's intent is to
    # preserve rdd state for post-mortem inspection; an HTTP server bound
    # to an ephemeral port is the opposite — leaving it running just leaks
    # ports between sessions.
    if load_var SERVER_PID; then
        kill "${SERVER_PID}" 2>/dev/null || true
    fi
}

local_setup() {
    rm -rf "${CACHE_DIR}"
    : >"${LOG_FILE}"
    # Reset both /version override files so each test starts with the

local_setup_file() builds the fake server as fake-kube-server.exe on Windows and launches it (lines 35-39, 49-55). MSYS2 kill cannot reach Win32 processes — bats/helpers/os.bash:96-117 documents this case explicitly and provides process_kill, which dispatches to taskkill.exe /F /PID on Windows. The existing teardown silently fails on MSYS, leaking the HTTP server and the bound ephemeral port across BATS sessions, which is the exact behavior the deviation from the "no teardown_file" rule was added to prevent. The cleanup works on macOS and Linux.

bats/helpers/os.bash

    # Check if we can run /usr/bin/true (or /bin/true) without requiring a password
    run sudo --non-interactive --reset-timestamp true
    ((status != 0))
}

# Cross-platform process management.
# MSYS2's kill cannot reach Win32 processes, so use taskkill.exe on Windows.

# Check if a process is alive. Returns 0 if alive, non-zero otherwise.
assert_process_alive() {
    local pid=$1
    if is_windows; then
        MSYS_NO_PATHCONV=1 tasklist.exe /FI "PID eq ${pid}" /NH 2>/dev/null | grep -qw "${pid}"
    else
        kill -0 "${pid}" 2>/dev/null
    fi
}

# Send SIGKILL (or TerminateProcess on Windows) to a process.
process_kill() {
    local pid=$1
    if is_windows; then
        MSYS_NO_PATHCONV=1 taskkill.exe /PID "${pid}" /F >/dev/null 2>&1
    else
        kill -9 "${pid}"
    fi
}

bats/tests/10-cli/7-kubectl-cache.bats

    (
        cd "${BATS_TEST_DIRNAME}/fake-kube" || exit
        go build -ldflags='-s -w' \
            -o "$(winpath "${MIRROR_BIN_DIR}/kubectl${EXE}")" \
            ./kubectl
        go build -ldflags='-s -w' \
            -o "$(winpath "${BATS_FILE_TMPDIR}/fake-kube-server${EXE}")" \
            ./server
    )
    SERVER_BIN=${BATS_FILE_TMPDIR}/fake-kube-server${EXE}

    PORT_FILE=${BATS_FILE_TMPDIR}/port
    LOG_FILE=${BATS_FILE_TMPDIR}/server.log
    GIT_VERSION_FILE=${BATS_FILE_TMPDIR}/git-version
    VERSION_STATUS_FILE=${BATS_FILE_TMPDIR}/version-status

 local_teardown_file() {
     if load_var SERVER_PID; then
-        kill "${SERVER_PID}" 2>/dev/null || true
+        process_kill "${SERVER_PID}" 2>/dev/null || true
     fi
 }

I3. run -0 env … rdd … bypasses the rdd() shell function on WSL Codex¶

bats/tests/10-cli/5-paths.bats

}

@test 'rdd svc paths honors RDD_CACHE_DIR override' {
    cache_override=$(winpath "${BATS_TEST_TMPDIR}/cache-override")

    run -0 env RDD_CACHE_DIR="${cache_override}" rdd svc paths cache_dir
    assert_output "${cache_override}"

    run -0 env RDD_CACHE_DIR="${cache_override}" rdd svc paths kubectl_cache
    assert_output --partial "${cache_override}"
    assert_output --regexp 'kubectl[/\\][a-z0-9]+-[a-z0-9]+$'

bats/helpers/commands.bash:42-89 defines rdd() as a bash function that handles three Windows boundaries: MSYS path conversion (lines 47-61), WSL /mnt/... path conversion (lines 62-70), and the WSLENV augmentation (lines 71-82) that adds every RDD_* env var to WSLENV so they cross the WSL→Win32 boundary into rdd.exe. bats/helpers/load.bash:65 puts bin/ on PATH, so env … rdd … execvp's bin/rdd${EXE} directly — none of the wrapper logic runs. Two consequences on WSL:

bats/helpers/commands.bash

        return 0
    fi
    return 1
}

rdd() {
    local arg
    local args=("$@")
    if is_windows; then
        args=()
        if is_msys; then
            # MSYS_NO_PATHCONV is set globally to protect URL-like paths
            # (e.g. /passthrough/demo/hello), so convert filesystem paths
            # to Windows format manually.
            for arg in "$@"; do
                if [[ "${arg}" =~ ^/([a-zA-Z]/|tmp(/|$)) ]]; then
                    # Drive-letter mounts (/c/...) and /tmp are always
                    # filesystem paths; convert without existence check.
                    args+=("$(cygpath -w "${arg}")")
                elif [[ "${arg}" == /* ]] && [[ -e "${arg}" ]]; then
                    args+=("$(cygpath -w "${arg}")")
                else
                    args+=("${arg}")
                fi
            done
        else
            # WSL: convert /mnt/... paths.
            for arg in "$@"; do
                if [[ "${arg}" != "${arg#/mnt/}" ]]; then
                    args+=("$(wslpath -w "${arg}")")
                else
                    args+=("${arg}")
                fi
            done
            # Adjust WSLENV to include everything starting with RDD_
            local env envs
            mapfile -t envs < <({
                tr : '\n' <<<"${WSLENV}"
                env | awk -F= '/^RDD_/ { print $1 }'
            } | sort -u || true)
            WSLENV=""
            for env in "${envs[@]}"; do
                WSLENV="${WSLENV}:${env}"
            done
            WSLENV=${WSLENV#:}
            export WSLENV
        fi
    fi
    # Close BATS's internal fds 3 and 4 so any daemon rdd spawns (hostagent,
    # qemu) cannot inherit them. A grandchild that keeps fd 3 open will hang
    # bats when it next tries to capture output via run/$(...).
    "${PATH_REPO_ROOT}/bin/rdd${EXE}" "${args[@]}" 3>&- 4>&-
}

bats/helpers/load.bash

# is_windows() etc from os.bash,
# and PATH_* variables from paths.bash
source "${PATH_BATS_HELPERS}/commands.bash"

# Add repo-root/bin directory to the PATH. This is where the Makefile puts all compiled programs.
export PATH="${PATH_BATS_ROOT}/../bin:${PATH}"

# If called from foo() this function will call local_foo() if it exist.
call_local_function() {
    local func
    func="local_$(calling_function)"

bats/tests/10-cli/5-paths.bats

@test 'rdd svc paths honors RDD_CACHE_DIR override' {
    cache_override=$(winpath "${BATS_TEST_TMPDIR}/cache-override")

    run -0 env RDD_CACHE_DIR="${cache_override}" rdd svc paths cache_dir
    assert_output "${cache_override}"

    run -0 env RDD_CACHE_DIR="${cache_override}" rdd svc paths kubectl_cache
    assert_output --partial "${cache_override}"
    assert_output --regexp 'kubectl[/\\][a-z0-9]+-[a-z0-9]+$'
}

# bats/helpers/commands.bash:71-82 — WSLENV augmentation
WSLENV=""
for env in "${envs[@]}"; do
    WSLENV="${WSLENV}:${env}"
done

First, RDD_CACHE_DIR and RDD_KUBECTL_MIRROR set on the env command line do not get added to WSLENV, so the values do not cross into rdd.exe. Second, winpath (see S1) does not handle WSL, so the pre-conversion in the test produces Linux-form paths that rdd.exe cannot read.

The fix routes the call through the bash function. Two ways:

-    run -0 env \
-        RDD_CACHE_DIR="$(winpath "${CACHE_DIR}")" \
-        RDD_KUBECTL_MIRROR="http://127.0.0.1:${PORT}" \
-        KUBECONFIG="$(winpath "${KUBECONFIG_PATH}")" \
-        rdd kubectl get pods
+    RDD_CACHE_DIR="${CACHE_DIR}" \
+    RDD_KUBECTL_MIRROR="http://127.0.0.1:${PORT}" \
+    KUBECONFIG="${KUBECONFIG_PATH}" \
+    run -0 rdd kubectl get pods

The temporary-assignment-before-command form invokes the rdd function with the env vars set for that one call; the function then handles path conversion and WSLENV. (winpath becomes redundant once the wrapper handles it.)

Suggestions ¶

S1. winpath does not handle WSL→Windows.exe path conversion Codex¶

bats/helpers/paths.bash

# winpath converts an MSYS-namespace path to its Windows-native
# equivalent (e.g. /tmp/x -> C:\msys64\tmp\x). Use it for env vars
# and command args that pass straight to a native Windows binary
# without traversing the rdd shell function's path-conversion path.
# Passes through unchanged on Linux and macOS.
winpath() {
    if is_msys; then
        cygpath -w "$1"
    else
        printf '%s' "$1"
    fi
}

# Convert 'export KEY="C:\win\path"' to POSIX paths (WSL or MSYS2).
win_to_posix_exports() {
    tr -d "\r" | while IFS= read -r line; do
        if [[ "${line}" =~ ^(export\ [A-Z_]+=)\"(.+)\"$ ]]; then

winpath only converts on MSYS. WSL is the other shell that calls a native Windows rdd.exe (commands.bash:4-6 sets EXE=.exe whenever using_windows_exe is true), and the existing repo convention for WSL path conversion is wslpath -w (see commands.bash:65-66 and paths.bash:26-27). On WSL the $(winpath ...) call sites pass Linux-form paths into rdd.exe, which cannot read them. The narrow fix:

bats/helpers/commands.bash

EXE=""
PLATFORM=${OS}
if is_windows; then
    if using_windows_exe; then
        EXE=".exe"
        PLATFORM=win32
    else
        # WSL with Linux binary (not yet supported).
        PLATFORM=linux
    fi
fi

bats/helpers/commands.bash

                fi
            done
        else
            # WSL: convert /mnt/... paths.
            for arg in "$@"; do
                if [[ "${arg}" != "${arg#/mnt/}" ]]; then
                    args+=("$(wslpath -w "${arg}")")
                else
                    args+=("${arg}")
                fi
            done
            # Adjust WSLENV to include everything starting with RDD_

bats/helpers/paths.bash


# Convert 'export KEY="C:\win\path"' to POSIX paths (WSL or MSYS2).
win_to_posix_exports() {
    tr -d "\r" | while IFS= read -r line; do
        if [[ "${line}" =~ ^(export\ [A-Z_]+=)\"(.+)\"$ ]]; then
            if command -v wslpath >/dev/null 2>&1; then
                printf '%s"%s"\n' "${BASH_REMATCH[1]}" "$(wslpath -u "${BASH_REMATCH[2]}")"
            else
                printf '%s"%s"\n' "${BASH_REMATCH[1]}" "$(cygpath -u "${BASH_REMATCH[2]}")"
            fi
        fi
    done

 winpath() {
     if is_msys; then
         cygpath -w "$1"
+    elif is_windows && using_windows_exe; then
+        wslpath -w "$1"
     else
         printf '%s' "$1"
     fi
 }

A wider fix routes the call sites through the rdd() shell function (see I3); doing both is defensible because winpath is also used in local_setup_file for raw ${SERVER_BIN} args that never traverse the rdd() wrapper.

S2. Unbound kubectl-global bool flags can hide the subcommand from isClientOnly Claude¶

pkg/kuberlr/resolver.go

//
// Conservative bias: when isClientOnly cannot identify the
// invocation, prefer to probe; misclassifying as client-only would
// skip the version skew check and silently use a mismatched embedded
// kubectl.
func isClientOnly(args []string) bool {
	if len(args) == 0 {
		return true
	}
	fs := pflag.NewFlagSet("rdd-kubectl", pflag.ContinueOnError)
	fs.ParseErrorsAllowlist.UnknownFlags = true
	fs.SetOutput(io.Discard)
	fs.Usage = func() {}

	cf := genericclioptions.NewConfigFlags(true)
	cf.AddFlags(fs)

	var help, client, warningsAsErrors bool
	fs.BoolVarP(&help, "help", "h", false, "")
	fs.BoolVar(&client, "client", false, "")
	fs.BoolVar(&warningsAsErrors, "warnings-as-errors", false, "")

	_ = fs.Parse(args)

	if help {
		return true
	}
	positionals := fs.Args()
	if len(positionals) == 0 {
		return true
	}
	subcommand := positionals[0]
	if subcommand == "version" {
		// `kubectl version` (no --client) probes the server for its
		// version. Treat that as cluster-bound so the resolver picks
		// a version-matched binary to do the probe.
		return client
	}
	return clientOnlySubcommands[subcommand]
}

The pflag UnknownFlags handling at lines 217-224 documents — and the test at resolver_test.go:200 confirms — that an unknown spaced flag consumes its next token as the assumed value. kubectl --match-server-version config view therefore has pflag swallow config, leaves view as the sole positional, and view is not in clientOnlySubcommands, so the resolver probes the cluster for a command kubectl can answer locally. The fall-through is correct (no silent mismatched execution), but the user pays the 2-second probe penalty. --match-server-version is the obvious candidate to add; an audit of upstream staging/src/k8s.io/kubectl/pkg/cmd/cmd.go would surface any other bool globals kubectl declares above the subcommand layer.

pkg/kuberlr/resolver_test.go

		{"--server URL must not be read as subcommand", []string{"--server", "completion", "get", "pods"}, false},
		{"-n namespace before subcommand", []string{"-n", "kube-system", "get", "pods"}, false},
		{"-n with config", []string{"-n", "ns", "config", "view"}, true},
		{"--as-user-extra spaced before subcommand", []string{"--as-user-extra", "k=v", "config", "view"}, true},
		{"--as-user-extra equals before subcommand", []string{"--as-user-extra=k=v", "config", "view"}, true},
		{"unknown long flag swallows next non-flag arg as assumed value", []string{"--unknown", "config", "view"}, false},
		{"stops at --", []string{"exec", "pod", "--", "completion"}, false},
		{"-- before subcommand still finds the subcommand", []string{"--", "get", "pods"}, false},
		{"unknown --v spaced rides UnknownFlags path", []string{"--v", "4", "config", "view"}, true},
		{"unknown --v=N equals form rides UnknownFlags path", []string{"--v=4", "config", "view"}, true},
		{"unknown --vmodule spaced rides UnknownFlags path", []string{"--vmodule", "foo=2", "config", "view"}, true},

pkg/kuberlr/resolver.go

// (--v, --vmodule, --log_dir, …) intentionally stay unbound so their
// values do not leak into klog's process-global state; pflag's
// UnknownFlags handling consumes the assumed value of any spaced
// unknown flag, so `--v 4 config view` correctly leaves [config,
// view] as positionals.
//
// Bool-parse errors short-circuit pflag's parser: a malformed value
// like `--client=garbage version` halts parse mid-args, leaves no
// positionals, and lands in the empty-positionals branch below
// (return true). This deviates from a strict conservative bias, but
// kubectl rejects the same malformed flag in the embedded path
// before contacting any cluster, so no silent mismatched execution
// follows.
//
// Conservative bias: when isClientOnly cannot identify the
// invocation, prefer to probe; misclassifying as client-only would
// skip the version skew check and silently use a mismatched embedded
// kubectl.

 var help, client, warningsAsErrors bool
 fs.BoolVarP(&help, "help", "h", false, "")
 fs.BoolVar(&client, "client", false, "")
 fs.BoolVar(&warningsAsErrors, "warnings-as-errors", false, "")
+var matchServerVersion bool
+fs.BoolVar(&matchServerVersion, "match-server-version", false, "")

A mirror of the existing --warnings-as-errors test cases locks the behavior in: {"--match-server-version before config", []string{"--match-server-version", "config", "view"}, true}.

S3. skipResolver global is a set-only toggle that leaks across hypothetical future tests Claude¶

pkg/kuberlr/resolver.go

const envSkipResolver = "RDD_KUBECTL_RESOLVED"

// skipResolver short-circuits Resolve for the rest of this process.
// Same-process toggle; envSkipResolver covers the cross-process recursion
// guard that Exec sets on the kubectl child.
var skipResolver bool

// SkipResolver short-circuits Resolve for the rest of this process.
// rdd ctl calls this before kubectlAction because it always targets
// the embedded apiserver, whose version matches the embedded kubectl
// by construction — the probe would always fall through anyway.
func SkipResolver() {
	skipResolver = true
}

// serverProbeTimeout caps the discovery call. A reachable cluster answers
// in under 100 ms; an unreachable one would otherwise stall every kubectl
// invocation.
const serverProbeTimeout = 2 * time.Second

SkipResolver() is a one-way toggle on a package-global; nothing flips it back. TestResolveEmbeddedVersionUnparseable swaps embeddedVersion via t.Cleanup but never touches skipResolver, so a future test that calls SkipResolver() would leave the bit set for any later test that exercises Resolve, and Resolve short-circuits on line 69 before the path the test means to exercise. The test would still pass but for the wrong reason.

pkg/kuberlr/resolver.go

// TODO(offline): when no cached kubectl matches and the network is down,
// Resolve currently fails with the download error. Add an opt-out for
// download attempts (config flag or RDD_KUBECTL_OFFLINE) and fall back to
// the cached binary closest to the server's minor.
func Resolve(ctx context.Context, args []string) (string, error) {
	if skipResolver || os.Getenv(envSkipResolver) != "" {
		return "", nil
	}
	if isClientOnly(args) {
		return "", nil
	}

In production the CLI process exits after one command, so the global is fine; the concern is purely test hygiene against future test additions.

 func TestResolveEmbeddedVersionUnparseable(t *testing.T) {
+    origSkip := skipResolver
+    t.Cleanup(func() { skipResolver = origSkip })
     orig := embeddedVersion
     t.Cleanup(func() { embeddedVersion = orig })

S4. main.go errors.As block captures msg it does not use; check exitErr.Err directly Claude¶

cmd/rdd/main.go

	)
	if err := cli.RunNoErrOutput(cmd); err != nil {
		// *cliexit.Error lets a subcommand opt into a specific exit code; everything else gets exit 1.
		// Empty Err means the subcommand has already written its own output (e.g. kuberlr.Exec
		// propagating a kubectl child's exit code), so skip the redundant log line.
		var exitErr *cliexit.Error
		if errors.As(err, &exitErr) {
			if msg := err.Error(); msg != "" {
				logrus.Error(err)
			}
			os.Exit(exitErr.Code)
		}
		logrus.Fatal(err)
	}
}

The captured msg is unused; logrus.Error(err) calls err.Error() again. The empty-string check works today because cliexit.Error.Error() returns "" iff Err == nil. If a future caller wraps the cliexit.Error (e.g. fmt.Errorf("foo: %w", &cliexit.Error{Code: c})), the outer err.Error() is non-empty even when the inner sentinel marks "subcommand already wrote output," and the redundant log line returns. Checking exitErr.Err != nil states intent precisely and survives the wrap.

 if errors.As(err, &exitErr) {
-    if msg := err.Error(); msg != "" {
+    if exitErr.Err != nil {
         logrus.Error(err)
     }
     os.Exit(exitErr.Code)
 }

S5. winpath via $(...) swallows non-zero exits in standalone assignments Claude¶

bats/tests/10-cli/5-paths.bats

    assert_output --partial 'no_such_key'
    assert_output --partial 'valid keys'
}

@test 'rdd svc paths honors RDD_CACHE_DIR override' {
    cache_override=$(winpath "${BATS_TEST_TMPDIR}/cache-override")

    run -0 env RDD_CACHE_DIR="${cache_override}" rdd svc paths cache_dir
    assert_output "${cache_override}"

    run -0 env RDD_CACHE_DIR="${cache_override}" rdd svc paths kubectl_cache

The repo's BATS style rule prefers run -0 cmd ...; value=${output} over value=$(cmd ...) because command substitution discards the exit code and leaves the variable empty on failure. winpath delegates to cygpath -w on MSYS, which can fail in pathological cases. At the standalone assignment sites — 5-paths.bats:43 and the SERVER_BIN=... shape near 7-kubectl-cache.bats:39 — the run -0 form locks in the safety. Sites inside a run command line (most of 7-kubectl-cache.bats) cannot use run -0 recursively; those are unavoidable shell composition.

bats/tests/10-cli/5-paths.bats

    assert_output --partial 'no_such_key'
    assert_output --partial 'valid keys'
}

@test 'rdd svc paths honors RDD_CACHE_DIR override' {
    cache_override=$(winpath "${BATS_TEST_TMPDIR}/cache-override")

    run -0 env RDD_CACHE_DIR="${cache_override}" rdd svc paths cache_dir
    assert_output "${cache_override}"

    run -0 env RDD_CACHE_DIR="${cache_override}" rdd svc paths kubectl_cache

bats/tests/10-cli/7-kubectl-cache.bats

            ./kubectl
        go build -ldflags='-s -w' \
            -o "$(winpath "${BATS_FILE_TMPDIR}/fake-kube-server${EXE}")" \
            ./server
    )
    SERVER_BIN=${BATS_FILE_TMPDIR}/fake-kube-server${EXE}

    PORT_FILE=${BATS_FILE_TMPDIR}/port
    LOG_FILE=${BATS_FILE_TMPDIR}/server.log
    GIT_VERSION_FILE=${BATS_FILE_TMPDIR}/git-version
    VERSION_STATUS_FILE=${BATS_FILE_TMPDIR}/version-status

S6. BATS coverage forces every test out of skew; the within-skew "no download" path runs only in unit tests Claude¶

The five 7-kubectl-cache.bats tests all pin the fake apiserver at v1.99.0, always out of skew with the embedded kubectl. The end-to-end coverage exercises the download lifecycle (miss, hit, mirror 404, sha mismatch, apiserver 500 fall-through) but never the within-skew branch where compatible() returns true and Resolve returns "" without probing the mirror. The within-skew rule is covered by TestCompatible at the unit level; an end-to-end test (e.g. set --git-version to v$(EMBEDDED_MINOR).0) would prove the whole resolver respects the rule end-to-end.

S7. Per-patch cache key accumulates one binary per patch release Claude¶

pkg/kuberlr/downloader.go

	}
	return path, nil
}

// cachedPath returns the absolute path of the cached kubectl for v.
func cachedPath(v semver.Version) string {
	name := fmt.Sprintf("kubectl-v%d.%d.%d", v.Major, v.Minor, v.Patch)
	if runtime.GOOS == "windows" {
		name += ".exe"
	}
	return filepath.Join(CacheDir(), name)
}

// download fetches kubectl v from the upstream mirror, verifies its sha512
// against the mirror's published checksum, and renames the verified file to
// dst. A failure leaves no partial file behind for the next call to mistake
// for a valid binary.

A cluster stepping through v1.30.5 → v1.30.6 → v1.30.7 accumulates three ~50 MB binaries even though any one would satisfy the skew rule. The eviction TODO in cache.go:33-39 addresses growth but not consolidation. A future enhancement could cache one binary per server minor (the highest patch the mirror knows about for that minor) and reuse it for any server in the same minor band. Within scope of the existing TODO; flag for sequencing.

pkg/kuberlr/cache.go

//
//	macOS:   ~/Library/Caches/rancher-desktop/kubectl/<os>-<arch>/
//	Linux:   ~/.cache/rancher-desktop/kubectl/<os>-<arch>/  ($XDG_CACHE_HOME)
//	Windows: %LocalAppData%\rancher-desktop\kubectl\<os>-<arch>\
//
// TODO(eviction): the cache only grows; a user switching across many
// remote clusters accumulates ~50 MB per minor version indefinitely.
// SIGKILL or power loss during a download also leaks the .kubectl-*
// temp file that defer os.Remove normally cleans up. A per-version
// TTL or LRU sweep should clear both stale binaries and stale
// .kubectl-* leftovers before the directory becomes a noticeable
// footprint.
func CacheDir() string {
	return filepath.Join(CacheRoot(), "kubectl", runtime.GOOS+"-"+runtime.GOARCH)
}

// CacheRoot returns the rdd-wide cache root. RDD_CACHE_DIR overrides the

Design Observations ¶

Concerns ¶

Double execution of KUBECONFIG exec auth plugins (in-scope) [carried over from Round 7 retro]. serverVersion calls discovery.NewDiscoveryClientForConfig(cfg).ServerVersion(), which runs the user's kubeconfig exec credential plugin (SSO, IDP CLI, hardware-token). The probe runs the plugin once; if Resolve then downloads and execs a remote kubectl, kubectl invokes the same plugin again. Non-interactive plugins double the latency; interactive plugins double the prompts. The TODO at resolver.go:64-67 for RDD_KUBECTL_OFFLINE is the natural escape hatch — offline mode should also suppress the probe so the user's exec plugin runs exactly once.

Strengths ¶

genericclioptions.ConfigFlags for resolver flag binding Codex Gemini. The probe and the kubectl child stay aligned on kubeconfig precedence by construction — any future kubectl flag rename or addition flows through NewConfigFlags(true).AddFlags(fs) without a parallel implementation to drift.
Body-size caps and per-request timeout on streamGet Claude. maxSha512Bytes (4 KiB) and maxKubectlBytes (256 MiB), paired with downloadTimeout (5 min), turn a malicious or misconfigured mirror into a bounded failure rather than an unbounded write.
Sibling Go module for fake-kube fixtures Claude Gemini. bats/tests/10-cli/fake-kube/go.mod keeps test-only dependencies out of the parent module, and a single server plays both apiserver /version and release mirror /release/* roles with file-based per-test overrides.
Conservative probe fallback policy Codex Gemini. Missing kubeconfig, malformed kubeconfig, unreachable cluster, and unparseable server version all fall through silently to the embedded kubectl. Only download and sha errors surface loudly, matching the "user wants to know" cut explicitly drawn in the resolver's comments.
Cross-process recursion guard Claude. The RDD_KUBECTL_RESOLVED env var plus the same-process skipResolver flag together cover both axes of recursion: kubectl reentry through a shim, and rdd ctl short-circuiting before it dispatches to kubectlAction.

Testing Assessment ¶

The Pure-Go coverage in pkg/kuberlr/resolver_test.go and the five-test BATS lifecycle in 7-kubectl-cache.bats cover the user-reachable paths well. Round 8 surfaces three coverage gaps, all on the testing side rather than the production code:

run -0 env … rdd … form does not exercise the WSL wrapper path (I3). Five sites in 7-kubectl-cache.bats plus two in 5-paths.bats use the bypass form. Re-routing them through the rdd() function (see I3's diff) restores Windows/WSL coverage without changing macOS/Linux behavior.
No BATS test exercises the within-skew "no download" path (S6). TestCompatible covers the rule at the unit level; an end-to-end assertion that the resolver did not GET /release/... when the server version is within ±1 minor would close the loop.
No unit test exercises the cross-process recursion guard. A test that sets RDD_KUBECTL_RESOLVED=1 and asserts Resolve(ctx, []string{"get", "pods"}) returns ("", nil) would lock the envSkipResolver short-circuit in.

Documentation Assessment ¶

docs/design/cmd_service.md and docs/design/environment.md match the resolver behavior — the Round 7 fixes landed (key table now lists docker_socket, cache_dir, kubectl_cache; example output shows args.json and config.yaml). No new documentation gaps surfaced this round.

Commit Structure ¶

Single commit ce035d5 "rdd kubectl: embed kuberlr-style version resolver". The config.json → config.yaml, rdd.args → args.json, and docker_socket key-table additions in cmd_service.md are doc-sync corrections that predate this PR; Round 7 explicitly green-lit bundling them as drive-bys while editing the same paragraph. Worth noting in the commit body that the doc fixes are bundled with the feature, but the bundling decision itself is settled.

Acknowledged Limitations ¶

Code comment: pkg/kuberlr/cache.go:33-39 documents unbounded cache growth and leaked .kubectl-* temp files after SIGKILL or power loss; eviction work is deferred.
Code comment: pkg/kuberlr/downloader.go:55-57 and pkg/kuberlr/resolver.go:64-67 document the planned RDD_KUBECTL_OFFLINE opt-out; intentionally deferred.
Code comment: pkg/kuberlr/downloader.go:138-141 documents maxKubectlBytes truncation surfacing as "checksum mismatch"; left as a comment per Round 6.
Code comment: pkg/kuberlr/exec_windows.go:27-30 documents the dormant CommandContext hard-kill race against console-delivered Ctrl+C; a future refactor that wires signal-driven cancellation must revisit this site (Round 7 S3 → Commented).
Code comment: pkg/kuberlr/resolver.go:217-228 documents the --client=garbage pflag short-circuit and the explicit deviation from the conservative bias.
Code comment: pkg/kuberlr/resolver.go:168-175 documents leaving --username/--password unbound on the resolver side to dodge ToRawKubeConfigLoader's interactive prompt.
Code comment: bats/tests/10-cli/7-kubectl-cache.bats:91-99 documents the deliberate deviation from the "no teardown_file" rule for the ephemeral HTTP-server fixture. See I2 for the implementation defect this deviation now needs to actually deliver on its intent.

Declined in prior rounds ¶

Concurrent first-use Windows os.Rename race — declined Rounds 1, 2, 3, 4, 5, 6, 7. Re-raised this round by Codex (I1), Gemini (I1), and Claude (S2). Dropped per the resolutions gate. Eight rounds running.
CacheRoot panic on os.UserCacheDir failure — declined Rounds 1, 2, 3, 5, 7. Re-raised this round by Gemini (I2) and Claude (S8). Dropped per the resolutions gate.

Unresolved Feedback ¶

None. The only PR review comment was @mook-as on cmd/rdd/service_paths.go:33 asking to expose RDD_CACHE_DIR; the current diff adds cache_dir and kubectl_cache to instancePaths(), and docs/design/environment.md:39 documents the dual role. Round 7 marked this resolved.

Agent Performance Retro ¶

Claude ¶

Claude returned cleanly after Round 7's stream-idle timeout. The shape was thorough: three Important findings (I1 the Warnf UX call, I2 the --match-server-version flag binding, I3 the skipResolver test hygiene) plus eight Suggestions ranging from doc bundling to per-patch cache accumulation. Consolidation downgraded Claude I2 and I3 to Suggestions (S2 and S3) because the behavior Claude flagged is documented in code and covered by an existing test case (the --unknown config view test at resolver_test.go:200), and the test fragility hits only hypothetical future tests. The unique value: I1's read of the RDD_LOG_LEVEL=warn default against this PR's new Warnf (matched by Gemini), plus the errors.As redundancy in main.go (S4) that neither other agent surfaced. Coverage was complete on the 20 changed files.

Codex ¶

Codex anchored the Round 8 keepers. Three new Important findings landed on the BATS plumbing introduced by the Round 7 fixes — and Codex was the only agent to read bats/helpers/commands.bash and bats/helpers/os.bash deeply enough to surface them (I2 kill vs process_kill for Win32, I3 env … rdd … bypass of the wrapper, S1 winpath WSL gap). Codex also re-raised the long-declined Windows os.Rename race (I1 → dropped per gate), but with no severity escalation and verified go test ./pkg/kuberlr passed before writing.

Gemini ¶

Gemini brought back two declined-in-prior-rounds findings (I1 Windows os.Rename, I2 CacheRoot panic) for the eighth and sixth time respectively. The single new contribution was I3 (the Warnf level critique), which Claude raised independently and consolidates with Claude's I1. No suggestions, no design observations. Gemini reviewed 20 files with no coverage misses but produced the thinnest net new signal of the three agents this round.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.5	Gemini 3.1 Pro
Duration	9m 40s	7m 03s	—
Findings	1I 6S	2I 1S	1I
Tool calls	24 (Read 16, Bash 4, Grep 4)	48 (shell 48)	—
Design observations	0	1	0
False positives	0	0	0
Unique insights	3	3	0
Files reviewed	20	20	20
Coverage misses	0	0	0
Totals	1I 6S	2I 1S	1I
Downgraded	2 (I→S)	0	0
Dropped	1	1	2

Overall: Codex provided the highest signal this round — three new Important findings on the Round 7 plumbing surface that the other two agents missed entirely. Claude rounded out the round with the unique main.go redundancy and the --match-server-version defensive binding. Gemini's contribution was confined to seconding Claude's Warnf call; its other findings were re-raises of long-declined items.

Review Process Notes ¶

Repo context updates

Add a "long-declined findings on PR #348" section to deep-review-context.md for the duration of this PR's review window. Round 7 made this recommendation; Round 8's repetition confirms the cost. List the two declined-three-or-more-rounds findings — Windows os.Rename cache-publish race (rounds 1-8) and CacheRoot panic on os.UserCacheDir failure (rounds 1, 2, 3, 5, 7, 8) — with one-line decline rationales. Agents reviewing future amendments will see them in the inlined repo context and can either skip or argue with new evidence rather than re-raise the same shape. Keep the list short and PR-scoped.