Deep Review: 20260421-132014-pr-333

Date2026-04-21 13:20
Reporancher-sandbox/rancher-desktop-daemon
Round1
Author@Nino-K
PR#333 — Docker Engine: improve Docker context probe safety and isolation
Commits179c097 Docker Engine: improve Docker context probe safety and isolation
ReviewersClaude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
VerdictMerge with fixes — the BATS test for the "no healthy context" path becomes unreliable on hosts with an active Docker daemon, and a handful of cleanups (dead helper, silent error paths, tmpdir scope) are worth addressing before merge.
Wall-clock time29 min 10 s


Executive Summary

The PR replaces getDockerContextHost + hand-rolled dockerclient.WithHost(host) probing with a docker/cli-native probeNamedDockerContext that goes through ContextStore, WithTLSData, and EndpointMeta.ClientOpts(). This fixes four false-override cases enumerated in the PR description (TCP+TLS, mTLS, SkipTLSVerify, SSH) and adds an early return when config.json is unreadable so a corrupt file is never round-tripped.

The core change is a solid correctness improvement. The remaining issues cluster around two themes:

  1. Test semantics drifted. The new probeDockerContext() uses dockerclient.FromEnv, which falls back to the platform default socket. The test moby engine sets currentContext when no healthy context exists no longer guarantees the precondition in its own name: on a developer machine or a CI runner with Docker pre-installed, the default socket does respond and the reconciler correctly refrains from overriding — but the test then times out waiting for an override that never comes.
  2. Dead and under-logged code. getDockerContextHost is now only referenced by its own unit tests, and the five new return true bailouts in probeNamedDockerContext give operators no breadcrumb when RDD silently declines to promote its context.

All three agents agreed on the core test-reliability issue (framed as either a test bug or a production-semantics gap). One agent raised a WaitGroup-race concern that, on closer inspection, is prevented by the existing watcherMu ordering; it is reconciled out below.

Report structure: 0 Critical, 1 Important, 5 Suggestions, 1 Design Concern, 2 Design Strengths.


Critical Issues

None.


Important Issues

I1. moby engine sets currentContext when no healthy context exists is flaky wherever the host default Docker socket responds Claude Codex Gemini
    local socket_path=${output}
    run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
    assert_output "unix://${socket_path}"
}

@test "moby engine sets currentContext when no healthy context exists" {
    local context_name="rancher-desktop-${RDD_INSTANCE}"

    rdd set running=false
    rdd set running=true containerEngine.name=moby
    rdd ctl wait --for=condition=ContainerEngineReady app/app --timeout=30s

    # The probe goroutine runs asynchronously; give it a moment.
    try --max 6 --delay 1 -- \
        bash -c "jq -r '.currentContext' '${DOCKER_CONFIG}/config.json' | grep -qx '${context_name}'"

    run -0 jq -r '.currentContext' "${DOCKER_CONFIG}/config.json"
    assert_output "${context_name}"
}

@test "stopping moby engine removes Docker context and clears currentContext" {
    rdd set running=false

    local context_name="rancher-desktop-${RDD_INSTANCE}"

Before this PR, when currentContext was "" or "default", the reconciler skipped probing and treated the default as unhealthy, so the override always fired. After this PR, probeDockerContext calls pingDocker(ctx, dockerclient.FromEnv), which falls back to /var/run/docker.sock on macOS/Linux when DOCKER_HOST is unset. The BATS local_setup_file exports DOCKER_HOST after rdd set running=true has already spawned the service (engine-docker.bats:29-31), and service.Start passes no explicit cmd.Env, so the service inherits whatever the BATS shell had at spawn — typically nothing. On any host with a reachable default socket (GitHub-hosted ubuntu-latest runners ship Docker pre-installed and running; developer machines with Docker Desktop), the probe returns healthy, setCurrentDockerContext never fires, and the try block times out.

    # Deliberately skip setup_rdd_control_plane here: `rdd set` internally
    # calls ensureServiceRunning, which is exactly the CLI path we want to
    # exercise — the engine controller is part of the default controller
    # set, so no explicit --controllers selection is needed.
    rdd svc delete
    rdd set running=true
    run -0 rdd svc paths docker_socket
    export DOCKER_HOST="unix://${output}"
    # Mirror resources live in App.spec.namespace. Override RDD_NAMESPACE
    # to whatever the App was created with so the test queries the same
    # namespace the engine controller uses, regardless of CRD defaults.
    RDD_NAMESPACE=$(rdd ctl get app app -o jsonpath='{.spec.namespace}')
    export RDD_NAMESPACE

The production side of the same finding: rdd set reuses an already-running service via cmd/rdd/service.go:73-87, so the daemon's DOCKER_HOST/DOCKER_CERT_PATH/DOCKER_TLS_VERIFY environment is frozen at the moment the user first ran rdd svc start. Any subsequent shell-side change is invisible to the probe, and the common "service already running, toggle App back to moby" path consults stale values.

		RunE:  serviceConfigAction,
	}
	return command
}

func ensureServiceRunning(ctx context.Context) error {
	if !service.Exists() {
		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		ctx, cancel := context.WithTimeout(ctx, 90*time.Second)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return err
		}
		return waitForDiscoveryConfigMap(ctx)
	}
	return startAndWaitForReady(ctx, nil)
}

// startAndWaitForReady starts the service, waits for the API server and the
// discovery ConfigMap to be ready. After an unclean shutdown the ConfigMap may
// survive with stale data, so the freshness check waits for a ConfigMap whose

Fix (test): seed a deliberately unreachable named context and mark it current, so the reconciler takes the probeNamedDockerContext branch with a guaranteed-failing endpoint independent of what the host's default socket does:

 @test "moby engine sets currentContext when no healthy context exists" {
     local context_name="rancher-desktop-${RDD_INSTANCE}"

+    # Force the probe to fail regardless of whether the host has a running
+    # Docker daemon at the default socket. Seed a named context pointing at
+    # an unreachable socket and make it current.
+    local probe_target="test-unreachable"
+    local probe_hash
+    if command -v sha256sum &>/dev/null; then
+        probe_hash=$(printf '%s' "${probe_target}" | sha256sum | awk '{print $1}')
+    else
+        probe_hash=$(printf '%s' "${probe_target}" | shasum -a 256 | awk '{print $1}')
+    fi
+    mkdir -p "${DOCKER_CONFIG}/contexts/meta/${probe_hash}"
+    cat >"${DOCKER_CONFIG}/contexts/meta/${probe_hash}/meta.json" <<EOF
+{"Name":"${probe_target}","Endpoints":{"docker":{"Host":"unix:///nonexistent/does-not-exist.sock","SkipTLSVerify":false}}}
+EOF
+    printf '{"currentContext":"%s"}\n' "${probe_target}" >"${DOCKER_CONFIG}/config.json"
+
     rdd set running=false
     rdd set running=true containerEngine.name=moby

Fix (production, optional): decide whether ""/"default" should ever be probed from inside a long-lived daemon. If the stale-env risk outweighs the "don't stomp on a healthy default daemon" goal, treat "default" as healthy and only probe named contexts. The current behavior is a reasonable middle ground in the common case — the service usually starts from the user's shell — but it deserves an explicit decision.


Suggestions

S1. Remove the now-dead getDockerContextHost helper Claude Gemini
		return err
	}
	return s.Remove(name)
}

// getDockerContextHost returns the full Docker host URL (e.g. "unix:///path/to/docker.sock"
// or "tcp://192.168.1.1:2376") for the named context's docker endpoint.
// Returns an empty string if the context does not exist or has no docker endpoint.
func getDockerContextHost(name string) (string, error) {
	s, err := newContextStore()
	if err != nil {
		return "", err
	}
	md, err := s.GetMetadata(name)
	if err != nil {
		if errdefs.IsNotFound(err) {
			return "", nil
		}
		return "", err
	}
	ep, err := docker.EndpointFromContext(md)
	if err != nil {
		// Missing or mistyped docker endpoint — treat as empty, not an error.
		// The configured EndpointTypeGetter normally prevents the mistyped case.
		return "", nil
	}
	return ep.Host, nil
}

// getCurrentDockerContext reads the currentContext field from ~/.docker/config.json.
// Returns an empty string if the file does not exist or no context is set.
// config.Load validates every "auths" entry; a malformed credential fails
// this call (and set/clearCurrentDockerContext) until the user repairs

After the probe-rewrite, engine_reconciler.go calls probeNamedDockerContext directly. A repo-wide grep shows getDockerContextHost now appears only in docker_context.go (the definition) and four spots in docker_context_test.go — the function survives solely to support its own unit tests.

Fix: delete getDockerContextHost; rewrite the three existing tests that use it (Test_createReplaceDockerContext, Test_deleteDockerContext, Test_getDockerContextHost_missing) to fetch the endpoint via newContextStore + docker.EndpointFromContext, or collapse them into a single probeNamedDockerContext unit test that exercises the same paths end-to-end.

S2. Log the silent return true bailouts in probeNamedDockerContext Claude

// probeNamedDockerContext pings the Docker endpoint for the
// named context within dockerContextProbeTimeout.
// For contexts with non-tcp/unix endpoints (e.g. SSH), or with TLS config we can't load,
// we assume healthy to avoid overriding the user's choice.
func probeNamedDockerContext(ctx context.Context, name string) bool {
	s, err := newContextStore()
	if err != nil {
		return true
	}
	md, err := s.GetMetadata(name)
	if err != nil {
		return !errdefs.IsNotFound(err)
	}
	epMeta, err := docker.EndpointFromContext(md)
	if err != nil {
		return true
	}
	scheme, _, _ := strings.Cut(epMeta.Host, "://")
	switch scheme {
	case "unix", "tcp":
	default:
		return true
	}
	ep, err := docker.WithTLSData(s, name, epMeta)
	if err != nil {
		return true
	}
	opts, err := ep.ClientOpts()
	if err != nil {
		return true
	}
	probeCtx, cancel := context.WithTimeout(ctx, dockerContextProbeTimeout)
	defer cancel()
	return pingDocker(probeCtx, opts...)
}

Four error paths plus the non-NotFound branch all return true without logging. The "assume healthy" posture is the right default — the function must not stomp on a context it cannot decode — but an operator debugging "why does RDD refuse to become the default context?" gets no signal. A single V(1) log per bailout preserves the hot-path silence while making diagnosis tractable.

Fix: take a logr.Logger (or pull it from the context via logf.FromContext) and log the underlying err and name at V(1) before each return true. The caller in manageDockerContext already carries a log := logf.FromContext(r.watcherCtx).WithName("docker-context"); thread it through or have the probe accept a logger.

S3. probeNamedDockerContext treats an undecodable Docker endpoint as healthy, broader than the comment promises Codex
func probeNamedDockerContext(ctx context.Context, name string) bool {
	s, err := newContextStore()
	if err != nil {
		return true
	}
	md, err := s.GetMetadata(name)
	if err != nil {
		return !errdefs.IsNotFound(err)
	}
	epMeta, err := docker.EndpointFromContext(md)
	if err != nil {
		return true
	}
	scheme, _, _ := strings.Cut(epMeta.Host, "://")
	switch scheme {
	case "unix", "tcp":
	default:
		return true
	}
	ep, err := docker.WithTLSData(s, name, epMeta)
	if err != nil {
		return true
	}

The comment on the function says the optimistic return covers "non-tcp/unix endpoints (e.g. SSH), or with TLS config we can't load." EndpointFromContext upstream (github.com/docker/cli/.../load.go:153-160) returns an error when the context has no Docker endpoint at all or when the endpoint metadata fails to decode. That broader case — a malformed or missing endpoint — is not SSH-or-TLS; it is a context that docker itself cannot use. Treating it as healthy prevents RDD from helping a user whose Docker config is actually broken.

The counter-argument is conservatism: a partially corrupted config may be mid-edit, and stomping on it is worse than leaving it alone. Either position is defensible; the point is that the current code is more permissive than its docstring claims.

Fix: pick a stance and align the code with the comment. Either narrow the return true to the specific non-tcp/unix + TLS-load error paths, or broaden the comment to "any error decoding the context is treated as healthy to avoid clobbering a partially edited config."

S4. Prefer BATS_FILE_TMPDIR over BATS_SUITE_TMPDIR for per-file isolation Claude

    # Isolate all Docker config reads and writes from the developer's real
    # ~/.docker directory. Set DOCKER_CONFIG before starting the service so
    # the controller process inherits it (service.Start uses exec.Command
    # without an explicit Env, so it inherits the caller's environment).
    export DOCKER_CONFIG="${BATS_SUITE_TMPDIR}/docker-config"
    mkdir -p "${DOCKER_CONFIG}"

    # Deliberately skip setup_rdd_control_plane here: `rdd set` internally
    # calls ensureServiceRunning, which is exactly the CLI path we want to
    # exercise — the engine controller is part of the default controller

Every other test file in the repo that needs a tmpdir uses BATS_FILE_TMPDIR (see bats/tests/34-containers-controllers/containers-mock.bats:20, bats/tests/31-rdd-controllers/crd-upgrade.bats:16). BATS_SUITE_TMPDIR is shared across every file in the suite, so any future test file that also touches DOCKER_CONFIG would collide.

    # external mock-controller on the shared Container/Image/Volume
    # mirrors in the rancher-desktop namespace.
    setup_rdd_control_plane "containers,-engine"
    echo "${RDD_LOG_DIR}/mock-controller.log" >&3
    "mock-controller${EXE}" &>"${RDD_LOG_DIR}/mock-controller.log" &
    echo "$!" >"${BATS_FILE_TMPDIR}/controller_pid"
}

local_teardown_file() {
    if [[ -f "${BATS_FILE_TMPDIR}/controller_pid" ]]; then
        read -r controller_pid <"${BATS_FILE_TMPDIR}/controller_pid"
    rdd svc delete
    rdd svc start
}

local_teardown_file() {
    if [[ -f "${BATS_FILE_TMPDIR}/controller_pid" ]]; then
        kill "$(cat "${BATS_FILE_TMPDIR}/controller_pid")" 2>/dev/null || true
    fi
    rdd svc delete
}

Fix:

-    export DOCKER_CONFIG="${BATS_SUITE_TMPDIR}/docker-config"
+    export DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
S5. TODO(windows) comment lists a file that does not need Windows-specific work Claude

// manageDockerContext creates the instance Docker context and, in a goroutine,
// probes the user's current context; if it is absent or unhealthy, switches
// the default to the new context. At most one probe runs at a time.
func (r *EngineReconciler) manageDockerContext(socketPath string) {
	// TODO(windows): Docker context management uses unix:// sockets; Windows
	// requires npipe:// in the context meta file, an npipe-aware probe client,
	// and platform-specific path handling in getCurrentDockerContext and
	// createReplaceDockerContext. Track in a follow-up issue before shipping
	// Windows support.
	if runtime.GOOS == "windows" {
		return
	}
	contextName := instance.Name()
	log := logf.FromContext(r.watcherCtx).WithName("docker-context")

getCurrentDockerContext delegates to docker/cli's config.Load, which already handles USERPROFILE on Windows via os.UserHomeDir(). The Windows-specific work concentrates in createReplaceDockerContext (scheme must be npipe://, socket path formatting differs) and in the probe client. Dropping getCurrentDockerContext from the list keeps the TODO honest.

Fix:

-    // TODO(windows): Docker context management uses unix:// sockets; Windows
-    // requires npipe:// in the context meta file, an npipe-aware probe client,
-    // and platform-specific path handling in getCurrentDockerContext and
-    // createReplaceDockerContext. Track in a follow-up issue before shipping
-    // Windows support.
+    // TODO(windows): Docker context management uses unix:// sockets; Windows
+    // requires npipe:// in the context meta file written by
+    // createReplaceDockerContext, and an npipe-aware probe client. Track in
+    // a follow-up issue before shipping Windows support.

Design Observations

Concerns

Probing "default" couples a durable control-plane decision to ephemeral shell state (in-scope) Codex

Named contexts are on-disk state the daemon can read deterministically; "default" is partly process environment (DOCKER_HOST, DOCKER_CERT_PATH, DOCKER_TLS_VERIFY). The current design probes both from a long-lived reconciler, but the reconciler only sees the environment it inherited at spawn. Moving the "default" decision into the CLI/startup path — or persisting the effective Docker env alongside the App request — would eliminate the entire stale-environment class of bugs, including the test flake in I1.

Strengths

Docker CLI primitives are the right abstraction (in-scope) Claude Codex Gemini

ContextStore, WithTLSData, and EndpointMeta.ClientOpts() already know how to assemble an HTTP client for every endpoint shape the docker CLI supports (unix, tcp, tcp+TLS, mTLS, SkipTLSVerify). Delegating to them retires four false-override classes enumerated in the PR description and keeps RDD aligned with Docker's own context format going forward.

Two-function split makes the "did we load TLS?" question unmissable (in-scope) Claude

The old single probeDockerContext(ctx, host) silently called dockerclient.WithHost(host) even when TLS was required. Splitting into probeDockerContext(ctx) (FromEnv-backed default probe) and probeNamedDockerContext(ctx, name) (store-backed named probe) makes the intent at each call site obvious and forecloses the old silent-WithHost failure mode.


Testing Assessment

Unit tests in docker_context_test.go cover the context-store read/write surface; the BATS suite exercises the reconciler end-to-end against a real Docker daemon inside Lima. Residual gaps, ranked by risk:

  1. No unit coverage for the TCP+TLS, mTLS, or SkipTLSVerify probe paths — the whole motivation for the PR. A fake context store seeded with TLS data, probed against an unreachable TLS endpoint, would verify the TLS config is actually constructed (not silently dropped as in the old WithHost(host) call).
  2. No unit coverage for the SSH/npipe short-circuit at lines 191-196. A table-driven test that seeds a context with Host: "ssh://u@h" and asserts probeNamedDockerContext returns true without attempting a ping would pin the new behavior.
  3. The "default probed via FromEnv" behavior change has no dedicated unit test. Indirect coverage via I1's BATS case is too fragile. Injecting DOCKER_HOST and asserting probeDockerContext returns false for an unreachable default would lock the semantic in.
  4. No unit test for the early-return on getCurrentDockerContext error. Test_getCurrentDockerContext_malformedAuth verifies the error surfaces from the helper, but not that the reconciler refuses to call setCurrentDockerContext when the config is malformed.
  5. No regression test exercises the EndpointFromContext failure branch (see S3). A context seeded with a missing or malformed Docker endpoint would guard the "healthy vs unhealthy" decision whichever way S3 lands.

Documentation Assessment

The new docstrings on pingDocker, probeDockerContext, and probeNamedDockerContext are accurate and concise. The probeNamedDockerContext comment overstates its scope slightly (see S3). TODO(windows) comments are helpful but the scope list in one of them is inaccurate (see S5).


Commit Structure

Single commit, scope matches the title. The commit message enumerates four independent improvements (TCP+TLS, mTLS, SkipTLSVerify, SSH) that together form a coherent "probe safety" unit. Test isolation via DOCKER_CONFIG is bundled into the same commit, which is defensible: the new probe paths rely on a predictable context store, and pinning the store location in tests keeps both changes verifiable together.


Acknowledged Limitations


Agent Performance Retro

Claude

Produced the most thorough pass: caught the exact failing BATS test by name, traced the failure through service.Start's missing cmd.Env into dockerclient.FromEnv's platform-default fallback, and supplied the cleanest fix (seed an unreachable named context) instead of the more invasive option of rewriting service env plumbing. Unique insights: the silent return true bailouts (S2), the BATS_FILE_TMPDIR convention mismatch (S3), and the TODO comment that lists getCurrentDockerContext despite that helper not needing Windows work (S5). Coverage Summary was complete and cross-references finding IDs cleanly.

Codex

Framed the I1 finding as a production-semantics concern (stale daemon env) rather than a test bug, which turned out to be the complementary half of the same issue — once the test is fixed, the production concern still stands. Raised S3 as the only agent to notice that EndpointFromContext errors extend beyond the SSH/TLS cases the docstring claims. Missed the unused getDockerContextHost helper that both other agents caught. Ran six test commands to validate the passing unit-test suite, which added confidence at modest cost.

Gemini

Caught I1 (test flakiness) and S1 (unused helper) in parallel with the other agents. Its I2 (WaitGroup misuse race) does not hold up under inspection: manageDockerContext calls Add(1) inside startWatcherAndSync, which holds watcherMu throughout. The shutdown hook's stopWatcher also takes watcherMu before calling removeDockerContextWait(), so the two cannot interleave. Reconcile's own direct removeDockerContext at line 155 only reaches the else-branch where no watcher is running, so no concurrent Add(1) is in flight. Reconciled out below. Gemini skipped git blame (known daily quota behavior); Claude and Codex covered that gap.

Summary

Claude Opus 4.7Codex GPT 5.4Gemini 3.1 Pro
Duration13m 56s6m 07s4m 25s
Findings1I 4S1I 1S1I 1S
Tool calls67 (Bash 37, Read 16, Grep 14)39 (shell 37, stdin 2)9 (runshellcommand 4, readfile 3, grepsearch 2)
Design observations222
False positives001
Unique insights320
Files reviewed333
Coverage misses000
Totals1I 4S1I 1S1I 1S
Downgraded000
Dropped001

Reconciliation: Gemini P1 I2 (manageDockerContext WaitGroup race) → rejected as false positive (Add(1) is called inside startWatcherAndSync while holding watcherMu; stopWatcher blocks on watcherMu before calling Wait(), preventing interleaving). The finding does not appear in the consolidated report.

Claude, Codex, and Gemini all contributed to I1. The consolidated finding merges the three framings: Claude and Gemini framed it as a BATS test bug; Codex framed it as a production-semantics gap. Both halves are real and point at the same underlying issue (daemon env vs user env).


Review Process Notes

Skill improvements

Repo context updates