Deep Review: 20260428-102700-pr-333

Date	2026-04-28 10:27
Repo	rancher-sandbox/rancher-desktop-daemon
Round	3 (of target)
Author	@Nino-K
PR	#333 — Docker Engine: improve Docker context probe safety and isolation
Commits	`751d6e5` Docker Engine: improve Docker context probe safety and isolation
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — round 2's `DOCKER_CONTEXT` gap is closed by an env-skip in `manageDockerContext`, but the BATS suite does not isolate the daemon's inherited env from the dev's parent shell, so the round-2 test fix becomes flaky on hosts that export `DOCKER_HOST` or `DOCKER_CONTEXT`. Plus several round-2 housekeeping items remain unaddressed.
Wall-clock time	`25 min 42 s`

Executive Summary ¶

Round 3 closes the round-2 DOCKER_CONTEXT false-override path by skipping the entire probe-and-set when the daemon process inherits either DOCKER_HOST or DOCKER_CONTEXT (engine_reconciler.go:372-381), and adds a four-case unit suite for probeNamedDockerContext covering missing, SSH, unreachable unix, and unreachable tcp endpoints. The author's choice to short-circuit on either env var (rather than the narrower "honor DOCKER_CONTEXT as the named context") is a defensible conservative read: the daemon cannot tell whether the inherited env reflects the user's global preference, so it declines to write config.json at all when in doubt.

The round-3 changes introduce one regression and surface several housekeeping items the prior rounds left open:

The new env-skip silently disables the round-2 BATS test fix on dev machines. local_setup_file does not unset DOCKER_HOST / DOCKER_CONTEXT before spawning the service, and service.Start passes no explicit cmd.Env, so the daemon inherits whatever the parent shell happened to have set. A developer who runs BATS from a shell with DOCKER_CONTEXT=staging exported (a common case for users with multiple Docker contexts) sees the new test fail because the env-skip bails out before the seeded named context is probed. CI runners typically lack these vars, so this surfaces only on local runs.
Round-2 housekeeping items remain. The manageDockerContext function-level docstring and docs/design/bootstrap.md:58,86 still describe the pre-refactor "absent or unhealthy → promote ours" contract; the round-2 S6 TODO(windows) wording still names getCurrentDockerContext despite that helper not needing Windows work; and the docker_context_dir BATS helper docstring still references ~/.docker/contexts/... even though the body returns ${DOCKER_CONFIG}/contexts/....
A pre-existing WaitGroup race remains in manageDockerContext / removeDockerContext. Gemini's round-1 finding was dismissed in the round-1 retro on incorrect reasoning (the retro claimed stopWatcher holds watcherMu across the Wait() call; in fact stopWatcher releases watcherMu before invoking removeDockerContext, so the Add(1) in manageDockerContext and the Wait() in removeDockerContext are not serialized by watcherMu). The race is narrow — it requires an in-flight Reconcile to call manageDockerContext after the shutdown hook released watcherMu — but it is real and out-of-scope for this PR.

Report structure: 0 Critical, 1 Important, 9 Suggestions, 2 Design Concerns, 4 Design Strengths.

Critical Issues ¶

None.

Important Issues ¶

I1. local_setup_file does not isolate the daemon from DOCKER_HOST / DOCKER_CONTEXT in the parent shell Claude¶

bats/tests/32-app-controllers/engine-docker.bats

# Engine controller tests — verify that the engine controller mirrors Docker
# containers, images, and volumes into Container, Image, and Volume
# resources, and that deletions and action annotations on Container
# resources are forwarded to Docker.

local_setup_file() {
    # The Docker socket access pattern used by these tests is not yet wired
    # up for Windows/WSL2.
    skip_on_windows

    # Isolate all Docker config reads and writes from the developer's real
    # ~/.docker directory. Set DOCKER_CONFIG before starting the service so
    # the controller process inherits it (service.Start uses exec.Command
    # without an explicit Env, so it inherits the caller's environment).
    export DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
    mkdir -p "${DOCKER_CONFIG}"

    # Deliberately skip setup_rdd_control_plane here: `rdd set` internally
    # calls ensureServiceRunning, which is exactly the CLI path we want to
    # exercise — the engine controller is part of the default controller
    # set, so no explicit --controllers selection is needed.
    rdd svc delete
    rdd set running=true
    run -0 rdd svc paths docker_socket
    export DOCKER_HOST="unix://${output}"
    # Mirror resources live in App.spec.namespace. Override RDD_NAMESPACE
    # to whatever the App was created with so the test queries the same
    # namespace the engine controller uses, regardless of CRD defaults.
    RDD_NAMESPACE=$(rdd ctl get app app -o jsonpath='{.spec.namespace}')
    export RDD_NAMESPACE
}

# Make a WebSocket request, with the endpoint given as the first argument in the
# same form as `kubectl get --raw`; the output is captured in `${output}`.
do_websocket() { # endpoint
    local endpoint=$1

pkg/service/cmd/service.go spawns the daemon via exec.CommandContext(...) with no explicit cmd.Env, so the daemon inherits the parent shell's environment verbatim. The new skip in engine_reconciler.go:379-381 then short-circuits the entire probe-and-set goroutine when the daemon's process env carries DOCKER_HOST or DOCKER_CONTEXT:

pkg/controllers/app/engine/controllers/engine_reconciler.go

		// vars: the daemon inherits them frozen from the shell that ran
		// rdd svc start and they may not reflect the user's global
		// preference. Rewriting config.json based on them could clobber a
		// correct currentContext for sessions that don't have those vars
		// set. Skip the probe-and-set entirely when either is present.
		if os.Getenv("DOCKER_HOST") != "" || os.Getenv("DOCKER_CONTEXT") != "" {
			return
		}

		var healthy bool
		if current != "" && current != "default" {
			healthy = probeNamedDockerContext(probeCtx, current)
		} else {

if os.Getenv("DOCKER_HOST") != "" || os.Getenv("DOCKER_CONTEXT") != "" {
    return
}

The round-2 test fix moby engine sets currentContext when no healthy context exists (engine-docker.bats:779-809) seeds currentContext: test-unreachable, restarts the watcher, and asserts the reconciler promotes rancher-desktop-${RDD_INSTANCE}. A developer who ran bats ... from a shell that exported either env var (a common case for anyone using a remote/secondary Docker context) trips the skip; currentContext stays at test-unreachable; the try --max 6 --delay 1 -- ... loop times out and the test fails. CI runners typically lack these vars, so this surfaces only on local runs.

bats/tests/32-app-controllers/engine-docker.bats

    local socket_path=${output}
    run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
    assert_output "unix://${socket_path}"
}

@test "moby engine sets currentContext when no healthy context exists" {
    local context_name="rancher-desktop-${RDD_INSTANCE}"

    # Seed a named context pointing at an unreachable socket and make it
    # current. This forces the reconciler into the probeNamedDockerContext
    # branch with a guaranteed-failing endpoint, independent of whether the
    # host has a running Docker daemon at the default socket (e.g.
    # ubuntu-latest runners ship Docker pre-installed). Using a named context
    # also avoids the stale-env issue: the service process's DOCKER_HOST is
    # frozen at first start, so FromEnv-based probing of "" / "default" is
    # host-dependent and unreliable from inside a long-lived daemon.
    probe_target="test-unreachable"
    run_e -0 docker_context_dir "${probe_target}"
    probe_dir="${output}"
    mkdir -p "${probe_dir}"
    cat >"${probe_dir}/meta.json" <<EOF
{"Name":"${probe_target}","Endpoints":{"docker":{"Host":"unix:///nonexistent/does-not-exist.sock","SkipTLSVerify":false}}}
EOF
    printf '{"currentContext":"%s"}\n' "${probe_target}" >"${DOCKER_CONFIG}/config.json"

    rdd set running=false
    rdd set running=true containerEngine.name=moby
    rdd ctl wait --for=condition=ContainerEngineReady app/app --timeout=30s

    # The probe goroutine runs asynchronously; give it a moment.
    try --max 6 --delay 1 -- \
        bash -c "jq -r '.currentContext' '${DOCKER_CONFIG}/config.json' | grep -qx '${context_name}'"

    run -0 jq -r '.currentContext' "${DOCKER_CONFIG}/config.json"
    assert_output "${context_name}"
}

@test "stopping moby engine removes Docker context and clears currentContext" {
    rdd set running=false

    local context_name="rancher-desktop-${RDD_INSTANCE}"

The in-test export DOCKER_HOST="unix://${output}" at line 31 fires after rdd set running=true on line 29, so it does not reach the already-spawned daemon — only the BATS shell's child docker CLIs see it. The daemon's env is whatever was inherited at spawn time.

Fix: clear the env vars before spawning the daemon.

 local_setup_file() {
     skip_on_windows
     export DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
     mkdir -p "${DOCKER_CONFIG}"
+    # The daemon inherits these env vars at spawn (service.Start does not
+    # set cmd.Env). Clear them so manageDockerContext exercises the
+    # probe-and-set path regardless of the parent shell.
+    unset DOCKER_HOST DOCKER_CONTEXT
     rdd svc delete
     rdd set running=true

(important, regression)

Suggestions ¶

S1. Comment explaining the named-context detour misstates why it works Claude¶

bats/tests/32-app-controllers/engine-docker.bats

}

@test "moby engine sets currentContext when no healthy context exists" {
    local context_name="rancher-desktop-${RDD_INSTANCE}"

    # Seed a named context pointing at an unreachable socket and make it
    # current. This forces the reconciler into the probeNamedDockerContext
    # branch with a guaranteed-failing endpoint, independent of whether the
    # host has a running Docker daemon at the default socket (e.g.
    # ubuntu-latest runners ship Docker pre-installed). Using a named context
    # also avoids the stale-env issue: the service process's DOCKER_HOST is
    # frozen at first start, so FromEnv-based probing of "" / "default" is
    # host-dependent and unreliable from inside a long-lived daemon.
    probe_target="test-unreachable"
    run_e -0 docker_context_dir "${probe_target}"
    probe_dir="${output}"
    mkdir -p "${probe_dir}"
    cat >"${probe_dir}/meta.json" <<EOF

The "Using a named context also avoids the stale-env issue" claim does not hold once round 3's env-skip lands. The skip at engine_reconciler.go:379-381 runs before the named-context probe, so a frozen DOCKER_HOST in the daemon's env disables both branches equally — the named-context detour buys nothing on that axis. The detour only escapes the FromEnv flake when the daemon inherits no env vars at all. Pair this rewrite with the I1 unset so the comment describes the actual contract.

pkg/controllers/app/engine/controllers/engine_reconciler.go

		// vars: the daemon inherits them frozen from the shell that ran
		// rdd svc start and they may not reflect the user's global
		// preference. Rewriting config.json based on them could clobber a
		// correct currentContext for sessions that don't have those vars
		// set. Skip the probe-and-set entirely when either is present.
		if os.Getenv("DOCKER_HOST") != "" || os.Getenv("DOCKER_CONTEXT") != "" {
			return
		}

		var healthy bool
		if current != "" && current != "default" {
			healthy = probeNamedDockerContext(probeCtx, current)
		} else {

-    # Using a named context
-    # also avoids the stale-env issue: the service process's DOCKER_HOST is
-    # frozen at first start, so FromEnv-based probing of "" / "default" is
-    # host-dependent and unreliable from inside a long-lived daemon.
+    # Both probe branches (named and FromEnv) require the daemon's env to
+    # have neither DOCKER_HOST nor DOCKER_CONTEXT set, otherwise
+    # manageDockerContext skips the probe entirely. local_setup_file
+    # unsets both before spawning the daemon.

(suggestion, gap)

S2. probeDockerContext docstring overstates what the function reaches in production Claude¶

pkg/controllers/app/engine/controllers/docker_context.go

	defer cli.Close()
	_, err = cli.Ping(ctx, dockerclient.PingOptions{})
	return err == nil
}

// probeDockerContext pings the implicit default Docker endpoint
// (DOCKER_HOST or the platform default socket) within dockerContextProbeTimeout.
// Used when currentContext is "" or "default".
func probeDockerContext(ctx context.Context) bool {
	probeCtx, cancel := context.WithTimeout(ctx, dockerContextProbeTimeout)
	defer cancel()
	return pingDocker(probeCtx, dockerclient.FromEnv)
}

// probeNamedDockerContext pings the Docker endpoint for the named context
// within dockerContextProbeTimeout and returns true if the endpoint is healthy.
//
// When we cannot determine whether the context is healthy — because the store

By the time probeDockerContext(probeCtx) runs at engine_reconciler.go:387, the goroutine has already returned if either DOCKER_HOST or DOCKER_CONTEXT was set (lines 379-381). So FromEnv's WithHostFromEnv only ever resolves to the platform default socket — DOCKER_HOST is empty by construction here. The docstring's "DOCKER_HOST or the platform default socket" phrasing is reachable from unit tests but never on the production path. Tighten the comment so a future refactor that removes the env-skip does not silently change probe semantics.

pkg/controllers/app/engine/controllers/engine_reconciler.go


		var healthy bool
		if current != "" && current != "default" {
			healthy = probeNamedDockerContext(probeCtx, current)
		} else {
			healthy = probeDockerContext(probeCtx)
		}
		// Guard against writing currentContext after removeDockerContext has
		// already cancelled probeCtx and deleted the context directory.
		if !healthy && probeCtx.Err() == nil {
			if err := setCurrentDockerContext(contextName); err != nil {

 // probeDockerContext pings the implicit default Docker endpoint
-// (DOCKER_HOST or the platform default socket) within dockerContextProbeTimeout.
-// Used when currentContext is "" or "default".
+// (the platform default socket) within dockerContextProbeTimeout. Used
+// when currentContext is "" or "default". The caller short-circuits when
+// DOCKER_HOST or DOCKER_CONTEXT is set, so FromEnv's host-from-env path
+// is unreachable on the production path and the probe always targets
+// the platform default socket.
 func probeDockerContext(ctx context.Context) bool {

(suggestion, documentation)

S3. probeNamedDockerContext lacks coverage for the four "assume healthy" library-error branches Claude¶

The new Test_probeNamedDockerContext covers four common outcomes (missing, SSH, unreachable unix, unreachable tcp), but the four "assume healthy" branches at docker_context.go:172-174, 178-180, 188-191, 193-196 (metadata read error, decode error, TLS load error, ClientOpts error) and the TCP+TLS / mTLS / SkipTLSVerify paths motivating the PR have no unit anchor. The malformed-meta case is the cheapest to lock in: write a meta.json with an unparseable endpoint payload and assert the function returns true.

pkg/controllers/app/engine/controllers/docker_context_test.go

		_, hasAuths := cfg["auths"]
		assert.Assert(t, hasAuths, "auths key must be preserved across clear")
	})
}

func Test_probeNamedDockerContext(t *testing.T) {
	// seedContext writes a named docker context with the given Host into the
	// store. It uses the same store config as production code so the context
	// is readable by probeNamedDockerContext.
	seedContext := func(t *testing.T, name, host string) {
		t.Helper()
		s, err := newContextStore()
		assert.NilError(t, err)
		assert.NilError(t, s.CreateOrUpdate(store.Metadata{
			Name:     name,
			Metadata: map[string]any{},
			Endpoints: map[string]any{
				docker.DockerEndpoint: docker.EndpointMeta{Host: host},
			},
		}))
	}

	t.Run("missing context returns false", func(t *testing.T) {
		newDockerTestDir(t)
		result := probeNamedDockerContext(context.Background(), "does-not-exist")
		assert.Assert(t, !result, "missing context must be treated as unhealthy")
	})

	t.Run("ssh context returns true (non-probed scheme)", func(t *testing.T) {
		newDockerTestDir(t)
		seedContext(t, "ssh-ctx", "ssh://user@remote-host")
		result := probeNamedDockerContext(context.Background(), "ssh-ctx")
		assert.Assert(t, result, "ssh context must be assumed healthy to avoid clobbering user's choice")
	})

	t.Run("unix context with unreachable socket returns false", func(t *testing.T) {
		newDockerTestDir(t)
		seedContext(t, "local-ctx", "unix:///nonexistent/does-not-exist.sock")
		result := probeNamedDockerContext(context.Background(), "local-ctx")
		assert.Assert(t, !result, "unreachable unix socket must be treated as unhealthy")
	})

	t.Run("tcp context with unreachable endpoint returns false", func(t *testing.T) {
		newDockerTestDir(t)
		// Port 1 is reserved and will be refused on any sane host.
		seedContext(t, "tcp-ctx", "tcp://127.0.0.1:1")
		result := probeNamedDockerContext(context.Background(), "tcp-ctx")
		assert.Assert(t, !result, "unreachable tcp endpoint must be treated as unhealthy")
	})
}

A fake context store seeded with TLS data plus an unreachable TLS endpoint would pin the TLS construction without needing a live TLS daemon and close the largest residual gap from round 2 S2.

(suggestion, gap)

S4. The env-skip and the empty/default probe routing have no unit test Claude¶

The four-line env-var skip and the two-branch probe selector are entirely behavioral additions to the goroutine in manageDockerContext and have no direct test. The BATS path covers a subset (named-unhealthy promotes ours), but no test asserts:

setCurrentDockerContext is never called when DOCKER_HOST is set;
setCurrentDockerContext is never called when DOCKER_CONTEXT is set;
current == "default" routes through probeDockerContext (FromEnv), not the named branch.

A small refactor that extracts the goroutine body into a testable helper (taking getEnv func(string) string and the probe functions as parameters) would make all three reachable from _test.go. Worth doing because the env-skip is the headline behavior of this PR — silent regression of it would not be caught.

(suggestion, gap)

S5. getCurrentDockerContext early-return comment misses the recovery path Claude¶

pkg/controllers/app/engine/controllers/engine_reconciler.go

			}
			r.contextMu.Unlock()
			cancel()
		}()

		current, err := getCurrentDockerContext()
		if err != nil {
			// An error here means the config file exists but could not be
			// read or parsed. Do not proceed — writing currentContext now
			// would overwrite a file we cannot safely round-trip.
			log.Error(err, "Failed to read current Docker context")
			return
		}

		// DOCKER_HOST and DOCKER_CONTEXT take precedence over config.json
		// in the Docker CLI's resolution order. Both are shell-local env
		// vars: the daemon inherits them frozen from the shell that ran
		// rdd svc start and they may not reflect the user's global

The early return is the safe choice — setCurrentDockerContext would have failed on the same config.Load call in the old flow. A future reader could mistake the early return for "RDD ignores broken config.json forever"; in fact the next reconcile (e.g., another rdd set) re-enters manageDockerContext, so repairing the file and triggering any reconcile is enough to recover.

     // An error here means the config file exists but could not be
     // read or parsed. Do not proceed — writing currentContext now
-    // would overwrite a file we cannot safely round-trip.
+    // would overwrite a file we cannot safely round-trip. The next
+    // reconcile (e.g. another `rdd set`) re-enters this function, so
+    // repairing config.json restores the probe-and-set behavior.

(suggestion, documentation)

S6. Use t.Context() in new tests Gemini¶

pkg/controllers/app/engine/controllers/docker_context_test.go

		}))
	}

	t.Run("missing context returns false", func(t *testing.T) {
		newDockerTestDir(t)
		result := probeNamedDockerContext(context.Background(), "does-not-exist")
		assert.Assert(t, !result, "missing context must be treated as unhealthy")
	})

	t.Run("ssh context returns true (non-probed scheme)", func(t *testing.T) {
		newDockerTestDir(t)

Go 1.25 (go.mod) gives every *testing.T a per-test Context() that cancels when the test finishes. Switch the four context.Background() callers in Test_probeNamedDockerContext so test timeouts and cancellations propagate into the probe.

-result := probeNamedDockerContext(context.Background(), "does-not-exist")
+result := probeNamedDockerContext(t.Context(), "does-not-exist")

(suggestion, enhancement)

S7. manageDockerContext docstring and docs/design/bootstrap.md still describe the pre-refactor contract¶

pkg/controllers/app/engine/controllers/engine_reconciler.go

		e.stop()
	}
	r.removeDockerContext()
}

// manageDockerContext creates the instance Docker context and, in a goroutine,
// probes the user's current context; if it is absent or unhealthy, switches
// the default to the new context. At most one probe runs at a time.
func (r *EngineReconciler) manageDockerContext(socketPath string) {
	// TODO(windows): Docker context management uses unix:// sockets; Windows
	// requires npipe:// in the context meta file, an npipe-aware probe client,
	// and platform-specific path handling in getCurrentDockerContext and
	// createReplaceDockerContext. Track in a follow-up issue before shipping

After round 2 the empty / "default" branch routes through probeDockerContext (FromEnv) and treats a foreign default daemon as healthy; round 3 adds a further env-skip for DOCKER_HOST / DOCKER_CONTEXT. Neither the docstring nor docs/design/bootstrap.md:58,86 reflects either change. Round 2 S3 flagged the same drift and was not picked up.

docs/design/bootstrap.md


    * The `LimaVM` controller creates the VM. Since the status is set to `running` it also starts the VM. Once the VM has completed startup the controller sets the status to "ready".

  * The `App` controller is notified of the status change of the `LimaVM` object and detects that the VM is running. It now monitors the VM to check when the `docker.sock` is accepting requests.

  * It creates a `rancher-desktop-2` docker context. If there is no current context, or if the context isn't accepting connections, then it makes the new context the default.

  * It sets the status of the `App` object to `ready`.

* `rdd start` is notified that the `App` status has changed. It verifies that the status is "ready" and exits.

docs/design/bootstrap.md:58
   "It creates a `rancher-desktop-2` docker context. If there is no
   current context, or if the context isn't accepting connections, then
   it makes the new context the default."

docs/design/bootstrap.md:86
   "`rdd start` has created the `rancher-desktop-2` docker context, but
   it will only activate it if the current context doesn't exist, or is
   non-functional."

Fix: rewrite the docstring and both bullets to spell out the new contract — RDD probes the current named context, falls back to probing the implicit default endpoint when none is set, and skips the entire promotion when DOCKER_HOST or DOCKER_CONTEXT is set in the daemon's env.

(suggestion, documentation, round 2 S3 unaddressed)

S8. TODO(windows) comment overstates what needs platform-specific handling¶

pkg/controllers/app/engine/controllers/engine_reconciler.go


// manageDockerContext creates the instance Docker context and, in a goroutine,
// probes the user's current context; if it is absent or unhealthy, switches
// the default to the new context. At most one probe runs at a time.
func (r *EngineReconciler) manageDockerContext(socketPath string) {
	// TODO(windows): Docker context management uses unix:// sockets; Windows
	// requires npipe:// in the context meta file, an npipe-aware probe client,
	// and platform-specific path handling in getCurrentDockerContext and
	// createReplaceDockerContext. Track in a follow-up issue before shipping
	// Windows support.
	if runtime.GOOS == "windows" {
		return
	}
	contextName := instance.Name()
	log := logf.FromContext(r.watcherCtx).WithName("docker-context")

getCurrentDockerContext delegates to docker/cli's config.Load, and createReplaceDockerContext plus newContextStore delegate to the store's path handling — both resolve %USERPROFILE%\.docker on Windows through os.UserHomeDir() with no extra work. The Windows-specific work concentrates in one line of createReplaceDockerContext (the hard-coded "unix://" prefix) and in the probe client selection. Round 1 S5 and round 2 S6 flagged this and were not picked up.

-    // TODO(windows): Docker context management uses unix:// sockets; Windows
-    // requires npipe:// in the context meta file, an npipe-aware probe client,
-    // and platform-specific path handling in getCurrentDockerContext and
-    // createReplaceDockerContext. Track in a follow-up issue before shipping
-    // Windows support.
+    // TODO(windows): createReplaceDockerContext hard-codes "unix://"; Windows
+    // needs "npipe://" plus an npipe-aware probe client in
+    // probeDockerContext / probeNamedDockerContext. Track in a follow-up
+    // issue before shipping Windows support.

(suggestion, documentation, round 2 S6 / round 1 S5 unaddressed)

S9. Stale path reference in docker_context_dir helper docstring¶

bats/tests/32-app-controllers/engine-docker.bats

    refute_output
}

# --- Docker context management ---

# docker_context_dir returns the ~/.docker/contexts/meta/<hash> directory for
# the named context. Docker derives the sub-directory from sha256(name).
docker_context_dir() {
    local name="$1"
    local hash
    # sha256sum on Linux, shasum on macOS
    if command -v sha256sum &>/dev/null; then

The body returns ${DOCKER_CONFIG}/contexts/meta/${hash}, but the docstring still says ~/.docker/contexts/meta/<hash>. The two diverged when the round-2 fix isolated tests via DOCKER_CONFIG, and the docstring still has not been updated.

-# docker_context_dir returns the ~/.docker/contexts/meta/<hash> directory for
-# the named context. Docker derives the sub-directory from sha256(name).
+# docker_context_dir returns the contexts/meta/<hash> directory under
+# DOCKER_CONFIG for the named context. Docker derives the sub-directory
+# from sha256(name).

(suggestion, documentation, round 2 S8 unaddressed)

Design Observations ¶

Concerns ¶

Add(1) and Wait() on contextProbeWg are not serialized by an outer lock (in-scope, pre-existing) Gemini

manageDockerContext calls r.contextProbeWg.Add(1) at engine_reconciler.go:350 after releasing contextMu; removeDockerContext calls r.contextProbeWg.Wait() at line 420 after releasing contextMu. The two are not on the same mutex, and stopWatcher releases watcherMu before invoking removeDockerContext (line 311 vs line 316), so watcherMu does not bridge them either. Round 1 dismissed Gemini's identical finding on the rationale that "stopWatcher takes watcherMu before calling Wait()" — that is incorrect; the lock is released first.

The exploit window is narrow: it needs an in-flight Reconcile to acquire watcherMu after the shutdown hook released it from stopWatcher, then succeed in newDockerWatcher(r.watcherCtx, ...) despite watcherCtx already being cancelled, then reach Add(1) while the shutdown hook's removeDockerContext is at Wait() with the counter at 0. If this lands, the WaitGroup panics with "Add called concurrently with Wait" and the daemon crashes during graceful shutdown rather than cleaning up.

This is pre-existing — predates PR #333 — so it does not block this PR. Mention it here so the round-1 dismissal does not bury the issue. The right fix is to swap the shared WaitGroup for a per-probe state struct holding both the cancel func and a per-probe done channel, both swappable atomically under contextMu.

Daemon's frozen env is a long-lived property masquerading as a per-invocation choice (future) Claude

The env-skip uses the daemon process's own environment — whatever shell happened to spawn rdd svc start minutes or hours earlier. A developer who once started rdd from a shell exporting DOCKER_CONTEXT=staging permanently disables RDD's context promotion on that machine — even from new shells without those vars — until they rdd svc delete && rdd svc start from a clean shell. The author's PR comment acknowledges this is conservative; a follow-up could either (a) document the gotcha in user-facing docs, (b) re-resolve the env at probe time from a stable source (e.g., the user's ~/.docker/config.json environment block — Docker CLI honors this), or (c) emit a one-shot info log on startup recording which env vars are inherited so the daemon's behavior stays observable.

Strengths ¶

DOCKER_CONTEXT env-skip closes the round-2 false-override path (in-scope) Codex Gemini

The author chose the most conservative reading of round-2 I1: rather than honor DOCKER_CONTEXT as a named-context selector (which would still write config.json based on shell-local state), the daemon now declines to write config.json at all when either DOCKER_HOST or DOCKER_CONTEXT is set in its env. This eliminates the "user unsets DOCKER_CONTEXT and finds RDD's context as their new default" surprise that round 2 enumerated.

Conservative "assume healthy" stance with explicit logging at every fall-through (in-scope) Claude Gemini

probeNamedDockerContext returns true on every error path that cannot decide healthiness — store unreadable, metadata missing, endpoint malformed, scheme not tcp/unix, TLS load failed, ClientOpts failed — and logs each at default level (round 2 S7 was picked up). For a tool that mutates user config, this is the right default: refuse to clobber a config the controller cannot evaluate.

Test_probeNamedDockerContext covers the four common probe outcomes (in-scope) Codex

The new four-case suite (missing context → false, SSH scheme → true, unreachable unix → false, unreachable tcp → false) anchors the contract for the most common shapes a real user config produces. Round 1 S2 and round 2 S2 raised this gap; the round-3 addition closes most of it (S3 above lists the residual library-error branches).

Two-function probe split makes the transport decision explicit (in-scope) Claude Codex

Splitting the old single-signature probeDockerContext(ctx, host) into probeDockerContext(ctx) (FromEnv-backed default probe) and probeNamedDockerContext(ctx, name) (store-backed named probe) makes the "did we load TLS for this?" question unmissable at each call site. The old signature silently called WithHost(host) even when TLS data was required; the new signatures foreclose that class entirely.

Testing Assessment ¶

Coverage gaps, ranked by risk:

Daemon env contamination from the parent shell — the round-2 BATS test fix flakes on dev machines where DOCKER_HOST or DOCKER_CONTEXT is exported (per I1).
DOCKER_HOST / DOCKER_CONTEXT env-skip path — the headline behavior of this PR, untested at any layer (per S4).
probeNamedDockerContext "assume healthy" library-error branches — TLS load failure, ClientOpts failure, malformed meta (per S3).
TCP+TLS / mTLS / SkipTLSVerify probe paths — the original motivation for the PR has no unit anchor (per S3, residual round 2 S2).
current == "default" routing — neither unit nor BATS tests assert that this routes through probeDockerContext, not probeNamedDockerContext (Docker treats "default" as a magic non-name).
Early return on getCurrentDockerContext error — Test_getCurrentDockerContext_malformedAuth verifies the helper surfaces the error, but no test confirms the reconciler refuses to call setCurrentDockerContext when the config is malformed.

Codex ran go test ./pkg/controllers/app/engine/controllers and go test -race ./pkg/controllers/app/engine/controllers during its review and reported both passed.

Documentation Assessment ¶

The manageDockerContext docstring at engine_reconciler.go:319-321 still describes the pre-refactor "absent or unhealthy → promote ours" contract (S7).
docs/design/bootstrap.md:58,86 carries the same stale contract (S7).
The probeDockerContext docstring at docker_context.go:143-145 overstates what the function reaches in production (S2).
The probeNamedDockerContext docstring at docker_context.go:152-159 accurately documents the conservative policy and now lists "endpoint metadata missing or malformed" explicitly — a round-1 S3 follow-up that landed cleanly.
The getCurrentDockerContext early-return comment misses the recovery path (S5).
TODO(windows) at engine_reconciler.go:323-327 overstates the work (S8, round 2 S6 / round 1 S5 unaddressed).
The docker_context_dir BATS helper docstring still references ~/.docker/contexts/... (S9, round 2 S8 unaddressed).
The BATS comment explaining the named-context detour misstates why it works (S1).

Commit Structure ¶

Single commit (751d6e5), self-contained, message accurately describes the change. The bundling — env-skip, unit-test additions, BATS env-isolation choice — is internally consistent. No objection.

Acknowledged Limitations ¶

Windows Docker context management is still deferred. Both manageDockerContext and removeDockerContext early-return on runtime.GOOS == "windows". The TODO calls this out explicitly; see S8 for the wording.
SSH/npipe endpoints are trusted implicitly. probeNamedDockerContext returns true for any scheme that is not unix or tcp. The docstring acknowledges the tradeoff: the controller cannot authoritatively probe these endpoints, so it accepts the risk of a dead user-configured context in exchange for not overriding the user's legitimate choice.
config.json updates are bypassed when DOCKER_HOST or DOCKER_CONTEXT is set in the daemon's env. Documented in the inline comment at engine_reconciler.go:372-378 and consistent with the author's PR-comment rationale: shell-local env vars do not necessarily reflect the user's global preference, so the conservative choice is to leave config.json alone.

Unresolved Feedback ¶

mook-as inline (config.json vs DOCKER_CONTEXT) at engine_reconciler.go:0: addressed — this commit's env-skip at lines 379-381 short-circuits the entire probe-and-set when DOCKER_HOST or DOCKER_CONTEXT is set, matching the reviewer's intent. The author's PR comment confirms the design choice.
mook-as inline (V(1) is pointless) at docker_context.go:0: addressed — docker_context.go now uses log.Error(err, ...) at default level (round 2 S7 picked up).
mook-as inline (spell check) at docker_context.go:18: cosmetic; the author indicated the merge would be redone. Not a code issue.
jandubois (AI review claimed missed DOCKER_CONTEXT override): addressed — the round-3 env-skip closes the false-override path the AI review described. The author chose the more conservative direction (never write) over the alternative (always honor env).

Agent Performance Retro ¶

Claude ¶

Carried this round single-handed: produced the only Important finding (I1, the BATS env-contamination flake) and four of the five new suggestions. The I1 trace is the strongest piece of work in the round — it spotted that the round-3 env-skip silently disables the round-2 BATS test fix, and tracked the failure through service.Start having no explicit cmd.Env to the parent-shell inheritance window. Also caught the round-2 leftovers (S7 docstring/bootstrap.md drift, S8 TODO Windows wording, S9 docker_context_dir comment) that both peers missed. No false positives, no coverage misses.

Codex ¶

Returned an empty findings list — 0 critical, 0 important, 0 suggestions — with all four files marked "Reviewed, no issues" and a brief Design Strengths / Testing Assessment section. The review ran a real go test -race and reported both passed; that is genuinely useful evidence for the verdict but does not by itself justify a slot in this round's report. Codex missed I1 entirely (the env-skip / BATS interaction was directly in its strike zone — round 2 was Codex's headline finding) and missed every round-2 leftover. Below threshold.

Gemini ¶

Two suggestions (one valid t.Context() modernization in S6; one local-in-@test re-flag that this round dropped as out-of-scope, since the offending line is pre-existing and not modified by this PR). The unique contribution is the WaitGroup race re-raise — Gemini was correct in round 1 and is correct now; the round-1 retro dismissed it on incorrect mutex reasoning. Surfacing the race a second time was the right call. Gemini again skipped git blame (known daily-quota behavior); Claude and Codex covered that gap.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.5	Gemini 3.1 Pro
Duration	12m 17s	3m 39s	7m 34s
Findings	1I 5S	none	1S
Tool calls	36 (Bash 16, Read 13, Grep 7)	37 (shell 36, stdin 1)	11 (runshellcommand 7, read_file 4)
Design observations	1	2	2
False positives	0	0	0
Unique insights	5	0	2
Files reviewed	4	4	4
Coverage misses	0	1	1
Totals	1I 5S	none	1S
Downgraded	0	0	0
Dropped	0	0	1

Reconciliation:

Gemini P1 S2 (local in @test) → dropped as out-of-scope. Reason: the only offending line (local context_name="rancher-desktop-${RDD_INSTANCE}" at engine-docker.bats:780) is pre-existing in the round-3 base; round 2 already declared the same line out-of-scope and only the round-2-introduced local lines (790, 792) were in scope at that time. Round 3 removed those, leaving line 780 untouched. No new violation in this PR's diff.
Gemini P1 Design Concern (WaitGroup race) → kept as Design Concern, tagged (in-scope, pre-existing). Reason: the race is real and the round-1 retro dismissed it on incorrect reasoning, but the bug pre-dates this PR and the exploit window is narrow.

Review Process Notes ¶

Skill improvements ¶

[ ] Add a "verify mutex coverage extends across the entire critical section" check to concurrency claim verification. When evaluating a race claim, do not stop at "lock X is taken before Y is called." Trace the lock through the entire call chain to the race-relevant primitive (Add/Wait, Read/Write, etc.). A lock that is taken and released before reaching the critical operation does not serialize anything past that release. The retro that dismisses such a claim must explicitly state which release-acquire pair establishes the happens-before relationship, not just which two functions take the lock. Past reviews have dismissed valid races on the rationale "function X takes mutex M before calling function Y" without noticing that M was released inside X before Y started.

Repo context updates ¶

[+] Daemon env vars are inherited at spawn from the shell that ran rdd svc start. pkg/service/cmd/service.go calls exec.CommandContext(...) with no explicit cmd.Env, so the daemon process inherits the parent shell's environment. Code that reads env vars inside the long-lived daemon (e.g., os.Getenv("DOCKER_HOST"), os.Getenv("DOCKER_CONTEXT"), anything else affecting reconciler decisions) sees those vars frozen at spawn time, not whatever the user's current shell has set. Flag any new daemon-side os.Getenv call that does not also document this freeze, and flag any BATS test that depends on the daemon NOT having a particular env var without unset-ing it before rdd set running=true / rdd svc start.