# Deep Review: 20260428-102700-pr-333

| | |
|---|---|
| **Date** | 2026-04-28 10:27 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 3 (of target) |
| **Author** | [@Nino-K](https://github.com/Nino-K) |
| **PR** | [#333](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/333) — Docker Engine: improve Docker context probe safety and isolation |
| **Commits** | `751d6e5` Docker Engine: improve Docker context probe safety and isolation |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — round 2's `DOCKER_CONTEXT` gap is closed by an env-skip in `manageDockerContext`, but the BATS suite does not isolate the daemon's inherited env from the dev's parent shell, so the round-2 test fix becomes flaky on hosts that export `DOCKER_HOST` or `DOCKER_CONTEXT`. Plus several round-2 housekeeping items remain unaddressed. |
| **Wall-clock time** | `25 min 42 s` |

## Executive Summary

Round 3 closes the round-2 `DOCKER_CONTEXT` false-override path by skipping the entire probe-and-set when the daemon process inherits either `DOCKER_HOST` or `DOCKER_CONTEXT` (`engine_reconciler.go:372-381`), and adds a four-case unit suite for `probeNamedDockerContext` covering missing, SSH, unreachable unix, and unreachable tcp endpoints. The author's choice to short-circuit on either env var (rather than the narrower "honor `DOCKER_CONTEXT` as the named context") is a defensible conservative read: the daemon cannot tell whether the inherited env reflects the user's global preference, so it declines to write `config.json` at all when in doubt.

The round-3 changes introduce one regression and surface several housekeeping items the prior rounds left open:

1. **The new env-skip silently disables the round-2 BATS test fix on dev machines.** `local_setup_file` does not unset `DOCKER_HOST` / `DOCKER_CONTEXT` before spawning the service, and `service.Start` passes no explicit `cmd.Env`, so the daemon inherits whatever the parent shell happened to have set. A developer who runs BATS from a shell with `DOCKER_CONTEXT=staging` exported (a common case for users with multiple Docker contexts) sees the new test fail because the env-skip bails out before the seeded named context is probed. CI runners typically lack these vars, so this surfaces only on local runs.
2. **Round-2 housekeeping items remain.** The `manageDockerContext` function-level docstring and `docs/design/bootstrap.md:58,86` still describe the pre-refactor "absent or unhealthy → promote ours" contract; the round-2 S6 TODO(windows) wording still names `getCurrentDockerContext` despite that helper not needing Windows work; and the `docker_context_dir` BATS helper docstring still references `~/.docker/contexts/...` even though the body returns `${DOCKER_CONFIG}/contexts/...`.
3. **A pre-existing `WaitGroup` race remains in `manageDockerContext` / `removeDockerContext`.** Gemini's round-1 finding was dismissed in the round-1 retro on incorrect reasoning (the retro claimed `stopWatcher` holds `watcherMu` across the `Wait()` call; in fact `stopWatcher` releases `watcherMu` *before* invoking `removeDockerContext`, so the `Add(1)` in `manageDockerContext` and the `Wait()` in `removeDockerContext` are not serialized by `watcherMu`). The race is narrow — it requires an in-flight Reconcile to call `manageDockerContext` after the shutdown hook released `watcherMu` — but it is real and out-of-scope for this PR.

Report structure: 0 Critical, 1 Important, 9 Suggestions, 2 Design Concerns, 4 Design Strengths.

---

## Critical Issues

None.

## Important Issues

I1. **`local_setup_file` does not isolate the daemon from `DOCKER_HOST` / `DOCKER_CONTEXT` in the parent shell** — `bats/tests/32-app-controllers/engine-docker.bats:12-37, 779-809` [Claude Opus 4.7]

```bash
local_setup_file() {
    skip_on_windows
    export DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
    mkdir -p "${DOCKER_CONFIG}"
    rdd svc delete
    rdd set running=true                             # <- spawns daemon
    run -0 rdd svc paths docker_socket
    export DOCKER_HOST="unix://${output}"            # <- after daemon spawn
    ...
}
```

`pkg/service/cmd/service.go` spawns the daemon via `exec.CommandContext(...)` with no explicit `cmd.Env`, so the daemon inherits the parent shell's environment verbatim. The new skip in `engine_reconciler.go:379-381` then short-circuits the entire probe-and-set goroutine when the daemon's process env carries `DOCKER_HOST` or `DOCKER_CONTEXT`:

```go
if os.Getenv("DOCKER_HOST") != "" || os.Getenv("DOCKER_CONTEXT") != "" {
    return
}
```

The round-2 test fix `moby engine sets currentContext when no healthy context exists` (`engine-docker.bats:779-809`) seeds `currentContext: test-unreachable`, restarts the watcher, and asserts the reconciler promotes `rancher-desktop-${RDD_INSTANCE}`. A developer who ran `bats ...` from a shell that exported either env var (a common case for anyone using a remote/secondary Docker context) trips the skip; `currentContext` stays at `test-unreachable`; the `try --max 6 --delay 1 -- ...` loop times out and the test fails. CI runners typically lack these vars, so this surfaces only on local runs.

The in-test `export DOCKER_HOST="unix://${output}"` at line 31 fires *after* `rdd set running=true` on line 29, so it does not reach the already-spawned daemon — only the BATS shell's child docker CLIs see it. The daemon's env is whatever was inherited at spawn time.

Fix: clear the env vars before spawning the daemon.

```diff
 local_setup_file() {
     skip_on_windows
     export DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
     mkdir -p "${DOCKER_CONFIG}"
+    # The daemon inherits these env vars at spawn (service.Start does not
+    # set cmd.Env). Clear them so manageDockerContext exercises the
+    # probe-and-set path regardless of the parent shell.
+    unset DOCKER_HOST DOCKER_CONTEXT
     rdd svc delete
     rdd set running=true
```

`(important, regression)`

## Suggestions

S1. **Comment explaining the named-context detour misstates why it works** — `bats/tests/32-app-controllers/engine-docker.bats:782-789` [Claude Opus 4.7]

```bash
# Seed a named context pointing at an unreachable socket and make it
# current. This forces the reconciler into the probeNamedDockerContext
# branch with a guaranteed-failing endpoint, independent of whether the
# host has a running Docker daemon at the default socket (e.g.
# ubuntu-latest runners ship Docker pre-installed). Using a named context
# also avoids the stale-env issue: the service process's DOCKER_HOST is
# frozen at first start, so FromEnv-based probing of "" / "default" is
# host-dependent and unreliable from inside a long-lived daemon.
```

The "Using a named context also avoids the stale-env issue" claim does not hold once round 3's env-skip lands. The skip at `engine_reconciler.go:379-381` runs *before* the named-context probe, so a frozen `DOCKER_HOST` in the daemon's env disables both branches equally — the named-context detour buys nothing on that axis. The detour only escapes the `FromEnv` flake when the daemon inherits no env vars at all. Pair this rewrite with the I1 `unset` so the comment describes the actual contract.

```diff
-    # Using a named context
-    # also avoids the stale-env issue: the service process's DOCKER_HOST is
-    # frozen at first start, so FromEnv-based probing of "" / "default" is
-    # host-dependent and unreliable from inside a long-lived daemon.
+    # Both probe branches (named and FromEnv) require the daemon's env to
+    # have neither DOCKER_HOST nor DOCKER_CONTEXT set, otherwise
+    # manageDockerContext skips the probe entirely. local_setup_file
+    # unsets both before spawning the daemon.
```

`(suggestion, gap)`

S2. **`probeDockerContext` docstring overstates what the function reaches in production** — `pkg/controllers/app/engine/controllers/docker_context.go:143-150` [Claude Opus 4.7]

```go
// probeDockerContext pings the implicit default Docker endpoint
// (DOCKER_HOST or the platform default socket) within dockerContextProbeTimeout.
// Used when currentContext is "" or "default".
func probeDockerContext(ctx context.Context) bool {
    ...
    return pingDocker(probeCtx, dockerclient.FromEnv)
}
```

By the time `probeDockerContext(probeCtx)` runs at `engine_reconciler.go:387`, the goroutine has already returned if either `DOCKER_HOST` or `DOCKER_CONTEXT` was set (lines 379-381). So `FromEnv`'s `WithHostFromEnv` only ever resolves to the platform default socket — `DOCKER_HOST` is empty by construction here. The docstring's "DOCKER_HOST or the platform default socket" phrasing is reachable from unit tests but never on the production path. Tighten the comment so a future refactor that removes the env-skip does not silently change probe semantics.

```diff
 // probeDockerContext pings the implicit default Docker endpoint
-// (DOCKER_HOST or the platform default socket) within dockerContextProbeTimeout.
-// Used when currentContext is "" or "default".
+// (the platform default socket) within dockerContextProbeTimeout. Used
+// when currentContext is "" or "default". The caller short-circuits when
+// DOCKER_HOST or DOCKER_CONTEXT is set, so FromEnv's host-from-env path
+// is unreachable on the production path and the probe always targets
+// the platform default socket.
 func probeDockerContext(ctx context.Context) bool {
```

`(suggestion, documentation)`

S3. **`probeNamedDockerContext` lacks coverage for the four "assume healthy" library-error branches** — `pkg/controllers/app/engine/controllers/docker_context_test.go:215-259` [Claude Opus 4.7] (extends round 2 S2)

The new `Test_probeNamedDockerContext` covers four common outcomes (missing, SSH, unreachable unix, unreachable tcp), but the four "assume healthy" branches at `docker_context.go:172-174, 178-180, 188-191, 193-196` (metadata read error, decode error, TLS load error, ClientOpts error) and the TCP+TLS / mTLS / `SkipTLSVerify` paths motivating the PR have no unit anchor. The malformed-meta case is the cheapest to lock in: write a `meta.json` with an unparseable endpoint payload and assert the function returns `true`.

```go
t.Run("malformed meta returns true (assume healthy)", func(t *testing.T) {
    newDockerTestDir(t)
    seedContext(t, "ok-ctx", "unix:///var/run/docker.sock")
    dir, _ := dockerConfigDir()
    sum := sha256.Sum256([]byte("ok-ctx"))
    metaPath := filepath.Join(dir, "contexts", "meta", hex.EncodeToString(sum[:]), "meta.json")
    assert.NilError(t, os.WriteFile(metaPath, []byte("{ not json"), 0o644))
    result := probeNamedDockerContext(t.Context(), "ok-ctx")
    assert.Assert(t, result, "unparseable meta must be assumed healthy")
})
```

A fake context store seeded with TLS data plus an unreachable TLS endpoint would pin the TLS construction without needing a live TLS daemon and close the largest residual gap from round 2 S2.

`(suggestion, gap)`

S4. **The env-skip and the empty/`default` probe routing have no unit test** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:363-396` [Claude Opus 4.7]

The four-line env-var skip and the two-branch probe selector are entirely behavioral additions to the goroutine in `manageDockerContext` and have no direct test. The BATS path covers a subset (named-unhealthy promotes ours), but no test asserts:

1. `setCurrentDockerContext` is never called when `DOCKER_HOST` is set;
2. `setCurrentDockerContext` is never called when `DOCKER_CONTEXT` is set;
3. `current == "default"` routes through `probeDockerContext` (FromEnv), not the named branch.

A small refactor that extracts the goroutine body into a testable helper (taking `getEnv func(string) string` and the probe functions as parameters) would make all three reachable from `_test.go`. Worth doing because the env-skip is the headline behavior of this PR — silent regression of it would not be caught.

`(suggestion, gap)`

S5. **`getCurrentDockerContext` early-return comment misses the recovery path** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:363-370` [Claude Opus 4.7]

```go
current, err := getCurrentDockerContext()
if err != nil {
    // An error here means the config file exists but could not be
    // read or parsed. Do not proceed — writing currentContext now
    // would overwrite a file we cannot safely round-trip.
    log.Error(err, "Failed to read current Docker context")
    return
}
```

The early return is the safe choice — `setCurrentDockerContext` would have failed on the same `config.Load` call in the old flow. A future reader could mistake the early return for "RDD ignores broken config.json forever"; in fact the next reconcile (e.g., another `rdd set`) re-enters `manageDockerContext`, so repairing the file and triggering any reconcile is enough to recover.

```diff
     // An error here means the config file exists but could not be
     // read or parsed. Do not proceed — writing currentContext now
-    // would overwrite a file we cannot safely round-trip.
+    // would overwrite a file we cannot safely round-trip. The next
+    // reconcile (e.g. another `rdd set`) re-enters this function, so
+    // repairing config.json restores the probe-and-set behavior.
```

`(suggestion, documentation)`

S6. **Use `t.Context()` in new tests** — `pkg/controllers/app/engine/controllers/docker_context_test.go:234, 241, 248, 256` [Gemini 3.1 Pro]

```go
t.Run("missing context returns false", func(t *testing.T) {
    newDockerTestDir(t)
    result := probeNamedDockerContext(context.Background(), "does-not-exist")
    assert.Assert(t, !result, "missing context must be treated as unhealthy")
})
```

Go 1.25 (`go.mod`) gives every `*testing.T` a per-test `Context()` that cancels when the test finishes. Switch the four `context.Background()` callers in `Test_probeNamedDockerContext` so test timeouts and cancellations propagate into the probe.

```diff
-result := probeNamedDockerContext(context.Background(), "does-not-exist")
+result := probeNamedDockerContext(t.Context(), "does-not-exist")
```

`(suggestion, enhancement)`

S7. **`manageDockerContext` docstring and `docs/design/bootstrap.md` still describe the pre-refactor contract** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:319-321` and `docs/design/bootstrap.md:58,86` (round 2 S3 unaddressed)

```go
// manageDockerContext creates the instance Docker context and, in a goroutine,
// probes the user's current context; if it is absent or unhealthy, switches
// the default to the new context. At most one probe runs at a time.
```

After round 2 the empty / `"default"` branch routes through `probeDockerContext` (FromEnv) and treats a foreign default daemon as healthy; round 3 adds a further env-skip for `DOCKER_HOST` / `DOCKER_CONTEXT`. Neither the docstring nor `docs/design/bootstrap.md:58,86` reflects either change. Round 2 S3 flagged the same drift and was not picked up.

```
docs/design/bootstrap.md:58
   "It creates a `rancher-desktop-2` docker context. If there is no
   current context, or if the context isn't accepting connections, then
   it makes the new context the default."

docs/design/bootstrap.md:86
   "`rdd start` has created the `rancher-desktop-2` docker context, but
   it will only activate it if the current context doesn't exist, or is
   non-functional."
```

Fix: rewrite the docstring and both bullets to spell out the new contract — RDD probes the current named context, falls back to probing the implicit default endpoint when none is set, and skips the entire promotion when `DOCKER_HOST` or `DOCKER_CONTEXT` is set in the daemon's env.

`(suggestion, documentation, round 2 S3 unaddressed)`

S8. **TODO(windows) comment overstates what needs platform-specific handling** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:323-327` (round 2 S6, round 1 S5 unaddressed)

```go
// TODO(windows): Docker context management uses unix:// sockets; Windows
// requires npipe:// in the context meta file, an npipe-aware probe client,
// and platform-specific path handling in getCurrentDockerContext and
// createReplaceDockerContext. Track in a follow-up issue before shipping
// Windows support.
```

`getCurrentDockerContext` delegates to `docker/cli`'s `config.Load`, and `createReplaceDockerContext` plus `newContextStore` delegate to the store's path handling — both resolve `%USERPROFILE%\.docker` on Windows through `os.UserHomeDir()` with no extra work. The Windows-specific work concentrates in one line of `createReplaceDockerContext` (the hard-coded `"unix://"` prefix) and in the probe client selection. Round 1 S5 and round 2 S6 flagged this and were not picked up.

```diff
-    // TODO(windows): Docker context management uses unix:// sockets; Windows
-    // requires npipe:// in the context meta file, an npipe-aware probe client,
-    // and platform-specific path handling in getCurrentDockerContext and
-    // createReplaceDockerContext. Track in a follow-up issue before shipping
-    // Windows support.
+    // TODO(windows): createReplaceDockerContext hard-codes "unix://"; Windows
+    // needs "npipe://" plus an npipe-aware probe client in
+    // probeDockerContext / probeNamedDockerContext. Track in a follow-up
+    // issue before shipping Windows support.
```

`(suggestion, documentation, round 2 S6 / round 1 S5 unaddressed)`

S9. **Stale path reference in `docker_context_dir` helper docstring** — `bats/tests/32-app-controllers/engine-docker.bats:743-744` (round 2 S8 unaddressed)

```bash
# docker_context_dir returns the ~/.docker/contexts/meta/<hash> directory for
# the named context. Docker derives the sub-directory from sha256(name).
docker_context_dir() {
    ...
    echo "${DOCKER_CONFIG}/contexts/meta/${hash}"
}
```

The body returns `${DOCKER_CONFIG}/contexts/meta/${hash}`, but the docstring still says `~/.docker/contexts/meta/<hash>`. The two diverged when the round-2 fix isolated tests via `DOCKER_CONFIG`, and the docstring still has not been updated.

```diff
-# docker_context_dir returns the ~/.docker/contexts/meta/<hash> directory for
-# the named context. Docker derives the sub-directory from sha256(name).
+# docker_context_dir returns the contexts/meta/<hash> directory under
+# DOCKER_CONFIG for the named context. Docker derives the sub-directory
+# from sha256(name).
```

`(suggestion, documentation, round 2 S8 unaddressed)`


## Design Observations

### Concerns

**`Add(1)` and `Wait()` on `contextProbeWg` are not serialized by an outer lock** `(in-scope, pre-existing)` [Gemini 3.1 Pro]

`manageDockerContext` calls `r.contextProbeWg.Add(1)` at `engine_reconciler.go:350` after releasing `contextMu`; `removeDockerContext` calls `r.contextProbeWg.Wait()` at line 420 after releasing `contextMu`. The two are not on the same mutex, and `stopWatcher` releases `watcherMu` *before* invoking `removeDockerContext` (line 311 vs line 316), so `watcherMu` does not bridge them either. Round 1 dismissed Gemini's identical finding on the rationale that "stopWatcher takes watcherMu before calling Wait()" — that is incorrect; the lock is released first.

The exploit window is narrow: it needs an in-flight Reconcile to acquire `watcherMu` after the shutdown hook released it from `stopWatcher`, then succeed in `newDockerWatcher(r.watcherCtx, ...)` despite `watcherCtx` already being cancelled, then reach `Add(1)` while the shutdown hook's `removeDockerContext` is at `Wait()` with the counter at 0. If this lands, the WaitGroup panics with "Add called concurrently with Wait" and the daemon crashes during graceful shutdown rather than cleaning up.

This is pre-existing — predates PR #333 — so it does not block this PR. Mention it here so the round-1 dismissal does not bury the issue. The right fix is to swap the shared `WaitGroup` for a per-probe state struct holding both the cancel func and a per-probe done channel, both swappable atomically under `contextMu`.

**Daemon's frozen env is a long-lived property masquerading as a per-invocation choice** `(future)` [Claude Opus 4.7]

The env-skip uses the daemon process's own environment — whatever shell happened to spawn `rdd svc start` minutes or hours earlier. A developer who once started `rdd` from a shell exporting `DOCKER_CONTEXT=staging` permanently disables RDD's context promotion on that machine — even from new shells without those vars — until they `rdd svc delete && rdd svc start` from a clean shell. The author's PR comment acknowledges this is conservative; a follow-up could either (a) document the gotcha in user-facing docs, (b) re-resolve the env at probe time from a stable source (e.g., the user's `~/.docker/config.json` `environment` block — Docker CLI honors this), or (c) emit a one-shot info log on startup recording which env vars are inherited so the daemon's behavior stays observable.

### Strengths

**`DOCKER_CONTEXT` env-skip closes the round-2 false-override path** `(in-scope)` [Codex GPT 5.5] [Gemini 3.1 Pro]

The author chose the most conservative reading of round-2 I1: rather than honor `DOCKER_CONTEXT` as a named-context selector (which would still write `config.json` based on shell-local state), the daemon now declines to write `config.json` at all when either `DOCKER_HOST` or `DOCKER_CONTEXT` is set in its env. This eliminates the "user unsets `DOCKER_CONTEXT` and finds RDD's context as their new default" surprise that round 2 enumerated.

**Conservative "assume healthy" stance with explicit logging at every fall-through** `(in-scope)` [Claude Opus 4.7] [Gemini 3.1 Pro]

`probeNamedDockerContext` returns `true` on every error path that cannot decide healthiness — store unreadable, metadata missing, endpoint malformed, scheme not tcp/unix, TLS load failed, ClientOpts failed — and logs each at default level (round 2 S7 was picked up). For a tool that mutates user config, this is the right default: refuse to clobber a config the controller cannot evaluate.

**`Test_probeNamedDockerContext` covers the four common probe outcomes** `(in-scope)` [Codex GPT 5.5]

The new four-case suite (missing context → false, SSH scheme → true, unreachable unix → false, unreachable tcp → false) anchors the contract for the most common shapes a real user config produces. Round 1 S2 and round 2 S2 raised this gap; the round-3 addition closes most of it (S3 above lists the residual library-error branches).

**Two-function probe split makes the transport decision explicit** `(in-scope)` [Claude Opus 4.7] [Codex GPT 5.5]

Splitting the old single-signature `probeDockerContext(ctx, host)` into `probeDockerContext(ctx)` (FromEnv-backed default probe) and `probeNamedDockerContext(ctx, name)` (store-backed named probe) makes the "did we load TLS for this?" question unmissable at each call site. The old signature silently called `WithHost(host)` even when TLS data was required; the new signatures foreclose that class entirely.

## Testing Assessment

Coverage gaps, ranked by risk:

1. **Daemon env contamination from the parent shell** — the round-2 BATS test fix flakes on dev machines where `DOCKER_HOST` or `DOCKER_CONTEXT` is exported (per I1).
2. **`DOCKER_HOST` / `DOCKER_CONTEXT` env-skip path** — the headline behavior of this PR, untested at any layer (per S4).
3. **`probeNamedDockerContext` "assume healthy" library-error branches** — TLS load failure, ClientOpts failure, malformed meta (per S3).
4. **TCP+TLS / mTLS / `SkipTLSVerify` probe paths** — the original motivation for the PR has no unit anchor (per S3, residual round 2 S2).
5. **`current == "default"` routing** — neither unit nor BATS tests assert that this routes through `probeDockerContext`, not `probeNamedDockerContext` (Docker treats `"default"` as a magic non-name).
6. **Early return on `getCurrentDockerContext` error** — `Test_getCurrentDockerContext_malformedAuth` verifies the helper surfaces the error, but no test confirms the reconciler refuses to call `setCurrentDockerContext` when the config is malformed.

Codex ran `go test ./pkg/controllers/app/engine/controllers` and `go test -race ./pkg/controllers/app/engine/controllers` during its review and reported both passed.

## Documentation Assessment

- The `manageDockerContext` docstring at `engine_reconciler.go:319-321` still describes the pre-refactor "absent or unhealthy → promote ours" contract (S7).
- `docs/design/bootstrap.md:58,86` carries the same stale contract (S7).
- The `probeDockerContext` docstring at `docker_context.go:143-145` overstates what the function reaches in production (S2).
- The `probeNamedDockerContext` docstring at `docker_context.go:152-159` accurately documents the conservative policy and now lists "endpoint metadata missing or malformed" explicitly — a round-1 S3 follow-up that landed cleanly.
- The `getCurrentDockerContext` early-return comment misses the recovery path (S5).
- TODO(windows) at `engine_reconciler.go:323-327` overstates the work (S8, round 2 S6 / round 1 S5 unaddressed).
- The `docker_context_dir` BATS helper docstring still references `~/.docker/contexts/...` (S9, round 2 S8 unaddressed).
- The BATS comment explaining the named-context detour misstates why it works (S1).

## Commit Structure

Single commit (`751d6e5`), self-contained, message accurately describes the change. The bundling — env-skip, unit-test additions, BATS env-isolation choice — is internally consistent. No objection.

## Acknowledged Limitations

- **Windows Docker context management is still deferred.** Both `manageDockerContext` and `removeDockerContext` early-return on `runtime.GOOS == "windows"`. The TODO calls this out explicitly; see S8 for the wording.
- **SSH/npipe endpoints are trusted implicitly.** `probeNamedDockerContext` returns `true` for any scheme that is not `unix` or `tcp`. The docstring acknowledges the tradeoff: the controller cannot authoritatively probe these endpoints, so it accepts the risk of a dead user-configured context in exchange for not overriding the user's legitimate choice.
- **`config.json` updates are bypassed when `DOCKER_HOST` or `DOCKER_CONTEXT` is set in the daemon's env.** Documented in the inline comment at `engine_reconciler.go:372-378` and consistent with the author's PR-comment rationale: shell-local env vars do not necessarily reflect the user's global preference, so the conservative choice is to leave `config.json` alone.

## Unresolved Feedback

- **mook-as inline (config.json vs `DOCKER_CONTEXT`)** at `engine_reconciler.go:0`: addressed — this commit's env-skip at lines 379-381 short-circuits the entire probe-and-set when `DOCKER_HOST` or `DOCKER_CONTEXT` is set, matching the reviewer's intent. The author's PR comment confirms the design choice.
- **mook-as inline (`V(1)` is pointless)** at `docker_context.go:0`: addressed — `docker_context.go` now uses `log.Error(err, ...)` at default level (round 2 S7 picked up).
- **mook-as inline (spell check)** at `docker_context.go:18`: cosmetic; the author indicated the merge would be redone. Not a code issue.
- **jandubois (AI review claimed missed `DOCKER_CONTEXT` override)**: addressed — the round-3 env-skip closes the false-override path the AI review described. The author chose the more conservative direction (never write) over the alternative (always honor env).


---

## Agent Performance Retro

### [Claude]

Carried this round single-handed: produced the only Important finding (I1, the BATS env-contamination flake) and four of the five new suggestions. The I1 trace is the strongest piece of work in the round — it spotted that the round-3 env-skip silently disables the round-2 BATS test fix, and tracked the failure through `service.Start` having no explicit `cmd.Env` to the parent-shell inheritance window. Also caught the round-2 leftovers (S7 docstring/bootstrap.md drift, S8 TODO Windows wording, S9 `docker_context_dir` comment) that both peers missed. No false positives, no coverage misses.

### [Codex]

Returned an empty findings list — 0 critical, 0 important, 0 suggestions — with all four files marked "Reviewed, no issues" and a brief Design Strengths / Testing Assessment section. The review ran a real `go test -race` and reported both passed; that is genuinely useful evidence for the verdict but does not by itself justify a slot in this round's report. Codex missed I1 entirely (the env-skip / BATS interaction was directly in its strike zone — round 2 was Codex's headline finding) and missed every round-2 leftover. Below threshold.

### [Gemini]

Two suggestions (one valid `t.Context()` modernization in S6; one `local`-in-`@test` re-flag that this round dropped as out-of-scope, since the offending line is pre-existing and not modified by this PR). The unique contribution is the `WaitGroup` race re-raise — Gemini was correct in round 1 and is correct now; the round-1 retro dismissed it on incorrect mutex reasoning. Surfacing the race a second time was the right call. Gemini again skipped `git blame` (known daily-quota behavior); Claude and Codex covered that gap.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 12m 17s | 3m 39s | 7m 34s |
| Findings | 1I 5S | none | 1S |
| Tool calls | 36 (Bash 16, Read 13, Grep 7) | 37 (shell 36, stdin 1) | 11 (run_shell_command 7, read_file 4) |
| Design observations | 1 | 2 | 2 |
| False positives | 0 | 0 | 0 |
| Unique insights | 5 | 0 | 2 |
| Files reviewed | 4 | 4 | 4 |
| Coverage misses | 0 | 1 | 1 |
| **Totals** | **1I 5S** | **none** | **1S** |
| Downgraded | 0 | 0 | 0 |
| Dropped | 0 | 0 | 1 |


**Reconciliation:**

- Gemini P1 S2 (`local` in `@test`) → dropped as out-of-scope. Reason: the only offending line (`local context_name="rancher-desktop-${RDD_INSTANCE}"` at `engine-docker.bats:780`) is pre-existing in the round-3 base; round 2 already declared the same line out-of-scope and only the round-2-introduced `local` lines (790, 792) were in scope at that time. Round 3 removed those, leaving line 780 untouched. No new violation in this PR's diff.
- Gemini P1 Design Concern (`WaitGroup` race) → kept as Design Concern, tagged `(in-scope, pre-existing)`. Reason: the race is real and the round-1 retro dismissed it on incorrect reasoning, but the bug pre-dates this PR and the exploit window is narrow.

## Review Process Notes

### Skill improvements

- [ ] **Add a "verify mutex coverage extends across the entire critical section" check to concurrency claim verification.** When evaluating a race claim, do not stop at "lock X is taken before Y is called." Trace the lock through the *entire* call chain to the race-relevant primitive (Add/Wait, Read/Write, etc.). A lock that is taken and *released* before reaching the critical operation does not serialize anything past that release. The retro that dismisses such a claim must explicitly state which release-acquire pair establishes the happens-before relationship, not just which two functions take the lock. Past reviews have dismissed valid races on the rationale "function X takes mutex M before calling function Y" without noticing that M was released inside X before Y started.

### Repo context updates

- [+] **Daemon env vars are inherited at spawn from the shell that ran `rdd svc start`.** `pkg/service/cmd/service.go` calls `exec.CommandContext(...)` with no explicit `cmd.Env`, so the daemon process inherits the parent shell's environment. Code that reads env vars inside the long-lived daemon (e.g., `os.Getenv("DOCKER_HOST")`, `os.Getenv("DOCKER_CONTEXT")`, anything else affecting reconciler decisions) sees those vars frozen at spawn time, not whatever the user's current shell has set. Flag any new daemon-side `os.Getenv` call that does not also document this freeze, and flag any BATS test that depends on the daemon NOT having a particular env var without `unset`-ing it before `rdd set running=true` / `rdd svc start`.

---

## Appendix: Original Reviews

### [Claude] — Pass 1

Let me check the Docker cli store implementation to verify the type getter behavior.

Let me look at the rest of the test directly to verify line numbers.

I have enough context. Let me verify a few more details on key trace paths before writing findings.

# Code Review: PR #333 — Docker Engine context probe safety and isolation

## Executive Summary

This change replaces the naive `mobyclient.New(WithHost(...))` probe with a docker/cli `ContextStore`-driven probe that loads TLS data via `WithTLSData` + `ClientOpts()`, treats SSH/npipe as healthy, and adds a daemon-env skip for `DOCKER_HOST`/`DOCKER_CONTEXT`. The implementation is largely sound and the conservative "assume healthy on any error" stance is well-documented. The most concrete concern is a hidden coupling in the BATS suite to the parent shell's environment that can silently disable the test the new code added.

## Findings

### Critical Issues

None.

### Important Issues

**I1. BATS setup_file does not isolate `DOCKER_HOST` / `DOCKER_CONTEXT` from the parent shell** — `bats/tests/32-app-controllers/engine-docker.bats:12-37, 779-809`

```bash
local_setup_file() {
    skip_on_windows
    export DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
    mkdir -p "${DOCKER_CONFIG}"
    rdd svc delete
    rdd set running=true                             # <- spawns daemon
    run -0 rdd svc paths docker_socket
    export DOCKER_HOST="unix://${output}"            # <- after daemon spawn
    ...
}
```

`pkg/service/cmd/service.go:248` spawns the daemon with `exec.CommandContext(...)` and no explicit `cmd.Env`, so the daemon inherits the parent shell's environment verbatim. The new skip in `engine_reconciler.go:379-381` then short-circuits the entire probe-and-set goroutine when the daemon's process env has `DOCKER_HOST` or `DOCKER_CONTEXT` set:

```go
if os.Getenv("DOCKER_HOST") != "" || os.Getenv("DOCKER_CONTEXT") != "" {
    return
}
```

The new test `moby engine sets currentContext when no healthy context exists` (`engine-docker.bats:779-809`) seeds `currentContext: test-unreachable`, triggers a watcher restart, and asserts that the reconciler promotes `rancher-desktop-${RDD_INSTANCE}`. If a developer runs `bats ...` from a shell that already exported either env var (a common case for anyone using a remote/secondary Docker context), the skip fires and `currentContext` stays at `test-unreachable`. The `try --max 6 --delay 1 -- ...` loop then times out and the test fails non-deterministically depending on who runs it.

Note that the in-test `export DOCKER_HOST="unix://${output}"` at line 31 happens *after* `rdd set running=true` on line 29, so it does not propagate to the already-spawned daemon — only to the BATS shell's child docker CLIs. The daemon's env is whatever was inherited at spawn time.

The same fragility applies to all tests in the file because the daemon is shared across tests, but only the new test asserts behavior that depends on the skip not firing.

Fix: explicitly clear the env vars in `local_setup_file` before spawning the daemon:

```diff
 local_setup_file() {
     skip_on_windows
     export DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
     mkdir -p "${DOCKER_CONFIG}"
+    # The daemon inherits these env vars at spawn (service.Start does not
+    # set cmd.Env). Clear them so the probe-and-set in manageDockerContext
+    # is exercised regardless of the parent shell.
+    unset DOCKER_HOST DOCKER_CONTEXT
     rdd svc delete
     rdd set running=true
```

(important, regression)

### Suggestions

**S1. Comment in the BATS test misrepresents why the named context is used** — `bats/tests/32-app-controllers/engine-docker.bats:782-789`

```bash
# Seed a named context pointing at an unreachable socket and make it
# current. This forces the reconciler into the probeNamedDockerContext
# branch with a guaranteed-failing endpoint, independent of whether the
# host has a running Docker daemon at the default socket (e.g.
# ubuntu-latest runners ship Docker pre-installed). Using a named context
# also avoids the stale-env issue: the service process's DOCKER_HOST is
# frozen at first start, so FromEnv-based probing of "" / "default" is
# host-dependent and unreliable from inside a long-lived daemon.
```

The "Using a named context also avoids the stale-env issue" claim is incorrect. The new env-var skip at `engine_reconciler.go:379-381` runs *before* the named-context probe, so a frozen `DOCKER_HOST` in the daemon's env disables both branches equally — the named-context detour does not help. The detour only avoids the FromEnv issue when no env vars are inherited at all, which is the precondition the comment glosses over. Tying this comment together with I1 will make future readers understand the actual contract.

Fix: rewrite the second half of the comment to state the real precondition:

```diff
-    # Using a named context
-    # also avoids the stale-env issue: the service process's DOCKER_HOST is
-    # frozen at first start, so FromEnv-based probing of "" / "default" is
-    # host-dependent and unreliable from inside a long-lived daemon.
+    # Both probe branches (named and FromEnv) require the daemon's env to
+    # have neither DOCKER_HOST nor DOCKER_CONTEXT set, otherwise
+    # manageDockerContext skips the probe entirely. local_setup_file
+    # unsets both before spawning the daemon.
```

(suggestion, gap)

**S2. `getCurrentDockerContext` error path no longer falls through to a probe attempt** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:363-370`

```go
current, err := getCurrentDockerContext()
if err != nil {
    // An error here means the config file exists but could not be
    // read or parsed. Do not proceed — writing currentContext now
    // would overwrite a file we cannot safely round-trip.
    log.Error(err, "Failed to read current Docker context")
    return
}
```

The new early return is the safe choice — `setCurrentDockerContext` would have failed the same `config.Load` call anyway in the old flow. Worth a one-line addition to the comment so future readers do not mistake the early return for "RDD ignores broken config.json forever": the next reconcile (e.g., another `rdd set` call) re-enters `manageDockerContext` and re-tries, so the user repairing the file and triggering any reconcile is enough to recover.

```diff
     // An error here means the config file exists but could not be
     // read or parsed. Do not proceed — writing currentContext now
-    // would overwrite a file we cannot safely round-trip.
+    // would overwrite a file we cannot safely round-trip. The next
+    // reconcile (e.g. another `rdd set`) re-enters this function, so
+    // repairing config.json restores the probe-and-set behavior.
```

(suggestion, gap)

**S3. `probeDockerContext` falls back to `FromEnv`, but the `DOCKER_HOST` env-skip already returned** — `pkg/controllers/app/engine/controllers/docker_context.go:143-150` and `engine_reconciler.go:386-388`

```go
// docker_context.go
// probeDockerContext pings the implicit default Docker endpoint
// (DOCKER_HOST or the platform default socket) within dockerContextProbeTimeout.
func probeDockerContext(ctx context.Context) bool {
    ...
    return pingDocker(probeCtx, dockerclient.FromEnv)
}
```

By the time `probeDockerContext(probeCtx)` runs at `engine_reconciler.go:387`, the goroutine has already returned if either `DOCKER_HOST` or `DOCKER_CONTEXT` was set (lines 379-381). So `FromEnv`'s `WithHostFromEnv` will only ever resolve to the platform default socket — `DOCKER_HOST` is, by construction, empty here. The function comment "DOCKER_HOST or the platform default socket" is technically reachable in unit tests but never on the production path. Either tighten the comment, or call out this invariant so a later refactor doesn't accidentally drop the env-skip and silently change behavior:

```diff
 // probeDockerContext pings the implicit default Docker endpoint
-// (DOCKER_HOST or the platform default socket) within dockerContextProbeTimeout.
-// Used when currentContext is "" or "default".
+// (the platform default socket) within dockerContextProbeTimeout.
+// Used when currentContext is "" or "default". The caller is expected
+// to have already short-circuited when DOCKER_HOST/DOCKER_CONTEXT are
+// set, so FromEnv's host-from-env path is effectively dead code here
+// and the probe always targets the platform default socket.
 func probeDockerContext(ctx context.Context) bool {
```

(suggestion, gap)

**S4. `probeNamedDockerContext` lacks coverage for the "assume healthy" library-error branches** — `pkg/controllers/app/engine/controllers/docker_context_test.go:215-259`

The unit test exercises the four observable outcomes most likely to arise from real configs: missing context (false), SSH (true), unreachable unix socket (false), unreachable tcp (false). The remaining "assume healthy" branches at `docker_context.go:172-174, 178-180, 188-191, 193-196` (metadata read error, decode error, TLS load error, ClientOpts error) are not covered. They are admittedly hard to trigger from outside without injecting a corrupt store, but at minimum the malformed-endpoint case can be reached cheaply by writing a `meta.json` with an unparseable endpoint payload and asserting the function returns `true`. Worth one extra subtest to lock the conservative behavior in.

```go
t.Run("malformed meta returns true (assume healthy)", func(t *testing.T) {
    newDockerTestDir(t)
    seedContext(t, "ok-ctx", "unix:///var/run/docker.sock")
    // Overwrite meta.json with garbage so EndpointFromContext / GetMetadata fails.
    dir, _ := dockerConfigDir()
    sum := sha256.Sum256([]byte("ok-ctx"))
    metaPath := filepath.Join(dir, "contexts", "meta", hex.EncodeToString(sum[:]), "meta.json")
    assert.NilError(t, os.WriteFile(metaPath, []byte("{ not json"), 0o644))
    result := probeNamedDockerContext(context.Background(), "ok-ctx")
    assert.Assert(t, result, "unparseable meta must be assumed healthy")
})
```

(suggestion, gap)

**S5. New env-skip and probe-routing in `manageDockerContext` are not unit tested** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:363-396`

The four-line env-var skip and the two-branch probe selector are entirely behavioral additions to the goroutine and have no unit test. The BATS path covers a subset of the routing (named-unhealthy promotes ours), but no test asserts that:

1. when `DOCKER_HOST` is set, `setCurrentDockerContext` is never called;
2. when `DOCKER_CONTEXT` is set, `setCurrentDockerContext` is never called;
3. when `current == "default"`, the FromEnv branch is taken (not the named branch).

A small refactor to extract the goroutine body into a testable helper (taking `getEnv func(string) string` and the probe functions as parameters) would make all three reachable from `_test.go`. Not blocking — flagging because the env-skip is the very behavior the PR description calls out, and it would be a shame to regress it silently.

(suggestion, gap)

## Design Observations

### Concerns

**Daemon's frozen env is a long-lived property masquerading as a per-invocation choice (future).** The env-var skip uses the daemon process's own environment, which is whatever shell happened to spawn `rdd svc start` minutes or hours earlier. A developer who once started `rdd` from a shell with `DOCKER_CONTEXT=staging` exported will *permanently* disable RDD's context promotion on that machine — even from new shells without those vars — until they `rdd svc delete && rdd svc start` from a clean shell. The author's PR comment acknowledges this is conservative; a follow-up could either (a) document the gotcha in user-facing docs, (b) re-resolve the env at probe time from a known stable source (e.g., user's `~/.docker/config.json` `environment` block — Docker CLI honors this), or (c) emit a one-shot info log on startup recording which env vars are inherited so the daemon's behavior is observable. Not in scope for this PR.

### Strengths

- The "assume healthy" stance with explicit logging at every fall-through (`docker_context.go:165, 173, 178, 185, 190, 195`) is the right default for a tool that mutates user config.
- Splitting probe into `probeDockerContext` (FromEnv) and `probeNamedDockerContext` (Store-driven) cleanly separates the two semantically different cases; the naming reads correctly at the call site.
- Using `DOCKER_CONFIG` for BATS isolation is a real improvement over the old save-and-restore-via-jq dance, which had no atomicity guarantees and would corrupt the developer's real `~/.docker/config.json` on a crashed test.
- Re-exporting the test helper `testGetContextHost` rather than the production helper `getDockerContextHost` is the right call now that the production code no longer needs it — avoids leaving dead code in the package.

## Testing Assessment

Coverage gaps, ranked by risk:

1. **`DOCKER_HOST`/`DOCKER_CONTEXT` env-skip path** (untested) — the headline behavior of this PR (per S5).
2. **BATS env contamination** — the test passes only when the parent shell happens to lack the env vars; CI usually does not set them, but local-dev invocations frequently do (per I1).
3. **`probeNamedDockerContext` "assume healthy" library-error branches** — TLS load failure, ClientOpts failure, malformed meta (per S4).
4. **`current == "default"` routing** — neither unit nor BATS tests assert that this routes through `probeDockerContext`, not `probeNamedDockerContext` (Docker treats `"default"` as a magic non-name).
5. **Real TLS context** — there is no end-to-end exercise of `WithTLSData` + `ClientOpts()` for a real `tcp+TLS` context. Not realistic to fixture in unit tests, but worth a follow-up integration test if Rancher Desktop is ever run alongside an mTLS-protected Docker daemon.

## Documentation Assessment

- The TODO comments at `engine_reconciler.go:323-327` and `:403` are an upgrade over the previous one-liner — they enumerate the concrete Windows blockers (npipe schema, npipe-aware client, path handling) rather than just saying "not implemented yet". The "Track in a follow-up issue before shipping Windows support" sentence would be more useful with an issue link; consider opening one and linking it.
- The function doc on `probeNamedDockerContext` (`docker_context.go:152-159`) accurately describes the conservative contract. Good.
- The setup_file comment (`engine-docker.bats:17-20`) correctly explains why `DOCKER_CONFIG` must be exported before `rdd set running=true`. Add the same reasoning for `DOCKER_HOST`/`DOCKER_CONTEXT` after applying the I1 fix.
- The `probeDockerContext` doc (`docker_context.go:143-145`) overstates what the function reaches in production (per S3).

## Commit Structure

Single commit (`751d6e5`), self-contained, message accurately describes the change. No concerns.

## Acknowledged Limitations

- **mook-as inline (config.json vs `DOCKER_CONTEXT`)** at `engine_reconciler.go:0`: reviewer asked whether RDD should ever set `currentContext` in `config.json` when `DOCKER_CONTEXT` is set, given that the env var would override it anyway. Author addressed this in this commit by short-circuiting the entire probe-and-set when either `DOCKER_HOST` or `DOCKER_CONTEXT` is set in the daemon's env (`engine_reconciler.go:379-381`), and called out the reasoning in the PR comment ("more conservative... we skip the probe and set entirely"). The implementation matches the reviewer's intent.
- **mook-as inline (`V(1)` is pointless)** at `docker_context.go:0`: addressed — the current file has no `.V(...)` calls, only direct `log.Error(err, ...)` and `log.Info(...)`.
- **mook-as inline (spell check)** at `docker_context.go:18`: cosmetic; author's reply notes the merge will be redone. Not a code issue.
- **jandubois (AI review claimed missed `DOCKER_CONTEXT` override)**: per Nino-K's reply, the conservative env-skip closes the false-override path the AI review described, and chose the more conservative direction (never write) over the path the AI suggested (always honor env). Resolved by the env-skip in this commit.

## Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats` — **I1, S1**
- `pkg/controllers/app/engine/controllers/docker_context.go` — **S3, S4** (new probe path is otherwise correct; SSH/npipe fall-through and ContextStore-based ClientOpts handling verified against `docker/cli@v29.4.0` source)
- `pkg/controllers/app/engine/controllers/docker_context_test.go` — **S4** (good coverage for the four common outcomes; library-error branches missing)
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — **S2, S3, S5** (lifecycle, locking, generation tracking, and Wait/cancel ordering all verified; env-skip is the main untested addition)
### [Codex] — Pass 1

### Executive Summary

This change moves Docker context probing onto Docker CLI `ContextStore` APIs and avoids overriding user contexts when env overrides or non-probeable endpoints are present. I found no merge-blocking regressions in the changed files; the main remaining risk is test coverage around real TLS/mTLS Docker contexts.

### Findings

#### Critical Issues

None.

#### Important Issues

None.

#### Suggestions

None.

### Design Observations

**Strengths**

Using Docker CLI’s context store and `Endpoint.ClientOpts()` is the right direction: it avoids reimplementing Docker’s context layout and TLS option translation.

The `DOCKER_HOST` / `DOCKER_CONTEXT` guard is a conservative choice that avoids rewriting global `config.json` based on daemon-inherited shell-local state.

### Testing Assessment

1. Real TCP+TLS, mTLS, and `SkipTLSVerify=true` contexts are not exercised end-to-end. The unit tests cover missing, SSH, unreachable Unix, and unreachable TCP contexts, but not the exact TLS cases called out in the PR motivation.
2. The env-override path is not directly tested: a test with `DOCKER_CONTEXT` or `DOCKER_HOST` set in the daemon environment would pin the intended “skip probe-and-set” behavior.

Targeted verification run:

```text
go test ./pkg/controllers/app/engine/controllers
go test -race ./pkg/controllers/app/engine/controllers
```

Both passed.

### Documentation Assessment

No user-facing documentation update appears necessary for this internal controller behavior. The Windows limitation is explicitly documented in code and should remain tracked for the later Windows implementation.

### Acknowledged Limitations

- **Code comment**: `pkg/controllers/app/engine/controllers/engine_reconciler.go:323` documents that Docker context management is skipped on Windows because it still needs `npipe://`, npipe-aware probing, and platform-specific path handling. This PR makes the limitation more explicit, not worse.
- **Code comment**: `pkg/controllers/app/engine/controllers/engine_reconciler.go:403` repeats that removal has the same Windows gap. This is consistent with the create/probe path.
- **PR comment**: The author explicitly declined treating daemon-inherited `DOCKER_CONTEXT` as the user’s global preference and instead skips probe-and-set when `DOCKER_HOST` or `DOCKER_CONTEXT` is set. The current implementation is consistent with that rationale and avoids listing it as a finding.

### Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/docker_context.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/docker_context_test.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — Reviewed, no issues

### [Gemini] — Pass 1

### Executive Summary

This PR correctly refactors Docker context probing by migrating from custom endpoint parsing to Docker CLI's native `ContextStore` and `ClientOpts` generators. The change successfully resolves false-override bugs (such as those affecting SSH, mTLS, or `SkipTLSVerify` configurations) by correctly inheriting the user's TLS and endpoint parameters. It also cleanly isolates the daemon from polluting the global `config.json` when started in a shell with `DOCKER_HOST` present. The implementation is robust, but there are a few minor test gaps and a pre-existing WaitGroup race condition that should be addressed.

### Findings

#### Critical Issues
None.

#### Important Issues
None.

#### Suggestions

S1. **Use `t.Context()` in new tests** — `pkg/controllers/app/engine/controllers/docker_context_test.go:234` `(suggestion, gap)`

```go
	t.Run("missing context returns false", func(t *testing.T) {
		newDockerTestDir(t)
		result := probeNamedDockerContext(context.Background(), "does-not-exist")
		assert.Assert(t, !result, "missing context must be treated as unhealthy")
	})
```

The new `probeNamedDockerContext` test suite uses `context.Background()` in four separate `t.Run` blocks. Because the project builds on Go 1.25, tests should leverage `t.Context()` so that test timeouts and cancellations propagate through the tested logic automatically.

Fix: Replace `context.Background()` with `t.Context()` in all test cases.

```diff
-		result := probeNamedDockerContext(context.Background(), "does-not-exist")
+		result := probeNamedDockerContext(t.Context(), "does-not-exist")
```

S2. **Local variable declarations inside `@test` blocks** — `bats/tests/32-app-controllers/engine-docker.bats:779` `(suggestion, gap)`

```bash
@test "moby engine sets currentContext when no healthy context exists" {
    local context_name="rancher-desktop-${RDD_INSTANCE}"
```

The BATS tests touched by this PR contain pre-existing `local` declarations. Because each `@test` block runs in an isolated subshell, variables are inherently scoped. The project's BATS style guide forbids `local` inside test blocks to prevent unnecessary clutter. While the PR correctly removed one such `local` declaration on line 784 (`local saved_context`), it left the surrounding ones intact.

Fix: Remove `local` from variable declarations inside the modified test blocks.

```diff
-    local context_name="rancher-desktop-${RDD_INSTANCE}"
+    context_name="rancher-desktop-${RDD_INSTANCE}"
```

### Design Observations

#### Concerns

**Pre-existing WaitGroup race condition on `contextProbeWg` can panic** `(in-scope)`
The `contextProbeWg` coordinates in-flight probes between `manageDockerContext` and `removeDockerContext`. However, it invites a data race that panics the daemon: if `Reconcile` evaluates `wantWatcher = true` and launches `startWatcherAndSync` *concurrently* with the manager shutdown hook executing `stopWatcher`, the WaitGroup will be misused. `stopWatcher` drops `watcherMu` before calling `removeDockerContext`, which executes `r.contextProbeWg.Wait()` without holding `contextMu`. Concurrently, `startWatcherAndSync` acquires `watcherMu` and calls `manageDockerContext`, which executes `r.contextProbeWg.Add(1)` outside of `contextMu`. 

If `Add(1)` fires concurrently with `Wait()` while the counter is 0, Go's WaitGroup panics. To eliminate this race, replace the shared `WaitGroup` with a probe-specific state struct (containing both the cancel func and a dedicated channel or WaitGroup for that specific probe) that can be swapped atomically under `contextMu`.

#### Strengths
The fallback behavior in `probeNamedDockerContext` is exceptionally conservative and safe. By returning `true` (healthy) whenever the endpoint scheme is unhandled (e.g., `ssh://`), or when the context metadata/TLS configs cannot be cleanly loaded, the implementation ensures Rancher Desktop will never overwrite a user's global `config.json` context simply because it failed to parse it. 

### Testing Assessment

1. **Early return on `DOCKER_HOST` / `DOCKER_CONTEXT`**: The daemon skips probing and writing `config.json` if these environment variables are set in its environment. There is currently no BATS test exercising this short-circuit logic. A test should verify that launching the service with `DOCKER_HOST` explicitly present preserves a dummy `currentContext` in `config.json`.
2. **mTLS context parsing**: The logic correctly uses `docker.WithTLSData(s, name, epMeta)` to attach client certificates to the probe. However, there is no unit or integration test simulating a context populated with valid TLS configurations. Ensuring `ep.ClientOpts()` properly passes these materials through to `pingDocker` would prevent regressions where mTLS endpoints are falsely marked unhealthy.

### Documentation Assessment

The internal documentation correctly explains why the logic diverges from standard paths (such as the comments documenting Windows skips or the shell-local `DOCKER_HOST` early return logic). No user-facing API or design docs are impacted directly by this internal fix.

### Acknowledged Limitations

- **`config.json` updates are bypassed if `DOCKER_HOST` or `DOCKER_CONTEXT` is set**: 
  - **Code comment**: `pkg/controllers/app/engine/controllers/engine_reconciler.go:369`: *"DOCKER_HOST and DOCKER_CONTEXT take precedence over config.json... Skip the probe-and-set entirely when either is present."*
  - **PR comment**: The author explicitly notes that this prevents treating a local shell env var as representative of the user's global preference. Since this preserves the user's intent when the daemon inherits a polluted shell environment, it is accepted by design and adequately documented.

### Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats` — **S2**
- `pkg/controllers/app/engine/controllers/docker_context.go` — **Reviewed, no issues**
- `pkg/controllers/app/engine/controllers/docker_context_test.go` — **S1**
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — **Reviewed, no issues** (WaitGroup race condition addressed under Design Observations)
