# Deep Review: 20260424-130819-pr-333

| | |
|---|---|
| **Date** | 2026-04-24 13:18 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 2 (of target) |
| **Author** | [@Nino-K](https://github.com/Nino-K) |
| **PR** | [#333](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/333) — Docker Engine: improve Docker context probe safety and isolation |
| **Commits** | `34cb44d` Docker Engine: improve Docker context probe safety and isolation |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — the round-1 cleanup landed cleanly (dead helper removed, bailouts now log, tests isolated via `DOCKER_CONFIG`, flaky test stabilized), but one false-override class the PR claims to close remains (`DOCKER_CONTEXT` env), plus two new `local`-in-`@test` regressions and one round-1 suggestion still unaddressed. |
| **Wall-clock time** | `14 min 24 s` |

## Executive Summary

Round 2 folds in every round-1 fix: `getDockerContextHost` is gone, the five "assume healthy" bailouts all log at `V(1)`, the BATS suite uses `DOCKER_CONFIG=${BATS_FILE_TMPDIR}/...`, and the flaky "no healthy context" test now seeds an unreachable named context so the probe takes the `probeNamedDockerContext` branch regardless of the host's default Docker socket. The core refactor — `ContextStore` + `WithTLSData` + `EndpointMeta.ClientOpts()` — is sound and correctly retires the TCP+TLS / mTLS / `SkipTLSVerify` / SSH false-override classes the PR description calls out.

What remains clusters in three buckets:

1. **One false-override class is still open.** `getCurrentDockerContext()` reads `config.json` only; the Docker CLI's `DOCKER_CONTEXT` env var is ignored. When `DOCKER_CONTEXT=<named-context>` is set in the daemon's (frozen) startup env and `config.json` has no `currentContext`, the code probes the implicit default endpoint via `FromEnv` (which does not consult `DOCKER_CONTEXT`) and — if that default is unreachable — overwrites `config.json` with `rancher-desktop-{instance}`. A user who later unsets `DOCKER_CONTEXT` sees our context as their new default.
2. **Two new BATS regressions.** The round-1 test fix introduced two `local` declarations inside an `@test` block at `engine-docker.bats:790-792`, which violates the repo's documented BATS style (`local` inside `@test` is redundant and explicitly flagged).
3. **Documentation drift and one carry-over.** The old inline comment explaining why empty/`default` were treated identically disappeared with the refactor; the new behavior (probe via `FromEnv`) is not documented at the call site, and `docs/design/bootstrap.md:58,86` still describes the old always-promote semantics. The round-1 S5 TODO(windows) wording fix was not picked up.

Report structure: 0 Critical, 1 Important, 8 Suggestions, 1 Design Concern, 4 Design Strengths.

---

## Critical Issues

None.

## Important Issues

I1. **`DOCKER_CONTEXT` env var is not consulted — residual false-override path** — `pkg/controllers/app/engine/controllers/docker_context.go:88-98` and `engine_reconciler.go:362-383` [Codex GPT 5.5]

```go
current, err := getCurrentDockerContext()
...
if current != "" && current != "default" {
    healthy = probeNamedDockerContext(probeCtx, current)
} else {
    healthy = probeDockerContext(probeCtx)
}
...
if !healthy && probeCtx.Err() == nil {
    if err := setCurrentDockerContext(contextName); err != nil {
```

`getCurrentDockerContext()` reads `cf.CurrentContext` from `config.json` and nothing else (`docker_context.go:88-98`). The Docker CLI resolves the effective context with an env-first precedence: `DOCKER_HOST` bypasses contexts entirely, then `DOCKER_CONTEXT` selects a named context, then `config.json`'s `currentContext`, then the platform default. `dockerclient.FromEnv` (used by `probeDockerContext`) reads `DOCKER_HOST` / `DOCKER_CERT_PATH` / `DOCKER_TLS_VERIFY` but **not** `DOCKER_CONTEXT`.

The unpatched sequence:

1. User sets `DOCKER_CONTEXT=myremote` (e.g., an SSH or mTLS context to a remote Docker), then runs `rdd svc start`. `rdd svc start` spawns the daemon via `exec.Command` with no explicit `Env`, so the daemon inherits `DOCKER_CONTEXT=myremote` in its frozen startup env.
2. `config.json` has no `currentContext`.
3. `getCurrentDockerContext()` returns `""`, the reconciler enters the else branch, and `probeDockerContext` pings the platform default socket.
4. The user has no local Docker (that is why they were using `DOCKER_CONTEXT=myremote` in the first place), so the default socket probe fails.
5. `setCurrentDockerContext("rancher-desktop-{instance}")` overwrites `config.json`.

While `DOCKER_CONTEXT` is set, the user's active context is still `myremote` (env wins). The damage surfaces when the user unsets `DOCKER_CONTEXT`: their previous "default" — the implicit `myremote` — is gone, replaced by `rancher-desktop-{instance}`. This is exactly the false-override class the PR description enumerates for SSH/mTLS contexts.

Fix: resolve the effective context using the daemon's frozen env before probing. The simplest correct stance, given the daemon can only see its spawn-time env, is to not auto-promote when `DOCKER_CONTEXT` is set and `DOCKER_HOST` is not:

```diff
 current, err := getCurrentDockerContext()
 if err != nil {
     log.Error(err, "Failed to read current Docker context")
     return
 }
+// Docker CLI resolves a named context from DOCKER_CONTEXT when
+// config.json has no currentContext; getCurrentDockerContext does
+// not. Consult the env so the probe targets the right endpoint and
+// we do not rewrite config.json to shadow an env-selected context.
+if current == "" && os.Getenv("DOCKER_HOST") == "" {
+    if envCtx := os.Getenv("DOCKER_CONTEXT"); envCtx != "" {
+        current = envCtx
+    }
+}

 var healthy bool
 if current != "" && current != "default" {
     healthy = probeNamedDockerContext(probeCtx, current)
```

Note the prior-round design concern still stands independently: the daemon's env is frozen at spawn, so any user change to `DOCKER_HOST` / `DOCKER_CERT_PATH` / `DOCKER_TLS_VERIFY` / `DOCKER_CONTEXT` after `rdd svc start` is invisible to the probe. The fix above does not close that bigger gap — it only addresses the at-spawn-time case — but it closes the specific false-override path a `DOCKER_CONTEXT`-only user hits today.

`(important, gap)`


## Suggestions

S1. **New `local` declarations inside an `@test` block** — `bats/tests/32-app-controllers/engine-docker.bats:790,792` [Codex GPT 5.5] [Gemini 3.1 Pro]

```bash
@test "moby engine sets currentContext when no healthy context exists" {
    local context_name="rancher-desktop-${RDD_INSTANCE}"
    ...
    local probe_target="test-unreachable"
    run_e -0 docker_context_dir "${probe_target}"
    local probe_dir="${output}"
```

The round-1 test fix added two new `local` declarations at lines 790 and 792 (the diff against the round-1 base confirms both are new). The repo's BATS style rule flags `local` inside an `@test` block as redundant — each `@test` already runs in its own subshell. The pre-existing `local context_name` on line 780 is out of scope for this PR, but the two new ones are.

Fix:

```diff
-    local probe_target="test-unreachable"
+    probe_target="test-unreachable"
     run_e -0 docker_context_dir "${probe_target}"
-    local probe_dir="${output}"
+    probe_dir="${output}"
```

`(suggestion, regression)`

S2. **No unit coverage for the new `probeNamedDockerContext` branches** — `pkg/controllers/app/engine/controllers/docker_context.go:160-201` [Claude Opus 4.7] [Gemini 3.1 Pro]

```go
func probeNamedDockerContext(ctx context.Context, name string) bool {
    ...
    md, err := s.GetMetadata(name)
    if err != nil {
        if errdefs.IsNotFound(err) {
            return false
        }
        ...
    }
    ...
    switch scheme {
    case "unix", "tcp":
    default:
        log.V(1).Info("Non-tcp/unix endpoint scheme; assuming context healthy", "scheme", scheme)
        return true
    }
```

The new function has five distinct return paths (open-store error, `NotFound` → false, decode error, non-tcp/unix scheme → true, TLS/ClientOpts error, unreachable ping). Only the unreachable-unix-socket path has integration coverage via the BATS test; the unit-test file (`docker_context_test.go`) only renames `getDockerContextHost` calls to `testGetContextHost` and does not exercise the new function. The branches most likely to regress on a future rewrite — `errdefs.IsNotFound` returning `false`, and the `ssh://` short-circuit returning `true` — have no unit test anchoring the contract the docstring describes.

Fix: add three table-driven cases to `docker_context_test.go`: (a) seed a context with `Host: "ssh://user@host"` and assert `probeNamedDockerContext` returns `true`; (b) call with a name the store does not contain and assert `false`; (c) seed a `unix://` context pointing at a nonexistent socket and assert `false`. All three can run against the existing `newDockerTestDir` fixture. The TCP+TLS / mTLS / `SkipTLSVerify` paths — the motivation for the PR — deserve coverage too, and a seeded fake context store with TLS data plus an unreachable endpoint would pin the TLS construction without needing a live TLS daemon.

`(suggestion, gap)`

S3. **Design docs and the `manageDockerContext` docstring no longer match the behavior** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:317-320` and `docs/design/bootstrap.md:58,86` [Codex GPT 5.5]

```go
// manageDockerContext creates the instance Docker context and, in a goroutine,
// probes the user's current context; if it is absent or unhealthy, switches
// the default to the new context. At most one probe runs at a time.
```

The function-level comment says RDD switches to its context when the user's current is "absent or unhealthy". After the refactor, "absent" (empty or `"default"` current) no longer unconditionally counts as "unhealthy" — `probeDockerContext(ctx)` pings the implicit default endpoint via `FromEnv` and returns `true` when a foreign daemon responds. `docs/design/bootstrap.md` carries the same stale contract:

- Line 58: *"It creates a `rancher-desktop-2` docker context. If there is no current context, or if the context isn't accepting connections, then it makes the new context the default."*
- Line 86: *"`rdd start` has created the `rancher-desktop-2` docker context, but it will only activate it if the current context doesn't exist, or is non-functional."*

Fix: update the docstring and both design-doc bullets to say empty / `"default"` current is now treated as "use Docker's implicit default endpoint, and only promote RDD when that endpoint does not respond."

`(suggestion, documentation)`

S4. **Test name `Test_getDockerContextHost_missing` refers to a production function that no longer exists** — `pkg/controllers/app/engine/controllers/docker_context_test.go:113` [Claude Opus 4.7]

```go
func Test_getDockerContextHost_missing(t *testing.T) {
    newDockerTestDir(t)
    host, err := testGetContextHost(t, "does-not-exist")
```

The commit deletes `getDockerContextHost` from production code and introduces a test-only `testGetContextHost` helper. The `Test_getDockerContextHost_missing` test calls the helper, not the removed production function, so a future reader grepping for `getDockerContextHost` lands on a test whose subject is gone. The other two call sites (`Test_createReplaceDockerContext`, `Test_deleteDockerContext`) already name their subject by the production function they exercise, so the pattern is inconsistent.

Fix: rename to match the helper, or fold the NotFound case into a `Test_probeNamedDockerContext_missing` that asserts `probeNamedDockerContext` returns `false` for an absent context (and delete the helper, which now has only one caller).

`(suggestion, enhancement)`

S5. **Empty / `"default"` branch loses its inline rationale** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:371-376` [Claude Opus 4.7]

```go
var healthy bool
if current != "" && current != "default" {
    healthy = probeNamedDockerContext(probeCtx, current)
} else {
    healthy = probeDockerContext(probeCtx)
}
```

The old code carried an explicit comment above this block: *"Probe the current context's host — not our own — to decide whether to promote ours. `default` and empty both mean no working context is set; treat them identically."* The refactor deleted the comment, and the new behavior (probe the implicit default via `FromEnv` so a foreign Docker daemon does not get clobbered) now has no call-site explanation. `probeDockerContext`'s own docstring states *what* it does but not *why* the reconciler routes empty/default into it.

Fix: add a one-line comment above the branch — e.g. `// Probe the implicit default endpoint (DOCKER_HOST or platform default socket) so a foreign Docker daemon is not overridden.` — so a reader sees the intent without cross-referencing `probeDockerContext`.

`(suggestion, documentation)`

S6. **TODO(windows) comment overstates what needs platform-specific handling (round 1 S5 unaddressed)** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:322-326` [Claude Opus 4.7]

```go
// TODO(windows): Docker context management uses unix:// sockets; Windows
// requires npipe:// in the context meta file, an npipe-aware probe client,
// and platform-specific path handling in getCurrentDockerContext and
// createReplaceDockerContext. Track in a follow-up issue before shipping
// Windows support.
```

`getCurrentDockerContext` delegates to `docker/cli`'s `config.Load`, and `createReplaceDockerContext` (as well as `newContextStore`) delegates to the store's built-in path handling; both resolve `%USERPROFILE%\.docker` on Windows through `os.UserHomeDir()` with no extra work required on RDD's side. The Windows-specific work concentrates in one line of `createReplaceDockerContext` (the hard-coded `"unix://"` prefix) and in the probe client selection. Round 1 (S5) flagged the same inaccuracy and it was not picked up.

Fix:

```diff
-	// TODO(windows): Docker context management uses unix:// sockets; Windows
-	// requires npipe:// in the context meta file, an npipe-aware probe client,
-	// and platform-specific path handling in getCurrentDockerContext and
-	// createReplaceDockerContext. Track in a follow-up issue before shipping
-	// Windows support.
+	// TODO(windows): createReplaceDockerContext hard-codes "unix://"; Windows
+	// needs "npipe://" plus an npipe-aware probe client in
+	// probeDockerContext / probeNamedDockerContext. Track in a follow-up
+	// issue before shipping Windows support.
```

`(suggestion, enhancement, round 1 S5 unaddressed)`

S7. **"Assume healthy" bailouts log at `V(1)` — consider promoting to the default level** — `pkg/controllers/app/engine/controllers/docker_context.go:165,173,178,190,195` [Claude Opus 4.7]

```go
s, err := newContextStore()
if err != nil {
    log.V(1).Error(err, "Cannot open context store; assuming context healthy")
    return true
}
```

The five bailouts (store open error, metadata read error, endpoint decode error, TLS-data load error, `ClientOpts` error) all log at `V(1)`. `V(1)` is off at the default verbosity, so an operator diagnosing "why didn't RDD promote its context?" sees no trace unless they raise verbosity first. Each bailout fires at most once per probe transition and covers a genuinely unusual condition (corrupt or partial Docker config), so the default level would not be noisy. The non-tcp/unix scheme branch at line 185 is the exception — SSH/npipe are common and the `V(1).Info` there is correct.

Fix: change the five `log.V(1).Error` calls to `log.Error` (default level). Leave the scheme-mismatch `V(1).Info` as-is.

`(suggestion, enhancement)`

S8. **Stale path reference in `docker_context_dir` helper comment** — `bats/tests/32-app-controllers/engine-docker.bats:742-743` [Codex GPT 5.5]

```bash
# docker_context_dir returns the ~/.docker/contexts/meta/<hash> directory for
# the named context. Docker derives the sub-directory from sha256(name).
docker_context_dir() {
    ...
    echo "${DOCKER_CONFIG}/contexts/meta/${hash}"
}
```

The helper's body now returns `${DOCKER_CONFIG}/contexts/meta/${hash}`, but the docstring still says `~/.docker/contexts/meta/<hash>`. The two diverged with the `DOCKER_CONFIG` isolation change.

Fix: update the comment to say `${DOCKER_CONFIG}/contexts/meta/<hash>`, or parameterize the description as "the `contexts/meta/<hash>` directory under `DOCKER_CONFIG`".

`(suggestion, documentation)`

## Design Observations

### Concerns

**Context restoration on shutdown** `(future)` [Gemini 3.1 Pro]

When RDD promotes itself to `currentContext` on startup and later shuts down, `removeDockerContext` clears `currentContext` rather than restoring whatever the user had before. A user whose previous default was set silently loses that selection. Saving the previous value when promoting and restoring it on shutdown would preserve the user's intent across an RDD lifecycle. Out of scope for this PR, but worth tracking.

### Strengths

**Docker CLI primitives are the right abstraction** `(in-scope)` [Claude Opus 4.7] [Codex GPT 5.5] [Gemini 3.1 Pro]

`ContextStore`, `WithTLSData`, and `EndpointMeta.ClientOpts()` already know how to assemble an HTTP client for every endpoint shape docker CLI supports (unix, tcp, tcp+TLS, mTLS, `SkipTLSVerify`). Delegating retires four false-override classes the PR description calls out and keeps RDD aligned with Docker's own context format.

**Two-function split makes the transport decision explicit** `(in-scope)` [Claude Opus 4.7] [Codex GPT 5.5]

Splitting the old single-signature `probeDockerContext(ctx, host)` into `probeDockerContext(ctx)` (FromEnv-backed default probe) and `probeNamedDockerContext(ctx, name)` (store-backed named probe) makes the "did we load TLS for this?" question unmissable at each call site. The old signature silently used `WithHost(host)` even when TLS data was required; the new signatures foreclose that class entirely.

**Scheme gate short-circuits SSH without building a connhelper** `(in-scope)` [Claude Opus 4.7]

`probeNamedDockerContext` checks the URL scheme before calling `WithTLSData`, so SSH and npipe endpoints skip the `commandconn` subprocess construction `ClientOpts` would otherwise build. This keeps the "assume healthy for non-tcp/unix" path cheap and avoids spawning an ssh-dial-stdio helper only to time out.

**`pingDocker` extraction centralizes the timeout idiom** `(in-scope)` [Claude Opus 4.7]

Hoisting the "new client + defer close + Ping within ctx" sequence into `pingDocker` keeps the timeout-creation pattern in one place and makes both probe wrappers one-liners at the call site. The call graph is shallow and the helper has no surprising side effects.

## Testing Assessment

Unit tests in `docker_context_test.go` cover context-store CRUD via the new `testGetContextHost` helper; the BATS suite exercises the reconciler end-to-end against a real Docker daemon inside Lima, and the `DOCKER_CONFIG` isolation is a clear improvement over round 1. Residual gaps, ranked by risk:

1. **`DOCKER_CONTEXT` env-var handling has no coverage** (see I1). A unit test that sets `DOCKER_CONTEXT=myname`, leaves `config.json` empty, and asserts the probe targets `myname` — not the platform default — would pin the I1 fix.
2. **No unit coverage for the TCP+TLS, mTLS, or `SkipTLSVerify` probe paths** — the motivation for the PR. A fake context store seeded with TLS data, probed against an unreachable TLS endpoint, would verify `ep.ClientOpts()` actually constructs the TLS config and the failure is a reachability failure, not a config failure.
3. **No unit coverage for the SSH/npipe short-circuit** (S2). Seeding a context with `Host: "ssh://u@h"` and asserting `probeNamedDockerContext` returns `true` without attempting a ping would lock the branch.
4. **No unit coverage for `probeDockerContext(FromEnv)`** (S2). `DOCKER_HOST=unix:///nonexistent/...` should return `false` — pinning that behavior guards the default-probe contract.
5. **No unit coverage for the `NotFound` → false branch of `probeNamedDockerContext`** (S2). Calling with a missing context should return `false`; this is the one branch whose contract flipped (the old `getDockerContextHost` returned empty + nil; the new function returns `false`).
6. **Early return on `getCurrentDockerContext` error lacks a reconciler-level assertion.** `Test_getCurrentDockerContext_malformedAuth` verifies the error surfaces from the helper, but no test confirms the reconciler refuses to call `setCurrentDockerContext` when the config is malformed.

## Documentation Assessment

- The new `probeNamedDockerContext` docstring at `docker_context.go:152-159` accurately documents the "assume healthy" policy and its rationale.
- The new `probeDockerContext` docstring at `docker_context.go:143-145` states *what* (FromEnv) but not *when* the reconciler calls it — a cross-reference to the `manageDockerContext` call site would help readers (see S5).
- `engine_reconciler.go:317-320` and `docs/design/bootstrap.md:58,86` describe pre-refactor semantics (see S3).
- `engine-docker.bats:742-743` references `~/.docker/contexts/...` although the body now returns `${DOCKER_CONFIG}/contexts/...` (see S8).
- TODO(windows) at `engine_reconciler.go:322-326` overstates the work (see S6, round 1 S5 unaddressed).

## Commit Structure

Single commit, scope matches the title, message enumerates the four independent improvements (FromEnv default probe, ContextStore-based named probe, early return on context-lookup errors, `DOCKER_CONFIG` isolation). The bundling is defensible: the new probe paths depend on the context-store reads, and pinning the store location in tests keeps both changes verifiable together. No objection.

## Acknowledged Limitations

- **Windows Docker context management is still deferred.** Both `manageDockerContext` and `removeDockerContext` early-return on `runtime.GOOS == "windows"`. The TODO calls this out explicitly; see S6 for the wording.
- **SSH/npipe endpoints are trusted implicitly.** `probeNamedDockerContext` returns `true` for any scheme that is not `unix` or `tcp`. The docstring acknowledges the tradeoff: the controller cannot authoritatively probe these endpoints, so it accepts the risk of a dead user-configured context in exchange for not overriding the user's legitimate choice.
- **Conservative "assume healthy" on unreadable store / malformed metadata.** `docker_context.go:154-159` documents this as deliberate — clobbering a partially edited or migrated config is worse than leaving it alone. S7 suggests promoting the bailout logs to the default level so operators can diagnose silent non-promotion.

## Unresolved Feedback

None. The `gh api` fetch of PR review comments returned no inline comments on the current commit.

---

## Agent Performance Retro

### [Claude]

Produced the deepest pass on the round-2 code: caught the stale test-function name (S4), the missing call-site comment for the empty/default branch (S5), the carry-over round-1 S5 TODO wording (S6), and the `V(1)` verbosity question on the new bailout logs (S7). Missed the two `local` declarations at `engine-docker.bats:790,792` despite having produced the BATS test fix suggestion in round 1 — a coverage miss, since Claude marked the BATS file "Reviewed, no issues" while both peers flagged the style regression. Also missed Codex's `DOCKER_CONTEXT` I1; the omission is notable because the prior-round memory already documents the "daemon env is frozen" pattern, which is exactly what I1 turns on. Coverage Summary was complete and cross-referenced finding IDs.

### [Codex]

Delivered the strongest single finding of the round — I1 (`DOCKER_CONTEXT` not consulted) — which is in the exact class the PR claims to close and is the only genuinely new bug surface the round uncovered. Also caught the design-doc drift at `bootstrap.md:58,86` plus the `manageDockerContext` docstring (S3) and the stale `docker_context_dir` helper comment (S8). The diff-only `local` regression (S1) came in parallel with Gemini. Missed the stale test-function name (S4) and the round-1 TODO carry-over (S6), neither of which costs much — both are within Claude's strike zone.

### [Gemini]

Matched peers on the `local` regression and on the missing unit-test coverage. The unique design observation about shutdown not restoring the user's previous `currentContext` is a valid forward-looking concern and worth tracking. Otherwise low yield: both of Gemini's initial "important" findings landed as suggestions after consolidation — the unit-test coverage gap is a testing concern (not a bug), and the `local`-in-`@test` regression is a style-rule violation whose blast radius is minimal (tests still pass). Gemini again skipped `git blame` (known daily-quota behavior); Claude and Codex covered that gap. No coverage misses, no false positives.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 5m 40s | 4m 39s | 4m 36s |
| Findings | 5S | 1I 3S | 2S |
| Tool calls | 20 (Bash 12, Read 7, Grep 1) | 40 (shell 40) | 15 (run_shell_command 7, read_file 4, grep_search 3) |
| Design observations | 3 | 2 | 3 |
| False positives | 0 | 0 | 0 |
| Unique insights | 4 | 3 | 0 |
| Files reviewed | 4 | 4 | 4 |
| Coverage misses | 1 | 0 | 0 |
| **Totals** | **5S** | **1I 3S** | **2S** |
| Downgraded | 0 | 0 | 2 (I→S) |
| Dropped | 0 | 0 | 0 |


**Reconciliation:**

- Gemini P1 I1 (missing unit-test coverage) → consolidated as S2 (suggestion) and merged with Claude P1 S2. Reason: test-coverage gaps do not land an Important absent a concrete regression.
- Gemini P1 I2 (`local` in `@test`) → consolidated as S1 (suggestion) and merged with Codex P1 S2. Reason: repo style violation, tests still pass; regression severity sits at suggestion.

## Review Process Notes

### Skill improvements

- [ ] **Add an "env-based resolution" check to review dimensions when reviewing code that reads configuration from files.** When code loads configuration from a file to pick a default, check whether the surrounding tool's CLI resolves the same value from an environment variable with higher precedence. If so, the file-only read silently ignores the env case. This applies to any `config.json`/YAML/TOML loader that backs a CLI where the CLI also consults env vars. Flag the gap as a false-default path.

### Repo context updates

- [+] **Docker context resolution precedence.** Docker CLI resolves the effective context as: `--context` flag → `--host` flag → `DOCKER_HOST` env (bypasses contexts) → `DOCKER_CONTEXT` env (names a context) → `config.json`'s `currentContext` → platform default. `dockerclient.FromEnv` reads `DOCKER_HOST`/`DOCKER_CERT_PATH`/`DOCKER_TLS_VERIFY` but **not** `DOCKER_CONTEXT`. Any RDD code that reads `config.json` to determine the user's "current" context, or that calls `dockerclient.FromEnv` as a stand-in for the user's environment, must also consult `DOCKER_CONTEXT` or document why it ignores it. Flag probe/override code that only reads `config.json`.

---

## Appendix: Original Reviews

### [Claude] — Pass 1

# Deep Code Review — PR #333 Docker Engine: improve Docker context probe safety and isolation

## Executive Summary

Replaces the hand-rolled `probeDockerContext(host)` helper with two purpose-built probes: `probeDockerContext(ctx)` uses `dockerclient.FromEnv` for the default/empty `currentContext`, and `probeNamedDockerContext` uses docker/cli's `ContextStore` + `WithTLSData` + `ClientOpts` so TCP+TLS / mTLS / `SkipTLSVerify` / SSH contexts no longer probe with the wrong transport. Conservative "assume healthy" fall-through for unparsable or non-tcp/unix endpoints prevents RDD from overwriting contexts it cannot correctly evaluate. Bats tests were rewritten to point at `$DOCKER_CONFIG` so the developer's real `~/.docker` is never touched. The logic is sound; findings are minor.

## Findings

### Critical Issues

None.

### Important Issues

None.

### Suggestions

**S1. Stale test-function name after `getDockerContextHost` removal** — `pkg/controllers/app/engine/controllers/docker_context_test.go:113`

The commit removes `getDockerContextHost` and introduces a test-only `testGetContextHost` helper. The function under test in `Test_getDockerContextHost_missing` no longer exists in the production code — the test exercises the helper.

**S2. No unit coverage for `probeNamedDockerContext` branches** — `pkg/controllers/app/engine/controllers/docker_context.go:160-201`

The new function has five distinct return paths. Only the unreachable-unix-socket happy path is covered by the bats test. The important branches — `errdefs.IsNotFound` returning `false`, and the `ssh://` / other-scheme short-circuit returning `true` — have no unit test anchoring the contract.

**S3. Behavior change for empty / "default" `currentContext` is not captured in the code comment** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:371-376`

Before this commit, the old code never set `healthy` for empty/default, so it stayed `false` and RDD always promoted. The new code probes the default endpoint via `FromEnv`. The rationale is gone from the file.

**S4. TODO comment overstates the work required for Windows** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:322-326`

`getCurrentDockerContext` and `createReplaceDockerContext` delegate all path handling to `docker/cli`. The only Windows-specific work needed is the `npipe://` host in `createReplaceDockerContext`.

**S5. `probeNamedDockerContext` drops the returned error instead of surfacing it** — `pkg/controllers/app/engine/controllers/docker_context.go:160-201`

The function collapses every failure into a `bool`. The decision "we chose not to clobber because the TLS data is unreadable" is a reasonably unusual path worth logging at default level.

## Design Observations

Strengths: scheme gate before `WithTLSData`, `pingDocker` split, tightly scoped probe refactor.

## Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats` — Reviewed, no issues.
- `pkg/controllers/app/engine/controllers/docker_context.go` — S2, S5.
- `pkg/controllers/app/engine/controllers/docker_context_test.go` — S1, S2.
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — S3, S4.

### [Codex] — Pass 1

### Executive Summary

This change moves Docker context probing onto docker/cli context metadata so TLS and mTLS contexts are handled more safely, and it isolates the BATS Docker config from the developer's real `~/.docker`. The main remaining issue is that the probe still resolves the "current" context from `config.json` only, so an environment-selected Docker context can still be ignored and persistently overwritten.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Environment-selected Docker contexts are still ignored** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:362-380` (important, gap)

`getCurrentDockerContext()` reads only `config.json`, but Docker CLI context resolution can select a named context from `DOCKER_CONTEXT` even when `config.json` has no `currentContext`. In that case the code probes the default endpoint instead of the env-selected named context and can persistently rewrite `currentContext` to Rancher Desktop.

#### Suggestions

S1. **Default-context docs no longer match behavior** — `engine_reconciler.go:371-375` (suggestion, regression)

The new default branch preserves an empty/`default` Docker context when the implicit default endpoint is healthy. That contradicts `docs/design/bootstrap.md:58` and `bootstrap.md:86`.

S2. **New BATS variables use `local` inside an `@test` block** — `bats/tests/32-app-controllers/engine-docker.bats:790-792` (suggestion, regression)

### Design Observations

Strengths: docker/cli `WithTLSData` + `ClientOpts()` is the right abstraction; conservative treatment of SSH/npipe preserves user intent.

### Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats` — S2
- `pkg/controllers/app/engine/controllers/docker_context.go` — I1
- `pkg/controllers/app/engine/controllers/docker_context_test.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — I1, S1

### [Gemini] — Pass 1

### Executive Summary

This PR improves the robustness and safety of the Docker context probe by utilizing docker/cli context store APIs (`WithTLSData` and `ClientOpts`) to correctly configure the probe client for TLS and mTLS endpoints. It also treats SSH/non-TCP/Unix endpoints as implicitly healthy, preventing Rancher Desktop from falsely overriding the user's working configuration. The core probing logic is currently missing unit test coverage.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Missing test coverage for context probing logic** — `pkg/controllers/app/engine/controllers/docker_context.go:160-193` (important, gap)

`probeNamedDockerContext` introduces non-trivial logic with multiple fallback paths; the test file only renames the old helper. No unit tests were added to exercise the new function.

I2. **Local variables in BATS @test blocks** — `bats/tests/32-app-controllers/engine-docker.bats:790-792` (important, regression)

The repository's BATS style guide prohibits `local` inside `@test` blocks. Lines 790 and 792 introduce new `local` declarations.

### Design Observations

**Concerns:** Context restoration on shutdown (pre-existing) — the daemon overwrites `currentContext` if the user's active context is unhealthy, but on clean shutdown simply clears `currentContext` rather than restoring the user's original.

**Strengths:** Safe failure modes in `probeNamedDockerContext`; cleanup sequencing via `r.contextProbeWg.Wait()`.

### Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats`: **I2**
- `pkg/controllers/app/engine/controllers/docker_context.go`: **I1**
- `pkg/controllers/app/engine/controllers/docker_context_test.go`: **I1**
- `pkg/controllers/app/engine/controllers/engine_reconciler.go`: **Reviewed, no issues**
