# Deep Review: 20260417-123813-pr-320

| | |
|---|---|
| **Date** | 2026-04-17 12:38 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 2 |
| **Author** | [@Nino-K](https://github.com/Nino-K) |
| **PR** | [#320](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320) — Manage Docker CLI context on ContainerEngineReady transitions |
| **Branch** | `docker-context` |
| **Commits** | `ee25067` Manage Docker CLI context on ContainerEngineReady transitions<br>`fe0962c` Add bats to support the docker context |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — round 1's three structural bugs are gone, but the probe still silently steals the default context, an unrelated read-failure path can clobber the user's config, and BATS tests still mutate the developer's real `~/.docker`. |
| **Wall-clock time** | `15 min 24 s` |

## Executive Summary

Round 2 fixes every structural defect from round 1: the cancel-vs-write
race in the probe goroutine is closed by `contextProbeWg.Wait()` plus a
`probeCtx.Err()` recheck, the hardcoded `unix://` scheme is replaced
with full-host pass-through, `updateDockerConfig` now uses an atomic
tmp-file + rename with a no-op fast path, the Windows divergence is
walled off with `runtime.GOOS == "windows"` early returns, and every
new error site logs through the scoped `logf.FromContext(...)` logger
with `.Error`. The concurrency and lifecycle wiring (`contextMu` +
`contextProbeGen` + `contextProbeWg`) is now solid.

What's left are six issues, none catastrophic but all material:

1. **Default context is still skipped without a probe.** The condition
   `current != "" && current != "default"` jumps straight to "promote
   ours" — so any user with Docker Engine on Linux pointed at the
   default `/var/run/docker.sock` (the most common Linux setup) will
   silently have their `currentContext` overwritten the first time
   `ContainerEngineReady` fires.
2. **The probe drops TLS/SSH transport state.** The host URL alone is
   not enough to dial a context with `TLSData` or SSH options; the
   probe creates a bare client, fails to connect, and silently
   overrides a working remote/TLS context.
3. **A swallowed `getCurrentDockerContext` error can still clobber the
   user's config.** When the first read fails transiently (EBUSY,
   antivirus lock on a network home, etc.), the code logs and proceeds
   with `current=""`; the second read inside `updateDockerConfig`
   succeeds, and the user's chosen `currentContext` is silently
   replaced. Reviewer @mook-as raised this and the author agreed but
   deferred — the fix is one `return`.
4. **BATS tests still operate on the developer's real
   `~/.docker/config.json`.** No `HOME`/`DOCKER_CONFIG` isolation;
   any mid-test failure leaves local Docker state mutated.
5. **`DOCKER_CONFIG` is ignored.** Users with a relocated config dir
   (dev containers, CI, Nix profiles) get a context written under
   `~/.docker` while their CLI reads from `$DOCKER_CONFIG`.
6. **The Windows skip has no `TODO:` or filed issue.** @mook-as's
   approval was conditioned on it ("Just needs to have an issue filed
   to keep track of the Windows thing").

The hand-rolled context store still duplicates docker/cli semantics —
each of I1, I2, I5, and the Windows gap flows from a piece of logic
that `github.com/docker/cli/cli/context/store` already gets right.
That trade-off was made deliberately, but the unfixed bugs above are
all the same shape, so the design observation persists.

Verdict: Merge with fixes. I1 (default), I3 (clobber), I4 (BATS
isolation), and I6 (Windows TODO) are merge-blockers; I2 and I5 can
ship as follow-ups if the design observation about adopting the docker
context store lands.

---

## Critical Issues

None.

---

## Important Issues

I1. **Default/empty context is unconditionally overridden without a probe** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:319-330` [Gemini 3.1 Pro]

```go
// Probe the current context's host — not our own — to decide
// whether to promote ours. "default" and empty both mean no
// working context is set; treat them identically.
var healthy bool
if current != "" && current != "default" {
    currentHost, err := getDockerContextHost(current)
    if err != nil {
        log.Error(err, "Failed to resolve current Docker context host", "context", current)
    } else if currentHost != "" {
        healthy = probeDockerContext(probeCtx, currentHost)
    }
}
```

The comment claims `""` and `"default"` both mean "no working context",
but `default` is a real Docker context: on Linux with Docker Engine
installed it points to `/var/run/docker.sock`, which is typically
running. By skipping the probe entirely for these values, the code
leaves `healthy=false` and then unconditionally writes
`currentContext = "rancher-desktop-<suffix>"`, silently overriding the
user's working local Docker daemon on the most common Linux
configuration. Same gap on a freshly-installed Docker Desktop
alternative where `currentContext` was never set.

This was raised as round 1 S2 and is unaddressed; Gemini correctly
re-flags it (at Critical, downgraded here to Important because
recovery is one `docker context use default` away).

Fix: probe the default endpoint instead of skipping it. The Docker
client's `FromEnv` resolver picks up `DOCKER_HOST` and falls back to
the platform default Unix socket:

```diff
 var healthy bool
 if current != "" && current != "default" {
     currentHost, err := getDockerContextHost(current)
     if err != nil {
         log.Error(err, "Failed to resolve current Docker context host", "context", current)
     } else if currentHost != "" {
         healthy = probeDockerContext(probeCtx, currentHost)
     }
+} else {
+    // "" or "default" → probe the implicit default endpoint
+    // (DOCKER_HOST or the platform default socket).
+    healthy = probeDockerContext(probeCtx, "")
 }
```

And teach `probeDockerContext` to use `dockerclient.FromEnv` when the
host is empty:

```diff
 func probeDockerContext(ctx context.Context, host string) bool {
     probeCtx, cancel := context.WithTimeout(ctx, dockerContextProbeTimeout)
     defer cancel()
-    cli, err := dockerclient.New(dockerclient.WithHost(host))
+    var opts []dockerclient.Opt
+    if host != "" {
+        opts = append(opts, dockerclient.WithHost(host))
+    } else {
+        opts = append(opts, dockerclient.FromEnv)
+    }
+    cli, err := dockerclient.New(opts...)
     ...
 }
```

I2. **Probe drops TLS/SSH transport state, can silently steal healthy non-trivial contexts** — `pkg/controllers/app/engine/controllers/docker_context.go:92-108, 163-172` [Codex GPT 5.4]

```go
func getDockerContextHost(name string) (string, error) {
    ...
    return meta.Endpoints["docker"].Host, nil
}

func probeDockerContext(ctx context.Context, host string) bool {
    ...
    cli, err := dockerclient.New(dockerclient.WithHost(host))
    if err != nil {
        return false
    }
    defer cli.Close()
    _, err = cli.Ping(probeCtx, dockerclient.PingOptions{})
    return err == nil
}
```

`getDockerContextHost` returns only `Endpoints["docker"].Host`, so any
context that depends on extra transport state — TLS certificate paths
in `~/.docker/contexts/tls/<hash>/`, SSH options, mTLS, identity files —
is probed as if it were a bare host URL. The client built with only
`WithHost` cannot complete the TLS handshake or SSH dial, the `Ping`
fails, and the goroutine treats the context as unhealthy and overrides
it. This violates the PR's documented "leave healthy contexts alone"
contract for any user whose `currentContext` is a TCP+TLS or SSH
endpoint (a common pattern for CI runners and remote-engine setups).

Round 1's I2 fix moved the probe from `unix://`-prefix-stripping to
full host pass-through, which closed the `tcp://` URL-construction bug,
but the deeper "context resolution requires more than the host URL"
issue remained.

Fix: reconstruct the full Docker context client from the context
store, not from the host URL alone. The smallest change is to read the
TLS material from `~/.docker/contexts/tls/<hash>/docker/{ca,cert,key}.pem`
when present and pass it via `WithTLSClientConfig`. The cleaner change
is to use `github.com/docker/cli/cli/context/store` (see Design
Observations) so the probe inherits whatever the user configured.

I3. **Swallowed `getCurrentDockerContext` error can clobber the user's `currentContext` on a transient read failure** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:314-337` [Claude Opus 4.7]

```go
current, err := getCurrentDockerContext()
if err != nil {
    log.Error(err, "Failed to read current Docker context")
}

var healthy bool
if current != "" && current != "default" { ... }
if !healthy && probeCtx.Err() == nil {
    if err := setCurrentDockerContext(contextName); err != nil {
        log.Error(err, "Failed to set current Docker context", "context", contextName)
    }
}
```

When `getCurrentDockerContext` fails (transient EBUSY/EIO, antivirus
lock on a CIFS-mounted `$HOME`, fs snapshot, etc.), the code logs and
proceeds with `current=""`. `healthy` stays `false`, and
`setCurrentDockerContext` is called. That call routes through
`updateDockerConfig`, which **reads the file again**. If the second
read succeeds, the user's other keys are preserved but their chosen
`currentContext` is silently replaced with ours. The user picked a
specific context deliberately; the next time they run `docker ps` it
talks to our VM instead.

Reviewer @mook-as raised this at
[discussion_r3102364628](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102364628)
and @Nino-K agreed
([discussion_r3102730266](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102730266):
"I guess it would make sense to return here, since an error here
means the config file exists but could not be read or parsed") but
deferred to a Windows follow-up. Fix is one `return`:

```diff
 current, err := getCurrentDockerContext()
 if err != nil {
-    log.Error(err, "Failed to read current Docker context")
+    log.Error(err, "Failed to read current Docker context; skipping promotion to avoid clobbering user config")
+    return
 }
```

The user can still address our context explicitly via
`DOCKER_CONTEXT=rancher-desktop-<N> docker ...`; not promoting on a
read failure is strictly safer than promoting from a stale `""`.

I4. **BATS tests mutate the developer's real `~/.docker/config.json` without HOME / DOCKER_CONFIG isolation** — `bats/tests/32-app-controllers/engine-docker.bats:12-28, 447-476` [Codex GPT 5.4, Gemini 3.1 Pro]

```bash
local_setup_file() {
    skip_on_windows
    rdd svc delete
    rdd set running=true
    run -0 rdd svc paths docker_socket
    export DOCKER_HOST="unix://${output}"
    ...
}

@test "moby engine sets currentContext when no healthy context exists" {
    ...
    saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)

    jq 'del(.currentContext)' "${HOME}/.docker/config.json" >"${HOME}/.docker/config.json.tmp" &&
        mv "${HOME}/.docker/config.json.tmp" "${HOME}/.docker/config.json"
    ...
    if [[ -n "${saved_context}" ]]; then
        jq --arg ctx "${saved_context}" ...
    fi
}
```

`local_setup_file` exports `DOCKER_HOST` but does not isolate `HOME`
or `DOCKER_CONFIG`. The "save → mutate → restore" inline block in the
`currentContext` test rewrites the developer's real `~/.docker/config.json`
in place, and the restore only runs when control reaches the bottom of
the test — any failed assertion above leaves `currentContext` cleared
on the developer's machine. The base BATS `teardown_file` only stops
`rdd`. There is no safety net.

This was raised as round 1 S5 and is unaddressed.

Fix (in order of preference):

1. Override `DOCKER_CONFIG` (or `HOME`) in `local_setup_file` to a
   per-suite temp directory. Once I5 lands, exporting `DOCKER_CONFIG`
   is the cross-platform-clean version (Windows uses `USERPROFILE`,
   not `HOME`).
2. Move the save/restore into a BATS `teardown` so it runs on failure
   too. The `set -e`/`set -o pipefail` framing inside `try` means a
   failed assertion above the inline restore aborts the test body
   without touching the cleanup.

I5. **`DOCKER_CONFIG` environment variable is ignored** — `pkg/controllers/app/engine/controllers/docker_context.go:36-42` [Gemini 3.1 Pro]

```go
func dockerConfigDir() (string, error) {
    home, err := os.UserHomeDir()
    if err != nil {
        return "", err
    }
    return filepath.Join(home, ".docker"), nil
}
```

The Docker CLI honours `$DOCKER_CONFIG` to relocate the config
directory; it's standard for dev containers, CI runners, Nix profiles,
and any user who keeps `$HOME` clean. With this implementation, the
context meta file lands under `~/.docker/contexts/meta/<hash>/`
while the user's CLI reads from `$DOCKER_CONFIG/contexts/meta/<hash>/`,
so the new context is invisible. The promote step writes to
`~/.docker/config.json` for the same reason, and the user's actual
config dir is left untouched.

This was raised as round 1 S3 and is unaddressed.

Fix:

```diff
 func dockerConfigDir() (string, error) {
+    if dir := os.Getenv("DOCKER_CONFIG"); dir != "" {
+        return dir, nil
+    }
     home, err := os.UserHomeDir()
     if err != nil {
         return "", err
     }
     return filepath.Join(home, ".docker"), nil
 }
```

Side benefit: the unit tests that currently do `t.Setenv("HOME", ...)`
become `t.Setenv("DOCKER_CONFIG", ...)`, which is portable to the
Windows test runner where `os.UserHomeDir()` consults `USERPROFILE`.

I6. **No `TODO:` or filed issue marks the Windows no-op path** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:277-281, 345-347` [Claude Opus 4.7]

```go
func (r *EngineReconciler) manageDockerContext(socketPath string) {
    // Docker context management uses unix:// sockets; Windows requires
    // npipe:// which is not yet implemented.
    if runtime.GOOS == "windows" {
        return
    }
```

@mook-as approved the PR conditionally
([discussion_r3101962597](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3101962597)):
"Just needs to have an issue filed to keep track of the Windows thing
I think." Neither a `TODO:` marker that `grep TODO` would surface nor
a referenced issue number appears in the code. Without one, the
asymmetry is invisible to a future contributor scanning for follow-up
work.

The comment also understates the gap: only the URL scheme is called
out, but `probeDockerContext` would also need a different client
construction (named-pipe transport), and `getCurrentDockerContext` /
`createReplaceDockerContext` would need npipe-aware host
serialization.

Fix: file a tracking issue and reference it, or add a grep-able
marker in both functions:

```diff
 func (r *EngineReconciler) manageDockerContext(socketPath string) {
-    // Docker context management uses unix:// sockets; Windows requires
-    // npipe:// which is not yet implemented.
+    // TODO(windows): Docker context management uses unix:// sockets;
+    // Windows requires npipe:// in the meta file, an npipe-aware probe
+    // client, and platform-specific path handling. Tracked in #<N>.
     if runtime.GOOS == "windows" {
         return
     }
```

Apply the same TODO + issue reference to `removeDockerContext`.

---

## Suggestions

S1. **`createReplaceDockerContext` is not atomic** — `pkg/controllers/app/engine/controllers/docker_context.go:55-75` [Claude Opus 4.7]

```go
func createReplaceDockerContext(name, socketPath string) error {
    path, err := contextMetaPath(name)
    ...
    return os.WriteFile(path, append(data, '\n'), 0o600)
}
```

`updateDockerConfig` (lines 207-225) goes to considerable lengths to do a
tmp-file + rename for `config.json`. `createReplaceDockerContext` does
not — `os.WriteFile` truncates before writing, so a crash mid-write
leaves `meta.json` zero- or partial-length. The next start re-creates
the file, so this is self-healing for our use, but a `docker context
inspect rancher-desktop-<N>` from the user during the window returns a
parse error.

Fix: extract the tmp-file + rename pattern from `updateDockerConfig`
into a `writeFileAtomic(path string, data []byte, perm os.FileMode)`
helper and use it from both call sites. Keeps the atomicity invariant
uniform across both files we own under `~/.docker/`.

S2. **`removeDockerContext` runs unconditionally on every non-moby reconcile** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:158-163` [Claude Opus 4.7]

```go
} else {
    // The watcher was never started or died on its own (e.g. the VM
    // socket closed). stopWatcher was not called, so clean up the
    // Docker context directly.
    r.removeDockerContext()
}
```

Every reconcile with `!wantWatcher && !watcherRunning` (containerd
mode, or stopped) now calls `removeDockerContext`. The no-op fast
path in `updateDockerConfig` (mutate returns false → no write) keeps
this cheap, but it still issues a `ReadFile` on
`~/.docker/config.json` and an `os.RemoveAll` on the contexts dir on
every App event for a user who has never run moby. Stable terminal
state should be I/O-free.

Fix: gate `removeDockerContext()` behind the same `!alreadyClean`
condition that already guards `cleanupMirrorResources`:

```diff
 if !wantWatcher {
     if watcherRunning {
         log.Info("Stopping Docker watcher", ...)
         r.stopWatcher()
-    } else {
-        // The watcher was never started or died on its own (e.g. the VM
-        // socket closed). stopWatcher was not called, so clean up the
-        // Docker context directly.
-        r.removeDockerContext()
     }
     ...
     alreadyClean := !watcherDied && current != nil && current.Reason == terminalReason && current.Status == terminalStatus
     if !alreadyClean {
+        if !watcherRunning {
+            r.removeDockerContext()
+        }
         if err := r.cleanupMirrorResources(ctx); err != nil {
             ...
```

S3. **Dead `contextsDir` field in test helper** — `pkg/controllers/app/engine/controllers/docker_context_test.go:17-32` [Claude Opus 4.7]

```go
type dockerTestPaths struct {
    configFile  string
    contextsDir string
}
```

`contextsDir` is assigned but never read by any test in the file
(round 1 S9 was the same finding; still unaddressed). Drop it:

```diff
 type dockerTestPaths struct {
     configFile  string
-    contextsDir string
 }
@@
     return dockerTestPaths{
         configFile:  filepath.Join(dockerDir, "config.json"),
-        contextsDir: filepath.Join(dockerDir, "contexts"),
     }
```

S4. **`rdd service paths` in one site, `rdd svc paths` everywhere else** — `bats/tests/32-app-controllers/engine-docker.bats:441` [Claude Opus 4.7]

```bash
    run -0 rdd service paths docker_socket
```

Both spellings work (`svc` is an alias). Line 22 in the same file
uses `rdd svc paths docker_socket`, and the rest of the bats suite
uses `svc`. Keep consistent:

```diff
-    run -0 rdd service paths docker_socket
+    run -0 rdd svc paths docker_socket
```

S5. **`$(jq ...)` command substitution where `run -0` is the convention** — `bats/tests/32-app-controllers/engine-docker.bats:453` [Codex GPT 5.4, Gemini 3.1 Pro]

```bash
local saved_context
saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)
```

Repo BATS style forbids `$(cmd)` for assertion-feeding values: failures
inside `$(...)` produce opaque BATS errors with no output, and the
`|| true` makes a real failure indistinguishable from a missing
file. Reviewer @mook-as flagged the same line at
[discussion_r3096228639](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096228639);
unaddressed.

Fix: branch on file existence and capture via `run -0`:

```diff
-    local saved_context
-    saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)
+    local saved_context=""
+    if [[ -f "${HOME}/.docker/config.json" ]]; then
+        run -0 jq -r '.currentContext // empty' "${HOME}/.docker/config.json"
+        saved_context=${output}
+    fi
```

Once I4 lands, the whole save/restore block disappears (the
DOCKER_CONFIG-isolated dir is discarded by the temp-dir cleanup).


---

## Design Observations

**Concerns**

- *Hand-rolled Docker context store still duplicates docker/cli semantics* `(in-scope)` [Codex GPT 5.4, Claude Opus 4.7] — Round 1 raised this; round 2 leaves it unchanged. Each of I1 (default-context resolution), I2 (transport reconstruction), I5 (DOCKER_CONFIG), and the Windows gap is a piece of logic that `github.com/docker/cli/cli/context/store` already gets right. The author argued in [discussion_r3096716690](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096716690) that the dependency cost is too high for ~30 lines we control, and the trade-off is defensible — but the unfixed bugs above are all the same shape, and each follow-up round is likely to add more of them. Worth a re-evaluation once a decision on TLS-context probing is made: writing a bug-free TLS resolver from scratch is harder than swallowing the dep.

- *`ContainerEngineReady=True` fires before the probe completes* `(future)` [Claude Opus 4.7] — @mook-as noted at [discussion_r3096360169](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096360169) that "ContainerEngineReady shouldn't be set until the the context is set up correctly." The `meta.json` is written synchronously inside `startWatcherAndSync` before `Connected` is asserted, so `DOCKER_CONTEXT=rancher-desktop-<N> docker ...` works the moment `ContainerEngineReady=True` lands. Only the `currentContext` promotion is async — which is unavoidable given a 3s probe timeout. Current behavior is consistent with the design in issue #318: promotion is an ergonomic effect, not a precondition for engine readiness. Worth a comment near the condition update so future readers don't try to "fix" the asynchrony.

**Strengths**

- *Round 1's three structural bugs are cleanly fixed* [Claude Opus 4.7, Codex GPT 5.4, Gemini 3.1 Pro] — `contextProbeWg` + `probeCtx.Err()` recheck closes the cancel-vs-write race; full-host pass-through replaces the `unix://` strip-and-reapply for non-Unix endpoints; `updateDockerConfig` now uses tmp-file + rename with a `mutate(cfg) bool` no-op fast path; `runtime.GOOS == "windows"` early-returns wall off both lifecycle entry points; every new error site logs through the scoped `logf.FromContext(...)` logger with `.Error`. The fixes match the round 1 finding text closely.
- *Goroutine concurrency design is well-reasoned* [Claude Opus 4.7, Codex GPT 5.4, Gemini 3.1 Pro] — the `contextProbeGen` / `contextProbeCancel` / `contextProbeWg` trio handles all the tricky races: superseding probes (gen compare in defer), blocking the cleanup until the probe cannot write, and the `probeCtx.Err()` recheck right before the write. The field-level comments at `engine_reconciler.go:95-107` are genuinely helpful for future readers.
- *`updateDockerConfig` preserves unrelated keys and writes atomically* [Codex GPT 5.4, Claude Opus 4.7] — the right baseline for any mutation of `~/.docker/config.json`. The no-op fast path means stable terminal state (after I4 lands) costs zero I/O.

---

## Testing Assessment

The new unit tests cover the file-shape helpers in isolation
(`updateDockerConfig` set/clear/no-op, `createReplaceDockerContext`,
`getCurrentDockerContext`) and the BATS additions exercise create,
async promote, and remove. The gaps are the same shape as round 1,
concentrated on the probe goroutine and the integration-with-real-CLI
paths:

1. **No test for the `getCurrentDockerContext` error path (I3)** — a
   malformed or unreadable `config.json` should not cause
   `setCurrentDockerContext` to silently clobber. Even with the I3
   fix (early return), a regression test guarding the behavior is
   cheap.
2. **No test for the default-context skip (I1)** — a `currentContext`
   of `""` or `"default"` pointing at a healthy default endpoint
   should remain selected. No unit or integration test exercises this.
3. **No test for non-`unix://` current contexts (I2)** — TLS, SSH,
   and TCP probe behavior is unverified.
4. **No test for the generation race** — two back-to-back
   `manageDockerContext` calls verifying that the second probe wins
   and the first cannot write post-supersede. The subtlest piece of
   the concurrency design is currently untested.
5. **No test for `removeDockerContext` with a probe in flight** — the
   whole point of `contextProbeWg.Wait()` is to block delete until the
   probe finishes. A test that stages a slow probe (fake hung socket)
   and verifies the cleanup blocks would exercise this directly.
6. **No `DOCKER_CONFIG`-isolated test setup (I4)** — mutating the
   developer's real `~/.docker` is a structural test-isolation gap,
   not just a per-test bug.
7. **Windows skip path (I6)** — the bats suite has a file-level
   `skip_on_windows`; the production-code Windows no-op is unexercised
   in any form. A unit test asserting `manageDockerContext(...)` is a
   safe nil on `runtime.GOOS == "windows"` would close this until the
   npipe path lands.

---

## Documentation Assessment

- `docs/design/api_app.md:39` and `docs/design/bootstrap.md:58` still
  attribute Docker context creation to the App controller / `rdd
  start` flow. This PR places the responsibility in the engine
  reconciler. The mismatch was raised in round 1 and is unaddressed.
  Either update the docs or move the code (the engine reconciler is
  the right home — it owns `ContainerEngineReady`).
- No godoc on `createReplaceDockerContext`, `deleteDockerContext`,
  `setCurrentDockerContext`, `clearCurrentDockerContext`,
  `dockerConfigDir`, or `dockerContextMeta`/`dockerEndpointMeta`. The
  rest of the helpers in the same file have one. Add for consistency.
- `manageDockerContext`'s docstring at `engine_reconciler.go:273-275`
  says "At most one probe runs at a time." Mechanically accurate via
  the generation counter; worth noting that the WaitGroup count *can*
  briefly exceed 1 during transitions (acknowledged in PR review),
  and that `Wait()` only cares about reaching zero.

---

## Commit Structure

Two commits:
- `ee25067` Manage Docker CLI context on ContainerEngineReady transitions — Go code, unit tests
- `fe0962c` Add bats to support the docker context — integration tests

Clean split, readable in either order; no fixup noise. Round 1
suggested squashing for `git bisect` cleanliness; round 2 keeps the
split, which is also fine — the bats commit alone is a coherent
addition. Not blocking.

---

## Acknowledged Limitations

- **Windows not implemented** — `engine_reconciler.go:277-281, 345-347`: both lifecycle functions early-return on `runtime.GOOS == "windows"`. The comment claims only the URL scheme is the gap; in practice `probeDockerContext` and the meta-file serialization would also need npipe-aware paths. See I6 for the missing TODO/issue marker.
- **Non-atomic `createReplaceDockerContext`** — `docker_context.go:55-75`: author consciously routed `config.json` through an atomic helper but left `meta.json` with a direct `os.WriteFile`. See S1 for the gap.
- **`contextProbeWg` count can briefly exceed 1** — `engine_reconciler.go:301-303`: @mook-as flagged this at [discussion_r3102603375](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102603375); @Nino-K confirmed at [discussion_r3102753949](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102753949) that `Wait()` only cares about reaching zero. Analysis agrees: the set-currentContext write is idempotent (same `contextName` from a stable `instance.Name()`), so a second goroutine writing the same value is harmless.
- **No file-level locking around `~/.docker/config.json`** — `docker_context.go:175`: @mook-as raised and self-resolved at [discussion_r3102609534](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102609534) since the Docker CLI itself does not lock; equivalent guarantee is the best we can offer.
- **Not adopting `github.com/docker/cli`'s `ContextStore`** — `docker_context.go:25`: @Nino-K's response at [discussion_r3096716690](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096716690) explains the dep cost. Defensible, but see Design Concerns — the unfixed bugs are all in scope of what the dep would handle.
- **One-shot probe (no continuous reassertion)** — [discussion_r3101959340](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3101959340): a polling loop fighting the user's later context changes was explicitly rejected. Implementation matches.

---

## Unresolved Feedback

- [discussion_r3101962597](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3101962597) (@mook-as — TODO or filed issue for Windows): **promoted to I6** — @mook-as's approval was conditioned on it.
- [discussion_r3102364628](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102364628) (@mook-as — continue-on-read-failure clobber risk): **promoted to I3** — the author agreed in [discussion_r3102730266](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102730266) but deferred. The fix is one `return`.
- [discussion_r3096228639](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096228639) (@mook-as — `run -0` instead of `$()`): **promoted to S5** — unaddressed.
- [discussion_r3096360169](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096360169) (@mook-as — ContainerEngineReady timing): see Design Observations. Architectural; not a regression.
- [discussion_r3096335657](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096335657) (@mook-as — `yq --inplace`): low-priority. The current `jq … > tmp && mv tmp` pattern is atomic on the same filesystem on Unix, so safe in practice. Will be obviated by I4's HOME/DOCKER_CONFIG isolation.


---

## Agent Performance Retro

**[Claude]** Took the longest by far and used the budget on round 2-specific work: read the PR review thread carefully and promoted two unresolved-but-agreed reviewer comments (the swallowed read error → I3, the missing Windows TODO → I6) into Important findings the other agents missed entirely. I3 is the most consequential finding in this round — silent data corruption on a transient read failure, with reviewer concurrence on the fix — and only Claude raised it. Caught two cleanup misses that round 1 also flagged but the author left in place (`contextsDir` unused, `rdd service` vs `rdd svc`). Did *not* re-flag the default-context-skip (round 1 S2) or DOCKER_CONFIG miss (round 1 S3) — those were both unaddressed and a thorough re-read of round 1's open suggestions would have caught them.

**[Codex]** Fastest of the three, with the tightest finding-to-noise ratio. Independently identified the TLS/SSH transport drop (I2) as the deeper version of round 1's I2 fix — a real upgrade in framing, since the simple "non-`unix://` URL" framing from round 1 has been closed but the broader transport-state issue remained. Caught the BATS isolation regression (I4 with Gemini). Did not raise the default-context skip, the DOCKER_CONFIG miss, the swallowed-read-error clobber, or the missing Windows TODO. Spent ~5MB of stderr on its session log; the actual review was 109 lines.

**[Gemini]** Correctly re-flagged the two round-1 suggestions the author skipped (default-context skip → C1, DOCKER_CONFIG ignored → I2). Severity calibration was hot again: marked the default skip as Critical (downgraded here to Important — recoverable via one CLI command) and produced one false positive (C2: "`os.Rename` fails on Windows when destination exists") that cited non-existent "system instructions" and contradicts Go's well-documented Windows behavior (`MoveFileEx` with `MOVEFILE_REPLACE_EXISTING`). The hallucinated source is a calibration concern; the underlying finding does not survive verification. Did not flag the swallowed-read-error or the missing Windows TODO.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 8m 10s | 7m 13s | 4m 51s |
| Findings | 2I 4S | 2I 1S | 2I 1S |
| Tool calls | 34 (Bash 16, Grep 12, Read 6) | 56 (shell 52, stdin 4) | 18 (run_shell_command 11, grep_search 6, write_file 1) |
| Design observations | 4 | 3 | 2 |
| False positives | 0 | 0 | 1 |
| Unique insights | 4 | 1 | 2 |
| Files reviewed | 4 | 4 | 4 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **2I 4S** | **2I 1S** | **2I 1S** |
| Downgraded | 0 | 0 | 1 (I→S) |
| Dropped | 0 | 0 | 1 |


Three of the most material round-2 findings (I3 swallowed-read clobber, I1 default-skip, I6 Windows TODO) were each found by exactly one agent — no agent caught all three. The ensemble's value here was the union, not the intersection: any single-agent run would have left at least one Important issue on the floor. Claude's PR-review-thread reading was the highest-leverage technique this round.

### Reconciliation

- Gemini C1 default-context skip → Important (I1). Recoverable with one CLI command; not data loss.
- Gemini C2 `os.Rename` Windows failure → **rejected as false positive**. Go's Windows `os.Rename` uses `MoveFileEx` with `MOVEFILE_REPLACE_EXISTING` and replaces existing files; the cited "system instructions" do not exist in this repo.


---

## Review Process Notes

**Skill improvements**

- *On a re-review (Round ≥ 2), explicitly re-check every Suggestion from the prior round against the current diff. Suggestions that were not addressed should either be re-raised at the same severity (with a "round N S<X> unaddressed" note) or explicitly demoted with reasoning. The author has had a chance to triage them; persistent suggestions are a stronger signal than first-time ones, and silently dropping them lets unaddressed issues accumulate across rounds.* In this review, two prior-round suggestions (default-context skip, DOCKER_CONFIG) survived only because Gemini happened to re-find them — Claude and Codex both missed them. A re-check pass against the prior report would catch these systematically rather than relying on agent recall.

- *Reject findings that cite non-existent "system instructions" or repo conventions. Treat unverified citations as a calibration signal — when an agent grounds a finding in a fabricated source, the underlying claim itself often does not survive verification (the agent reached for evidence because the claim felt thin).* Specifically: when a finding's framing relies on a documented behavior or a repo rule, verify the rule exists before accepting the finding. If the citation is hallucinated, downgrade to "verify independently" rather than accept on the agent's confidence alone.


---

## Appendix: Original Reviews

### [Claude]

Let me verify a few things about the code — particularly consistency of CLI commands, usage of `unix://` on Windows, and error paths.

Let me check one more thing — the test isolation via HOME env var.

Now let me verify how `mgr.GetLogger()` is available before controllers run — specifically whether `watcherCtx` carries the logger properly at the reconcile-time:

I have enough context now. Let me write the review.

---

# Review — PR #320: Manage Docker CLI context on ContainerEngineReady transitions

## Verdict

| | |
|---|---|
| **Recommendation** | Request changes (1 important data-loss risk already flagged by @mook-as has no code fix) |
| **Critical** | 0 |
| **Important** | 2 |
| **Suggestions** | 4 |

## Executive Summary

The PR introduces a Docker CLI context synchronized with `ContainerEngineReady` transitions, with a background probe that promotes the new context when the user has no healthy default. Correctness of the goroutine/mutex/generation machinery is solid and well reasoned. The two real concerns are a swallowed read error that can lead to clobbering a healthy but temporarily-unreadable `config.json`, and a few acknowledged-but-not-yet-tracked follow-ups (Windows, atomic write, TODO comment).

## Findings

### Critical Issues

None.

### Important Issues

**I1. Swallowed `getCurrentDockerContext` error can still lead to clobber on a transient read failure** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:314-337` (important, regression)

```go
current, err := getCurrentDockerContext()
if err != nil {
    log.Error(err, "Failed to read current Docker context")
}

// Probe the current context's host — not our own — to decide ...
var healthy bool
if current != "" && current != "default" {
    ...
}
// Guard against writing currentContext after removeDockerContext has
// already cancelled probeCtx and deleted the context directory.
if !healthy && probeCtx.Err() == nil {
    if err := setCurrentDockerContext(contextName); err != nil {
        log.Error(err, "Failed to set current Docker context", "context", contextName)
    }
}
```

When `getCurrentDockerContext` fails (line 314), the code logs and proceeds with `current=""`. The follow-on `setCurrentDockerContext` call will re-read the same file through `updateDockerConfig`; on a persistent corruption it fails benignly. But the failure cases are not symmetric:

- **Transient I/O error** (e.g., brief EBUSY/EIO, fs snapshot, antivirus lock on CIFS/SMB-mounted `$HOME`): first read fails → `current=""`; second read, seconds later inside `updateDockerConfig`, succeeds → the file is **overwritten with the user's original keys minus their `currentContext`, replaced by ours**. This is silent loss of a user-chosen healthy context they picked deliberately.
- **Permission error** on read but not on write (unlikely but possible mid-process owner/ACL change): same outcome.

@mook-as raised this at `engine_reconciler.go:316` (https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102364628). @Nino-K agreed (https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102730266: "I guess it would make sense to return here, since an error here means the config file exists but could not be read or parsed") but deferred to the Windows follow-up. Per the repo's CLAUDE.md note about data integrity and atomic `config.json` edits (jandubois' general comment), this deserves fixing in this PR rather than deferring.

Fix:

```diff
 current, err := getCurrentDockerContext()
 if err != nil {
-    log.Error(err, "Failed to read current Docker context")
+    log.Error(err, "Failed to read current Docker context; skipping promotion to avoid clobbering the user's config")
+    return
 }
```

Rationale: the probe exists to *help* the user; if we cannot read their intent, the safe action is to do nothing. The context meta file still exists, so `DOCKER_CONTEXT=rancher-desktop-<N> docker …` works regardless.

---

**I2. No TODO or tracked issue reference for Windows support in `manageDockerContext` / `removeDockerContext`** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:277-281, 345-347` (important, gap)

```go
func (r *EngineReconciler) manageDockerContext(socketPath string) {
	// Docker context management uses unix:// sockets; Windows requires
	// npipe:// which is not yet implemented.
	if runtime.GOOS == "windows" {
		return
	}
```

@mook-as requested a grep-able `TODO:` marker or filed issue (https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3101962597) and his approval was conditioned on it ("Just needs to have an issue filed to keep track of the Windows thing I think"). Neither a TODO nor an issue number appears in the submitted code.

This matters more than style: Windows users will silently have no Docker context management — no meta file, no promotion, no teardown. With no marker, a future grep will miss it, and the asymmetry is easy to forget.

Fix: either file an upstream issue and reference it, or insert a `// TODO(windows):` comment at both branches:

```diff
 func (r *EngineReconciler) manageDockerContext(socketPath string) {
-	// Docker context management uses unix:// sockets; Windows requires
-	// npipe:// which is not yet implemented.
+	// TODO(windows): Docker context management uses unix:// sockets;
+	// Windows requires npipe:// encoding in the context meta file and
+	// different probe client construction. Not yet implemented.
 	if runtime.GOOS == "windows" {
 		return
 	}
```

Apply the same `TODO(windows)` comment at `removeDockerContext` (line 345).

### Suggestions

**S1. `createReplaceDockerContext` is not atomic** — `pkg/controllers/app/engine/controllers/docker_context.go:55-75` (suggestion, gap)

```go
func createReplaceDockerContext(name, socketPath string) error {
	path, err := contextMetaPath(name)
	...
	return os.WriteFile(path, append(data, '\n'), 0o600)
}
```

`updateDockerConfig` goes to considerable length to do atomic tmp-file + rename for `config.json` (line 207-225), matching @jandubois' general PR comment that "any replacement must be done atomically." `createReplaceDockerContext` does not — `os.WriteFile` truncates before writing, so a crash mid-write leaves a zero- or half-length `meta.json`.

Since our code always fully rewrites this file at start and re-creates it after `deleteDockerContext`, a corrupt-mid-write state is self-healing on the next start. But the `docker` CLI itself may fail while rdd is down if the user runs `docker context inspect rancher-desktop-2`. Given the atomic helper already exists in the same file, routing through it or a parallel `writeFileAtomic` keeps the invariant uniform.

Fix: extract the tmp-file + rename pattern from `updateDockerConfig` into a small helper (e.g., `writeFileAtomic(path string, data []byte, perm os.FileMode) error`) and call it from both places.

---

**S2. `removeDockerContext` runs unconditionally on every non-moby reconcile** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:158-163` (suggestion, regression)

```go
} else {
    // The watcher was never started or died on its own (e.g. the VM
    // socket closed). stopWatcher was not called, so clean up the
    // Docker context directly.
    r.removeDockerContext()
}
```

Every reconcile with `!wantWatcher && !watcherRunning` (containerd mode, or stopped) now calls `removeDockerContext`, which reads `~/.docker/config.json` and stats `~/.docker/contexts/meta/<hash>/`. For a user who has never used moby and only uses containerd, this is file I/O on the user's dotfile on every App event, forever — mostly no-ops, but the `ReadFile` fires every time.

The existing `alreadyClean` guard (line 179) skips the mirror sweep when the terminal condition is already published. The same guard can gate the context cleanup:

```diff
 if !wantWatcher {
     if watcherRunning {
         log.Info("Stopping Docker watcher", ...)
         r.stopWatcher()
-    } else {
-        // The watcher was never started or died on its own (e.g. the VM
-        // socket closed). stopWatcher was not called, so clean up the
-        // Docker context directly.
-        r.removeDockerContext()
     }
     ...
     alreadyClean := !watcherDied && current != nil && current.Reason == terminalReason && current.Status == terminalStatus
     if !alreadyClean {
+        if !watcherRunning {
+            r.removeDockerContext()
+        }
         if err := r.cleanupMirrorResources(ctx); err != nil {
             ...
```

This keeps the semantics (stopWatcher path still gets cleanup; steady-state terminal reconciles skip it).

---

**S3. Dead `contextsDir` field in test helper** — `pkg/controllers/app/engine/controllers/docker_context_test.go:19, 30` (suggestion, regression)

```go
type dockerTestPaths struct {
	configFile  string
	contextsDir string
}
```

`contextsDir` is assigned (line 30) but never read anywhere in the test file (confirmed via grep). Remove:

```diff
 type dockerTestPaths struct {
 	configFile  string
-	contextsDir string
 }
@@
 	return dockerTestPaths{
 		configFile:  filepath.Join(dockerDir, "config.json"),
-		contextsDir: filepath.Join(dockerDir, "contexts"),
 	}
```

---

**S4. `rdd service paths` in one test site, `rdd svc paths` everywhere else** — `bats/tests/32-app-controllers/engine-docker.bats:441` (suggestion, regression)

```bash
    run -0 rdd service paths docker_socket
```

Both spellings work (`svc` is an alias). The only other instance in this file is line 22 (`rdd svc paths docker_socket`), and the rest of the repo's bats suite uses `svc`. Keep the file consistent:

```diff
-    run -0 rdd service paths docker_socket
+    run -0 rdd svc paths docker_socket
```

## Design Observations

### Strengths

- The `contextProbeGen`/`contextProbeCancel`/`contextProbeWg` trio correctly handles all the tricky races: superseding probes (gen compare in the defer), blocking `removeDockerContext` until the probe cannot write anymore, and the `probeCtx.Err() == nil` re-check right before `setCurrentDockerContext`. The comments on the fields (`engine_reconciler.go:95-107`) are genuinely helpful.
- Splitting the pure file-shape code into `docker_context.go` with a clean unit-test surface (no HOME/env tricks for `updateDockerConfig`) is a nice separation — the tests exercise the real logic of the atomic write without needing a Docker CLI.
- Treating `""` and `"default"` identically in the probe (`engine_reconciler.go:323`) is the right call and the comment explains why.

### Concerns

- **ContainerEngineReady fires before the probe completes.** @mook-as noted at line 465 of the bats file (https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096360169) that "ContainerEngineReady shouldn't be set until the the context is set up correctly." The `meta.json` *is* written synchronously inside `startWatcherAndSync` before the reconciler re-asserts `Connected`, so `DOCKER_CONTEXT=rancher-desktop-<N> docker …` works the moment `ContainerEngineReady=True` fires. Only the `currentContext` promotion is async — which is unavoidable given a 3s probe timeout. The current behavior is therefore consistent with the design in issue #318 ("switch the default to the new context" is an ergonomic-but-secondary effect, not a precondition for the engine being ready). Worth an explicit comment near the condition update clarifying this, so future readers don't try to "fix" the asynchrony. `(future)`

## Testing Assessment

The unit tests (`docker_context_test.go`) cover the atomic-write happy paths and the no-op short-circuits well. The bats tests cover creation, promotion (async via `try`), and deletion. Ranked gaps:

1. **No unit test for the generation race**: two back-to-back `manageDockerContext` calls (e.g., watcher-died-then-restarted), verifying that the *second* probe is the one that determines `currentContext` and the first cannot write post-supersede. This is the subtlest piece of the concurrency design and currently has no test.
2. **No test for `removeDockerContext` when a probe is in-flight**: the whole point of the `contextProbeWg.Wait()` is to block the delete until the probe is done. A test that starts a probe against a slow/hung fake socket, calls `removeDockerContext`, and verifies the probe neither writes nor logs-after-delete would exercise this directly.
3. **No test for the `getCurrentDockerContext` error path** (related to I1): a malformed `config.json` should not cause `setCurrentDockerContext` to silently clobber. Even with the fix for I1 (early return), a regression test guarding this behavior is cheap.
4. **No negative bats test for Windows skip**: currently the bats suite has `skip_on_windows` at the file level, so the explicit Windows no-op isn't exercised. Out of scope for this PR, but worth a `unit-test` guarding that `manageDockerContext(... )` returns nil-error on `runtime.GOOS == "windows"` (can be simulated by a `goos` build-tag split file down the line).

## Documentation Assessment

- `docs/design/bootstrap.md` is referenced in the issue as the current home for the "documented" behavior but was not updated in this PR. Worth a short note there describing the promote-on-unhealthy behavior and the symmetry of the teardown so future code readers find it.
- No godoc on `createReplaceDockerContext`, `deleteDockerContext`, `setCurrentDockerContext`, `clearCurrentDockerContext`, `dockerConfigDir`, or `dockerContextMeta`/`dockerEndpointMeta`. Minor (these are unexported helpers), but the other exported-looking helpers in this file (`contextMetaPath`, `getDockerContextHost`, `getCurrentDockerContext`, `probeDockerContext`, `updateDockerConfig`) all have one. Add for consistency:

```go
// createReplaceDockerContext writes the Docker CLI context meta file for
// name pointing at socketPath, creating parent directories as needed.
// If the file exists it is overwritten.
func createReplaceDockerContext(name, socketPath string) error { ... }

// deleteDockerContext removes the Docker CLI context directory for name.
// Returns nil if the directory does not exist.
func deleteDockerContext(name string) error { ... }
```

## Commit Structure

Two commits:
- `ee25067 Manage Docker CLI context on ContainerEngineReady transitions` — Go code, unit tests.
- `fe0962c Add bats to support the docker context` — integration tests.

Clean split, readable in either order. No fixup noise. No concerns.

## Acknowledged Limitations

- **Windows not implemented** — `engine_reconciler.go:277-281, 345-347`: both functions early-return on `runtime.GOOS == "windows"`. @mook-as approved the PR on the condition that a tracking issue be filed; see I2 for the request for an in-code TODO or filed issue. The comment claims only the socket URL scheme is wrong; in fact `probeDockerContext` would also need a different client transport.
- **Non-atomic `createReplaceDockerContext`** — `docker_context.go:55-75`: author consciously uses `os.WriteFile` here but routed `config.json` mutations through an atomic helper. @jandubois explicitly asked for atomic handling in his general comment; @Nino-K's response cites "the issue I3 in the AI review" (not visible to this reviewer) as addressing it. The split coverage (atomic for `config.json`, not for `meta.json`) is in the code as-is; see S1 for the gap.
- **`contextProbeWg` count can exceed 1** — `engine_reconciler.go:357`: @mook-as noticed the count can briefly climb above 1 (https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102603375); @Nino-K responded that `Wait()` just blocks until zero regardless of count (https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102753949). Analysis agrees: the design is correct — the set-currentContext write is idempotent (same `contextName` from a stable `instance.Name()`), and `Wait()` semantics hold.
- **No file-level locking around `~/.docker/config.json`** — `docker_context.go:55`: @mook-as raised and immediately self-resolved (https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102609534) since the Docker CLI itself doesn't lock.
- **Not adopting `github.com/docker/cli`'s `ContextStore`** — @Nino-K's response (https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096716690) explains that it would pull in a large dep tree for ~30 lines we control. Agree — especially since the context metadata schema is stable.

## Unresolved Feedback

- **https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3101962597** (mook — TODO or filed issue for Windows): not visibly addressed. Promoted to **I2**.
- **https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102364628** (mook — continue on read failure = clobber risk): partially addressed in conversation but no code change. Promoted to **I1**.
- **https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096335657** (mook — `yq --inplace` for the bats fixture): low-priority. The `jq … > tmp && mv tmp …` pattern in `engine-docker.bats:456-457` and `472-474` still uses non-atomic `mv` (which *is* atomic on the same filesystem on Unix, so safe in practice). Not actionable beyond stylistic preference.
- **https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096360169** (mook — ContainerEngineReady timing): see Design Observations. Not a regression; worth a clarifying comment but not a code fix.

## Coverage Summary

| File | Status |
|---|---|
| `pkg/controllers/app/engine/controllers/engine_reconciler.go` | I1, I2, S2 |
| `pkg/controllers/app/engine/controllers/docker_context.go` | S1 |
| `pkg/controllers/app/engine/controllers/docker_context_test.go` | S3 |
| `bats/tests/32-app-controllers/engine-docker.bats` | S4 |
---

### [Codex]

Verdict: Changes requested

### Executive Summary

This PR adds Docker CLI context lifecycle management around `ContainerEngineReady` for the moby engine and extends the BATS suite to cover the new behavior. The core lifecycle wiring is reasonable, but two regressions stand out: the new liveness probe misclassifies healthy non-`unix://` contexts as dead and can overwrite the user’s active context, and the new BATS coverage edits the operator’s real `~/.docker/config.json` without isolating it from test failures.

### Findings

**Critical Issues**

None.

**Important Issues**

I1. **Current-context probing drops transport settings and will steal healthy TLS/SSH contexts** — `pkg/controllers/app/engine/controllers/docker_context.go:92-108,163-172; pkg/controllers/app/engine/controllers/engine_reconciler.go:323-335` (important, regression)

```go
currentHost, err := getDockerContextHost(current)
if err != nil {
	log.Error(err, "Failed to resolve current Docker context host", "context", current)
} else if currentHost != "" {
	healthy = probeDockerContext(probeCtx, currentHost)
}

cli, err := dockerclient.New(dockerclient.WithHost(host))
_, err = cli.Ping(probeCtx, dockerclient.PingOptions{})
```

`getDockerContextHost()` at `docker_context.go:92-108` only returns the raw `Endpoints["docker"].Host`, `manageDockerContext()` at `engine_reconciler.go:323-328` feeds that directly into `probeDockerContext()`, and `probeDockerContext()` at `docker_context.go:166-171` builds a client with only `WithHost(host)`. That means any healthy current context that depends on extra transport state (for example TLS certs from the context store, or any future non-host endpoint settings) is probed as if it were a bare host URL. When that probe fails, `manageDockerContext()` treats the context as unhealthy and overwrites `currentContext` at `engine_reconciler.go:333-335`, violating the stated “leave healthy contexts alone” behavior.

Fix: resolve and apply the full Docker context transport, not just `Host`. In practice that means either using Docker’s context store APIs or reproducing the full `docker context inspect`/TLS-path resolution before constructing the client, then adding coverage for a healthy non-`unix://` context that must remain selected.

I2. **The new BATS test mutates the developer’s real Docker config and can leave it corrupted after a failed run** — `bats/tests/32-app-controllers/engine-docker.bats:12-24,456-474` (important, regression)

```bash
local_setup_file() {
    rdd svc delete
    rdd set running=true
    run -0 rdd svc paths docker_socket
    export DOCKER_HOST="unix://${output}"
}

jq 'del(.currentContext)' "${HOME}/.docker/config.json" >"${HOME}/.docker/config.json.tmp" &&
    mv "${HOME}/.docker/config.json.tmp" "${HOME}/.docker/config.json"
...
jq --arg ctx "${saved_context}" '.currentContext = $ctx' \
    "${HOME}/.docker/config.json" >"${HOME}/.docker/config.json.tmp" &&
    mv "${HOME}/.docker/config.json.tmp" "${HOME}/.docker/config.json"
```

`local_setup_file()` at `engine-docker.bats:12-24` points Docker at the test daemon via `DOCKER_HOST`, but it does not isolate `HOME` or `DOCKER_CONFIG`. The new test then rewrites `${HOME}/.docker/config.json` in place at `engine-docker.bats:456-474`. If the test times out, aborts, or fails between the delete and the manual restore, the operator’s real Docker context selection is left changed outside the repo; `teardown_file()` from `bats/helpers/load.bash` only stops `rdd`, so there is no safety net. That is a practical developer-state corruption bug in the test suite.

Fix: give these tests an isolated Docker config directory, e.g. copy `~/.docker` into `${BATS_TEST_TMPDIR}/docker`, export `DOCKER_CONFIG` to that temp directory for the duration of the file, and restore via `teardown_file()`/trap instead of inline best-effort cleanup inside a single test.

**Suggestions**

S1. **Use BATS `run -0` instead of command substitution for `saved_context`** — `bats/tests/32-app-controllers/engine-docker.bats:452-453` (suggestion, regression)

```bash
local saved_context
saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)
```

The capture at `engine-docker.bats:453` uses command substitution plus `|| true`, which is the exact opaque-failure pattern the repo’s BATS guidelines try to avoid. If `jq` fails because the file is malformed or unreadable, the test silently proceeds with `saved_context=""`, weakening diagnostics and making it harder to distinguish a real regression from setup damage.

Fix: use `run -0 jq -r '.currentContext // empty' ...` and assign `local saved_context=$output`, or branch explicitly on file existence before reading it.

### Design Observations

**Concerns**

1. **Hand-rolling Docker context resolution is already duplicating Docker CLI semantics** `(in-scope)`. The new helper owns hashing, on-disk layout, config mutation, and probe transport reconstruction, but only the simplest `unix://` case is implemented. Moving the probe path onto Docker’s context store semantics would remove a whole class of drift bugs around TLS/default/SSH handling and make future Windows/npipe support a focused transport addition instead of another special case.

**Strengths**

1. Reusing the watcher-lifetime context for the background probe is the right lifecycle boundary. The stop path cancels the probe and waits before deleting the context, which avoids the worst post-shutdown race where a slow probe re-promotes the just-removed context.
2. `updateDockerConfig()` preserves unrelated keys and uses same-directory temp-file replacement, which is the right baseline for mutating `config.json`.

### Testing Assessment

`make` and `go test ./pkg/controllers/app/engine/controllers/...` passed locally, but the new coverage still misses the highest-risk cases:

1. A healthy current context that is not a bare `unix://` socket should remain selected. There is no unit or integration test covering TLS/SSH/TCP context probing, which is how I1 slips through.
2. The suite does not prove that the operator’s original Docker config is restored on early exit. The new BATS coverage only tests the happy-path restore inside one test body.

### Documentation Assessment

The existing design docs already describe the intended Docker-context behavior, so no broad doc rewrite is needed. The Windows/npipe deferral should be tracked explicitly in docs or an issue, because `bootstrap.md` currently reads as if context management is universally available.

### Acknowledged Limitations

- **Code comment** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:277-278`: “Docker context management uses unix:// sockets; Windows requires npipe:// which is not yet implemented.” The change is consistent with that limitation; it still leaves Windows without context management.
- **PR comment** — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096716690: the author chose not to adopt `github.com/docker/cli`’s `ContextStore`, citing dependency size and the desire to keep the implementation to a simple Unix-socket JSON writer. The current patch is consistent with that rationale.
- **PR comment** — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096742825: the author acknowledged Windows would need separate handling and only renamed the helper/comment away from the Unix-specific wording for now. The implementation still intentionally no-ops on Windows.
- **PR comment** — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102730266: the author agreed that config-read/parse failures probably should stop promotion instead of continuing, but deferred that decision to the Windows npipe follow-up. The current code is unchanged.
- **PR comment** — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3102753949: the author agreed `contextProbeWg` can exceed 1 and judged that harmless because `Wait()` only cares about the count reaching zero. The current implementation still relies on that reasoning.
- **PR comment** — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3101959340: the one-shot probe-on-start / cleanup-on-stop design was intentional; continuous reassertion was explicitly rejected. The implementation matches that design choice.

### Unresolved Feedback

1. https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096599500 — still actionable, and it overlaps with I1. The review comment correctly predicted that probing arbitrary contexts by extracting only the socket/host would fail once the active context is not the simple local-socket case.
2. https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3101962597 — still actionable, low priority. The review asked for a TODO or filed issue for the Windows gap; the code comment remains, but I do not see a corresponding tracking issue in this PR.

### Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats` — I2, S1
- `pkg/controllers/app/engine/controllers/docker_context.go` — I1
- `pkg/controllers/app/engine/controllers/docker_context_test.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — I1

---

### [Gemini]

### Executive Summary

This PR implements automatic Docker CLI context management for the moby engine, seamlessly redirecting the user's `docker` CLI to the Rancher Desktop Daemon. While the core orchestration correctly bounds the probe lifecycle to the manager context and cleanly serializes transitions, the PR introduces two critical regressions. First, it skips probing the implicit `default` context, silently hijacking the `docker` connection even if the user's existing engine is healthy. Second, its atomic file replacement relies on `os.Rename` semantics that fail on Windows, causing immediate CI failures for the cross-platform unit tests. 

### Findings

#### Critical Issues

C1. **Missing default context probe hijacks existing healthy engines** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:323-331` (critical, regression)

```go
		// Probe the current context's host — not our own — to decide
		// whether to promote ours. "default" and empty both mean no
		// working context is set; treat them identically.
		var healthy bool
		if current != "" && current != "default" {
			currentHost, err := getDockerContextHost(current)
```
The probe explicitly ignores empty and `"default"` contexts, under the assumption that they represent "no working context". This is a misunderstanding of Docker's CLI behavior: if `config.json` has no `currentContext` (`getCurrentDockerContext` at `docker_context.go:129` returns `""`), or if it is explicitly `"default"`, Docker connects to the implicit default endpoint (e.g., `/var/run/docker.sock` or `DOCKER_HOST`). By skipping the probe for these values, the code assumes `healthy = false` and unconditionally overwrites `config.json` with the `rancher-desktop-x` context. This silently hijacks the user's Docker CLI even when their existing local daemon (like Docker CE) is perfectly healthy, violating the documented fallback behavior.

Fix: Probe the default context when `current` is `""` or `"default"`. Initialize the client with `client.FromEnv` so the Docker SDK resolves the default environment endpoint:
```diff
-		var healthy bool
-		if current != "" && current != "default" {
-			currentHost, err := getDockerContextHost(current)
-			if err != nil {
-				log.Error(err, "Failed to resolve current Docker context host", "context", current)
-			} else if currentHost != "" {
-				healthy = probeDockerContext(probeCtx, currentHost)
-			}
-		}
+		var healthy bool
+		if current != "" && current != "default" {
+			currentHost, err := getDockerContextHost(current)
+			if err != nil {
+				log.Error(err, "Failed to resolve current Docker context host", "context", current)
+			} else if currentHost != "" {
+				healthy = probeDockerContext(probeCtx, currentHost)
+			}
+		} else {
+			healthy = probeDockerContext(probeCtx, "")
+		}
```
Update `probeDockerContext` (`docker_context.go:163`) to use `FromEnv` on an empty host:
```diff
 func probeDockerContext(ctx context.Context, host string) bool {
 	probeCtx, cancel := context.WithTimeout(ctx, dockerContextProbeTimeout)
 	defer cancel()
-	cli, err := dockerclient.New(dockerclient.WithHost(host))
+	var opts []dockerclient.Opt
+	if host != "" {
+		opts = append(opts, dockerclient.WithHost(host))
+	} else {
+		opts = append(opts, dockerclient.FromEnv)
+	}
+	cli, err := dockerclient.New(opts...)
 	if err != nil {
 		return false
 	}
```

C2. **`os.Rename` fails on Windows, breaking cross-platform tests** — `pkg/controllers/app/engine/controllers/docker_context.go:224-226` (critical, regression)

```go
	if err := tmp.Close(); err != nil {
		os.Remove(tmp.Name())
		return err
	}
	return os.Rename(tmp.Name(), path)
```
The repository's system instructions note that `os.Rename` on Windows fails when the destination file already exists. While the reconciler caller `manageDockerContext` returns early on Windows, `updateDockerConfig` is evaluated in the cross-platform `docker_context_test.go:53` unit tests. Because `Test_updateDockerConfig_set` writes a file and calls this logic, `os.Rename` reliably throws "Cannot create a file when that file already exists", breaking Windows CI immediately.

Fix: Explicitly remove the destination file before renaming on Windows:
```diff
 	if err := tmp.Close(); err != nil {
 		os.Remove(tmp.Name())
 		return err
 	}
+	if runtime.GOOS == "windows" {
+		_ = os.Remove(path)
+	}
 	return os.Rename(tmp.Name(), path)
```

#### Important Issues

I1. **BATS test assumes `config.json` exists and uses unsafe `$()` substitutions** — `bats/tests/32-app-controllers/engine-docker.bats:449-453` (important, regression)

```bash
    local saved_context
    saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)

    # Clear the current context and restart the engine so the probe runs fresh.
    jq 'del(.currentContext)' "${HOME}/.docker/config.json" >"${HOME}/.docker/config.json.tmp" &&
```
The test assumes `~/.docker/config.json` already exists. On a clean CI runner or developer machine without prior Docker configuration, the second `jq` invocation fails with "No such file or directory," causing the BATS framework (which runs under `set -e`) to abort immediately. Furthermore, the repository instructions forbid `$(command)` substitutions because they swallow opaque errors. If `jq` fails, `saved_context` evaluates to empty, meaning the teardown block will not restore the user's active context — silently deleting developer configurations locally.

Fix: Check if the file exists first, and use `run` to capture output defensively:
```diff
-    local saved_context
-    saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)
-
-    # Clear the current context and restart the engine so the probe runs fresh.
-    jq 'del(.currentContext)' "${HOME}/.docker/config.json" >"${HOME}/.docker/config.json.tmp" &&
-        mv "${HOME}/.docker/config.json.tmp" "${HOME}/.docker/config.json"
+    local saved_context=""
+    if [[ -f "${HOME}/.docker/config.json" ]]; then
+        run -0 jq -r '.currentContext // empty' "${HOME}/.docker/config.json"
+        saved_context="${output}"
+        
+        run -0 jq 'del(.currentContext)' "${HOME}/.docker/config.json"
+        echo "${output}" > "${HOME}/.docker/config.json"
+    fi
```

I2. **Context resolution ignores `DOCKER_CONFIG` environment variable** — `pkg/controllers/app/engine/controllers/docker_context.go:36-39` (important, gap)

```go
func dockerConfigDir() (string, error) {
	home, err := os.UserHomeDir()
```
The `dockerConfigDir` hardcodes the path to `~/.docker`. If the user has configured the `DOCKER_CONFIG` environment variable (a standard feature of the Docker CLI), the daemon will incorrectly read and overwrite contexts in the default location, completely ignoring their active configuration. This leads to duplicate configurations and unpredictable CLI behavior.

Fix: Check for `DOCKER_CONFIG` before defaulting to the home directory. This also simplifies the unit tests: instead of mocking `t.Setenv("HOME", ...)` (which breaks on Windows where it needs `USERPROFILE`), tests can simply use `t.Setenv("DOCKER_CONFIG", ...)` securely across platforms.
```diff
 func dockerConfigDir() (string, error) {
+	if configDir := os.Getenv("DOCKER_CONFIG"); configDir != "" {
+		return configDir, nil
+	}
 	home, err := os.UserHomeDir()
```

#### Suggestions

S1. **Avoid `$(command)` in `docker_context_dir` assignment** — `bats/tests/32-app-controllers/engine-docker.bats:430` (suggestion, enhancement)

```bash
    meta_file="$(docker_context_dir "${context_name}")/meta.json"
```
While this calls a harmless local bash function (`docker_context_dir`), using `$(...)` still violates the rigid project guideline prohibiting substitution to extract values. Using `run` aligns strictly with the requested BATS conventions.

### Design Observations

#### Strengths
- **Goroutine orchestration**: Deriving the probe goroutine's context (`probeCtx`) from the manager's `watcherCtx` and tightly synchronizing cancellation via `sync.WaitGroup` and `contextMu` ensures rapid condition transitions gracefully supersede one another without race conditions or dangling connections. 
- **Non-blocking Reconcile**: Using an asynchronous probe bounded by a hard timeout perfectly avoids blocking the worker pool during API round trips.

### Testing Assessment

1. `Test_updateDockerConfig_set` exercises the atomic rename logic on a cross-platform scope. Because it lacks Windows `os.Rename` safeguards, it breaks Windows CI immediately. (Caught by C2)
2. The BATS tests correctly exercise the CLI fallback logic but destructively assume the user/CI environment already has a `config.json`. If a local developer runs this suite without Docker pre-installed, it fails midway and swallows its state. (Caught by I1)

### Acknowledged Limitations

- **`github.com/docker/cli` usage replacing JSON logic**: `pkg/controllers/app/engine/controllers/docker_context.go`. The author noted in PR comments that pulling the upstream `ContextStore` package would introduce a massive dependency tree just for ~30 lines of basic JSON mutation that we fully control.
- **Windows `npipe` socket implementation**: `pkg/controllers/app/engine/controllers/engine_reconciler.go:277`. The author acknowledged dropping Windows socket generation (`return` early) and stated this behavior will be fulfilled in an upcoming `npipe` PR.
- **Atomic replacement without file locks**: `pkg/controllers/app/engine/controllers/docker_context.go:175`. Acknowledged by the reviewer: avoiding data races against parallel processes (e.g. the Docker CLI) using locks is impractical as external ecosystem tools may not respect identical lock mechanisms anyway.
- **Background probe running infinitely**: The PR explicitly runs the check strictly once per engine transition. The reviewer clarified that a polling loop watching for deviation "sounds like an objectively terrible idea" as it would fight the user trying to legitimately alter their context.
- **Probe `contextProbeWg` going above 1**: `pkg/controllers/app/engine/controllers/engine_reconciler.go:357`. The author confirmed the wait group counter can safely increase if called concurrently, as `Wait()` correctly blocks until every inflight generation gracefully decrements via cancellation.

### Unresolved Feedback

- [Review comment by @mook-as](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3101962597) — "We should either have `TODO:` that we can grep for, or an issue filed." 
  **Assessment**: Actionable. The author deferred the Windows `npipe` work to an upcoming PR but left no searchable `TODO:` marker next to the `if runtime.GOOS == "windows"` early return at `engine_reconciler.go:278`.

### Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats`: I1, S1
- `pkg/controllers/app/engine/controllers/docker_context.go`: C2, I2
- `pkg/controllers/app/engine/controllers/docker_context_test.go`: C2
- `pkg/controllers/app/engine/controllers/engine_reconciler.go`: C1
