# Deep Review: 20260416-144802-pr-320

| | |
|---|---|
| **Date** | 2026-04-16 15:14 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 1 |
| **Author** | [@Nino-K](https://github.com/Nino-K) |
| **PR** | [#320](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320) — Manage Docker CLI context on ContainerEngineReady transitions |
| **Branch** | `docker-context` |
| **Commits** | `001fd95` Manage Docker CLI context on ContainerEngineReady transitions<br>`ef5114d` Add bats to support the docker context |
| **Reviewers** | Claude Opus 4.7, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Rework** — three serious correctness bugs in the probe and config-write paths must be fixed before merge. |
| **Wall-clock time** | `30 min 15 s` |

## Executive Summary

PR 320 wires Docker CLI context creation/teardown into the engine reconciler:
on `ContainerEngineReady=True` it writes a `rancher-desktop-<suffix>` context
pointing at `instance.DockerSocket()`, and a background probe promotes that
context to `currentContext` if the user's existing context is missing or
unreachable. On the way back to False, the probe is cancelled and the
context is removed.

The design (probe in a goroutine derived from `watcherCtx`, generation
counter to deduplicate concurrent probes) is sensible, but three correctness
bugs make this PR unsafe to ship as-is, and all three were identified
independently by all reviewers:

1. **Probe-vs-cleanup race resurrects a deleted context.** The goroutine
   does not check `probeCtx.Err()` before writing `currentContext`, so a
   cancelled probe interprets cancellation as "current context is unhealthy"
   and writes our context name to `~/.docker/config.json` after
   `removeDockerContext` has already cleared it and deleted the meta dir.
   The user's `docker` CLI then fails with "Current context … is not found"
   until they manually run `docker context use default`.
2. **Hardcoded `unix://` breaks every non-Unix Docker context.** The probe
   strips `unix://` from the user's existing context Host and unconditionally
   re-prepends it before dialing, so `tcp://`, `ssh://`, and `npipe://`
   contexts probe as `unix://tcp://…`, get marked unhealthy, and silently
   get overridden — Windows users and remote-Docker users lose their
   working context.
3. **Non-atomic, repeated writes to `~/.docker/config.json` risk data
   loss.** `os.WriteFile` truncates before writing, and
   `removeDockerContext` runs on every reconcile in the stopped state, so
   the user's auths/credsStore can be lost to a partial write or a race
   with a concurrent `docker login`.

Two further important issues complete the must-fix list: the new code logs
errors via `logf.Log.Info(..., "error", err)` instead of the scoped
`log.Error(...)` pattern used everywhere else in the reconciler, and the
hardcoded `unix://` scheme in `createReplaceDockerContext` makes the
context Docker CLI on Windows cannot open. Reviewer @mook-as raised the
same Windows concern in inline review comments that remain unresolved.

The BATS tests mutate the developer's actual `~/.docker/config.json` —
they should run against an isolated `HOME` (or `DOCKER_CONFIG`) so a
mid-test failure does not corrupt local Docker state. None of the
reviewer's prior-round comments have been addressed.

Structure: the change is small (4 files, ~559 LoC) but split into one
"code" commit and one "tests" commit; squash before merge.

---

## Critical Issues

None.

---

## Important Issues

I1. **Probe goroutine ignores cancellation and resurrects the deleted context** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:290-323` [Claude Opus 4.7, Codex GPT 5.4, Gemini 3.1 Pro]

```go
go func() {
    defer func() {
        r.contextMu.Lock()
        if r.contextProbeGen == myGen {
            r.contextProbeCancel = nil
        }
        r.contextMu.Unlock()
        cancel()
    }()

    current, err := getCurrentDockerContext()
    ...
    var healthy bool
    if current != "" && current != "default" {
        currentSocket, err := getDockerContextSocket(current)
        ...
        healthy = probeDockerContext(probeCtx, currentSocket)
    }
    if !healthy {
        if err := setCurrentDockerContext(contextName); err != nil {
            logf.Log.Info("Failed to set current Docker context", "context", contextName, "error", err)
        }
    }
}()
```

When the engine stops, `removeDockerContext` (line 328) cancels `probeCtx`,
clears `currentContext` if it points at us, and deletes the context dir —
without waiting for the probe goroutine to finish. The probe interprets the
cancellation in `probeDockerContext`'s `Ping` as "context is unhealthy" and
proceeds to write `setCurrentDockerContext(contextName)`. That write can
land *after* `removeDockerContext` has already deleted the meta dir,
leaving `currentContext` pointing at a non-existent context. Every
subsequent `docker ps`/`docker build`/`docker context inspect` then fails
with `"Current context … is not found"` until the user manually runs
`docker context use default`.

The generation counter (`myGen`) only protects the cancel-slot bookkeeping;
the disk write on line 319 is not gated on `probeCtx.Err()`, generation
match, or any wait on the cleanup path.

Fix: gate the promotion on `probeCtx.Err()`. A fully race-free fix also
adds a `sync.WaitGroup` so `removeDockerContext` blocks until the probe
returns:

```diff
 go func() {
+    r.contextProbeWg.Add(1)
+    defer r.contextProbeWg.Done()
     defer func() {
         ...
     }()
     ...
     if !healthy {
+        if probeCtx.Err() != nil {
+            return
+        }
         if err := setCurrentDockerContext(contextName); err != nil {
             ...
         }
     }
 }()

 func (r *EngineReconciler) removeDockerContext() {
     r.contextMu.Lock()
     ...
     r.contextMu.Unlock()
+    r.contextProbeWg.Wait()
     ...
 }
```

The wait is bounded by `dockerContextProbeTimeout` (3s) because `probeCtx`
gets cancelled before the wait. The `Err()` check alone closes most of the
window; the WaitGroup closes the remainder where the goroutine has read
`!healthy` and is about to call `setCurrentDockerContext` when cancellation
fires.

I2. **Hardcoded `unix://` scheme breaks tcp / ssh / npipe Docker contexts** — `pkg/controllers/app/engine/controllers/docker_context.go:90-111, 157-167` [Claude Opus 4.7, Codex GPT 5.4, Gemini 3.1 Pro]

```go
func getDockerContextSocket(name string) (string, error) {
    ...
    host := meta.Endpoints["docker"].Host
    return strings.TrimPrefix(host, "unix://"), nil
}

func probeDockerContext(ctx context.Context, socketPath string) bool {
    ...
    cli, err := dockerclient.New(dockerclient.WithHost("unix://" + socketPath))
    ...
}
```

If the user's current context has `Host: "tcp://foo:2375"` (or `ssh://`,
`npipe://`), `getDockerContextSocket` returns the string unchanged because
the `"unix://"` prefix doesn't match. `probeDockerContext` then constructs
`"unix://tcp://foo:2375"`, the moby client parses it as
`proto=unix, addr="tcp://foo:2375"`, attempts to dial a Unix socket at the
literal path, and fails. The probe reports `healthy=false`, and the
goroutine silently overrides the user's working remote/Windows-pipe
context with ours.

This violates the PR's documented contract ("if the current context is
healthy, leave it alone") for every non-Unix endpoint and is the same
class of bug Windows users will hit in I4.

Fix: stop stripping the scheme — pass the full Host URL through to the
client. Rename for clarity:

```diff
-func getDockerContextSocket(name string) (string, error) {
+// getDockerContextHost returns the Docker Host URL (e.g. "unix:///path",
+// "tcp://host:port") that the named context points at.
+func getDockerContextHost(name string) (string, error) {
     ...
-    host := meta.Endpoints["docker"].Host
-    return strings.TrimPrefix(host, "unix://"), nil
+    return meta.Endpoints["docker"].Host, nil
 }

-func probeDockerContext(ctx context.Context, socketPath string) bool {
+func probeDockerContext(ctx context.Context, host string) bool {
     ...
-    cli, err := dockerclient.New(dockerclient.WithHost("unix://" + socketPath))
+    cli, err := dockerclient.New(dockerclient.WithHost(host))
     ...
 }
```

I3. **Non-atomic write of `~/.docker/config.json` can lose user data** — `pkg/controllers/app/engine/controllers/docker_context.go:172-203` [Gemini 3.1 Pro]

```go
return os.WriteFile(path, append(out, '\n'), 0o600)
```

`updateDockerConfig` uses `os.WriteFile`, which opens with `O_TRUNC` and
writes in place. Two failure modes:

1. *Crash after truncate, before write completes:* The user's
   `~/.docker/config.json` is left empty or partial, losing every key it
   held (`auths`, `credsStore`, `currentContext`, plugin settings, etc.).
2. *Race with concurrent docker CLI:* A `docker login` or `docker context
   use` running concurrently performs its own read-modify-write. Whichever
   write lands second silently clobbers the other's changes — the user
   loses an auth entry or a context switch with no error.

Severity is amplified by the design: `removeDockerContext` is called on
every reconcile when the engine is stopped (see S1), so the file is read,
re-marshalled, and rewritten on a steady cadence in the background even
when nothing has changed.

Fix: write to a sibling temp file and `rename(2)` — atomic on POSIX, and
fixable on Windows by removing the destination first. Also short-circuit
when `mutate` produces no change so steady state pays no I/O cost:

```diff
-func updateDockerConfig(path string, mutate func(map[string]any)) error {
+func updateDockerConfig(path string, mutate func(map[string]any) bool) error {
     cfg := map[string]any{}
     ...
-    before := len(cfg)
-    mutate(cfg)
-
-    if notExist && len(cfg) == 0 && before == 0 {
+    if changed := mutate(cfg); !changed {
         return nil
     }

     out, err := json.MarshalIndent(cfg, "", "\t")
     ...
-    return os.WriteFile(path, append(out, '\n'), 0o600)
+    tmp, err := os.CreateTemp(filepath.Dir(path), ".config.json.*")
+    if err != nil {
+        return err
+    }
+    if _, err := tmp.Write(append(out, '\n')); err != nil {
+        tmp.Close()
+        os.Remove(tmp.Name())
+        return err
+    }
+    if err := tmp.Chmod(0o600); err != nil {
+        tmp.Close()
+        os.Remove(tmp.Name())
+        return err
+    }
+    if err := tmp.Close(); err != nil {
+        os.Remove(tmp.Name())
+        return err
+    }
+    return os.Rename(tmp.Name(), path)
 }
```

The `mutate` callback returns `bool` so the caller can signal whether it
actually changed anything — `setCurrentDockerContext` returns true only
when the field changes, `clearCurrentDockerContext` only when the matching
key existed. This collapses I3 with S1 (no rewrite-on-every-reconcile).

I4. **Windows: hardcoded `unix://` scheme produces a context Docker CLI on Windows cannot open** — `pkg/controllers/app/engine/controllers/docker_context.go:64-69` [Claude Opus 4.7]

```go
Endpoints: map[string]dockerEndpointMeta{
    "docker": {Host: "unix://" + socketPath},
},
```

`instance.DockerSocket()` returns a Windows path like
`%USERPROFILE%\.rd<suffix>\docker.sock`; the resulting `meta.json`
contains `Host: "unix://C:\\Users\\…\\docker.sock"`, which Docker CLI on
Windows cannot use (it expects `npipe://./pipe/...`). The BATS file skips
Windows at line 15, but `manageDockerContext` runs unconditionally in
production, so the controller will write a broken context for every
Windows user. Reviewer @mook-as raised the same concern at
[discussion_r3096484641](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096484641),
which is unresolved.

Fix (smallest): gate `manageDockerContext` and `removeDockerContext` on
`runtime.GOOS != "windows"` until the named-pipe path is wired up.
Long-term: switch to `github.com/docker/cli/cli/context/store`, which
constructs the platform-appropriate Host (see Design Concerns).

```diff
 func (r *EngineReconciler) manageDockerContext(socketPath string) {
+    if runtime.GOOS == "windows" {
+        return
+    }
     contextName := instance.Name()
     ...
 }
```

Apply the same guard to `removeDockerContext`.

I5. **New error sites use `logf.Log.Info(..., "error", err)` instead of the scoped `log.Error(...)` pattern** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:275, 303, 313, 320, 338, 341` [Claude Opus 4.7]

```go
logf.Log.Info("Failed to create Docker context", "context", contextName, "error", err)
...
logf.Log.Info("Failed to set current Docker context", "context", contextName, "error", err)
...
logf.Log.Info("Failed to delete Docker context", "context", contextName, "error", err)
```

Every other error site in this package uses
`log := logf.FromContext(ctx); log.Error(err, msg, kv...)`. The new code
deviates on two axes:

1. `.Info` logs at info severity with an `"error"` field, so operators
   filtering on error level miss these and structured-log parsers do not
   promote `err` into a dedicated field.
2. `logf.Log` is the unscoped global logger. `SetupWithManager` primes
   `r.watcherCtx` with `logf.IntoContext(..., mgr.GetLogger().WithName("engine-watcher"))`,
   but the new code ignores it — these lines have no controller name and
   no goroutine identity.

Fix: capture a scoped logger at goroutine launch and use `.Error`:

```diff
 func (r *EngineReconciler) manageDockerContext(socketPath string) {
     contextName := instance.Name()
+    log := logf.FromContext(r.watcherCtx).WithName("docker-context")

     if err := createReplaceDockerContext(contextName, socketPath); err != nil {
-        logf.Log.Info("Failed to create Docker context", "context", contextName, "error", err)
+        log.Error(err, "Failed to create Docker context", "context", contextName)
         return
     }
     ...
 }
```

Apply the same pattern to `removeDockerContext`.

---

## Suggestions

S1. **`removeDockerContext` re-runs config.json I/O on every stopped-state reconcile** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:148-158` [Claude Opus 4.7]

```go
if !wantWatcher {
    if watcherRunning {
        log.Info("Stopping Docker watcher", ...)
        r.stopWatcher()
    } else {
        r.removeDockerContext()    // line 157
    }
    ...
    alreadyClean := !watcherDied && current != nil && current.Reason == terminalReason && current.Status == terminalStatus
    if !alreadyClean {
        ...
    }
}
```

The `else` branch fires every reconcile while the engine is stopped, so
`removeDockerContext → clearCurrentDockerContext → updateDockerConfig`
reads, re-marshals, and rewrites `~/.docker/config.json` continuously even
in steady state. With the current non-atomic write (I3), this turns a
single stale-config-cleanup into ongoing risk. Stable state should be
no-I/O.

Fix: move `removeDockerContext()` inside the `!alreadyClean` block, or
fold it into the same condition that gates `cleanupMirrorResources`. With
I3's "no rewrite when nothing changed" path, the cost goes away
automatically.

S2. **`"default"` context is unconditionally overridden without a probe** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:308-317` [Claude Opus 4.7, Codex GPT 5.4]

```go
if current != "" && current != "default" {
    currentSocket, err := getDockerContextSocket(current)
    ...
}
```

The comment claims `"default"` and `""` both mean "no working context",
but `default` is a real Docker context — on Linux with Docker Engine
installed it points to `/var/run/docker.sock`, which may be running. The
code treats it as broken and silently promotes ours. The PR's stated
contract ("if the current context is healthy, leave it alone") would be
violated on every Linux box with both Docker Engine and Rancher Desktop
installed.

Fix: probe `default` explicitly via `dockerclient.DefaultDockerHost`
(or `client.FromEnv`) instead of skipping it. The change is small once I2's
"pass the full Host through" refactor is in place.

S3. **`DOCKER_CONFIG` environment variable is ignored** — `pkg/controllers/app/engine/controllers/docker_context.go:37-43` [Claude Opus 4.7]

```go
func dockerConfigDir() (string, error) {
    home, err := os.UserHomeDir()
    if err != nil {
        return "", err
    }
    return filepath.Join(home, ".docker"), nil
}
```

Docker CLI honours `$DOCKER_CONFIG` to relocate the config dir. Users who
set it (common in dev containers, CI runners, or Nix profiles) will have
the context written under `~/.docker` while their CLI reads from
`$DOCKER_CONFIG`, so the new context never appears.

```diff
 func dockerConfigDir() (string, error) {
+    if dir := os.Getenv("DOCKER_CONFIG"); dir != "" {
+        return dir, nil
+    }
     home, err := os.UserHomeDir()
     if err != nil {
         return "", err
     }
     return filepath.Join(home, ".docker"), nil
 }
```

S4. **Empty `docker` endpoint is treated as "no current context"** — `pkg/controllers/app/engine/controllers/docker_context.go:93-111` [Claude Opus 4.7]

```go
host := meta.Endpoints["docker"].Host
return strings.TrimPrefix(host, "unix://"), nil
```

If `meta.json` exists but has no `docker` endpoint (a kubernetes-only
context, which Docker CLI does write), the map lookup returns a zero
`dockerEndpointMeta{}` with `Host=""`. The caller then takes the
`currentSocket != ""` false branch, treats `healthy=false`, and overrides
a context the user is actively using — just one without a docker endpoint.

Fix shares the I2 refactor: return the raw `Host` and let the caller
distinguish `"" → not probeable, leave alone` from `currentSocket missing
→ promote ours`.

S5. **BATS tests modify the developer's real `~/.docker/config.json` without HOME isolation** — `bats/tests/32-app-controllers/engine-docker.bats:425-494` [Claude Opus 4.7, Codex GPT 5.4, Gemini 3.1 Pro]

```bash
@test "moby engine sets currentContext when no healthy context exists" {
    ...
    saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)

    jq 'del(.currentContext)' "${HOME}/.docker/config.json" >"${HOME}/.docker/config.json.tmp" &&
        mv "${HOME}/.docker/config.json.tmp" "${HOME}/.docker/config.json"
    ...
    if [[ -n "${saved_context}" ]]; then
        jq --arg ctx "${saved_context}" ...
    fi
}
```

The tests operate on the developer's actual `$HOME/.docker/config.json`
(no helper overrides `HOME`). The "restore in teardown" inline block at
lines 471–475 runs only when execution reaches it — any `assert_output`
failure above it leaves `currentContext` cleared. There is no
file-level `teardown`.

Fix options, in order of preference:

1. Override `HOME` (or `DOCKER_CONFIG`, after S3 lands) in
   `local_setup_file` to a per-test temp dir; the engine controller picks
   it up when started by the test harness.
2. Move the save/restore into a BATS `teardown` so it runs unconditionally.

S6. **`$(rdd ...)` and `$(jq ...)` command substitution in BATS** — `bats/tests/32-app-controllers/engine-docker.bats:442, 453` [Claude Opus 4.7, Codex GPT 5.4, Gemini 3.1 Pro]

```bash
socket_path=$(rdd svc paths docker_socket)
saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)
```

Repo convention forbids `$(cmd)` substitution when the result feeds an
assertion — failures inside `$(...)` produce opaque BATS errors with no
output. Use `run -0 <cmd>; var=$output`. Reviewer @mook-as flagged the
same line at
[discussion_r3096228639](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096228639).

S7. **Typo "currentContextshouldhavebeenremoved" in failure message** — `pkg/controllers/app/engine/controllers/docker_context_test.go:102` [Claude Opus 4.7]

```go
assert.Assert(t, !hasCtx, "currentContextshouldhavebeenremoved")
```

```diff
-assert.Assert(t, !hasCtx, "currentContextshouldhavebeenremoved")
+assert.Assert(t, !hasCtx, "currentContext should have been removed")
```

S8. **No integration test exercises the healthy-context-skip path** — `bats/tests/32-app-controllers/engine-docker.bats` [Claude Opus 4.7, Codex GPT 5.4, Gemini 3.1 Pro]

The probe's central responsibility — "if the user's context is healthy,
leave it alone" — has no integration test. Existing tests cover only
"unset → promote ours" and "stop → remove". A test that stages a healthy
alternative context (e.g. a second context name pointing at the same
socket, set as current) and asserts `currentContext` is *unchanged* after
the engine starts would catch I2 regressions and document the policy.

S9. **`newDockerTestDir` returns `contextsDir` that no test reads** — `pkg/controllers/app/engine/controllers/docker_context_test.go:17-32` [Claude Opus 4.7]

```go
type dockerTestPaths struct {
    configFile  string
    contextsDir string
}
```

`contextsDir` is assigned but never referenced by any test in the file.
Drop the field, or convert it to a helper returning the meta-path that
removes the `t.Setenv("HOME", …)` gymnastics in the create/delete tests.

---

## Design Observations

**Concerns**

- *Hand-rolled Docker context store duplicates docker/cli and is the source of every correctness bug here* `(in-scope)` [Claude Opus 4.7, Codex GPT 5.4, Gemini 3.1 Pro] — The PR re-implements context directory hashing, JSON metadata serialization, default-context resolution, and atomic config-file mutation. Each of I2, I3, I4, S2, S3 flows directly from one of those re-implementations. `github.com/docker/cli/cli/context/store` (suggested by reviewer @mook-as at [discussion_r3096468463](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096468463)) provides a tested implementation that constructs the platform-appropriate Host, atomically writes the metadata, and resolves `default` via `client.FromEnv`. The trade is one new dependency (already transitively present via moby/moby). Worth re-visiting before adding more code on top of the hand-written path.

- *Design docs still attribute Docker context creation to the App controller* `(in-scope)` [Claude Opus 4.7] — `docs/design/api_app.md:39` and `docs/design/bootstrap.md:58` say the App controller creates the Docker context during start. This PR places the responsibility in the engine reconciler instead, which is the right home (engine owns `ContainerEngineReady`), but the design docs now lie. Update the docs to match, or move the code to the App controller.

**Strengths**

- *Generation counter for probe deduplication* [Claude Opus 4.7, Codex GPT 5.4] — `contextProbeGen + myGen` is the right shape for deduplicating concurrent probes; a superseded goroutine cannot accidentally clear a newer probe's cancel slot. The pattern just needs to be extended to gate the disk write as well (I1).
- *Probe lifetime tied to `watcherCtx`, not Reconcile's `ctx`* [Claude Opus 4.7, Codex GPT 5.4] — Correct: the probe outlives a single Reconcile call (a per-reconcile deadline must not kill it) but dies on manager shutdown.
- *`!watcherRunning` else-branch handles the dead-watcher case* [Claude Opus 4.7] — A watcher that died between reconciles would otherwise leave the Docker context behind forever; the explicit `removeDockerContext` call closes that hole. (S1 is about how *often* it then runs, not whether it should run at all.)

---

## Testing Assessment

The new unit tests cover the file I/O helpers in isolation
(`updateDockerConfig`, `createReplaceDockerContext`, `getCurrentDockerContext`)
and the BATS additions exercise the create-on-start and remove-on-stop
end-to-end paths. The gaps are concentrated in the probe goroutine and the
config-file integration:

1. **Probe-vs-cleanup race (I1)** — no test triggers a stop while a probe
   is in flight; a test that stages a slow probe (e.g. via a context
   pointing at a socket that hangs) and then sets `running=false` would
   reproduce the resurrection bug.
2. **Non-Unix current context (I2)** — no test stages `Host: "tcp://…"`
   or `npipe://…` in `currentContext`. The probe behavior with those
   schemes is unverified.
3. **Healthy-context-skip (S8)** — the central policy is untested.
4. **`default` context interaction (S2)** — no test of a healthy
   `/var/run/docker.sock` (the realistic Linux scenario) being preserved.
5. **Crash-during-write (I3)** — no test for the partial-write recovery,
   though once `os.Rename`-based atomicity is in place this becomes
   structurally impossible.
6. **Repeated engine restarts** — `manageDockerContext` runs once per
   start in tests, never twice; the cancel-on-supersede path is unexercised.
7. **Windows (I4)** — entirely skipped via `skip_on_windows`; production
   code runs unconditionally there.

---

## Documentation Assessment

- `docs/design/bootstrap.md:58` and `docs/design/api_app.md:39` still
  attribute Docker context creation to the App controller / `rdd start`
  flow. This PR moves the work to the engine reconciler. Either the docs
  must be updated, or the code belongs elsewhere (see Design Concerns).
- `getDockerContextSocket`'s docstring at `docker_context.go:90-92`
  describes the return as a "Unix socket path"; once I2 is fixed and the
  function returns a generic Host URL, the docstring needs to follow.
- `manageDockerContext`'s docstring at `engine_reconciler.go:268-270`
  says "At most one probe runs at a time." Mechanically true via the
  generation cancel, but it does not warn about the I1 race with
  `removeDockerContext`. Update once I1 is fixed (or note the
  limitation if the fix is deferred).

---

## Commit Structure

Two commits: `001fd95` (code) + `ef5114d` (tests). The tests verify the
code in the preceding commit; squashing into one commit is conventional
for this repo and keeps `git bisect` honest. Not blocking.

---

## Unresolved Feedback

Eight inline review comments from @mook-as remain unresolved on the
current diff:

- [discussion_r3096224461](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096224461) (`bats:436`) — use `assert_file_exists` instead of `run -0 test -f`. Style.
- [discussion_r3096228639](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096228639) (`bats:442`) — use `run -0 rdd …; var=$output` instead of `$(rdd …)`. Covered by S6.
- [discussion_r3096335657](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096335657) (`bats:457`) — consider `yq --inplace`. Optional; obviated by S5 if HOME isolation lands.
- [discussion_r3096360169](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096360169) (`bats:465`) — "ContainerEngineReady shouldn't be set until the context is set up correctly." Architectural: should `ContainerEngineReady=Connected` wait for `manageDockerContext` to complete, including the probe? Today it does not. Worth answering in the PR.
- [discussion_r3096439251](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096439251) (`bats:489`) — precompute `context_dir`, drop `bash -c`, use `assert_dir_not_exists`. Covered by S5/S6.
- [discussion_r3096468463](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096468463) (`docker_context.go:26`) — adopt `github.com/docker/cli/cli/context/store`. See Design Concerns; would resolve I2/I3/I4/S2/S3.
- [discussion_r3096484641](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096484641) (`docker_context.go:90`) — Windows: named pipes vs `unix://`. Covered by I4.
- [discussion_r3096522025](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096522025) (`engine_reconciler.go:271`) — reviewer self-correction; effectively resolved.

---

## Agent Performance Retro

**[Claude]** Produced the most thorough, finely-segmented review: separated the `unix://` scheme bug from the Windows-specific Host construction (I2 vs I4), the "default" override (S2), the `DOCKER_CONFIG` env-var miss (S3), the empty-endpoint case (S4), the per-reconcile rewrite (S1), and the unscoped `logf.Log` logging (I5) — none of which Codex or Gemini surfaced individually. The cancel-race fix it proposed was the most defensive (gen-check + ctx-check + WaitGroup); the others stopped at the `ctx.Err()` check alone, which closes most but not all of the window. Verdict line was the only one to call out Windows production behavior as a blocker.

**[Codex]** Concise and correct. Caught the cancel-race and the `unix://` bug independently, and explicitly bundled the "default" override into the same finding rather than separating it — a defensible call. Did not surface the non-atomic write, the Windows compile-but-broken issue, or the logging-style deviation. The fix proposals were the smallest possible diffs; useful when the priority is "minimum surface to land the fix."

**[Gemini]** Caught the same two consensus bugs as the others *and* uniquely flagged the non-atomic `os.WriteFile` data-loss path (I3) — the only finding no other agent raised. Severity calibration was hot: marked all three findings as Critical, where the consolidator downgraded to Important (recoverable in all three cases without data permanently lost on the median user system). The atomic-write framing was the most actionable, with a concrete `tmp+rename` patch.

| | Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 18m 27s | 3m 34s | 3m 02s |
| Findings | 5I 9S | 2I 1S | 3I |
| Tool calls | 47 (Grep 19, Bash 14, Read 9) | 21 (shell 20, stdin 1) | 7 (grep_search 3, run_shell_command 3, read_file 1) |
| Design observations | 5 | 3 | 1 |
| False positives | 0 | 0 | 0 |
| Unique insights | 5 | 0 | 1 |
| Files reviewed | 4 | 4 | 4 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **5I 9S** | **2I 1S** | **3I** |
| Downgraded | 0 | 0 | 3 (I→S) |
| Dropped | 0 | 0 | 0 |


### Reconciliation

- Gemini C1 cancel-race → Important (I1). Recoverable by `docker context use default`; no persistent data loss. Same severity as Claude I1 / Codex I2.
- Gemini C2 hardcoded `unix://` → Important (I2). Same severity as Claude I2 / Codex I1; "silent override of working remote context" is a major usability bug but not data loss.
- Gemini C3 non-atomic write → Important (I3). Crash-window for partial write is real but narrow on the median user system; concurrent docker-CLI writer is plausible but not common. Atomic fix is trivial.

---

## Review Process Notes

**Skill improvements**

- *When PR code re-implements a well-known external library's on-disk format (e.g. Docker context store, kubeconfig, OCI image layout), call it out as a structural risk in Design Observations even if no individual bug is yet visible.* Hand-rolled re-implementations of stable third-party file formats tend to inherit a long tail of edge cases (platform-specific schemes, atomic-write semantics, default-value resolution, environment-variable overrides). Several findings against a small new package are a strong signal that the underlying design is reaching for an external contract; suggest the existing library as the alternative even when the immediate bugs are individually small.

**Repo context updates**

- *`pkg/controllers/app/engine/controllers/` uses scoped `log := logf.FromContext(ctx)` with `log.Error(err, msg, kv...)` for error sites and `log.Info(msg, kv...)` for normal events. Code that runs from goroutines launched by `Reconcile`/`SetupWithManager` should capture a derived logger (`logf.FromContext(parentCtx).WithName("…")`) at goroutine launch rather than fall back to the global `logf.Log`, which is unscoped and emits at info-level only.* Flag any new `logf.Log.Info(..., "error", err)` pattern in this directory.


---

## Appendix: Original Reviews

### [Claude]

Let me check the BATS harness for HOME isolation and understand the reconciler patterns better.

Now I have enough context to write the review. Let me verify a couple of last details before finalizing.

I have all the context I need. Writing the review now.

# Deep Review: PR #320 — Manage Docker CLI context on ContainerEngineReady transitions

**Verdict:** Request changes (one race, one URL-construction bug, Windows behavior undefined).

## Executive Summary

The change adds Docker CLI context management to the engine reconciler: creating/replacing a `rancher-desktop-<suffix>` context on `ContainerEngineReady=True`, probing the user's current context in a goroutine, and removing the context on the way back out. The generation-counter/cancel machinery is a thoughtful guard against superseded probes, but the probe goroutine can race with `removeDockerContext` to leave `currentContext` pointing at a just-deleted context, and the probe URL-construction silently breaks any user context whose `Host` is not `unix://` (tcp://, ssh://, npipe://). Windows is compiled-in but produces a context Docker CLI cannot open.

## Findings

### Critical Issues

None. The race in I1 is narrow and the URL bug in I2 only affects non-unix contexts; neither qualifies as CRITICAL by the review rubric.

### Important Issues

**I1. Probe goroutine races with `removeDockerContext` — can leave `currentContext` pointing at a deleted context** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:290-323, 326-343` (important, regression)

```go
go func() {
    defer func() {
        r.contextMu.Lock()
        if r.contextProbeGen == myGen {
            r.contextProbeCancel = nil
        }
        r.contextMu.Unlock()
        cancel()
    }()
    ...
    if !healthy {
        if err := setCurrentDockerContext(contextName); err != nil {
            logf.Log.Info("Failed to set current Docker context", ...)
        }
    }
}()
```

`removeDockerContext` (line 328) only cancels the probe context and drops the cancel slot; it never waits for the probe goroutine to exit. If the probe has already decided `healthy=false` at line 318 and is scheduled to call `setCurrentDockerContext` at line 319, the following interleaving is possible:

1. `removeDockerContext` locks `contextMu`, calls `r.contextProbeCancel()`, unlocks (lines 329–334).
2. `removeDockerContext` calls `clearCurrentDockerContext(ourName)` (line 337) — typically a no-op because `currentContext` may not yet be ours.
3. `removeDockerContext` calls `deleteDockerContext(ourName)` (line 340) — removes `~/.docker/contexts/meta/<hash>/`.
4. Probe runs `setCurrentDockerContext(contextName)` (line 319) — writes `currentContext = "rancher-desktop-<suffix>"` into `~/.docker/config.json`.

End state: `currentContext` names a context whose `meta.json` no longer exists, so every `docker ps`/`docker build` fails with "Current context … is not found." The user now has a broken Docker CLI until they manually `docker context use default`. Cancellation of `probeCtx` does not prevent this — `setCurrentDockerContext` does file I/O and never checks `ctx.Err()`.

Fix: make `removeDockerContext` block until the probe goroutine has returned. A `sync.WaitGroup` works:

```diff
 type EngineReconciler struct {
     ...
     contextMu sync.Mutex
+    // contextProbeWg tracks in-flight probe goroutines so removeDockerContext
+    // can wait for them before mutating the user's Docker config.
+    contextProbeWg sync.WaitGroup
     contextProbeCancel context.CancelFunc
     contextProbeGen int
 }

 func (r *EngineReconciler) manageDockerContext(socketPath string) {
     ...
+    r.contextProbeWg.Add(1)
     go func() {
-        defer func() {
+        defer r.contextProbeWg.Done()
+        defer func() {
             r.contextMu.Lock()
             ...

 func (r *EngineReconciler) removeDockerContext() {
     r.contextMu.Lock()
     if r.contextProbeCancel != nil {
         r.contextProbeCancel()
         r.contextProbeCancel = nil
     }
     r.contextMu.Unlock()
+    r.contextProbeWg.Wait()
     ...
```

The wait is bounded by `dockerContextProbeTimeout` (3s) because `probeCtx` gets cancelled before the wait.

---

**I2. `probeDockerContext` misconstructs non-unix URLs, silently overriding healthy tcp/ssh/npipe contexts** — `pkg/controllers/app/engine/controllers/docker_context.go:93-111, 157-167` (important, regression)

```go
func getDockerContextSocket(name string) (string, error) {
    ...
    host := meta.Endpoints["docker"].Host
    return strings.TrimPrefix(host, "unix://"), nil  // line 110
}

func probeDockerContext(ctx context.Context, socketPath string) bool {
    ...
    cli, err := dockerclient.New(dockerclient.WithHost("unix://" + socketPath))  // line 160
    ...
    _, err = cli.Ping(probeCtx, dockerclient.PingOptions{})
    return err == nil
}
```

If the user's current context has `Host: "tcp://foo:2375"` (or `ssh://`, `npipe://`), line 110 returns that string unchanged because the `"unix://"` prefix doesn't match. Line 160 then constructs `"unix://tcp://foo:2375"`, which `ParseHostURL` in moby/moby/client (`client/client.go:412`) cuts into `proto="unix"`, `addr="tcp://foo:2375"`. The dial then attempts to open a Unix socket at the literal path `tcp://foo:2375`, which fails.

Result: `healthy=false`, and line 319 in the reconciler silently promotes `rancher-desktop-<suffix>` over the user's working remote/npipe context. The user's `docker` CLI now talks to our VM instead of their remote daemon, with no log that explains why.

Fix: pass the full Host URL around instead of the socket path. Don't strip the scheme in `getDockerContextSocket`, and don't re-prepend in `probeDockerContext`:

```diff
-func getDockerContextSocket(name string) (string, error) {
+// getDockerContextHost returns the Docker Host URL (e.g. "unix:///path",
+// "tcp://host:port") that the named context points at. Returns an empty
+// string if the context does not exist or has no docker endpoint.
+func getDockerContextHost(name string) (string, error) {
     ...
-    host := meta.Endpoints["docker"].Host
-    return strings.TrimPrefix(host, "unix://"), nil
+    return meta.Endpoints["docker"].Host, nil
 }

-func probeDockerContext(ctx context.Context, socketPath string) bool {
+func probeDockerContext(ctx context.Context, host string) bool {
     probeCtx, cancel := context.WithTimeout(ctx, dockerContextProbeTimeout)
     defer cancel()
-    cli, err := dockerclient.New(dockerclient.WithHost("unix://" + socketPath))
+    cli, err := dockerclient.New(dockerclient.WithHost(host))
     ...
 }
```

Callers in `engine_reconciler.go:311-315` then pass `currentHost` straight through.

---

**I3. Error-path logs use `.Info` on the global logger instead of `.Error` on the scoped logger** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:275, 303, 313, 320, 338, 341` (important, gap)

```go
logf.Log.Info("Failed to create Docker context", "context", contextName, "error", err)
...
logf.Log.Info("Failed to read current Docker context", "error", err)
...
logf.Log.Info("Failed to resolve current Docker context socket", "context", current, "error", err)
...
logf.Log.Info("Failed to set current Docker context", "context", contextName, "error", err)
...
logf.Log.Info("Failed to clear current Docker context", "context", contextName, "error", err)
...
logf.Log.Info("Failed to delete Docker context", "context", contextName, "error", err)
```

Every other error site in the engine reconciler (e.g. lines 177, 187, and all of `docker_watcher.go`) uses `log.Error(err, msg, kv...)` where `log := logf.FromContext(ctx)`. The six new sites deviate on both dimensions:

1. `.Info` logs at info severity with an `"error"` field, so operators filtering on error-level miss these and log-format parsers do not promote the err into a dedicated field.
2. `logf.Log` is the logr package default — unscoped. The watcher context is primed with a named logger at `SetupWithManager` (`r.watcherCtx = ...logf.IntoContext(..., mgr.GetLogger().WithName("engine-watcher"))`), but the new code ignores it, so these lines have no controller name, no request key, and no goroutine name.

Fix: capture a scoped logger at goroutine launch time and use `.Error`:

```diff
 func (r *EngineReconciler) manageDockerContext(socketPath string) {
     contextName := instance.Name()
+    log := logf.FromContext(r.watcherCtx).WithName("docker-context")

     if err := createReplaceDockerContext(contextName, socketPath); err != nil {
-        logf.Log.Info("Failed to create Docker context", "context", contextName, "error", err)
+        log.Error(err, "Failed to create Docker context", "context", contextName)
         return
     }
     ...
     go func() {
         ...
         if err != nil {
-            logf.Log.Info("Failed to read current Docker context", "error", err)
+            log.Error(err, "Failed to read current Docker context")
         }
         ...
 }
```

Apply the same pattern to `removeDockerContext`.

---

**I4. Windows: hardcoded `unix://` scheme produces a context Docker CLI on Windows cannot use** — `pkg/controllers/app/engine/controllers/docker_context.go:68, 157-167` (important, gap)

```go
Endpoints: map[string]dockerEndpointMeta{
    "docker": {Host: "unix://" + socketPath},   // line 68
},
```

`instance.DockerSocket()` returns `%USERPROFILE%\.rd<suffix>\docker.sock` on Windows; the written `meta.json` has `Host: "unix://C:\Users\...\docker.sock"`, which Docker CLI on Windows cannot use (named pipes would be `npipe://./pipe/...`, TCP would be `tcp://localhost:port`). The BATS file skips Windows at line 15, so CI never notices, but `manageDockerContext` runs unconditionally in production regardless of GOOS. The reviewer asked this explicitly (comment on docker_context.go:90) and it is unresolved.

Fix: either (a) skip `manageDockerContext`/`removeDockerContext` on Windows by gating on `runtime.GOOS` until the named-pipe path lands, or (b) construct a platform-appropriate Host. Option (a) is smaller and consistent with the BATS skip:

```diff
 func (r *EngineReconciler) manageDockerContext(socketPath string) {
+    if runtime.GOOS == "windows" {
+        // TODO: Docker CLI on Windows uses named pipes; revisit once the
+        // WSL2 socket path is wired up.
+        return
+    }
     contextName := instance.Name()
     ...
 }
```

Either way, also add a `runtime.GOOS` guard to `removeDockerContext` so the Windows code does not attempt cleanup of a context it never created.

### Suggestions

**S1. `removeDockerContext` re-runs config.json I/O on every stopped-state reconcile** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:153-158` (suggestion, regression)

```go
if !wantWatcher {
    if watcherRunning {
        log.Info("Stopping Docker watcher", ...)
        r.stopWatcher()
    } else {
        r.removeDockerContext()    // line 157
    }
    ...
    alreadyClean := !watcherDied && current != nil && current.Reason == terminalReason && current.Status == terminalStatus
    if !alreadyClean {
        ...
    }
}
```

Once the engine is stopped, subsequent reconciles still take the `else` branch and call `removeDockerContext` → `clearCurrentDockerContext` → `updateDockerConfig`, which reads `config.json`, runs a no-op mutate, and rewrites the file. `updateDockerConfig` only skips the write when the file was absent (line 191) — it does not compare before/after content when the file existed. Every reconcile of a stopped engine therefore rewrites `~/.docker/config.json` and re-executes `RemoveAll` on a missing directory. Stable state should be idempotent-with-no-I/O.

Fix: move the call inside `!alreadyClean`, or make `updateDockerConfig` skip the write when the mutate produced identical JSON.

---

**S2. "default" context is unconditionally overridden without a probe** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:310` (suggestion, regression)

```go
if current != "" && current != "default" {
    currentSocket, err := getDockerContextSocket(current)
    ...
}
```

The code comment at line 307 asserts `"default"` and `""` both mean "no working context." But `default` is a real Docker context representing the OS-native socket (e.g. `/var/run/docker.sock` when Docker Engine is installed directly on Linux). On systems where that socket is actually running Docker, we silently override it without a liveness check — diverging from the PR description which says "probe the user's current context … if the current context is healthy, leave it alone."

The fix is either (a) probe `default` via `dockerclient.DefaultDockerHost` / `client.FromEnv` (most faithful to the design doc), or (b) keep the current behavior and update the docstring to be explicit about the policy choice. Flagging because the current comment is a design claim, not a rationale — a future reader has no way to tell this was intentional.

---

**S3. `DOCKER_CONFIG` environment variable is ignored** — `pkg/controllers/app/engine/controllers/docker_context.go:37-43` (suggestion, gap)

```go
func dockerConfigDir() (string, error) {
    home, err := os.UserHomeDir()
    if err != nil {
        return "", err
    }
    return filepath.Join(home, ".docker"), nil
}
```

Docker CLI respects `DOCKER_CONFIG` to relocate the config directory; users who set it will have the created context written to `~/.docker/…` while their CLI reads from `$DOCKER_CONFIG/…`, so the context is invisible.

```diff
 func dockerConfigDir() (string, error) {
+    if dir := os.Getenv("DOCKER_CONFIG"); dir != "" {
+        return dir, nil
+    }
     home, err := os.UserHomeDir()
     if err != nil {
         return "", err
     }
     return filepath.Join(home, ".docker"), nil
 }
```

---

**S4. Empty `docker` endpoint is indistinguishable from "context missing" in the probe flow** — `pkg/controllers/app/engine/controllers/docker_context.go:93-111` (suggestion, regression)

```go
host := meta.Endpoints["docker"].Host
return strings.TrimPrefix(host, "unix://"), nil
```

If `meta.json` exists but has no `docker` endpoint (kubernetes-only context, which Docker CLI does write), the map lookup returns a zero `dockerEndpointMeta{}` with `Host=""`. The caller at `engine_reconciler.go:314` then takes the `currentSocket != ""` false branch, treats `healthy=false`, and promotes ours — overriding a context the user is actively using, just one without a docker endpoint.

Same intent as I2's fix: return the raw `Host`, and in the caller treat `""` as "not probeable, assume healthy" or "treat as absent" explicitly.

---

**S5. BATS tests modify the developer's real `~/.docker/config.json` without HOME isolation** — `bats/tests/32-app-controllers/engine-docker.bats:425-494` (suggestion, regression)

```bash
@test "moby engine sets currentContext when no healthy context exists" {
    ...
    saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)

    jq 'del(.currentContext)' "${HOME}/.docker/config.json" >"${HOME}/.docker/config.json.tmp" &&
        mv "${HOME}/.docker/config.json.tmp" "${HOME}/.docker/config.json"
    ...
    # Restore the original context if there was one.
    if [[ -n "${saved_context}" ]]; then
        jq --arg ctx "${saved_context}" ...
    fi
}
```

The tests operate on the real `$HOME` (no BATS helper overrides it — confirmed by `grep -R HOME bats/helpers` returning nothing). The "restore in teardown" inline block at lines 471–475 only runs when the test passes through that point — any `assert_output` failure above line 470 leaves the developer's `currentContext` deleted. There is no `teardown` function in the file.

Fix: either set `HOME` (or `DOCKER_CONFIG`) to a per-test temporary directory in `local_setup_file` (this also requires the S3 fix so the reconciler sees it), or use BATS's standard `teardown` to restore state unconditionally.

---

**S6. `$(rdd ...)` command substitution in BATS** — `bats/tests/32-app-controllers/engine-docker.bats:442, 453` (suggestion, regression)

```bash
socket_path=$(rdd svc paths docker_socket)     # line 442
saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)  # line 453
```

Project convention (stored repo-wide) forbids `$(cmd)` substitution when the result is needed for an assertion because failures inside `$(...)` produce opaque errors. Use `run -0 <cmd>; var=$output`. The reviewer flagged line 442 explicitly (`r3096228639`).

---

**S7. Typo "currentContextshouldhavebeenremoved"** — `pkg/controllers/app/engine/controllers/docker_context_test.go:102` (suggestion, regression)

```go
assert.Assert(t, !hasCtx, "currentContextshouldhavebeenremoved")
```

Missing spaces in the failure message.

```diff
-assert.Assert(t, !hasCtx, "currentContextshouldhavebeenremoved")
+assert.Assert(t, !hasCtx, "currentContext should have been removed")
```

---

**S8. No integration test exercises the healthy-current-context path** — `bats/tests/32-app-controllers/engine-docker.bats` (suggestion, gap)

The probe's central responsibility — "if the user's context is healthy, leave it alone" — has no integration test. The existing tests cover only the "unset → promote ours" and "stop → remove" branches. A test that stages a healthy alternative context (e.g. points to the same socket via a second context named `other`, sets it as current, waits, asserts `currentContext` is unchanged) would catch I2 regressions and cement the behavior documented in `bootstrap.md`.

---

**S9. `newDockerTestDir` returns `contextsDir` that no test reads** — `pkg/controllers/app/engine/controllers/docker_context_test.go:17-32` (suggestion, enhancement)

```go
type dockerTestPaths struct {
    configFile  string
    contextsDir string
}
```

Grep shows `contextsDir` is assigned but never read. It can be dropped or converted to a helper that computes `contextMetaPath` without needing `t.Setenv("HOME", ...)` gymnastics in `Test_createDockerContext`/`Test_deleteDockerContext`.

### Design Observations

**Concerns**

1. *Hand-rolled metadata format* — The code duplicates the Docker CLI's on-disk schema (file layout, sha256 hashing, JSON field tags). The reviewer already suggested `github.com/docker/cli/cli/context/store` (`r3096468463`). Using that package would (a) insulate us from any on-disk format change, (b) give us correct Windows behavior out of the box, and (c) cut ~60 LoC of hand-rolled JSON. The cost is the docker/cli dependency, which the project already transitively pulls in via moby. *(future)*

2. *Context setup belongs to the engine reconciler, not the App controller* — `docs/design/api_app.md:39` says "When the App is starting it creates the Docker context and sets up the kubeconfig." `docs/design/bootstrap.md:58` says the same. This PR implements the docker side in the engine reconciler instead, which is the right home (it owns `ContainerEngineReady`), but the design docs now lie. Either update the docs to reflect the split (kubeconfig stays with App, docker context moves to engine), or keep the docs as source-of-truth and move the code. *(in-scope)*

**Strengths**

- The generation-counter pattern (`contextProbeGen` + `myGen`) is a clean ABA guard: a superseded goroutine can't accidentally clear a newer probe's cancel slot. This is the right shape for the problem, even though I1 shows it doesn't cover the setCurrent-vs-remove race on config.json.
- Deriving `probeCtx` from `watcherCtx` rather than Reconcile's `ctx` is correct — the probe outlives a single Reconcile call but dies on manager shutdown.
- The `!watcherRunning` else branch at line 157 closes a real hole: a watcher that died between reconciles would otherwise leave the Docker context in place forever.

### Testing Assessment

Untested scenarios, ranked by risk:

1. **Probe-vs-remove race (I1):** No unit or integration test covers the window where a probe is in-flight when `removeDockerContext` runs. A test that sets a sentinel long-sleep probe and then triggers a stop could reproduce it.
2. **Non-unix current context (I2):** No test with `Host: "tcp://..."` or `npipe://...` in the user's current context. The probe behavior with these hosts is unverified.
3. **Healthy-context-skip (S8):** Core policy untested.
4. **Docker Desktop / `default` interaction (S2):** No test of a running `/var/run/docker.sock` or `desktop-linux` that we correctly leave alone.
5. **Repeated engine restarts:** The `moby → containerd → moby` transition exercises manageDockerContext twice; integration tests only hit it once per file.
6. **Windows:** Entirely skipped by `skip_on_windows`. See I4.

### Documentation Assessment

- `docs/design/bootstrap.md:58` and `docs/design/api_app.md:39` still attribute docker-context creation to the App controller and `rdd start`; this PR moves it to the engine reconciler. Either the doc must be updated, or this code belongs elsewhere (see Design Concern 2).
- `manageDockerContext`'s docstring at line 268 says "At most one probe runs at a time." That's mechanically true via the generation cancel, but the docstring should also note the cleanup race with `removeDockerContext` (I1) until that race is fixed.

### Commit Structure

Two commits: `001fd95` (code) + `ef5114d` (tests). The tests belong with the code they exercise — for a reviewer these should be one commit (same author/day, tests verify the code in the preceding commit). Not blocking but worth squashing before merge.

### Acknowledged Limitations

None — the author did not explicitly defer or dismiss any prior-round findings on this PR. All review comments on the PR are currently unresolved and appear below.

### Unresolved Feedback

All seven in-line review comments from @mook-as remain open and the current diff does not address them:

- `bats/tests/32-app-controllers/engine-docker.bats:436` — use `assert_file_exists` instead of `run -0 test -f` — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096224461. Still actionable.
- `bats/tests/32-app-controllers/engine-docker.bats:442` — use `run -0 rdd service paths docker_socket` + `local socket_path=${output}` — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096228639. Overlaps with S6 above.
- `bats/tests/32-app-controllers/engine-docker.bats:457` — consider `yq --inplace` — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096335657. Style; optional if S5 (HOME isolation) lands and obviates the edit/restore dance.
- `bats/tests/32-app-controllers/engine-docker.bats:465` — "ContainerEngineReady shouldn't be set until the context is set up correctly" — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096360169. The PR description argues the probe is async to avoid blocking Reconcile; `createReplaceDockerContext` (the synchronous part) does run before the condition flips to `Connected`. But the condition flipping to True while the probe hasn't yet decided whether to promote is what the comment really flags — a concern that overlaps with S8 (test for the steady-state). Still actionable as either a docstring clarification or a code change to gate the condition on probe completion.
- `bats/tests/32-app-controllers/engine-docker.bats:489` — precompute `context_dir`, use plain `test` instead of `bash -c`, prefer `assert_dir_not_exists` — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096439251. All three still apply.
- `pkg/controllers/app/engine/controllers/docker_context.go:26` — use docker/cli `ContextStore` — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096468463. See Design Concern 1.
- `pkg/controllers/app/engine/controllers/docker_context.go:90` — Windows: named pipes vs default path — https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096484641. Overlaps with I4 above; the question has not been answered in code.

The eighth comment at `engine_reconciler.go:271` (https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096522025) is the reviewer self-correcting a misreading of the PR description and is effectively resolved; no action required.

### Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats` — S5, S6, S8; plus five unresolved review comments (lines 436, 442, 457, 465, 489).
- `pkg/controllers/app/engine/controllers/docker_context.go` — I2, I4, S3, S4; plus two unresolved review comments (lines 26, 90).
- `pkg/controllers/app/engine/controllers/docker_context_test.go` — S7, S9.
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — I1, I3, S1, S2.
### [Codex]

### Executive Summary

This PR adds Docker CLI context creation/removal around `ContainerEngineReady` and a background probe that may promote the Rancher Desktop context to the user's active context. The main correctness issues are in the probe path: it overwrites healthy existing Docker contexts that are not unnamed Unix-socket contexts, and its cancellation path can still re-promote a deleted Rancher Desktop context after the engine has already stopped.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Healthy `default` / non-Unix Docker contexts are treated as broken and get overwritten** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:306-319` (`important, regression`)

```go
// Probe the current context's socket — not our own — to decide
// whether to promote ours. "default" and empty both mean no
// working context is set; treat them identically.
var healthy bool
if current != "" && current != "default" {
	currentSocket, err := getDockerContextSocket(current)
	...
		healthy = probeDockerContext(probeCtx, currentSocket)
}
if !healthy {
	if err := setCurrentDockerContext(contextName); err != nil {
```

At `engine_reconciler.go:307-310`, the new logic explicitly treats `"default"` as equivalent to “no context”, so a healthy default Docker daemon is always replaced. For named contexts, `getDockerContextSocket()` at `docker_context.go:109-110` strips only the `unix://` prefix, and `probeDockerContext()` at `docker_context.go:160` then unconditionally prepends `unix://` again. Any valid Docker context using `tcp://`, `ssh://`, or Windows `npipe://` therefore probes as `unix://tcp://...` / `unix://ssh://...` and is marked unhealthy. That violates the PR contract of “leave the current context alone if it is healthy” and will switch users away from working remote or default contexts.

Fix:

```diff
- if current != "" && current != "default" {
-     currentSocket, err := getDockerContextSocket(current)
+ if current != "" {
+     endpoint, err := resolveDockerEndpoint(current)
      ...
-     } else if currentSocket != "" {
-         healthy = probeDockerContext(probeCtx, currentSocket)
+     } else if endpoint != "" {
+         healthy = probeDockerContext(probeCtx, endpoint)
      }
  }
```

Use Docker's own context resolution rules for `"default"` and preserve the full endpoint URI when probing instead of hard-coding `unix://`.

I2. **Cancelling an in-flight probe does not stop it from re-selecting the deleted context** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:318-320` (`important, regression`)

```go
if !healthy {
	if err := setCurrentDockerContext(contextName); err != nil {
		logf.Log.Info("Failed to set current Docker context", "context", contextName, "error", err)
	}
}
```

`removeDockerContext()` cancels the probe at `engine_reconciler.go:329-333`, clears `currentContext` at `engine_reconciler.go:337-339`, and deletes the context directory at `engine_reconciler.go:340-341`. But the goroutine launched by `manageDockerContext()` never checks whether its `probeCtx` was cancelled before calling `setCurrentDockerContext()` at `engine_reconciler.go:318-320`. If the engine stops while a probe is blocked in `probeDockerContext()` (`docker_context.go:157-166`), or even just between reading the current context at `engine_reconciler.go:301` and writing the new one at `engine_reconciler.go:319`, the canceled goroutine can still rewrite `~/.docker/config.json` to point at the Rancher Desktop context after that context was already deleted. That is exactly the stale-promotion race the issue description said cancellation must prevent.

Fix:

```diff
- if !healthy {
+ if probeCtx.Err() != nil {
+     return
+ }
+ if !healthy {
      if err := setCurrentDockerContext(contextName); err != nil {
          ...
      }
  }
```

Also gate the write on the probe generation (`myGen`) so only the most recent, still-active probe is allowed to mutate `currentContext`.

#### Suggestions

S1. **The new BATS tests bypass the repo’s `run -0` pattern and do not guarantee config restoration on failure** — `bats/tests/32-app-controllers/engine-docker.bats:453-474` (`suggestion, regression`)

```bash
saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)

jq 'del(.currentContext)' "${HOME}/.docker/config.json" >"${HOME}/.docker/config.json.tmp" &&
    mv "${HOME}/.docker/config.json.tmp" "${HOME}/.docker/config.json"
...
try --max 6 --delay 1 -- \
    bash -c "jq -r '.currentContext' '${HOME}/.docker/config.json' | grep -qx '${context_name}'"
```

The new tests use command substitution at `engine-docker.bats:453`, direct shell pipelines at `engine-docker.bats:456-457` / `472-474`, and `bash -c` wrappers at `engine-docker.bats:465` / `486`. In this repo's BATS conventions, that makes failures harder to diagnose and skips the normal `$output` capture path. It also restores `currentContext` only at the end of the test body, so an earlier assertion failure can leak mutations into later tests or the local `~/.docker/config.json`.

Fix:

```diff
- saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)
+ run -0 jq -r '.currentContext // empty' "${HOME}/.docker/config.json"
+ local saved_context=${output}
...
- try --max 6 --delay 1 -- \
-     bash -c "jq -r '.currentContext' '${HOME}/.docker/config.json' | grep -qx '${context_name}'"
+ try --max 6 --delay 1 -- jq -e --arg ctx "${context_name}" \
+     '.currentContext == $ctx' "${HOME}/.docker/config.json"
```

Move the restore logic into `teardown`/`local_teardown` so it runs even when the test aborts early.

### Design Observations

**Concerns**

1. **Hand-rolling Docker context storage duplicated Docker CLI behavior and introduced both controller bugs** `(in-scope)`  
The new code reimplements context directory hashing, metadata parsing, default-context resolution, and config-file mutation in [`docker_context.go`]. Using Docker's own context store and default-endpoint resolution would remove the need to guess how `"default"` works, avoid hard-coding `unix://`, and reduce the surface area for future context-format drift.

**Strengths**

1. Reusing the reconciler's watcher-scoped context for probe lifetime is the right shape: it aligns the probe with manager shutdown instead of individual `Reconcile()` calls.
2. The controller continues to use conflict-retrying status updates on `App.Status.Conditions`, which is consistent with the documented multi-writer contract in `docs/design/api_app.md`.

### Testing Assessment

1. No test covers a healthy `"default"` Docker context, so the unconditional takeover in I1 ships green.
2. No test covers a healthy named context with a non-`unix://` endpoint (`tcp://`, `ssh://`, `npipe://`), so the scheme-mangling path in I1 is untested.
3. No test covers `running=false` / backend switch while the probe is still in flight, so the stale-promotion race in I2 is untested.
4. No test exercises failure to create or update the Docker context files, so the current “log and continue” behavior is not validated against the intended readiness semantics.

### Documentation Assessment

`getDockerContextSocket()` is documented as returning a “Unix socket path” in `pkg/controllers/app/engine/controllers/docker_context.go:90-92`, but the caller now uses it to evaluate arbitrary user-selected Docker contexts. Once I1 is fixed, that comment should be updated to describe generic endpoint resolution instead of Unix-only sockets.

### Unresolved Feedback

1. `https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096468463`  
Still actionable. The suggestion to use Docker's `ContextStore` was not addressed, and the current hand-written store logic is the source of the correctness problems in I1.

2. `https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096360169`  
Still actionable, but lower priority than I1/I2. `manageDockerContext()` logs and returns on context-creation failure at `engine_reconciler.go:274-276`, yet `Reconcile()` still reports `ContainerEngineReady=True` later in the same pass.

3. `https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096228639`  
Covered by S1. The test still uses command substitution instead of `run -0` / `$output`.

4. `https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096439251`  
Covered by S1. The test still shells out through `bash -c` and recalculates `context_dir` instead of using the simpler BATS-native assertions.

### Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats` — S1
- `pkg/controllers/app/engine/controllers/docker_context.go` — I1
- `pkg/controllers/app/engine/controllers/docker_context_test.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — I1, I2

### [Gemini]

### Executive Summary

This PR introduces automated Docker CLI context management that synchronizes the user's `currentContext` with the Rancher Desktop Daemon engine state. While the lifecycle hooks correctly tie into `ContainerEngineReady`, the custom context parsing and modification logic contains severe bugs: it races to resurrect deleted contexts upon cancellation, overwrites non-unix contexts, and risks data loss by continuously truncating `~/.docker/config.json` in the background.

### Findings

#### Critical Issues

**C1. Probe cancellation race leads to context hijacking** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:303-317` (critical, regression)

When the engine stops, the controller invokes `r.removeDockerContext()`. This function deletes the context directory, clears it from the user's `~/.docker/config.json`, and cancels the running probe goroutine's context (`contextProbeCancel`). However, the goroutine ignores `probeCtx.Err()`: the cancellation causes the `cli.Ping` probe to fail, which the goroutine misinterprets as an "unhealthy" context, prompting it to call `setCurrentDockerContext(contextName)`. This races with the cleanup and effectively resurrects the deleted context as the active default, entirely breaking the user's Docker CLI.

```diff
 		if !healthy {
+			if probeCtx.Err() != nil {
+				return // probe was cancelled, abort promotion
+			}
 			if err := setCurrentDockerContext(contextName); err != nil {
 				logf.Log.Info("Failed to set current Docker context", "context", contextName, "error", err)
 			}
```

**C2. Hardcoded unix scheme breaks non-unix Docker contexts** — `pkg/controllers/app/engine/controllers/docker_context.go:90-109` (critical, regression)

`getDockerContextSocket` strips the `unix://` prefix but leaves other schemes (e.g., `tcp://`, `ssh://`, `npipe://`) intact. The probe then unconditionally prepends `unix://` when building the client. If the user's existing context points to a remote TCP endpoint, the probe constructs an invalid URI (`unix://tcp://...`), fails to connect, and erroneously concludes the user's valid context is dead. It then steals the default context setting. 

```diff
-func getDockerContextSocket(name string) (string, error) {
+func getDockerContextHost(name string) (string, error) {
 	path, err := contextMetaPath(name)
 	if err != nil {
-		return "", err
+		return "", err
 	}
 	// ... unmarshal logic ...
 	host := meta.Endpoints["docker"].Host
-	return strings.TrimPrefix(host, "unix://"), nil
+	return host, nil
 }

-func probeDockerContext(ctx context.Context, socketPath string) bool {
+func probeDockerContext(ctx context.Context, hostURI string) bool {
 	probeCtx, cancel := context.WithTimeout(ctx, dockerContextProbeTimeout)
 	defer cancel()
-	cli, err := dockerclient.New(dockerclient.WithHost("unix://" + socketPath))
+	cli, err := dockerclient.New(dockerclient.WithHost(hostURI))
```
*(Update `manageDockerContext` to use the new return value accordingly).*

**C3. updateDockerConfig unconditionally truncates user config** — `pkg/controllers/app/engine/controllers/docker_context.go:171-197` (critical, regression)

`updateDockerConfig` parses the user's `~/.docker/config.json`, mutates it, and unconditionally overwrites it via `os.WriteFile`, which opens the file with `O_TRUNC`. Because `r.removeDockerContext()` is executed on *every* reconcile when the App is stopped or using a different engine, the controller parses, truncates, and rewrites the user's configuration file in a continuous background loop. Any crash immediately after truncation results in irrecoverable loss of the user's auth tokens, and the continuous writing creates a massive race condition against the user's manual Docker CLI usage.

```diff
-func updateDockerConfig(path string, mutate func(map[string]any)) error {
+func updateDockerConfig(path string, mutate func(map[string]any) bool) error {
 	// ... read/unmarshal logic ...
 
-	before := len(cfg)
-	mutate(cfg)
-
-	// file didn't exist and there is nothing left nothing to write.
-	if notExist && len(cfg) == 0 && before == 0 {
+	if changed := mutate(cfg); !changed {
 		return nil
 	}
 
 	out, err := json.MarshalIndent(cfg, "", "\t")
 	if err != nil {
 		return err
 	}
 	if err := os.MkdirAll(filepath.Dir(path), 0o700); err != nil {
 		return err
 	}
-	return os.WriteFile(path, append(out, '\n'), 0o600)
+	tmpPath := path + ".tmp"
+	if err := os.WriteFile(tmpPath, append(out, '\n'), 0o600); err != nil {
+		return err
+	}
+	// On Windows, os.Rename fails if the destination exists.
+	_ = os.Remove(path)
+	if err := os.Rename(tmpPath, path); err != nil {
+		os.Remove(tmpPath)
+		return err
+	}
+	return nil
 }
```
*(Update the callbacks in `setCurrentDockerContext` and `clearCurrentDockerContext` (as well as tests) to return a `bool` representing whether a change actually occurred).*

#### Important Issues

**I1. BATS tests use forbidden command substitution** — `bats/tests/32-app-controllers/engine-docker.bats:434-453` (important, regression)

The test additions extract values using command substitution (`$(rdd ...)`, `$(jq ...)`), which violates the repository's testing requirements. If the inner command fails, it cascades into opaque BATS errors without logging the actual failure output.

```diff
-    local socket_path
-    socket_path=$(rdd svc paths docker_socket)
+    run -0 rdd svc paths docker_socket
+    local socket_path="${output}"

-    saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)
+    run jq -r '.currentContext // empty' "${HOME}/.docker/config.json"
+    local saved_context="${output}"
```

#### Suggestions
None.

### Design Observations

**Concerns**
- **Reinventing Context Management (future):** The PR implements a custom JSON parser for Docker context generation and lookup. The official library at `github.com/docker/cli/cli/context/store` provides a rigorous, heavily-tested `ContextStore`. Adopting this upstream package would bypass the risks of unsafe atomic writes on Windows, URI-scheme parsing nuances (`unix` vs `npipe` vs `tcp`), and edge cases handled natively by Docker CLI.

### Testing Assessment

1. **Context Cancellation:** There are no tests validating the race condition described in **C1**. A test should assert that when the `Running` condition transitions away quickly, `config.json` doesn't get populated with the aborted context name.
2. **Non-Unix Existing Context:** Tests assume the default or empty `currentContext`. A scenario configuring `tcp://127.0.0.1:2375` before the engine turns ready would have caught the `unix://` scheme injection bug in **C2**.

### Documentation Assessment

No issues found.

### Acknowledged Limitations

None.

### Unresolved Feedback

The following PR review comments from `@mook-as` were unaddressed and remain actionable:
- Using `assert_file_exists` instead of `run -0 test -f`: [Comment](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096224461)
- Using `run -0 rdd` instead of `$()`: [Comment](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096228639) (Overlaps with **I1**)
- Using `yq --inplace` instead of redirection/move: [Comment](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096335657)
- `ContainerEngineReady` shouldn't be set until the context is correctly set up: [Comment](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096360169)
- Using the `test` executable and `assert_dir_not_exists`: [Comment](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096439251)
- Consideration for `github.com/docker/cli` context store instead of custom JSON processing: [Comment](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096468463)
- Hardcoded `unix://` on Windows: [Comment](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/320#discussion_r3096484641) (Overlaps with **C2**)

### Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats`: I1
- `pkg/controllers/app/engine/controllers/docker_context.go`: C2, C3
- `pkg/controllers/app/engine/controllers/docker_context_test.go`: C3
- `pkg/controllers/app/engine/controllers/engine_reconciler.go`: C1, C3
