# Deep Review: 20260421-132014-pr-333

| | |
|---|---|
| **Date** | 2026-04-21 13:20 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 1 |
| **Author** | [@Nino-K](https://github.com/Nino-K) |
| **PR** | [#333](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/333) — Docker Engine: improve Docker context probe safety and isolation |
| **Commits** | `179c097` Docker Engine: improve Docker context probe safety and isolation |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — the BATS test for the "no healthy context" path becomes unreliable on hosts with an active Docker daemon, and a handful of cleanups (dead helper, silent error paths, tmpdir scope) are worth addressing before merge. |
| **Wall-clock time** | `29 min 10 s` |

## Executive Summary

The PR replaces `getDockerContextHost` + hand-rolled `dockerclient.WithHost(host)` probing with a docker/cli-native `probeNamedDockerContext` that goes through `ContextStore`, `WithTLSData`, and `EndpointMeta.ClientOpts()`. This fixes four false-override cases enumerated in the PR description (TCP+TLS, mTLS, SkipTLSVerify, SSH) and adds an early return when `config.json` is unreadable so a corrupt file is never round-tripped.

The core change is a solid correctness improvement. The remaining issues cluster around two themes:

1. **Test semantics drifted.** The new `probeDockerContext()` uses `dockerclient.FromEnv`, which falls back to the platform default socket. The test `moby engine sets currentContext when no healthy context exists` no longer guarantees the precondition in its own name: on a developer machine or a CI runner with Docker pre-installed, the default socket *does* respond and the reconciler correctly refrains from overriding — but the test then times out waiting for an override that never comes.
2. **Dead and under-logged code.** `getDockerContextHost` is now only referenced by its own unit tests, and the five new `return true` bailouts in `probeNamedDockerContext` give operators no breadcrumb when RDD silently declines to promote its context.

All three agents agreed on the core test-reliability issue (framed as either a test bug or a production-semantics gap). One agent raised a WaitGroup-race concern that, on closer inspection, is prevented by the existing `watcherMu` ordering; it is reconciled out below.

Report structure: 0 Critical, 1 Important, 5 Suggestions, 1 Design Concern, 2 Design Strengths.

---

## Critical Issues

None.

## Important Issues

I1. **`moby engine sets currentContext when no healthy context exists` is flaky wherever the host default Docker socket responds** — `bats/tests/32-app-controllers/engine-docker.bats:606-619` [Claude Opus 4.7] [Codex GPT 5.4] [Gemini 3.1 Pro]

```bash
@test "moby engine sets currentContext when no healthy context exists" {
    local context_name="rancher-desktop-${RDD_INSTANCE}"

    rdd set running=false
    rdd set running=true containerEngine.name=moby
    rdd ctl wait --for=condition=ContainerEngineReady app/app --timeout=30s

    # The probe goroutine runs asynchronously; give it a moment.
    try --max 6 --delay 1 -- \
        bash -c "jq -r '.currentContext' '${DOCKER_CONFIG}/config.json' | grep -qx '${context_name}'"

    run -0 jq -r '.currentContext' "${DOCKER_CONFIG}/config.json"
    assert_output "${context_name}"
}
```

Before this PR, when `currentContext` was `""` or `"default"`, the reconciler skipped probing and treated the default as unhealthy, so the override always fired. After this PR, `probeDockerContext` calls `pingDocker(ctx, dockerclient.FromEnv)`, which falls back to `/var/run/docker.sock` on macOS/Linux when `DOCKER_HOST` is unset. The BATS `local_setup_file` exports `DOCKER_HOST` *after* `rdd set running=true` has already spawned the service (`engine-docker.bats:29-31`), and `service.Start` passes no explicit `cmd.Env`, so the service inherits whatever the BATS shell had at spawn — typically nothing. On any host with a reachable default socket (GitHub-hosted `ubuntu-latest` runners ship Docker pre-installed and running; developer machines with Docker Desktop), the probe returns healthy, `setCurrentDockerContext` never fires, and the `try` block times out.

The production side of the same finding: `rdd set` reuses an already-running service via `cmd/rdd/service.go:73-87`, so the daemon's `DOCKER_HOST`/`DOCKER_CERT_PATH`/`DOCKER_TLS_VERIFY` environment is frozen at the moment the user first ran `rdd svc start`. Any subsequent shell-side change is invisible to the probe, and the common "service already running, toggle App back to moby" path consults stale values.

Fix (test): seed a deliberately unreachable named context and mark it current, so the reconciler takes the `probeNamedDockerContext` branch with a guaranteed-failing endpoint independent of what the host's default socket does:

```diff
 @test "moby engine sets currentContext when no healthy context exists" {
     local context_name="rancher-desktop-${RDD_INSTANCE}"

+    # Force the probe to fail regardless of whether the host has a running
+    # Docker daemon at the default socket. Seed a named context pointing at
+    # an unreachable socket and make it current.
+    local probe_target="test-unreachable"
+    local probe_hash
+    if command -v sha256sum &>/dev/null; then
+        probe_hash=$(printf '%s' "${probe_target}" | sha256sum | awk '{print $1}')
+    else
+        probe_hash=$(printf '%s' "${probe_target}" | shasum -a 256 | awk '{print $1}')
+    fi
+    mkdir -p "${DOCKER_CONFIG}/contexts/meta/${probe_hash}"
+    cat >"${DOCKER_CONFIG}/contexts/meta/${probe_hash}/meta.json" <<EOF
+{"Name":"${probe_target}","Endpoints":{"docker":{"Host":"unix:///nonexistent/does-not-exist.sock","SkipTLSVerify":false}}}
+EOF
+    printf '{"currentContext":"%s"}\n' "${probe_target}" >"${DOCKER_CONFIG}/config.json"
+
     rdd set running=false
     rdd set running=true containerEngine.name=moby
```

Fix (production, optional): decide whether `""`/`"default"` should ever be probed from inside a long-lived daemon. If the stale-env risk outweighs the "don't stomp on a healthy default daemon" goal, treat `"default"` as healthy and only probe named contexts. The current behavior is a reasonable middle ground in the common case — the service usually starts from the user's shell — but it deserves an explicit decision.

## Suggestions

S1. **Remove the now-dead `getDockerContextHost` helper** — `pkg/controllers/app/engine/controllers/docker_context.go:81-103` [Claude Opus 4.7] [Gemini 3.1 Pro]

```go
// getDockerContextHost returns the full Docker host URL (e.g. "unix:///path/to/docker.sock"
// or "tcp://192.168.1.1:2376") for the named context's docker endpoint.
// Returns an empty string if the context does not exist or has no docker endpoint.
func getDockerContextHost(name string) (string, error) {
```

After the probe-rewrite, `engine_reconciler.go` calls `probeNamedDockerContext` directly. A repo-wide grep shows `getDockerContextHost` now appears only in `docker_context.go` (the definition) and four spots in `docker_context_test.go` — the function survives solely to support its own unit tests.

Fix: delete `getDockerContextHost`; rewrite the three existing tests that use it (`Test_createReplaceDockerContext`, `Test_deleteDockerContext`, `Test_getDockerContextHost_missing`) to fetch the endpoint via `newContextStore` + `docker.EndpointFromContext`, or collapse them into a single `probeNamedDockerContext` unit test that exercises the same paths end-to-end.

S2. **Log the silent `return true` bailouts in `probeNamedDockerContext`** — `pkg/controllers/app/engine/controllers/docker_context.go:178-208` [Claude Opus 4.7]

```go
func probeNamedDockerContext(ctx context.Context, name string) bool {
    s, err := newContextStore()
    if err != nil {
        return true
    }
    md, err := s.GetMetadata(name)
    if err != nil {
        return !errdefs.IsNotFound(err)
    }
    epMeta, err := docker.EndpointFromContext(md)
    if err != nil {
        return true
    }
    ...
    ep, err := docker.WithTLSData(s, name, epMeta)
    if err != nil {
        return true
    }
    opts, err := ep.ClientOpts()
    if err != nil {
        return true
    }
```

Four error paths plus the non-NotFound branch all return `true` without logging. The "assume healthy" posture is the right default — the function must not stomp on a context it cannot decode — but an operator debugging "why does RDD refuse to become the default context?" gets no signal. A single V(1) log per bailout preserves the hot-path silence while making diagnosis tractable.

Fix: take a `logr.Logger` (or pull it from the context via `logf.FromContext`) and log the underlying `err` and `name` at V(1) before each `return true`. The caller in `manageDockerContext` already carries a `log := logf.FromContext(r.watcherCtx).WithName("docker-context")`; thread it through or have the probe accept a logger.

S3. **`probeNamedDockerContext` treats an undecodable Docker endpoint as healthy, broader than the comment promises** — `pkg/controllers/app/engine/controllers/docker_context.go:183-195` [Codex GPT 5.4]

```go
md, err := s.GetMetadata(name)
if err != nil {
    return !errdefs.IsNotFound(err)
}
epMeta, err := docker.EndpointFromContext(md)
if err != nil {
    return true
}
scheme, _, _ := strings.Cut(epMeta.Host, "://")
```

The comment on the function says the optimistic return covers "non-tcp/unix endpoints (e.g. SSH), or with TLS config we can't load." `EndpointFromContext` upstream (`github.com/docker/cli/.../load.go:153-160`) returns an error when the context has no Docker endpoint at all or when the endpoint metadata fails to decode. That broader case — a malformed or missing endpoint — is not SSH-or-TLS; it is a context that `docker` itself cannot use. Treating it as healthy prevents RDD from helping a user whose Docker config is actually broken.

The counter-argument is conservatism: a partially corrupted config may be mid-edit, and stomping on it is worse than leaving it alone. Either position is defensible; the point is that the current code is more permissive than its docstring claims.

Fix: pick a stance and align the code with the comment. Either narrow the `return true` to the specific non-tcp/unix + TLS-load error paths, or broaden the comment to "any error decoding the context is treated as healthy to avoid clobbering a partially edited config."

S4. **Prefer `BATS_FILE_TMPDIR` over `BATS_SUITE_TMPDIR` for per-file isolation** — `bats/tests/32-app-controllers/engine-docker.bats:21` [Claude Opus 4.7]

```bash
export DOCKER_CONFIG="${BATS_SUITE_TMPDIR}/docker-config"
```

Every other test file in the repo that needs a tmpdir uses `BATS_FILE_TMPDIR` (see `bats/tests/34-containers-controllers/containers-mock.bats:20`, `bats/tests/31-rdd-controllers/crd-upgrade.bats:16`). `BATS_SUITE_TMPDIR` is shared across every file in the suite, so any future test file that also touches `DOCKER_CONFIG` would collide.

Fix:

```diff
-    export DOCKER_CONFIG="${BATS_SUITE_TMPDIR}/docker-config"
+    export DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
```

S5. **TODO(windows) comment lists a file that does not need Windows-specific work** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:270-274` [Claude Opus 4.7]

```go
// TODO(windows): Docker context management uses unix:// sockets; Windows
// requires npipe:// in the context meta file, an npipe-aware probe client,
// and platform-specific path handling in getCurrentDockerContext and
// createReplaceDockerContext. Track in a follow-up issue before shipping
// Windows support.
```

`getCurrentDockerContext` delegates to `docker/cli`'s `config.Load`, which already handles `USERPROFILE` on Windows via `os.UserHomeDir()`. The Windows-specific work concentrates in `createReplaceDockerContext` (scheme must be `npipe://`, socket path formatting differs) and in the probe client. Dropping `getCurrentDockerContext` from the list keeps the TODO honest.

Fix:

```diff
-    // TODO(windows): Docker context management uses unix:// sockets; Windows
-    // requires npipe:// in the context meta file, an npipe-aware probe client,
-    // and platform-specific path handling in getCurrentDockerContext and
-    // createReplaceDockerContext. Track in a follow-up issue before shipping
-    // Windows support.
+    // TODO(windows): Docker context management uses unix:// sockets; Windows
+    // requires npipe:// in the context meta file written by
+    // createReplaceDockerContext, and an npipe-aware probe client. Track in
+    // a follow-up issue before shipping Windows support.
```

## Design Observations

### Concerns

**Probing "default" couples a durable control-plane decision to ephemeral shell state** `(in-scope)` [Codex GPT 5.4]

Named contexts are on-disk state the daemon can read deterministically; `"default"` is partly process environment (`DOCKER_HOST`, `DOCKER_CERT_PATH`, `DOCKER_TLS_VERIFY`). The current design probes both from a long-lived reconciler, but the reconciler only sees the environment it inherited at spawn. Moving the "default" decision into the CLI/startup path — or persisting the effective Docker env alongside the App request — would eliminate the entire stale-environment class of bugs, including the test flake in I1.

### Strengths

**Docker CLI primitives are the right abstraction** `(in-scope)` [Claude Opus 4.7] [Codex GPT 5.4] [Gemini 3.1 Pro]

`ContextStore`, `WithTLSData`, and `EndpointMeta.ClientOpts()` already know how to assemble an HTTP client for every endpoint shape the docker CLI supports (unix, tcp, tcp+TLS, mTLS, SkipTLSVerify). Delegating to them retires four false-override classes enumerated in the PR description and keeps RDD aligned with Docker's own context format going forward.

**Two-function split makes the "did we load TLS?" question unmissable** `(in-scope)` [Claude Opus 4.7]

The old single `probeDockerContext(ctx, host)` silently called `dockerclient.WithHost(host)` even when TLS was required. Splitting into `probeDockerContext(ctx)` (FromEnv-backed default probe) and `probeNamedDockerContext(ctx, name)` (store-backed named probe) makes the intent at each call site obvious and forecloses the old silent-WithHost failure mode.

## Testing Assessment

Unit tests in `docker_context_test.go` cover the context-store read/write surface; the BATS suite exercises the reconciler end-to-end against a real Docker daemon inside Lima. Residual gaps, ranked by risk:

1. **No unit coverage for the TCP+TLS, mTLS, or SkipTLSVerify probe paths** — the whole motivation for the PR. A fake context store seeded with TLS data, probed against an unreachable TLS endpoint, would verify the TLS config is actually constructed (not silently dropped as in the old `WithHost(host)` call).
2. **No unit coverage for the SSH/npipe short-circuit at lines 191-196.** A table-driven test that seeds a context with `Host: "ssh://u@h"` and asserts `probeNamedDockerContext` returns `true` without attempting a ping would pin the new behavior.
3. **The "default probed via `FromEnv`" behavior change has no dedicated unit test.** Indirect coverage via I1's BATS case is too fragile. Injecting `DOCKER_HOST` and asserting `probeDockerContext` returns false for an unreachable default would lock the semantic in.
4. **No unit test for the early-return on `getCurrentDockerContext` error.** `Test_getCurrentDockerContext_malformedAuth` verifies the error surfaces from the helper, but not that the reconciler refuses to call `setCurrentDockerContext` when the config is malformed.
5. **No regression test exercises the `EndpointFromContext` failure branch** (see S3). A context seeded with a missing or malformed Docker endpoint would guard the "healthy vs unhealthy" decision whichever way S3 lands.

## Documentation Assessment

The new docstrings on `pingDocker`, `probeDockerContext`, and `probeNamedDockerContext` are accurate and concise. The `probeNamedDockerContext` comment overstates its scope slightly (see S3). TODO(windows) comments are helpful but the scope list in one of them is inaccurate (see S5).

## Commit Structure

Single commit, scope matches the title. The commit message enumerates four independent improvements (TCP+TLS, mTLS, SkipTLSVerify, SSH) that together form a coherent "probe safety" unit. Test isolation via `DOCKER_CONFIG` is bundled into the same commit, which is defensible: the new probe paths rely on a predictable context store, and pinning the store location in tests keeps both changes verifiable together.

## Acknowledged Limitations

- **Windows Docker context management is still deferred.** Both `manageDockerContext` and `removeDockerContext` early-return on `runtime.GOOS == "windows"` and carry `TODO(windows)` comments pointing at the gap. This PR makes the limitation more explicit but does not materially worsen it.
- **SSH/npipe endpoints are trusted implicitly.** `probeNamedDockerContext` returns `true` for any scheme that is not `unix` or `tcp`. The docstring acknowledges the tradeoff: the controller cannot authoritatively probe these endpoints, so it accepts the risk of a dead user-configured context in exchange for not overriding a user's legitimate choice.

---

## Agent Performance Retro

### [Claude]

Produced the most thorough pass: caught the exact failing BATS test by name, traced the failure through `service.Start`'s missing `cmd.Env` into `dockerclient.FromEnv`'s platform-default fallback, and supplied the cleanest fix (seed an unreachable named context) instead of the more invasive option of rewriting service env plumbing. Unique insights: the silent `return true` bailouts (S2), the `BATS_FILE_TMPDIR` convention mismatch (S3), and the TODO comment that lists `getCurrentDockerContext` despite that helper not needing Windows work (S5). Coverage Summary was complete and cross-references finding IDs cleanly.

### [Codex]

Framed the I1 finding as a production-semantics concern (stale daemon env) rather than a test bug, which turned out to be the complementary half of the same issue — once the test is fixed, the production concern still stands. Raised S3 as the only agent to notice that `EndpointFromContext` errors extend beyond the SSH/TLS cases the docstring claims. Missed the unused `getDockerContextHost` helper that both other agents caught. Ran six test commands to validate the passing unit-test suite, which added confidence at modest cost.

### [Gemini]

Caught I1 (test flakiness) and S1 (unused helper) in parallel with the other agents. Its I2 (WaitGroup misuse race) does not hold up under inspection: `manageDockerContext` calls `Add(1)` inside `startWatcherAndSync`, which holds `watcherMu` throughout. The shutdown hook's `stopWatcher` also takes `watcherMu` before calling `removeDockerContext` → `Wait()`, so the two cannot interleave. Reconcile's own direct `removeDockerContext` at line 155 only reaches the else-branch where no watcher is running, so no concurrent `Add(1)` is in flight. Reconciled out below. Gemini skipped `git blame` (known daily quota behavior); Claude and Codex covered that gap.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 13m 56s | 6m 07s | 4m 25s |
| Findings | 1I 4S | 1I 1S | 1I 1S |
| Tool calls | 67 (Bash 37, Read 16, Grep 14) | 39 (shell 37, stdin 2) | 9 (run_shell_command 4, read_file 3, grep_search 2) |
| Design observations | 2 | 2 | 2 |
| False positives | 0 | 0 | 1 |
| Unique insights | 3 | 2 | 0 |
| Files reviewed | 3 | 3 | 3 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **1I 4S** | **1I 1S** | **1I 1S** |
| Downgraded | 0 | 0 | 0 |
| Dropped | 0 | 0 | 1 |


**Reconciliation:** Gemini P1 I2 (`manageDockerContext` WaitGroup race) → rejected as false positive (Add(1) is called inside `startWatcherAndSync` while holding `watcherMu`; `stopWatcher` blocks on `watcherMu` before calling `Wait()`, preventing interleaving). The finding does not appear in the consolidated report.

Claude, Codex, and Gemini all contributed to I1. The consolidated finding merges the three framings: Claude and Gemini framed it as a BATS test bug; Codex framed it as a production-semantics gap. Both halves are real and point at the same underlying issue (daemon env vs user env).

## Review Process Notes

### Skill improvements

- [ ] **Add a "concurrency-claim verification" step to the dimensions.** When an agent raises a race or lock-ordering issue, the orchestrator should trace every call site that touches the implicated primitive (mutex, WaitGroup, channel) and confirm the claim against the actual lock graph before promoting it into the report. Concurrency claims are high-confidence-sounding but easy to get wrong when an outer lock serializes the calls; an orchestrator pass that enumerates the callers is cheap and catches false positives that the raising agent cannot see in isolation.

### Repo context updates

- [+] **Document the RDD service env freeze.** `rdd set` reuses an already-running service rather than restarting it (`cmd/rdd/service.go:73-87`), so the daemon's environment — including `DOCKER_HOST`, `DOCKER_CERT_PATH`, `DOCKER_TLS_VERIFY`, and anything else read at process start — is frozen at the moment the user first ran `rdd svc start`. Any code in the controllers that reads environment variables directly (as opposed to reading from App spec) will see stale values on subsequent `rdd set` invocations. Flag reviews that introduce environment reads inside reconcilers or long-lived goroutines.

---

## Appendix: Original Reviews

### [Claude] — Pass 1

Now I have enough context. Let me verify a couple more things before writing the review.

Now I have enough. Let me write the review.

# Deep Review — PR #333: Docker Engine context probe safety

## Executive Summary

The PR replaces the simplistic `dockerclient.WithHost(host)` probe with `probeNamedDockerContext` that uses `docker/cli`'s `EndpointMeta.ClientOpts()`, correctly handling TCP+TLS/mTLS/SkipTLSVerify contexts via `WithTLSData`. It also short-circuits SSH/npipe endpoints as healthy, adds an early return when `config.json` is unreadable, and isolates the BATS tests from the developer's `~/.docker` via `DOCKER_CONFIG`. The core change is a solid improvement; the main concern is test reliability on hosts where a default Docker daemon is reachable.

## Findings

### Critical Issues

None.

### Important Issues

**I1. `moby engine sets currentContext when no healthy context exists` becomes flaky where the host has a running Docker daemon** — `bats/tests/32-app-controllers/engine-docker.bats:606-619`

```bash
@test "moby engine sets currentContext when no healthy context exists" {
    local context_name="rancher-desktop-${RDD_INSTANCE}"

    rdd set running=false
    rdd set running=true containerEngine.name=moby
    rdd ctl wait --for=condition=ContainerEngineReady app/app --timeout=30s

    # The probe goroutine runs asynchronously; give it a moment.
    try --max 6 --delay 1 -- \
        bash -c "jq -r '.currentContext' '${DOCKER_CONFIG}/config.json' | grep -qx '${context_name}'"

    run -0 jq -r '.currentContext' "${DOCKER_CONFIG}/config.json"
    assert_output "${context_name}"
}
```

Before this PR, when `currentContext` was `""` or `"default"`, the reconciler took `healthy` as zero-valued `false` without probing and unconditionally promoted our context (`engine_reconciler.go:319-324` historical). The old test forced that branch by clearing `currentContext`. After this PR, the empty-/default branch now probes the host default via `probeDockerContext` → `dockerclient.FromEnv` (`docker_context.go:168-172`), which falls back to `unix:///var/run/docker.sock` when `DOCKER_HOST` is unset (`moby/moby/client@v0.4.0 client_unix.go:13`).

In `local_setup_file` (`engine-docker.bats:28-31`), the service is spawned by `rdd set running=true` **before** `DOCKER_HOST` is exported, so the service inherits no `DOCKER_HOST`. `service.Start` at `pkg/service/cmd/service.go:248` sets no `cmd.Env`, so the service subprocess uses whatever `DOCKER_HOST`/lack thereof was present in the BATS shell at spawn.

Concretely, on any host where `/var/run/docker.sock` responds (GitHub-hosted `ubuntu-latest` runners ship Docker pre-installed and running; developer machines with Docker Desktop), the probe succeeds, `healthy=true`, `setCurrentDockerContext` is never called, and the `try` block times out. The test then fails — even though the code is behaving exactly as the PR intends.

The previous test at least worked on a dev machine because it poked the user's real `~/.docker/config.json` directly; now the test is only robust on environments with no ambient Docker daemon.

Fix: force the probe into a failure by pre-seeding a doomed `currentContext` in the isolated config, so the reconciler takes the `probeNamedDockerContext` branch with a deliberately unreachable endpoint. For example:

```diff
 @test "moby engine sets currentContext when no healthy context exists" {
     local context_name="rancher-desktop-${RDD_INSTANCE}"

+    # Force the probe to fail deterministically regardless of whether the
+    # host has a running Docker daemon at the default socket. Seed a named
+    # context pointing at an unreachable socket and make it current.
+    local probe_target="test-unreachable"
+    local probe_hash
+    if command -v sha256sum &>/dev/null; then
+        probe_hash=$(printf '%s' "${probe_target}" | sha256sum | awk '{print $1}')
+    else
+        probe_hash=$(printf '%s' "${probe_target}" | shasum -a 256 | awk '{print $1}')
+    fi
+    mkdir -p "${DOCKER_CONFIG}/contexts/meta/${probe_hash}"
+    cat >"${DOCKER_CONFIG}/contexts/meta/${probe_hash}/meta.json" <<EOF
+{"Name":"${probe_target}","Endpoints":{"docker":{"Host":"unix:///nonexistent/does-not-exist.sock","SkipTLSVerify":false}}}
+EOF
+    printf '{"currentContext":"%s"}\n' "${probe_target}" >"${DOCKER_CONFIG}/config.json"
+
     rdd set running=false
     rdd set running=true containerEngine.name=moby
```

Alternative: clear `DOCKER_HOST` in the service's environment and set `DOCKER_HOST` to an unreachable path when spawning the service in `local_setup_file` — but this requires service.Start changes or a wrapper around `rdd set`, so the config-seed approach is simpler.

`(important, regression)`

### Suggestions

**S1. `getDockerContextHost` is no longer referenced from production code** — `pkg/controllers/app/engine/controllers/docker_context.go:81-103`

```go
// getDockerContextHost returns the full Docker host URL (e.g. "unix:///path/to/docker.sock"
// or "tcp://192.168.1.1:2376") for the named context's docker endpoint.
// Returns an empty string if the context does not exist or has no docker endpoint.
func getDockerContextHost(name string) (string, error) {
```

`engine_reconciler.go` now calls `probeNamedDockerContext` directly; a repo-wide grep for `getDockerContextHost` shows only the function definition and four calls inside `docker_context_test.go`. The function survives this PR solely to support its own unit tests — a textbook tautology.

Fix: delete `getDockerContextHost`, and rewrite the two tests that use it (`Test_createReplaceDockerContext`, `Test_deleteDockerContext`, `Test_getDockerContextHost_missing`) to fetch the endpoint via `newContextStore` and `docker.EndpointFromContext`, which exercises the same public API the production code now uses. `Test_getDockerContextHost_missing` can be dropped — the missing-context path is now covered by `probeNamedDockerContext`, so test it there instead.

`(suggestion, enhancement)`

**S2. Silent error swallowing in `probeNamedDockerContext` hides operator-diagnosable failures** — `pkg/controllers/app/engine/controllers/docker_context.go:178-208`

```go
func probeNamedDockerContext(ctx context.Context, name string) bool {
    s, err := newContextStore()
    if err != nil {
        return true
    }
    md, err := s.GetMetadata(name)
    if err != nil {
        return !errdefs.IsNotFound(err)
    }
    epMeta, err := docker.EndpointFromContext(md)
    if err != nil {
        return true
    }
    ...
    ep, err := docker.WithTLSData(s, name, epMeta)
    if err != nil {
        return true
    }
    opts, err := ep.ClientOpts()
    if err != nil {
        return true
    }
```

Four error paths (line 180, 187, 198, 202) and one non-NotFound branch (line 184-185) return `true` with no logging. When a user's Docker config is broken (corrupt context metadata, malformed TLS files, unparseable endpoint), the reconciler silently refuses to promote and gives the operator no breadcrumb. The stated rationale — "assume healthy to avoid overriding the user's choice" — is right, but the operator still needs to know *why* RDD won't take over.

Fix: thread a logger through (or use `logf.FromContext`) and log at V(1) when any of these errors fire. The hot path (success) stays silent; operators diagnosing "why is my RDD context not the default" get a single warning line that names the underlying cause.

`(suggestion, gap)`

**S3. Prefer `BATS_FILE_TMPDIR` over `BATS_SUITE_TMPDIR` for per-file isolation** — `bats/tests/32-app-controllers/engine-docker.bats:21`

```bash
export DOCKER_CONFIG="${BATS_SUITE_TMPDIR}/docker-config"
```

Every other test file in this repo that needs a tmpdir uses `BATS_FILE_TMPDIR` (see `bats/tests/34-containers-controllers/containers-mock.bats:20`, `bats/tests/31-rdd-controllers/crd-upgrade.bats:16`, etc.). `BATS_SUITE_TMPDIR` is suite-scoped, which means other test files in the suite would share this `DOCKER_CONFIG` if they also opted in. `bats/helpers/utils.bash:486` carries a stale-but-not-wrong note suggesting `BATS_SUITE_TMPDIR` has lifecycle gotchas (technically it *is* available in setup_file per BATS docs, but the convention in this repo is file-scoped).

Fix:

```diff
-    export DOCKER_CONFIG="${BATS_SUITE_TMPDIR}/docker-config"
+    export DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
```

`(suggestion, enhancement)`

**S4. TODO comment slightly misdescribes what needs platform-specific handling** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:270-274`

```go
    // TODO(windows): Docker context management uses unix:// sockets; Windows
    // requires npipe:// in the context meta file, an npipe-aware probe client,
    // and platform-specific path handling in getCurrentDockerContext and
    // createReplaceDockerContext. Track in a follow-up issue before shipping
    // Windows support.
```

`getCurrentDockerContext` relies on `docker/cli`'s `config.Load`, which is already cross-platform via `os.UserHomeDir()` and handles `USERPROFILE` correctly on Windows. The platform-specific work is concentrated in `createReplaceDockerContext` (scheme must be `npipe://` and the `socketPath` formatting differs) and the probe client selection — but not in `getCurrentDockerContext`.

Fix: trim the mention of `getCurrentDockerContext` so the TODO reflects what actually needs Windows porting:

```diff
-    // TODO(windows): Docker context management uses unix:// sockets; Windows
-    // requires npipe:// in the context meta file, an npipe-aware probe client,
-    // and platform-specific path handling in getCurrentDockerContext and
-    // createReplaceDockerContext. Track in a follow-up issue before shipping
-    // Windows support.
+    // TODO(windows): Docker context management uses unix:// sockets; Windows
+    // requires npipe:// in the context meta file written by
+    // createReplaceDockerContext, and an npipe-aware probe client. Track in
+    // a follow-up issue before shipping Windows support.
```

`(suggestion, enhancement)`

## Design Observations

### Strengths

- Leaning on `docker/cli`'s `ClientOpts()` is the right call: the library already knows how to assemble an HTTP client for every endpoint shape (unix, tcp, tcp+TLS, mTLS, SkipTLSVerify), and the SSH/npipe short-circuit correctly defers to the user rather than re-implementing connhelper in this reconciler.
- Splitting `probeDockerContext(ctx, host)` into `probeDockerContext(ctx)` (FromEnv-backed default probe) and `probeNamedDockerContext(ctx, name)` (store-backed named probe) makes the intent at each call site obvious — the old single-arg call site silently used `WithHost(host)` even when TLS was needed; the two-function design makes the "did we load TLS?" question impossible to skip.
- The early return at `engine_reconciler.go:310-317` on `getCurrentDockerContext` failure removes a latent clobber risk: the old path logged and continued with `current=""`, then ran the probe and tried `setCurrentDockerContext`, which would then fail with its own Load error — net-effect equivalent today, but one upstream fix (e.g., `config.Load` becoming more lenient about malformed auths) would have silently turned that into a real overwrite. Fail-fast is safer.

## Testing Assessment

Unit tests in `docker_context_test.go` cover the context-store read/write surface; the BATS suite exercises the reconciler end-to-end. Ranked by residual risk:

1. **No coverage for the TCP+TLS, mTLS, or SkipTLSVerify probe paths** — the whole motivation for the PR. These are hard to BATS-test without a TLS daemon, but unit tests could be added by building a fake context store fixture with TLS data and verifying that `probeNamedDockerContext` returns false against an unreachable TLS endpoint (ensuring the TLS config is actually constructed and the call fails for reachability reasons, not config reasons).
2. **No coverage for the SSH/npipe short-circuit** — `probeNamedDockerContext` now returns `true` for any endpoint whose scheme isn't `unix` or `tcp`. A table-driven unit test that seeds a context with `Host: "ssh://u@h"` and asserts the function returns `true` without attempting a ping would pin this behavior.
3. **The "probe of default context via FromEnv" behavior change has no dedicated test** — covered only indirectly by I1, which is too fragile. A unit test could inject a dummy `DOCKER_HOST` and verify `probeDockerContext` correctly returns false for an unreachable default endpoint.
4. **Early-return on `getCurrentDockerContext` error is not asserted** — the existing `Test_getCurrentDockerContext_malformedAuth` only verifies the error surfaces from the helper. A reconciler-level test (or an integration assertion) that checks no `setCurrentDockerContext` write happens when config is malformed would lock in the new fail-fast semantics.

## Documentation Assessment

The new docstrings on `pingDocker`, `probeDockerContext`, and `probeNamedDockerContext` are accurate and concise. The TODO comments in `manageDockerContext` / `removeDockerContext` are helpful but have a minor inaccuracy called out in S4. No design-doc updates needed.

## Commit Structure

Single commit, scope matches the title. The commit message clearly enumerates the four independent improvements; each would stand alone but together they form a coherent "probe safety" unit. No objection.

## Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats` — **I1, S3**
- `pkg/controllers/app/engine/controllers/docker_context.go` — **S1, S2**
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — **S4** (the probe-branch change at line 319-324 is verified against the new semantics; early return at 310-317 is verified safe)
### [Codex] — Pass 1

### Executive Summary

This PR moves Docker-context probing onto Docker CLI primitives (`ContextStore`, `WithTLSData`, `ClientOpts`) and isolates the BATS suite from the developer's real `~/.docker` directory. The main remaining risk is that the new `"default"` probe still evaluates the daemon's startup environment instead of the caller's live Docker environment, so a long-running service can still overwrite a healthy default context; there is also a smaller regression where malformed named contexts are now treated as healthy.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Default-context probing still uses the daemon's startup environment** — `pkg/controllers/app/engine/controllers/docker_context.go:165-171` (important, gap)

```go
// probeDockerContext pings the implicit default Docker endpoint
// (DOCKER_HOST or the platform default socket) within dockerContextProbeTimeout.
// Used when currentContext is "" or "default".
func probeDockerContext(ctx context.Context) bool {
	probeCtx, cancel := context.WithTimeout(ctx, dockerContextProbeTimeout)
	defer cancel()
	return pingDocker(probeCtx, dockerclient.FromEnv)
}
```

At line 171, `dockerclient.FromEnv` reads `DOCKER_HOST`, `DOCKER_CERT_PATH`, and `DOCKER_TLS_VERIFY` from the service process environment. That environment is fixed when the daemon is spawned at `pkg/service/cmd/service.go:248`, and later `rdd set` calls reuse the already-running service via `cmd/rdd/service.go:79-87` instead of restarting it. When `manageDockerContext` takes the `"default"` branch at `pkg/controllers/app/engine/controllers/engine_reconciler.go:320-323`, a user who changed their shell's Docker env after the control plane started can still have a healthy default endpoint, but the probe will consult stale values and overwrite `currentContext` anyway. This means the new safety check only works on the first daemon start, not on the common "service already running, toggle App back to moby" path.

Fix: do not auto-promote over `"default"` from inside the long-lived daemon unless the effective Docker env is passed in explicitly. The safe short-term fix is to treat `"default"` as healthy in this code path; the complete fix is to move the default-endpoint decision into the CLI/startup path that has the caller's live environment.

```diff
-		} else {
-			healthy = probeDockerContext(probeCtx)
+		} else {
+			// The daemon only sees the env it inherited at startup; probing
+			// "default" here can be stale after the service is already running.
+			healthy = true
 		}
```

#### Suggestions

S1. **Malformed named contexts are now preserved as “healthy”** — `pkg/controllers/app/engine/controllers/docker_context.go:183-195` (suggestion, regression)

```go
md, err := s.GetMetadata(name)
if err != nil {
	return !errdefs.IsNotFound(err)
}
epMeta, err := docker.EndpointFromContext(md)
if err != nil {
	return true
}
scheme, _, _ := strings.Cut(epMeta.Host, "://")
```

At lines 187-189, any `docker.EndpointFromContext` failure is treated as healthy. Upstream, `EndpointFromContext` returns an error when the context has no Docker endpoint or the endpoint metadata cannot decode (`github.com/docker/cli/.../load.go:153-160`). Before this change, `getDockerContextHost` treated that case as empty/unhealthy, so the controller would promote its own context instead. After this change, a corrupt named context with no Docker endpoint or malformed host survives at lines 187-195 even though `docker` cannot use it. That is broader than the acknowledged "SSH or TLS we can't load" tradeoff in the comment at lines 176-177.

Fix: keep the optimistic fallback for unsupported helper schemes and intentionally skipped TLS-loading failures, but treat missing/undecodable Docker endpoint metadata as unhealthy.

```diff
 	epMeta, err := docker.EndpointFromContext(md)
 	if err != nil {
-		return true
+		return false
 	}
 	scheme, _, _ := strings.Cut(epMeta.Host, "://")
+	if scheme == "" {
+		return false
+	}
 	switch scheme {
 	case "unix", "tcp":
```

### Design Observations

**Concerns**

1. `(in-scope)` The `"default"` case is fundamentally different from named contexts: named contexts are on-disk state the daemon can read deterministically, while `"default"` is partly process environment. Handling both in the same background reconciler couples a durable control-plane decision to ephemeral shell state. Moving the `"default"` decision into the CLI/startup path, or persisting the effective Docker env alongside the App request, would eliminate this entire stale-environment class.

**Strengths**

1. Replacing hand-rolled context parsing with Docker CLI's `ContextStore`, `WithTLSData`, and `ClientOpts` is the right direction. It removes the old false-negative probes for TCP+TLS, mTLS, and `SkipTLSVerify` contexts and aligns the controller with Docker's own context format.

### Testing Assessment

The existing Docker-context unit tests passed: `go test ./pkg/controllers/app/engine/controllers -run 'Test_(createReplaceDockerContext|dockerConfigDir_DOCKER_CONFIG|getCurrentDockerContext_malformedAuth|getDockerContextHost_missing|deleteDockerContext|currentDockerContext)$'`.

1. Highest risk untested path: a service started with one Docker env, then the user changes `DOCKER_HOST` / `DOCKER_TLS_VERIFY` / `DOCKER_CERT_PATH` before `rdd set running=true`. The current BATS setup exports `DOCKER_HOST` only after the service has already started (`bats/tests/32-app-controllers/engine-docker.bats:29-31`), so it does not exercise the daemon's live/default probe behavior.
2. No unit test seeds a named context with missing or malformed Docker endpoint metadata, so the new `return true` branch at `pkg/controllers/app/engine/controllers/docker_context.go:187-195` is unguarded.
3. The PR description's TCP+TLS, mTLS, `SkipTLSVerify`, and SSH scenarios still have no direct regression tests; `pkg/controllers/app/engine/controllers/docker_context_test.go:31-194` only covers context-file CRUD and `config.json` round-tripping.

### Documentation Assessment

None.

### Acknowledged Limitations

- **Code comment**: `"TODO(windows): Docker context management uses unix:// sockets; Windows requires npipe:// in the context meta file, an npipe-aware probe client, and platform-specific path handling..."` — `pkg/controllers/app/engine/controllers/engine_reconciler.go:270-274`, plus `"TODO(windows): see manageDockerContext — same gap applies here."` at `pkg/controllers/app/engine/controllers/engine_reconciler.go:339`. This change makes the limitation more explicit but does not materially worsen it; the code still exits early on Windows.
- **Code comment**: `"For contexts with non-tcp/unix endpoints (e.g. SSH), or with TLS config we can't load, we assume healthy to avoid overriding the user's choice."` — `pkg/controllers/app/engine/controllers/docker_context.go:176-177`. This is an intentional tradeoff introduced by the patch: it reduces false overrides for SSH/TLS-backed contexts, while knowingly preserving some contexts the controller cannot authoritatively probe.

### Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/docker_context.go` — I1, S1
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — I1

### [Gemini] — Pass 1

### Executive Summary
This PR replaces the raw host string URL probe with a full Docker endpoint probe (`probeNamedDockerContext`) that properly respects `ClientOpts` for TLS and non-TCP/Unix endpoints, preventing false overrides of user contexts. It also adjusts the default context probe to evaluate the environment's default Docker socket, respecting an active local engine if one exists. However, the BATS test suite fails to isolate the daemon's environment from the test host's default Docker socket, making the test flaky on developer machines. Additionally, a race condition in the engine reconciler's shutdown logic can trigger a `sync.WaitGroup` panic, resulting in an orphaned Docker context being left behind.

### Findings

#### Critical Issues
None.

#### Important Issues
I1. **BATS test fails on developer machines with an active Docker daemon** — `bats/tests/32-app-controllers/engine-docker.bats:13-26`
(important, gap)

The PR changed `probeDockerContext` to ping the environment's default Docker socket (`dockerclient.FromEnv`) instead of blindly overriding an empty or `"default"` context. If the default endpoint is healthy, the daemon now correctly refrains from stealing the `currentContext`. However, the BATS test fails to isolate the daemon from the test host's default Docker socket. In `local_setup_file()`, `DOCKER_HOST` is exported *after* the daemon is started via `rdd set running=true`. Thus, the daemon process inherits the test runner's environment (typically an empty `DOCKER_HOST`). When the test clears `currentContext` and restarts the engine, `probeDockerContext` pings the platform default (e.g., `/var/run/docker.sock`). If the developer running the test has an active Docker daemon, the probe returns `true`, the context is not overridden, and the test fails while waiting for the context to be updated.

Fix: Isolate the daemon's default context probe by exporting a dead `DOCKER_HOST` before starting the service.

```diff
     # exercise — the engine controller is part of the default controller
     # set, so no explicit --controllers selection is needed.
     rdd svc delete
+    # Isolate the daemon's default context probe from the host's Docker daemon
+    export DOCKER_HOST="unix:///dev/null/docker.sock"
     rdd set running=true
     run -0 rdd svc paths docker_socket
```

I2. **Concurrent Add and Wait on sync.WaitGroup causes panic during shutdown** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:323-324`
(important, regression)

`manageDockerContext` increments `r.contextProbeWg` outside of the `contextMu` lock. If the manager shuts down while a `Reconcile` is executing, the shutdown hook calls `stopWatcher`, which invokes `removeDockerContext()`. Since `stopWatcher` releases `watcherMu` before calling `removeDockerContext`, the `removeDockerContext` call can interleave with `manageDockerContext`. If `removeDockerContext` calls `Wait()` concurrently with `manageDockerContext` calling `Add(1)` while the counter is zero, Go's `sync.WaitGroup` will panic (`panic: sync: WaitGroup misuse: Add called concurrently with Wait`). This panic during shutdown crashes the process before `deleteDockerContext` executes, leaving the user with an orphaned `rancher-desktop` context in their `config.json` that points to a dead socket.

Fix: Move `r.contextProbeWg.Add(1)` inside the `contextMu` lock to order `Wait()` strictly after `Add(1)` or strictly before it.

```diff
 	probeCtx, cancel := context.WithCancel(r.watcherCtx)
 	r.contextProbeCancel = cancel
 	r.contextProbeGen++
 	myGen := r.contextProbeGen
+	r.contextProbeWg.Add(1)
 	r.contextMu.Unlock()
 
-	r.contextProbeWg.Add(1)
 	go func() {
```

#### Suggestions
S1. **Remove unused `getDockerContextHost` function** — `pkg/controllers/app/engine/controllers/docker_context.go:84-104`
(suggestion, enhancement)

The PR replaced calls to `getDockerContextHost` with `probeNamedDockerContext` in `engine_reconciler.go`. As a result, `getDockerContextHost` is no longer used in production code and is only called by its own unit tests. It should be removed to reduce maintenance burden.

Fix: Delete the `getDockerContextHost` function and its corresponding tests in `docker_context_test.go`.

### Design Observations

#### Concerns
None.

#### Strengths
**Safe Context Updates**: The new `getCurrentDockerContext` error handling in `engine_reconciler.go` correctly distinguishes between a missing configuration file (which `config.Load` returns without error) and a malformed one (which returns an error). By bailing out early on error, the daemon safely avoids overwriting a corrupted file, preventing data loss in the user's `config.json`.

**Respecting Host Environment**: Changing `probeDockerContext` to probe `dockerclient.FromEnv` for the default/empty context is an elegant way to prevent RDD from hijacking the Docker CLI if the user already has a working Docker Desktop or OS-level Docker daemon running at the default socket.

### Testing Assessment
Tested scenarios are adequate, but as identified in I1, the test environment isolation is incomplete. The current tests will fail on a developer workstation with an active default Docker daemon.

### Documentation Assessment
No missing docstrings were introduced by this change. The new helper functions `pingDocker`, `probeDockerContext`, and `probeNamedDockerContext` include accurate and descriptive comments.

### Acknowledged Limitations
- **Code comment**: `// TODO(windows): Docker context management uses unix:// sockets; Windows requires npipe:// in the context meta file, an npipe-aware probe client, and platform-specific path handling... Track in a follow-up issue before shipping Windows support.` (`pkg/controllers/app/engine/controllers/engine_reconciler.go:270-274`, `339`)
  The lack of Windows support for Docker context management is explicitly deferred. The changes correctly preserve the early returns (`if runtime.GOOS == "windows" { return }`) in `manageDockerContext` and `removeDockerContext`.

### Coverage Summary
- `bats/tests/32-app-controllers/engine-docker.bats`: I1
- `pkg/controllers/app/engine/controllers/docker_context.go`: S1
- `pkg/controllers/app/engine/controllers/engine_reconciler.go`: I2
