# Deep Review: 20260415-015141-pr-309

| | |
|---|---|
| **Date** | 2026-04-15 01:51 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 4 (of PR #309) |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#309](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/309) — engine: mirror Docker container engine state into Kubernetes |
| **Branch** | `engine-v3` |
| **Commits** | `a5235c8` engine: mirror Docker container engine state into Kubernetes |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — three important regressions introduced by prior-round fixes: a discovery ordering race that re-opens the `rdd set running=true` gap on cold start, engine cleanup that deletes unowned mirror resources, and the standalone `app-controller` binary importing `engine` under the wrong manager group. |
| **Wall-clock time** | `16 min 6 s` |

---

## Executive Summary

Round 4 of PR #309 reviews the fixes applied after round 3. Four of the five prior findings were addressed cleanly (nil-pointer guards on `inspect.Config`, engine controller moved to the `containers` API group, webhook dead code removed, watcher context decoupled via `context.WithoutCancel`). One was skipped by design (namespace self-healing) and one remains a standing style disagreement (BATS `$(...)` substitution at file scope).

This round surfaces three new Important regressions, each a direct consequence of a round-3 fix interacting with the rest of the control plane. (1) `rdd.rancherdesktop.io/ready=true` now fires before `registerDiscovery` populates the enabled-controller list, so on cold start `isEngineControllerEnabled` sees an empty list and `rdd set running=true` silently falls through to the weaker `Running=True` wait — exactly reintroducing the race the PR aims to close. (2) `cleanupMirrorResources` and the full-sync prune paths operate on every `Container`/`Image`/`Volume` in `rancher-desktop`, not only on engine-owned mirrors, which forced the `containers-mock.bats` suite to spell `containers,-engine` to keep the two controllers from fighting over shared kinds. (3) `cmd/app-controller/main.go` now imports `engine`, but `pkg/external/main.go` does not filter by `GetAPIGroup()`, so a standalone `app-controller` run can register a `containers`-group reconciler under the `app` shared manager without the `containers` CRDs it needs.

Four lower-severity items cover a latent watcher-goroutine leak on manager shutdown (the `context.WithoutCancel` fix from round 3 has no shutdown hook), missing teardown cleanup in `engine.bats`, a stylistic `errs := []error{applyErr}` seed, and stale `--controllers` help text that now omits the `containers` group. Gemini filed a Critical "invalid `@test` syntax" finding against `engine.bats` that is a hallucination — the file uses `@test` correctly — and an Important "API-group mismatch" claim whose cause-and-effect trace is backwards from what `pkg/external/main.go` actually does.

Structure: 0 critical, 3 important, 4 suggestions.

---

## Critical Issues

None.

---

## Important Issues

I1. **`rdd set running=true` cold-start race: `ReadyAnnotation` fires before the enabled-controller list is populated** — `pkg/service/controllers/manager.go:105-112,200-202`, `cmd/rdd/set.go:490-505`, `pkg/service/controllers/discovery.go:270-289` [Codex GPT 5.4] [Claude Opus 4.6]

```go
// manager.go:101-112
if err := scm.installControllerCRDs(ctx); err != nil {
    return fmt.Errorf("failed to install controller CRDs: %w", err)
}

// Mark the control plane ready before the rest of manager setup.
// Clients waiting on the discovery ConfigMap can now proceed;
// ...
if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
    return fmt.Errorf("failed to mark control plane as ready: %w", err)
}
```

`SharedControllerManager.Start` installs CRDs, then sets `ReadyAnnotation`, and only later — at `manager.go:202` — calls `registerDiscovery`, which is what writes the `EnabledControllers` list into `configMap.Data`.

```go
// set.go:490-505
engineEnabled, err := isEngineControllerEnabled(ctx, config)
if err != nil {
    logrus.WithError(err).Debug("Falling back to ContainerEngineReady wait after discovery lookup failed")
    engineEnabled = true
}
if engineEnabled {
    logrus.Info("Waiting for container engine to be ready")
    return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
        status, gen, present := conditionInfo(obj, "ContainerEngineReady")
        return present && gen >= minGen && status == metav1.ConditionTrue
    })
}
logrus.Info("Waiting for the VM to start (engine controller not enabled)")
return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
    status, gen, present := conditionInfo(obj, "Running")
    return present && gen >= minGen && status == metav1.ConditionTrue
})
```

`rdd set running=true` calls `waitForDesiredState` → `isEngineControllerEnabled`, which reads `configMap.Data` via `GetEnabledControllers` (`discovery.go:270-289`) and returns `(false, nil)` for a ConfigMap with no data entries. In the window between `MarkControlPlaneReady` and `registerDiscovery`, the predicate is: ready annotation set, but `EnabledControllers` empty. `rdd svc start --wait` and `cmd/rdd/service.go` already wait on the ready annotation, so `rdd set` proceeds into the wait, sees `engineEnabled=false`, and falls through to the weaker `Running=True` wait.

`Running=True` is satisfied by `LimaVM` reaching Running — well before the engine controller has connected to Docker, run `fullSync`, and stamped `ContainerEngineReady=True`. That is exactly the race this PR is designed to close. `engine.bats` never catches it because its tests either start the service with `rdd set running=true` against an already-ready control plane or run after startup has fully settled.

Codex originally traced this as Important; Claude independently flagged the same ordering (as S1 "`isEngineControllerEnabled` queries discovery before controller managers register") and proposed defaulting to `ContainerEngineReady` when the controller list is empty-during-startup. We keep the Important classification because the failure mode reintroduces a ship-blocking behaviour from before the PR.

Fix options, in order of preference:
1. Move `registerDiscovery` before `MarkControlPlaneReady` in `SharedControllerManager.Start`. The enabled-controller list is then visible to every client that waits for the ready annotation. This is the minimal change and preserves the "ready = clients may use the control plane" contract.
2. If keeping readiness independent of controller-manager health is load-bearing, precompute the enabled-controller list into the discovery ConfigMap (or a sibling key) before `MarkControlPlaneReady`, and have `isEngineControllerEnabled` read that precomputed record.
3. As a defensive change on top, let `isEngineControllerEnabled` distinguish "empty list" from "registered with zero engine" and default to the stricter `ContainerEngineReady` wait when the list is empty. This alone is not sufficient — it papers over the ordering — but it is a belt-and-braces safeguard.

---

I2. **Engine cleanup and full-sync prune delete every `Container`/`Image`/`Volume` in `rancher-desktop`, not only engine-owned mirrors** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:293-336`, `pkg/controllers/app/engine/controllers/sync_containers.go:62-75`, `pkg/controllers/app/engine/controllers/sync_images.go:220-239`, `pkg/controllers/app/engine/controllers/sync_volumes.go` [Codex GPT 5.4]

```go
// engine_reconciler.go:293-337
func (r *EngineReconciler) deleteAllOfType(ctx context.Context, list client.ObjectList) error {
    if err := r.List(ctx, list, client.InNamespace(apiNamespace)); err != nil {
        return err
    }

    items, err := meta.ExtractList(list)
    // ...
    for _, item := range items {
        obj := item.(client.Object)
        // ...
        retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
            latest := obj.DeepCopyObject().(client.Object)
            if err := r.Get(ctx, key, latest); err != nil { /* ... */ }
            if !controllerutil.RemoveFinalizer(latest, mirrorFinalizer) {
                return nil
            }
            return r.Update(ctx, latest)
        })
        // ...
        if err := client.IgnoreNotFound(r.Delete(ctx, obj)); err != nil {
            errs = append(errs, fmt.Errorf("failed to delete %T %s: %w", obj, obj.GetName(), err))
        }
    }
}
```

When `cleanupMirrorResources` runs (App stopped, backend switched, controller teardown), `deleteAllOfType` lists every object of a given kind in the `rancher-desktop` namespace. For each object, it enters a retry closure that removes the engine `mirrorFinalizer` if present — but the closure exits with `nil` on "finalizer not present" (line 319). The outer loop then falls through to `r.Delete(ctx, obj)` unconditionally at line 331. An object created by any other controller in `rancher-desktop`, lacking the engine finalizer, is deleted anyway.

The full-sync prune paths have the same shape:

```go
// sync_containers.go:62-75
var containerMirrors containersv1alpha1.ContainerList
if err := w.k8s.List(ctx, &containerMirrors, client.InNamespace(apiNamespace)); err != nil {
    return fmt.Errorf("failed to list Containers: %w", err)
}
for i := range containerMirrors.Items {
    c := &containerMirrors.Items[i]
    if !dockerIDs[c.Name] {
        log.V(1).Info("Removing stale Container", "id", c.Name)
        if err := w.removeMirrorResource(ctx, c, c.Name); err != nil {
            errs = append(errs, err)
        }
    }
}
```

Any `Container` whose `metadata.name` does not correspond to a live Docker container ID is pruned, regardless of whether it was created by the engine controller. `sync_images.go:220-239` and `sync_volumes.go` apply the same logic to their kinds. The fallout is already visible in this PR: `bats/tests/34-containers-controllers/containers-mock.bats` had to switch to `--controllers=containers,-engine` so that the mock container controller keeps exclusive ownership of the mirror kinds while its tests run. That is a test-only workaround for an architectural problem: two controllers in the same group both write authoritative state into the same namespace and the same kinds, with no ownership marker distinguishing their objects.

Fix: stamp engine-owned objects with a distinguishing label or annotation (e.g. `engine.rancherdesktop.io/owned-by=engine-controller`) in addition to the finalizer, and filter both `deleteAllOfType` and every full-sync prune by that marker. Unmarked objects must be skipped entirely so other controllers — the mock used by `containers-mock.bats` today, any future containerd mirror, or a human-created test fixture — can coexist in the same namespace without losing state on a VM stop or backend switch. Deleting by marker also removes the need for the `-engine` carve-out in `containers-mock.bats`.

Two earlier rounds discussed the related "namespace self-healing" question and decided not to re-create `rancher-desktop` / `containernamespace/moby` after a user deletes them mid-run. That decision is distinct from this one: self-healing is about re-creating your own objects, while this finding is about not destroying someone else's.

---

I3. **`cmd/app-controller` imports the engine package, but `external.RunControllers` does not filter by API group — the standalone binary can register `engine` under the wrong shared manager** — `cmd/app-controller/main.go:13-20`, `pkg/external/main.go:62-72,97-111`, `pkg/controllers/app/engine/controller.go:32-38` [Codex GPT 5.4]

```go
// cmd/app-controller/main.go
import (
    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/app"
    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/demo"
    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine"
    // ...
)

func main() {
    os.Exit(external.RunControllers("app"))
}
```

`cmd/app-controller/main.go` now blank-imports `engine` to trigger its `init()`. The intent of round 3 was the opposite: engine declares itself in the `containers` API group (`pkg/controllers/app/engine/controller.go:32-38`) precisely so `--controllers=app` selections cannot pull it in without its CRDs.

```go
// pkg/external/main.go:62-72
for _, controller := range base.GetAllControllers() {
    shouldStart, err := shouldStartController(ctx, config, controller.GetName(), setupLog)
    // ...
    } else if shouldStart {
        controllersToStart = append(controllersToStart, controller)
    }
}
```

`RunControllers` iterates the global registry and filters only by `shouldStartController`, which checks whether the controller is already running in another manager. It never compares `controller.GetAPIGroup()` to the `apiGroupName` passed in at line 31. If the embedded control plane is running with engine disabled (`--controllers=-engine`), or if engine has not registered itself in discovery yet, `shouldStartController` returns true, and `cmd/app-controller` adds engine to its `controllersToStart` list. Engine is then registered into the shared manager named `"app"` at line 99 — not `"containers"` — which is exactly the scenario round 3 tried to prevent. Once in, engine's `SetupWithManager` watches `Container`/`Image`/`Volume` from the `containers` group; if those CRDs are not installed (they are not installed by `app-controller` itself — `demo` and `app` have no containers dependency), the watch caches fail to establish and the manager startup errors out.

Gemini filed a related finding but reversed the effect: it claimed `external.RunControllers("app")` filters by API group and would therefore silently *omit* engine. That would be a safer outcome than what the code actually does, but `RunControllers` does not filter. The engine controller is imported *and* registered under the wrong shared manager.

Fix options, in order of preference:
1. Drop the `_ "...controllers/app/engine"` import from `cmd/app-controller/main.go`. The blank import contributes nothing the standalone binary needs now that engine belongs to a different group, and the other two `app`-group controllers continue to work. This is a one-line change.
2. In `pkg/external/main.go`, add an `if controller.GetAPIGroup() != apiGroupName { continue }` guard to the controllers-to-start loop. This turns the filtering into a systemic guardrail: a future mistaken import in any standalone binary is prevented from cross-registering into the wrong manager. Pair with (1) for belt-and-braces.
3. If the intent is to eventually run engine from a containers-specific standalone binary (`cmd/containers-controller`), add that binary and move the engine import there instead.

---

## Suggestions

S1. **`context.WithoutCancel` leaks the watcher goroutine on manager shutdown** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:201-222` [Claude Opus 4.6]

```go
// engine_reconciler.go:201-222
func (r *EngineReconciler) startWatcher(ctx context.Context) error {
    // ...
    // context.WithoutCancel decouples the watcher's lifetime from
    // this Reconcile call. Controller-runtime does not per-request
    // cancel Reconcile today, but adding a ReconciliationTimeout or
    // a middleware that attaches a deadline would otherwise kill
    // the watcher the moment Reconcile returns, triggering an
    // immediate reconnect loop.
    w, err := newDockerWatcher(context.WithoutCancel(ctx), r.Client, r.reconcileChan)
    // ...
}
```

The round-3 fix for "watcher bound to Reconcile context" correctly prevents per-request cancellation from killing the watcher, but it also prevents *manager-shutdown* cancellation from reaching the watcher. The only shutdown path for the watcher is `stopWatcher`, which is called from `Reconcile`. On manager shutdown, controller-runtime cancels the manager context, `Reconcile` stops being invoked, and `stopWatcher` never runs. The watcher goroutine stays alive — and holds the Docker client connection open — until the process exits moments later.

Practical impact is small: `GracefulShutdownTimeout` on the manager is 10 s (`manager.go:161`), process exit follows quickly, and the watcher goroutine dies with the process. But `cli.Close()` never runs, the watcher's shutdown logs are missing, and any future refactor that wants ordered cleanup (flush pending events, fsync metadata) would have no hook to plug into. Claude rated this Important; we classify as Suggestion given the narrow shutdown-only impact, while noting it regresses the explicit intent of the round-3 fix.

Fix: register a `manager.RunnableFunc` with the shared manager in `RegisterWithManager` that listens on the manager context and calls `stopWatcher` on cancel. Or store the `watchCancel` alongside `watcherMu` and cancel it from a `mgr.Add(manager.RunnableFunc(...))` runnable. Either pattern gives the watcher a proper shutdown hook while keeping its lifetime independent of any individual Reconcile ctx.

---

S2. **`engine.bats` has no `local_teardown_file` — failed tests leak Docker containers/volumes into the next run** — `bats/tests/32-app-controllers/engine.bats` [Claude Opus 4.6]

```bash
local_setup_file() {
    skip_on_windows
    rdd svc delete
    rdd set running=true
}
# No local_teardown_file.
```

The new suite creates named Docker objects (`test-lifecycle`, `test-state`, `test-delete`, `test-inuse`, `test-unpause`, `dangling-pin`, `dangling-tag-test:v1`, `test-vol`, `My_Vol_Ume`) across many tests. A failure mid-suite leaves them running; subsequent runs error out with `container name "test-lifecycle" is already in use`. The `rdd svc delete` in `local_setup_file` destroys the VM and its Docker state, which is what catches this today — but only because the suite always performs a full VM recreate on every run. Dropping that `rdd svc delete` (an obvious future optimisation for test speed) would reintroduce the leaks immediately.

Fix: add a `local_teardown_file` that force-removes the named test objects:

```bash
local_teardown_file() {
    docker rm --force test-lifecycle test-state test-delete test-inuse \
        test-unpause dangling-pin 2>/dev/null || true
    docker rmi dangling-tag-test:v1 2>/dev/null || true
    docker volume rm test-vol My_Vol_Ume 2>/dev/null || true
}
```

The suite is currently self-cleaning only because of the VM-recreate in `local_setup_file`, which is a load-bearing side effect worth replacing with explicit teardown.

---

S3. **`errs := []error{applyErr}` pre-seeds the slice with a possibly-nil error** — `pkg/controllers/app/engine/controllers/sync_images.go:226` [Claude Opus 4.6]

```go
freshNames, applyErr := w.applyImageFromInspect(ctx, result.InspectResponse)
// ...
errs := []error{applyErr}
```

`errors.Join` at line 239 discards nils, so the behaviour is correct. The pattern reads as if every call returns at least one error, and diverges from the more common `var errs []error; if applyErr != nil { errs = append(errs, applyErr) }` shape used elsewhere in the same package. Pure style; change only if you are in the file anyway.

---

S4. **`--controllers` help text does not list the new `containers` group** — `pkg/service/controllers/options.go:32`, `cmd/rdd/service.go:174`, `pkg/controllers/app/engine/controller.go:32-38` [Codex GPT 5.4]

```go
fs.StringVar(&o.Controllers, "controllers", "*",
    "Controllers to enable. Use '*' for all, or specify comma-separated list. API groups: 'rdd' (configmapreplicaset, notary), 'app' (demo). Prefix with '-' to exclude, e.g., '*,-demo'")
```

The round-3 fix moved `engine` from the `app` group to the `containers` group, but both help strings (`options.go:32` and the equivalent line in `cmd/rdd/service.go:174`) still only document `rdd` and `app`. A user who reads `--help` has no indication that `-engine`, `containers,-engine`, or `containers` are valid selectors — and this PR introduces a `containers-mock.bats` test that already relies on the `containers,-engine` form.

Fix: update both help strings to list every registered API group, including `containers` and `lima`. A structural fix would derive the help text from the registered controllers at startup so it cannot drift again; a minimal fix is to just edit both strings and add a test that every selector used in `bats/` appears in the help.

---

## Design Observations

### Strengths

- **`retry.RetryOnConflict` + re-`Get` on every multi-writer status write** (in-scope) [Claude Opus 4.6] [Codex GPT 5.4] [Gemini 3.1 Pro] — `AppReconciler` mirroring LimaVM conditions and `EngineReconciler.setEngineCondition` both wrap their App-status updates in retry + re-Get. Controller-runtime serialises Reconcile within a controller but not across controllers, so this is the correct response to two reconcilers writing different conditions on the same object; the pattern is applied consistently and matches `removeMirrorResource` and `deleteAllOfType`.
- **Daemon-clock anchoring for Docker event replay** (in-scope) [Codex GPT 5.4] — `dockerEventsSince` uses `Info.SystemTime` rather than host `time.Now()` for the event replay filter, which avoids dropping events inside a host/guest clock-skew window. The biased host-clock fallback on `Info` failure is a sensible safety net.
- **Dual SSA field managers for spec vs finalizer apply** (in-scope) [Claude Opus 4.6] — `engine-controller` owns spec; `engine-controller-finalizer` owns the finalizer-only apply. A finalizer-only SSA through the main field owner would release ownership of `spec` and fail the CRD's required-field validation. The pattern is non-obvious enough that the code comment documenting it is the right move.
- **Stable sort before atomic-list apply** (in-scope) [Gemini 3.1 Pro] — `RepoDigests`, `ContainerPort` bindings, and port names are sorted before every apply, so Go's randomised map iteration does not mint a new `resourceVersion` on each sync. Matches the atomic-list SSA semantics the CRD declares.
- **Idempotent `fullSync` preserving mirror identities across watcher reconnects** (in-scope) [Claude Opus 4.6] — Container IDs, image hashes, and volume hashes are stable across reconnects, so transient Docker disconnects cause no downstream churn. The combination with the dead-watcher detection + reconnect path handles daemon restarts cleanly.

### Concerns

- **Engine controller is a namespace-scanner, not an owned-object reconciler** (in-scope) [Codex GPT 5.4] — `reconcileContainerSpecs`, `processFinalizers`, `cleanupMirrorResources`, and all three full-sync prune paths list every object of a kind in `rancher-desktop` and act on them regardless of ownership. Covered in I2 as the immediate concrete bug; the deeper concern is architectural. The code comment at `engine_reconciler.go:181-185` names per-resource reconcilers with watch predicates as the long-term fix, and pairing that with an ownership label would eliminate both the O(N)-per-event cost and the cross-controller interference in one change.
- **Watcher shutdown hook missing from the `context.WithoutCancel` rewrite** (in-scope) [Claude Opus 4.6] — Covered as S1. The round-3 fix decoupled the watcher from individual Reconcile calls but left no manager-level cancellation path. A `manager.RunnableFunc` registered in `RegisterWithManager` would complete the lifecycle.
- **Per-reconcile List cost for specs and finalizers** (future) [Claude Opus 4.6] — Already flagged in prior rounds and acknowledged in code comments. Becomes load-bearing once mirror counts exceed ~100 per type; fine for dev workstations today.

---

## Testing Assessment

BATS coverage for the new functionality is solid: `engine.bats` exercises image pull/untag/dangling, container lifecycle, volume mirroring (including RFC-1123-invalid names), bidirectional deletion, `spec.state` transitions, the non-moby `NotApplicable` path, and `rdd set` wait semantics for already-stopped / `--timeout=0`. The new `sync_containers_test.go` unit test covers `parseContainerName`, and `discovery_test.go` covers the `InitDiscovery`/`MarkControlPlaneReady` lifecycle.

Untested scenarios surfaced by the findings above:

1. **Cold-start race on `rdd set running=true`** (I1). `engine.bats` starts the service against an already-ready control plane or lets startup fully settle before asserting. A test that patches App the moment `ReadyAnnotation` lands — before `registerDiscovery` completes — would catch the ordering regression. Given the tight window, a unit-level test on `waitForDesiredState` against a mocked discovery ConfigMap is more practical.
2. **Non-engine resources in `rancher-desktop`** (I2). No test creates a Container/Image/Volume without the engine finalizer and verifies that engine stop / full-sync leaves it alone. The `containers-mock.bats` suite has already been modified to exclude engine, which documents the problem instead of guarding against it.
3. **Standalone `app-controller` with the engine import** (I3). No test runs `cmd/app-controller` with the embedded control plane disabled or with engine explicitly missing, so the misregistration path is unguarded.
4. **Watcher survives manager shutdown without `cli.Close`** (S1). The missing shutdown hook is invisible at the unit-test level because the process exits anyway; an integration test that inspects `/proc/self/fd` before and after shutdown on Linux would surface the leaked Docker client connection.
5. **`engine.bats` teardown** (S2). A forced-failure test that asserts the next run starts cleanly would catch the leak.

None of the reviewers ran `go test` this round; all analysis is static. The round-3 review ran `go test ./pkg/controllers/app/engine/controllers ./pkg/service/controllers ./pkg/controllers/containers/container ./pkg/controllers/app/app/controllers` against a prior revision and all four packages passed; re-running on `a5235c8` would be worthwhile before merge.

---

## Documentation Assessment

Design docs remain consistent with the implementation: `api_app.md` (new `ContainerEngineReady` condition), `api_containers.md` (engine mirroring + `spec.state` + finalizer lifecycle), `cmd_app.md` (`--timeout` and wait semantics, with the `--controllers=-engine` fallback caveat added in round 3), and `discovery.md` (ready annotation).

Two gaps remain after this round:

1. `--controllers` help text in `pkg/service/controllers/options.go:32` and `cmd/rdd/service.go:174` does not mention the `containers` group — tracked as S4.
2. `docs/design/discovery.md` describes the ready annotation contract but does not mention that `EnabledControllers` is populated *after* the annotation. Documenting that ordering would have helped this round catch I1 earlier, and clients that depend on the list should know to wait past just the annotation.

---

## Acknowledged Limitations

- `cmd/rdd/set.go:469-472` — Changes that trigger a VM restart without changing `running` (for example `containerEngine.name` alone) are still not waited on. The TODO proposes a `Reconciled` condition as the long-term fix. Unchanged from prior rounds.
- `pkg/controllers/app/engine/controllers/engine_reconciler.go:181-185` — `reconcileContainerSpecs` and `processFinalizers` are O(N) per child event. Per-resource reconcilers with watch predicates are the named long-term fix. Interacts with I2 — the ownership-label change would compose well with the split.
- `pkg/controllers/app/engine/controllers/docker_watcher.go:179-185` — Transient `handleEvent` errors for image/volume create are logged and dropped; the mirror stays missing until the next full resync. Periodic `fullSync` tick is deferred.
- `pkg/controllers/app/engine/controllers/sync_images.go:243-249` — `removeImagesByID` matches on `status.id`; a narrow orphan window exists if metadata Apply succeeded but status Apply failed transiently. Self-heals on the next watcher restart.
- `pkg/controllers/app/engine/controller.go:23-29` — Engine controller registration skipped on Windows until WSL2 socket access is wired up. The round-3 fix to `waitForDesiredState` means `rdd set running=true` no longer hangs on Windows, but the mirroring functionality itself is still unavailable there.
- `pkg/controllers/app/engine/controllers/engine_reconciler.go:136-147` — `ContainerEngineReady` is moby-only; to be renamed when containerd mirroring lands.

---

## Unresolved Feedback

- **BATS `$(rdd svc paths docker_socket)` at file scope** (`bats/tests/32-app-controllers/engine.bats:13-14`). Flagged by Claude and Gemini this round, and as S2 in round 3 — declined in all prior rounds on the grounds that `rdd svc paths docker_socket` is a pure computation with no real failure mode. Recording here rather than in Suggestions because the author's position is on the record and unchanged. If the decision is that the project's BATS style rule applies only when the substituted command has a real failure path (socket probes, process queries), that carve-out would be worth documenting in `bats-style.md` so reviewers and agents stop re-flagging it.

---

## Agent Performance Retro

### [Claude]

Claude caught the cold-start race on `rdd set running=true` and filed it as S1, with a correct root-cause description but insufficient urgency — the fallback from `ContainerEngineReady` to `Running=True` is exactly the regression the PR exists to close. Codex filed the same code path at Important with a tighter file:line trace, and we promoted Claude's S1 into the consolidated I1. Claude's unique contribution was the `context.WithoutCancel` shutdown leak, a clean catch on a round-3 fix: it identified that the decoupling is too complete and that no manager-level cancellation path exists. Claude withdrew one self-filed finding (I3, stale-discovery cleanup) mid-review after re-reading the code, which is the right behaviour and worth rewarding. Its Coverage Summary was the most complete of the three, flagging per-file "Trivial" vs "Reviewed, no issues". Calibration was mostly sound though Claude's I1 (WithoutCancel) reads closer to Suggestion-severity; we downgraded.

### [Codex]

Codex delivered the three highest-impact structural findings of the round and the only Important findings that survived consolidation unchanged. The `ReadyAnnotation`-vs-`registerDiscovery` ordering trace crosses four files (`manager.go`, `set.go`, `discovery.go`, `service.go`) and gets the causality right on the first try. The "engine deletes unowned objects" finding correctly links the code behaviour to the visible test-suite carve-out in `containers-mock.bats`, which is the kind of cross-file reasoning that separates a ship-blocking finding from a theoretical one. The `app-controller` import finding contradicts Gemini on the same code path and is right. Codex did not find the shutdown leak Claude caught. It did not run tests this round (prior rounds did). Calibration was correct on all three Important findings — no reconciliation adjustments needed.

### [Gemini]

Gemini's output this round was net-negative. Its Critical finding (C1, invalid `@test` syntax) is a hallucination: the `engine.bats` file uses `@test` correctly, verified directly on disk. Its Important I1 (API group mismatch with app-controller silently omitting engine) is backwards — `pkg/external/main.go` does not filter by `GetAPIGroup()`, and Codex's finding on the same code path has the causality right. Its Important I2 (BATS `$(...)` substitution) is a re-flag of a standing-declined style issue. We dropped C1 and I1 as false positives and routed I2 to Unresolved Feedback. Gemini skipped `git blame` again (consistent with its quota-driven behaviour), so it could not distinguish regressions from pre-existing issues — which might be why it spent its analysis budget on hallucination-prone structural claims rather than on tracing the round-3 fixes. This is the second round where Gemini's calibration has been notably worse than the other two agents; worth watching whether the pattern is model-level or this-diff-specific.

### Summary

| | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 3m 34s | 7m 26s | 6m 54s |
| Findings | 3S | 3I 1S | none |
| Tool calls | 35 (Read 19, Grep 9, Bash 7) | 78 (shell 77, plan 1) | 21 (grep_search 12, run_shell_command 8, read_file 1) |
| Design observations | 5 | 4 | 3 |
| False positives | 1 | 0 | 2 |
| Unique insights | 1 | 3 | 0 |
| Files reviewed | 39 | 39 | 39 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **3S** | **3I 1S** | **none** |
| Downgraded | 1 (I→S) | 0 | 0 |
| Dropped | 3 | 0 | 3 |


**Reconciliation** — Claude P1 `isEngineControllerEnabled` race: suggestion → important I1 (promoted; shared with Codex and classified as a regression-reintroducing bug rather than a soft concern). Claude P1 `context.WithoutCancel` shutdown leak: important → suggestion S1 (downgraded; narrow shutdown-only impact and process exits soon after). Gemini P1 invalid `@test` syntax: critical → dropped as false positive (file verified to use `@test` correctly). Gemini P1 API-group mismatch: important → dropped as false positive (the causality trace contradicts `pkg/external/main.go`, which does not filter by `GetAPIGroup()`). Gemini P1 BATS `$(...)` substitution: important → dropped to Unresolved Feedback (standing-declined in prior rounds). Claude P1 image-name length (S3 in its review): dropped as false positive (DNS-1123 subdomain, 253-char limit applies). Claude P1 `waitForDesiredState` non-running TODO: dropped (acknowledged limitation, not a finding). Claude P1 stale-discovery cleanup: withdrawn by Claude itself during review.

Overall, Codex carried the round. Claude contributed one valuable unique finding and corroborated the highest-impact regression; Gemini did not usefully improve the output and cost orchestration time on verification. For a round-4 review of a bundle of targeted fixes, two reliable agents is the right spread — Gemini's hallucinations this round are a reminder that agent output must be verified against source before it reaches a report.

---

## Review Process Notes

### Skill improvements

- **When reviewing a follow-up round, treat each prior-round fix as a dedicated review target.** Every fix is a small patch against known-good code, so the reviewer should explicitly ask: "what new code path does this introduce, and what else reads the state this change writes?" Three of the four round-4 findings (I1, I3, S1) are direct consequences of round-3 fixes interacting with unrelated code paths. A review checklist item of the form "for each prior-round fix, list every caller of the modified symbol and verify none depend on the prior timing or shape" would have caught I1 (the new `isEngineControllerEnabled` caller vs the existing `registerDiscovery` ordering) on the first pass rather than promoting a suggestion into an important.

### Repo context updates

- **UPDATE `deep-review-context.md`: add the discovery ConfigMap ordering contract.** The file already documents that CRDs become ready when `rdd.rancherdesktop.io/ready=true` appears on the discovery ConfigMap. Round 4 surfaced a subtler invariant: `EnabledControllers` in `configMap.Data` is populated *after* the ready annotation, because `registerDiscovery` runs later in `SharedControllerManager.Start`. Any client that reads both the annotation and the enabled-controller list races. Add a one-paragraph note in the "Control plane readiness and discovery" subsection stating that the annotation and the enabled-controller list are not consistent checkpoints, and that code depending on both must serialize through something stricter (e.g. a controller-manager-health probe or a precomputed enabled-controller key written before `MarkControlPlaneReady`). Rationale: I1 this round could have been prevented by a repo-context entry that a reviewer would have read before pattern-matching on "read discovery, make decision".

- **ADD to `deep-review-context.md`: engine controller deletes unowned objects in `rancher-desktop` namespace.** Until I2 is fixed, any reviewer looking at a PR that adds a new controller touching `Container`/`Image`/`Volume` kinds needs to know that engine will delete those objects on VM stop/backend switch unless they carry the engine finalizer. Add a short warning in the "Engine controller" subsection saying that `Container`/`Image`/`Volume` mirrors in `rancher-desktop` are co-owned space and a second controller must either pick a different namespace or coordinate with engine's prune logic. Drop this entry once I2 is resolved with an ownership marker.

---

## Appendix: Original Reviews

### [Claude] — Pass 1

See `/tmp/deep-review-6JQq1I/review-claude-pass-1.md` (reproduced below): I1 (context.WithoutCancel leaks watcher on manager shutdown; downgraded to S1), I2 (BATS `$(...)` substitution at file scope; routed to Unresolved Feedback), I3 (stale-discovery cleanup; withdrawn mid-review), S1 (`isEngineControllerEnabled` queries discovery before registration; promoted to consolidated I1), S2 (missing `engine.bats` teardown; consolidated S2), S3 (image/volume mirror name length; dropped as false positive — DNS-1123 subdomain allows 253 chars), S4 (`waitForDesiredState` ignores non-running changes; dropped as acknowledged limitation), S5 (duplicate errs initialization style; consolidated S3). Included a per-file Coverage Summary, which was the most complete of the three.

### [Codex] — Pass 1

See `/tmp/deep-review-6JQq1I/review-codex-pass-1.md`: I1 (`ReadyAnnotation` fires before enabled-controller list is populated; consolidated I1), I2 (engine cleanup/prune deletes unowned objects in `rancher-desktop`; consolidated I2), I3 (`app-controller` imports engine without API-group filter; consolidated I3), S1 (stale `--controllers` help text; consolidated S4). Did not run tests this round. All three Important findings survived consolidation unchanged.

### [Gemini] — Pass 1

See `/tmp/deep-review-6JQq1I/review-gemini-pass-1.md`: C1 (invalid `@test` syntax in `engine.bats`; dropped as false positive — the file uses `@test` correctly), I1 (API group mismatch with `app-controller` silently omitting engine; dropped as false positive — `pkg/external/main.go` does not filter by `GetAPIGroup()`, so engine is incorrectly *included*, not omitted), I2 (BATS `$(...)` substitution at file scope; routed to Unresolved Feedback). No suggestions filed. Gemini skipped `git blame` (consistent with prior-round behaviour) and did not land any unique finding that survived consolidation.

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-15 |
| **Commit** | `f3856c1 engine: mirror Docker container engine state into Kubernetes` (fixup autosquashed into the original) |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important #1: `rdd set running=true` cold-start race (ReadyAnnotation ordering) | Fixed (already in HEAD before round 4: `registerDiscovery` moved before `MarkControlPlaneReady`, with doc comments in `discovery.go`/`manager.go`/`service.go` updated) |
| 2 | Important #2: Engine cleanup deletes unowned objects in `rancher-desktop` | Commented (engine is authoritative for these kinds; limitation documented on `deleteAllOfType`. Ownership-label fix deferred to pair with the per-resource reconciler split) |
| 3 | Important #3: `cmd/app-controller` imports engine under the wrong manager | Fixed (dropped the engine import from `cmd/app-controller/main.go`; added an API-group guard in `pkg/external/main.go` that logs an error and skips any foreign-group controller, so a stray import surfaces as a build-time bug instead of silent cross-registration) |
| 4 | Suggestion #1: `context.WithoutCancel` leaks the watcher goroutine on manager shutdown | Fixed (reconciler now owns `watcherCtx`/`watcherCancel` bounded by manager lifetime; `SetupWithManager` registers a `manager.RunnableFunc` that cancels `watcherCtx` and calls `stopWatcher` on shutdown, so `cli.Close()` always runs) |
| 5 | Suggestion #2: `engine.bats` has no `local_teardown_file` | Skipped (the `rdd svc delete` in `local_setup_file` blows away the VM and all Docker state on every run; explicit teardown would duplicate that coverage. Revisit if we ever drop the VM recreate to speed up tests) |
| 6 | Suggestion #3: `errs := []error{applyErr}` pre-seeds with a possibly-nil error | Fixed (replaced with `var errs []error; if applyErr != nil { errs = append(errs, applyErr) }` to match the package's prevailing style) |
| 7 | Suggestion #4: `--controllers` help text missing `containers` group | Fixed (extracted `ControllersFlagUsage` constant in `pkg/service/controllers/options.go` listing all groups and `containers,-engine` as an example; both `options.go` and `cmd/rdd/service.go` reference it so the two cannot drift) |
| 8 | Testing gaps (5 scenarios) | Skipped (batch — covered by the code fixes, blocked on separate design work, or requiring process-level fault injection) |
| 9 | Documentation gap: `docs/design/discovery.md` describes the old ready-annotation contract | Fixed (updated the "Ready annotation" section to document that `ready=true` signals both CRDs installed and controller-manager registration, matching the code comments in commit `334de5a`) |
