# Deep Review: 20260415-103429-pr-309

| | |
|---|---|
| **Date** | 2026-04-15 10:34 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 5 (of pr-309) |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#309](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/309) — engine: mirror Docker container engine state into Kubernetes |
| **Branch** | `engine-v3` |
| **Commits** | `f3856c1` engine: mirror Docker container engine state into Kubernetes |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — two real regressions: `containernamespace/moby` is not self-healing after user deletion, and the external `containers-controller` binary never registers the engine controller. Two narrower races around the discovery ready annotation are worth tightening before merge. |
| **Wall-clock time** | `13 min 40 s` |

## Executive Summary

The PR adds the `engine` controller under `pkg/controllers/app/engine/`, plus `rdd set` wait semantics, a discovery ready annotation, and a new `Container.spec.state=unknown` value. The mirroring design (SSA dual field manager, daemon-clock-anchored event replay, `retry.RetryOnConflict` across every App status writer, idempotent `fullSync` on reconnect) is sound and well-documented. The suite finds no critical issues.

Four important findings warrant pre-merge attention: (1) `ContainerNamespace moby` is only applied during `fullSync`, so a user delete is never restored until the watcher cycles — a BATS test even locks in the broken behavior with `refute_output`; (2) the engine controller's blank import is only in the embedded `rdd svc serve` path, not in `cmd/containers-controller/main.go`, so the external binary silently drops the feature; (3) `MarkControlPlaneReady` fires before `mgr.Start()`, creating a narrow window where a client observes the annotation but webhooks and reconcilers have not yet started; (4) a failed `registerDiscovery` is logged-not-fatal but the ready annotation is still written over an empty controller list, which flips `isEngineControllerEnabled` to the wrong wait branch in `rdd set`.

---

## Critical Issues

None.

---

## Important Issues

I1. **`ContainerNamespace moby` is not recreated after user deletion** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:560-567` [Codex GPT 5.4]

```go
return ctrl.NewControllerManagedBy(mgr).
    For(&appv1alpha1.App{}).
    Named("engine-reconciler").
    WatchesRawSource(source.Channel(r.reconcileChan, enqueueApp)).
    Watches(&containersv1alpha1.Container{}, enqueueApp).
    Watches(&containersv1alpha1.Image{}, enqueueApp).
    Watches(&containersv1alpha1.Volume{}, enqueueApp).
    Complete(r)
```

`syncContainerNamespace` is only called from `fullSync` (`docker_watcher.go:436-438`), and the steady-state reconcile path in `Reconcile` never touches the `ContainerNamespace` mirror. The controller graph also does not `Watches` it, so deleting `containernamespace/moby` fires no event, triggers no reconcile, and the mirror stays gone until the watcher cycles (Docker daemon restart or VM restart). The API stops reflecting the running engine's namespace in the meantime.

The new BATS suite actually locks this behavior in: `engine.bats:375-381` deletes `containernamespace/moby` and then asserts `refute_output` on the list, which passes only because the mirror never comes back.

```bash
@test "deleting containernamespace/moby completes without a finalizer hang" {
    rdd ctl delete containernamespace/moby --namespace="${NAMESPACE}" --timeout=10s
    run -0 rdd ctl get containernamespaces --namespace="${NAMESPACE}" --output=name
    refute_output
}
```

Fix: add `ContainerNamespace` to the controller's `Watches` list, and re-apply `containernamespace/moby` as part of the steady-state reconcile (or gate it behind `wantWatcher` the same way `reconcileContainerSpecs` is). Flip the BATS test to `rdd ctl wait --for=create containernamespace/moby --timeout=10s` so the self-heal is verified instead of masked.

I2. **External `containers-controller` binary does not register the engine controller** — `cmd/containers-controller/main.go:8-14` [Codex GPT 5.4]

```go
import (
    "os"

    // Import app controller packages to trigger init() functions.
    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"
    "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/external"
)
```

The engine controller declares itself part of the `containers` API group (`pkg/controllers/app/engine/controller.go:38`) with the explicit rationale that it watches `Container`, `Image`, and `Volume`. The embedded path in `pkg/service/cmd/service.go:59` blank-imports `pkg/controllers/app/engine`, but `cmd/containers-controller/main.go` only imports `pkg/controllers/containers`, so the external binary never sees the engine controller's `init()` and `base.RegisterController` never runs. `external.RunControllers("containers")` therefore exposes the containers CRDs without any Docker mirroring.

Either the grouping is right and the import is missing, or the grouping should change to `app` and the import should move to `cmd/app-controller/main.go`. Picking one and making the import graph match is the fix.

```diff
 import (
     "os"

-    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"
+    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine"
+    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"
     "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/external"
 )
```

I3. **`MarkControlPlaneReady` fires before `mgr.Start()` registers controllers and webhooks** — `pkg/service/controllers/manager.go:198-203` [Claude Opus 4.6]

```go
// Mark the control plane ready after registerDiscovery so clients
// waiting on the ready annotation see both CRDs installed and
// controller registration written, not just CRDs.
if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
    return fmt.Errorf("failed to mark control plane as ready: %w", err)
}
```

`MarkControlPlaneReady` at line 201 runs well before `mgr.Start(ctx)` at line 284. Between those two points, `Start` still iterates `scm.registrations` to call `RegisterWithManager` (line 223-235) and then runs webhook `Setup` in parallel goroutines (line 241-261). `mgr.Start` is where controller-runtime actually starts the reconcile loops and the webhook HTTP server. A client that observes the ready annotation at this point and immediately issues a request can:

- hit an admission webhook whose HTTPS server is not yet listening, returning "connection refused" from the apiserver's webhook call, or
- land on a reconciler that has not registered yet and see a spec change that is never reconciled until the next event fires.

The window is short in practice because `rdd set`'s own `ensureServiceRunning` + `waitForFreshDiscoveryConfigMap` gives the manager time to finish startup, but the annotation's documented contract ("every enabled controller has installed its CRDs") is stronger than what the code currently guarantees, and the whole point of the annotation is to let clients skip ad-hoc waits.

Fix: move `MarkControlPlaneReady` into a `manager.RunnableFunc` added to `mgr` so it runs only after `Start` has initialized controllers and webhooks. A `RunnableFunc` added before `mgr.Start` fires once the manager is up.

I4. **`registerDiscovery` failure is swallowed but `MarkControlPlaneReady` still runs** — `pkg/service/controllers/manager.go:193-203` [Claude Opus 4.6]

```go
if err := scm.registerDiscovery(ctx); err != nil {
    log.Error(err, "Failed to register controller manager for discovery")
    // Don't fail startup for discovery registration errors
}

// Mark the control plane ready after registerDiscovery so clients
// waiting on the ready annotation see both CRDs installed and
// controller registration written, not just CRDs.
if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
    return fmt.Errorf("failed to mark control plane as ready: %w", err)
}
```

If `registerDiscovery` fails at 193, execution falls through and `MarkControlPlaneReady` stamps the ready annotation on a discovery ConfigMap that has no controller entries. `rdd set`'s `isEngineControllerEnabled` (`cmd/rdd/set.go:523-537`) then reads an empty controller list, decides the engine controller is *not* enabled, and falls back to waiting on `Running=True` — which returns before the Docker full sync has finished. The user sees `rdd set running=true` complete while the engine is still catching up.

Fix: either make `registerDiscovery` failure fatal (it already returns an error), or skip `MarkControlPlaneReady` when it fails. The "don't fail startup" comment pre-dates the new ready-annotation contract and is now at odds with it.

```diff
-	if err := scm.registerDiscovery(ctx); err != nil {
-		log.Error(err, "Failed to register controller manager for discovery")
-		// Don't fail startup for discovery registration errors
-	}
-
-	if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
+	if err := scm.registerDiscovery(ctx); err != nil {
+		return fmt.Errorf("failed to register controller manager for discovery: %w", err)
+	}
+	if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
```

---

## Suggestions

S1. **BATS `DOCKER_HOST` is resolved at file-load time via bare command substitution** — `bats/tests/32-app-controllers/engine.bats:12-14` [Claude Opus 4.6] [Codex GPT 5.4]

```bash
NAMESPACE="rancher-desktop"
DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
export DOCKER_HOST
```

The assignment on line 13 runs during BATS file load, before `local_setup_file` has a chance to run `rdd svc delete` / `rdd set running=true`. It works today because `rdd svc paths docker_socket` is a static path computation, but (a) it bypasses the repo's BATS convention of capturing command output via `run -0` so failures surface the real stderr instead of an opaque shell-load error, and (b) it couples the test to `rdd svc paths` never needing a running service, which is not promised by the command's contract.

Fix: move the resolution into `local_setup_file` after the service is up, using `run -0` per the BATS style rules.

```diff
 NAMESPACE="rancher-desktop"
-DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
-export DOCKER_HOST

 local_setup_file() {
     skip_on_windows
     rdd svc delete
     rdd set running=true
+    run -0 rdd svc paths docker_socket
+    DOCKER_HOST="unix://${output}"
+    export DOCKER_HOST
 }
```

S2. **`setCondition` in the generic Container reconciler omits `ObservedGeneration`** — `pkg/controllers/containers/container/container_reconciler.go:81-88` [Gemini 3.1 Pro]

```go
func (r *reconciler) setCondition(ctx context.Context, container *containersv1alpha1.Container, conditionType string, status metav1.ConditionStatus, reason, message string) {
    changed := apimeta.SetStatusCondition(&container.Status.Conditions, metav1.Condition{
        Type:    conditionType,
        Status:  status,
        Reason:  reason,
        Message: message,
    })
```

The engine controller and `rdd set` rely on `observedGeneration >= post-write generation` to tell a stale condition from a fresh one, and `EngineReconciler.setEngineCondition` / `AppReconciler` both stamp it. The generic Container reconciler's `setCondition` helper does not, so any future tooling that polls `Container` conditions — or any additional condition the generic controller grows — will silently lack the staleness signal. Cheap to add now; expensive to retrofit once a consumer depends on it.

Fix: pass `container.Generation` into the `metav1.Condition` literal.

S3. **`engineControllerName` duplicates an unexported constant in the engine package** — `cmd/rdd/set.go:48` [Claude Opus 4.6]

```go
// engineControllerName mirrors the unexported controllerName in
// pkg/controllers/app/engine/controller.go.
engineControllerName = "engine"
```

`isEngineControllerEnabled` compares the discovery ConfigMap's controller list against this local copy. If either side drifts, the function silently falls back to the wrong `rdd set` wait path (Running=True instead of ContainerEngineReady=True) — a "successful" command that returns before mirroring is ready. Export the constant from `pkg/controllers/app/engine/controller.go` so drift is a compile error.

S4. **`startWatcher` runs `fullSync` under `watcherCtx`, not Reconcile's ctx** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:216-229` [Claude Opus 4.6]

```go
func (r *EngineReconciler) startWatcher(_ context.Context) error {
    r.watcherMu.Lock()
    defer r.watcherMu.Unlock()

    if r.watcher != nil {
        return nil
    }

    w, err := newDockerWatcher(r.watcherCtx, r.Client, r.reconcileChan)
```

The `_ context.Context` parameter makes the drop explicit and the comment at the callsite explains why, but `newDockerWatcher` runs `fullSync` synchronously before returning. A future `ReconciliationTimeout` on the engine controller would not apply to `fullSync` because `watcherCtx` only cancels on manager shutdown. A short comment on `startWatcher` noting that `fullSync` intentionally runs under `watcherCtx`, and that adding a reconciliation timeout would require decoupling `fullSync` from the `startWatcher` call, would save a reader the trace.

S5. **`reconcileContainerSpecs` and `processFinalizers` read a stale watcher snapshot** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:358-363, 386-391` [Claude Opus 4.6]

```go
func (r *EngineReconciler) reconcileContainerSpecs(ctx context.Context) error {
    r.watcherMu.Lock()
    w := r.watcher
    r.watcherMu.Unlock()
    if w == nil {
        return nil
    }
```

Both methods snapshot `r.watcher` under the lock and then use it without resynchronising. If the watcher dies between the snapshot and the Docker API calls, the Docker client is already closed (`w.stop()` calls `cli.Close()`), so the first attempt fails with `use of closed network connection` instead of a clearer "watcher died" message. The reconcile self-heals on requeue, so the user impact is only a noisier log. Either accept it with a one-line comment, or re-check `w.alive()` at the top of each method.

S6. **Removing `UnregisterControllerManager` from the preamble loses a hint for future readers** — `pkg/service/controllers/manager.go` [Claude Opus 4.6]

The diff drops a stale-cleanup call that used to run at the top of `Start()`. The reason is sound — `InitDiscovery` deletes and recreates the ConfigMap, so the preamble unregister is redundant — but neither the commit message nor the code says so. A one-line comment at the deferred cleanup block (line 206-212) noting that stale entries are handled by `InitDiscovery` would prevent a future reader from reintroducing the call.

---

## Design Observations

**Concerns**

- **One-shot `fullSync` vs steady-state reconcile split** `(in-scope)` [Codex GPT 5.4] — The engine controller divides work between a one-time `fullSync` path that seeds scaffolding (e.g. `ContainerNamespace`, the `rancher-desktop` namespace) and a steady-state reconcile that only maintains `Container`/`Image`/`Volume`. I1 is the first consequence of that split: scaffolding resources exist only while startup or reconnect happens to recreate them. An explicit "ensure mirror scaffolding" step in steady-state reconcile is a small in-scope fix; the longer-term direction is the per-resource reconcilers with watch predicates already hinted in the code.
- **Per-reconcile `List` calls in `reconcileContainerSpecs` and `processFinalizers`** `(in-scope)` [Claude Opus 4.6] — Both methods list every mirror on every reconcile, and every child event enqueues the App singleton, so cost is O(children) per burst. The code's own TODO at `engine_reconciler.go:191-195` calls out the long-term fix (per-resource reconcilers with deletion/spec-change predicates). Fine for a dev workstation, noticeable once a user runs hundreds of containers.
- **Controller grouping is encoded in three places** `(future)` [Codex GPT 5.4] — `GetAPIGroup()`, the embedded blank imports, and the external-binary blank imports all have to agree. I2 is the first divergence in that arrangement. A single per-group registration package imported by both embedded and external entrypoints would remove the whole class of drift.
- **`rdd set` wait only covers `running=true/false`** `(future)` [Claude Opus 4.6] — `cmd/rdd/set.go:469-472` still has a TODO that changes like `containerEngine.name` trigger a VM restart without being waited on. The `Reconciled` condition design in MEMORY.md is the clean path forward.

**Strengths**

- **Dual SSA field manager for finalizer-only applies** [Claude Opus 4.6] [Codex GPT 5.4] [Gemini 3.1 Pro] — Using `engine-controller-finalizer` as a separate field owner for finalizer reassertions is the right way to prevent a finalizer apply from releasing spec ownership, which would otherwise prune spec and fail the CRD's required-field validation. Documented clearly at `sync_containers.go:138-148`.
- **Daemon-clock anchoring for event replay** [Claude Opus 4.6] [Codex GPT 5.4] [Gemini 3.1 Pro] — `dockerEventsSince` reads `Info.SystemTime` from the daemon and uses that as the Docker events `Since` filter, neutralising host/guest clock skew. The fallback to a biased host clock (`now - 2m`) pairs cleanly with an idempotent `fullSync`, so replaying a few extra events is safe.
- **`retry.RetryOnConflict` + re-`Get` across every App status writer** [Claude Opus 4.6] — Applied consistently in `setEngineCondition`, `AppReconciler`, `removeMirrorResource`, `deleteAllOfType`, and every finalizer processor. This is exactly the pattern needed now that two controllers (`EngineReconciler` and `AppReconciler`) both write to `App.Status.Conditions` and controller-runtime does not serialise across controllers.
- **`watcherCtx` scoped to manager lifetime, not Reconcile** [Claude Opus 4.6] [Gemini 3.1 Pro] — Dropping Reconcile's ctx and seeding the watcher with a `manager.RunnableFunc`-managed context prevents the Docker client from dying on every Reconcile return, and the shutdown hook guarantees clean teardown on manager exit.
- **Discovery ready annotation replaces ConfigMap-existence probes** [Gemini 3.1 Pro] — Promoting readiness from "discovery ConfigMap exists" to an explicit `rdd.rancherdesktop.io/ready=true` annotation is the right direction; I3 is a sequencing issue with the *current* fire point, not with the design.
- **Immutable-spec webhook permits state transitions cleanly** [Claude Opus 4.6] — `container_webhooks.go:38-43` copies the old spec, replaces only `State`, and then compares, which is the right way to allow one mutable field without enumerating all immutable ones.

## Testing Assessment

The new `bats/tests/32-app-controllers/engine.bats` suite covers the critical paths well: image pull / tag / untag / dangling / remove, container lifecycle and `spec.state` transitions, volumes with RFC-1123-invalid names, bidirectional deletion, `rdd set` wait semantics including `--timeout=0`, and the containerd no-op path. The unit test at `sync_containers_test.go` covers `parseContainerName` adequately, and `discovery_test.go` covers the `InitDiscovery` → `MarkControlPlaneReady` lifecycle.

Gaps ranked by risk:

1. **Self-heal of `containernamespace/moby`** — not only untested, the test at `engine.bats:375-381` actively locks in the broken behavior (I1). Flipping `refute_output` to `rdd ctl wait --for=create` reverses the mask into a verification.
2. **External `containers-controller` binary registration** — no smoke test verifies that the binary registers the engine controller via discovery, so I2 shipped unnoticed. A smoke test that `containers-controller` exposes `engine` in discovery would catch future import/group mismatches.
3. **Watcher reconnect on Docker daemon restart** — the `w.alive()` detection, `watcherDied` log, and reconnect-triggers-`fullSync` path are only exercised if Docker actually disconnects during the test, which nothing currently does.
4. **Webhook rejects non-state spec changes on existing containers** — the suite exercises `spec.state` transitions but no test tries to change `spec.image` on an existing container and asserts the webhook rejects it.
5. **`processImageFinalizers` on an image Docker refuses to delete** — the suite covers in-use image finalizer retention, but not a permission-denied / structural error path where `ImageRemove` returns non-`NotFound`.
6. **`DOCKER_HOST` resolution failure surfacing** — S1: the current bare substitution hides socket resolution errors.

## Documentation Assessment

The design-doc updates (`api_app.md`, `api_containers.md`, `cmd_app.md`, `discovery.md`) are thorough and accurately describe the new behavior. `api_containers.md`'s terminology section clarifying `Container`/`Image`/`Volume` (the rdd K8s resource) vs `container`/`image`/`volume` (the Docker engine object) is a notable addition that will pay off in future reviews. The `ControllersFlagUsage` constant in `options.go` keeps the `--controllers` help text in sync between embedded and standalone binaries.

Missing: an explicit statement of which standalone binary hosts `engine`. The code says `apiGroup = "containers"`, but the external import graph disagrees (I2); picking one and documenting it closes the loop.

## Acknowledged Limitations

- **Per-reconcile `List` calls** — documented inline at `engine_reconciler.go:191-195`; long-term fix is per-resource reconcilers.
- **`rdd set` does not wait on non-`running` spec changes** — documented at `cmd/rdd/set.go:469-472`; design of a `Reconciled` condition is in MEMORY.md.
- **No containerd mirroring** — documented in the code (`engine_reconciler.go:147-156`) and the design doc; returns `NotApplicable` on containerd.
- **`Container.spec.state` is level-triggered** — documented in MEMORY.md as a known design limitation to be addressed with edge-triggered action annotations after containerd mirroring.
- **Windows Docker socket unsupported** — documented in both the controller `init()` guard and `skip_on_windows` in `engine.bats`.
- **Image mirrors with empty status have no engine reference to forward a delete to** — documented inline at `sync_images.go:248-252` and in `processImageFinalizers`.

---

## Agent Performance Retro

### [Claude]

Claude carried the review. It surfaced I3 and I4 (the `MarkControlPlaneReady` sequencing and the `registerDiscovery` swallow), both of which required reading the full `Start()` flow and noticing that the ready annotation promise is stronger than the current code guarantees. It also raised the useful latent-comment pair (S4 `fullSync`-under-`watcherCtx`, S5 stale-watcher reference) that are low-impact but reward a future reader. A couple of findings were raised as important but are really polish — S4 and S5 were downgraded during consolidation because they are documented acknowledgments rather than behavior gaps. The coverage summary was disciplined: 40 files with per-file dispositions.

### [Codex]

Codex landed the two biggest findings. I1 (ContainerNamespace self-heal) required *both* tracing `syncContainerNamespace` back to its only caller and then noticing that the BATS test at 375-381 locks the broken behavior in — a nice cross-artifact catch. I2 (external-binary wiring) required comparing `cmd/containers-controller/main.go` against `pkg/service/cmd/service.go` and checking what `GetAPIGroup()` returns. Neither of the other two agents caught either issue. Codex's single suggestion (S1 BATS `DOCKER_HOST`) was the sharper framing — "bypasses `run -0` per BATS conventions" — compared to Claude's more generic "moves coupling".

### [Gemini]

Gemini's single finding (S2 missing `ObservedGeneration` on the generic Container `setCondition`) is a valid forward-looking gap. Its design observations were crisp and it correctly flagged the discovery ready annotation as a clean direction, but it did not notice that the annotation fires too early (I3) or that `registerDiscovery` failure desyncs the wait (I4). Gemini marked every manager.go / engine_reconciler.go file "Reviewed, no issues" while those two files between them contain three of the four important findings — a meaningful coverage miss that mostly reflects skipping `git blame` (documented quota behavior) and a lighter trace of the `Start()` call path.

### Summary

| | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 4m 43s | 7m 45s | 4m 52s |
| Findings | 2I 7S | 2I 1S | 1S |
| Tool calls | 30 (Read 18, Bash 11, Agent 1) | 80 (shell 77, stdin 3) | 7 (grep_search 3, run_shell_command 3, read_file 1) |
| Design observations | 6 | 4 | 4 |
| False positives | 0 | 0 | 0 |
| Unique insights | 2 | 2 | 1 |
| Files reviewed | 40 | 40 | 40 |
| Coverage misses | 0 | 0 | 2 |
| **Totals** | **2I 7S** | **2I 1S** | **1S** |
| Downgraded | 2 (I→S) | 0 | 0 |
| Dropped | 0 | 0 | 0 |


**Reconciliation:** Claude I3 `startWatcher` / `fullSync` comment: important → suggestion S4 (latent, comment-only). Claude I4 stale-watcher error-message clarity: important → suggestion S5 (self-healing on requeue, cosmetic).

Overall, Codex provided the most merge-blocking value this round (both important findings), Claude provided the highest density of correct findings and the best coverage discipline, and Gemini's contribution was limited to one forward-looking suggestion plus the useful design-observation framing.

---

## Review Process Notes

### Repo context updates

- **Discovery ready annotation sequencing** — When reviewing controller-manager startup, trace the fire point of `MarkControlPlaneReady` against `mgr.Start()` and flag any path that sets the ready annotation before the manager has actually registered controllers and started webhooks. The annotation's contract is "every enabled controller has installed its CRDs *and* is reconciling" — clients use it to skip ad-hoc waits, so a premature fire creates a false-ready window in which admission webhooks may not be listening yet. Also flag swallow-then-mark-ready patterns where a `registerDiscovery` (or equivalent prerequisite) failure is logged-not-fatal but the ready annotation still fires over a partially populated ConfigMap.
- **Scaffolding resources must be part of steady-state reconciliation** — When a reconciler splits work between a one-shot `fullSync` path and a steady-state path, verify that every mirror resource created by `fullSync` is either (a) watched and re-applied on steady-state reconcile, or (b) explicitly documented as "startup-only, never self-heals". Singleton scaffolding resources (namespaces, placeholder objects, `ContainerNamespace`-style markers) are the usual casualties because they're applied once and then never touched by steady-state code. Flag any watch list that omits a resource the controller applies anywhere.
- **Controller grouping agreement across entrypoints** — In repos where controllers are registered via `init()` + `base.RegisterController`, verify that every controller's `GetAPIGroup()` return value matches the blank imports in both embedded and external binary entrypoints. A controller declared in group `X` must be blank-imported in the binary that runs `external.RunControllers("X")`. Mismatch is a silent drop — the feature disappears from external mode without any error. Grep for the controller package under both `cmd/*/main.go` and the embedded service package; a missing import in either is a regression.

### Skill improvements

- **Scan BATS suites for test assertions that lock in broken behavior** — When reviewing a test that deletes a resource and then asserts its absence (`refute_output`, `--for=delete`), check whether the controller is *supposed* to recreate that resource. If the expected behavior is self-healing, a one-shot absence check is a false-positive test that hides the bug. Flag when the deletion test's assertion style is tighter than the controller's actual contract suggests.


---

## Appendix: Original Reviews

### [Claude] — Pass 1

---

## Executive Summary

This PR adds a comprehensive engine controller that mirrors Docker containers, images, and volumes into Kubernetes resources, along with `rdd set` wait semantics and a discovery ready annotation. The implementation is well-structured with proper SSA ownership, retry-on-conflict patterns, and thorough BATS coverage. The key findings center on a `MarkControlPlaneReady` call that races against controller readiness, and the BATS test file evaluating `rdd svc paths` at load time.

---

## Findings

### Critical Issues

None.

### Important Issues

I1. **`MarkControlPlaneReady` fires before controllers start running** — `pkg/service/controllers/manager.go:201`

```go
// Mark the control plane ready after registerDiscovery so clients
// waiting on the ready annotation see both CRDs installed and
// controller registration written, not just CRDs.
if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
    return fmt.Errorf("failed to mark control plane as ready: %w", err)
}
```

`MarkControlPlaneReady` at line 201 runs inside `SharedControllerManager.Start()` *before* `mgr.Start()` (the `mgr.Start()` call is further down at approximately line 229). The ready annotation promises that CRDs are installed and discovery is registered, but the manager's controllers — including the engine reconciler and its webhooks — have not started yet. A client that observes the ready annotation and immediately patches the App could hit a webhook that hasn't been registered yet, or trigger a reconcile before the engine reconciler's `SetupWithManager` goroutines are running.

In practice, the engine reconciler's watch loop starts asynchronously within `mgr.Start()`, and `rdd set` does `ensureServiceRunning` + `waitForFreshDiscoveryConfigMap` which gives the manager time to initialize. The window is narrow and mostly theoretical in the single-user embedded context, but the documented contract ("every enabled controller has registered") is not strictly upheld.

Fix: Move `MarkControlPlaneReady` into a `manager.RunnableFunc` that fires after the manager's leader election (no-op here) and cache sync, so the annotation is set only after all controllers are fully registered. Alternatively, document the weaker guarantee.

---

I2. **`registerDiscovery` failure is swallowed but `MarkControlPlaneReady` still fires** — `pkg/service/controllers/manager.go:193-203`

```go
if err := scm.registerDiscovery(ctx); err != nil {
    log.Error(err, "Failed to register controller manager for discovery")
    // Don't fail startup for discovery registration errors
}

// Mark the control plane ready after registerDiscovery ...
if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
    return fmt.Errorf("failed to mark control plane as ready: %w", err)
}
```

When `registerDiscovery` fails at line 193, execution continues and `MarkControlPlaneReady` at line 201 sets the ready annotation on a ConfigMap that has no controller data entries. `rdd set`'s `isEngineControllerEnabled` (at `cmd/rdd/set.go:523-537`) reads the empty controller list, concludes engine is *not* enabled, and falls back to waiting on `Running=True` instead of `ContainerEngineReady=True`. This is the wrong wait when the engine controller is in fact running — it returns too early, before the Docker sync has finished.

Fix: Skip `MarkControlPlaneReady` when `registerDiscovery` fails, or make `registerDiscovery` failure fatal (it already returns an error).

```diff
-	if err := scm.registerDiscovery(ctx); err != nil {
-		log.Error(err, "Failed to register controller manager for discovery")
-		// Don't fail startup for discovery registration errors
-	}
-
-	if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
+	if err := scm.registerDiscovery(ctx); err != nil {
+		return fmt.Errorf("failed to register controller manager for discovery: %w", err)
+	}
+	if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
```

(important, regression)

---

I3. **`startWatcher` passes `watcherCtx` to `fullSync` but `fullSync`'s K8s client calls may outlive Reconcile's ctx** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:216-229`

```go
func (r *EngineReconciler) startWatcher(_ context.Context) error {
    r.watcherMu.Lock()
    defer r.watcherMu.Unlock()

    if r.watcher != nil {
        return nil
    }

    w, err := newDockerWatcher(r.watcherCtx, r.Client, r.reconcileChan)
```

`startWatcher` discards the Reconcile context (the `_ context.Context` parameter at line 216) and passes `r.watcherCtx` to `newDockerWatcher`. This is documented and intentional for the long-lived watcher goroutine. However, `newDockerWatcher` calls `fullSync` at line 86 of `docker_watcher.go`, which runs synchronously before returning — and `fullSync` uses `r.Client` (the controller-runtime cached client). If a future `ReconciliationTimeout` is added to the controller, `startWatcher` would still block on `fullSync` under the longer-lived `watcherCtx`, defeating the timeout.

This is currently safe because no reconciliation timeout is configured, but it's a latent issue worth a comment.

Fix: Add a brief comment at `startWatcher` line 216 noting that `fullSync` runs under `watcherCtx` intentionally, and that adding a `ReconciliationTimeout` to this controller would require decoupling `fullSync` from the `startWatcher` call.

(important, gap)

---

I4. **`reconcileContainerSpecs` and `processFinalizers` read a stale watcher reference** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:358-363, 386-391`

```go
func (r *EngineReconciler) reconcileContainerSpecs(ctx context.Context) error {
    r.watcherMu.Lock()
    w := r.watcher
    r.watcherMu.Unlock()
    if w == nil {
        return nil
    }
```

Both functions snapshot `r.watcher` under the lock and then use it unsynchronized for the rest of the call. If the watcher dies between the snapshot and the Docker API calls (`reconcileContainerState`, `deleteContainer`, etc.), the Docker client is closed (`w.stop()` calls `cli.Close()`), and the calls fail with closed-connection errors. These errors are retried via `errors.Join` and requeue, so the controller self-heals, but the first attempt always fails with an opaque "use of closed network connection" error rather than a clear "watcher died" message.

Fix: Either wrap the Docker calls in a check of `w.alive()` or accept the current behavior with a clarifying comment noting that stale-watcher errors are benign and self-heal on requeue.

(important, gap)

### Suggestions

S1. **BATS `DOCKER_HOST` evaluated at file load time** — `bats/tests/32-app-controllers/engine.bats:13`

```bash
DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
export DOCKER_HOST
```

`$(rdd svc paths docker_socket)` at line 13 runs during file load, before `local_setup_file()`. Since `DockerSocket()` is a static path computation (`filepath.Join(ShortDir(), "docker.sock")`), this works today. But it couples the test to an implementation detail — if `rdd svc paths` ever needs a running service, this breaks silently. Moving the assignment into `local_setup_file()` after the service starts would be more robust.

Fix: Move the `DOCKER_HOST` assignment into `local_setup_file()`:
```diff
 local_setup_file() {
     skip_on_windows
     rdd svc delete
     rdd set running=true
+    DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
+    export DOCKER_HOST
 }
```

(suggestion, regression)

---

S2. **`engineControllerName` duplicates an unexported constant** — `cmd/rdd/set.go:48`

```go
// engineControllerName mirrors the unexported controllerName in
// pkg/controllers/app/engine/controller.go.
engineControllerName = "engine"
```

This string is duplicated across `cmd/rdd/set.go:48` and `pkg/controllers/app/engine/controller.go:33`. If either changes independently, `isEngineControllerEnabled` silently falls back to the wrong wait path. The comment acknowledges this, but the fix is to export the constant from the engine package.

Fix: Export `controllerName` from `pkg/controllers/app/engine/controller.go` as `ControllerName`, and import it in `cmd/rdd/set.go`.

(suggestion, regression)

---

S3. **`UnregisterControllerManager` removed from `Start()` preamble without comment** — `pkg/service/controllers/manager.go` (removed lines)

The diff removes the stale-cleanup `UnregisterControllerManager` call that previously ran at the top of `Start()`. The rationale appears to be that `InitDiscovery` already deletes and recreates the ConfigMap, making the unregister redundant. This is correct, but the removal is uncommented in both the code and the commit message. A one-line comment at the `defer` cleanup block (line 206-212) noting that `InitDiscovery` handles stale entries would prevent a future reader from re-adding the call.

(suggestion, regression)

---

S4. **`conditionInfo` uses `unstructured.NestedInt64` which silently returns 0 for float64 values** — `cmd/rdd/set.go:607`

```go
gen, _, _ := unstructured.NestedInt64(cond, "observedGeneration")
```

JSON unmarshaling into `map[string]any` produces `float64` for numbers, not `int64`. `unstructured.NestedInt64` handles the `float64` → `int64` conversion correctly (it has an explicit `float64` case), so this is safe. But the triple-underscore error discard hides a potential future issue. Consider at least logging the error path.

(suggestion, gap)

---

S5. **`applyImage` uses `ForceOwnership` which could steal spec fields from other writers** — `pkg/controllers/app/engine/controllers/sync_images.go:183`

```go
err := w.k8s.Apply(ctx, image,
    client.ForceOwnership, client.FieldOwner(controllerName))
```

Unlike `applyContainer` (which carefully uses a Get-then-Apply-without-ForceOwnership pattern for spec), `applyImage` uses `ForceOwnership` on the full object including finalizers. Images have no user-mutable spec today, so this is safe. But if an Image spec is added later, this will clobber user fields. Add a comment noting the `ForceOwnership` is safe because Image has no spec.

(suggestion, gap)

---

## Design Observations

**Concerns**

- **(in-scope)** The `reconcileContainerSpecs` and `processFinalizers` methods each `List` all resources per reconcile (noted in the code's own TODO at line 191-195 of `engine_reconciler.go`). With N container mirrors, every Container watch event triggers a reconcile that lists all N containers, all N images, and all N volumes — O(N²) total API calls per burst of events. This is acceptable for a dev workstation but will become a bottleneck for enterprise users with hundreds of containers. The acknowledged TODO to split into per-resource reconcilers with watch predicates is the right fix.

- **(future)** The `waitForDesiredState` function at `cmd/rdd/set.go:473` only waits for `running=true/false`. The TODO at line 469-472 acknowledges that `containerEngine.name` changes trigger a VM restart but aren't waited on. The proposed `Reconciled` condition design (already in MEMORY.md) would close this gap cleanly.

**Strengths**

- The dual SSA field manager pattern (`controllerName` for spec+status, `finalizerFieldOwner` for finalizer-only reassertions) at `sync_containers.go:138-148` is a well-thought-out solution to the SSA pruning problem. It prevents a finalizer apply from releasing spec ownership without requiring a separate controller.

- The daemon clock anchoring for Docker events (`dockerEventsSince` at `docker_watcher.go:99-113`) correctly avoids host/guest clock skew. The fallback to a biased host clock on failure is a pragmatic degradation.

- The `retry.RetryOnConflict` + re-Get pattern is applied consistently across all status writers (`setEngineCondition`, `AppReconciler`, `removeMirrorResource`, `deleteAllOfType`, all finalizer processors). This eliminates the 409-loop-through-requeue pattern that would otherwise occur with concurrent App status writers.

- The immutable-spec webhook change at `container_webhooks.go:38-43` is clean: copying the old spec and replacing only `State` before comparing is the right way to allow a single mutable field without enumerating all immutable ones.

---

## Testing Assessment

The BATS suite at `engine.bats` covers the critical paths well: image pull/untag/dangling flows, container lifecycle, volumes with RFC-1123-invalid names, bidirectional deletion, spec.state transitions, and shutdown cleanup. Untested scenarios ranked by risk:

1. **Watcher reconnect on Docker daemon restart** — the reconnect logic (watcher dies → reconciler restarts it) is only exercised if Docker actually disconnects during the test, which isn't triggered.
2. **Concurrent `rdd set` and engine reconciler** — the generation-gating logic in `waitForDesiredState` is never stressed with concurrent writes.
3. **`processImageFinalizers` with in-use image** — the test at "deleting an in-use Image keeps the finalizer" is good, but doesn't verify what happens when Docker's `ImageRemove` returns a non-NotFound error (e.g., permission denied).
4. **`ValidateUpdate` webhook on non-state spec change** — the webhook change allows `spec.state` transitions but the BATS suite only tests `rdd ctl patch`. There's no test confirming that changing a non-state field is rejected.

The unit test at `sync_containers_test.go` covers `parseContainerName` adequately. The discovery test at `discovery_test.go` covers the `InitDiscovery` → `MarkControlPlaneReady` lifecycle.

---

## Documentation Assessment

The design doc updates (`api_app.md`, `api_containers.md`, `cmd_app.md`, `discovery.md`) are thorough and accurately describe the new behavior. The terminology section in `api_containers.md` clarifying capitalization conventions is a valuable addition. The `ControllersFlagUsage` constant at `options.go:15-18` ensures the `--controllers` help text stays in sync between embedded and standalone binaries.

---

## Acknowledged Limitations

- **Per-reconcile List calls**: The code at `engine_reconciler.go:191-195` contains a TODO acknowledging that `reconcileContainerSpecs` and `processFinalizers` each list all resources per reconcile, and that the long-term fix is per-resource reconcilers with watch predicates.

- **No wait for non-`running` changes**: The TODO at `cmd/rdd/set.go:469-472` acknowledges that changes like `containerEngine.name` alone trigger a VM restart but `rdd set` doesn't wait. The proposed `Reconciled` condition is documented in MEMORY.md.

- **No containerd mirroring**: The code at `engine_reconciler.go:147-156` and the design doc explicitly state that containerd reports `NotApplicable` and takes no mirroring action.

- **Container spec.state is level-triggered**: Documented in MEMORY.md as a known design limitation that will be addressed with edge-triggered action annotations after containerd mirroring.

- **Docker socket not wired for Windows/WSL2**: The `init()` guard at `controller.go:23-28` and the `skip_on_windows` in `engine.bats:19` acknowledge this.

- **Orphan window for status-less Image mirrors**: Documented at `sync_images.go:248-252`.

---

## Coverage Summary

| File | Status |
|------|--------|
| `.github/workflows/bats.yaml` | Reviewed, no issues (adds `docker` to brew install) |
| `bats/tests/32-app-controllers/app.bats` | Reviewed, no issues (changes `"app,limavm"` to `"*"`) |
| `bats/tests/32-app-controllers/engine.bats` | **S1** |
| `bats/tests/34-containers-controllers/containers-mock.bats` | Reviewed, no issues |
| `cmd/rdd/limavm_shell.go` | Reviewed, no issues (SSH keepalive settings) |
| `cmd/rdd/service.go` | Reviewed, no issues |
| `cmd/rdd/service_paths.go` | Reviewed, no issues |
| `cmd/rdd/set.go` | **S2, S4** |
| `docs/design/api_app.md` | Reviewed, no issues |
| `docs/design/api_containers.md` | Reviewed, no issues |
| `docs/design/cmd_app.md` | Reviewed, no issues |
| `docs/design/discovery.md` | Reviewed, no issues |
| `go.mod` | Trivial |
| `go.sum` | Trivial |
| `pkg/apis/containers/v1alpha1/applyconfiguration/.../containerspec.go` | Reviewed, no issues |
| `pkg/apis/containers/v1alpha1/container_types.go` | Reviewed, no issues |
| `pkg/apis/containers/v1alpha1/volume_types.go` | Reviewed, no issues |
| `pkg/controllers/app/app/controllers/app_controller.go` | Reviewed, no issues |
| `pkg/controllers/app/app/lima-template.yaml` | Reviewed, no issues |
| `pkg/controllers/app/demo/controllers/demo_controller.go` | Reviewed, no issues |
| `pkg/controllers/app/engine/controller.go` | Reviewed, no issues |
| `pkg/controllers/app/engine/controllers/docker_watcher.go` | **I4** |
| `pkg/controllers/app/engine/controllers/engine_reconciler.go` | **I3, I4** |
| `pkg/controllers/app/engine/controllers/sync_containers.go` | Reviewed, no issues |
| `pkg/controllers/app/engine/controllers/sync_containers_test.go` | Reviewed, no issues |
| `pkg/controllers/app/engine/controllers/sync_images.go` | **S5** |
| `pkg/controllers/app/engine/controllers/sync_volumes.go` | Reviewed, no issues |
| `pkg/controllers/containers/container/container_reconciler.go` | Reviewed, no issues |
| `pkg/controllers/containers/container/container_webhooks.go` | Reviewed, no issues |
| `pkg/controllers/containers/container/crd.yaml` | Reviewed, no issues |
| `pkg/controllers/containers/volume/crd.yaml` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/limavm_controller.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | Reviewed, no issues |
| `pkg/external/main.go` | Reviewed, no issues |
| `pkg/instance/instance.go` | Reviewed, no issues |
| `pkg/service/cmd/service.go` | Reviewed, no issues |
| `pkg/service/controllers/discovery.go` | Reviewed, no issues |
| `pkg/service/controllers/discovery_test.go` | Reviewed, no issues |
| `pkg/service/controllers/manager.go` | **I1, I2, S3** |
| `pkg/service/controllers/options.go` | Reviewed, no issues |
---

### [Codex] — Pass 1

### Executive Summary

This change adds an embedded engine controller that mirrors Docker state into the `containers.rancherdesktop.io` API and teaches `rdd set` to wait on post-write App conditions. The core mirroring logic is solid, but two integration gaps remain: `ContainerNamespace` mirrors are not self-healing after user deletion, and the new controller is only wired into embedded mode, not the external `containers-controller` binary.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **`ContainerNamespace` mirrors are not recreated after user deletion** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:560-567` (important, regression)

```go
return ctrl.NewControllerManagedBy(mgr).
    For(&appv1alpha1.App{}).
    Named("engine-reconciler").
    WatchesRawSource(source.Channel(r.reconcileChan, enqueueApp)).
    Watches(&containersv1alpha1.Container{}, enqueueApp).
    Watches(&containersv1alpha1.Image{}, enqueueApp).
    Watches(&containersv1alpha1.Volume{}, enqueueApp).
    Complete(r)
```

`syncContainerNamespace()` is only called during `fullSync()` at `pkg/controllers/app/engine/controllers/docker_watcher.go:436-438`, and the actual apply lives in `pkg/controllers/app/engine/controllers/sync_volumes.go:61-65`. Once steady state begins, the controller graph at `engine_reconciler.go:560-567` watches `Container`, `Image`, and `Volume`, but not `ContainerNamespace`, and the steady-state reconcile path never re-applies the namespace mirror. If a user deletes `containernamespace/moby`, no watched event fires and no later reconcile restores it, so the API stops reflecting the running engine until the watcher reconnects or the VM restarts.

Fix: make `ContainerNamespace` part of the steady-state reconciliation contract. The minimal fix is to watch `ContainerNamespace` and explicitly `Apply` `containernamespace/moby` during normal reconciles whenever `wantWatcher` is true.

```diff
-    // Map any Container/Image/Volume event to a reconcile of the App singleton.
+    // Map any mirror event to a reconcile of the App singleton.
     enqueueApp := handler.EnqueueRequestsFromMapFunc(
         func(_ context.Context, _ client.Object) []reconcile.Request {
             return []reconcile.Request{{NamespacedName: types.NamespacedName{Name: appName}}}
         },
     )

     return ctrl.NewControllerManagedBy(mgr).
         For(&appv1alpha1.App{}).
         Named("engine-reconciler").
         WatchesRawSource(source.Channel(r.reconcileChan, enqueueApp)).
+        Watches(&containersv1alpha1.ContainerNamespace{}, enqueueApp).
         Watches(&containersv1alpha1.Container{}, enqueueApp).
         Watches(&containersv1alpha1.Image{}, enqueueApp).
         Watches(&containersv1alpha1.Volume{}, enqueueApp).
         Complete(r)
```

I2. **The new engine controller is embedded-only; external `containers-controller` never loads it** — `pkg/controllers/app/engine/controller.go:34-38` (important, regression)

```go
const (
    controllerName = "engine"
    // apiGroup is "containers" because the reconciler watches
    // Container, Image, and Volume from that group and needs
    // their CRDs at startup. Grouping engine with its dependencies
    // keeps --controllers selections from splitting the two apart.
    apiGroup = "containers"
)
```

The new controller declares itself part of the `containers` API group at `pkg/controllers/app/engine/controller.go:34-38`, so `external.RunControllers("containers")` is the binary that should host it. But `cmd/containers-controller/main.go:11-13` only blank-imports `pkg/controllers/containers`, while the new blank import was added only to embedded mode in `pkg/service/cmd/service.go:57-59`. The result is that `rdd svc serve` runs the engine controller, but `containers-controller` never registers it at all, so the feature cannot be exercised in the repo’s supported external-controller mode.

Fix: if `engine` remains in the `containers` group, import it from `cmd/containers-controller/main.go`. If the intent was for `app-controller` to own it, then change `GetAPIGroup()` accordingly and move the import there instead; the grouping and import graph need to agree.

```diff
 import (
     "os"

-    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"
+    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine"
+    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"
     "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/external"
 )
```

#### Suggestions

S1. **The new BATS suite uses forbidden command substitution at file scope** — `bats/tests/32-app-controllers/engine.bats:12-14` (suggestion, regression)

```bash
NAMESPACE="rancher-desktop"
DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
export DOCKER_HOST
```

The suite sets `DOCKER_HOST` via `$(rdd svc paths docker_socket)` at line 13. The repo’s BATS guidance explicitly asks tests to capture command output through `run -0 ...` so failures are surfaced with the real output; command substitution at file scope does the opposite and can turn setup problems into opaque shell-load failures.

Fix: resolve the socket path inside `local_setup_file()` with `run -0`, then export `DOCKER_HOST` from `$output`.

```diff
-NAMESPACE="rancher-desktop"
-DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
-export DOCKER_HOST
+NAMESPACE="rancher-desktop"

 local_setup_file() {
+    run -0 rdd svc paths docker_socket
+    DOCKER_HOST="unix://${output}"
+    export DOCKER_HOST
     # The Docker socket access pattern used by these tests is not yet wired
```

### Design Observations

**Concerns**

1. The engine controller currently splits responsibility between a one-shot `fullSync()` path and a steady-state reconcile path that only handles child resources. That split already produced I1: scaffolding resources like `ContainerNamespace` exist only if startup or reconnect happens to recreate them. A small in-scope refactor would be to add an explicit “ensure mirror scaffolding” step to steady-state reconcile; the longer-term direction is the one already hinted in the code, with per-resource reconcilers and narrow watch predicates. `(in-scope)`

2. Controller grouping is now encoded in multiple places: `GetAPIGroup()`, embedded blank imports, and external binary blank imports. I2 is the first divergence from that arrangement. A single registration package per API group, imported by both embedded and external entrypoints, would remove this entire class of drift. `(future)`

**Strengths**

1. Anchoring Docker event replay to `Info().SystemTime` instead of `time.Now()` is the right fix for host/guest clock skew and avoids a subtle class of lost-event bugs.

2. Using a dedicated SSA field owner for finalizer-only applies is a strong design choice; it avoids accidental spec ownership release and required-field validation failures.

### Testing Assessment

Targeted verification passed with `go test ./pkg/controllers/app/engine/... ./pkg/service/controllers ./pkg/controllers/containers/container/...` and `go test ./cmd/app-controller ./cmd/containers-controller ./pkg/external`. A full `go test ./cmd/...` run is still blocked by the pre-existing missing `pkg/guestagent/lima-guestagent.gz` asset in `cmd/rdd` and `cmd/lima-controller`.

1. Deleting `containernamespace/moby` while the watcher stays connected is not tested as a self-healing path. The new suite currently asserts persistent absence at `bats/tests/32-app-controllers/engine.bats:375-380`, which masks I1 instead of catching it.
2. No test covers external-controller mode for the new engine controller, so I2 ships unnoticed. A small smoke test around `containers-controller` registration/discovery would catch future import/group mismatches.
3. The BATS suite does not exercise failure reporting for `DOCKER_HOST` resolution because it bypasses `run -0` entirely (S1).

### Documentation Assessment

The design docs were updated thoroughly and mostly match the implementation. What is still missing is an explicit statement of which controller group owns `engine` in external mode and therefore which standalone binary is expected to host it; the code currently says `containers`, while the import graph still reflects “embedded only”.

### Acknowledged Limitations

- **Code comment**: “`TODO: changes that trigger a VM restart without changing "running" (for example, containerEngine.name alone) are still not waited on.`” — `cmd/rdd/set.go:469-472`. This limitation remains unchanged by the PR and is accurately documented.
- **Code comment**: “`The long-term fix is to split these into per-resource reconcilers with watch predicates`” — `pkg/controllers/app/engine/controllers/engine_reconciler.go:191-195`. The current O(N)-per-event reconcile is known; this PR makes it more relevant because the engine controller now owns more live-updating resources.
- **Code comment**: “`Windows lacks a Docker socket transport, so the reconciler would hot-loop on connect errors. Skip registration until WSL2 support lands.`” — `pkg/controllers/app/engine/controller.go:24-27`. This PR intentionally leaves Windows unsupported for engine mirroring.
- **PR description**: containerd mirroring is intentionally deferred; `ContainerEngineReady=True,Reason=NotApplicable` is the temporary contract for `containerEngine.name=containerd`. The implementation is consistent with that rationale.

### Coverage Summary

- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/tests/32-app-controllers/app.bats` — Reviewed, no issues
- `bats/tests/32-app-controllers/engine.bats` — I1, S1
- `bats/tests/34-containers-controllers/containers-mock.bats` — Reviewed, no issues
- `cmd/rdd/limavm_shell.go` — Reviewed, no issues
- `cmd/rdd/service.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — Reviewed, no issues
- `cmd/rdd/set.go` — Reviewed, no issues
- `docs/design/api_app.md` — Reviewed, no issues
- `docs/design/api_containers.md` — Reviewed, no issues
- `docs/design/cmd_app.md` — Reviewed, no issues
- `docs/design/discovery.md` — Reviewed, no issues
- `go.mod` — Reviewed, no issues
- `go.sum` — Trivial
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerspec.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/container_types.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/volume_types.go` — Reviewed, no issues
- `pkg/controllers/app/app/controllers/app_controller.go` — Reviewed, no issues
- `pkg/controllers/app/app/lima-template.yaml` — Reviewed, no issues
- `pkg/controllers/app/demo/controllers/demo_controller.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controller.go` — I2
- `pkg/controllers/app/engine/controllers/docker_watcher.go` — I1
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — I1
- `pkg/controllers/app/engine/controllers/sync_containers.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_containers_test.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_images.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_volumes.go` — I1
- `pkg/controllers/containers/container/container_reconciler.go` — Reviewed, no issues
- `pkg/controllers/containers/container/container_webhooks.go` — Reviewed, no issues
- `pkg/controllers/containers/container/crd.yaml` — Reviewed, no issues
- `pkg/controllers/containers/volume/crd.yaml` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Reviewed, no issues
- `pkg/external/main.go` — Reviewed, no issues
- `pkg/instance/instance.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Reviewed, no issues
- `pkg/service/controllers/discovery.go` — Reviewed, no issues
- `pkg/service/controllers/discovery_test.go` — Reviewed, no issues
- `pkg/service/controllers/manager.go` — Reviewed, no issues
- `pkg/service/controllers/options.go` — Reviewed, no issues

---

### [Gemini] — Pass 1

### Executive Summary

This change introduces the `engine` controller to securely mirror Docker containers, images, and volumes into Kubernetes resources, establishing a foundation for GUI tools to manage container state via the K8s API. The implementation handles complex orchestration edge cases exceptionally well, including graceful reconciliation of status conditions under concurrent updates, cross-manager discovery coordination, and robust lifecycle management for background processes. I found no regressions or critical bugs; the design defensively guards against known framework pitfalls.

### Findings

- **Critical Issues**
  None.

- **Important Issues**
  None.

- **Suggestions**

S1. **Missing `ObservedGeneration` in Container status update** — `pkg/controllers/containers/container/container_reconciler.go:81-88` (suggestion, gap)

```go
func (r *reconciler) setCondition(ctx context.Context, container *containersv1alpha1.Container, conditionType string, status metav1.ConditionStatus, reason, message string) {
	changed := apimeta.SetStatusCondition(&container.Status.Conditions, metav1.Condition{
		Type:    conditionType,
		Status:  status,
		Reason:  reason,
		Message: message,
	})
```

The `setCondition` helper in the generic `Container` controller updates status conditions but does not capture the object's generation. While the CLI `wait` logic currently only targets the `App` singleton (where `EngineReconciler` and `AppReconciler` both correctly stamp `ObservedGeneration`), omitting it here leaves a subtle gap if future tooling relies on `Container` condition staleness checks.

Fix: Pass the container's generation into the condition.

```diff
 func (r *reconciler) setCondition(ctx context.Context, container *containersv1alpha1.Container, conditionType string, status metav1.ConditionStatus, reason, message string) {
 	changed := apimeta.SetStatusCondition(&container.Status.Conditions, metav1.Condition{
 		Type:               conditionType,
 		Status:             status,
 		Reason:             reason,
 		Message:            message,
+		ObservedGeneration: container.Generation,
 	})
```

### Design Observations

**Strengths**

- **Context scoping for background processes:** Discarding the `Reconcile` context to seed `dockerWatcher` with a manager-bound lifecycle context prevents immediate cancellation of the watcher upon reconcile completion. Hooking `watcherCancel` to `manager.RunnableFunc` ensures clean exit semantics without client leakage.
- **Strict SSA ownership boundaries:** Minting a distinct `engine-controller-finalizer` field manager for finalizer manipulation prevents the `Apply` from stripping `spec` fields. This elegantly solves a common issue with `ForceOwnership` dropping unmanaged fields when patching `ObjectMeta`.
- **Race-free discovery logic:** Moving the CLI wait predicate off the discovery ConfigMap existence and onto `rdd.rancherdesktop.io/ready=true` closes the gap where clients would attempt API requests before the controller manager had fully registered its CRDs.
- **Clock skew mitigation:** Biasing the `dockerEventsSince` host-clock fallback two minutes into the past pairs perfectly with the idempotent `fullSync` loop, neutralizing host-daemon clock drift without complex state tracking.

### Testing Assessment

The new BATS integration suite (`32-app-controllers/engine.bats`) thoroughly exercises bidirectional mirroring, tag pruning, container state manipulation, and timeout conditions. It correctly uses `run -0` combined with `assert_output` according to repository conventions.

### Documentation Assessment

Design documentation (`api_containers.md`, `cmd_app.md`) accurately captures the new behavior, states, finalizer handling, and CLI wait semantics.

### Coverage Summary

- `.github/workflows/bats.yaml` — Trivial
- `bats/tests/32-app-controllers/app.bats` — Trivial
- `bats/tests/32-app-controllers/engine.bats` — Reviewed, no issues
- `bats/tests/34-containers-controllers/containers-mock.bats` — Trivial
- `cmd/rdd/limavm_shell.go` — Reviewed, no issues
- `cmd/rdd/service.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — Reviewed, no issues
- `cmd/rdd/set.go` — Reviewed, no issues
- `docs/design/api_app.md` — Reviewed, no issues
- `docs/design/api_containers.md` — Reviewed, no issues
- `docs/design/cmd_app.md` — Reviewed, no issues
- `docs/design/discovery.md` — Reviewed, no issues
- `go.mod` — Trivial
- `go.sum` — Trivial
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerspec.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/container_types.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/volume_types.go` — Reviewed, no issues
- `pkg/controllers/app/app/controllers/app_controller.go` — Reviewed, no issues
- `pkg/controllers/app/app/lima-template.yaml` — Reviewed, no issues
- `pkg/controllers/app/demo/controllers/demo_controller.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controller.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/docker_watcher.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_containers.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_containers_test.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_images.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_volumes.go` — Reviewed, no issues
- `pkg/controllers/containers/container/container_reconciler.go` — S1
- `pkg/controllers/containers/container/container_webhooks.go` — Reviewed, no issues
- `pkg/controllers/containers/container/crd.yaml` — Reviewed, no issues
- `pkg/controllers/containers/volume/crd.yaml` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Reviewed, no issues
- `pkg/external/main.go` — Reviewed, no issues
- `pkg/instance/instance.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Reviewed, no issues
- `pkg/service/controllers/discovery.go` — Reviewed, no issues
- `pkg/service/controllers/discovery_test.go` — Reviewed, no issues
- `pkg/service/controllers/manager.go` — Reviewed, no issues
- `pkg/service/controllers/options.go` — Reviewed, no issues
