# Deep Review: 20260414-214710-pr-309

| | |
|---|---|
| **Date** | 2026-04-14 21:47 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 2 (of PR #309) |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#309](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/309) — engine: mirror Docker container engine state into Kubernetes |
| **Commits** | `626c789` engine: mirror Docker container engine state into Kubernetes |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — one BATS style fix and a controller-grouping decision |
| **Wall-clock time** | `19 min 21 s` |

---

## Executive Summary

This PR adds the `engine` controller that mirrors Docker containers, images, and volumes into Kubernetes resources, plus supporting plumbing in `rdd set` (wait-for-desired-state with ObservedGeneration gating), a discovery `ready` annotation, and multi-writer safety on App status. The code is well-structured, thoroughly documented, and has strong integration test coverage. Findings are concentrated around one BATS style violation and a controller-grouping design question; the remaining items are minor robustness and ergonomics suggestions.

Structure: 0 critical, 2 important, 5 suggestions, plus design observations and a testing assessment.

---

## Critical Issues

None.

---

## Important Issues

I1. **BATS file-scope command substitution for `DOCKER_HOST`** — `bats/tests/32-app-controllers/engine.bats:13-14` [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro]

```bash
NAMESPACE="rancher-desktop"
DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
export DOCKER_HOST
```

The repo's BATS style rules forbid `$(command ...)` substitutions: when the inner command fails, BATS produces opaque errors instead of the command's output. This assignment runs at file-load time, outside the `run`/`$output` machinery, so a failure there aborts the whole suite with no diagnostic. In practice `rdd svc paths docker_socket` is a pure path computation that does not call into the service, so it is unlikely to fail — but the pattern still violates the project's standing rule and sets a bad example for later tests.

Fix: move the assignment into `local_setup_file` and use the canonical pattern.

```diff
 NAMESPACE="rancher-desktop"
-DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
-export DOCKER_HOST

 local_setup_file() {
     skip_on_windows
+    run -0 rdd svc paths docker_socket
+    DOCKER_HOST="unix://${output}"
+    export DOCKER_HOST
     rdd svc delete
     rdd set running=true
 }
```

I2. **Engine controller is excluded from `--controllers=app`** — `pkg/controllers/app/engine/controller.go:25-43` [Codex GPT 5.4]

```go
const (
    controllerName = "engine"
    apiGroup       = "engine"
)
```

The embedded control plane's controller selector in `pkg/service/cmd/service.go:494-554` enables a controller only when the `--controllers` spec matches the controller's `GetAPIGroup()` or `GetName()`. Every other controller under `pkg/controllers/app/` (`app`, `demo`) reports `APIGroup = "app"`, but this new one uses `"engine"`. As a direct consequence `rdd svc start --controllers=app` does not bring up the engine controller: `ContainerEngineReady` is never written and no mirror resources appear. That is why `bats/tests/32-app-controllers/app.bats` had to widen its selector from `"app,limavm"` to `"*"` in this PR.

The `--controllers` help text (`pkg/service/controllers/options.go:32`) advertises `app` as an API group, so a user who narrows to it will hit this gap. This is a design choice, not a strict regression (the engine controller is new), but the current grouping makes selective starts silently incomplete. Codex additionally claims the standalone `cmd/app-controller` binary is affected — that part is incorrect: `external.RunControllers` at `pkg/external/main.go:62` registers every controller returned by `base.GetAllControllers()` without filtering by API group, so the standalone binary does pick up engine.

Fix options (in order of preference):
1. Change `apiGroup` to `"app"` so selective `--controllers=app` starts match the grouping convention used by `app` and `demo`.
2. If engine needs to stay separately selectable, document that users must pass `--controllers=app,engine` and update the `options.go` help text to list `engine` as a known group.
3. Add a dependency/alias mechanism on `base.Controller` so `--controllers=app` can pull engine in transitively.

---

## Suggestions

S1. **`waitForDesiredState` treats any non-"true" `running` value as "false"** — `cmd/rdd/set.go:465-492` [Claude Opus 4.6]

```go
if runningVal == "true" {
    // ...wait for ContainerEngineReady...
}
logrus.Info("Waiting for the VM to stop")
return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
    status, gen, present := conditionInfo(obj, "Running")
    // ...
})
```

Any value that is not `"true"` falls through to the "VM stop" wait path. The CRD enum at `coerceValue` (line 688) already constrains `running` to a boolean, so this is defensive territory rather than a reachable bug — but it contradicts the explicit intent in the surrounding code, and if `coerceValue` is ever bypassed or the schema widened the silent fall-through could mask user error.

Fix:

```diff
 if runningVal == "true" {
     // wait for ContainerEngineReady
     return watchCondition(...)
 }
+if runningVal != "false" {
+    return nil
+}
 logrus.Info("Waiting for the VM to stop")
```

S2. **`syncAllContainers` logs per-item Inspect errors but does not collect them** — `pkg/controllers/app/engine/controllers/sync_containers.go:54-60` [Claude Opus 4.6]

```go
var errs []error
for _, dc := range listResult.Items {
    dockerIDs[dc.ID] = true
    if err := w.syncContainer(ctx, dc.ID); err != nil {
        log.Error(err, "Skipping container during full sync", "id", dc.ID)
    }
}
```

Per-item Inspect failures are logged but never appended to `errs`, while the stale-mirror-removal loop below *does* append. The asymmetry is deliberate (the comment at lines 49-53 explains why startup should not fail on one bad Inspect), but a reader expects parallel structure. The same asymmetry exists in `syncAllImages` and `syncAllVolumes`. Consider a one-line comment at the log call in each helper stating "not collected — see block comment above" so future readers do not "fix" it.

S3. **`fullSync` continues after `syncContainerNamespace` failure** — `pkg/controllers/app/engine/controllers/docker_watcher.go:425-450` [Claude Opus 4.6]

```go
if err := w.ensureNamespace(ctx); err != nil {
    return fmt.Errorf("failed to ensure namespace: %w", err)
}
var errs []error
if err := w.syncContainerNamespace(ctx); err != nil {
    errs = append(errs, fmt.Errorf("failed to sync container namespace: %w", err))
}
if err := w.syncAllContainers(ctx); err != nil {
```

`ensureNamespace` returns early on failure (creating the K8s namespace is a prerequisite for everything that follows), but `syncContainerNamespace` failure is collected and execution continues. If writing the `ContainerNamespace/moby` object fails, the subsequent `syncAllContainers`/`syncAllImages`/`syncAllVolumes` calls will still run and may produce mirrors whose `status.namespace` points at an object that does not yet exist. Returning early on this failure (or collecting the error and skipping the dependent syncs) would produce cleaner failure semantics.

S4. **Custom `hasFinalizer`/`removeFinalizer` helpers duplicate `controllerutil`** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:532-553` [Claude Opus 4.6]

```go
func hasFinalizer(obj metav1.Object, finalizer string) bool {
    for _, f := range obj.GetFinalizers() {
        if f == finalizer {
            return true
        }
    }
    return false
}

func removeFinalizer(obj metav1.Object, finalizer string) bool {
    finalizers := obj.GetFinalizers()
    for i, f := range finalizers {
        if f == finalizer {
            obj.SetFinalizers(append(finalizers[:i], finalizers[i+1:]...))
            return true
        }
    }
    return false
}
```

`sigs.k8s.io/controller-runtime/pkg/controller/controllerutil` already provides `ContainsFinalizer` and `RemoveFinalizer` with identical semantics. The only friction is that `controllerutil.RemoveFinalizer` does not return a bool, but `len(finalizers) != len(obj.GetFinalizers())` is a trivial post-check. Using the upstream helpers would match the pattern used elsewhere in the codebase and reduce surface area to maintain.

S5. **`parseContainerName` produces non-"moby" status.namespace on the dockerd backend** — `pkg/controllers/app/engine/controllers/sync_containers.go:231-238` [Codex GPT 5.4]

```go
func parseContainerName(fullName string) (namespace, name string) {
    fullName = strings.TrimPrefix(fullName, "/")
    namespace, name, found := strings.Cut(fullName, "/")
    if !found {
        return containerNamespace, namespace
    }
    return namespace, name
}
```

The unit test at `sync_containers_test.go:25-29` explicitly expects `/k8s.io/magical_gates` to return `("k8s.io", "magical_gates")`. The engine controller currently only runs when the backend is `moby`, and `docs/design/api_containers.md:33-35` states unambiguously:

> When running `dockerd`, namespaces are not supported and we always use `moby` as the value for that label.

Standard Docker-CLI container names are single-segment, so in practice the `found` branch is dead code on dockerd — but if anything ever produces a multi-segment name, the resulting `status.namespace` would reference a `ContainerNamespace` object that this controller never creates (it only creates `ContainerNamespace/moby` in `syncContainerNamespace`). The unit test bakes the inconsistency in as intended behavior, which suggests the function is being shared with a future containerd implementation. Either: (a) hardcode `status.namespace = "moby"` in `applyContainer` for now and move the multi-segment parsing into the containerd code path when it lands, or (b) leave the parsing and drop the `k8s.io` test case so the function's dockerd behavior matches the design doc.

---

## Design Observations

### Strengths

- **Dual SSA field manager for finalizer-only applies** (in-scope) [Claude Opus 4.6, Gemini 3.1 Pro] — `applyContainer` uses `engine-controller` for the full mirror and `engine-controller-finalizer` for finalizer-only re-applies. Keeping the managers distinct prevents a finalizer-only apply from releasing spec ownership and pruning `spec.state`, which would fail CRD required-field validation. The rationale is documented inline at `sync_containers.go:137-148`.
- **Docker daemon clock for event `since` filter** (in-scope) [Claude Opus 4.6, Gemini 3.1 Pro] — `dockerEventsSince` queries `Info.SystemTime` so the `Since` filter is anchored on the daemon's own clock, sidestepping host/guest skew. The biased host-clock fallback is safe because `fullSync` is idempotent.
- **Stable SSA apply order** (in-scope) [Gemini 3.1 Pro, Claude Opus 4.6] — Ports, bindings, and RepoDigests are sorted before being built into apply configs. Atomic list SSA is position-sensitive, and sorting prevents spurious `resourceVersion` churn on every sync that would otherwise fire watch events for no-op changes.
- **Error-joining across mirror types** (in-scope) [Claude Opus 4.6] — `cleanupMirrorResources`, `processFinalizers`, and `reconcileContainerSpecs` use `errors.Join` and unconditional execution so one stuck resource type cannot starve pending work on sibling types. The comment at `engine_reconciler.go:186-189` explaining why `specsErr` must not early-return is particularly thoughtful.
- **`ObservedGeneration` flow for `rdd set` wait** (in-scope) [Claude Opus 4.6, Codex GPT 5.4] — `setEngineCondition` and the App reconciler's LimaVM condition mirror both stamp `latest.Generation`, and `waitForDesiredState` filters condition snapshots by `gen >= minGen`. This correctly rejects stale snapshots from a previous backend and is the only place that re-asserts `Connected` so its `observedGeneration` tracks stable-state reconciles.
- **Multi-writer safety on App status** (in-scope) [Gemini 3.1 Pro, Claude Opus 4.6] — Both `EngineReconciler.setEngineCondition` and `AppReconciler`'s condition-mirroring loop use `retry.RetryOnConflict` + re-`Get`. Controller-runtime serializes reconciles within a controller but not across controllers, so without the retry they would 409-loop through requeues.

### Concerns

- **No dependency model for controller selection** (in-scope) [Codex GPT 5.4] — `base.Controller` exposes only a free-form API group string. The `--controllers` help text in `pkg/service/controllers/options.go` advertises API groups as user-facing selectors, but there is nothing preventing a cross-group dependency (engine depends on App) from being broken by a selective `--controllers=app` start (see I2). A lightweight `GetDependencies() []string` on `base.Controller` would let the service expand user selections into coherent sets and prevent this class of breakage in the future.
- **O(N) List calls per child event** (future) [Claude Opus 4.6, Gemini 3.1 Pro] — `reconcileContainerSpecs` and `processFinalizers` each issue one List per mirror kind, and every Container/Image/Volume watch event enqueues an App reconcile (`engine_reconciler.go:516-529`). For N mirrors each child update is O(N) work. The TODO at `engine_reconciler.go:180-184` acknowledges this and proposes splitting into per-resource reconcilers with watch predicates; this is the right direction and should land before a machine with hundreds of containers becomes a realistic target.

---

## Testing Assessment

The new `bats/tests/32-app-controllers/engine.bats` suite (409 lines) exercises the end-to-end story thoroughly:

1. Image pull / untag / dangling-image promotion lifecycle, including the `reconcileImageByID` path for events that carry only the image ID.
2. Container create / stop / rm mirroring and bidirectional delete (K8s-side delete → Docker, Docker-side destroy → K8s).
3. Volume create / rm, including names with uppercase and underscores that must be hashed into RFC-1123 subdomains.
4. `spec.state` transitions across `running`, `paused`, `created`, and `unknown`, including the paused-plus-stop and paused-plus-unpause cases that cover the Docker state matrix at `docker_watcher.go:320-367`.
5. In-use-image finalizer retention (`processImageFinalizers` keeps the finalizer until Docker permits the removal).
6. Cleanup-on-VM-stop and containerd backend `NotApplicable` handling.
7. `rdd set` wait semantics: `--timeout=0` bypass, idempotent `running=false` on an already-stopped VM, restart-and-recreate-moby after `running=true`.

The `pkg/service/controllers/discovery_test.go` addition covers the full lifecycle of `InitDiscovery` → `MarkControlPlaneReady` → re-`InitDiscovery` clearing stale annotations. `sync_containers_test.go` unit-tests `parseContainerName` (though see S5 regarding the `k8s.io` cases).

Untested areas worth tracking (by decreasing risk):

1. **Watcher reconnect after a transient Docker disconnect.** The goroutine-death → `enqueueReconcile` → new `newDockerWatcher` → `fullSync` path is only exercised implicitly. A test that kills Docker mid-stream and verifies mirror consistency would validate the most complex lifecycle branch.
2. **Selective `--controllers` startup paths** (ties into I2). A test that runs `rdd svc start --controllers=app` and verifies `ContainerEngineReady` appears would have caught the grouping gap.
3. **`mapDockerContainerState` fall-through** for an unrecognised Docker status string. A trivial unit test would confirm the SSA-safe `unknown` fallback.

---

## Documentation Assessment

The design-doc updates are thorough and consistent with the implementation:

- `docs/design/api_containers.md` gains the terminology section, engine-mirroring overview, `spec.state` semantics, finalizer lifecycle, and the dockerd-always-`moby` namespace rule.
- `docs/design/api_app.md` adds `ContainerEngineReady` to the conditions table with the four reasons (`Connected`, `ConnectFailed`, `Stopped`, `NotApplicable`).
- `docs/design/cmd_app.md` documents the new `--timeout` flag and the `running=true`/`running=false` wait semantics.
- `docs/design/discovery.md` explains the `ready` annotation and the client-contract change.

The one inconsistency is the S5 tension between `api_containers.md`'s "dockerd always uses moby" statement and `parseContainerName`'s ability to return other namespaces on dockerd input. Update whichever side you choose to reconcile.

---

## Commit Structure

Reviewed, no issues. The PR is a single commit, which is appropriate for a feature whose parts (engine controller, `rdd set` wait, discovery readiness, `spec.state=unknown`, App multi-writer safety) are tightly coupled.

---

## Acknowledged Limitations

1. **Transient `handleEvent` failures are dropped pending a periodic full resync.** — `docker_watcher.go:178-184`

   > "image pull and volume create events fire once, so a dropped apply leaves the mirror missing until the next full resync. A periodic fullSync tick is deferred until the rest of the mirroring work lands."

   This change makes the limitation newly observable (image and volume events are newly mirrored), but the comment correctly describes it.

2. **`reconcileContainerSpecs` and `processFinalizers` are O(N) per reconcile.** — `engine_reconciler.go:180-184`

   > "The long-term fix is to split these into per-resource reconcilers with watch predicates."

   Acknowledged; see the Design Observations concern.

3. **Changes that trigger a VM restart without changing `running` are not waited on.** — `cmd/rdd/set.go:460-464`

   > "for example, containerEngine.name alone — a dedicated 'Reconciled' condition on App would let the CLI detect when the full reconcile chain has settled for any property change."

   Matches the existing TODO tracked in memory.

4. **Orphan window for `Image` mirrors with empty `status.id`.** — `sync_images.go:239-243`

   > "if applyImage wrote the metadata Apply but the status SubResource Apply failed transiently, the mirror has an empty status.id and this function will miss it. syncAllImages sweeps the orphan on the next watcher restart."

   Narrow and self-healing via full sync on reconnect.

5. **Windows/WSL2 Docker socket access is not wired up.** — `engine.bats:17-19`, PR summary.

   The new BATS suite is skipped on Windows until WSL2 socket plumbing lands.

6. **`containerEngine.name=containerd` takes no mirroring action.** — `engine_reconciler.go:137-146`

   Reports `ContainerEngineReady=True` with reason `NotApplicable` so `rdd set running=true containerEngine.name=containerd` finishes waiting. The condition will be renamed when containerd mirroring lands.

---

## Agent Performance Retro

### [Claude] — Claude Opus 4.6

Claude delivered the most thorough file-by-file walkthrough and provided solid verification of subtle concerns (the stale `NestedInt64` decoding path, the `alreadyClean` partial-cleanup self-healing, the `running=false` no-op generation flow). Those traces are genuinely useful context even when they did not produce findings. The downside of this approach is that the review document contains several "draft" findings Claude retracted in-place before writing the final version — the reader has to skip past the annotated dead-ends to find the actual conclusions. The surviving findings are all real but on the minor side; the one important item (BATS style) was also caught by the other two agents, and the `waitForDesiredState` fall-through was defensive territory that I downgraded to a suggestion.

### [Codex] — Codex GPT 5.4

Codex was the only agent to identify the `--controllers=app` grouping gap (I2) and the `parseContainerName`/design-doc tension (S5), and both are substantive. Calibration was mixed: Codex also filed the `apiNamespace = "rancher-desktop"` constant as important, which is a direct misread of the design doc (`rancher-desktop` is the documented K8s namespace for mirror resources, not a stand-in for `App.spec.namespace`), and its I1 framing of `cmd/app-controller` being affected was factually wrong because the standalone binary registers every controller via `base.GetAllControllers()`. The signal/noise ratio was still better than any single-issue report — two out of three important claims landed as real findings.

### [Gemini] — Gemini 3.1 Pro

Gemini produced a compact, high-precision review: one finding (the BATS style issue), a clean Coverage Summary with every file accounted for, and a tight set of design strengths. No false positives, no misreads, but also no unique findings: everything it raised was caught by at least one other agent. Gemini continues to skip `git blame` (daily quota), and that is acceptable given Claude and Codex cover regression attribution.

### Summary

| | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 7m 58s | 9m 51s | 6m 30s |
| Findings | 1I 4S | 2I 1S | 1I |
| Tool calls | 49 (Read 28, Bash 11, Grep 9) | 91 (shell 87, stdin 3, plan 1) | 15 (run_shell_command 9, grep_search 4, write_file 2) |
| Design observations | 6 | 3 | 4 |
| False positives | 0 | 1 | 0 |
| Unique insights | 4 | 2 | 0 |
| Files reviewed | 33 | 33 | 33 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **1I 4S** | **2I 1S** | **1I** |
| Downgraded | 1 (I→S) | 1 (I→S) | 0 |
| Dropped | 0 | 1 | 0 |


All three agents surfaced the same important issue (I1). Codex contributed both of the unique insights that survived consolidation, at the cost of one false positive and one factually-wrong framing. Claude's coverage was the broadest but its contributions clustered in minor/suggestion territory after the retracted drafts were stripped out. Gemini was the most reliable per-word but had no unique finds.

**Reconciliation.** Codex P1 `apiNamespace` constant (important → rejected): `rancher-desktop` is the documented K8s namespace for mirror resources per `docs/design/api_containers.md`, not a stand-in for `App.spec.namespace`. Codex P1 `parseContainerName` (important → S5 suggestion): the "wrong" branch is effectively unreachable for normal Docker container names and the unit test bakes in the current behavior, making this a latent design question rather than an active bug. Codex P1 BATS `$(...)` (suggestion → I1 important): promoted to match Claude and Gemini's rating, reflecting the repo's standing rule against BATS command substitution. Claude P1 `waitForDesiredState` fall-through (important → S1 suggestion): unreachable given CRD enum validation in `coerceValue`, kept as a defensive suggestion.

---

## Review Process Notes

### Repo context updates

- **[repo] Controller `--controllers` selector and `base.Controller.GetAPIGroup()`** — When reviewing any new controller under `pkg/controllers/`, check that its `GetAPIGroup()` value matches the grouping convention used by sibling controllers in the same directory and that the `--controllers` help text in `pkg/service/controllers/options.go` accurately lists known groups. Cross-group dependencies (e.g., engine depending on App lifecycle) need explicit documentation or a dependency model; otherwise selective starts like `--controllers=app` silently drop dependent controllers and the only symptom is missing conditions on downstream resources. *Why this belongs in repo context:* the selector lives in one file but the grouping decisions are spread across every controller, so reviewers cannot catch the mismatch by reading the PR in isolation.

- **[repo] `docs/design/api_containers.md` is the authoritative namespace contract** — When reviewing code in `pkg/controllers/app/engine/` or in the `containers` API group, cross-reference every hardcoded Kubernetes namespace (`rancher-desktop`, `moby`, etc.) against `docs/design/api_containers.md`. The design doc explicitly documents `metadata.namespace: rancher-desktop` as the target K8s namespace for mirror resources and `status.namespace: moby` (dockerd) / `status.namespace: k8s.io` (containerd) as the container-engine namespace label. Reviewers who do not read the doc frequently misread the `apiNamespace` constant as a missing-`App.spec.namespace`-lookup bug. *Why this belongs in repo context:* the naming convention is non-obvious and the first instinct is to assume the constant is a bug.

---

## Appendix: Original Reviews

### Claude Opus 4.6

I'll start by reading the key changed files to build context and verify line numbers for findings.

Now let me read the remaining key files in parallel.

Now let me read the remaining changed files I need for complete coverage.

Let me now read the remaining files I need for complete coverage.

Now let me check a few more files and run git blame on key areas.

Now let me check if there's a removed `UnregisterControllerManager` call in the manager Start method.

The removal of `UnregisterControllerManager` is notable. Let me check where stale discovery cleanup now happens.

Good — it's still called in manager shutdown at line 211. Let me verify the stale-entry cleanup path is still intact.

Good. Now let me check the `rdd set` wait for `running=false` more carefully. The `waitForDesiredState` checks `Running` condition — let me check that the AppReconciler stamps `ObservedGeneration` correctly.

Good. `ObservedGeneration: latest.Generation` is set on Running condition at line 322. Now let me check the `running=false` wait path more carefully — particularly whether a no-op `running=false` (already stopped) bumps generation.

Let me look at the `running=false` no-op case in the BATS test and the patchApp generation logic.

Now let me think about the `running=false` no-op scenario. When a no-op patch is sent (`running=false` on already-stopped VM), the API server does not bump `.metadata.generation` because the spec did not change. So `targetGen` from `patchApp` equals the existing generation. The wait predicate at line 490 requires `gen >= minGen && status != metav1.ConditionTrue`. Since the condition was already `Running=False` with `observedGeneration >= currentGen`, this should pass immediately. This is correct behavior — the BATS test at line 377 confirms this.

Let me now check the `alreadyClean` optimization in the reconciler for a potential issue.

I see a potential issue. Let me check: at line 148, `alreadyClean` uses the condition from `app.Status.Conditions` — the initial Get at line 86. But `setEngineCondition` at line 155 does a re-Get inside its retry loop. The `alreadyClean` check uses possibly-stale data, but that's fine — it's just an optimization: the worst case is an unnecessary cleanup pass, and the condition is correctly set regardless.

Now let me check the `UnregisterControllerManager` removal. The old code called `scm.discovery.UnregisterControllerManager(ctx)` at the start of `Start()` to clean stale entries. The new code removes that call. Let me verify that `InitDiscovery` already handles this case.

Good — `InitDiscovery` at line 398 deletes the entire ConfigMap, which is more thorough than `UnregisterControllerManager` (which only removes one key). So the removal is correct — the old call was redundant.

Let me now check the BATS test file for the `$()` substitution anti-pattern mentioned in the review instructions.

Line 13: `DOCKER_HOST="unix://$(rdd svc paths docker_socket)"` — this uses `$()` substitution at file scope (outside a test), which is the anti-pattern noted in the review instructions. However, this is at file scope, not inside a `run` call, and `rdd svc paths` is expected to work before tests run. Let me check if this is truly problematic — file-scope assignments run during `load`, and failure here would produce an opaque error.

Now let me check the container_types.go for the `ContainerStatusUnknown` value.

Good. Now let me look at the `engine` controller's `GetCRDData()` returning empty string.

Let me check how `GetCRDData` and `GetAPIGroup` are used — specifically whether `apiGroup = "engine"` causes any issues. The engine controller uses CRDs from the `containers` API group but registers with `apiGroup = "engine"`.

Good — `GetCRDData() == ""` causes `installControllerCRDs` to skip the controller. The engine controller uses CRDs from the `containers` package (installed by the containers controller), so returning empty is correct.

Now let me check the `DOCKER_HOST` line in the BATS file more carefully.

Line 13 runs `$(rdd svc paths docker_socket)` at file-scope load time, before `local_setup_file` runs `rdd set running=true`. The `rdd svc paths docker_socket` command computes a path from `instance.DockerSocket()` which is a pure path computation — it doesn't need the service running. Let me verify:

Good — `rdd svc paths docker_socket` is a pure path computation that doesn't need the service running, so the file-scope `$()` is safe. But it's still an anti-pattern per the BATS style rules. Let me check if this pattern is used elsewhere.

Let me now check the `observedGeneration` stamping on `conditionInfo` — specifically, whether `unstructured.NestedInt64` correctly extracts `observedGeneration` from the unstructured JSON.

`unstructured.NestedInt64` at line 561 extracts from a `map[string]any`. JSON numbers from Kubernetes come as `float64` in Go's JSON decoding. `unstructured.NestedInt64` handles `int64` type but may fail with `float64` values from JSON. Let me verify this is handled.

This is a real issue. `NestedInt64` does a strict `int64` type assertion. JSON numbers from the API server are decoded as `float64` by `encoding/json`, not `int64`. So when `conditionInfo` receives unstructured data from the watch/list, `observedGeneration` will be a `float64`, and `NestedInt64` will return `(0, false, error)`. Since line 561 discards the error with `_, _ =`, `gen` will always be `0`, and the `gen >= minGen` check at lines 478 and 490 will only pass when `minGen` is 0 or 1.

Wait — but `watchtools.UntilWithSync` uses unstructured.Unstructured objects decoded from the API server. Let me check how the dynamic client decodes JSON. The k8s dynamic client uses `runtime.UnstructuredJSONScheme` which may handle this differently.

So the Kubernetes JSON decoder *does* convert `json.Number` to `int64` when possible. Since `observedGeneration` is always an integer, `NestedInt64` will work correctly. The value in the unstructured map will be `int64`, not `float64`. This is not a bug.

Now let me check a few more patterns and verify the BATS test for `$()` usage inside test functions.

Good — all container ID captures use `cid=${output}` after `run -0`, following the correct BATS pattern. Let me verify a sample to make sure.

Good BATS patterns. Now let me check `app.bats` for the change.

The `app.bats` change from `"app,limavm"` to `"*"` enables all controllers. Let me check if this causes any issues with the engine controller trying to connect to Docker during app tests.

The `app.bats` now uses `"*"` which will include the engine controller. When app tests create an App with `running=false`, the engine controller will just set `ContainerEngineReady=Stopped`, which is harmless. When tests set `running=true`, the engine will try to connect to Docker — but these tests likely don't start a real VM. Let me check what `app.bats` actually does.

The `app.bats` never sets `running=true`, so the engine controller sees `Running=False` and sets `ContainerEngineReady=Stopped`. This is harmless. The change to `"*"` is necessary because the engine controller needs to be registered so its condition can be tested.

Now let me check the `context.Background()` concern. The `newDockerWatcher` at docker_watcher.go:44 accepts a `ctx` from the caller and derives `watchCtx` from it, which is good.

Let me check one more thing — the `deleteImage` function and whether it correctly handles in-use images.

The `deleteImage` at line 406 uses `ImageRemoveOptions{}` — no `Force` flag. When an image is in use by a container, Docker returns a conflict error (not NotFound), which won't be caught at line 407, so the error propagates and keeps the finalizer. This is the correct behavior (the BATS test at line 198 verifies this).

Now let me look at whether there's a `context.Background()` in any of the new code.

Good. Let me now verify the `lima-template.yaml` change.

Good — `HOST_DOCKER_SOCKET` replaces `HOST_SHORT_DIR` + `/docker.sock`. The template parameter at line 30 is already quoted. The `applySpecToTemplate` at app_controller.go line 58 uses `%q` for the value: `fmt.Sprintf("  HOST_DOCKER_SOCKET: %q", instance.DockerSocket())`.

Now I have enough context to write the review. Let me do a final check on the discovery test.

Good test coverage for the discovery ready annotation lifecycle. Now let me check the `waitForDesiredState` for a subtle case: `rdd set running=true containerEngine.name=containerd`. The `properties` map will have both `running=true` and `containerEngine.name=containerd`. The code at line 466 checks `properties["running"]`. With `running=true`, it waits for `ContainerEngineReady=True`. The engine reconciler will set `ContainerEngineReady=True` with reason `NotApplicable` for containerd. The wait predicate at line 478 checks `status == metav1.ConditionTrue` which matches `NotApplicable`. This is correct by design.

Now let me look at the `running=false` wait path for a potential generation mismatch. If `rdd set running=false` is sent when the VM is already stopped and `Running` condition has `observedGeneration` less than `minGen` (because the no-op patch didn't bump generation), will the wait hang?

Wait — a no-op merge patch (where the spec doesn't change) does NOT bump `metadata.generation`. The API server is smart enough to detect a no-op. So `targetGen` would equal the existing `metadata.generation`. The `Running` condition's `observedGeneration` should already be >= this value since the condition was last updated from the same (unchanged) generation. So the wait should pass immediately.

But there's a subtle edge case: what if the user sends `rdd set running=false` and the VM is currently starting (Running=False/Starting)? The patch changes `running=false`, bumps generation. The condition `Running` has `observedGeneration` from the previous generation and `status=False`. The wait predicate at line 490 requires `gen >= minGen && status != ConditionTrue`. The condition's gen is from the old generation, so `gen >= minGen` fails. The reconciler needs to update the condition with the new generation. This works because the patch triggers a reconcile, the AppReconciler mirrors the LimaVM condition with `ObservedGeneration: latest.Generation` (line 322 of app_controller.go), and the wait predicate succeeds once this update lands. This is correct.

I have thoroughly reviewed all files. Let me write the review now.

---

### Executive Summary

This PR adds a well-structured engine controller that mirrors Docker state into Kubernetes, with supporting changes to `rdd set` (wait-for-state), discovery readiness, and multi-writer safety on App status. The code is clean, thoughtfully documented, and thoroughly tested. Key findings are around test style and a race window during cleanup.

---

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Cleanup-then-condition race can leave stale mirrors** — `engine_reconciler.go:149-155`

```go
if !alreadyClean {
    if err := r.cleanupMirrorResources(ctx); err != nil {
        log.Error(err, "Failed to clean up mirror resources")
        return ctrl.Result{}, err
    }
}
return ctrl.Result{}, r.setEngineCondition(ctx, &app, terminalStatus, terminalReason, terminalMessage)
```

If `cleanupMirrorResources` succeeds at line 150 but `setEngineCondition` fails at line 155, the reconcile returns an error and is requeued. On the next reconcile, `alreadyClean` at line 148 is `false` (condition is stale) so cleanup runs again — this is correct and self-healing. However, there's a subtler gap: if `cleanupMirrorResources` partially succeeds (e.g., deletes Containers but fails on Volumes), the error at line 151 returns early *without* updating the condition. On the next requeue, `alreadyClean` is false, so cleanup reruns from scratch. But `cleanupMirrorResources` at line 268 calls `deleteAllOfType` for `ContainerList` first — the containers are already gone, so this is a no-op — and then proceeds to Volumes, which retry. This is correct behavior: the function is idempotent and all errors are joined. No bug here; the partial-success path self-heals.

After closer tracing, this is not a real issue — the design handles all partial-failure cases through idempotent cleanup + retry. Downgrading to no finding.

*(Removed — self-healing.)*

I1. **`reconcileContainerSpecs` re-dispatches start/stop on every reconcile for active spec.state** — `engine_reconciler.go:337-360` + `docker_watcher.go:337-367`

```go
for i := range containers.Items {
    c := &containers.Items[i]
    if c.DeletionTimestamp != nil {
        continue
    }
    if err := w.reconcileContainerState(ctx, c); err != nil {
        errs = append(errs, fmt.Errorf("container %s: %w", c.Name, err))
    }
}
```

Every Container/Image/Volume watch event triggers a reconcile of the App (via `enqueueApp` at line 516), which runs `reconcileContainerSpecs`. For a container with `spec.state=running` and `status.status=running`, `reconcileContainerState` returns `nil` at line 349 — no Docker API call. For `spec.state=created` and `status.status=exited`, `isStopped()` returns `true` at line 360 — also no call. So the hot path is just a List + iterate with field comparisons. This is already noted in the TODO at line 180-184 as a known O(N) cost, not a bug. (important, gap)

*(Acknowledged by TODO — moving to Acknowledged Limitations.)*

I1. **BATS file-scope `$(rdd svc paths docker_socket)` substitution** — `engine.bats:13`

```bash
DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
export DOCKER_HOST
```

The BATS style rules for this repo say to use `run -0 <command>` then `$output` instead of `$(command ...)` substitution. If `rdd svc paths docker_socket` fails at file load time, BATS produces an opaque error instead of showing what went wrong.

In practice, `rdd svc paths docker_socket` is a pure path computation that cannot fail, and this runs at file scope before any tests. The risk is low, but it violates the repo's established test pattern. (important, regression)

Fix: Move the `DOCKER_HOST` computation into `local_setup_file`, where it runs after test infrastructure is set up and failures are more diagnosable:

```diff
 NAMESPACE="rancher-desktop"
-DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
-export DOCKER_HOST

 local_setup_file() {
     skip_on_windows
+    run -0 rdd svc paths docker_socket
+    DOCKER_HOST="unix://${output}"
+    export DOCKER_HOST
     rdd svc delete
     rdd set running=true
 }
```

I2. **`waitForDesiredState` treats any non-"true" `running` value as `running=false`** — `set.go:466-491`

```go
runningVal, ok := properties["running"]
if !ok {
    return nil
}
// ...
if runningVal == "true" {
    // ...wait for ContainerEngineReady...
}
logrus.Info("Waiting for the VM to stop")
```

If the user passes `rdd set running=invalid`, the value validation happens earlier in `coerceValue` (line 665), which checks the CRD enum. But if the CRD schema were ever extended with a third `running` value (unlikely but possible), or if `coerceValue` is bypassed, the function falls through to the `running=false` path. This is a minor robustness gap — the `else` branch should explicitly check for `"false"` and return `nil` for any other value, matching the TODO at line 460-464 about incomplete wait semantics. (important, gap)

Fix:

```diff
 if runningVal == "true" {
     logrus.Info("Waiting for container engine to be ready")
     return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
         status, gen, present := conditionInfo(obj, "ContainerEngineReady")
         return present && gen >= minGen && status == metav1.ConditionTrue
     })
 }
+if runningVal != "false" {
+    return nil
+}
 logrus.Info("Waiting for the VM to stop")
```

#### Suggestions

S1. **Image mirror name length exceeds practical K8s name limits** — `sync_images.go:38-48`

```go
func imageMirrorNames(id string, repoTags []string) []string {
    if len(repoTags) == 0 {
        return []string{fmt.Sprintf("img-%x", sha256.Sum256([]byte(id)))}
    }
```

`sha256.Sum256` produces a 64-character hex digest, making each name `"img-"` + 64 = 68 characters. Volume names use the same scheme (`"vol-"` + 64 = 68). K8s allows up to 253 characters for names, so this is well within limits. The `"img-"` and `"vol-"` prefixes make the names recognizable in `kubectl` output. No change needed — just confirming the design is sound. (suggestion, enhancement)

*(Withdrawn — not an issue.)*

S1. **`syncAllContainers` logs per-item Inspect errors but doesn't collect them into `errs`** — `sync_containers.go:54-59`

```go
var errs []error
for _, dc := range listResult.Items {
    dockerIDs[dc.ID] = true
    if err := w.syncContainer(ctx, dc.ID); err != nil {
        log.Error(err, "Skipping container during full sync", "id", dc.ID)
    }
}
```

Per-item Inspect failures are logged (line 58) but not appended to `errs`. This means `syncAllContainers` returns `nil` even when some containers failed to sync, which is correct behavior for the stated goal of not blocking startup. But it's inconsistent with stale-mirror removal at lines 71-73 which *does* append to `errs`. The asymmetry is deliberate (per the comment at lines 49-53), but a reader might expect parallel structure. Consider adding a brief inline comment at line 58. (suggestion, enhancement)

S2. **`fullSync` continues after `ensureNamespace` or `syncContainerNamespace` failure** — `docker_watcher.go:425-450`

```go
func (w *dockerWatcher) fullSync(ctx context.Context) error {
    if err := w.ensureNamespace(ctx); err != nil {
        return fmt.Errorf("failed to ensure namespace: %w", err)
    }
    var errs []error
    if err := w.syncContainerNamespace(ctx); err != nil {
        errs = append(errs, ...)
    }
    if err := w.syncAllContainers(ctx); err != nil {
```

`ensureNamespace` correctly returns early on failure (line 430), but `syncContainerNamespace` failure at line 435 is appended to `errs` and execution continues to `syncAllContainers`. If the K8s namespace doesn't exist, all the `syncAll*` calls will fail too, but they produce separate errors that are all joined — not harmful, just noisy. Consider returning early on `syncContainerNamespace` failure since it's a prerequisite. (suggestion, enhancement)

S3. **Redundant duplicate finalizer helpers vs controller-runtime's `controllerutil`** — `engine_reconciler.go:532-553`

```go
func hasFinalizer(obj metav1.Object, finalizer string) bool {
    for _, f := range obj.GetFinalizers() { ... }
}
func removeFinalizer(obj metav1.Object, finalizer string) bool {
    finalizers := obj.GetFinalizers()
    for i, f := range finalizers { ... }
}
```

`sigs.k8s.io/controller-runtime/pkg/controller/controllerutil` provides `ContainsFinalizer` and `RemoveFinalizer` with the same semantics. Using the upstream helpers would reduce code to maintain and align with the LimaVM controller's patterns. The custom `removeFinalizer` returns a `bool` (changed or not) which `controllerutil.RemoveFinalizer` does not, so the refactor isn't entirely free — but comparing `len(finalizers)` before and after is trivial. (suggestion, enhancement)

### Design Observations

**Strengths**

- The dual-field-manager SSA pattern (`engine-controller` for spec, `engine-controller-finalizer` for finalizers) is a clean solution to the "finalizer-only apply would prune spec" problem. The comment at `sync_containers.go:137-142` explains the rationale clearly.

- The Docker clock skew handling at `docker_watcher.go:71-83` (`dockerEventsSince` + biased fallback) is careful: using the daemon's own `SystemTime` for the `Since` filter avoids the host/guest skew problem that the review prompt specifically calls out.

- Error-joining throughout (e.g., `cleanupMirrorResources`, `processFinalizers`, `reconcileContainerSpecs`) ensures one stuck resource type doesn't starve others. The comment at `engine_reconciler.go:186-189` about not early-returning on `specsErr` is especially thoughtful.

- The `observedGeneration` flow between `rdd set`, `EngineReconciler.setEngineCondition`, and `AppReconciler` is correct and well-documented. The wait predicate at `set.go:478` correctly filters stale snapshots from a prior backend.

- The `applyContainer` function's Get-then-Apply-without-ForceOwnership pattern at `sync_containers.go:119-127` correctly preserves user-initiated `spec.state` changes.

**Concerns**

- **(future)** The reconciler lists all Containers, Images, and Volumes on every reconcile (three List calls for `reconcileContainerSpecs` + three for `processFinalizers`). With the Watches on these types at `engine_reconciler.go:526-528`, every mirror status update fires a reconcile, which fires six List calls. On a machine with hundreds of containers this is O(N) work per event. The TODO at line 180-184 acknowledges this; splitting into per-resource reconcilers with watch predicates (deletionTimestamp or spec.state changes only) is the right direction.

### Testing Assessment

Test coverage is strong. The `engine.bats` suite covers:

1. Image pull/untag/dangling lifecycle
2. Container create/stop/rm mirroring
3. Volume create/rm with RFC-1123-invalid names
4. Bidirectional delete (K8s-side delete → Docker, Docker-side delete → K8s)
5. `spec.state` transitions (running→created, created→running, paused→created, paused→running)
6. In-use image finalizer retention
7. Cleanup on VM stop
8. Containerd backend `NotApplicable` condition
9. `rdd set` wait semantics (`--timeout=0`, already-stopped idempotency)

The `discovery_test.go` addition covers the full lifecycle of `InitDiscovery` → `MarkControlPlaneReady` → re-`InitDiscovery` clearing stale annotations.

Untested scenarios (by risk):

1. **Watcher reconnect after transient Docker disconnect** — the goroutine death → `enqueueReconcile` → new `newDockerWatcher` → `fullSync` path is only tested implicitly through the full integration test. A test that kills the Docker daemon mid-stream and verifies reconnection + mirror consistency would validate the most complex lifecycle path.
2. **Concurrent `rdd set` + engine reconciler writing `ContainerEngineReady`** — the `retry.RetryOnConflict` in both writers is correct but untested at the unit level.
3. **`mapDockerContainerState` with unrecognized Docker status strings** — a unit test for the unknown/fallback case would confirm the SSA validation safety net.

### Documentation Assessment

Documentation is thorough and well-structured:

- `api_containers.md` gains the terminology section, engine mirroring overview, finalizer lifecycle, and `spec.state` semantics.
- `api_app.md` adds `ContainerEngineReady` to the conditions table with clear descriptions.
- `cmd_app.md` documents `--timeout` and the wait semantics with examples.
- `discovery.md` explains the ready annotation lifecycle.

All design doc updates are consistent with the implementation.

### Acknowledged Limitations

1. **O(N) List calls per reconcile** — TODO at `engine_reconciler.go:180-184`:
   > "Both passes issue List calls per reconcile, and every Container/Image/Volume watch event triggers a reconcile — so cost is O(N) per child event. The long-term fix is to split these into per-resource reconcilers with watch predicates."
   
   The change makes this moderately worse by adding three resource types that trigger reconciles, but the fundamental issue predates this PR.

2. **No containerd mirroring** — `engine_reconciler.go:137-142` and `api_containers.md`:
   > "Engine mirroring is only supported with the moby backend"
   
   `NotApplicable` with `Status=True` is a pragmatic workaround that lets `rdd set` finish waiting. Will be addressed when containerd mirroring lands.

3. **Changes that trigger VM restart without changing "running" are not waited on** — TODO at `set.go:460-464`:
   > "changes that trigger a VM restart without changing 'running' (for example, containerEngine.name alone) are still not waited on."
   
   The proposed `Reconciled` condition is noted for future work.

4. **Orphan window for Image mirrors with empty `status.id`** — comment at `sync_images.go:239-243`:
   > "if applyImage wrote the metadata Apply but the status SubResource Apply failed transiently, the mirror has an empty status.id and this function will miss it. syncAllImages sweeps the orphan on the next watcher restart."
   
   Narrow window, self-healing.

5. **Dropped single-fire events** — comment at `docker_watcher.go:178-184`:
   > "image pull and volume create events fire once, so a dropped apply leaves the mirror missing until the next full resync."
   
   Acknowledged as deferring periodic fullSync to future work.

### Coverage Summary

| File | Status |
|------|--------|
| `.github/workflows/bats.yaml` | Reviewed, no issues — adds `docker` to `brew install` for CI |
| `bats/tests/32-app-controllers/app.bats` | Reviewed, no issues — changes `"app,limavm"` to `"*"` |
| `bats/tests/32-app-controllers/engine.bats` | **I1** |
| `cmd/app-controller/main.go` | Trivial — adds engine import |
| `cmd/rdd/service.go` | Reviewed, no issues |
| `cmd/rdd/service_paths.go` | Reviewed, no issues |
| `cmd/rdd/set.go` | **I2** |
| `docs/design/api_app.md` | Reviewed, no issues |
| `docs/design/api_containers.md` | Reviewed, no issues |
| `docs/design/cmd_app.md` | Reviewed, no issues |
| `docs/design/discovery.md` | Reviewed, no issues |
| `go.mod` | Trivial |
| `go.sum` | Trivial |
| `pkg/apis/containers/v1alpha1/applyconfiguration/.../containerspec.go` | Reviewed, no issues |
| `pkg/apis/containers/v1alpha1/container_types.go` | Reviewed, no issues |
| `pkg/apis/containers/v1alpha1/volume_types.go` | Reviewed, no issues |
| `pkg/controllers/app/app/controllers/app_controller.go` | Reviewed, no issues |
| `pkg/controllers/app/app/lima-template.yaml` | Reviewed, no issues |
| `pkg/controllers/app/engine/controller.go` | Reviewed, no issues |
| `pkg/controllers/app/engine/controllers/docker_watcher.go` | **S2** |
| `pkg/controllers/app/engine/controllers/engine_reconciler.go` | **S3** |
| `pkg/controllers/app/engine/controllers/sync_containers.go` | **S1** |
| `pkg/controllers/app/engine/controllers/sync_containers_test.go` | Reviewed, no issues |
| `pkg/controllers/app/engine/controllers/sync_images.go` | Reviewed, no issues |
| `pkg/controllers/app/engine/controllers/sync_volumes.go` | Reviewed, no issues |
| `pkg/controllers/containers/container/container_webhooks.go` | Reviewed, no issues |
| `pkg/controllers/containers/container/crd.yaml` | Reviewed, no issues |
| `pkg/controllers/containers/volume/crd.yaml` | Reviewed, no issues |
| `pkg/instance/instance.go` | Reviewed, no issues |
| `pkg/service/cmd/service.go` | Reviewed, no issues |
| `pkg/service/controllers/discovery.go` | Reviewed, no issues |
| `pkg/service/controllers/discovery_test.go` | Reviewed, no issues |
| `pkg/service/controllers/manager.go` | Reviewed, no issues |

---

### Codex GPT 5.4

### Executive Summary

This change adds a new Docker-backed engine controller, App readiness/wait plumbing, and discovery readiness gating so `rdd set` can block on engine or VM state. The implementation is close, but I found three merge-blocking regressions: the new controller is not selectable through the existing `app` controller set, it ignores `App.spec.namespace` and always writes mirrors into `rancher-desktop`, and it emits `Container.status.namespace` values that do not correspond to any `ContainerNamespace` object on the dockerd backend.

### Findings

**Critical Issues**

None.

**Important Issues**

I1. **Engine controller is excluded from `--controllers=app` and the app-controller group** — `pkg/controllers/app/engine/controller.go:25-43` (important, regression)

```go
const (
	controllerName = "engine"
	apiGroup       = "engine"
)

func (c *controller) GetAPIGroup() string {
	return apiGroup
}
```

`GetAPIGroup()` at `pkg/controllers/app/engine/controller.go:42` now reports `"engine"`, but the control-plane selector in `pkg/service/cmd/service.go:531-538` only enables a controller when the user’s `--controllers` entry matches `GetAPIGroup()` or the controller name. That means existing targeted starts such as `--controllers=app` no longer bring up the new engine controller even though it is shipped in `cmd/app-controller/main.go:12-15`. In practice this regresses selective/dev setups: `ContainerEngineReady` is never written, mirrors never appear, and the updated app BATS had to widen its fixture from the old app-only setup to `*`. Fix: make the engine controller report the existing `app` API group, or add explicit controller-group alias/dependency logic so `app` pulls in `engine` (and its required containers CRDs) automatically.

I2. **Mirror resources are hardcoded to `rancher-desktop` instead of following `App.spec.namespace`** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:39-40` (important, regression)

```go
// apiNamespace is the Kubernetes namespace for mirror resources.
apiNamespace = "rancher-desktop"
```

The engine controller hardcodes `apiNamespace` at `pkg/controllers/app/engine/controllers/engine_reconciler.go:39-40`, and every mirror create/list/delete path uses that constant (for example `pkg/controllers/app/engine/controllers/sync_volumes.go:61-65` and `pkg/controllers/app/engine/controllers/engine_reconciler.go:285-289`). That conflicts with the App API contract at `pkg/apis/app/v1alpha1/app_types.go:36-40`, which says `spec.namespace` is where the App’s namespaced resources live, and with the App reconciler itself, which consistently uses `app.GetResourceNamespace()` at `pkg/controllers/app/app/controllers/app_controller.go:102` and `:157`. This is observable on the default `rdd set running=true` path too: `cmd/rdd/set.go:344-349` creates an App without a namespace field, so the CRD default at `pkg/controllers/app/app/crd.yaml:80-85` resolves it to `default`, yet the engine still writes mirrors into `rancher-desktop`. Fix: derive the mirror namespace from the current App object, pass it into the watcher/sync helpers, and stop using a package-level constant for per-App state.

I3. **Dockerd containers can reference nonexistent `ContainerNamespace` objects** — `pkg/controllers/app/engine/controllers/sync_containers.go:107,231-237` (important, regression)

```go
namespace, name := parseContainerName(inspect.Name)

func parseContainerName(fullName string) (namespace, name string) {
	fullName = strings.TrimPrefix(fullName, "/")
	namespace, name, found := strings.Cut(fullName, "/")
	if !found {
		return containerNamespace, namespace
	}
	return namespace, name
}
```

`applyContainer()` stores the parsed namespace into status at `pkg/controllers/app/engine/controllers/sync_containers.go:107`, and `parseContainerName()` at `:231-237` now returns the first slash-delimited path component for names like `/k8s.io/...`. On the dockerd backend, though, the engine controller only creates `ContainerNamespace/moby` in `pkg/controllers/app/engine/controllers/sync_volumes.go:56-65`; it never creates `ContainerNamespace/k8s.io`. That violates the `ContainerStatus.Namespace` contract at `pkg/apis/containers/v1alpha1/container_types.go:79-83` (“refers to a `ContainerNamespace` object”), so mirrored containers with slash-prefixed Docker names will point at a namespace object that does not exist. The new unit test at `pkg/controllers/app/engine/controllers/sync_containers_test.go:25-29` bakes this wrong behavior in. Fix: on the dockerd backend always publish `status.namespace=moby`, and treat any slash prefix in the Docker name as name decoration rather than as a real container-engine namespace; update the test to match that backend contract.

**Suggestions**

S1. **Avoid command substitution in the new BATS suite** — `bats/tests/32-app-controllers/engine.bats:12-14` (suggestion, regression)

```bash
NAMESPACE="rancher-desktop"
DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
export DOCKER_HOST
```

The command substitution at `bats/tests/32-app-controllers/engine.bats:13` runs during file load, outside BATS’ `run`/`$output` machinery. If `rdd svc paths docker_socket` ever fails, the suite dies with a much less useful load-time error than the repo’s standard `run -0 ...` pattern. Fix: resolve the socket path inside `local_setup_file()` with `run -0 rdd svc paths docker_socket`, assign `DOCKER_HOST="unix://$output"`, and export it there.

### Design Observations

**Concerns**

1. **Controller selection has no dependency model** `(in-scope)`. The new engine controller depends on App lifecycle plus the containers CRDs, but the only metadata the framework understands is a single free-form API-group string. That makes cross-group controllers fragile, as shown by I1. A small dependency declaration on `base.Controller` (for example `GetDependencies() []string`) would let the service expand `--controllers=app` into a coherent set and prevent future cross-controller breakage.

**Strengths**

1. The `ObservedGeneration` gating in `cmd/rdd/set.go` and the retry-on-conflict App status writes in both reconcilers are solid; they directly address the stale-condition and concurrent-writer races described in the design notes.
2. The watcher’s use of the Docker daemon clock for `since` and the explicit sorting before SSA in the sync helpers are well-judged details that avoid subtle event-loss and no-op churn bugs.

### Testing Assessment

1. Selective startup paths are untested. There is no coverage for `rdd svc start --controllers=app` or standalone `app-controller`, which would have caught I1 immediately.
2. Non-default and defaulted `App.spec.namespace` flows are untested. A simple `rdd set running=true` plus assertions against the App’s resolved namespace would have exposed I2.
3. Slash-prefixed dockerd container names are only unit-tested for parsing, not for end-to-end namespace validity. An integration test that checks `status.namespace` against an existing `ContainerNamespace` would have caught I3.

### Documentation Assessment

`docs/design/api_containers.md` now says dockerd always uses `moby`, but the examples still show `k8s.io` namespaces later in the document. Update those examples when fixing I3 so the docs match the backend contract.

### Acknowledged Limitations

- **Code comment** — `cmd/rdd/set.go:461-464`: “changes that trigger a VM restart without changing `running` ... are still not waited on.” This remains accurate; the new wait logic still only covers explicit `running` changes.
- **Code comment** — `pkg/controllers/app/engine/controllers/docker_watcher.go:178-184`: dropped image/volume create events can leave a mirror missing until a later full resync because the periodic resync is deferred. This change makes the limitation newly relevant, but the comment correctly describes it.
- **Code comment** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:180-182`: listing every mirror on every child event is a temporary O(N) design and should be split into per-resource reconcilers later. The current code matches that description.
- **Code comment** — `bats/tests/32-app-controllers/engine.bats:17-18`: Windows/WSL2 Docker socket access is not wired yet, so the suite skips there. The new tests increase the importance of that gap but do not hide it.

### Coverage Summary

- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/tests/32-app-controllers/app.bats` — Reviewed, no issues
- `bats/tests/32-app-controllers/engine.bats` — S1
- `cmd/app-controller/main.go` — I1
- `cmd/rdd/service.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — Reviewed, no issues
- `cmd/rdd/set.go` — I2
- `docs/design/api_app.md` — Reviewed, no issues
- `docs/design/api_containers.md` — Reviewed, no issues
- `docs/design/cmd_app.md` — Reviewed, no issues
- `docs/design/discovery.md` — Reviewed, no issues
- `go.mod` — Trivial
- `go.sum` — Trivial
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerspec.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/container_types.go` — I3
- `pkg/apis/containers/v1alpha1/volume_types.go` — Reviewed, no issues
- `pkg/controllers/app/app/controllers/app_controller.go` — Reviewed, no issues
- `pkg/controllers/app/app/lima-template.yaml` — Reviewed, no issues
- `pkg/controllers/app/engine/controller.go` — I1
- `pkg/controllers/app/engine/controllers/docker_watcher.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — I2
- `pkg/controllers/app/engine/controllers/sync_containers.go` — I3
- `pkg/controllers/app/engine/controllers/sync_containers_test.go` — I3
- `pkg/controllers/app/engine/controllers/sync_images.go` — I2
- `pkg/controllers/app/engine/controllers/sync_volumes.go` — I2, I3
- `pkg/controllers/containers/container/container_webhooks.go` — Reviewed, no issues
- `pkg/controllers/containers/container/crd.yaml` — Reviewed, no issues
- `pkg/controllers/containers/volume/crd.yaml` — Reviewed, no issues
- `pkg/instance/instance.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Reviewed, no issues
- `pkg/service/controllers/discovery.go` — Reviewed, no issues
- `pkg/service/controllers/discovery_test.go` — Reviewed, no issues
- `pkg/service/controllers/manager.go` — Reviewed, no issues


---

### Gemini 3.1 Pro

### Executive Summary

This PR introduces the `engine` controller to mirror Docker engine objects (containers, images, volumes) into the Kubernetes control plane. The controller sets up an event-driven `dockerWatcher`, gracefully handles Docker state transitions, uses atomic list applications, and synchronizes cross-controller operations via `ContainerEngineReady` and `Running` conditions. The design choices, context propagation, and attention to event-stream details are outstanding. Only one testing-guideline violation was found.

### Findings

#### Critical Issues
None.

#### Important Issues

I1. **Command substitution in BATS test file scope** — `bats/tests/32-app-controllers/engine.bats:13-14` (important, regression)

```bash
DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
export DOCKER_HOST
```

The repository guidelines explicitly forbid `$(command ...)` substitutions in BATS tests because they swallow failures and produce opaque errors if the command fails. Because this assignment is in the file scope rather than inside a `setup` or `@test` block, it will execute before the control plane is configured. If `rdd` is not in the PATH or fails, the test suite will fail cryptically with `unix://` instead of showing the actual command failure.

Fix: Move the assignment inside `local_setup_file()` and use the standard `run -0` pattern:

```diff
-DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
-export DOCKER_HOST
 
 local_setup_file() {
     # The Docker socket access pattern used by these tests is not yet wired
     # up for Windows/WSL2.
     skip_on_windows
+    
+    run -0 rdd svc paths docker_socket
+    export DOCKER_HOST="unix://${output}"
+
     # Deliberately skip setup_rdd_control_plane here: `rdd set` internally
```

#### Suggestions
None.

### Design Observations

**Strengths**
- **Docker API resilience**: The event stream untag handling (which doesn't carry tag names natively) and system time drift correction (`Info.SystemTime`) demonstrate an exceptional understanding of the Moby daemon's quirks.
- **Correct SSA map application**: Sorting lists of maps (like `ContainerPort` and `RepoDigests`) before creating the apply-configs correctly prevents spurious resourceVersion churn that would otherwise trigger constant requeues.
- **Multi-writer Status safety**: The implementation of `retry.RetryOnConflict` paired with a fresh `Get` effectively solves the concurrent controller 409-loop race on the `App` singleton's `Status.Conditions`.
- **Granular Finalizer Updates**: Applying the mirror finalizer using a segregated SSA `FieldOwner` (`engine-controller-finalizer`) prevents required-field validation failures by preserving the primary controller's ownership of the spec. 

### Testing Assessment

The implementation includes rigorous integration tests verifying expected functionality:
- Finalizer lifecycle handling for dangling or referenced images.
- Container state reconciliation across `running`, `paused`, and `created` state combinations.
- Accurate wait semantics for `rdd set running=false`.
- Fallback paths for unsupported container engines (`containerd`).

No additional critical test scenarios are missing.

### Documentation Assessment

The design documentation comprehensively maps out API behavior, field expectations, container engine mirroring responsibilities, and specific limitations. The updates fully match the code implementation logic.

### Commit Structure

Reviewed, no issues. The commits cleanly introduce the required functionality.

### Acknowledged Limitations

- **Event dropping and resync reliance**:
  - *Code comment* (`pkg/controllers/app/engine/controllers/docker_watcher.go:178`): `"Transient handleEvent errors (API 503, SSA conflict past its internal retry) are logged and dropped... so a dropped apply leaves the mirror missing until the next full resync."`
  - *PR comment*: The author explicitly documents that transient write failures rely on scheduled full-sync passes rather than local enqueueing. This intentional design choice accepts temporary staleness in favor of architectural simplicity.
- **O(N) reconciliation / Shared Singleton Queue**:
  - *PR comment*: `"reconcileContainerSpecs and processFinalizers both list all resources per reconcile; splitting into per-resource reconcilers with watch predicates is the long-term fix."`
  - *Code comment* (`pkg/controllers/app/engine/controllers/engine_reconciler.go:154`): `"Both passes issue List calls per reconcile, and every Container/Image/Volume watch event triggers a reconcile — so cost is O(N) per child event."`
  - Because errors are returned and joined up to `Reconcile` from `errors.Join(specsErr, finErr)`, a single perpetually failing resource (e.g. an un-removable stuck Docker container) will put the `App` singleton into `controller-runtime`'s exponential backoff. While watch-event triggers natively bypass the delayed workqueue (keeping the system responsive to state changes), background retry polling for all other items will be capped at the backoff limit. The author correctly identifies splitting this into per-resource reconcilers as the long-term fix.

### Coverage Summary

- `.github/workflows/bats.yaml` — Trivial
- `bats/tests/32-app-controllers/app.bats` — Trivial
- `bats/tests/32-app-controllers/engine.bats` — I1
- `cmd/app-controller/main.go` — Trivial
- `cmd/rdd/service.go` — Reviewed, no issues
- `cmd/rdd/service_paths.go` — Trivial
- `cmd/rdd/set.go` — Reviewed, no issues
- `docs/design/api_app.md` — Reviewed, no issues
- `docs/design/api_containers.md` — Reviewed, no issues
- `docs/design/cmd_app.md` — Reviewed, no issues
- `docs/design/discovery.md` — Reviewed, no issues
- `go.mod` — Trivial
- `go.sum` — Trivial
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerspec.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/container_types.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/volume_types.go` — Trivial
- `pkg/controllers/app/app/controllers/app_controller.go` — Reviewed, no issues
- `pkg/controllers/app/app/lima-template.yaml` — Reviewed, no issues
- `pkg/controllers/app/engine/controller.go` — Trivial
- `pkg/controllers/app/engine/controllers/docker_watcher.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_containers.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_containers_test.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_images.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_volumes.go` — Reviewed, no issues
- `pkg/controllers/containers/container/container_webhooks.go` — Reviewed, no issues
- `pkg/controllers/containers/container/crd.yaml` — Reviewed, no issues
- `pkg/controllers/containers/volume/crd.yaml` — Trivial
- `pkg/instance/instance.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Reviewed, no issues
- `pkg/service/controllers/discovery.go` — Reviewed, no issues
- `pkg/service/controllers/discovery_test.go` — Reviewed, no issues
- `pkg/service/controllers/manager.go` — Reviewed, no issues

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-14 |
| **Commit** | `587bfe2 engine: mirror Docker container engine state into Kubernetes` (fixup autosquashed into the original `aaf6c213`) |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important #1: BATS file-scope command substitution for `DOCKER_HOST` | Skipped |
| 2 | Important #2: Engine controller is excluded from `--controllers=app` | Fixed (changed `apiGroup` from `"engine"` to `"app"` in `pkg/controllers/app/engine/controller.go`) |
| 3 | Suggestion #1: `waitForDesiredState` treats any non-"true" `running` value as "false" | Skipped |
| 4 | Suggestion #2: `syncAllContainers` logs per-item Inspect errors but does not collect them | Skipped |
| 5 | Suggestion #3: `fullSync` continues after `syncContainerNamespace` failure | Skipped |
| 6 | Suggestion #4: Custom `hasFinalizer`/`removeFinalizer` helpers duplicate `controllerutil` | Fixed (replaced local helpers with `controllerutil.ContainsFinalizer` / `RemoveFinalizer` in `engine_reconciler.go` and `docker_watcher.go`) |
| 7 | Suggestion #5: `parseContainerName` produces non-"moby" status.namespace on dockerd | Skipped |
| 8 | Testing gaps (3 scenarios) | Skipped (batch) |