# Deep Review: 20260412-031335-pr-295

| | |
|---|---|
| **Date** | 2026-04-12 03:13 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 2 (of PR #295) |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#295](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/295) — Container Engine watcher |
| **Commits** | `7c8f356` ci: always capture support bundle and log hostagent pgid<br>`9669b4b` bats: close fds 3 and 4 when invoking rdd<br>`50bdc15` ci: capture support bundle when BATS targets time out<br>`762c1e8` Make rdd set wait for the desired state by default<br>`65943da` Add BATS integration tests for the engine controller<br>`ba6caf7` Document engine mirroring and container state transitions<br>`4db3f20` Allow container spec.state transitions via webhook<br>`93359f7` Add engine controller that mirrors Docker state into K8s<br>`87f1df1` Add DockerSocket() helper and Docker client dependency<br>`32a0109` Work around distro proxy bug; update docker socket path<br>`536ccb2` Rosetta is now a vz driver option, not a global option |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — untag handling is broken, container pause/unpause is asymmetric, and several watcher lifecycle races need addressing before merge |
| **Wall-clock time** | `29 min 41 s` |

---

## Executive Summary

PR #295 introduces the engine controller — a new `EngineReconciler` plus long-lived `dockerWatcher` goroutine that mirrors Docker containers, images, and volumes into Kubernetes resources, forwards K8s-initiated deletions and `spec.state` transitions back to Docker, relaxes the container webhook to permit state changes, and adds `rdd set --timeout` wait semantics. It also bundles CI hang-diagnostics (`bats-with-timeout.sh`), Lima template updates (`.rd2` socket path, vz rosetta nesting), a distro proxy workaround, and BATS integration coverage.

The overall architecture is sound — a singleton reconciler owning a watcher goroutine, finalizer-gated reverse deletes, and a Get-then-Apply create pattern to avoid fighting user intent on `spec.state`. But the watcher's Docker↔K8s event handling has several correctness gaps: `ActionUnTag` events never remove stale mirrors (the event payload does not carry the tag name, and the call site passes the image ID into a tag-based lookup), paused containers cannot be unpaused via `spec.state=running` (asymmetric with the tested pause→stop path), and a transient Docker event-stream blip wipes every mirror resource. There is also a bootstrap race where the initial `fullSync` finishes before the event stream subscribes, so mutations in that window are invisible, and a `NotFound` during `Inspect` during `fullSync` aborts the whole sync.

No CRITICAL findings: 11 IMPORTANT issues (1 unique to each of the three reviewers plus a several-agent cluster around untag and error handling) and 10 SUGGESTIONS. Structurally, two commits (`4db3f20` and `762c1e8`) bundle engine-controller behavior changes into commits advertised as webhook and CLI features — rework the commit history before merge. Test coverage is strong on the happy paths (K8s-initiated deletion, state transitions, moby⇄containerd backend switch) but misses paused→running, transient Docker disconnect, and Docker-initiated untag flows.

---

## Critical Issues

None.

## Important Issues

I1. **Docker `ActionUnTag` events never remove the stale tag mirror** — `pkg/controllers/app/engine/controllers/docker_watcher.go:191-193` [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro]

```go
case events.ActionUnTag:
    log.V(1).Info("Image untagged", "id", msg.Actor.ID)
    return w.removeImageByTag(ctx, msg.Actor.ID)
```

All three reviewers caught this. When Docker emits an `untag` event, `msg.Actor.ID` is the image ID hash, not the tag. `removeImageByTag` searches for `status.repoTag == <hash>` at `pkg/controllers/app/engine/controllers/sync_images.go:185`, which never matches. The per-tag `Image` mirror survives indefinitely, until a watcher restart runs `fullSync` again.

I verified the upstream Moby contract: `daemon/images/image_delete.go:134,191,272` all call `LogImageEvent(ctx, imgID.String(), imgID.String(), events.ActionUnTag)`, so both `Actor.ID` and `Attributes["name"]` are the image ID hash — the tag that was removed is not propagated through the event payload at all. Gemini proposed fixing this by reading `msg.Actor.Attributes["name"]`, which is also wrong for this reason.

Fix (Codex's proposal): treat `untag` as a per-image reconciliation, not a direct tag delete. Re-inspect the image by ID, then delete any K8s `Image` objects whose `status.id` matches but whose `status.repoTag` is no longer in the fresh `RepoTags`. If the inspect returns NotFound, fall back to `removeImagesByID`.

```diff
 case events.ActionUnTag:
-    return w.removeImageByTag(ctx, msg.Actor.ID)
+    return w.syncImageAndPruneStaleTags(ctx, msg.Actor.ID)
```

I2. **`reconcileContainerState` does not unpause a paused container on `spec.state=running`** — `pkg/controllers/app/engine/controllers/docker_watcher.go:272-281` [Claude Opus 4.6]

```go
switch desired {
case containersv1alpha1.ContainerStatusRunning:
    log.Info("Starting container", "id", c.Name)
    _, err := w.cli.ContainerStart(ctx, c.Name, dockerclient.ContainerStartOptions{})
    return err
case containersv1alpha1.ContainerStatusCreated:
    log.Info("Stopping container", "id", c.Name)
    _, err := w.cli.ContainerStop(ctx, c.Name, dockerclient.ContainerStopOptions{})
    return err
}
```

The comment at `docker_watcher.go:268-271` carefully explains that `ContainerStop` handles paused/restarting containers natively, and the BATS test `patching spec.state=created stops a paused container` at `bats/tests/32-app-controllers/engine.bats:200-220` covers that direction. The symmetric case (desired=running, actual=paused) falls into `ContainerStart`, which Docker rejects for paused containers with "container is paused, unpause before starting". The user's intent is silently dropped and there is no test for this direction.

Fix: dispatch `ContainerUnpause` when `actual == paused` and `desired == running`, and add a BATS test mirroring the existing stop-paused test.

```diff
 switch desired {
 case containersv1alpha1.ContainerStatusRunning:
+    if actual == containersv1alpha1.ContainerStatusPaused {
+        log.Info("Unpausing container", "id", c.Name)
+        _, err := w.cli.ContainerUnpause(ctx, c.Name, dockerclient.ContainerUnpauseOptions{})
+        return err
+    }
     log.Info("Starting container", "id", c.Name)
     _, err := w.cli.ContainerStart(ctx, c.Name, dockerclient.ContainerStartOptions{})
     return err
```

I3. **Watcher bootstrap race — mutations during `fullSync` are invisible** — `pkg/controllers/app/engine/controllers/docker_watcher.go:69-76` [Codex GPT 5.4]

```go
// Perform initial full sync before starting the event stream.
if err := w.fullSync(watchCtx); err != nil {
    watchCancel()
    cli.Close()
    return nil, fmt.Errorf("failed to perform initial sync: %w", err)
}

go w.run(watchCtx)
```

`newDockerWatcher()` runs `fullSync` before `run()` subscribes to the Docker event stream at `docker_watcher.go:108-110`. Any container/image/volume mutation in Docker during that window is invisible — it happened after the sync snapshotted state, but before the event stream was open. This is especially plausible during VM startup where dockerd is actively accepting work as the App's `Running` condition flips.

After startup, `Reconcile()` marks the engine connected and returns `ctrl.Result{}, nil` at `engine_reconciler.go:143-146`, so there is no immediate second sync to close the gap.

Fix: open the event stream before or in parallel with `fullSync`, buffer events until the sync finishes, then replay. Alternatively, record a `Since` timestamp before `fullSync` starts and subscribe to the event stream with `Since` set to that timestamp so post-snapshot mutations are replayed after the sync.

I4. **Transient Docker disconnect wipes every mirror resource** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:95-107` [Claude Opus 4.6, Gemini 3.1 Pro (as Design Observation)]

```go
if watcherDied {
    log.Info("Docker watcher died, cleaning up mirror resources")
    if err := r.cleanupMirrorResources(ctx); err != nil {
        log.Error(err, "Failed to clean up mirror resources")
        return ctrl.Result{}, err
    }
    if err := r.setEngineCondition(ctx, &app, metav1.ConditionFalse, "Disconnected", "Container engine connection lost"); err != nil {
        return ctrl.Result{}, err
    }
    return ctrl.Result{}, nil
}
```

Any transient error on the Docker event stream (`docker_watcher.go:117-123` — e.g. user ran `systemctl restart docker` inside the guest) causes `run()` to return, flips the watcher to `!alive` on the next reconcile, and then `cleanupMirrorResources` deletes every `Container`/`Image`/`Volume`/`ContainerNamespace` in `rancher-desktop`. The subsequent reconcile restarts the watcher and `fullSync` recreates them, but every downstream client (front end, `rdd ctl get`, anything watching) sees the whole universe blink `DELETED → ADDED` with fresh UIDs. A single Docker blip is as disruptive as an intentional VM shutdown.

Fix: split "VM stopped / engine backend changed" (sweep everything — current behavior) from "transient watcher failure" (stop the watcher, leave mirror resources in place, attempt a bounded retry; on success `fullSync` reconciles drift in place). Only sweep if reconnect fails after a timeout. This preserves resource identity across transient outages and avoids flooding downstream watchers.

I5. **`IsNotFound` during `fullSync` Inspect aborts the whole sync** — `pkg/controllers/app/engine/controllers/sync_containers.go:74-81` [Gemini 3.1 Pro]

```go
func (w *dockerWatcher) syncContainer(ctx context.Context, id string) error {
    result, err := w.cli.ContainerInspect(ctx, id, dockerclient.ContainerInspectOptions{})
    if err != nil {
        return fmt.Errorf("failed to inspect container %s: %w", id, err)
    }
    return w.applyContainer(ctx, result.Container)
}
```

During `fullSync`, the controller lists all containers/images/volumes and then Inspects each by ID. If a resource is deleted in Docker between the `List` and `Inspect` calls (a routine TOCTOU), `Inspect` returns a `NotFound` error. That error propagates through `syncAllContainers` → `fullSync` → `newDockerWatcher`, which then tears down the watcher (`watchCancel()`, `cli.Close()`) and returns `ConnectFailed`. The engine reconciler then flips the App condition to `ConnectFailed` and waits for the next retry — over nothing more than a single concurrent `docker rm`.

Fix: swallow `NotFound` as success in `syncContainer`, `syncImageFromSummary`, and `syncVolume` — the stale K8s mirror will be swept in the "remove stale" step later in the sync.

```diff
 result, err := w.cli.ContainerInspect(ctx, id, dockerclient.ContainerInspectOptions{})
 if err != nil {
+    if cerrdefs.IsNotFound(err) {
+        return nil
+    }
     return fmt.Errorf("failed to inspect container %s: %w", id, err)
 }
```

I6. **`ActionTag` on a dangling image leaks the dangling mirror** — `pkg/controllers/app/engine/controllers/sync_images.go:120-150` [Gemini 3.1 Pro]

```go
if len(inspect.RepoTags) > 0 {
    for _, tag := range inspect.RepoTags {
        name := fmt.Sprintf("%s-%x",
            sanitizeKubernetesObjectName(inspect.ID),
            sha256.Sum256([]byte(tag)))
        // ... creates "<id>-<taghash>" mirror
    }
} else {
    // Dangling image (no tags).
    name := sanitizeKubernetesObjectName(inspect.ID)
    // ... creates "<id>" mirror
}
```

When an image that was dangling (K8s mirror name `<id>`) is subsequently tagged, `applyImageFromInspect` creates the new tagged mirror `<id>-<taghash>` but never deletes the pre-existing dangling mirror. Both survive until the next `fullSync` at watcher restart.

This is the mirror image of I1: tags and the dangling form should transition cleanly in both directions, but neither direction is currently handled incrementally.

Fix: inside the `len(inspect.RepoTags) > 0` branch, proactively remove the dangling resource:

```go
w.removeMirrorResource(ctx, &containersv1alpha1.Image{}, sanitizeKubernetesObjectName(inspect.ID))
```

I7. **Failed `spec.state` transitions stall after one attempt** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:297-307` [Codex GPT 5.4, Claude Opus 4.6 (as S5), Gemini 3.1 Pro (as C4)]

```go
for i := range containers.Items {
    c := &containers.Items[i]
    if c.DeletionTimestamp != nil {
        continue
    }
    if err := w.reconcileContainerState(ctx, c); err != nil {
        logf.FromContext(ctx).Error(err, "Failed to reconcile container state",
            "container", c.Name)
    }
}
return nil
```

`reconcileContainerState` returns raw Docker API errors, but `reconcileContainerSpecs` only logs them and returns success. The reconcile then falls through to `return ctrl.Result{}, nil`, so a failed start/stop gets exactly one attempt: the user patch that changed `spec.state` produced the only watch event, and no requeue is requested when Docker rejects the operation. The mismatch sits forever until some unrelated event triggers a reconcile.

Gemini and Codex both flagged this as a correctness issue (Gemini at Critical, Codex at Important); Claude flagged it as Suggestion S5 pointing out the asymmetry with the per-item `errors.Join` pattern used in `processContainerFinalizers` (`engine_reconciler.go:334-356`). I settled on Important — the retry never happens unless another reconcile fires, but the reconcile loop's overall safety net (watch events from later Docker activity) provides an eventual fallback.

Fix: collect per-container errors with `errors.Join` and return them so controller-runtime requeues with backoff, matching the existing finalizer pattern.

```diff
+    var errs []error
     for i := range containers.Items {
         ...
         if err := w.reconcileContainerState(ctx, c); err != nil {
-            logf.FromContext(ctx).Error(err, "Failed to reconcile container state", "container", c.Name)
+            errs = append(errs, fmt.Errorf("container %s: %w", c.Name, err))
         }
     }
-    return nil
+    return errors.Join(errs...)
```

I8. **`cleanupMirrorResources` runs on every idle reconcile** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:115-134` [Claude Opus 4.6]

```go
wantWatcher := running && engineIsDocker
if !wantWatcher {
    if watcherRunning {
        ...
        r.stopWatcher()
    }
    if err := r.cleanupMirrorResources(ctx); err != nil {
        log.Error(err, "Failed to clean up mirror resources")
        return ctrl.Result{}, err
    }
    ...
```

The comment defends unconditional cleanup ("so transient errors are retried on the next requeue"), but in the common "VM stopped" and "containerd backend" steady states every incoming App event triggers four `List` calls (Container/Image/Volume/ContainerNamespace), each returning empty. Since the reconciler also `Watches` Container/Image/Volume as trigger sources (`engine_reconciler.go:421-423`), any update on any of them in `rancher-desktop` cascades into four list calls per event.

Fix: gate cleanup on a state change — only sweep on transition into the not-wanted state (watcher was running, or a dedicated "CleanupPending" bit is set). If `ContainerEngineReady` already reflects the terminal state and `observedGeneration == app.Generation`, short-circuit with no List calls.

I9. **`ForceOwnership` on initial create races with concurrent user patch** — `pkg/controllers/app/engine/controllers/sync_containers.go:101-112` [Claude Opus 4.6]

```go
var existing containersv1alpha1.Container
err := w.k8s.Get(ctx, client.ObjectKey{Name: inspect.ID, Namespace: apiNamespace}, &existing)
if apierrors.IsNotFound(err) {
    applyConfig := containersv1alpha1apply.Container(inspect.ID, apiNamespace).
        WithFinalizers(mirrorFinalizer).
        WithSpec(containersv1alpha1apply.ContainerSpec().WithState(containersv1alpha1.ContainerStatusUnknown))
    if err := w.k8s.Apply(ctx, applyConfig, client.ForceOwnership, client.FieldOwner(controllerName)); err != nil {
        return fmt.Errorf("failed to create container %s: %w", inspect.ID, err)
    }
}
```

The Get-then-Apply pattern (explained by the comment) is the right idea — spec must be written exactly once to avoid fighting user intent. But the Get is racy: a concurrent user patch `spec.state=running` that lands between the Get (returning NotFound) and the Apply is clobbered by the Apply, because `ForceOwnership` explicitly steals `spec.state` from whoever wrote it last. The window is small but real: in practice the first Docker create event fires very close to when rdd-ctl-driven user patches might arrive for pre-existing resources after a crash recovery.

Fix: drop `ForceOwnership` on the initial create apply, or use a typed `Create` call with explicit `metav1.CreateOptions` instead of SSA. The narrower fix is to remove `client.ForceOwnership` from the create path and leave it on the status-only subsequent apply.

I10. **`watchCondition` conflates user cancellation and timeout** — `cmd/rdd/set.go:506-512` [Claude Opus 4.6]

```go
if _, err := watchtools.UntilWithSync(ctx, lw, &unstructured.Unstructured{}, precondition, condition); err != nil {
    if errors.Is(err, context.DeadlineExceeded) || errors.Is(err, context.Canceled) {
        return errors.New("timed out waiting for App state")
    }
    return fmt.Errorf("failed to watch App: %w", err)
}
```

When the user interrupts `rdd set` mid-wait with Ctrl+C, `ctx` is cancelled and `watchtools` returns `context.Canceled`. The CLI reports "timed out waiting for App state", which is false — the user cancelled. A user debugging why `rdd set` "timed out" after 2 seconds will go hunting in VM logs for the symptom instead of realizing they pressed Ctrl+C.

Fix: distinguish the two cases.

```diff
-    if errors.Is(err, context.DeadlineExceeded) || errors.Is(err, context.Canceled) {
+    if errors.Is(err, context.DeadlineExceeded) {
         return errors.New("timed out waiting for App state")
     }
+    if errors.Is(err, context.Canceled) {
+        return errors.New("wait cancelled")
+    }
     return fmt.Errorf("failed to watch App: %w", err)
```

I11. **`rdd set` with multi-property changes returns on stale `ContainerEngineReady`** — `cmd/rdd/set.go:444-457` [Claude Opus 4.6]

```go
if runningVal == "true" {
    logrus.Info("Waiting for container engine to be ready")
    return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
        return conditionStatus(obj, "ContainerEngineReady") == metav1.ConditionTrue
    })
```

`rdd set containerEngine.name=containerd running=true` on an already-running VM races with the stale cached `ContainerEngineReady=True/Connected` from the moby backend. The precondition check at `set.go:489-495` fires on the stale state and returns immediately, leaving the user with a half-restarted VM even though `rdd set` succeeded.

The test at `bats/tests/32-app-controllers/engine.bats:266-274` explicitly documents this limitation: "Stop first so there is no stale True/Connected from moby to satisfy the CLI wait below before the engine reconciler has run." That comment is the confession — the code works around a race the test harness itself can't avoid without stopping first.

The author's TODO at `set.go:440-443` acknowledges the root cause and proposes a future `Reconciled` condition with `observedGeneration`. Until that lands: either narrow the documented behavior ("waits for engine readiness only when `running` is the only property"), or skip the precondition shortcut when properties include anything besides `running`. At minimum, document the hazard in the TODO itself so future maintainers know it's a real user-visible issue, not just a polish item.

## Suggestions

S1. **`removeMirrorResource` mutates caller-provided `obj` across retries** — `pkg/controllers/app/engine/controllers/docker_watcher.go:232-250` [Claude Opus 4.6]

```go
func (w *dockerWatcher) removeMirrorResource(ctx context.Context, obj client.Object, name string) error {
    key := client.ObjectKey{Name: name, Namespace: apiNamespace}
    retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
        if err := w.k8s.Get(ctx, key, obj); err != nil {
            ...
        }
        if !removeFinalizer(obj, mirrorFinalizer) {
            return nil
        }
        return w.k8s.Update(ctx, obj)
    })
```

`obj` is caller-provided and retry iterations reuse the same pointer. That works in practice because controller-runtime's `Get` resets-and-populates, but it depends on an implementation detail. Callers like `syncAllContainers` at `sync_containers.go:60-66` pass `&k8sContainers.Items[i]`, a slice element with prior list state pre-populated. `deleteAllOfType` at `engine_reconciler.go:257` already uses the fresher `DeepCopyObject()` inside the retry closure pattern — apply the same here.

S2. **`DockerSocket()` hardcode of `.rd2` is a silent regression** — `pkg/instance/instance.go:129-137` [Claude Opus 4.6]

```go
var DockerSocket = sync.OnceValue(func() string {
    home, err := os.UserHomeDir()
    if err != nil {
        panic(fmt.Errorf("could not get home directory: %w", err))
    }
    dir := filepath.Join(home, ".rd2")
    _ = os.MkdirAll(dir, 0o700)
    return filepath.Join(dir, "docker.sock")
})
```

Commit `87f1df1` originally used `filepath.Join(ShortDir(), "docker.sock")`, which is the correct per-instance path. Commit `762c1e8` regressed it to a hardcoded `.rd2` to match the (also-hardcoded) Lima template path at `pkg/controllers/app/app/lima-template.yaml:30`. The TODO at `instance.go:127-128` acknowledges this, but the two hardcodes need to be fixed together — either both parameterized on the instance suffix, or a pair of tests that fails the build if they ever diverge.

Also, `os.MkdirAll` silently swallows errors (`_ =`); while the directory is typically pre-created by `rdd svc start`, the silent failure is a trap. The BATS helper at `bats/tests/32-app-controllers/engine.bats:12-14` carries a parallel TODO, confirming the author is aware.

S3. **`docker_watcher.run()` has no panic recovery** — `pkg/controllers/app/engine/controllers/docker_watcher.go:99-131` [Claude Opus 4.6]

Unlike controller-runtime reconcilers (wrapped in `HandleCrash`), this raw goroutine has no panic recovery. A nil-pointer in `handleEvent` — e.g. an unexpected Moby event shape — takes down the whole app-controller process. Add `defer utilruntime.HandleCrash(...)` or at minimum `defer func() { if r := recover(); r != nil { log.Error(...) } }()`.

S4. **Inconsistent `Force` flags across `delete{Container,Image,Volume}`** — `pkg/controllers/app/engine/controllers/docker_watcher.go:289-320` [Claude Opus 4.6]

Containers and volumes use `Force: true`, meaning `rdd ctl delete container/foo` force-stops a running container without giving processes a chance to exit gracefully. Images use no Force, so in-use images are correctly kept — but the inconsistency is surprising and undocumented. Either document the policy in a comment, or consider `Force: false` for containers (the user can patch `spec.state=created` first, then delete, which is the K8s-idiomatic pattern).

S5. **`statusCopy := *statusApply` shallow copy shares slice/map state** — `pkg/controllers/app/engine/controllers/sync_images.go:128-138` [Claude Opus 4.6]

```go
statusCopy := *statusApply
if err := w.applyImage(ctx,
    containersv1alpha1apply.Image(name, apiNamespace).
        WithFinalizers(mirrorFinalizer),
    statusCopy.
        WithRepoTag(tag).
        WithNamespace(containerNamespace),
); err != nil {
```

`statusCopy := *statusApply` is a shallow copy: `RepoDigests` (slice), `Labels` (map), and every `*string`/`*int64` field share backing memory with the original across iterations. It is safe today because the loop only calls `WithRepoTag` and `WithNamespace` — both set-only scalar operations — but a future edit that adds `WithLabels(extraLabels)` or `WithRepoDigests(extra)` inside the loop would corrupt all prior iterations' payloads via in-place `append`/map mutation.

Fix: replace the shallow copy with a helper that returns a fresh `ImageStatusApplyConfiguration` each iteration, building from the common fields explicitly.

S6. **`pgid` log field is misleading on Windows** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:479-485` [Claude Opus 4.6]

```go
// pgid equals pid because SetGroup made the hostagent a new group leader.
logger.Info("Hostagent started, watcher active",
    "instance", limaVM.Name,
    "pid", haCmd.Process.Pid,
    "pgid", haCmd.Process.Pid)
```

On Unix the claim holds (`Setpgid=true → pgid=pid`). On Windows there is no POSIX process group; `SetGroup` sets `CREATE_NEW_PROCESS_GROUP`, which is a different console-control-event concept — logging the value as "pgid" is a misnomer. Harmless today (Windows CI is not exercised), but will bite when it's added. Either read the real pgid via `syscall.Getpgid(haCmd.Process.Pid)` on Unix only, or label the field with a platform-neutral name.

S7. **BATS engine tests use `[ -n "${output}" ]` instead of `refute_output`** — `bats/tests/32-app-controllers/engine.bats:142,152` [Claude Opus 4.6]

```bash
run -0 rdd ctl get image --namespace="${NAMESPACE}" \
    --field-selector "status.repoTag=busybox:latest" -o name
image_ref=${output}
[ -n "${image_ref}" ]
```

Per the repo convention, prefer `refute_output ""` over `[ -n ... ]`, which fails silently on violation. Swap both sites.

```diff
 run -0 rdd ctl get image --namespace="${NAMESPACE}" \
     --field-selector "status.repoTag=busybox:latest" -o name
+refute_output ""
 image_ref=${output}
-[ -n "${image_ref}" ]
```

S8. **Concurrent finalizer removal can log benign NotFound errors** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:349-353` [Gemini 3.1 Pro]

```go
if removeFinalizer(c, mirrorFinalizer) {
    if err := r.Update(ctx, c); err != nil {
        errs = append(errs, fmt.Errorf("failed to remove finalizer from container %s: %w", c.Name, err))
    }
}
```

`r.Update(ctx, c)` can return `NotFound` if the Docker `ActionDestroy` event handler concurrently stripped the finalizer and deleted the mirror object between this reconcile's List and Update. It's a benign race but produces noisy error logs and triggers unnecessary reconciles.

Fix: `client.IgnoreNotFound(r.Update(ctx, c))`. The same pattern applies to `processImageFinalizers` and `processVolumeFinalizers`.

S9. **Design doc overstates containerd mirroring support** — `docs/design/api_containers.md:35-38` [Codex GPT 5.4]

```md
2. Creates the `rancher-desktop` K8s namespace and the appropriate
   `ContainerNamespace` resource (`moby` for dockerd, engine-specific
   namespaces for containerd).
```

The new doc text says the engine controller creates engine-specific namespaces for containerd, but the implementation does the opposite: `engine_reconciler.go:115-133` shuts the watcher down and reports `ContainerEngineReady=True, reason=NotApplicable` whenever `app.Spec.ContainerEngine.Name != "moby"`. That wording will mislead anyone trying to build against the API from the doc.

Fix:

```diff
-2. Creates the `rancher-desktop` K8s namespace and the appropriate
-   `ContainerNamespace` resource (`moby` for dockerd, engine-specific
-   namespaces for containerd).
+2. Creates the `rancher-desktop` K8s namespace and the `moby`
+   `ContainerNamespace` resource when the App uses the moby backend.
+   Containerd mirroring is not implemented yet; the controller reports
+   `ContainerEngineReady` with reason `NotApplicable` in that mode.
```

Separately, the `Image` example at `docs/design/api_containers.md:218-219` still describes the old "random suffix" naming scheme even though the implementation now uses a deterministic hash — that text predates this PR but is stale.

S10. **`docs/design/api_containers.md:306-312` still describes the old finalizer-untag semantics** — [Claude Opus 4.6]

```md
Delete the Image object; the finalizer will untag the image. If all tags of an
image are removed, _and_ it is not in use by a container (running or not), then
the image itself is removed.
```

This contradicts the current code: on a Docker untag the controller deletes the K8s resource directly, without going through the finalizer path. The doc change in `ba6caf7` left this stale. Reconcile the doc with the implementation — the current code better matches the intent, but the description on this line should match what the code does.

## Design Observations

### Concerns

- **Engine controller cleanup policy treats "VM stopped" and "watcher died" identically** `(in-scope)` — A transient event-stream blip triggers the same aggressive sweep as a user-initiated shutdown. See I4 for the specific fix; the broader design question is whether the `Reconcile` state machine should distinguish "intentional stop" (from App condition) from "involuntary failure" (from watcher liveness). [Claude Opus 4.6, Gemini 3.1 Pro]

- **`spec.state: unknown` conflates "no user intent" with a real spec state** `(future)` — A cleaner model would be a nullable `spec.state` (pointer) with `nil = no opinion`, or split `spec.desired.state` from `status.managedBy`. The current shape forces the webhook to enforce "stop trying to be unknown once a user expressed intent" — future contributors will trip over this. Not in-scope for this PR. [Claude Opus 4.6]

- **`cleanupMirrorResources` assumes exclusive ownership of `rancher-desktop`** `(future)` — `deleteAllOfType` iterates every resource in the namespace and strips the mirror finalizer. The mock controller at `pkg/controllers/mock/mock_controller.go:27` uses the same `apiNamespace = "rancher-desktop"`; if the two ever coexist (e.g. mock + real engine in a dev instance), cleanup will wipe the mock data. Filter by a label selector (`engine.rancherdesktop.io/mirror=true`) or by field owner. [Claude Opus 4.6]

- **Engine feature is a single singleton reconciler owning bootstrap sync, event streaming, desired-state convergence, and finalizer processing** `(future)` — Splitting into per-resource reconcilers would remove the global O(N) sweeps, make retries object-scoped instead of best-effort, and eliminate the I3 bootstrap race by giving each object a durable reconcile path. The code already acknowledges this at `engine_reconciler.go:149-155` with an in-place comment. [Codex GPT 5.4]

### Strengths

- The `Get`-then-`Apply` comment at `sync_containers.go:87-93` explaining why spec is written only on create (to avoid SSA fights with user patches) is the kind of insight most code lacks. [Claude Opus 4.6]

- `processContainerFinalizers`, `processImageFinalizers`, and `processVolumeFinalizers` collect per-item errors with `errors.Join` — one stuck object does not block the rest. This is the correct pattern for sweep loops. (The asymmetry with `reconcileContainerSpecs`, which does _not_ do this, is I7.) [Claude Opus 4.6]

- The `docker_watcher.alive()` check paired with the explicit `stop() → <-w.done` handshake avoids the common goroutine-ownership ambiguity. [Claude Opus 4.6]

- The mirror reverse-delete path is designed around the repo's "no K8s GC" constraint: mirror resources carry an explicit engine finalizer, and `cleanupMirrorResources` sweeps all mirrored kinds on shutdown instead of assuming owner-reference garbage collection. [Codex GPT 5.4]

- The new `rdd set` implementation uses a watch-based wait rather than polling — the right control-plane pattern for this codebase. [Codex GPT 5.4]

- The `immutableValidator` webhook correctly prevents drift by forcing explicit transitions; the `equality.Semantic.DeepEqual` comparison with `State` normalized is the right way to verify nothing else changed. [Gemini 3.1 Pro]

- Deterministic RFC 1123 volume names via SHA-256 hashing with the original name preserved in `status.name` elegantly handles Docker names that contain uppercase/underscore characters. [Gemini 3.1 Pro]

## Testing Assessment

Strong happy-path coverage: container lifecycle mirroring, `docker run`/`stop`/`rm`, volume create/delete (including uppercase names), image pull, K8s-initiated container deletion, in-use image finalizer retention, spec-state transitions in the running↔created↔paused(→stopped) direction, VM shutdown cleanup, and the moby↔containerd backend switch. The BATS test at `bats/tests/32-app-controllers/engine.bats` is readable, well-commented, and validates the main contracts end-to-end.

Gaps, in rough order of risk:

1. **Paused → running transition** (I2). No test. High risk of silently-dropped user intent on unpause.
2. **Docker-initiated untag of in-use image** (I1, I6). The existing `deleting an in-use Image keeps the finalizer` test covers the K8s-initiated direction; the Docker-initiated `docker rmi <tag>` path is not exercised.
3. **Transient Docker disconnect** (I4). No test. Would need to kill/restart Docker inside the VM or sever the host socket briefly.
4. **`NotFound` during `fullSync` Inspect** (I5). No test for a container being removed concurrently with the initial sync.
5. **Watcher bootstrap race** (I3). No test for mutations landing between `fullSync` and event-stream subscription.
6. **Multi-property `rdd set` while VM running** (I11). The test at `engine.bats:266-274` documents the stale-condition race but works around it by stopping first.
7. **Concurrent user patch during engine reconcile** (I9). No test for a user patching `spec.state=running` in the window between the engine's Get and Apply on create.
8. **Engine tests only cover a single instance and skip Windows/WSL2** — the hardcoded `.rd2` socket path (S2) means multi-instance tests cannot be added until the Lima template is parameterized on the instance suffix.

## Documentation Assessment

- `docs/design/api_containers.md` is substantially updated with engine mirroring semantics, `unknown` state rationale, and finalizer lifecycle — a net improvement.
- The `State is the desired state of the container` comment at `pkg/apis/containers/v1alpha1/container_types.go:56-61` does not describe what `unknown` means to the reconciler. Update to: "State is the desired state of the container; `unknown` means the engine controller mirrors without expressing intent and takes no start/stop action."
- The `rdd set` help text at `cmd/rdd/set.go:68-72` correctly mentions `--timeout=0`, but commit message `762c1e8` refers to a `--wait=false` flag that does not exist. Fix the commit message (see I7 / Commit Structure).
- S9 and S10 document stale doc text that should be updated with this PR.

## Commit Structure

This PR bundles the core engine-watcher feature with unrelated CI and Lima template work. Two commits in particular are mis-titled:

- **`4db3f20` "Allow container spec.state transitions via webhook"** — the webhook diff is a small part of this commit. It also introduces the `unknown` enum value, changes the CRD default from `running` to `unknown`, adds the `reconcileContainerState` flow with paused-container handling, adds `processContainerFinalizers`/`processImageFinalizers`/`processVolumeFinalizers`, and reworks `sync_containers.go` to gate spec-apply on Get. A reader scanning `git log --oneline` for the `unknown` enum addition would not find it. [Claude Opus 4.6, Codex GPT 5.4]

- **`762c1e8` "Make rdd set wait for the desired state by default"** — same problem, larger. ~300 lines of engine controller churn, the `DockerSocket` regression from `ShortDir()` to hardcoded `.rd2` (S2), and a significant reconciler rework. The commit message also refers to a `--wait=false` flag that does not exist — the flag is `--timeout=0`. [Claude Opus 4.6, Codex GPT 5.4]

- **`50bdc15` and `7c8f356`** — the second immediately flips the first from "on-timeout only" to "always". These should be a single commit. [Claude Opus 4.6]

Crisp/self-contained commits (good examples): `9669b4b` (bats fd close), `ba6caf7` (docs), `32a0109` (distro proxy workaround), `536ccb2` (rosetta vz move).

Fix: split the commits along responsibility lines and correct the `762c1e8` commit message to reflect the actual flag name before the PR lands.

## Acknowledged Limitations

- **`DockerSocket()` hardcodes `.rd2`** — TODO at `pkg/instance/instance.go:127-128`, mirrored by a parallel TODO at `bats/tests/32-app-controllers/engine.bats:12-14`. S2 is in findings because it's a regression from the previously correct state, not because the author is unaware.
- **`rdd set` doesn't wait for non-`running` property changes that trigger a VM restart** — TODO at `cmd/rdd/set.go:440-443` proposes a `Reconciled` condition with `observedGeneration`. I11 is still in findings because the race is user-visible today, not just a future polish item.
- **`reconcileContainerSpecs` + `processFinalizers` run O(N) per child event** — comment at `engine_reconciler.go:149-155` acknowledges the long-term fix (split into per-resource reconcilers). Not promoted to a finding given the typical scale.
- **`moby` ContainerNamespace has no mirror finalizer by design** — comment at `sync_volumes.go:54-59` explains; the BATS test `deleting containernamespace/moby completes without a finalizer hang` verifies.
- **Containerd backend reports `ContainerEngineReady=True/NotApplicable`** — intentional so that `rdd set running=true` returns promptly. The test comment at `engine.bats:266-273` documents this.
- **Windows hostagent `pgid` logging** — the comment at `limavm_lifecycle.go:479-481` assumes Unix Setpgid semantics. S6 is in findings because the wording breaks down on Windows, not because the author is unaware.

## Agent Performance Retro

### [Claude]

Claude produced the most comprehensive review of the three by a wide margin (8 Important, 8 Suggestions, 3 Design observations), and carried the most unique signal: the paused→running asymmetry (I2), the `ForceOwnership` create race (I9), the `watchCondition` cancel/timeout confusion (I10), the multi-property `rdd set` stale-condition race (I11), and the idle-reconcile `cleanupMirrorResources` cost (I8). It also caught the shallow-copy trap (S5), the panic-recovery gap (S3), and the `Force` flag inconsistency (S4), plus the commit-structure critique. It correctly framed I1 as "no dangling mirror is ever created" rather than "the tag is not removed," which is a sharper diagnosis than Codex or Gemini's.

Claude's run time was unusual: it took 22 minutes — well past the nominal 15-minute timeout — but produced the deepest analysis. Worth noting that `--effort max` is paying for quality here.

### [Codex]

Codex produced a tight, well-calibrated review (3 Important, 1 Suggestion) with the single highest-signal unique finding of the round: the **watcher bootstrap race** (I3) between `fullSync` and event-stream subscription. That requires tracing the initialization order carefully — a class of bug where no raw code read would catch it. Codex also reached the correct fix for the untag bug (I1) independently before Claude got there, and correctly identified that the commit-structure and documentation issues needed flagging. It downgraded the error-swallowing from Gemini's "Critical" to Important (correct) and from Claude's "Suggestion" (correct). Zero false positives.

### [Gemini]

Gemini produced 4 "Critical" and 3 "Important" findings but **hallucinated C1** — the "typo in BATS wrapper `$ @` causes immediate CI failure" finding. The actual `scripts/bats-with-timeout.sh:225` reads `"$@" &` with no space. I verified this directly. A CI-failure-level claim should never be made without verifying the literal character sequence in the source, and Gemini's own tool output earlier contained `Error stating path " &` — a garbled state during its own reading of the file that it apparently then committed to as fact.

Gemini also inflated severity on C3 and C4 (early return bypasses finalizers, container state errors swallowed) — both are real concerns but neither is Critical. And its suggested fix for the untag bug (use `msg.Actor.Attributes["name"]`) is also wrong, because Docker sets that attribute to the image ID hash too. That said, Gemini caught two things the others missed: the dangling-image leak on tag (I6) and the `NotFound` during `fullSync` issue (I5). When correctly calibrated, those are valuable finds.

Gemini's review ran in about 6 minutes — fastest of the three.

### Summary

| | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 22m 15s | 7m 00s | 6m 51s |
| Findings | 8I 8S | 3I 1S | 4I 1S |
| Tool calls | 99 (Bash 36, Read 36, Grep 20) | 49 (shell 47, stdin 2) | 4 (run_shell_command 3, google_web_search 1) |
| Design observations | 3 | 2 | 1 |
| False positives | 0 | 0 | 1 |
| Unique insights | 7 | 1 | 2 |
| Files reviewed | 26 | 26 | 26 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **8I 8S** | **3I 1S** | **4I 1S** |
| Downgraded | 0 | 0 | 3 (I→S) |
| Dropped | 0 | 0 | 1 |


Claude is the clear leader for depth and calibration on this review; Codex's bootstrap-race catch is the single most valuable unique finding; Gemini's hallucinated "critical CI failure" is the most concerning failure mode and reinforces the rule that model claims about literal file contents need to be verified before being trusted.

**Reconciliation**:

- Gemini C1 (shell typo, Critical → rejected as false positive): the claimed `"$ @"` does not exist; the actual script has `"$@"`. Verified by reading `scripts/bats-with-timeout.sh:225`.
- Gemini C2 (untag broken, Critical → Important as I1): the bug is real but not a data-loss-level regression — stale resources survive until the next watcher restart's `fullSync`. Also, Gemini's proposed fix is wrong (reading `Actor.Attributes["name"]` returns the image ID hash per Moby `daemon/images/image_events.go:22`).
- Gemini C3 (early return bypasses finalizers, Critical → rejected): subsequent reconciles from the cleanup's own watch events process the deferred finalizers. Not critical in practice.
- Gemini C4 (state errors swallowed, Critical → Important as I7): real bug but not Critical; retry eventually happens via subsequent Docker events.
- Claude S5 (state errors swallowed, Suggestion → promoted to Important as I7): merged with Codex/Gemini framing.
- Gemini Design Observation (tear-down on disconnect → promoted to Important as I4): the issue is user-visible (UIDs churn, watch clients see DELETED→ADDED), so it warrants Important rather than future-design treatment.

## Review Process Notes

**Skill improvements:**

- None this round. The review dimensions already cover watcher lifecycle, initialization ordering, retry semantics, and Docker event-handling contracts. Gemini's hallucinated CI-failure Critical is a model-specific failure, not a prompt gap — no amount of prompt tweaking removes "verify the literal characters" as a model responsibility.

**Repo context updates:**

- `[repo]` Add to `deep-review-context.md`: "When reviewing watcher or event-stream code, cross-check against upstream source for the event payload contract — Docker's image events in particular do not propagate tag names through `Actor.ID` or `Actor.Attributes`. Agents must trace into `daemon/images/image_events.go` (or equivalent) when proposing a fix that reads event fields."

  Why: this review had two agents (Gemini, Codex initially) propose fixes based on a wrong assumption about what `Actor.ID` and `Actor.Attributes["name"]` contain for `ActionUnTag`. Future reviewers will save time if this is documented up front.

## Appendix: Original Reviews

### [Claude] Opus 4.6

Now I have enough context to write the review. Let me finalize my findings and write the review.

# Code Review — PR #295 Container Engine watcher

## Executive Summary

This PR adds an engine controller that mirrors Docker container/image/volume state into the Kubernetes API, backed by a long-lived `dockerWatcher` goroutine; it also relaxes the Container webhook to allow `spec.state` transitions, adds `rdd set --timeout` wait semantics, ports the Lima template from `.rd` to `.rd2`, and adds BATS integration tests plus CI support-bundle capture. The design is sound, but there are several correctness gaps around image lifecycle, container unpause, cleanup in transient-disconnect scenarios, and commit/scope hygiene.

## Findings

### Critical Issues

None.

### Important Issues

**I1. Last-tag untag of an in-use image leaves no K8s representation of the dangling image — `pkg/controllers/app/engine/controllers/docker_watcher.go:183-201`**

```go
case events.ActionUnTag:
    log.V(1).Info("Image untagged", "id", msg.Actor.ID)
    return w.removeImageByTag(ctx, msg.Actor.ID)
```

When a user runs `docker rmi busybox` on an image that is still referenced by a running container, Docker untags the image and leaves a *dangling* image behind (the layer still exists, just without a tag). The engine controller fires on `ActionUnTag`, calls `removeImageByTag` at docker_watcher.go:191-193, and deletes the per-tag K8s `Image` resource — but it never creates a replacement dangling Image. The design doc (`docs/design/api_containers.md:208-212`) explicitly promises the dangling form: "If an image without any tags exists, that will be represented by an Image object without `.status.repoTag` and `.status.namespace`." `syncAllImages` (sync_images.go:39-78) handles the dangling form at sync_images.go:139-150, but that path runs only during the one-shot `fullSync` at watcher startup — between watcher restarts, the dangling image has no K8s representation at all. A user doing `docker pull foo && docker run foo && docker rmi foo` would see the image vanish from `rdd ctl get images` even though Docker still has it, until the next VM restart.

Fix: after handling `ActionUnTag`, re-inspect the image by ID; if it still exists with an empty `RepoTags`, create a dangling K8s resource. The simplest implementation reuses `syncImage(ctx, msg.Actor.ID)` when `msg.Actor.Attributes["name"]` (or a re-inspect) shows the ID still present:

```diff
 case events.ActionUnTag:
     log.V(1).Info("Image untagged", "id", msg.Actor.ID)
-    return w.removeImageByTag(ctx, msg.Actor.ID)
+    if err := w.removeImageByTag(ctx, msg.Actor.ID); err != nil {
+        return err
+    }
+    // The image may now be dangling (still exists but has no tags).
+    // Re-sync from Docker to create the dangling K8s resource.
+    imageID, ok := msg.Actor.Attributes["imageID"]
+    if !ok {
+        return nil
+    }
+    if err := w.syncImage(ctx, imageID); err != nil && !cerrdefs.IsNotFound(err) {
+        return err
+    }
+    return nil
```

(important, gap)

---

**I2. `reconcileContainerState` does not unpause a paused container on `spec.state=running` — `pkg/controllers/app/engine/controllers/docker_watcher.go:255-283`**

```go
func (w *dockerWatcher) reconcileContainerState(ctx context.Context, c *containersv1alpha1.Container) error {
    desired := c.Spec.State
    if desired == containersv1alpha1.ContainerStatusUnknown {
        return nil
    }
    actual := c.Status.Status
    if desired == actual {
        return nil
    }
    ...
    switch desired {
    case containersv1alpha1.ContainerStatusRunning:
        log.Info("Starting container", "id", c.Name)
        _, err := w.cli.ContainerStart(ctx, c.Name, dockerclient.ContainerStartOptions{})
        return err
```

The comment at docker_watcher.go:268-271 carefully explains that `ContainerStop` handles paused/restarting containers natively — and the test `patching spec.state=created stops a paused container` at engine.bats:200-220 verifies that path. But the symmetric case (desired=running, actual=paused) falls into `ContainerStart`, which Docker rejects for paused containers with an error along the lines of "container is paused, unpause before starting". The user's intent — "I want this container running" — is silently dropped, and there is no test covering this direction. This is asymmetric with the documented "start/stop" semantics in `docs/design/api_containers.md:130-143`.

Fix: dispatch `ContainerUnpause` when `actual == paused` and `desired == running`:

```diff
 switch desired {
 case containersv1alpha1.ContainerStatusRunning:
-    log.Info("Starting container", "id", c.Name)
-    _, err := w.cli.ContainerStart(ctx, c.Name, dockerclient.ContainerStartOptions{})
-    return err
+    if actual == containersv1alpha1.ContainerStatusPaused {
+        log.Info("Unpausing container", "id", c.Name)
+        _, err := w.cli.ContainerUnpause(ctx, c.Name, dockerclient.ContainerUnpauseOptions{})
+        return err
+    }
+    log.Info("Starting container", "id", c.Name)
+    _, err := w.cli.ContainerStart(ctx, c.Name, dockerclient.ContainerStartOptions{})
+    return err
```

Add a test `patching spec.state=running unpauses a paused container` mirroring the existing stop-paused test. (important, gap)

---

**I3. Transient Docker disconnect deletes every mirror resource — `pkg/controllers/app/engine/controllers/engine_reconciler.go:95-107`**

```go
// Watcher died unexpectedly (Docker socket gone, engine restarted, etc.).
// Clean up mirror resources and set the condition to Disconnected.
if watcherDied {
    log.Info("Docker watcher died, cleaning up mirror resources")
    if err := r.cleanupMirrorResources(ctx); err != nil {
        log.Error(err, "Failed to clean up mirror resources")
        return ctrl.Result{}, err
    }
    if err := r.setEngineCondition(ctx, &app, metav1.ConditionFalse, "Disconnected", "Container engine connection lost"); err != nil {
        return ctrl.Result{}, err
    }
    return ctrl.Result{}, nil
}
```

Any transient error on the Docker event stream (`docker_watcher.go:117-123`) causes `run()` to return, flips the watcher to !alive on the next reconcile, and then `cleanupMirrorResources` wipes every `Container`/`Image`/`Volume`/`ContainerNamespace` in `rancher-desktop`. The subsequent reconcile starts a new watcher and `fullSync` recreates them — but every downstream client (the front end, `rdd ctl get`, anything watching these resources) sees the whole universe blink out and reappear. UIDs change, watch clients see spurious `DELETED` then `ADDED` events, and any external tool tracking resource lifetimes loses continuity. The same pattern applies on VM restart: there's no attempt to reconcile-in-place.

This is particularly painful given the watcher has no backoff or single-shot retry before declaring defeat — one blip of a Docker daemon restart (e.g., the user ran `systemctl restart docker` inside the guest) triggers a full sweep.

Fix (minimum): add a short retry-in-place path in `Reconcile` before cleanup — re-ping Docker, and if it responds, restart the watcher with a fresh event stream *without* deleting mirror resources. Only sweep if the watcher has been dead for >N seconds or the underlying App condition changed. Alternatively, skip the aggressive cleanup and rely on the next `fullSync` to reconcile mirror state to Docker reality. The current `cleanupMirrorResources` path should remain for the "VM stopped" case but not for the "watcher died, VM still running" case. (important, gap)

---

**I4. `cleanupMirrorResources` runs `deleteAllOfType` on every idle reconcile — `pkg/controllers/app/engine/controllers/engine_reconciler.go:115-134`**

```go
wantWatcher := running && engineIsDocker
if !wantWatcher {
    if watcherRunning {
        ...
        r.stopWatcher()
    }
    if err := r.cleanupMirrorResources(ctx); err != nil {
        log.Error(err, "Failed to clean up mirror resources")
        return ctrl.Result{}, err
    }
    switch {
    case running && !engineIsDocker:
        return ctrl.Result{}, r.setEngineCondition(ctx, &app, metav1.ConditionTrue, "NotApplicable", ...)
```

The in-code comment justifies this: "Cleanup runs on every reconcile in that state so transient errors are retried on the next requeue." That's defensible for the "error happened" case, but in the common "VM is stopped" and "containerd backend" steady states *every* incoming App event triggers four `List` calls (Container/Image/Volume/ContainerNamespace), each returning empty but still hitting the apiserver. Given the engine reconciler also watches Container/Image/Volume as trigger sources (engine_reconciler.go:421-423) and any update on any of them triggers an App reconcile, a single mutating operation on an unrelated resource in `rancher-desktop` (e.g., from a future controller) would cascade into four list calls.

Fix: check the current `ContainerEngineReady` condition — if it's already `False/Stopped` or `True/NotApplicable` AND the App's `ObservedGeneration` hasn't advanced, short-circuit with no list calls. Only sweep when transitioning into the not-wanted state (watcher was running) or on retry after a failed sweep (tracked via a dedicated "CleanupPending" condition). (important, gap)

---

**I5. Container spec never returns to `unknown` after the creator was removed and re-synced — `pkg/controllers/app/engine/controllers/sync_containers.go:101-112`**

```go
var existing containersv1alpha1.Container
err := w.k8s.Get(ctx, client.ObjectKey{Name: inspect.ID, Namespace: apiNamespace}, &existing)
if apierrors.IsNotFound(err) {
    applyConfig := containersv1alpha1apply.Container(inspect.ID, apiNamespace).
        WithFinalizers(mirrorFinalizer).
        WithSpec(containersv1alpha1apply.ContainerSpec().WithState(containersv1alpha1.ContainerStatusUnknown))
    if err := w.k8s.Apply(ctx, applyConfig, client.ForceOwnership, client.FieldOwner(controllerName)); err != nil {
        return fmt.Errorf("failed to create container %s: %w", inspect.ID, err)
    }
} else if err != nil {
    return fmt.Errorf("failed to get container %s: %w", inspect.ID, err)
}
```

The `Get`-then-`Apply` pattern is explained by the comment: spec must be written exactly once to avoid fighting with user intent. That's sound, but the read is racy. Two stream events for the same container ID can land close together: thread A gets NotFound, thread B gets NotFound, both race into `WithSpec(unknown)`, and both `Apply` — harmless here because both writes are identical. But under the same pattern, a concurrent user patch `spec.state=running` (which landed between A's Get and A's Apply) gets clobbered by A's Apply (the engine-controller field owner reasserts `unknown` with `ForceOwnership`). The window is small but real.

The author is protected in the *read-existing* path because `applyContainer` only applies the status subresource on subsequent syncs. But the *creation* path sets spec+finalizer with `ForceOwnership` which explicitly steals ownership of `spec.state` even if it was set by another field owner a moment ago.

Fix: drop `ForceOwnership` on the creation apply, or move the finalizer + spec write into a single `Create` call via the typed client instead of SSA with `ForceOwnership`. The narrower fix: use `client.FieldOwner(controllerName)` without `ForceOwnership` on the initial create; let the second apply fail-safe if a user sneaked a patch in. (important, gap)

---

**I6. `watchCondition` conflates user cancellation and timeout — `cmd/rdd/set.go:506-512`**

```go
if _, err := watchtools.UntilWithSync(ctx, lw, &unstructured.Unstructured{}, precondition, condition); err != nil {
    if errors.Is(err, context.DeadlineExceeded) || errors.Is(err, context.Canceled) {
        return errors.New("timed out waiting for App state")
    }
    return fmt.Errorf("failed to watch App: %w", err)
}
```

When the user sends SIGINT (Ctrl+C) mid-wait, `ctx` is cancelled and `watchtools` returns `context.Canceled`. The CLI then reports "timed out waiting for App state", which is false — the user cancelled. A user debugging why `rdd set` "timed out" after 2 seconds will be misled into investigating the VM instead of realizing they themselves interrupted the command.

Fix: distinguish the two cases, and preserve the actual cause:

```diff
 if _, err := watchtools.UntilWithSync(ctx, lw, &unstructured.Unstructured{}, precondition, condition); err != nil {
-    if errors.Is(err, context.DeadlineExceeded) || errors.Is(err, context.Canceled) {
+    if errors.Is(err, context.DeadlineExceeded) {
         return errors.New("timed out waiting for App state")
     }
+    if errors.Is(err, context.Canceled) {
+        return errors.New("wait cancelled")
+    }
     return fmt.Errorf("failed to watch App: %w", err)
 }
```

(important, regression — introduced by commit 762c1e8)

---

**I7. `rdd set` for property changes that trigger a restart returns before the restart completes — `cmd/rdd/set.go:439-448`**

```go
func waitForDesiredState(ctx context.Context, config *rest.Config, properties map[string]string, timeout time.Duration) error {
    runningVal, ok := properties["running"]
    if !ok {
        return nil
    }
    ...
    if runningVal == "true" {
        logrus.Info("Waiting for container engine to be ready")
        return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
            return conditionStatus(obj, "ContainerEngineReady") == metav1.ConditionTrue
        })
```

`rdd set containerEngine.name=containerd running=true` (VM already running) races with a stale `ContainerEngineReady=True/Connected` (from moby). The precondition check at set.go:489-495 fires on the stale cached state and returns immediately, leaving the user with a half-restarted VM even though `rdd set` succeeded. The engine.bats test for `containerd backend reports ContainerEngineReady=NotApplicable and skips mirroring` at engine.bats:266-274 explicitly documents this: "Stop first so there is no stale True/Connected from moby to satisfy the CLI wait below before the engine reconciler has run." That comment is the confession.

The author's TODO at set.go:440-443 acknowledges this, but the note is for a future `Reconciled` condition. In the meantime, users will hit this: either `rdd set` shouldn't claim to "wait for the desired state by default" when it doesn't, or the wait should be disabled for multi-property sets that include non-`running` properties.

Fix: either narrow the documented behavior ("waits for engine readiness only when `running` is the only property") or, short-term, skip the precondition shortcut when the properties include anything besides `running`:

```diff
 if runningVal == "true" {
-    logrus.Info("Waiting for container engine to be ready")
-    return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
-        return conditionStatus(obj, "ContainerEngineReady") == metav1.ConditionTrue
-    })
+    // If any other property was set alongside running=true, the VM may
+    // be in the middle of reacting to it. Wait a short time for the
+    // condition to drop before waiting for it to come back.
+    ...
 }
```

(The proper fix is the TODO'd `Reconciled` condition.) (important, gap)

---

**I8. Commit 4db3f20 bundles unrelated changes — commit message misrepresents scope**

```
Allow container spec.state transitions via webhook

Relax the container webhook to permit spec.state changes between
"created" and "running". This lets the engine controller start and
stop Docker containers when a user patches the K8s resource.
All other spec fields remain immutable.
```

The diff touches 8 files / 141 insertions and does much more than the title says: it introduces the new `unknown` enum value (container_types.go:56-61), changes the CRD default from `running` to `unknown` (container_types.go:59, crd.yaml:286), adds the `reconcileContainerState` flow with its own bespoke paused-container handling (docker_watcher.go:252-283), adds `processContainerFinalizers`/`processImageFinalizers`/`processVolumeFinalizers` (engine_reconciler.go), reworks `sync_containers.go` to gate spec-apply on Get, and adds the `syncContainerNamespace` no-finalizer comment (sync_volumes.go). The "just a webhook change" framing buries load-bearing behavioral changes in the commit that a reader looking at `git log --oneline` would miss — and `git blame` on those chunks points at a commit whose message does not mention them.

Commit 762c1e8 "Make rdd set wait for the desired state by default" has the same problem, even worse: 198 lines of engine_reconciler.go churn, 104 of docker_watcher.go, and the `DockerSocket` rewrite from `ShortDir()` to hardcoded `.rd2` — all in a commit advertised as a CLI wait feature. Its body also says "Use `--wait=false` to skip waiting" but the flag added is `--timeout=0`, not `--wait=false`.

Fix: in a future revision, split the commits along responsibility lines, and correct the 762c1e8 message to reflect the actual flag name before the PR lands. (important, regression)

---

### Suggestions

**S1. `removeMirrorResource` mutates its caller-provided `obj` across retries — `pkg/controllers/app/engine/controllers/docker_watcher.go:232-250`**

```go
func (w *dockerWatcher) removeMirrorResource(ctx context.Context, obj client.Object, name string) error {
    key := client.ObjectKey{Name: name, Namespace: apiNamespace}
    retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
        if err := w.k8s.Get(ctx, key, obj); err != nil {
            ...
        }
        if !removeFinalizer(obj, mirrorFinalizer) {
            return nil
        }
        return w.k8s.Update(ctx, obj)
    })
```

`obj` is passed by the caller; subsequent retry iterations reuse the same pointer. That works in practice because controller-runtime's `Get` resets-and-populates, but it depends on an implementation detail — controller-runtime doesn't formally guarantee zeroing. Callers like `syncAllContainers` at sync_containers.go:60-66 pass a `&k8sContainers.Items[i]` slice element, which now has the previous list state pre-populated. A fresher pattern is to DeepCopyObject() inside the retry closure, matching the one already used by `deleteAllOfType` at engine_reconciler.go:257. (suggestion, gap)

---

**S2. `DockerSocket()` hardcode of `.rd2` is a regression from commit 87f1df1 — `pkg/instance/instance.go:129-137`**

```go
var DockerSocket = sync.OnceValue(func() string {
    home, err := os.UserHomeDir()
    if err != nil {
        panic(fmt.Errorf("could not get home directory: %w", err))
    }
    dir := filepath.Join(home, ".rd2")
    _ = os.MkdirAll(dir, 0o700)
    return filepath.Join(dir, "docker.sock")
})
```

Commit 87f1df1 (as originally authored) used `filepath.Join(ShortDir(), "docker.sock")`, which is the correct per-instance path. Commit 762c1e8 regressed it to a hardcoded `.rd2` to match the (also-hardcoded) Lima template path. The TODO acknowledges this, but the two hardcodes need to be fixed together — either both parameterized on the instance suffix, or a pair of tests that fails the build if they ever diverge. Also, `os.MkdirAll` here silently swallows errors (`_ =`); while the directory is typically pre-created by `rdd svc start`, the silent failure is a trap. (suggestion, regression)

Fix:

```diff
 var DockerSocket = sync.OnceValue(func() string {
-    home, err := os.UserHomeDir()
-    if err != nil {
-        panic(fmt.Errorf("could not get home directory: %w", err))
-    }
-    dir := filepath.Join(home, ".rd2")
-    _ = os.MkdirAll(dir, 0o700)
-    return filepath.Join(dir, "docker.sock")
+    // TODO: switch back to filepath.Join(ShortDir(), "docker.sock")
+    // once lima-template.yaml derives the socket path from the instance suffix.
+    return filepath.Join(ShortDir(), "docker.sock")
 })
```

Reverting to `ShortDir()` plus fixing the lima template together is cleaner than doubling the hardcode.

---

**S3. `docker_watcher.run()` has no panic recovery — `pkg/controllers/app/engine/controllers/docker_watcher.go:99-131`**

```go
func (w *dockerWatcher) run(ctx context.Context) {
    log := logf.FromContext(ctx).WithName("docker-watcher")
    defer close(w.done)
    ...
```

Unlike reconcilers (controller-runtime wraps them in `HandleCrash`), this raw goroutine has no panic recovery. A nil-pointer in `handleEvent` — e.g., an unexpected Moby event shape — takes down the whole app-controller process. Add a `defer utilruntime.HandleCrash(...)` or at minimum `defer func() { if r := recover(); r != nil { log.Error(...) } }()`. (suggestion, gap)

---

**S4. `deleteContainer`/`deleteImage`/`deleteVolume` pass `Force: true` for containers and volumes but not for images — `pkg/controllers/app/engine/controllers/docker_watcher.go:289-320`**

```go
func (w *dockerWatcher) deleteContainer(ctx context.Context, id string) error {
    _, err := w.cli.ContainerRemove(ctx, id, dockerclient.ContainerRemoveOptions{Force: true})
    ...
func (w *dockerWatcher) deleteImage(ctx context.Context, img *containersv1alpha1.Image) error {
    ...
    _, err := w.cli.ImageRemove(ctx, ref, dockerclient.ImageRemoveOptions{})
    ...
func (w *dockerWatcher) deleteVolume(ctx context.Context, name string) error {
    _, err := w.cli.VolumeRemove(ctx, name, dockerclient.VolumeRemoveOptions{Force: true})
```

Containers and volumes use `Force: true`, meaning `rdd ctl delete container/foo` force-stops a running container without giving processes a chance to exit gracefully. Images use no Force, so in-use images are (correctly) kept — but the inconsistency is surprising and undocumented. Document the policy in the comment, or consider `Force: false` for containers (the user can patch `spec.state=created` first, then delete, which is the documented K8s-idiomatic pattern). (suggestion, gap)

---

**S5. `reconcileContainerSpecs` logs but swallows errors — `pkg/controllers/app/engine/controllers/engine_reconciler.go:297-307`**

```go
for i := range containers.Items {
    c := &containers.Items[i]
    if c.DeletionTimestamp != nil {
        continue
    }
    if err := w.reconcileContainerState(ctx, c); err != nil {
        logf.FromContext(ctx).Error(err, "Failed to reconcile container state",
            "container", c.Name)
    }
}
return nil
```

Errors from `reconcileContainerState` (Docker start/stop failures) are logged but not surfaced — the function always returns nil. In contrast, `processContainerFinalizers` at engine_reconciler.go:334-356 collects per-item errors with `errors.Join` and returns them, triggering a requeue. The asymmetry means a user-patched spec.state change that fails (e.g., Docker rejected "unpause" because the container doesn't exist any more) never gets retried — it just ends up in the logs. (suggestion, gap)

---

**S6. `pgid` log field is misleading on Windows — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:479-485`**

```go
// pgid equals pid because SetGroup made the hostagent a new group leader.
// Recording it lets post-mortem leak detection attribute orphaned
// grandchildren (qemu) back to the hostagent that spawned them.
logger.Info("Hostagent started, watcher active",
    "instance", limaVM.Name,
    "pid", haCmd.Process.Pid,
    "pgid", haCmd.Process.Pid)
```

On Unix the claim holds (Setpgid=true → pgid=pid). On Windows there is no POSIX process group; `SetGroup` sets `CREATE_NEW_PROCESS_GROUP` which is a different console-control-event concept — logging the value as "pgid" is a misnomer. Since `bats-with-timeout.sh` is what consumes the log correlation, and it's only exercised on Linux/macOS runners in CI (not Windows), this is harmless today, but it will bite when Windows CI is added. Consider either reading the real pgid via `syscall.Getpgid(haCmd.Process.Pid)` on Unix or labeling the field "groupPID" with a platform-neutral meaning. (suggestion, enhancement)

---

**S7. BATS engine tests use `[ -n "${output}" ]` instead of `assert_output` / `refute_output` — `bats/tests/32-app-controllers/engine.bats:142,152`**

```bash
image_ref=${output}
[ -n "${image_ref}" ]
...
[ -n "${output}" ]
```

Per the review prompt's style rule: "Prefer `assert_output` over manual `[[ "${var1}" == "${var2}" ]]` comparisons — `assert_output` shows both expected and actual values on failure, while `[[ ]]` fails silently." `[ -n ... ]` has the same "fails silently" problem. Swap to `refute_output ""` or the equivalent.

```diff
 run -0 rdd ctl get image --namespace="${NAMESPACE}" \
     --field-selector "status.repoTag=busybox:latest" -o name
+refute_output ""
 image_ref=${output}
-[ -n "${image_ref}" ]
```

(suggestion, gap)

---

**S8. `applyContainer` shallow-copies a struct with shared slice state — `pkg/controllers/app/engine/controllers/sync_images.go:128-138`**

```go
statusCopy := *statusApply
if err := w.applyImage(ctx,
    containersv1alpha1apply.Image(name, apiNamespace).
        WithFinalizers(mirrorFinalizer),
    statusCopy.
        WithRepoTag(tag).
        WithNamespace(containerNamespace),
); err != nil {
```

`statusCopy := *statusApply` is a shallow copy; `statusApply.RepoDigests` (a slice), `statusApply.Labels` (a map), and all the `*string`/`*int64` fields are shared across iterations. It happens to be safe today because the loop only calls `WithRepoTag`/`WithNamespace` — neither mutates shared state — but it's a trap: a future edit that adds `WithLabels(extraLabels)` or `WithRepoDigests(extra)` inside the loop will corrupt all previous iterations' payloads via in-place `append`/map mutation.

Fix: replace the shallow copy with a fresh `ImageStatus()` constructor call inside the loop, duplicating only the fields that are stable, or (better) factor the common fields into a helper that returns a fresh `ImageStatusApplyConfiguration` each time. (suggestion, gap)

---

## Design Observations

### Concerns

**O1. Engine controller cleanup policy is too aggressive — `(in-scope)`**

The current design treats "watcher died" identically to "VM stopped": both sweep every mirror resource. That makes transient Docker restarts (a blip on the event stream) as disruptive as an intentional VM shutdown from the user's perspective. A more resilient approach would split the two concerns:

- **VM stopped / engine backend changed**: sweep everything (current behavior).
- **Transient watcher failure**: stop the watcher, leave mirror resources in place, attempt to reconnect with bounded retries. On successful reconnect, `fullSync` naturally reconciles drift. Only sweep if reconnection fails after a timeout.

This preserves resource identity (UIDs) across transient outages and avoids flooding watchers with spurious delete/add events. The change is local to `Reconcile` in `engine_reconciler.go` — the `watcherDied` branch becomes a retry loop.

**O2. `unknown` is a spec state rather than an absence of intent — `(future)`**

The engine controller uses `spec.state: unknown` to mean "I am just mirroring, I have no opinion". This mixes two concepts into one field: (a) what the user *wants* (running/created) and (b) whether anyone has an opinion at all. A cleaner model would be a nullable `spec.state` (pointer) with `nil = no opinion` and enforced via OpenAPI, or splitting the field into `spec.desired.state` (optional) and a separate `status.managedBy` record. The current shape works but forces the webhook to enforce a state transition rule that is really just "stop trying to be unknown once a user expressed intent" — future contributors will trip over this.

**O3. `cleanupMirrorResources` is brittle against cross-controller coexistence — `(future)`**

`deleteAllOfType` iterates every resource in `rancher-desktop` namespace and strips the mirror finalizer. That's fine as long as the engine controller is the only thing writing `Container`/`Image`/`Volume` in that namespace, but the mock controller (`pkg/controllers/mock/mock_controller.go:27`) uses the exact same `apiNamespace = "rancher-desktop"`. If the two ever coexist (e.g., the developer runs the mock controller against the same embedded service instance for demoing), the engine controller's cleanup will wipe the mock data. Filter `cleanupMirrorResources` by a label selector (`engine.rancherdesktop.io/mirror=true` or the field owner) so it only deletes what it created.

### Strengths

- The comment block at sync_containers.go:87-93 explaining *why* spec is written only on create (preventing SSA fight with user patches) is the kind of insight most code lacks. Keep this level of explanation as the pattern evolves.
- `processXxxFinalizers` collect per-item errors with `errors.Join` instead of short-circuiting — one stuck object doesn't block the rest. This is the correct pattern for sweep loops over lists.
- The `docker_watcher.alive()` check paired with the explicit `stop()` + `<-w.done` handshake avoids the common mistake of goroutine ownership ambiguity.
- The test `patching spec.state=created stops a paused container` explicitly documents the paused-container dispatch logic — it's a great example of tests-as-specification.

---

## Testing Assessment

Untested scenarios, in rough order of risk:

1. **Paused → running transition** (I2): no test. High risk of silently-dropped user intent.
2. **Last-tag untag of in-use image → dangling K8s representation** (I1): no test. `deleting an in-use Image keeps the finalizer` covers a different code path (K8s-initiated deletion).
3. **Transient Docker disconnect** (I3): no test. The cleanup-then-recreate cycle isn't exercised; a test would need to pause Docker briefly via `docker pause <rdd-daemon>` or kill the docker socket.
4. **Multi-property `rdd set` while VM running** (I7): no test. The `containerd backend` test documents the stale-condition race but works around it.
5. **Concurrent user patch + engine reconcile** (I5): no test. The test harness would need to patch `spec.state=running` the moment the engine controller observes a new container via the event stream.
6. **Shared image ID with multiple tags being progressively untagged**: no test. A single tag untag vs. all-tags-untag path has different behavior.
7. **Container state transition during VM shutdown**: no test. If a user patches `spec.state=running` mid-shutdown, what happens?
8. **Engine reconciler with App having stale `ContainerEngineReady` from a prior backend**: the existing test captures this but only for containerd; symmetric moby→containerd→moby isn't covered.

---

## Documentation Assessment

- `docs/design/api_containers.md` is well updated with engine mirroring semantics and finalizer lifecycle. The example at docs/design/api_containers.md:86-128 correctly shows `spec.state: unknown` — good.
- The `State is the desired state of the container` comment at `pkg/apis/containers/v1alpha1/container_types.go:56` no longer describes the `unknown` case. Update to say: "State is the desired state of the container; `unknown` means the engine controller is mirroring only and will take no start/stop action."
- The `rdd set` help text at `cmd/rdd/set.go:68-72` says "Use `--timeout=0` to skip waiting" but the commit message for 762c1e8 says "Use `--wait=false`" — if a user searches the commit history for `--wait=false` they'll be confused.
- The TODO at `cmd/rdd/set.go:440-443` is informative but doesn't flag I7 (the stale-condition race on multi-property sets). Add a note about the immediate hazard, not just the long-term fix.
- `docs/design/api_containers.md:306-312` still refers to the old finalizer-based untag semantics ("Delete the Image object; the finalizer will untag the image. If all tags of an image are removed, _and_ it is not in use by a container (running or not), then the image itself is removed"). This contradicts the current code: the controller deletes the K8s resource directly on Docker untag events. Reconcile the design doc and the implementation — the current code better matches the intent, but the doc change in `ba6caf7` left this stale.

---

## Commit Structure

- **4db3f20** "Allow container spec.state transitions via webhook" → bundles unrelated changes (see I8). Title says "webhook"; diff adds the entire `unknown` enum, defaults change, reconciler state-machine, and finalizer handlers. Should have been split into: (a) add `unknown` enum + default change, (b) webhook relaxation, (c) engine-side reconcile state handling.
- **762c1e8** "Make rdd set wait for the desired state by default" → same problem, larger. ~300 lines of engine controller churn unrelated to the CLI feature. Commit message also says `--wait=false` but the flag is `--timeout=0`.
- **87f1df1** "Add DockerSocket() helper" introduces a correct `ShortDir()`-based implementation that 762c1e8 silently regresses to hardcoded `.rd2`. That's the kind of regression that bisect makes very hard to track down because the diff spans an unrelated-sounding commit.
- **50bdc15** "ci: capture support bundle when BATS targets time out" and **7c8f356** "ci: always capture support bundle and log hostagent pgid" are a pair: the second immediately adjusts the first to always-run (which was the obviously-better design). This pair should have been a single commit — the first one is effectively a fixup of its own specification.
- **9669b4b** "bats: close fds 3 and 4 when invoking rdd" is crisp and self-contained; good example.
- **ba6caf7** "Document engine mirroring ..." is also crisp and self-contained.

---

## Acknowledged Limitations

- **`DockerSocket()` hardcodes `.rd2`** — TODO at `pkg/instance/instance.go:127-128`: "use `ShortDir()` once the Lima template derives the socket path from the instance suffix instead of hardcoding `.rd2`." Paired with the Lima template hardcode at `pkg/controllers/app/app/lima-template.yaml:30`. Tests and author are aware; the BATS test carries a parallel TODO at `bats/tests/32-app-controllers/engine.bats:12-14`. *(S2 is in the findings above because it's a regression from the previous correct state, not because the author didn't acknowledge the hardcode.)*
- **`rdd set` doesn't wait for VM restarts triggered by non-`running` property changes** — TODO at `cmd/rdd/set.go:440-443` proposes a "Reconciled" condition with `observedGeneration` as the long-term fix. *(I7 is still in findings because the race is user-visible today, not just a future polish item.)*
- **`reconcileContainerSpecs` + `processFinalizers` run O(N) per child event** — comment at `engine_reconciler.go:149-155` acknowledges: "The long-term fix is to split these into per-resource reconcilers with watch predicates." Not worth promoting to a finding given the typical scale (tens of containers on dev machines).
- **`moby` ContainerNamespace has no mirror finalizer by design** — comment at `pkg/controllers/app/engine/controllers/sync_volumes.go:54-59` explains the reasoning, and the engine.bats test `deleting containernamespace/moby completes without a finalizer hang` verifies the behavior.
- **Containerd backend reports `ContainerEngineReady=True/NotApplicable`** — by design, so that `rdd set running=true` with containerd returns promptly instead of waiting forever. The test comment at engine.bats:266-273 documents this explicitly.
- **Windows hostagent pgid logging** — the comment at `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:479-481` assumes Unix Setpgid semantics; I flagged it as S6 not because the author is unaware (they're recording a known detail) but because the comment wording breaks down on Windows and the code path may eventually land there.
- **`ContainerNamespace` delete webhook is a no-op with warning** — TODO at `pkg/controllers/containers/containernamespace/container_namespace_webhooks.go:25-28` says "We should validate that... we do not delete the `moby` namespace." Not introduced by this PR; pre-existing.

---

## Coverage Summary

- `.github/workflows/bats.yaml` — Trivial (adds `docker` to `brew install`).
- `bats/Makefile` — Reviewed, no issues. The `BATS = $(if ...)` recursive-vs-simple expansion change to use `=` for `$@` resolution is correct.
- `bats/helpers/commands.bash` — Reviewed, no issues. The `3>&- 4>&-` fix is crisp and well-commented.
- `bats/tests/32-app-controllers/engine.bats` — S7.
- `cmd/app-controller/main.go` — Trivial (adds the engine controller import).
- `cmd/rdd/set.go` — I6, I7.
- `docs/design/api_containers.md` — stale untag semantics documented in Documentation Assessment.
- `go.mod` / `go.sum` — Reviewed, no issues. New dependencies (`moby/client`, `containerd/errdefs`, `docker/go-connections`) are appropriate for the engine controller.
- `pkg/apis/containers/v1alpha1/container_types.go` — I8 (part of bundled commit); doc comment noted in Documentation Assessment.
- `pkg/apis/containers/v1alpha1/volume_types.go` — Reviewed, no issues. The `status.name` selectable field is needed by the engine.bats tests.
- `pkg/controllers/app/app/lima-template.yaml` — S2 (hardcode paired with instance.go); vmOpts.vz.rosetta move is correct.
- `pkg/controllers/app/engine/controller.go` — Reviewed, no issues. Standard controller registration boilerplate.
- `pkg/controllers/app/engine/controllers/docker_watcher.go` — I1, I2, S1, S3, S4.
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — I3, I4, I5, S5.
- `pkg/controllers/app/engine/controllers/sync_containers.go` — I5.
- `pkg/controllers/app/engine/controllers/sync_containers_test.go` — Reviewed, no issues. Good coverage of the parseContainerName edge cases.
- `pkg/controllers/app/engine/controllers/sync_images.go` — S8, plus the dangling-image code path referenced in I1.
- `pkg/controllers/app/engine/controllers/sync_volumes.go` — Reviewed, no issues. `syncContainerNamespace` comment about finalizer absence is helpful.
- `pkg/controllers/containers/container/container_webhooks.go` — Reviewed, no issues. The state-transition relaxation is sound; the dead-code `default` in the switch is defense-in-depth.
- `pkg/controllers/containers/container/crd.yaml` — Reviewed, no issues. The `allOf` enum intersection is a consequence of the type-level + field-level annotation combination and is correct.
- `pkg/controllers/containers/volume/crd.yaml` — Reviewed, no issues. The new `.status.name` selectable field matches `volume_types.go`.
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — S6.
- `pkg/instance/instance.go` — S2.
- `pkg/service/cmd/service.go` — Trivial (adds the engine controller import for embedded mode).
- `scripts/bats-with-timeout.sh` — Reviewed, no issues. Nice touch recording process state non-destructively before SIGKILLing. The `matches_our_instance` pattern match is adequate given the known instance suffixes. (Noting: pedantically, if an instance suffix happens to be a substring of another instance's suffix, the scope check is over-broad; in practice suffixes are distinct enough that this is academic.)
---

### [Codex] GPT 5.4

### Executive Summary

This PR adds a new engine controller that mirrors Docker containers, images, and volumes into the Containers API, teaches `rdd set` to wait for readiness by default, and adds BATS coverage plus CI hang-diagnostics helpers. The main issues are in the new watcher lifecycle: image `untag` events are decoded incorrectly, the watcher has a bootstrap race that can miss engine mutations during startup, and failed `spec.state` transitions are logged but never retried.

### Findings

**Critical Issues**

None.

**Important Issues**

I1. **Image `untag` events never remove the stale tag mirror** — `pkg/controllers/app/engine/controllers/docker_watcher.go:191-193` (important, regression)

```go
case events.ActionUnTag:
	log.V(1).Info("Image untagged", "id", msg.Actor.ID)
	return w.removeImageByTag(ctx, msg.Actor.ID)
```

At `pkg/controllers/app/engine/controllers/docker_watcher.go:191-193`, `handleImageEvent()` passes `msg.Actor.ID` into `removeImageByTag()`, but `removeImageByTag()` at `pkg/controllers/app/engine/controllers/sync_images.go:179-186` matches against `.status.repoTag`, not the image ID. I verified the upstream Moby event contract before flagging this: the classic dockerd delete path emits `untag` with the image ID as the event ID, so the comparison at `pkg/controllers/app/engine/controllers/sync_images.go:185` never matches. The practical result is that `docker rmi repo:tag` leaves the old `Image` object behind indefinitely; there is no later full resync to heal it unless the watcher restarts.

Fix: treat `untag` as a per-image reconciliation, not a direct tag delete. Inspect the image by ID, apply the current tag set, then delete any K8s `Image` objects for that same `.status.id` whose `.status.repoTag` is no longer present. If the inspect returns not found, fall back to `removeImagesByID()`.

```diff
case events.ActionUnTag:
-	return w.removeImageByTag(ctx, msg.Actor.ID)
+	return w.syncImageAndPruneStaleTags(ctx, msg.Actor.ID)
```

I2. **The watcher can miss Docker mutations during bootstrap** — `pkg/controllers/app/engine/controllers/docker_watcher.go:69-76` (important, regression)

```go
// Perform initial full sync before starting the event stream.
if err := w.fullSync(watchCtx); err != nil {
	...
}

go w.run(watchCtx)
```

`newDockerWatcher()` does the initial `fullSync()` at `pkg/controllers/app/engine/controllers/docker_watcher.go:69-74` before `run()` even subscribes to Docker events, and `run()` only opens the event stream at `pkg/controllers/app/engine/controllers/docker_watcher.go:108-110`. After startup, `EngineReconciler.Reconcile()` marks the engine connected and returns done at `pkg/controllers/app/engine/controllers/engine_reconciler.go:143-146`, so there is no immediate second sync to close that gap. Any container/image/volume create/delete that happens during the `fullSync()` window is therefore invisible unless that object later changes again. This is especially plausible during VM startup, where dockerd and Kubernetes can create containers immediately after the App’s `Running` condition flips.

Fix: eliminate the gap by subscribing before or at the start of the sync and replaying buffered events afterward, or by recording a `since` timestamp before `fullSync()` and opening the event stream from that point so post-timestamp mutations are replayed after the snapshot.

I3. **Failed `spec.state` transitions stall after one attempt** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:302-305` (important, regression)

```go
if err := w.reconcileContainerState(ctx, c); err != nil {
	logf.FromContext(ctx).Error(err, "Failed to reconcile container state",
		"container", c.Name)
}
```

`reconcileContainerState()` returns raw Docker API errors at `pkg/controllers/app/engine/controllers/docker_watcher.go:273-280`, but `reconcileContainerSpecs()` only logs them at `pkg/controllers/app/engine/controllers/engine_reconciler.go:302-305` and then returns success. Because `Reconcile()` falls through to `return ctrl.Result{}, nil` at `pkg/controllers/app/engine/controllers/engine_reconciler.go:163`, a failed start/stop request gets exactly one attempt: the patch that changed `.spec.state` produced the only watch event, and no requeue is requested when Docker rejects the operation. The resource can sit forever with `spec.state != status.status` until some unrelated event happens.

Fix: accumulate these errors and return them from `reconcileContainerSpecs()` so controller-runtime requeues, or explicitly `RequeueAfter` on failure. If you want best-effort handling across multiple containers, collect per-container errors with `errors.Join()` instead of returning on the first failure.

```diff
+var errs []error
for i := range containers.Items {
	...
	if err := w.reconcileContainerState(ctx, c); err != nil {
-		logf.FromContext(ctx).Error(err, "Failed to reconcile container state", "container", c.Name)
+		errs = append(errs, fmt.Errorf("container %s: %w", c.Name, err))
	}
}
-return nil
+return errors.Join(errs...)
```

**Suggestions**

S1. **The design doc overstates containerd mirroring support** — `docs/design/api_containers.md:35-38` (suggestion, regression)

```md
2. Creates the `rancher-desktop` K8s namespace and the appropriate
   `ContainerNamespace` resource (`moby` for dockerd, engine-specific
   namespaces for containerd).
```

The new doc text at `docs/design/api_containers.md:35-38` says the engine controller creates engine-specific namespaces for containerd, but the implementation does the opposite: `pkg/controllers/app/engine/controllers/engine_reconciler.go:115-133` shuts the watcher down and reports `ContainerEngineReady=True, reason=NotApplicable` whenever `app.Spec.ContainerEngine.Name != "moby"`. That wording will mislead anyone trying to build against the API from the doc.

Fix: document containerd as explicitly unsupported for mirroring today and move the namespace discussion into a future-work note.

```diff
-2. Creates the `rancher-desktop` K8s namespace and the appropriate
-   `ContainerNamespace` resource (`moby` for dockerd, engine-specific
-   namespaces for containerd).
+2. Creates the `rancher-desktop` K8s namespace and the `moby`
+   `ContainerNamespace` resource when the App uses the moby backend.
+   Containerd mirroring is not implemented yet; the controller reports
+   `ContainerEngineReady` with reason `NotApplicable` in that mode.
```

### Design Observations

**Concerns**

1. `(future)` The engine feature is currently split between one singleton reconciler and one in-memory watcher that together own bootstrap sync, event streaming, desired-state convergence, and finalizer processing. Moving image/container/volume convergence into per-resource reconcilers or typed work queues would remove the global O(N) sweeps, make retries object-scoped instead of best-effort, and eliminate the bootstrap race in I2 by giving each object a durable reconcile path.

**Strengths**

1. The reverse-delete path is designed with the repo’s “no GC” constraint in mind: mirror resources carry an explicit engine finalizer, and `cleanupMirrorResources()` sweeps all mirrored kinds on shutdown instead of assuming owner-reference garbage collection.

2. The new `rdd set` implementation uses a watch-based wait rather than polling, which is the right control-plane pattern for this codebase.

### Testing Assessment

1. `docker rmi <tag>` / Docker-side `untag` flows are not covered. The new tests exercise K8s-initiated image deletion (`bats/tests/32-app-controllers/engine.bats:131-162`), but not the Docker-side path that triggers I1.
2. No test stresses concurrent engine mutations during watcher startup. There is no scenario that creates or deletes containers/images while `newDockerWatcher()` is between `fullSync()` and `cli.Events()`, so I2 would not be caught.
3. No test forces `ContainerStart` / `ContainerStop` to fail and then verifies a retry. The current suite only covers happy-path state transitions (`bats/tests/32-app-controllers/engine.bats:167-207`), not the stalled-error path in I3.
4. The engine tests only run against the hardcoded default socket path and skip Windows/WSL2 entirely, so the new mirroring code still lacks cross-instance and cross-platform coverage.

### Documentation Assessment

S1 should be fixed before merge so the design doc matches the actual non-moby behavior. Separately, the `Image` example at `docs/design/api_containers.md:218-219` still describes the old “random suffix” naming scheme even though the implementation now uses a deterministic hash; that text predates this PR, so I’m noting it here rather than as a regression.

### Commit Structure

This PR is broader than the title suggests. The core engine-watcher feature is bundled with unrelated CI timeout/support-bundle work (`50bdc15`, `762c1e8`, `9669b4b`, `7c8f356`) and separate Lima template adjustments (`32a0109`, `536ccb2`), which makes bisecting and reviewing the feature logic harder than it needs to be.

### Acknowledged Limitations

1. **Code comment**: “`TODO: use ShortDir() once the Lima template derives the socket path from the instance suffix instead of hardcoding ".rd2".`” — `pkg/instance/instance.go:127-128`. This PR also mirrors the same limitation in `bats/tests/32-app-controllers/engine.bats:12-14`. The change makes that limitation more likely to matter because the new engine watcher now depends directly on this socket path for all Docker mirroring.
2. **Code comment**: “`TODO: changes that trigger a VM restart without changing "running" (e.g. containerEngine.name) are not waited on.`” — `cmd/rdd/set.go:425-429`. The test workaround at `bats/tests/32-app-controllers/engine.bats:266-274` is consistent with that rationale; the PR does not resolve it.
3. **Code comment**: the O(N)-per-event sweep in `pkg/controllers/app/engine/controllers/engine_reconciler.go:149-155` is already called out in code with a plan to split work into per-resource reconcilers. The change introduces that cost but also documents it accurately.

### Coverage Summary

- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/Makefile` — Reviewed, no issues
- `bats/helpers/commands.bash` — Reviewed, no issues
- `bats/tests/32-app-controllers/engine.bats` — Reviewed, no issues
- `cmd/app-controller/main.go` — Trivial
- `cmd/rdd/set.go` — Reviewed, no issues
- `docs/design/api_containers.md` — S1
- `go.mod` — Reviewed, no issues
- `go.sum` — Trivial
- `pkg/apis/containers/v1alpha1/container_types.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/volume_types.go` — Reviewed, no issues
- `pkg/controllers/app/app/lima-template.yaml` — Reviewed, no issues
- `pkg/controllers/app/engine/controller.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/docker_watcher.go` — I1, I2, I3
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — I2, I3
- `pkg/controllers/app/engine/controllers/sync_containers.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_containers_test.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_images.go` — I1
- `pkg/controllers/app/engine/controllers/sync_volumes.go` — Reviewed, no issues
- `pkg/controllers/containers/container/container_webhooks.go` — Reviewed, no issues
- `pkg/controllers/containers/container/crd.yaml` — Reviewed, no issues
- `pkg/controllers/containers/volume/crd.yaml` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Reviewed, no issues
- `pkg/instance/instance.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Trivial
- `scripts/bats-with-timeout.sh` — Reviewed, no issues

---

### [Gemini] 3.1 Pro

### Executive Summary

This PR introduces the engine controller, which mirrors the Docker container engine state into Kubernetes resources and acts as a translation layer for container lifecycle operations. The core architecture effectively synchronizes resources and translates state, but several critical logic flaws exist. The implementation contains a shell syntax error that breaks the CI wrapper, fails to delete untagged images due to a mismatched Docker event payload field, skips finalizer processing on watcher startup, and swallows state transition errors that break the reconciliation loop. 

### Findings

#### Critical Issues

C1. **Typo in BATS wrapper causes immediate CI failure** — `scripts/bats-with-timeout.sh:231` (critical, regression)

```bash
# Start the command in the background.
"$ @" &
cmd_pid=$!
```

The `$ @` variable expansion contains an errant space. Bash attempts to execute a command literally named `$ @`, which produces a `command not found` error and unconditionally fails the BATS test wrapper immediately.

Fix: Remove the space to correctly expand the arguments array:
```diff
- "$ @" &
+ "$@" &
```

C2. **Untagged image mirrors are never deleted** — `pkg/controllers/app/engine/controllers/docker_watcher.go:193-195` (critical, regression)

```go
	case events.ActionUnTag:
		log.V(1).Info("Image untagged", "id", msg.Actor.ID)
		return w.removeImageByTag(ctx, msg.Actor.ID)
```

For Docker `ActionUnTag` events, `msg.Actor.ID` contains the image's SHA256 identifier, not the tag name. The actual tag name that was removed is located in `msg.Actor.Attributes["name"]`. Since `removeImageByTag` explicitly searches for the tag string against `Status.RepoTag` (e.g. `busybox:latest`), passing the SHA256 ID causes the comparison to fail. The untagged image mirror leaks in Kubernetes until the next full sync or App restart.

Fix: Extract the tag name from the attributes map:
```diff
-		return w.removeImageByTag(ctx, msg.Actor.ID)
+		return w.removeImageByTag(ctx, msg.Actor.Attributes["name"])
```

C3. **Early return bypasses finalizer processing** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:142-144` (critical, regression)

```go
		if err := r.setEngineCondition(ctx, &app, metav1.ConditionTrue, "Connected", "Container engine synced"); err != nil {
			return ctrl.Result{}, err
		}
		return ctrl.Result{}, nil
```

Returning early after successfully starting the watcher bypasses the `reconcileContainerSpecs` and `processFinalizers` sweeps for that reconcile pass. This logic relies entirely on the side-effect of `setEngineCondition` triggering a subsequent watch event to eventually process finalizers. If the condition was already `Connected` (e.g., across a transient watcher restart), the update is a no-op, no event is fired, and finalizer processing is indefinitely skipped. Terminating resources will hang until an unrelated Docker event wakes up the controller.

Fix: Remove the `return ctrl.Result{}, nil` to allow the reconciler to fall through and process finalizers immediately:
```diff
 		if err := r.setEngineCondition(ctx, &app, metav1.ConditionTrue, "Connected", "Container engine synced"); err != nil {
 			return ctrl.Result{}, err
 		}
-		return ctrl.Result{}, nil
 	}
```

C4. **Container state transition errors are swallowed** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:343-346` (critical, regression)

```go
		if err := w.reconcileContainerState(ctx, c); err != nil {
			logf.FromContext(ctx).Error(err, "Failed to reconcile container state",
				"container", c.Name)
		}
```

If a state transition like `ContainerStart` or `ContainerStop` fails (e.g., due to a temporary Docker daemon timeout), the error is swallowed and not returned by `reconcileContainerSpecs`. This prevents `controller-runtime` from applying exponential backoff and retrying the operation. The container's state will remain mismatched indefinitely until another unrelated event triggers a reconcile.

Fix: Collect errors during the loop and return them so the reconciler requeues:
```diff
+       var errs []error
 		if err := w.reconcileContainerState(ctx, c); err != nil {
 			logf.FromContext(ctx).Error(err, "Failed to reconcile container state",
 				"container", c.Name)
+           errs = append(errs, err)
 		}
 	}
-	return nil
+	return errors.Join(errs...)
```

#### Important Issues

I1. **Serial blocking calls stall the App reconciler** — `pkg/controllers/app/engine/controllers/docker_watcher.go:273-276` (important, regression)

```go
	case containersv1alpha1.ContainerStatusCreated:
		log.Info("Stopping container", "id", c.Name)
		_, err := w.cli.ContainerStop(ctx, c.Name, dockerclient.ContainerStopOptions{})
		return err
```

`w.reconcileContainerState` makes synchronous, blocking calls to the Docker daemon. `ContainerStop` defaults to a 10-second grace period before dispatching `SIGKILL`. Because `reconcileContainerSpecs` processes containers serially within the main `App` reconciler thread, stopping multiple containers delays the processing of all other events (including image and volume mirroring) for `N * 10` seconds. While the author acknowledged the `List()` overhead (O(N)), unbounded serial blocking is a distinct and severe stall hazard.

Fix: Dispatch `ContainerStop` and `ContainerStart` in background goroutines with context timeouts to unblock the main loop, or expedite the planned split into separate per-resource reconcilers.

I2. **Incremental tagging duplicates dangling image mirrors** — `pkg/controllers/app/engine/controllers/sync_images.go:133-145` (important, regression)

```go
	if len(inspect.RepoTags) > 0 {
		for _, tag := range inspect.RepoTags {
			// Deterministic name from image ID + tag hash (same as mock controller).
			name := fmt.Sprintf("%s-%x",
				sanitizeKubernetesObjectName(inspect.ID),
				sha256.Sum256([]byte(tag)))
```

When an image is pulled by digest (dangling), a mirror resource is created using its SHA256 ID as the name. If the image is subsequently tagged, Docker emits an `ActionTag` event. `applyImageFromInspect` successfully creates the new tagged mirror but fails to delete the pre-existing dangling mirror. Both mirrors persist in Kubernetes until the daemon restarts and a full sync prunes the stale dangling one.

Fix: Inside the `if len(inspect.RepoTags) > 0` block, proactively remove the dangling resource:
```go
w.removeMirrorResource(ctx, &containersv1alpha1.Image{}, sanitizeKubernetesObjectName(inspect.ID))
```

I3. **Concurrent deletions fail the entire full sync** — `pkg/controllers/app/engine/controllers/sync_containers.go:76-78` (important, gap)

```go
func (w *dockerWatcher) syncContainer(ctx context.Context, id string) error {
	result, err := w.cli.ContainerInspect(ctx, id, dockerclient.ContainerInspectOptions{})
	if err != nil {
		return fmt.Errorf("failed to inspect container %s: %w", id, err)
	}
```

During `fullSync` (e.g. `syncAllContainers`, `syncAllImages`, `syncAllVolumes`), the controller lists all objects and iteratively inspects each ID. If a resource is deleted in Docker between the `List` and `Inspect` calls, `Inspect` returns a `NotFound` error. This propagates up, failing the entire `fullSync` operation, tearing down the watcher, and setting the App condition to `ConnectFailed`.

Fix: Catch `NotFound` errors and gracefully skip the item instead of aborting the sync:
```go
	if err != nil {
		if cerrdefs.IsNotFound(err) {
			return nil
		}
		return fmt.Errorf("failed to inspect container %s: %w", id, err)
	}
```
*(Apply to `syncContainer`, `syncImageFromSummary`, and `syncVolume`)*

#### Suggestions

S1. **Concurrent finalizer removal logs benign NotFound errors** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:384-386` (suggestion, regression)

```go
		if removeFinalizer(c, mirrorFinalizer) {
			if err := r.Update(ctx, c); err != nil {
				errs = append(errs, fmt.Errorf("failed to remove finalizer from container %s: %w", c.Name, err))
			}
		}
```

In finalizer processing loops, `r.Update(ctx, c)` can return a `NotFound` error if the Docker watcher's `ActionDestroy` event handler concurrently stripped the finalizer and deleted the mirror object. This race condition is benign but leads to unnecessary error logs and reconciler retries.

Fix: Use `client.IgnoreNotFound(r.Update(ctx, c))` to suppress these benign errors.

### Design Observations

**Concerns**
- **Tear-down on watcher disconnect (future)**: If the Docker event stream drops momentarily, `watcherDied` becomes true. The reconciler unconditionally deletes all mirror resources and sets the condition to `Disconnected`, only to rebuild them via a full sync seconds later. This causes significant K8s API churn and disruptions for clients watching these resources. Consider attempting to restart the watcher directly; if the reconnect succeeds, `fullSync` will naturally prune stale resources and preserve existing ones without a complete tear-down.

**Strengths**
- The backend-agnostic `containerEngine.name` handling accurately surfaces `NotApplicable` for containerd, allowing `rdd set running=true` to unblock gracefully.
- `immutableValidator` effectively prevents drift by forcing explicit transitions and rejecting ad-hoc state changes.
- Generating deterministic RFC-1123 compliant names for volumes while preserving the original Docker name in `status` prevents invalid-name API errors.

### Testing Assessment

Test coverage is fundamentally strong via BATS integration tests that assert on the critical path logic. The highest risk untested scenarios are:
1. Docker daemon restarts or transient event stream socket failures (proving state recovery).
2. Handling of images with `RepoDigests` but no `RepoTags` (dangling images).
3. Concurrent modifications or state toggles on the same container (e.g. rapid sequences of `running` to `created` back to `running`).

### Documentation Assessment

The new `api_containers.md` documentation accurately reflects the implemented `unknown` state transitions and correctly details the scope of the engine controller's synchronization boundaries.

### Acknowledged Limitations

- **Code comment**: "TODO: use .rd${RDD_INSTANCE} once the Lima template derives the socket path from the instance suffix instead of hardcoding ".rd2"." `bats/tests/32-app-controllers/engine.bats:12` and `pkg/instance/instance.go:128`. Hardcoded socket path limits parallel execution if instances share a host.
- **Code comment**: "TODO: changes that trigger a VM restart without changing "running" (e.g. containerEngine.name) are not waited on. This requires a "Reconciled" condition with observedGeneration on the App resource so the CLI can detect when the full reconcile chain has settled." `pkg/controllers/app/engine/controllers/engine_reconciler.go:11`.
- **Code comment**: "The long-term fix is to split these into per-resource reconcilers with watch predicates ("deletion timestamp set", "spec.state changed"); until then the sweep runs on every event." `pkg/controllers/app/engine/controllers/engine_reconciler.go:150`.

### Coverage Summary

- `.github/workflows/bats.yaml`: Trivial
- `bats/Makefile`: Trivial
- `bats/helpers/commands.bash`: Reviewed, no issues
- `bats/tests/32-app-controllers/engine.bats`: Reviewed, no issues
- `cmd/app-controller/main.go`: Trivial
- `cmd/rdd/set.go`: Reviewed, no issues
- `docs/design/api_containers.md`: Reviewed, no issues
- `go.mod`: Trivial
- `go.sum`: Trivial
- `pkg/apis/containers/v1alpha1/container_types.go`: Reviewed, no issues
- `pkg/apis/containers/v1alpha1/volume_types.go`: Reviewed, no issues
- `pkg/controllers/app/app/lima-template.yaml`: Reviewed, no issues
- `pkg/controllers/app/engine/controller.go`: Trivial
- `pkg/controllers/app/engine/controllers/docker_watcher.go`: C2, I1, I3
- `pkg/controllers/app/engine/controllers/engine_reconciler.go`: C3, C4, S1
- `pkg/controllers/app/engine/controllers/sync_containers.go`: I3
- `pkg/controllers/app/engine/controllers/sync_containers_test.go`: Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_images.go`: I2, I3
- `pkg/controllers/app/engine/controllers/sync_volumes.go`: I3
- `pkg/controllers/containers/container/container_webhooks.go`: Reviewed, no issues
- `pkg/controllers/containers/container/crd.yaml`: Trivial
- `pkg/controllers/containers/volume/crd.yaml`: Trivial
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go`: Reviewed, no issues
- `pkg/instance/instance.go`: Reviewed, no issues
- `pkg/service/cmd/service.go`: Trivial
- `scripts/bats-with-timeout.sh`: C1

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-12 |
| **Commit** | `66eb50e Address review round 2 findings for PR #295` |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important #1: Docker `ActionUnTag` events never remove the stale tag mirror | Fixed (new `reconcileImageByID` re-inspects and prunes stale mirrors) |
| 2 | Important #2: `reconcileContainerState` does not unpause a paused container on spec.state=running | Fixed (dispatch `ContainerUnpause` when actual=paused) |
| 3 | Important #3: Watcher bootstrap race — mutations during `fullSync` are invisible | Fixed (capture Since timestamp before fullSync, pass to Events stream) |
| 4 | Important #4: Transient Docker disconnect wipes every mirror resource | Fixed (skip cleanup on watcherDied; next reconcile restarts watcher in place) |
| 5 | Important #5: `IsNotFound` during `fullSync` Inspect aborts the whole sync | Fixed (swallow NotFound in syncContainer, syncImageFromSummary, syncVolume) |
| 6 | Important #6: `ActionTag` on a dangling image leaks the dangling mirror | Fixed (route ActionTag through reconcileImageByID too) |
| 7 | Important #7: Failed `spec.state` transitions stall after one attempt | Fixed (errors.Join return from reconcileContainerSpecs) |
| 8 | Important #8: `cleanupMirrorResources` runs on every idle reconcile | Fixed (gate on terminal ContainerEngineReady reason) |
| 9 | Important #9: `ForceOwnership` on initial create races with concurrent user patch | Fixed (drop ForceOwnership from the create path) |
| 10 | Important #10: `watchCondition` conflates user cancellation and timeout | Fixed (distinguish context.Canceled from context.DeadlineExceeded) |
| 11 | Important #11: `rdd set` with multi-property changes returns on stale ContainerEngineReady | Commented (expanded TODO with user-visible hazard) |
| 12 | Suggestion #1: `removeMirrorResource` mutates caller-provided `obj` across retries | Fixed (DeepCopyObject inside retry closure and delete path) |
| 13 | Suggestion #2: `DockerSocket()` hardcode of `.rd2` is a silent regression | Commented (tightened TODO explaining why both hardcodes must move together) |
| 14 | Suggestion #3: `docker_watcher.run()` has no panic recovery | Fixed (defer recover() in run goroutine) |
| 15 | Suggestion #4: Inconsistent `Force` flags across delete helpers | Commented (deleteContainer comment documents the K8s-idiomatic asymmetry) |
| 16 | Suggestion #5: `statusCopy := *statusApply` shallow copy shares slice/map state | Fixed (imageStatusFromInspect returns a fresh config per call) |
| 17 | Suggestion #6: `pgid` log field is misleading on Windows | Commented (expanded to explain Unix-only Setpgid semantics) |
| 18 | Suggestion #7: BATS engine tests use `[ -n "${output}" ]` instead of `assert_output` | Fixed (use `assert_output` with no args; memory/bats-style.md updated to prevent recurrence) |
| 19 | Suggestion #8: Concurrent finalizer removal can log benign NotFound errors | Fixed (client.IgnoreNotFound on processXxxFinalizers updates) |
| 20 | Suggestion #9: Design doc overstates containerd mirroring support | Fixed (api_containers.md Engine Mirroring section clarified) |
| 21 | Suggestion #10: `docs/design/api_containers.md:306-312` still describes old untag semantics | Fixed (untag section rewritten to match current behavior) |
| 22 | Testing Gap #1: Paused → running transition | Test written (BATS: `patching spec.state=running unpauses a paused container`) |
| 23 | Testing Gap #2: Docker-initiated untag of in-use image | Test written (BATS: `docker rmi of one tag removes only the tag mirror`) |
| 24 | Testing Gaps #3–#8: disconnect / bootstrap race / multi-property set / concurrent patch / single-instance | Skipped (require fault injection or are S2/I11 acknowledged limitations) |
| 25 | Documentation Gap #1: `container_types.go` doc comment doesn't describe `unknown` | Fixed (expanded State comment to describe the three values) |