# Deep Review: 20260412-171333-pr-295

| | |
|---|---|
| **Date** | 2026-04-12 17:13 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 3 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#295](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/295) — Container Engine watcher |
| **Commits** | `c484dab` ci: capture memory pressure and API state in support bundle<br>`3196b24` bats: park spec.state=unknown before pausing test-state<br>`84c0736` engine: apply Container/Image/Volume capitalization convention<br>`c8fa377` generate: refresh containerspec.go and crd.yaml for spec.state doc<br>`62d222a` scripts: resolve rdd binary relative to bats-with-timeout.sh<br>`66eb50e` Address review round 2 findings for PR #295<br>`7c8f356` ci: always capture support bundle and log hostagent pgid<br>`9669b4b` bats: close fds 3 and 4 when invoking rdd<br>`50bdc15` ci: capture support bundle when BATS targets time out<br>`762c1e8` Make rdd set wait for the desired state by default<br>`65943da` Add BATS integration tests for the engine controller<br>`ba6caf7` Document engine mirroring and container state transitions<br>`4db3f20` Allow container spec.state transitions via webhook<br>`93359f7` Add engine controller that mirrors Docker state into K8s<br>`87f1df1` Add DockerSocket() helper and Docker client dependency<br>`32a0109` Work around distro proxy bug; update docker socket path<br>`536ccb2` Rosetta is now a vz driver option, not a global option |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — no critical blockers; four important findings are narrowly-scoped with trivial fixes |
| **Wall-clock time** | `31 min 5 s` |

## Executive Summary

This PR introduces the engine controller that mirrors Docker container engine state (containers, images, volumes) into the `containers.rancherdesktop.io` API group via fullSync-then-watch, forwards user-initiated mirror deletions back to Docker through a finalizer, and dispatches Container start/stop on `spec.state` transitions. It also adds the `rdd set` wait-for-desired-state path, 389 lines of BATS integration coverage, a new CI support-bundle capture wrapper, and Lima template updates (new socket path, rosetta moved under `vmOpts.vz`, distro workaround).

The design is sound: idempotent fullSync, panic-recovery, per-item error aggregation with `errors.Join`, a narrow webhook allowing only `spec.state` transitions, and deliberate reasoning about server-side apply field ownership (documented inline in `sync_containers.go`). Verdict is **merge with fixes**: no critical regressions, four important findings with one-to-three-line fixes, and a cluster of smaller suggestions around condition semantics, clock assumptions, and finalizer edge cases.

Structure: 4 important, 12 suggestions, 0 critical. The important findings are self-healing gaps (panic recovery, watcher-restart early return, sequential finalizer short-circuit, ContainerNamespace resurrection) that the "list-then-watch" architecture would normally handle but for specific early-returns or missing watches.

---

## Critical Issues

None.

## Important Issues

I1. **Panic recovery in watcher goroutine doesn't wake the reconciler** — `pkg/controllers/app/engine/controllers/docker_watcher.go:117-121` [Claude Opus 4.6, Codex GPT 5.4]

```go
defer func() {
    if r := recover(); r != nil {
        log.Error(nil, "panic in Docker watcher goroutine", "recovered", r)
    }
}()
```

The recover block logs the panic and then `defer close(w.done)` fires, marking the watcher dead. Unlike the normal error-stream exit path at `docker_watcher.go:138-144` — which calls `w.enqueueReconcile()` before returning — the panic path emits no wake-up signal. The reconciler only learns the watcher died when another reconcile is triggered by an App spec change, a mirror watch event, or the channel source. While the watcher is dead, though, no Container/Image/Volume change will ever fire (the watcher is the source of those events), and an idle user triggers nothing. A panic can silently leave the engine controller in a degraded state for an arbitrarily long time without `ContainerEngineReady` flipping to false.

The comment at `docker_watcher.go:113-116` explicitly claims "The engine reconciler notices the dead watcher via alive() on the next reconcile and starts a fresh one" — but there is no guaranteed next reconcile.

Fix: enqueue a reconcile from the panic-recovery defer.

```diff
 defer func() {
     if r := recover(); r != nil {
         log.Error(nil, "panic in Docker watcher goroutine", "recovered", r)
+        w.enqueueReconcile()
     }
 }()
```

Claude additionally suggested capturing `debug.Stack()` so the panic is actionable — worth doing.

I2. **`processFinalizers` aborts at the first per-type error** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:330-336` [Claude Opus 4.6, Gemini 3.1 Pro]

```go
if err := r.processContainerFinalizers(ctx, w); err != nil {
    return err
}
if err := r.processImageFinalizers(ctx, w); err != nil {
    return err
}
return r.processVolumeFinalizers(ctx, w)
```

Each per-type helper internally uses `errors.Join` so one stuck Container does not block the rest of the Container list. But the wrapper short-circuits at the *type* level: a single failed `processContainerFinalizers` aborts the function before `processImageFinalizers` or `processVolumeFinalizers` runs. The new test at `engine.bats:195-227` (`deleting an in-use Image keeps the finalizer until the container is removed`) demonstrates the exact scenario where this matters: a stuck image delete legitimately returns an error until the container is removed. While stuck, Volume finalizers on the same reconcile cycle are silently starved. This contradicts the per-item aggregation pattern used throughout the rest of the file (`cleanupMirrorResources` at `engine_reconciler.go:221-238`, `reconcileContainerSpecs` at `engine_reconciler.go:294-318`).

Fix: collect errors across all finalizer types.

```diff
-    if err := r.processContainerFinalizers(ctx, w); err != nil {
-        return err
-    }
-    if err := r.processImageFinalizers(ctx, w); err != nil {
-        return err
-    }
-    return r.processVolumeFinalizers(ctx, w)
+    return errors.Join(
+        r.processContainerFinalizers(ctx, w),
+        r.processImageFinalizers(ctx, w),
+        r.processVolumeFinalizers(ctx, w),
+    )
```

I3. **`ContainerNamespace/moby` is not re-created after user delete** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:420-438` [Claude Opus 4.6, Codex GPT 5.4]

```go
return ctrl.NewControllerManagedBy(mgr).
    For(&appv1alpha1.App{}).
    Named("engine-reconciler").
    WatchesRawSource(source.Channel(r.reconcileChan, enqueueApp)).
    Watches(&containersv1alpha1.Container{}, enqueueApp).
    Watches(&containersv1alpha1.Image{}, enqueueApp).
    Watches(&containersv1alpha1.Volume{}, enqueueApp).
    Complete(r)
```

`syncContainerNamespace` at `sync_volumes.go:62-67` is called only from `fullSync`, which only runs at watcher startup. `SetupWithManager` does not watch `ContainerNamespace`, so user-initiated deletion produces no follow-up reconcile, and the hot loop in `Reconcile` (after line 153) never re-checks the moby mirror. The new test at `engine.bats:355-361` confirms the gap from one side — `kubectl delete containernamespace/moby` succeeds without a finalizer hang — but there is no assertion that the engine re-creates the resource. The adjacent test `engine.bats:349-353` ("restarting VM restores ContainerEngineReady and moby namespace") only validates recovery via a VM bounce.

This also contradicts the "mirror Docker state into K8s resources" contract documented in `docs/design/api_containers.md`.

Fix: watch `ContainerNamespace` so the engine reconciler notices user deletions, then re-call `syncContainerNamespace` in the wantWatcher branch. The Apply is idempotent so the cost is negligible.

```diff
 return ctrl.NewControllerManagedBy(mgr).
     For(&appv1alpha1.App{}).
     Named("engine-reconciler").
     WatchesRawSource(source.Channel(r.reconcileChan, enqueueApp)).
     Watches(&containersv1alpha1.Container{}, enqueueApp).
     Watches(&containersv1alpha1.Image{}, enqueueApp).
+    Watches(&containersv1alpha1.ContainerNamespace{}, enqueueApp).
     Watches(&containersv1alpha1.Volume{}, enqueueApp).
     Complete(r)
```

Alternatively, document `moby` as a protected singleton and reject the delete in the webhook. The design doc should be updated either way to state what the intended behavior is.

I4. **Early return after `startWatcher` skips pending specs/finalizers** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:142-153` [Gemini 3.1 Pro] (downgraded from critical)

```go
if !watcherRunning {
    log.Info("App is running, starting Docker watcher")
    if err := r.startWatcher(ctx); err != nil {
        ...
    }
    if err := r.setEngineCondition(ctx, &app, metav1.ConditionTrue, "Connected", "Container engine synced"); err != nil {
        return ctrl.Result{}, err
    }
    return ctrl.Result{}, nil
}
```

After successfully starting a fresh watcher and setting the condition, the reconciler returns without running `reconcileContainerSpecs` or `processFinalizers`. In most cases this is self-healing: `setEngineCondition` flips `ContainerEngineReady` from `Stopped`/`False` to `Connected`/`True`, which triggers a new App reconcile that enters the hot loop. But in the watcher-restart path (the previous condition was already `Connected`/`True` from before the crash), `meta.SetStatusCondition` returns `changed=false`, no status write happens, and no App reconcile fires from that source. `fullSync` then becomes the only trigger: its status Apply calls usually change a timestamp or field and emit watch events, but if Docker state is unchanged during the restart window, the applies can be no-ops and no reconcile follows. Pending user intent (e.g., a `spec.state=running` patch made during the gap) stalls until some unrelated event triggers a reconcile.

Gemini framed this as critical — it is real, but the practical blast radius is narrower than a guaranteed stall. Downgrading to important. The fix is trivial and eliminates the worry entirely.

Fix: remove the early return so the reconciler falls through to the shared code path.

```diff
     if err := r.setEngineCondition(ctx, &app, metav1.ConditionTrue, "Connected", "Container engine synced"); err != nil {
         return ctrl.Result{}, err
     }
-    return ctrl.Result{}, nil
 }

 // reconcileContainerSpecs + processFinalizers each issue one or more
```

## Suggestions

S1. **`since` Docker events filter uses host wall clock, not Docker engine clock** — `pkg/controllers/app/engine/controllers/docker_watcher.go:70-76` [Claude Opus 4.6]

```go
// Capture a Docker-relative timestamp before fullSync begins. The
// run() goroutine passes this to the event stream as the Since
// filter, so any mutation that happens while fullSync is snapshotting
// state is replayed once the stream opens.
since := strconv.FormatInt(time.Now().Unix(), 10)
```

The comment claims "Docker-relative timestamp" but `time.Now()` is the **host** clock. The Docker server (inside the Lima VM) interprets the `since` query parameter against its **own** clock. If the host clock is ahead of the guest clock by N seconds, events generated in the window `[now - N, start of event stream]` are filtered out by the server and never reach the watcher — silently lost. With Lima/Vz the skew is normally millisecond-scale so practical risk is low, but the failure mode is silent rather than detectable.

Two cheap fixes:

```go
// Option A: bias the timestamp into the past so any realistic skew is covered.
// fullSync is idempotent, so duplicate replay is safe.
since := strconv.FormatInt(time.Now().Add(-2*time.Minute).Unix(), 10)
```

```go
// Option B: query the Docker server's view of "now" before fullSync.
if info, err := cli.Info(ctx); err == nil && info.SystemTime != "" {
    if t, perr := time.Parse(time.RFC3339Nano, info.SystemTime); perr == nil {
        since = strconv.FormatInt(t.Unix(), 10)
    }
}
```

At minimum, fix the comment to say "host wall clock" and acknowledge the clock-skew assumption.

S2. **Container mirror finalizer is not re-added after user removes it** — `pkg/controllers/app/engine/controllers/sync_containers.go:117-128` [Claude Opus 4.6]

```go
if apierrors.IsNotFound(err) {
    applyConfig := containersv1alpha1apply.Container(inspect.ID, apiNamespace).
        WithFinalizers(mirrorFinalizer).
        WithSpec(containersv1alpha1apply.ContainerSpec().WithState(containersv1alpha1.ContainerStatusUnknown))
    if err := w.k8s.Apply(ctx, applyConfig, client.FieldOwner(controllerName)); err != nil {
        return fmt.Errorf("failed to create container %s: %w", inspect.ID, err)
    }
} else if err != nil {
    return fmt.Errorf("failed to get container %s: %w", inspect.ID, err)
}
```

The mirror finalizer is set only on the `IsNotFound` (creation) branch. On any subsequent sync of the same Container, the code falls through to the status-only Apply at `sync_containers.go:176` without re-asserting the finalizer. If the user removes the finalizer via `kubectl edit`, the next reconcile won't restore it, and a subsequent user delete bypasses the engine's Docker-side cleanup entirely — the K8s mirror is GC'd instantly while the Docker container survives.

`sync_images.go:112-133` and `sync_volumes.go:128-130` reassert the finalizer on every sync because their applies are unconditional. The Container path is the only outlier because of its special "spec-only-on-create" gate.

Fix: split finalizer reassertion into a separate Apply that always runs:

```go
finalizerOnly := containersv1alpha1apply.Container(inspect.ID, apiNamespace).
    WithFinalizers(mirrorFinalizer)
if err := w.k8s.Apply(ctx, finalizerOnly,
    client.ForceOwnership, client.FieldOwner(controllerName)); err != nil {
    return fmt.Errorf("failed to assert container finalizer %s: %w", inspect.ID, err)
}
```

With SSA, applying only `WithFinalizers` preserves the user-owned `spec.state` field.

S3. **Concurrent `App.Status().Update` from `AppReconciler` and `EngineReconciler`** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:202-214` [Claude Opus 4.6]

```go
func (r *EngineReconciler) setEngineCondition(ctx context.Context, app *appv1alpha1.App, ...) error {
    changed := meta.SetStatusCondition(&app.Status.Conditions, metav1.Condition{...})
    if !changed {
        return nil
    }
    return r.Status().Update(ctx, app)
}
```

The `AppReconciler` at `pkg/controllers/app/app/controllers/app_controller.go:247` already calls `r.Status().Update(ctx, &app)` to mirror LimaVM conditions onto `App.Status.Conditions`. The `EngineReconciler` now also calls `Status().Update` on the same object to set `ContainerEngineReady`. Both reconcilers run in the same manager and are **not** serialized across controllers (controller-runtime serializes within a controller, not across them). When both have a stale local copy and both issue a `Status().Update`, the loser gets a 409 conflict.

In practice conflicts will be infrequent because `meta.SetStatusCondition` short-circuits no-op updates and the App reconciler reconciles less often. But the race is real.

Fix (either of two patterns already used elsewhere in this file):

- Use server-side apply with `client.Apply` + a distinct `FieldOwner` for the engine condition — field-level merge avoids resource-version conflict entirely.
- Wrap the `Status().Update` in `retry.RetryOnConflict` plus a re-Get, mirroring `removeMirrorResource` at `docker_watcher.go:271-283` and `deleteAllOfType` at `engine_reconciler.go:262-274`.

S4. **`instance.DockerSocket()` unconditionally creates `~/.rd2/` and swallows the error** — `pkg/instance/instance.go:136-144` [Claude Opus 4.6]

```go
var DockerSocket = sync.OnceValue(func() string {
    home, err := os.UserHomeDir()
    if err != nil {
        panic(fmt.Errorf("could not get home directory: %w", err))
    }
    dir := filepath.Join(home, ".rd2")
    _ = os.MkdirAll(dir, 0o700)
    return filepath.Join(dir, "docker.sock")
})
```

Two concerns:

1. **Hardcoded directory**: On any instance where `RDD_INSTANCE != "2"` (e.g., `bats-app-controller`, a developer's `dev` instance), this creates `~/.rd2/` instead of `instance.ShortDir()`. The TODO at `instance.go:124-133` already acknowledges the hardcode, but the side effect of creating the directory early means every caller that reads this value (even from code paths that never use the socket) litters the user's home. Lima itself creates the directory when port-forwarding the socket, so the `MkdirAll` here is likely redundant.

2. **Swallowed error**: `_ = os.MkdirAll(...)` ignores any failure (permission, ENOSPC). If the directory cannot be created, the function still returns a path to a non-existent directory, and the later dial fails with an error unrelated to the real cause.

Fix (minimal): drop the `MkdirAll` entirely and let Lima own the directory.

```diff
-    dir := filepath.Join(home, ".rd2")
-    _ = os.MkdirAll(dir, 0o700)
-    return filepath.Join(dir, "docker.sock")
+    return filepath.Join(home, ".rd2", "docker.sock")
```

If Lima does not pre-create it, propagate the error instead of ignoring it.

S5. **`ContainerEngineReady=True/NotApplicable` is a contradiction in terms for containerd** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:123-130` [Claude Opus 4.6]

```go
if running && !engineIsDocker {
    terminalReason = "NotApplicable"
    terminalStatus = metav1.ConditionTrue
    terminalMessage = "Engine mirroring is only supported with the moby backend"
}
```

Setting `ContainerEngineReady=True/NotApplicable` for the containerd backend makes `rdd set running=true` complete immediately (matching the BATS test at `engine.bats:365-389`), but it tells any other consumer of `ContainerEngineReady=True` ("the engine is ready") something misleading: there is no mirroring at all, and any UI relying on `Container`/`Image`/`Volume` resources will see nothing. `Status=True` with `Reason=NotApplicable` conflicts with the condition name.

Two cleaner alternatives:
- Rename the condition to `ContainerEngineMirrored` so `True/NotApplicable` doesn't conflict with the type name.
- Use `Status=Unknown/NotApplicable` — still allows `rdd set` to stop waiting but doesn't claim active readiness.

Option two is cheaper and keeps the condition name descriptive; the CLI wait at `cmd/rdd/set.go:464-466` would need to accept `Unknown` as a terminal state.

S6. **`startWatcher` failure swallows the condition update error** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:144-148` [Claude Opus 4.6]

```go
if err := r.startWatcher(ctx); err != nil {
    log.Error(err, "Failed to start Docker watcher")
    _ = r.setEngineCondition(ctx, &app, metav1.ConditionFalse, "ConnectFailed", err.Error())
    return ctrl.Result{}, err
}
```

The `_ =` discards any error from `setEngineCondition`. If the status update fails (e.g., a 409 conflict against a concurrent App reconciler update — see S3), the App keeps reporting its stale value (often `True/Connected` from the previous startup) until the next reconcile. The reconciler requeues on the original `startWatcher` error so the state eventually converges, but the API-server-visible condition is wrong in the interim.

```diff
     log.Error(err, "Failed to start Docker watcher")
-    _ = r.setEngineCondition(ctx, &app, metav1.ConditionFalse, "ConnectFailed", err.Error())
+    if condErr := r.setEngineCondition(ctx, &app, metav1.ConditionFalse, "ConnectFailed", err.Error()); condErr != nil {
+        log.Error(condErr, "Failed to update ContainerEngineReady=ConnectFailed")
+    }
     return ctrl.Result{}, err
```

S7. **`processVolumeFinalizers` does not guard against empty `Status.Name`** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:395-417` [Claude Opus 4.6]

```go
if err := w.deleteVolume(ctx, v.Status.Name); err != nil {
```

If a `Volume` mirror is created through the K8s API by something other than the engine controller — or in a startup race window where the mirror has the finalizer but the engine has not yet populated its status — `Status.Name` will be empty. `deleteVolume` then calls Docker's `VolumeRemove("")`, which errors out, and the finalizer is never stripped. The Volume mirror is permanently stuck in `Terminating`.

Fix: short-circuit the finalizer strip when there is no Docker name to forward to.

```diff
 if v.DeletionTimestamp == nil || !hasFinalizer(v, mirrorFinalizer) {
     continue
 }
+if v.Status.Name == "" {
+    if removeFinalizer(v, mirrorFinalizer) {
+        if err := client.IgnoreNotFound(r.Update(ctx, v)); err != nil {
+            errs = append(errs, fmt.Errorf("failed to remove finalizer from Volume %s: %w", v.Name, err))
+        }
+    }
+    continue
+}
 if err := w.deleteVolume(ctx, v.Status.Name); err != nil {
```

S8. **Docker client lacks API-version negotiation** — `pkg/controllers/app/engine/controllers/docker_watcher.go:44-50` [Claude Opus 4.6]

```go
cli, err := dockerclient.New(
    dockerclient.WithHost("unix://" + socketPath),
)
```

Without `dockerclient.WithAPIVersionNegotiation()`, the client uses its compiled-in default API version. If the Docker engine inside the Lima VM is newer or older than the client's default, calls will fail with cryptic version-mismatch errors. RDD controls the Lima image today so this is mostly theoretical, but a user who sshes into the VM and runs `zypper up`, or a future Lima image bump, could trigger it unexpectedly.

```diff
 cli, err := dockerclient.New(
     dockerclient.WithHost("unix://" + socketPath),
+    dockerclient.WithAPIVersionNegotiation(),
 )
```

Verify the exact option name in `moby/moby/client@v0.4.0` — moby has historically used `WithAPIVersionNegotiation` and `NegotiateAPIVersion` interchangeably across versions.

S9. **`matches_our_instance` matches on raw substring before destructive cleanup** — `scripts/bats-with-timeout.sh:71-75` [Codex GPT 5.4]

```bash
matches_our_instance() {
    case "$1" in
        *"${instance}"*) return 0 ;;
    esac
    return 1
}
```

The helper is used by `our_leaked_pids` / `sigquit_our_go_leaks` / `sigkill_our_leaks` at lines 306-365 to scope `SIGQUIT` and `SIGKILL` cleanup to the current `RDD_INSTANCE`. A bare substring match is broader than the comment at `bats-with-timeout.sh:68-70` promises: `RDD_INSTANCE=bats` would match every `bats-*` instance name in `bats/Makefile:53-83`, so the cleanup path can target sibling control planes rather than its own. The destructive scoping is the whole point of this helper — subtle over-matching here risks killing a parallel run's daemons.

Fix: anchor the match on actual instance markers that appear in paths and environment assignments.

```diff
 matches_our_instance() {
     case "$1" in
-        *"${instance}"*) return 0 ;;
+        *"RDD_INSTANCE=${instance}"*|*"/.rd${instance}/"*|*"rancher-desktop-${instance}"*) return 0 ;;
     esac
     return 1
 }
```

S10. **Cleanup error messages have an empty resource kind** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:246-286` [Gemini 3.1 Pro] (downgraded from important)

```go
for _, item := range items {
    obj := item.(client.Object)
    key := client.ObjectKeyFromObject(obj)
    kind := obj.GetObjectKind().GroupVersionKind().Kind
    ...
    errs = append(errs, fmt.Errorf("failed to remove finalizer from %s/%s: %w",
        kind, obj.GetName(), retryErr))
```

Items extracted from a List response via `meta.ExtractList` have empty TypeMeta — the K8s API server only populates GVK on the top-level list object, not its items. Error messages log as `failed to remove finalizer from /volume-name` instead of `Volume/volume-name`, making operator debugging harder. Cosmetic (doesn't affect behavior) — downgraded to suggestion.

```diff
-        errs = append(errs, fmt.Errorf("failed to remove finalizer from %s/%s: %w",
-            kind, obj.GetName(), retryErr))
+        errs = append(errs, fmt.Errorf("failed to remove finalizer from %T %s: %w",
+            obj, obj.GetName(), retryErr))
```

S11. **Watcher startup is fragile to individual item inspection failures** — `pkg/controllers/app/engine/controllers/sync_containers.go:37-72`, `sync_images.go:41-80` [Gemini 3.1 Pro]

Each `syncAll*` collects per-item errors with `errors.Join`, `fullSync` joins across all four sync types, and `newDockerWatcher` fails the whole startup on any non-nil error. A single permanently-broken inspect (engine-side corruption, rare race) prevents every other healthy resource from being mirrored, and the `ContainerEngineReady=ConnectFailed` condition stays sticky until the problem resource disappears.

Consider logging individual inspect errors and continuing the sync rather than failing the entire startup on any per-item error. The stale-prune step already runs regardless and will eventually clean up mirrors for items that stay broken.

S12. **`BATS_TIMEOUT=0` bypasses the new support-bundle wrapper entirely** — `bats/Makefile:16` [Claude Opus 4.6]

```makefile
BATS = $(if $(filter-out 0,$(BATS_TIMEOUT)),../scripts/bats-with-timeout.sh $(BATS_TIMEOUT) $@) $(BATS_CORE)
```

When a developer runs `make bats-app BATS_TIMEOUT=0` to disable the timeout (e.g., for debugging a hang), the wrapper is not invoked at all — no support-bundle capture, no leaked-process cleanup. Document this in the Makefile comment, or split the wrapper so `timeout=0` still runs it with the timeout disabled.

---

## Design Observations

### Concerns

- **Reconciler scales linearly per child event (in-scope)** [Claude Opus 4.6]. The in-code comment at `engine_reconciler.go:155-161` already calls this out: every Container/Image/Volume watch event triggers a reconcile via `SetupWithManager`, and each reconcile issues full List() calls across all mirror types. Cost is O(N) per child event. The right long-term fix — per-resource reconcilers with watch predicates on `deletionTimestamp set` / `spec.state changed` — is the correct structural answer. Worth doing before the controller is expected to handle more than a few dozen containers.
- **Two writers on `App.Status.Conditions` (in-scope)** [Claude Opus 4.6]. The App reconciler mirrors LimaVM conditions; the engine reconciler sets `ContainerEngineReady`. They are different field-level concerns but `Update` sends the entire status. See S3 for the immediate impact. A cleaner long-term design has all App.Status writers use SSA with distinct field owners; this eliminates conflicts entirely and makes per-condition ownership explicit.
- **Engine assumes host/guest clock parity in multiple places (in-scope)** [Claude Opus 4.6]. Beyond the `since` filter (S1), the various `time.Parse(time.RFC3339Nano, ...)` calls in `applyContainer`, `imageStatusFromInspect`, and `applyVolume` all trust Docker's timestamps as-is. None is wrong, but they all make the same assumption. A one-shot startup check (compare `Info().SystemTime` to `time.Now()` and warn if skew exceeds N seconds) would surface clock drift early.
- **`ContainerNamespace` is documented as a mirror resource but implemented as a startup-only singleton (in-scope)** [Codex GPT 5.4]. Two coherent models exist: reconcile it like the other mirrors, or reject deletion in the webhook and document it as protected state. Picking one simplifies the system; the current hybrid is the direct cause of I3.

### Strengths

- **Idempotent fullSync + watch model (in-scope)** [Claude Opus 4.6]. The fundamental design — list-then-watch with full reconciliation on every watcher startup — handles every edge case (panic, transient disconnect, Docker errors) by self-healing. This is the right architecture.
- **Captured `since` timestamp before fullSync (in-scope)** [Codex GPT 5.4]. Capturing the event-stream cursor at `docker_watcher.go:70-84` before the initial sync is the right way to close the startup blind spot between snapshot and stream subscription — separate from S1's clock concern.
- **Error aggregation pattern (in-scope)** [Claude Opus 4.6]. The per-resource `errors.Join` pattern in `cleanupMirrorResources`, `reconcileContainerSpecs`, and per-finalizer helpers is a solid choice; only `processFinalizers` itself violates it (see I2).
- **Spec-state-once-on-create reasoning (in-scope)** [Claude Opus 4.6]. The comment at `sync_containers.go:111-116` explaining why `ForceOwnership` is deliberately omitted on the create Apply is excellent — it captures the exact race the author considered and why the chosen design avoids it.
- **Webhook restraint (in-scope)** [Claude Opus 4.6, Gemini 3.1 Pro]. The new validator at `container_webhooks.go:35-62` allows only `spec.state` to change and normalizes it before comparing the rest. As `ContainerSpec` grows, the comparison continues to work correctly.
- **`removeMirrorResource` retry-on-conflict (in-scope)** [Claude Opus 4.6]. The watcher-side finalizer strip uses `retry.RetryOnConflict` at `docker_watcher.go:271` so concurrent reconciles don't require a manual requeue. This matches the pattern used in `deleteAllOfType`.
- **BATS coverage breadth (in-scope)** [Claude Opus 4.6]. `engine.bats` is unusually thorough — it covers dangling/tagged transitions both directions, paused-container state transitions both directions, in-use image deletion, and the "rdd set running=false returns promptly" regression. The inline comments explain each test's design carefully.
- **`watchtools.UntilWithSync` in `rdd set` (in-scope)** [Codex GPT 5.4]. Solid choice for the new CLI wait path — it handles watch restarts without turning transient disconnects into user-visible flakes.

## Testing Assessment

Coverage is broad and intentional. Specific untested scenarios that warrant tests, ranked by risk:

1. **Watcher death and recovery while VM is running** — no test causes a Docker event-stream error or panics the goroutine, so the self-healing claim is unverified end-to-end. Hard to test from BATS but doable with a fault-injection hook. Addresses I1.
2. **`ContainerNamespace/moby` re-creation after user delete** — the existing test at `engine.bats:355-361` verifies the delete succeeds without a finalizer hang, but there is no assertion the engine re-creates it without bouncing the VM. Addresses I3.
3. **Pending user work during watcher restart** — a test that patches `spec.state=running` while the watcher is dying and asserts the patch is eventually processed without an external trigger. Addresses I4.
4. **Concurrent App.Status updates from App + Engine reconcilers** — a stress test that rapidly toggles `running=true/false` and asserts neither reconciler reports `Conflict` errors. Addresses S3.
5. **Empty `Volume.Status.Name` finalizer hang** — create a Volume via the K8s API directly with empty status, set deletionTimestamp, assert finalizer cleanup proceeds. Addresses S7.
6. **`fullSync` stale-prune race** — force a container-removed-between-List-and-Inspect race and verify the stale mirror is pruned in the same sync.
7. **Overlapping RDD_INSTANCE names in `bats-with-timeout.sh`** — a shell test covering the substring-match risk before the wrapper sends destructive signals. Addresses S9.

Unit test coverage gap: the only Go test in the new package is `sync_containers_test.go` for `parseContainerName`. The reconciler state machine, finalizer cleanup, and error-aggregation logic are exercised only by BATS — valuable but slow to iterate on. `envtest`-based unit tests for reconcile transitions, stuck finalizers, and the cleanup per-type error accumulation would shorten the feedback loop significantly.

Local verification: `go test ./pkg/controllers/app/engine/controllers ./pkg/controllers/containers/container` passed in Codex's worktree.

## Documentation Assessment

`docs/design/api_containers.md` is updated substantively — new Terminology section, Engine Mirroring section, finalizer lifecycle, container state transitions, and updated Volume name format. Gaps:

- The `ContainerEngineReady=NotApplicable` reason is introduced in the design doc (lines 62-64) but the implication ("containerd users should not depend on this condition for mirroring readiness") is not stated. See S5.
- The doc does not state whether deleting `containernamespace/moby` is supposed to be rejected or self-healed. Clarifying alongside I3 is the right place.
- `instance.DockerSocket()` has no doc-string explaining the `~/.rd2` hardcode beyond a TODO. Surfacing the assumption in the function comment (or name) would help the next reader.
- `cmd/rdd/set.go`'s new `--timeout` flag is documented in the command help, but the failure mode ("timed out waiting for App state") is not — users may expect a request-level timeout rather than a wait-for-condition timeout.
- `engine.bats:147-149` references `volumeK8sName` but the actual helper is `volumeMirrorName`. Stale comment.

## Commit Structure

17 commits, one explicit "Address review round 2 findings" fixup (`66eb50e`), several small capitalization/comment/script fixups (`84c0736`, `c8fa377`, `3196b24`, `62d222a`), and a cluster of CI infrastructure commits (`50bdc15`, `7c8f356`, `c484dab`, `9669b4b`) intermixed with the feature work. The fixups are review-driven and reasonable per the project convention. The CI support-bundle work is arguably independent of the engine controller — the new script does not depend on engine code, and engine code does not depend on the script — and could have been a separate PR that stood on its own merits. Bundling makes the diff larger and the relationship between the new tooling and the feature less clear, but not a blocker.

## Acknowledged Limitations

- **DockerSocket hardcoded to `~/.rd2`** at `pkg/instance/instance.go:127-135`: TODO explicitly flags that multi-instance BATS runs cannot share a host until the Lima template derives the path from the instance suffix; `engine.bats:13-14` carries the parallel TODO. This PR makes the limitation more impactful but does not worsen it.
- **`rdd set containerEngine.name=<other> running=true` returns immediately on a running VM** at `cmd/rdd/set.go:445-452`: the doc comment describes this hazard and points to the proposed `Reconciled` condition with `observedGeneration` as the long-term fix.
- **per-reconcile O(N) List + Inspect** at `engine_reconciler.go:155-161`: acknowledged with a clear path to per-resource reconcilers with watch predicates.
- **Sequential per-object Inspect during fullSync** at `sync_containers.go:30-36` and `sync_images.go:36-40`: acknowledged with pointers to where an errgroup could go if startup latency becomes a concern.
- **Tag-event handling requires re-inspect** at `docker_watcher.go:218-223`: the untag event payload only carries the image ID hash, so `reconcileImageByID` re-inspects and prunes stale tag mirrors. The asymmetry is understood.
- **`pgid` log field is Unix-only** at `limavm_lifecycle.go:479-490`: the comment flags that it will need renaming when Windows CI arrives.

## Unresolved Feedback

No unresolved review comments on the PR itself — round 2 comments were addressed in commit `66eb50e`.

---

## Agent Performance Retro

**[Claude]** — Claude carried the review. It produced 3 of the 4 important findings (I1, I2, I3) and 10 of the 12 suggestions, including the only one that required cross-component reasoning (S3: concurrent App + Engine status writers, which depended on reading the App reconciler code outside the diff). Its unique finding S1 (clock-skew in the `since` filter) is a quiet bug the other two agents missed — the author's comment explicitly says "Docker-relative timestamp" and both Codex and Gemini accepted it without checking what `time.Now()` actually returns. Claude's S8 (API-version negotiation) is a genuine hedging concern but closer to a code review idiom than a bug. The review's prose wandered slightly in spots (S10, S11 on testing/Makefile are valid but low-value), but calibration across severities was accurate and every claim verified against current code.

**[Codex]** — Codex produced the two highest-signal findings it raised (I1/panic-recovery and I3/ContainerNamespace resurrection, both shared with Claude) plus one sharp suggestion (S9: substring match in the bats-with-timeout helper) that Claude missed entirely despite both agents reading the same script. Its finding density is low — three items total — but every one is load-bearing and the analysis traced through the webhook, `SetupWithManager`, and the new test code to prove the claim. No false positives, no padding. Codex also ran `go test` in its worktree and reported pass; this is the kind of verification step the prompt invites but only Codex exercised.

**[Gemini]** — Gemini produced both a high-value finding (I2/short-circuit finalizers, shared with Claude) and one legitimate critical-severity bug-adjacent catch (I4/early-return after startWatcher, downgraded to important after verification), but it also produced the review's only false positive (C1: claimed a `"$ @"` syntax error with a space in `bats-with-timeout.sh` that does not exist, with line numbers 159 and 349 that do not correspond to the cited function). The hallucinated critical is the single-worst event in this review — it would have blocked merge if trusted. Gemini's I2 was downgraded to a suggestion (S10) because it only affects error message quality. Net: one genuine important, one downgraded-critical, one downgraded-important, one fabricated critical. Calibration was inflated.

### Summary

| | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 22m 58s | 6m 21s | 5m 48s |
| Findings | 3I 11S | 2I 1S | 1I 3S |
| Tool calls | 77 (Read 39, Bash 30, TodoWrite 7) | 43 (shell 41, plan 1, stdin 1) | 1 (run_shell_command 1) |
| Design observations | 6 | 3 | 2 |
| False positives | 0 | 0 | 1 |
| Unique insights | 9 | 1 | 1 |
| Files reviewed | 27 | 27 | 27 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **3I 11S** | **2I 1S** | **1I 3S** |
| Downgraded | 0 | 0 | 2 (I→S) |
| Dropped | 0 | 0 | 1 |


**Reconciliation.** Gemini pass-1 C1 (bash syntax error) rejected as a false positive after reading `scripts/bats-with-timeout.sh:202` directly and confirming the code is `"$@" &` with no space; cited line numbers 159/349 also do not match the file. Gemini pass-1 C2 (reconciler stalls after startWatcher) downgraded from critical to important I4 — the scenario is real but narrower than framed: `setEngineCondition` triggers a subsequent App reconcile in most cases, and only the watcher-restart path with unchanged condition + status-apply no-op actually stalls. Gemini pass-1 I2 (missing GVK in deleteAllOfType error) downgraded from important to suggestion S10 — cosmetic error-message issue, no behavior impact.

Overall, Claude contributed the most value on this review; Codex contributed the highest signal-to-noise; Gemini contributed one genuine finding per category but lost calibration at the top end.

---

## Review Process Notes

### Skill improvements

- **Reinforce literal verification of short code snippets when an agent claims a bash/shell syntax error.** Gemini fabricated a `"$ @"` critical finding with line numbers that did not correspond to the cited function. Single-character syntax claims are cheap to verify via a direct file read, and this review only caught it because the orchestrator happened to have read the file already. A prompt-level nudge — "before reporting a single-character syntax error as critical, quote the exact line using a direct file read tool call and confirm the line number" — could head this off for future reviews.

### Repo context updates

- **[repo] UPDATE `deep-review-context.md`: engine controller now uses host wall clock for Docker `since` filter.** Add a bullet: "When reviewing engine controller changes, flag any code that uses `time.Now()` for filters passed to the Docker daemon API. The Docker daemon is inside the Lima VM and filters on its own clock; host clock must be translated via `Info().SystemTime` or biased into the past to cover skew." This was Claude's unique S1 finding and an easy miss for agents that don't know Docker runs in the VM rather than the host.
- **[repo] UPDATE `deep-review-context.md`: App.Status has multiple writers.** Add a bullet: "`App.Status.Conditions` is now written by both `AppReconciler` (LimaVM mirror) and `EngineReconciler` (ContainerEngineReady). Controllers run concurrently — flag any non-SSA, non-retry-on-conflict `Status().Update` on the App object." This generalizes S3 and would have helped both Gemini and Codex catch the concurrent-writer concern independently.

---

## Appendix: Original Reviews

### [Claude] — Claude Opus 4.6

Let me start reading the key files in parallel.

Now I have found the key issue with `ContainerCreateRequestSpec` - it still uses `running` as the default instead of `unknown`. But let's check further — this is for new creation requests, not regular containers. Let me continue reading the other files.

I need to check the webhook registration to understand if ValidateCreate actually blocks the engine controller's creation of Container mirrors.

The webhook only validates Update operations, so Container creation by the engine controller is not blocked. Good.

Now let me look at the existing App controller and related files to understand the condition lifecycle.

Now let me check a few more things — specifically the controller registration, the `GetAPIGroup` value, and how the engine controller integrates with the rest.

Now let me check if "engine" as `apiGroup` may collide with anything else, and look at how the App `Reconciler` registers vs the engine reconciler.

Let me verify a few specific line numbers I want to reference in findings.

Now let me check pp where engine controller would conflict with App controller on App status updates by looking at sigs.k8s.io/controller-runtime concurrency handling.

Default is 1, so reconciles within a single controller are serialized but multiple controllers in a manager can run reconciles in parallel. App and Engine reconcilers can update App.Status concurrently. Let me check how the App controller updates status.

Both controllers do raw `Status().Update` with no retry. Now let me verify a couple of remaining details before writing.

Let me check one more potential issue around Apply patches.

I have enough context now. Let me also quickly verify the watcher's `since` parameter in the Docker library to confirm my understanding.

I have enough context. Let me skip the find command and write up the review. The system context is sufficient.

I have enough context. Let me write the review.

# Code Review: PR #295 — Container Engine watcher

## Executive Summary

This PR introduces an engine controller that mirrors Docker engine state (containers, images, volumes) into the `containers.rancherdesktop.io` API group. The design is sound — fullSync-then-watch with finalizer-based deletion forwarding, defensive panic recovery, idempotent state reconciliation, and good test coverage via BATS. The most significant gap is that the panic-recovery path does not enqueue a reconcile, leaving the watcher silently dead until external activity nudges the reconciler. Several smaller issues around event-stream timestamping, finalizer-processing error aggregation, and DockerSocket directory side effects deserve attention.

## Findings

### Critical Issues

None.

### Important Issues

**I1. Panic recovery in watcher goroutine doesn't wake the reconciler — `pkg/controllers/app/engine/controllers/docker_watcher.go:117-121`**

```go
defer func() {
    if r := recover(); r != nil {
        log.Error(nil, "panic in Docker watcher goroutine", "recovered", r)
    }
}()
```

The deferred recover at lines 117-121 catches the panic and logs it, then the deferred `close(w.done)` at line 112 fires, marking the watcher dead. But unlike the normal error path at line 138-144 (which calls `w.enqueueReconcile()` before returning), the panic path does not enqueue any reconcile. The reconciler only learns the watcher died when something *else* triggers a Reconcile — an App spec change, a Container/Image/Volume mirror change. While the watcher is dead, however, no Container/Image/Volume change will fire (the watcher is the source of those events for engine-driven changes), and an idle user will never trigger anything. This means a panic can silently leave the engine controller in a degraded state for an arbitrarily long time, without `ContainerEngineReady` flipping to false and without recovery.

The comment at lines 113-116 explicitly claims "The engine reconciler notices the dead watcher via alive() on the next reconcile and starts a fresh one" — but there is no guaranteed next reconcile.

Fix:

```diff
 defer func() {
     if r := recover(); r != nil {
         log.Error(nil, "panic in Docker watcher goroutine", "recovered", r)
+        w.enqueueReconcile()
     }
 }()
```

Also consider including a stack trace (`debug.Stack()`) so the panic is actionable. (important, regression)

---

**I2. `processFinalizers` aborts at the first per-resource-type error — `pkg/controllers/app/engine/controllers/engine_reconciler.go:330-336`**

```go
if err := r.processContainerFinalizers(ctx, w); err != nil {
    return err
}
if err := r.processImageFinalizers(ctx, w); err != nil {
    return err
}
return r.processVolumeFinalizers(ctx, w)
```

Each per-type helper internally collects errors with `errors.Join` so one stuck Container does not block the rest of the Container list. But `processFinalizers` itself stops at the first errored *type*: a single failed `processContainerFinalizers` aborts the function before `processImageFinalizers` or `processVolumeFinalizers` ever runs. Image and Volume finalizers pending deletion stay stuck for an entire reconcile cycle whenever a Container delete happens to fail. This contradicts the per-resource error-collection pattern used everywhere else in this file (`cleanupMirrorResources` line 226-238, `reconcileContainerSpecs` line 307-317, the per-helper loops at 349-368 / 376-392 / 400-416).

Fix:

```diff
 func (r *EngineReconciler) processFinalizers(ctx context.Context) error {
     r.watcherMu.Lock()
     w := r.watcher
     r.watcherMu.Unlock()
     if w == nil {
         return nil
     }

-    if err := r.processContainerFinalizers(ctx, w); err != nil {
-        return err
-    }
-    if err := r.processImageFinalizers(ctx, w); err != nil {
-        return err
-    }
-    return r.processVolumeFinalizers(ctx, w)
+    return errors.Join(
+        r.processContainerFinalizers(ctx, w),
+        r.processImageFinalizers(ctx, w),
+        r.processVolumeFinalizers(ctx, w),
+    )
 }
```

(important, regression)

---

**I3. Engine reconciler does not re-create `ContainerNamespace/moby` after user delete — `pkg/controllers/app/engine/controllers/engine_reconciler.go:142-153`**

```go
if !watcherRunning {
    log.Info("App is running, starting Docker watcher")
    if err := r.startWatcher(ctx); err != nil {
        ...
    }
    if err := r.setEngineCondition(ctx, &app, metav1.ConditionTrue, "Connected", "Container engine synced"); err != nil {
        return ctrl.Result{}, err
    }
    return ctrl.Result{}, nil
}
// reconcileContainerSpecs + processFinalizers each issue one or more
// List() calls...
```

`syncContainerNamespace` at `sync_volumes.go:62-67` is called only from `fullSync`, which only runs at watcher startup. The hot-loop branch in `Reconcile` (after line 153) does `reconcileContainerSpecs` and `processFinalizers` but never re-checks the moby `ContainerNamespace`. Combined with the new test at `engine.bats:355-361` which verifies that user-initiated deletion of `containernamespace/moby` returns promptly, this means: once a user (intentionally or accidentally) `kubectl delete`s `containernamespace/moby`, the resource stays missing until the next watcher restart. There is no `Watches(&containersv1alpha1.ContainerNamespace{}, ...)` either, so the engine reconciler does not even get notified of the deletion to re-trigger sync.

The bats test at `engine.bats:349-353` ("restarting VM restores ContainerEngineReady and moby namespace") confirms this gap by only validating recovery via VM restart — there is no "deleting and restoring" test that does not bounce the VM.

Fix: re-call `syncContainerNamespace` from the wantWatcher branch on each reconcile (it is an idempotent server-side Apply, so the cost is negligible), or add a `Watches(&ContainerNamespace{}, enqueueApp)` plus dedicated re-create logic.

```diff
 if !watcherRunning {
     log.Info("App is running, starting Docker watcher")
     if err := r.startWatcher(ctx); err != nil {
         ...
     }
     ...
     return ctrl.Result{}, nil
 }

+r.watcherMu.Lock()
+w := r.watcher
+r.watcherMu.Unlock()
+if w != nil {
+    if err := w.syncContainerNamespace(ctx); err != nil {
+        return ctrl.Result{}, fmt.Errorf("failed to ensure moby ContainerNamespace: %w", err)
+    }
+}
+
 if err := r.reconcileContainerSpecs(ctx); err != nil {
     ...
 }
```

(important, regression)

### Suggestions

**S1. `watchCondition` waits for `ContainerEngineReady=True` even when running=true was already set on a stopped App — `cmd/rdd/set.go:462-466`**

```go
if runningVal == "true" {
    logrus.Info("Waiting for container engine to be ready")
    return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
        return conditionStatus(obj, "ContainerEngineReady") == metav1.ConditionTrue
    })
}
```

For a fresh App (no resource exists yet, no conditions yet), the watch fires on creation, then on every status update from the App reconciler and Engine reconciler. Once the engine reconciler starts the watcher and runs `setEngineCondition(... True, "Connected" ...)`, the wait completes. Good.

But a known hazard is documented in the doc comment at lines 444-452 that I want to highlight in code: the precondition function at line 498-505 reads from the cache *list* before waiting on events — if a stale `ContainerEngineReady=True` from a previous run is already in store (e.g., the engine reconciler has not yet observed the new spec), the precondition fires immediately and the wait short-circuits. This is exactly the pre-existing inconsistency the author called out, and the code carries the hazard without mitigation. The comment is honest about this; consider whether the wait should additionally require `observedGeneration >= patchedGeneration` once that condition exists.

No code change needed right now — the comment correctly documents the limitation, and the proposed `Reconciled` condition described in `bug-fix` notes is the right long-term answer. (suggestion, gap)

---

**S2. `since` Docker events filter uses host wall clock, not Docker engine clock — `pkg/controllers/app/engine/controllers/docker_watcher.go:70-76`**

```go
// Capture a Docker-relative timestamp before fullSync begins. The
// run() goroutine passes this to the event stream as the Since
// filter, so any mutation that happens while fullSync is snapshotting
// state is replayed once the stream opens. Without this, the window
// between the List-based snapshot and the event subscription is a
// blind spot during VM startup.
since := strconv.FormatInt(time.Now().Unix(), 10)
```

The comment claims a "Docker-relative timestamp" but `time.Now()` is the **host** clock. The Docker server (running in the Lima VM) interprets the `since` query parameter against its **own** clock (verified in `moby/moby/api/types/time/timestamp.go`'s `GetTimestamp` and the way moby's daemon filters events). If the host clock is ahead of the guest clock by N seconds, events generated in the (last N seconds before fullSync) + (during fullSync) window are filtered out by the server and never reach the watcher — they are silently lost. With Lima/Vz the skew is normally millisecond-scale, so the practical risk is small, but the comment is misleading and the failure mode is silent rather than detectable.

Two cheap fixes:

```go
// Option A: bias the timestamp into the past so any realistic skew is covered.
// fullSync is idempotent, so duplicate replay is safe.
since := strconv.FormatInt(time.Now().Add(-2*time.Minute).Unix(), 10)
```

or

```go
// Option B: query the Docker server's view of "now" before fullSync.
info, err := cli.Info(ctx)
if err == nil && info.SystemTime != "" {
    if t, perr := time.Parse(time.RFC3339Nano, info.SystemTime); perr == nil {
        since = strconv.FormatInt(t.Unix(), 10)
    }
}
```

At minimum, fix the comment to say "host wall clock" and acknowledge the clock-skew assumption. (suggestion, regression)

---

**S3. Connect-failed condition update silently swallowed — `pkg/controllers/app/engine/controllers/engine_reconciler.go:144-148`**

```go
if err := r.startWatcher(ctx); err != nil {
    log.Error(err, "Failed to start Docker watcher")
    _ = r.setEngineCondition(ctx, &app, metav1.ConditionFalse, "ConnectFailed", err.Error())
    return ctrl.Result{}, err
}
```

The `_ =` discards any error from `setEngineCondition`. If the status update fails (e.g., a 409 conflict against a concurrent App update from the App reconciler — see S6), the App will continue reporting whatever stale value it had (often `True/Connected` from the previous startup), masking the connect failure. The reconciler returns the original `startWatcher` error, so it does requeue and retry, but the API-server-visible state is wrong in the meantime.

Fix:

```diff
 if err := r.startWatcher(ctx); err != nil {
     log.Error(err, "Failed to start Docker watcher")
-    _ = r.setEngineCondition(ctx, &app, metav1.ConditionFalse, "ConnectFailed", err.Error())
+    if condErr := r.setEngineCondition(ctx, &app, metav1.ConditionFalse, "ConnectFailed", err.Error()); condErr != nil {
+        log.Error(condErr, "Failed to update ContainerEngineReady=ConnectFailed")
+    }
     return ctrl.Result{}, err
 }
```

(suggestion, regression)

---

**S4. `ContainerEngineReady=True/NotApplicable` for containerd may mislead consumers — `pkg/controllers/app/engine/controllers/engine_reconciler.go:126-130`**

```go
if running && !engineIsDocker {
    terminalReason = "NotApplicable"
    terminalStatus = metav1.ConditionTrue
    terminalMessage = "Engine mirroring is only supported with the moby backend"
}
```

Setting `ContainerEngineReady=True/NotApplicable` for containerd makes `rdd set running=true` complete immediately (matching the BATS test at `engine.bats:365-389`), but it also tells any other consumer of `ContainerEngineReady=True` ("the engine is ready") something that is misleading: there is no engine mirroring at all, and any UI relying on `Container`/`Image`/`Volume` resources will see nothing. Status=True with reason=NotApplicable is a contradiction in terms — the condition's name is `ContainerEngineReady`, not `ContainerEngineMirroringSupported`.

Two alternatives that are clearer:
- Rename the condition to `ContainerEngineMirrored` (or similar) so True/NotApplicable doesn't conflict with the type name.
- Use status=Unknown/NotApplicable, which still allows `rdd set` to know "you can stop waiting" but doesn't claim active readiness.

The CLI wait at `cmd/rdd/set.go:464-466` would need to be updated to also accept `Unknown` if you go with the second option. (suggestion, regression)

---

**S5. `instance.DockerSocket()` unconditionally creates `~/.rd2`, even on instances where the socket is not used — `pkg/instance/instance.go:136-144`**

```go
var DockerSocket = sync.OnceValue(func() string {
    home, err := os.UserHomeDir()
    if err != nil {
        panic(fmt.Errorf("could not get home directory: %w", err))
    }
    dir := filepath.Join(home, ".rd2")
    _ = os.MkdirAll(dir, 0o700)
    return filepath.Join(dir, "docker.sock")
})
```

Two concerns:

1. **Wrong directory**: For instances where `RDD_INSTANCE != "2"` (e.g., `bats-app-controller`, `bats-lima-controller`, the Jan-the-developer's interactive `dev` instance), this hardcodes `~/.rd2` rather than using `instance.ShortDir()`. The TODO comment acknowledges this, but the side-effect of creating `~/.rd2` on a host whose primary instance is not `2` clutters the user's home directory and pollutes paths the user did not authorize. At minimum, defer the `MkdirAll` to a place where it is actually about to be used.

2. **Silently swallowed error**: `_ = os.MkdirAll(...)` ignores any failure (permission denied, ENOSPC, etc.). If the directory cannot be created, the function still returns the path, the watcher then tries to dial a socket in a non-existent directory, and the failure looks unrelated to the underlying mkdir issue.

Fix (minimal):

```diff
 var DockerSocket = sync.OnceValue(func() string {
     home, err := os.UserHomeDir()
     if err != nil {
         panic(fmt.Errorf("could not get home directory: %w", err))
     }
-    dir := filepath.Join(home, ".rd2")
-    _ = os.MkdirAll(dir, 0o700)
-    return filepath.Join(dir, "docker.sock")
+    return filepath.Join(home, ".rd2", "docker.sock")
 })
```

Lima creates the directory itself when port-forwarding the socket; the MkdirAll here is redundant. If it is *not* redundant for some reason (please verify against Lima behavior), then propagate the error rather than ignoring it. (suggestion, regression)

---

**S6. Concurrent `App.Status().Update` from `AppReconciler` and `EngineReconciler` — `pkg/controllers/app/engine/controllers/engine_reconciler.go:202-214`**

```go
func (r *EngineReconciler) setEngineCondition(ctx context.Context, app *appv1alpha1.App, status metav1.ConditionStatus, reason, message string) error {
    changed := meta.SetStatusCondition(&app.Status.Conditions, metav1.Condition{
        ...
    })
    if !changed {
        return nil
    }
    return r.Status().Update(ctx, app)
}
```

The `AppReconciler` at `pkg/controllers/app/app/controllers/app_controller.go:247` already does `r.Status().Update(ctx, &app)` to mirror `LimaVM` conditions onto `App.Status.Conditions`. Now the `EngineReconciler` also calls `r.Status().Update(ctx, app)` to set `ContainerEngineReady` on the same `App.Status.Conditions`. Both reconcilers run in the same controller manager and can be in flight concurrently (controller-runtime serializes per controller, not across controllers). When they both have a stale local copy of the App and both call `Status().Update`, the loser gets a 409 conflict and must retry from a re-fetch.

In practice, conflicts will be infrequent because (a) `meta.SetStatusCondition` returns false when nothing changed, gating most reconciles to no-ops, and (b) the App reconciler reconciles less often than the Engine reconciler. But the scenario is real. Either:

- Use `r.Status().Patch(ctx, app, client.Apply, client.FieldOwner("engine-controller"), client.ForceOwnership)` for the engine condition (server-side apply, field-level merge — no resource-version conflict).
- Or wrap the `Status().Update` in `retry.RetryOnConflict` plus a re-Get, mirroring the pattern already used in `removeMirrorResource` (`docker_watcher.go:271`) and `deleteAllOfType` (`engine_reconciler.go:262`).

(suggestion, regression)

---

**S7. `processVolumeFinalizers` does not guard against an empty `Status.Name` — `pkg/controllers/app/engine/controllers/engine_reconciler.go:406`**

```go
if err := w.deleteVolume(ctx, v.Status.Name); err != nil {
```

If a `Volume` mirror is created via the K8s API by something other than the engine controller (no `ValidateCreate` webhook for volumes — see `pkg/controllers/containers/volume/`), `Status.Name` may be empty when the user later deletes it. `deleteVolume` would call Docker's `VolumeRemove("")`, which fails with an error, and the finalizer would never be stripped — the Volume mirror is permanently stuck in `Terminating`. Likewise during a startup race window where the mirror has the finalizer but the engine has not yet written its status.

Fix:

```diff
 if v.DeletionTimestamp == nil || !hasFinalizer(v, mirrorFinalizer) {
     continue
 }
+if v.Status.Name == "" {
+    // Mirror created without an engine-managed status; nothing to
+    // forward to Docker. Strip our finalizer so the user delete
+    // can proceed.
+    if removeFinalizer(v, mirrorFinalizer) {
+        if err := client.IgnoreNotFound(r.Update(ctx, v)); err != nil {
+            errs = append(errs, fmt.Errorf("failed to remove finalizer from Volume %s: %w", v.Name, err))
+        }
+    }
+    continue
+}
 if err := w.deleteVolume(ctx, v.Status.Name); err != nil {
```

(suggestion, gap)

---

**S8. Mirror finalizer not re-added after user removes it — `pkg/controllers/app/engine/controllers/sync_containers.go:117-128`**

```go
if apierrors.IsNotFound(err) {
    applyConfig := containersv1alpha1apply.Container(inspect.ID, apiNamespace).
        WithFinalizers(mirrorFinalizer).
        WithSpec(containersv1alpha1apply.ContainerSpec().WithState(containersv1alpha1.ContainerStatusUnknown))
    if err := w.k8s.Apply(ctx, applyConfig, client.FieldOwner(controllerName)); err != nil {
        return fmt.Errorf("failed to create container %s: %w", inspect.ID, err)
    }
} else if err != nil {
    return fmt.Errorf("failed to get container %s: %w", inspect.ID, err)
}
```

The mirror finalizer is set only on the `IsNotFound` (creation) branch. On any subsequent sync of the same Container, the code falls through to the status-only Apply at line 176 without ever re-adding the finalizer. If the user removes the finalizer via `kubectl edit`, the next reconcile won't restore it, and a subsequent user delete bypasses the engine's Docker-side cleanup entirely — the K8s mirror is GC'd instantly while the Docker container survives.

This is also true for `Image` mirrors at `sync_images.go:113-114` and `Volume` mirrors at `sync_volumes.go:128-130`, but for those the finalizer is re-asserted via `WithFinalizers(mirrorFinalizer)` on every sync because the apply is unconditional. So the Container path is the only outlier, because it has the special `WithSpec(WithState(unknown))`-only-on-create gate.

Fix: extract finalizer reassertion to a separate Apply that runs on every sync:

```go
finalizerOnly := containersv1alpha1apply.Container(inspect.ID, apiNamespace).
    WithFinalizers(mirrorFinalizer)
if err := w.k8s.Apply(ctx, finalizerOnly,
    client.ForceOwnership, client.FieldOwner(controllerName)); err != nil {
    return fmt.Errorf("failed to assert container finalizer %s: %w", inspect.ID, err)
}
```

(Note: with server-side apply, applying just `WithFinalizers(mirrorFinalizer)` adds the finalizer but does not remove other field managers' fields, so the user-set spec.state is preserved.)

(suggestion, regression)

---

**S9. Docker client lacks API-version negotiation — `pkg/controllers/app/engine/controllers/docker_watcher.go:44-50`**

```go
socketPath := instance.DockerSocket()
cli, err := dockerclient.New(
    dockerclient.WithHost("unix://" + socketPath),
)
```

Without an explicit `dockerclient.WithAPIVersionNegotiation()` (or equivalent for moby/moby/client v0.4.0), the client uses its compiled-in default API version. If the Docker engine in the Lima VM is significantly newer or older, calls may fail with cryptic version-mismatch errors. RDD currently controls the Lima image and the moby version installed inside it, so this is mostly theoretical — but if the VM is upgraded out of band (e.g., a user sshes in and runs `zypper`), or if a newer Docker is bundled in a future Lima image, this could break unexpectedly.

Fix: enable negotiation at construction.

```diff
 cli, err := dockerclient.New(
     dockerclient.WithHost("unix://" + socketPath),
+    dockerclient.WithAPIVersionNegotiation(),
 )
```

(verify the exact option name in `moby/moby/client@v0.4.0`; the moby client has historically used `WithAPIVersionNegotiation` and `NegotiateAPIVersion` interchangeably across versions.) (suggestion, gap)

---

**S10. Engine reconciler has no unit tests — `pkg/controllers/app/engine/controllers/`**

The only Go test in the new package is `sync_containers_test.go` which exercises `parseContainerName`. The reconciler logic itself — the watcher lifecycle state machine, finalizer cleanup, reconcile-on-Container-event loops, condition setting, error aggregation — is exercised only by BATS integration tests. Those are valuable but slow and harder to debug.

Consider adding `envtest`-based unit tests for at least:
- Reconcile transitions (`running=true → false`, `engine=moby → containerd`, `watcher dies → restarts`).
- Finalizer processing for stuck Docker deletes (image-in-use, volume-in-use).
- The `cleanupMirrorResources` per-type error accumulation.

(suggestion, gap)

---

**S11. The `bats-with-timeout.sh` `BATS_TIMEOUT=0` escape skips the support bundle entirely — `bats/Makefile:16`**

```makefile
BATS = $(if $(filter-out 0,$(BATS_TIMEOUT)),../scripts/bats-with-timeout.sh $(BATS_TIMEOUT) $@) $(BATS_CORE)
```

When `BATS_TIMEOUT=0`, the wrapper script is bypassed entirely — no support bundle, no leaked-process cleanup. A developer who runs `make bats-app BATS_TIMEOUT=0` to disable the timeout (e.g., for debugging) loses all of the post-mortem capture this PR adds. Consider documenting this in the Makefile comment, or alternatively splitting the wrapper so that "timeout=0" still runs the wrapper but with the timeout disabled.

(suggestion, enhancement)

## Design Observations

### Concerns

**Reconciler scales linearly per child event (in-scope).** The comment at `engine_reconciler.go:155-161` already calls this out: `reconcileContainerSpecs` and `processFinalizers` each issue full `List()` calls across all mirrored objects on every reconcile, and every Container/Image/Volume watch event triggers a reconcile via `SetupWithManager` at lines 434-436. Cost is O(N) per child event. With hundreds of containers and frequent Docker activity, the engine reconciler will dominate API server time. The right long-term fix described in the comment — per-resource reconcilers with watch predicates (`deletionTimestamp set`, `spec.state changed`) — is the right structural answer. This refactor is worth doing before this controller hits a workload of more than a few dozen containers.

**Two writers on `App.Status.Conditions` (in-scope).** The `App` reconciler mirrors `LimaVM` conditions (Running, Reconciling, …) and the `Engine` reconciler sets `ContainerEngineReady`, both via `r.Status().Update`. They are different field-level concerns, but Update sends the entire status. This is the same root cause as S6. A cleaner long-term design is to have *all* App.Status writers use server-side apply with distinct field owners; that eliminates conflicts entirely and makes the per-condition ownership explicit.

**The `since` Docker filter is one of three places where the engine assumes host/guest clock parity (in-scope).** The other two are the various `time.Parse(time.RFC3339Nano, ...)` calls in `applyContainer` / `imageStatusFromInspect` / `applyVolume` — they parse Docker timestamps as-is, with no clock-correction. None of these are wrong, but they all assume the same thing: the Docker engine clock and the host clock agree. The PR doesn't validate this assumption. Adding a one-shot startup check (compare `Info().SystemTime` to `time.Now()` and warn if skew exceeds N seconds) would surface clock issues early.

### Strengths

- **Idempotent fullSync + watch model**: the fundamental design — list-then-watch with full reconciliation on each watcher startup — handles every edge case (panic, transient disconnect, intermittent Docker errors) by self-healing. This is the right architecture.
- **Error aggregation pattern**: the per-resource `errors.Join` pattern in `cleanupMirrorResources`, `reconcileContainerSpecs`, and the per-finalizer helpers is a solid choice that ensures one stuck object does not stall the rest. (Modulo I2 above.)
- **Spec-state-once-on-create**: the comment at `sync_containers.go:111-116` explaining why `ForceOwnership` is omitted on the create-Apply is excellent — it captures the exact race the author considered and why the chosen design avoids it.
- **Webhook restraint**: the new validator at `container_webhooks.go:35-62` allows only `spec.state` to change, normalizing it to compare other fields. As `ContainerSpec` grows, the comparison will continue to work correctly.
- **`removeMirrorResource` retry-on-conflict**: the watcher-side finalizer strip uses `retry.RetryOnConflict` (`docker_watcher.go:271`) so concurrent reconciles don't require a manual requeue. This matches the pattern used in `deleteAllOfType` and is the right choice.
- **BATS coverage**: the engine.bats file is unusually thorough — it covers the dangling/tagged transitions both directions, paused-container state transitions both directions, in-use image deletion, and the "rdd set running=false returns promptly" regression. The comments are likewise thorough about the test design.

## Testing Assessment

BATS coverage is broad. Specific untested scenarios that warrant tests, ranked by risk:

1. **Watcher death and recovery while VM is running.** No test SIGKILL's the watcher goroutine, no test causes a Docker event-stream error, and no test panics it. The "self-healing" claim is unverified end-to-end. Hard to test from BATS but doable with a fault-injection hook.
2. **`ContainerNamespace/moby` re-creation after user delete (I3).** Test 20 deletes moby and verifies it stays gone. There is no test that asserts the engine controller re-creates it without a VM bounce — because, per I3, the engine doesn't.
3. **Concurrent `App.Status` updates from App + Engine reconcilers (S6).** A stress test that rapidly toggles `running=true/false` and asserts neither reconciler reports `Conflict` errors.
4. **Empty `Volume.Status.Name` finalizer hang (S7).** Create a Volume via the K8s API directly (with empty status), set deletionTimestamp, assert finalizer cleanup happens promptly.
5. **Image with multiple tags pointing at the same ID through tag/untag chains.** Existing test at `engine.bats:41-64` covers two-tag flow but not larger fan-outs.
6. **Container/Image/Volume mirror stale-prune races during `fullSync`.** Specifically: a container removed by Docker between `ContainerList` and `ContainerInspect` should result in NotFound, and the stale-prune at line 61-69 should clean up the mirror later in the same sync. There is no test that specifically forces this race.
7. **The `engine` apiGroup filter.** Whether `--controllers=app,-engine` correctly disables the engine controller while keeping the rest of the app group running.

## Documentation Assessment

`docs/design/api_containers.md` is updated thoroughly to reflect the new behavior — terminology section, engine mirroring section, finalizer lifecycle, container state transitions, and updated Volume name format. Good.

Gaps:
- The `ContainerEngineReady` condition's `NotApplicable` reason for containerd is documented in the design doc (lines 62-64) but the implication ("containerd users should not depend on this condition for mirroring readiness") is not stated. See S4.
- `instance.DockerSocket()` has no docstring explaining the `~/.rd2` hardcode beyond a TODO. Surface the hardcode in the function name or in the package doc.
- The new `--timeout` flag on `rdd set` is documented in the command help but the failure mode (`timed out waiting for App state`) is not — users may be surprised that this is a wait-for-condition timeout rather than a request timeout.
- `engine.bats:147-149` says "see volumeK8sName in sync_volumes.go" but the actual function is named `volumeMirrorName`. Update the comment.

## Commit Structure

The 17-commit history includes one explicit "Address review round 2 findings" commit (`66eb50e`), several small fix-up commits (`84c0736 engine: apply Container/Image/Volume capitalization convention`, `c8fa377 generate: refresh containerspec.go and crd.yaml for spec.state doc`, `3196b24 bats: park spec.state=unknown before pausing test-state`, `62d222a scripts: resolve rdd binary relative to bats-with-timeout.sh`), and several CI/infrastructure commits intermixed with feature commits (`50bdc15`, `7c8f356`, `c484dab`, `9669b4b`).

The fix-up commits are review-driven and reasonable to keep separate per the project convention. The CI infrastructure commits (`50bdc15 ci: capture support bundle`, `7c8f356 ci: always capture support bundle`, `c484dab ci: capture memory pressure`) are arguably independent of the engine controller and could have been a separate PR — they would be reviewable on their own merits and useful even without the engine work. Bundling them into PR #295 makes the diff larger and the relationship between the new `bats-with-timeout.sh` script and the engine work less clear (the support bundle script does not depend on engine code; engine code does not depend on the script).

This is a soft observation, not a blocker. (suggestion)

## Acknowledged Limitations

- **DockerSocket hardcoded to `~/.rd2`** — `pkg/instance/instance.go:127-135`: TODO comment explicitly notes that this must be parameterized once the Lima template derives the path from the instance suffix, and that "multi-instance BATS runs cannot share the same host" until then. The engine.bats file at lines 13-14 carries a parallel TODO. This PR makes the limitation more impactful (because the engine controller now actively uses the socket) but the limitation is acknowledged.
- **`rdd set containerEngine.name=<other> running=true` returns immediately on a running VM** — `cmd/rdd/set.go:445-452`: the doc comment explicitly calls out this as a known hazard pending the proposed `Reconciled` condition with `observedGeneration`, and tells users to either stop first or re-query. This PR does not change the underlying race; it just inherits it.
- **per-reconcile O(N) List + Inspect** — `engine_reconciler.go:155-161`: the comment acknowledges this is O(N) per child event and outlines the intended long-term fix (split into per-resource reconcilers with watch predicates). Acceptable as written.
- **Sequential per-object Inspect during fullSync** — `sync_containers.go:30-36` and `sync_images.go:36-40`: comments acknowledge this and suggest where to add an errgroup if startup latency becomes a concern. Acceptable.
- **Tag-event handling produces dangling mirrors** — `docker_watcher.go:218-223`: the code re-inspects the image because Docker's untag event payload only carries the image ID hash. The author has clearly thought about this and the design is sound.
- **`pgid` log field is Unix-only** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:479-490`: the comment explicitly notes this needs renaming when Windows CI arrives.

## Coverage Summary

| File | Status |
|---|---|
| `.github/workflows/bats.yaml` | Trivial (adds `docker` brew package on macOS) |
| `bats/Makefile` | S11 |
| `bats/helpers/commands.bash` | Reviewed, no issues |
| `bats/tests/32-app-controllers/engine.bats` | Reviewed (note in Documentation: stale `volumeK8sName` reference at line 147-149); see Testing Assessment |
| `cmd/app-controller/main.go` | Trivial (adds `_ "..."/engine"` import) |
| `cmd/rdd/set.go` | S1 |
| `docs/design/api_containers.md` | Reviewed (gaps noted in Documentation Assessment) |
| `go.mod` | Reviewed, no issues (S9 addresses client construction) |
| `go.sum` | Trivial (generated) |
| `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerspec.go` | Trivial (generated; doc-only) |
| `pkg/apis/containers/v1alpha1/container_types.go` | Reviewed (note: `ContainerCreateRequestSpec.State` still defaults to `running` with enum `created;running` — this is intentional, for ContainerCreateRequest the user is expressing intent at creation time, and the asymmetry with `Container.Spec.State` is meaningful). No issues. |
| `pkg/apis/containers/v1alpha1/volume_types.go` | Trivial (adds selectableField annotation) |
| `pkg/controllers/app/app/lima-template.yaml` | Reviewed (the `vmOpts.vz.rosetta` move is a Lima API change; the `if $ALLOWED_IMAGES` workaround is a distro patch). No issues. |
| `pkg/controllers/app/engine/controller.go` | Reviewed. The engine controller uses `apiGroup = "engine"` rather than `"app"` — this means `--controllers=app` will not enable the engine controller; users must specify `engine` (or both). Worth documenting in `docs/`. |
| `pkg/controllers/app/engine/controllers/docker_watcher.go` | I1, S2, S9 |
| `pkg/controllers/app/engine/controllers/engine_reconciler.go` | I2, I3, S3, S4, S6 |
| `pkg/controllers/app/engine/controllers/sync_containers.go` | S8 |
| `pkg/controllers/app/engine/controllers/sync_containers_test.go` | Reviewed, no issues |
| `pkg/controllers/app/engine/controllers/sync_images.go` | Reviewed, no issues |
| `pkg/controllers/app/engine/controllers/sync_volumes.go` | Reviewed (S7 references `processVolumeFinalizers` not this file) |
| `pkg/controllers/containers/container/container_webhooks.go` | Reviewed, no issues |
| `pkg/controllers/containers/container/crd.yaml` | Trivial (generated) |
| `pkg/controllers/containers/volume/crd.yaml` | Trivial (generated) |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | Reviewed, no issues (the `pgid` log field is acknowledged as Unix-only). |
| `pkg/instance/instance.go` | S5 |
| `pkg/service/cmd/service.go` | Trivial (adds `_ "..."/engine"` import) |
| `scripts/bats-with-timeout.sh` | Reviewed, no issues. The script is well-commented, handles cross-platform process inspection, and properly scopes destructive cleanup to the current `RDD_INSTANCE`. |

Review complete. The review covers all 27 changed files and surfaces 3 important findings (panic recovery doesn't enqueue reconcile, processFinalizers stops at first error, moby ContainerNamespace not re-created after user delete) plus 11 suggestions across design, error handling, observability, and testing. All findings include file:line references, code snippets, and concrete fix suggestions.

---

### [Codex] — Codex GPT 5.4

### Executive Summary

This PR adds the Docker-backed engine controller, extends `rdd set` to wait for state transitions, adds engine BATS coverage, and introduces CI support-bundle capture. The two substantive regressions are that the `moby` `ContainerNamespace` mirror is no longer self-healing after API deletion, and the new panic-recovery path can leave the Docker watcher dead until some unrelated event happens. The timeout helper also uses overly broad instance matching for destructive cleanup.

### Findings

**Critical Issues**

None.

**Important Issues**

I1. **`moby` `ContainerNamespace` can be deleted permanently until the next VM bounce** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:430-437` (important, regression)

```go
return ctrl.NewControllerManagedBy(mgr).
    For(&appv1alpha1.App{}).
    Named("engine-reconciler").
    WatchesRawSource(source.Channel(r.reconcileChan, enqueueApp)).
    Watches(&containersv1alpha1.Container{}, enqueueApp).
    Watches(&containersv1alpha1.Image{}, enqueueApp).
    Watches(&containersv1alpha1.Volume{}, enqueueApp).
    Complete(r)
```

`syncContainerNamespace()` only recreates `containernamespace/moby` during watcher startup/full sync at `pkg/controllers/app/engine/controllers/sync_volumes.go:62-66`, while `ValidateDelete()` explicitly allows the delete at `pkg/controllers/containers/containernamespace/container_namespace_webhooks.go:24-29`. Because `SetupWithManager()` never watches `ContainerNamespace` at `pkg/controllers/app/engine/controllers/engine_reconciler.go:430-437`, deleting `containernamespace/moby` while the engine is running produces no follow-up reconcile, so the mirror stays missing until some unrelated App reconcile or a VM restart. That breaks the “mirror Docker state into Kubernetes resources” contract in `docs/design/api_containers.md`.

Fix: either make `moby` undeletable while the watcher is running, or watch/reconcile `ContainerNamespace` just like the other mirror resources. The minimal self-healing fix is to watch `ContainerNamespace` and call `syncContainerNamespace()` whenever `wantWatcher` is true.

```diff
 return ctrl.NewControllerManagedBy(mgr).
     For(&appv1alpha1.App{}).
     Named("engine-reconciler").
     WatchesRawSource(source.Channel(r.reconcileChan, enqueueApp)).
     Watches(&containersv1alpha1.Container{}, enqueueApp).
     Watches(&containersv1alpha1.Image{}, enqueueApp).
+    Watches(&containersv1alpha1.ContainerNamespace{}, enqueueApp).
     Watches(&containersv1alpha1.Volume{}, enqueueApp).
     Complete(r)
```

I2. **Watcher panic recovery can stall mirroring indefinitely** — `pkg/controllers/app/engine/controllers/docker_watcher.go:117-121` (important, regression)

```go
defer func() {
    if r := recover(); r != nil {
        log.Error(nil, "panic in Docker watcher goroutine", "recovered", r)
    }
}()
```

The new recover block at `pkg/controllers/app/engine/controllers/docker_watcher.go:117-121` only logs and exits. Unlike the normal event-stream failure path at `pkg/controllers/app/engine/controllers/docker_watcher.go:138-144`, it does not call `enqueueReconcile()`. The reconciler’s restart logic at `pkg/controllers/app/engine/controllers/engine_reconciler.go:86-105` only notices a dead watcher on a later reconcile, and `SetupWithManager()` only gets those reconciles from App/Container/Image/Volume watches or `reconcileChan` at `pkg/controllers/app/engine/controllers/engine_reconciler.go:430-437`. If the panic happens while the system is otherwise idle, no later event is guaranteed, so the watcher can stay dead forever.

Fix: enqueue a reconcile from the panic-recovery defer before returning.

```diff
 defer func() {
     if r := recover(); r != nil {
         log.Error(nil, "panic in Docker watcher goroutine", "recovered", r)
+        w.enqueueReconcile()
     }
 }()
```

**Suggestions**

S1. **Timeout cleanup matches instances by raw substring before sending signals** — `scripts/bats-with-timeout.sh:71-75` (suggestion, regression)

```bash
matches_our_instance() {
    case "$1" in
        *"${instance}"*) return 0 ;;
    esac
    return 1
}
```

`matches_our_instance()` at `scripts/bats-with-timeout.sh:71-75` treats any occurrence of `${instance}` as a match, and `our_leaked_pids()` / `sigquit_our_go_leaks()` / `sigkill_our_leaks()` use that result to send `SIGQUIT`/`SIGKILL` at `scripts/bats-with-timeout.sh:306-365`. That is broader than the comment promises: a default helper run with `RDD_INSTANCE=bats` from `bats/helpers/defaults.bash:2-3` would match every `bats-*` instance name used in `bats/Makefile:53-83`, so the cleanup path can target sibling control planes rather than just its own.

Fix: match only on exact instance markers such as `RDD_INSTANCE=<instance>`, `/.rd<instance>/`, or `rancher-desktop-<instance>`, rather than any substring.

```diff
 matches_our_instance() {
     case "$1" in
-        *"${instance}"*) return 0 ;;
+        *"RDD_INSTANCE=${instance}"*|*"/.rd${instance}/"*|*"rancher-desktop-${instance}"*) return 0 ;;
     esac
     return 1
 }
```

### Design Observations

**Concerns**

1. `(in-scope)` `ContainerNamespace` is documented as a mirror resource, but the implementation treats it as a startup-only singleton. Picking one model would simplify the system: either reconcile it like the other mirrors, or reject deletion and document it as protected state.

**Strengths**

1. Capturing the Docker event `Since` timestamp before `fullSync()` in `pkg/controllers/app/engine/controllers/docker_watcher.go:70-84` is the right way to close the startup blind spot between snapshot and stream subscription.
2. `watchtools.UntilWithSync` in `cmd/rdd/set.go:476-523` is a solid choice for the new CLI wait path; it handles watch restarts without turning transient disconnects into user-visible flakes.

### Testing Assessment

Local verification: `go test ./pkg/controllers/app/engine/controllers ./pkg/controllers/containers/container` passed. `go test ./cmd/rdd/...` could not run in this checkout because `pkg/guestagent/lima-guestagent.gz` is missing.

1. No test covers deleting `containernamespace/moby` while the watcher is active and asserting that it is recreated or rejected; the new suite only checks that deletion completes.
2. No test covers the watcher-panic path and verifies that recovery enqueues a reconcile and restarts mirroring.
3. No shell test exercises overlapping instance names for `bats-with-timeout.sh` before it sends destructive signals.

### Documentation Assessment

`docs/design/api_containers.md` says `ContainerNamespace` reflects engine namespaces, but it does not state whether deleting `moby` should be rejected or immediately self-healed. That behavior should be clarified alongside I1.

### Acknowledged Limitations

- **Code comment**: “TODO: changes that trigger a VM restart without changing `running` ... are not waited on” and the “Known hazard” note in [cmd/rdd/set.go](/private/var/folders/s2/9q03h0c11_s4fdw169wqz3qc0000gn/T/review-codex-pass-1-658437135/cmd/rdd/set.go:440). This PR already documents that backend switches on a running VM can satisfy the wait from stale `ContainerEngineReady`.
- **Code comment**: “TODO: use ShortDir() once the Lima template derives the socket path from the instance suffix instead of hardcoding `.rd2`” in [pkg/instance/instance.go](/private/var/folders/s2/9q03h0c11_s4fdw169wqz3qc0000gn/T/review-codex-pass-1-658437135/pkg/instance/instance.go:127), mirrored by the test TODO in [bats/tests/32-app-controllers/engine.bats](/private/var/folders/s2/9q03h0c11_s4fdw169wqz3qc0000gn/T/review-codex-pass-1-658437135/bats/tests/32-app-controllers/engine.bats:13). The engine watcher depends on that hardcoded socket too, so the multi-instance limitation is now more visible but already acknowledged.
- **Code comment**: “The namespace to be deleted must be empty ... when using moby backend: we do not delete the `moby` namespace” in [pkg/controllers/containers/containernamespace/container_namespace_webhooks.go](/private/var/folders/s2/9q03h0c11_s4fdw169wqz3qc0000gn/T/review-codex-pass-1-658437135/pkg/controllers/containers/containernamespace/container_namespace_webhooks.go:25). This PR’s new engine test makes that omission easier to hit.

### Coverage Summary

- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/Makefile` — S1
- `bats/helpers/commands.bash` — Reviewed, no issues
- `bats/tests/32-app-controllers/engine.bats` — Reviewed, no issues
- `cmd/app-controller/main.go` — Trivial
- `cmd/rdd/set.go` — Reviewed, no issues
- `docs/design/api_containers.md` — Reviewed, no issues
- `go.mod` — Trivial
- `go.sum` — Trivial
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerspec.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/container_types.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/volume_types.go` — Reviewed, no issues
- `pkg/controllers/app/app/lima-template.yaml` — Reviewed, no issues
- `pkg/controllers/app/engine/controller.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/docker_watcher.go` — I2
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — I1, I2
- `pkg/controllers/app/engine/controllers/sync_containers.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_containers_test.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_images.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_volumes.go` — I1
- `pkg/controllers/containers/container/container_webhooks.go` — Reviewed, no issues
- `pkg/controllers/containers/container/crd.yaml` — Reviewed, no issues
- `pkg/controllers/containers/volume/crd.yaml` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Reviewed, no issues
- `pkg/instance/instance.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Trivial
- `scripts/bats-with-timeout.sh` — S1


---

### [Gemini] — Gemini 3.1 Pro

### Executive Summary

This PR introduces the engine controller to mirror Docker container engine state (containers, images, and volumes) into Kubernetes resources within the embedded `rdd` control plane, enabling bidirectional sync for lifecycle operations. The review identified two critical regressions: a bash syntax error in the new CI timeout wrapper that will prevent all BATS tests from running, and a stalled reconcile loop when the Docker watcher restarts, which can cause user commands (like deletions or state changes) to be permanently ignored. 

### Findings

#### Critical Issues

C1. **Syntax error in bash script causes timeout wrapper to fail** — `scripts/bats-with-timeout.sh:159`, `scripts/bats-with-timeout.sh:349` (critical, regression)

```bash
run_with_timeout() {
    local seconds=$1
    shift
    "$ @" &
```

The bash variable expansion `"$ @"` contains an errant space between the `$` and `@`. In bash, variable expansions cannot contain spaces; this causes the script to evaluate `"$ @"` literally, resulting in a fatal `command not found: $ @` error instead of executing the wrapped target command. Consequently, all BATS test targets running through this wrapper will fail immediately.

Fix: Remove the space to correctly expand the positional parameters:
```diff
-    "$ @" &
+    "$@" &
```

C2. **Engine reconciler stalls indefinitely on watcher restart** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:148` (critical, regression)

```go
	if !watcherRunning {
		log.Info("App is running, starting Docker watcher")
		if err := r.startWatcher(ctx); err != nil {
			log.Error(err, "Failed to start Docker watcher")
			_ = r.setEngineCondition(ctx, &app, metav1.ConditionFalse, "ConnectFailed", err.Error())
			return ctrl.Result{}, err
		}
		if err := r.setEngineCondition(ctx, &app, metav1.ConditionTrue, "Connected", "Container engine synced"); err != nil {
			return ctrl.Result{}, err
		}
		return ctrl.Result{}, nil
	}
```

When the Docker watcher is restarted (e.g., after the VM starts or the watcher goroutine crashes), the reconciler initializes the new watcher, performs a full sync, updates the `App` status, and then returns `ctrl.Result{}, nil`. If the user submitted a deletion request or a `spec.state` change while the watcher was offline, this event triggered the current reconcile, but the early return skips the `reconcileContainerSpecs` and `processFinalizers` phases. If the `App` condition was already `Connected`, `setEngineCondition` performs no update, meaning no subsequent watch event is emitted. The reconciler will then stall indefinitely, permanently ignoring the user's operation until an unrelated event triggers another reconcile.

Fix: Remove the early return so the reconciler falls through and processes pending specs and finalizers using the newly started watcher:
```diff
 		if err := r.setEngineCondition(ctx, &app, metav1.ConditionTrue, "Connected", "Container engine synced"); err != nil {
 			return ctrl.Result{}, err
 		}
-		return ctrl.Result{}, nil
 	}
 
 	// reconcileContainerSpecs + processFinalizers each issue one or more
```

#### Important Issues

I1. **Sequential finalizer processing blocks subsequent resource cleanup** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:281-289` (important, regression)

```go
	if err := r.processContainerFinalizers(ctx, w); err != nil {
		return err
	}
	if err := r.processImageFinalizers(ctx, w); err != nil {
		return err
	}
	return r.processVolumeFinalizers(ctx, w)
```

The finalizer processing logic short-circuits on the first error encountered. As demonstrated by the integration test (`deleting an in-use Image keeps the finalizer until the container is removed`), deleting an in-use image correctly returns an error and keeps the finalizer in place. However, because `processFinalizers` returns immediately upon this error, `processVolumeFinalizers` is skipped entirely. This means one stuck container or an in-use image will perpetually starve volume deletions across the entire namespace until the blockage is resolved.

Fix: Collect errors across all finalizer types rather than short-circuiting:
```diff
 	if err := r.processContainerFinalizers(ctx, w); err != nil {
-		return err
+		errs = append(errs, err)
 	}
 	if err := r.processImageFinalizers(ctx, w); err != nil {
-		return err
+		errs = append(errs, err)
 	}
-	return r.processVolumeFinalizers(ctx, w)
+	if err := r.processVolumeFinalizers(ctx, w); err != nil {
+		errs = append(errs, err)
+	}
+	return errors.Join(errs...)
```

I2. **Missing GVK causes empty resource kinds in cleanup error logs** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:226` (important, regression)

```go
		kind := obj.GetObjectKind().GroupVersionKind().Kind

		retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
// ...
		if retryErr != nil {
			errs = append(errs, fmt.Errorf("failed to remove finalizer from %s/%s: %w",
				kind, obj.GetName(), retryErr))
			continue
		}
```

The `deleteAllOfType` function attempts to extract the resource kind using `GroupVersionKind()`. However, items extracted from a `List` response via client-go `meta.ExtractList` typically have an empty `TypeMeta` (and thus an empty `GroupVersionKind`) because the Kubernetes API server only populates these fields for the top-level list object, not the individual items. This causes the error log to format as `failed to remove finalizer from /volume-name` instead of `Volume/volume-name`, making debugging confusing.

Fix: Derive the type information directly from the concrete object type:
```diff
 		if retryErr != nil {
-			errs = append(errs, fmt.Errorf("failed to remove finalizer from %s/%s: %w",
-				kind, obj.GetName(), retryErr))
+			errs = append(errs, fmt.Errorf("failed to remove finalizer from %T %s: %w",
+				obj, obj.GetName(), retryErr))
 			continue
 		}
```

#### Suggestions

S1. **Watcher startup is fragile to individual container inspection failures** — `pkg/controllers/app/engine/controllers/sync_containers.go` (suggestion, regression)

In `syncAllContainers` (and similar image/volume syncs), inspection errors are aggregated but ultimately returned by `fullSync`, which aborts watcher startup and sets `ContainerEngineReady=False`. If a single object permanently fails inspection due to engine-side corruption or race conditions, the watcher will never successfully start, preventing mirroring for all other healthy resources. Consider logging individual inspection errors and skipping the problematic item rather than failing the entire sync loop.

### Design Observations

**Strengths**
- The recovery and graceful retry mechanism for unexpected Docker event stream panics (`defer func() { if r := recover() ... }`) handles third-party data stream fragility well.
- Validating state transitions within `immutableValidator` effectively prevents drift between declarative specs and imperative Docker commands. 

### Testing Assessment

The integration tests accurately reflect real-world scenarios but contain a gap matching the I1 finding:
1. **Resource Deletion Starvation (High Risk):** There are no BATS tests verifying that deleting a volume successfully completes while an image is stuck in the `Terminating` state.

### Acknowledged Limitations

- **O(N) reconciliation performance cost**: The author acknowledges that polling every mirrored object via `List()` during specs and finalizer processing scales poorly and identifies a long-term goal of migrating to per-resource reconcilers with specific watch predicates. (`pkg/controllers/app/engine/controllers/engine_reconciler.go:154-159`)
- **CLI waits miss non-running state changes**: The author notes that properties triggering a VM restart without changing the `running` state (e.g., `containerEngine.name`) do not appropriately block, pending a fully observed "Reconciled" condition to land in `App`. (`cmd/rdd/set.go:440-449`)
- **Hardcoded Docker socket path limits multi-tenancy**: The implementation acknowledges that the path `~/.rd2/docker.sock` is hardcoded instead of being derived dynamically via `ShortDir()`, currently blocking multi-instance parallel BATS runs from safely coexisting on the same host. (`pkg/instance/instance.go:127-133`)

### Coverage Summary

- `.github/workflows/bats.yaml`: Trivial
- `bats/Makefile`: C1
- `bats/helpers/commands.bash`: Trivial
- `bats/tests/32-app-controllers/engine.bats`: Reviewed, no issues
- `cmd/app-controller/main.go`: Trivial
- `cmd/rdd/set.go`: Reviewed, no issues
- `docs/design/api_containers.md`: Reviewed, no issues
- `go.mod`: Trivial
- `go.sum`: Trivial
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerspec.go`: Trivial
- `pkg/apis/containers/v1alpha1/container_types.go`: Reviewed, no issues
- `pkg/apis/containers/v1alpha1/volume_types.go`: Trivial
- `pkg/controllers/app/app/lima-template.yaml`: Reviewed, no issues
- `pkg/controllers/app/engine/controller.go`: Trivial
- `pkg/controllers/app/engine/controllers/docker_watcher.go`: Reviewed, no issues
- `pkg/controllers/app/engine/controllers/engine_reconciler.go`: C2, I1, I2
- `pkg/controllers/app/engine/controllers/sync_containers.go`: S1
- `pkg/controllers/app/engine/controllers/sync_containers_test.go`: Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_images.go`: Reviewed, no issues
- `pkg/controllers/app/engine/controllers/sync_volumes.go`: Reviewed, no issues
- `pkg/controllers/containers/container/container_webhooks.go`: Reviewed, no issues
- `pkg/controllers/containers/container/crd.yaml`: Trivial
- `pkg/controllers/containers/volume/crd.yaml`: Trivial
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go`: Reviewed, no issues
- `pkg/instance/instance.go`: Reviewed, no issues
- `pkg/service/cmd/service.go`: Trivial
- `scripts/bats-with-timeout.sh`: C1

---

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-12 |
| **Commit** | `a4ee48f Address review round 3 findings for PR #295` |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important #1: Panic recovery in watcher goroutine doesn't wake the reconciler | Fixed |
| 2 | Important #2: `processFinalizers` aborts at the first per-type error | Fixed |
| 3 | Important #3: `ContainerNamespace/moby` is not re-created after user delete | Skipped |
| 4 | Important #4: Early return after `startWatcher` skips pending specs/finalizers | Fixed |
| 5 | Suggestion #1: `since` Docker events filter uses host wall clock | Fixed |
| 6 | Suggestion #2: Container mirror finalizer is not re-added after user removes it | Fixed |
| 7 | Suggestion #3: Concurrent `App.Status().Update` from AppReconciler and EngineReconciler | Fixed |
| 8 | Suggestion #4: `instance.DockerSocket()` unconditionally creates `~/.rd2/` | Skipped |
| 9 | Suggestion #5: `ContainerEngineReady=True/NotApplicable` is semantically misleading | Commented |
| 10 | Suggestion #6: `startWatcher` failure swallows the `setEngineCondition` error | Fixed |
| 11 | Suggestion #7: `processVolumeFinalizers` does not guard against empty `Status.Name` | Fixed |
| 12 | Suggestion #8: Docker client lacks API-version negotiation | Skipped |
| 13 | Suggestion #9: `matches_our_instance` matches on raw substring before destructive cleanup | Fixed |
| 14 | Suggestion #10: Cleanup error messages have an empty resource kind | Fixed |
| 15 | Suggestion #11: Watcher startup is fragile to individual item inspection failures | Fixed |
| 16 | Suggestion #12: `BATS_TIMEOUT=0` bypasses the new support-bundle wrapper entirely | Commented |
| 17 | Testing Gap #1-8: Watcher death, ContainerNamespace re-creation, watcher restart, concurrent Status, empty Volume, fullSync race, RDD_INSTANCE overlap, envtest unit tests | Skipped |
| 18 | Documentation Gap #5: engine.bats stale `volumeK8sName` reference | Documentation updated |

Notes:
- **I3 skipped**: The review's options (watch+self-heal or webhook+protect) both ran into complexity around letting the controller delete moby while blocking user deletes. Decision: defer namespace lifecycle management until containerd mirroring lands, since that's when we need real multi-namespace support and the design can be planned holistically.
- **S4 skipped**: The underlying `~/.rd2` hardcode is already tracked by a TODO and shared with the Lima template; fixing it properly is part of that effort.
- **S8 skipped**: `WithAPIVersionNegotiation()` is deprecated in moby client v0.4.0 because negotiation is now enabled by default — the option is a no-op.
- **Testing gaps skipped**: Most require new infrastructure (fault injection, envtest, shell unit tests). None are regressions; the fixes they would cover are straightforward enough to verify by inspection.
