# Deep Review: 20260316-134751-pr-236

| | |
|---|---|
| **Date** | 2026-03-16 13:47 |
| **PR** | [#236](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/236) — Implement App controller lifecycle for LimaVM |
| **Branch** | `app-controller-create-delete-limavm` |
| **Commits** | `9284b6c` Implement App controller lifecycle for LimaVM<br>`e300882` Add Bats test to App Controller |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Rework** — the `/owned` finalizer on the LimaVM is never removed during deletion, deadlocking the entire teardown. The embedded template also hard-codes amd64, breaking Apple Silicon hosts. |
| **Wall-clock time** | 30 min 16 s |

---

## Consolidated Review

### Executive Summary

This PR adds the App controller's core reconciliation loop: creating a LimaVM from an embedded template, propagating `spec.running`, mirroring status conditions, and deleting owned resources via a finalizer. The design is well-structured, with correct use of existing patterns for finalizer management and owner references. However, a critical deadlock prevents App deletion from ever completing: the App controller sets `OwnedFinalizerName` on the LimaVM at creation but never removes it during teardown. The embedded template also hard-codes amd64, which breaks Apple Silicon.

---

### Critical Issues

1. **LimaVM stuck in Terminating: `/owned` finalizer never removed** — `pkg/controllers/app/app/controllers/app_controller.go:75-84` [Gemini 3.1 Pro]

```go
case !base.IsBeingDeleted(limaVM):
    if err := r.Delete(ctx, limaVM); err != nil && !apierrors.IsNotFound(err) {
        return ctrl.Result{}, fmt.Errorf("failed to delete LimaVM: %w", err)
    }
    log.Info("Requested LimaVM deletion, waiting for teardown")
    return ctrl.Result{RequeueAfter: requeueAfterDeletion}, nil
default:
    // LimaVM deletion is in progress; keep requeueing until it is gone.
    log.Info("Waiting for LimaVM deletion to complete")
    return ctrl.Result{RequeueAfter: requeueAfterDeletion}, nil
```

The App controller creates the LimaVM with `OwnedFinalizerName` at line 132. During App deletion, the App controller calls `r.Delete(ctx, limaVM)` (line 76) and polls until the LimaVM disappears. But it never strips `OwnedFinalizerName` from the LimaVM. The LimaVM controller's `handleDeletion` (`limavm_lifecycle.go:32-78`) removes only `CleanupFinalizerName` (line 71) — its `DeleteOwnedResources` call (line 65) strips `/owned` from resources owned *by* the LimaVM (ConfigMaps), not from the LimaVM itself. With `/owned` still present, the LimaVM remains in Terminating permanently, and the App controller polls forever.

The BATS test at `app.bats:177-179` (`rdd ctl wait --for=delete app/"${APP_NAME}" --timeout=90s`) will hang and time out.

Fix: Either call `base.RemoveOwnedFinalizer(ctx, r.Client, limaVM)` before issuing the delete, or replace the manual find-and-delete with `base.DeleteOwnedResources(ctx, r.Client, &app, r.Manager)`, which strips `/owned` from all owned resources and deletes them in one pass. The second option also handles the orphan race in finding #3.

---

### Important Issues

1. **Embedded template hard-codes amd64 — arm64 hosts will fail** — `pkg/controllers/app/app/lima.yaml:3` [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro]

```yaml
images:
- location: https://github.com/rancher-sandbox/rancher-desktop-opensuse/releases/download/v0.1.0/distro.v0.1.0.amd64.qcow2
```

The template points to `distro.v0.1.0.amd64.qcow2` unconditionally. On arm64 hosts (Apple Silicon), Lima will fail to match a suitable image, and `limainstance.Prepare()` will fail. Rosetta is also disabled (`rosetta: enabled: false`). Issue #227 explicitly requires "Select correct arch (amd64/arm64) and image format (qcow2/tar) from host."

The template is marked temporary (`controller.go:18`), so this may be intentional as a first pass. But the BATS test cannot pass on arm64 as-is.

Fix: Generate the template dynamically (e.g., substitute the URL based on `runtime.GOARCH`), or add both architecture entries to the image list (Lima selects the matching one), or gate the test with a skip guard.

2. **Orphaned `rd-template` ConfigMap in pre-adoption deletion race** — `pkg/controllers/app/app/controllers/app_controller.go:62-84` [Codex GPT 5.4]

```go
case apierrors.IsNotFound(err):
    // LimaVM is gone. Clean up the input ConfigMap if it still exists
    inputCM := &corev1.ConfigMap{}
    if cmErr := r.Get(ctx, client.ObjectKey{Name: inputConfigMapName, Namespace: namespace}, inputCM); cmErr == nil {
        if cmErr := r.Delete(ctx, inputCM); cmErr != nil && !apierrors.IsNotFound(cmErr) {
            return ctrl.Result{}, fmt.Errorf("failed to delete input ConfigMap during cleanup: %w", cmErr)
        }
    }
```

The LimaVM webhook creates `rd-template` with `OwnedFinalizerName` but without an ownerReference (line 210-212 of `limavm/controller.go`). The reconciler adds the ownerReference in a subsequent reconcile (line 232-242). If the App is deleted during this window, the App controller's cleanup (lines 62-71) only looks for the input `rd` ConfigMap, not `rd-template`. The LimaVM's `DeleteOwnedResources` won't find `rd-template` either, since it has no ownerReference yet. The ConfigMap validator (`controller.go:276`) rejects deleting ConfigMaps that carry the `/owned` finalizer, so `rd-template` becomes undeletable.

The window is narrow (one reconcile cycle), but the result is permanent: an orphan resource that resists manual cleanup.

Fix: Have the App controller's cleanup also delete `rd-template` by deterministic name, or adopt `DeleteOwnedResources` for the entire App cleanup path (it would handle both the input ConfigMap and the LimaVM in one pass).

---

### Suggestions

1. **`ObservedGeneration` forwarded from LimaVM context to App** — `pkg/controllers/app/app/controllers/app_controller.go:181` [Claude Opus 4.6, Gemini 3.1 Pro]

```go
statusChanged = apimeta.SetStatusCondition(&app.Status.Conditions, metav1.Condition{
    // ...
    ObservedGeneration: cond.ObservedGeneration,
    LastTransitionTime: cond.LastTransitionTime,
}) || statusChanged
```

`ObservedGeneration` on a condition conventionally refers to the `.metadata.generation` of the resource that owns the condition. Copying the LimaVM's value into the App makes it reference the LimaVM's generation, which is meaningless on the App resource. Currently harmless because the LimaVM controller does not set `ObservedGeneration` (defaults to 0), but becomes incorrect if it ever does.

Fix: Set `ObservedGeneration: app.Generation`, or omit the field.

2. **Swallowed API errors during ConfigMap cleanup** — `pkg/controllers/app/app/controllers/app_controller.go:160-162` [Gemini 3.1 Pro]

```go
} else if !apierrors.IsNotFound(err) {
    log.Error(err, "Failed to fetch input ConfigMap")
}
```

If `r.Get` returns a transient error (e.g., API timeout), the controller logs it but continues to `ctrl.Result{}, nil`, foregoing retry. The input ConfigMap has an ownerReference to the App, so Kubernetes GC handles cleanup eventually. The App controller also re-reconciles on LimaVM status changes, so the window is small. Still, returning the error would be more correct.

Fix: Return `ctrl.Result{}, err` on transient Get failures.

3. **Swallowed API errors during teardown prematurely drop App finalizer** — `pkg/controllers/app/app/controllers/app_controller.go:68-71` [Gemini 3.1 Pro]

```go
} else if !apierrors.IsNotFound(cmErr) {
    log.Error(cmErr, "Failed to fetch input ConfigMap during cleanup")
}
return ctrl.Result{}, base.RemoveCleanupFinalizer(ctx, r.Client, &app)
```

Same pattern as above, but during deletion. The input ConfigMap has an ownerReference, so GC provides a safety net. But the controller should not proceed to remove the cleanup finalizer before verifying cleanup succeeded.

Fix: Return `ctrl.Result{}, cmErr` when `cmErr` is not nil and not NotFound.

4. **Condition mirroring never prunes stale conditions** — `pkg/controllers/app/app/controllers/app_controller.go:175-184` [Claude Opus 4.6, Gemini 3.1 Pro]

The loop over `limaVM.Status.Conditions` adds or updates conditions on the App but never removes conditions that no longer exist on the LimaVM. In practice, LimaVM conditions transition between states rather than disappearing, so this risk is low. If the LimaVM is deleted and recreated with a different condition set, stale conditions would linger.

Fix: Before the loop, diff condition types and remove any App conditions absent from the LimaVM. Or accept this as a known limitation with a comment.

5. **Missing `local_teardown_file` leaves Lima instance directories behind** — `bats/tests/32-app-controllers/app.bats` [Claude Opus 4.6]

The test file has no `local_teardown_file`. The existing LimaVM tests (e.g., `limavm-instance.bats`) define one to clean up `$RDD_LIMA_HOME/<instance>`. Without it, the default teardown just runs `rdd svc stop`, which kills the control plane mid-deletion, potentially leaving orphaned instance directories.

Fix: Add a `local_teardown_file` that cleans up `${RDD_LIMA_HOME}/${VM_NAME}`, matching the pattern in `limavm-instance.bats`.

6. **Dropped EventRecorder removes Kubernetes Events from App** — `pkg/controllers/app/app/controllers/app_controller.go` [Gemini 3.1 Pro]

The PR removes the `Recorder events.EventRecorder` field and the `setCondition` helper that emitted `ConditionChanged` events. The old code was a stub, and the App controller is now a thin orchestration layer — the LimaVM controller emits the real lifecycle events. Still, emitting events on App status transitions would improve observability.

Fix: Restore the `Recorder` and emit an event when `statusChanged` is true after condition mirroring.

7. **Typo in comment** — `pkg/controllers/app/app/controllers/app_controller.go:164` [Claude Opus 4.6]

```go
// Propagate spec.running from App into to the LimaVM.
```

Fix: `// Propagate spec.running from App to the LimaVM.`

---

### Design Observations

1. **The deletion path should use `DeleteOwnedResources` instead of manual find-and-delete** (in-scope)

The App controller manually finds the LimaVM by name and deletes it. This bypasses the established pattern where `DeleteOwnedResources` strips `/owned` finalizers and deletes owned resources in one pass. Adopting `DeleteOwnedResources` for App deletion eliminates the `/owned` deadlock (Critical #1), handles the `rd-template` orphan race (Important #2) via transitive ownership, and reduces the code to a single function call. It also means any future owned resources are automatically covered without modifying the deletion path.

2. **Bootstrap handoff through two ConfigMaps creates avoidable lifecycle coupling** (future)

The current flow creates an input `rd` ConfigMap, has the LimaVM webhook copy it into `rd-template`, then has the reconciler adopt `rd-template`. This split ownership creates a window where `rd-template` has no owner. A simpler design would have the App controller create the final template directly, or have the LimaVM webhook set the ownerReference atomically. Either approach eliminates the orphan window and makes deletion semantics easier to reason about.

3. **Embedded template is a temporary bootstrapping mechanism** (future)

The template is embedded via `//go:embed lima.yaml` and marked temporary (`controller.go:18`). When this moves to a proper template source (e.g., the TwoCows project per issue #227), the `LimaTemplateData string` field will need to support runtime resolution (e.g., host architecture selection, dynamic URLs). The current string injection design makes this transition straightforward.

---

### Testing Assessment

The BATS test suite covers the happy path well: singleton creation, finalizer setup, ownership, template lifecycle, condition mirroring, running state propagation, validation webhooks, and deletion cascade.

Untested scenarios, ranked by risk:

1. **App deletion** — the test exercises this path (`app.bats:173-179`) but will deadlock due to the `/owned` finalizer bug. The test validates the wrong thing: it proves the bug exists.
2. **Running condition mirroring** — only the `Created` condition is verified on the App (tests 15-16). After toggling `spec.running`, the tests check `LimaVM.spec.running` but never verify that the App mirrors the `Running` condition back.
3. **arm64 architecture** — no test verifies that the template works on the review machine's architecture. The `Created` condition wait (`app.bats:114-116`) will fail on Apple Silicon.
4. **Early-deletion orphan race** — no test deletes the App immediately after LimaVM creation but before the reconciler sets the ownerReference on `rd-template`.
5. **Error paths** — no test covers webhook rejection of invalid template data or instance preparation failure.

---

### Documentation Assessment

Code comments are clear and accurate. The comment on `controller.go:18` ("This is temporary") documents the intended evolution. The `app_types.go` immutability comment correctly explains why `spec.namespace` is immutable. No documentation gaps for the current scope.

---

## Agent Performance Retro

### Claude Opus 4.6

- **Unique contributions**: Missing `local_teardown_file` in BATS, typo in comment
- **Accuracy**: Good — no false positives. Correctly noted the polling deletion was "belt-and-suspenders" with the Owns watch, but did not identify that the manual deletion path bypasses `RemoveOwnedFinalizer`, missing the critical deadlock.
- **Depth**: Moderate. Traced into the webhook and ConfigMap lifecycle. Did not trace into `handleDeletion` in the LimaVM controller to verify finalizer removal.
- **Signal-to-noise**: High — all findings were actionable, though none were critical.

### Codex GPT 5.4

- **Unique contributions**: Orphaned `rd-template` in pre-adoption deletion race
- **Accuracy**: Good — the orphan finding was correct and well-reasoned with specific code path citations. No false positives.
- **Depth**: Good cross-controller tracing. Identified the ownership gap between webhook creation and reconciler adoption. Did not discover the `/owned` deadlock despite examining the same deletion path.
- **Signal-to-noise**: High — concise report, every finding was substantive.

### Gemini 3.1 Pro

- **Unique contributions**: `/owned` finalizer deadlock (the only critical finding), swallowed API errors, dropped EventRecorder
- **Accuracy**: Mixed. The critical finding was correct and the most valuable insight across all three reviews. However, the amd64 gap was rated CRITICAL (should be IMPORTANT — it's a missing feature, not a runtime bug in the controller logic). The swallowed error findings were rated IMPORTANT but their practical impact is low (ownerReference GC provides a safety net). The dropped EventRecorder was also overrated as IMPORTANT.
- **Depth**: Deep — traced the full finalizer lifecycle from creation through deletion across both controllers. This is what enabled the critical finding.
- **Signal-to-noise**: Moderate — severity inflation diluted the signal, but the critical finding alone justified the review.

### Summary Table

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 9:33 | 4:11 | 11:15 |
| Critical | 0 | 0 | 1 |
| Important | 1 | 2 | 4 |
| Suggestion | 4 | 1 | 2 |
| Design observations | 3 | 1 | 0 |
| False positives | 0 | 0 | 0 |
| Unique insights | 2 | 1 | 3 |

**Overall assessment**: Gemini 3.1 Pro provided the most value by a wide margin. It was the only agent to find the critical `/owned` finalizer deadlock — the one bug that would prevent this PR from ever working correctly. Both Claude and Codex examined the same deletion code and missed it. Codex contributed the second-most-valuable unique finding (the orphan race). Claude's contributions were limited to lower-severity items. Gemini's severity inflation is a recurring issue but is easy to calibrate for; its depth of cross-controller tracing is what makes it stand out.

---

## Skill Improvement Recommendations

1. The review dimensions are comprehensive. No gaps identified.
2. The prompt's instruction to trace into called functions and related controllers worked well for Gemini (which found the deadlock) but was not followed deeply enough by Claude or Codex. Consider adding an explicit checklist item: "For every finalizer set in the change, verify which code path removes it."
3. The repo-specific context about single-user kube-apiserver helped calibrate severity — Codex did not inflate the orphan race to CRITICAL. However, Gemini still rated the amd64 gap as CRITICAL despite the note about calibrating severity.
4. All three agents missed one scenario: what happens if the App is deleted while the LimaVM is still in the "preparing" phase (sentinel file present). The LimaVM controller's `handleDeletion` does not check for the sentinel, which could leave a stale sentinel file on disk.
5. No process improvements identified. The three-agent parallel approach and the clarification round worked as designed.

---

## Appendix: Original Reviews

### Claude Opus 4.6

Now I have all the information I need. Let me compile the review.

---

## Executive Summary

This PR implements the App controller's core lifecycle: creating a LimaVM from an embedded template, propagating `spec.running`, mirroring status conditions, and cleaning up via a finalizer on deletion. The reconciliation loop is well-structured with correct error handling and proper use of the existing finalizer patterns. The main gap is that the embedded template only supports amd64, which will cause instance preparation to fail on arm64 (Apple Silicon) hosts — including the developer's own machine.

## Findings

### Critical Issues

None.

### Important Issues

1. **Embedded template only provides an amd64 image — arm64 hosts will fail** — `pkg/controllers/app/app/lima.yaml:3` (important, gap)

```yaml
images:
- location: https://github.com/rancher-sandbox/rancher-desktop-opensuse/releases/download/v0.1.0/distro.v0.1.0.amd64.qcow2
```

The template has a single amd64 qcow2 image. On arm64 hosts (this machine is arm64), `limainstance.Prepare()` at `limavm_controller.go:318` will fail because Lima has no matching image for the host architecture. The LimaVM will get `Created=False/CreateFailed`, and the BATS test will timeout at the "wait for LimaVM Created condition" step (line 115 of `app.bats`). Issue #227 explicitly requires "Select correct arch (amd64/arm64) and image format (qcow2/tar) from host."

The template is marked temporary (`controller.go:18`), so this may be intentional as a first pass. But the BATS test can't pass on arm64 as-is.

Fix: Add an arm64 image entry (the rancher-desktop-opensuse releases page likely has `distro.v0.1.0.aarch64.qcow2` or `.tar`), or gate the BATS test with a `skip_on_arm64` guard.

### Suggestions

2. **Condition mirroring is add/update only — conditions removed from LimaVM persist stale on App** — `pkg/controllers/app/app/controllers/app_controller.go:175-184` (suggestion, gap)

```go
for _, cond := range limaVM.Status.Conditions {
    statusChanged = apimeta.SetStatusCondition(&app.Status.Conditions, metav1.Condition{
        ...
    }) || statusChanged
}
```

The loop at lines 175–184 iterates over LimaVM conditions and sets/updates them on App, but never removes App conditions that no longer exist on LimaVM. In practice this is low risk — LimaVM conditions (`Created`, `Running`) transition between states rather than being removed. But if the LimaVM is deleted and recreated with a different condition set, stale conditions would linger on the App.

Fix: Before the loop, diff the App's condition types against the LimaVM's and remove any App conditions not present on the LimaVM. Or accept this as a known limitation with a comment.

3. **`ObservedGeneration` forwarded from LimaVM generation context to App** — `pkg/controllers/app/app/controllers/app_controller.go:181` (suggestion, regression)

```go
ObservedGeneration: cond.ObservedGeneration,
```

`ObservedGeneration` on a condition conventionally refers to the `.metadata.generation` of the resource that owns the condition. Copying it from LimaVM to App makes it reference the LimaVM's generation, which is meaningless on the App resource. Currently harmless because the LimaVM controller's `setCondition` at `limavm_controller.go:428-433` doesn't set `ObservedGeneration` (defaults to 0), but this becomes incorrect if the LimaVM controller ever starts using it.

Fix: Set `ObservedGeneration: app.Generation` or omit the field (let it default to 0).

4. **Missing `local_teardown_file` leaves Lima instance directories behind** — `bats/tests/32-app-controllers/app.bats` (suggestion, gap)

The test file has no `local_teardown_file`. The existing LimaVM tests (e.g., `limavm-instance.bats:32-39`) define `local_teardown_file` to remove Lima instance directories under `$RDD_LIMA_HOME`. Without this, the last test ("verify App can be recreated after deletion") calls `delete_app` but doesn't wait for full teardown. The default `teardown_file` in `load.bash:87-93` just runs `rdd svc stop`, which kills the control plane mid-deletion, potentially leaving orphaned instance files on disk.

Fix: Add a `local_teardown_file` that cleans up `${RDD_LIMA_HOME}/${VM_NAME}`, matching the pattern in `limavm-instance.bats`.

5. **Typo in comment** — `pkg/controllers/app/app/controllers/app_controller.go:164` (suggestion, regression)

```go
// Propagate spec.running from App into to the LimaVM.
```

"into to" should be "to".

Fix: `// Propagate spec.running from App to the LimaVM.`

### Design Observations

1. **Input ConfigMap lifecycle is cleanly separated** (in-scope)

The pattern of creating a transient input ConfigMap, letting the LimaVM webhook copy its data into the managed `<name>-template` ConfigMap, and then deleting the input is well-designed. The App controller deletes the input ConfigMap at lines 154–162 on every reconcile where LimaVM exists, which is idempotent (the Get returns NotFound on subsequent passes). The webhook reads the input synchronously during admission (at `controller.go:250`), so there's no race between the copy and the deletion.

2. **Polling-based deletion wait is appropriate for cross-controller coordination** (in-scope)

The deletion path at lines 55–85 uses `RequeueAfter: 2s` to poll for LimaVM deletion rather than trying to watch across controllers. This is the right choice: the App controller already `Owns()` the LimaVM (line 199), so the Owns handler will also fire when the LimaVM is deleted, giving belt-and-suspenders coverage. The 2s poll ensures forward progress even if a watch event is missed.

3. **Consider future template source** (future)

The template is embedded via `//go:embed lima.yaml` and marked temporary (`controller.go:18`). When this moves to a proper template source (e.g., the TwoCows project per issue #227), the `LimaTemplateData string` field on `AppReconciler` will need to change to something that supports runtime resolution (e.g., reading from a ConfigMap or using the TwoCows API). The current string injection design makes this transition straightforward.

### Testing Assessment

The BATS test suite provides good coverage of the happy path:

- Singleton creation, finalizer setup, and LimaVM ownership (tests 1–9)
- Template lifecycle: webhook copy, input ConfigMap deletion (tests 10–13)
- Condition mirroring with timestamp preservation (tests 14–16)
- Running state propagation in both directions (tests 17–18)
- Validation webhooks: singleton name and namespace immutability (tests 19–20)
- Deletion cascade and re-creatability (tests 21–24)

Untested scenarios, ranked by risk:

1. **Running condition mirroring** — only the `Created` condition is verified on the App (test 15–16). The `Running` condition is never checked after `spec.running` is toggled, so the mirroring loop is only partially validated.
2. **Error paths** — no test for what happens when the LimaVM webhook rejects the template (e.g., invalid YAML) or when instance preparation fails. The App would be stuck retrying without any status condition indicating the problem.
3. **Re-creation after external LimaVM deletion** — the test covers App deletion triggering LimaVM deletion, but not the scenario where someone deletes the LimaVM directly. The App controller would recreate it, but this path is untested.

### Documentation Assessment

The code comments are clear and accurate. The comment on `controller.go:18` ("This is temporary") is appropriate. The `app_types.go` immutability comment at line 18–19 correctly explains why `spec.namespace` is immutable. No documentation gaps for the current scope.

---

### Codex GPT 5.4

**Verdict**

Needs fixes before merge.

### Executive Summary

This PR adds the App singleton reconciler for LimaVM lifecycle management: it bootstraps a template ConfigMap, creates and deletes the owned LimaVM, propagates `spec.running`, and mirrors LimaVM conditions back onto the App. The main issues are that the embedded template is hardcoded to the amd64 image, so it does not satisfy the host-selection requirement on arm64 systems, and the new App deletion path introduces a race that can leave `rd-template` orphaned and undeletable if the App is removed before the LimaVM controller adopts that ConfigMap.

### Findings

**Critical Issues**

None.

**Important Issues**

1. **Embedded Lima template is hardcoded to the amd64 image** — `pkg/controllers/app/app/lima.yaml:1-3` (important, regression)

```yaml
vmType: qemu
images:
- location: https://github.com/rancher-sandbox/rancher-desktop-opensuse/releases/download/v0.1.0/distro.v0.1.0.amd64.qcow2
```

The new bootstrap template at `pkg/controllers/app/app/lima.yaml:3` always points at `distro.v0.1.0.amd64.qcow2`, and the App reconciler copies that file verbatim into the input ConfigMap at `pkg/controllers/app/app/controllers/app_controller.go:111-118`. On this review machine `uname -m` is `arm64`, and the release actually publishes separate `arm64` assets, so the implementation never selects the correct architecture on Apple Silicon. That misses the stated requirement from issue #227 to select the correct arch/image from the host, and it means arm64 users will create a VM from the wrong artifact.

Fix: generate the template dynamically from host facts instead of embedding a single static file. At minimum, branch on architecture (`amd64` vs `arm64`) before writing `images[].location`; if image format is also host-dependent, derive that at the same point and cover both cases in tests.

2. **Deleting the App before the LimaVM reconciler adopts `rd-template` can orphan an undeletable ConfigMap** — `pkg/controllers/app/app/controllers/app_controller.go:62-84` (important, regression)

```go
case apierrors.IsNotFound(err):
	// LimaVM is gone. Clean up the input ConfigMap if it still exists
	inputCM := &corev1.ConfigMap{}
	if cmErr := r.Get(ctx, client.ObjectKey{Name: inputConfigMapName, Namespace: namespace}, inputCM); cmErr == nil {
		if cmErr := r.Delete(ctx, inputCM); cmErr != nil && !apierrors.IsNotFound(cmErr) {
			return ctrl.Result{}, fmt.Errorf("failed to delete input ConfigMap during cleanup: %w", cmErr)
		}
	}
	return ctrl.Result{}, base.RemoveCleanupFinalizer(ctx, r.Client, &app)
```

The App controller now creates the LimaVM and relies on the LimaVM webhook to create `rd-template` first, with the LimaVM reconciler attaching the owner reference later at `pkg/controllers/lima/limavm/controllers/limavm_controller.go:231-242`. If the App is deleted in that window, the new App cleanup path at `pkg/controllers/app/app/controllers/app_controller.go:62-84` only waits for the LimaVM to disappear and deletes the bootstrap `rd` ConfigMap; it never looks for `rd-template`. LimaVM deletion also will not clean up that ConfigMap, because `base.DeleteOwnedResources()` only touches objects already owned by the LimaVM UID at `pkg/controllers/base/finalizer.go:154-163`, and the webhook-created template ConfigMap is created without that owner reference at `pkg/controllers/lima/limavm/controller.go:202-237`. The leftover `rd-template` still carries `OwnedFinalizerName`, and the ConfigMap validator rejects deleting any template ConfigMap that still has that finalizer at `pkg/controllers/lima/limavm/controller.go:272-279`, so this race leaves an orphan that is effectively stuck.

Fix: make cleanup resilient to the pre-adoption window. Practical options are: delete `rd-template` by deterministic name during App cleanup, teach LimaVM deletion to fall back to `GetTemplateConfigMapName()` when owner discovery finds nothing, or eliminate the two-phase ownership gap entirely by creating/adopting the final template ConfigMap atomically.

**Suggestions**

1. **The new bats coverage stops short of validating real VM start/stop behavior** — `bats/tests/32-app-controllers/app.bats:114-151` (suggestion, regression)

```bash
@test "wait for LimaVM Created condition to be set" {
    rdd ctl wait --for=condition=Created \
        limavm/"${VM_NAME}" --namespace "${RDD_NAMESPACE}" --timeout=60s
}

@test "propagating running=true updates LimaVM spec.running" {
    rdd ctl patch app "${APP_NAME}" --type='merge' -p='{"spec":{"running":true}}'
    rdd ctl wait --for=jsonpath='{.spec.running}'=true \
        limavm/"${VM_NAME}" --namespace "${RDD_NAMESPACE}" --timeout=30s
}
```

The new tests verify that `Created=True` appears and that App patches reach `LimaVM.spec.running` at `bats/tests/32-app-controllers/app.bats:114-151`, but they never wait for the LimaVM `Running` condition or for the App to mirror `Running`. That means the suite would not catch the architecture bug above, nor any breakage in the actual start/stop lifecycle; it only proves metadata propagation.

Fix: extend the bats flow to set `running: true`, wait for `limavm.status.conditions[type=Running].status=True`, and verify the App mirrors that condition back. Add a fast delete-after-create case as well to cover the orphan race.

### Design Observations

1. **The bootstrap handoff through two ConfigMaps creates avoidable lifecycle coupling** (in-scope)

The current flow creates a temporary `rd` ConfigMap in the App controller, asks the LimaVM webhook to copy it into `rd-template`, then asks the LimaVM reconciler to adopt `rd-template` later. That split ownership is what creates the orphan window above. A simpler design is to have the App controller render the final template directly once it knows the host facts, or to have LimaVM cleanup always own deletion of `GetTemplateConfigMapName()` by name even before owner adoption. Either approach removes a class of timing bugs and makes deletion semantics much easier to reason about.

### Testing Assessment

1. No test deletes the App immediately after LimaVM creation but before the LimaVM reconciler sets the owner reference on `rd-template`; that is the highest-risk missing scenario because it can strand an undeletable resource.
2. No test covers host-dependent template selection across `amd64` and `arm64`, even though the feature requirement explicitly depends on host architecture/image choice.
3. The `running=true` / `running=false` tests only assert `LimaVM.spec.running`, not the actual `Running` condition on LimaVM or App, so they do not validate the real start/stop lifecycle.

### Documentation Assessment

The PR description and code comments still imply host-specific template selection from issue #227, but the implementation is currently a single embedded template. When this is fixed, document the selection rules explicitly and record which controller is responsible for deleting the final template ConfigMap during early-delete races.

---

### Gemini 3.1 Pro

### Executive Summary

This PR implements the core `App` controller reconciliation loop, successfully managing the creation, state propagation, and teardown of the owned `LimaVM` singleton. However, the PR contains critical regressions: it permanently deadlocks the deletion of the `App` resource by failing to remove a blocking finalizer from the child `LimaVM`, and it entirely misses the architectural requirement to dynamically select the correct host architecture (e.g., `amd64` vs `arm64`), breaking support for Apple Silicon. Additionally, there are several edge cases where transient API errors are swallowed, leading to potential resource leaks, and observable generation numbers are mirrored incorrectly.

### Critical Issues

1. **LimaVM stuck in Terminating state forever during App deletion** — `pkg/controllers/app/app/controllers/app_controller.go:74-78` `(critical, regression)`

```go
                case !base.IsBeingDeleted(limaVM):
                        if err := r.Delete(ctx, limaVM); err != nil && !apierrors.IsNotFound(err) {
                                return ctrl.Result{}, fmt.Errorf("failed to delete LimaVM: %w", err)
                        }
                        log.Info("Requested LimaVM deletion, waiting for teardown")
                        return ctrl.Result{RequeueAfter: requeueAfterDeletion}, nil
```

The `App` controller sets the `base.OwnedFinalizerName` (`rdd.rancherdesktop.io/owned`) finalizer on the `LimaVM` at creation to block unmanaged cascade deletion. During teardown, the controller issues `r.Delete(ctx, limaVM)` and polls until the resource is gone. However, it never strips the `OwnedFinalizerName` from the `LimaVM`. As a result, the `LimaVM` becomes permanently stuck in the `Terminating` state, the App controller loops infinitely, and the `App` object hangs indefinitely on deletion (timing out operations like `rdd ctl delete app`).

Fix: Explicitly remove the cascade-blocking finalizer before or alongside issuing the delete:
```go
                case !base.IsBeingDeleted(limaVM):
                        if err := base.RemoveOwnedFinalizer(ctx, r.Client, limaVM); err != nil {
                                return ctrl.Result{}, fmt.Errorf("failed to remove owned finalizer: %w", err)
                        }
                        // ... then delete
```

2. **Hardcoded architecture and image format breaks Apple Silicon support** — `pkg/controllers/app/app/lima.yaml:3` and `pkg/controllers/app/app/controllers/app_controller.go:116-118` `(critical, gap)`

```yaml
images:
- location: https://github.com/rancher-sandbox/rancher-desktop-opensuse/releases/download/v0.1.0/distro.v0.1.0.amd64.qcow2
```

The linked issue explicitly requires the controller to "Select correct arch (amd64/arm64) and image format (qcow2/tar) from host". This PR embeds a static `lima.yaml` that hardcodes the image to `distro.v0.1.0.amd64.qcow2` and makes no attempt to substitute the value during ConfigMap creation. Since the template also disables Rosetta (`rosetta: enabled: false`), booting this configuration on Apple Silicon (M1/M2/M3, which are `aarch64`) will fail entirely or degrade to unusably slow QEMU instruction translation.

Fix: Use `text/template` or `strings.ReplaceAll` to dynamically substitute `amd64.qcow2` with the correct architecture string (e.g. via `runtime.GOARCH`) before writing `r.LimaTemplateData` to the ConfigMap.

### Important Issues

1. **Incorrect `ObservedGeneration` mapping corrupts status conditions** — `pkg/controllers/app/app/controllers/app_controller.go:181` `(important, regression)`

```go
               statusChanged = apimeta.SetStatusCondition(&app.Status.Conditions, metav1.Condition{
                       // ...
                       Message:            cond.Message,
                       ObservedGeneration: cond.ObservedGeneration,
                       LastTransitionTime: cond.LastTransitionTime,
               }) || statusChanged
```

The reconciler copies the `ObservedGeneration` directly from the child `LimaVM`'s conditions onto the parent `App`'s conditions. This is an anti-pattern. `ObservedGeneration` on a status condition must reflect the generation of the object hosting the status (the `App`), not the generation of the child resource. If `App` and `LimaVM` generation numbers ever diverge (e.g., a metadata or unrelated spec update increments `App`'s generation but not `LimaVM`'s), consumers will see a mismatched `ObservedGeneration` and incorrectly assume the `App` controller has permanently stalled on an older spec.

Fix: Change this line to `ObservedGeneration: app.Generation`.

2. **Swallowed API errors during ConfigMap creation cause permanent leak** — `pkg/controllers/app/app/controllers/app_controller.go:163-165` `(important, regression)`

```go
       } else if !apierrors.IsNotFound(err) {
               log.Error(err, "Failed to fetch input ConfigMap")
       }
```

If `r.Get` fails with a transient error (e.g., API timeout or rate-limiting) when trying to clean up the bootstrap ConfigMap, the controller logs the error and proceeds to update the status, ultimately returning `ctrl.Result{}, nil`. Because the App controller does not Watch `ConfigMap` resources, it will never be requeued to retry this deletion unless an unrelated spec/status change triggers another reconcile. This quietly leaks the bootstrap ConfigMap for the entire lifetime of the `App` object.

Fix: Return `ctrl.Result{}, err` to ensure the controller requeues and retries the cleanup.

3. **Swallowed API errors during teardown prematurely drop App finalizer** — `pkg/controllers/app/app/controllers/app_controller.go:68-71` `(important, regression)`

```go
                        } else if !apierrors.IsNotFound(cmErr) {
                                log.Error(cmErr, "Failed to fetch input ConfigMap during cleanup")
                        }
                        return ctrl.Result{}, base.RemoveCleanupFinalizer(ctx, r.Client, &app)
```

Similar to the leak above, if `r.Get` fails with a transient error during App deletion, the error is swallowed and the controller proceeds to unconditionally remove the `App`'s cleanup finalizer. While Kubernetes GC will eventually clean up the ConfigMap in the background since it has an OwnerReference, this circumvents the controller's intended design of synchronously verifying cleanup before unblocking the `App`'s deletion.

Fix: Return `ctrl.Result{}, cmErr` instead of proceeding to remove the finalizer if `cmErr` is not `IsNotFound`.

4. **Dropped Kubernetes Events degrades observability** — `pkg/controllers/app/app/controllers/app_controller.go:195-206` `(important, regression)`

The PR completely removes the `Recorder: events.EventRecorder` from the `AppReconciler` struct and deletes the `setCondition` helper that previously emitted standard Kubernetes Events (e.g., `ConditionChanged`) when the `App` status changed. Since `App` is a top-level singleton resource heavily monitored by administrators/users, silently dropping these events significantly degrades cluster observability and debugging capabilities.

Fix: Restore the `Recorder` in the reconciler struct and emit a `corev1.EventTypeNormal` event whenever `statusChanged` evaluates to true after mirroring conditions.

### Suggestions

1. **Unnecessary active polling during LimaVM teardown** — `pkg/controllers/app/app/controllers/app_controller.go:78-82` `(suggestion, enhancement)`

```go
                        return ctrl.Result{RequeueAfter: requeueAfterDeletion}, nil
                default:
                        // LimaVM deletion is in progress; keep requeueing until it is gone.
                        log.Info("Waiting for LimaVM deletion to complete")
                        return ctrl.Result{RequeueAfter: requeueAfterDeletion}, nil
```

The App controller uses a 2-second `RequeueAfter` loop while waiting for the `LimaVM` to be fully deleted. However, the reconciler has `Owns(&limav1alpha1.LimaVM{})` configured in its `SetupWithManager`. The controller-runtime machinery will automatically and instantaneously wake the reconciler the moment the `LimaVM`'s finalizer is stripped and it is deleted from the API server. Using `RequeueAfter` forces redundant active polling in an otherwise purely event-driven architecture.

Fix: Return `ctrl.Result{}, nil` and rely on the watch event to immediately trigger the final cleanup step.

2. **Dangling conditions on App if dropped from LimaVM** `(suggestion, gap)`

`apimeta.SetStatusCondition` exclusively updates or appends conditions. The condition-mirroring loop does not prune old conditions. If the `LimaVM` controller ever drops a condition type entirely rather than transitioning it to `False`, the `App` resource will retain the stale, orphaned condition indefinitely.

### Testing Assessment

The Bats test suite (`app.bats`) covers the happy path effectively. However, due to the critical deadlock issue outlined in Finding #1, the final step of the test file (`delete App resource`) will hang indefinitely on execution unless `BATS_TIMEOUT` kills the runner.

1. **Missing test for Architecture Substitution:** (Highest Risk) There are no tests asserting that the newly generated `ConfigMap` data contains the correctly substituted `amd64`/`aarch64` architectures.
2. **Missing test for transient API failures:** No tests simulate or mock API errors during the deletion of the `inputCM` to ensure the controller actually retries.

### Documentation Assessment

No additional documentation gaps beyond what was identified in the findings.