## Deep Code Review: PR #236 — Implement App controller lifecycle for LimaVM

| | |
|---|---|
| **Verdict** | Approve with required changes |
| **PR** | #236 |
| **Branch** | app-controller-create-delete-limavm |
| **Date** | 2026-03-20 |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |

### Executive Summary

This PR adds the App controller's reconciliation loop: it bootstraps a LimaVM from an embedded template, propagates `spec.running`, mirrors LimaVM conditions onto the App, and handles deletion through a cleanup finalizer. The implementation is clean — the priority chain returns after every mutation to ensure fresh state, and the dual-finalizer pattern correctly delegates VM teardown to the LimaVM controller. Two issues need fixes: transient API errors during cleanup can orphan ConfigMaps, and the LimaVM resource lacks a validating webhook to block deletion while owned.

### Findings

#### Critical Issues

None.

#### Important Issues

1. **Transient API errors during cleanup bypass finalizer removal** — `pkg/controllers/app/app/controllers/app_controller.go:62-86` (important, regression) [Gemini 3.1 Pro]

If the API server returns a transient error (timeout, connection reset) when fetching either ConfigMap during App deletion, the error is logged at lines 69-70 and 82-84 but execution falls through to `RemoveCleanupFinalizer` at line 86. This strips the App's finalizer and allows the App to be permanently deleted, orphaning the ConfigMaps.

In RDD's embedded single-user apiserver transient errors are rare, which is why this is important rather than critical — but the fix is trivial.

Fix: Return the error to force a retry:
```go
} else if !apierrors.IsNotFound(cmErr) {
    return ctrl.Result{}, fmt.Errorf("failed to fetch input ConfigMap during cleanup: %w", cmErr)
}
```

2. ~~**Manual LimaVM deletion deadlocks**~~ **INVALID** — `pkg/controllers/app/app/controllers/app_controller.go:149,89-101` (important, regression) [Gemini 3.1 Pro]

**This finding is invalid.** The owned finalizer is not a deadlock — it is an access-control gate. A validating webhook checks `HasOwnedFinalizer` and rejects DELETE requests while the finalizer is present. The owner removes the finalizer when it wants to allow deletion. The LimaVM controller already uses this pattern for template ConfigMaps (`ConfigMapValidator.ValidateDelete` at `pkg/controllers/lima/limavm/controller.go:272-279`). See Important #3 for the actual gap.

3. **LimaVM resource lacks a validating webhook to block deletion while owned** — `pkg/controllers/app/app/controllers/app_controller.go:149` (important, gap)

The App controller adds `OwnedFinalizerName` to the LimaVM at creation (line 149), matching the pattern the LimaVM controller uses for template ConfigMaps. The ConfigMap has a corresponding `ValidateDelete` webhook (`pkg/controllers/lima/limavm/controller.go:272-279`) that rejects DELETE requests when the owned finalizer is present, giving the user a clear error. The LimaVM resource lacks an equivalent webhook. Without it, `rdd ctl delete limavm rd` is accepted by the API server, the resource enters Terminating, and the owned finalizer blocks actual deletion — leaving a stuck resource with no error explaining why.

Fix: Add a `ValidateDelete` to the LimaVM resource webhook that checks `HasOwnedFinalizer` and rejects the DELETE with a message like *"cannot delete LimaVM 'rd': it is owned by the App controller; delete the App resource instead."*

#### Suggestions

1. **Incomplete comment** — `pkg/controllers/app/app/controllers/app_controller.go:114` (suggestion, regression) [Claude Opus 4.6]

The sentence trails off mid-thought. Complete it, e.g.: `// Check whether the LimaVM already exists. If not, create the input ConfigMap and LimaVM.`

2. **Typo: doubled "to"** — `pkg/controllers/app/app/controllers/app_controller.go:182` (suggestion, regression) [Claude Opus 4.6]

Fix: `// Propagate spec.running from App into the LimaVM.`

3. **ObservedGeneration copied from wrong resource** — `pkg/controllers/app/app/controllers/app_controller.go:201` (suggestion, regression) [Claude Opus 4.6, Gemini 3.1 Pro]

`ObservedGeneration` on a condition refers to the generation of the resource that owns the condition. Copying the LimaVM's value onto the App is semantically wrong — the two resources have independent generation sequences. Currently harmless because the LimaVM controller doesn't set `ObservedGeneration` (it stays 0), but any tooling that compares `condition.observedGeneration >= resource.metadata.generation` would be misled.

Fix: Use `app.Generation` instead, or omit it (zero value means "not tracking").

4. **Misleading conditionMessageMaxLen comment** — `pkg/controllers/lima/limavm/controllers/limavm_controller.go:426-430` (suggestion, regression) [Claude Opus 4.6, Gemini 3.1 Pro]

The 32768 limit comes from the CRD's `maxLength: 32768` on the condition `.message` field, not from a whole-object size limit. The etcd object limit is ~1.5 MB.

Fix: `// conditionMessageMaxLen matches the standard Kubernetes maxLength (32768) for condition messages defined in the CRD schema.`

5. **Input ConfigMap deletion returns without watch guarantee** — `pkg/controllers/app/app/controllers/app_controller.go:177` (suggestion, regression) [Gemini 3.1 Pro]

The `return ctrl.Result{}, nil` at line 177 stops the reconcile. The reconciler `Owns(&limav1alpha1.LimaVM{})` but not `Owns(&corev1.ConfigMap{})`, so deleting the ConfigMap produces no watch event. The controller relies on the LimaVM emitting a future status update. In practice the LimaVM controller will still be processing (downloading images, starting the VM), so a watch event will arrive — but the coupling is implicit and fragile.

Fix: Fall through to propagate `spec.running` and mirror conditions in the same pass, or add `Requeue: true`.

6. **Mirrored messages bypass truncation** — `pkg/controllers/app/app/controllers/app_controller.go:200` (suggestion, gap) [Claude Opus 4.6]

The LimaVM controller truncates messages via `truncateConditionMessage()`, so messages should already be within limits. But the App controller has no independent guard. If a future LimaVM code path bypasses truncation, the App's status update would fail CRD validation.

Fix: Apply truncation defensively, or extract `truncateConditionMessage` to the `base` package so both controllers can share it.

### Design Observations

**Strengths**

- The reconciler's priority chain (deletion → finalizer → create → cleanup → propagate → mirror) returns after every mutation, ensuring each re-entry works with fresh state. The comment at lines 192-193 explains why the final `Status().Update` is safe without a re-read.
- The dual-finalizer pattern (CleanupFinalizer on App, OwnedFinalizer on LimaVM) correctly delegates VM teardown to the LimaVM controller instead of duplicating Lima-specific cleanup logic.
- Making `spec.namespace` immutable via CRD-level `XValidation` prevents orphaning owned resources by namespace change.
- Using `Owns(&limav1alpha1.LimaVM{})` with `SetControllerReference` gives the App reconciler correct trigger events without manual watches.

### Testing Assessment

The BATS tests cover the full happy-path lifecycle: creation, finalizer verification, ownership, template propagation, input ConfigMap cleanup, condition mirroring (including `LastTransitionTime` preservation), `spec.running` propagation, CRD validation (singleton name, namespace immutability), deletion cascade, and recreation after deletion.

Untested scenarios, ranked by risk:

1. **Attempted LimaVM deletion while owned** — Highest risk. Once the validating webhook from Important #3 is added, a test should verify that `rdd ctl delete limavm rd` is rejected with a clear error while the App owns it.
2. **No `local_teardown_file`** — Unlike `limavm-instance.bats`, this test file has no teardown to clean up Lima instance directories from disk. If a test fails before "delete App" at line 182, leftover disk state persists at `LIMA_HOME`.
3. **App deletion while LimaVM is partially created** — e.g., the input ConfigMap exists but the LimaVM webhook hasn't processed the template yet. The cleanup path at lines 62-84 handles this, but no test exercises it.

### Documentation Assessment

No documentation gaps. The inline comments explain the reconciler's structure well, aside from the two minor issues flagged above.

### Commit Structure

Clean. The implementation commit (`6a94f5d`) contains the controller logic, CRD changes, and template. The test commit (`dd300d6`) adds integration tests.
