Deep Review: 20260320-225736-pr-236
| Date | 2026-03-20 22:57 |
| Author | @Nino-K |
| PR | #236 — Implement App controller lifecycle for LimaVM |
| Branch | app-controller-create-delete-limavm |
| Commits | 6a94f5d Implement App controller lifecycle for LimaVMdd300d6 Add Bats test to App Controller |
| Reviewers | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| Verdict | Merge with fixes — Owned-finalizer deadlock needs fixing; template portability is a known gap but the template is temporary |
| Wall-clock time | 31 min 23 s |
Consolidated Review ¶
Executive Summary ¶
This PR introduces the App controller's core reconciliation loop: it creates a LimaVM from an embedded template, propagates spec.running, mirrors VM status conditions onto the App, and handles deletion through a cleanup finalizer. The priority-chain reconcile pattern is clean and the test coverage is thorough for the happy path. The main issues are a finalizer deadlock when a LimaVM is deleted externally, a fragile early-return after ConfigMap cleanup, and hardcoded instance-specific values in the embedded template.
Critical Issues ¶
None.
Important Issues ¶
The default case requeues without removing the owned finalizer. If the LimaVM reaches Terminating through any path other than the App controller's deletion flow (manual delete, CLI tool), the rdd.rancherdesktop.io/owned finalizer blocks deletion indefinitely. The LimaVM controller's handleDeletion (at limavm_lifecycle.go:32–78) removes only the cleanup finalizer and owned finalizers on children — it never touches the owned finalizer on the LimaVM itself. The result: the LimaVM stays Terminating, the App stays Terminating waiting for it, and both resources deadlock.
The same gap exists during normal operation (App not being deleted): if a LimaVM is deleted externally, the reconcile at lines 116–210 proceeds through spec.running propagation and condition mirroring without ever removing the owned finalizer or acknowledging the terminating state.
Verified via git blame: lines 89–101 are introduced in commit 6a94f5d.
}
}
return ctrl.Result{}, base.RemoveCleanupFinalizer(ctx, r.Client, &app)
case err != nil:
return ctrl.Result{}, err
case !base.IsBeingDeleted(limaVM):
if err := base.RemoveOwnedFinalizer(ctx, r.Client, limaVM); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to remove owned finalizer from LimaVM: %w", err)
}
if err := r.Delete(ctx, limaVM); err != nil && !apierrors.IsNotFound(err) {
return ctrl.Result{}, fmt.Errorf("failed to delete LimaVM: %w", err)
}
log.Info("Requested LimaVM deletion, waiting for teardown")
return ctrl.Result{RequeueAfter: requeueAfterDeletion}, nil
default:
// LimaVM deletion is in progress; keep requeueing until it is gone.
log.Info("Waiting for LimaVM deletion to complete")
return ctrl.Result{RequeueAfter: requeueAfterDeletion}, nil
}
Fix: Always check for the owned finalizer when the LimaVM is terminating. In the deletion switch, consolidate the err == nil cases:
case err == nil:
if base.HasOwnedFinalizer(limaVM) {
if err := base.RemoveOwnedFinalizer(ctx, r.Client, limaVM); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to remove owned finalizer: %w", err)
}
}
if !base.IsBeingDeleted(limaVM) {
if err := r.Delete(ctx, limaVM); err != nil && !apierrors.IsNotFound(err) {
return ctrl.Result{}, fmt.Errorf("failed to delete LimaVM: %w", err)
}
}
return ctrl.Result{RequeueAfter: requeueAfterDeletion}, nil
During normal reconcile, add an early check after the Get at line 116:
if base.IsBeingDeleted(limaVM) {
if err := base.RemoveOwnedFinalizer(ctx, r.Client, limaVM); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to remove owned finalizer from terminating LimaVM: %w", err)
}
return ctrl.Result{RequeueAfter: requeueAfterDeletion}, nil
}
Returning ctrl.Result{}, nil after deleting the ConfigMap relies on the LimaVM controller producing subsequent status updates to re-trigger the App controller. The App controller does not watch or own ConfigMaps, so the deletion itself generates no watch event. In practice the LimaVM always has more lifecycle steps, so this works today. But the coupling is implicit: if any future change reduces LimaVM controller chattiness, the App controller stalls until an external event arrives.
Gemini rated this CRITICAL; after tracing the LimaVM controller lifecycle (limavm_controller.go:176–268), a newly created LimaVM always produces multiple status updates (initial condition at line 176, template ConfigMap at line 248, Created condition). The stall scenario requires all those updates to complete before the App controller processes the ConfigMap deletion — unlikely in practice.
}
// LimaVM exists — clean up the input ConfigMap if it still lingers from the
// creation phase and return to requeueing on LimaVM updates.
inputConfigMap := &corev1.ConfigMap{}
if err := r.Get(ctx, client.ObjectKey{Name: inputConfigMapName, Namespace: namespace}, inputConfigMap); err == nil {
if err := r.Delete(ctx, inputConfigMap); err != nil && !apierrors.IsNotFound(err) {
return ctrl.Result{}, fmt.Errorf("failed to delete input ConfigMap: %w", err)
}
log.Info("Deleted input ConfigMap after LimaVM created its own copy")
return ctrl.Result{}, nil
} else if !apierrors.IsNotFound(err) {
log.Error(err, "Failed to fetch input ConfigMap")
}
Fix: Replace return ctrl.Result{}, nil with return ctrl.Result{Requeue: true}, nil to guarantee the reconciler continues to spec.running propagation and condition mirroring without depending on external events.
Test 123–126 waits for the LimaVM's Created condition, then test 128 reads the App status without waiting for the App controller to mirror it. The Owns watch triggers a reconcile on LimaVM status changes, but the reconcile may not have completed by the time the BATS test reads the App. The same race applies to test 139–149 ("verify App conditions preserve LimaVM LastTransitionTime").
@test "wait for LimaVM Created condition to be set" {
rdd ctl wait --for=condition=Created \
limavm/"${VM_NAME}" --namespace "${RDD_NAMESPACE}" --timeout=60s
}
@test "verify App status mirrors LimaVM Created condition" {
run -0 rdd ctl get app "${APP_NAME}" -o json
local json="${output}"
run -0 jq -r '.status | has("conditions")' <<<"${json}"
assert_output "true"
run -0 jq -r '.status.conditions[] | select(.type == "Created") | .status' <<<"${json}"
assert_output "True"
}
@test "verify App conditions preserve LimaVM LastTransitionTime" {
run -0 rdd ctl get limavm "${VM_NAME}" --namespace "${RDD_NAMESPACE}" -o json
local limavm_ts
limavm_ts=$(jq -r '.status.conditions[] | select(.type == "Created") | .lastTransitionTime' <<<"${output}")
run -0 rdd ctl get app "${APP_NAME}" -o json
local app_ts
app_ts=$(jq -r '.status.conditions[] | select(.type == "Created") | .lastTransitionTime' <<<"${output}")
[[ "${limavm_ts}" == "${app_ts}" ]]
}
Fix: Add rdd ctl wait --for=condition=Created app/"${APP_NAME}" --timeout=30s before the assertion at line 129.
During App deletion, if fetching either ConfigMap fails with a transient error (connection timeout, rate limit), the code logs the error and falls through to RemoveCleanupFinalizer at line 86. Without a garbage collector, the orphaned ConfigMaps persist indefinitely.
case apierrors.IsNotFound(err):
// LimaVM is gone. Clean up any ConfigMaps that may have been left behind.
inputCM := &corev1.ConfigMap{}
if cmErr := r.Get(ctx, client.ObjectKey{Name: inputConfigMapName, Namespace: namespace}, inputCM); cmErr == nil {
if cmErr := r.Delete(ctx, inputCM); cmErr != nil && !apierrors.IsNotFound(cmErr) {
return ctrl.Result{}, fmt.Errorf("failed to delete input ConfigMap during cleanup: %w", cmErr)
}
} else if !apierrors.IsNotFound(cmErr) {
log.Error(cmErr, "Failed to fetch input ConfigMap during cleanup")
}
templateCM := &corev1.ConfigMap{}
templateCMName := limaVMName + "-template"
if cmErr := r.Get(ctx, client.ObjectKey{Name: templateCMName, Namespace: namespace}, templateCM); cmErr == nil {
if cmErr := base.RemoveOwnedFinalizer(ctx, r.Client, templateCM); cmErr != nil {
return ctrl.Result{}, fmt.Errorf("failed to remove finalizer from template ConfigMap: %w", cmErr)
}
if cmErr := r.Delete(ctx, templateCM); cmErr != nil && !apierrors.IsNotFound(cmErr) {
return ctrl.Result{}, fmt.Errorf("failed to delete template ConfigMap during cleanup: %w", cmErr)
}
} else if !apierrors.IsNotFound(cmErr) {
log.Error(cmErr, "Failed to fetch template ConfigMap during cleanup")
}
return ctrl.Result{}, base.RemoveCleanupFinalizer(ctx, r.Client, &app)
Fix: Return the error so the deletion retries:
} else if !apierrors.IsNotFound(cmErr) {
return ctrl.Result{}, fmt.Errorf("failed to fetch input ConfigMap during cleanup: %w", cmErr)
}
r.Update(ctx, limaVM) sends the entire LimaVM object fetched at line 116. The LimaVM controller frequently patches its status concurrently, incrementing the resourceVersion between the Get and Update. Each collision produces a 409 Conflict and a wasted reconcile cycle. Worse, a full Update can overwrite concurrent metadata changes (annotations, labels) made by the LimaVM controller.
}
// Propagate spec.running from App into to the LimaVM.
if limaVM.Spec.Running != app.Spec.Running {
limaVM.Spec.Running = app.Spec.Running
if err := r.Update(ctx, limaVM); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to propagate running state to LimaVM: %w", err)
}
return ctrl.Result{}, nil
}
Fix: Use a merge patch targeting only spec.running:
patch := client.MergeFrom(limaVM.DeepCopy())
limaVM.Spec.Running = app.Spec.Running
if err := r.Patch(ctx, limaVM, patch); err != nil {
The SSH port at line 14 is fixed at 51422, ignoring instance.Index() which derives per-instance ports. The host socket path at line 37 uses .rd instead of .rd<suffix> (the default suffix is "2", so the runtime path is ~/.rd2). Both break parallel instances and the socket path is wrong even for the default instance.
The controller.go:18 comment acknowledges the template is temporary ("This is temporary and will be removed once the app controller is fully implemented"). The BATS tests never boot the VM to Running, so these paths are untested.
images:
- location: https://github.com/rancher-sandbox/rancher-desktop-opensuse/releases/download/v0.1.2/distro.v0.1.2.amd64.qcow2
arch: x86_64
digest: "sha256:45de5c158399d5bc31e93f42d9c7fcfaf8dd3279ee157c702e9bd86bcaf9e088"
- location: https://github.com/rancher-sandbox/rancher-desktop-opensuse/releases/download/v0.1.2/distro.v0.1.2.arm64.qcow2
arch: aarch64
digest: "sha256:39598a41ebe5c8da0baf428716a63c50598e8306e85bb85043877407eecc8991"
cpus: 2
memory: "6442450944"
mounts:
- location: "~"
writable: true
ssh:
localPort: 51422
loadDotSSHPubKeys: false
firmware:
legacyBIOS: false
containerd:
system: false
user: false
portForwards:
- guestIPMustBeZero: true
guestPortRange:
- 1
- 65535
hostIP: 0.0.0.0
hostPortRange:
- 0
- 0
- guestPortRange:
- 0
- 0
guestSocket: /var/run/docker.sock
hostPortRange:
- 0
- 0
hostSocket: "{{.Home}}/.rd/docker.sock"
hostResolver:
hosts:
host.docker.internal: host.lima.internal
host.rancher-desktop.internal: host.lima.internal
lima-rancher-desktop: lima-0
rosetta:
enabled: false
binfmt: false
Fix: Render dynamic fields at reconcile time from instance.ShortDir() and instance.Index() instead of embedding literal values.
This sentence stops mid-thought.
} else if added {
return ctrl.Result{}, nil
}
// Check LimaVM first. If it exists and we can skip ConfigMap
limaVM := &limav1alpha1.LimaVM{}
limaVMErr := r.Get(ctx, client.ObjectKey{Name: limaVMName, Namespace: namespace}, limaVM)
if limaVMErr != nil && !apierrors.IsNotFound(limaVMErr) {
return ctrl.Result{}, limaVMErr
}
Fix: // Check whether the LimaVM already exists. If so, skip ConfigMap creation.
Suggestions ¶
}
// Propagate spec.running from App into to the LimaVM.
if limaVM.Spec.Running != app.Spec.Running {
limaVM.Spec.Running = app.Spec.Running
if err := r.Update(ctx, limaVM); err != nil {
Fix: // Propagate spec.running from App to the LimaVM.
Per Kubernetes convention, ObservedGeneration should reference the containing resource's generation (the App), not the source resource's (the LimaVM). The LimaVM controller does not currently set this field, so the value is always 0. If anyone adds generation tracking to the LimaVM controller, the mirrored value on the App becomes silently wrong.
statusChanged := false
for _, cond := range limaVM.Status.Conditions {
statusChanged = apimeta.SetStatusCondition(&app.Status.Conditions, metav1.Condition{
Type: cond.Type,
Status: cond.Status,
Reason: cond.Reason,
Message: cond.Message,
ObservedGeneration: cond.ObservedGeneration,
LastTransitionTime: cond.LastTransitionTime,
}) || statusChanged
}
if statusChanged {
if err := r.Status().Update(ctx, &app); err != nil {
log.Error(err, "Unable to update App status")
return ctrl.Result{}, err
}
}
Fix: Use ObservedGeneration: app.Generation or omit the field.
The canonical name computation is limaVM.GetTemplateConfigMapName(). On the deletion path (LimaVM already gone), there is no object to call the helper on. The hardcoded "-template" suffix could drift from the helper.
} else if !apierrors.IsNotFound(cmErr) {
log.Error(cmErr, "Failed to fetch input ConfigMap during cleanup")
}
templateCM := &corev1.ConfigMap{}
templateCMName := limaVMName + "-template"
if cmErr := r.Get(ctx, client.ObjectKey{Name: templateCMName, Namespace: namespace}, templateCM); cmErr == nil {
if cmErr := base.RemoveOwnedFinalizer(ctx, r.Client, templateCM); cmErr != nil {
return ctrl.Result{}, fmt.Errorf("failed to remove finalizer from template ConfigMap: %w", cmErr)
}
Fix: Extract a shared constant (e.g., templateConfigMapSuffix) or add a comment referencing GetTemplateConfigMapName() as the source of truth.
The test verifies the LimaVM is gone after App deletion but does not verify that both ConfigMaps (rd and rd-template) are also cleaned up.
@test "wait for App to be fully deleted" {
# The App finalizer stays until the LimaVM controller finishes teardown.
rdd ctl wait --for=delete app/"${APP_NAME}" --timeout=90s
}
@test "verify LimaVM rd is deleted when App is deleted" {
run -1 rdd ctl get limavm "${VM_NAME}" --namespace "${RDD_NAMESPACE}"
assert_output --partial "not found"
}
Fix: Add a test that verifies rdd ctl get configmap "rd-template" --namespace "${RDD_NAMESPACE}" returns not found.
The comment says "32KB limit for the whole object," but 32768 matches the per-field maxLength in the CRD schema (crd.yaml:86). The comment and the intent (CRD validation, not object size) are misaligned.
}
// conditionMessageMaxLen caps how long a condition message can be.
// The API server has a 32KB limit for the whole object, so we keep
// messages short to avoid failures, especially when errors get long
// (like repeated image download retries).
const conditionMessageMaxLen = 32768
Fix: // conditionMessageMaxLen matches the CRD maxLength for condition messages. Truncation prevents validation errors when messages grow large.
Design Observations ¶
Concerns ¶
Terminating LimaVM handling is absent from normal reconcile
(in-scope)GeminiThe normal reconcile path (lines 116–210) assumes the LimaVM is either missing or alive. A terminating LimaVM passes through spec.running propagation and condition mirroring without acknowledgment. The controller should detect a terminating LimaVM, remove the owned finalizer, and either wait for deletion or recreate. This would also eliminate the deadlock in Finding 1.
Template rendering should replace template embedding
(in-scope)CodexThe embedded
lima.yamlis fully static, but several fields depend on runtime state: socket paths oninstance.ShortDir(), SSH port oninstance.Index(), and image format on the host OS. A Go template or struct-based rendering would eliminate the portability gaps in Finding 6 without complicating the reconciler. Thecontroller.go:18comment already flags this as temporary.
Strengths ¶
The priority-chain pattern (return after each action, let the next reconcile advance the state machine) makes the reconcile function easy to read and debug. Each code block has a single responsibility, and the flow is linear.
The defensive ConfigMap cleanup during App deletion (lines 63–84) handles the case where the LimaVM's
DeleteOwnedResourcesdid not run — belt-and-suspenders cleanup for a system without garbage collection.SetControllerReferencefor both the input ConfigMap and LimaVM, combined withOwns(&limav1alpha1.LimaVM{}), provides correct ownership tracking. The immutability validation onspec.namespaceprevents orphaning owned resources after creation.LastTransitionTimemirroring preserves the original LimaVM timestamp rather than generating a new one, so condition timelines remain accurate across the resource hierarchy.
Testing Assessment ¶
Coverage is thorough for the happy-path lifecycle (create, own, propagate, mirror, delete, recreate). Untested scenarios ranked by risk:
- Externally deleted LimaVM — no test verifies the controller handles a LimaVM deleted by a user or CLI tool during normal operation or during App deletion. This is the deadlock path from Finding 1.
- ConfigMap cleanup after deletion — no test verifies that
rd-templateandrdConfigMaps are removed when the App is deleted. - Condition mirroring race — tests 128 and 139 read App conditions without waiting for the mirror reconcile (Finding 3).
- App deletion while LimaVM is mid-startup — the test always uses
running: false. Deleting during startup exercises the full cleanup flow more thoroughly.
Documentation Assessment ¶
No gaps. The CRD YAML includes the immutability description, and the embedded lima.yaml is self-documenting. The controller.go:18 comment marks the template as temporary.
Commit Structure ¶
The two commits are well-scoped: 6a94f5d implements the controller logic and dd300d6 adds the BATS tests. Separating implementation from tests makes review easier, though the test commit does not stand alone (it requires the controller from the first commit).
Agent Performance Retro ¶
Claude Claude Opus 4.6 ¶
Claude produced the most findings (5 important, 4 suggestions) with strong signal quality. Unique contributions: the test race condition (Finding 3), the full-object Update concern (Finding 5), the incomplete comment (Finding 7), and several suggestions (typo, hardcoded suffix, ConfigMap cleanup test, event recorder removal). Claude correctly identified no critical issues and maintained appropriate severity calibration. It traced into the LimaVM controller for the Owns() watch analysis and verified ObservedGeneration is unused via grep. The only weakness: Claude did not catch the finalizer deadlock or the early-return stall, the two most impactful findings.
Codex Codex GPT 5.4 ¶
Codex produced only two findings but both were substantive. The embedded template portability analysis (Finding 6) was the most thorough cross-reference of any agent — it traced through instance.ShortDir(), instance.Index(), the BATS helpers, and the design docs to demonstrate the path/port mismatch. Codex also identified the missing image format selection from #227's requirements. No false positives. The review was short (66 lines) with no suggestions or design observations beyond what the findings covered. Duration was the shortest at 253s.
Gemini Gemini 3.1 Pro ¶
Gemini raised two findings as CRITICAL: the early-return stall and the finalizer deadlock. The deadlock finding (Finding 1) was the most valuable individual contribution across all three agents — it identified a real bug that the other two missed. The early-return stall was overrated at CRITICAL (downgraded to IMPORTANT after tracing the LimaVM lifecycle), but the core observation was valid. Gemini also caught the transient-error ConfigMap leak (Finding 4), which the other agents missed. False positive: the "redundant template ConfigMap cleanup" suggestion incorrectly assumed DeleteOwnedResources would handle it — the App controller's cleanup runs after the LimaVM is already gone, so there is no owner to drive DeleteOwnedResources.
Summary ¶
| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 10:01 | 4:13 | 4:54 |
| Critical | 0 | 0 | 0 (2 downgraded) |
| Important | 5 | 2 | 3 |
| Suggestion | 4 | 0 | 2 |
| Design observations | 1 | 2 | 0 |
| False positives | 0 | 0 | 1 |
| Unique insights | 4 | 1 | 2 |
Gemini provided the highest-impact single finding (the finalizer deadlock). Claude provided the broadest coverage and most actionable suggestions. Codex was the most efficient — two precise, well-researched findings in the shortest time.