Deep Review: PR #236

Date2026-03-16
PR#236 — Implement App controller lifecycle for LimaVM
Branchapp-controller-create-delete-limavm
Commits9284b6c Implement App controller lifecycle for LimaVM / e300882 Add Bats test to App Controller
ReviewersClaude Codex Gemini
VerdictRework — the /owned finalizer on the LimaVM is never removed during deletion, deadlocking the entire teardown. The embedded template also hard-codes amd64, breaking Apple Silicon hosts.
Wall-clock time30 min 16 s

Executive Summary

This PR adds the App controller's core reconciliation loop: creating a LimaVM from an embedded template, propagating spec.running, mirroring status conditions, and deleting owned resources via a finalizer. The design follows established patterns for finalizer management and owner references. However, a critical deadlock prevents App deletion from completing: the App controller sets OwnedFinalizerName on the LimaVM at creation but never removes it during teardown. The embedded template also hard-codes amd64, which breaks Apple Silicon.


Critical Issues

1. LimaVM stuck in Terminating: /owned finalizer never removed Gemini

The App controller creates the LimaVM with OwnedFinalizerName at line 132. During App deletion, the controller calls r.Delete(ctx, limaVM) at line 76 and polls until the LimaVM disappears. But it never strips OwnedFinalizerName from the LimaVM. The LimaVM controller's handleDeletion (limavm_lifecycle.go:32-78) removes only CleanupFinalizerName at line 71. Its DeleteOwnedResources call at line 65 strips /owned from resources owned by the LimaVM (ConfigMaps), not from the LimaVM itself. With /owned still present, the LimaVM remains in Terminating permanently, and the App controller polls forever.

			log.Error(cmErr, "Failed to fetch input ConfigMap during cleanup")
		}
		return ctrl.Result{}, base.RemoveCleanupFinalizer(ctx, r.Client, &app)
	case err != nil:
		return ctrl.Result{}, err
	case !base.IsBeingDeleted(limaVM):
		if err := r.Delete(ctx, limaVM); err != nil && !apierrors.IsNotFound(err) {
			return ctrl.Result{}, fmt.Errorf("failed to delete LimaVM: %w", err)
		}
		log.Info("Requested LimaVM deletion, waiting for teardown")
		return ctrl.Result{RequeueAfter: requeueAfterDeletion}, nil
	default:
		// LimaVM deletion is in progress; keep requeueing until it is gone.
		log.Info("Waiting for LimaVM deletion to complete")
		return ctrl.Result{RequeueAfter: requeueAfterDeletion}, nil
	}

The LimaVM is created with /owned at lines 128-141:

	limaVM = &limav1alpha1.LimaVM{
		ObjectMeta: metav1.ObjectMeta{
			Name:       limaVMName,
			Namespace:  namespace,
			Finalizers: []string{base.OwnedFinalizerName},
		},
		Spec: limav1alpha1.LimaVMSpec{
			TemplateRef: limav1alpha1.TemplateReference{
				Name:      inputConfigMapName,
				Namespace: namespace,
			},
			Running: app.Spec.Running,
		},
	}

The LimaVM controller's handleDeletion removes /cleanup but not /owned:

	// Delete owned resources and remove the finalizer in one pass. This is safe
	// because owned resources (ConfigMaps) delete instantly and have no finalizers.
	// If we later own resources with complex teardown, split this into two reconcile
	// cycles: delete owned resources, then verify they are gone before removing
	// the finalizer.
	if err := base.DeleteOwnedResources(ctx, r.Client, limaVM, r.Manager); err != nil {
		logger.Error(err, "Failed to delete owned resources")
		return ctrl.Result{}, err
	}

	// Remove finalizer
	if err := base.RemoveCleanupFinalizer(ctx, r.Client, limaVM); err != nil {
		logger.Error(err, "Failed to remove finalizer")
		return ctrl.Result{}, err
	}

	logger.Info("Deleted Lima instance, owned resources, and finalizer")
	return ctrl.Result{}, nil

The BATS test at app.bats:177-179 (rdd ctl wait --for=delete app/"${APP_NAME}" --timeout=90s) will hang and time out.

Fix: Either call base.RemoveOwnedFinalizer(ctx, r.Client, limaVM) before issuing the delete, or replace the manual find-and-delete with base.DeleteOwnedResources(ctx, r.Client, &app, r.Manager), which strips /owned from all owned resources and deletes them in one pass. The second option also handles the orphan race in Important Issue #2.


Important Issues

1. Embedded template hard-codes amd64 — arm64 hosts will fail Claude Codex Gemini

The template points to distro.v0.1.0.amd64.qcow2 unconditionally. On arm64 hosts (Apple Silicon), Lima will fail to match a suitable image, and limainstance.Prepare() will fail. Rosetta is also disabled (rosetta: enabled: false). #227 explicitly requires "Select correct arch (amd64/arm64) and image format (qcow2/tar) from host."

vmType: qemu
images:
- location: https://github.com/rancher-sandbox/rancher-desktop-opensuse/releases/download/v0.1.0/distro.v0.1.0.amd64.qcow2
cpus: 2
memory: "6442450944"

The template is marked temporary at controller.go:18, so this may be intentional as a first pass. But the BATS test cannot pass on arm64 as-is.


// This is temporary and will be removed once the app controller is fully implemented.
//
//go:embed lima.yaml
var limaTemplateData string

Fix: Generate the template dynamically (e.g., substitute the URL based on runtime.GOARCH), add both architecture entries to the image list (Lima selects the matching one), or gate the test with a skip guard.

2. Orphaned rd-template ConfigMap in pre-adoption deletion race Codex

The LimaVM webhook creates rd-template with OwnedFinalizerName but without an ownerReference at controller.go:202-217. The reconciler adds the ownerReference in a subsequent reconcile at limavm_controller.go:231-242. If the App is deleted during this window, the App controller's cleanup at app_controller.go:62-71 only looks for the input rd ConfigMap, not rd-template. The LimaVM's DeleteOwnedResources won't find rd-template either, since it has no ownerReference yet. The ConfigMap validator at controller.go:272-279 rejects deleting ConfigMaps that carry the /owned finalizer, so rd-template becomes undeletable.

	// Create template ConfigMap with label
	templateConfigMap := &corev1.ConfigMap{
		ObjectMeta: metav1.ObjectMeta{
			Name:      limavm.GetTemplateConfigMapName(),
			Namespace: limavm.Namespace,
			Labels: map[string]string{
				controllers.TemplateConfigMapLabel: "true",
			},
			Finalizers: []string{
				base.OwnedFinalizerName,
			},
		},
		Data: map[string]string{
			v1alpha1.TemplateConfigMapKey: templateData,
		},
	}

The reconciler adds the ownerReference one reconcile later:

	// Set owner reference (the webhook already set the finalizer)
	if !base.IsOwnedByUID(templateConfigMap, limaVM.GetUID()) {
		if err := ctrl.SetControllerReference(&limaVM, templateConfigMap, r.Scheme); err != nil {
			logger.Error(err, "Failed to set owner reference on template ConfigMap")
			return ctrl.Result{}, err
		}
		if err := r.Update(ctx, templateConfigMap); err != nil {
			logger.Error(err, "Failed to update template ConfigMap")
			return ctrl.Result{}, err
		}
		logger.Info("Set owner reference on template ConfigMap", "ConfigMap.Name", configMapName)
		return ctrl.Result{}, nil
	}

The window is narrow (one reconcile cycle), but the result is permanent: an orphan resource that resists manual cleanup.

Fix: Have the App controller's cleanup also delete rd-template by deterministic name, or adopt DeleteOwnedResources for the entire App cleanup path (it handles both the input ConfigMap and the LimaVM in one pass).


Suggestions

1. ObservedGeneration forwarded from LimaVM context to App Claude Gemini

ObservedGeneration on a condition conventionally refers to the .metadata.generation of the resource that owns the condition. Copying the LimaVM's value into the App makes it reference the LimaVM's generation, which is meaningless on the App resource. Currently harmless because the LimaVM controller does not set ObservedGeneration (defaults to 0), but this becomes incorrect if it ever does.

	statusChanged := false
	for _, cond := range limaVM.Status.Conditions {
		statusChanged = apimeta.SetStatusCondition(&app.Status.Conditions, metav1.Condition{
			Type:               cond.Type,
			Status:             cond.Status,
			Reason:             cond.Reason,
			Message:            cond.Message,
			ObservedGeneration: cond.ObservedGeneration,
			LastTransitionTime: cond.LastTransitionTime,
		}) || statusChanged
	}

The LimaVM controller's setCondition at limavm_controller.go:427-437 does not set ObservedGeneration:

// setCondition updates or adds a condition in the LimaVM status.
func (r *LimaVMReconciler) setCondition(limaVM *v1alpha1.LimaVM, conditionType string, status metav1.ConditionStatus, reason, message string) {
	changed := apimeta.SetStatusCondition(&limaVM.Status.Conditions, metav1.Condition{
		Type:    conditionType,
		Status:  status,
		Reason:  reason,
		Message: message,
	})
	if changed && r.Recorder != nil {
		r.Recorder.Eventf(limaVM, nil, corev1.EventTypeNormal, "ConditionChanged", conditionType, message)
	}
}

Fix: Set ObservedGeneration: app.Generation, or omit the field.

2. Swallowed API errors during ConfigMap cleanup Gemini

If r.Get returns a transient error (e.g., API timeout), the controller logs it but continues to ctrl.Result{}, nil, forgoing retry. The input ConfigMap has an ownerReference to the App, so Kubernetes GC handles cleanup eventually. The App controller also re-reconciles on LimaVM status changes, so the window is small. Still, returning the error would be more correct.

	inputCM := &corev1.ConfigMap{}
	if err := r.Get(ctx, client.ObjectKey{Name: inputConfigMapName, Namespace: namespace}, inputCM); err == nil {
		if err := r.Delete(ctx, inputCM); err != nil && !apierrors.IsNotFound(err) {
			return ctrl.Result{}, fmt.Errorf("failed to delete input ConfigMap: %w", err)
		}
		log.Info("Deleted input ConfigMap after LimaVM created its own copy")
	} else if !apierrors.IsNotFound(err) {
		log.Error(err, "Failed to fetch input ConfigMap")
	}

Fix: Return ctrl.Result{}, err on transient Get failures.

3. Swallowed API errors during teardown prematurely drop App finalizer Gemini

Same pattern as above, but during deletion. The input ConfigMap has an ownerReference, so GC provides a safety net. But the controller should not proceed to remove the cleanup finalizer before verifying cleanup succeeded.

	case apierrors.IsNotFound(err):
		// LimaVM is gone. Clean up the input ConfigMap if it still exists
		inputCM := &corev1.ConfigMap{}
		if cmErr := r.Get(ctx, client.ObjectKey{Name: inputConfigMapName, Namespace: namespace}, inputCM); cmErr == nil {
			if cmErr := r.Delete(ctx, inputCM); cmErr != nil && !apierrors.IsNotFound(cmErr) {
				return ctrl.Result{}, fmt.Errorf("failed to delete input ConfigMap during cleanup: %w", cmErr)
			}
		} else if !apierrors.IsNotFound(cmErr) {
			log.Error(cmErr, "Failed to fetch input ConfigMap during cleanup")
		}
		return ctrl.Result{}, base.RemoveCleanupFinalizer(ctx, r.Client, &app)

Fix: Return ctrl.Result{}, cmErr when cmErr is not nil and not NotFound.

4. Condition mirroring never prunes stale conditions Claude Gemini

The loop over limaVM.Status.Conditions adds or updates conditions on the App but never removes conditions that no longer exist on the LimaVM. In practice, LimaVM conditions transition between states rather than disappearing, so this risk is low. If the LimaVM is deleted and recreated with a different condition set, stale conditions would linger.

	// Mirror LimaVM status conditions onto the App status.
	statusChanged := false
	for _, cond := range limaVM.Status.Conditions {
		statusChanged = apimeta.SetStatusCondition(&app.Status.Conditions, metav1.Condition{
			Type:               cond.Type,
			Status:             cond.Status,
			Reason:             cond.Reason,
			Message:            cond.Message,
			ObservedGeneration: cond.ObservedGeneration,
			LastTransitionTime: cond.LastTransitionTime,
		}) || statusChanged
	}
	if statusChanged {
		if err := r.Status().Update(ctx, &app); err != nil {
			log.Error(err, "Unable to update App status")
			return ctrl.Result{}, err
		}
	}

Fix: Before the loop, diff condition types and remove any App conditions absent from the LimaVM. Or accept this as a known limitation with a comment.

5. Missing local_teardown_file leaves Lima instance directories behind Claude

The test file has no local_teardown_file. The existing LimaVM tests (e.g., limavm-instance.bats) define one to clean up $RDD_LIMA_HOME/<instance>. Without it, the default teardown runs rdd svc stop, which kills the control plane mid-deletion, potentially leaving orphaned instance directories.

local_setup_file() {
    setup_rdd_control_plane "app,limavm"
}

Fix: Add a local_teardown_file that cleans up ${RDD_LIMA_HOME}/${VM_NAME}, matching the pattern in limavm-instance.bats.

6. Dropped EventRecorder removes Kubernetes Events from App Gemini

The PR removes the Recorder events.EventRecorder field and the setCondition helper that emitted ConditionChanged events. The old code was a stub, and the App controller is now a thin orchestration layer — the LimaVM controller emits the real lifecycle events. Still, emitting events on App status transitions would improve observability.

Fix: Restore the Recorder and emit an event when statusChanged is true after condition mirroring.

7. Typo in comment Claude
	// Propagate spec.running from App into to the LimaVM.

Fix: // Propagate spec.running from App to the LimaVM.


Design Observations

1. The deletion path should use DeleteOwnedResources instead of manual find-and-delete (in-scope)

The App controller manually finds the LimaVM by name and deletes it. This bypasses the established pattern where DeleteOwnedResources strips /owned finalizers and deletes owned resources in one pass. Adopting DeleteOwnedResources for App deletion eliminates the /owned deadlock (Critical #1), handles the rd-template orphan race (Important #2) via transitive ownership, and reduces the code to a single function call. Any future owned resources are automatically covered without modifying the deletion path.

func DeleteOwnedResources(ctx context.Context, c client.Client, owner client.Object, mgr ctrl.Manager) error {
	logger := log.FromContext(ctx)

	var namespace string
	if ns := owner.GetNamespace(); ns != "" {
		namespace = ns
2. Bootstrap handoff through two ConfigMaps creates avoidable lifecycle coupling (future)

The current flow creates an input rd ConfigMap, has the LimaVM webhook copy it into rd-template, then has the reconciler adopt rd-template. This split ownership creates a window where rd-template has no owner. A simpler design would have the App controller create the final template directly, or have the LimaVM webhook set the ownerReference atomically. Either approach eliminates the orphan window and simplifies deletion semantics.

3. Embedded template is a temporary bootstrapping mechanism (future)

The template is embedded via //go:embed lima.yaml and marked temporary (controller.go:18). When this moves to a proper template source (e.g., the TwoCows project per #227), the LimaTemplateData string field will need to support runtime resolution (e.g., host architecture selection, dynamic URLs). The current string injection design makes this transition straightforward.


Testing Assessment

The BATS test suite covers the happy path well: singleton creation, finalizer setup, ownership, template lifecycle, condition mirroring, running state propagation, validation webhooks, and deletion cascade.

Untested scenarios, ranked by risk:

  1. App deletion: The test exercises this path (app.bats:173-179) but will deadlock due to the /owned finalizer bug. The test validates the wrong thing: it proves the bug exists.
  2. Running condition mirroring: Only the Created condition is verified on the App (tests 15-16). After toggling spec.running, the tests check LimaVM.spec.running but never verify that the App mirrors the Running condition back.
  3. arm64 architecture: No test verifies that the template works on Apple Silicon. The Created condition wait (app.bats:114-116) will fail on arm64.
  4. Early-deletion orphan race: No test deletes the App immediately after LimaVM creation but before the reconciler sets the ownerReference on rd-template.
  5. Error paths: No test covers webhook rejection of invalid template data or instance preparation failure.

Documentation Assessment

Code comments are clear and accurate. The comment on controller.go:18 ("This is temporary") documents the intended evolution. The app_types.go immutability comment correctly explains why spec.namespace is immutable. No documentation gaps for the current scope.


Agent Performance Retro

Claude Opus 4.6

Codex GPT 5.4

Gemini 3.1 Pro

Summary

MetricClaude Opus 4.6Codex GPT 5.4Gemini 3.1 Pro
Duration9:334:1111:15
Critical001
Important124
Suggestion412
Design observations310
False positives000
Unique insights213

Gemini 3.1 Pro provided the most value by a wide margin. It was the only agent to find the critical /owned finalizer deadlock — the one bug that prevents this PR from working. Both Claude and Codex examined the same deletion code and missed it. Codex contributed the second-most-valuable unique finding (the orphan race). Claude's contributions were limited to lower-severity items. Gemini's severity inflation is a recurring issue but easy to calibrate for; its depth of cross-controller tracing is what makes it stand out.


Skill Improvement Recommendations

  1. The review dimensions are comprehensive. No gaps identified.
  2. The prompt's instruction to trace into called functions and related controllers worked well for Gemini (which found the deadlock) but Claude and Codex did not follow it deeply enough. Consider adding an explicit checklist item: "For every finalizer set in the change, verify which code path removes it."
  3. The repo-specific context about single-user kube-apiserver helped calibrate severity — Codex did not inflate the orphan race to CRITICAL. However, Gemini still rated the amd64 gap as CRITICAL despite the note about calibrating severity.
  4. All three agents missed one scenario: what happens if the App is deleted while the LimaVM is still in the "preparing" phase (sentinel file present). The LimaVM controller's handleDeletion does not check for the sentinel, which could leave a stale sentinel file on disk.
  5. No process improvements identified. The three-agent parallel approach and the clarification round worked as designed.