Deep Review: 20260415-184413-pr-315

Date2026-04-15 18:44
Reporancher-sandbox/rancher-desktop-daemon
Round1
Author@jandubois
PR#315 — app: aggregate Ready and Settled conditions, wait for Settled in rdd set
Branchcontainer-actions
Commits52c05a7 app: aggregate Ready and Settled conditions, wait for Settled in rdd set
ReviewersClaude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro
VerdictMerge with fixesEngineEnabled derives from the init-time registry instead of the runtime-enabled controller set
Wall-clock time15 min 13 s


Executive Summary

This commit replaces per-property wait logic in rdd set with a single Settled condition on the App resource, computed by the App controller from the mirrored LimaVM Running condition and the engine controller's ContainerEngineReady. The design is sound: ObservedGeneration gating prevents stale-snapshot races, and the Ready/Settled split serves different consumers cleanly (Ready for liveness, Settled for CLI waits).

One important issue: hasRegisteredController("engine") checks the compile-time controller registry (populated by init()) instead of the runtime-enabled controller set (filtered by --controllers). When the engine controller is compiled in but excluded at runtime, no one writes ContainerEngineReady, and Settled hangs indefinitely. The prior code queried the discovery ConfigMap, which reflects actually-running controllers.

Structure: 8 files, 219 additions, 107 deletions across the App types, App controller, rdd set CLI, BATS tests, and design docs.


Critical Issues

None.


Important Issues

I1. hasRegisteredController checks registration, not runtime enablement Claude Codex Gemini

	if err := (&controllers.AppReconciler{
		Client:           mgr.GetClient(),
		Scheme:           mgr.GetScheme(),
		LimaTemplateData: limaTemplateData(),
		EngineEnabled:    base.IsControllerEnabled("engine"),
	}).SetupWithManager(mgr); err != nil {
		return err
	}

	return c.setupWebhook(mgr)
func hasRegisteredController(name string) bool {
	for _, c := range base.GetAllControllers() {
		if c.GetName() == name {
			return true
		}
	}
	return false
}

base.GetAllControllers() returns every controller whose init() called RegisterController, regardless of whether --controllers subsequently excluded it. The engine controller registers on all non-Windows platforms, so EngineEnabled is always true on macOS and Linux. When the service starts with --controllers=app,lima (or *,-engine), no one writes ContainerEngineReady. The Settled switch falls through to engineCond == nil at app_controller.go:440 and stays False/WaitingForEngine permanently.

}

// SetupWithManager sets up the controller with the Manager.
func (r *AppReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&v1alpha1.App{}).
		Owns(&limav1alpha1.LimaVM{}).
		Complete(r)
}

The prior code avoided this: isEngineControllerEnabled (removed from set.go in this commit) queried the discovery ConfigMap, which reflects actually-running controllers. This change regresses that capability.

Fix: plumb the actual enabled-controller set from the shared controller manager startup (service.go:757-764) into RegisterWithManager, or query the discovery ConfigMap from the reconciler.


Suggestions

S1. Stale help text in rdd set cobra command Claude Codex

			Valid property names and types are derived from the App CRD at
			runtime. If the App resource does not exist, it is created with
			default settings before applying the specified values.

			By default, rdd set waits for the App's Settled condition
			to reach True at the new generation. The wait covers every
			property change and returns only after the full reconcile
			chain catches up. Pass --wait=false to return as soon as
			the patch is accepted, or --timeout=0 to wait indefinitely.

			Examples:
			  rdd set running=true
			  rdd set running=true containerEngine.name=containerd
			  rdd set kubernetes.enabled=true kubernetes.version=1.32.2

This help text describes the old per-property wait behavior. The implementation now waits for Settled=True unconditionally, and the design docs (docs/design/cmd_app.md:13) reflect the new behavior. The cobra help was not updated.


Valid property names and types are derived from the App CRD's OpenAPI schema at runtime, so the command automatically supports new properties as they are added to the CRD.

If the App resource does not exist, it is created with default settings before the specified values are applied.

By default, `rdd set` waits for the reconcile chain to settle before returning. Every property change — `running`, `containerEngine.name`, `kubernetes.enabled`, or any combination — waits for the App's `Settled` condition to reach `True` with `ObservedGeneration` matching the post-patch generation.

`Settled` goes `False` whenever the App controller sees a generation it has not yet caught up with, and back to `True` once the LimaVM has reached a terminal phase (Started or Stopped) and the engine controller has processed the current generation.

The `ObservedGeneration` filter prevents a stale `Settled=True` from a previous reconcile from prematurely satisfying the wait.

Fix: update the Long help to describe the Settled-based wait.

S2. Narrow race: premature Settled=True after in-place config change Gemini

When a spec change updates the template ConfigMap (e.g. kubernetes.version), the App controller patches the ConfigMap and requeues. On the requeue it mirrors LimaVM conditions, which still carry Running=True/Started because the LimaVM controller has not yet detected the template change. If the engine controller has already stamped ContainerEngineReady at the new generation (it just re-confirms its existing connection), computeAppConditions sees a terminal Running reason with a current-generation engine condition and sets Settled=True.

This can cause rdd set to return before the VM restarts with the new config.

Mitigating factors: the previous code never waited for non-running property changes (the explicit TODO this commit addresses), so the new behavior is strictly better. The window is narrow — the LimaVM controller detects template changes quickly via handleTemplateUpdate and sets restartNeeded, which changes Running and causes a subsequent App reconcile to correct Settled. In practice, this race requires the engine reconcile and App requeue to both complete before the LimaVM controller processes the ConfigMap watch event.

Fix: the App controller could check whether the LimaVM's observedTemplateResourceVersion matches the current ConfigMap's resourceVersion before allowing Settled=True, but this adds cross-controller coupling. A simpler approach: return ctrl.Result{RequeueAfter: ...} instead of Requeue: true after the template patch, giving the LimaVM controller time to process.

S3. Settled message misleading for terminal error states Claude
	case engineCond.Status != metav1.ConditionTrue:
		settled.Status = metav1.ConditionFalse
		settled.Reason = engineCond.Reason
		settled.Message = engineCond.Message
	default:
		settled.Status = metav1.ConditionTrue
		settled.Reason = "Settled"
		settled.Message = "App has reached the desired state"
	}
	return settled
}

// runningLimaVMMessage builds the Settled message when LimaVM's
// Running reason does not match the desired state. Failure reasons
// (ending in "Failed") propagate LimaVM's diagnostic message; other
// reasons get a concise "has not yet reached <desired>" text.
func runningLimaVMMessage(runningCond *metav1.Condition, desired string) string {
	if strings.HasSuffix(runningCond.Reason, "Failed") && runningCond.Message != "" {

When the reason is a terminal failure like StartFailed, the message "has not yet reached Started" implies progress that will not happen. The Reason field carries the correct signal, but rdd set users see only the timeout error and need to inspect the condition separately.

Fix: propagate the upstream message for failure reasons:

-		settled.Message = "LimaVM has not yet reached Started"
+		if strings.HasSuffix(runningCond.Reason, "Failed") {
+			settled.Message = runningCond.Message
+		} else {
+			settled.Message = "LimaVM has not yet reached Started"
+		}

Design Observations

Strengths

  1. Single Settled condition replaces per-property mapping (in-scope) Claude Codex Gemini — The old waitForDesiredState hardcoded a running → ContainerEngineReady/Running mapping, making non-running changes unwaiteable. Moving the aggregation server-side means every property change gets correct wait semantics without CLI changes.
  2. ObservedGeneration gating prevents stale-snapshot races (in-scope) Codex — Stamping ObservedGeneration on synthesized conditions and filtering on gen >= minGen in the CLI ensures a stale Settled=True from a prior reconcile cannot prematurely satisfy the wait.
  3. Clean Ready/Settled separation (in-scope) Claude — Ready tracks live operational state (flips on degradation without a spec change), Settled tracks reconcile convergence (gates on generation and terminal reasons). The two serve different consumers — Ready for dashboards, Settled for CLI waits — and the code keeps them independent.
  4. Stopped-app engine bypass (in-scope) Claude — The !desiredRunning case at app_controller.go:431-438 correctly short-circuits the engine check: a stopped VM cannot produce a fresh ContainerEngineReady, so gating on it would cause a permanent hang.

Concerns

  1. Controller dependency source-of-truth split (in-scope) Codex — The CLI used to discover controller presence from the ConfigMap (runtime truth); this commit moves that decision into the App controller but uses the compile-time registry. Reusing the discovery ConfigMap or passing the enabled set from manager startup would keep controller-selection semantics in one place.

Testing Assessment

  1. No test for --controllers=-engine with rdd set running=true: No integration test exercises the EngineEnabled=true, engine-not-running configuration that triggers I1. A test with setup_rdd_control_plane "app,lima" followed by rdd set running=true would cover it.
  2. No test for StartFailed interaction with Settled: The Settled condition stays False with a non-descriptive message when the VM fails to start (S3). A test forcing a start failure and verifying the timeout exit code (and the Settled condition's reason) would add confidence.
  3. Existing coverage is solid: engine-docker.bats exercises the full Settled wait for running=true, running=false, and the containerd backend swap. set.bats correctly uses --wait=false because it runs without lima or engine controllers. The test header comment at set.bats:7-15 explains the rationale.

Documentation Assessment

The design docs in api_app.md and cmd_app.md accurately reflect the new behavior. The condition table in api_app.md comprehensively covers all Ready/Settled state transitions. The computeAppConditions docstring (lines 358-380) is thorough. The remaining gap is the cobra help text in set.go:70-74 (S1).


Acknowledged Limitations


Agent Performance Retro

Claude

Claude traced the hasRegisteredControllershouldEnableController discrepancy through both the controller registration path and the service startup path, citing specific lines in service.go. It was the only agent to flag the misleading Settled message for terminal errors (S3), a detail the others passed over despite reviewing the same switch cases. Its severity calibration was the most conservative — it classified the stale help text as Important rather than Suggestion.

Codex

Codex produced the shortest review but every finding landed. Its analysis of I1 was the most thorough, tracing the issue through the discovery design doc and noting the contradiction with the documented external-controller discovery pattern. It was the only agent to cross-reference docs/design/discovery.md:56. Coverage summary was file-level precise.

Gemini

Gemini independently found I1 and was the only agent to identify the template-mirroring race (S2). Its analysis of the ObservedGeneration stamping was mechanically correct but conflated two issues: the stamping on mirrored conditions (cosmetic) and the Settled switch logic not distinguishing "Started with old config" from "Started with current config" (functional). Its C2 severity (Critical) was too high — the prior code never waited for non-running property changes, making the new behavior strictly better despite the narrow race. Its S1 (empty reason fallback) was dropped as a false positive: CRD validation and controller logic guarantee non-empty reasons. Gemini's stderr shows a "File not found" error from an attempted file read; this did not affect output quality.

Summary

Claude Opus 4.6Codex GPT 5.4Gemini 3.1 Pro
Duration6m 52s3m 28s4m 36s
Findings1I 1S1I 1S1I 1S
Tool calls28 (Read 14, Bash 8, Agent 3)36 (shell 36)28 (grepsearch 11, runshellcommand 8, readfile 7)
Design observations321
False positives001
Unique insights101
Files reviewed888
Coverage misses000
Totals1I 1S1I 1S1I 1S
Downgraded1 (I→S)02 (I→S)
Dropped001

Reconciliation: Gemini C1 (EngineEnabled): critical → important I1 (non-default configuration, the primary use case works correctly). Gemini C2 (premature Settled): critical → suggestion S2 (narrow race, strictly better than prior behavior). Claude I2 (stale help text): important → suggestion S1 (documentation mismatch, behavior is correct). Gemini S1 (empty reason fallback): dropped as false positive.


Review Process Notes

No suggestions.