Deep Review: 20260415-184413-pr-315
| Date | 2026-04-15 18:44 |
| Repo | rancher-sandbox/rancher-desktop-daemon |
| Round | 1 |
| Author | @jandubois |
| PR | #315 — app: aggregate Ready and Settled conditions, wait for Settled in rdd set |
| Branch | container-actions |
| Commits | 52c05a7 app: aggregate Ready and Settled conditions, wait for Settled in rdd set |
| Reviewers | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| Verdict | Merge with fixes — EngineEnabled derives from the init-time registry instead of the runtime-enabled controller set |
| Wall-clock time | 15 min 13 s |
Executive Summary ¶
This commit replaces per-property wait logic in rdd set with a single Settled condition on the App resource, computed by the App controller from the mirrored LimaVM Running condition and the engine controller's ContainerEngineReady. The design is sound: ObservedGeneration gating prevents stale-snapshot races, and the Ready/Settled split serves different consumers cleanly (Ready for liveness, Settled for CLI waits).
One important issue: hasRegisteredController("engine") checks the compile-time controller registry (populated by init()) instead of the runtime-enabled controller set (filtered by --controllers). When the engine controller is compiled in but excluded at runtime, no one writes ContainerEngineReady, and Settled hangs indefinitely. The prior code queried the discovery ConfigMap, which reflects actually-running controllers.
Structure: 8 files, 219 additions, 107 deletions across the App types, App controller, rdd set CLI, BATS tests, and design docs.
Critical Issues ¶
None.
Important Issues ¶
if err := (&controllers.AppReconciler{
Client: mgr.GetClient(),
Scheme: mgr.GetScheme(),
LimaTemplateData: limaTemplateData(),
EngineEnabled: base.IsControllerEnabled("engine"),
}).SetupWithManager(mgr); err != nil {
return err
}
return c.setupWebhook(mgr)
func hasRegisteredController(name string) bool {
for _, c := range base.GetAllControllers() {
if c.GetName() == name {
return true
}
}
return false
}
base.GetAllControllers() returns every controller whose init() called RegisterController, regardless of whether --controllers subsequently excluded it. The engine controller registers on all non-Windows platforms, so EngineEnabled is always true on macOS and Linux. When the service starts with --controllers=app,lima (or *,-engine), no one writes ContainerEngineReady. The Settled switch falls through to engineCond == nil at app_controller.go:440 and stays False/WaitingForEngine permanently.
}
// SetupWithManager sets up the controller with the Manager.
func (r *AppReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&v1alpha1.App{}).
Owns(&limav1alpha1.LimaVM{}).
Complete(r)
}
The prior code avoided this: isEngineControllerEnabled (removed from set.go in this commit) queried the discovery ConfigMap, which reflects actually-running controllers. This change regresses that capability.
Fix: plumb the actual enabled-controller set from the shared controller manager startup (service.go:757-764) into RegisterWithManager, or query the discovery ConfigMap from the reconciler.
Suggestions ¶
Valid property names and types are derived from the App CRD at
runtime. If the App resource does not exist, it is created with
default settings before applying the specified values.
By default, rdd set waits for the App's Settled condition
to reach True at the new generation. The wait covers every
property change and returns only after the full reconcile
chain catches up. Pass --wait=false to return as soon as
the patch is accepted, or --timeout=0 to wait indefinitely.
Examples:
rdd set running=true
rdd set running=true containerEngine.name=containerd
rdd set kubernetes.enabled=true kubernetes.version=1.32.2
This help text describes the old per-property wait behavior. The implementation now waits for Settled=True unconditionally, and the design docs (docs/design/cmd_app.md:13) reflect the new behavior. The cobra help was not updated.
Valid property names and types are derived from the App CRD's OpenAPI schema at runtime, so the command automatically supports new properties as they are added to the CRD.
If the App resource does not exist, it is created with default settings before the specified values are applied.
By default, `rdd set` waits for the reconcile chain to settle before returning. Every property change — `running`, `containerEngine.name`, `kubernetes.enabled`, or any combination — waits for the App's `Settled` condition to reach `True` with `ObservedGeneration` matching the post-patch generation.
`Settled` goes `False` whenever the App controller sees a generation it has not yet caught up with, and back to `True` once the LimaVM has reached a terminal phase (Started or Stopped) and the engine controller has processed the current generation.
The `ObservedGeneration` filter prevents a stale `Settled=True` from a previous reconcile from prematurely satisfying the wait.
Fix: update the Long help to describe the Settled-based wait.
When a spec change updates the template ConfigMap (e.g. kubernetes.version), the App controller patches the ConfigMap and requeues. On the requeue it mirrors LimaVM conditions, which still carry Running=True/Started because the LimaVM controller has not yet detected the template change. If the engine controller has already stamped ContainerEngineReady at the new generation (it just re-confirms its existing connection), computeAppConditions sees a terminal Running reason with a current-generation engine condition and sets Settled=True.
This can cause rdd set to return before the VM restarts with the new config.
Mitigating factors: the previous code never waited for non-running property changes (the explicit TODO this commit addresses), so the new behavior is strictly better. The window is narrow — the LimaVM controller detects template changes quickly via handleTemplateUpdate and sets restartNeeded, which changes Running and causes a subsequent App reconcile to correct Settled. In practice, this race requires the engine reconcile and App requeue to both complete before the LimaVM controller processes the ConfigMap watch event.
Fix: the App controller could check whether the LimaVM's observedTemplateResourceVersion matches the current ConfigMap's resourceVersion before allowing Settled=True, but this adds cross-controller coupling. A simpler approach: return ctrl.Result{RequeueAfter: ...} instead of Requeue: true after the template patch, giving the LimaVM controller time to process.
case engineCond.Status != metav1.ConditionTrue:
settled.Status = metav1.ConditionFalse
settled.Reason = engineCond.Reason
settled.Message = engineCond.Message
default:
settled.Status = metav1.ConditionTrue
settled.Reason = "Settled"
settled.Message = "App has reached the desired state"
}
return settled
}
// runningLimaVMMessage builds the Settled message when LimaVM's
// Running reason does not match the desired state. Failure reasons
// (ending in "Failed") propagate LimaVM's diagnostic message; other
// reasons get a concise "has not yet reached <desired>" text.
func runningLimaVMMessage(runningCond *metav1.Condition, desired string) string {
if strings.HasSuffix(runningCond.Reason, "Failed") && runningCond.Message != "" {
When the reason is a terminal failure like StartFailed, the message "has not yet reached Started" implies progress that will not happen. The Reason field carries the correct signal, but rdd set users see only the timeout error and need to inspect the condition separately.
Fix: propagate the upstream message for failure reasons:
- settled.Message = "LimaVM has not yet reached Started"
+ if strings.HasSuffix(runningCond.Reason, "Failed") {
+ settled.Message = runningCond.Message
+ } else {
+ settled.Message = "LimaVM has not yet reached Started"
+ }
Design Observations ¶
Strengths ¶
- Single Settled condition replaces per-property mapping
(in-scope)Claude Codex Gemini — The oldwaitForDesiredStatehardcoded arunning → ContainerEngineReady/Runningmapping, making non-running changes unwaiteable. Moving the aggregation server-side means every property change gets correct wait semantics without CLI changes. - ObservedGeneration gating prevents stale-snapshot races
(in-scope)Codex — StampingObservedGenerationon synthesized conditions and filtering ongen >= minGenin the CLI ensures a staleSettled=Truefrom a prior reconcile cannot prematurely satisfy the wait. - Clean Ready/Settled separation
(in-scope)Claude — Ready tracks live operational state (flips on degradation without a spec change), Settled tracks reconcile convergence (gates on generation and terminal reasons). The two serve different consumers — Ready for dashboards, Settled for CLI waits — and the code keeps them independent. - Stopped-app engine bypass
(in-scope)Claude — The!desiredRunningcase atapp_controller.go:431-438correctly short-circuits the engine check: a stopped VM cannot produce a freshContainerEngineReady, so gating on it would cause a permanent hang.
Concerns ¶
- Controller dependency source-of-truth split
(in-scope)Codex — The CLI used to discover controller presence from the ConfigMap (runtime truth); this commit moves that decision into the App controller but uses the compile-time registry. Reusing the discovery ConfigMap or passing the enabled set from manager startup would keep controller-selection semantics in one place.
Testing Assessment ¶
- No test for
--controllers=-enginewithrdd set running=true: No integration test exercises theEngineEnabled=true, engine-not-runningconfiguration that triggers I1. A test withsetup_rdd_control_plane "app,lima"followed byrdd set running=truewould cover it. - No test for
StartFailedinteraction with Settled: The Settled condition stays False with a non-descriptive message when the VM fails to start (S3). A test forcing a start failure and verifying the timeout exit code (and the Settled condition's reason) would add confidence. - Existing coverage is solid:
engine-docker.batsexercises the full Settled wait forrunning=true,running=false, and the containerd backend swap.set.batscorrectly uses--wait=falsebecause it runs without lima or engine controllers. The test header comment atset.bats:7-15explains the rationale.
Documentation Assessment ¶
The design docs in api_app.md and cmd_app.md accurately reflect the new behavior. The condition table in api_app.md comprehensively covers all Ready/Settled state transitions. The computeAppConditions docstring (lines 358-380) is thorough. The remaining gap is the cobra help text in set.go:70-74 (S1).
Acknowledged Limitations ¶
set.batsrequires--wait=false: The test header comment atset.bats:7-15acknowledges that this suite runs with only the app controller (no lima, no engine), so the App reconciler fails to start and Settled is never written. The wait semantics are covered byengine-docker.bats.
Agent Performance Retro ¶
Claude ¶
Claude traced the hasRegisteredController → shouldEnableController discrepancy through both the controller registration path and the service startup path, citing specific lines in service.go. It was the only agent to flag the misleading Settled message for terminal errors (S3), a detail the others passed over despite reviewing the same switch cases. Its severity calibration was the most conservative — it classified the stale help text as Important rather than Suggestion.
Codex ¶
Codex produced the shortest review but every finding landed. Its analysis of I1 was the most thorough, tracing the issue through the discovery design doc and noting the contradiction with the documented external-controller discovery pattern. It was the only agent to cross-reference docs/design/discovery.md:56. Coverage summary was file-level precise.
Gemini ¶
Gemini independently found I1 and was the only agent to identify the template-mirroring race (S2). Its analysis of the ObservedGeneration stamping was mechanically correct but conflated two issues: the stamping on mirrored conditions (cosmetic) and the Settled switch logic not distinguishing "Started with old config" from "Started with current config" (functional). Its C2 severity (Critical) was too high — the prior code never waited for non-running property changes, making the new behavior strictly better despite the narrow race. Its S1 (empty reason fallback) was dropped as a false positive: CRD validation and controller logic guarantee non-empty reasons. Gemini's stderr shows a "File not found" error from an attempted file read; this did not affect output quality.
Summary ¶
| Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro | |
|---|---|---|---|
| Duration | 6m 52s | 3m 28s | 4m 36s |
| Findings | 1I 1S | 1I 1S | 1I 1S |
| Tool calls | 28 (Read 14, Bash 8, Agent 3) | 36 (shell 36) | 28 (grepsearch 11, runshellcommand 8, readfile 7) |
| Design observations | 3 | 2 | 1 |
| False positives | 0 | 0 | 1 |
| Unique insights | 1 | 0 | 1 |
| Files reviewed | 8 | 8 | 8 |
| Coverage misses | 0 | 0 | 0 |
| Totals | 1I 1S | 1I 1S | 1I 1S |
| Downgraded | 1 (I→S) | 0 | 2 (I→S) |
| Dropped | 0 | 0 | 1 |
Reconciliation: Gemini C1 (EngineEnabled): critical → important I1 (non-default configuration, the primary use case works correctly). Gemini C2 (premature Settled): critical → suggestion S2 (narrow race, strictly better than prior behavior). Claude I2 (stale help text): important → suggestion S1 (documentation mismatch, behavior is correct). Gemini S1 (empty reason fallback): dropped as false positive.
Review Process Notes ¶
No suggestions.