Deep Review: 20260417-113401-pr-315

Date	2026-04-17 11:34
Repo	rancher-sandbox/rancher-desktop-daemon
Round	2 (of target)
Author	@jandubois
PR	#315 — app: aggregate Ready and Settled conditions, wait for Settled in `rdd set`
Commits	`27d7dd3` app: add Settled condition, wait for it in `rdd set`
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — replace the now-stale "Ready/Settled" mentions with "Settled" only (PR title, body, and code comment), and decide whether to address or accept the documented template-CM race that can let `rdd set containerEngine.name=…` return before the VM has restarted.
Wall-clock time	`13 min 50 s`

Executive Summary ¶

This PR introduces an AppConditionSettled aggregate that the App reconciler computes from the mirrored LimaVM Running condition and the engine-written ContainerEngineReady, then switches rdd set to a single wait predicate against Settled=True && observedGeneration >= post-write. The implementation collapses the old per-property branching cleanly, the unit-test matrix covers the state machine well, and the EngineEnabled flag correctly gates the engine dependency in builds where it is not registered.

Three issues stand out. (1) The PR title and body promise both Ready and Settled, and a new in-code comment also says "aggregate Ready/Settled" — but only Settled was implemented; this needs reconciling before merge. (2) A template-CM update race acknowledged by the author is more impactful than the comment suggests: when rdd set containerEngine.name=… triggers a backend swap, the App reconciler can mirror LimaVM's stale Running=Started into Settled at the new generation before LimaVM observes the ConfigMap change, so the CLI returns before the VM has restarted — directly defeating the PR's stated motivation. (3) EngineEnabled is captured from a process-local registry at startup, which works for today's binary topologies but silently breaks the moment engine ships in a separate manager.

Structure: 0 critical, 3 important, 3 suggestions, plus design observations on the convergence model.

Critical Issues ¶

None.

Important Issues ¶

I1. Template-CM update race lets rdd set containerEngine.name=… return before the VM restarts Codex Gemini Claude¶

pkg/controllers/app/app/controllers/app_controller.go

		log.Info("Deleted input ConfigMap after LimaVM created its own copy")
		// ConfigMaps are not watched (no Owns(&corev1.ConfigMap{})), so deleting
		// one produces no watch event. Requeue explicitly to guarantee the next
		// reconcile runs rather than relying on implicit LimaVM status activity.
		return ctrl.Result{Requeue: true}, nil
	} else if !apierrors.IsNotFound(err) {
		return ctrl.Result{}, fmt.Errorf("failed to fetch input ConfigMap: %w", err)
	}

	if limaVM.Spec.Running && !app.Spec.Running {
		limaVM.Spec.Running = false
		if err := r.Update(ctx, limaVM); err != nil {
			return ctrl.Result{}, fmt.Errorf("failed to propagate running state to LimaVM: %w", err)
		}
		return ctrl.Result{}, nil
	}

	// Propagate app spec.containerEngine and spec.kubernetes into the LimaVM's
	// template ConfigMap.
	if limaVM.Status.TemplateConfigMap != "" {
		templateCM := &corev1.ConfigMap{}
		if err := r.Get(ctx, client.ObjectKey{Name: limaVM.Status.TemplateConfigMap, Namespace: namespace}, templateCM); err != nil {
			if !apierrors.IsNotFound(err) {
				return ctrl.Result{}, fmt.Errorf("failed to fetch LimaVM template ConfigMap: %w", err)
			}
		} else {
			desired, err := applySpecToTemplate(r.LimaTemplateData, app.Spec)
			if err != nil {
				return ctrl.Result{}, fmt.Errorf("failed to apply spec to template: %w", err)
			}
			if templateCM.Data[limav1alpha1.TemplateConfigMapKey] != desired {

The PR's stated motivation is rdd set containerEngine.name=containerd waiting for the VM restart. After the App reconciler patches the template ConfigMap and requeues, the next pass copies LimaVM's still-Running=Started condition into the App and re-stamps it with the new App.Generation at line 334. computeSettledCondition then sees a current-generation Running=Started and, on engine-disabled paths (Windows, --controllers=-engine) immediately returns Settled=True. On engine-enabled paths the engine condition's ObservedGeneration is also stale, which holds Settled at False via the EngineStale branch — but that masking depends on the engine controller running in the same process, not on a real Lima convergence signal.

pkg/controllers/app/app/controllers/app_controller.go

	}

	if !limaVM.Spec.Running && app.Spec.Running {
		limaVM.Spec.Running = true
		if err := r.Update(ctx, limaVM); err != nil {
			return ctrl.Result{}, fmt.Errorf("failed to propagate running state to LimaVM: %w", err)
		}
		return ctrl.Result{}, nil
	}

	// Mirror LimaVM status conditions and compute Settled. The engine

The new comment at lines 297-300 acknowledges the race as a brief blip, but rdd set's waitForDesiredState returns the moment it sees Settled=True at the new generation. There is no minimum dwell, no integration test exercising the running-VM backend swap (engine-docker.bats:534-559 starts from a stopped VM), and no Lima-side condition the App can synchronise on. The result is exactly the early-return behaviour the PR removes for running=true/false but reintroduces for backend / kubernetes spec changes on a running VM.

bats/tests/32-app-controllers/engine-docker.bats

    refute_output
}

# --- containerd backend ---

@test "containerd backend reports ContainerEngineReady=NotApplicable and skips mirroring" {
    # Stop first so there is no stale True/Connected from moby to
    # satisfy the Settled wait below before the engine reconciler has
    # processed the containerd switch.
    rdd set running=false

    # Start with containerd. rdd set waits for Settled=True, which
    # requires ContainerEngineReady=True. The engine reconciler
    # satisfies that immediately with reason NotApplicable because
    # engine mirroring only supports the moby backend.
    rdd set containerEngine.name=containerd running=true

    run -0 rdd ctl get app app \
        -o jsonpath='{.status.conditions[?(@.type=="ContainerEngineReady")].reason}'
    assert_output "NotApplicable"

    # No mirror resources should exist in containerd mode.
    run -0 rdd ctl get containers --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
    run -0 rdd ctl get images --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
    run -0 rdd ctl get volumes --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
    run -0 rdd ctl get ContainerNamespaces --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
}

Fix: gate Settled on a Lima-owned convergence signal rather than the restamped Running mirror. Either (a) add LimaVMStatus.ObservedTemplateResourceVersion and require limaVM.Status.ObservedTemplateResourceVersion == templateCM.ResourceVersion before computing Settled=True, or (b) have LimaVM publish a dedicated "reconciled with current template" condition that the App controller aggregates. As an interim, set Settled to False with reason RestartPending on the same reconcile that patches the template CM, and persist a marker so the next pass does not flip it back to True until LimaVM has reacted.

I2. EngineEnabled captured from process-local registry at startup ignores cluster-wide topology Codex Gemini¶

pkg/controllers/app/app/controller.go

	if err := v1alpha1.AddToScheme(mgr.GetScheme()); err != nil {
		return err
	}
	if err := limav1alpha1.AddToScheme(mgr.GetScheme()); err != nil {
		return err
	}

	discovery, err := servicecontrollers.NewControllerManagerDiscovery(mgr.GetConfig())
	if err != nil {
		return fmt.Errorf("failed to create controller-manager discovery: %w", err)
	}

	if err := (&controllers.AppReconciler{
		Client:           mgr.GetClient(),
		Scheme:           mgr.GetScheme(),
		LimaTemplateData: limaTemplateData(),

base.IsControllerEnabled is documented as "started in this OS process" (controller.go:122-157), populated only from each manager's local set in service.go:761-772 and external/main.go:86-92. The repo ships separate cmd/app-controller and cmd/containers-controller standalone binaries; docs/design/discovery.md:56-64 allows external controllers to attach later. Whenever the engine controller runs in a different manager from the App controller — including the future where engine ships in cmd/containers-controller — EngineEnabled stays false, computeSettledCondition skips the ContainerEngineReady gate at lines 393-405, and cmd/rdd/set.go:477-486 returns before the engine has connected. This regresses the prior CLI-side behaviour, which queried the discovery ConfigMap to decide whether to wait on the engine condition.

pkg/controllers/app/app/controller.go

	validator, err := controllers.NewAppValidator(k3sVersionsData)
	if err != nil {
		return err
	}
	validatingConfig := base.WebhookConfig[*v1alpha1.App]{
		Name:        appValidatorConfigName,
		WebhookName: appValidatorWebhookName,
		WebhookPort: c.webhookPort,
		Operations: []admissionregistrationv1.OperationType{
			admissionregistrationv1.Create,
			admissionregistrationv1.Update,
		},
		Validator: validator,
	}

	managers, err := base.SetupWebhookForResource(mgr, &v1alpha1.App{}, validatingConfig)
	if err != nil {
		return err
	}
	c.webhookManagers = append(c.webhookManagers, managers...)
	return nil
}

// RegisterWithManager implements the complete controller registration for both embedded and external modes.
// It registers the CRD types with the scheme and sets up the controller with the manager.
// It also registers Lima types, which allows App controller to create and watch LimaVM resources.
func (c *controller) RegisterWithManager(mgr ctrl.Manager) error {
	if err := v1alpha1.AddToScheme(mgr.GetScheme()); err != nil {
		return err
	}
	if err := limav1alpha1.AddToScheme(mgr.GetScheme()); err != nil {
		return err
	}

	discovery, err := servicecontrollers.NewControllerManagerDiscovery(mgr.GetConfig())
	if err != nil {
		return fmt.Errorf("failed to create controller-manager discovery: %w", err)
	}

	if err := (&controllers.AppReconciler{
		Client:           mgr.GetClient(),
		Scheme:           mgr.GetScheme(),
		LimaTemplateData: limaTemplateData(),
		Discovery:        discovery,
	}).SetupWithManager(mgr); err != nil {
		return err

pkg/external/main.go

	if len(controllersToStart) == 0 {
		setupLog.Info("All controllers are already running in RDD controller manager, exiting")
		return 0
	}

	// Expose the enabled set to reconcilers that need to know which
	// siblings share this process (see pkg/controllers/base docs).
	enabledNames := make([]string, 0, len(controllersToStart))
	for _, controller := range controllersToStart {
		enabledNames = append(enabledNames, controller.GetName())
	}
	base.SetInProcessControllers(enabledNames)

	// Create a cancellable context for control plane monitoring
	monitorCtx, cancelMonitor := context.WithCancel(ctx)
	defer cancelMonitor()

docs/design/discovery.md


Ordering both steps before the annotation keeps the "ready = clients may proceed" contract consistent: any client that waits for the annotation sees both CRDs and the enabled-controller list, not just one of the two.

The ConfigMap is recreated on every control plane startup, so a stale `ready` from a previous crash is always cleared before CRD installation begins.

The annotation only covers controllers known at control plane startup. External controllers (for example, the mock controller, or a controller manager started later via `rdd svc create --controllers=...`) can attach at any time, so a client waiting for the ready annotation must not assume that every controller it cares about is already registered. Discover external controllers by reading the ConfigMap directly.

## Consumers

**Service readiness** -- The control plane queries the ConfigMap to discover which controllers are running across all managers. It uses this to decide when external controllers have finished registering.

**Passthrough proxy** -- The control plane exposes a `/passthrough/<controller>/<endpoint>/...` HTTP route that looks up the target manager via the ConfigMap and reverse-proxies the request.

**External controller startup** -- An external controller checks the ConfigMap to see whether its controllers are already running (to avoid conflicts) and monitors it to detect when the control plane has stopped.

**Health gating** -- `IsControllerRunning` performs an HTTP health check against the registered `healthEndpoint` before reporting a controller as running.

cmd/rdd/set.go

// current generation.
//
// Filtering on ObservedGeneration >= minGen blocks a stale
// Settled=True from a prior reconcile from satisfying the wait
// before the App controller observes our write.
func waitForDesiredState(ctx context.Context, config *rest.Config, timeout time.Duration, minGen int64) error {
	// timeout == 0 means "no deadline", matching kubectl wait.
	ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
	defer cancel()

	logrus.Info("Waiting for App to settle")
	return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
		status, gen, present := conditionInfo(obj, appv1alpha1.AppConditionSettled)
		return present && gen >= minGen && status == metav1.ConditionTrue
	})
}

// watchCondition watches the App singleton until the predicate
// returns true or the context expires. watchtools.UntilWithSync
// re-lists transparently on benign watch-channel closures (API

The split-manager scenario is not exploitable today because engine is not blank-imported by any standalone binary (only pkg/service/cmd/service.go's embedded path imports it). That makes this latent rather than active — but the architectural assumption "engine is enabled iff it's in this process" is built into the wait contract just as engine is being prepared for external execution.

Fix: derive engine availability from the shared discovery ConfigMap (controllers.NewControllerManagerDiscovery(mgr.GetConfig())) and check it during reconciliation, so late-attaching managers can take effect without a controller restart. Cache the result with a short TTL if the lookup cost is a concern.

I3. PR title, body, and a new in-code comment claim a Ready condition that does not exist Claude¶

pkg/controllers/app/app/controllers/app_controller.go

				if templateCM.Data == nil {
					templateCM.Data = make(map[string]string)
				}
				templateCM.Data[limav1alpha1.TemplateConfigMapKey] = desired
				if err := r.Patch(ctx, templateCM, patch); err != nil {
					return ctrl.Result{}, fmt.Errorf("failed to update template ConfigMap: %w", err)
				}
				log.Info("Updated template ConfigMap",
					"containerEngine", app.Spec.ContainerEngine.Name,
					"kubernetesEnabled", app.Spec.Kubernetes.Enabled,
					"kubernetesVersion", app.Spec.Kubernetes.Version)

The PR title is "app: aggregate Ready and Settled conditions, wait for Settled in rdd set" and the body has a two-paragraph "Ready" section explaining the contract. But app_types.go defines only AppConditionSettled (line 38); there is no AppConditionReady, nothing in computeSettledCondition produces a Ready entry, and the unit tests only assert the Settled path. The stale comment on app_controller.go:314 was introduced by this commit and will mislead readers into looking for a Ready aggregator.

pkg/controllers/app/app/controllers/app_controller.go

				if templateCM.Data == nil {
					templateCM.Data = make(map[string]string)
				}
				templateCM.Data[limav1alpha1.TemplateConfigMapKey] = desired
				if err := r.Patch(ctx, templateCM, patch); err != nil {
					return ctrl.Result{}, fmt.Errorf("failed to update template ConfigMap: %w", err)
				}
				log.Info("Updated template ConfigMap",
					"containerEngine", app.Spec.ContainerEngine.Name,
					"kubernetesEnabled", app.Spec.Kubernetes.Enabled,
					"kubernetesVersion", app.Spec.Kubernetes.Version)

pkg/controllers/app/app/controllers/app_controller.go

)

// ControllerDiscovery enumerates controllers enabled across all controller
// managers in this control plane. AppReconciler queries it during
// reconciliation to decide whether Settled must wait for
// ContainerEngineReady. The interface mirrors a small subset of
// pkg/service/controllers.ControllerManagerDiscovery so tests can substitute
// a fake without taking on the full dependency.
type ControllerDiscovery interface {
	GetEnabledControllers(ctx context.Context) ([]string, error)
}

This matters for the review itself: a reviewer cannot tell from the PR whether Ready was an intentional scope reduction or an implementation oversight — Gemini's executive summary in this round explicitly described the PR as "introduces the Ready and Settled conditions", taking the PR description at face value.

Fix: decide whether Ready is in scope. If not, drop "Ready" from the PR title and body, and correct the comment:

-	// Mirror LimaVM status conditions and aggregate Ready/Settled. The
-	// engine reconciler writes ContainerEngineReady on the same object,
+	// Mirror LimaVM status conditions and compute Settled. The
+	// engine reconciler writes ContainerEngineReady on the same object,

If Ready is still intended, add the constant in app_types.go, compute it alongside Settled, and extend the test matrix.

Suggestions ¶

S1. Hardcoded "engine" controller name creates silent coupling Claude¶

pkg/controllers/app/app/controller.go

		return err
	}

	discovery, err := servicecontrollers.NewControllerManagerDiscovery(mgr.GetConfig())
	if err != nil {
		return fmt.Errorf("failed to create controller-manager discovery: %w", err)
	}

	if err := (&controllers.AppReconciler{
		Client:           mgr.GetClient(),
		Scheme:           mgr.GetScheme(),

The App controller queries the engine controller by literal name. Engine's own name constant — controllerName at pkg/controllers/app/engine/controller.go:33 — is unexported, so a rename on the engine side would silently turn EngineEnabled into a permanent false, regressing Settled to the pre-engine path with no build or test failure. The literal is repeated in the doc comment at app_controller.go:53 and in service.go:771.

pkg/controllers/app/engine/controller.go

	}
	base.RegisterController(newController())
}

// apiGroup is "containers" because the reconciler watches
// Container, Image, and Volume from that group and needs
// their CRDs at startup. Grouping engine with its dependencies
// keeps --controllers selections from splitting the two apart.
const apiGroup = "containers"

type controller struct{}

pkg/controllers/app/app/controllers/app_controller.go


	// requeueAfterDeletion is how long to wait between checks while the LimaVM
	// controller is running its teardown (stopping the VM, removing disk files).
	requeueAfterDeletion = 2 * time.Second
)

// AppReconciler reconciles the singleton App resource and manages its LimaVM lifecycle.
type AppReconciler struct {
	client.Client
	Scheme           *runtime.Scheme
	LimaTemplateData string

Fix: define EngineControllerName = "engine" in pkg/apis/app/v1alpha1/app_types.go (alongside AppConditionContainerEngineReady) and reference it from both sides. Importing pkg/controllers/app/engine directly from pkg/controllers/app/app is undesirable because it would trigger engine's init() registration in the app-controller binary, which targets a different API group.

S2. Settled Reason table entry says "(varies)" but the values are well-defined Claude¶

docs/design/api_app.md

  | `ContainerEngineReady` | `True`    | `Connected`      | Engine controller has connected to Docker and completed full sync |
  | `ContainerEngineReady` | `True`    | `NotApplicable`  | Mirroring is not implemented for the current backend (e.g. `containerd`); forced `True` so `rdd set` can finish waiting |
  | `ContainerEngineReady` | `False`   | `ConnectFailed`  | Engine controller failed to connect to Docker                     |
  | `ContainerEngineReady` | `False`   | `Stopped`        | The VM is stopped; the engine watcher is not running              |
  | `Settled`              | `True`    | `Settled`        | Reconcile chain has caught up with the current spec               |
  | `Settled`              | `False`   | `WaitingForLimaVM` | LimaVM has not yet reported a `Running` condition               |
  | `Settled`              | `False`   | `WaitingForEngine` | Engine controller has not yet written `ContainerEngineReady`    |
  | `Settled`              | `False`   | `EngineStale`    | Engine controller has not yet observed the current generation     |
  | `Settled`              | `False`   | *(forwarded)*    | Forwarded from the blocking `Running` or `ContainerEngineReady` reason |

  `Running=True` means the Lima guest has finished booting and SSH is reachable. It says nothing about the container engine socket; consumers that depend on the engine (container/image/volume mirrors, `docker` against the forwarded socket) must also check `ContainerEngineReady`, which flips to `True` only after the engine controller has connected to the socket and completed its initial full sync.

computeSettledCondition emits an enumerable set of fixed reasons: WaitingForLimaVM, WaitingForEngine, EngineStale, plus the forwarded LimaVM Running reason or the engine ContainerEngineReady reason. Documenting them lets consumers (GUI, tests, other controllers) branch on specific values without reading the source.

Fix: expand the row into explicit rows for WaitingForLimaVM, WaitingForEngine, EngineStale, and a catch-all "forwarded from Running/ContainerEngineReady" row.

S3. runningLimaVMMessage uses string-suffix matching for failure detection Claude¶

pkg/controllers/app/app/controllers/app_controller.go

		// While desiredRunning is false, ContainerEngineReady may be
		// absent or stale: the engine controller writes it once per
		// run and stops writing it once the watcher has been torn
		// down. A stopped VM is settled regardless of what
		// ContainerEngineReady currently says.
		settled.Status = metav1.ConditionTrue
		settled.Reason = "Settled"
		settled.Message = "App has reached the desired state"
	case engineCond == nil:
		settled.Status = metav1.ConditionFalse
		settled.Reason = "WaitingForEngine"
		settled.Message = "Waiting for container engine condition"
	case engineCond.ObservedGeneration < app.Generation:
		settled.Status = metav1.ConditionFalse
		settled.Reason = "EngineStale"
		settled.Message = "Container engine needs to be synchronized"

The heuristic couples Settled's behaviour to a LimaVM naming convention. A new failure reason that doesn't end in Failed would silently lose its diagnostic message; a non-failure reason ending in Failed (none today, but nothing prevents one) would leak a confusing operational message into Settled.

Fix: switch on the known failure reasons, or check runningCond.Status == metav1.ConditionFalse && runningCond.Message != "". The LimaVM reason constants would need to be exported or mirrored; a unit test pinning the App-side strings to the LimaVM values keeps drift visible.

Design Observations ¶

Concerns ¶

Settled infers downstream convergence from mirrored status instead of controller-owned signals (in-scope) Codex Claude. The current design lets the App controller guess whether Lima and engine have caught up by mirroring their status into the App and combining that with a process-topology hint. A stronger structure would have Lima and engine each publish an explicit "observed App generation X and finished applying it" condition that App aggregates. This removes both the split-manager dependency in I2 and the stale-restamp race in I1, and gives rdd set a true end-to-end synchronisation contract instead of a heuristic.
Settled absence hangs rdd set for the full timeout when the App reconciler is not running (future) Claude. If the App controller is excluded by --controllers=-app or fails to start, no one writes Settled and rdd set blocks until the 5-minute default timeout expires. set.bats papers over this by passing --wait=false, but a real deployment that disabled the app controller would hang. The CLI could check for the App controller's presence via the discovery ConfigMap (the same data path the pre-PR isEngineControllerEnabled used) and degrade gracefully.

Strengths ¶

ObservedGeneration as part of the wait contract Codex. Once the writers stamp the right generation at the right time, this cleanly prevents stale success from a previous spec from satisfying the wait. The primitive is right; the open issues are about which writer stamps it and when.
Conflict-retry on shared status object Codex Gemini. The multi-writer App status updates at app_controller.go:320-344 continue to use retry.RetryOnConflict plus re-Get, consistent with EngineReconciler.setEngineCondition. This avoids reintroducing the 409 loop that the prior architecture had between concurrent writers.
Stopped-VM short-circuit handles engine teardown contract correctly Claude. Short-circuiting !desiredRunning when runningCond.Reason == "Stopped" (app_controller.go:397-405) handles the "engine stops writing ContainerEngineReady after watcher teardown" contract that would otherwise deadlock a stopped app.
Pure-function condition computation with broad unit coverage Claude. computeSettledCondition is a pure function over the App's status plus one bool; app_controller_test.go covers each branch with 11 cases. This is the right shape for the logic and the right way to test it.

Testing Assessment ¶

Unit coverage of computeSettledCondition is thorough. Integration gaps, ranked by risk:

No integration test for rdd set containerEngine.name=… while the VM is running. The new Settled semantics explicitly target this case, but engine-docker.bats:534-559 starts from a stopped VM, sidestepping the I1 race. An end-to-end test that sets the backend on a running VM and asserts that rdd set returns only after the restart has actually completed would both validate the PR and surface the race.
No integration test exercises EngineEnabled=false against a real control plane. The Windows / --controllers=-engine branch is unit-tested, but no integration test runs against a service started with --controllers=-engine to exercise the full base.IsControllerEnabled("engine") discovery path.
No test verifying Settled re-fires on engine-only condition changes. The App reconciler must re-stamp Settled when engine writes ContainerEngineReady. ctrl.NewControllerManagedBy(mgr).For(&v1alpha1.App{}) lacks an explicit predicate, so it relies on controller-runtime's default of firing on any event. A test that writes ContainerEngineReady at a new generation and asserts Settled.ObservedGeneration catches up would guard against a GenerationChangedPredicate creeping in and silently breaking this PR's contract.
No coverage for split-manager / late external-registration topologies Codex. Even though no standalone binary ships engine today, docs/design/discovery.md:56-64 documents support for external controllers attaching after startup. Without a test, the I2 latent regression cannot fail closed.

go test ./pkg/controllers/app/app/controllers passes locally Codex.

Documentation Assessment ¶

docs/design/api_app.md:97-119 and docs/design/cmd_app.md:13-17 now promise stronger semantics than the code provides: every property change waits until the full reconcile chain has caught up, but I1 and I2 show that is not yet true on the template-update path or in split-manager deployments. The doc should either narrow the contract or note the known race until I1 lands.
docs/design/api_app.md:97 does not mention that Settled is the single signal rdd set waits on; a reader has to cross-reference cmd_app.md to infer this.
docs/design/cmd_app.md loses the previous documentation about what "stops waiting" meant for each property class. A sentence clarifying that the wait now covers backend swaps (the previously-open TODO) would make the contract change explicit.
See S2 for the Settled-reason table gap.

Commit Structure ¶

Clean. Single commit, message accurately describes the change, stacks correctly on #316 and #309 as advertised. The PR title and body need updating per I3 to match what was implemented.

Acknowledged Limitations ¶

app_controller.go:297-300 — comment on the template-CM update path acknowledges the race in I1. The author has flagged it; the PR does not attempt to close the window. Reviewers note the comment understates user impact: the CLI can return during the stale window, not merely transit through Settled=True briefly.
set.bats:10-15 — --wait=false on every rdd set because the suite runs with only the app controller registered, so Settled is never written. Acknowledged and consistent with routing meaningful wait coverage through engine-docker.bats.
PR body — author notes Windows / EngineEnabled=false is unit-tested but not exercised end-to-end on Windows; trusting CI. The engineEnabled=false branch in the unit test matrix is the path Windows takes.

Unresolved Feedback ¶

PR description has no review comments listed; no inline reviewer feedback to reconcile.

Agent Performance Retro ¶

Claude ¶

Best executive-summary calibration of the three: caught that the PR title/body promise Ready while only Settled was implemented (I3), which Gemini swallowed and Codex only mentioned in passing. Strongest on local code quality (S1 hardcoded "engine" string, S2 docs table, S3 string-suffix matching) — these are the kind of fix-now-cheaply suggestions that compound over a codebase. Treated the template-CM race as a future design observation rather than promoting it; with hindsight, "Important regression" is the right calibration since the PR's stated purpose is the backend-swap wait. One miss: did not raise the process-local EngineEnabled concern that both Codex and Gemini caught.

Codex ¶

Tied with Gemini on identifying the two structural issues (template-CM race, process-local EngineEnabled), and gave the cleanest write-up of both — including the right callouts to cmd/containers-controller/main.go and docs/design/discovery.md. Correctly classified the template race as Important regression rather than Critical, which matched the merged severity. Mentioned the missing Ready once in the Documentation Assessment but did not promote it to a finding. Ran go test for live verification, which strengthened the testing-assessment section.

Gemini ¶

Found the same two structural issues as Codex (race condition, process-local EngineEnabled), with independent diagnoses. Promoted the template-CM race to Critical, which is defensible given the PR's stated goal but harsher than the author's acknowledged-limitation framing warrants. Two miscalibrations: (a) its I2 (engine missing from cmd/app-controller/main.go) is misdirected — engine's API group is containers, not app, so adding it to cmd/app-controller would panic at startup via the API-group check in external/main.go:65-68; the underlying observation about engine missing from cmd/containers-controller is true but pre-existing. (b) Took the PR description at face value and described the PR as "introduces the Ready and Settled conditions" in its executive summary, missing the Ready-vs-Settled discrepancy that Claude caught. As expected per the repo memory, did not run git blame.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.4	Gemini 3.1 Pro
Duration	9m 15s	6m 37s	5m 22s
Findings	1I 3S	2I	2I
Tool calls	46 (Read 23, Grep 15, Bash 8)	64 (shell 61, plan 2, stdin 1)	22 (grepsearch 12, readfile 7, runshellcommand 3)
Design observations	5	3	2
False positives	0	0	1
Unique insights	5	1	0
Files reviewed	12	12	12
Coverage misses	1	1	1
Totals	1I 3S	2I	2I
Downgraded	0	0	1 (I→S)
Dropped	0	0	1

Claude and Codex both delivered actionable reports; Codex's findings were the most precisely scoped and Claude's local-code suggestions added the most readability/maintainability value. Gemini's two structural findings overlap fully with Codex's, and its misdirected I2 plus its missed I3 reduce its marginal contribution on this PR. For a single-agent review of this kind of change (controller boundaries + condition contract), Codex would have been sufficient; Claude added meaningful local-quality suggestions that Codex missed.

Review Process Notes ¶

Repo context updates ¶

REMOVE the rule "When reviewing a new controller under pkg/controllers/, verify its GetAPIGroup() matches the grouping convention of sibling controllers and that the --controllers help text in pkg/service/controllers/options.go lists the group" — currently in deep-review-context.md. The point is correct in principle but agents have started over-applying it: in this round, Gemini used it to recommend adding the engine controller to cmd/app-controller/main.go, which would actually panic at startup because engine's API group is containers. Replace with a tighter rule: "When checking that every blank-imported controller appears in the matching standalone binary, verify the controller's GetAPIGroup() matches the binary's group argument first — external.RunControllers panics on a group mismatch." This keeps the import-coverage check while preventing the wrong-binary fix recommendation.
ADD a rule about cross-controller convergence signals: "When a controller publishes an aggregate condition that depends on another controller having reacted to a recent change (Settled, Ready, Reconciled), verify the aggregate is gated on a signal the dependent controller writes — not on data the aggregating controller computes from its own observation. A reconciler-restamped mirror of someone else's status is not a convergence signal: between (a) the change being requested and (b) the dependent controller observing it, the aggregator can produce a fresh True from stale data." This codifies the I1 pattern; agents should look for it whenever a PR adds a new aggregate condition.
ADD a rule about process-local registry queries in cross-controller logic: "Flag base.IsControllerEnabled(...) (or any init()-time registry query) used to gate a cluster-wide contract. The result is process-local — it answers 'is this controller in my OS process,' not 'is this controller running anywhere in the cluster.' Cluster-wide questions belong on the discovery ConfigMap. Process-local checks are appropriate only for in-process plumbing (e.g., wiring a watch or skipping a setup step)." Two of three agents flagged this in the current PR; making the heuristic explicit in the repo context will make future flags consistent.

Skill improvements ¶

None this round.