Deep Review: 20260423-182700-pr-344

Date	2026-04-23 18:27
Repo	rancher-sandbox/rancher-desktop-daemon
Round	1 (of target)
Author	@jandubois
PR	#344 — cli: standardize --wait and --timeout across svc and limavm
Branch	`wait-timeout-cleanup-v2`
Commits	`b59e796` cli: standardize --wait and --timeout across svc and limavm
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: high), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — minor doc/test polish plus an under-bounded readiness wait that survives an in-flight shutdown; all findings are low-risk or pre-existing.
Wall-clock time	`18 min 43 s`

Executive Summary ¶

The PR delivers what it advertises: a uniform --wait/--timeout contract across rdd svc and rdd limavm, with cliexit.Classify centralizing context.DeadlineExceeded → exit code 4. StopWithWait/Delete now accept caller context and timeout, the discovery-ConfigMap freshness check anchors on a per-start PID-file mtime, and the manual watcher.ResultChan() loops in cmd/rdd/limavm.go correctly distinguish a satisfied predicate from a dropped channel.

No agent flagged the change set as merge-blocking. The one Important finding (Gemini I1) is a pre-existing race between in-flight shutdown and the readiness poll; the PR's new --timeout=0 option for svc start --wait widens its worst-case slightly. Suggestions cluster around (a) sibling CLI paths the uniform-contract refactor missed, (b) error/diagnostic loss in the new Delete predicate, and (c) doc and test polish.

Structure: 0 Critical, 1 Important (gap, pre-existing), 6 Suggestions.

Critical Issues ¶

None.

Important Issues ¶

I1. Readiness wait can hang against an in-flight shutdown Gemini¶

pkg/service/cmd/service.go


	return readiness.WaitForReadyWithCRDs(ctx, config, enabledControllers, false)
}

// Wait until the control plane is ready or the context is cancelled.
func Wait(ctx context.Context) error {
	if err := checkReadiness(ctx); err == nil {
		return nil
	}

	ticker := time.NewTicker(500 * time.Millisecond)
	defer ticker.Stop()

	for {
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-ticker.C:
			if err := checkReadiness(ctx); err == nil {
				return nil
			}
		}
	}
}

// StopWithWait sends a shutdown signal to the service. When wait is true, it
// blocks until the process exits, ctx is cancelled, or timeout elapses; pass
// timeout 0 to wait indefinitely (bounded only by ctx).
func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {

If rdd svc start (or any command transiting ensureServiceRunning) launches while rdd svc stop --wait or rdd svc delete is draining, the start-side path takes the already-running branch (Running() returns true while the daemon drains), then enters service.Wait. When the daemon exits and the shutdown path removes instance.Config(), checkReadiness keeps failing — the loop has no Running() check — and the wait hangs until ctx expires (default 90s on the autostart path, or whatever --timeout the user passed to svc start).

waitForFreshDiscoveryConfigMap at cmd/rdd/service.go:151 has the same shape: it polls until the ConfigMap is fresh enough or ctx expires, with no fast-fail when the daemon is gone.

cmd/rdd/service.go

// The serve command recreates the ConfigMap on every startup, so a
// creationTimestamp at or after beforeStart identifies the current
// instance. The ready annotation is set after CRDs are installed and
// every controller manager has registered, so waiting for it lets
// clients use both CRDs and discovery data without racing startup.
func waitForFreshDiscoveryConfigMap(ctx context.Context, beforeStart time.Time) error {
	restConfig, err := service.GetKubeRestConfig()
	if err != nil {
		return err
	}
	client, err := kubernetes.NewForConfig(restConfig)

Reconciliation: Gemini labeled this "regression"; the underlying Wait loop has no Running() check at the merge-base either (git show 81135357:pkg/service/cmd/service.go confirms an identical loop), so the bug is pre-existing (gap), not introduced by this PR. The PR does widen exposure marginally — the freshness check on the already-running path is new (post-PR commits to creationTimestamp >= startTime), and svc start --wait --timeout=0 now allows an unbounded hang where the pre-PR WaitWithTimeout capped at 90s. Repo context line 51 explicitly calls out this pattern as a known concern, which is exactly what surfaced here.

pkg/service/cmd/service.go

	_ "k8s.io/component-base/metrics/prometheus/workqueue"
	"k8s.io/component-base/term"
	"k8s.io/component-base/version"
	"k8s.io/component-base/version/verflag"
	"k8s.io/klog/v2"
	aggregatorapiserver "k8s.io/kube-aggregator/pkg/apiserver"
	controlplaneapiserver "k8s.io/kubernetes/pkg/controlplane/apiserver"
	// Justify blank import.
	_ "k8s.io/kubernetes/pkg/features"

	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/cli/help"

Fix: short-circuit both polls when the service process is gone:

 case <-ticker.C:
+    if !Running() {
+        return fmt.Errorf("control plane exited while waiting for readiness")
+    }
     if err := checkReadiness(ctx); err == nil {
         return nil
     }

 return wait.PollUntilContextCancel(ctx, 500*time.Millisecond, true, func(ctx context.Context) (bool, error) {
+    if !service.Running() {
+        return false, fmt.Errorf("control plane exited while waiting for discovery ConfigMap")
+    }
     cm, err := client.CoreV1().ConfigMaps(controllers.RDDSystemNamespace).Get(...)

Suggestions ¶

S1. rdd ctl wait-condition skips the uniform exit-4 contract Gemini¶

cmd/rdd/ctl_wait_condition.go

		WatchFunc: func(opts metav1.ListOptions) (watch.Interface, error) {
			opts.FieldSelector = fieldSelector
			return resource.Watch(ctx, opts)
		},
	}
	_, err = watchtools.UntilWithSync(ctx, lw, &unstructured.Unstructured{}, nil,
		func(event watch.Event) (bool, error) {
			if event.Type == watch.Deleted {
				return false, errors.New("resource was deleted while waiting")
			}
			obj, ok := event.Object.(*unstructured.Unstructured)
			if !ok {
				return false, nil
			}
			return checker.check(obj), nil
		},
	)
	return err
}

// parseConditionArg parses "TYPE[=STATUS]" into condition type and status.
// STATUS defaults to "True" when omitted.
func parseConditionArg(arg string) (condType, condStatus string, err error) {

ctl wait-condition carries its own --timeout flag (default 30s) and creates a context with that deadline at line 97. When the deadline expires, watchtools.UntilWithSync returns a wrapped context.DeadlineExceeded, which exits the CLI with code 1 instead of the intended cliexit.CodeTimeout (4). The PR does not touch this file, but the stated goal of a uniform CLI contract leaves this sibling path on the wrong exit code. This matches the warning at repo-context line 48 about uniform-contract refactors that wire only the outermost entry points.

cmd/rdd/ctl_wait_condition.go


	if err := flags.Parse(rawArgs); err != nil {
		return err
	}
	positional := flags.Args()
	if len(positional) != 2 {
		return fmt.Errorf("expected TYPE/NAME and CONDITION[=STATUS] arguments, got %d arguments", len(positional))
	}

	resourceType, resourceName, err := parseResourceArg(positional[0])
	if err != nil {

cmd/rdd/ctl_wait_condition.go

	dynClient, err := dynamic.NewForConfig(restConfig)
	if err != nil {
		return fmt.Errorf("failed to create dynamic client: %w", err)
	}

	ctx, cancel := context.WithTimeout(cmd.Context(), *timeout)
	defer cancel()

	checker := conditionChecker{
		condType:   condType,
		condStatus: condStatus,

Fix:

-    return err
+    return cliexit.Classify(err)

S2. Delete silently drops non-deadline StopWithWait errors when Running() happens to be false at re-check Claude¶

pkg/service/cmd/service.go

// assumes the instance exists. If the service is running, Delete waits up to
// timeout for it to exit before removing files; removing instance.Dir() while
// the serve process holds rdd.pid, rdd.sqlite3, and log files corrupts the
// directory on Windows and breaks PID-file mutual exclusion on Unix.
// Pass timeout 0 to wait indefinitely (bounded only by ctx).
func Delete(ctx context.Context, timeout time.Duration) error {
	if Running() {
		if err := StopWithWait(ctx, true, timeout); err != nil {
			// Return the error whenever the process is alive or the
			// deadline expired, so deletion never races a live daemon
			// and the CLI exits 4 on timeout per docs/design/cmd_service.md.
			// The predicate also absorbs signal-delivery failures and
			// context cancellation when the process has already exited;
			// the invariant is "the directory survives unless the daemon
			// is confirmed gone." context.Canceled is unreachable today
			// because the CLI runs on context.Background(); wiring SIGINT
			// into cmd.Context() would let Ctrl-C fall through to deletion
			// during a live stop, and needs an explicit decision here.
			if errors.Is(err, context.DeadlineExceeded) || Running() {
				return err
			}
		}
	}
	preserveAllInstanceLogs()
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		_ = os.RemoveAll(instance.LogDir())
	}
	_ = os.RemoveAll(instance.ShortDir())
	return os.RemoveAll(instance.Dir())
}

// preserveAllInstanceLogs moves .log files from each Lima instance directory
// to the service log directory before the instance directories are deleted.
// This is a no-op unless RDD_KEEP_LOGS is set.
//

StopWithWait can return three non-deadline errors: "control plane is not running" (line 350; benign race), the "failed to stop ... pid" wrapper when both Interrupt and Kill fail (line 367; rare on Unix, possible on Windows when no shared console), and "wait for ... cancelled" when the parent context is canceled (line 400; today unreachable per the in-code comment). When any of these fire and the daemon happens to have exited between the error and the re-check, the predicate proceeds with deletion and the diagnostic is lost. The behavior is still strictly better than the pre-PR _ = Stop() (which dropped every error unconditionally), so the outcome is correct; what is lost is the user's only signal that something unexpected happened — most relevantly a Windows console-fallback that bypassed graceful hostagent shutdown.

pkg/service/cmd/service.go

// StopWithWait sends a shutdown signal to the service. When wait is true, it
// blocks until the process exits, ctx is cancelled, or timeout elapses; pass
// timeout 0 to wait indefinitely (bounded only by ctx).
func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
	if !Running() {
		return fmt.Errorf("%q control plane is not running", instance.Name())
	}

	pid := PID()
	// Try graceful shutdown first. On Unix, Kill already sends SIGTERM which
	// triggers the Go signal handler. On Windows, Kill uses TerminateProcess
	// which bypasses all handlers, so we send Interrupt (CTRL_BREAK_EVENT)
	// first to let the service run its shutdown path (shutdownAllHostagents).
	//
	// On Unix, Interrupt (SIGINT) always succeeds for a valid PID, so the
	// Kill fallback never fires. On Windows, Interrupt uses
	// GenerateConsoleCtrlEvent, which fails when caller and target lack a
	// shared console; Kill (TerminateProcess) then bypasses graceful shutdown.
	// Hostagents survive in their own process groups and self-heal on the
	// next service start via killOrphanedHostagent.
	if err := process.Interrupt(pid); err != nil {
		if err := process.Kill(pid); err != nil {
			return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
		}
	}

	if wait {
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)

pkg/service/cmd/service.go

				_ = process.Kill(pid)
				err := ctx.Err()
				if errors.Is(err, context.DeadlineExceeded) {
					return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, err)
				}
				return fmt.Errorf("wait for %q control plane with pid %d to stop cancelled: %w", instance.Name(), pid, err)
			case <-ticker.C:
				if !Running() {
					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil
				}

Reconciliation: Claude labeled this "regression"; the pre-existing _ = Stop() discarded errors more aggressively, so the post-PR predicate is an improvement, not a regression. Reclassified as a design-quality concern in newly-introduced code.

Fix: log the dropped error at warn level so the daemon-state surprise is preserved without changing the outcome:

 if Running() {
     if err := StopWithWait(ctx, true, timeout); err != nil {
         if errors.Is(err, context.DeadlineExceeded) || Running() {
             return err
         }
+        logrus.WithError(err).Warn("service exited despite shutdown error; proceeding with delete")
     }
 }

S3. watchUntil / watchUntilDeleted give up after a single dropped watch Claude¶

cmd/rdd/limavm.go

	if err != nil {
		return err
	}
	defer watcher.Stop()

	for ev := range watcher.ResultChan() {
		switch ev.Type {
		case k8swatch.Error:
			return apierrors.FromObject(ev.Object)
		case k8swatch.Deleted:
			return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
		}
		updated, ok := ev.Object.(*limav1alpha1.LimaVM)
		if ok && check(updated) {
			return nil
		}
	}
	// ResultChan closed. If ctx ended, return its error. Otherwise the
	// watch dropped (server close, proxy timeout, watch expiry); re-read
	// and re-evaluate to distinguish a satisfied predicate from a drop.
	if err := ctx.Err(); err != nil {
		return err
	}
	if err := c.Get(ctx, key, &vm); err != nil {
		if apierrors.IsNotFound(err) {
			return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
		}
		return err
	}
	if check(&vm) {
		return nil
	}
	return fmt.Errorf("watch closed before LimaVM %q reached the desired state", key.Name)
}

// watchUntilDeleted watches a LimaVM until it is deleted.
func watchUntilDeleted(ctx context.Context, c client.WithWatch, limaVM *limav1alpha1.LimaVM) error {
	key := client.ObjectKeyFromObject(limaVM)

The fall-through correctly fixes the "dropped watch → spurious success" trap. But when the predicate still isn't satisfied after the re-read, the code returns a hard error rather than reopening the watch from vm.ResourceVersion. For rdd limavm restart --wait --timeout=5m, a benign Kubernetes server-side watch expiry (~5–10 min) would surface as a hard failure even though the caller's ctx is still valid. The repo already has a retry-aware pattern: cmd/rdd/set.go:490-535 uses watchtools.UntilWithSync, which transparently re-lists on watch-channel closure.

cmd/rdd/set.go


// watchCondition watches the App singleton until the predicate
// returns true or the context expires. watchtools.UntilWithSync
// re-lists transparently on benign watch-channel closures (API
// timeouts, proxy disconnects, resource-version compaction).
func watchCondition(ctx context.Context, config *rest.Config, satisfied func(*unstructured.Unstructured) bool) error {
	dynClient, err := dynamic.NewForConfig(config)
	if err != nil {
		return fmt.Errorf("failed to create dynamic client: %w", err)
	}

	// Defensive filter; the webhook enforces metadata.name=app on the singleton.
	lw := &cache.ListWatch{
		ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
			options.FieldSelector = "metadata.name=app"
			return dynClient.Resource(appGVR).List(ctx, options)
		},
		WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
			options.FieldSelector = "metadata.name=app"
			return dynClient.Resource(appGVR).Watch(ctx, options)
		},
	}

	precondition := func(store cache.Store) (bool, error) {
		for _, item := range store.List() {
			if obj, ok := item.(*unstructured.Unstructured); ok && satisfied(obj) {
				return true, nil
			}
		}
		return false, nil
	}

	condition := func(event watch.Event) (bool, error) {
		obj, ok := event.Object.(*unstructured.Unstructured)
		if !ok {
			return false, nil
		}
		return satisfied(obj), nil
	}

	if _, err := watchtools.UntilWithSync(ctx, lw, &unstructured.Unstructured{}, precondition, condition); err != nil {
		if errors.Is(err, context.DeadlineExceeded) {
			return fmt.Errorf("timed out waiting for App state: %w", err)
		}
		if errors.Is(err, context.Canceled) {
			return fmt.Errorf("wait cancelled: %w", err)
		}
		return fmt.Errorf("failed to watch App: %w", err)
	}
	return nil
}

// conditionInfo returns the status, observedGeneration, and presence
// of the named condition on an App. Callers must check present
// before trusting the other return values.
func conditionInfo(obj *unstructured.Unstructured, condType string) (status metav1.ConditionStatus, observedGen int64, present bool) {

Fix: replace the hand-rolled loop with watchtools.UntilWithSync (preferred — matches the set implementation), or wrap the existing loop in a retry that opens a fresh watch from the current ResourceVersion until ctx is done.

S4. docs/design/api_service.md omits the Windows "first signal fails" case Claude¶

docs/design/api_service.md


* It starts any additional controllers configured via the `config` ConfigMap.

## Service shutdown

The control plane is notified to shut down, either by `rdd service stop`
(SIGINT on Unix, `CTRL_BREAK_EVENT` on Windows; if the wait deadline expires
the wait force-terminates with SIGTERM on Unix or `TerminateProcess` on
Windows), or by the OS preparing to shut down the host.

Some controllers need to be notified of pending shutdown so they can gracefully terminate, e.g. shut down virtual machines.

Any controller can request shutdown notification by creating a ConfigMap in the `rdd-system` namespace:

The in-code comment at pkg/service/cmd/service.go:356-364 documents that on Windows, GenerateConsoleCtrlEvent "fails when caller and target lack a shared console; Kill (TerminateProcess) then bypasses graceful shutdown." Since rdd.exe typically launches the daemon as a detached background process, this fallback is the common Windows path, not a corner case — graceful hostagent shutdown then relies on killOrphanedHostagent at next start. The doc currently captures only the deadline-expiry fallback.

pkg/service/cmd/service.go

	}

	pid := PID()
	// Try graceful shutdown first. On Unix, Kill already sends SIGTERM which
	// triggers the Go signal handler. On Windows, Kill uses TerminateProcess
	// which bypasses all handlers, so we send Interrupt (CTRL_BREAK_EVENT)
	// first to let the service run its shutdown path (shutdownAllHostagents).
	//
	// On Unix, Interrupt (SIGINT) always succeeds for a valid PID, so the
	// Kill fallback never fires. On Windows, Interrupt uses
	// GenerateConsoleCtrlEvent, which fails when caller and target lack a
	// shared console; Kill (TerminateProcess) then bypasses graceful shutdown.
	// Hostagents survive in their own process groups and self-heal on the
	// next service start via killOrphanedHostagent.
	if err := process.Interrupt(pid); err != nil {
		if err := process.Kill(pid); err != nil {
			return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
		}
	}

Fix: extend the parenthetical, e.g.:

> On Windows, when the CLI and daemon do not share a console, CTRL_BREAK_EVENT cannot be delivered and the stop falls back to TerminateProcess immediately; graceful hostagent shutdown does not run, and orphaned hostagents are reaped at the next service start.

S5. svc delete --timeout flag help wording diverges from the rest of the PR Claude¶

cmd/rdd/service.go

	command := &cobra.Command{
		Use:  "delete",
		Long: "Delete RDD control plane",
		RunE: serviceDeleteAction,
	}
	command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the control plane to exit before deletion (0 means wait indefinitely)")
	return command
}

func serviceDeleteAction(cmd *cobra.Command, _ []string) error {
	if !service.Exists() {

Every other --timeout flag in this PR uses the shape "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)". Because svc delete deliberately omits --wait, the wording correctly differs — but it then drops the exit-4 contract that cmd_service.md:121 explicitly documents.

docs/design/cmd_service.md


Delete always waits for the control plane to exit before removing files, because
removing the instance directory under a live process corrupts it on Windows and
breaks PID-file mutual exclusion on Unix. Use `--timeout` to bound that wait
(default `5m`); `0` waits indefinitely. If the deadline expires, `rdd` exits
with code 4 and the directory is left in place. See [`rdd service stop`](#rdd-service-stop)
for the signal and fallback details of the shutdown itself.

### `rdd service reset`

Deletes the datastore, but create a new (empty) one with the same settings.

Fix: align the wording to surface the exit-4 contract:

-    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the control plane to exit before deletion (0 means wait indefinitely)")
+    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the control plane to exit before deletion; on expiry rdd exits with code 4 and preserves the instance directory (0 means wait indefinitely)")

S6. Cross-test daemon-state coupling in the new 6-timeout.bats Claude¶

bats/tests/20-service/6-timeout.bats

@test 'svc stop --timeout=1ms exits with code 4' {
    run -4 rdd svc stop --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}

@test 'svc delete --timeout=1ms exits with code 4 and preserves the instance directory' {
    # Hard reset: the prior test sent SIGTERM but does not wait for the
    # process to exit, so a plain `rdd svc start` could still see
    # Running()=true and take the already-running branch.
    rdd svc delete --timeout=10s || :
    rdd svc create
    rdd svc start
    run -4 rdd svc delete --timeout=1ms
    assert_output --partial 'context deadline exceeded'
    run -0 rdd svc status
    assert_output --partial 'has been created: true'
}

@test 'svc start --timeout=1ms exits with code 4' {
    # Reset to a known state; prior tests left the control plane half-dead.
    rdd svc delete --timeout=120s
    rdd svc create

The hard-reset uses --timeout=10s, while every other reset in the suite (e.g. setup_rdd_control_plane) uses --timeout=120s. If the prior test's SIGTERM ever takes longer than 10s on slow CI, the reset times out, rdd svc create errors with "instance already exists", and run -4 on the subsequent line misreports the failure as "expected exit 4 but got 1". The comment also reads slightly off: the prior test does wait (1ms) before the synchronous Kill fallback fires.

Fix: bump the reset to --timeout=120s for parity with the rest of the suite, or move the reset into a local_setup so each @test starts from a known daemon state independent of the prior test's force-exit timing.

Design Observations ¶

Concerns

(future) ensureServiceRunning's 90-second cap is correct (the comment defends it well: a broken service must fail fast on a long deadline), but the user-visible error is identical to the one a genuinely slow VM boot would produce. A user running rdd limavm start --wait --timeout=30m on a flaky control plane will see exit-4 "context deadline exceeded" after 90s with no signal that the failure was the autostart cap, not the VM. Consider wrapping the autostart-phase error with a prefix like "control plane not ready after %s (autostart cap): %w" so the diagnostic distinguishes "broken daemon" from "slow VM". Claude

Strengths

(in-scope) Centralizing cliexit.Classify removes duplicated errors.Is(err, context.DeadlineExceeded) branches and makes the exit-4 contract inspectable in one place. Wired through ensureServiceRunning, every transitive caller (rdd set, every rdd limavm *, rdd kubectl, rdd service config) inherits the contract without per-callsite plumbing — the PR's central design move. Claude Codex Gemini
(in-scope) Anchoring waitForFreshDiscoveryConfigMap on service.StartTime() (PID-file mtime) correctly handles daemon restarts without a cross-process atomic clock; the truncation comment in startAndWaitForReady (time.Now().Truncate(time.Second)) shows the right level of attention to metav1.Time precision. Claude
(in-scope) watchUntil / watchUntilDeleted fall-through after a closed ResultChan correctly distinguishes "watch dropped" from "predicate satisfied" via the post-loop re-read — a real class of spurious successes that the repo steering note specifically calls out. Claude Gemini
(in-scope) StopWithWait's on-deadline Running() re-check before falling through to SIGTERM/TerminateProcess cleanly handles the "daemon just exited" race; pre-PR it would have returned a timeout error even when graceful shutdown had completed. Claude

Testing Assessment ¶

The new BATS coverage anchors the exit-4 contract for every primary entry point: svc start/stop/delete and limavm create/start/stop/restart/delete. The --timeout=1ms pattern is a reliable trigger for the deadline path, and assert_output --partial 'context deadline exceeded' ties exit code to the wrapped error.

Gaps worth tracking (none merge-blocking):

svc start --wait=false followed by svc start — exercises the new already-running + startTime-anchored freshness check; not yet covered.
svc delete --timeout=0 (indefinite) — verifies that watchtools.ContextWithOptionalTimeout(ctx, 0) is honored on the shutdown path; not covered.
limavm create --start --wait=false --timeout=5m — verifies that --timeout really is ignored when --wait=false; a regression here would silently waste wall-clock with no failure signal.
Retry after a --timeout=1ms delete — 6-timeout.bats confirms the directory survives but doesn't close the loop by retrying with a reasonable timeout.
Autostart timeout via a transitive caller (e.g. rdd ctl get or rdd service config against a wedged daemon) — confirms the cliexit.Classify wiring inside ensureServiceRunning reaches commands that don't surface --timeout themselves.

Codex live-verified make build-rdd and rdd svc start --wait --timeout=1ms (exit code 4); the BATS suite reproduces the same path under deterministic timing.

Documentation Assessment ¶

cmd_service.md and cmd_lima.md track the new defaults, the exit-4 contract, and the svc delete "preserves the instance directory on timeout" behavior consistently — these read cleanly. api_service.md is the one doc miss (S4) — it captures the timeout-fallback signal path but omits the more common Windows "no shared console → immediate TerminateProcess" path. The dense rationale comment at pkg/service/cmd/service.go:384-395 is technically accurate but worth splitting into shorter paragraphs on a subsequent edit.

The stopWaitTimeout = limaLongWaitTimeout cross-file aliasing at cmd/rdd/service.go:43 is acceptable (it makes the per-VM ceiling the single tuning knob) but fragile; the existing comment on the alias is the right size and should not be removed.

Acknowledged Limitations ¶

pkg/service/cmd/service.go:356-364 — Windows GenerateConsoleCtrlEvent failure with no shared console falls back to TerminateProcess, bypassing graceful hostagent shutdown; cleanup deferred to killOrphanedHostagent at next start. PR does not regress this; only the doc still under-describes it (see S4).
pkg/service/cmd/service.go:429-434 — context.Canceled is unreachable today because the CLI runs on context.Background(); wiring SIGINT into cmd.Context() would change deletion semantics during a live stop and needs an explicit decision. Comment is accurate.
pkg/service/cmd/service.go:410-413 — --wait=false deliberately leaves instance.Config() in place so concurrent rdd ctl / rdd kubectl calls keep working while the daemon drains. Intentional; documented inline.

Agent Performance Retro ¶

Claude ¶

Drove the bulk of the report with six findings spread across code, docs, and tests. Strongest contribution: the watchUntil no-retry concern (S3) and the dense Windows-shutdown doc gap (S4) — both required reading the in-code comment block and reasoning about cross-platform behavior, not just the diff. Mislabeled S2 as a "regression" — the new Delete predicate is actually an improvement over the pre-PR _ = Stop() that discarded every error; reclassified to a design-quality concern. Did not catch the readiness-wait race that Gemini flagged.

Codex ¶

Returned a clean review with zero findings and the highest-value live verification of the round (make build-rdd, rdd svc start --wait --timeout=1ms exit 4). The empty-handed result is honest — Codex correctly ranked the change set as low-risk — but it missed both the readiness-wait race that the repo context explicitly warns about (Gemini I1) and the ctl wait-condition contract gap (Gemini S1). Useful as a calibration signal that the PR has no obvious latent bugs in the changed paths; less useful for tracing into sibling code.

Gemini ¶

Found the highest-impact issue of the round (I1, the readiness-wait race) by following exactly the pattern repo-context line 51 warns about. Also caught the only sibling-path miss (S1, ctl wait-condition skipping cliexit.Classify) that the uniform-contract refactor leaves behind — repo-context line 48 in action. Mislabeled I1 as a "regression"; the underlying Wait() loop has the same shape at the merge-base. Two findings, both valid, both downgraded by reconciliation but kept in the report.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.5	Gemini 3.1 Pro
Duration	12m 47s	5m 54s	9m 41s
Findings	5S	none	1I 1S
Tool calls	38 (Read 19, Grep 14, Bash 5)	51 (shell 49, stdin 2)	30 (grepsearch 17, readfile 10, runshellcommand 2)
Design observations	5	1	3
False positives	0	0	0
Unique insights	5	0	2
Files reviewed	11	11	11
Coverage misses	0	1	0
Totals	5S	none	1I 1S
Downgraded	1 (I→S)	0	1 (I→S)
Dropped	0	0	0

Reconciliation

Gemini C1 readiness race: severity downgraded from Critical to Important, label changed from regression to gap. Underlying Wait() loop has no Running() check at merge-base SHA 81135357 (verified via git show). PR widens worst-case slightly via the new freshness check on the already-running path and the --timeout=0 option on svc start --wait.
Claude I1 Delete predicate: severity downgraded from Important to Suggestion (S2), label changed from regression to design concern. Pre-PR _ = Stop() discarded every error; the post-PR predicate is a strict improvement that loses only the diagnostic, not correctness.

Best value of the round: Gemini provided the only finding tied to repo-specific context (race against in-flight shutdown), which Codex missed entirely and Claude did not surface. Claude provided the broadest coverage and the most actionable suggestions for the changed code itself. Codex contributed the only live verification.

Review Process Notes ¶

None — repo context already encodes the patterns that surfaced this round (lines 48 and 51). Skill prompt and dimensions covered everything that landed.