Deep Review: 20260422-115550-pr-336

Date	2026-04-22 11:55
Repo	rancher-sandbox/rancher-desktop-daemon
Round	4
Author	@jandubois
PR	#336 — cli: standardize --wait and --timeout across svc and limavm
Branch	`wait-timeout-cleanup`
Commits	`4bf0ec5` cli: standardize --wait and --timeout across svc and limavm
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — Round 3's I1 (`Delete` timeout swallow) and I2 (`StopWithWait`/`Delete` caller context) are both cleanly resolved — `errors.Is(err, context.DeadlineExceeded) \\|\\| Running()` narrows the recovery, and both functions now accept `ctx`. Five of six Round 3 Suggestions landed (S1 → `cliexit.Classify`, S2/S6 → help-text updates, S4 → accurate SIGINT/SIGTERM wording, S5 → multi-VM caveat in comment); R3 S3 (`run -0` on the cleanup `rdd limavm delete "start-vm"`) turned out to be a false recommendation — the repo BATS style rules say to drop `run` on a command whose output is not inspected, so the current code is correct. Two new Important findings: (I1) `watchUntil` / `watchUntilDeleted` in `cmd/rdd/limavm.go` return `nil` when the watch channel closes before the context fires, so a dropped API-server watch silently reports success on any `limavm start/stop/restart/create/delete --wait`; (I2) the already-running `svc start` branch calls `waitForDiscoveryConfigMap` without the freshness anchor the cold-start path uses, so a stale `ReadyAnnotation=true` from a crashed prior instance can satisfy the wait. Nine small suggestions round out the list.
Wall-clock time	`37 min 38 s`

Executive Summary ¶

Round 4 addresses every Round 3 Important finding and five of six Round 3 Suggestions. Delete's over-broad recovery is now gated on errors.Is(err, context.DeadlineExceeded) || Running(), preserving the "exit 4, directory in place" contract on deadline expiry (R3 I1 resolved). StopWithWait(ctx, wait, timeout) and Delete(ctx, timeout) both accept the caller's context and derive their timed child via watchtools.ContextWithOptionalTimeout, closing the R3 I2 "--timeout=0 is uninterruptible" exposure. The timeoutError helper is moved to pkg/cli/exit/exit.go as Classify and used from set.go, service.go, and limavm.go (R3 S1). --timeout help text on every --wait-capable command now says "ignored if --wait=false" (R3 S2), the api_service.md wording is now technically accurate ("SIGTERM as fallback" on Unix, R3 S4), and the stopWaitTimeout comment acknowledges the multi-VM ceiling (R3 S5).

Two new Important findings surface. First, cmd/rdd/limavm.go:485-497 and :527-535 (watchUntil / watchUntilDeleted) both return ctx.Err() when the watch channel closes — and ctx.Err() is nil whenever the context is still valid at that moment. If the API server closes the watch before the desired state arrives (proxy timeout, server restart, graceful close), the CLI returns exit 0 and logs "LimaVM started/stopped/…" without the status ever matching. This code is pre-existing (776b8f77, 2026-02-26) and not modified by this PR, but the PR widens its exposure: every limavm wait path now has a user-documented --timeout=0 mode, and the new exit-code-4 contract depends on timeouts actually firing. Second, cmd/rdd/service.go:271 still calls waitForDiscoveryConfigMap(ctx) in the already-running svc start branch, which skips the freshness anchor (beforeStart) that the cold-start path uses at :122-128. A stale ReadyAnnotation=true ConfigMap from a crashed prior serve can satisfy service.Wait() + waitForDiscoveryConfigMap() before the new serve has reached InitDiscovery() — the user sees exit 0 while controllers are still initializing. The R3 strength ("already-running branch now matches the cold-start path") is narrower than consolidated then.

Suggestions span diagnostics and polish: service.Wait and waitForFreshDiscoveryConfigMap surface bare ctx.Err(), so users see only "context deadline exceeded" with no hint of which wait timed out; ensureServiceRunning still uses context.WithTimeout instead of the watchtools.ContextWithOptionalTimeout the rest of the PR standardizes on; BATS coverage for svc start --timeout / svc delete --timeout exit-code-4 remains an intentional R2/R3 skip but a regression there would ship green; cliexit.Classify can double-wrap already-classified errors; the already-running svc start branch has no "is ready" log to match the cold-start; and the svc delete --timeout help text reads "stop-and-wait" (jargon) where the siblings use plain "Timeout for --wait". Three more minor polish items round out the list.

No Critical issues. Codex's Important I1 ("new 5-minute svc stop / svc delete default is shorter than healthy multi-VM shutdown, regression") is dropped as a mis-labeled regression: pre-merge-base StopWithWait hardcoded a 60-second timeout in pkg/service/cmd/service.go:373, and svc stop exposed no --timeout flag. The PR raises that default to 5 minutes and exposes --timeout=0 for indefinite waits — strictly a widening, not a shortening. The underlying "multi-VM sequential drain may exceed 5m once shutdownAllHostagents is parallelized" concern is real but already documented as an acknowledged limitation in the stopWaitTimeout comment (R3 S5 resolution).

Critical Issues ¶

None.

Important Issues ¶

I1. watchUntil / watchUntilDeleted return nil when the watch channel closes, silently succeeding on a dropped watch Gemini¶

cmd/rdd/limavm.go

	if err != nil {
		return err
	}
	defer watcher.Stop()

	for ev := range watcher.ResultChan() {
		switch ev.Type {
		case k8swatch.Error:
			return apierrors.FromObject(ev.Object)
		case k8swatch.Deleted:
			return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
		}
		updated, ok := ev.Object.(*limav1alpha1.LimaVM)
		if ok && check(updated) {
			return nil
		}
	}
	return ctx.Err()
}

// watchUntilDeleted watches a LimaVM until it is deleted.
func watchUntilDeleted(ctx context.Context, c client.WithWatch, limaVM *limav1alpha1.LimaVM) error {
	key := client.ObjectKeyFromObject(limaVM)

When watcher.ResultChan() closes without an Error event — a graceful connection close, a proxy/load-balancer timeout, a server restart, or the Kubernetes watch timeout that fires every 5–10 minutes by default — the for loop exits, and ctx.Err() is consulted. Inside watchUntil the context passed in is the caller's waitCtx; if the timeout has not yet expired, ctx.Err() returns nil. Every limavm action then interprets that as success:

if err := watchUntil(waitCtx, c, key, func(vm *limav1alpha1.LimaVM) bool {
    return vm.Status.RestartCount > beforeCount
}); err != nil {
    return cliexit.Classify(fmt.Errorf("failed waiting for restart to complete: %w", err))
}
logrus.Infof("LimaVM %q restarted in namespace %q", name, limaVM.Namespace)

cmd/rdd/limavm.go:233-238, 413-417, 453-457 and the matching watchUntilDeleted call at :556-558 all propagate the nil as success. The user sees LimaVM "foo" started / deleted / restarted with exit 0, but the status field was never observed to change.

This is pre-existing (776b8f77, 2026-02-26) but the PR materially widens the exposure:

Every limavm wait-capable subcommand now defaults to --wait=true with a finite --timeout (this PR), so more CLI users land on these code paths than before.
--timeout=0 is now a documented user-facing option (docs/design/cmd_lima.md). In that mode the context never cancels on its own, so ctx.Err() returns nil on every channel close — the false-success path is the only path a dropped watch can take.
The PR's new exit-code-4 contract advertises "the wait timed out" as a crisp signal; the watch-drop path launders it into "the wait succeeded."

Fix: distinguish "channel closed before condition met" from "context cancelled first." A one-line guard at the return ctx.Err() site suffices:

 for ev := range watcher.ResultChan() {
     ...
 }
-return ctx.Err()
+if err := ctx.Err(); err != nil {
+    return err
+}
+return fmt.Errorf("watch for LimaVM %q closed before condition was observed", key.Name)

Apply the symmetric change at watchUntilDeleted:535 with "... closed before deletion was observed".

(important, gap — pre-existing behavior widened by this PR's user-documented --timeout=0 and the exit-code-4 contract that depends on timeouts being the authoritative "gave up" signal.)

I2. Already-running svc start branch skips the freshness anchor, so a stale ReadyAnnotation can satisfy the wait Codex¶

cmd/rdd/service.go

		if err := service.Create(args); err != nil {
			return err
		}
		logrus.Infof("successfully created %q control plane", instance.Name())
	}
	if service.Running() {
		logrus.Infof("%q control plane is already running", instance.Name())
		if !wait {
			return nil
		}
		logrus.Infof("waiting for %q control plane to be ready", instance.Name())
		ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
	}

	// Collect all provided flags as arguments for serve subprocess
	var serveArgs []string
	if cmd.Flags().Changed("controllers") {

waitForDiscoveryConfigMap calls waitForFreshDiscoveryConfigMap(ctx, time.Time{}), and the zero beforeStart makes the cm.CreationTimestamp.Time.Before(beforeStart) check vacuous:

func waitForDiscoveryConfigMap(ctx context.Context) error {
    return waitForFreshDiscoveryConfigMap(ctx, time.Time{})
}

The cold-start path avoids stale state by anchoring beforeStart := time.Now().Truncate(time.Second) before service.Start, then passing that to waitForFreshDiscoveryConfigMap (cmd/rdd/service.go:117-128). The already-running branch has no such anchor.

cmd/rdd/service.go

// the apiserver may become ready before the serve command creates the ConfigMap.
func startAndWaitForReady(ctx context.Context, serveArgs []string, timeout time.Duration) error {
	// Truncate to second precision because metav1.Time drops sub-seconds
	// during JSON serialization. Without this, a server that starts in the
	// same second would appear to have a startTime *before* beforeStart.
	beforeStart := time.Now().Truncate(time.Second)
	if err := service.Start(ctx, serveArgs); err != nil {
		return err
	}

	ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
	defer cancel()

	if err := service.Wait(ctx); err != nil {
		return err
	}
	return waitForFreshDiscoveryConfigMap(ctx, beforeStart)
}

// waitForDiscoveryConfigMap polls until the discovery ConfigMap exists.
func waitForDiscoveryConfigMap(ctx context.Context) error {
	return waitForFreshDiscoveryConfigMap(ctx, time.Time{})

The race: serve writes rdd.pid very early in its startup sequence, so service.Running() returns true from that moment on. If a prior serve process crashed with its discovery ConfigMap still carrying ReadyAnnotation=true in the sqlite3 datastore, and a new serve process has just started but not yet reached InitDiscovery(), a concurrent rdd svc start enters the already-running branch, service.Wait returns once /readyz + CRDs respond (which predates discovery), and waitForDiscoveryConfigMap finds the stale annotation immediately. The CLI returns exit 0 and the user proceeds to run commands that depend on controllers that haven't registered yet.

The cold-start path's beforeStart blocks exactly this: the ConfigMap must have been created at or after the current startup to satisfy the wait. The already-running branch needs the same contract — but there's no natural beforeStart value, because the branch doesn't own the startup moment. Workable options:

Take beforeStart from the rdd.pid file's modification time (serve writes the file on startup, so the mtime bounds when the current instance began). Truncate to seconds to match metav1.Time serialization.
Have serve record a lastStart timestamp in its on-disk state when it comes up; both paths read from it.

Fix (pid-file mtime approach):

 if service.Running() {
     ...
-    return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
+    beforeStart, err := pidFileStartTime()
+    if err != nil {
+        return cliexit.Classify(err)
+    }
+    return cliexit.Classify(waitForFreshDiscoveryConfigMap(ctx, beforeStart))
 }

Round 3 consolidated "already-running branch now waits for ReadyAnnotation, matching the cold-start path" as a strength. It waits for the annotation, but not for a fresh one — the match is narrower than it looked. Round 4 carries this through, so it lands here.

(important, regression — the already-running wait is new in this PR; the freshness-bypass ships with it.)

Suggestions ¶

S1. service.Wait and waitForFreshDiscoveryConfigMap surface bare ctx.Err(), so the user-visible timeout message has no scoping context Claude¶

pkg/service/cmd/service.go


	return readiness.WaitForReadyWithCRDs(ctx, config, enabledControllers, false)
}

// Wait until the control plane is ready or the context is cancelled.
func Wait(ctx context.Context) error {
	if err := checkReadiness(ctx); err == nil {
		return nil
	}

	ticker := time.NewTicker(500 * time.Millisecond)
	defer ticker.Stop()

	for {
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-ticker.C:
			if err := checkReadiness(ctx); err == nil {
				return nil
			}
		}
	}
}

// StopWithWait sends a shutdown signal to the service. When wait is true, it
// blocks until the process exits, ctx is cancelled, or timeout elapses; pass
// timeout 0 to wait indefinitely (bounded only by ctx).
func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {

service.Wait returns context.DeadlineExceeded directly; waitForFreshDiscoveryConfigMap (via wait.PollUntilContextCancel) does the same. The CLI feeds both through cliexit.Classify, so exit code 4 is correct, but the error message printed to stderr is just context deadline exceeded — no indication of which wait fired. StopWithWait wraps as timed out waiting for %q control plane with pid %d to stop: %w (pkg/service/cmd/service.go:382), which is the shape a user can act on.

pkg/service/cmd/service.go

				// that are children of the service but run in their own
				// process groups. On Unix, SIGTERM suffices because the
				// service is responsive to signals (it's just slow shutting
				// down hostagents sequentially).
				_ = process.Kill(pid)
				return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, ctx.Err())
			case <-ticker.C:
				if !Running() {
					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil
				}

Fix: wrap at the call site with a one-line scope. Example:

 if service.Running() {
     ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
     defer cancel()
     if err := service.Wait(ctx); err != nil {
-        return cliexit.Classify(err)
+        return cliexit.Classify(fmt.Errorf("waiting for %q control plane to be ready: %w", instance.Name(), err))
     }
-    return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
+    if err := waitForDiscoveryConfigMap(ctx); err != nil {
+        return cliexit.Classify(fmt.Errorf("waiting for discovery ConfigMap: %w", err))
+    }
+    return nil
 }

The PR already threads %w through the limavm.go call sites (failed waiting for LimaVM to start: %w, etc.); service.go's three cliexit.Classify(err) sites are the outliers.

(suggestion, gap — diagnostic parity with the limavm paths.)

S2. ensureServiceRunning uses context.WithTimeout; every sibling wait uses watchtools.ContextWithOptionalTimeout Claude¶

cmd/rdd/service.go

		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		return cliexit.Classify(waitForDiscoveryConfigMap(ctx))

Every other timeout-bound block this PR touches — startAndWaitForReady:122, serviceStartAction already-running branch :266, StopWithWait:366, and all five limavm.go wait sites — uses watchtools.ContextWithOptionalTimeout, which degrades to context.WithCancel when the duration is 0. startWaitTimeout is a positive constant today, so the observable behavior is the same, but a future refactor that exposes this constant via a flag would silently fail on --timeout=0 (an immediately-expired context). Drop-in change:

-        ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
+        ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, startWaitTimeout)
         defer cancel()

(suggestion, enhancement — cross-site consistency.)

S3. No BATS coverage for svc start --timeout / svc delete --timeout exit-code-4 contract Codex¶

bats/tests/20-service/6-timeout.bats

load '../../helpers/load'

# Assert `rdd svc stop --wait` exits with code 4 (cliexit.CodeTimeout) when
# --timeout expires. A 1ms deadline fires before the graceful-shutdown loop
# can observe the service exit, so the wait sends SIGTERM (or TerminateProcess
# on Windows) and returns context.DeadlineExceeded wrapped in cliexit.Timeout.

The PR adds exit-code-4 behavior on three svc paths (stop, start already-running branch, delete) and four limavm paths, but the new 6-timeout.bats asserts only svc stop. svc delete is the highest-risk unrepresented path because its I1 predecessor in Round 3 (over-broad error swallow) would have shipped green under a run -4 rdd svc delete --timeout=1ms && assert_dir_exist "${RDD_DIR}" test. svc start's already-running branch is also new and has no integration coverage. The author's Round 2 resolution table lists these two as intentional skips; Round 4 I2 demonstrates that the cost of that skip can be a non-obvious correctness bug.

Fix: two new cases in 6-timeout.bats, shaped like the existing one. For svc delete, also assert the instance directory survives:

@test 'svc delete --timeout=1ms exits with code 4 and preserves directory' {
    run -4 rdd svc delete --timeout=1ms
    assert_output --partial 'context deadline exceeded'
    assert_dir_exist "$(rdd svc paths dir)"
}

(suggestion, gap — carries forward R2 testing gaps #1–#2 listed as intentional skips; R4 I2 makes the coverage argument sharper.)

S4. Already-running svc start branch has no symmetric "is ready" log on success Claude¶

The cold-start branch logs "%q control plane is ready" after the wait succeeds (line 297). The already-running branch waits for the same readiness signal but logs nothing on success — a user running rdd svc start against an already-running-but-still-initializing service sees the "waiting for ... to be ready" line then silence. Mirror the cold-start log:

cmd/rdd/service.go


	if wait {
		if err := startAndWaitForReady(cmd.Context(), serveArgs, timeout); err != nil {
			return cliexit.Classify(err)
		}
		logrus.Infof("%q control plane is ready", instance.Name())
	} else {
		if err := service.Start(cmd.Context(), serveArgs); err != nil {
			return err
		}
		logrus.Infof("%q control plane is starting", instance.Name())

         if err := service.Wait(ctx); err != nil {
             return cliexit.Classify(err)
         }
-        return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
+        if err := waitForDiscoveryConfigMap(ctx); err != nil {
+            return cliexit.Classify(err)
+        }
+        logrus.Infof("%q control plane is ready", instance.Name())
+        return nil
     }

Pairs well with S1's wrapping — both amount to "make the already-running branch as chatty on success and failure as the cold-start branch."

(suggestion, enhancement.)

S5. cliexit.Classify can double-wrap a *cliexit.Error{Code: CodeTimeout} Gemini¶

pkg/cli/exit/exit.go

// Timeout wraps err with [CodeTimeout].
func Timeout(err error) *Error {
	return &Error{Code: CodeTimeout, Err: err}
}

// Classify wraps err with [CodeTimeout] when err wraps
// [context.DeadlineExceeded]. Other errors pass through unchanged.
func Classify(err error) error {
	if errors.Is(err, context.DeadlineExceeded) {
		return Timeout(err)
	}
	return err
}

*cliexit.Error.Unwrap() exposes the inner error, so errors.Is(cliexit.Timeout(deadlineErr), context.DeadlineExceeded) is true. If a call site passes an already-classified error through Classify again, the result is *cliexit.Error{Code: 4, Err: *cliexit.Error{Code: 4, Err: deadlineErr}} — two layers, same exit code. Today's call graph does not double-classify in practice (every Classify is a top-level, per-command call), but the pattern of wrap-on-inner-call-then-classify-on-outer-call is easy to slip into and the helper should be idempotent:

 func Classify(err error) error {
+    var cliErr *Error
+    if errors.As(err, &cliErr) {
+        return err
+    }
     if errors.Is(err, context.DeadlineExceeded) {
         return Timeout(err)
     }
     return err
 }

(suggestion, enhancement — defensive idempotence.)

S6. Delete's deadline-branch recovery elides that the process has already been SIGTERM'd Claude¶

pkg/service/cmd/service.go

// Pass timeout 0 to wait indefinitely (bounded only by ctx).
func Delete(ctx context.Context, timeout time.Duration) error {
	if !Exists() {
		return fmt.Errorf("%q control plane does not exist", instance.Name())
	}
	if Running() {
		if err := StopWithWait(ctx, true, timeout); err != nil {
			// Suppress only the outer/inner Running() race, where the
			// process exits between Delete's check and StopWithWait's.
			// Deadline expiry must surface so the CLI exits 4 and leaves
			// the directory in place, per docs/design/cmd_service.md.
			if errors.Is(err, context.DeadlineExceeded) || Running() {
				return err
			}
		}
	}
	preserveAllInstanceLogs()
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		_ = os.RemoveAll(instance.LogDir())
	}
	_ = os.RemoveAll(instance.ShortDir())

The comment at :430-433 correctly describes the R3 I1 fix. One observation worth a second comment line: when the deadline branch fires, StopWithWait has already called _ = process.Kill(pid) (:381). On Unix that is SIGTERM, which the service catches — but the k8s SetupSignalContext second-signal handler routes the second SIGTERM through os.Exit(1) within microseconds, so the process is usually already dead when Delete returns err. A user who runs rdd svc status right after exit 4 will see no running PID, which reads as a contradiction of the "directory left in place because the process is still alive" framing in the docs. The contract is still correct (the original reason for preserving the directory was "do not remove it under a live writer," which holds during the wait; the post-kill moment is a bonus side effect), but the comment understates the post-deadline process state:

         if errors.Is(err, context.DeadlineExceeded) || Running() {
+            // At deadline expiry, StopWithWait has already SIGTERM'd the
+            // process, which usually exits within microseconds. Directory
+            // removal still waits for the next `rdd svc delete` invocation,
+            // per docs/design/cmd_service.md.
             return err
         }

(suggestion, enhancement — comment accuracy.)

S7. rdd limavm create --timeout help text understates when the flag is ignored Claude¶

cmd/rdd/limavm.go

	}
	command.Flags().StringVarP(&namespace, "namespace", "n", metav1.NamespaceDefault, "Namespace for the LimaVM resource")
	command.Flags().BoolVar(&dryRun, "dry-run", false, "If set, do not commit any changes to the cluster")
	command.Flags().BoolVar(&start, "start", false, "Start the VM after creation")
	command.Flags().BoolVar(&wait, "wait", true, "Wait for the operation to complete (only applies with --start)")
	command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
	return command
}

func newLimaVMStartCommand() *cobra.Command {
	var wait bool

The create wait condition is start && wait && !dryRun (:409), so --timeout is also ignored when --start is false or --dry-run is set. The other limavm commands' help text is correct because their wait condition is just wait (or equivalent). Tighten the create text:

-    command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
+    command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored without --start, with --wait=false, or with --dry-run (0 means wait indefinitely)")

(suggestion, enhancement — only create has this extra condition.)

S8. svc delete --timeout help text reads as jargon ("stop-and-wait") Claude¶

cmd/rdd/service.go

	command := &cobra.Command{
		Use:  "delete",
		Long: "Delete RDD control plane",
		RunE: serviceDeleteAction,
	}
	command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
	return command
}

func serviceDeleteAction(cmd *cobra.Command, _ []string) error {
	if !service.Exists() {

Round 3 S6 pushed the author to divorce the wording from the Timeout for --wait sibling (since delete has no --wait flag). The landing text "stop-and-wait" is terse but reads as a noun of art; "stop and wait" is not a named concept in the docs, and the siblings use plain Timeout for --wait. A more readable form:

-    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
+    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for waiting for the control plane to stop (0 means wait indefinitely)")

(suggestion, enhancement — polish on R3 S6 resolution; the Round 3 fix was correct in spirit, the specific word choice is what jars.)

S9. cliexit.Classify's docstring under-describes potential growth and the single-case implementation is minimal enough to inline Claude¶

pkg/cli/exit/exit.go

// Timeout wraps err with [CodeTimeout].
func Timeout(err error) *Error {
	return &Error{Code: CodeTimeout, Err: err}
}

// Classify wraps err with [CodeTimeout] when err wraps
// [context.DeadlineExceeded]. Other errors pass through unchanged.
func Classify(err error) error {
	if errors.Is(err, context.DeadlineExceeded) {
		return Timeout(err)
	}
	return err
}

The name Classify suggests a polymorphic classifier (admission errors → CodeRejected, context.Canceled → some user-cancel code, etc.), but the body handles exactly one case. Two paths:

If the intent is "extension point for more classes later," add a doc line naming that intent so future readers don't think it's incomplete.
If the intent is "just this one classifier for now," the caller pattern cliexit.Classify(fmt.Errorf("waiting for ...: %w", err)) is two lines (check + wrap) that could be inlined at each of the 9 call sites, at the cost of a package-level helper.

Either direction keeps the signal clear. A two-line doc comment is the cheapest form:

-// Classify wraps err with [CodeTimeout] when err wraps
-// [context.DeadlineExceeded]. Other errors pass through unchanged.
+// Classify tags known error classes with their CLI exit code. Today it
+// only handles [context.DeadlineExceeded] → [CodeTimeout]; add new
+// classifications here as they become CLI-observable.
 func Classify(err error) error {

(suggestion, enhancement — documentation clarity for a shared helper.)

Design Observations ¶

Strengths ¶

Round 3's two Important findings both land cleanly in a single commit Claude Codex Gemini. Delete's recovery now gates on errors.Is(err, context.DeadlineExceeded) || Running(), the documented "exit 4, directory in place" contract holds, and both StopWithWait(ctx, ...) and Delete(ctx, ...) accept the caller's context and derive their timed child via watchtools.ContextWithOptionalTimeout. The latter pattern now appears at every wait site in the PR.
cliexit.Classify at the CLI boundary is the right shape (in-scope) Codex Claude. The helper lives in pkg/cli/exit, every entry point already imports it, and set.go's previously-inline errors.Is(...) → cliexit.Timeout block was migrated to use it. The helper also keeps lower-level packages returning ordinary wrapped errors — no coupling from pkg/service/cmd back up to the CLI.
Unix/Windows shutdown signal divergence is documented at the point of use Gemini Claude. pkg/service/cmd/service.go:348-358 walks a reader through why Interrupt is tried first (lets the service run its shutdown path on Windows, where Kill is TerminateProcess and bypasses handlers), and why the Kill fallback never fires on Unix. The docs/design/api_service.md:21 wording ("SIGTERM as fallback" on Unix, "TerminateProcess as fallback" on Windows) now matches the implementation — the R3 S4 "forced termination" claim is gone.
Single source of truth for long waits Claude. limaLongWaitTimeout = 5 * time.Minute drives every limavm subcommand and stopWaitTimeout, and the stopWaitTimeout comment explicitly names the future LimaVM stop-orchestration work so the R3 S5 multi-VM caveat is no longer implicit.

Concerns ¶

cliexit.Classify only handles context.DeadlineExceeded, not context.Canceled (future) Claude. A user who presses Ctrl-C during rdd svc stop --wait --timeout=5m gets exit code 1, indistinguishable from a real failure. A separate "user cancelled" exit code would let scripts distinguish intentional cancellation from command failure. Not actionable this round — the user-cancellation path is not exercised in this PR.
Two context.WithTimeout call sites still bypass watchtools.ContextWithOptionalTimeout (in-scope) Claude. cmd/rdd/service.go:95 uses the plain form for startWaitTimeout; pkg/service/cmd/service.go:202 uses it for an internal 15-second bound on getRuntimeControllersFromCluster. Both are fine today (non-zero positive constants), but they are the two stragglers in a PR whose theme is "0 means infinite everywhere." S2 proposes the ensureServiceRunning fix.

Testing Assessment ¶

Coverage of the exit-code-4 contract is broad on the limavm side: every wait-capable subcommand has a run -4 regression test, and 6-timeout.bats asserts svc stop --timeout=1ms. Unaddressed gaps, carried forward from earlier rounds plus new:

rdd svc delete --timeout=… deadline expiry returns exit code 4 and leaves the directory in place. round 2 S2 unaddressed (intentional skip). The R3 I1 regression around this exact code path would have been caught by a test. Priority elevated by R4 I2.
rdd svc start --timeout=… against an already-running-but-not-yet-ready service returns exit code 4. round 2 S2 unaddressed (intentional skip). Would exercise the new already-running branch plus R4 I2's freshness-skip.
Dropped watch during rdd limavm start/stop/restart/create/delete --wait. New in R4. Ties directly to I1. Hard to provoke without a custom mock API server; a unit-level test against a fake.NewWithWatch client that closes its watcher channel prematurely would be the cheapest.
ensureServiceRunning deadline expiry during rdd set / rdd limavm * / rdd kubectl / rdd service config returns exit code 4. round 2 testing gap #3 unaddressed (intentional skip). The fix at cmd/rdd/service.go:97-100 is correct; no test exercises it.
--timeout=0 (indefinite wait) succeeds when the desired state eventually arrives. round 2 testing gap #4 unaddressed (intentional skip). A regression that accidentally used context.WithTimeout(ctx, 0) (which fires immediately) would ship green.
Ctrl-C during svc stop --timeout=0 / svc delete --timeout=0. round 3 testing gap #5 unaddressed. Now that both functions accept ctx, cancellation via cmd.Context() should abort the wait; a test would lock the behavior in.

Items 1, 2, 4, 5, 6 remain intentional skips per the author's R2/R3 resolution; item 3 is new and ties to I1.

Documentation Assessment ¶

docs/design/cmd_service.md:104-109 now says "Sends SIGINT to the control plane process (rdd.pid) ... If the deadline expires, the wait sends SIGTERM on Unix (or calls TerminateProcess on Windows) as a fallback and rdd exits with code 4." Matches the code. R3 S4 resolved.
docs/design/cmd_service.md:117-122 promises "exit 4, directory left in place" on deadline expiry — the code now honors this (R3 I1 resolved).
docs/design/cmd_lima.md — defaults (--wait=true, --timeout=5m) consistent with the code.
docs/design/api_service.md:21 — "SIGINT on Unix with SIGTERM as fallback; CTRL_BREAK_EVENT on Windows with TerminateProcess as fallback" matches the implementation.
Pre-existing test asserts "successfully started" log line at bats/tests/20-service/2-create.bats:40-44, but the cold-start code path emits "... is ready" instead (code stopped emitting "successfully started" months ago). Unaffected by this PR but worth a separate pass — the test must currently be failing or skipped. Out of scope for this review.

Commit Structure ¶

Single commit, single concept. The commit body accurately enumerates the landed defaults and references the R3 resolution. Clean.

Acknowledged Limitations ¶

cmd/rdd/service.go:36-42 — the stopWaitTimeout = limaLongWaitTimeout comment now includes "A multi-VM instance may still exceed this budget while shutdownAllHostagents drains sequentially." Matches the implementation and absorbs Codex's I1 concern as a known-future-work item rather than a present regression.
pkg/service/cmd/service.go:348-358 — the Unix/Windows Interrupt → Kill fallback comment is accurate and explains the hostagent-preservation trade-off. Unaffected by this PR.
pkg/service/cmd/service.go:373-382 — the process.Kill on deadline is intentional and accepted; hostagents survive in their own process groups and self-heal via killOrphanedHostagent on next start.
R3-acknowledged StopWithWait PID-recycling race — Running() returns true, then pid := PID() finds the file stale, process.Interrupt(0) on Unix signals the calling process group. Pre-existing; not widened by this PR.

Unresolved Feedback ¶

No prior PR review comments on this PR.

Agent Performance Retro ¶

Claude ¶

Claude carried the majority of the Suggestion coverage again: six of nine Suggestions, including the four that capture the "not-quite-symmetric" polish gaps (context.WithTimeout vs watchtools.ContextWithOptionalTimeout, missing ready log in already-running branch, --timeout help text precision, cliexit.Classify docstring). Claude's I1 (bare ctx.Err() messages) and I2 (Delete post-deadline comment) were both cosmetic enough that consolidation downgraded them to S1 and S6 — Important severity was not the right pitch for a diagnostic-wrapping ask. One miss worth naming: Claude's S10 (add local_teardown_file to 6-timeout.bats) directly contradicted the repo's BATS style rule "Don't write local_teardown_file() — leave the system running after tests so failures can be inspected manually." The orchestrator missed it during the first consolidation pass too — the BATS rules lived in bats-style.md but were not inlined into the deep-review context the agents receive, so no agent had the rule in prompt. Dropped during a second pass; repo-context update below captures the fix. Claude also noted a pre-existing test-code-drift in 2-create.bats that this PR does not touch, scoped as out-of-scope correctly. No Critical false alarms and no hallucinated file contents.

Codex ¶

Codex delivered the review's single largest-impact finding in I2 (stale ConfigMap in the already-running svc start branch). That catch reconciles what R3 consolidated as a strength — "already-running branch now matches the cold-start path" — against the actual code, which matches only up to the freshness anchor. The scope-level realization that the R3 S7 consolidation was incomplete is exactly what Round 4's re-read needs to surface. Codex's I1 (5-minute default too short for multi-VM shutdown) is dropped as a mis-labeled regression — pre-PR svc stop hardcoded 60s, so 5m is strictly longer — but the underlying concern about future-parallel shutdown is real and is captured as an acknowledged limitation via the R3 S5 comment. Codex skipped the git blame/merge-base check on I1, which is the gap behind the mis-label. S1 (BATS coverage gaps) correctly carries R2/R3 testing skips forward with new scaffolding from R4 I2.

Gemini ¶

Gemini produced its single Important finding (watch-close false success) directly — no Critical mis-calibrations, no hallucinated @test headers (the repeat false positive of the last two rounds), and the fix shape is correct. Gemini noted the finding was pre-existing but did not explicitly tie it to this PR's widening of the exposure (user-documented --timeout=0 + exit-code-4 contract); consolidation added that link. Gemini's S1 (Classify double-wrap) is a valid defensive-idempotence suggestion with no active bug today. Gemini continues to skip git blame, so the regression-vs-gap labeling falls to consolidation — but with only one finding, the labeling cost was minimal this round.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.4	Gemini 3.1 Pro
Duration	—	—	5m 09s
Findings	6S	1I 1S	1I 1S
Tool calls	—	—	20 (readfile 8, grepsearch 7, runshellcommand 5)
Design observations	4	1	2
False positives	1	1	0
Unique insights	6	2	2
Files reviewed	10	10	10
Coverage misses	1	0	0
Totals	6S	1I 1S	1I 1S
Downgraded	2 (I→S)	0	0
Dropped	1	1	0

Reconciliation. Four severity changes:

Codex I1 (5m default too short, regression-labeled): dropped as false positive. Pre-merge-base (47b55ec0) StopWithWait hardcoded a 60-second timeout with no user override; post-PR default is 5 minutes with --timeout=0 for indefinite. Strictly a widening. The underlying multi-VM concern is real but already documented in the stopWaitTimeout comment per R3 S5. false_positives += 1.
Claude I1 (bare ctx.Err() messages, important): downgraded to S1. Exit code is already correct; the finding is a diagnostic-wrapping ask, not a correctness bug. Important is reserved for behavior that misleads users about state, not for terse error strings.
Claude I2 (Delete post-deadline comment, important): downgraded to S6. Claude's own fix says "light touch — a comment, not a code change." Comment-only changes belong under Suggestions.
Claude S10 (add local_teardown_file to 6-timeout.bats): dropped as false positive. Contradicts the repo BATS style rule in bats-style.md: "Don't write local_teardown_file() — leave the system running after tests so failures can be inspected manually." The orchestrator also missed this during the first consolidation pass; the BATS rules had not been inlined into the deep-review context, so the agents never saw them. Fix captured under Repo context updates below. false_positives += 1.

A second retracted item — an orchestrator-added re-raise of Round 3 S3 (run -0 on the cleanup rdd limavm delete "start-vm") — turned out to be built on a Round 3 mistake. The repo BATS style rule states: "Only use run -0 when you need to inspect $output afterwards. If you just want to verify a command succeeds, run it directly without run." The current code (no run -0 on that cleanup line) matches the convention; Round 3 S3 was itself a false positive carried forward from Round 2's --wait scrub commentary. Removed from this round's Suggestions and noted as a Round 3 error.

Review Process Notes ¶

Skill improvements ¶

When a prior round consolidates two agents' findings into a claim about parity between two code paths ("X now matches Y"), the next round's first action on that claim should be a paired read of both paths side by side. Round 3 consolidated "already-running svc start branch now waits for ReadyAnnotation, matching the cold-start path." That consolidation was correct at the service.Wait + ReadyAnnotation level and incorrect at the waitForFresh* level. The miss persisted into R4 and was caught only because one agent independently re-read both paths. A reviewer recognises the pattern when a prior-round strength or resolution uses comparison phrasing ("matches the cold-start", "same contract as", "symmetric with"); verify the comparison at every level the unified helper operates on, not just at the top level. Parity claims silently decay as one side changes.

Repo context updates ¶

Add: When a long-lived watch loop ends with return ctx.Err() after the result channel closes, distinguish "context cancelled" from "channel closed while context still valid." A closed watch channel without an explicit Error event can fire on graceful server close, proxy timeout, or Kubernetes watch-expiry; returning ctx.Err() (which is nil in that moment) converts a dropped watch into a spurious success. A reviewer recognises the pattern in any for ev := range watcher.ResultChan() { ... } return ctx.Err() shape — flag the need to return a distinct "watch closed before condition met" error on the fall-through. Reason: cmd/rdd/limavm.go's watchUntil / watchUntilDeleted shipped this shape in 2026-02 and it was not caught until round 4 of a PR that widened the wait-timeout exposure.
Add (applied this round): Inline the repo's BATS style rules into deep-review-context.md so review agents apply them. The rules live in bats-style.md (variable scoping, run -0 only when ${output} / ${status} is read on the next line, assert_output / refute_output idioms, no local_teardown_file, blank-line spacing). Agents cannot follow links out of the review prompt, so a rule kept only in bats-style.md is invisible at review time — Round 3 S3 and Round 4 S10 both landed as false positives because of this. bats-style.md carries a sync comment requiring any change to mirror into deep-review-context.md.