Deep Review: 20260422-104141-pr-336

Date	2026-04-22 10:41
Repo	rancher-sandbox/rancher-desktop-daemon
Round	3
Author	@jandubois
PR	#336 — cli: standardize --wait and --timeout across svc and limavm
Branch	`wait-timeout-cleanup`
Commits	`0b82624` cli: standardize --wait and --timeout across svc and limavm (squashed)
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — Round 2's I1, S1, S3, S4, S6, S7, S8 are all resolved in the squashed `0b82624`. Two new Important findings surface: (I1) the new `Delete` TOCTOU recovery is too broad and silently drops the timeout error on the force-kill race, directly contradicting the documented "exit 4, directory left in place" contract; and (I2) the new `StopWithWait` / `Delete` library surface does not thread `cmd.Context()`, so `--timeout=0` is uninterruptible — a larger exposure now that `0 = indefinite` is a documented user-facing option. Six small suggestions round out the list, including a `timeoutError` helper that still has an inline twin in `cmd/rdd/set.go`.
Wall-clock time	`40 min 39 s`

Executive Summary ¶

Round 3 squashes the two earlier commits and resolves every Round 2 finding except the testing-gap suggestion svc start / svc delete timeout coverage, which the author's resolution table lists as an intentional skip. The new 6-timeout.bats asserts the svc stop --timeout=1ms exit-code-4 path; ensureServiceRunning now routes through timeoutError, closing the cold-start gap on every autostart command; serviceStartAction's already-running branch now waits for ReadyAnnotation like the cold-start path; the watchtools alias is consistent; and the architecture doc says SIGINT / CTRL_BREAK_EVENT.

Two Round-2 findings rebounded into Round 3 at Important severity. First, the Round-2 S1 recovery in Delete (re-check Running() after StopWithWait fails) is too broad: when StopWithWait hits its deadline it calls process.Kill(pid) and then returns the deadline error, but the subsequent Running() check usually sees the process already dead (the SIGTERM-via-Kill + k8s second-signal handler lands fast, and TerminateProcess is synchronous on Windows). The code then suppresses the timeout error, continues past preserveAllInstanceLogs, and removes the instance directory — exit code 0, directory gone — the exact opposite of what docs/design/cmd_service.md:120-121 promises. Second, StopWithWait builds its own context.Background() instead of accepting the caller's context, so the newly documented --timeout=0 ("wait indefinitely") is genuinely uninterruptible from the CLI's command context. Round 2 flagged this as a design concern; documenting --timeout=0 for svc stop and svc delete in Round 3 is what elevates it to Important.

Suggestions: the new timeoutError helper duplicates the inline cliexit.Timeout pattern already in cmd/rdd/set.go:325-328; --timeout=N is silently ignored when --wait=false (flag help says "Timeout for --wait"); the cleanup run -0 rdd limavm delete "start-vm" lost its run -0 wrapper in the --wait scrub; stopWaitTimeout's comment ties the budget to one VM's ceiling without reconciling the sequential-shutdown multi-VM case; one --timeout help string uses "embedded shutdown wait" while the rest say "Timeout for --wait"; and the new SIGINT on Unix … with forced termination as fallback wording in both docs and a test comment overstates Unix behavior — process.Kill on Unix is SIGTERM, and forced termination there is the k8s second-signal os.Exit(1) path rather than SIGKILL.

No Critical issues. Gemini's three Criticals were either a repeat false positive (BATS @test header "corrupted by find-and-replace" — same hallucination as Round 2) or rightly belong one tier lower (the Delete race, consolidated with Claude's I1 at Important) or describe a pre-existing design trade-off (process.Kill(pid) mid-shutdown leaves hostagents the orchestrator hadn't reached yet — the comment at pkg/service/cmd/service.go:377-383 explicitly calls this out, and pre-PR code already used the same process.Kill).

Critical Issues ¶

None.

Important Issues ¶

I1. Delete swallows timeout errors once the force-kill wins the race Claude Gemini¶

pkg/service/cmd/service.go

// the directory on Windows and breaks PID-file mutual exclusion on Unix.
func Delete(timeout time.Duration) error {
	if !Exists() {
		return fmt.Errorf("%q control plane does not exist", instance.Name())
	}
	if Running() {
		if err := StopWithWait(true, timeout); err != nil {
			// The process may exit between the outer Running check and
			// StopWithWait's inner one; treat that as success.
			if Running() {
				return err
			}
		}
	}
	preserveAllInstanceLogs()
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		_ = os.RemoveAll(instance.LogDir())
	}
	_ = os.RemoveAll(instance.ShortDir())

The comment frames the swallow as the outer/inner Running() race that Round 2 S1 reported: the serve subprocess exits between Delete's outer Running() check and StopWithWait's inner one, StopWithWait returns a "%q control plane is not running" error, and Delete treats that as success. The code, however, swallows every StopWithWait error whenever Running() is false at the second check — including the deadline error.

Trace: when the deadline expires inside StopWithWait (pkg/service/cmd/service.go:376-385), the function calls _ = process.Kill(pid) and returns fmt.Errorf("timed out … to stop: %w", …, ctx.Err()). On Windows process.Kill is TerminateProcess, which is synchronous; on Unix it is SIGTERM and the k8s SetupSignalContext second-signal handler calls os.Exit(1) within microseconds. By the time Delete re-reads Running(), the process is almost always gone.

pkg/service/cmd/service.go

		ticker := time.NewTicker(100 * time.Millisecond)
		defer ticker.Stop()

		for {
			select {
			case <-ctx.Done():
				// Graceful shutdown timed out; terminate so we don't leave
				// a hung service process. Kill targets only the service; on
				// Windows (TerminateProcess) this avoids killing hostagents
				// that are children of the service but run in their own
				// process groups. On Unix, SIGTERM suffices because the
				// service is responsive to signals (it's just slow shutting
				// down hostagents sequentially).
				_ = process.Kill(pid)
				return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, ctx.Err())
			case <-ticker.C:
				if !Running() {
					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil
				}

Delete then drops the wrapped context.DeadlineExceeded, runs preserveAllInstanceLogs, and calls os.RemoveAll(instance.Dir()). Exit code 0, directory gone, hostagents that the orchestrator had not yet reached now orphaned.

If the deadline expires, `rdd` exits with code 4 and the directory is left in place.

docs/design/cmd_service.md:120-121 promises the opposite: on deadline expiry the CLI is supposed to exit 4 and leave the directory intact. The documented safety story — the whole reason --wait=false was removed from svc delete in Round 1 — is "do not remove instance.Dir() while a writer is alive." Force-killing the writer with Kill then immediately deleting the directory does not restore that safety; the hostagents still in flight have not been drained, and on Windows the kill races any still-open file handles the service held. The behavior is also non-deterministic: identical invocations can flip between exit 0 / dir-removed and exit 4 / dir-intact based on signal-handler latency.

docs/design/cmd_service.md

directory.

Delete always waits for the control plane to exit before removing files, because
removing the instance directory under a live process corrupts it on Windows and
breaks PID-file mutual exclusion on Unix. Use `--timeout` to bound that wait
(default `5m`); `0` waits indefinitely. If the deadline expires, `rdd` exits
with code 4 and the directory is left in place. See [`rdd service stop`](#rdd-service-stop)
for the signal and fallback details of the shutdown itself.

### `rdd service reset`

Deletes the datastore, but create a new (empty) one with the same settings.

Fix: narrow the swallow so it only suppresses the "not running" error that R2 S1 asked for, and always propagate deadline expiry:

 if Running() {
     if err := StopWithWait(true, timeout); err != nil {
-        // The process may exit between the outer Running check and
-        // StopWithWait's inner one; treat that as success.
-        if Running() {
+        // Suppress only the outer/inner Running() race. A deadline
+        // expiry must surface so the CLI exits 4 and leaves the
+        // directory in place, per docs/design/cmd_service.md.
+        if errors.Is(err, context.DeadlineExceeded) || Running() {
             return err
         }
     }
 }

Or, more robustly, have StopWithWait return a typed sentinel for the "not running" case so the recovery matches on identity rather than on the negation of Running().

(important, regression — the squash's R2 S1 fix introduced the over-broad recovery. Consolidated from Claude I1 and Gemini C2; Gemini's Critical classification is consolidated to Important because the scenario requires a deadline to actually fire on svc delete, and the 5m default makes that rare in practice.)

I2. StopWithWait / Delete ignore the caller's context, so --timeout=0 is uninterruptible Claude Codex Gemini¶

pkg/service/cmd/service.go

}

// StopWithWait sends a shutdown signal to the service. When wait is true, it
// blocks until the process exits or timeout elapses; pass timeout 0 to wait
// indefinitely.
func StopWithWait(wait bool, timeout time.Duration) error {
	if !Running() {
		return fmt.Errorf("%q control plane is not running", instance.Name())
	}

	// Clean up discovery configmap while cluster is still accessible

cmd/rdd/service.go:336 calls service.StopWithWait(wait, timeout); cmd/rdd/service.go:377 calls service.Delete(timeout). Both entry points discard cmd.Context() and build a new context.Background() inside the wait loop. With timeout == 0 (documented in cmd/rdd/service.go:354 and :364 as "wait indefinitely"), the wait loop's select only fires on the 100 ms ticker and the never-cancelled ctx.Done() — there is no parent channel a caller could close to abort. A user running rdd svc stop --timeout=0 against a wedged service has no in-band cancellation path; the only exit is Go's default SIGINT handler terminating the CLI process itself, which leaves the half-killed service behind.

cmd/rdd/service.go

	timeout, err := cmd.Flags().GetDuration("timeout")
	if err != nil {
		return err
	}

	if err := service.StopWithWait(wait, timeout); err != nil {
		return timeoutError(err)
	}
	if wait {
		logrus.Infof("%q control plane has stopped", instance.Name())
	} else {
		logrus.Infof("%q control plane is stopping", instance.Name())
	}
	return nil
}

func newServiceStopCommand() *cobra.Command {
	command := &cobra.Command{
		Use:  "stop",
		Long: "Stop RDD control plane",
		RunE: serviceStopAction,
	}
	command.Flags().Bool("wait", true, "Wait for control plane to actually stop")
	command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
	return command
}

func newServiceDeleteCommand() *cobra.Command {
	command := &cobra.Command{

cmd/rdd/service.go

	}
	timeout, err := cmd.Flags().GetDuration("timeout")
	if err != nil {
		return err
	}
	if err := service.Delete(timeout); err != nil {
		return timeoutError(err)
	}
	logrus.Infof("%q control plane has been deleted", instance.Name())
	return nil
}

Every other wait site this PR touches threads cmd.Context() into the wait — startAndWaitForReady, limaVMSetRunningAction, limaVMRestartAction, limaVMCreateAction, limaVMDeleteAction, the already-running serviceStartAction branch. StopWithWait is the only outlier, and the fact that it is the one place where --timeout=0 is a user-visible knob is what makes the gap visible.

Round 2 flagged this as a design concern (Claude's) and also as an Important finding (Gemini's, which we consolidated to a design observation because --timeout=0 was not then documented). Round 3 documents --timeout=0 on svc stop, svc delete, and every limavm wait path, which enlarges the exposure from "theoretical future signal handler" to "a user-facing option that cannot be cancelled from user-facing code."

Fix: thread the context through and use the same watchtools.ContextWithOptionalTimeout helper every other wait site uses:

-func StopWithWait(wait bool, timeout time.Duration) error {
+func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
     ...
     if wait {
-        ctx := context.Background()
-        if timeout > 0 {
-            var cancel context.CancelFunc
-            ctx, cancel = context.WithTimeout(ctx, timeout)
-            defer cancel()
-        }
+        ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
+        defer cancel()

-func Delete(timeout time.Duration) error {
+func Delete(ctx context.Context, timeout time.Duration) error {
     ...
     if Running() {
-        if err := StopWithWait(true, timeout); err != nil {
+        if err := StopWithWait(ctx, true, timeout); err != nil {

Update the two callers in cmd/rdd/service.go:336, 377 to pass cmd.Context().

(important, regression — the new --timeout=0 knob is what widens the pre-existing design concern. Consolidated from Claude I2, Codex I1, Gemini I1.)

Suggestions ¶

S1. timeoutError duplicates the inline cliexit.Timeout pattern already in set.go Claude¶

cmd/rdd/service.go

const stopWaitTimeout = limaLongWaitTimeout

// timeoutError wraps a deadline-expiry error with [cliexit.Timeout] so the CLI
// exits with [cliexit.CodeTimeout], matching `rdd set`. Other errors pass
// through unchanged.
func timeoutError(err error) error {
	if errors.Is(err, context.DeadlineExceeded) {
		return cliexit.Timeout(err)
	}
	return err
}

// logrusLevelToKlog converts current logrus level to klog level.
func logrusLevelToKlog() string {
	switch logrus.GetLevel() {
	case logrus.DebugLevel:

The new helper exists to factor exactly the set.go pattern, but set.go itself was not rewired to use it. The helper also lives in cmd/rdd/service.go, so set.go would have to import it from a file it does not otherwise depend on.

Fix: move the helper to pkg/cli/exit/exit.go (where every CLI entry point already imports it) and call it from all three sites. Naming it exit.Classify(err) or similar frees the service.go filename from owning a package-cross-cutting utility.

(suggestion, gap — the PR's own refactoring opportunity, missed.)

S2. --timeout=N is silently ignored when --wait=false Claude¶

cmd/rdd/service.go

		Use:  "stop",
		Long: "Stop RDD control plane",
		RunE: serviceStopAction,
	}
	command.Flags().Bool("wait", true, "Wait for control plane to actually stop")
	command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
	return command
}

func newServiceDeleteCommand() *cobra.Command {
	command := &cobra.Command{

rdd svc stop --wait=false --timeout=10s, and the equivalent on every limavm wait-capable subcommand, accepts both flags and then ignores --timeout. The help text "Timeout for --wait" hints at the dependency but does not tell a user their value is dead. Users routinely combine both flags when scripting (a common pattern: "don't block me, but still error out after N seconds").

Fix: either reject the combination at parse time, or make the dependency explicit in the flag help:

-command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
+command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")

(suggestion, enhancement)

S3. Cleanup rdd limavm delete lost its run -0 wrapper Claude¶

-    run -0 rdd limavm delete --wait "start-vm"
+    rdd limavm delete "start-vm"

The Round 2 fix to drop the now-redundant --wait flag (R2 S4) also dropped the run -0 wrapper. BATS still fails the test via set -e if the command exits non-zero, but the scoped exit-code assertion and captured $output / $status are gone, and the file no longer tells a reader that the exit code matters. Repo BATS style (BATS style rules) prefers run -0.

Fix:

-    rdd limavm delete "start-vm"
+    run -0 rdd limavm delete "start-vm"

The earlier cleanup at :78 kept run -0, so the asymmetry is just this one site.

(suggestion, regression — Round 2 S4 fix collateral.)

S4. SIGINT on Unix … with forced termination as fallback overstates what Unix actually does Codex¶

docs/design/api_service.md


* It starts any additional controllers configured via the `config` ConfigMap.

## Service shutdown

The control plane is notified to shut down, either by `rdd service stop` (SIGINT on Unix, `CTRL_BREAK_EVENT` on Windows, with forced termination as fallback), or by the OS preparing to shut down the host.

Some controllers need to be notified of pending shutdown so they can gracefully terminate, e.g. shut down virtual machines.

Any controller can request shutdown notification by creating a ConfigMap in the `rdd-system` namespace:

Readers interpret "forced termination" as SIGKILL on Unix. pkg/util/process/process_unix.go:30-32 defines:

pkg/util/process/process_unix.go

// Interrupt sends SIGINT to the process with the given PID.
func Interrupt(pid int) error {
	return unix.Kill(pid, unix.SIGINT)
}

// Kill sends SIGTERM to the process with the given PID.
func Kill(pid int) error {
	return unix.Kill(pid, unix.SIGTERM)
}

// KillTree terminates the process and all its descendants.
// The target must have been started with SetGroup so it leads its own group.
// On Unix, this sends SIGKILL to the process group. On Windows, this uses

// Kill sends SIGTERM to the process with the given PID.
func Kill(pid int) error {
    return unix.Kill(pid, unix.SIGTERM)
}

The fallback process.Kill(pid) sends another SIGTERM, which the service catches. Forced termination on Unix happens only because the k8s SetupSignalContext handler routes the second SIGTERM through os.Exit(1) — not because Kill itself is forceful. Users who rely on the "forced" wording to reason about hostagent cleanup or unkillable writers will be surprised when rdd svc stop --timeout=1ms returns exit 4 and svc status still shows the pid alive for a moment afterwards.

Fix: either match the Unix implementation ("SIGINT on Unix with SIGTERM as fallback; on Windows, CTRL_BREAK_EVENT with TerminateProcess as fallback") or strengthen the Unix path to genuinely force-kill (process.KillTree or unix.SIGKILL) — but not without reconciling with the pre-existing "leave hostagents in their own process groups" decision.

(suggestion, regression — the new doc wording is what makes the Unix contract ambiguous.)

S5. stopWaitTimeout's rationale covers one VM, not the multi-VM worst case Claude¶

cmd/rdd/service.go

// startWaitTimeout bounds the wait for the API server, CRDs, and
// controller-manager registration to land. Cold start can take well over 30s
// in CI on slower runners.
const startWaitTimeout = 90 * time.Second

// stopWaitTimeout bounds the wait for the control plane to shut down. The
// service itself drains its controller-manager goroutines for up to 45s
// (see Run in pkg/service/cmd/service.go); when LimaVM stop orchestration
// lands, the drain will sequence each VM's shutdown, so the CLI deadline must
// match the per-VM ceiling in [limaLongWaitTimeout].
const stopWaitTimeout = limaLongWaitTimeout

// timeoutError wraps a deadline-expiry error with [cliexit.Timeout] so the CLI
// exits with [cliexit.CodeTimeout], matching `rdd set`. Other errors pass
// through unchanged.
func timeoutError(err error) error {

The comment states the budget "must match the per-VM ceiling" — a reader leaves thinking 5m is sized for the worst case. In a multi-VM deployment, shutdownAllHostagents drains hostagents sequentially (per MEMORY.md: "shutdownAllHostagents waits sequentially — TODO"), so N VMs at the per-VM ceiling can exceed 5m, and the Run drain hard-caps at 45s (pkg/service/cmd/service.go:832) regardless of the CLI's patience.

pkg/service/cmd/service.go

		mgrWg.Wait()
		close(mgrDone)
	}()
	select {
	case <-mgrDone:
	case <-time.After(45 * time.Second):
		klog.Warning("Controller manager did not shut down within 45s, exiting anyway")
	}

	return nil
}

Fix: reconcile the comment with the sequential-drain reality, or raise the budget when the parallel-shutdown TODO is resolved. Either:

-// match the per-VM ceiling in [limaLongWaitTimeout].
+// match the per-VM ceiling in [limaLongWaitTimeout]; a multi-VM
+// instance may still exceed this budget while shutdownAllHostagents
+// drains sequentially.

(suggestion, gap — comment accuracy; does not block the merge.)

S6. svc delete --timeout help text diverges from the other --timeout flags Claude¶

cmd/rdd/service.go

	command := &cobra.Command{
		Use:  "delete",
		Long: "Delete RDD control plane",
		RunE: serviceDeleteAction,
	}
	command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the embedded shutdown wait (0 means wait indefinitely)")
	return command
}

func serviceDeleteAction(cmd *cobra.Command, _ []string) error {
	if !service.Exists() {

Every other --timeout flag in the PR uses "Timeout for --wait (0 means wait indefinitely)". svc delete has no --wait flag, so the wording diverges deliberately, but "embedded shutdown wait" reads oddly next to the docs, which describe the delete path as "always waits for the control plane to exit before removing files." Minor polish:

-command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the embedded shutdown wait (0 means wait indefinitely)")
+command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")

(suggestion, enhancement.)

Design Observations ¶

Concerns ¶

process.Kill(pid) mid-shutdown orphans any hostagent the orchestrator had not yet reached (future) Gemini. The comment at pkg/service/cmd/service.go:377-383 acknowledges the trade-off and MEMORY.md tracks the sequential-shutdown TODO. Gemini raised this at Critical; it is pre-existing (pre-PR code used the same process.Kill) and the author's comment explicitly chose this path over KillTree to avoid killing hostagents that survive in their own process groups and self-heal via killOrphanedHostagent on next start. Flagged here because I1's narrower errors.Is(…DeadlineExceeded) fix still leaves the underlying "hostagents not drained on timeout" exposure; a future ordered-shutdown design (see [shutdown-ordering-design.md](shutdown-ordering-design.md)) is the right place to address it.
Two service.Wait callers still construct a plain context.WithTimeout instead of watchtools.ContextWithOptionalTimeout (in-scope) Claude. cmd/rdd/service.go:105-106 still uses context.WithTimeout(ctx, startWaitTimeout), and pkg/service/cmd/service.go:202-203 uses context.WithTimeout(ctx, 15*time.Second). The first is fine because startWaitTimeout is a non-zero constant; the second is internal to getRuntimeControllersFromCluster and is bounded by design. Mentioned only because the "zero means infinite" pattern elsewhere makes the two stragglers look inconsistent at a glance; no fix needed.

Strengths ¶

limaLongWaitTimeout is the single source of truth for the long wait Claude. cmd/rdd/limavm.go:41 defines it once; every limavm subcommand and stopWaitTimeout in cmd/rdd/service.go:42 reference it by name. Tuning the per-VM ceiling flows automatically through the CLI. Codex also flagged the boundary shape as a strength: timeoutError lives in cmd/rdd/service.go so the service package stays reusable while CLI-facing commands get consistent exit-code behavior.
watchtools.ContextWithOptionalTimeout handles the "0 means infinite" contract at every wait site Claude. Five-plus call sites in limavm.go plus the two in serviceStartAction. Hand-rolling if timeout == 0 would have been easy to miss at one site; the helper keeps the contract in one place.
serviceStartAction's already-running branch now waits for ReadyAnnotation Codex Claude. Round 2 S3 fix; the cold-start and warm-start paths now share the same readiness contract, so the documented --wait=false → rdd svc start workflow delivers on its promise.
timeoutError at the CLI boundary plus errors.Is(err, context.DeadlineExceeded) Claude. The %w wrap chain propagates through fmt.Errorf("failed waiting …: %w", err) wrappers so each call site can add context freely without losing the classification.
Removing --wait=false from svc delete is the right shape Claude. There is no safe "non-blocking delete" while the writer still holds the directory; the doc now explains the reason.

Testing Assessment ¶

Coverage of the exit-code-4 contract is broad: each wait-capable limavm subcommand has a run -4 regression test, and 6-timeout.bats asserts svc stop --timeout=1ms. Unaddressed gaps, carried forward with round 2 context:

rdd svc start --wait --timeout=… deadline expiry returns exit code 4. R2 S2 unaddressed (intentional skip per R2 resolution table). Coverage gap remains; a --timeout=1ms against a slow-starting control plane would prove the contract.
rdd svc delete --timeout=… deadline expiry returns exit code 4 and leaves the directory in place. R2 S2 unaddressed (intentional skip). Also directly ties to I1 — a test here would have caught the over-broad swallow.
ensureServiceRunning deadline expiry during rdd set / rdd limavm * / rdd kubectl / rdd service config returns exit code 4. R2 testing gap #5 unaddressed (intentional skip). The fix at cmd/rdd/service.go:107-112 is correct; no test exercises it.
--timeout=0 (indefinite wait) succeeds when the desired state eventually arrives. R2 testing gap #4 unaddressed (intentional skip). A regression that accidentally used context.WithTimeout(ctx, 0) (which fires immediately) would ship green.
Ctrl-C (cmd.Context() cancellation) during svc stop --timeout=0 / svc delete --timeout=0. New. Ties to I2; a cancelled caller context should abort the wait.

The author's R2 resolution table marks items 1-4 as "Skipped"; these remain gaps, not re-raises. Item 5 is new and follows I2.

Documentation Assessment ¶

docs/design/cmd_service.md:117-122 promises "exit 4, directory left in place" on deadline expiry — see I1; the code does not currently deliver this in the common case.
docs/design/cmd_service.md:93-99 correctly replaces --no-wait with --wait=false and adds --timeout. Good.
docs/design/cmd_service.md:104-109, docs/design/api_service.md:21, and the 6-timeout.bats header comment all say "forced termination as fallback." On Unix the fallback is a second SIGTERM; forced termination there is the k8s second-signal os.Exit(1) rather than SIGKILL. See S4.
docs/design/cmd_lima.md — defaults consistent with the code.
Comment at cmd/rdd/service.go:120-122 ("the API server readiness poll and the ConfigMap freshness poll share the single deadline") is correct when timeout > 0 and vacuously true when timeout == 0. Fine.

Commit Structure ¶

Single commit, single concept — the squash is clean and the commit body accurately enumerates the shipped defaults. Round 2 S5 (commit body lists pre-fixup defaults) is resolved.

Acknowledged Limitations ¶

Comment at pkg/service/cmd/service.go:377-383 documents the deliberate trade-off of process.Kill leaving hostagents the orchestrator had not yet drained. Accurate; Gemini's Critical C3 is consolidated with this.
Comment at pkg/service/cmd/service.go:432-433 describes the outer/inner Running() race that R2 S1 fixed; the comment understates the swallow's actual scope — promoted to I1 rather than left here.
cmd/rdd/service.go:32-42 justifies startWaitTimeout (90s) and ties stopWaitTimeout to the per-VM ceiling. The latter's multi-VM caveat is S5.
Comment at pkg/service/cmd/service.go:444-454 documents the Windows os.Rename / FILE_SHARE_DELETE limitation in preserveAllInstanceLogs. Accurate; unaffected by this PR.
The R2-acknowledged PID-recycling race in StopWithWait (Running() returns true, then pid := PID() finds the file stale, process.Interrupt(0) on Unix signals the calling process group) is still present. Pre-existing; not widened by this PR.

Unresolved Feedback ¶

No prior PR review comments on this PR.

Agent Performance Retro ¶

Claude ¶

Claude carried the round again: both consolidated Important findings are Claude's, and it produced five of the six suggestions. The I1 framing ("R2 S1 fix is too broad; it drops every StopWithWait error, not just the 'not running' race") is the clearest of the three agents' takes on the same code — Codex didn't raise it and Gemini raised it at Critical without tracing why the swallow fires in the force-kill path. Unique suggestions: timeoutError duplicating the inline cliexit.Timeout in set.go (S1), --timeout=N silently ignored when --wait=false (S2), the collateral loss of run -0 on the cleanup delete (S3), the stopWaitTimeout comment overstating multi-VM safety (S5), and the svc delete --timeout help-text wording divergence (S6). The one design observation worth noting: Claude correctly demoted its own R2 "StopWithWait ignores cmd.Context()" concern to Important this round because --timeout=0 is now a documented user-facing option — calibrated severity movement.

Codex ¶

Codex delivered one Important (the StopWithWait/Delete context-ignoring surface, consolidated with Claude's and Gemini's matching finds into I2) and one Suggestion — but the Suggestion came with a live-verified demonstration. Codex actually built the CLI, ran ./bin/rdd svc stop --wait --timeout=1ms on its macOS worktree, and observed that svc status still reported the daemon PID as running immediately afterward, which is exactly what S4's doc-accuracy claim rests on. That is the kind of evidence no amount of code-reading replaces. No false positives; one tight Important that both other agents also caught; one unique doc-accuracy find with real artifacts.

Gemini ¶

Gemini produced three Criticals, two of which do not hold up. C1 claimed the BATS @test decorator has been corrupted to @pkg/util/tail/tail_test.go in both 6-timeout.bats and limavm-cli.bats — identical hallucination to Round 2, where it was also consolidated as a false positive. The worktree grep '^@test\|^@pkg' shows every @test line intact. C3 (Kill orphans hostagents) is pre-existing (pre-PR code used the same process.Kill) and directly contradicted by the author's intent comment at pkg/service/cmd/service.go:377-383 — Gemini read the code but not the comment that explains why the call is Kill and not KillTree. C2 is real but Gemini rated it Critical; we consolidate at Important alongside Claude's matching finding, because a svc delete deadline with the 5m default is a rare hit and the consequence (data removed + exit 0) is bad but not unrecoverable. I1 is a genuine concurring catch on the caller-context gap. Gemini continues to skip git blame, so pre-existing-vs-regression calibration falls to Claude and Codex; the C3 mislabel as "regression" is one concrete downstream cost of that skip.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.4	Gemini 3.1 Pro
Duration	35m 08s	6m 33s	6m 37s
Findings	2I 6S	1I 1S	1I
Tool calls	70 (Bash 34, Read 22, Grep 14)	39 (shell 35, stdin 4)	11 (grepsearch 8, runshellcommand 2, readfile 1)
Design observations	5	3	1
False positives	0	0	1
Unique insights	5	1	0
Files reviewed	8	8	8
Coverage misses	0	0	0
Totals	2I 6S	1I 1S	1I
Downgraded	0	0	1 (I→S)
Dropped	0	0	2

Reconciliation. Three severity changes:

Gemini C1 (BATS @test header corrupted): dropped as false positive. The file in $WORKTREE_lead contains valid @test declarations at 6-timeout.bats:14 and limavm-cli.bats:196, 208, 219, 232; Gemini hallucinated the corruption — same false positive as Round 2. false_positives += 1.
Gemini C2 (Delete race data loss): downgraded to I1. The regression is real and important, but svc delete with the 5m default makes the race rare in practice; Critical is reserved for issues with broader impact or more certain triggers. Consolidated with Claude's matching Important.
Gemini C3 (process.Kill orphans hostagents): demoted to design observation (pre-existing). Pre-PR code already used process.Kill(pid) in the timeout-kill path; the author's comment at pkg/service/cmd/service.go:377-383 names the trade-off explicitly. Not introduced by this PR, and Gemini's suggested fix (KillTree) contradicts the author's intent of letting hostagents survive in their own process groups for killOrphanedHostagent to reclaim on next start.

Overall: Codex's live-verified S4 and Claude's I1 framing are the two findings that materially improve the review this round. Gemini's single valid find (I1 caller-context, already caught by the others) does not offset two mislabeled Criticals and a repeat hallucination; with the PR's small scope, Codex + Claude alone would have produced the same consolidated set.

Review Process Notes ¶

Skill improvements ¶

Before accepting a severity change that is "narrower than the originating agent claimed," verify the narrowing against the code path, not the abstract concept. When multiple agents raise the same structural concern across rounds (e.g., "this wait ignores the caller's context"), check whether the PR under review changes what that concern means for users. A design concern about an internal library call becomes an Important finding the moment the PR documents a user-facing knob that would exercise the concern. A reviewer recognises the pattern when a prior-round design observation covers the same code path as a current-round finding: compare the round's user-facing contract, not just the code. Going the other way, an under-labeled "regression" for a pre-existing line is a common miscalibration that costs a consolidation round — drop the regression label on any finding whose cited lines existed before the PR's merge-base.

Repo context updates ¶

Add: When the CLI adds a new --timeout=0 ("wait indefinitely") user-facing option, verify that the underlying wait loop accepts a caller's context.Context and derives its timed child from it. A context.Background() wait loop plus a documented indefinite mode combine to produce a genuinely uninterruptible CLI. Reason: StopWithWait in this review shipped both in the same PR without the connection being noticed during Round 1 or Round 2; the exposure surfaced only in Round 3 after the user-facing doc explicitly promoted --timeout=0.