Deep Review: 20260422-115550-pr-336

Date2026-04-22 11:55
Reporancher-sandbox/rancher-desktop-daemon
Round4
Author@jandubois
PR#336 — cli: standardize --wait and --timeout across svc and limavm
Branchwait-timeout-cleanup
Commits4bf0ec5 cli: standardize --wait and --timeout across svc and limavm
ReviewersClaude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
VerdictMerge with fixes — Round 3's I1 (Delete timeout swallow) and I2 (StopWithWait/Delete caller context) are both cleanly resolved — errors.Is(err, context.DeadlineExceeded) \|\| Running() narrows the recovery, and both functions now accept ctx. Five of six Round 3 Suggestions landed (S1 → cliexit.Classify, S2/S6 → help-text updates, S4 → accurate SIGINT/SIGTERM wording, S5 → multi-VM caveat in comment); R3 S3 (run -0 on the cleanup rdd limavm delete "start-vm") turned out to be a false recommendation — the repo BATS style rules say to drop run on a command whose output is not inspected, so the current code is correct. Two new Important findings: (I1) watchUntil / watchUntilDeleted in cmd/rdd/limavm.go return nil when the watch channel closes before the context fires, so a dropped API-server watch silently reports success on any limavm start/stop/restart/create/delete --wait; (I2) the already-running svc start branch calls waitForDiscoveryConfigMap without the freshness anchor the cold-start path uses, so a stale ReadyAnnotation=true from a crashed prior instance can satisfy the wait. Nine small suggestions round out the list.
Wall-clock time37 min 38 s


Executive Summary

Round 4 addresses every Round 3 Important finding and five of six Round 3 Suggestions. Delete's over-broad recovery is now gated on errors.Is(err, context.DeadlineExceeded) || Running(), preserving the "exit 4, directory in place" contract on deadline expiry (R3 I1 resolved). StopWithWait(ctx, wait, timeout) and Delete(ctx, timeout) both accept the caller's context and derive their timed child via watchtools.ContextWithOptionalTimeout, closing the R3 I2 "--timeout=0 is uninterruptible" exposure. The timeoutError helper is moved to pkg/cli/exit/exit.go as Classify and used from set.go, service.go, and limavm.go (R3 S1). --timeout help text on every --wait-capable command now says "ignored if --wait=false" (R3 S2), the api_service.md wording is now technically accurate ("SIGTERM as fallback" on Unix, R3 S4), and the stopWaitTimeout comment acknowledges the multi-VM ceiling (R3 S5).

Two new Important findings surface. First, cmd/rdd/limavm.go:485-497 and :527-535 (watchUntil / watchUntilDeleted) both return ctx.Err() when the watch channel closes — and ctx.Err() is nil whenever the context is still valid at that moment. If the API server closes the watch before the desired state arrives (proxy timeout, server restart, graceful close), the CLI returns exit 0 and logs "LimaVM started/stopped/…" without the status ever matching. This code is pre-existing (776b8f77, 2026-02-26) and not modified by this PR, but the PR widens its exposure: every limavm wait path now has a user-documented --timeout=0 mode, and the new exit-code-4 contract depends on timeouts actually firing. Second, cmd/rdd/service.go:271 still calls waitForDiscoveryConfigMap(ctx) in the already-running svc start branch, which skips the freshness anchor (beforeStart) that the cold-start path uses at :122-128. A stale ReadyAnnotation=true ConfigMap from a crashed prior serve can satisfy service.Wait() + waitForDiscoveryConfigMap() before the new serve has reached InitDiscovery() — the user sees exit 0 while controllers are still initializing. The R3 strength ("already-running branch now matches the cold-start path") is narrower than consolidated then.

Suggestions span diagnostics and polish: service.Wait and waitForFreshDiscoveryConfigMap surface bare ctx.Err(), so users see only "context deadline exceeded" with no hint of which wait timed out; ensureServiceRunning still uses context.WithTimeout instead of the watchtools.ContextWithOptionalTimeout the rest of the PR standardizes on; BATS coverage for svc start --timeout / svc delete --timeout exit-code-4 remains an intentional R2/R3 skip but a regression there would ship green; cliexit.Classify can double-wrap already-classified errors; the already-running svc start branch has no "is ready" log to match the cold-start; and the svc delete --timeout help text reads "stop-and-wait" (jargon) where the siblings use plain "Timeout for --wait". Three more minor polish items round out the list.

No Critical issues. Codex's Important I1 ("new 5-minute svc stop / svc delete default is shorter than healthy multi-VM shutdown, regression") is dropped as a mis-labeled regression: pre-merge-base StopWithWait hardcoded a 60-second timeout in pkg/service/cmd/service.go:373, and svc stop exposed no --timeout flag. The PR raises that default to 5 minutes and exposes --timeout=0 for indefinite waits — strictly a widening, not a shortening. The underlying "multi-VM sequential drain may exceed 5m once shutdownAllHostagents is parallelized" concern is real but already documented as an acknowledged limitation in the stopWaitTimeout comment (R3 S5 resolution).


Critical Issues

None.


Important Issues

I1. watchUntil / watchUntilDeleted return nil when the watch channel closes, silently succeeding on a dropped watch Gemini
	if err != nil {
		return err
	}
	defer watcher.Stop()

	for ev := range watcher.ResultChan() {
		switch ev.Type {
		case k8swatch.Error:
			return apierrors.FromObject(ev.Object)
		case k8swatch.Deleted:
			return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
		}
		updated, ok := ev.Object.(*limav1alpha1.LimaVM)
		if ok && check(updated) {
			return nil
		}
	}
	return ctx.Err()
}

// watchUntilDeleted watches a LimaVM until it is deleted.
func watchUntilDeleted(ctx context.Context, c client.WithWatch, limaVM *limav1alpha1.LimaVM) error {
	key := client.ObjectKeyFromObject(limaVM)

When watcher.ResultChan() closes without an Error event — a graceful connection close, a proxy/load-balancer timeout, a server restart, or the Kubernetes watch timeout that fires every 5–10 minutes by default — the for loop exits, and ctx.Err() is consulted. Inside watchUntil the context passed in is the caller's waitCtx; if the timeout has not yet expired, ctx.Err() returns nil. Every limavm action then interprets that as success:

if err := watchUntil(waitCtx, c, key, func(vm *limav1alpha1.LimaVM) bool {
    return vm.Status.RestartCount > beforeCount
}); err != nil {
    return cliexit.Classify(fmt.Errorf("failed waiting for restart to complete: %w", err))
}
logrus.Infof("LimaVM %q restarted in namespace %q", name, limaVM.Namespace)

cmd/rdd/limavm.go:233-238, 413-417, 453-457 and the matching watchUntilDeleted call at :556-558 all propagate the nil as success. The user sees LimaVM "foo" started / deleted / restarted with exit 0, but the status field was never observed to change.

This is pre-existing (776b8f77, 2026-02-26) but the PR materially widens the exposure:

Fix: distinguish "channel closed before condition met" from "context cancelled first." A one-line guard at the return ctx.Err() site suffices:

 for ev := range watcher.ResultChan() {
     ...
 }
-return ctx.Err()
+if err := ctx.Err(); err != nil {
+    return err
+}
+return fmt.Errorf("watch for LimaVM %q closed before condition was observed", key.Name)

Apply the symmetric change at watchUntilDeleted:535 with "... closed before deletion was observed".

(important, gap — pre-existing behavior widened by this PR's user-documented --timeout=0 and the exit-code-4 contract that depends on timeouts being the authoritative "gave up" signal.)

I2. Already-running svc start branch skips the freshness anchor, so a stale ReadyAnnotation can satisfy the wait Codex
		if err := service.Create(args); err != nil {
			return err
		}
		logrus.Infof("successfully created %q control plane", instance.Name())
	}
	if service.Running() {
		logrus.Infof("%q control plane is already running", instance.Name())
		if !wait {
			return nil
		}
		logrus.Infof("waiting for %q control plane to be ready", instance.Name())
		ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
	}

	// Collect all provided flags as arguments for serve subprocess
	var serveArgs []string
	if cmd.Flags().Changed("controllers") {

waitForDiscoveryConfigMap calls waitForFreshDiscoveryConfigMap(ctx, time.Time{}), and the zero beforeStart makes the cm.CreationTimestamp.Time.Before(beforeStart) check vacuous:

func waitForDiscoveryConfigMap(ctx context.Context) error {
    return waitForFreshDiscoveryConfigMap(ctx, time.Time{})
}

The cold-start path avoids stale state by anchoring beforeStart := time.Now().Truncate(time.Second) before service.Start, then passing that to waitForFreshDiscoveryConfigMap (cmd/rdd/service.go:117-128). The already-running branch has no such anchor.

// the apiserver may become ready before the serve command creates the ConfigMap.
func startAndWaitForReady(ctx context.Context, serveArgs []string, timeout time.Duration) error {
	// Truncate to second precision because metav1.Time drops sub-seconds
	// during JSON serialization. Without this, a server that starts in the
	// same second would appear to have a startTime *before* beforeStart.
	beforeStart := time.Now().Truncate(time.Second)
	if err := service.Start(ctx, serveArgs); err != nil {
		return err
	}

	ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
	defer cancel()

	if err := service.Wait(ctx); err != nil {
		return err
	}
	return waitForFreshDiscoveryConfigMap(ctx, beforeStart)
}

// waitForDiscoveryConfigMap polls until the discovery ConfigMap exists.
func waitForDiscoveryConfigMap(ctx context.Context) error {
	return waitForFreshDiscoveryConfigMap(ctx, time.Time{})

The race: serve writes rdd.pid very early in its startup sequence, so service.Running() returns true from that moment on. If a prior serve process crashed with its discovery ConfigMap still carrying ReadyAnnotation=true in the sqlite3 datastore, and a new serve process has just started but not yet reached InitDiscovery(), a concurrent rdd svc start enters the already-running branch, service.Wait returns once /readyz + CRDs respond (which predates discovery), and waitForDiscoveryConfigMap finds the stale annotation immediately. The CLI returns exit 0 and the user proceeds to run commands that depend on controllers that haven't registered yet.

The cold-start path's beforeStart blocks exactly this: the ConfigMap must have been created at or after the current startup to satisfy the wait. The already-running branch needs the same contract — but there's no natural beforeStart value, because the branch doesn't own the startup moment. Workable options:

Fix (pid-file mtime approach):

 if service.Running() {
     ...
-    return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
+    beforeStart, err := pidFileStartTime()
+    if err != nil {
+        return cliexit.Classify(err)
+    }
+    return cliexit.Classify(waitForFreshDiscoveryConfigMap(ctx, beforeStart))
 }

Round 3 consolidated "already-running branch now waits for ReadyAnnotation, matching the cold-start path" as a strength. It waits for the annotation, but not for a fresh one — the match is narrower than it looked. Round 4 carries this through, so it lands here.

(important, regression — the already-running wait is new in this PR; the freshness-bypass ships with it.)


Suggestions

S1. service.Wait and waitForFreshDiscoveryConfigMap surface bare ctx.Err(), so the user-visible timeout message has no scoping context Claude

	return readiness.WaitForReadyWithCRDs(ctx, config, enabledControllers, false)
}

// Wait until the control plane is ready or the context is cancelled.
func Wait(ctx context.Context) error {
	if err := checkReadiness(ctx); err == nil {
		return nil
	}

	ticker := time.NewTicker(500 * time.Millisecond)
	defer ticker.Stop()

	for {
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-ticker.C:
			if err := checkReadiness(ctx); err == nil {
				return nil
			}
		}
	}
}

// StopWithWait sends a shutdown signal to the service. When wait is true, it
// blocks until the process exits, ctx is cancelled, or timeout elapses; pass
// timeout 0 to wait indefinitely (bounded only by ctx).
func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {

service.Wait returns context.DeadlineExceeded directly; waitForFreshDiscoveryConfigMap (via wait.PollUntilContextCancel) does the same. The CLI feeds both through cliexit.Classify, so exit code 4 is correct, but the error message printed to stderr is just context deadline exceeded — no indication of which wait fired. StopWithWait wraps as timed out waiting for %q control plane with pid %d to stop: %w (pkg/service/cmd/service.go:382), which is the shape a user can act on.

				// that are children of the service but run in their own
				// process groups. On Unix, SIGTERM suffices because the
				// service is responsive to signals (it's just slow shutting
				// down hostagents sequentially).
				_ = process.Kill(pid)
				return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, ctx.Err())
			case <-ticker.C:
				if !Running() {
					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil
				}

Fix: wrap at the call site with a one-line scope. Example:

 if service.Running() {
     ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
     defer cancel()
     if err := service.Wait(ctx); err != nil {
-        return cliexit.Classify(err)
+        return cliexit.Classify(fmt.Errorf("waiting for %q control plane to be ready: %w", instance.Name(), err))
     }
-    return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
+    if err := waitForDiscoveryConfigMap(ctx); err != nil {
+        return cliexit.Classify(fmt.Errorf("waiting for discovery ConfigMap: %w", err))
+    }
+    return nil
 }

The PR already threads %w through the limavm.go call sites (failed waiting for LimaVM to start: %w, etc.); service.go's three cliexit.Classify(err) sites are the outliers.

(suggestion, gap — diagnostic parity with the limavm paths.)

S2. ensureServiceRunning uses context.WithTimeout; every sibling wait uses watchtools.ContextWithOptionalTimeout Claude
		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		return cliexit.Classify(waitForDiscoveryConfigMap(ctx))

Every other timeout-bound block this PR touches — startAndWaitForReady:122, serviceStartAction already-running branch :266, StopWithWait:366, and all five limavm.go wait sites — uses watchtools.ContextWithOptionalTimeout, which degrades to context.WithCancel when the duration is 0. startWaitTimeout is a positive constant today, so the observable behavior is the same, but a future refactor that exposes this constant via a flag would silently fail on --timeout=0 (an immediately-expired context). Drop-in change:

-        ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
+        ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, startWaitTimeout)
         defer cancel()

(suggestion, enhancement — cross-site consistency.)

S3. No BATS coverage for svc start --timeout / svc delete --timeout exit-code-4 contract Codex
load '../../helpers/load'

# Assert `rdd svc stop --wait` exits with code 4 (cliexit.CodeTimeout) when
# --timeout expires. A 1ms deadline fires before the graceful-shutdown loop
# can observe the service exit, so the wait sends SIGTERM (or TerminateProcess
# on Windows) and returns context.DeadlineExceeded wrapped in cliexit.Timeout.

The PR adds exit-code-4 behavior on three svc paths (stop, start already-running branch, delete) and four limavm paths, but the new 6-timeout.bats asserts only svc stop. svc delete is the highest-risk unrepresented path because its I1 predecessor in Round 3 (over-broad error swallow) would have shipped green under a run -4 rdd svc delete --timeout=1ms && assert_dir_exist "${RDD_DIR}" test. svc start's already-running branch is also new and has no integration coverage. The author's Round 2 resolution table lists these two as intentional skips; Round 4 I2 demonstrates that the cost of that skip can be a non-obvious correctness bug.

Fix: two new cases in 6-timeout.bats, shaped like the existing one. For svc delete, also assert the instance directory survives:

@test 'svc delete --timeout=1ms exits with code 4 and preserves directory' {
    run -4 rdd svc delete --timeout=1ms
    assert_output --partial 'context deadline exceeded'
    assert_dir_exist "$(rdd svc paths dir)"
}

(suggestion, gap — carries forward R2 testing gaps #1–#2 listed as intentional skips; R4 I2 makes the coverage argument sharper.)

S4. Already-running svc start branch has no symmetric "is ready" log on success Claude

The cold-start branch logs "%q control plane is ready" after the wait succeeds (line 297). The already-running branch waits for the same readiness signal but logs nothing on success — a user running rdd svc start against an already-running-but-still-initializing service sees the "waiting for ... to be ready" line then silence. Mirror the cold-start log:


	if wait {
		if err := startAndWaitForReady(cmd.Context(), serveArgs, timeout); err != nil {
			return cliexit.Classify(err)
		}
		logrus.Infof("%q control plane is ready", instance.Name())
	} else {
		if err := service.Start(cmd.Context(), serveArgs); err != nil {
			return err
		}
		logrus.Infof("%q control plane is starting", instance.Name())
         if err := service.Wait(ctx); err != nil {
             return cliexit.Classify(err)
         }
-        return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
+        if err := waitForDiscoveryConfigMap(ctx); err != nil {
+            return cliexit.Classify(err)
+        }
+        logrus.Infof("%q control plane is ready", instance.Name())
+        return nil
     }

Pairs well with S1's wrapping — both amount to "make the already-running branch as chatty on success and failure as the cold-start branch."

(suggestion, enhancement.)

S5. cliexit.Classify can double-wrap a *cliexit.Error{Code: CodeTimeout} Gemini
// Timeout wraps err with [CodeTimeout].
func Timeout(err error) *Error {
	return &Error{Code: CodeTimeout, Err: err}
}

// Classify wraps err with [CodeTimeout] when err wraps
// [context.DeadlineExceeded]. Other errors pass through unchanged.
func Classify(err error) error {
	if errors.Is(err, context.DeadlineExceeded) {
		return Timeout(err)
	}
	return err
}

*cliexit.Error.Unwrap() exposes the inner error, so errors.Is(cliexit.Timeout(deadlineErr), context.DeadlineExceeded) is true. If a call site passes an already-classified error through Classify again, the result is *cliexit.Error{Code: 4, Err: *cliexit.Error{Code: 4, Err: deadlineErr}} — two layers, same exit code. Today's call graph does not double-classify in practice (every Classify is a top-level, per-command call), but the pattern of wrap-on-inner-call-then-classify-on-outer-call is easy to slip into and the helper should be idempotent:

 func Classify(err error) error {
+    var cliErr *Error
+    if errors.As(err, &cliErr) {
+        return err
+    }
     if errors.Is(err, context.DeadlineExceeded) {
         return Timeout(err)
     }
     return err
 }

(suggestion, enhancement — defensive idempotence.)

S6. Delete's deadline-branch recovery elides that the process has already been SIGTERM'd Claude
// Pass timeout 0 to wait indefinitely (bounded only by ctx).
func Delete(ctx context.Context, timeout time.Duration) error {
	if !Exists() {
		return fmt.Errorf("%q control plane does not exist", instance.Name())
	}
	if Running() {
		if err := StopWithWait(ctx, true, timeout); err != nil {
			// Suppress only the outer/inner Running() race, where the
			// process exits between Delete's check and StopWithWait's.
			// Deadline expiry must surface so the CLI exits 4 and leaves
			// the directory in place, per docs/design/cmd_service.md.
			if errors.Is(err, context.DeadlineExceeded) || Running() {
				return err
			}
		}
	}
	preserveAllInstanceLogs()
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		_ = os.RemoveAll(instance.LogDir())
	}
	_ = os.RemoveAll(instance.ShortDir())

The comment at :430-433 correctly describes the R3 I1 fix. One observation worth a second comment line: when the deadline branch fires, StopWithWait has already called _ = process.Kill(pid) (:381). On Unix that is SIGTERM, which the service catches — but the k8s SetupSignalContext second-signal handler routes the second SIGTERM through os.Exit(1) within microseconds, so the process is usually already dead when Delete returns err. A user who runs rdd svc status right after exit 4 will see no running PID, which reads as a contradiction of the "directory left in place because the process is still alive" framing in the docs. The contract is still correct (the original reason for preserving the directory was "do not remove it under a live writer," which holds during the wait; the post-kill moment is a bonus side effect), but the comment understates the post-deadline process state:

         if errors.Is(err, context.DeadlineExceeded) || Running() {
+            // At deadline expiry, StopWithWait has already SIGTERM'd the
+            // process, which usually exits within microseconds. Directory
+            // removal still waits for the next `rdd svc delete` invocation,
+            // per docs/design/cmd_service.md.
             return err
         }

(suggestion, enhancement — comment accuracy.)

S7. rdd limavm create --timeout help text understates when the flag is ignored Claude
	}
	command.Flags().StringVarP(&namespace, "namespace", "n", metav1.NamespaceDefault, "Namespace for the LimaVM resource")
	command.Flags().BoolVar(&dryRun, "dry-run", false, "If set, do not commit any changes to the cluster")
	command.Flags().BoolVar(&start, "start", false, "Start the VM after creation")
	command.Flags().BoolVar(&wait, "wait", true, "Wait for the operation to complete (only applies with --start)")
	command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
	return command
}

func newLimaVMStartCommand() *cobra.Command {
	var wait bool

The create wait condition is start && wait && !dryRun (:409), so --timeout is also ignored when --start is false or --dry-run is set. The other limavm commands' help text is correct because their wait condition is just wait (or equivalent). Tighten the create text:

-    command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
+    command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored without --start, with --wait=false, or with --dry-run (0 means wait indefinitely)")

(suggestion, enhancement — only create has this extra condition.)

S8. svc delete --timeout help text reads as jargon ("stop-and-wait") Claude
	command := &cobra.Command{
		Use:  "delete",
		Long: "Delete RDD control plane",
		RunE: serviceDeleteAction,
	}
	command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
	return command
}

func serviceDeleteAction(cmd *cobra.Command, _ []string) error {
	if !service.Exists() {

Round 3 S6 pushed the author to divorce the wording from the Timeout for --wait sibling (since delete has no --wait flag). The landing text "stop-and-wait" is terse but reads as a noun of art; "stop and wait" is not a named concept in the docs, and the siblings use plain Timeout for --wait. A more readable form:

-    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
+    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for waiting for the control plane to stop (0 means wait indefinitely)")

(suggestion, enhancement — polish on R3 S6 resolution; the Round 3 fix was correct in spirit, the specific word choice is what jars.)

S9. cliexit.Classify's docstring under-describes potential growth and the single-case implementation is minimal enough to inline Claude
// Timeout wraps err with [CodeTimeout].
func Timeout(err error) *Error {
	return &Error{Code: CodeTimeout, Err: err}
}

// Classify wraps err with [CodeTimeout] when err wraps
// [context.DeadlineExceeded]. Other errors pass through unchanged.
func Classify(err error) error {
	if errors.Is(err, context.DeadlineExceeded) {
		return Timeout(err)
	}
	return err
}

The name Classify suggests a polymorphic classifier (admission errors → CodeRejected, context.Canceled → some user-cancel code, etc.), but the body handles exactly one case. Two paths:

Either direction keeps the signal clear. A two-line doc comment is the cheapest form:

-// Classify wraps err with [CodeTimeout] when err wraps
-// [context.DeadlineExceeded]. Other errors pass through unchanged.
+// Classify tags known error classes with their CLI exit code. Today it
+// only handles [context.DeadlineExceeded] → [CodeTimeout]; add new
+// classifications here as they become CLI-observable.
 func Classify(err error) error {

(suggestion, enhancement — documentation clarity for a shared helper.)


Design Observations

Strengths

Concerns


Testing Assessment

Coverage of the exit-code-4 contract is broad on the limavm side: every wait-capable subcommand has a run -4 regression test, and 6-timeout.bats asserts svc stop --timeout=1ms. Unaddressed gaps, carried forward from earlier rounds plus new:

  1. rdd svc delete --timeout=… deadline expiry returns exit code 4 and leaves the directory in place. round 2 S2 unaddressed (intentional skip). The R3 I1 regression around this exact code path would have been caught by a test. Priority elevated by R4 I2.
  2. rdd svc start --timeout=… against an already-running-but-not-yet-ready service returns exit code 4. round 2 S2 unaddressed (intentional skip). Would exercise the new already-running branch plus R4 I2's freshness-skip.
  3. Dropped watch during rdd limavm start/stop/restart/create/delete --wait. New in R4. Ties directly to I1. Hard to provoke without a custom mock API server; a unit-level test against a fake.NewWithWatch client that closes its watcher channel prematurely would be the cheapest.
  4. ensureServiceRunning deadline expiry during rdd set / rdd limavm * / rdd kubectl / rdd service config returns exit code 4. round 2 testing gap #3 unaddressed (intentional skip). The fix at cmd/rdd/service.go:97-100 is correct; no test exercises it.
  5. --timeout=0 (indefinite wait) succeeds when the desired state eventually arrives. round 2 testing gap #4 unaddressed (intentional skip). A regression that accidentally used context.WithTimeout(ctx, 0) (which fires immediately) would ship green.
  6. Ctrl-C during svc stop --timeout=0 / svc delete --timeout=0. round 3 testing gap #5 unaddressed. Now that both functions accept ctx, cancellation via cmd.Context() should abort the wait; a test would lock the behavior in.

Items 1, 2, 4, 5, 6 remain intentional skips per the author's R2/R3 resolution; item 3 is new and ties to I1.


Documentation Assessment


Commit Structure

Single commit, single concept. The commit body accurately enumerates the landed defaults and references the R3 resolution. Clean.


Acknowledged Limitations


Unresolved Feedback

No prior PR review comments on this PR.


Agent Performance Retro

Claude

Claude carried the majority of the Suggestion coverage again: six of nine Suggestions, including the four that capture the "not-quite-symmetric" polish gaps (context.WithTimeout vs watchtools.ContextWithOptionalTimeout, missing ready log in already-running branch, --timeout help text precision, cliexit.Classify docstring). Claude's I1 (bare ctx.Err() messages) and I2 (Delete post-deadline comment) were both cosmetic enough that consolidation downgraded them to S1 and S6 — Important severity was not the right pitch for a diagnostic-wrapping ask. One miss worth naming: Claude's S10 (add local_teardown_file to 6-timeout.bats) directly contradicted the repo's BATS style rule "Don't write local_teardown_file() — leave the system running after tests so failures can be inspected manually." The orchestrator missed it during the first consolidation pass too — the BATS rules lived in bats-style.md but were not inlined into the deep-review context the agents receive, so no agent had the rule in prompt. Dropped during a second pass; repo-context update below captures the fix. Claude also noted a pre-existing test-code-drift in 2-create.bats that this PR does not touch, scoped as out-of-scope correctly. No Critical false alarms and no hallucinated file contents.

Codex

Codex delivered the review's single largest-impact finding in I2 (stale ConfigMap in the already-running svc start branch). That catch reconciles what R3 consolidated as a strength — "already-running branch now matches the cold-start path" — against the actual code, which matches only up to the freshness anchor. The scope-level realization that the R3 S7 consolidation was incomplete is exactly what Round 4's re-read needs to surface. Codex's I1 (5-minute default too short for multi-VM shutdown) is dropped as a mis-labeled regression — pre-PR svc stop hardcoded 60s, so 5m is strictly longer — but the underlying concern about future-parallel shutdown is real and is captured as an acknowledged limitation via the R3 S5 comment. Codex skipped the git blame/merge-base check on I1, which is the gap behind the mis-label. S1 (BATS coverage gaps) correctly carries R2/R3 testing skips forward with new scaffolding from R4 I2.

Gemini

Gemini produced its single Important finding (watch-close false success) directly — no Critical mis-calibrations, no hallucinated @test headers (the repeat false positive of the last two rounds), and the fix shape is correct. Gemini noted the finding was pre-existing but did not explicitly tie it to this PR's widening of the exposure (user-documented --timeout=0 + exit-code-4 contract); consolidation added that link. Gemini's S1 (Classify double-wrap) is a valid defensive-idempotence suggestion with no active bug today. Gemini continues to skip git blame, so the regression-vs-gap labeling falls to consolidation — but with only one finding, the labeling cost was minimal this round.

Summary

Claude Opus 4.7Codex GPT 5.4Gemini 3.1 Pro
Duration5m 09s
Findings6S1I 1S1I 1S
Tool calls20 (readfile 8, grepsearch 7, runshellcommand 5)
Design observations412
False positives110
Unique insights622
Files reviewed101010
Coverage misses100
Totals6S1I 1S1I 1S
Downgraded2 (I→S)00
Dropped110

Reconciliation. Four severity changes:

A second retracted item — an orchestrator-added re-raise of Round 3 S3 (run -0 on the cleanup rdd limavm delete "start-vm") — turned out to be built on a Round 3 mistake. The repo BATS style rule states: "Only use run -0 when you need to inspect $output afterwards. If you just want to verify a command succeeds, run it directly without run." The current code (no run -0 on that cleanup line) matches the convention; Round 3 S3 was itself a false positive carried forward from Round 2's --wait scrub commentary. Removed from this round's Suggestions and noted as a Round 3 error.


Review Process Notes

Skill improvements

Repo context updates