Deep Review: 20260422-150402-pr-336

Date2026-04-22 15:04
Reporancher-sandbox/rancher-desktop-daemon
Round6
Author@jandubois
PR#336 — cli: standardize --wait and --timeout across svc and limavm
Branchwait-timeout-cleanup
Commitsfca9713 cli: standardize --wait and --timeout across svc and limavm
ReviewersClaude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
VerdictMerge with fixes — Round 5's three Importants (I1 watchUntil silent success on closed watch, I2 StopWithWait mis-labeling cancellation as timeout, I3 already-running branch skipping freshness anchor) are all fixed on fca9713. Two new Importants surface: (I1) cliexit.Classify strips any non-CodeTimeout classification when the wrapped error is context.DeadlineExceeded, so a future cliexit.Rejected(ctx.Err()) would silently become exit 4 despite the docstring's "callers may Classify repeatedly" promise; and (I2) cleanupDiscoveryConfigMap still deletes the ConfigMap before the stop wait, so a timed-out svc stop --wait followed by a quick svc start now takes the already-running branch, finds the CM gone, and hangs 90s waiting for a ready annotation the dying service will never write. The new BATS test 6-timeout.bats exercises exactly this sequence and relies on the service exiting fast enough to take the cold-start path instead. No Critical.
Wall-clock time21 min 57 s


Executive Summary

Round 6 (fca9713) closes every Round 5 Important: watchUntil / watchUntilDeleted now re-Get on watch close and distinguish a satisfied predicate from a dropped watch (cmd/rdd/limavm.go:497-509, 547-563); StopWithWait's deadline branch now splits on ctx.Err(), so context.Canceled is no longer reported as a timeout (pkg/service/cmd/service.go:383-397); and both ensureServiceRunning and serviceStartAction's already-running branches now anchor waitForFreshDiscoveryConfigMap on service.StartTime() (cmd/rdd/service.go:98-110, 262-282). The library-surface refactor — StopWithWait(ctx, wait, timeout), Delete(ctx, timeout), the new StartTime() — and the centralization of context.DeadlineExceeded → exit-4 routing in cliexit.Classify are clean and well-documented.

Two new Importants surface in this round. I1: cliexit.Classify (pkg/cli/exit/exit.go:57-66) short-circuits only when the existing *Error already has Code == CodeTimeout. Any future call site that wraps a deadline-bearing error with a different code (cliexit.Rejected(ctx.Err()), or a not-yet-defined code) will have its classification silently stripped and replaced with Timeout. No current caller produces that combination, so the bug is latent, but the docstring promises "callers may Classify repeatedly" — the fix is a one-line broadening of the short-circuit. I2: StopWithWait deletes the discovery ConfigMap before sending the signal (pkg/service/cmd/service.go:355-356), and on a short timeout process.Kill returns before the process actually exits. A user who runs rdd svc start in that window now takes the already-running branch, calls waitForFreshDiscoveryConfigMap with the current PID-file mtime as anchor, and waits 90s (or until startWaitTimeout) for a ReadyAnnotation that the dying service cannot write. The new test bats/tests/20-service/6-timeout.bats:22-29 exercises this exact sequence and relies on the service exiting fast enough to take the cold-start path instead.

Ten Suggestions follow. S1 captures Codex's observation that the second-precision freshness anchor (PID-file mtime) could theoretically accept a stale ConfigMap when a prior unclean shutdown lands in the same wall-clock second as the new startup — demoted from Important because the pre-PR already-running branch had no freshness check at all, so Round 6 is still strictly better, and the race window is sub-second. S2 and S3 capture Claude's two demoted Importants: watchUntil's watch-close + NotFound fall-through returns a raw apierror instead of the symmetric "deleted while waiting" message, and the Running()StartTime() call pair has a theoretical ENOENT race that surfaces as an opaque stat … no such file or directory error. Both are one-line fixes. The remaining suggestions cover a missing exit-4 contract test for the shared autostart helpers (svc config, set, ctl), a wait local variable that shadows the k8s.io/apimachinery/pkg/util/wait import, a comment/predicate mismatch in Delete, the BATS setup_rdd_control_plane helper's now-5-minute potential hang, a timing dependency in the new 6-timeout test, and two doc/flag-help polish items.

No Critical issues. Claude and Codex split their unique finds: Claude surfaced the Classify stripping bug and the cleanupDiscoveryConfigMap ordering hazard (independent finds), while Codex traced the shared-helper test gap and the second-precision freshness race. Gemini cleared the entire diff — zero findings — which is consistent with its skip on git blame and a diff where all Round-5 Importants are genuinely closed.


Critical Issues

None.


Important Issues

I1. Classify strips a non-timeout classification when the wrapped error is context.DeadlineExceeded Claude

// Classify wraps err with [CodeTimeout] when err wraps
// [context.DeadlineExceeded]. Already-classified timeout errors and
// unrelated errors pass through unchanged, so callers may Classify
// repeatedly.
func Classify(err error) error {
	var exitErr *Error
	if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
		return err
	}
	if errors.Is(err, context.DeadlineExceeded) {
		return Timeout(err)
	}
	return err
}

The errors.As short-circuit only fires when the existing *Error already carries Code == CodeTimeout. If any caller ever wraps a deadline-bearing error with a different exit code — cliexit.Rejected(ctx.Err()), or any future classification — Classify strips the original code and replaces it with Timeout. Today no call site produces that combination (classifyAPIError at cmd/rdd/set.go:332-337 only fires on admission rejections, which never wrap DeadlineExceeded), so the bug is latent. The docstring at pkg/cli/exit/exit.go:53-56 promises "callers may Classify repeatedly," which reads as a commitment to idempotency across every code path — a future helper that returns cliexit.Rejected(ctx.Err()) through Classify would silently lose its 3 → 4 (rejected → timeout) classification.

	return nil
}

// classifyAPIError tags admission-controller rejections with
// [cliexit.CodeRejected]. Other API errors pass through unchanged.
func classifyAPIError(err error) error {
	if apierrors.IsInvalid(err) || apierrors.IsBadRequest(err) || apierrors.IsForbidden(err) {
		return cliexit.Rejected(err)
	}
	return err
}

// createAndPatchApp creates the App using the dynamic client so that
// only user-specified fields (plus required fields with zero values)
// are sent; the API server fills in CRD defaults. The returned
// generation is metadata.generation after the write, used to filter
// Timeout wraps err with [CodeTimeout].
func Timeout(err error) *Error {
	return &Error{Code: CodeTimeout, Err: err}
}

// Classify wraps err with [CodeTimeout] when err wraps
// [context.DeadlineExceeded]. Already-classified timeout errors and
// unrelated errors pass through unchanged, so callers may Classify
// repeatedly.
func Classify(err error) error {
	var exitErr *Error
	if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
		return err
	}

Fix: short-circuit on any already-classified *Error, not only on CodeTimeout:

 func Classify(err error) error {
     var exitErr *Error
-    if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
+    if errors.As(err, &exitErr) {
         return err
     }
     if errors.Is(err, context.DeadlineExceeded) {
         return Timeout(err)
     }
     return err
 }

*(important, regression — Classify is new in this PR.)*

I2. StopWithWait deletes the discovery ConfigMap before the wait, so a timed-out stop plus a quick restart hangs the new start 90 s on a missing CM Claude
func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
	if !Running() {
		return fmt.Errorf("%q control plane is not running", instance.Name())
	}

	// Clean up discovery configmap while cluster is still accessible
	_ = cleanupDiscoveryConfigMap() // Clean up discovery configmap to prevent stale controller info

	pid := PID()
	// Try graceful shutdown first. On Unix, Kill already sends SIGTERM which
	// triggers the Go signal handler. On Windows, Kill uses TerminateProcess
	// which bypasses all handlers, so we send Interrupt (CTRL_BREAK_EVENT)

StopWithWait deletes the discovery ConfigMap (line 356) before sending SIGINT, so on a short timeout (--timeout=1ms) the CM is gone well before the service exits. process.Kill at line 392 is fire-and-forget on Unix; Running() may still report the process alive when StopWithWait returns. A user who runs rdd svc start in that window takes the already-running branch at cmd/rdd/service.go:262-282, reads service.StartTime(), and calls waitForFreshDiscoveryConfigMap(ctx, startTime). The CM was just deleted and the dying service never recreates it — only a fresh rdd service serve re-runs InitDiscovery (pkg/service/cmd/service.go:766-768). The poll at cmd/rdd/service.go:156-170 runs until startWaitTimeout (90 s) and exits 4.

	client, err := kubernetes.NewForConfig(restConfig)
	if err != nil {
		return fmt.Errorf("failed to create kubernetes client: %w", err)
	}

	return wait.PollUntilContextCancel(ctx, 500*time.Millisecond, true, func(ctx context.Context) (bool, error) {
		cm, err := client.CoreV1().ConfigMaps(controllers.RDDSystemNamespace).Get(
			ctx, controllers.ControllerManagerConfigMapName, metav1.GetOptions{},
		)
		if apierrors.IsNotFound(err) {
			return false, nil // ConfigMap not created yet; retry.
		}
		if err != nil {
			return false, err
		}
		if cm.CreationTimestamp.Time.Before(beforeStart) {
			return false, nil // Stale ConfigMap from a previous run; wait for the new one.
		}
		return cm.Annotations[controllers.ReadyAnnotation] == "true", nil
	})
}

func serviceConfigAction(cmd *cobra.Command, _ []string) error {
	if err := ensureServiceRunning(cmd.Context()); err != nil {
		return err
		if err := service.Create(args); err != nil {
			return err
		}
		logrus.Infof("successfully created %q control plane", instance.Name())
	}
	if service.Running() {
		logrus.Infof("%q control plane is already running", instance.Name())
		if !wait {
			return nil
		}
		startTime, err := service.StartTime()
		if err != nil {
			return err
		}
		logrus.Infof("waiting for %q control plane to be ready", instance.Name())
		ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		if err := waitForFreshDiscoveryConfigMap(ctx, startTime); err != nil {
			return cliexit.Classify(err)
		}
		logrus.Infof("%q control plane is ready", instance.Name())
		return nil
	}

	// Collect all provided flags as arguments for serve subprocess
	var serveArgs []string
	if cmd.Flags().Changed("controllers") {
		controllers, err := cmd.Flags().GetString("controllers")
				// service alone: on Windows (TerminateProcess) it spares
				// hostagents, which live in their own process groups. On
				// Unix, SIGTERM suffices because the service handles
				// signals promptly; its slowness comes from draining
				// hostagents sequentially.
				_ = process.Kill(pid)
				err := ctx.Err()
				if errors.Is(err, context.DeadlineExceeded) {
					return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, err)
				}
				return fmt.Errorf("wait for %q control plane with pid %d to stop cancelled: %w", instance.Name(), pid, err)
	// Its creationTimestamp serves as the control plane start time.
	initClient, err := kubernetes.NewForConfig(completed.ControlPlane.Generic.LoopbackClientConfig)
	if err != nil {
		return fmt.Errorf("failed to create kubernetes client for discovery init: %w", err)
	}
	if err := controllers.InitDiscovery(ctx, initClient); err != nil {
		return fmt.Errorf("failed to initialize discovery: %w", err)
	}

	// Start shared controller manager for enabled controllers
	var enabledControllers []base.Controller

	// Get all registered controllers from the registry

Before Round 6, the already-running branch used the non-freshness waitForDiscoveryConfigMap, which accepted any surviving CM. The Round-6 freshness anchor closes a real staleness gap but promotes this pre-existing CM ordering hazard into a user-visible hang. The new test at bats/tests/20-service/6-timeout.bats:22-29 exercises precisely this sequence and relies on the service exiting fast enough for rdd svc start to take the cold-start path instead of the already-running one.

@test 'svc stop --timeout=1ms exits with code 4' {
    run -4 rdd svc stop --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}

@test 'svc delete --timeout=1ms exits with code 4 and preserves the instance directory' {
    # Restart the service; the prior test killed it.
    rdd svc start
    run -4 rdd svc delete --timeout=1ms
    assert_output --partial 'context deadline exceeded'
    run -0 rdd svc status
    assert_output --partial 'has been created: true'
}

@test 'svc start --timeout=1ms exits with code 4' {
    # Reset to a known state; prior tests left the control plane half-dead.
    rdd svc delete || :
    rdd svc create
# Restart the service; the prior test killed it.
rdd svc start
run -4 rdd svc delete --timeout=1ms

Fix options, in increasing scope:

  1. Move cleanupDiscoveryConfigMap() to the success branch of the wait loop (line 399-402), so a timed-out stop leaves the prior CM in place for the next start to consume. The already-running branch then sees a fresh-enough CM (until the service exits, at which point the cold-start path takes over).
  2. On the already-running path, detect a missing ConfigMap via the loopback client and fall through to the cold-start path.
  3. Document the limitation: after a timed-out stop, the user must run rdd svc delete && rdd svc create && rdd svc start.
				if errors.Is(err, context.DeadlineExceeded) {
					return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, err)
				}
				return fmt.Errorf("wait for %q control plane with pid %d to stop cancelled: %w", instance.Name(), pid, err)
			case <-ticker.C:
				if !Running() {
					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil
				}
			}
		}
	}

	_ = os.Remove(instance.Config()) // Ignore error as file might not exist

(important, regression — the freshness anchor in the already-running branch is new in this PR and it is what makes the pre-existing CM ordering hazard observable on the CLI.)


Suggestions

S1. Second-precision freshness anchor accepts a stale ConfigMap when a prior crash lands in the same wall-clock second as the new startup Codex
		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		startTime, err := service.StartTime()
		if err != nil {
			return err
		}
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		return cliexit.Classify(waitForFreshDiscoveryConfigMap(ctx, startTime))
	}
	return cliexit.Classify(startAndWaitForReady(ctx, nil, startWaitTimeout))
}

// startAndWaitForReady starts the service, waits for the API server and the

Both the cold-start path (beforeStart := time.Now().Truncate(time.Second) at cmd/rdd/service.go:125) and the new already-running path (service.StartTime() at :99, :267) anchor freshness on a second-truncated timestamp. The predicate CreationTimestamp.Before(beforeStart) accepts a stale CM whose CT and the new startup fall in the same wall-clock second. The race requires a prior unclean shutdown that left a ReadyAnnotation=true CM behind, a new service startup whose PID-file mtime lands in the same wall-clock second as that stale CT, and the caller entering the already-running branch before InitDiscovery deletes and recreates the CM. The window is sub-second.

// the apiserver may become ready before the serve command creates the ConfigMap.
func startAndWaitForReady(ctx context.Context, serveArgs []string, timeout time.Duration) error {
	// Truncate to second precision because metav1.Time drops sub-seconds
	// during JSON serialization. Without this, a server that starts in the
	// same second would appear to have a startTime *before* beforeStart.
	beforeStart := time.Now().Truncate(time.Second)
	if err := service.Start(ctx, serveArgs); err != nil {
		return err
	}

	ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)

Demoted from Codex's Important: pre-Round-6 the already-running branch skipped the freshness check entirely and would have accepted any surviving CM unconditionally, so Round 6 is strictly better. The cold-start path has carried this second-precision limit since before the PR. A durable fix replaces the timestamp anchor with a startup nonce copied from InitDiscovery into the CM and matched in waitForFreshDiscoveryConfigMap, and would also let the docs at docs/design/cmd_service.md describe readiness identity without hedging on timestamp precision.

(suggestion, gap — pre-existing precision limit; Round 6 extends it to a new path.)

S2. watchUntil returns raw NotFound on watch-close fall-through, asymmetric with the in-loop "deleted while waiting" message Claude
		updated, ok := ev.Object.(*limav1alpha1.LimaVM)
		if ok && check(updated) {
			return nil
		}
	}
	// ResultChan closed. If ctx ended, return its error. Otherwise the
	// watch dropped (server close, proxy timeout, watch expiry); re-read
	// and re-evaluate to distinguish a satisfied predicate from a drop.
	if err := ctx.Err(); err != nil {
		return err
	}
	if err := c.Get(ctx, key, &vm); err != nil {
		return err
	}
	if check(&vm) {
		return nil
	}
	return fmt.Errorf("watch closed before LimaVM %q reached the desired state", key.Name)
}

// watchUntilDeleted watches a LimaVM until it is deleted.
func watchUntilDeleted(ctx context.Context, c client.WithWatch, limaVM *limav1alpha1.LimaVM) error {
	key := client.ObjectKeyFromObject(limaVM)

When a Deleted event arrives on the live channel, line 489-490 returns the descriptive "LimaVM %q was deleted while waiting". When the channel closes before that event lands and the resource is gone, the fall-through c.Get at line 503 returns an unwrapped apierrors.NewNotFound(...); the user sees "limavms.lima.rancherdesktop.io \"foo\" not found" instead of the consistent message. Same asymmetry applies to terminal-state predicates — limavm start --wait against a VM deleted mid-wait reports a stale NotFound that does not tell the user why the wait ended.


	for ev := range watcher.ResultChan() {
		switch ev.Type {
		case k8swatch.Error:
			return apierrors.FromObject(ev.Object)
		case k8swatch.Deleted:
			return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
		}
		updated, ok := ev.Object.(*limav1alpha1.LimaVM)
		if ok && check(updated) {
			return nil
		}

Demoted from Claude's Important: the fall-through does surface an error rather than report a silent success, so the Round-5 regression class is closed. The remaining gap is one of message symmetry.

Fix:

     if err := c.Get(ctx, key, &vm); err != nil {
+        if apierrors.IsNotFound(err) {
+            return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
+        }
         return err
     }

(suggestion, regression — the fall-through code is new in Round 6.)

S3. Running()StartTime() race surfaces as an opaque stat … no such file or directory error Claude
	if !service.Exists() {
		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		startTime, err := service.StartTime()
		if err != nil {
			return err
		}
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}

Running() walks PID(), which removes the PID file when the process no longer responds to signal 0 (pkg/service/cmd/service.go:158). A concurrent CLI invocation that passes between our Running() and StartTime() can remove the file, so os.Stat(PIDFile) in StartTime() returns ENOENT and the user sees the bare stat error with no context.

			}
			_ = process.Release()
		}
	}
	if err != nil {
		_ = os.Remove(instance.PIDFile())
		return PIDNotFound
	}
	return pid
}

Demoted from Claude's Important: the only path that removes the file is another PID() call, which requires a second process, and the gap is microseconds. Still worth handling: on os.IsNotExist, fall through to the cold-start path instead of propagating the raw error.

     startTime, err := service.StartTime()
-    if err != nil {
+    if os.IsNotExist(err) {
+        // Service exited between Running() and StartTime(); treat as cold start.
+        return cliexit.Classify(startAndWaitForReady(ctx, nil, startWaitTimeout))
+    } else if err != nil {
         return err
     }

*(suggestion, regression — StartTime is new; prior paths had no such dependency.)*

S4. No exit-4 BATS coverage for the shared autostart helpers (svc config, set, ctl) Codex

The new 6-timeout.bats asserts exit 4 for svc stop, svc start, and svc delete, and limavm-cli.bats covers lima create --start, lima start, and lima restart. The same timeout contract now flows through ensureServiceRunning for rdd svc config, rdd set, and rdd ctl (at cmd/rdd/kubectl.go), but no test asserts exit 4 on any of those call sites. A caller-specific regression that unwraps *Error before returning from the autostart path would ship green.

Fix: add one run -4 case per representative shared-helper caller — e.g. rdd svc config and rdd set running=false against a deliberately unready control plane.

(suggestion, gap.)

S5. Local wait bool shadows the imported k8s.io/apimachinery/pkg/util/wait package in serviceStartAction and serviceStopAction Claude
	"github.com/sirupsen/logrus"
	"github.com/spf13/cobra"

	apierrors "k8s.io/apimachinery/pkg/api/errors"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/util/wait"
	"k8s.io/client-go/kubernetes"
	watchtools "k8s.io/client-go/tools/watch"

	cliexit "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/cli/exit"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/developer"

Neither function references wait.PollUntilContextCancel, so no bug lands today. A future edit that adds a wait.Poll… call in either function would compile-fail on the shadow and the fix would be obscure. Renaming the local to waitFlag (or hoisting the branch selection into a helper) removes the trap. The shadow is pre-existing in serviceStopAction; this PR introduces the same shadow to serviceStartAction.

(suggestion, gap.)

S6. Delete's comment claims it suppresses only the outer/inner Running() race, but the predicate also absorbs signal-delivery failures and context.Canceled Claude
// Pass timeout 0 to wait indefinitely (bounded only by ctx).
func Delete(ctx context.Context, timeout time.Duration) error {
	if !Exists() {
		return fmt.Errorf("%q control plane does not exist", instance.Name())
	}
	if Running() {
		if err := StopWithWait(ctx, true, timeout); err != nil {
			// Suppress only the outer/inner Running() race, where the
			// process exits between Delete's check and StopWithWait's.
			// Deadline expiry must surface so the CLI exits 4 and leaves
			// the directory in place, per docs/design/cmd_service.md.
			if errors.Is(err, context.DeadlineExceeded) || Running() {
				return err
			}
		}
	}
	preserveAllInstanceLogs()
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		_ = os.RemoveAll(instance.LogDir())
	}
	_ = os.RemoveAll(instance.ShortDir())

The predicate errors.Is(err, context.DeadlineExceeded) || Running() suppresses any StopWithWait error whose post-call Running() returns false — including the process.Interrupt / process.Kill failure branch at lines 370-373 and the context.Canceled branch at lines 393-397. The context.Canceled path is unreachable today because the CLI uses context.Background() for RunE, but a future commit that wires SIGINT into the CLI context would silently proceed to RemoveAll(instance.Dir()) on Ctrl-C. Either tighten the predicate to errors.Is(err, ErrAlreadyStopped) (once StopWithWait returns a typed sentinel for that race), or expand the comment to enumerate every case the predicate absorbs.

	// Kill fallback never fires. On Windows, Interrupt uses
	// GenerateConsoleCtrlEvent, which fails when caller and target lack a
	// shared console; Kill (TerminateProcess) then bypasses graceful shutdown.
	// Hostagents survive in their own process groups and self-heal on the
	// next service start via killOrphanedHostagent.
	if err := process.Interrupt(pid); err != nil {
		if err := process.Kill(pid); err != nil {
			return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
		}
	}

	if wait {
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
		defer cancel()
		ticker := time.NewTicker(100 * time.Millisecond)
		defer ticker.Stop()

		for {
			select {
			case <-ctx.Done():
				// Deadline fired or caller cancelled; terminate either way
				// to ensure the service process exits. Kill targets the
				// service alone: on Windows (TerminateProcess) it spares
				// hostagents, which live in their own process groups. On
				// Unix, SIGTERM suffices because the service handles
				// signals promptly; its slowness comes from draining
				// hostagents sequentially.
				_ = process.Kill(pid)
				err := ctx.Err()
				if errors.Is(err, context.DeadlineExceeded) {
					return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, err)
				}
				return fmt.Errorf("wait for %q control plane with pid %d to stop cancelled: %w", instance.Name(), pid, err)
			case <-ticker.C:
				if !Running() {
					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil
				}

(suggestion, regression.)

S7. setup_rdd_control_plane can now block up to 5 minutes when a prior test leaves the service stuck Claude
    local patch=$3

    rdd ctl patch "${resource_type}" "${name}" -n "${RDD_NAMESPACE}" --type='merge' -p="${patch}"
}

setup_rdd_control_plane() {
    local controllers=${1:-"*"}

    rdd svc delete
    rdd svc create --controllers="${controllers}"
    rdd svc start
}

Pre-PR, rdd svc delete waited 60 s and always removed the directory (old Stop() errors were ignored). After this PR, delete waits up to 5 minutes (stopWaitTimeout) and fails with exit 4 if the service does not exit in time, leaving the directory in place. A test file whose prior run left a hostagent in a pathological state would now hang the helper for 5 minutes and then fail, rather than recovering.

Fix: bound the helper with a shorter timeout and tolerate a failed delete so a partial teardown still lets create run.

-    rdd svc delete
+    rdd svc delete --timeout=120s || :

(suggestion, gap — pre-existing helper, behaviour shift introduced by this PR.)

S8. 6-timeout.bats Test 2 depends on the dying service exiting before the next rdd svc start evaluates Running() Claude
@test 'svc stop --timeout=1ms exits with code 4' {
    run -4 rdd svc stop --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}

@test 'svc delete --timeout=1ms exits with code 4 and preserves the instance directory' {
    # Restart the service; the prior test killed it.
    rdd svc start
    run -4 rdd svc delete --timeout=1ms
    assert_output --partial 'context deadline exceeded'
    run -0 rdd svc status
    assert_output --partial 'has been created: true'
}

@test 'svc start --timeout=1ms exits with code 4' {
    # Reset to a known state; prior tests left the control plane half-dead.
    rdd svc delete || :
    rdd svc create

Test 1 sends SIGINT then SIGTERM via process.Kill and returns. process.Kill is async on Unix: the service may still report Running() == true when Test 2's rdd svc start runs, and Test 2 then takes the already-running branch — see I2 for why that can hang 90 s on a missing CM. Test 3 already resets with rdd svc delete || :; rdd svc create; rdd svc start; applying the same reset before Test 2's rdd svc start (or a wait_for_dead polling helper) removes the timing dependency.

(suggestion, regression — new test.)

S9. --timeout help strings disagree between rdd set and every other wait-capable command Claude

The PR's new flags say "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)". The pre-existing rdd set --timeout says "Timeout for waiting (0 to wait indefinitely)". The PR's stated goal is to standardize this surface — aligning set's help string in this PR (or as a targeted follow-up) keeps --help output uniform. The hardcoded 300*time.Second at cmd/rdd/set.go:92 also duplicates limaLongWaitTimeout = 5 * time.Minute; promoting a single named constant prevents drift.

	}
	command.Flags().BoolVar(&dryRun, "dry-run", false,
		"Validate changes against the API server without persisting them")
	command.Flags().BoolVar(&wait, "wait", true,
		"Wait for the desired state after the patch is accepted")
	command.Flags().DurationVar(&timeout, "timeout", 300*time.Second,
		"Timeout for waiting (0 to wait indefinitely)")

	// Override help to append live property descriptions from the CRD schema.
	defaultHelp := command.HelpFunc()
	command.SetHelpFunc(func(cmd *cobra.Command, args []string) {

(suggestion, enhancement.)

S10. limavm create's --timeout help omits the --start dependency Claude
		},
	}
	command.Flags().StringVarP(&namespace, "namespace", "n", metav1.NamespaceDefault, "Namespace for the LimaVM resource")
	command.Flags().BoolVar(&dryRun, "dry-run", false, "If set, do not commit any changes to the cluster")
	command.Flags().BoolVar(&start, "start", false, "Start the VM after creation")
	command.Flags().BoolVar(&wait, "wait", true, "Wait for the operation to complete (only applies with --start)")
	command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
	return command
}

func newLimaVMStartCommand() *cobra.Command {
	var wait bool

--wait mentions "only applies with --start"; --timeout's help mentions only --wait=false. The wait branch at line 409 gates on start && wait && !dryRun, so --timeout is also a no-op without --start. Mention --start in the timeout help on create for parity with --wait.

			// Keep createdConfigMap until limaVM itself is deleted.
			createdConfigMap = nil
		}
	}

	if start && wait && !dryRun {
		waitCtx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
		defer cancel()
		key := client.ObjectKeyFromObject(limaVM)
		if err := watchUntil(waitCtx, c, key, func(vm *limav1alpha1.LimaVM) bool {
			return apimeta.IsStatusConditionPresentAndEqual(vm.Status.Conditions, "Running", metav1.ConditionTrue)

(suggestion, enhancement.)


Design Observations

Strengths

Concerns


Testing Assessment

New exit-code-4 coverage hits svc stop, svc delete, svc start, lima create --start, lima start, and lima restart — good breadth against the new contract. Codex built bin/rdd and ran go test ./cmd/rdd/... ./pkg/cli/... ./pkg/service/cmd/... ./pkg/service/controllers/...; all passed.

Untested scenarios, ranked by risk:

  1. Shared autostart helpers (S4). rdd svc config, rdd set, and rdd ctl now carry the same exit-4 contract through ensureServiceRunning, but no test asserts it.
  2. svc delete after the service has already exited cleanly (S3 / S6). The outer/inner Running() race-suppression branch at pkg/service/cmd/service.go:449 is not exercised; a test that starts the service, delivers SIGINT directly, then runs rdd svc delete would cover it.
  3. --timeout=0 (wait forever). Documented for svc stop, svc delete, svc start, and every limavm command, but never asserted. A regression that treats 0 as "return immediately" would ship green.
  4. svc start --wait=false. A TODO at bats/tests/20-service/2-create.bats:3 predates this PR; the new flag handling makes regressions slightly easier to introduce.
  5. limavm create --start --wait=false. Guarded by start && wait && !dryRun at cmd/rdd/limavm.go:409, but not exercised.
  6. I1 / I2 regression tests. Neither Important has a regression test today: I1 needs a synthetic Rejected-wrapping-Deadline error through Classify; I2 needs a timed-out stop followed immediately by an already-running svc start with a deterministic process-exit barrier.

Documentation Assessment


Commit Structure

Single commit, clean scope, message accurately describes the change.


Acknowledged Limitations


Agent Performance Retro

[Claude] Carried the round: four of five consolidated Importants/Suggestions with behavioural teeth (the Classify stripping bug, the cleanupDiscoveryConfigMap ordering hazard, the watchUntil NotFound asymmetry, and the Running() / StartTime() race). The Classify finding is the kind of latent-bug spot that requires reading the docstring against the implementation; no other agent surfaced it. Two Importants (I3, I4) were borderline and consolidated as Suggestions — a calibration note rather than a miss, since the findings themselves are real.

[Codex] One sharp Important (second-precision freshness race) and one well-chosen test-gap Suggestion. The freshness finding demotes cleanly to Suggestion because Round 6 still improves strictly over Round 5 — Codex correctly identified the residual race but did not weigh the pre-PR baseline. Ran the Go unit tests and reported them green, which neither other agent did. Missed the Classify latent bug and the cleanupDiscoveryConfigMap ordering hazard.

[Gemini] Zero findings. With every Round-5 Important fixed and the remaining issues either latent (I1) or dependent on process-lifecycle timing (I2), a clean bill from the third agent is plausible — the three recorded design observations read the diff honestly. As expected, Gemini skipped git blame (quota), so regression/gap labels came entirely from Claude and Codex. Its single design concern (background-daemon lifecycle tied to CLI context) sits outside the PR scope but is worth keeping on file.

Summary

Claude Opus 4.7Codex GPT 5.4Gemini 3.1 Pro
Duration15m 09s7m 32s6m 14s
Findings2I 8S2Snone
Tool calls49 (Bash 28, Read 21)57 (shell 53, plan 2, stdin 2)41 (grepsearch 18, readfile 12, runshellcommand 10)
Design observations213
False positives000
Unique insights420
Files reviewed101010
Coverage misses000
Totals2I 8S2Snone
Downgraded2 (I→S)1 (I→S)0
Dropped000

Overall, Claude produced the most reviewable output this round; Codex added one durable suggestion and the only live build-and-test signal; Gemini's clean read corroborates that the Round-5 Importants genuinely closed.

Reconciliation


Review Process Notes

Repo context updates

Skill improvements

None. Both findings above are specific to this repo's shutdown / readiness protocol and do not generalise.