Deep Review: 20260422-150402-pr-336

Date	2026-04-22 15:04
Repo	rancher-sandbox/rancher-desktop-daemon
Round	6
Author	@jandubois
PR	#336 — cli: standardize --wait and --timeout across svc and limavm
Branch	`wait-timeout-cleanup`
Commits	`fca9713` cli: standardize --wait and --timeout across svc and limavm
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — Round 5's three Importants (I1 `watchUntil` silent success on closed watch, I2 `StopWithWait` mis-labeling cancellation as timeout, I3 already-running branch skipping freshness anchor) are all fixed on `fca9713`. Two new Importants surface: (I1) `cliexit.Classify` strips any non-`CodeTimeout` classification when the wrapped error is `context.DeadlineExceeded`, so a future `cliexit.Rejected(ctx.Err())` would silently become exit 4 despite the docstring's "callers may Classify repeatedly" promise; and (I2) `cleanupDiscoveryConfigMap` still deletes the ConfigMap before the stop wait, so a timed-out `svc stop --wait` followed by a quick `svc start` now takes the already-running branch, finds the CM gone, and hangs 90s waiting for a ready annotation the dying service will never write. The new BATS test `6-timeout.bats` exercises exactly this sequence and relies on the service exiting fast enough to take the cold-start path instead. No Critical.
Wall-clock time	`21 min 57 s`

Executive Summary ¶

Round 6 (fca9713) closes every Round 5 Important: watchUntil / watchUntilDeleted now re-Get on watch close and distinguish a satisfied predicate from a dropped watch (cmd/rdd/limavm.go:497-509, 547-563); StopWithWait's deadline branch now splits on ctx.Err(), so context.Canceled is no longer reported as a timeout (pkg/service/cmd/service.go:383-397); and both ensureServiceRunning and serviceStartAction's already-running branches now anchor waitForFreshDiscoveryConfigMap on service.StartTime() (cmd/rdd/service.go:98-110, 262-282). The library-surface refactor — StopWithWait(ctx, wait, timeout), Delete(ctx, timeout), the new StartTime() — and the centralization of context.DeadlineExceeded → exit-4 routing in cliexit.Classify are clean and well-documented.

Two new Importants surface in this round. I1: cliexit.Classify (pkg/cli/exit/exit.go:57-66) short-circuits only when the existing *Error already has Code == CodeTimeout. Any future call site that wraps a deadline-bearing error with a different code (cliexit.Rejected(ctx.Err()), or a not-yet-defined code) will have its classification silently stripped and replaced with Timeout. No current caller produces that combination, so the bug is latent, but the docstring promises "callers may Classify repeatedly" — the fix is a one-line broadening of the short-circuit. I2: StopWithWait deletes the discovery ConfigMap before sending the signal (pkg/service/cmd/service.go:355-356), and on a short timeout process.Kill returns before the process actually exits. A user who runs rdd svc start in that window now takes the already-running branch, calls waitForFreshDiscoveryConfigMap with the current PID-file mtime as anchor, and waits 90s (or until startWaitTimeout) for a ReadyAnnotation that the dying service cannot write. The new test bats/tests/20-service/6-timeout.bats:22-29 exercises this exact sequence and relies on the service exiting fast enough to take the cold-start path instead.

Ten Suggestions follow. S1 captures Codex's observation that the second-precision freshness anchor (PID-file mtime) could theoretically accept a stale ConfigMap when a prior unclean shutdown lands in the same wall-clock second as the new startup — demoted from Important because the pre-PR already-running branch had no freshness check at all, so Round 6 is still strictly better, and the race window is sub-second. S2 and S3 capture Claude's two demoted Importants: watchUntil's watch-close + NotFound fall-through returns a raw apierror instead of the symmetric "deleted while waiting" message, and the Running() → StartTime() call pair has a theoretical ENOENT race that surfaces as an opaque stat … no such file or directory error. Both are one-line fixes. The remaining suggestions cover a missing exit-4 contract test for the shared autostart helpers (svc config, set, ctl), a wait local variable that shadows the k8s.io/apimachinery/pkg/util/wait import, a comment/predicate mismatch in Delete, the BATS setup_rdd_control_plane helper's now-5-minute potential hang, a timing dependency in the new 6-timeout test, and two doc/flag-help polish items.

No Critical issues. Claude and Codex split their unique finds: Claude surfaced the Classify stripping bug and the cleanupDiscoveryConfigMap ordering hazard (independent finds), while Codex traced the shared-helper test gap and the second-precision freshness race. Gemini cleared the entire diff — zero findings — which is consistent with its skip on git blame and a diff where all Round-5 Importants are genuinely closed.

Critical Issues ¶

None.

Important Issues ¶

I1. Classify strips a non-timeout classification when the wrapped error is context.DeadlineExceeded Claude¶

pkg/cli/exit/exit.go


// Classify wraps err with [CodeTimeout] when err wraps
// [context.DeadlineExceeded]. Already-classified timeout errors and
// unrelated errors pass through unchanged, so callers may Classify
// repeatedly.
func Classify(err error) error {
	var exitErr *Error
	if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
		return err
	}
	if errors.Is(err, context.DeadlineExceeded) {
		return Timeout(err)
	}
	return err
}

The errors.As short-circuit only fires when the existing *Error already carries Code == CodeTimeout. If any caller ever wraps a deadline-bearing error with a different exit code — cliexit.Rejected(ctx.Err()), or any future classification — Classify strips the original code and replaces it with Timeout. Today no call site produces that combination (classifyAPIError at cmd/rdd/set.go:332-337 only fires on admission rejections, which never wrap DeadlineExceeded), so the bug is latent. The docstring at pkg/cli/exit/exit.go:53-56 promises "callers may Classify repeatedly," which reads as a commitment to idempotency across every code path — a future helper that returns cliexit.Rejected(ctx.Err()) through Classify would silently lose its 3 → 4 (rejected → timeout) classification.

cmd/rdd/set.go

	return nil
}

// classifyAPIError tags admission-controller rejections with
// [cliexit.CodeRejected]. Other API errors pass through unchanged.
func classifyAPIError(err error) error {
	if apierrors.IsInvalid(err) || apierrors.IsBadRequest(err) || apierrors.IsForbidden(err) {
		return cliexit.Rejected(err)
	}
	return err
}

// createAndPatchApp creates the App using the dynamic client so that
// only user-specified fields (plus required fields with zero values)
// are sent; the API server fills in CRD defaults. The returned
// generation is metadata.generation after the write, used to filter

pkg/cli/exit/exit.go

// Timeout wraps err with [CodeTimeout].
func Timeout(err error) *Error {
	return &Error{Code: CodeTimeout, Err: err}
}

// Classify wraps err with [CodeTimeout] when err wraps
// [context.DeadlineExceeded]. Already-classified timeout errors and
// unrelated errors pass through unchanged, so callers may Classify
// repeatedly.
func Classify(err error) error {
	var exitErr *Error
	if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
		return err
	}

Fix: short-circuit on any already-classified *Error, not only on CodeTimeout:

 func Classify(err error) error {
     var exitErr *Error
-    if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
+    if errors.As(err, &exitErr) {
         return err
     }
     if errors.Is(err, context.DeadlineExceeded) {
         return Timeout(err)
     }
     return err
 }

*(important, regression — Classify is new in this PR.)*

I2. StopWithWait deletes the discovery ConfigMap before the wait, so a timed-out stop plus a quick restart hangs the new start 90 s on a missing CM Claude¶

pkg/service/cmd/service.go

func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
	if !Running() {
		return fmt.Errorf("%q control plane is not running", instance.Name())
	}

	// Clean up discovery configmap while cluster is still accessible
	_ = cleanupDiscoveryConfigMap() // Clean up discovery configmap to prevent stale controller info

	pid := PID()
	// Try graceful shutdown first. On Unix, Kill already sends SIGTERM which
	// triggers the Go signal handler. On Windows, Kill uses TerminateProcess
	// which bypasses all handlers, so we send Interrupt (CTRL_BREAK_EVENT)

StopWithWait deletes the discovery ConfigMap (line 356) before sending SIGINT, so on a short timeout (--timeout=1ms) the CM is gone well before the service exits. process.Kill at line 392 is fire-and-forget on Unix; Running() may still report the process alive when StopWithWait returns. A user who runs rdd svc start in that window takes the already-running branch at cmd/rdd/service.go:262-282, reads service.StartTime(), and calls waitForFreshDiscoveryConfigMap(ctx, startTime). The CM was just deleted and the dying service never recreates it — only a fresh rdd service serve re-runs InitDiscovery (pkg/service/cmd/service.go:766-768). The poll at cmd/rdd/service.go:156-170 runs until startWaitTimeout (90 s) and exits 4.

cmd/rdd/service.go

	client, err := kubernetes.NewForConfig(restConfig)
	if err != nil {
		return fmt.Errorf("failed to create kubernetes client: %w", err)
	}

	return wait.PollUntilContextCancel(ctx, 500*time.Millisecond, true, func(ctx context.Context) (bool, error) {
		cm, err := client.CoreV1().ConfigMaps(controllers.RDDSystemNamespace).Get(
			ctx, controllers.ControllerManagerConfigMapName, metav1.GetOptions{},
		)
		if apierrors.IsNotFound(err) {
			return false, nil // ConfigMap not created yet; retry.
		}
		if err != nil {
			return false, err
		}
		if cm.CreationTimestamp.Time.Before(beforeStart) {
			return false, nil // Stale ConfigMap from a previous run; wait for the new one.
		}
		return cm.Annotations[controllers.ReadyAnnotation] == "true", nil
	})
}

func serviceConfigAction(cmd *cobra.Command, _ []string) error {
	if err := ensureServiceRunning(cmd.Context()); err != nil {
		return err

cmd/rdd/service.go

		if err := service.Create(args); err != nil {
			return err
		}
		logrus.Infof("successfully created %q control plane", instance.Name())
	}
	if service.Running() {
		logrus.Infof("%q control plane is already running", instance.Name())
		if !wait {
			return nil
		}
		startTime, err := service.StartTime()
		if err != nil {
			return err
		}
		logrus.Infof("waiting for %q control plane to be ready", instance.Name())
		ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		if err := waitForFreshDiscoveryConfigMap(ctx, startTime); err != nil {
			return cliexit.Classify(err)
		}
		logrus.Infof("%q control plane is ready", instance.Name())
		return nil
	}

	// Collect all provided flags as arguments for serve subprocess
	var serveArgs []string
	if cmd.Flags().Changed("controllers") {
		controllers, err := cmd.Flags().GetString("controllers")

pkg/service/cmd/service.go

				// service alone: on Windows (TerminateProcess) it spares
				// hostagents, which live in their own process groups. On
				// Unix, SIGTERM suffices because the service handles
				// signals promptly; its slowness comes from draining
				// hostagents sequentially.
				_ = process.Kill(pid)
				err := ctx.Err()
				if errors.Is(err, context.DeadlineExceeded) {
					return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, err)
				}
				return fmt.Errorf("wait for %q control plane with pid %d to stop cancelled: %w", instance.Name(), pid, err)

pkg/service/cmd/service.go

	// Its creationTimestamp serves as the control plane start time.
	initClient, err := kubernetes.NewForConfig(completed.ControlPlane.Generic.LoopbackClientConfig)
	if err != nil {
		return fmt.Errorf("failed to create kubernetes client for discovery init: %w", err)
	}
	if err := controllers.InitDiscovery(ctx, initClient); err != nil {
		return fmt.Errorf("failed to initialize discovery: %w", err)
	}

	// Start shared controller manager for enabled controllers
	var enabledControllers []base.Controller

	// Get all registered controllers from the registry

Before Round 6, the already-running branch used the non-freshness waitForDiscoveryConfigMap, which accepted any surviving CM. The Round-6 freshness anchor closes a real staleness gap but promotes this pre-existing CM ordering hazard into a user-visible hang. The new test at bats/tests/20-service/6-timeout.bats:22-29 exercises precisely this sequence and relies on the service exiting fast enough for rdd svc start to take the cold-start path instead of the already-running one.

bats/tests/20-service/6-timeout.bats

@test 'svc stop --timeout=1ms exits with code 4' {
    run -4 rdd svc stop --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}

@test 'svc delete --timeout=1ms exits with code 4 and preserves the instance directory' {
    # Restart the service; the prior test killed it.
    rdd svc start
    run -4 rdd svc delete --timeout=1ms
    assert_output --partial 'context deadline exceeded'
    run -0 rdd svc status
    assert_output --partial 'has been created: true'
}

@test 'svc start --timeout=1ms exits with code 4' {
    # Reset to a known state; prior tests left the control plane half-dead.
    rdd svc delete || :
    rdd svc create

# Restart the service; the prior test killed it.
rdd svc start
run -4 rdd svc delete --timeout=1ms

Fix options, in increasing scope:

Move cleanupDiscoveryConfigMap() to the success branch of the wait loop (line 399-402), so a timed-out stop leaves the prior CM in place for the next start to consume. The already-running branch then sees a fresh-enough CM (until the service exits, at which point the cold-start path takes over).
On the already-running path, detect a missing ConfigMap via the loopback client and fall through to the cold-start path.
Document the limitation: after a timed-out stop, the user must run rdd svc delete && rdd svc create && rdd svc start.

pkg/service/cmd/service.go

				if errors.Is(err, context.DeadlineExceeded) {
					return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, err)
				}
				return fmt.Errorf("wait for %q control plane with pid %d to stop cancelled: %w", instance.Name(), pid, err)
			case <-ticker.C:
				if !Running() {
					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil
				}
			}
		}
	}

	_ = os.Remove(instance.Config()) // Ignore error as file might not exist

(important, regression — the freshness anchor in the already-running branch is new in this PR and it is what makes the pre-existing CM ordering hazard observable on the CLI.)

Suggestions ¶

S1. Second-precision freshness anchor accepts a stale ConfigMap when a prior crash lands in the same wall-clock second as the new startup Codex¶

cmd/rdd/service.go

		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		startTime, err := service.StartTime()
		if err != nil {
			return err
		}
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		return cliexit.Classify(waitForFreshDiscoveryConfigMap(ctx, startTime))
	}
	return cliexit.Classify(startAndWaitForReady(ctx, nil, startWaitTimeout))
}

// startAndWaitForReady starts the service, waits for the API server and the

Both the cold-start path (beforeStart := time.Now().Truncate(time.Second) at cmd/rdd/service.go:125) and the new already-running path (service.StartTime() at :99, :267) anchor freshness on a second-truncated timestamp. The predicate CreationTimestamp.Before(beforeStart) accepts a stale CM whose CT and the new startup fall in the same wall-clock second. The race requires a prior unclean shutdown that left a ReadyAnnotation=true CM behind, a new service startup whose PID-file mtime lands in the same wall-clock second as that stale CT, and the caller entering the already-running branch before InitDiscovery deletes and recreates the CM. The window is sub-second.

cmd/rdd/service.go

// the apiserver may become ready before the serve command creates the ConfigMap.
func startAndWaitForReady(ctx context.Context, serveArgs []string, timeout time.Duration) error {
	// Truncate to second precision because metav1.Time drops sub-seconds
	// during JSON serialization. Without this, a server that starts in the
	// same second would appear to have a startTime *before* beforeStart.
	beforeStart := time.Now().Truncate(time.Second)
	if err := service.Start(ctx, serveArgs); err != nil {
		return err
	}

	ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)

Demoted from Codex's Important: pre-Round-6 the already-running branch skipped the freshness check entirely and would have accepted any surviving CM unconditionally, so Round 6 is strictly better. The cold-start path has carried this second-precision limit since before the PR. A durable fix replaces the timestamp anchor with a startup nonce copied from InitDiscovery into the CM and matched in waitForFreshDiscoveryConfigMap, and would also let the docs at docs/design/cmd_service.md describe readiness identity without hedging on timestamp precision.

(suggestion, gap — pre-existing precision limit; Round 6 extends it to a new path.)

S2. watchUntil returns raw NotFound on watch-close fall-through, asymmetric with the in-loop "deleted while waiting" message Claude¶

cmd/rdd/limavm.go

		updated, ok := ev.Object.(*limav1alpha1.LimaVM)
		if ok && check(updated) {
			return nil
		}
	}
	// ResultChan closed. If ctx ended, return its error. Otherwise the
	// watch dropped (server close, proxy timeout, watch expiry); re-read
	// and re-evaluate to distinguish a satisfied predicate from a drop.
	if err := ctx.Err(); err != nil {
		return err
	}
	if err := c.Get(ctx, key, &vm); err != nil {
		return err
	}
	if check(&vm) {
		return nil
	}
	return fmt.Errorf("watch closed before LimaVM %q reached the desired state", key.Name)
}

// watchUntilDeleted watches a LimaVM until it is deleted.
func watchUntilDeleted(ctx context.Context, c client.WithWatch, limaVM *limav1alpha1.LimaVM) error {
	key := client.ObjectKeyFromObject(limaVM)

When a Deleted event arrives on the live channel, line 489-490 returns the descriptive "LimaVM %q was deleted while waiting". When the channel closes before that event lands and the resource is gone, the fall-through c.Get at line 503 returns an unwrapped apierrors.NewNotFound(...); the user sees "limavms.lima.rancherdesktop.io \"foo\" not found" instead of the consistent message. Same asymmetry applies to terminal-state predicates — limavm start --wait against a VM deleted mid-wait reports a stale NotFound that does not tell the user why the wait ended.

cmd/rdd/limavm.go


	for ev := range watcher.ResultChan() {
		switch ev.Type {
		case k8swatch.Error:
			return apierrors.FromObject(ev.Object)
		case k8swatch.Deleted:
			return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
		}
		updated, ok := ev.Object.(*limav1alpha1.LimaVM)
		if ok && check(updated) {
			return nil
		}

Demoted from Claude's Important: the fall-through does surface an error rather than report a silent success, so the Round-5 regression class is closed. The remaining gap is one of message symmetry.

Fix:

     if err := c.Get(ctx, key, &vm); err != nil {
+        if apierrors.IsNotFound(err) {
+            return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
+        }
         return err
     }

(suggestion, regression — the fall-through code is new in Round 6.)

S3. Running() → StartTime() race surfaces as an opaque stat … no such file or directory error Claude¶

cmd/rdd/service.go

	if !service.Exists() {
		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		startTime, err := service.StartTime()
		if err != nil {
			return err
		}
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}

Running() walks PID(), which removes the PID file when the process no longer responds to signal 0 (pkg/service/cmd/service.go:158). A concurrent CLI invocation that passes between our Running() and StartTime() can remove the file, so os.Stat(PIDFile) in StartTime() returns ENOENT and the user sees the bare stat error with no context.

pkg/service/cmd/service.go

			}
			_ = process.Release()
		}
	}
	if err != nil {
		_ = os.Remove(instance.PIDFile())
		return PIDNotFound
	}
	return pid
}

Demoted from Claude's Important: the only path that removes the file is another PID() call, which requires a second process, and the gap is microseconds. Still worth handling: on os.IsNotExist, fall through to the cold-start path instead of propagating the raw error.

     startTime, err := service.StartTime()
-    if err != nil {
+    if os.IsNotExist(err) {
+        // Service exited between Running() and StartTime(); treat as cold start.
+        return cliexit.Classify(startAndWaitForReady(ctx, nil, startWaitTimeout))
+    } else if err != nil {
         return err
     }

*(suggestion, regression — StartTime is new; prior paths had no such dependency.)*

S4. No exit-4 BATS coverage for the shared autostart helpers (svc config, set, ctl) Codex¶

The new 6-timeout.bats asserts exit 4 for svc stop, svc start, and svc delete, and limavm-cli.bats covers lima create --start, lima start, and lima restart. The same timeout contract now flows through ensureServiceRunning for rdd svc config, rdd set, and rdd ctl (at cmd/rdd/kubectl.go), but no test asserts exit 4 on any of those call sites. A caller-specific regression that unwraps *Error before returning from the autostart path would ship green.

Fix: add one run -4 case per representative shared-helper caller — e.g. rdd svc config and rdd set running=false against a deliberately unready control plane.

(suggestion, gap.)

S5. Local wait bool shadows the imported k8s.io/apimachinery/pkg/util/wait package in serviceStartAction and serviceStopAction Claude¶

cmd/rdd/service.go

	"github.com/sirupsen/logrus"
	"github.com/spf13/cobra"

	apierrors "k8s.io/apimachinery/pkg/api/errors"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/util/wait"
	"k8s.io/client-go/kubernetes"
	watchtools "k8s.io/client-go/tools/watch"

	cliexit "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/cli/exit"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/developer"

Neither function references wait.PollUntilContextCancel, so no bug lands today. A future edit that adds a wait.Poll… call in either function would compile-fail on the shadow and the fix would be obscure. Renaming the local to waitFlag (or hoisting the branch selection into a helper) removes the trap. The shadow is pre-existing in serviceStopAction; this PR introduces the same shadow to serviceStartAction.

(suggestion, gap.)

S6. Delete's comment claims it suppresses only the outer/inner Running() race, but the predicate also absorbs signal-delivery failures and context.Canceled Claude¶

pkg/service/cmd/service.go

// Pass timeout 0 to wait indefinitely (bounded only by ctx).
func Delete(ctx context.Context, timeout time.Duration) error {
	if !Exists() {
		return fmt.Errorf("%q control plane does not exist", instance.Name())
	}
	if Running() {
		if err := StopWithWait(ctx, true, timeout); err != nil {
			// Suppress only the outer/inner Running() race, where the
			// process exits between Delete's check and StopWithWait's.
			// Deadline expiry must surface so the CLI exits 4 and leaves
			// the directory in place, per docs/design/cmd_service.md.
			if errors.Is(err, context.DeadlineExceeded) || Running() {
				return err
			}
		}
	}
	preserveAllInstanceLogs()
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		_ = os.RemoveAll(instance.LogDir())
	}
	_ = os.RemoveAll(instance.ShortDir())

The predicate errors.Is(err, context.DeadlineExceeded) || Running() suppresses any StopWithWait error whose post-call Running() returns false — including the process.Interrupt / process.Kill failure branch at lines 370-373 and the context.Canceled branch at lines 393-397. The context.Canceled path is unreachable today because the CLI uses context.Background() for RunE, but a future commit that wires SIGINT into the CLI context would silently proceed to RemoveAll(instance.Dir()) on Ctrl-C. Either tighten the predicate to errors.Is(err, ErrAlreadyStopped) (once StopWithWait returns a typed sentinel for that race), or expand the comment to enumerate every case the predicate absorbs.

pkg/service/cmd/service.go

	// Kill fallback never fires. On Windows, Interrupt uses
	// GenerateConsoleCtrlEvent, which fails when caller and target lack a
	// shared console; Kill (TerminateProcess) then bypasses graceful shutdown.
	// Hostagents survive in their own process groups and self-heal on the
	// next service start via killOrphanedHostagent.
	if err := process.Interrupt(pid); err != nil {
		if err := process.Kill(pid); err != nil {
			return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
		}
	}

	if wait {
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
		defer cancel()
		ticker := time.NewTicker(100 * time.Millisecond)
		defer ticker.Stop()

		for {
			select {
			case <-ctx.Done():
				// Deadline fired or caller cancelled; terminate either way
				// to ensure the service process exits. Kill targets the
				// service alone: on Windows (TerminateProcess) it spares
				// hostagents, which live in their own process groups. On
				// Unix, SIGTERM suffices because the service handles
				// signals promptly; its slowness comes from draining
				// hostagents sequentially.
				_ = process.Kill(pid)
				err := ctx.Err()
				if errors.Is(err, context.DeadlineExceeded) {
					return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, err)
				}
				return fmt.Errorf("wait for %q control plane with pid %d to stop cancelled: %w", instance.Name(), pid, err)
			case <-ticker.C:
				if !Running() {
					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil
				}

(suggestion, regression.)

S7. setup_rdd_control_plane can now block up to 5 minutes when a prior test leaves the service stuck Claude¶

bats/helpers/controller.bash

    local patch=$3

    rdd ctl patch "${resource_type}" "${name}" -n "${RDD_NAMESPACE}" --type='merge' -p="${patch}"
}

setup_rdd_control_plane() {
    local controllers=${1:-"*"}

    rdd svc delete
    rdd svc create --controllers="${controllers}"
    rdd svc start
}

Pre-PR, rdd svc delete waited 60 s and always removed the directory (old Stop() errors were ignored). After this PR, delete waits up to 5 minutes (stopWaitTimeout) and fails with exit 4 if the service does not exit in time, leaving the directory in place. A test file whose prior run left a hostagent in a pathological state would now hang the helper for 5 minutes and then fail, rather than recovering.

Fix: bound the helper with a shorter timeout and tolerate a failed delete so a partial teardown still lets create run.

-    rdd svc delete
+    rdd svc delete --timeout=120s || :

(suggestion, gap — pre-existing helper, behaviour shift introduced by this PR.)

S8. 6-timeout.bats Test 2 depends on the dying service exiting before the next rdd svc start evaluates Running() Claude¶

bats/tests/20-service/6-timeout.bats

@test 'svc stop --timeout=1ms exits with code 4' {
    run -4 rdd svc stop --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}

@test 'svc delete --timeout=1ms exits with code 4 and preserves the instance directory' {
    # Restart the service; the prior test killed it.
    rdd svc start
    run -4 rdd svc delete --timeout=1ms
    assert_output --partial 'context deadline exceeded'
    run -0 rdd svc status
    assert_output --partial 'has been created: true'
}

@test 'svc start --timeout=1ms exits with code 4' {
    # Reset to a known state; prior tests left the control plane half-dead.
    rdd svc delete || :
    rdd svc create

Test 1 sends SIGINT then SIGTERM via process.Kill and returns. process.Kill is async on Unix: the service may still report Running() == true when Test 2's rdd svc start runs, and Test 2 then takes the already-running branch — see I2 for why that can hang 90 s on a missing CM. Test 3 already resets with rdd svc delete || :; rdd svc create; rdd svc start; applying the same reset before Test 2's rdd svc start (or a wait_for_dead polling helper) removes the timing dependency.

(suggestion, regression — new test.)

S9. --timeout help strings disagree between rdd set and every other wait-capable command Claude¶

The PR's new flags say "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)". The pre-existing rdd set --timeout says "Timeout for waiting (0 to wait indefinitely)". The PR's stated goal is to standardize this surface — aligning set's help string in this PR (or as a targeted follow-up) keeps --help output uniform. The hardcoded 300*time.Second at cmd/rdd/set.go:92 also duplicates limaLongWaitTimeout = 5 * time.Minute; promoting a single named constant prevents drift.

cmd/rdd/set.go

	}
	command.Flags().BoolVar(&dryRun, "dry-run", false,
		"Validate changes against the API server without persisting them")
	command.Flags().BoolVar(&wait, "wait", true,
		"Wait for the desired state after the patch is accepted")
	command.Flags().DurationVar(&timeout, "timeout", 300*time.Second,
		"Timeout for waiting (0 to wait indefinitely)")

	// Override help to append live property descriptions from the CRD schema.
	defaultHelp := command.HelpFunc()
	command.SetHelpFunc(func(cmd *cobra.Command, args []string) {

(suggestion, enhancement.)

S10. limavm create's --timeout help omits the --start dependency Claude¶

cmd/rdd/limavm.go

		},
	}
	command.Flags().StringVarP(&namespace, "namespace", "n", metav1.NamespaceDefault, "Namespace for the LimaVM resource")
	command.Flags().BoolVar(&dryRun, "dry-run", false, "If set, do not commit any changes to the cluster")
	command.Flags().BoolVar(&start, "start", false, "Start the VM after creation")
	command.Flags().BoolVar(&wait, "wait", true, "Wait for the operation to complete (only applies with --start)")
	command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
	return command
}

func newLimaVMStartCommand() *cobra.Command {
	var wait bool

--wait mentions "only applies with --start"; --timeout's help mentions only --wait=false. The wait branch at line 409 gates on start && wait && !dryRun, so --timeout is also a no-op without --start. Mention --start in the timeout help on create for parity with --wait.

cmd/rdd/limavm.go

			// Keep createdConfigMap until limaVM itself is deleted.
			createdConfigMap = nil
		}
	}

	if start && wait && !dryRun {
		waitCtx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
		defer cancel()
		key := client.ObjectKeyFromObject(limaVM)
		if err := watchUntil(waitCtx, c, key, func(vm *limav1alpha1.LimaVM) bool {
			return apimeta.IsStatusConditionPresentAndEqual(vm.Status.Conditions, "Running", metav1.ConditionTrue)

(suggestion, enhancement.)

Design Observations ¶

Strengths

watchtools.ContextWithOptionalTimeout replaces the bespoke time.After channel in StopWithWait, unifying cancellation and timeout on the same primitive and inheriting client-go's "0 = forever" idiom. (in-scope) Claude
The watchUntil / watchUntilDeleted fall-through — re-Get after channel close, distinguishing a satisfied predicate from a dropped watch — closes Round 5's silent-success regression. (in-scope) Gemini
Centralizing context.DeadlineExceeded → exit 4 in cliexit.Classify keeps exit-code policy at the CLI boundary instead of scattering per-command checks. (in-scope) Codex Gemini
Anchoring waitForFreshDiscoveryConfigMap to service.StartTime() in the already-running path closes Round 5's staleness gap and makes the already-running and cold-start paths use the same predicate. (in-scope) Claude
Delete(ctx, timeout) preserves instance.Dir() on a timed-out stop, per the documented Windows directory-corruption and Unix PID-file mutual-exclusion hazards. (in-scope) Gemini

Concerns

Delete leaks StopWithWait's internal error vocabulary into the caller via errors.Is(err, context.DeadlineExceeded) || Running(). A typed sentinel from StopWithWait — or returning a *cliexit.Error directly — would let Delete express its contract in one check and also retire I1's double-classification risk by construction. (future) Claude
Readiness freshness models identity across two storage systems (filesystem mtime and Kubernetes CreationTimestamp). An explicit startup nonce copied from InitDiscovery into the ConfigMap, and matched in waitForFreshDiscoveryConfigMap, would remove clock-precision coupling and let the docs describe readiness identity exactly rather than via timestamp ordering. (future) Codex
service.Start(ctx, …) launches the serve subprocess via exec.CommandContext(ctx, …) where ctx is the CLI command context. A SIGINT during a long limavm start --wait would cancel that context and kill the just-started daemon; background daemons typically need context.Background() (or a detached context) so parent cancellation does not tear them down. Unchanged by this PR. (future) Gemini

Testing Assessment ¶

New exit-code-4 coverage hits svc stop, svc delete, svc start, lima create --start, lima start, and lima restart — good breadth against the new contract. Codex built bin/rdd and ran go test ./cmd/rdd/... ./pkg/cli/... ./pkg/service/cmd/... ./pkg/service/controllers/...; all passed.

Untested scenarios, ranked by risk:

Shared autostart helpers (S4). rdd svc config, rdd set, and rdd ctl now carry the same exit-4 contract through ensureServiceRunning, but no test asserts it.
svc delete after the service has already exited cleanly (S3 / S6). The outer/inner Running() race-suppression branch at pkg/service/cmd/service.go:449 is not exercised; a test that starts the service, delivers SIGINT directly, then runs rdd svc delete would cover it.
--timeout=0 (wait forever). Documented for svc stop, svc delete, svc start, and every limavm command, but never asserted. A regression that treats 0 as "return immediately" would ship green.
svc start --wait=false. A TODO at bats/tests/20-service/2-create.bats:3 predates this PR; the new flag handling makes regressions slightly easier to introduce.
limavm create --start --wait=false. Guarded by start && wait && !dryRun at cmd/rdd/limavm.go:409, but not exercised.
I1 / I2 regression tests. Neither Important has a regression test today: I1 needs a synthetic Rejected-wrapping-Deadline error through Classify; I2 needs a timed-out stop followed immediately by an already-running svc start with a deterministic process-exit barrier.

Documentation Assessment ¶

docs/design/cmd_service.md accurately reflects the new --wait / --timeout defaults and the exit-4 contract; the cross-link from delete to stop is clean.
docs/design/cmd_lima.md updates all four affected commands' defaults.
docs/design/api_service.md now names the SIGINT / SIGTERM / CTRL_BREAK_EVENT / TerminateProcess fallback chain exactly.
The Classify docstring at pkg/cli/exit/exit.go:53-56 reads as an idempotency guarantee. If I1 is not adopted, tighten it to say the function only preserves CodeTimeout classifications.
The comment at pkg/service/cmd/service.go:389-391 ("the service handles signals promptly; its slowness comes from draining hostagents sequentially") is accurate but should cross-reference the 45-second controller-manager drain in Run at line 847, since both bound the same wait.

Commit Structure ¶

Single commit, clean scope, message accurately describes the change.

Acknowledged Limitations ¶

Windows Interrupt fallback. pkg/service/cmd/service.go:359-369 notes that on Windows, Interrupt (GenerateConsoleCtrlEvent) fails when caller and target lack a shared console; Kill (TerminateProcess) then bypasses graceful shutdown and hostagents self-heal on the next service start via killOrphanedHostagent. Pre-existing, unchanged.
Windows log rename. pkg/service/cmd/service.go:466-470 notes that on Windows, os.Rename of log files can fail when a non-Go process (QEMU) holds a lock without FILE_SHARE_DELETE; those logs are lost. Pre-existing, unchanged.
bats/tests/20-service/2-create.bats:3 TODO. Explicit svc start --wait=false coverage still missing; this PR adds surrounding timeout coverage but not this one.

Agent Performance Retro ¶

[Claude] Carried the round: four of five consolidated Importants/Suggestions with behavioural teeth (the Classify stripping bug, the cleanupDiscoveryConfigMap ordering hazard, the watchUntil NotFound asymmetry, and the Running() / StartTime() race). The Classify finding is the kind of latent-bug spot that requires reading the docstring against the implementation; no other agent surfaced it. Two Importants (I3, I4) were borderline and consolidated as Suggestions — a calibration note rather than a miss, since the findings themselves are real.

[Codex] One sharp Important (second-precision freshness race) and one well-chosen test-gap Suggestion. The freshness finding demotes cleanly to Suggestion because Round 6 still improves strictly over Round 5 — Codex correctly identified the residual race but did not weigh the pre-PR baseline. Ran the Go unit tests and reported them green, which neither other agent did. Missed the Classify latent bug and the cleanupDiscoveryConfigMap ordering hazard.

[Gemini] Zero findings. With every Round-5 Important fixed and the remaining issues either latent (I1) or dependent on process-lifecycle timing (I2), a clean bill from the third agent is plausible — the three recorded design observations read the diff honestly. As expected, Gemini skipped git blame (quota), so regression/gap labels came entirely from Claude and Codex. Its single design concern (background-daemon lifecycle tied to CLI context) sits outside the PR scope but is worth keeping on file.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.4	Gemini 3.1 Pro
Duration	15m 09s	7m 32s	6m 14s
Findings	2I 8S	2S	none
Tool calls	49 (Bash 28, Read 21)	57 (shell 53, plan 2, stdin 2)	41 (grepsearch 18, readfile 12, runshellcommand 10)
Design observations	2	1	3
False positives	0	0	0
Unique insights	4	2	0
Files reviewed	10	10	10
Coverage misses	0	0	0
Totals	2I 8S	2S	none
Downgraded	2 (I→S)	1 (I→S)	0
Dropped	0	0	0

Overall, Claude produced the most reviewable output this round; Codex added one durable suggestion and the only live build-and-test signal; Gemini's clean read corroborates that the Round-5 Importants genuinely closed.

Reconciliation ¶

Claude I3 (watchUntil NotFound asymmetry) → Suggestion S2. The fall-through closes the Round-5 silent-success regression; the remaining issue is message symmetry.
Claude I4 (Running() / StartTime() race) → Suggestion S3. The race requires a concurrent second CLI removing the PID file between our two calls; microsecond window.
Codex I1 (second-precision freshness) → Suggestion S1. Round 6 is strictly better than Round 5's no-freshness-check baseline; the sub-second race is a pre-existing precision limit, not a new regression.

Review Process Notes ¶

Repo context updates ¶

[repo] ADD a verification heuristic for latent idempotency guarantees in wrappers. When a new helper function's docstring says "callers may X repeatedly" or "X is idempotent," read the implementation against every existing caller that could violate the guarantee, then trace any plausible future caller the signature invites. Latent bugs of this shape rarely surface until the invited caller lands, at which point the behaviour regresses quietly. Flag when the implementation's short-circuit is narrower than the docstring claims. Rationale: the Classify finding this round existed because the docstring promised more than the implementation delivered, and no caller today exercises the gap.
[repo] ADD pairwise verification of CLI state cleanup and CLI readiness waits. When a CLI command both (a) deletes shared cluster state during shutdown and (b) another CLI command waits on that state for readiness, trace the sequence shutdown-with-timeout → state-deleted → restart-quick → readiness-wait-against-missing-state. This pattern produces user-visible hangs the individual code paths look clean. Rationale: I2 this round is exactly this shape — the stop deletes the discovery CM, the start waits for the same CM with a freshness anchor, and the window between signal send and actual process exit is the failure mode.

Skill improvements ¶

None. Both findings above are specific to this repo's shutdown / readiness protocol and do not generalise.