Deep Review: 20260422-125913-pr-336

Date	2026-04-22 12:59
Repo	rancher-sandbox/rancher-desktop-daemon
Round	5
Author	@jandubois
PR	#336 — cli: standardize --wait and --timeout across svc and limavm
Branch	`wait-timeout-cleanup`
Commits	`4bf0ec5` cli: standardize --wait and --timeout across svc and limavm
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — same SHA as Round 4 (`4bf0ec5`), so every Round 4 Important remains. Round 5 re-confirms both of them — (I1) `watchUntil` / `watchUntilDeleted` returning `nil` on a closed watch, which the new default `--wait=true` on `limavm delete` now exposes on the happy path, and (I3) `svc start`'s already-running branch calling `waitForDiscoveryConfigMap` without the `beforeStart` anchor the cold-start path uses — and surfaces one new Important: (I2) `StopWithWait`'s deadline branch now fires on parent-context cancellation too, but the error message still reads "timed out waiting for …" and `cliexit.Classify` routes it to exit 1, so a Ctrl-C during `svc stop --wait` prints a timeout message with a non-timeout exit code. Eight Suggestions follow, six of which restate unaddressed Round 4 items. No Critical.
Wall-clock time	`18 min 38 s`

Executive Summary ¶

Round 5 reviews the same commit (4bf0ec5) as Round 4, so every finding carries over. The two Round 4 Importants land again from independent agent sessions: watchUntil / watchUntilDeleted at cmd/rdd/limavm.go:485-497, 527-535 still return ctx.Err() when the watch channel closes before the condition is met — ctx.Err() is nil while the context is still valid, so every limavm create/start/stop/restart/delete --wait path silently reports success on a dropped watch (Codex + Gemini); and cmd/rdd/service.go:271 still calls waitForDiscoveryConfigMap(ctx) in the already-running branch, which resolves to waitForFreshDiscoveryConfigMap(ctx, time.Time{}) at :132-134, so the freshness check the cold-start path uses at :122-128 is skipped (Codex). The PR description's "matching the cold-start path" claim is therefore partially incorrect as written.

The new Round 5 finding is StopWithWait's deadline branch at pkg/service/cmd/service.go:373-382. Round 4 moved the wait onto watchtools.ContextWithOptionalTimeout(ctx, timeout), which means ctx.Done() now fires on parent-context cancellation (context.Canceled) as well as timeout. The branch still kills the service and returns "timed out waiting for … : context canceled" — misleading wording, and cliexit.Classify only matches context.DeadlineExceeded, so Ctrl-C during a wait produces exit 1 with a "timed out" message. Delete at :434 happens to route around this by gating its recovery on errors.Is(err, context.DeadlineExceeded) (cancellation falls through to the Running() check), but that safety is incidental.

Eight Suggestions follow, six restating unaddressed Round 4 items: Delete abandons the process it just SIGTERM'd; no BATS coverage for svc delete --timeout; ensureServiceRunning hardcodes a 90s ceiling regardless of the caller's --timeout; svc delete --timeout's help text still reads "stop-and-wait"; cliexit.Classify can double-wrap on any path that passes an already-classified error through a second Classify call; and the already-running svc start branch has no "is ready" log on success. Two new Suggestions — stopWaitTimeout's comment couples the default to unwritten LimaVM-drain orchestration, and bats/tests/33-lima-controllers/limavm-cli.bats:193's run -0 rdd limavm delete "restart-vm" is now the only BATS run -0 with no subsequent ${output} / ${status} check since the PR removed the trailing rdd ctl wait line — round out the list.

No Critical issues. Claude missed Codex/Gemini's I1 (watch fall-through) despite the pattern being spelled out in deep-review-context.md line 47 ("for ev := range watcher.ResultChan() { ... } return ctx.Err() loop converts a dropped watch into a spurious success"). Codex missed Claude's I2 (cancellation mislabel) but found the complementary I3 (discovery freshness).

Critical Issues ¶

None.

Important Issues ¶

I1. watchUntil / watchUntilDeleted return nil on a closed watch, silently succeeding Codex Gemini¶

cmd/rdd/limavm.go

	if err != nil {
		return err
	}
	defer watcher.Stop()

	for ev := range watcher.ResultChan() {
		switch ev.Type {
		case k8swatch.Error:
			return apierrors.FromObject(ev.Object)
		case k8swatch.Deleted:
			return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
		}
		updated, ok := ev.Object.(*limav1alpha1.LimaVM)
		if ok && check(updated) {
			return nil
		}
	}
	return ctx.Err()
}

// watchUntilDeleted watches a LimaVM until it is deleted.
func watchUntilDeleted(ctx context.Context, c client.WithWatch, limaVM *limav1alpha1.LimaVM) error {
	key := client.ObjectKeyFromObject(limaVM)

When the API server closes the watch before the condition holds — proxy timeout, graceful server shutdown, or a standard watch-expiry without an Error event — ResultChan() drains and the loop falls through. ctx.Err() is nil while the context is still valid, so the caller sees success and limaVMDeleteAction at :553-557 logs "LimaVM %q deleted" though the resource may still exist.

Pre-PR, limavm delete defaulted to --wait=false, so the delete-wait path rarely ran. The PR flips the default to --wait=true at cmd/rdd/limavm.go:254-255 and removes the test's rdd ctl wait --for=delete safety nets at bats/tests/33-lima-controllers/limavm-cli.bats:129-131, 194-197, which puts watchUntilDeleted on the happy path without a backup check. The exit-code-4 contract also relies on the timeout actually firing; a silent success converts that to exit 0.

cmd/rdd/limavm.go

		Args:  cobra.ExactArgs(1),
		RunE: func(cmd *cobra.Command, args []string) error {
			return limaVMDeleteAction(cmd.Context(), args[0], wait, timeout)
		},
	}
	command.Flags().BoolVar(&wait, "wait", true, "Wait for the VM to be deleted before returning")
	command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
	return command
}

// getKubeClient returns a controller-runtime client configured for the RDD control plane.
// The returned client supports Watch for event-driven waiting on resource changes.

Fix: return a distinct "watch closed before condition met" error on fall-through. For watchUntilDeleted, re-Get by name and UID before declaring success so NotFound or a changed UID confirms deletion; for watchUntil, re-Get once and re-evaluate check before returning success.

     if ok && check(updated) {
         return nil
     }
 }
-return ctx.Err()
+if err := ctx.Err(); err != nil {
+    return err
+}
+return fmt.Errorf("watch closed before condition met for LimaVM %q", key.Name)

(important, regression)

I2. StopWithWait's deadline branch mis-labels parent-context cancellation as a timeout Claude¶

pkg/service/cmd/service.go

		if err := process.Kill(pid); err != nil {
			return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
		}
	}

	if wait {
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
		defer cancel()
		ticker := time.NewTicker(100 * time.Millisecond)
		defer ticker.Stop()

		for {
			select {
			case <-ctx.Done():
				// Graceful shutdown timed out; terminate so we don't leave
				// a hung service process. Kill targets only the service; on
				// Windows (TerminateProcess) this avoids killing hostagents
				// that are children of the service but run in their own
				// process groups. On Unix, SIGTERM suffices because the
				// service is responsive to signals (it's just slow shutting
				// down hostagents sequentially).
				_ = process.Kill(pid)
				return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, ctx.Err())
			case <-ticker.C:
				if !Running() {
					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil
				}

Pre-PR, the wait selected on time.After(60 * time.Second) with no context channel, so the <-timeout case only fired on genuine deadline expiry. The PR unifies the parent context with the timeout via watchtools.ContextWithOptionalTimeout (line 366), which makes ctx.Done() fire on parent-context cancellation as well. A user who presses Ctrl-C during rdd svc stop --wait now gets:

the literal message "timed out waiting for … : context canceled" — the wording contradicts the cause;
exit code 1, because cliexit.Classify only matches context.DeadlineExceeded.

Delete at cmd/rdd/service.go:428-438 sidesteps the same ambiguity by gating its recovery on errors.Is(err, context.DeadlineExceeded) || Running(), so cancellation falls through to the Running() check. That safety is coincidental — Delete's current call path happens to tolerate it.

Fix: branch on ctx.Err() to pick the right wording and classify the right code:

             case <-ctx.Done():
                 _ = process.Kill(pid)
-                return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, ctx.Err())
+                err := ctx.Err()
+                if errors.Is(err, context.DeadlineExceeded) {
+                    return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, err)
+                }
+                return fmt.Errorf("wait for %q control plane with pid %d to stop cancelled: %w", instance.Name(), pid, err)

(important, regression)

I3. Already-running svc start --wait skips the discovery ConfigMap freshness check Codex¶

cmd/rdd/service.go

		if err := service.Create(args); err != nil {
			return err
		}
		logrus.Infof("successfully created %q control plane", instance.Name())
	}
	if service.Running() {
		logrus.Infof("%q control plane is already running", instance.Name())
		if !wait {
			return nil
		}
		logrus.Infof("waiting for %q control plane to be ready", instance.Name())
		ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
	}

	// Collect all provided flags as arguments for serve subprocess
	var serveArgs []string
	if cmd.Flags().Changed("controllers") {
		controllers, err := cmd.Flags().GetString("controllers")

The PR description frames this branch as "matching the cold-start path". The cold-start path records beforeStart := time.Now().Truncate(time.Second) and waits for a ConfigMap whose creationTimestamp is at or after beforeStart (cmd/rdd/service.go:117, 128). The already-running branch instead calls waitForDiscoveryConfigMap(ctx), which resolves to waitForFreshDiscoveryConfigMap(ctx, time.Time{}) at :132-134 — the zero time skips the freshness check per the comment at :143.

The serve process deletes and recreates the discovery ConfigMap inside Run at pkg/service/cmd/service.go:751-753, but that happens only after readiness.WaitForReady returns. If a prior serve exited uncleanly (no CleanupDiscovery), the stale ConfigMap survives with ReadyAnnotation=true, and a concurrent rdd svc start --wait that races the in-progress new serve can observe the stale CM (via both service.Wait's runtimeControllers lookup and waitForDiscoveryConfigMap) before the new controllers register.

pkg/service/cmd/service.go

	// Its creationTimestamp serves as the control plane start time.
	initClient, err := kubernetes.NewForConfig(completed.ControlPlane.Generic.LoopbackClientConfig)
	if err != nil {
		return fmt.Errorf("failed to create kubernetes client for discovery init: %w", err)
	}
	if err := controllers.InitDiscovery(ctx, initClient); err != nil {
		return fmt.Errorf("failed to initialize discovery: %w", err)
	}

	// Start shared controller manager for enabled controllers
	var enabledControllers []base.Controller

	// Get all registered controllers from the registry

The race window is narrow, but the PR's stated goal is symmetry with the cold-start path, and the symmetry is incomplete.

Fix: anchor the freshness check to a marker the running service can expose — PID-file mtime, the process's Start() time via os.Stat(instance.PIDFile()), or a small startup timestamp file written by the serve process — and pass that to waitForFreshDiscoveryConfigMap. A small helper shared with cold-start keeps both paths honest.

(important, gap)

Suggestions ¶

S1. Delete abandons the process it already SIGTERM'd Claude¶

pkg/service/cmd/service.go

// Delete removes all instance data. If the service is running, Delete waits
// up to timeout for it to exit before removing files; removing instance.Dir()
// while the serve process holds rdd.pid, rdd.sqlite3, and log files corrupts
// the directory on Windows and breaks PID-file mutual exclusion on Unix.
// Pass timeout 0 to wait indefinitely (bounded only by ctx).
func Delete(ctx context.Context, timeout time.Duration) error {
	if !Exists() {
		return fmt.Errorf("%q control plane does not exist", instance.Name())
	}
	if Running() {
		if err := StopWithWait(ctx, true, timeout); err != nil {
			// Suppress only the outer/inner Running() race, where the
			// process exits between Delete's check and StopWithWait's.
			// Deadline expiry must surface so the CLI exits 4 and leaves
			// the directory in place, per docs/design/cmd_service.md.
			if errors.Is(err, context.DeadlineExceeded) || Running() {
				return err
			}
		}
	}
	preserveAllInstanceLogs()
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		_ = os.RemoveAll(instance.LogDir())
	}
	_ = os.RemoveAll(instance.ShortDir())

When the deadline expires, StopWithWait at pkg/service/cmd/service.go:381 already calls process.Kill(pid), so the process is almost always dead before the error returns. Delete still short-circuits on the DeadlineExceeded arm and leaves the directory in place; the user runs rdd svc delete a second time, which succeeds because Running() is now false. The two-step UX is awkward given the kill usually took effect.

pkg/service/cmd/service.go

				// Windows (TerminateProcess) this avoids killing hostagents
				// that are children of the service but run in their own
				// process groups. On Unix, SIGTERM suffices because the
				// service is responsive to signals (it's just slow shutting
				// down hostagents sequentially).
				_ = process.Kill(pid)
				return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, ctx.Err())
			case <-ticker.C:
				if !Running() {
					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil

Consider re-checking Running() after the error and continuing with cleanup if the service has exited. Keep docs/design/cmd_service.md:117-122 in sync if you change the behaviour.

docs/design/cmd_service.md


Stops the control plane and removes the instance directory, the short directory
(which contains the Lima home), and — unless `RDD_KEEP_LOGS` is set — the log
directory.

Delete always waits for the control plane to exit before removing files, because
removing the instance directory under a live process corrupts it on Windows and
breaks PID-file mutual exclusion on Unix. Use `--timeout` to bound that wait
(default `5m`); `0` waits indefinitely. If the deadline expires, `rdd` exits
with code 4 and the directory is left in place. See [`rdd service stop`](#rdd-service-stop)
for the signal and fallback details of the shutdown itself.

### `rdd service reset`

Deletes the datastore, but create a new (empty) one with the same settings.

(suggestion, enhancement)

S2. No BATS coverage for svc delete --timeout exit-code-4 Claude¶

6-timeout.bats covers svc stop --timeout=1ms. svc delete --timeout=1ms has no equivalent, so the new branching in Delete at pkg/service/cmd/service.go:428-438 (the errors.Is(err, context.DeadlineExceeded) || Running() guard and the "directory stays on timeout" contract) is asserted only in docs. A test that runs rdd svc delete --timeout=1ms, asserts exit 4, and confirms the instance directory survives (e.g. via rdd svc status reporting created: true) would pin the contract down.

(suggestion, gap)

S3. ensureServiceRunning hardcodes a 90s ceiling regardless of caller --timeout Claude¶

cmd/rdd/service.go

		RunE:  serviceConfigAction,
	}
	return command
}

func ensureServiceRunning(ctx context.Context) error {
	if !service.Exists() {
		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
	}
	return cliexit.Classify(startAndWaitForReady(ctx, nil, startWaitTimeout))
}

// startAndWaitForReady starts the service, waits for the API server and the
// discovery ConfigMap to be ready. After an unclean shutdown the ConfigMap may
// survive with stale data, so the freshness check waits for a ConfigMap whose
// creationTimestamp is at or after the current startup.

ensureServiceRunning is called transitively by rdd set, every rdd limavm *, rdd kubectl, and rdd service config. It always applies the 90s startWaitTimeout, so rdd limavm create --timeout=0 (user asks for indefinite wait) can still fail with exit 4 at 90s if the service is slow coming up. The PR framing that --timeout now flows uniformly through the autostart helper is not quite accurate — the user's timeout does not propagate through the start phase.

This is not a regression (the old code had the same 90s cap) but the function is already being reshaped. At minimum, the docstring at cmd/rdd/service.go:31-33 should note that the 90s cap is independent of the caller's --timeout. Standardising on watchtools.ContextWithOptionalTimeout (Round 4 S1) would also align the style with every sibling wait in this PR.

cmd/rdd/service.go

	service "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/service/cmd"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/service/controllers"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/util/tail"
)

// startWaitTimeout bounds the wait for the API server, CRDs, and
// controller-manager registration to land. Cold start can take well over 30s
// in CI on slower runners.
const startWaitTimeout = 90 * time.Second

// stopWaitTimeout bounds the wait for the control plane to shut down. The
// service itself drains its controller-manager goroutines for up to 45s
// (see Run in pkg/service/cmd/service.go); when LimaVM stop orchestration

(suggestion, enhancement)

S4. svc delete --timeout help text reads "stop-and-wait" — jargon mismatched with siblings Claude¶

cmd/rdd/service.go

	command := &cobra.Command{
		Use:  "delete",
		Long: "Delete RDD control plane",
		RunE: serviceDeleteAction,
	}
	command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
	return command
}

func serviceDeleteAction(cmd *cobra.Command, _ []string) error {
	if !service.Exists() {

Every other Duration("timeout", …) in this PR uses "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)". svc delete drops --wait, which is why the wording diverges, but "stop-and-wait" reads as a noun phrase rather than a knob description. The docs at docs/design/cmd_service.md:117-122 phrase it clearly.

docs/design/cmd_service.md


Stops the control plane and removes the instance directory, the short directory
(which contains the Lima home), and — unless `RDD_KEEP_LOGS` is set — the log
directory.

Delete always waits for the control plane to exit before removing files, because
removing the instance directory under a live process corrupts it on Windows and
breaks PID-file mutual exclusion on Unix. Use `--timeout` to bound that wait
(default `5m`); `0` waits indefinitely. If the deadline expires, `rdd` exits
with code 4 and the directory is left in place. See [`rdd service stop`](#rdd-service-stop)
for the signal and fallback details of the shutdown itself.

### `rdd service reset`

Deletes the datastore, but create a new (empty) one with the same settings.

-    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
+    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for waiting for the control plane to exit before removing files (0 means wait indefinitely)")

(suggestion, code clarity)

S5. stopWaitTimeout comment couples the default to unwritten future behaviour Claude¶

cmd/rdd/service.go

// startWaitTimeout bounds the wait for the API server, CRDs, and
// controller-manager registration to land. Cold start can take well over 30s
// in CI on slower runners.
const startWaitTimeout = 90 * time.Second

// stopWaitTimeout bounds the wait for the control plane to shut down. The
// service itself drains its controller-manager goroutines for up to 45s
// (see Run in pkg/service/cmd/service.go); when LimaVM stop orchestration
// lands, the drain will sequence each VM's shutdown, so the CLI deadline must
// match the per-VM ceiling in [limaLongWaitTimeout]. A multi-VM instance may
// still exceed this budget while shutdownAllHostagents drains sequentially.
const stopWaitTimeout = limaLongWaitTimeout

// logrusLevelToKlog converts current logrus level to klog level.
func logrusLevelToKlog() string {
	switch logrus.GetLevel() {
	case logrus.DebugLevel:

Today the service drains for 45s at pkg/service/cmd/service.go:832, so 5m is roughly 6× the actual worst case. The rationale is tied to "when LimaVM stop orchestration lands" — unwritten work whose shape may differ when it arrives. Rot risk is real: if that change never lands, or lands with a different budget, this comment points future readers at a policy that was never implemented.

pkg/service/cmd/service.go

		mgrWg.Wait()
		close(mgrDone)
	}()
	select {
	case <-mgrDone:
	case <-time.After(45 * time.Second):
		klog.Warning("Controller manager did not shut down within 45s, exiting anyway")
	}

	return nil
}

A shorter rationale that states the current policy ("the CLI waits longer than the service drain so slow disks or misconfigured hosts don't trip the default") survives code drift. The coupling-to-limaLongWaitTimeout constant can stay; the comment just needs to justify today's value.

(suggestion, code clarity)

S6. BATS run -0 with no output/status check on limavm delete "restart-vm" Gemini¶

bats/tests/33-lima-controllers/limavm-cli.bats

    run -0 rdd ctl get limavm "restart-vm" --namespace "${LIMA_TEST_NS}" \
        --output jsonpath='{.spec.running}'
    assert_output "true"

    # Delete the VM
    run -0 rdd limavm delete "restart-vm"
}

@test "lima start --timeout fails when VM cannot reach desired state" {
    rdd ctl create configmap "timeout-vm" --namespace "${LIMA_TEST_NS}" \
        --from-literal="template=${TEMPLATE}"

The PR removed the trailing rdd ctl wait --for=delete "limavm/restart-vm" … line (diff hunk at -@@ -194,7 +191,6 @@), leaving run -0 followed only by the closing brace. Per the repo BATS style rules in deep-review-context.md:

> Use run -0 only when the next line reads ${output} or ${status}. For commands whose success is the entire assertion (setup, cleanup, fixture calls), invoke them directly.

The symmetric fix at line 94 (run -0 rdd limavm delete --wait "start-vm" → rdd limavm delete "start-vm") was applied; the restart-vm case below was missed.

bats/tests/33-lima-controllers/limavm-cli.bats

    assert_created "start-vm" "${LIMA_TEST_NS}" "start-vm"

    run -0 rdd ctl get limavm "start-vm" --namespace "${LIMA_TEST_NS}" -o jsonpath='{.spec.running}'
    assert_output "true"

    rdd limavm delete "start-vm"
}

@test "lima create with file template" {
    # Create a temporary template file
    local template_file="${BATS_TEST_TMPDIR}/test-template.yaml"

     # Delete the VM
-    run -0 rdd limavm delete "restart-vm"
+    rdd limavm delete "restart-vm"
 }

(suggestion, style)

S7. cliexit.Classify can double-wrap when called on an already-classified error Gemini¶

pkg/cli/exit/exit.go

// Timeout wraps err with [CodeTimeout].
func Timeout(err error) *Error {
	return &Error{Code: CodeTimeout, Err: err}
}

// Classify wraps err with [CodeTimeout] when err wraps
// [context.DeadlineExceeded]. Other errors pass through unchanged.
func Classify(err error) error {
	if errors.Is(err, context.DeadlineExceeded) {
		return Timeout(err)
	}
	return err
}

errors.Is unwraps, so Classify(*Error{Code: CodeTimeout, Err: someDeadlineErr}) returns *Error{Code: CodeTimeout, Err: *Error{Code: CodeTimeout, Err: someDeadlineErr}}. The CLI driver still resolves the correct exit code via errors.As, but the nested Error structure is noisy in logs and defeats idempotence.

No current call path triggers the double-wrap (audited ensureServiceRunning → getAppKubeClient → setAction and every svc * / limavm * caller), but the call sites are growing, and the helper is a natural target for future refactors that may chain. A short-circuit on an existing *Error{Code: CodeTimeout} is cheap insurance:

 func Classify(err error) error {
+    var exitErr *Error
+    if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
+        return err
+    }
     if errors.Is(err, context.DeadlineExceeded) {
         return Timeout(err)
     }
     return err
 }

(suggestion, enhancement)

S8. Already-running svc start branch has no "is ready" log on success¶

The cold-start branch logs "%q control plane is ready" at :297 after the wait succeeds. The already-running branch waits on the same readiness signal (service.Wait + waitForDiscoveryConfigMap) but returns silently on success — a user running rdd svc start against a running-but-still-initialising service sees the "waiting for … to be ready" line and then nothing. Mirror the cold-start log so both paths print the same terminator.

(suggestion, observation)

Design Observations ¶

Concerns ¶

Readiness contract is split between service.Wait and CLI-side discovery freshness checks (in-scope) Codex. service.Wait inspects the discovery ConfigMap via getRuntimeControllersFromCluster without checking creationTimestamp; the CLI then layers waitForFreshDiscoveryConfigMap on top. Folding "current startup generation" into one library helper would remove the branch skew behind I3 and give every caller a single definition of "ready".
limaVMCreateAction leaks the managed ConfigMap when --start --wait hits the deadline (in-scope) Claude. After takeOwnership runs, createdConfigMap is cleared and the deferred cleanup does not fire; if the watch then hits DeadlineExceeded, the ConfigMap stays and only rdd limavm delete removes it. The behaviour is intended (the LimaVM owns the CM), but the new 5m default on --start --wait makes this path hit more often than the pre-PR implicit --timeout=0. The BATS test at bats/tests/33-lima-controllers/limavm-cli.bats:208-217 already relies on namespace teardown to reap the LimaVM; no change required, but worth a note in the user-facing docs that a timed-out --start --wait leaves the template ConfigMap behind.

Strengths ¶

Centralising deadline-to-exit-code translation in cliexit.Classify Claude. Collapses three copies of the if errors.Is(err, context.DeadlineExceeded) { return cliexit.Timeout(err) } boilerplate (cmd/rdd/set.go:322-326 is the clearest example) and makes every wait path self-consistent.
watchtools.ContextWithOptionalTimeout standardisation Claude. Renaming toolswatch → watchtools across cmd/rdd/limavm.go, cmd/rdd/service.go, and pkg/service/cmd/service.go gives timeout=0 a single documented meaning and makes text-search across the codebase reliable.
Threading caller context through StopWithWait(ctx, wait, timeout) and Delete(ctx, timeout) Claude. The old API could not honour Ctrl-C or a caller deadline. Exposing I2 was the cost of making the right architectural choice.
Keeping pkg/cli/exit library-like and applying Classify only at the CLI boundary Codex. pkg/service stays free of CLI exit-code semantics; the CLI gets a uniform contract.
svc delete stops the service before removing state Codex. Correct for both Windows file semantics and Unix PID-file mutual exclusion, and the comment at pkg/service/cmd/service.go:419-422 documents why.

Testing Assessment ¶

The new timeout contract still lacks direct entry-point coverage for rdd svc start --timeout, rdd svc delete --timeout, and autostart callers such as rdd svc config, rdd kubectl, or rdd set. The assertion that every wait-capable command exits 4 on deadline holds only for the paths the BATS suite exercises, and only rdd set (existing) and the three new limavm tests cover it.
No test exercises watchUntil / watchUntilDeleted with a closed watch channel before the condition fires (I1). The bug can ship green because the default test templates reach Running=True quickly enough that the watch stays open.
No test for the I3 startup race: one process starting the service while another runs rdd svc start --wait against a stale discovery ConfigMap. Hard to reproduce deterministically without harness-level hooks, but a unit test that drives waitForFreshDiscoveryConfigMap with a pre-seeded stale ConfigMap would at least pin the freshness contract.
No test forces a Ctrl-C through rdd svc stop --wait to cover I2. An acceptable substitute: a unit test on the StopWithWait loop body with a cancelled context that asserts the error message and the classified exit code.

Documentation Assessment ¶

docs/design/cmd_service.md and docs/design/cmd_lima.md reflect the new defaults and the exit-4 contract cleanly.
docs/design/api_service.md:21 now names both SIGINT (primary) and SIGTERM (fallback) on Unix, matching pkg/service/cmd/service.go:359-363.
cmd/rdd/service.go:131 still reads "waitForDiscoveryConfigMap polls until the discovery ConfigMap exists" — the helper at :144-168 also requires ReadyAnnotation == "true". Minor doc drift; tighten the sentence to match the implementation.

Commit Structure ¶

Single commit (4bf0ec5), self-contained, message accurate. Round 5 reviews the same SHA as Round 4 — no amend or rebase since.

Acknowledged Limitations ¶

cmd/rdd/service.go:40-41 comment acknowledges that a multi-VM instance may exceed the 5m stopWaitTimeout. S5 argues for rewording but not changing the value.
bats/tests/20-service/2-create.bats:3 carries a pre-existing # TODO: test rdd svc start --wait=false comment that the PR does not address.

Unresolved Feedback ¶

None — the PR carries no inline review comments.

Agent Performance Retro ¶

Claude ¶

Strongest on narrow, nuanced correctness: the StopWithWait cancellation mislabel (I2) was a Round 5 original that neither other agent raised, and required noticing the semantic shift when Round 4 moved the wait onto watchtools.ContextWithOptionalTimeout. Six Suggestions — most restate Round 4 items that remain unaddressed — plus a useful design note on the limaVMCreateAction ConfigMap leak under --start --wait deadline expiry. Missed I1 (watchUntil fall-through) despite deep-review-context.md line 47 spelling out the exact for ev := range ... return ctx.Err() anti-pattern. The repo-context rule was loaded into the prompt; Claude did not pattern-match to it. Also missed I3 (Codex's discovery-freshness gap).

Codex ¶

Tight, high-signal output: two Important findings and zero Suggestions. Both Importants are real — I1 (watch fall-through) and I3 (discovery freshness) — and I3 was an original Round 5 contribution grounded in reading the PR description against the code ("matches the cold-start path" vs the zero-time call). The Testing Assessment was the most useful of the three: named the I1 and I3 test gaps and called out the missing rdd svc start/delete --timeout and autostart-caller coverage. Tried to run BATS but found bats/lib/bats-core empty in the worktree — no agent can run BATS against a fresh worktree until the helper populates submodules; this is a deep-review tooling gap, not a Codex bug.

Gemini ¶

Three findings — two Important and one Suggestion — with the coverage-summary format the retro table expects. Caught I1 cleanly alongside Codex and correctly labelled it a regression based on the default flip. Flagged the BATS run -0 style violation at bats/tests/33-lima-controllers/limavm-cli.bats:193 (downgraded here to a Suggestion since run -0 already asserts exit 0, so the functional behaviour is equivalent — only the repo style rule is violated). The Classify double-wrap Suggestion restates Round 4 S5. Gemini's strength this round was the coverage grid: short, scanner-friendly, and both I1/I2 (its own numbering) mapped to files without hedging.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.4	Gemini 3.1 Pro
Duration	8m 20s	8m 12s	3m 01s
Findings	1I 5S	2I	1I 2S
Tool calls	31 (Read 12, Bash 11, Grep 8)	53 (shell 51, stdin 2)	14 (grepsearch 7, readfile 4, runshellcommand 3)
Design observations	1	1	0
False positives	0	0	0
Unique insights	2	1	1
Files reviewed	10	10	10
Coverage misses	1	0	0
Totals	1I 5S	2I	1I 2S
Downgraded	0	0	1 (I→S)
Dropped	1	0	0

Reconciliation — Gemini raised the BATS run -0 issue at Important severity; downgraded to Suggestion (S6) in consolidation. run -0 without a follow-up ${output} / ${status} check is a style violation, not a functional failure: run -0 already asserts exit 0, so the command still fails the test if it fails. The repo BATS style rules document the convention for discoverability and consistency — important to keep, but not the same weight as a correctness bug.

Overall: Codex produced the highest value density — two valid Importants from a short output. Claude carried the most volume and a valuable original in I2 but missed a finding the prompt had pre-armed it for. Gemini's coverage grid and regression labelling were the best of the three. No agent made a false positive that survived consolidation.

Review Process Notes ¶

Repo context updates ¶

[repo] Re-reviewing the same commit is a common operating mode. Five rounds against the same SHA (4bf0ec5) on this PR — Round 5 is a re-review of Round 4's target. A note in deep-review-context.md steering agents to open the prior round's .reviews/*pr-N*.md (if any) and list unaddressed findings before generating new ones would cut prompt-to-answer latency on re-reviews and reduce duplicate Suggestions. The orchestrator already does this in the prior-round gate; arming agents to do it in parallel would produce richer first-pass output.
[repo] Pre-populate BATS submodules in agent worktrees. Codex attempted bin/rdd + BATS validation but hit an empty bats/lib/bats-core. The worktree setup in deep-review-setup does not init submodules; a future enhancement would let any agent actually run the suite against its own code read. Until then, the live-verification note in deep-review-context.md should explicitly say BATS is not available in agent worktrees and agents should use unit tests or the built rdd binary only.