Deep Review: 20260422-170229-pr-336

Date	2026-04-22 17:02
Repo	rancher-sandbox/rancher-desktop-daemon
Round	8 (of target)
Author	@jandubois
PR	#336 — cli: standardize --wait and --timeout across svc and limavm
Commits	`7b08089` cli: standardize --wait and --timeout across svc and limavm
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — address I1 (cleanupDiscoveryConfigMap on `--wait=false`) either by gating the call or by documenting the trade-off; the other findings are polish and need not block merge.
Wall-clock time	`36 min 36 s`

Executive Summary ¶

Round 8 of the CLI standardization PR. All round-7 Important findings were addressed (api_service.md fix, controller.bash hard-fail on timed delete, 120s local_setup_file bound); deferred round-7 Suggestions (context.TODO, Running/StartTime race, autostart entry-point coverage) resurface here as expected.

The core change remains sound. One new Important finding: cleanupDiscoveryConfigMap runs unconditionally on --wait=false, so a rdd svc stop --wait=false deletes the discovery ConfigMap while the daemon is still fully up and serving — pessimizing any concurrent rdd ctl/rdd kubectl caller that then polls a missing ConfigMap until startWaitTimeout (90s). The documented trade-off covers the timed-out-wait case; the --wait=false case is not called out and is strictly worse for concurrent callers.

The persistent Important gap (I2) is the autostart-entry-point BATS coverage for rdd set, rdd kubectl, rdd service config — raised in rounds 6, 7, and 8 and deferred each time. Author's call whether to add now or track as pending work.

Critical Issues ¶

None.

Important Issues ¶

I1. cleanupDiscoveryConfigMap runs unconditionally on --wait=false, pessimizing concurrent CLI callers Claude¶

pkg/service/cmd/service.go

func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
	if !Running() {
		return fmt.Errorf("%q control plane is not running", instance.Name())
	}

	// Delete the discovery configmap before signaling so the kube-API
	// delete still has a live apiserver. Trade-off: a timed-out --wait
	// leaves the configmap deleted while the daemon drains, so a quick
	// `rdd svc start` polls the missing configmap until startWaitTimeout.
	// Moving cleanup into the daemon's own shutdown path is the durable fix.
	_ = cleanupDiscoveryConfigMap()

	pid := PID()
	// Try graceful shutdown first. On Unix, Kill already sends SIGTERM which
	// triggers the Go signal handler. On Windows, Kill uses TerminateProcess
	// which bypasses all handlers, so we send Interrupt (CTRL_BREAK_EVENT)

The comment calls out the timed-out-wait case, but the same cleanup fires on --wait=false, where it is strictly pessimal. The CLI returns immediately, the daemon stays up and serving, and any concurrent rdd ctl/rdd kubectl/rdd set caller that hits ensureServiceRunning then polls the missing ConfigMap in waitForFreshDiscoveryConfigMap until startWaitTimeout (90s) expires. The daemon itself never recreates the discovery ConfigMap after MarkControlPlaneReady (pkg/service/cmd/service.go:803), so the window is not self-healing — it closes only when the daemon exits.

pkg/service/cmd/service.go

	// (including terminating hostagent processes) before Run returns.
	var mgrWg sync.WaitGroup
	if len(enabledControllers) == 0 {
		// No controllers means no CRDs to install, so signal
		// readiness immediately for waiting clients.
		if err := controllers.MarkControlPlaneReady(ctx, initClient); err != nil {
			return fmt.Errorf("failed to mark control plane as ready: %w", err)
		}
	} else {
		mgrWg.Add(1)
		go func() {

The call is pre-existing (was unconditional before the PR), not a regression. The PR surfaces it by documenting --wait=false as a supported stop mode in docs/design/cmd_service.md:93-95 and removing --wait=false from svc delete.

docs/design/cmd_service.md


Runs `rdd service serve` to actually start the control plane in the background

Additional options:

*   `--wait=false`
    Return immediately, without waiting for all controllers to be ready. Run `rdd service start`
    again with no options to wait.

*   `--timeout=90s`
    How long to wait for the control plane to become ready. Pass `0` to wait indefinitely.
    If the deadline expires, `rdd` exits with code 4.

Fix: gate the call on wait, or (per the code's own TODO) move cleanup into the daemon's graceful-shutdown path so it runs only after the apiserver quiesces:

-    _ = cleanupDiscoveryConfigMap()
+    if wait {
+        _ = cleanupDiscoveryConfigMap()
+    }

If neither fix lands, document the trade-off in cmd_service.md alongside the existing --wait=false paragraph, so users know a --wait=false stop can slow subsequent auto-starting CLI calls by up to 90s.

I2. Autostart entry-point exit-code-4 contract still untested Codex¶

ensureServiceRunning (cmd/rdd/service.go:93-111) now routes context.DeadlineExceeded through cliexit.Classify, and every caller that autostarts the service inherits the exit-code-4 contract: rdd set, rdd ctl, rdd service config, and every rdd limavm * command. Round 8 adds explicit run -4 coverage for limavm create --start, start, stop, restart, and delete, closing most of the round-7 gaps. Still missing:

cmd/rdd/service.go

// ensureServiceRunning readies the control plane before a CLI command
// relies on it. It bounds the already-running and cold-start paths by
// startWaitTimeout (90s), independent of the caller's --timeout. The
// cap lets a broken service fail fast so rdd limavm *, rdd set, rdd
// kubectl, and rdd service config stay responsive on a long deadline.
func ensureServiceRunning(ctx context.Context) error {
	if !service.Exists() {
		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		startTime, err := service.StartTime()
		if err != nil {
			return err
		}
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		return cliexit.Classify(waitForFreshDiscoveryConfigMap(ctx, startTime))
	}
	return cliexit.Classify(startAndWaitForReady(ctx, nil, startWaitTimeout))
}

// startAndWaitForReady starts the service, waits for the API server and the
// discovery ConfigMap to be ready. After an unclean shutdown the ConfigMap may
// survive with stale data, so the freshness check waits for a ConfigMap whose

rdd set --timeout=… exit-code-4. cmd/rdd/set.go:324-325 is the original site of the contract; no BATS test asserts rdd set returns 4 specifically.
rdd ctl --… exit-code-4 when ensureServiceRunning itself times out.
rdd service config exit-code-4 (autostart-only caller; no own wait loop).

cmd/rdd/set.go

	}

	if dryRun || !wait {
		return nil
	}
	if err := waitForDesiredState(ctx, restConfig, timeout, targetGen); err != nil {
		return cliexit.Classify(err)
	}
	return nil
}

// classifyAPIError tags admission-controller rejections with

A missing cliexit.Classify on any of these would ship green because the "uniform contract" now leans on the shared helper. Forcing ensureServiceRunning past its 90s cap in CI is expensive, so a focused Go-level test around ensureServiceRunning — asserting the returned error unwraps to *cliexit.Error{Code: 4} when the ctx deadline expires — is cheaper and equally diagnostic.

Fix: add a Go test in cmd/rdd/ that exercises ensureServiceRunning (or cliexit.Classify at the helper boundary) with a forced-timeout ctx; skip the end-to-end BATS path unless svc config is cheap to cover.

Suggestions ¶

S1. Delete returns exit 4 even when the fallback process.Kill already brought the daemon down Claude¶

pkg/service/cmd/service.go

// assumes the instance exists. If the service is running, Delete waits up to
// timeout for it to exit before removing files; removing instance.Dir() while
// the serve process holds rdd.pid, rdd.sqlite3, and log files corrupts the
// directory on Windows and breaks PID-file mutual exclusion on Unix.
// Pass timeout 0 to wait indefinitely (bounded only by ctx).
func Delete(ctx context.Context, timeout time.Duration) error {
	if Running() {
		if err := StopWithWait(ctx, true, timeout); err != nil {
			// Return the error whenever the process is alive or the
			// deadline expired, so deletion never races a live daemon
			// and the CLI exits 4 on timeout per docs/design/cmd_service.md.
			// The predicate also absorbs signal-delivery failures and
			// context cancellation when the process has already exited;
			// the invariant is "the directory survives unless the daemon
			// is confirmed gone." context.Canceled is unreachable today
			// because the CLI runs on context.Background(); wiring SIGINT
			// into cmd.Context() would let Ctrl-C fall through to deletion
			// during a live stop, and needs an explicit decision here.
			if errors.Is(err, context.DeadlineExceeded) || Running() {
				return err
			}
		}
	}
	preserveAllInstanceLogs()
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		_ = os.RemoveAll(instance.LogDir())
	}
	_ = os.RemoveAll(instance.ShortDir())

When the deadline fires inside StopWithWait, the wait loop's ctx.Done branch calls process.Kill(pid) and returns DeadlineExceeded. The Kill often takes effect immediately (Unix apiserver's SetupSignalContext exits on the second signal; Windows TerminateProcess blocks until the process is gone). Delete's outer check then short-circuits on errors.Is(DeadlineExceeded) before Running() has a chance to reflect the post-Kill state, so a user running rdd svc delete --timeout=30s against a slow-to-drain daemon sees exit 4 and an intact directory even though the daemon is gone — they must re-run to finish.

Trade-off: Claude's proposed swap (check Running() only, drop the DeadlineExceeded short-circuit) silently succeeds on a timed-out-but-Kill-succeeded delete, contradicting the documented "deadline expires → exit 4" contract. The cleaner fix is inside StopWithWait's wait loop (see S2) — if the wait loop returns nil when the process is already gone by the time ctx.Done fires, Delete's outer predicate never sees DeadlineExceeded in this scenario and the contract stays intact.

S2. Spurious timeout when the service exits in the same scheduling tick as the deadline Claude¶

pkg/service/cmd/service.go

		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
		defer cancel()
		ticker := time.NewTicker(100 * time.Millisecond)
		defer ticker.Stop()

		for {
			select {
			case <-ctx.Done():
				// Deadline fired or caller cancelled; terminate either way
				// to ensure the service process exits. Kill targets the
				// service alone: on Windows (TerminateProcess) it spares
				// hostagents, which live in their own process groups. On
				// Unix, Kill delivers SIGTERM, which is the second
				// shutdown signal after the earlier SIGINT; the apiserver's
				// SetupSignalContext handler responds to a second signal
				// with os.Exit(1), forcing immediate exit. The daemon also
				// caps its own drain at 45s before returning from Run, so
				// timeout values shorter than stopWaitTimeout (5m) routinely
				// race a still-draining daemon.
				_ = process.Kill(pid)
				err := ctx.Err()
				if errors.Is(err, context.DeadlineExceeded) {
					return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, err)
				}
				return fmt.Errorf("wait for %q control plane with pid %d to stop cancelled: %w", instance.Name(), pid, err)
			case <-ticker.C:
				if !Running() {
					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil
				}
			}
		}
	}

	// --wait=false: the daemon may still be serving for tens of seconds.
	// Leave instance.Config() alone so concurrent `rdd ctl`/`rdd kubectl`
	// calls keep working; the next `Start` rewrites it.

Go selects randomly when both ctx.Done() and ticker.C are ready in the same tick. A graceful shutdown that lands right on the deadline can take the ctx.Done branch even though !Running() is already true, returning DeadlineExceeded (exit 4). Re-running the command succeeds, so the symptom is minor user friction — but the race also feeds S1.

Fix: double-check Running() in the ctx.Done branch before declaring a timeout:

     case <-ctx.Done():
+        if !Running() {
+            _ = os.Remove(instance.Config())
+            return nil
+        }
         _ = process.Kill(pid)
         err := ctx.Err()

This also resolves S1 without changing Delete's outer predicate.

S3. ensureServiceRunning and serviceStartAction race Running() against StartTime() Claude¶

cmd/rdd/service.go

	if !service.Exists() {
		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		startTime, err := service.StartTime()
		if err != nil {
			return err
		}
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}

Running() calls PID(), which removes a stale PID file when the process check fails (pkg/service/cmd/service.go:158). Between Running() returning true and StartTime() calling os.Stat(instance.PIDFile()), the daemon can exit and a second PID() invocation can remove the file. StartTime() then returns fs.ErrNotExist, which propagates as an opaque error (not Classify-able, not actionable) from an unrelated-looking call like rdd ctl get. The window is microseconds, but once a user hits it the diagnosis is hard.

pkg/service/cmd/service.go

			}
			_ = process.Release()
		}
	}
	if err != nil {
		_ = os.Remove(instance.PIDFile())
		return PIDNotFound
	}
	return pid
}

Round 7 S2 flagged the race in serviceStartAction only; this round notes the same pattern in ensureServiceRunning too.

Fix: treat a missing PID file as "daemon exited between the two checks" at both call sites. Two options:

 startTime, err := service.StartTime()
-if err != nil {
+if errors.Is(err, fs.ErrNotExist) {
+    return cliexit.Classify(startAndWaitForReady(ctx, nil, startWaitTimeout))
+}
+if err != nil {
     return err
 }

Or fold the retry inside StartTime() so every caller gets consistent handling.

S4. cleanupDiscoveryConfigMap still uses context.TODO() despite StopWithWait now carrying a ctx Codex Claude¶

pkg/service/cmd/service.go

	if err != nil {
		logrus.WithError(err).Debug("Could not create kubernetes client for discovery cleanup")
		return nil // Not a critical error during shutdown
	}

	err = controllers.CleanupDiscovery(context.TODO(), client)
	if err != nil {
		logrus.WithError(err).Debug("Failed to delete discovery configmap")
	}

	return nil // Don't fail stop operation due to discovery cleanup issues

The --timeout envelope does not bound the pre-shutdown discovery-cleanup RPC. If the apiserver is wedged or a connection stalls, rdd svc stop --timeout=1m / rdd svc delete --timeout=5m can hang indefinitely before the timed wait begins, contradicting docs/design/cmd_service.md:117-121 which frames --timeout as bounding the whole operation.

docs/design/cmd_service.md


Stops the control plane and removes the instance directory, the short directory
(which contains the Lima home), and — unless `RDD_KEEP_LOGS` is set — the log
directory.

Delete always waits for the control plane to exit before removing files, because
removing the instance directory under a live process corrupts it on Windows and
breaks PID-file mutual exclusion on Unix. Use `--timeout` to bound that wait
(default `5m`); `0` waits indefinitely. If the deadline expires, `rdd` exits
with code 4 and the directory is left in place. See [`rdd service stop`](#rdd-service-stop)
for the signal and fallback details of the shutdown itself.

### `rdd service reset`

Deletes the datastore, but create a new (empty) one with the same settings.

Round 7 S6 raised the same three concerns (unbounded RPC, context.TODO, deferred durable fix) as a Suggestion, and the author skipped it pending the discovery-CM-cleanup pending-work item. The code comment at lines 353-357 documents moving cleanup into the daemon as the durable fix, so this stays a Suggestion rather than an Important issue — but threading the caller's ctx through cleanupDiscoveryConfigMap(ctx) is a one-line interim fix that makes the --timeout contract honest today:

pkg/service/cmd/service.go

func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
	if !Running() {
		return fmt.Errorf("%q control plane is not running", instance.Name())
	}

	// Delete the discovery configmap before signaling so the kube-API
	// delete still has a live apiserver. Trade-off: a timed-out --wait
	// leaves the configmap deleted while the daemon drains, so a quick
	// `rdd svc start` polls the missing configmap until startWaitTimeout.
	// Moving cleanup into the daemon's own shutdown path is the durable fix.
	_ = cleanupDiscoveryConfigMap()

	pid := PID()
	// Try graceful shutdown first. On Unix, Kill already sends SIGTERM which
	// triggers the Go signal handler. On Windows, Kill uses TerminateProcess

-func cleanupDiscoveryConfigMap() error {
+func cleanupDiscoveryConfigMap(ctx context.Context) error {
     ...
-    err = controllers.CleanupDiscovery(context.TODO(), client)
+    err = controllers.CleanupDiscovery(ctx, client)

Codex raised this as Important. Per the repo-context rule about deliberate code-comment trade-offs, demoted to Suggestion.

S5. local_setup_file in the new timeout file duplicates setup_rdd_control_plane Claude¶

bats/tests/20-service/6-timeout.bats

# SIGTERM (or TerminateProcess on Windows) and returns
# context.DeadlineExceeded wrapped in cliexit.Timeout. svc delete also
# preserves the instance directory on timeout (see docs/design/cmd_service.md)
# so the user can retry.

local_setup_file() {
    rdd svc delete --timeout=120s
    rdd svc create
    rdd svc start
}

@test 'svc stop --timeout=1ms exits with code 4' {
    run -4 rdd svc stop --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}

bats/helpers/controller.bash:setup_rdd_control_plane does exactly this (and its 120s bound carries a detailed comment explaining why). Using the helper keeps the bounded-delete rationale in one place and inherits future updates automatically.

 local_setup_file() {
-    rdd svc delete --timeout=120s
-    rdd svc create
-    rdd svc start
+    setup_rdd_control_plane
 }

S6. Test 3's hard-reset rdd svc delete --timeout=120s can fail with a confusing error Claude¶

bats/tests/20-service/6-timeout.bats

    run -0 rdd svc status
    assert_output --partial 'has been created: true'
}

@test 'svc start --timeout=1ms exits with code 4' {
    # Reset to a known state; prior tests left the control plane half-dead.
    rdd svc delete --timeout=120s
    rdd svc create
    run -4 rdd svc start --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}

If the prior svc delete --timeout=1ms test left a daemon that refuses to die within 120s, this line exits 4 and BATS's set -e aborts the test with the same message context deadline exceeded — indistinguishable from the failure being tested. Test 2 uses || : on its reset for exactly this reason.

Fix: add || : to match test 2, or skip the teardown entirely by reusing test 2's verified post-state.

S7. limavm delete --wait help text omits the "ignored if --wait=false" companion phrasing Claude¶

cmd/rdd/limavm.go

		Args:  cobra.ExactArgs(1),
		RunE: func(cmd *cobra.Command, args []string) error {
			return limaVMDeleteAction(cmd.Context(), args[0], wait, timeout)
		},
	}
	command.Flags().BoolVar(&wait, "wait", true, "Wait for the VM to be deleted before returning")
	command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
	return command
}

// getKubeClient returns a controller-runtime client configured for the RDD control plane.

Every --timeout flag added by this PR carries the "ignored if --wait=false" nudge. The --wait flag itself stays terse across all limavm subcommands. Not a bug; minor wording polish for symmetry.

S8. stopWaitTimeout doc comment could name the specific function that caps the drain Claude¶

cmd/rdd/service.go

// controller-manager registration to land. Cold start can take well over 30s
// in CI on slower runners.
const startWaitTimeout = 90 * time.Second

// stopWaitTimeout bounds the CLI wait for the control plane to shut down.
// Run caps the internal drain at 45s (see pkg/service/cmd/service.go).
// The CLI deadline exceeds that cap to absorb slow disks, misconfigured
// hosts, and sequential multi-VM drains. Aliased to limaLongWaitTimeout
// (defined in limavm.go) so the per-VM ceiling stays the single tuning
// knob for long CLI waits.
const stopWaitTimeout = limaLongWaitTimeout

The "see pkg/service/cmd/service.go" pointer is accurate but vague — Run lives in that file at line 692, and the 45s cap is at line 859. Naming the function (Run) in the comment helps a reader find the cross-reference in one step.

S9. cmd_service.md rdd service delete section omits the discovery-ConfigMap side-effect on timeout Claude¶

The paragraph documents the directory-preservation invariant ("the directory is left in place") but not the parallel effect: a timed-out delete (like a timed-out stop) wipes the discovery ConfigMap while the daemon continues draining, so a follow-up command that auto-starts the service polls the missing ConfigMap until startWaitTimeout (90s). The code comment at pkg/service/cmd/service.go:353-357 calls this out; the user-facing doc does not.

pkg/service/cmd/service.go

func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
	if !Running() {
		return fmt.Errorf("%q control plane is not running", instance.Name())
	}

	// Delete the discovery configmap before signaling so the kube-API
	// delete still has a live apiserver. Trade-off: a timed-out --wait
	// leaves the configmap deleted while the daemon drains, so a quick
	// `rdd svc start` polls the missing configmap until startWaitTimeout.
	// Moving cleanup into the daemon's own shutdown path is the durable fix.
	_ = cleanupDiscoveryConfigMap()

	pid := PID()
	// Try graceful shutdown first. On Unix, Kill already sends SIGTERM which
	// triggers the Go signal handler. On Windows, Kill uses TerminateProcess

Fix: add one sentence to the delete section noting the discovery-ConfigMap side-effect, or cross-reference the svc stop doc if that gains a matching note.

Design Observations ¶

Concerns ¶

cleanupDiscoveryConfigMap placement is load-bearing — pkg/service/cmd/service.go:353-358 (future) Claude

The CLI-side cleanup (instead of daemon-side) is the root of both I1 and S4: --wait=false stop, timed-out stop, and timed-out delete all leave the ConfigMap deleted while the daemon is still serving. The TODO at lines 353-357 names the durable fix — delete the ConfigMap from the daemon's ctx-cancellation handler, stop touching it in StopWithWait. The move also removes the "CLI needs kubeconfig access to cleanly stop the service" coupling, which complicates permissions and error paths. Not in scope for this PR; listed here so the pending-work item carries the context.

Strengths ¶

cliexit.Classify is the right abstraction — pkg/cli/exit/exit.go:56-65 (in-scope) Claude Gemini. A single choke-point at the CLI boundary, idempotent under errors.As (re-Classifying a *cliexit.Error is a no-op), and every call site uses it consistently.
ensureServiceRunning as the hidden fan-out — cmd/rdd/service.go:93-111 (in-scope) Claude Gemini. Routing context.DeadlineExceeded through cliexit.Classify here propagates the exit-code-4 contract to every autostart caller (rdd set, rdd kubectl, rdd service config, all rdd limavm *) without repeating the classification at each site. Exactly the right place to centralize the contract.
Watch-drop handling distinguishes drop from unmet predicate — cmd/rdd/limavm.go:497-512, 550-566 (in-scope) Claude. The post-watch re-read separates "watch closed before predicate satisfied" from "context deadline" from "satisfied predicate," addressing the silent-success-on-watch-drop class of bugs.
svc delete preserves the directory on timeout — pkg/service/cmd/service.go:448-465 (in-scope) Gemini. Dropping --wait=false from svc delete and preserving the directory on timed-out wait correctly addresses Windows directory-corruption and Unix PID-file mutual-exclusion risks under a live daemon.

Testing Assessment ¶

The new BATS coverage is solid for the primary contract (exit-code-4 with context deadline exceeded). Round 8 adds limavm create --start, stop, and delete timeout assertions, closing most of the round-7 coverage gaps. Remaining gaps, in priority order:

Autostart entry-point exit-code-4 (I2) — rdd set, rdd ctl, rdd service config all route through ensureServiceRunning for classification, with no dedicated coverage. A Go-level test against ensureServiceRunning/Classify is cheaper than BATS.
svc start --wait=false — not asserted to return immediately. A regression that reinstates waiting by default would ship green. The existing TODO at bats/tests/20-service/2-create.bats:3 still tracks this.
svc stop --wait=false — not asserted to return immediately. Especially relevant given I1 (the pre-shutdown cleanup fires regardless).
--timeout=0 indefinite wait — no test asserts 0 disables the deadline. A regression that treats 0 as "immediate timeout" would silently break the documented contract.
CM disappears mid-poll — waitForFreshDiscoveryConfigMap is not tested against a concurrent rdd svc stop --wait=false that deletes the CM while the poll loop is running (the I1 scenario).

Documentation Assessment ¶

docs/design/cmd_service.md and cmd_lima.md accurately describe the new defaults, flags, and exit-code-4 contract. The round-7 api_service.md imprecision is fixed and the signal-fallback wording now matches StopWithWait precisely.

Two gaps:

cmd_service.md rdd service delete section — missing the discovery-ConfigMap side-effect on timed delete (S9 above). Parallel to the stop case; needs a matching note.
Minor wording in cmd_service.md:94-95 — "Run rdd service start again with no options to wait" is slightly ambiguous (the user can still pass --timeout). "Run rdd service start again to wait" reads cleaner.

Pre-existing test bats/tests/20-service/2-create.bats:37-44 asserts log strings ("successfully started", "waiting for ... to be ready") that no current code path emits — flagged in round 7, not introduced by this PR.

Commit Structure ¶

One commit, one concept, descriptive message. Clean. Round 8 of iteration has converged on the same single-commit shape as round 7.

Acknowledged Limitations ¶

Discovery ConfigMap deleted before signaling, leaving a startup window for new clients. pkg/service/cmd/service.go:353-359 documents the trade-off for the timed-out case; I1 extends the scope to --wait=false and S4 names the context.TODO gap. The durable fix is moving cleanup into the daemon's own shutdown path.
context.Canceled in Delete's predicate. pkg/service/cmd/service.go:455-461 notes that context.Canceled is unreachable today because the CLI runs on context.Background(). Verified: cmd/rdd/main.go passes context.Background() to newServiceCommand, and nothing else wires SIGINT into cobra's command context. Wiring SIGINT into cmd.Context() would let Ctrl-C fall through to deletion during a live stop, which needs an explicit decision.
Timeout values shorter than stopWaitTimeout race the daemon's 45s internal drain cap. pkg/service/cmd/service.go:395-398 documents this; the BATS suite avoids the race by using --timeout=1ms (deterministic loss) or --timeout=120s/5m (always wins).
Second-precision freshness anchor accepts a stale ConfigMap when a prior crash lands in the same wall-clock second as the new startup. Round 6 S1 (Codex). The truncation aligns with metav1.Time JSON serialization (a real constraint); finer-grained anchoring would require carrying sub-second precision through the ConfigMap path.
--timeout scope is intentionally not extended over ensureServiceRunning — cmd/rdd/service.go:88-91 documents the 90s cap as independent of the caller's --timeout so a broken service fails fast on long-deadline commands. Design tradeoff flagged in multiple prior rounds (round 5 S3, round 7 Design Observation), consistently treated as intentional.
Deadline-expiry path sends SIGTERM (graceful), not SIGKILL (forced) — pkg/service/cmd/service.go:389-401 documents that SIGTERM is the second shutdown signal (SIGINT was sent earlier), which the apiserver's SetupSignalContext handler responds to with os.Exit(1). The genuinely-deadlocked-daemon case accepts the hang for log flush and proper cleanup.

Agent Performance Retro ¶

Claude ¶

Deepest review again: 1 Important + 9 Suggestions + 1 Design Observation + 3 Strengths. Three are genuinely new this round: I1 (the --wait=false cleanup angle that Claude alone saw, requiring a trace through ensureServiceRunning → waitForFreshDiscoveryConfigMap → MarkControlPlaneReady to confirm the daemon doesn't recreate the CM), S1/S2 (the paired Delete exit-code 4 / wait-loop race). Five are re-raises of deferred round-7 items (S3 race, S4 context.TODO, plus test-file and doc polish) that Claude correctly tracked. One false positive: I2 clock-skew between CLI and serve subprocess — the subprocess runs on the same host kernel (exec.CommandContext, not a VM or container), so the clock-skew scenario does not apply. Claude also mislabeled I1 as "regression"; the cleanupDiscoveryConfigMap call was unconditional pre-PR, so it's a pre-existing gap surfaced by new docs.

Codex ¶

Tight and calibrated: 2 Important, no Suggestions, no false positives. I2 (autostart entry-point coverage) is the same gap Codex raised in rounds 6 and 7, framed here with sharper accounting of which entry points remain uncovered after round 8's limavm additions. I1 (context.TODO / unbounded pre-wait) is a re-raise of round 7's S6, which the author had deferred to the discovery-CM-cleanup pending-work item; demoted to Suggestion S4 per the repo-context rule about deliberate code-comment tradeoffs. Codex did not catch the --wait=false cleanup angle (Claude's I1) despite their shared focus on the same function.

Gemini ¶

Thinnest review: 1 Suggestion, dropped as a false positive. The run_e -0 nit is the exact same finding Gemini raised in round 7 (also dropped there for the same reason: assert_created → assert_info → assert_stderr_line reads $stderr, so run_e is required). Gemini did not track that this was rejected last round and repeated it verbatim. The two Strengths observations (Centralized Contract Enforcement, Graceful Deletion Handling) are useful framings but not unique. Also did not run git blame per the known daily-quota limitation.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.4	Gemini 3.1 Pro
Duration	22m 25s	8m 33s	5m 53s
Findings	1I 9S	1I 1S	none
Tool calls	42 (Bash 25, Read 17)	52 (shell 49, stdin 3)	18 (grepsearch 17, readfile 1)
Design observations	4	0	2
False positives	1	0	1
Unique insights	3	1	0
Files reviewed	11	11	11
Coverage misses	0	0	0
Totals	1I 9S	1I 1S	none
Downgraded	0	1 (I→S)	0
Dropped	1	0	1

Claude carried this round: I1 is the single new finding with real user-visible consequence, S1/S2 together close the exit-4 race in Delete, and the doc/test polish is concrete and targeted. Codex stayed sharp on the coverage gap. Gemini added little this round — its Suggestion was a deferred-to-pending-work repeat and its Strengths observations were reductive summaries of what Claude and Codex already framed more precisely.

Review Process Notes ¶

Repo context updates ¶

[repo] When the CLI spawns the daemon as a local subprocess (exec.CommandContext), discount clock-skew findings between CLI and daemon clocks. The rdd control plane runs as a direct subprocess on the host, not inside a container or VM, so CLI-side time.Now() and subprocess-side time.Now() share the same host kernel and the same clock source. Findings that posit a clock-skew between the two processes require a VM/container boundary that does not exist here. Flag only if a code path crosses into a Lima VM or a container.