# Deep Review: 20260422-104141-pr-336

| | |
|---|---|
| **Date** | 2026-04-22 10:41 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 3 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#336](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/336) — cli: standardize --wait and --timeout across svc and limavm |
| **Branch** | `wait-timeout-cleanup` |
| **Commits** | `0b82624` cli: standardize --wait and --timeout across svc and limavm (squashed) |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — Round 2's I1, S1, S3, S4, S6, S7, S8 are all resolved in the squashed `0b82624`. Two new Important findings surface: (I1) the new `Delete` TOCTOU recovery is too broad and silently drops the timeout error on the force-kill race, directly contradicting the documented "exit 4, directory left in place" contract; and (I2) the new `StopWithWait` / `Delete` library surface does not thread `cmd.Context()`, so `--timeout=0` is uninterruptible — a larger exposure now that `0 = indefinite` is a documented user-facing option. Six small suggestions round out the list, including a `timeoutError` helper that still has an inline twin in `cmd/rdd/set.go`. |
| **Wall-clock time** | `40 min 39 s` |

## Executive Summary

Round 3 squashes the two earlier commits and resolves every Round 2 finding except the testing-gap suggestion `svc start` / `svc delete` timeout coverage, which the author's resolution table lists as an intentional skip. The new `6-timeout.bats` asserts the `svc stop --timeout=1ms` exit-code-4 path; `ensureServiceRunning` now routes through `timeoutError`, closing the cold-start gap on every autostart command; `serviceStartAction`'s already-running branch now waits for `ReadyAnnotation` like the cold-start path; the `watchtools` alias is consistent; and the architecture doc says `SIGINT` / `CTRL_BREAK_EVENT`.

Two Round-2 findings rebounded into Round 3 at Important severity. First, the Round-2 S1 recovery in `Delete` (re-check `Running()` after `StopWithWait` fails) is too broad: when `StopWithWait` hits its deadline it calls `process.Kill(pid)` and then returns the deadline error, but the subsequent `Running()` check usually sees the process already dead (the SIGTERM-via-`Kill` + k8s second-signal handler lands fast, and `TerminateProcess` is synchronous on Windows). The code then suppresses the timeout error, continues past `preserveAllInstanceLogs`, and removes the instance directory — exit code 0, directory gone — the exact opposite of what `docs/design/cmd_service.md:120-121` promises. Second, `StopWithWait` builds its own `context.Background()` instead of accepting the caller's context, so the newly documented `--timeout=0` ("wait indefinitely") is genuinely uninterruptible from the CLI's command context. Round 2 flagged this as a design concern; documenting `--timeout=0` for `svc stop` and `svc delete` in Round 3 is what elevates it to Important.

Suggestions: the new `timeoutError` helper duplicates the inline `cliexit.Timeout` pattern already in `cmd/rdd/set.go:325-328`; `--timeout=N` is silently ignored when `--wait=false` (flag help says "Timeout for --wait"); the cleanup `run -0 rdd limavm delete "start-vm"` lost its `run -0` wrapper in the `--wait` scrub; `stopWaitTimeout`'s comment ties the budget to one VM's ceiling without reconciling the sequential-shutdown multi-VM case; one `--timeout` help string uses "embedded shutdown wait" while the rest say "Timeout for --wait"; and the new `SIGINT on Unix … with forced termination as fallback` wording in both docs and a test comment overstates Unix behavior — `process.Kill` on Unix is `SIGTERM`, and forced termination there is the k8s second-signal `os.Exit(1)` path rather than `SIGKILL`.

No Critical issues. Gemini's three Criticals were either a repeat false positive (BATS `@test` header "corrupted by find-and-replace" — same hallucination as Round 2) or rightly belong one tier lower (the `Delete` race, consolidated with Claude's I1 at Important) or describe a pre-existing design trade-off (`process.Kill(pid)` mid-shutdown leaves hostagents the orchestrator hadn't reached yet — the comment at `pkg/service/cmd/service.go:377-383` explicitly calls this out, and pre-PR code already used the same `process.Kill`).

---

## Critical Issues

None.

## Important Issues

I1. **`Delete` swallows timeout errors once the force-kill wins the race** — `pkg/service/cmd/service.go:430-438`, `docs/design/cmd_service.md:117-122` [Claude Opus 4.7] [Gemini 3.1 Pro]

```go
if Running() {
    if err := StopWithWait(true, timeout); err != nil {
        // The process may exit between the outer Running check and
        // StopWithWait's inner one; treat that as success.
        if Running() {
            return err
        }
    }
}
```

The comment frames the swallow as the outer/inner `Running()` race that Round 2 S1 reported: the serve subprocess exits between `Delete`'s outer `Running()` check and `StopWithWait`'s inner one, `StopWithWait` returns a `"%q control plane is not running"` error, and `Delete` treats that as success. The code, however, swallows *every* `StopWithWait` error whenever `Running()` is false at the second check — including the deadline error.

Trace: when the deadline expires inside `StopWithWait` (`pkg/service/cmd/service.go:376-385`), the function calls `_ = process.Kill(pid)` and returns `fmt.Errorf("timed out … to stop: %w", …, ctx.Err())`. On Windows `process.Kill` is `TerminateProcess`, which is synchronous; on Unix it is `SIGTERM` and the k8s `SetupSignalContext` second-signal handler calls `os.Exit(1)` within microseconds. By the time `Delete` re-reads `Running()`, the process is almost always gone.

`Delete` then drops the wrapped `context.DeadlineExceeded`, runs `preserveAllInstanceLogs`, and calls `os.RemoveAll(instance.Dir())`. Exit code 0, directory gone, hostagents that the orchestrator had not yet reached now orphaned.

```md
If the deadline expires, `rdd` exits with code 4 and the directory is left in place.
```

`docs/design/cmd_service.md:120-121` promises the opposite: on deadline expiry the CLI is supposed to exit 4 and leave the directory intact. The documented safety story — the whole reason `--wait=false` was removed from `svc delete` in Round 1 — is "do not remove `instance.Dir()` while a writer is alive." Force-killing the writer with `Kill` then immediately deleting the directory does not restore that safety; the hostagents still in flight have not been drained, and on Windows the kill races any still-open file handles the service held. The behavior is also non-deterministic: identical invocations can flip between exit 0 / dir-removed and exit 4 / dir-intact based on signal-handler latency.

Fix: narrow the swallow so it only suppresses the "not running" error that R2 S1 asked for, and always propagate deadline expiry:

```diff
 if Running() {
     if err := StopWithWait(true, timeout); err != nil {
-        // The process may exit between the outer Running check and
-        // StopWithWait's inner one; treat that as success.
-        if Running() {
+        // Suppress only the outer/inner Running() race. A deadline
+        // expiry must surface so the CLI exits 4 and leaves the
+        // directory in place, per docs/design/cmd_service.md.
+        if errors.Is(err, context.DeadlineExceeded) || Running() {
             return err
         }
     }
 }
```

Or, more robustly, have `StopWithWait` return a typed sentinel for the "not running" case so the recovery matches on identity rather than on the negation of `Running()`.

(important, regression — the squash's R2 S1 fix introduced the over-broad recovery. Consolidated from Claude I1 and Gemini C2; Gemini's Critical classification is consolidated to Important because the scenario requires a deadline to actually fire on `svc delete`, and the 5m default makes that rare in practice.)

I2. **`StopWithWait` / `Delete` ignore the caller's context, so `--timeout=0` is uninterruptible** — `pkg/service/cmd/service.go:338, 365-370, 426` [Claude Opus 4.7] [Codex GPT 5.4] [Gemini 3.1 Pro]

```go
func StopWithWait(wait bool, timeout time.Duration) error {
    ...
    if wait {
        ctx := context.Background()
        if timeout > 0 {
            var cancel context.CancelFunc
            ctx, cancel = context.WithTimeout(ctx, timeout)
            defer cancel()
        }
```

`cmd/rdd/service.go:336` calls `service.StopWithWait(wait, timeout)`; `cmd/rdd/service.go:377` calls `service.Delete(timeout)`. Both entry points discard `cmd.Context()` and build a new `context.Background()` inside the wait loop. With `timeout == 0` (documented in `cmd/rdd/service.go:354` and `:364` as "wait indefinitely"), the wait loop's `select` only fires on the 100 ms ticker and the never-cancelled `ctx.Done()` — there is no parent channel a caller could close to abort. A user running `rdd svc stop --timeout=0` against a wedged service has no in-band cancellation path; the only exit is Go's default SIGINT handler terminating the CLI process itself, which leaves the half-killed service behind.

Every other wait site this PR touches threads `cmd.Context()` into the wait — `startAndWaitForReady`, `limaVMSetRunningAction`, `limaVMRestartAction`, `limaVMCreateAction`, `limaVMDeleteAction`, the already-running `serviceStartAction` branch. `StopWithWait` is the only outlier, and the fact that it is the one place where `--timeout=0` is a user-visible knob is what makes the gap visible.

Round 2 flagged this as a design concern (Claude's) and also as an Important finding (Gemini's, which we consolidated to a design observation because `--timeout=0` was not then documented). Round 3 documents `--timeout=0` on `svc stop`, `svc delete`, and every `limavm` wait path, which enlarges the exposure from "theoretical future signal handler" to "a user-facing option that cannot be cancelled from user-facing code."

Fix: thread the context through and use the same `watchtools.ContextWithOptionalTimeout` helper every other wait site uses:

```diff
-func StopWithWait(wait bool, timeout time.Duration) error {
+func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
     ...
     if wait {
-        ctx := context.Background()
-        if timeout > 0 {
-            var cancel context.CancelFunc
-            ctx, cancel = context.WithTimeout(ctx, timeout)
-            defer cancel()
-        }
+        ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
+        defer cancel()
```

```diff
-func Delete(timeout time.Duration) error {
+func Delete(ctx context.Context, timeout time.Duration) error {
     ...
     if Running() {
-        if err := StopWithWait(true, timeout); err != nil {
+        if err := StopWithWait(ctx, true, timeout); err != nil {
```

Update the two callers in `cmd/rdd/service.go:336, 377` to pass `cmd.Context()`.

(important, regression — the new `--timeout=0` knob is what widens the pre-existing design concern. Consolidated from Claude I2, Codex I1, Gemini I1.)

## Suggestions

S1. **`timeoutError` duplicates the inline `cliexit.Timeout` pattern already in `set.go`** — `cmd/rdd/service.go:47-52`, `cmd/rdd/set.go:325-328` [Claude Opus 4.7]

```go
// cmd/rdd/service.go
func timeoutError(err error) error {
    if errors.Is(err, context.DeadlineExceeded) {
        return cliexit.Timeout(err)
    }
    return err
}

// cmd/rdd/set.go
if err := waitForDesiredState(ctx, restConfig, timeout, targetGen); err != nil {
    if errors.Is(err, context.DeadlineExceeded) {
        return cliexit.Timeout(err)
    }
    return err
}
```

The new helper exists to factor exactly the `set.go` pattern, but `set.go` itself was not rewired to use it. The helper also lives in `cmd/rdd/service.go`, so `set.go` would have to import it from a file it does not otherwise depend on.

Fix: move the helper to `pkg/cli/exit/exit.go` (where every CLI entry point already imports it) and call it from all three sites. Naming it `exit.Classify(err)` or similar frees the `service.go` filename from owning a package-cross-cutting utility.

(suggestion, gap — the PR's own refactoring opportunity, missed.)

S2. **`--timeout=N` is silently ignored when `--wait=false`** — `cmd/rdd/service.go:354`, `cmd/rdd/limavm.go:151, 164, 180, 196, 254` [Claude Opus 4.7]

```go
command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
```

`rdd svc stop --wait=false --timeout=10s`, and the equivalent on every `limavm` wait-capable subcommand, accepts both flags and then ignores `--timeout`. The help text "Timeout for --wait" hints at the dependency but does not tell a user their value is dead. Users routinely combine both flags when scripting (a common pattern: "don't block me, but still error out after N seconds").

Fix: either reject the combination at parse time, or make the dependency explicit in the flag help:

```diff
-command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
+command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
```

(suggestion, enhancement)

S3. **Cleanup `rdd limavm delete` lost its `run -0` wrapper** — `bats/tests/33-lima-controllers/limavm-cli.bats:94` [Claude Opus 4.7]

```diff
-    run -0 rdd limavm delete --wait "start-vm"
+    rdd limavm delete "start-vm"
```

The Round 2 fix to drop the now-redundant `--wait` flag (R2 S4) also dropped the `run -0` wrapper. BATS still fails the test via `set -e` if the command exits non-zero, but the scoped exit-code assertion and captured `$output` / `$status` are gone, and the file no longer tells a reader that the exit code matters. Repo BATS style ([BATS style rules](bats-style.md)) prefers `run -0`.

Fix:

```diff
-    rdd limavm delete "start-vm"
+    run -0 rdd limavm delete "start-vm"
```

The earlier cleanup at `:78` kept `run -0`, so the asymmetry is just this one site.

(suggestion, regression — Round 2 S4 fix collateral.)

S4. **`SIGINT on Unix … with forced termination as fallback` overstates what Unix actually does** — `docs/design/api_service.md:21`, `docs/design/cmd_service.md:104-109`, `bats/tests/20-service/6-timeout.bats:3-6` [Codex GPT 5.4]

```md
The control plane is notified to shut down, either by `rdd service stop` (SIGINT on Unix, `CTRL_BREAK_EVENT` on Windows, with forced termination as fallback), or by the OS preparing to shut down the host.
```

Readers interpret "forced termination" as `SIGKILL` on Unix. `pkg/util/process/process_unix.go:30-32` defines:

```go
// Kill sends SIGTERM to the process with the given PID.
func Kill(pid int) error {
    return unix.Kill(pid, unix.SIGTERM)
}
```

The fallback `process.Kill(pid)` sends another `SIGTERM`, which the service catches. Forced termination on Unix happens only because the k8s `SetupSignalContext` handler routes the second `SIGTERM` through `os.Exit(1)` — not because `Kill` itself is forceful. Users who rely on the "forced" wording to reason about hostagent cleanup or unkillable writers will be surprised when `rdd svc stop --timeout=1ms` returns exit 4 and `svc status` still shows the pid alive for a moment afterwards.

Fix: either match the Unix implementation ("SIGINT on Unix with SIGTERM as fallback; on Windows, `CTRL_BREAK_EVENT` with `TerminateProcess` as fallback") or strengthen the Unix path to genuinely force-kill (`process.KillTree` or `unix.SIGKILL`) — but not without reconciling with the pre-existing "leave hostagents in their own process groups" decision.

(suggestion, regression — the new doc wording is what makes the Unix contract ambiguous.)

S5. **`stopWaitTimeout`'s rationale covers one VM, not the multi-VM worst case** — `cmd/rdd/service.go:37-42` [Claude Opus 4.7]

```go
// stopWaitTimeout bounds the wait for the control plane to shut down. The
// service itself drains its controller-manager goroutines for up to 45s
// (see Run in pkg/service/cmd/service.go); when LimaVM stop orchestration
// lands, the drain will sequence each VM's shutdown, so the CLI deadline must
// match the per-VM ceiling in [limaLongWaitTimeout].
```

The comment states the budget "must match the per-VM ceiling" — a reader leaves thinking 5m is sized for the worst case. In a multi-VM deployment, `shutdownAllHostagents` drains hostagents sequentially (per `MEMORY.md`: "`shutdownAllHostagents waits sequentially — TODO`"), so N VMs at the per-VM ceiling can exceed 5m, and the `Run` drain hard-caps at 45s (`pkg/service/cmd/service.go:832`) regardless of the CLI's patience.

Fix: reconcile the comment with the sequential-drain reality, or raise the budget when the parallel-shutdown TODO is resolved. Either:

```diff
-// match the per-VM ceiling in [limaLongWaitTimeout].
+// match the per-VM ceiling in [limaLongWaitTimeout]; a multi-VM
+// instance may still exceed this budget while shutdownAllHostagents
+// drains sequentially.
```

(suggestion, gap — comment accuracy; does not block the merge.)

S6. **`svc delete --timeout` help text diverges from the other `--timeout` flags** — `cmd/rdd/service.go:364` [Claude Opus 4.7]

```go
command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the embedded shutdown wait (0 means wait indefinitely)")
```

Every other `--timeout` flag in the PR uses "Timeout for --wait (0 means wait indefinitely)". `svc delete` has no `--wait` flag, so the wording diverges deliberately, but "embedded shutdown wait" reads oddly next to the docs, which describe the delete path as "always waits for the control plane to exit before removing files." Minor polish:

```diff
-command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the embedded shutdown wait (0 means wait indefinitely)")
+command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
```

(suggestion, enhancement.)

---

## Design Observations

### Concerns

- **`process.Kill(pid)` mid-shutdown orphans any hostagent the orchestrator had not yet reached** (future) [Gemini 3.1 Pro]. The comment at `pkg/service/cmd/service.go:377-383` acknowledges the trade-off and `MEMORY.md` tracks the sequential-shutdown TODO. Gemini raised this at Critical; it is pre-existing (pre-PR code used the same `process.Kill`) and the author's comment explicitly chose this path over `KillTree` to avoid killing hostagents that survive in their own process groups and self-heal via `killOrphanedHostagent` on next start. Flagged here because I1's narrower `errors.Is(…DeadlineExceeded)` fix still leaves the underlying "hostagents not drained on timeout" exposure; a future ordered-shutdown design (see [`shutdown-ordering-design.md`](shutdown-ordering-design.md)) is the right place to address it.
- **Two `service.Wait` callers still construct a plain `context.WithTimeout` instead of `watchtools.ContextWithOptionalTimeout`** (in-scope) [Claude Opus 4.7]. `cmd/rdd/service.go:105-106` still uses `context.WithTimeout(ctx, startWaitTimeout)`, and `pkg/service/cmd/service.go:202-203` uses `context.WithTimeout(ctx, 15*time.Second)`. The first is fine because `startWaitTimeout` is a non-zero constant; the second is internal to `getRuntimeControllersFromCluster` and is bounded by design. Mentioned only because the "zero means infinite" pattern elsewhere makes the two stragglers look inconsistent at a glance; no fix needed.

### Strengths

- **`limaLongWaitTimeout` is the single source of truth for the long wait** [Claude Opus 4.7]. `cmd/rdd/limavm.go:41` defines it once; every `limavm` subcommand and `stopWaitTimeout` in `cmd/rdd/service.go:42` reference it by name. Tuning the per-VM ceiling flows automatically through the CLI. [Codex GPT 5.4] also flagged the boundary shape as a strength: `timeoutError` lives in `cmd/rdd/service.go` so the service package stays reusable while CLI-facing commands get consistent exit-code behavior.
- **`watchtools.ContextWithOptionalTimeout` handles the "0 means infinite" contract at every wait site** [Claude Opus 4.7]. Five-plus call sites in `limavm.go` plus the two in `serviceStartAction`. Hand-rolling `if timeout == 0` would have been easy to miss at one site; the helper keeps the contract in one place.
- **`serviceStartAction`'s already-running branch now waits for `ReadyAnnotation`** [Codex GPT 5.4] [Claude Opus 4.7]. Round 2 S3 fix; the cold-start and warm-start paths now share the same readiness contract, so the documented `--wait=false` → `rdd svc start` workflow delivers on its promise.
- **`timeoutError` at the CLI boundary plus `errors.Is(err, context.DeadlineExceeded)`** [Claude Opus 4.7]. The `%w` wrap chain propagates through `fmt.Errorf("failed waiting …: %w", err)` wrappers so each call site can add context freely without losing the classification.
- **Removing `--wait=false` from `svc delete` is the right shape** [Claude Opus 4.7]. There is no safe "non-blocking delete" while the writer still holds the directory; the doc now explains the reason.

## Testing Assessment

Coverage of the exit-code-4 contract is broad: each wait-capable `limavm` subcommand has a `run -4` regression test, and `6-timeout.bats` asserts `svc stop --timeout=1ms`. Unaddressed gaps, carried forward with round 2 context:

1. **`rdd svc start --wait --timeout=…` deadline expiry returns exit code 4.** R2 S2 unaddressed (intentional skip per R2 resolution table). Coverage gap remains; a `--timeout=1ms` against a slow-starting control plane would prove the contract.
2. **`rdd svc delete --timeout=…` deadline expiry returns exit code 4 and leaves the directory in place.** R2 S2 unaddressed (intentional skip). Also directly ties to I1 — a test here would have caught the over-broad swallow.
3. **`ensureServiceRunning` deadline expiry during `rdd set` / `rdd limavm *` / `rdd kubectl` / `rdd service config` returns exit code 4.** R2 testing gap #5 unaddressed (intentional skip). The fix at `cmd/rdd/service.go:107-112` is correct; no test exercises it.
4. **`--timeout=0` (indefinite wait) succeeds when the desired state eventually arrives.** R2 testing gap #4 unaddressed (intentional skip). A regression that accidentally used `context.WithTimeout(ctx, 0)` (which fires immediately) would ship green.
5. **Ctrl-C (`cmd.Context()` cancellation) during `svc stop --timeout=0` / `svc delete --timeout=0`.** New. Ties to I2; a cancelled caller context should abort the wait.

The author's R2 resolution table marks items 1-4 as "Skipped"; these remain gaps, not re-raises. Item 5 is new and follows I2.

## Documentation Assessment

- `docs/design/cmd_service.md:117-122` promises "exit 4, directory left in place" on deadline expiry — see I1; the code does not currently deliver this in the common case.
- `docs/design/cmd_service.md:93-99` correctly replaces `--no-wait` with `--wait=false` and adds `--timeout`. Good.
- `docs/design/cmd_service.md:104-109`, `docs/design/api_service.md:21`, and the `6-timeout.bats` header comment all say "forced termination as fallback." On Unix the fallback is a second `SIGTERM`; forced termination there is the k8s second-signal `os.Exit(1)` rather than `SIGKILL`. See S4.
- `docs/design/cmd_lima.md` — defaults consistent with the code.
- Comment at `cmd/rdd/service.go:120-122` ("the API server readiness poll and the ConfigMap freshness poll share the single deadline") is correct when `timeout > 0` and vacuously true when `timeout == 0`. Fine.

## Commit Structure

Single commit, single concept — the squash is clean and the commit body accurately enumerates the shipped defaults. Round 2 S5 (commit body lists pre-fixup defaults) is resolved.

## Acknowledged Limitations

- Comment at `pkg/service/cmd/service.go:377-383` documents the deliberate trade-off of `process.Kill` leaving hostagents the orchestrator had not yet drained. Accurate; Gemini's Critical C3 is consolidated with this.
- Comment at `pkg/service/cmd/service.go:432-433` describes the outer/inner `Running()` race that R2 S1 fixed; the comment understates the swallow's actual scope — promoted to I1 rather than left here.
- `cmd/rdd/service.go:32-42` justifies `startWaitTimeout` (90s) and ties `stopWaitTimeout` to the per-VM ceiling. The latter's multi-VM caveat is S5.
- Comment at `pkg/service/cmd/service.go:444-454` documents the Windows `os.Rename` / `FILE_SHARE_DELETE` limitation in `preserveAllInstanceLogs`. Accurate; unaffected by this PR.
- The R2-acknowledged PID-recycling race in `StopWithWait` (`Running()` returns true, then `pid := PID()` finds the file stale, `process.Interrupt(0)` on Unix signals the calling process group) is still present. Pre-existing; not widened by this PR.

## Unresolved Feedback

No prior PR review comments on this PR.

---

## Agent Performance Retro

### [Claude]

Claude carried the round again: both consolidated Important findings are Claude's, and it produced five of the six suggestions. The I1 framing ("R2 S1 fix is too broad; it drops *every* StopWithWait error, not just the 'not running' race") is the clearest of the three agents' takes on the same code — Codex didn't raise it and Gemini raised it at Critical without tracing why the swallow fires in the force-kill path. Unique suggestions: `timeoutError` duplicating the inline `cliexit.Timeout` in `set.go` (S1), `--timeout=N` silently ignored when `--wait=false` (S2), the collateral loss of `run -0` on the cleanup delete (S3), the `stopWaitTimeout` comment overstating multi-VM safety (S5), and the `svc delete --timeout` help-text wording divergence (S6). The one design observation worth noting: Claude correctly demoted its own R2 "StopWithWait ignores cmd.Context()" concern to Important this round because `--timeout=0` is now a documented user-facing option — calibrated severity movement.

### [Codex]

Codex delivered one Important (the `StopWithWait`/`Delete` context-ignoring surface, consolidated with Claude's and Gemini's matching finds into I2) and one Suggestion — but the Suggestion came with a live-verified demonstration. Codex actually built the CLI, ran `./bin/rdd svc stop --wait --timeout=1ms` on its macOS worktree, and observed that `svc status` still reported the daemon PID as running immediately afterward, which is exactly what S4's doc-accuracy claim rests on. That is the kind of evidence no amount of code-reading replaces. No false positives; one tight Important that both other agents also caught; one unique doc-accuracy find with real artifacts.

### [Gemini]

Gemini produced three Criticals, two of which do not hold up. C1 claimed the BATS `@test` decorator has been corrupted to `@pkg/util/tail/tail_test.go` in both `6-timeout.bats` and `limavm-cli.bats` — identical hallucination to Round 2, where it was also consolidated as a false positive. The worktree `grep '^@test\|^@pkg'` shows every `@test` line intact. C3 (Kill orphans hostagents) is pre-existing (pre-PR code used the same `process.Kill`) and directly contradicted by the author's intent comment at `pkg/service/cmd/service.go:377-383` — Gemini read the code but not the comment that explains why the call is `Kill` and not `KillTree`. C2 is real but Gemini rated it Critical; we consolidate at Important alongside Claude's matching finding, because a `svc delete` deadline with the 5m default is a rare hit and the consequence (data removed + exit 0) is bad but not unrecoverable. I1 is a genuine concurring catch on the caller-context gap. Gemini continues to skip `git blame`, so pre-existing-vs-regression calibration falls to Claude and Codex; the C3 mislabel as "regression" is one concrete downstream cost of that skip.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 35m 08s | 6m 33s | 6m 37s |
| Findings | 2I 6S | 1I 1S | 1I |
| Tool calls | 70 (Bash 34, Read 22, Grep 14) | 39 (shell 35, stdin 4) | 11 (grep_search 8, run_shell_command 2, read_file 1) |
| Design observations | 5 | 3 | 1 |
| False positives | 0 | 0 | 1 |
| Unique insights | 5 | 1 | 0 |
| Files reviewed | 8 | 8 | 8 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **2I 6S** | **1I 1S** | **1I** |
| Downgraded | 0 | 0 | 1 (I→S) |
| Dropped | 0 | 0 | 2 |


**Reconciliation.** Three severity changes:

- Gemini C1 (BATS `@test` header corrupted): **dropped as false positive.** The file in `$WORKTREE_lead` contains valid `@test` declarations at `6-timeout.bats:14` and `limavm-cli.bats:196, 208, 219, 232`; Gemini hallucinated the corruption — same false positive as Round 2. `false_positives += 1`.
- Gemini C2 (`Delete` race data loss): **downgraded to I1.** The regression is real and important, but `svc delete` with the 5m default makes the race rare in practice; Critical is reserved for issues with broader impact or more certain triggers. Consolidated with Claude's matching Important.
- Gemini C3 (`process.Kill` orphans hostagents): **demoted to design observation (pre-existing).** Pre-PR code already used `process.Kill(pid)` in the timeout-kill path; the author's comment at `pkg/service/cmd/service.go:377-383` names the trade-off explicitly. Not introduced by this PR, and Gemini's suggested fix (`KillTree`) contradicts the author's intent of letting hostagents survive in their own process groups for `killOrphanedHostagent` to reclaim on next start.

Overall: Codex's live-verified S4 and Claude's I1 framing are the two findings that materially improve the review this round. Gemini's single valid find (I1 caller-context, already caught by the others) does not offset two mislabeled Criticals and a repeat hallucination; with the PR's small scope, Codex + Claude alone would have produced the same consolidated set.

## Review Process Notes

### Skill improvements

- **Before accepting a severity change that is "narrower than the originating agent claimed," verify the narrowing against the code path, not the abstract concept.** When multiple agents raise the same structural concern across rounds (e.g., "this wait ignores the caller's context"), check whether the PR under review changes what that concern means for users. A design concern about an internal library call becomes an Important finding the moment the PR documents a user-facing knob that would exercise the concern. A reviewer recognises the pattern when a prior-round design observation covers the same code path as a current-round finding: compare the round's user-facing contract, not just the code. Going the other way, an under-labeled "regression" for a pre-existing line is a common miscalibration that costs a consolidation round — drop the regression label on any finding whose cited lines existed before the PR's merge-base.

### Repo context updates

- **Add:** When the CLI adds a new `--timeout=0` ("wait indefinitely") user-facing option, verify that the underlying wait loop accepts a caller's `context.Context` and derives its timed child from it. A `context.Background()` wait loop plus a documented indefinite mode combine to produce a genuinely uninterruptible CLI. Reason: `StopWithWait` in this review shipped both in the same PR without the connection being noticed during Round 1 or Round 2; the exposure surfaced only in Round 3 after the user-facing doc explicitly promoted `--timeout=0`.

---

## Appendix: Original Reviews

### [Claude Opus 4.7]

# Code Review: PR #336 — cli: standardize --wait and --timeout across svc and limavm

## Executive Summary

A focused refactor that standardizes `--wait`/`--timeout` flags across `rdd svc` and `rdd limavm`, routes `context.DeadlineExceeded` through `cliexit.Timeout` for exit code 4, and tightens the safety story for `svc delete` (no more `--wait=false`). The behavioral change in `serviceStartAction`'s already-running branch (now also waits for `ReadyAnnotation`) is a real bug fix. One important documentation/behavior contradiction in `Delete`'s timeout handling, plus a few suggestions.

## Findings

### Critical Issues

None.

### Important Issues

**I1. `Delete` swallows timeout errors when force-kill races ahead of the second `Running()` check — contradicts docs** — `pkg/service/cmd/service.go:430-438`, `docs/design/cmd_service.md:117-122`

```go
if Running() {
    if err := StopWithWait(true, timeout); err != nil {
        // The process may exit between the outer Running check and
        // StopWithWait's inner one; treat that as success.
        if Running() {
            return err
        }
    }
}
```

The comment frames the swallow as "outer/inner `Running()` race", but the code swallows *any* `StopWithWait` error whenever `Running()` is false at the second check, including the timeout error. Trace: when the deadline expires inside `StopWithWait` (line 376-385), the function force-kills via `process.Kill(pid)` (SIGTERM on Unix, `TerminateProcess` on Windows) and then returns the timeout error. By the time `Delete` runs the second `Running()` check (line 434), the process is usually dead — Go's signal handler runs in microseconds, and `TerminateProcess` is synchronous. The error is silently dropped, `Delete` continues, removes the directory, and returns nil → exit 0.

Meanwhile `docs/design/cmd_service.md:120-121` claims:

> If the deadline expires, `rdd` exits with code 4 and the directory is left in place.

This is observably false in the common case. With `--timeout=1ms` against a healthy service, the kill almost always wins the race, the user gets exit 0 and a removed directory — the opposite of what the docs promise.

The behavior is also racy: identical invocations can flip between exit 0/dir-removed and exit 4/dir-intact based on signal-handler latency.

Fix: pick one. Either always honor the timeout exit code when the deadline fires (simpler, predictable), or update the docs to acknowledge the race. The first option matches the user-visible contract and the rest of the PR's "exit 4 on deadline" theme:

```diff
 if Running() {
     if err := StopWithWait(true, timeout); err != nil {
-        // The process may exit between the outer Running check and
-        // StopWithWait's inner one; treat that as success.
-        if Running() {
-            return err
+        // Only the "control plane is not running" race (process exited between
+        // the outer Running check and StopWithWait's inner one) should be
+        // treated as success. Timeout / kill failures must surface.
+        if !errors.Is(err, errNotRunning) {
+            return err
         }
     }
 }
```

(introducing a sentinel for the "not running" error, or matching the message). Alternatively, scope the swallow to *only* when `StopWithWait` returned its own "not running" error — never to timeout errors. (important, regression)

---

**I2. `--timeout=0` on `svc stop`/`svc delete` permanently blocks the CLI even when the user Ctrl+Cs** — `pkg/service/cmd/service.go:365-370`

```go
if wait {
    ctx := context.Background()
    if timeout > 0 {
        var cancel context.CancelFunc
        ctx, cancel = context.WithTimeout(ctx, timeout)
        defer cancel()
    }
    ticker := time.NewTicker(100 * time.Millisecond)
    defer ticker.Stop()
```

`StopWithWait` constructs its wait context from `context.Background()` rather than accepting one from the caller (line 338 signature: `StopWithWait(wait bool, timeout time.Duration)`). With `--timeout=0` (documented as "wait indefinitely"), the wait loop has no parent cancellation path. If the user hits Ctrl+C during `rdd svc stop --timeout=0` against a wedged service, Go's default SIGINT handler still terminates the CLI process, so this is rarely visible — but Cobra command contexts already exist (`cmd.Context()`) and the rest of the PR is plumbing them through. `StopWithWait` is the only wait helper in this PR that ignores them.

Fix: take a context parameter and derive the timed child from it.

```diff
-func StopWithWait(wait bool, timeout time.Duration) error {
+func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
     ...
     if wait {
-        ctx := context.Background()
-        if timeout > 0 {
-            var cancel context.CancelFunc
-            ctx, cancel = context.WithTimeout(ctx, timeout)
-            defer cancel()
-        }
+        ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
+        defer cancel()
```

Update callers in `cmd/rdd/service.go:336` and `pkg/service/cmd/service.go:431` (Delete) to pass `cmd.Context()` / a derived context. (important, gap)

### Suggestions

**S1. `timeoutError` helper duplicates the inline `cliexit.Timeout` pattern in `cmd/rdd/set.go:325-328`** — `cmd/rdd/service.go:47-52`

```go
func timeoutError(err error) error {
    if errors.Is(err, context.DeadlineExceeded) {
        return cliexit.Timeout(err)
    }
    return err
}
```

`cmd/rdd/set.go:324-329` still has the inline form:

```go
if err := waitForDesiredState(ctx, restConfig, timeout, targetGen); err != nil {
    if errors.Is(err, context.DeadlineExceeded) {
        return cliexit.Timeout(err)
    }
    return err
}
```

This PR introduces the helper but doesn't rewire the existing call site. The helper currently lives in `cmd/rdd/service.go` alongside other service-specific helpers, but the helper itself is not service-specific. Move it into `pkg/cli/exit/exit.go` (e.g. `exit.Classify(err)`) and switch both `set.go` and the new call sites in `service.go`/`limavm.go`. (suggestion, gap)

---

**S2. `--timeout` is silently ignored when `--wait=false`** — `cmd/rdd/service.go:336`, `cmd/rdd/limavm.go:228, 408, 448, 552`

The flag combination `--wait=false --timeout=10s` is accepted without warning. Across `svc stop`, `limavm start/stop/restart/create --start/delete`, the timeout is read but unused when wait is false. The flag help text says "Timeout for --wait" so the user gets no signal that their value is dead. Either reject the combination at parse time, or note it in `--timeout`'s help text:

```diff
-command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
+command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
```

(suggestion, enhancement)

---

**S3. `stopWaitTimeout` rationale comment overstates current safety with multiple VMs** — `cmd/rdd/service.go:37-42`

```go
// stopWaitTimeout bounds the wait for the control plane to shut down. The
// service itself drains its controller-manager goroutines for up to 45s
// (see Run in pkg/service/cmd/service.go); when LimaVM stop orchestration
// lands, the drain will sequence each VM's shutdown, so the CLI deadline must
// match the per-VM ceiling in [limaLongWaitTimeout].
```

The comment justifies 5m as "match[ing] the per-VM ceiling" so the deadline accommodates *one* slow VM. But existing memory/code shows hostagent shutdown is already sequential (`shutdownAllHostagents waits sequentially — TODO`), so with N VMs a real-world stop can comfortably exceed `5m` even today, and the inner `Run` drain is hard-capped at 45s (line 832). The comment reads as if the 5m budget is sized for the worst case, but it's only sized for *one* worst-case VM. Either make it explicit ("matches the per-VM ceiling on the assumption that a single-VM worst case dominates") or pick a budget that covers the common multi-VM case. Not actionable for this PR — flag for follow-up. (suggestion, gap)

---

**S4. Cleanup `rdd limavm delete` in test loses exit-code check** — `bats/tests/33-lima-controllers/limavm-cli.bats:94`

```diff
-    run -0 rdd limavm delete --wait "start-vm"
+    rdd limavm delete "start-vm"
```

Dropping `run -0` means BATS now relies on `set -e` to surface a non-zero exit. That works, but loses the explicit assertion and the captured `$output`/`$status`. BATS style guidance in this repo (per memory: "Variable scoping, run -0, refute_output") prefers `run -0`. Keep it:

```diff
-    rdd limavm delete "start-vm"
+    run -0 rdd limavm delete "start-vm"
```

(suggestion, regression)

---

**S5. Doc wording inconsistency for `svc delete --timeout` vs other commands** — `cmd/rdd/service.go:364`

```go
command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the embedded shutdown wait (0 means wait indefinitely)")
```

Other commands (svc start/stop, all limavm) use `"Timeout for --wait (0 means wait indefinitely)"`. `svc delete` doesn't expose `--wait`, so the wording differs for clarity, but "the embedded shutdown wait" is awkward. Suggest `"Timeout for the implicit stop-and-wait (0 means wait indefinitely)"` to mirror the docs language. (suggestion, enhancement)

---

**S6. Pre-existing assertion in `2-create.bats:40-43` doesn't match emitted log lines (out of scope but adjacent)** — `bats/tests/20-service/2-create.bats:38-44`

```bash
@test 'start instance' {
    run -0 rdd svc start
    run -0 extract_msg
    assert_output <<EOT
successfully started "rancher-desktop-${RDD_INSTANCE}" control plane
waiting for "rancher-desktop-${RDD_INSTANCE}" control plane to be ready
EOT
}
```

`cmd/rdd/service.go` emits `"%q control plane is ready"` (line 307) and `"%q control plane is starting"` (line 312) on the cold-start path; neither "successfully started" nor "waiting for ... to be ready" is reachable without taking the already-running branch (line 275). git blame shows the message strings predate this PR, so this is pre-existing — but flagging because it sits in the same suite that the PR adds 6-timeout.bats to, and suggests this test may not be exercising the assertion path the author intended. Worth checking whether it has been silently passing because BATS is comparing differently than expected, or has been marked skipped somewhere. (suggestion, gap — pre-existing)

## Design Observations

### Strengths

- **Single source of truth for the long-wait constant.** `limaLongWaitTimeout` (5m) is defined once in `cmd/rdd/limavm.go:41` and reused by every limavm subcommand and by `stopWaitTimeout` in `cmd/rdd/service.go:42`. Changing the per-VM ceiling now flows through automatically.
- **`watchtools.ContextWithOptionalTimeout` for the 0-means-infinite contract.** Reusing a stdlib (k8s.io/client-go) helper instead of hand-rolling the `if timeout == 0` case avoids the easy-to-misget edge case across 5+ call sites.
- **The `serviceStartAction` already-running branch now waits for `ReadyAnnotation`.** Real fix — pre-PR, `service.WaitWithTimeout` only checked discovery presence, not readiness, so `rdd svc start` against a still-initializing service could return success too early.

### Concerns

- **`StopWithWait` doesn't accept a parent context** (covered in I2). Every other wait helper this PR touches threads through `cmd.Context()`; this is the only outlier. (in-scope)
- **Asymmetric flag set on `svc delete`** (no `--wait`) is documented and intentional, but the codebase now has three different flag combinations for very similar operations: `start/stop` have both, `delete` has only `--timeout`, `serve` has neither. Future readers will look at `delete` and ask "where's `--wait`?". The doc text addresses it ("Delete always waits ... because removing the instance directory under a live process corrupts it on Windows"), but the flag's missing-ness in `--help` is not self-documenting. Consider adding a short `Long:` description on the delete command itself, or a hidden `--wait` flag that errors out with the safety reason if set to false. (future)

## Testing Assessment

Coverage of the new exit-code-4 contract is good: each wait-capable lima subcommand now has a `run -4` regression test, and `6-timeout.bats` covers `svc stop`. Notable gaps:

1. **`rdd svc start --timeout=...` exit code 4 is untested.** The change adds the flag and the `timeoutError` wrap on lines 304-305 and 278-281 of `cmd/rdd/service.go`, but nothing exercises the deadline path. A `--timeout=1ms` test against a slow start would prove the contract.
2. **`rdd svc delete --timeout=...` exit code 4 is untested.** Tied to I1: the documented behavior "exits with code 4 and the directory is left in place" has no test. A test would have caught the swallow-on-race issue.
3. **`ensureServiceRunning` timeout path is untested.** The `timeoutError` wraps on lines 108, 110, 112 are reached when `rdd set`, `rdd ctl`, `rdd kubectl`, etc. autostart a stuck service. No test forces this path.
4. **`--wait=false --timeout=N` combination is untested** — would surface S2 as a UX issue.

## Documentation Assessment

- `docs/design/cmd_service.md:117-122` — see I1; the "directory is left in place" claim is racy in practice.
- `docs/design/cmd_service.md:93-99` — good update (replacing `--no-wait` with the `--wait=false` form, adding `--timeout`).
- `docs/design/api_service.md:21` — accurate, matches the SIGINT/CTRL_BREAK_EVENT implementation in `pkg/util/process/process_{unix,windows}.go`.
- `docs/design/cmd_lima.md:14-57` — defaults updated consistently. The `--timeout` description for each command is identical except for the `--start`-conditional caveat on `create` — consider extracting the boilerplate so future flag additions stay aligned.

## Commit Structure

Single commit, single concept (CLI flag standardization). Message accurately summarizes the change. Clean.

## Acknowledged Limitations

- **Comment at `pkg/service/cmd/service.go:432-433`** acknowledges the outer/inner `Running()` race in `Delete`. As noted in I1, the comment understates the swallow's actual scope (it covers more than just that one race), so this is promoted to a finding rather than left here.
- **Comment at `pkg/service/cmd/service.go:377-383`** explains the SIGTERM-on-timeout fallback rationale for the platform asymmetry. Accurate.
- **Comment at `cmd/rdd/service.go:32-42`** explains why `startWaitTimeout` is 90s and why `stopWaitTimeout` matches the per-VM ceiling. The latter is partially misleading under multi-VM scenarios — see S3.

## Coverage Summary

- `bats/tests/20-service/6-timeout.bats` — Reviewed, covers I1-adjacent territory but not the deadline-on-delete case.
- `bats/tests/33-lima-controllers/limavm-cli.bats` — S4 (cleanup deletes dropped `run -0`); two new exit-code-4 tests are well-placed.
- `cmd/rdd/limavm.go` — Reviewed, no issues. Changes are mechanical: rename `toolswatch`→`watchtools`, add `limaLongWaitTimeout` constant, route `context.DeadlineExceeded` through `timeoutError`.
- `cmd/rdd/service.go` — I1 (Delete swallow), S1 (helper duplication), S3 (stopWaitTimeout comment), S5 (delete --timeout help text).
- `docs/design/api_service.md` — Reviewed, no issues.
- `docs/design/cmd_lima.md` — Reviewed, defaults documented correctly.
- `docs/design/cmd_service.md` — I1 (delete behavior contradiction).
- `pkg/service/cmd/service.go` — I1 (Delete logic), I2 (StopWithWait ignores parent ctx).

The review is complete. The background grep finished but its results aren't needed — I already located `ContextWithOptionalTimeout` via the Go module cache at `/Users/jan/go/pkg/mod/k8s.io/client-go@v0.33.0/tools/watch/until.go` and verified the negative-clamp behavior. No further action.
---

### [Codex GPT 5.4]

### Executive Summary

This PR does the intended flag and exit-code standardization: `svc` and `limavm` now default to waiting, and deadline expiry is consistently routed to exit code 4. The main issue I found is in the new service wait API: `svc stop` / `svc delete` discard the CLI context, so the newly introduced `--timeout=0` path can hang indefinitely even after the user interrupts the command. The service shutdown docs and the new timeout test comment also overstate Unix behavior by describing a forced kill that the implementation does not actually perform.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **`svc stop` / `svc delete` can become uninterruptible when `--timeout=0` is used** — `pkg/service/cmd/service.go:364-369` (important, regression)

```go
if wait {
	ctx := context.Background()
	if timeout > 0 {
		var cancel context.CancelFunc
		ctx, cancel = context.WithTimeout(ctx, timeout)
		defer cancel()
	}
```

`serviceStopAction()` at `cmd/rdd/service.go:336` and `serviceDeleteAction()` at `cmd/rdd/service.go:377` call into `StopWithWait()` / `Delete()` with only `wait` and `timeout`, and the new flags at `cmd/rdd/service.go:354` and `cmd/rdd/service.go:364` explicitly advertise `--timeout=0` as “wait indefinitely”. The new code at `pkg/service/cmd/service.go:365` replaces the caller’s `cmd.Context()` with `context.Background()`, so once the wait loop at `pkg/service/cmd/service.go:374-391` starts, Ctrl-C on the CLI can no longer cancel it. Before this change the stop wait was at least bounded to 60s; after this change, the new indefinite-wait mode can block forever from the caller’s point of view.

Fix: thread the caller context through the new library surface so timeout is derived from the parent context instead of replacing it.

```diff
-func StopWithWait(wait bool, timeout time.Duration) error {
+func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
 	…
 	if wait {
-		ctx := context.Background()
 		if timeout > 0 {
 			var cancel context.CancelFunc
 			ctx, cancel = context.WithTimeout(ctx, timeout)
 			defer cancel()
 		}
```

Apply the same change to `Delete()`, and pass `cmd.Context()` from `cmd/rdd/service.go`.

#### Suggestions

S1. **The new shutdown docs and timeout test comment promise a forced kill that Unix never sends** — `docs/design/api_service.md:21; docs/design/cmd_service.md:104-109; bats/tests/20-service/6-timeout.bats:3-6` (suggestion, regression)

```md
The control plane is notified to shut down, either by `rdd service stop`
(SIGINT on Unix, `CTRL_BREAK_EVENT` on Windows, with forced termination as fallback)
```

The new text at `docs/design/api_service.md:21` and the comment at `bats/tests/20-service/6-timeout.bats:3-6` say the timeout path force-kills the daemon, but the only fallback used by `pkg/service/cmd/service.go:384` is `process.Kill(pid)`, and on Unix `pkg/util/process/process_unix.go:30-32` implements that as another `SIGTERM`, not `SIGKILL`. In a local run on this macOS host, `./bin/rdd svc stop --wait --timeout=1ms` exited 4 while `./bin/rdd svc status` still reported the daemon PID as running immediately afterward, so the new docs/comments are describing a stronger guarantee than the implementation provides.

Fix: either make the Unix timeout fallback truly forceful, or rewrite the docs/test comment to say that the CLI wait timed out and the daemon may still be shutting down.

### Design Observations

**Concerns**

1. The new service API now separates “how long to wait” from “who can cancel the wait”, but only models the timeout half. Passing `context.Context` through `StopWithWait()` / `Delete()` would make the contract explicit and remove a whole class of caller-cancellation bugs. (in-scope)

**Strengths**

1. Keeping `cliexit.Timeout` wrapping in `cmd/rdd/service.go` via `timeoutError()` is a good boundary: the service package stays reusable, while CLI-facing commands get consistent exit-code behavior.

2. Waiting for the discovery ConfigMap’s `ReadyAnnotation` in both the cold-start and already-running `svc start` paths tightens readiness semantics and matches the PR’s stated goal.

### Testing Assessment

1. `rdd svc delete --timeout=<short>` has no integration test covering exit code 4 and the “directory is left in place” safety guarantee this PR documents.
2. None of the autostarting commands called out in the PR description (`rdd set`, `rdd kubectl`, `rdd service config`) has a timeout-path test proving `ensureServiceRunning()` now returns exit code 4 instead of 1.
3. The new `svc stop --timeout=1ms` test only checks exit code and stderr text; it does not assert whether the daemon is still running or eventually stops, which is the exact behavior the new docs/commentary now describe.

### Documentation Assessment

The `limavm` docs are aligned with the new defaults. The only documentation issue I found is S1: the service shutdown docs now overstate the Unix timeout fallback.

### Coverage Summary

- `bats/tests/20-service/6-timeout.bats` — S1
- `bats/tests/33-lima-controllers/limavm-cli.bats` — Reviewed, no issues
- `cmd/rdd/limavm.go` — Reviewed, no issues
- `cmd/rdd/service.go` — I1
- `docs/design/api_service.md` — S1
- `docs/design/cmd_lima.md` — Reviewed, no issues
- `docs/design/cmd_service.md` — S1
- `pkg/service/cmd/service.go` — I1

---

### [Gemini 3.1 Pro]

## Executive Summary

This PR standardizes the `--wait` and `--timeout` behavior across the `rdd svc` and `rdd limavm` subcommands, defaulting to finite timeouts and exiting with code 4 when the deadline expires. However, the implementation introduces several critical regressions: a race condition in `rdd svc delete` that swallows the timeout error and deletes data out from under running VMs, an orchestration flaw that orphans hostagent processes instead of forcefully terminating them, and a botched search/replace that corrupts the BATS test suite.

## Findings

### Critical Issues

C1. **BATS test syntax corrupted by botched global replace** — `bats/tests/20-service/6-timeout.bats:10`, `bats/tests/33-lima-controllers/limavm-cli.bats:201` (critical, regression)

```bash
-@test "lima start --timeout fails when VM cannot reach desired state" {
+@pkg/util/tail/tail_test.go "lima start --timeout fails when VM cannot reach desired state" {
```

The `@test` decorator in the BATS files has been erroneously replaced with `@pkg/util/tail/tail_test.go` (likely from an unintended global search/replace in the editor). This syntax error completely breaks the BATS test suite.

Fix: Revert all instances of `@pkg/util/tail/tail_test.go` back to `@test`.

C2. **`service.Delete` race condition swallows timeout errors, causing data loss** — `pkg/service/cmd/service.go:428-435` (critical, regression)

```go
	if Running() {
		if err := StopWithWait(true, timeout); err != nil {
			// The process may exit between the outer Running check and
			// StopWithWait's inner one; treat that as success.
			if Running() {
				return err
			}
		}
	}
```

When `StopWithWait` hits the deadline, it kills the service process and returns `context.DeadlineExceeded` wrapped in a timeout error. `Delete` then evaluates `Running()`. Because the service was just killed by `StopWithWait`, `Running()` returns `false` (especially on Windows where dead processes instantly fail `FindProcess` and trigger PID file cleanup). As a result, `Delete` ignores the timeout error and proceeds to delete all instance directories. This explicitly violates the PR's design contract ("If the deadline expires, `rdd` exits with code 4 and the directory is left in place") and deletes files while orphaned hostagents are still running.

Fix: Explicitly check for the timeout error and return it so the CLI exits with code 4 and leaves the directory intact.

```diff
 		if err := StopWithWait(true, timeout); err != nil {
 			// The process may exit between the outer Running check and
 			// StopWithWait's inner one; treat that as success.
-			if Running() {
+			if errors.Is(err, context.DeadlineExceeded) || Running() {
 				return err
 			}
 		}
```

C3. **`StopWithWait` timeout orphans hostagents** — `pkg/service/cmd/service.go:384-385` (critical, regression)

```go
				// Graceful shutdown timed out; terminate so we don't leave
				// a hung service process. Kill targets only the service; on
				// Windows (TerminateProcess) this avoids killing hostagents
				// that LimaVM stop orchestration hasn't reached yet. The
				// service is responsive to signals (it's just slow shutting
				// down hostagents sequentially).
				_ = process.Kill(pid)
```

The CLI uses `process.Kill(pid)` to terminate the service on timeout. The comment incorrectly assumes that avoiding `KillTree` is desirable because the service is "slow shutting down hostagents sequentially." However, by killing the service (the orchestrator), the graceful shutdown is aborted entirely! The hostagents that haven't been reached will *never* be shut down, leaving them orphaned indefinitely. Per the repository instructions, `process.KillTree` must be called while the parent is still alive to ensure all process group members are terminated.

Fix: Use `process.KillTree` to ensure the entire tree is terminated if the orchestrator fails to complete its graceful shutdown.

```diff
-				_ = process.Kill(pid)
+				_ = process.KillTree(context.Background(), pid)
```

### Important Issues

I1. **`StopWithWait` ignores caller context and SIGINT** — `pkg/service/cmd/service.go:365-370` (important, regression)

```go
func StopWithWait(wait bool, timeout time.Duration) error {
// ...
	if wait {
		ctx := context.Background()
		if timeout > 0 {
			var cancel context.CancelFunc
			ctx, cancel = context.WithTimeout(ctx, timeout)
			defer cancel()
		}
```

`StopWithWait` creates a new `context.Background()` instead of accepting the caller's context. If the user passes `--timeout=0`, the timeout check is bypassed and `ctx` remains `context.Background()`. This means `<-ctx.Done()` will never fire (nil channel), and the `select` loop will block indefinitely polling `<-ticker.C`. The CLI command becomes entirely un-cancellable by the user — pressing Ctrl+C will not propagate cancellation to the wait loop. This violates the repository convention ("Functions should accept a `context.Context` from their caller so shutdown can propagate").

Fix: Accept `ctx context.Context` as the first argument, and derive the timeout context from it. Update callers in `cmd/rdd/service.go` to pass `cmd.Context()`.

```diff
-func StopWithWait(wait bool, timeout time.Duration) error {
+func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
// ...
-	if wait {
-		ctx := context.Background()
-		if timeout > 0 {
-			var cancel context.CancelFunc
-			ctx, cancel = context.WithTimeout(ctx, timeout)
-			defer cancel()
-		}
+	if wait {
+		if timeout > 0 {
+			var cancel context.CancelFunc
+			ctx, cancel = context.WithTimeout(ctx, timeout)
+			defer cancel()
+		}
```

### Suggestions

None.

## Design Observations

### Concerns
None.

### Strengths
None.

## Testing Assessment

1. The new BATS tests accurately assert the timeout behavior and exit codes for both `rdd svc` and `rdd limavm` operations, assuming the syntax error (C1) is resolved. 
2. The use of `run -4` perfectly aligns with the new exit-code-4 contract.

## Documentation Assessment

The design documentation updates correctly reflect the new defaults, behaviors, and error codes. No gaps identified.

## Coverage Summary

- `bats/tests/20-service/6-timeout.bats`: C1
- `bats/tests/33-lima-controllers/limavm-cli.bats`: C1
- `cmd/rdd/limavm.go`: Reviewed, no issues
- `cmd/rdd/service.go`: I1
- `docs/design/api_service.md`: Reviewed, no issues
- `docs/design/cmd_lima.md`: Reviewed, no issues
- `docs/design/cmd_service.md`: Reviewed, no issues
- `pkg/service/cmd/service.go`: C2, C3, I1

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-22 |
| **Commit** | `4bf0ec5 cli: standardize --wait and --timeout across svc and limavm` (amended) |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important #1: `Delete` swallows timeout errors once the force-kill wins the race | Fixed |
| 2 | Important #2: `StopWithWait` / `Delete` ignore the caller's context | Fixed |
| 3 | Suggestion #1: `timeoutError` duplicates the inline `cliexit.Timeout` pattern in `set.go` | Fixed |
| 4 | Suggestion #2: `--timeout=N` is silently ignored when `--wait=false` | Fixed |
| 5 | Suggestion #3: Cleanup `rdd limavm delete` lost its `run -0` wrapper | Skipped (finding misread repo BATS rule; current code matches `bats-style.md`) |
| 6 | Suggestion #4: "Forced termination as fallback" overstates Unix behavior | Fixed |
| 7 | Suggestion #5: `stopWaitTimeout` rationale covers one VM, not the multi-VM worst case | Fixed |
| 8 | Suggestion #6: `svc delete --timeout` help text diverges from other `--timeout` flags | Fixed |
| 9 | Testing Gap #1: `svc start --timeout` exit-code-4 | Skipped (R2 intentional skip) |
| 10 | Testing Gap #2: `svc delete --timeout` exit-code-4 + directory-left-in-place | Skipped (R2 intentional skip) |
| 11 | Testing Gap #3: `ensureServiceRunning` deadline during autostart commands | Skipped (R2 intentional skip) |
| 12 | Testing Gap #4: `--timeout=0` indefinite wait succeeds | Skipped (R2 intentional skip) |
| 13 | Testing Gap #5: Ctrl-C during `svc stop --timeout=0` / `svc delete --timeout=0` | Skipped (depends on signal-to-context plumbing not in this PR) |
