# Deep Review: 20260421-225153-pr-336

| | |
|---|---|
| **Date** | 2026-04-21 22:51 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 2 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#336](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/336) — cli: standardize --wait and --timeout across svc and limavm |
| **Branch** | `wait-timeout-cleanup` |
| **Commits** | `8aa55e4` cli: address deep-review findings for #336<br>`99aa4e3` cli: standardize --wait and --timeout across svc and limavm |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with minor fixes** — Round 1's Important findings (I1 delete-under-live-process race, I2 shared default timeout) are correctly resolved in `8aa55e4`. One new Important gap remains: `ensureServiceRunning` timeouts still exit with code 1 instead of 4, so the advertised uniform exit-code-4 contract is not uniform for the service-autostart path invoked from `rdd set`, `rdd limavm *`, and `rdd kubectl`. The rest is documentation polish and still-missing BATS coverage for the new `svc *` timeout paths. |
| **Wall-clock time** | `2 h 4 min 21 s` |

## Executive Summary

`8aa55e4` correctly addresses the Round 1 Important findings: `--wait=false` is removed from `svc delete` (the directory-removal-under-live-process race is gone), the single `defaultWaitTimeout = 30s` is replaced with two named constants (`startWaitTimeout = 90s`, `stopWaitTimeout = limaLongWaitTimeout = 5m`) each anchored to a concrete invariant in its doc comment, and the SIGTERM/SIGINT signal-name mismatch in `docs/design/cmd_service.md` is fixed. `docs/design/cmd_lima.md` is now consistent with the code, and two new BATS tests cover `lima create --start --timeout` and `lima restart --timeout` exit-code-4 paths.

The remaining items split cleanly. One Important gap: `ensureServiceRunning` (called by `rdd set`, every `rdd limavm *` path via `getKubeClient`, and `rdd kubectl`) returns raw `context.DeadlineExceeded` without routing through `timeoutError`, so a service-start timeout during those commands exits with code 1 instead of the advertised code 4. The rest is documentation polish: the PR description and commit body still describe pre-`8aa55e4` defaults, `docs/design/api_service.md` still says `svc stop` delivers `SIGTERM` (the companion doc `cmd_service.md` was fixed but not this one), and `docs/design/cmd_service.md`'s `svc delete` section lacks a cross-reference to `svc stop` for the signal/fallback details. The new `svc start`/`svc stop`/`svc delete` timeout paths still have no BATS coverage — `8aa55e4`'s new tests cover the `lima` paths only, so Round 1's I3 is partially addressed but leaves the `svc *` side of the contract unasserted.

A small `Delete` TOCTOU (self-heals on retry) and a subtle readiness-contract gap on the documented `--wait=false` → `rdd svc start` workflow round out the suggestion list.

---

## Critical Issues

None.

## Important Issues

I1. **`ensureServiceRunning` timeouts bypass the `cliexit.Timeout` envelope** — `cmd/rdd/service.go:98-113`, `cmd/rdd/limavm.go:260-280`, `cmd/rdd/set.go:110, 566`, `cmd/rdd/kubectl.go:34` [Codex GPT 5.4] [Gemini 3.1 Pro]

```go
func ensureServiceRunning(ctx context.Context) error {
    if !service.Exists() {
        if err := service.Create(nil); err != nil {
            return err
        }
    }
    if service.Running() {
        ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
        defer cancel()
        if err := service.Wait(ctx); err != nil {
            return err
        }
        return waitForDiscoveryConfigMap(ctx)
    }
    return startAndWaitForReady(ctx, nil, startWaitTimeout)
}
```

`ensureServiceRunning` now enforces the new `startWaitTimeout` (90s) on every service-autostart path, but it returns raw `context.DeadlineExceeded` without routing through `timeoutError`. Callers — `getKubeClient` in `cmd/rdd/limavm.go:261` (used by every `rdd limavm *` subcommand), `rdd set` at `cmd/rdd/set.go:110` and `:566`, `rdd kubectl` at `cmd/rdd/kubectl.go:34`, and `rdd service config` at `cmd/rdd/service.go:182` — all propagate the error without wrapping. `cmd/rdd/main.go`'s `errors.As(err, &cliexit.Error)` check then sees a bare deadline error and exits with code 1, contradicting the PR's stated uniform exit-code-4 contract. The `svc start` path itself is fine (line 302 wraps `startAndWaitForReady` directly with `timeoutError`), but the shared `ensureServiceRunning` entry point five other commands use is not.

Fix: wrap the error at the source so every caller inherits the contract without a separate `timeoutError` call at each site:

```diff
 func ensureServiceRunning(ctx context.Context) error {
     if !service.Exists() {
         if err := service.Create(nil); err != nil {
             return err
         }
     }
     if service.Running() {
         ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
         defer cancel()
-        if err := service.Wait(ctx); err != nil {
-            return err
-        }
-        return waitForDiscoveryConfigMap(ctx)
+        if err := service.Wait(ctx); err != nil {
+            return timeoutError(err)
+        }
+        return timeoutError(waitForDiscoveryConfigMap(ctx))
     }
-    return startAndWaitForReady(ctx, nil, startWaitTimeout)
+    return timeoutError(startAndWaitForReady(ctx, nil, startWaitTimeout))
 }
```

(important, gap — consolidated from Codex I1 and Gemini I2; both agents independently traced the unwrapped error path.)

## Suggestions

S1. **`Delete` leaves the directory in place when the process exits between outer and inner `Running()` checks** — `pkg/service/cmd/service.go:338-340, 426-434` [Claude Opus 4.7]

```go
// Delete (outer check)
if Running() {
    if err := StopWithWait(true, timeout); err != nil {
        return err
    }
}
// ...
return os.RemoveAll(instance.Dir())

// StopWithWait (inner check, line 338-340)
func StopWithWait(wait bool, timeout time.Duration) error {
    if !Running() {
        return fmt.Errorf("%q control plane is not running", instance.Name())
    }
```

Pre-PR, `Delete()` called `_ = Stop()` and discarded the error, so a process that exited between any two checks did not block deletion. Post-PR, `Delete(timeout)` propagates the `StopWithWait` error verbatim. A narrow window exists between `Delete`'s outer `Running()` and `StopWithWait`'s inner one in which the serve subprocess can exit (crash, OOM-kill, external `kill`, concurrent `rdd svc stop`). When that happens, `StopWithWait` returns `"… control plane is not running"`, `Delete` propagates it, and `os.RemoveAll(instance.Dir())` never runs. The user sees exit code 1 and a confusing message, even though the desired end state is one `os.RemoveAll` away. The condition self-heals on retry — the next invocation sees `!Running()`, skips the stop branch, and proceeds to delete.

Fix: treat "already not running" as success in `Delete` by re-checking after the stop failure:

```diff
     if Running() {
         if err := StopWithWait(true, timeout); err != nil {
-            return err
+            // The process may have exited between our Running check and
+            // StopWithWait's; in that case its work is already done.
+            if Running() {
+                return err
+            }
         }
     }
```

(suggestion, regression)

S2. **Exit-code-4 contract remains asserted only on `lima` paths; the new `svc *` timeout paths have no coverage** — `bats/tests/33-lima-controllers/limavm-cli.bats:196-230` [Claude Opus 4.7]

```bash
@test "lima start --timeout fails when VM cannot reach desired state" { … run -4 … }
@test "lima create --start --timeout fails when VM cannot boot" { … run -4 … }
@test "lima restart --timeout fails when VM cannot restart" { … run -4 … }
```

`8aa55e4` adds two new exit-code-4 tests on top of the one `99aa4e3` updated, all inside `lima` subcommands. None of the four new `svc *` timeout paths the PR introduces carry coverage:

- `rdd svc start --wait --timeout=…` (wraps `startAndWaitForReady` at `cmd/rdd/service.go:302`)
- `rdd svc start` against an already-running service (wraps `service.Wait` at `cmd/rdd/service.go:278`)
- `rdd svc stop --wait --timeout=…` (wraps `service.StopWithWait` at `cmd/rdd/service.go:334`)
- `rdd svc delete --timeout=…` (wraps `service.Delete` at `cmd/rdd/service.go:375`)
- `--timeout=0` (indefinite) on any of these paths

A regression that returns the wrapped error directly without going through `timeoutError` will ship green. This restates Round 1 I3; the fixup commit partially addressed it (lima paths now covered) but left the svc side unasserted. The simplest provoke for `svc start` is creating the control plane with a broken controller manifest; the simplest for `svc stop` is the setup from `bats/tests/33-lima-controllers/limavm-running.bats:327` ("graceful service stop terminates hostagent"). At minimum, add one `svc *` timeout assertion so the contract is not lima-only.

(suggestion, gap — restates Round 1 I3 partially addressed.)

S3. **`rdd svc start` on an already-running service returns before the `ReadyAnnotation` is set** — `cmd/rdd/service.go:270-278` [Claude Opus 4.7] [Codex GPT 5.4]

```go
if service.Running() {
    logrus.Infof("%q control plane is already running", instance.Name())
    if !wait {
        return nil
    }
    logrus.Infof("waiting for %q control plane to be ready", instance.Name())
    ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
    defer cancel()
    return timeoutError(service.Wait(ctx))
}
```

`docs/design/cmd_service.md:93-95` explicitly advertises the flow "`--wait=false` … then run `rdd service start` again with no options to wait." The cold-start path (`startAndWaitForReady`) waits for both `service.Wait(ctx)` and `waitForFreshDiscoveryConfigMap(ctx, beforeStart)` — the latter requires `ReadyAnnotation == "true"`, which the controller manager sets only after every expected controller has registered. The already-running path calls only `service.Wait(ctx)`, which via `checkReadiness` returns success as soon as the discovery ConfigMap lists registered controllers — before the ready annotation. A concurrent `--wait=false` followed by `rdd svc start` during a still-initializing daemon therefore returns success under a narrower contract than the cold-start path guarantees.

Codex raised this at Important because the PR's new documentation names the workflow explicitly; Claude raised it at Suggestion because the race window is narrow (controllers finish registering quickly, and a steady-state instance is always fine). We consolidate at Suggestion: the practical exposure is a narrow mid-startup race, but the documented workflow deserves a one-line fix to deliver the advertised guarantee.

Fix: mirror `ensureServiceRunning`'s pattern and wait for the discovery ConfigMap on the already-running branch too:

```diff
-    return timeoutError(service.Wait(ctx))
+    if err := service.Wait(ctx); err != nil {
+        return timeoutError(err)
+    }
+    return timeoutError(waitForDiscoveryConfigMap(ctx))
```

(suggestion, gap — pre-existing; the PR's new docs make it newly relevant.)

S4. **BATS tests still pass `--wait` explicitly on `limavm delete`, which is now the default** — `bats/tests/33-lima-controllers/limavm-cli.bats:78, 94` [Claude Opus 4.7]

```bash
run -0 rdd limavm delete --wait "test-vm"
run -0 rdd limavm delete --wait "start-vm"
```

`limavm delete` now defaults to `--wait=true` (`cmd/rdd/limavm.go:253`). The explicit flag is redundant and signals to a future reader that the flag is load-bearing — when the opposite is now true. Drop the flag:

```diff
-run -0 rdd limavm delete --wait "test-vm"
+run -0 rdd limavm delete "test-vm"
```

(suggestion, gap)

S5. **PR description and `99aa4e3` commit body describe the pre-`8aa55e4` default timeouts** — commit `99aa4e3`, PR description [Claude Opus 4.7]

The PR description lists:

> Default timeouts:
> - 30s: `svc start`, `svc stop`, `svc delete`, `limavm create`
> - 5m: `limavm start`, `limavm stop`, `limavm restart`, `limavm delete`

After `8aa55e4`:

- `svc start` → `startWaitTimeout = 90s`
- `svc stop`, `svc delete` → `stopWaitTimeout = limaLongWaitTimeout = 5m`
- All `limavm` paths → `limaLongWaitTimeout = 5m`

The "30s bucket" no longer exists. Update the PR description, and either amend `99aa4e3` or extend `8aa55e4`'s body with the final contract so a future `git log` reader sees the shipped defaults. This also picks up Round 1 S7 (the `service.Stop()` / `service.WaitWithTimeout()` removals are still unnamed in either commit body) and Round 1 S8 (the stale `limavm create` 30s line).

(suggestion, gap — re-raises Round 1 S7 and S8 unaddressed.)

S6. **`docs/design/cmd_service.md` does not cross-reference `svc delete` back to `svc stop`** — `docs/design/cmd_service.md:111-121` [Claude Opus 4.7]

```
### `rdd service delete`

Stops the control plane and removes the instance directory, the short directory
(which contains the Lima home), and — unless `RDD_KEEP_LOGS` is set — the log
directory.

Delete always waits for the control plane to exit before removing files, because
removing the instance directory under a live process corrupts it on Windows and
breaks PID-file mutual exclusion on Unix. Use `--timeout` to bound that wait
(default `5m`); `0` waits indefinitely. If the deadline expires, `rdd` exits
with code 4 and the directory is left in place.
```

The `99aa4e3` commit body says the change folds the shutdown-wait semantics of `svc delete` into a reference to `svc stop`, but the actual doc carries no cross-reference. A reader must know that `svc delete` uses `svc stop`'s mechanism to understand which signal is sent (SIGINT / CTRL_BREAK_EVENT, with Kill fallback). A one-line "See [svc stop](#rdd-service-stop) for the signal/fallback details" closes the gap without duplicating content.

(suggestion, gap)

S7. **`docs/design/api_service.md` still describes `svc stop` as delivering `SIGTERM`** — `docs/design/api_service.md:21` [Codex GPT 5.4]

```md
The control plane will be notified to shut down, either by receiving a SIGTERM signal from the user (via `rdd service stop`), or by the OS preparing to shut down the host.
```

`cmd_service.md` was updated to say `SIGINT` (with the Windows `CTRL_BREAK_EVENT` mapping) and the implementation calls `process.Interrupt(pid)` first at `pkg/service/cmd/service.go:358-362`, but the architecture doc `api_service.md` still names `SIGTERM`. The two docs now disagree on the shutdown contract — exactly what this PR is standardizing.

Fix: update `docs/design/api_service.md:21` to match: `rdd service stop` sends `SIGINT` on Unix or `CTRL_BREAK_EVENT` on Windows, then falls back to forced termination if graceful shutdown cannot be initiated or does not complete in time.

(suggestion, gap — pre-existing; the PR's companion doc update made the divergence worse.)

S8. **Inconsistent alias for the client-go watch tools import** — `cmd/rdd/limavm.go:28`, `cmd/rdd/service.go:22` [Gemini 3.1 Pro]

`cmd/rdd/service.go` imports `k8s.io/client-go/tools/watch` as `watchtools`, while `cmd/rdd/limavm.go` imports the same package as `toolswatch`. Standardize on one alias across the CLI to keep a quick text search from missing call sites.

(suggestion, gap)

---

## Design Observations

### Concerns

- **Two readiness definitions coexist in `pkg/service/cmd/service.go`** (in-scope) [Codex GPT 5.4]. `service.Wait()` proves API readiness plus best-effort controller discovery; `startAndWaitForReady()` adds the `ReadyAnnotation` + freshness check. Unifying the two into a single readiness helper removes the mismatch behind S3 and gives every future caller the same contract. Matches the shape of S3's fix.
- **`StopWithWait` constructs `context.Background()` instead of accepting the caller's context** (future) [Claude Opus 4.7] [Gemini 3.1 Pro]. The function signature is `StopWithWait(wait bool, timeout time.Duration)` rather than the idiomatic `StopWithWait(ctx context.Context, wait bool, timeout time.Duration)`. Today this is a no-op — `cmd/rdd/main.go:153` passes `context.Background()` and never installs signal-to-context plumbing — but if the CLI ever grows a signal handler that cancels `cmd.Context()`, the shutdown wait will not observe it. Plumbing the context through costs one parameter and matches every other wait site in the CLI.
- **Pre-existing PID-recycling race in `StopWithWait`** (future) [Claude Opus 4.7]. Between `Running()` at line 339 and `pid := PID()` at line 346, the service can exit and `PID()` can find the PID file stale and return `PIDNotFound` (0). `process.Interrupt(0)` on Unix becomes `unix.Kill(0, SIGINT)`, which signals every process in the calling process's process group — including the `rdd` CLI itself and its parent shell. The race is narrow and existed before this PR, but the refactor did not close it.

### Strengths

- **Removing `--wait=false` from `svc delete` is the correct fix for Round 1 I1** [Claude Opus 4.7] [Codex GPT 5.4]. No safe semantics exist for "delete without waiting" when the writer still owns the directory, and `docs/design/cmd_service.md` now names the safety reason explicitly.
- **`startWaitTimeout` and `stopWaitTimeout` each tie their value to a concrete invariant in the doc comment** [Claude Opus 4.7]. `startWaitTimeout` cites CRD establishment and controller-manager registration; `stopWaitTimeout` cites the 45s drain in `Run` and the per-VM shutdown ceiling. Future tuning now anchors on the underlying sizing constraint, not a handwave.
- **`timeoutError` at the CLI boundary plus `errors.Is(err, context.DeadlineExceeded)` is the right classification primitive** [Claude Opus 4.7] [Codex GPT 5.4]. The helper keeps `cliexit` concerns out of the lower-level service package, and the `%w` wrap chain propagates through every layer.
- **`watchtools.ContextWithOptionalTimeout` is used at every wait site** [Claude Opus 4.7]. The "timeout=0 → no deadline" semantics now live in one place instead of six independent timer constructions.
- **`service.Delete(timeout)` is a safer API than the previous `service.Delete()`** [Codex GPT 5.4]. Waiting for shutdown before removing instance state is the right direction for both Windows file-handle semantics and Unix PID-file correctness.

## Testing Assessment

Untested paths, in priority order:

1. **`svc start --wait --timeout=…` deadline expiry returns exit code 4.** Most novel path in the PR; the `--timeout` flag on `svc start` is new. See S2.
2. **`svc stop --wait --timeout=…` deadline expiry returns exit code 4.** `StopWithWait` was rewritten to use a context-driven loop; no test exercises the timeout-expiry branch. See S2.
3. **`svc delete --timeout=…` deadline expiry returns exit code 4 and leaves the directory in place.** The doc explicitly promises this; no test asserts it. See S2.
4. **`--timeout=0` (indefinite wait) succeeds when the desired state is eventually reached.** A regression to `context.WithTimeout(ctx, 0)` (which times out immediately) would ship green. Round 1 testing gap, still unaddressed.
5. **`ensureServiceRunning` timeout during `rdd limavm *` / `rdd set` / `rdd kubectl` returns exit code 4.** Directly ties to I1; the exit-code contract is not exercised for service-autostart paths.
6. **The documented `--wait=false` → `rdd svc start` readiness workflow.** S3's narrow race window and the docs' advertisement of this workflow; no BATS test covers it.
7. **The Round 1 TODO at `bats/tests/20-service/2-create.bats:3` (`# TODO: test rdd svc start --wait=false`) remains unaddressed.**

## Documentation Assessment

- `docs/design/cmd_lima.md` now matches the code (Round 1 S1 resolved).
- `docs/design/cmd_service.md` correctly names SIGINT (Round 1 S3 resolved) and explains the safety reason for removing `--wait=false` on `svc delete` (Round 1 I1 resolved).
- `--timeout=0` semantics are documented at every site.
- **Missing:** cross-reference from `svc delete` to `svc stop` for signal/fallback details (S6).
- **Stale:** `docs/design/api_service.md:21` still says `SIGTERM` (S7).
- **Stale comment** at `cmd/rdd/service.go:120-122`: "the API server readiness poll and the ConfigMap freshness poll share the single deadline" holds only when `timeout > 0`. Round 1 doc gap, still unaddressed; minor.

## Commit Structure

`99aa4e3` and `8aa55e4` split correctly — the first is the feature, the second is the fixup for Round 1 findings. Both commit bodies understate the final contract (S5): `99aa4e3` lists pre-fixup defaults, and neither names the `service.Stop()` / `service.WaitWithTimeout()` library-surface removals. Consider squashing before merge, or amending either commit message to match the shipped code.

## Acknowledged Limitations

- `pkg/service/cmd/service.go:444-451` (in `preserveAllInstanceLogs`) names the Windows `os.Rename` / `FILE_SHARE_DELETE` limitation. The always-wait-before-delete semantics make the exposure rarer, but the QEMU-holds-the-handle case the comment describes is independent of this PR.
- `cmd/rdd/limavm.go:39-41` justifies `limaLongWaitTimeout = 5 * time.Minute` with a VM boot/shutdown comment; `8aa55e4` extends the same convention to `startWaitTimeout` and `stopWaitTimeout` in `cmd/rdd/service.go:32-42`.
- `pkg/service/cmd/service.go:381` comments that `process.Kill` (TerminateProcess on Windows) leaves child hostagents orphaned if they are slow — an explicit trade-off rather than an oversight.

## Unresolved Feedback

No prior PR review comments on this PR.

---

## Agent Performance Retro

### [Claude]

Claude produced the most thorough pass again: six suggestions and six design observations, with the deepest technical detail per finding. Its unique material dominated: the `Delete` TOCTOU around the outer/inner `Running()` checks (S1), the PID-recycling race that turns `process.Interrupt(PIDNotFound)` into a process-group self-signal (design concern), the PR-description-now-lists-pre-fixup-defaults observation (S5), and the BATS `--wait` redundancy on `limavm delete` (S4). It independently identified the same `serviceStartAction` readiness weakness as Codex, but at Suggestion severity with a more careful framing of the race window — which matches the practical exposure better than Codex's Important classification. Its verdict ("Merge") was a hair more optimistic than the consolidated "Merge with minor fixes," because Claude did not flag the `ensureServiceRunning` exit-code gap that both Codex and Gemini caught.

### [Codex]

Codex caught the Important finding the consolidation led with: `ensureServiceRunning` errors bypass `timeoutError` on every service-autostart path (`rdd set`, `rdd limavm *`, `rdd kubectl`), leaving five commands outside the advertised uniform exit-code-4 contract. It also uniquely caught the `docs/design/api_service.md` SIGTERM stale-doc (S7), which Claude and Gemini both missed despite reviewing the SIGINT/SIGTERM area in Round 1. Its call on the already-running-path readiness mismatch (I2 in its own report) was sharp on the documentation angle but over-weighted on severity; we consolidated at Suggestion alongside Claude's matching finding. No false positives; one clean Important and one unique stale-doc insight.

### [Gemini]

Gemini produced its shortest report of this PR yet and delivered a mixed bag. It independently raised the `ensureServiceRunning` exit-code-4 gap (its I2, consolidated into our I1) — valuable corroboration of Codex. It caught a clean cosmetic suggestion (S8: `watchtools` vs `toolswatch` alias mismatch) that neither other agent noticed. But its critical-severity `@test` header claim was a false positive: Gemini parsed its own reading of the diff as the file contents and hallucinated a corrupted BATS syntax in lines that are in fact valid `@test "…"` declarations at `limavm-cli.bats:208, 219`. Its Important finding on `StopWithWait` ignoring `cmd.Context()` was technically accurate but practically a no-op today (the CLI has no signal-to-context plumbing, so `cmd.Context()` is itself `context.Background()`); we demoted it to a design observation alongside Claude's matching concern. Gemini continues to skip `git blame`, so regression vs. pre-existing calls come from the other two agents.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 54m 15s | 6m 47s | 5m 05s |
| Findings | 6S | 1I 2S | 1I 1S |
| Tool calls | 59 (Bash 35, Grep 12, Read 12) | 62 (shell 59, stdin 3) | 18 (grep_search 10, run_shell_command 5, read_file 3) |
| Design observations | 6 | 3 | 0 |
| False positives | 0 | 0 | 1 |
| Unique insights | 5 | 1 | 1 |
| Files reviewed | 6 | 6 | 6 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **6S** | **1I 2S** | **1I 1S** |
| Downgraded | 0 | 1 (I→S) | 1 (I→S) |
| Dropped | 0 | 0 | 1 |


**Reconciliation.** Three severity changes:

- Gemini C1 (BATS `@test` header corrupted): **dropped as false positive.** The file in `$WORKTREE_lead` contains valid `@test` declarations at 208 and 219; Gemini hallucinated the corruption. `false_positives += 1`.
- Gemini I1 (StopWithWait ignores `cmd.Context()`): **downgraded to design observation.** No practical impact today because `main.go` passes `context.Background()` without signal-to-context plumbing; Claude's matching design concern captures the substance at the right severity.
- Codex I2 (already-running readiness mismatch): **downgraded to S3.** The documented workflow is a real gap but the race window is narrow; Claude's Suggestion classification matches the practical exposure.

Overall: Codex contributed the single most valuable finding (the `ensureServiceRunning` exit-code-4 gap) and the SIGTERM stale-doc. Claude carried the long tail of suggestions and unique design observations. Gemini corroborated the Important finding and caught the import-alias cosmetic, but its critical-severity call was spurious; with this PR's size and Round-2 narrow scope, Codex + Claude would have been sufficient.

## Review Process Notes

### Skill improvements

- **Add: flag a finding's critical/important severity as suspect when the evidence is a quoted block from a "diff view," not from the file in the worktree.** Severity escalation on purely textual evidence (especially where the claim requires the file to be malformed) should be cross-checked by reading the file at the cited line before accepting the severity. Agents sometimes synthesize a plausible diff and then reason from it as if it were the file; the consolidator must catch this. A reviewer recognises the pattern when the cited text begins with `-`/`+` markers, references "find-and-replace," or claims a syntax-level corruption that would break tooling the PR passed through.

### Repo context updates

- **Add:** When a PR introduces a uniform CLI exit-code contract that applies to "every wait-capable subcommand," enumerate every caller of the shared autostart/autowait helper (e.g., `ensureServiceRunning`) before accepting the PR as "uniform." Cross-cutting contracts that are only wired in at the outermost entry points miss every command that invokes the helper transitively (`rdd set`, every `rdd limavm *`, `rdd kubectl`, `rdd service config` in this PR). Reason: deep-review agents catch the obvious call sites but often miss the transitive ones.

---

## Appendix: Original Reviews

### [Claude Opus 4.7]

Let me verify the build is clean:

Now I have enough context to write the review.

# Deep Review: PR #336 — `cli: standardize --wait and --timeout across svc and limavm`

| | |
|---|---|
| **Verdict** | **Merge** — the round-1 critical findings (I1 directory-removal race, I2 inadequate stop budget) are all resolved in `8aa55e4`. Remaining items are documentation polish, untested timeout paths, and one residual TOCTOU in `Delete` that self-heals on retry. |

## Executive Summary

This PR unifies the `--wait` / `--timeout` surface across `rdd svc` and `rdd limavm`, routing `context.DeadlineExceeded` through a single `timeoutError` helper that tags the error with `cliexit.Timeout` so deadline expiry exits with code 4 (matching `rdd set`). The follow-up commit `8aa55e4` correctly addresses the round-1 findings: `--wait=false` was removed from `svc delete` (the dangerous "remove `instance.Dir()` under a live serve subprocess" mode is gone), the single 30s `defaultWaitTimeout` was split into `startWaitTimeout` (90s, sized for CRD establishment) and `stopWaitTimeout` (5m, exceeding the service's own 45s controller-manager drain), and the SIGTERM/SIGINT signal-name mismatch in the design doc was fixed. The remaining gaps are tests for the new `svc *` exit-code-4 paths and a small TOCTOU window in the new `Delete` that returns an error if the process exits between the outer `Running()` check and the inner one, leaving the directory in place until the user retries.

## Findings

### Critical Issues

None.

### Important Issues

None.

### Suggestions

**S1. `Delete` returns "control plane is not running" error if the process exits between the outer and inner `Running()` checks** — `pkg/service/cmd/service.go:430-434`, `pkg/service/cmd/service.go:338-340`

```go
// Delete (caller)
if Running() {
    if err := StopWithWait(true, timeout); err != nil {
        return err
    }
}

// StopWithWait (callee), line 338-340:
func StopWithWait(wait bool, timeout time.Duration) error {
    if !Running() {
        return fmt.Errorf("%q control plane is not running", instance.Name())
    }
```

Before this PR, `Delete()` called `_ = Stop()` and discarded the error, so a process that exited between any two checks did not block deletion. After this PR, `Delete(timeout)` propagates the `StopWithWait` error verbatim. There is a small window between `Delete`'s `Running()` check at line 430 and `StopWithWait`'s `Running()` check at line 339 in which the serve subprocess can exit (crash, OOM-killed, manually `kill`'ed, or stopped by another `rdd svc stop`). When that happens, `StopWithWait` returns `"… control plane is not running"`, `Delete` propagates it, and the instance directory is **not** removed. The `cliexit.Timeout` helper in `serviceDeleteAction` does not match this error, so the user sees exit code 1 and a confusing message even though the desired state ("no process, no directory") is exactly one `os.RemoveAll` away.

The condition self-heals on retry — the second `rdd svc delete` invocation sees `!Running()`, skips the stop branch entirely, and proceeds to delete. So this is a UX/observability concern, not data loss.

Fix: treat "already not running" as success in `Delete`. The cleanest version re-checks `Running()` after the failure to distinguish "stop genuinely failed" from "process raced ahead":

```diff
     if Running() {
         if err := StopWithWait(true, timeout); err != nil {
+            // The process may have exited between our Running check and
+            // StopWithWait's; in that case its job is already done.
+            if Running() {
                 return err
+            }
         }
     }
```

(suggestion, regression)

---

**S2. Exit-code-4 contract is asserted only for `lima` paths; the new `svc *` `--timeout` paths have no test coverage** — `bats/tests/33-lima-controllers/limavm-cli.bats:196-230`

```bash
@test "lima start --timeout fails when VM cannot reach desired state" { … run -4 … }
@test "lima create --start --timeout fails when VM cannot boot" { … run -4 … }
@test "lima restart --timeout fails when VM cannot restart" { … run -4 … }
```

`8aa55e4` adds two new exit-code-4 tests on top of the one updated by `99aa4e3`, all inside `lima` subcommands. None of the four new `svc *` timeout paths the PR introduces are covered:

- `rdd svc start --wait --timeout=…` — `cmd/rdd/service.go:301-303` (wraps `startAndWaitForReady`)
- `rdd svc start` against an already-running service — `cmd/rdd/service.go:276-278` (wraps `service.Wait`)
- `rdd svc stop --wait --timeout=…` — `cmd/rdd/service.go:333-335` (wraps `service.StopWithWait`)
- `rdd svc delete --timeout=…` — `cmd/rdd/service.go:374-376` (wraps `service.Delete`)
- `lima stop --timeout=…` and `lima delete --timeout=…` — `cmd/rdd/limavm.go:455, 556`
- `--timeout=0` (indefinite) on any of these

A regression that returns the wrapped error directly without going through `timeoutError` will ship green. The simplest provoke for `svc start` is `rdd svc create --controllers="*"` followed by deleting the controller-manager discovery ConfigMap mid-start; the simplest for `svc stop` is starting an instance with an unkillable hostagent (the existing `bats/tests/33-lima-controllers/limavm-running.bats:327` "graceful service stop terminates hostagent" test has the right setup). At minimum, add one `svc *` timeout assertion so the contract is not lima-only.

(suggestion, gap)

---

**S3. Stale `--wait` flag in BATS tests** — `bats/tests/33-lima-controllers/limavm-cli.bats:78,94`

```bash
run -0 rdd limavm delete --wait "test-vm"
run -0 rdd limavm delete --wait "start-vm"
```

`limavm delete` now defaults to `--wait=true` (`cmd/rdd/limavm.go:253`), so the explicit flag is redundant. Harmless, but signals "this test cares about waiting" when in fact the new default makes it the only mode. Drop the redundant `--wait` so a future reader does not infer the flag is required.

```diff
-run -0 rdd limavm delete --wait "test-vm"
+run -0 rdd limavm delete "test-vm"
```

(suggestion, gap)

---

**S4. PR description and commit body still describe pre-`8aa55e4` defaults** — commit `99aa4e3` body, PR description

The PR description says:

> Default timeouts:
> - 30s: `svc start`, `svc stop`, `svc delete`, `limavm create`
> - 5m:  `limavm start`, `limavm stop`, `limavm restart`, `limavm delete`

After `8aa55e4` the actual code uses:

- `svc start` → `startWaitTimeout` = **90s** (`cmd/rdd/service.go:35`)
- `svc stop`, `svc delete` → `stopWaitTimeout = limaLongWaitTimeout` = **5m** (`cmd/rdd/service.go:42`)
- `limavm create/start/stop/restart/delete` → `limaLongWaitTimeout` = **5m** (`cmd/rdd/limavm.go:41`)

The "30s bucket" no longer exists. Update the PR description (and either the `99aa4e3` commit message via amend or a note in `8aa55e4`'s body) so reviewers and `git log` readers see the current contract. Also: `8aa55e4`'s body lists the constant rename but does not name `service.Stop()` and `service.WaitWithTimeout()` as removed library-surface changes (raised in round-1 S7).

(suggestion, gap)

---

**S5. `docs/design/cmd_service.md` does not link `svc delete`'s wait semantics back to `svc stop`** — `docs/design/cmd_service.md:111-121`

```
### `rdd service delete`

Stops the control plane and removes the instance directory, the short directory
(which contains the Lima home), and — unless `RDD_KEEP_LOGS` is set — the log
directory.

Delete always waits for the control plane to exit before removing files, because
removing the instance directory under a live process corrupts it on Windows and
breaks PID-file mutual exclusion on Unix. Use `--timeout` to bound that wait
(default `5m`); `0` waits indefinitely. If the deadline expires, `rdd` exits
with code 4 and the directory is left in place.
```

The `99aa4e3` commit body says it "folds the shutdown-wait semantics of `svc delete` into a reference to `svc stop`," but the actual doc has no cross-reference — the user has to know that `svc delete` calls into `svc stop`'s mechanism to understand which signal is sent (SIGINT vs CTRL_BREAK_EVENT, fallback to Kill, etc.). A one-line "See [svc stop](#rdd-service-stop) for the signal/fallback details" would close the gap without duplicating the content.

(suggestion, gap)

---

**S6. `serviceStartAction` start-while-already-running path skips the discovery-ConfigMap freshness wait** — `cmd/rdd/service.go:270-279`

```go
if service.Running() {
    logrus.Infof("%q control plane is already running", instance.Name())
    if !wait {
        return nil
    }
    logrus.Infof("waiting for %q control plane to be ready", instance.Name())
    ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
    defer cancel()
    return timeoutError(service.Wait(ctx))
}
```

`ensureServiceRunning` (line 104-110) calls both `service.Wait(ctx)` **and** `waitForDiscoveryConfigMap(ctx)` when the service is running, but `serviceStartAction`'s analogous branch only calls `service.Wait(ctx)`. The implicit discovery check inside `service.Wait` (via `getRuntimeControllersFromCluster`) does not verify the `ReadyAnnotation` that `MarkControlPlaneReady` sets after every controller has registered — it only checks that the discovery ConfigMap exists with a controllers list. In a steady-state instance this is usually fine, but during an in-progress startup race (two CLI processes calling `rdd svc start` concurrently while the daemon is still registering controllers) `serviceStartAction` can return success while `ensureServiceRunning` would still wait.

This pre-dates `99aa4e3` (the wait-flag refactor only changed the timeout plumbing), so it is a pre-existing gap rather than a regression. Worth a one-line fix while the surrounding code is being touched:

```diff
-    return timeoutError(service.Wait(ctx))
+    if err := service.Wait(ctx); err != nil {
+        return timeoutError(err)
+    }
+    return timeoutError(waitForDiscoveryConfigMap(ctx))
```

(suggestion, gap)

## Design Observations

### Strengths

- Wrapping deadline detection in a single `timeoutError(err)` helper at the CLI boundary keeps the `cliexit.Timeout` contract in one place, and `errors.Is(err, context.DeadlineExceeded)` correctly traverses the `fmt.Errorf("…: %w", err)` chain so callers can wrap freely without losing the classification.
- `watchtools.ContextWithOptionalTimeout` is used consistently at every wait site, so the "timeout=0 → no deadline" semantics live in one place.
- `8aa55e4` does the right tuning move: instead of one shared `defaultWaitTimeout` constant, it ties each constant to a concrete invariant in the doc comment (`startWaitTimeout` cites CRD establishment, `stopWaitTimeout` cites the 45s drain in `Run`). Future tuning is anchored to the right thing.
- Removing `--wait=false` from `svc delete` is the correct fix for round-1 I1 — there is no useful "non-blocking delete" with a writer holding the directory, and the doc explicitly explains the safety reason.

### Concerns

- **`StopWithWait` ignores the caller's context** — `pkg/service/cmd/service.go:365-370` constructs `context.Background()` internally. The function signature is `StopWithWait(wait bool, timeout time.Duration)` rather than the idiomatic `StopWithWait(ctx context.Context, wait bool, timeout time.Duration)`. In the current CLI this is a no-op (the parent `cmd.Context()` is itself `context.Background()` because `main.go` does not install signal handling), but if signal handling is ever added to the CLI, the wait will not be Ctrl-C interruptible. Future-proofing this would add a `ctx` parameter; the caller in `cmd/rdd/service.go:333` would then pass `cmd.Context()`. (future)
- **Pre-existing PID-recycling race in `StopWithWait`** — `pkg/service/cmd/service.go:339-362`: `Running()` returns true, then `pid := PID()` reads the file again. If the process exits and the PID file is removed between the two calls, `PID()` returns `PIDNotFound` (0), and `process.Interrupt(0)` on Unix becomes `unix.Kill(0, SIGINT)` — which sends SIGINT to **every process in the calling process's process group**, including the `rdd` CLI itself. Not introduced by this PR (the same race exists in the pre-PR code) but worth a separate fix. (future)

## Testing Assessment

Untested scenarios in priority order:

1. **`svc start --wait --timeout=…` deadline expiry returns exit code 4** — most novel path in the PR (the `--timeout` flag on `svc start` is brand new). See S2.
2. **`svc stop --wait --timeout=…` deadline expiry returns exit code 4** — the underlying `StopWithWait` was completely rewritten to use a context-driven loop (`pkg/service/cmd/service.go:364-393`); no test exercises the timeout-expiry branch.
3. **`svc delete --timeout=…` deadline expiry returns exit code 4 and leaves the directory in place** — the doc explicitly promises this behavior; there is no test.
4. **`lima delete --timeout=…` deadline expiry** — `cmd/rdd/limavm.go:556` wraps `watchUntilDeleted` errors through `timeoutError`; not exercised.
5. **`--timeout=0` (indefinite wait) succeeds when the desired state is eventually reached** — the entire `watchtools.ContextWithOptionalTimeout(ctx, 0)` branch is untested; a regression to `context.WithTimeout(ctx, 0)` (which times out immediately) would not be caught.
6. **`Delete` race when service exits between outer and inner `Running()` checks** (S1) — not testable without artificial process injection; mention in a comment instead.

## Documentation Assessment

- `docs/design/cmd_lima.md` is now consistent with the code (round-1 S1 resolved).
- `docs/design/cmd_service.md` correctly states SIGINT (round-1 S3 resolved) and explains the safety reason for dropping `--wait=false` on `svc delete` (round-1 I1 resolved).
- `--timeout=0` semantics ("waits indefinitely") are documented at every site. Good.
- Missing: cross-reference from `svc delete` to `svc stop` for the signal/fallback details (S5).
- Stale comment in `cmd/rdd/service.go:120-122`: "Otherwise the API server readiness poll and the ConfigMap freshness poll share the single deadline" — this only holds when `timeout > 0`. The first sentence ("Pass timeout 0 to wait indefinitely") covers the `0` case, but the "Otherwise" clause is correct, so this is a minor wording quibble.

## Commit Structure

`99aa4e3` and `8aa55e4` are correctly split: the first is the feature, the second is the round-1 review-feedback fix. Per the repo's convention (`MEMORY.md`: review-feedback commits may stay separate), this is the right structure for review. Before merging, consider squashing them into one commit since the second is purely a fix-up of the first; if squashed, the combined message should call out the `service.Stop()` / `service.WaitWithTimeout()` removals (round-1 S7, still unaddressed).

## Acknowledged Limitations

- `pkg/service/cmd/service.go:444-451` (in `preserveAllInstanceLogs`) already documents the Windows `os.Rename` / `FILE_SHARE_DELETE` limitation. The new "always wait before delete" semantics make this case rarer (the serve subprocess must be gone before `preserveAllInstanceLogs` runs), but the underlying QEMU-holds-the-handle case the comment names is independent of this PR.
- `cmd/rdd/limavm.go:39-41`: the `limaLongWaitTimeout` constant carries a justification comment; `8aa55e4` extends this convention to the new `startWaitTimeout` and `stopWaitTimeout` constants in `cmd/rdd/service.go:32-42`. Consistent with the round-1 S2 ask.
- Round-1 review (`b569217`) Verdict was "Merge with fixes" — the two fix items (I1, I2) are resolved in `8aa55e4`. I3 (test coverage) is partially resolved (lima paths covered; svc paths not — restated here as S2).

## Coverage Summary

- `bats/tests/33-lima-controllers/limavm-cli.bats` — **S2, S3** (added timeout tests cover lima paths only; `--wait` flag on delete is now redundant in two places)
- `cmd/rdd/limavm.go` — Reviewed, no issues (consistent flag definitions, all wait sites route through `timeoutError`)
- `cmd/rdd/service.go` — **S2, S6** (svc paths route correctly through `timeoutError` but lack test coverage; running-already branch skips discovery-ConfigMap freshness wait)
- `docs/design/cmd_lima.md` — Reviewed, no issues (matches code after `8aa55e4`)
- `docs/design/cmd_service.md` — **S5** (no cross-reference from `svc delete` to `svc stop` for signal details)
- `pkg/service/cmd/service.go` — **S1** (race between `Delete`'s outer `Running()` check and `StopWithWait`'s inner one returns an error and leaves the directory in place; self-heals on retry)
---

### [Codex GPT 5.4]

### Executive Summary

This PR pushes `rdd svc` and `rdd limavm` toward a consistent `--wait` / `--timeout` UX and the `cliexit.CodeTimeout` exit code, and the core timeout plumbing mostly lines up with that goal. The remaining problems are at the boundaries: `limavm` commands still leak raw startup timeouts as exit code `1`, and the documented “run `rdd service start` later to wait” flow can still return before controller registration is finished.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **`limavm` cold-start timeouts still bypass `cliexit.CodeTimeout`** — `cmd/rdd/limavm.go:260-266` `(important, gap)`

```go
func getKubeClient(ctx context.Context) (client.WithWatch, error) {
	if err := ensureServiceRunning(ctx); err != nil {
		return nil, err
	}
	config, err := service.GetKubeRestConfig()
```

Every wait-capable Lima command calls `getKubeClient()` at `cmd/rdd/limavm.go:260-266`; for example, `limaVMCreateAction()` returns the error directly at `cmd/rdd/limavm.go:347-350`, and `limaVMSetRunningAction()` does the same at `cmd/rdd/limavm.go:422-425`. `ensureServiceRunning()` now enforces a 90s startup deadline at `cmd/rdd/service.go:105-112`, but it returns the raw `context.DeadlineExceeded` from `service.Wait()` / `startAndWaitForReady()` instead of passing it through `timeoutError()` from `cmd/rdd/service.go:47-50`. That means `cmd/rdd/main.go:157-164` still treats a cold-start timeout as a generic error and exits `1`, so `rdd limavm ...` is not actually standardized to exit `4` when the control plane itself misses its startup deadline.

Fix: route `ensureServiceRunning()` failures through `timeoutError()` before they cross the CLI boundary. The smallest change is to wrap the `ensureServiceRunning()` error in `getKubeClient()`. A better fix is to centralize the wrapping inside `ensureServiceRunning()` / `startAndWaitForReady()` so every caller, including future commands, gets the same timeout semantics automatically.

I2. **`rdd svc start` can still report “ready” before controllers finish registering** — `cmd/rdd/service.go:270-278` `(important, gap)`

```go
if service.Running() {
	logrus.Infof("%q control plane is already running", instance.Name())
	if !wait {
		return nil
	}
	logrus.Infof("waiting for %q control plane to be ready", instance.Name())
	ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
	defer cancel()
	return timeoutError(service.Wait(ctx))
}
```

The new docs explicitly tell users to start with `--wait=false` and “Run `rdd service start` again with no options to wait” at `docs/design/cmd_service.md:93-95`, but the second call only invokes `service.Wait()` at `cmd/rdd/service.go:270-278`. `service.Wait()` at `pkg/service/cmd/service.go:315-331` delegates to `checkReadiness()`, and `checkReadiness()` treats “discovery ConfigMap exists but no controllers are registered yet” as ready-enough by returning `readiness.WaitForReadyWithCRDs(..., []base.Controller{}, false)` at `pkg/service/cmd/service.go:277-280`. That is weaker than the cold-start path in `cmd/rdd/service.go:123-139`, which also waits for the discovery ConfigMap freshness / ready annotation. As a result, the documented follow-up `rdd service start` can still return before the control plane has finished controller-manager registration and before the discovery data is marked ready.

Fix: make the “already running, now wait” path use the same readiness contract as cold start. For example, after `service.Wait(ctx)` succeeds, call `waitForDiscoveryConfigMap(ctx)` (or better, extract a single helper that both paths share) so both code paths require the discovery ready annotation before returning success.

#### Suggestions

S1. **Service shutdown documentation now contradicts itself on the delivered signal** — `docs/design/api_service.md:21` `(suggestion, gap)`

```md
The control plane will be notified to shut down, either by receiving a SIGTERM signal from the user (via `rdd service stop`), or by the OS preparing to shut down the host.
```

This architecture doc still says `rdd service stop` delivers `SIGTERM` at `docs/design/api_service.md:21`, but the updated command doc says `SIGINT` / `CTRL_BREAK_EVENT` at `docs/design/cmd_service.md:104-107`, and the implementation really does call `process.Interrupt(pid)` first at `pkg/service/cmd/service.go:358-362`. That contradiction matters because the shutdown contract is exactly what this PR is standardizing.

Fix: update `docs/design/api_service.md` to match the real control flow: `rdd service stop` sends `SIGINT` on Unix or `CTRL_BREAK_EVENT` on Windows, then falls back to forced termination if graceful shutdown cannot be initiated or does not complete in time.

### Design Observations

**Concerns**

1. **Two different “service ready” definitions are now encoded in two helpers** `(in-scope)`  
`service.Wait()` only proves API readiness plus best-effort controller discovery, while `startAndWaitForReady()` adds the discovery ready annotation check. Unifying those into one readiness helper would eliminate the mismatch behind I2 and make every future caller inherit the same semantics.

**Strengths**

1. **Timeout tagging is being moved to the CLI boundary instead of scattered through low-level helpers.**  
`timeoutError()` is a good shape for this problem; once the remaining paths also use it, the exit-code behavior will be much easier to keep consistent.

2. **`service.Delete(timeout)` is a safer API than the old unconditional delete.**  
Waiting for shutdown before removing instance state is the right direction for both Windows file-handle semantics and Unix PID-file correctness.

### Testing Assessment

1. No test exercises a `limavm` command that has to auto-start the service and then times out during that startup, so the exit-code path in I1 is currently uncovered.
2. No BATS test covers the documented `rdd svc start --wait=false` followed by `rdd svc start` flow, so the premature-success path in I2 is untested.
3. The PR changes `rdd svc start`, `rdd svc stop`, and `rdd svc delete` timeout behavior, but the added tests only cover `limavm` timeout cases. There is still no integration coverage for `svc --timeout=<finite>` or `svc --timeout=0`.

### Documentation Assessment

`docs/design/cmd_lima.md` and `docs/design/cmd_service.md` were updated consistently with the new flag defaults, but the repo still has a stale shutdown contract in `docs/design/api_service.md:21` (S1).

### Acknowledged Limitations

- **Code comment** — `bats/tests/20-service/2-create.bats:3`  
  `# TODO: test \`rdd svc start --wait=false\``  
  This limitation is now more important, because the PR leans harder on the `--wait=false` / “wait later” workflow without adding coverage for it.

### Coverage Summary

- `bats/tests/33-lima-controllers/limavm-cli.bats` — Reviewed, no issues
- `cmd/rdd/limavm.go` — I1
- `cmd/rdd/service.go` — I1, I2
- `docs/design/cmd_lima.md` — Reviewed, no issues
- `docs/design/cmd_service.md` — I2
- `pkg/service/cmd/service.go` — I2

---

### [Gemini 3.1 Pro]

### Executive Summary

This change successfully standardizes the `--wait` and `--timeout` semantics across the `rdd svc` and `rdd limavm` commands, ensuring predictable deadlines and an explicit exit code 4 when operations fail to settle. However, a broken test header currently corrupts the BATS suite, and the refactored shutdown sequence ignores the CLI's command context, breaking SIGINT (Ctrl+C) handling and causing the CLI to hang uninterruptibly during long waits. 

### Findings

#### Critical Issues

**C1. BATS test syntax corrupted by bad find-and-replace** — `bats/tests/33-lima-controllers/limavm-cli.bats:200`, `209`, `219` (critical, regression)

```bash
-    run -1 rdd limavm start --wait --timeout=3s "timeout-vm"
+    # with exit code 4 (cliexit.CodeTimeout) to match `rdd set`.
+    run -4 rdd limavm start --wait --timeout=3s "timeout-vm"
+    assert_output --partial 'context deadline exceeded'
+}
+
+ @pkg/util/tail/tail_test.go "lima create --start --timeout fails when VM cannot boot" {
```

The newly added tests use `@pkg/util/tail/tail_test.go` instead of the BATS `@test` header. This breaks the BATS parser, likely resulting from an accidental global find-and-replace in the author's editor. The new test blocks will either cause a syntax error or be completely skipped by the runner.

Fix: Revert the headers back to the standard BATS format.

```diff
- @pkg/util/tail/tail_test.go "lima create --start --timeout fails when VM cannot boot" {
+ @test "lima create --start --timeout fails when VM cannot boot" {
```

#### Important Issues

**I1. `StopWithWait` ignores CLI context, blocking SIGINT (Ctrl+C)** — `pkg/service/cmd/service.go:362-385` (important, regression)

```go
	if wait {
		ctx := context.Background()
		if timeout > 0 {
			var cancel context.CancelFunc
			ctx, cancel = context.WithTimeout(ctx, timeout)
			defer cancel()
		}
```

By creating its own `context.Background()`, `StopWithWait` ignores the caller's context (`cmd.Context()`). When a user runs `rdd svc stop --wait` or `rdd svc delete`, pressing Ctrl+C will be ignored by the wait loop, and the CLI will hang unresponsively until the 5-minute timeout expires. As noted in the system guidelines, `context.Background()` must be flagged in new code when a parent context is available.

Furthermore, if the wait loop receives a SIGINT via the caller's context, it should gracefully abort the wait without forcefully killing the background service (`process.Kill(pid)`). `Kill` should only trigger on actual deadline expiry.

Fix: Modify `StopWithWait` and `Delete` to accept the caller's context, wrapping it with the timeout in the CLI layer:
```diff
-func StopWithWait(wait bool, timeout time.Duration) error {
+func StopWithWait(ctx context.Context, wait bool) error {
```
Distinguish between user cancellation and timeout in the select block:
```go
			case <-ctx.Done():
				if errors.Is(ctx.Err(), context.DeadlineExceeded) {
					_ = process.Kill(pid)
				}
				return fmt.Errorf("wait aborted for %q control plane with pid %d: %w", instance.Name(), pid, ctx.Err())
```
In `cmd/rdd/service.go`, wrap `cmd.Context()` with the timeout before calling `StopWithWait` or `Delete`.

**I2. API server wait timeouts return code 1 instead of 4** — `cmd/rdd/service.go:104-111` (important, gap)

```go
	if service.Running() {
		ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return err
		}
		return waitForDiscoveryConfigMap(ctx)
	}
	return startAndWaitForReady(ctx, nil, startWaitTimeout)
```

The PR standardizes deadline expiries to code 4 (`cliexit.CodeTimeout`). However, `ensureServiceRunning` hardcodes `startWaitTimeout` (90s) and returns raw `context.DeadlineExceeded` errors instead of wrapping them in `timeoutError`. If the embedded API server fails to start during a `limavm start` or `limavm create` command, the CLI exits with code 1 instead of 4.

Fix: Wrap the returns in `timeoutError` to ensure the `cliexit.Timeout` wrapper correctly classifies the failure:

```diff
 		if err := service.Wait(ctx); err != nil {
-			return err
+			return timeoutError(err)
 		}
-		return waitForDiscoveryConfigMap(ctx)
+		return timeoutError(waitForDiscoveryConfigMap(ctx))
 	}
-	return startAndWaitForReady(ctx, nil, startWaitTimeout)
+	return timeoutError(startAndWaitForReady(ctx, nil, startWaitTimeout))
```

#### Suggestions

**S1. Inconsistent watch tool alias** — `cmd/rdd/limavm.go:28`, `cmd/rdd/service.go:22` (suggestion, enhancement)

`cmd/rdd/service.go` imports `k8s.io/client-go/tools/watch` as `watchtools`, while `cmd/rdd/limavm.go` imports it as `toolswatch`.

Fix: Standardize on `watchtools` across the CLI commands.

### Design Observations

None.

### Testing Assessment

1. **High Risk (Untested)**: The CLI's responsiveness to SIGINT (Ctrl+C) while waiting for the service to gracefully shut down. (Addressed by I1).
2. **Medium Risk (Untested)**: A cold boot where the API server fails to come up within the `ensureServiceRunning` timeout window, verifying the exit code behaves consistently with the VM timeouts. (Addressed by I2).
3. The new timeout tests correctly assert exit code 4 and "context deadline exceeded" for LimaVM operations that cannot progress.

### Documentation Assessment

The CLI documentation updates in `cmd_service.md` and `cmd_lima.md` correctly reflect the implemented flag defaults, timeout lengths, and exit code changes. 

### Acknowledged Limitations

- **Code comment**: "Kill targets only the service; on Windows (TerminateProcess) this avoids killing hostagents (child processes) if they're slow." (`pkg/service/cmd/service.go:381`)
  - The author explicitly acknowledges that `process.Kill(pid)` (which maps to `TerminateProcess` on Windows) leaves child `hostagent` processes orphaned if they are still running. This intentionally trades off the potential for orphaned hostagents against forcefully killing them.

### Coverage Summary

- `bats/tests/33-lima-controllers/limavm-cli.bats` — C1
- `cmd/rdd/limavm.go` — S1
- `cmd/rdd/service.go` — I2, S1
- `docs/design/cmd_lima.md` — Reviewed, no issues
- `docs/design/cmd_service.md` — Reviewed, no issues
- `pkg/service/cmd/service.go` — I1

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-22 |
| **Commit** | `0b82624 cli: standardize --wait and --timeout across svc and limavm` (squashed `99aa4e3` + `8aa55e4` + round-2 fixes) |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important #1: `ensureServiceRunning` timeouts bypass the `cliexit.Timeout` envelope | Fixed |
| 2 | Suggestion #1: `Delete` leaves the directory in place when the process exits between outer and inner `Running()` checks | Fixed |
| 3 | Suggestion #2: Exit-code-4 contract remains asserted only on `lima` paths | Test written |
| 4 | Suggestion #3: `rdd svc start` on an already-running service returns before the `ReadyAnnotation` is set | Fixed |
| 5 | Suggestion #4: BATS tests still pass `--wait` explicitly on `limavm delete` | Fixed |
| 6 | Suggestion #5: PR description and `99aa4e3` commit body describe pre-`8aa55e4` default timeouts | Fixed (squash + new message) |
| 7 | Suggestion #6: `docs/design/cmd_service.md` does not cross-reference `svc delete` back to `svc stop` | Documentation updated |
| 8 | Suggestion #7: `docs/design/api_service.md` still describes `svc stop` as delivering `SIGTERM` | Documentation updated |
| 9 | Suggestion #8: Inconsistent alias for the client-go watch tools import | Fixed |
| 10 | Documentation Gap: stale "share the single deadline" comment at `cmd/rdd/service.go:120-122` | Documentation updated |
| 11 | Testing Gap #1: `svc start --wait --timeout` exit-code-4 | Skipped |
| 12 | Testing Gap #3: `svc delete --timeout` exit-code-4 | Skipped |
| 13 | Testing Gap #4: `--timeout=0` indefinite wait | Skipped |
| 14 | Testing Gap #5: `ensureServiceRunning` timeout during `rdd limavm *` | Skipped |
| 15 | Testing Gap #6: `--wait=false` → `rdd svc start` readiness workflow | Skipped |
| 16 | Testing Gap #7: pre-existing TODO at `bats/tests/20-service/2-create.bats:3` | Skipped |
