# Deep Review: 20260422-125913-pr-336

| | |
|---|---|
| **Date** | 2026-04-22 12:59 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 5 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#336](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/336) — cli: standardize --wait and --timeout across svc and limavm |
| **Branch** | `wait-timeout-cleanup` |
| **Commits** | `4bf0ec5` cli: standardize --wait and --timeout across svc and limavm |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — same SHA as Round 4 (`4bf0ec5`), so every Round 4 Important remains. Round 5 re-confirms both of them — (I1) `watchUntil` / `watchUntilDeleted` returning `nil` on a closed watch, which the new default `--wait=true` on `limavm delete` now exposes on the happy path, and (I3) `svc start`'s already-running branch calling `waitForDiscoveryConfigMap` without the `beforeStart` anchor the cold-start path uses — and surfaces one new Important: (I2) `StopWithWait`'s deadline branch now fires on parent-context cancellation too, but the error message still reads "timed out waiting for …" and `cliexit.Classify` routes it to exit 1, so a Ctrl-C during `svc stop --wait` prints a timeout message with a non-timeout exit code. Eight Suggestions follow, six of which restate unaddressed Round 4 items. No Critical. |
| **Wall-clock time** | `18 min 38 s` |

## Executive Summary

Round 5 reviews the same commit (`4bf0ec5`) as Round 4, so every finding carries over. The two Round 4 Importants land again from independent agent sessions: `watchUntil` / `watchUntilDeleted` at `cmd/rdd/limavm.go:485-497, 527-535` still return `ctx.Err()` when the watch channel closes before the condition is met — `ctx.Err()` is `nil` while the context is still valid, so every `limavm create/start/stop/restart/delete --wait` path silently reports success on a dropped watch (Codex + Gemini); and `cmd/rdd/service.go:271` still calls `waitForDiscoveryConfigMap(ctx)` in the already-running branch, which resolves to `waitForFreshDiscoveryConfigMap(ctx, time.Time{})` at `:132-134`, so the freshness check the cold-start path uses at `:122-128` is skipped (Codex). The PR description's "matching the cold-start path" claim is therefore partially incorrect as written.

The new Round 5 finding is `StopWithWait`'s deadline branch at `pkg/service/cmd/service.go:373-382`. Round 4 moved the wait onto `watchtools.ContextWithOptionalTimeout(ctx, timeout)`, which means `ctx.Done()` now fires on parent-context cancellation (`context.Canceled`) as well as timeout. The branch still kills the service and returns `"timed out waiting for … : context canceled"` — misleading wording, and `cliexit.Classify` only matches `context.DeadlineExceeded`, so Ctrl-C during a wait produces exit 1 with a "timed out" message. `Delete` at `:434` happens to route around this by gating its recovery on `errors.Is(err, context.DeadlineExceeded)` (cancellation falls through to the `Running()` check), but that safety is incidental.

Eight Suggestions follow, six restating unaddressed Round 4 items: `Delete` abandons the process it just SIGTERM'd; no BATS coverage for `svc delete --timeout`; `ensureServiceRunning` hardcodes a 90s ceiling regardless of the caller's `--timeout`; `svc delete --timeout`'s help text still reads "stop-and-wait"; `cliexit.Classify` can double-wrap on any path that passes an already-classified error through a second `Classify` call; and the already-running `svc start` branch has no "is ready" log on success. Two new Suggestions — `stopWaitTimeout`'s comment couples the default to unwritten LimaVM-drain orchestration, and `bats/tests/33-lima-controllers/limavm-cli.bats:193`'s `run -0 rdd limavm delete "restart-vm"` is now the only BATS `run -0` with no subsequent `${output}` / `${status}` check since the PR removed the trailing `rdd ctl wait` line — round out the list.

No Critical issues. Claude missed Codex/Gemini's I1 (watch fall-through) despite the pattern being spelled out in `deep-review-context.md` line 47 ("`for ev := range watcher.ResultChan() { ... } return ctx.Err()` loop converts a dropped watch into a spurious success"). Codex missed Claude's I2 (cancellation mislabel) but found the complementary I3 (discovery freshness).

---

## Critical Issues

None.

## Important Issues

I1. **`watchUntil` / `watchUntilDeleted` return `nil` on a closed watch, silently succeeding** — `cmd/rdd/limavm.go:485-497, 527-535` [Codex GPT 5.4] [Gemini 3.1 Pro]

```go
for ev := range watcher.ResultChan() {
    switch ev.Type {
    case k8swatch.Error:
        return apierrors.FromObject(ev.Object)
    case k8swatch.Deleted:
        return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
    }
    updated, ok := ev.Object.(*limav1alpha1.LimaVM)
    if ok && check(updated) {
        return nil
    }
}
return ctx.Err()
```

When the API server closes the watch before the condition holds — proxy timeout, graceful server shutdown, or a standard watch-expiry without an `Error` event — `ResultChan()` drains and the loop falls through. `ctx.Err()` is `nil` while the context is still valid, so the caller sees success and `limaVMDeleteAction` at `:553-557` logs `"LimaVM %q deleted"` though the resource may still exist.

Pre-PR, `limavm delete` defaulted to `--wait=false`, so the delete-wait path rarely ran. The PR flips the default to `--wait=true` at `cmd/rdd/limavm.go:254-255` and removes the test's `rdd ctl wait --for=delete` safety nets at `bats/tests/33-lima-controllers/limavm-cli.bats:129-131, 194-197`, which puts `watchUntilDeleted` on the happy path without a backup check. The exit-code-4 contract also relies on the timeout actually firing; a silent success converts that to exit 0.

Fix: return a distinct "watch closed before condition met" error on fall-through. For `watchUntilDeleted`, re-`Get` by name and UID before declaring success so `NotFound` or a changed UID confirms deletion; for `watchUntil`, re-`Get` once and re-evaluate `check` before returning success.

```diff
     if ok && check(updated) {
         return nil
     }
 }
-return ctx.Err()
+if err := ctx.Err(); err != nil {
+    return err
+}
+return fmt.Errorf("watch closed before condition met for LimaVM %q", key.Name)
```

*(important, regression)*

I2. **`StopWithWait`'s deadline branch mis-labels parent-context cancellation as a timeout** — `pkg/service/cmd/service.go:365-382` [Claude Opus 4.7]

```go
if wait {
    ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
    defer cancel()
    ticker := time.NewTicker(100 * time.Millisecond)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            _ = process.Kill(pid)
            return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, ctx.Err())
```

Pre-PR, the wait selected on `time.After(60 * time.Second)` with no context channel, so the `<-timeout` case only fired on genuine deadline expiry. The PR unifies the parent context with the timeout via `watchtools.ContextWithOptionalTimeout` (line 366), which makes `ctx.Done()` fire on parent-context cancellation as well. A user who presses Ctrl-C during `rdd svc stop --wait` now gets:

- the literal message `"timed out waiting for … : context canceled"` — the wording contradicts the cause;
- exit code 1, because `cliexit.Classify` only matches `context.DeadlineExceeded`.

`Delete` at `cmd/rdd/service.go:428-438` sidesteps the same ambiguity by gating its recovery on `errors.Is(err, context.DeadlineExceeded) || Running()`, so cancellation falls through to the `Running()` check. That safety is coincidental — `Delete`'s current call path happens to tolerate it.

Fix: branch on `ctx.Err()` to pick the right wording and classify the right code:

```diff
             case <-ctx.Done():
                 _ = process.Kill(pid)
-                return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, ctx.Err())
+                err := ctx.Err()
+                if errors.Is(err, context.DeadlineExceeded) {
+                    return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, err)
+                }
+                return fmt.Errorf("wait for %q control plane with pid %d to stop cancelled: %w", instance.Name(), pid, err)
```

*(important, regression)*

I3. **Already-running `svc start --wait` skips the discovery ConfigMap freshness check** — `cmd/rdd/service.go:260-272` [Codex GPT 5.4]

```go
if service.Running() {
    logrus.Infof("%q control plane is already running", instance.Name())
    if !wait {
        return nil
    }
    logrus.Infof("waiting for %q control plane to be ready", instance.Name())
    ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
    defer cancel()
    if err := service.Wait(ctx); err != nil {
        return cliexit.Classify(err)
    }
    return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
}
```

The PR description frames this branch as "matching the cold-start path". The cold-start path records `beforeStart := time.Now().Truncate(time.Second)` and waits for a ConfigMap whose `creationTimestamp` is at or after `beforeStart` (`cmd/rdd/service.go:117, 128`). The already-running branch instead calls `waitForDiscoveryConfigMap(ctx)`, which resolves to `waitForFreshDiscoveryConfigMap(ctx, time.Time{})` at `:132-134` — the zero time skips the freshness check per the comment at `:143`.

The serve process deletes and recreates the discovery ConfigMap inside `Run` at `pkg/service/cmd/service.go:751-753`, but that happens only after `readiness.WaitForReady` returns. If a prior serve exited uncleanly (no `CleanupDiscovery`), the stale ConfigMap survives with `ReadyAnnotation=true`, and a concurrent `rdd svc start --wait` that races the in-progress new serve can observe the stale CM (via both `service.Wait`'s `runtimeControllers` lookup and `waitForDiscoveryConfigMap`) before the new controllers register.

The race window is narrow, but the PR's stated goal is symmetry with the cold-start path, and the symmetry is incomplete.

Fix: anchor the freshness check to a marker the running service can expose — PID-file mtime, the process's `Start()` time via `os.Stat(instance.PIDFile())`, or a small startup timestamp file written by the serve process — and pass that to `waitForFreshDiscoveryConfigMap`. A small helper shared with cold-start keeps both paths honest.

*(important, gap)*

## Suggestions

S1. **`Delete` abandons the process it already SIGTERM'd** — `pkg/service/cmd/service.go:424-438` [Claude Opus 4.7] (Round 4 S6 unaddressed)

```go
if Running() {
    if err := StopWithWait(ctx, true, timeout); err != nil {
        if errors.Is(err, context.DeadlineExceeded) || Running() {
            return err
        }
    }
}
```

When the deadline expires, `StopWithWait` at `pkg/service/cmd/service.go:381` already calls `process.Kill(pid)`, so the process is almost always dead before the error returns. `Delete` still short-circuits on the `DeadlineExceeded` arm and leaves the directory in place; the user runs `rdd svc delete` a second time, which succeeds because `Running()` is now false. The two-step UX is awkward given the kill usually took effect.

Consider re-checking `Running()` after the error and continuing with cleanup if the service has exited. Keep `docs/design/cmd_service.md:117-122` in sync if you change the behaviour.

*(suggestion, enhancement)*

S2. **No BATS coverage for `svc delete --timeout` exit-code-4** — `bats/tests/20-service/` [Claude Opus 4.7] (Round 4 S3 partially unaddressed)

`6-timeout.bats` covers `svc stop --timeout=1ms`. `svc delete --timeout=1ms` has no equivalent, so the new branching in `Delete` at `pkg/service/cmd/service.go:428-438` (the `errors.Is(err, context.DeadlineExceeded) || Running()` guard and the "directory stays on timeout" contract) is asserted only in docs. A test that runs `rdd svc delete --timeout=1ms`, asserts exit 4, and confirms the instance directory survives (e.g. via `rdd svc status` reporting `created: true`) would pin the contract down.

*(suggestion, gap)*

S3. **`ensureServiceRunning` hardcodes a 90s ceiling regardless of caller `--timeout`** — `cmd/rdd/service.go:88-103` [Claude Opus 4.7] (Round 4 S1 related)

```go
if service.Running() {
    ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
    defer cancel()
    ...
}
return cliexit.Classify(startAndWaitForReady(ctx, nil, startWaitTimeout))
```

`ensureServiceRunning` is called transitively by `rdd set`, every `rdd limavm *`, `rdd kubectl`, and `rdd service config`. It always applies the 90s `startWaitTimeout`, so `rdd limavm create --timeout=0` (user asks for indefinite wait) can still fail with exit 4 at 90s if the service is slow coming up. The PR framing that `--timeout` now flows uniformly through the autostart helper is not quite accurate — the user's timeout does not propagate through the start phase.

This is not a regression (the old code had the same 90s cap) but the function is already being reshaped. At minimum, the docstring at `cmd/rdd/service.go:31-33` should note that the 90s cap is independent of the caller's `--timeout`. Standardising on `watchtools.ContextWithOptionalTimeout` (Round 4 S1) would also align the style with every sibling wait in this PR.

*(suggestion, enhancement)*

S4. **`svc delete --timeout` help text reads "stop-and-wait" — jargon mismatched with siblings** — `cmd/rdd/service.go:354` [Claude Opus 4.7] (Round 4 S8 unaddressed)

```go
command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
```

Every other `Duration("timeout", …)` in this PR uses `"Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)"`. `svc delete` drops `--wait`, which is why the wording diverges, but "stop-and-wait" reads as a noun phrase rather than a knob description. The docs at `docs/design/cmd_service.md:117-122` phrase it clearly.

```diff
-    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
+    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for waiting for the control plane to exit before removing files (0 means wait indefinitely)")
```

*(suggestion, code clarity)*

S5. **`stopWaitTimeout` comment couples the default to unwritten future behaviour** — `cmd/rdd/service.go:36-42` [Claude Opus 4.7]

```go
// stopWaitTimeout bounds the wait for the control plane to shut down. The
// service itself drains its controller-manager goroutines for up to 45s
// (see Run in pkg/service/cmd/service.go); when LimaVM stop orchestration
// lands, the drain will sequence each VM's shutdown, so the CLI deadline must
// match the per-VM ceiling in [limaLongWaitTimeout]. A multi-VM instance may
// still exceed this budget while shutdownAllHostagents drains sequentially.
```

Today the service drains for 45s at `pkg/service/cmd/service.go:832`, so `5m` is roughly 6× the actual worst case. The rationale is tied to "when LimaVM stop orchestration lands" — unwritten work whose shape may differ when it arrives. Rot risk is real: if that change never lands, or lands with a different budget, this comment points future readers at a policy that was never implemented.

A shorter rationale that states the current policy ("the CLI waits longer than the service drain so slow disks or misconfigured hosts don't trip the default") survives code drift. The coupling-to-`limaLongWaitTimeout` constant can stay; the comment just needs to justify today's value.

*(suggestion, code clarity)*

S6. **BATS `run -0` with no output/status check on `limavm delete "restart-vm"`** — `bats/tests/33-lima-controllers/limavm-cli.bats:193` [Gemini 3.1 Pro]

```bash
    # Delete the VM
    run -0 rdd limavm delete "restart-vm"
}
```

The PR removed the trailing `rdd ctl wait --for=delete "limavm/restart-vm" …` line (diff hunk at `-@@ -194,7 +191,6 @@`), leaving `run -0` followed only by the closing brace. Per the repo BATS style rules in `deep-review-context.md`:

> Use `run -0` only when the next line reads `${output}` or `${status}`. For commands whose success is the entire assertion (setup, cleanup, fixture calls), invoke them directly.

The symmetric fix at line 94 (`run -0 rdd limavm delete --wait "start-vm"` → `rdd limavm delete "start-vm"`) was applied; the `restart-vm` case below was missed.

```diff
     # Delete the VM
-    run -0 rdd limavm delete "restart-vm"
+    rdd limavm delete "restart-vm"
 }
```

*(suggestion, style)*

S7. **`cliexit.Classify` can double-wrap when called on an already-classified error** — `pkg/cli/exit/exit.go:53-60` [Gemini 3.1 Pro] (Round 4 S5 unaddressed)

```go
func Classify(err error) error {
    if errors.Is(err, context.DeadlineExceeded) {
        return Timeout(err)
    }
    return err
}
```

`errors.Is` unwraps, so `Classify(*Error{Code: CodeTimeout, Err: someDeadlineErr})` returns `*Error{Code: CodeTimeout, Err: *Error{Code: CodeTimeout, Err: someDeadlineErr}}`. The CLI driver still resolves the correct exit code via `errors.As`, but the nested Error structure is noisy in logs and defeats idempotence.

No current call path triggers the double-wrap (audited `ensureServiceRunning` → `getAppKubeClient` → `setAction` and every `svc *` / `limavm *` caller), but the call sites are growing, and the helper is a natural target for future refactors that may chain. A short-circuit on an existing `*Error{Code: CodeTimeout}` is cheap insurance:

```diff
 func Classify(err error) error {
+    var exitErr *Error
+    if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
+        return err
+    }
     if errors.Is(err, context.DeadlineExceeded) {
         return Timeout(err)
     }
     return err
 }
```

*(suggestion, enhancement)*

S8. **Already-running `svc start` branch has no "is ready" log on success** — `cmd/rdd/service.go:260-272, 293-297` (Round 4 S2/S4 unaddressed)

The cold-start branch logs `"%q control plane is ready"` at `:297` after the wait succeeds. The already-running branch waits on the same readiness signal (`service.Wait` + `waitForDiscoveryConfigMap`) but returns silently on success — a user running `rdd svc start` against a running-but-still-initialising service sees the `"waiting for … to be ready"` line and then nothing. Mirror the cold-start log so both paths print the same terminator.

*(suggestion, observation)*

## Design Observations

### Concerns

- **Readiness contract is split between `service.Wait` and CLI-side discovery freshness checks** `(in-scope)` [Codex GPT 5.4]. `service.Wait` inspects the discovery ConfigMap via `getRuntimeControllersFromCluster` without checking `creationTimestamp`; the CLI then layers `waitForFreshDiscoveryConfigMap` on top. Folding "current startup generation" into one library helper would remove the branch skew behind I3 and give every caller a single definition of "ready".
- **`limaVMCreateAction` leaks the managed ConfigMap when `--start --wait` hits the deadline** `(in-scope)` [Claude Opus 4.7]. After `takeOwnership` runs, `createdConfigMap` is cleared and the deferred cleanup does not fire; if the watch then hits `DeadlineExceeded`, the ConfigMap stays and only `rdd limavm delete` removes it. The behaviour is intended (the LimaVM owns the CM), but the new 5m default on `--start --wait` makes this path hit more often than the pre-PR implicit `--timeout=0`. The BATS test at `bats/tests/33-lima-controllers/limavm-cli.bats:208-217` already relies on namespace teardown to reap the LimaVM; no change required, but worth a note in the user-facing docs that a timed-out `--start --wait` leaves the template ConfigMap behind.

### Strengths

- **Centralising deadline-to-exit-code translation in `cliexit.Classify`** [Claude Opus 4.7]. Collapses three copies of the `if errors.Is(err, context.DeadlineExceeded) { return cliexit.Timeout(err) }` boilerplate (`cmd/rdd/set.go:322-326` is the clearest example) and makes every wait path self-consistent.
- **`watchtools.ContextWithOptionalTimeout` standardisation** [Claude Opus 4.7]. Renaming `toolswatch` → `watchtools` across `cmd/rdd/limavm.go`, `cmd/rdd/service.go`, and `pkg/service/cmd/service.go` gives `timeout=0` a single documented meaning and makes text-search across the codebase reliable.
- **Threading caller context through `StopWithWait(ctx, wait, timeout)` and `Delete(ctx, timeout)`** [Claude Opus 4.7]. The old API could not honour Ctrl-C or a caller deadline. Exposing I2 was the cost of making the right architectural choice.
- **Keeping `pkg/cli/exit` library-like and applying `Classify` only at the CLI boundary** [Codex GPT 5.4]. `pkg/service` stays free of CLI exit-code semantics; the CLI gets a uniform contract.
- **`svc delete` stops the service before removing state** [Codex GPT 5.4]. Correct for both Windows file semantics and Unix PID-file mutual exclusion, and the comment at `pkg/service/cmd/service.go:419-422` documents why.

## Testing Assessment

- The new timeout contract still lacks direct entry-point coverage for `rdd svc start --timeout`, `rdd svc delete --timeout`, and autostart callers such as `rdd svc config`, `rdd kubectl`, or `rdd set`. The assertion that every wait-capable command exits 4 on deadline holds only for the paths the BATS suite exercises, and only `rdd set` (existing) and the three new `limavm` tests cover it.
- No test exercises `watchUntil` / `watchUntilDeleted` with a closed watch channel before the condition fires (I1). The bug can ship green because the default test templates reach `Running=True` quickly enough that the watch stays open.
- No test for the I3 startup race: one process starting the service while another runs `rdd svc start --wait` against a stale discovery ConfigMap. Hard to reproduce deterministically without harness-level hooks, but a unit test that drives `waitForFreshDiscoveryConfigMap` with a pre-seeded stale ConfigMap would at least pin the freshness contract.
- No test forces a Ctrl-C through `rdd svc stop --wait` to cover I2. An acceptable substitute: a unit test on the `StopWithWait` loop body with a cancelled context that asserts the error message and the classified exit code.

## Documentation Assessment

- `docs/design/cmd_service.md` and `docs/design/cmd_lima.md` reflect the new defaults and the exit-4 contract cleanly.
- `docs/design/api_service.md:21` now names both SIGINT (primary) and SIGTERM (fallback) on Unix, matching `pkg/service/cmd/service.go:359-363`.
- `cmd/rdd/service.go:131` still reads "waitForDiscoveryConfigMap polls until the discovery ConfigMap exists" — the helper at `:144-168` also requires `ReadyAnnotation == "true"`. Minor doc drift; tighten the sentence to match the implementation.

## Commit Structure

Single commit (`4bf0ec5`), self-contained, message accurate. Round 5 reviews the same SHA as Round 4 — no amend or rebase since.

## Acknowledged Limitations

- `cmd/rdd/service.go:40-41` comment acknowledges that a multi-VM instance may exceed the 5m `stopWaitTimeout`. S5 argues for rewording but not changing the value.
- `bats/tests/20-service/2-create.bats:3` carries a pre-existing `# TODO: test rdd svc start --wait=false` comment that the PR does not address.

## Unresolved Feedback

None — the PR carries no inline review comments.

## Agent Performance Retro

### [Claude]

Strongest on narrow, nuanced correctness: the `StopWithWait` cancellation mislabel (I2) was a Round 5 original that neither other agent raised, and required noticing the semantic shift when Round 4 moved the wait onto `watchtools.ContextWithOptionalTimeout`. Six Suggestions — most restate Round 4 items that remain unaddressed — plus a useful design note on the `limaVMCreateAction` ConfigMap leak under `--start --wait` deadline expiry. Missed I1 (`watchUntil` fall-through) despite `deep-review-context.md` line 47 spelling out the exact `for ev := range ... return ctx.Err()` anti-pattern. The repo-context rule was loaded into the prompt; Claude did not pattern-match to it. Also missed I3 (Codex's discovery-freshness gap).

### [Codex]

Tight, high-signal output: two Important findings and zero Suggestions. Both Importants are real — I1 (watch fall-through) and I3 (discovery freshness) — and I3 was an original Round 5 contribution grounded in reading the PR description against the code ("matches the cold-start path" vs the zero-time call). The Testing Assessment was the most useful of the three: named the I1 and I3 test gaps and called out the missing `rdd svc start/delete --timeout` and autostart-caller coverage. Tried to run BATS but found `bats/lib/bats-core` empty in the worktree — no agent can run BATS against a fresh worktree until the helper populates submodules; this is a deep-review tooling gap, not a Codex bug.

### [Gemini]

Three findings — two Important and one Suggestion — with the coverage-summary format the retro table expects. Caught I1 cleanly alongside Codex and correctly labelled it a regression based on the default flip. Flagged the BATS `run -0` style violation at `bats/tests/33-lima-controllers/limavm-cli.bats:193` (downgraded here to a Suggestion since `run -0` already asserts exit 0, so the functional behaviour is equivalent — only the repo style rule is violated). The Classify double-wrap Suggestion restates Round 4 S5. Gemini's strength this round was the coverage grid: short, scanner-friendly, and both I1/I2 (its own numbering) mapped to files without hedging.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 8m 20s | 8m 12s | 3m 01s |
| Findings | 1I 5S | 2I | 1I 2S |
| Tool calls | 31 (Read 12, Bash 11, Grep 8) | 53 (shell 51, stdin 2) | 14 (grep_search 7, read_file 4, run_shell_command 3) |
| Design observations | 1 | 1 | 0 |
| False positives | 0 | 0 | 0 |
| Unique insights | 2 | 1 | 1 |
| Files reviewed | 10 | 10 | 10 |
| Coverage misses | 1 | 0 | 0 |
| **Totals** | **1I 5S** | **2I** | **1I 2S** |
| Downgraded | 0 | 0 | 1 (I→S) |
| Dropped | 1 | 0 | 0 |


**Reconciliation** — Gemini raised the BATS `run -0` issue at Important severity; downgraded to Suggestion (S6) in consolidation. `run -0` without a follow-up `${output}` / `${status}` check is a style violation, not a functional failure: `run -0` already asserts exit 0, so the command still fails the test if it fails. The repo BATS style rules document the convention for discoverability and consistency — important to keep, but not the same weight as a correctness bug.

Overall: Codex produced the highest value density — two valid Importants from a short output. Claude carried the most volume and a valuable original in I2 but missed a finding the prompt had pre-armed it for. Gemini's coverage grid and regression labelling were the best of the three. No agent made a false positive that survived consolidation.

## Review Process Notes

### Repo context updates

- **[repo] Re-reviewing the same commit is a common operating mode.** Five rounds against the same SHA (`4bf0ec5`) on this PR — Round 5 is a re-review of Round 4's target. A note in `deep-review-context.md` steering agents to open the prior round's `.reviews/*pr-N*.md` (if any) and list *unaddressed* findings before generating new ones would cut prompt-to-answer latency on re-reviews and reduce duplicate Suggestions. The orchestrator already does this in the prior-round gate; arming agents to do it in parallel would produce richer first-pass output.
- **[repo] Pre-populate BATS submodules in agent worktrees.** Codex attempted `bin/rdd` + BATS validation but hit an empty `bats/lib/bats-core`. The worktree setup in `deep-review-setup` does not init submodules; a future enhancement would let any agent actually run the suite against its own code read. Until then, the live-verification note in `deep-review-context.md` should explicitly say BATS is not available in agent worktrees and agents should use unit tests or the built `rdd` binary only.

## Appendix: Original Reviews

### [Claude] — Pass 1

Let me check a few more specific items.

Now I have enough context. Let me write the review.

# Deep Review: PR #336 — standardize `--wait` and `--timeout`

## Header

| | |
|--|--|
| **PR** | #336 — cli: standardize --wait and --timeout across svc and limavm |
| **Commit under review** | 4bf0ec5 |
| **Verdict** | Approve with one important fix (I1). Minor observations below. |

## Executive Summary

The PR uniformly adds `--wait=true` and `--timeout` defaults across `rdd svc` and `rdd limavm` and routes `context.DeadlineExceeded` through a new `cliexit.Classify` helper so every command now exits 4 on deadline expiry. The refactor is largely correct; the only meaningful issue is a misleading error-message shape in `StopWithWait` when the parent context is cancelled (not timed out), and a behaviour subtlety in `Delete` where a real kill on deadline expiry still leaves stale files behind.

## Findings

### Critical Issues

None.

### Important Issues

**I1. `StopWithWait` mis-labels context cancellation as a timeout** — `pkg/service/cmd/service.go:371-382`

```go
for {
    select {
    case <-ctx.Done():
        // Graceful shutdown timed out; terminate so we don't leave
        // a hung service process. ...
        _ = process.Kill(pid)
        return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, ctx.Err())
```

Previously the wait listened only to a `time.After(60s)` channel (no context selector), so the `<-...` case could only fire on genuine deadline expiry. The new code unifies the parent context with the timeout via `watchtools.ContextWithOptionalTimeout` (line 366), which means `ctx.Done()` now also fires on parent cancellation (`context.Canceled`). When that happens the error still says "timed out waiting for ... : context canceled", and — more importantly — `cliexit.Classify` only wraps `DeadlineExceeded`, so a genuine cancellation now exits 1 with the message claiming a timeout. The user-visible wording contradicts the cause.

This is the same issue that the unwrap logic in `serviceDeleteAction` quietly sidesteps (line 434 uses `errors.Is(err, context.DeadlineExceeded)` to decide whether to surface, so cancellation takes the `Running()` fallback path) — that is only safe by coincidence of `Delete`'s current call pattern.

Fix: branch on `ctx.Err()` to produce the right message, and make the error classifiable so `Delete` and `Classify` behave predictably.

```diff
             case <-ctx.Done():
-                _ = process.Kill(pid)
-                return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, ctx.Err())
+                _ = process.Kill(pid)
+                err := ctx.Err()
+                if errors.Is(err, context.DeadlineExceeded) {
+                    return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, err)
+                }
+                return fmt.Errorf("wait for %q control plane with pid %d to stop cancelled: %w", instance.Name(), pid, err)
```

(important, regression)

### Suggestions

**S1. `Delete` abandons a process it has already killed** — `pkg/service/cmd/service.go:424-438`

```go
if Running() {
    if err := StopWithWait(ctx, true, timeout); err != nil {
        if errors.Is(err, context.DeadlineExceeded) || Running() {
            return err
        }
    }
}
```

When the deadline expires, `StopWithWait` has already sent SIGINT then SIGTERM (line 381 `_ = process.Kill(pid)`), so the process is usually dead by the time the error is returned. `Delete` then short-circuits on the `DeadlineExceeded` check at line 434 and returns `err` without removing any files. The user sees exit 4, re-runs `rdd svc delete`, and succeeds on the second try because `Running()` is now false.

Documented behaviour per `docs/design/cmd_service.md:117-122`, but the two-step UX is awkward given the process is almost always already dead. Consider re-checking `Running()` after the SIGTERM — if the service actually exited, proceed with cleanup. Note that the current comment in `Delete` ("Deadline expiry must surface ... per docs/design/cmd_service.md") describes the docs' contract; keep both consistent if you change the behaviour.

(suggestion, enhancement)

**S2. No coverage for `svc delete --timeout` timeout branch** — `bats/tests/20-service/`

`6-timeout.bats` covers `svc stop --timeout=1ms` but there is no test for `svc delete --timeout=1ms`. `Delete` has new branching logic (the `errors.Is(err, context.DeadlineExceeded) || Running()` guard at `pkg/service/cmd/service.go:434`) that no BATS test exercises, and the contract "deadline expiry leaves the directory in place" is asserted only in docs.

Fix: add a BATS test mirroring `6-timeout.bats` that runs `rdd svc delete --timeout=1ms`, asserts exit 4, and also asserts that the instance directory still exists afterwards (e.g. via `rdd svc status` reporting `created: true`).

(suggestion, gap)

**S3. `ensureServiceRunning` hardcodes a 90 s ceiling regardless of user `--timeout`** — `cmd/rdd/service.go:88-103`

```go
if service.Running() {
    ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
    defer cancel()
    ...
}
return cliexit.Classify(startAndWaitForReady(ctx, nil, startWaitTimeout))
```

`ensureServiceRunning` is called transitively by `rdd set`, every `rdd limavm *`, `rdd kubectl`, and `rdd service config`. It always applies the 90 s `startWaitTimeout`, so `rdd limavm create ... --timeout=0` (user asks for indefinite wait) can still fail with exit 4 at 90 s if the service is slow to come up. The PR description claims uniform `--timeout` routing through this helper, but the user-configured timeout is not actually propagated here.

This is not a regression (the old code had the identical 90 s hard cap) but it is an opportunity the PR could address while it is already reshaping this function. At minimum, the docstring at `cmd/rdd/service.go:31-33` could note explicitly that the 90 s cap is independent of any caller's `--timeout`.

(suggestion, enhancement)

**S4. `newServiceDeleteCommand` timeout flag description diverges from the `stop` wording** — `cmd/rdd/service.go:354`

```go
command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
```

Every other `Duration("timeout", ...)` definition in this PR uses the wording `"Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)"`. `delete` intentionally drops `--wait`, but "stop-and-wait" reads awkwardly. A clearer phrasing matches the docs:

```diff
-    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
+    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for waiting for the control plane to exit before removing files (0 means wait indefinitely)")
```

(suggestion, code clarity)

**S5. `stopWaitTimeout` comment names unwritten future behaviour** — `cmd/rdd/service.go:36-42`

```go
// stopWaitTimeout bounds the wait for the control plane to shut down. The
// service itself drains its controller-manager goroutines for up to 45s
// (see Run in pkg/service/cmd/service.go); when LimaVM stop orchestration
// lands, the drain will sequence each VM's shutdown, so the CLI deadline must
// match the per-VM ceiling in [limaLongWaitTimeout]. ...
```

Today the service drains for 45 s (`pkg/service/cmd/service.go:832`), so `5m` is ~6× the actual worst case. Tying the CLI default to unwritten LimaVM-drain work ("when LimaVM stop orchestration lands") couples this constant to a future change. If that change never lands, or lands with a different budget, the rationale here rots. A briefer comment that states the current policy ("the CLI waits longer than the service drain so misconfigured hosts or slow disk I/O don't trip the default") would survive code drift.

(suggestion, code clarity)

**S6. `limaVMCreateAction` can stall a cleanup goroutine when the wait hits `DeadlineExceeded`** — `cmd/rdd/limavm.go:369-373, 409-420`

```go
defer func() {
    if createdConfigMap != nil {
        logrus.Warnf("Cleaning up created ConfigMap %q", createdConfigMap.Name)
        _ = c.Delete(ctx, createdConfigMap)
    }
}()
...
if start && wait && !dryRun {
    waitCtx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
    defer cancel()
    ...
    if err := watchUntil(waitCtx, c, key, func(vm *limav1alpha1.LimaVM) bool {
```

Not a regression, but worth noting while the wait is being touched: after `takeOwnership` succeeds (line 405), `createdConfigMap` is set to nil, so the deferred cleanup is disabled. But if `watchUntil` later returns a `DeadlineExceeded` error (new code path under the new default timeout), the LimaVM owns the ConfigMap and only LimaVM deletion will remove it. In the `--start --wait --timeout=3s` failure case the user sees exit 4; the ConfigMap stays; `rdd limavm delete` is required to clean up. That is the intended behaviour, but the new 5m default means users will now hit this path much more often than before, when the implicit `--timeout=0` rarely expired in practice.

No change required — just verify the BATS test at `bats/tests/33-lima-controllers/limavm-cli.bats:208-217` leaves the test namespace clean via the existing `local_teardown()` namespace teardown (it does, since the LimaVM is reaped with the namespace).

(suggestion, observation)

## Design Observations

### Strengths

- Centralising deadline → exit-code-4 translation in `cliexit.Classify` is a clean refactor; it collapses three instances of the `if errors.Is(err, context.DeadlineExceeded) { return cliexit.Timeout(err) }` boilerplate (see `cmd/rdd/set.go:322-326` simplification) and makes every wait path self-consistent.
- Renaming the import alias from `toolswatch` to `watchtools` and standardising on `watchtools.ContextWithOptionalTimeout` across `cmd/rdd/limavm.go`, `cmd/rdd/service.go`, and `pkg/service/cmd/service.go` gives `timeout=0` a single documented meaning.
- Moving the stop wait into the caller's context (`StopWithWait(ctx, wait, timeout)`) is the right direction even though it exposes finding I1 above — the old API had no hope of handling Ctrl-C or propagating a caller's deadline.

## Testing Assessment

1. **No test for `svc delete --timeout` expiry** (S2 above): the `Delete` function has new "suppress race vs. surface timeout" branching that no BATS test exercises.
2. **No test for `svc stop --timeout` without `--wait`**: ensures `--timeout` is silently ignored when `--wait=false`. Low priority — consistent with the shared flag description — but would pin down the documented contract.
3. **No test for `rdd set` or `rdd limavm *` exit-code-4 paths from `ensureServiceRunning` timeout**: the PR claims uniformity across every autostart-capable command, but only `rdd set` (existing `bats/tests/32-app-controllers/`) and the three new `33-lima-controllers/limavm-cli.bats` tests cover this. A test that deliberately stalls the service start (e.g. by stubbing the discovery ConfigMap) would exercise the `cliexit.Classify(startAndWaitForReady(...))` path.

## Documentation Assessment

- `docs/design/cmd_service.md` and `docs/design/cmd_lima.md` reflect the new defaults and the exit-4 contract. Good.
- `docs/design/api_service.md:21` now names both SIGINT (primary) and SIGTERM (fallback) — matches `pkg/service/cmd/service.go:359-363`.
- `cmd/rdd/service.go:105-112` doc-comment for `startAndWaitForReady` is accurate with the new `timeout` parameter.

No gaps.

## Commit Structure

Single commit, self-contained, message is accurate. Good.

## Acknowledged Limitations

None — there are no TODO comments or author responses to prior findings in this PR.

## Coverage Summary

| File | Status |
| --- | --- |
| `bats/tests/20-service/6-timeout.bats` | Reviewed, no issues |
| `bats/tests/33-lima-controllers/limavm-cli.bats` | S6 (observation) |
| `cmd/rdd/limavm.go` | S6 (observation) |
| `cmd/rdd/service.go` | S3, S4, S5 |
| `cmd/rdd/set.go` | Reviewed, no issues (simplification via `cliexit.Classify`) |
| `docs/design/api_service.md` | Reviewed, no issues |
| `docs/design/cmd_lima.md` | Reviewed, no issues |
| `docs/design/cmd_service.md` | Reviewed, no issues |
| `pkg/cli/exit/exit.go` | Reviewed, no issues |
| `pkg/service/cmd/service.go` | I1, S1, S2 |
### [Codex] — Pass 1

### Executive Summary

This PR standardizes `--wait` / `--timeout` handling and exit-code-4 timeout classification across `rdd svc` and `rdd limavm`, and it updates the design docs and BATS coverage to match. The two material issues are that `limavm` still treats a dropped watch as a successful wait, which this change now exposes on the default `delete` path, and that the new already-running `svc start --wait` branch still accepts stale discovery state during startup races.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Dropped Lima watch channels are treated as successful waits** — `cmd/rdd/limavm.go:254-255,485-497,527-535,553-557` (important, regression)

```go
for ev := range watcher.ResultChan() {
    switch ev.Type {
    case k8swatch.Deleted:
        return nil
    case k8swatch.Error:
        return apierrors.FromObject(ev.Object)
    }
}
return ctx.Err()
```

`watchUntilDeleted()` at `cmd/rdd/limavm.go:527-535` and `watchUntil()` at `cmd/rdd/limavm.go:485-497` both fall through to `ctx.Err()`. When the watch channel closes before the condition is met but the context is still valid, `ctx.Err()` is `nil`, so callers treat the wait as success. That was already latent for the other waitable Lima commands (`cmd/rdd/limavm.go:233-236`, `413-416`, `453-456`), but this PR makes `delete` wait by default at `cmd/rdd/limavm.go:254-255`, and `limaVMDeleteAction()` at `cmd/rdd/limavm.go:553-557` now logs a successful deletion on that false `nil`. In practice, a watch expiry / proxy close / server shutdown can make plain `rdd limavm delete NAME` report success while the object still exists.

Fix: return a distinct error when `ResultChan()` closes before the condition is satisfied. For `watchUntilDeleted()`, re-`Get` the object by name and UID before returning success; only `NotFound` or a different UID should count as deleted. For `watchUntil()`, re-`Get` once on channel close and re-evaluate the predicate before deciding whether to return success or a `"watch closed before condition met"` error.

I2. **Already-running `svc start --wait` still trusts stale discovery data** — `cmd/rdd/service.go:265-271` (important, gap)

```go
logrus.Infof("waiting for %q control plane to be ready", instance.Name())
ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
defer cancel()
if err := service.Wait(ctx); err != nil {
    return cliexit.Classify(err)
}
return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
```

The cold-start path at `cmd/rdd/service.go:113-128` records `beforeStart` and waits for a fresh discovery ConfigMap, but the new already-running branch at `cmd/rdd/service.go:265-271` calls `waitForDiscoveryConfigMap()` instead. That helper is implemented as `waitForFreshDiscoveryConfigMap(ctx, time.Time{})` at `cmd/rdd/service.go:132-144`, and line 143 explicitly says zero time skips the freshness check. The service only deletes and recreates discovery after the apiserver is ready in `pkg/service/cmd/service.go:741-753`, so a second `rdd svc start --wait` racing with an in-progress startup can still observe the previous run’s ready ConfigMap and return before the current controllers register. That means the new branch does not actually “match the cold-start path” yet.

Fix: derive a startup marker for the currently running service and pass it into `waitForFreshDiscoveryConfigMap()`. The PID file mtime, process start time, or a startup timestamp persisted alongside the PID would all work; the important part is that the already-running path must reject discovery ConfigMaps older than the current service process.

#### Suggestions

None.

### Design Observations

**Concerns**

1. The readiness contract is now split between `service.Wait()` and CLI-side discovery freshness checks. Folding the “current startup generation” logic into a single service-layer helper would remove the branch skew behind I2 and make `svc start`, `ensureServiceRunning()`, and any future callers share one definition of “ready”. (in-scope)

**Strengths**

1. Keeping timeout-to-exit-code mapping in `pkg/cli/exit` and applying it at the CLI boundary is a good separation: `pkg/service` stays library-like, while the CLI still gets a uniform contract.

2. Making `svc delete` stop the service before removing instance state is the right safety direction for both Windows file semantics and Unix PID-file mutual exclusion.

### Testing Assessment

1. There is no regression test for the startup-race in I2: one command starting the service while another runs `rdd svc start --wait` against a stale discovery ConfigMap.
2. There is no test that forces `watchUntil()` / `watchUntilDeleted()` to see a closed watch channel before the condition is met, so I1 can ship green.
3. The new timeout contract still lacks direct entry-point coverage for `rdd svc start --timeout`, `rdd svc delete --timeout`, and an autostart caller such as `rdd svc config`, `rdd kubectl`, or `rdd set`.

I built `bin/rdd`, but I could not rerun the BATS suites from this checkout because `bats/lib/bats-core` is empty here.

### Documentation Assessment

The user-facing design docs added in this PR are consistent with the intended CLI contract. The remaining doc drift is internal: `cmd/rdd/service.go:131` still says `waitForDiscoveryConfigMap` only waits for the ConfigMap to exist, but `cmd/rdd/service.go:144-167` also requires `ReadyAnnotation == "true"`.

### Acknowledged Limitations

- **Code comment**: "`A multi-VM instance may still exceed this budget while shutdownAllHostagents drains sequentially.`" `cmd/rdd/service.go:40-41`. This change makes that limitation more likely to matter, because `5m` is now the default deadline for both `rdd svc stop` and `rdd svc delete`.
- **Code comment**: "`# TODO: test `rdd svc start --wait=false``" `bats/tests/20-service/2-create.bats:3`. This PR expands `svc start` semantics further (`--timeout` plus discovery readiness), so the existing start-command coverage debt remains relevant.

### Coverage Summary

- `bats/tests/20-service/6-timeout.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-cli.bats` — Reviewed, no issues
- `cmd/rdd/limavm.go` — I1
- `cmd/rdd/service.go` — I2
- `cmd/rdd/set.go` — Reviewed, no issues
- `docs/design/api_service.md` — Reviewed, no issues
- `docs/design/cmd_lima.md` — Reviewed, no issues
- `docs/design/cmd_service.md` — Reviewed, no issues
- `pkg/cli/exit/exit.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — I2

### [Gemini] — Pass 1

### Executive Summary

This PR successfully unifies `--wait` and `--timeout` behavior across the `rdd svc` and `rdd limavm` subcommands. It consistently bounds shutdown, ensures code 4 exit on deadline expiration, and eliminates dangerous file-removal races during deletion. There are a few important gaps remaining, notably around watch loop fall-throughs producing spurious successes, and a BATS test style violation.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **BATS style violation for uninspected `run -0`** — `bats/tests/33-lima-controllers/limavm-cli.bats:193`

```bash
    # Delete the VM
    run -0 rdd limavm delete "restart-vm"
}
```

The diff removed the `rdd ctl wait` command immediately following this line, leaving `run -0` without an assertion on `${output}` or `${status}`. As mandated by the repository instructions, commands whose success is the entire assertion must be invoked directly to allow BATS's `set -e` to fail the test on non-zero exit. `run` swallows the exit code and allows the test to pass even if the command fails. You applied this fix correctly on line 94 (`rdd limavm delete "start-vm"`); it needs to be applied here as well.

Fix: invoke the command directly without `run -0`:
```diff
     # Delete the VM
-    run -0 rdd limavm delete "restart-vm"
+    rdd limavm delete "restart-vm"
 }
```
*(important, regression)*

I2. **Spurious success on dropped watch** — `cmd/rdd/limavm.go:485-497`

```go
	for ev := range watcher.ResultChan() {
        // ...
		updated, ok := ev.Object.(*limav1alpha1.LimaVM)
		if ok && check(updated) {
			return nil
		}
	}
	return ctx.Err()
```

When a watch connection drops before the condition is met (e.g. due to proxy timeout, Kubernetes watch-expiry, or graceful server close), the channel closes. The loop falls through and returns `ctx.Err()`. Because the context itself has not expired, `ctx.Err()` evaluates to `nil`. As a result, wait operations (`rdd limavm create/start/stop --wait`) will incorrectly return exit code 0 when they silently drop the watch before the VM operation actually finishes.

Fix: Return a distinct error if the channel closes but the context is still valid. This also applies identically to `watchUntilDeleted` on line 538.
```diff
 		if ok && check(updated) {
 			return nil
 		}
 	}
-	return ctx.Err()
+	if err := ctx.Err(); err != nil {
+		return err
+	}
+	return fmt.Errorf("watch closed before condition met")
 }
```
*(important, gap)*

#### Suggestions

S1. **Redundant classification wrapping** — `pkg/cli/exit/exit.go:54-56`

```go
func Classify(err error) error {
	if errors.Is(err, context.DeadlineExceeded) {
		return Timeout(err)
	}
	return err
}
```

When `Classify` receives an error that is already an `*exit.Error` wrapping `context.DeadlineExceeded` (such as the error returned from `ensureServiceRunning`, which is then classified *again* in `setAction` and `serviceStartAction`), `errors.Is` will evaluate to `true` through `Unwrap`. This results in `Timeout(err)` double-wrapping the error into `*Error{Code: CodeTimeout, Err: *Error{...}}`. While `errors.As` handles recursive unwrapping perfectly fine later in the main CLI entrypoint, checking the root type prevents noisy nested structures if the error happens to be logged or printed.

Fix: Return the error unmodified if it is already an `exit.Error` with the correct code.
```diff
 func Classify(err error) error {
+	var exitErr *Error
+	if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
+		return err
+	}
 	if errors.Is(err, context.DeadlineExceeded) {
 		return Timeout(err)
 	}
```
*(suggestion, enhancement)*

### Design Observations

**Strengths**
- The file deletion sequence in `service.Delete()` robustly handles concurrency bounds. Skipping `os.RemoveAll` while `Running()` gracefully protects against the "zombie file-handle lock" on Windows while correctly surfacing the deadline expiry error to the CLI layer.
- `StopWithWait` successfully avoids falling back to `KillTree` when shutting down the service, relying on the instance lifecycle logic instead of risking ungraceful orphans via Windows process groupings.

### Testing Assessment

The test updates correctly assert on the new exit code 4 contract using `cliexit.CodeTimeout` during VM creation/restart timeouts and service stop expirations.

### Coverage Summary

- `bats/tests/20-service/6-timeout.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-cli.bats` — I1
- `cmd/rdd/limavm.go` — I2
- `cmd/rdd/service.go` — Reviewed, no issues
- `cmd/rdd/set.go` — Reviewed, no issues
- `docs/design/api_service.md` — Trivial
- `docs/design/cmd_lima.md` — Trivial
- `docs/design/cmd_service.md` — Trivial
- `pkg/cli/exit/exit.go` — S1
- `pkg/service/cmd/service.go` — Reviewed, no issues

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-22 |
| **Commit** | `fca9713` cli: standardize --wait and --timeout across svc and limavm (amended) |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important #1: `watchUntil` / `watchUntilDeleted` return `nil` on a closed watch | Fixed |
| 2 | Important #2: `StopWithWait` mis-labels cancellation as timeout | Fixed |
| 3 | Important #3: already-running `svc start` skips discovery freshness check | Fixed |
| 4 | Suggestion #1: `Delete` abandons the SIGTERM'd process | Skipped |
| 5 | Suggestion #2: no BATS coverage for `svc delete --timeout` | Test written |
| 6 | Suggestion #3: `ensureServiceRunning` hardcodes 90s ceiling | Fixed |
| 7 | Suggestion #4: `svc delete --timeout` help text "stop-and-wait" | Fixed |
| 8 | Suggestion #5: `stopWaitTimeout` comment references unwritten future | Fixed |
| 9 | Suggestion #6: BATS `run -0` style violation on `limavm delete "restart-vm"` | Fixed |
| 10 | Suggestion #7: `cliexit.Classify` can double-wrap | Fixed |
| 11 | Suggestion #8: already-running `svc start` missing "is ready" log | Fixed |
| 12 | Testing Gap #1: entry-point coverage for `svc start --timeout` | Test written |
| 13 | Testing Gap #2: `watchUntil` / `watchUntilDeleted` closed-channel test | Skipped |
| 14 | Testing Gap #3: I3 startup race regression-guard | Skipped |
| 15 | Testing Gap #4: Ctrl-C through `StopWithWait` regression-guard | Skipped |
