# Deep Review: 20260422-115550-pr-336

| | |
|---|---|
| **Date** | 2026-04-22 11:55 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 4 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#336](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/336) — cli: standardize --wait and --timeout across svc and limavm |
| **Branch** | `wait-timeout-cleanup` |
| **Commits** | `4bf0ec5` cli: standardize --wait and --timeout across svc and limavm |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — Round 3's I1 (`Delete` timeout swallow) and I2 (`StopWithWait`/`Delete` caller context) are both cleanly resolved — `errors.Is(err, context.DeadlineExceeded) \|\| Running()` narrows the recovery, and both functions now accept `ctx`. Five of six Round 3 Suggestions landed (S1 → `cliexit.Classify`, S2/S6 → help-text updates, S4 → accurate SIGINT/SIGTERM wording, S5 → multi-VM caveat in comment); R3 S3 (`run -0` on the cleanup `rdd limavm delete "start-vm"`) turned out to be a false recommendation — the repo BATS style rules say to drop `run` on a command whose output is not inspected, so the current code is correct. Two new Important findings: (I1) `watchUntil` / `watchUntilDeleted` in `cmd/rdd/limavm.go` return `nil` when the watch channel closes before the context fires, so a dropped API-server watch silently reports success on any `limavm start/stop/restart/create/delete --wait`; (I2) the already-running `svc start` branch calls `waitForDiscoveryConfigMap` without the freshness anchor the cold-start path uses, so a stale `ReadyAnnotation=true` from a crashed prior instance can satisfy the wait. Nine small suggestions round out the list. |
| **Wall-clock time** | `37 min 38 s` |

## Executive Summary

Round 4 addresses every Round 3 Important finding and five of six Round 3 Suggestions. `Delete`'s over-broad recovery is now gated on `errors.Is(err, context.DeadlineExceeded) || Running()`, preserving the "exit 4, directory in place" contract on deadline expiry (R3 I1 resolved). `StopWithWait(ctx, wait, timeout)` and `Delete(ctx, timeout)` both accept the caller's context and derive their timed child via `watchtools.ContextWithOptionalTimeout`, closing the R3 I2 "`--timeout=0` is uninterruptible" exposure. The `timeoutError` helper is moved to `pkg/cli/exit/exit.go` as `Classify` and used from `set.go`, `service.go`, and `limavm.go` (R3 S1). `--timeout` help text on every `--wait`-capable command now says "ignored if --wait=false" (R3 S2), the `api_service.md` wording is now technically accurate ("SIGTERM as fallback" on Unix, R3 S4), and the `stopWaitTimeout` comment acknowledges the multi-VM ceiling (R3 S5).

Two new Important findings surface. First, `cmd/rdd/limavm.go:485-497` and `:527-535` (`watchUntil` / `watchUntilDeleted`) both return `ctx.Err()` when the watch channel closes — and `ctx.Err()` is `nil` whenever the context is still valid at that moment. If the API server closes the watch before the desired state arrives (proxy timeout, server restart, graceful close), the CLI returns exit 0 and logs "LimaVM started/stopped/…" without the status ever matching. This code is pre-existing (`776b8f77`, 2026-02-26) and not modified by this PR, but the PR widens its exposure: every `limavm` wait path now has a user-documented `--timeout=0` mode, and the new exit-code-4 contract depends on timeouts actually firing. Second, `cmd/rdd/service.go:271` still calls `waitForDiscoveryConfigMap(ctx)` in the already-running `svc start` branch, which skips the freshness anchor (`beforeStart`) that the cold-start path uses at `:122-128`. A stale `ReadyAnnotation=true` ConfigMap from a crashed prior serve can satisfy `service.Wait()` + `waitForDiscoveryConfigMap()` before the new serve has reached `InitDiscovery()` — the user sees exit 0 while controllers are still initializing. The R3 strength ("already-running branch now matches the cold-start path") is narrower than consolidated then.

Suggestions span diagnostics and polish: `service.Wait` and `waitForFreshDiscoveryConfigMap` surface bare `ctx.Err()`, so users see only "context deadline exceeded" with no hint of which wait timed out; `ensureServiceRunning` still uses `context.WithTimeout` instead of the `watchtools.ContextWithOptionalTimeout` the rest of the PR standardizes on; BATS coverage for `svc start --timeout` / `svc delete --timeout` exit-code-4 remains an intentional R2/R3 skip but a regression there would ship green; `cliexit.Classify` can double-wrap already-classified errors; the already-running `svc start` branch has no "is ready" log to match the cold-start; and the `svc delete --timeout` help text reads "stop-and-wait" (jargon) where the siblings use plain "Timeout for --wait". Three more minor polish items round out the list.

No Critical issues. Codex's Important I1 ("new 5-minute `svc stop` / `svc delete` default is shorter than healthy multi-VM shutdown, regression") is **dropped as a mis-labeled regression**: pre-merge-base `StopWithWait` hardcoded a 60-second timeout in `pkg/service/cmd/service.go:373`, and `svc stop` exposed no `--timeout` flag. The PR raises that default to 5 minutes *and* exposes `--timeout=0` for indefinite waits — strictly a widening, not a shortening. The underlying "multi-VM sequential drain may exceed 5m once `shutdownAllHostagents` is parallelized" concern is real but already documented as an acknowledged limitation in the `stopWaitTimeout` comment (R3 S5 resolution).

---

## Critical Issues

None.

## Important Issues

I1. **`watchUntil` / `watchUntilDeleted` return `nil` when the watch channel closes, silently succeeding on a dropped watch** — `cmd/rdd/limavm.go:485-497`, `cmd/rdd/limavm.go:525-535` [Gemini 3.1 Pro]

```go
for ev := range watcher.ResultChan() {
    switch ev.Type {
    case k8swatch.Error:
        return apierrors.FromObject(ev.Object)
    case k8swatch.Deleted:
        return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
    }
    updated, ok := ev.Object.(*limav1alpha1.LimaVM)
    if ok && check(updated) {
        return nil
    }
}
return ctx.Err()
```

When `watcher.ResultChan()` closes without an `Error` event — a graceful connection close, a proxy/load-balancer timeout, a server restart, or the Kubernetes `watch` timeout that fires every 5–10 minutes by default — the `for` loop exits, and `ctx.Err()` is consulted. Inside `watchUntil` the context passed in is the caller's `waitCtx`; if the timeout has not yet expired, `ctx.Err()` returns `nil`. Every `limavm` action then interprets that as success:

```go
if err := watchUntil(waitCtx, c, key, func(vm *limav1alpha1.LimaVM) bool {
    return vm.Status.RestartCount > beforeCount
}); err != nil {
    return cliexit.Classify(fmt.Errorf("failed waiting for restart to complete: %w", err))
}
logrus.Infof("LimaVM %q restarted in namespace %q", name, limaVM.Namespace)
```

`cmd/rdd/limavm.go:233-238, 413-417, 453-457` and the matching `watchUntilDeleted` call at `:556-558` all propagate the `nil` as success. The user sees `LimaVM "foo" started` / `deleted` / `restarted` with exit 0, but the status field was never observed to change.

This is pre-existing (`776b8f77`, 2026-02-26) but the PR materially widens the exposure:

- Every `limavm` wait-capable subcommand now defaults to `--wait=true` with a finite `--timeout` (this PR), so more CLI users land on these code paths than before.
- `--timeout=0` is now a documented user-facing option (`docs/design/cmd_lima.md`). In that mode the context never cancels on its own, so `ctx.Err()` returns `nil` on *every* channel close — the false-success path is the only path a dropped watch can take.
- The PR's new exit-code-4 contract advertises "the wait timed out" as a crisp signal; the watch-drop path launders it into "the wait succeeded."

Fix: distinguish "channel closed before condition met" from "context cancelled first." A one-line guard at the `return ctx.Err()` site suffices:

```diff
 for ev := range watcher.ResultChan() {
     ...
 }
-return ctx.Err()
+if err := ctx.Err(); err != nil {
+    return err
+}
+return fmt.Errorf("watch for LimaVM %q closed before condition was observed", key.Name)
```

Apply the symmetric change at `watchUntilDeleted:535` with `"... closed before deletion was observed"`.

(important, gap — pre-existing behavior widened by this PR's user-documented `--timeout=0` and the exit-code-4 contract that depends on timeouts being the authoritative "gave up" signal.)

I2. **Already-running `svc start` branch skips the freshness anchor, so a stale `ReadyAnnotation` can satisfy the wait** — `cmd/rdd/service.go:260-271`, `cmd/rdd/service.go:131-168` [Codex GPT 5.4]

```go
if service.Running() {
    logrus.Infof("%q control plane is already running", instance.Name())
    if !wait {
        return nil
    }
    logrus.Infof("waiting for %q control plane to be ready", instance.Name())
    ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
    defer cancel()
    if err := service.Wait(ctx); err != nil {
        return cliexit.Classify(err)
    }
    return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
}
```

`waitForDiscoveryConfigMap` calls `waitForFreshDiscoveryConfigMap(ctx, time.Time{})`, and the zero `beforeStart` makes the `cm.CreationTimestamp.Time.Before(beforeStart)` check vacuous:

```go
func waitForDiscoveryConfigMap(ctx context.Context) error {
    return waitForFreshDiscoveryConfigMap(ctx, time.Time{})
}
```

The cold-start path avoids stale state by anchoring `beforeStart := time.Now().Truncate(time.Second)` before `service.Start`, then passing that to `waitForFreshDiscoveryConfigMap` (`cmd/rdd/service.go:117-128`). The already-running branch has no such anchor.

The race: serve writes `rdd.pid` very early in its startup sequence, so `service.Running()` returns `true` from that moment on. If a prior serve process crashed with its discovery ConfigMap still carrying `ReadyAnnotation=true` in the sqlite3 datastore, and a new serve process has just started but not yet reached `InitDiscovery()`, a concurrent `rdd svc start` enters the already-running branch, `service.Wait` returns once `/readyz` + CRDs respond (which predates discovery), and `waitForDiscoveryConfigMap` finds the stale annotation immediately. The CLI returns exit 0 and the user proceeds to run commands that depend on controllers that haven't registered yet.

The cold-start path's `beforeStart` blocks exactly this: the ConfigMap must have been created at or after the current startup to satisfy the wait. The already-running branch needs the same contract — but there's no natural `beforeStart` value, because the branch doesn't own the startup moment. Workable options:

- Take `beforeStart` from the `rdd.pid` file's modification time (serve writes the file on startup, so the mtime bounds when the current instance began). Truncate to seconds to match `metav1.Time` serialization.
- Have `serve` record a `lastStart` timestamp in its on-disk state when it comes up; both paths read from it.

Fix (pid-file mtime approach):

```diff
 if service.Running() {
     ...
-    return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
+    beforeStart, err := pidFileStartTime()
+    if err != nil {
+        return cliexit.Classify(err)
+    }
+    return cliexit.Classify(waitForFreshDiscoveryConfigMap(ctx, beforeStart))
 }
```

Round 3 consolidated "already-running branch now waits for `ReadyAnnotation`, matching the cold-start path" as a strength. It waits for the annotation, but *not* for a fresh one — the match is narrower than it looked. Round 4 carries this through, so it lands here.

(important, regression — the already-running wait is new in this PR; the freshness-bypass ships with it.)

## Suggestions

S1. **`service.Wait` and `waitForFreshDiscoveryConfigMap` surface bare `ctx.Err()`, so the user-visible timeout message has no scoping context** — `pkg/service/cmd/service.go:316-334`, `cmd/rdd/service.go:97-100, 125-128, 268-271` [Claude Opus 4.7]

```go
// pkg/service/cmd/service.go:325-327
case <-ctx.Done():
    return ctx.Err()
```

`service.Wait` returns `context.DeadlineExceeded` directly; `waitForFreshDiscoveryConfigMap` (via `wait.PollUntilContextCancel`) does the same. The CLI feeds both through `cliexit.Classify`, so exit code 4 is correct, but the error message printed to stderr is just `context deadline exceeded` — no indication of *which* wait fired. `StopWithWait` wraps as `timed out waiting for %q control plane with pid %d to stop: %w` (`pkg/service/cmd/service.go:382`), which is the shape a user can act on.

Fix: wrap at the call site with a one-line scope. Example:

```diff
 if service.Running() {
     ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
     defer cancel()
     if err := service.Wait(ctx); err != nil {
-        return cliexit.Classify(err)
+        return cliexit.Classify(fmt.Errorf("waiting for %q control plane to be ready: %w", instance.Name(), err))
     }
-    return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
+    if err := waitForDiscoveryConfigMap(ctx); err != nil {
+        return cliexit.Classify(fmt.Errorf("waiting for discovery ConfigMap: %w", err))
+    }
+    return nil
 }
```

The PR already threads `%w` through the `limavm.go` call sites (`failed waiting for LimaVM to start: %w`, etc.); `service.go`'s three `cliexit.Classify(err)` sites are the outliers.

(suggestion, gap — diagnostic parity with the `limavm` paths.)

S2. **`ensureServiceRunning` uses `context.WithTimeout`; every sibling wait uses `watchtools.ContextWithOptionalTimeout`** — `cmd/rdd/service.go:95` [Claude Opus 4.7]

```go
ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
```

Every other timeout-bound block this PR touches — `startAndWaitForReady:122`, `serviceStartAction` already-running branch `:266`, `StopWithWait:366`, and all five `limavm.go` wait sites — uses `watchtools.ContextWithOptionalTimeout`, which degrades to `context.WithCancel` when the duration is 0. `startWaitTimeout` is a positive constant today, so the observable behavior is the same, but a future refactor that exposes this constant via a flag would silently fail on `--timeout=0` (an immediately-expired context). Drop-in change:

```diff
-        ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
+        ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, startWaitTimeout)
         defer cancel()
```

(suggestion, enhancement — cross-site consistency.)

S3. **No BATS coverage for `svc start --timeout` / `svc delete --timeout` exit-code-4 contract** — `bats/tests/20-service/6-timeout.bats` [Codex GPT 5.4]

```bash
@test 'svc stop --timeout=1ms exits with code 4' {
    run -4 rdd svc stop --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}
```

The PR adds exit-code-4 behavior on three `svc` paths (`stop`, `start` already-running branch, `delete`) and four `limavm` paths, but the new `6-timeout.bats` asserts only `svc stop`. `svc delete` is the highest-risk unrepresented path because its I1 predecessor in Round 3 (over-broad error swallow) would have shipped green under a `run -4 rdd svc delete --timeout=1ms && assert_dir_exist "${RDD_DIR}"` test. `svc start`'s already-running branch is also new and has no integration coverage. The author's Round 2 resolution table lists these two as intentional skips; Round 4 I2 demonstrates that the cost of that skip can be a non-obvious correctness bug.

Fix: two new cases in `6-timeout.bats`, shaped like the existing one. For `svc delete`, also assert the instance directory survives:

```bash
@test 'svc delete --timeout=1ms exits with code 4 and preserves directory' {
    run -4 rdd svc delete --timeout=1ms
    assert_output --partial 'context deadline exceeded'
    assert_dir_exist "$(rdd svc paths dir)"
}
```

(suggestion, gap — carries forward R2 testing gaps #1–#2 listed as intentional skips; R4 I2 makes the coverage argument sharper.)

S4. **Already-running `svc start` branch has no symmetric "is ready" log on success** — `cmd/rdd/service.go:260-272, 293-297` [Claude Opus 4.7]

The cold-start branch logs `"%q control plane is ready"` after the wait succeeds (line 297). The already-running branch waits for the same readiness signal but logs nothing on success — a user running `rdd svc start` against an already-running-but-still-initializing service sees the `"waiting for ... to be ready"` line then silence. Mirror the cold-start log:

```diff
         if err := service.Wait(ctx); err != nil {
             return cliexit.Classify(err)
         }
-        return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
+        if err := waitForDiscoveryConfigMap(ctx); err != nil {
+            return cliexit.Classify(err)
+        }
+        logrus.Infof("%q control plane is ready", instance.Name())
+        return nil
     }
```

Pairs well with S1's wrapping — both amount to "make the already-running branch as chatty on success and failure as the cold-start branch."

(suggestion, enhancement.)

S5. **`cliexit.Classify` can double-wrap a `*cliexit.Error{Code: CodeTimeout}`** — `pkg/cli/exit/exit.go:53-60` [Gemini 3.1 Pro]

```go
func Classify(err error) error {
    if errors.Is(err, context.DeadlineExceeded) {
        return Timeout(err)
    }
    return err
}
```

`*cliexit.Error.Unwrap()` exposes the inner error, so `errors.Is(cliexit.Timeout(deadlineErr), context.DeadlineExceeded)` is `true`. If a call site passes an already-classified error through `Classify` again, the result is `*cliexit.Error{Code: 4, Err: *cliexit.Error{Code: 4, Err: deadlineErr}}` — two layers, same exit code. Today's call graph does not double-classify in practice (every `Classify` is a top-level, per-command call), but the pattern of wrap-on-inner-call-then-classify-on-outer-call is easy to slip into and the helper should be idempotent:

```diff
 func Classify(err error) error {
+    var cliErr *Error
+    if errors.As(err, &cliErr) {
+        return err
+    }
     if errors.Is(err, context.DeadlineExceeded) {
         return Timeout(err)
     }
     return err
 }
```

(suggestion, enhancement — defensive idempotence.)

S6. **`Delete`'s deadline-branch recovery elides that the process has already been SIGTERM'd** — `pkg/service/cmd/service.go:428-438` [Claude Opus 4.7]

```go
if Running() {
    if err := StopWithWait(ctx, true, timeout); err != nil {
        // Suppress only the outer/inner Running() race, where the
        // process exits between Delete's check and StopWithWait's.
        // Deadline expiry must surface so the CLI exits 4 and leaves
        // the directory in place, per docs/design/cmd_service.md.
        if errors.Is(err, context.DeadlineExceeded) || Running() {
            return err
        }
    }
}
```

The comment at `:430-433` correctly describes the R3 I1 fix. One observation worth a second comment line: when the deadline branch fires, `StopWithWait` has already called `_ = process.Kill(pid)` (`:381`). On Unix that is `SIGTERM`, which the service catches — but the k8s `SetupSignalContext` second-signal handler routes the second `SIGTERM` through `os.Exit(1)` within microseconds, so the process is usually already dead when `Delete` returns `err`. A user who runs `rdd svc status` right after exit 4 will see no running PID, which reads as a contradiction of the "directory left in place because the process is still alive" framing in the docs. The contract is still correct (the *original* reason for preserving the directory was "do not remove it under a live writer," which holds during the wait; the post-kill moment is a bonus side effect), but the comment understates the post-deadline process state:

```diff
         if errors.Is(err, context.DeadlineExceeded) || Running() {
+            // At deadline expiry, StopWithWait has already SIGTERM'd the
+            // process, which usually exits within microseconds. Directory
+            // removal still waits for the next `rdd svc delete` invocation,
+            // per docs/design/cmd_service.md.
             return err
         }
```

(suggestion, enhancement — comment accuracy.)

S7. **`rdd limavm create --timeout` help text understates when the flag is ignored** — `cmd/rdd/limavm.go:151` [Claude Opus 4.7]

```go
command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
```

The `create` wait condition is `start && wait && !dryRun` (`:409`), so `--timeout` is also ignored when `--start` is false or `--dry-run` is set. The other `limavm` commands' help text is correct because their wait condition is just `wait` (or equivalent). Tighten the `create` text:

```diff
-    command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
+    command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored without --start, with --wait=false, or with --dry-run (0 means wait indefinitely)")
```

(suggestion, enhancement — only `create` has this extra condition.)

S8. **`svc delete --timeout` help text reads as jargon ("stop-and-wait")** — `cmd/rdd/service.go:354` [Claude Opus 4.7]

```go
command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
```

Round 3 S6 pushed the author to divorce the wording from the `Timeout for --wait` sibling (since `delete` has no `--wait` flag). The landing text "stop-and-wait" is terse but reads as a noun of art; "stop and wait" is not a named concept in the docs, and the siblings use plain `Timeout for --wait`. A more readable form:

```diff
-    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
+    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for waiting for the control plane to stop (0 means wait indefinitely)")
```

(suggestion, enhancement — polish on R3 S6 resolution; the Round 3 fix was correct in spirit, the specific word choice is what jars.)

S9. **`cliexit.Classify`'s docstring under-describes potential growth and the single-case implementation is minimal enough to inline** — `pkg/cli/exit/exit.go:53-60` [Claude Opus 4.7]

```go
// Classify wraps err with [CodeTimeout] when err wraps
// [context.DeadlineExceeded]. Other errors pass through unchanged.
func Classify(err error) error {
    if errors.Is(err, context.DeadlineExceeded) {
        return Timeout(err)
    }
    return err
}
```

The name `Classify` suggests a polymorphic classifier (admission errors → `CodeRejected`, `context.Canceled` → some user-cancel code, etc.), but the body handles exactly one case. Two paths:

- If the intent is "extension point for more classes later," add a doc line naming that intent so future readers don't think it's incomplete.
- If the intent is "just this one classifier for now," the caller pattern `cliexit.Classify(fmt.Errorf("waiting for ...: %w", err))` is two lines (check + wrap) that could be inlined at each of the 9 call sites, at the cost of a package-level helper.

Either direction keeps the signal clear. A two-line doc comment is the cheapest form:

```diff
-// Classify wraps err with [CodeTimeout] when err wraps
-// [context.DeadlineExceeded]. Other errors pass through unchanged.
+// Classify tags known error classes with their CLI exit code. Today it
+// only handles [context.DeadlineExceeded] → [CodeTimeout]; add new
+// classifications here as they become CLI-observable.
 func Classify(err error) error {
```

(suggestion, enhancement — documentation clarity for a shared helper.)

---

## Design Observations

### Strengths

- **Round 3's two Important findings both land cleanly in a single commit** [Claude Opus 4.7] [Codex GPT 5.4] [Gemini 3.1 Pro]. `Delete`'s recovery now gates on `errors.Is(err, context.DeadlineExceeded) || Running()`, the documented "exit 4, directory in place" contract holds, and both `StopWithWait(ctx, ...)` and `Delete(ctx, ...)` accept the caller's context and derive their timed child via `watchtools.ContextWithOptionalTimeout`. The latter pattern now appears at every wait site in the PR.
- **`cliexit.Classify` at the CLI boundary is the right shape** (in-scope) [Codex GPT 5.4] [Claude Opus 4.7]. The helper lives in `pkg/cli/exit`, every entry point already imports it, and `set.go`'s previously-inline `errors.Is(...) → cliexit.Timeout` block was migrated to use it. The helper also keeps lower-level packages returning ordinary wrapped errors — no coupling from `pkg/service/cmd` back up to the CLI.
- **Unix/Windows shutdown signal divergence is documented at the point of use** [Gemini 3.1 Pro] [Claude Opus 4.7]. `pkg/service/cmd/service.go:348-358` walks a reader through why `Interrupt` is tried first (lets the service run its shutdown path on Windows, where `Kill` is `TerminateProcess` and bypasses handlers), and why the `Kill` fallback never fires on Unix. The `docs/design/api_service.md:21` wording ("SIGTERM as fallback" on Unix, "`TerminateProcess` as fallback" on Windows) now matches the implementation — the R3 S4 "forced termination" claim is gone.
- **Single source of truth for long waits** [Claude Opus 4.7]. `limaLongWaitTimeout = 5 * time.Minute` drives every `limavm` subcommand and `stopWaitTimeout`, and the `stopWaitTimeout` comment explicitly names the future `LimaVM` stop-orchestration work so the R3 S5 multi-VM caveat is no longer implicit.

### Concerns

- **`cliexit.Classify` only handles `context.DeadlineExceeded`, not `context.Canceled`** (future) [Claude Opus 4.7]. A user who presses Ctrl-C during `rdd svc stop --wait --timeout=5m` gets exit code 1, indistinguishable from a real failure. A separate "user cancelled" exit code would let scripts distinguish intentional cancellation from command failure. Not actionable this round — the user-cancellation path is not exercised in this PR.
- **Two `context.WithTimeout` call sites still bypass `watchtools.ContextWithOptionalTimeout`** (in-scope) [Claude Opus 4.7]. `cmd/rdd/service.go:95` uses the plain form for `startWaitTimeout`; `pkg/service/cmd/service.go:202` uses it for an internal 15-second bound on `getRuntimeControllersFromCluster`. Both are fine today (non-zero positive constants), but they are the two stragglers in a PR whose theme is "0 means infinite everywhere." S2 proposes the `ensureServiceRunning` fix.

## Testing Assessment

Coverage of the exit-code-4 contract is broad on the `limavm` side: every wait-capable subcommand has a `run -4` regression test, and `6-timeout.bats` asserts `svc stop --timeout=1ms`. Unaddressed gaps, carried forward from earlier rounds plus new:

1. **`rdd svc delete --timeout=…` deadline expiry returns exit code 4 and leaves the directory in place.** `round 2 S2 unaddressed` (intentional skip). The R3 I1 regression around this exact code path would have been caught by a test. Priority elevated by R4 I2.
2. **`rdd svc start --timeout=…` against an already-running-but-not-yet-ready service returns exit code 4.** `round 2 S2 unaddressed` (intentional skip). Would exercise the new already-running branch plus R4 I2's freshness-skip.
3. **Dropped watch during `rdd limavm start/stop/restart/create/delete --wait`.** New in R4. Ties directly to I1. Hard to provoke without a custom mock API server; a unit-level test against a `fake.NewWithWatch` client that closes its watcher channel prematurely would be the cheapest.
4. **`ensureServiceRunning` deadline expiry during `rdd set` / `rdd limavm *` / `rdd kubectl` / `rdd service config` returns exit code 4.** `round 2 testing gap #3 unaddressed` (intentional skip). The fix at `cmd/rdd/service.go:97-100` is correct; no test exercises it.
5. **`--timeout=0` (indefinite wait) succeeds when the desired state eventually arrives.** `round 2 testing gap #4 unaddressed` (intentional skip). A regression that accidentally used `context.WithTimeout(ctx, 0)` (which fires immediately) would ship green.
6. **Ctrl-C during `svc stop --timeout=0` / `svc delete --timeout=0`.** `round 3 testing gap #5 unaddressed`. Now that both functions accept `ctx`, cancellation via `cmd.Context()` should abort the wait; a test would lock the behavior in.

Items 1, 2, 4, 5, 6 remain intentional skips per the author's R2/R3 resolution; item 3 is new and ties to I1.

## Documentation Assessment

- `docs/design/cmd_service.md:104-109` now says "Sends SIGINT to the control plane process (`rdd.pid`) ... If the deadline expires, the wait sends SIGTERM on Unix (or calls `TerminateProcess` on Windows) as a fallback and `rdd` exits with code 4." Matches the code. R3 S4 resolved.
- `docs/design/cmd_service.md:117-122` promises "exit 4, directory left in place" on deadline expiry — the code now honors this (R3 I1 resolved).
- `docs/design/cmd_lima.md` — defaults (`--wait=true`, `--timeout=5m`) consistent with the code.
- `docs/design/api_service.md:21` — "SIGINT on Unix with SIGTERM as fallback; `CTRL_BREAK_EVENT` on Windows with `TerminateProcess` as fallback" matches the implementation.
- Pre-existing test asserts `"successfully started"` log line at `bats/tests/20-service/2-create.bats:40-44`, but the cold-start code path emits `"... is ready"` instead (code stopped emitting `"successfully started"` months ago). Unaffected by this PR but worth a separate pass — the test must currently be failing or skipped. *Out of scope for this review.*

## Commit Structure

Single commit, single concept. The commit body accurately enumerates the landed defaults and references the R3 resolution. Clean.

## Acknowledged Limitations

- **`cmd/rdd/service.go:36-42`** — the `stopWaitTimeout = limaLongWaitTimeout` comment now includes "A multi-VM instance may still exceed this budget while shutdownAllHostagents drains sequentially." Matches the implementation and absorbs Codex's I1 concern as a known-future-work item rather than a present regression.
- **`pkg/service/cmd/service.go:348-358`** — the Unix/Windows `Interrupt → Kill` fallback comment is accurate and explains the hostagent-preservation trade-off. Unaffected by this PR.
- **`pkg/service/cmd/service.go:373-382`** — the `process.Kill` on deadline is intentional and accepted; hostagents survive in their own process groups and self-heal via `killOrphanedHostagent` on next start.
- **R3-acknowledged `StopWithWait` PID-recycling race** — `Running()` returns true, then `pid := PID()` finds the file stale, `process.Interrupt(0)` on Unix signals the calling process group. Pre-existing; not widened by this PR.

## Unresolved Feedback

No prior PR review comments on this PR.

---

## Agent Performance Retro

### [Claude]

Claude carried the majority of the Suggestion coverage again: six of nine Suggestions, including the four that capture the "not-quite-symmetric" polish gaps (`context.WithTimeout` vs `watchtools.ContextWithOptionalTimeout`, missing ready log in already-running branch, `--timeout` help text precision, `cliexit.Classify` docstring). Claude's I1 (bare `ctx.Err()` messages) and I2 (`Delete` post-deadline comment) were both cosmetic enough that consolidation downgraded them to S1 and S6 — Important severity was not the right pitch for a diagnostic-wrapping ask. One miss worth naming: Claude's S10 (add `local_teardown_file` to `6-timeout.bats`) directly contradicted the repo's BATS style rule "Don't write `local_teardown_file()` — leave the system running after tests so failures can be inspected manually." The orchestrator missed it during the first consolidation pass too — the BATS rules lived in `bats-style.md` but were not inlined into the deep-review context the agents receive, so no agent had the rule in prompt. Dropped during a second pass; repo-context update below captures the fix. Claude also noted a pre-existing test-code-drift in `2-create.bats` that this PR does not touch, scoped as out-of-scope correctly. No Critical false alarms and no hallucinated file contents.

### [Codex]

Codex delivered the review's single largest-impact finding in I2 (stale ConfigMap in the already-running `svc start` branch). That catch reconciles what R3 consolidated as a strength — "already-running branch now matches the cold-start path" — against the actual code, which matches only up to the freshness anchor. The scope-level realization that the R3 S7 consolidation was incomplete is exactly what Round 4's re-read needs to surface. Codex's I1 (5-minute default too short for multi-VM shutdown) is dropped as a mis-labeled regression — pre-PR `svc stop` hardcoded 60s, so 5m is strictly longer — but the underlying concern about future-parallel shutdown is real and is captured as an acknowledged limitation via the R3 S5 comment. Codex skipped the `git blame`/merge-base check on I1, which is the gap behind the mis-label. S1 (BATS coverage gaps) correctly carries R2/R3 testing skips forward with new scaffolding from R4 I2.

### [Gemini]

Gemini produced its single Important finding (watch-close false success) directly — no Critical mis-calibrations, no hallucinated `@test` headers (the repeat false positive of the last two rounds), and the fix shape is correct. Gemini noted the finding was pre-existing but did not explicitly tie it to this PR's widening of the exposure (user-documented `--timeout=0` + exit-code-4 contract); consolidation added that link. Gemini's S1 (`Classify` double-wrap) is a valid defensive-idempotence suggestion with no active bug today. Gemini continues to skip `git blame`, so the regression-vs-gap labeling falls to consolidation — but with only one finding, the labeling cost was minimal this round.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | — | — | 5m 09s |
| Findings | 6S | 1I 1S | 1I 1S |
| Tool calls | — | — | 20 (read_file 8, grep_search 7, run_shell_command 5) |
| Design observations | 4 | 1 | 2 |
| False positives | 1 | 1 | 0 |
| Unique insights | 6 | 2 | 2 |
| Files reviewed | 10 | 10 | 10 |
| Coverage misses | 1 | 0 | 0 |
| **Totals** | **6S** | **1I 1S** | **1I 1S** |
| Downgraded | 2 (I→S) | 0 | 0 |
| Dropped | 1 | 1 | 0 |



**Reconciliation.** Four severity changes:

- Codex I1 (5m default too short, regression-labeled): **dropped as false positive.** Pre-merge-base (`47b55ec0`) `StopWithWait` hardcoded a 60-second timeout with no user override; post-PR default is 5 minutes with `--timeout=0` for indefinite. Strictly a widening. The underlying multi-VM concern is real but already documented in the `stopWaitTimeout` comment per R3 S5. `false_positives += 1`.
- Claude I1 (bare `ctx.Err()` messages, important): **downgraded to S1.** Exit code is already correct; the finding is a diagnostic-wrapping ask, not a correctness bug. Important is reserved for behavior that misleads users about state, not for terse error strings.
- Claude I2 (`Delete` post-deadline comment, important): **downgraded to S6.** Claude's own fix says "light touch — a comment, not a code change." Comment-only changes belong under Suggestions.
- Claude S10 (add `local_teardown_file` to `6-timeout.bats`): **dropped as false positive.** Contradicts the repo BATS style rule in `bats-style.md`: "Don't write `local_teardown_file()` — leave the system running after tests so failures can be inspected manually." The orchestrator also missed this during the first consolidation pass; the BATS rules had not been inlined into the deep-review context, so the agents never saw them. Fix captured under Repo context updates below. `false_positives += 1`.

A second retracted item — an orchestrator-added re-raise of Round 3 S3 (`run -0` on the cleanup `rdd limavm delete "start-vm"`) — turned out to be built on a Round 3 mistake. The repo BATS style rule states: "Only use `run -0` when you need to inspect `$output` afterwards. If you just want to verify a command succeeds, run it directly without `run`." The current code (no `run -0` on that cleanup line) matches the convention; Round 3 S3 was itself a false positive carried forward from Round 2's `--wait` scrub commentary. Removed from this round's Suggestions and noted as a Round 3 error.

## Review Process Notes

### Skill improvements

- **When a prior round consolidates two agents' findings into a claim about parity between two code paths ("X now matches Y"), the next round's first action on that claim should be a paired read of both paths side by side.** Round 3 consolidated "already-running `svc start` branch now waits for `ReadyAnnotation`, matching the cold-start path." That consolidation was correct at the `service.Wait` + `ReadyAnnotation` level and incorrect at the `waitForFresh*` level. The miss persisted into R4 and was caught only because one agent independently re-read both paths. A reviewer recognises the pattern when a prior-round strength or resolution uses comparison phrasing ("matches the cold-start", "same contract as", "symmetric with"); verify the comparison at every level the unified helper operates on, not just at the top level. Parity claims silently decay as one side changes.

### Repo context updates

- **Add:** When a long-lived watch loop ends with `return ctx.Err()` after the result channel closes, distinguish "context cancelled" from "channel closed while context still valid." A closed watch channel without an explicit `Error` event can fire on graceful server close, proxy timeout, or Kubernetes watch-expiry; returning `ctx.Err()` (which is `nil` in that moment) converts a dropped watch into a spurious success. A reviewer recognises the pattern in any `for ev := range watcher.ResultChan() { ... } return ctx.Err()` shape — flag the need to return a distinct "watch closed before condition met" error on the fall-through. Reason: `cmd/rdd/limavm.go`'s `watchUntil` / `watchUntilDeleted` shipped this shape in 2026-02 and it was not caught until round 4 of a PR that widened the wait-timeout exposure.
- **Add (applied this round):** Inline the repo's BATS style rules into `deep-review-context.md` so review agents apply them. The rules live in `bats-style.md` (variable scoping, `run -0` only when `${output}` / `${status}` is read on the next line, `assert_output` / `refute_output` idioms, no `local_teardown_file`, blank-line spacing). Agents cannot follow links out of the review prompt, so a rule kept only in `bats-style.md` is invisible at review time — Round 3 S3 and Round 4 S10 both landed as false positives because of this. `bats-style.md` carries a sync comment requiring any change to mirror into `deep-review-context.md`.

---

## Appendix: Original Reviews

### [Claude Opus 4.7]

Now I have enough context. Let me verify a few more critical pieces before writing the review.

I have enough context. Writing the review now.

# Deep Code Review — PR #336: cli: standardize --wait and --timeout across svc and limavm

## Executive Summary

This PR consolidates `--wait`/`--timeout` semantics across `rdd svc` and `rdd limavm`, routes `context.DeadlineExceeded` through a new `cliexit.Classify` helper so all wait-capable commands return exit code 4 on timeout, and removes the rigid `--wait=false` shortcut from `svc delete` to prevent corruption of the instance directory. The implementation is largely careful, but a few subtleties around error messages, the dual-Running-check race-handling in `Delete`, and a pre-existing test-fixture mismatch are worth noting.

## Findings

### Critical Issues

None.

### Important Issues

**I1. `service.Wait` returns bare `ctx.Err()` — user-facing timeout messages lose context** — `pkg/service/cmd/service.go:316-334`, `cmd/rdd/service.go:97-103,267-271,294-296`

```go
// pkg/service/cmd/service.go:325-327
case <-ctx.Done():
    return ctx.Err()
```

`Wait` and `waitForFreshDiscoveryConfigMap` (via `wait.PollUntilContextCancel`) both surface `context.DeadlineExceeded` directly. The CLI now feeds this through `cliexit.Classify`, so the user gets exit code 4, but the message printed to stderr is just `context deadline exceeded` — no hint of *what* timed out. Compare with `StopWithWait` (line 382), which wraps as `timed out waiting for %q control plane with pid %d to stop: %w`, giving the operator something to act on.

The new code paths in `cmd/rdd/service.go` that exhibit this:
- `ensureServiceRunning` → `service.Wait` → `cliexit.Classify` (line 97–98)
- `ensureServiceRunning` → `waitForDiscoveryConfigMap` → `cliexit.Classify` (line 100)
- `serviceStartAction` (already-running branch) (line 268–271)
- `startAndWaitForReady` → `service.Wait` / `waitForFreshDiscoveryConfigMap` (line 122–128)

A user who runs `rdd svc start --timeout=10s` against a struggling apiserver sees only `level=error msg="context deadline exceeded"` and has to guess whether it's the apiserver poll or the discovery ConfigMap that's stuck.

Fix: wrap at the call site with a one-line context. Example:

```diff
 if service.Running() {
     ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
     defer cancel()
     if err := service.Wait(ctx); err != nil {
-        return cliexit.Classify(err)
+        return cliexit.Classify(fmt.Errorf("waiting for %q control plane to be ready: %w", instance.Name(), err))
     }
-    return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
+    if err := waitForDiscoveryConfigMap(ctx); err != nil {
+        return cliexit.Classify(fmt.Errorf("waiting for discovery ConfigMap: %w", err))
+    }
+    return nil
 }
```

(important, gap — the timeout-classification work is an opportunity to also improve the diagnostics.)

---

**I2. `Delete` race-handling re-checks `Running()` after `StopWithWait` returned a non-deadline error, but never re-checks after the deadline branch** — `pkg/service/cmd/service.go:428-438`

```go
if Running() {
    if err := StopWithWait(ctx, true, timeout); err != nil {
        if errors.Is(err, context.DeadlineExceeded) || Running() {
            return err
        }
    }
}
```

The intent (per the comment) is to suppress *only* the "process exited between our `Running()` check at line 428 and `StopWithWait`'s `Running()` check at line 340". In practice the predicate also matches: signal-send failures (`process.Kill` returned an error and the process actually died), partial cleanup errors, etc. — anything that makes `Running()` return false now suppresses the error. That's fine, but the inverse case is subtle: when `errors.Is(err, context.DeadlineExceeded)` matches, `StopWithWait` already issued a synchronous `process.Kill(pid)` (line 381). On Unix, `process.Kill` is `unix.Kill(pid, SIGTERM)` which returns immediately — the process is *probably* dead but may take a few ms to actually exit. We return the deadline error and leave the directory in place, so the user retries `rdd svc delete`. The retry catches the now-dead process via `if Running()` → false at line 428 and proceeds to cleanup. Acceptable but worth a comment, since the design contract is "deadline expiry leaves the directory in place" — not "deadline expiry leaves the directory in place *and* may already have killed the process". A user who sees exit code 4 and runs `rdd svc status` immediately after may find the process gone.

Fix (light touch — a comment, not a code change):

```diff
         if errors.Is(err, context.DeadlineExceeded) || Running() {
+            // Deadline branch: StopWithWait already SIGTERM'd the process,
+            // which usually exits within a few ms. The next `rdd svc delete`
+            // will find Running()=false at line 428 and proceed to cleanup.
             return err
         }
```

(important, gap — the suppression logic is correct but subtle; a one-line comment about the post-deadline state of the process clarifies the contract.)

### Suggestions

**S1. `ensureServiceRunning` uses `context.WithTimeout`, sibling code paths use `watchtools.ContextWithOptionalTimeout`** — `cmd/rdd/service.go:95`

```go
ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
```

Every other timeout-bound block in this PR (lines 122, 266, in `pkg/service/cmd/service.go:366`, in all of `cmd/rdd/limavm.go`) uses `watchtools.ContextWithOptionalTimeout`, which transparently degrades to `context.WithCancel` when the duration is 0. `startWaitTimeout` is hardcoded to a positive `90 * time.Second`, so the difference doesn't matter today — but a future refactor that lets users override this constant via a flag would silently fail on `--timeout=0` (an immediately-expired context). Single-line consistency change:

```diff
-		ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
+		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, startWaitTimeout)
 		defer cancel()
```

(suggestion, enhancement.)

---

**S2. `serviceStartAction`'s "already running" branch never logs the symmetric "is ready"** — `cmd/rdd/service.go:260-272,293-297`

The cold-start branch logs `"%q control plane is ready"` after the wait succeeds (line 297). The already-running branch waits for the same readiness signal but logs nothing on success — only `"waiting for %q control plane to be ready"` before the wait. A user running `rdd svc start` against an already-running but still-initializing service sees the "waiting" line then silence, with no confirmation that the wait completed.

```diff
         if err := service.Wait(ctx); err != nil {
             return cliexit.Classify(err)
         }
-        return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
+        if err := waitForDiscoveryConfigMap(ctx); err != nil {
+            return cliexit.Classify(err)
+        }
+        logrus.Infof("%q control plane is ready", instance.Name())
+        return nil
     }
```

(suggestion, enhancement.)

---

**S3. `--timeout` help text on `rdd limavm create` understates when it's ignored** — `cmd/rdd/limavm.go:151`

```go
command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
```

For `create`, the wait condition is `start && wait && !dryRun` (line 409), so `--timeout` is also ignored when `--start=false` or `--dry-run` is set. The current text only mentions `--wait=false`, which can mislead users into thinking a `5m` deadline is in effect for `--dry-run` runs.

```diff
-	command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
+	command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored without --start, with --wait=false, or with --dry-run (0 means wait indefinitely)")
```

(suggestion, enhancement.)

---

**S4. `--timeout` help text on `rdd svc delete` reads as a noun, not a knob** — `cmd/rdd/service.go:354`

```go
command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
```

"the stop-and-wait" is jargon. The siblings say `Timeout for --wait`. Since `delete` *always* waits, the most readable form is just:

```diff
-	command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the stop-and-wait (0 means wait indefinitely)")
+	command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for waiting for the control plane to stop (0 means wait indefinitely)")
```

(suggestion, enhancement.)

---

**S5. Test 6-timeout.bats lacks isolation cleanup between runs** — `bats/tests/20-service/6-timeout.bats:8-12`

```bash
local_setup_file() {
    rdd svc delete || :
    rdd svc create
    rdd svc start
}
```

After the timeout test, the service is force-killed via `process.Kill` (SIGTERM, second signal → `os.Exit(1)`), leaving an instance directory with stale `rdd.pid` and `rdd.sqlite3`. The global `teardown_file` only runs `rdd svc stop || true`. The next run of this same test cleans up via `rdd svc delete || :`, so it's self-correcting, but a `local_teardown_file` that runs `rdd svc delete` would make the contract explicit and avoid wasted log space. Since every other 33-/34- test uses `setup_rdd_control_plane` (which always starts from a clean `rdd svc delete`), this divergence is also worth a comment explaining why this file inlines the setup.

(suggestion, enhancement.)

---

**S6. `cliexit.Classify`'s docstring says "Other errors pass through unchanged" but it returns `error`, not the original concrete type** — `pkg/cli/exit/exit.go:53-60`

```go
func Classify(err error) error {
    if errors.Is(err, context.DeadlineExceeded) {
        return Timeout(err)
    }
    return err
}
```

Minor: the function's name suggests it could grow to handle other classifications (`context.Canceled`, admission errors via `apierrors.IsInvalid`, etc.). Today it handles exactly one case. If the intent is "extension point", a one-line comment would help; if the intent is "single classifier", consider naming it `TimeoutOrPassThrough` or inlining the two-line check at each call site (only 9 call sites). As written, future readers will probably re-read the body to confirm there's no hidden behavior.

(suggestion, enhancement.)

## Design Observations

### Strengths

- **Symmetric defaults across `lima*` subcommands.** A single `limaLongWaitTimeout` constant drives all five lima commands. Easy to tune in one place.
- **Re-checking `Running()` in `Delete`.** The race between `Delete`'s outer check and `StopWithWait`'s inner check is real and the suppression handles it cleanly (modulo the comment improvement in I2).
- **`watchtools.ContextWithOptionalTimeout`.** Using this for the user-supplied timeout (instead of hand-rolling `if timeout > 0` branches) gets `--timeout=0` semantics for free and matches what `kubectl wait` does.
- **`stopWaitTimeout = limaLongWaitTimeout`.** Symbolic alias makes the dependency explicit and the comment forecasting the LimaVM-stop-orchestration sequencing is forward-looking without over-abstracting.

### Concerns

- **`cliexit.Classify` only matches `context.DeadlineExceeded`, not `context.Canceled`.** A user pressing Ctrl+C during `rdd svc stop --wait --timeout=5m` gets exit code 1, indistinguishable from a real failure. Today's behavior is reasonable (the user knows they cancelled), but if scripting wants to distinguish "user-cancelled" from "command failed", you may want a separate exit code. Not actionable now — flagging for the future. *(future)*

## Testing Assessment

The new tests cover the three highest-risk paths: `svc stop --timeout`, `lima create --start --timeout`, and `lima restart --timeout`. The existing `lima start --timeout` test was correctly updated from `run -1` to `run -4`. Untested scenarios, ranked by risk:

1. **`rdd svc delete --timeout=Xms` exit code 4** — the entire safety claim of `Delete` (leaves the directory in place on timeout, per `docs/design/cmd_service.md:117-122`) has no integration test. A regression that swallowed the deadline error and proceeded to wipe the directory under a live process would ship green. Realistic test: same shape as 6-timeout.bats but `run -4 rdd svc delete --timeout=1ms` and `assert_dir_exist "${RDD_DIR}"`.
2. **`rdd svc start --timeout=Xms` exit code 4** — harder to provoke, but the `serviceStartAction` already-running branch (line 260–272) is also new code paths that have no integration coverage. A test that starts the service, then runs `rdd svc start --timeout=1ms` against the running instance, would exercise the already-running readiness wait.
3. **`rdd limavm stop --timeout=Xms`** — would need a VM that ignores stop. The existing pattern (non-functional template) doesn't quite work for stop because the controller can mark a never-started VM as stopped immediately. Lower priority.

## Documentation Assessment

- `docs/design/cmd_service.md` — accurate. The default values, the SIGINT/CTRL_BREAK_EVENT detail, and the safety rationale for omitting `--wait=false` on delete all match the code.
- `docs/design/cmd_lima.md` — accurate. All five subcommands now show `--timeout=5m` default and exit-code-4 contract.
- `docs/design/api_service.md:21` — accurate. Updated to match the SIGINT-with-SIGTERM-fallback (Unix) and CTRL_BREAK_EVENT-with-TerminateProcess-fallback (Windows) sequence.

Pre-existing documentation mismatch worth noting (not in PR scope): `bats/tests/20-service/2-create.bats:40-44` asserts the start command emits `successfully started "..."` and `waiting for "..." to be ready`, but the actual code (`cmd/rdd/service.go`) hasn't emitted `"successfully started"` since commit `c762b5f0` (2026-03-06), three months before this PR. The cold-start path now emits `"%q control plane is ready"` instead. This test must be currently broken; a side audit before the next CI green sweep would be worthwhile. *Pre-existing, not flagged as a finding.*

## Commit Structure

A single commit; message accurately describes the change. Clean.

## Acknowledged Limitations

- **`cmd/rdd/service.go:36-41`** — comment names the future LimaVM stop-orchestration work as the reason `stopWaitTimeout` aliases `limaLongWaitTimeout`. Author has explicitly noted this is provisional and a multi-VM instance may exceed the budget. Aligned with the implementation.
- **`pkg/service/cmd/service.go:348-358`** — comment block enumerates the Unix/Windows divergence in `process.Interrupt` reliability and explicitly accepts that on Windows-without-shared-console, hostagents survive and self-heal via `killOrphanedHostagent`. Matches the code path.

## Coverage Summary

- `bats/tests/20-service/6-timeout.bats` — Reviewed, no issues (S5 is enhancement).
- `bats/tests/33-lima-controllers/limavm-cli.bats` — Reviewed, no issues. The three new `--timeout` tests use the correct `run -4` and the deletion-wait removals reflect the new default.
- `cmd/rdd/limavm.go` — Reviewed, no issues. `cliexit.Classify` is applied to all four wait paths; `limaLongWaitTimeout` constant centralizes the default.
- `cmd/rdd/service.go` — **I1** (Wait error context), **S1** (`watchtools` consistency), **S2** (missing "is ready" log), **S4** (delete help text).
- `cmd/rdd/set.go` — Reviewed, no issues. The migration from inline `errors.Is` to `cliexit.Classify` is correct.
- `docs/design/api_service.md` — Reviewed, no issues.
- `docs/design/cmd_lima.md` — Reviewed, no issues. **S3** (create help text precision is a code-side issue).
- `docs/design/cmd_service.md` — Reviewed, no issues.
- `pkg/cli/exit/exit.go` — **S6** (docstring and naming clarity).
- `pkg/service/cmd/service.go` — **I2** (Delete post-deadline state comment).
### [Codex GPT 5.4]

### Executive Summary

This PR standardizes `--wait` / `--timeout` handling across `rdd svc` and `rdd limavm`, routes timeout failures to exit code 4, and updates the design docs and BATS coverage accordingly. The main regression is that the new fixed 5 minute default for `svc stop` / `svc delete` is shorter than the current healthy multi-VM shutdown path, and the new already-running `svc start` wait path still accepts a stale discovery ConfigMap after an unclean restart.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **`svc stop` / `svc delete` now time out on healthy multi-VM shutdowns** — `cmd/rdd/service.go:36-42` (important, regression)

```go
// stopWaitTimeout bounds the wait for the control plane to shut down. The
// service itself drains its controller-manager goroutines for up to 45s
// (see Run in pkg/service/cmd/service.go); when LimaVM stop orchestration
// lands, the drain will sequence each VM's shutdown, so the CLI deadline must
// match the per-VM ceiling in [limaLongWaitTimeout]. A multi-VM instance may
// still exceed this budget while shutdownAllHostagents drains sequentially.
const stopWaitTimeout = limaLongWaitTimeout
```

The new default at `cmd/rdd/service.go:36-42` sets both `svc stop` and `svc delete` to a fixed 5 minute timeout, and that budget is enforced for the whole service wait in `pkg/service/cmd/service.go:366-382`. That is shorter than the current shutdown algorithm: `Run()` only gives the controller manager 45 seconds after cancellation at `pkg/service/cmd/service.go:819-833`, while Lima shutdown still waits hostagents sequentially and explicitly documents `N × gracefulShutdownTimeout` at `pkg/controllers/lima/limavm/controllers/limavm_controller.go:520-526`. With multiple running VMs, a healthy shutdown can legitimately exceed 5 minutes, so the new default turns a successful stop/delete into exit code 4 and falls back to `process.Kill` at `pkg/service/cmd/service.go:373-382`.

Fix: do not ship a fixed 5 minute default until `shutdownAllHostagents()` is parallelized. Either keep `svc stop` / `svc delete` at an indefinite default for now, or derive the default from the number of running LimaVMs before calling `StopWithWait()`.

I2. **The already-running `svc start` path still accepts a stale ready annotation** — `cmd/rdd/service.go:265-271` (important, regression)

```go
logrus.Infof("waiting for %q control plane to be ready", instance.Name())
ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
defer cancel()
if err := service.Wait(ctx); err != nil {
	return cliexit.Classify(err)
}
return cliexit.Classify(waitForDiscoveryConfigMap(ctx))
```

The new branch at `cmd/rdd/service.go:265-271` is meant to match the cold-start path, but it calls `waitForDiscoveryConfigMap()` at `cmd/rdd/service.go:131-144`, which intentionally skips the freshness check. That matters because the serve process writes `rdd.pid` very early at `pkg/service/cmd/service.go:584-587`, so a second `rdd svc start` can enter the "already running" branch before startup has reached `InitDiscovery()` at `pkg/service/controllers/discovery.go:360-389`. The cold-start path avoids stale discovery data by capturing `beforeStart` and calling `waitForFreshDiscoveryConfigMap()` at `cmd/rdd/service.go:113-128`, but this branch does not. Since `service.Wait()` only checks `/readyz`, CRDs, and webhooks via `pkg/service/cmd/service.go:256-312`, a stale ConfigMap from the crashed prior run can still satisfy the final wait and make `svc start` return before the current controller manager has registered.

Fix: record the current service start time for both paths and use `waitForFreshDiscoveryConfigMap()` here too. The simplest source is a startup marker written by `serve`; a weaker fallback is the current PID file mtime truncated to seconds.

#### Suggestions

S1. **The new service timeout coverage only exercises `svc stop`** — `bats/tests/20-service/6-timeout.bats:14-17` (suggestion, regression)

```bash
@test 'svc stop --timeout=1ms exits with code 4' {
    run -4 rdd svc stop --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}
```

The new BATS file at `bats/tests/20-service/6-timeout.bats:14-17` covers only `svc stop`, but this PR also adds new timeout behavior in the already-running `svc start` branch at `cmd/rdd/service.go:260-271` and in `svc delete` at `cmd/rdd/service.go:358-370`. Those are the two service paths most likely to regress silently: `svc start` now depends on discovery freshness, and `svc delete` now has its own timeout/error contract plus the outer/inner `Running()` race handling in `pkg/service/cmd/service.go:428-437`.

Fix: add BATS coverage for `svc delete --timeout=...` exiting 4 without removing the instance directory, and for `svc start --timeout=...` when another service instance is already running but not yet discovery-ready.

### Design Observations

**Strengths**

Centralizing timeout-to-exit-code mapping in `pkg/cli/exit/exit.go:53-60` is a good boundary choice. It keeps lower-level packages returning ordinary wrapped errors while making the CLI contract consistent across `set`, `svc`, and `limavm`.

### Testing Assessment

1. There is no coverage for `svc stop` / `svc delete` with multiple running LimaVMs, which is the exact case where the new fixed 5 minute default in I1 fails.
2. There is no coverage for `svc start` against an already-running replacement service after stale discovery data survives an unclean shutdown, which is the case behind I2.
3. I ran `go test ./pkg/cli/exit ./pkg/service/cmd ./cmd/rdd/...`; the `cmd/rdd` package did not build in this checkout because `pkg/guestagent/guestagent.go` expects the generated `lima-guestagent.gz` asset.

### Documentation Assessment

`docs/design/cmd_service.md` now accurately reflects the current code, but it also bakes in the fixed 5 minute `svc stop` / `svc delete` default. If I1 is fixed by reverting to an indefinite or scaled timeout, those sections at `docs/design/cmd_service.md:102-122` need to change with the code.

### Coverage Summary

- `bats/tests/20-service/6-timeout.bats` — S1
- `bats/tests/33-lima-controllers/limavm-cli.bats` — Reviewed, no issues
- `cmd/rdd/limavm.go` — Reviewed, no issues
- `cmd/rdd/service.go` — I1, I2, S1
- `cmd/rdd/set.go` — Trivial
- `docs/design/api_service.md` — Reviewed, no issues
- `docs/design/cmd_lima.md` — Reviewed, no issues
- `docs/design/cmd_service.md` — Reviewed, no issues
- `pkg/cli/exit/exit.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — I1

### [Gemini 3.1 Pro]

## Executive Summary

This PR successfully standardizes the wait and timeout behavior across `rdd svc` and `rdd limavm` commands, introducing consistent default wait times and structured exit codes (code 4 on deadline expiry). The implementation safely handles process shutdown synchronization and leverages `context.Context` effectively. The review found no new regressions; however, it identified one important pre-existing gap where API server watch disconnections can falsely report successful waits, bypassing the new timeout mechanisms.

## Findings

### Critical Issues

None.

### Important Issues

I1. **Watch disconnection falsely signals success** — `cmd/rdd/limavm.go:497-498`, `cmd/rdd/limavm.go:527-528` (important, gap)

```go
	for ev := range watcher.ResultChan() {
		// ...
		updated, ok := ev.Object.(*limav1alpha1.LimaVM)
		if ok && check(updated) {
			return nil
		}
	}
	return ctx.Err()
```

If the API server closes the watch prematurely (e.g., due to a proxy timeout, network error, or standard Kubernetes watch expiry), `watcher.ResultChan()` closes and the loop terminates. Because the context itself has not timed out or been explicitly cancelled, `ctx.Err()` returns `nil`. This causes `watchUntil` and `watchUntilDeleted` to exit with a success code `0` even if the desired state (e.g. `Running == True`) was never reached. The CLI's wait completes falsely, skipping the new timeout logic entirely.

**Fix:** Return an error if the loop exits without `ctx.Err()` being populated, so the caller knows the watch dropped before the condition was satisfied. 

```diff
 		updated, ok := ev.Object.(*limav1alpha1.LimaVM)
 		if ok && check(updated) {
 			return nil
 		}
 	}
-	return ctx.Err()
+	if err := ctx.Err(); err != nil {
+		return err
+	}
+	return fmt.Errorf("watch closed before condition was met")
```
*(Apply the same fix to `watchUntilDeleted` around line 527, with the message `"watch closed before deletion was observed"`).*

### Suggestions

S1. **Prevent double-wrapping of Timeout errors** — `pkg/cli/exit/exit.go:55-59` (suggestion, enhancement)

```go
func Classify(err error) error {
	if errors.Is(err, context.DeadlineExceeded) {
		return Timeout(err)
	}
	return err
}
```

If `err` is already wrapped by `cliexit.Timeout(err)` (which implements `Unwrap() error`), `errors.Is(err, context.DeadlineExceeded)` evaluates to `true`, causing `Classify` to wrap the error a second time. While not currently breaking functionality, it adds unnecessary layers to the error chain and could complicate future error inspections.

**Fix:** Check if the error is already a `*cliexit.Error` of the correct type before classifying it.

```diff
 func Classify(err error) error {
+	var cliErr *Error
+	if errors.As(err, &cliErr) && cliErr.Code == CodeTimeout {
+		return err
+	}
 	if errors.Is(err, context.DeadlineExceeded) {
 		return Timeout(err)
 	}
 	return err
 }
```

## Design Observations

### Strengths

- **Robust signal escalation during graceful shutdown timeout:** In `StopWithWait`, if the `ctx` deadline is exceeded, `process.Kill(pid)` is called. On Unix, this sends a second `SIGTERM`. Because `controller-runtime`'s `SetupSignalHandler` intercepts this, the first signal (`SIGINT` from earlier) initiates graceful shutdown, and the second `SIGTERM` accurately forces an immediate hard exit. This natively integrates with Go's signal handling ecosystem without needing a specialized `SIGKILL` or `KillTree`.
- **Safe Directory Cleanup:** The removal of `--wait=false` from `svc delete` to address directory corruption and PID file mutual exclusion races correctly reflects that persistent state removals *must* happen synchronously after the process holding handles fully exits. The race-handling inside `Delete` seamlessly preserves state when timeouts occur, matching the documented expectation.

## Testing Assessment

The new BATS tests specifically target `svc stop` and `limavm start/create/restart` with `--timeout` boundaries. This is well covered.

1. **Missing test for `limavm delete` timeout (Low Risk):** While `limavm-cli.bats` verifies default wait behavior using `rdd limavm delete "start-vm"`, it does not explicitly test that `rdd limavm delete --timeout=1s` exits with code 4 for a slow-to-delete instance (e.g., heavily populated WSL VM). The behavior is identical to `limavm start`, so the risk is low, but an explicit test for consistency could be added.

## Documentation Assessment

Documentation updates in `docs/design/` correctly match the implemented behavior, defaults, and API server wait-conditions. No gaps identified.

## Coverage Summary

- `bats/tests/20-service/6-timeout.bats`: Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-cli.bats`: Reviewed, no issues
- `cmd/rdd/limavm.go`: I1
- `cmd/rdd/service.go`: Reviewed, no issues
- `cmd/rdd/set.go`: Reviewed, no issues
- `docs/design/api_service.md`: Reviewed, no issues
- `docs/design/cmd_lima.md`: Reviewed, no issues
- `docs/design/cmd_service.md`: Reviewed, no issues
- `pkg/cli/exit/exit.go`: S1
- `pkg/service/cmd/service.go`: Reviewed, no issues
