# Deep Review: 20260423-182700-pr-344

| | |
|---|---|
| **Date** | 2026-04-23 18:27 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 1 (of target) |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#344](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/344) — cli: standardize --wait and --timeout across svc and limavm |
| **Branch** | `wait-timeout-cleanup-v2` |
| **Commits** | `b59e796` cli: standardize --wait and --timeout across svc and limavm |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: high), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — minor doc/test polish plus an under-bounded readiness wait that survives an in-flight shutdown; all findings are low-risk or pre-existing. |
| **Wall-clock time** | `18 min 43 s` |

---

## Executive Summary

The PR delivers what it advertises: a uniform `--wait`/`--timeout` contract across `rdd svc` and `rdd limavm`, with `cliexit.Classify` centralizing `context.DeadlineExceeded` → exit code 4. `StopWithWait`/`Delete` now accept caller context and timeout, the discovery-ConfigMap freshness check anchors on a per-start PID-file mtime, and the manual `watcher.ResultChan()` loops in `cmd/rdd/limavm.go` correctly distinguish a satisfied predicate from a dropped channel.

No agent flagged the change set as merge-blocking. The one Important finding (Gemini I1) is a pre-existing race between in-flight shutdown and the readiness poll; the PR's new `--timeout=0` option for `svc start --wait` widens its worst-case slightly. Suggestions cluster around (a) sibling CLI paths the uniform-contract refactor missed, (b) error/diagnostic loss in the new Delete predicate, and (c) doc and test polish.

Structure: 0 Critical, 1 Important (gap, pre-existing), 6 Suggestions.

---

## Critical Issues

None.

## Important Issues

I1. **Readiness wait can hang against an in-flight shutdown** — `pkg/service/cmd/service.go:325-343`, `cmd/rdd/service.go:151-176` [Gemini 3.1 Pro]

```go
func Wait(ctx context.Context) error {
    if err := checkReadiness(ctx); err == nil {
        return nil
    }
    ticker := time.NewTicker(500 * time.Millisecond)
    defer ticker.Stop()
    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-ticker.C:
            if err := checkReadiness(ctx); err == nil {
                return nil
            }
        }
    }
}
```

If `rdd svc start` (or any command transiting `ensureServiceRunning`) launches while `rdd svc stop --wait` or `rdd svc delete` is draining, the start-side path takes the already-running branch (`Running()` returns true while the daemon drains), then enters `service.Wait`. When the daemon exits and the shutdown path removes `instance.Config()`, `checkReadiness` keeps failing — the loop has no `Running()` check — and the wait hangs until `ctx` expires (default 90s on the autostart path, or whatever `--timeout` the user passed to `svc start`).

`waitForFreshDiscoveryConfigMap` at `cmd/rdd/service.go:151` has the same shape: it polls until the ConfigMap is fresh enough or `ctx` expires, with no fast-fail when the daemon is gone.

Reconciliation: Gemini labeled this "regression"; the underlying `Wait` loop has no `Running()` check at the merge-base either (`git show 81135357:pkg/service/cmd/service.go` confirms an identical loop), so the bug is **pre-existing** (gap), not introduced by this PR. The PR does widen exposure marginally — the freshness check on the already-running path is new (post-PR commits to `creationTimestamp >= startTime`), and `svc start --wait --timeout=0` now allows an unbounded hang where the pre-PR `WaitWithTimeout` capped at 90s. Repo context line 51 explicitly calls out this pattern as a known concern, which is exactly what surfaced here.

Fix: short-circuit both polls when the service process is gone:

```diff
 case <-ticker.C:
+    if !Running() {
+        return fmt.Errorf("control plane exited while waiting for readiness")
+    }
     if err := checkReadiness(ctx); err == nil {
         return nil
     }
```

```diff
 return wait.PollUntilContextCancel(ctx, 500*time.Millisecond, true, func(ctx context.Context) (bool, error) {
+    if !service.Running() {
+        return false, fmt.Errorf("control plane exited while waiting for discovery ConfigMap")
+    }
     cm, err := client.CoreV1().ConfigMaps(controllers.RDDSystemNamespace).Get(...)
```

## Suggestions

S1. **`rdd ctl wait-condition` skips the uniform exit-4 contract** — `cmd/rdd/ctl_wait_condition.go:122-134` [Gemini 3.1 Pro]

```go
_, err = watchtools.UntilWithSync(ctx, lw, &unstructured.Unstructured{}, nil,
    func(event watch.Event) (bool, error) { ... },
)
return err
```

`ctl wait-condition` carries its own `--timeout` flag (default 30s) and creates a context with that deadline at line 97. When the deadline expires, `watchtools.UntilWithSync` returns a wrapped `context.DeadlineExceeded`, which exits the CLI with code 1 instead of the intended `cliexit.CodeTimeout` (4). The PR does not touch this file, but the stated goal of a uniform CLI contract leaves this sibling path on the wrong exit code. This matches the warning at repo-context line 48 about uniform-contract refactors that wire only the outermost entry points.

Fix:

```diff
-    return err
+    return cliexit.Classify(err)
```

S2. **`Delete` silently drops non-deadline `StopWithWait` errors when `Running()` happens to be false at re-check** — `pkg/service/cmd/service.go:422-446` [Claude Opus 4.7]

```go
if Running() {
    if err := StopWithWait(ctx, true, timeout); err != nil {
        if errors.Is(err, context.DeadlineExceeded) || Running() {
            return err
        }
    }
}
preserveAllInstanceLogs()
```

`StopWithWait` can return three non-deadline errors: `"control plane is not running"` (line 350; benign race), the `"failed to stop ... pid"` wrapper when both `Interrupt` and `Kill` fail (line 367; rare on Unix, possible on Windows when no shared console), and `"wait for ... cancelled"` when the parent context is canceled (line 400; today unreachable per the in-code comment). When any of these fire and the daemon happens to have exited between the error and the re-check, the predicate proceeds with deletion and the diagnostic is lost. The behavior is still strictly better than the pre-PR `_ = Stop()` (which dropped every error unconditionally), so the *outcome* is correct; what is lost is the user's only signal that something unexpected happened — most relevantly a Windows console-fallback that bypassed graceful hostagent shutdown.

Reconciliation: Claude labeled this "regression"; the pre-existing `_ = Stop()` discarded errors more aggressively, so the post-PR predicate is an **improvement, not a regression**. Reclassified as a design-quality concern in newly-introduced code.

Fix: log the dropped error at warn level so the daemon-state surprise is preserved without changing the outcome:

```diff
 if Running() {
     if err := StopWithWait(ctx, true, timeout); err != nil {
         if errors.Is(err, context.DeadlineExceeded) || Running() {
             return err
         }
+        logrus.WithError(err).Warn("service exited despite shutdown error; proceeding with delete")
     }
 }
```

S3. **`watchUntil` / `watchUntilDeleted` give up after a single dropped watch** — `cmd/rdd/limavm.go:485-512`, `cmd/rdd/limavm.go:542-566` [Claude Opus 4.7]

```go
for ev := range watcher.ResultChan() { ... }
if err := ctx.Err(); err != nil {
    return err
}
if err := c.Get(ctx, key, &vm); err != nil { ... }
if check(&vm) {
    return nil
}
return fmt.Errorf("watch closed before LimaVM %q reached the desired state", key.Name)
```

The fall-through correctly fixes the "dropped watch → spurious success" trap. But when the predicate still isn't satisfied after the re-read, the code returns a hard error rather than reopening the watch from `vm.ResourceVersion`. For `rdd limavm restart --wait --timeout=5m`, a benign Kubernetes server-side watch expiry (~5–10 min) would surface as a hard failure even though the caller's `ctx` is still valid. The repo already has a retry-aware pattern: `cmd/rdd/set.go:490-535` uses `watchtools.UntilWithSync`, which transparently re-lists on watch-channel closure.

Fix: replace the hand-rolled loop with `watchtools.UntilWithSync` (preferred — matches the `set` implementation), or wrap the existing loop in a retry that opens a fresh watch from the current `ResourceVersion` until `ctx` is done.

S4. **`docs/design/api_service.md` omits the Windows "first signal fails" case** — `docs/design/api_service.md:21-24` [Claude Opus 4.7]

```
The control plane is notified to shut down, either by `rdd service stop`
(SIGINT on Unix, `CTRL_BREAK_EVENT` on Windows; if the wait deadline expires
the wait force-terminates with SIGTERM on Unix or `TerminateProcess` on
Windows), or by the OS preparing to shut down the host.
```

The in-code comment at `pkg/service/cmd/service.go:356-364` documents that on Windows, `GenerateConsoleCtrlEvent` "fails when caller and target lack a shared console; Kill (TerminateProcess) then bypasses graceful shutdown." Since `rdd.exe` typically launches the daemon as a detached background process, this fallback is the **common** Windows path, not a corner case — graceful hostagent shutdown then relies on `killOrphanedHostagent` at next start. The doc currently captures only the deadline-expiry fallback.

Fix: extend the parenthetical, e.g.:

> On Windows, when the CLI and daemon do not share a console, `CTRL_BREAK_EVENT` cannot be delivered and the stop falls back to `TerminateProcess` immediately; graceful hostagent shutdown does not run, and orphaned hostagents are reaped at the next service start.

S5. **`svc delete --timeout` flag help wording diverges from the rest of the PR** — `cmd/rdd/service.go:372` [Claude Opus 4.7]

```go
command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the control plane to exit before deletion (0 means wait indefinitely)")
```

Every other `--timeout` flag in this PR uses the shape `"Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)"`. Because `svc delete` deliberately omits `--wait`, the wording correctly differs — but it then drops the exit-4 contract that `cmd_service.md:121` explicitly documents.

Fix: align the wording to surface the exit-4 contract:

```diff
-    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the control plane to exit before deletion (0 means wait indefinitely)")
+    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the control plane to exit before deletion; on expiry rdd exits with code 4 and preserves the instance directory (0 means wait indefinitely)")
```

S6. **Cross-test daemon-state coupling in the new `6-timeout.bats`** — `bats/tests/20-service/6-timeout.bats:22-33` [Claude Opus 4.7]

```bash
@test 'svc delete --timeout=1ms exits with code 4 and preserves the instance directory' {
    # Hard reset: the prior test sent SIGTERM but does not wait for the
    # process to exit, so a plain `rdd svc start` could still see
    # Running()=true and take the already-running branch.
    rdd svc delete --timeout=10s || :
    rdd svc create
    rdd svc start
    ...
```

The hard-reset uses `--timeout=10s`, while every other reset in the suite (e.g. `setup_rdd_control_plane`) uses `--timeout=120s`. If the prior test's SIGTERM ever takes longer than 10s on slow CI, the reset times out, `rdd svc create` errors with `"instance already exists"`, and `run -4` on the subsequent line misreports the failure as "expected exit 4 but got 1". The comment also reads slightly off: the prior test does wait (1ms) before the synchronous Kill fallback fires.

Fix: bump the reset to `--timeout=120s` for parity with the rest of the suite, or move the reset into a `local_setup` so each `@test` starts from a known daemon state independent of the prior test's force-exit timing.

## Design Observations

**Concerns**

- `(future)` `ensureServiceRunning`'s 90-second cap is correct (the comment defends it well: a broken service must fail fast on a long deadline), but the user-visible error is identical to the one a genuinely slow VM boot would produce. A user running `rdd limavm start --wait --timeout=30m` on a flaky control plane will see exit-4 "context deadline exceeded" after 90s with no signal that the failure was the autostart cap, not the VM. Consider wrapping the autostart-phase error with a prefix like `"control plane not ready after %s (autostart cap): %w"` so the diagnostic distinguishes "broken daemon" from "slow VM". [Claude Opus 4.7]

**Strengths**

- `(in-scope)` Centralizing `cliexit.Classify` removes duplicated `errors.Is(err, context.DeadlineExceeded)` branches and makes the exit-4 contract inspectable in one place. Wired through `ensureServiceRunning`, every transitive caller (`rdd set`, every `rdd limavm *`, `rdd kubectl`, `rdd service config`) inherits the contract without per-callsite plumbing — the PR's central design move. [Claude Opus 4.7, Codex GPT 5.5, Gemini 3.1 Pro]
- `(in-scope)` Anchoring `waitForFreshDiscoveryConfigMap` on `service.StartTime()` (PID-file mtime) correctly handles daemon restarts without a cross-process atomic clock; the truncation comment in `startAndWaitForReady` (`time.Now().Truncate(time.Second)`) shows the right level of attention to `metav1.Time` precision. [Claude Opus 4.7]
- `(in-scope)` `watchUntil` / `watchUntilDeleted` fall-through after a closed `ResultChan` correctly distinguishes "watch dropped" from "predicate satisfied" via the post-loop re-read — a real class of spurious successes that the repo steering note specifically calls out. [Claude Opus 4.7, Gemini 3.1 Pro]
- `(in-scope)` `StopWithWait`'s on-deadline `Running()` re-check before falling through to SIGTERM/TerminateProcess cleanly handles the "daemon just exited" race; pre-PR it would have returned a timeout error even when graceful shutdown had completed. [Claude Opus 4.7]

## Testing Assessment

The new BATS coverage anchors the exit-4 contract for every primary entry point: `svc start/stop/delete` and `limavm create/start/stop/restart/delete`. The `--timeout=1ms` pattern is a reliable trigger for the deadline path, and `assert_output --partial 'context deadline exceeded'` ties exit code to the wrapped error.

Gaps worth tracking (none merge-blocking):

1. `svc start --wait=false` followed by `svc start` — exercises the new already-running + `startTime`-anchored freshness check; not yet covered.
2. `svc delete --timeout=0` (indefinite) — verifies that `watchtools.ContextWithOptionalTimeout(ctx, 0)` is honored on the shutdown path; not covered.
3. `limavm create --start --wait=false --timeout=5m` — verifies that `--timeout` really is ignored when `--wait=false`; a regression here would silently waste wall-clock with no failure signal.
4. Retry after a `--timeout=1ms` delete — `6-timeout.bats` confirms the directory survives but doesn't close the loop by retrying with a reasonable timeout.
5. Autostart timeout via a transitive caller (e.g. `rdd ctl get` or `rdd service config` against a wedged daemon) — confirms the `cliexit.Classify` wiring inside `ensureServiceRunning` reaches commands that don't surface `--timeout` themselves.

Codex live-verified `make build-rdd` and `rdd svc start --wait --timeout=1ms` (exit code 4); the BATS suite reproduces the same path under deterministic timing.

## Documentation Assessment

`cmd_service.md` and `cmd_lima.md` track the new defaults, the exit-4 contract, and the `svc delete` "preserves the instance directory on timeout" behavior consistently — these read cleanly. `api_service.md` is the one doc miss (S4) — it captures the timeout-fallback signal path but omits the more common Windows "no shared console → immediate `TerminateProcess`" path. The dense rationale comment at `pkg/service/cmd/service.go:384-395` is technically accurate but worth splitting into shorter paragraphs on a subsequent edit.

The `stopWaitTimeout = limaLongWaitTimeout` cross-file aliasing at `cmd/rdd/service.go:43` is acceptable (it makes the per-VM ceiling the single tuning knob) but fragile; the existing comment on the alias is the right size and should not be removed.

## Acknowledged Limitations

- `pkg/service/cmd/service.go:356-364` — Windows `GenerateConsoleCtrlEvent` failure with no shared console falls back to `TerminateProcess`, bypassing graceful hostagent shutdown; cleanup deferred to `killOrphanedHostagent` at next start. PR does not regress this; only the doc still under-describes it (see S4).
- `pkg/service/cmd/service.go:429-434` — `context.Canceled` is unreachable today because the CLI runs on `context.Background()`; wiring SIGINT into `cmd.Context()` would change deletion semantics during a live stop and needs an explicit decision. Comment is accurate.
- `pkg/service/cmd/service.go:410-413` — `--wait=false` deliberately leaves `instance.Config()` in place so concurrent `rdd ctl` / `rdd kubectl` calls keep working while the daemon drains. Intentional; documented inline.


---

## Agent Performance Retro

### [Claude]

Drove the bulk of the report with six findings spread across code, docs, and tests. Strongest contribution: the `watchUntil` no-retry concern (S3) and the dense Windows-shutdown doc gap (S4) — both required reading the in-code comment block and reasoning about cross-platform behavior, not just the diff. Mislabeled S2 as a "regression" — the new `Delete` predicate is actually an improvement over the pre-PR `_ = Stop()` that discarded every error; reclassified to a design-quality concern. Did not catch the readiness-wait race that Gemini flagged.

### [Codex]

Returned a clean review with zero findings and the highest-value live verification of the round (`make build-rdd`, `rdd svc start --wait --timeout=1ms` exit 4). The empty-handed result is honest — Codex correctly ranked the change set as low-risk — but it missed both the readiness-wait race that the repo context explicitly warns about (Gemini I1) and the `ctl wait-condition` contract gap (Gemini S1). Useful as a calibration signal that the PR has no obvious latent bugs in the changed paths; less useful for tracing into sibling code.

### [Gemini]

Found the highest-impact issue of the round (I1, the readiness-wait race) by following exactly the pattern repo-context line 51 warns about. Also caught the only sibling-path miss (S1, `ctl wait-condition` skipping `cliexit.Classify`) that the uniform-contract refactor leaves behind — repo-context line 48 in action. Mislabeled I1 as a "regression"; the underlying `Wait()` loop has the same shape at the merge-base. Two findings, both valid, both downgraded by reconciliation but kept in the report.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 12m 47s | 5m 54s | 9m 41s |
| Findings | 5S | none | 1I 1S |
| Tool calls | 38 (Read 19, Grep 14, Bash 5) | 51 (shell 49, stdin 2) | 30 (grep_search 17, read_file 10, run_shell_command 2) |
| Design observations | 5 | 1 | 3 |
| False positives | 0 | 0 | 0 |
| Unique insights | 5 | 0 | 2 |
| Files reviewed | 11 | 11 | 11 |
| Coverage misses | 0 | 1 | 0 |
| **Totals** | **5S** | **none** | **1I 1S** |
| Downgraded | 1 (I→S) | 0 | 1 (I→S) |
| Dropped | 0 | 0 | 0 |


**Reconciliation**

- Gemini C1 readiness race: severity downgraded from Critical to Important, label changed from regression to gap. Underlying `Wait()` loop has no `Running()` check at merge-base SHA `81135357` (verified via `git show`). PR widens worst-case slightly via the new freshness check on the already-running path and the `--timeout=0` option on `svc start --wait`.
- Claude I1 Delete predicate: severity downgraded from Important to Suggestion (S2), label changed from regression to design concern. Pre-PR `_ = Stop()` discarded every error; the post-PR predicate is a strict improvement that loses only the diagnostic, not correctness.

Best value of the round: Gemini provided the only finding tied to repo-specific context (race against in-flight shutdown), which Codex missed entirely and Claude did not surface. Claude provided the broadest coverage and the most actionable suggestions for the changed code itself. Codex contributed the only live verification.

## Review Process Notes

None — repo context already encodes the patterns that surfaced this round (lines 48 and 51). Skill prompt and dimensions covered everything that landed.

---

## Appendix: Original Reviews

### [Claude]

Now let me look at other pieces of the code—in particular, `process.Interrupt`, `process.Kill`, and related helpers for signal handling. Let me also understand what `ContextWithOptionalTimeout` does and what commands invoke ensureServiceRunning.

Now I have enough context to write up the review. Let me also sanity-check one subtle area: the `watchUntil` re-read fallthrough logic and whether the implementation of `watchUntilDeleted` is consistent.

Now I have enough context. Let me write up the review.

# Code Review — PR #344: cli: standardize --wait and --timeout across svc and limavm

## Header

| | |
|---|---|
| Verdict | Approve with minor follow-ups |
| Risk | Low |

## Executive Summary

The PR delivers a clean, consistent `--wait`/`--timeout` contract across `rdd svc` and `rdd limavm` commands, centralizes `context.DeadlineExceeded` → `CodeTimeout` classification, and correctly anchors discovery-ConfigMap freshness on a per-start PID-file mtime. The core Go changes are technically sound — `StopWithWait` / `Delete` signatures propagate caller context, watch-drop fall-throughs now return a distinct error instead of a spurious success, and all callers ensure the new `Exists()` precondition. Findings are minor: doc phrasing, BATS fragility, and a couple of possible followups around error classification boundaries.

## Findings

### Critical Issues

None.

### Important Issues

**I1. `Delete` swallows non-deadline `StopWithWait` errors when the process is no longer running** — `pkg/service/cmd/service.go:423-438` (important, regression)

```go
if Running() {
    if err := StopWithWait(ctx, true, timeout); err != nil {
        if errors.Is(err, context.DeadlineExceeded) || Running() {
            return err
        }
    }
}
preserveAllInstanceLogs()
```

`StopWithWait` can return non-deadline errors: `"control plane is not running"` (line 350 — possible if the daemon raced Delete's own `Running()` check), the `"failed to stop ... control plane with pid"` signal-failure wrapper (line 367, from both `Interrupt` and `Kill` failing), or `"wait for ... cancelled"` when the parent context is canceled (line 400). The predicate `errors.Is(err, DeadlineExceeded) || Running()` silently drops any of those if the process happens to be gone at the moment `Running()` is re-queried — deletion then proceeds and the original error is lost to the user. A signal-delivery failure is mildly concerning (rare but nontrivial), and swallowing the `"control plane is not running"` race means the user loses the only diagnostic signal that something unexpected happened.

In practice none of these non-deadline errors are likely to fire, but the predicate trades a loud failure (tell the user the daemon state was surprising) for a silent continuation. The author's in-code comment acknowledges the reasoning but doesn't address why a signal-send failure shouldn't surface.

Fix: narrow the pass-through to the "not running" race that Delete actually wants to absorb:

```diff
     if Running() {
         if err := StopWithWait(ctx, true, timeout); err != nil {
-            if errors.Is(err, context.DeadlineExceeded) || Running() {
+            // Absorb only the benign race where StopWithWait's own
+            // Running() check lost to a daemon that exited on its own
+            // between our check and its check.
+            if Running() || errors.Is(err, context.DeadlineExceeded) {
                 return err
             }
+            logrus.WithError(err).Warn("service stopped despite shutdown error; proceeding with delete")
         }
     }
```

If the narrower intent is what's wanted, a debug/warn log preserves the diagnostic. If the author prefers the current behavior, the comment should explicitly call out which non-deadline errors are intentionally swallowed.

### Suggestions

**S1. `svc delete` flag help is inconsistent with the rest of the PR** — `cmd/rdd/service.go:372` (suggestion, enhancement)

```go
command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the control plane to exit before deletion (0 means wait indefinitely)")
```

Every other `--timeout` in the PR uses the phrase "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)" or the same shape with "ignored without --start". Because `svc delete` has no `--wait` flag (the author deliberately removed it, per the PR description), the inconsistency is accurate — but the absence of any mention of "exits with code 4 on expiry" or a `--wait` reference still reads differently. Either align or explain.

Fix: append the exit-4 note for parity with `cmd_service.md` (line 121):

```diff
-    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the control plane to exit before deletion (0 means wait indefinitely)")
+    command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the control plane to exit before deletion; on expiry rdd exits with code 4 and preserves the instance directory (0 means wait indefinitely)")
```

**S2. `docs/design/api_service.md` omits the Windows "first signal fails" case** — `docs/design/api_service.md:21-24` (suggestion, gap)

```
The control plane is notified to shut down, either by `rdd service stop`
(SIGINT on Unix, `CTRL_BREAK_EVENT` on Windows; if the wait deadline expires
the wait force-terminates with SIGTERM on Unix or `TerminateProcess` on
Windows), or by the OS preparing to shut down the host.
```

The in-code comment at `pkg/service/cmd/service.go:360-364` documents that on Windows `GenerateConsoleCtrlEvent` "fails when caller and target lack a shared console; Kill (TerminateProcess) then bypasses graceful shutdown." Given that `rdd.exe` typically invokes the daemon as a detached background process, this first-path `TerminateProcess` fallback is the common Windows behavior, not a rare corner case — graceful shutdown of hostagents on Windows relies on the `killOrphanedHostagent` path at next start. The doc update captures the timeout path but not this non-timeout fallback.

Fix: expand the parenthetical to mention the Windows console-attachment prerequisite, e.g. "On Windows, when the CLI and daemon lack a shared console, `CTRL_BREAK_EVENT` cannot be delivered and the stop immediately falls back to `TerminateProcess`; graceful hostagent shutdown does not run and cleanup is deferred to the next service start."

**S3. BATS test in `6-timeout.bats` depends on cross-test daemon state** — `bats/tests/20-service/6-timeout.bats:22-33` (suggestion, regression)

```bash
@test 'svc delete --timeout=1ms exits with code 4 and preserves the instance directory' {
    # Hard reset: the prior test sent SIGTERM but does not wait for the
    # process to exit, so a plain `rdd svc start` could still see
    # Running()=true and take the already-running branch.
    rdd svc delete --timeout=10s || :
    rdd svc create
    rdd svc start
    run -4 rdd svc delete --timeout=1ms
    ...
```

The `rdd svc delete --timeout=10s || :` on line 26 relies on the prior test having brought the daemon close to exit via the SIGINT→SIGTERM force-exit path. If the previous test ever fails to deliver both signals (unlikely but possible on slow CI), this 10s bound is strictly less than the `stopWaitTimeout=5m` configured elsewhere, so a stuck daemon survives and `rdd svc create` on the next line errors (`instance already exists`) — which `run -4` on line 29 will then misreport as "timeout expected but got exit 1." The comment also claims "the prior test sent SIGTERM but does not wait" — actually the prior test *does* wait (up to 1ms) and then the code sends the fallback Kill synchronously, so the diagnostic is slightly misleading.

Fix: either bump the hard-reset timeout to `--timeout=120s` (matching `local_setup_file`), or make the test independent by wrapping the reset in its own `local_setup` so each `@test` starts from a known daemon state.

**S4. `watchUntil` / `watchUntilDeleted` never retry a dropped watch** — `cmd/rdd/limavm.go:485-512, 542-566` (suggestion, enhancement)

```go
for ev := range watcher.ResultChan() {
    ...
}
// ResultChan closed. If ctx ended, return its error. Otherwise the
// watch dropped (server close, proxy timeout, watch expiry); re-read
// and re-evaluate to distinguish a satisfied predicate from a drop.
if err := ctx.Err(); err != nil {
    return err
}
if err := c.Get(ctx, key, &vm); err != nil { ... }
if check(&vm) {
    return nil
}
return fmt.Errorf("watch closed before LimaVM %q reached the desired state", key.Name)
```

The fix fully addresses the "dropped watch → spurious success" trap (good). But when the predicate still isn't satisfied after the re-read, the code gives up rather than reopening the watch from the latest `vm.ResourceVersion`. For `rdd limavm restart --wait --timeout=5m`, a benign server-side watch expiry (Kubernetes does this every ~5-10 minutes) would surface as a hard failure even though the caller's ctx is still valid. This contradicts how `rdd set` handles the same class via `watchtools.UntilWithSync`, which retries transparently.

Fix: replace the hand-rolled watch loop with `watchtools.UntilWithSync`, or wrap the current loop in a for-over-retry with a fresh watch from the current ResourceVersion. The `rdd set` implementation at `cmd/rdd/set.go:490-535` is the pattern already in the repo.

## Design Observations

**Concerns**

- **`ensureServiceRunning`'s fixed 90s cap is documented but surprises users on long timeouts.** A user running `rdd limavm start --wait --timeout=30m` will see exit-4 "context deadline exceeded" after only 90s if the autostart phase stalls. The comment captures intent, but the user-facing error is identical to the one a genuine 30m-timed-out VM boot would produce — there's no diagnostic distinction. Consider wrapping the autostart-phase error with a prefix like `"control plane not ready after 90s (autostart cap): %w"` so users can disambiguate "my daemon is broken" from "my VM is slow." `(future)`

**Strengths**

- Centralizing `cliexit.Classify` eliminates duplicated `errors.Is(err, context.DeadlineExceeded)` branches and makes the exit-4 contract inspectable in one place.
- The `watchUntil` / `watchUntilDeleted` fall-through re-read properly anchors the "watch closed" vs "ctx ended" distinction — this fixes a real class of spurious successes and the repo steering note calls this out specifically.
- Anchoring `waitForFreshDiscoveryConfigMap` on PID-file mtime (via `service.StartTime()`) correctly handles daemon restarts without needing a cross-process atomic clock.
- `StopWithWait`'s on-deadline pre-check of `Running()` before sending SIGTERM handles the "process just exited" race cleanly — the pre-PR code would have returned a timeout error even when the graceful shutdown had in fact completed.

## Testing Assessment

The PR adds integration coverage for every new exit-4 path across `svc` and `limavm`. Missing scenarios ranked by risk:

1. `svc start --wait=false` (no timeout wait) followed by `svc start` (default wait) — exercises the already-running + startTime-based freshness check that didn't exist before the PR. Not covered.
2. `svc delete --timeout=0` (indefinite wait) — verifies that `watchtools.ContextWithOptionalTimeout(ctx, 0)` is honored in the shutdown path. Not covered.
3. `limavm create --start --wait=false --timeout=5m` — verifies that `--timeout` really is ignored when `--wait=false`. Not covered; a regression here would silently waste wall-clock.
4. A retry scenario after a `timeout=1ms` delete — does the preserved instance dir actually allow `rdd svc delete` (with a reasonable timeout) to succeed? The test confirms `status` reports the dir intact but doesn't close the retry loop.
5. Windows-specific: `svc stop --timeout=1ms` on a daemon without a shared console — does the CTRL_BREAK fallback to TerminateProcess still produce exit 4, and does the state survive? No Windows BATS coverage yet (pre-existing gap, not PR-specific).

## Documentation Assessment

- `cmd_service.md` and `cmd_lima.md` are updated consistently with the new defaults and the exit-4 contract — good.
- `api_service.md` has the signal-path asymmetry flagged in S2.
- The in-code comment block in `pkg/service/cmd/service.go:384-394` (StopWithWait deadline branch) is *very* dense but technically accurate; consider splitting the Windows / Unix rationale into two short paragraphs.
- `stopWaitTimeout` aliased to `limaLongWaitTimeout` across files (`cmd/rdd/service.go:43`) is fine, but the cross-file const aliasing gets fragile quickly — a future edit to `limaLongWaitTimeout` would silently retune `svc stop` in a way no one diff-reading `service.go` would see. A short comment near the `const stopWaitTimeout = limaLongWaitTimeout` line already exists; keep it.

## Commit Structure

Single commit, focused, message accurately describes the change. No concerns.

## Acknowledged Limitations

- **`pkg/service/cmd/service.go:356-364`** — in-code comment explains that on Windows `Interrupt` (GenerateConsoleCtrlEvent) fails when caller and target lack a shared console, then falls back to `Kill` (TerminateProcess) bypassing graceful shutdown. Hostagent cleanup deferred to `killOrphanedHostagent` on next start. The PR does not regress this; only the docs still under-describe it (S2).
- **`pkg/service/cmd/service.go:429-434`** — in-code comment explains that `context.Canceled` is unreachable today because the CLI runs on `context.Background()`; wiring signal-handling into `cmd.Context()` would require an explicit design decision. Verified accurate: `cli.RunNoErrOutput` (from `k8s.io/component-base`) does not install signal handlers, and cobra's default `Execute()` uses `context.Background()`.
- **`pkg/service/cmd/service.go:410-412`** — `--wait=false` deliberately leaves `instance.Config()` in place so concurrent `rdd ctl` / `rdd kubectl` calls keep working while the daemon drains. Intentional; documented.

## Coverage Summary

| File | Status |
|---|---|
| `bats/helpers/controller.bash` | Reviewed, no issues |
| `bats/tests/20-service/6-timeout.bats` | S3 |
| `bats/tests/33-lima-controllers/limavm-cli.bats` | Reviewed, no issues |
| `cmd/rdd/limavm.go` | S4 |
| `cmd/rdd/service.go` | S1 |
| `cmd/rdd/set.go` | Reviewed, no issues |
| `docs/design/api_service.md` | S2 |
| `docs/design/cmd_lima.md` | Reviewed, no issues |
| `docs/design/cmd_service.md` | Reviewed, no issues |
| `pkg/cli/exit/exit.go` | Reviewed, no issues |
| `pkg/service/cmd/service.go` | I1 |
### [Codex]

### Executive Summary

This PR standardizes wait defaults and timeout classification across `svc`, `limavm`, and shared autostart paths. I did not find merge-blocking regressions in the changed code; the main remaining risk is test coverage around already-running/autostart readiness paths rather than the direct timeout paths.

### Findings

#### Critical Issues

None.

#### Important Issues

None.

#### Suggestions

None.

### Design Observations

**Strengths** — The explicit `context.Context` and timeout plumbing through `StopWithWait(ctx, wait, timeout)` and `Delete(ctx, timeout)` is a good cleanup. Centralizing `context.DeadlineExceeded` classification in `cliexit.Classify` also reduces the chance that sibling CLI paths drift on exit-code behavior.

### Testing Assessment

1. The new BATS tests cover direct `svc start/stop/delete` timeout exits and direct `limavm create/start/stop/restart/delete` timeout exits.
2. Still worth adding later: an autostart timeout assertion through a transitive caller such as `rdd ctl` or `rdd service config`, since those depend on `ensureServiceRunning`.
3. Still worth adding later: the already-running `rdd svc start --timeout=...` readiness path, especially the fresh discovery ConfigMap check.

I verified `make build-rdd` succeeds and a live `rdd svc start --wait --timeout=1ms` exits with code `4`.

### Documentation Assessment

The updated service and Lima command docs match the new defaults and timeout behavior. No documentation gaps found in the changed docs.

### Acknowledged Limitations

- **Code comment**: `pkg/service/cmd/service.go:431` documents that `context.Canceled` is currently unreachable because CLI command contexts are still rooted in `context.Background()`, and that wiring Ctrl-C into `cmd.Context()` needs an explicit deletion decision. This is a deliberate tradeoff and not worsened by the PR.

### Coverage Summary

- `bats/helpers/controller.bash` — Reviewed, no issues
- `bats/tests/20-service/6-timeout.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-cli.bats` — Reviewed, no issues
- `cmd/rdd/limavm.go` — Reviewed, no issues
- `cmd/rdd/service.go` — Reviewed, no issues
- `cmd/rdd/set.go` — Reviewed, no issues
- `docs/design/api_service.md` — Reviewed, no issues
- `docs/design/cmd_lima.md` — Reviewed, no issues
- `docs/design/cmd_service.md` — Reviewed, no issues
- `pkg/cli/exit/exit.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Reviewed, no issues

### [Gemini]

### Executive Summary

This PR standardizes `--wait` and `--timeout` behavior across the `rdd svc` and `rdd limavm` subcommands, centralizing `context.DeadlineExceeded` translation into exit code 4 via `cliexit.Classify`. The new default timeouts (`90s` for service start, `5m` for VM lifecycle and service shutdown) eliminate indefinite hangs, and `svc delete` now safely waits for the daemon to exit before tearing down the instance directory. The test suite correctly covers these timeouts. However, the delayed teardown introduced for `svc delete` and `svc stop` creates a critical race condition where a concurrent autostart can hang waiting for readiness against a deleted API state.

### Findings

**Critical Issues**

C1. **Readiness wait hangs if concurrent shutdown deletes `instance.Config()`** — `pkg/service/cmd/service.go:341`, `cmd/rdd/service.go:161` (critical, regression)

The PR makes `svc delete` and `svc stop --wait` block until the daemon exits before removing state (`instance.Config()` and `instance.Dir()`). This delayed teardown creates a race condition: if a user runs `rdd svc start` (or any command calling `ensureServiceRunning`) while a shutdown is in-flight, `ensureServiceRunning` evaluates `service.Running() == true` and enters the already-running branch, calling `service.Wait(ctx)`. When the shutting-down daemon finally exits, the shutdown command deletes `instance.Config()`. `service.Wait(ctx)` then repeatedly fails to build the kubeconfig, hanging for the full 90-second `startWaitTimeout` against a dead daemon because it never checks if the process is still alive. The same hang occurs in `waitForFreshDiscoveryConfigMap` if the ConfigMap wait races the teardown.

Fix: Abort the poll loops immediately if `!Running()`.

```diff
 // pkg/service/cmd/service.go
 func Wait(ctx context.Context) error {
 	if err := checkReadiness(ctx); err == nil {
 		return nil
 	}
 
 	ticker := time.NewTicker(500 * time.Millisecond)
 	defer ticker.Stop()
 
 	for {
 		select {
 		case <-ctx.Done():
 			return ctx.Err()
 		case <-ticker.C:
+			if !Running() {
+				return fmt.Errorf("control plane exited while waiting for readiness")
+			}
 			if err := checkReadiness(ctx); err == nil {
 				return nil
 			}
 		}
 	}
 }
```

```diff
 // cmd/rdd/service.go
 func waitForFreshDiscoveryConfigMap(ctx context.Context, beforeStart time.Time) error {
 	restConfig, err := service.GetKubeRestConfig()
 	if err != nil {
 		return err
 	}
 	client, err := kubernetes.NewForConfig(restConfig)
 	if err != nil {
 		return fmt.Errorf("failed to create kubernetes client: %w", err)
 	}
 
 	return wait.PollUntilContextCancel(ctx, 500*time.Millisecond, true, func(ctx context.Context) (bool, error) {
+		if !service.Running() {
+			return false, fmt.Errorf("control plane exited while waiting for discovery ConfigMap")
+		}
 		cm, err := client.CoreV1().ConfigMaps(controllers.RDDSystemNamespace).Get(
 			ctx, controllers.ControllerManagerConfigMapName, metav1.GetOptions{},
 		)
```

**Important Issues**

None.

**Suggestions**

S1. **Missing `cliexit.Classify` for `rdd ctl wait-condition`** — `cmd/rdd/ctl_wait_condition.go:149` (suggestion, gap)

While `ctl` is an internal developer command, `wait-condition` performs a wait governed by a context deadline. By returning `watchtools.UntilWithSync`'s error directly instead of wrapping it in `cliexit.Classify(err)`, a deadline expiry skips the centralized `CodeTimeout` envelope and exits with code 1 instead of 4, deviating from the rest of the CLI's uniform timeout contract.

Fix: Wrap the return value.

```diff
-	_, err = watchtools.UntilWithSync(ctx, lw, &unstructured.Unstructured{}, nil,
+	_, err := watchtools.UntilWithSync(ctx, lw, &unstructured.Unstructured{}, nil,
 		func(event watch.Event) (bool, error) {
 			if event.Type == watch.Deleted {
 				return false, errors.New("resource was deleted while waiting")
 			}
 			obj, ok := event.Object.(*unstructured.Unstructured)
 			if !ok {
 				return false, nil
 			}
 			return checker.check(obj), nil
 		},
 	)
-	return err
+	return cliexit.Classify(err)
```

### Design Observations

**Strengths**

- **Uniform CLI Contract**: The PR successfully centralizes the timeout contract by wiring `cliexit.Classify` directly into the `ensureServiceRunning` helper. This ensures that all commands that transitively trigger an autostart or autowait — `rdd set`, every `rdd limavm *`, `rdd kubectl`, and `rdd service config` — properly inherit the `CodeTimeout` (exit code 4) envelope without needing redundant error handling at each outermost entry point.
- **Robust Channel Closure Handling**: The manual loop around `watcher.ResultChan()` in `cmd/rdd/limavm.go` now correctly distinguishes between a satisfied wait and a dropped watch (e.g., from a proxy disconnect or watch expiry), surfacing a distinct error instead of converting the channel closure into a spurious success.
- **Intentional Ceilings**: The PR accurately preserves the independent `startWaitTimeout` ceiling for the autostart phases (`ensureServiceRunning`) without improperly inheriting `--timeout` from the user's command context.

### Testing Assessment

The BATS coverage for timeout behaviors correctly asserts exit code 4 for the primary entry points (`rdd svc start/stop/delete` and `rdd limavm create/start/stop/restart/delete`). The use of dummy templates and very low timeout values (`1ms`) is an effective way to trigger the fallback paths reliably. No critical coverage gaps remain for the timeout mechanics.

### Documentation Assessment

The documentation updates correctly capture the fallback mechanism for `svc stop` and `svc delete`, specifically noting that the wait leaves the instance directory intact upon timeout. The default values listed in the Markdown directly correspond to the constant definitions (`limaLongWaitTimeout` and `startWaitTimeout`).

### Coverage Summary

- `bats/helpers/controller.bash`: Reviewed, no issues
- `bats/tests/20-service/6-timeout.bats`: Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-cli.bats`: Reviewed, no issues
- `cmd/rdd/limavm.go`: Reviewed, no issues
- `cmd/rdd/service.go`: C1
- `cmd/rdd/set.go`: Reviewed, no issues
- `docs/design/api_service.md`: Reviewed, no issues
- `docs/design/cmd_lima.md`: Reviewed, no issues
- `docs/design/cmd_service.md`: Reviewed, no issues
- `pkg/cli/exit/exit.go`: Reviewed, no issues
- `pkg/service/cmd/service.go`: C1
