# Deep Review: 20260422-160047-pr-336

| | |
|---|---|
| **Date** | 2026-04-22 16:00 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 7 (of target) |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#336](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/336) — cli: standardize --wait and --timeout across svc and limavm |
| **Commits** | `47b71db` cli: standardize --wait and --timeout across svc and limavm |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — fix the doc-imprecision (I3), add the missing `--timeout=120s` on `local_setup_file` (S1), and consider the autostart-entry-point coverage gap (I2). The other suggestions are polish that need not block merge. |
| **Wall-clock time** | `20 min 19 s` |

---

## Executive Summary

Round 7 of a CLI standardization PR (`--wait` / `--timeout` flags across `rdd svc` and `rdd limavm`, exit code 4 contract, `service.Delete` always-waits-before-removing-files). Most prior-round findings are addressed in this iteration: `waitFlag` rename, `watchUntil` watch-close error symmetry, expanded `Delete` predicate comment, `--timeout` help-string consistency, and the `setup_rdd_control_plane` 5m block. What remains is mostly documentation and test-coverage polish.

The core change is sound. Three findings worth attention before merge: a doc imprecision in `api_service.md`, a missing `--timeout=120s` on the new test file's `local_setup_file` that mirrors the same risk the helper already mitigates, and persistent absence of exit-code-4 BATS coverage for the autostart entry points (`rdd set`, `rdd kubectl`, `rdd svc config`) plus `limavm delete/stop`. Two findings raised as "Important" by Gemini are deliberate design tradeoffs documented in code comments and have been demoted to Design Observations.

---

## Critical Issues

None.

---

## Important Issues

I1. **`setup_rdd_control_plane` silently reuses the prior daemon's controller set after a timed-out delete** — `bats/helpers/controller.bash:78-83` [Codex GPT 5.4]

```bash
# Bound delete and tolerate failure so a stuck prior daemon cannot
# hang the helper for the full 5m stopWaitTimeout. `create` then
# proceeds against whatever state survives.
rdd svc delete --timeout=120s || :
rdd svc create --controllers="${controllers}"
rdd svc start
```

The new helper intentionally swallows `rdd svc delete` failure at line 81. If the timed-out delete leaves the instance directory behind, `rdd svc create` at line 82 becomes a no-op because `serviceCreateAction` returns success when the instance already exists (`cmd/rdd/service.go:202-205`). Then `rdd svc start` at line 83 either waits on the still-running old daemon (already-running branch, `cmd/rdd/service.go:262-281`) or restarts from the previously saved args (`pkg/service/cmd/service.go:511-518`) instead of the requested `controllers` set. A cleanup timeout therefore turns into cross-suite contamination: the next test runs against whatever controllers the previous daemon happened to preserve.

Note that even Codex's suggested fix only partially addresses this — `rdd svc start --controllers=...` is silently ignored when `Running()` is true. The deeper question is whether to fail loudly when the precondition cannot be established, or to detect and recover (`Stop` + retry `Delete`).

Fix: keep the helper's "fresh control plane with this controller set" invariant. The simplest fix is to fail fast when the bounded delete fails, and pass `--controllers` on `start` so a surviving stopped instance is at least restarted with the intended set:

```diff
-    rdd svc delete --timeout=120s || :
+    rdd svc delete --timeout=120s
     rdd svc create --controllers="${controllers}"
-    rdd svc start
+    rdd svc start --controllers="${controllers}"
```

I2. **The new exit-code-4 contract is not asserted on every entry point the PR changes** — `bats/tests/20-service/6-timeout.bats:17-40`, `bats/tests/33-lima-controllers/limavm-cli.bats:196-229` [Codex GPT 5.4] (round 6 S1 unaddressed)

The new BATS coverage hits `svc start/stop/delete` and `lima create --start/restart/start`. The implementation change is broader: `ensureServiceRunning` (`cmd/rdd/service.go:92-110`) now classifies autostart deadlines for `rdd set`, `rdd kubectl`, and `rdd service config`; `cmd/rdd/set.go:324-325` was updated specifically for this contract; and `cmd/rdd/limavm.go:584-589` gives `limavm delete` the same timeout classification. None of those paths have a `run -4` assertion today, and `limavm stop` (`cmd/rdd/limavm.go:449-456`) is also uncovered.

A missing `cliexit.Classify` on any one of these would still ship green even though the PR claims a uniform CLI contract. Round 6 raised the same gap; this round adds coverage for some commands but the autostart and `limavm delete/stop` gaps persist.

Fix: add one explicit exit-code-4 BATS assertion per entry point whose observable contract changed. For the autostart callers (`rdd set`, `rdd kubectl`, `rdd svc config`), force `ensureServiceRunning` past its 90s readiness deadline (e.g., delete the discovery ConfigMap or block readiness via a controller toggle). For the LimaVM commands, reuse the non-booting template fixture and assert `run -4`.

I3. **`api_service.md` shutdown-signal description conflates two distinct fallback paths** — `docs/design/api_service.md:21` [Claude Opus 4.7]

```text
The control plane is notified to shut down, either by `rdd service stop` (SIGINT on Unix with SIGTERM as fallback; `CTRL_BREAK_EVENT` on Windows with `TerminateProcess` as fallback), or by the OS preparing to shut down the host.
```

This sentence implies a single "if SIGINT fails, send SIGTERM" fallback. The implementation actually has two distinct fallbacks (`pkg/service/cmd/service.go:374-378` and `pkg/service/cmd/service.go:399`):

1. **Signal-delivery failure fallback** (`StopWithWait` line 374-378): if `process.Interrupt` returns an error, immediately call `process.Kill`. On Unix, `Interrupt` (SIGINT via `unix.Kill`) only fails when the PID is invalid (the comment at line 369-371 says so), so this path is essentially Windows-only — it covers the "no shared console" failure of `GenerateConsoleCtrlEvent`.
2. **Deadline-expiry fallback** (`StopWithWait` line 388-404): if `--timeout` expires, call `process.Kill` once. On Unix that delivers SIGTERM, which the daemon's signal handler treats as the second shutdown signal.

`docs/design/cmd_service.md:104-107` describes the deadline-expiry path correctly. The api-doc as written contradicts it: on Unix the `Interrupt`-failed fallback never fires in practice, and the deadline-expiry path is the meaningful Unix fallback. The text was modified in this PR (the previous wording was outright wrong about the primary signal too), so this is an opportunity to land the precise version rather than a partially-corrected one.

Fix:

```diff
-The control plane is notified to shut down, either by `rdd service stop` (SIGINT on Unix with SIGTERM as fallback; `CTRL_BREAK_EVENT` on Windows with `TerminateProcess` as fallback), or by the OS preparing to shut down the host.
+The control plane is notified to shut down, either by `rdd service stop`
+(SIGINT on Unix, `CTRL_BREAK_EVENT` on Windows; if the wait deadline expires
+the wait force-terminates with SIGTERM on Unix or `TerminateProcess` on
+Windows), or by the OS preparing to shut down the host.
```

---

## Suggestions

S1. **`local_setup_file` in the new timeout test uses default 5m for cleanup, while the per-test reset uses 10s** — `bats/tests/20-service/6-timeout.bats:11-15` [Claude Opus 4.7]

```bash
local_setup_file() {
    rdd svc delete || :
    rdd svc create
    rdd svc start
}
```

The other deletes in the same file pass `--timeout=10s` (line 26) or rely on a known-clean state (line 37). The file-level `rdd svc delete || :` on line 12 has no explicit timeout, so it inherits the new 5-minute default. If a prior test file left a half-dead daemon (the comment on line 23 acknowledges that tests exit while the daemon is still draining), this setup can hang the entire test file for 5 minutes before BATS reports anything.

The `bats/helpers/controller.bash:81` change explicitly bounds its delete at 120s for exactly this reason. The same reasoning applies here.

Fix:

```diff
 local_setup_file() {
-    rdd svc delete || :
+    rdd svc delete --timeout=120s || :
     rdd svc create
     rdd svc start
 }
```

S2. **`serviceStartAction` already-running branch reads `StartTime()` after `Running()` without guarding the TOCTOU** — `cmd/rdd/service.go:262-282` [Claude Opus 4.7] (round 6 S3 unaddressed)

```go
if service.Running() {
    logrus.Infof("%q control plane is already running", instance.Name())
    if !waitFlag {
        return nil
    }
    startTime, err := service.StartTime()
    if err != nil {
        return err
    }
```

`Running()` (line 262) reads the PID file, validates the process is alive, and removes the PID file if the process is gone (`pkg/service/cmd/service.go:158`). `StartTime()` (line 267) calls `os.Stat(instance.PIDFile())`. If the process exits between line 262 and line 267 (e.g., a concurrent `rdd svc stop`) and the PID file is removed, `StartTime()` returns a raw `os.Stat: ENOENT` error. The window is microseconds and harmless in practice, but the surfaced error is opaque and not `Classify`-able.

The same race exists in `ensureServiceRunning` (line 99). Round 6 raised this; this round has not addressed it.

Fix: when `StartTime()` returns `os.IsNotExist`, retry the `Running()` check or return a clearer "control plane exited" message:

```go
startTime, err := service.StartTime()
if err != nil {
    if os.IsNotExist(err) {
        return fmt.Errorf("%q control plane exited while waiting", instance.Name())
    }
    return err
}
```

S3. **`stopWaitTimeout` aliases a constant defined in a sibling file** — `cmd/rdd/service.go:41` [Claude Opus 4.7]

```go
// stopWaitTimeout bounds the CLI wait for the control plane to shut down.
// ... Aliasing to limaLongWaitTimeout
// makes the per-VM ceiling the single tuning knob for long CLI waits.
const stopWaitTimeout = limaLongWaitTimeout
```

`limaLongWaitTimeout` is declared in `cmd/rdd/limavm.go:42`. The two files share the `main` package, so this compiles, but a reader looking at `service.go` only has the comment to discover where `limaLongWaitTimeout` lives. The "single tuning knob" rationale is fine; the implementation makes the file dependency invisible.

Either move both constants to a small shared file, or qualify the comment with the file location:

```diff
-// hosts, and sequential multi-VM drains. Aliasing to limaLongWaitTimeout
-// makes the per-VM ceiling the single tuning knob for long CLI waits.
+// hosts, and sequential multi-VM drains. Aliased to limaLongWaitTimeout
+// (defined in limavm.go) so the per-VM ceiling is the single tuning knob
+// for long CLI waits.
```

S4. **`service.Delete` never reaches its `if !Exists()` guard** — `pkg/service/cmd/service.go:447-449` [Claude Opus 4.7]

```go
func Delete(ctx context.Context, timeout time.Duration) error {
    if !Exists() {
        return fmt.Errorf("%q control plane does not exist", instance.Name())
    }
```

The only caller is `serviceDeleteAction` (`cmd/rdd/service.go:368-372`), which short-circuits on `!service.Exists()` before calling `service.Delete`. The library-side guard is dead code, and if it ever fired, it would propagate as a CLI error (exit 1) rather than the action's "info + success".

Either remove the inner check (and document `Delete`'s precondition that the instance exists) or have `Delete` log+return-nil to match the action. Removal is the smallest change.

S5. **`serviceStopAction` removes `instance.Config()` after `--wait=false` even though the daemon is still alive** — `pkg/service/cmd/service.go:412-415` [Claude Opus 4.7]

```go
}

_ = os.Remove(instance.Config()) // Ignore error as file might not exist
return nil
```

The `!wait` branch falls through to remove `instance.Config()` (the kubeconfig the CLI uses to talk to the apiserver) immediately after sending SIGINT. The daemon may still be alive and serving for tens of seconds. Any concurrent `rdd ctl get` / `rdd kubectl` call between stop's return and the daemon's actual exit will fail with "could not read config" (`pkg/service/cmd/service.go:95`), masking the real state ("daemon is still up").

This predates the PR but the new contract makes `--wait=false` more attractive (per the docs), so the symptom will get more exposure. The fix is to gate the removal on a confirmed-exited check, or to leave the kubeconfig in place when `!wait` (it will be re-created on next start).

S6. **`cleanupDiscoveryConfigMap` uses `context.TODO()` and is called outside the `--timeout` envelope** — `pkg/service/cmd/service.go:355-360, 433` [Claude Opus 4.7]

```go
_ = cleanupDiscoveryConfigMap()
```

Three concerns, all pre-existing but in scope of this PR's "wire context through" theme:

1. The function is not bounded by the user's `--timeout` (it runs before `watchtools.ContextWithOptionalTimeout`).
2. `CleanupDiscovery` is called with `context.TODO()` (`pkg/service/cmd/service.go:433`), so it has no deadline at all.
3. The author already documented the trade-off ("Moving cleanup into the daemon's own shutdown path is the durable fix").

Listed because the PR did the bulk of the context plumbing for `Stop`/`Delete` and this single TODO is now the loudest holdout.

Fix: thread `ctx` into `cleanupDiscoveryConfigMap`, or move the cleanup into the daemon's own shutdown path as the comment proposes.

S7. **The deadline-path comment understates what SIGTERM actually does to the daemon** — `pkg/service/cmd/service.go:392-398` [Claude Opus 4.7]

The deadline-expiry SIGTERM is the second shutdown signal (the first SIGINT was sent earlier in the same call), and the apiserver's `SetupSignalContext` handler responds to a second signal by calling `os.Exit(1)`. The current comment reads as if SIGTERM lets the daemon "handle signals" gracefully, but the second signal forces an abrupt termination — which is the desired behavior, just not what the comment describes.

```diff
-                // On Unix, SIGTERM suffices because the service handles
-                // signals promptly; its slowness comes from draining
-                // hostagents sequentially.
+                // On Unix, SIGTERM is the *second* shutdown signal
+                // (SIGINT was sent at line 374); the apiserver signal
+                // handler responds to a second signal by calling
+                // os.Exit(1), forcing immediate termination — which is
+                // exactly what the deadline path needs.
```

S8. **`run_e -0` wrapper is unnecessary on a setup command whose output is not inspected** — `bats/tests/33-lima-controllers/limavm-cli.bats:222` [Gemini 3.1 Pro]

```bash
run_e -0 rdd limavm create "restart-timeout-vm" "restart-timeout-vm" --namespace "${LIMA_TEST_NS}"
assert_created "restart-timeout-vm" "${LIMA_TEST_NS}" "restart-timeout-vm"
```

`run_e -0` invokes `rdd limavm create`, but `assert_created` reads `${stderr}` set by the previous `run_e`. If the helper relies on `${stderr}`, the wrapper is justified; otherwise per the BATS style notes, commands whose success is the entire assertion should be invoked directly. Verify by reading `assert_created`: if it inspects `$stderr`, leave the call as-is; if not, drop the wrapper.

---

## Design Observations

### Concerns

**`--timeout` flag scope is intentionally not extended over `ensureServiceRunning`** — `cmd/rdd/service.go:88-91` *(in-scope)* [Gemini 3.1 Pro]

The code comment at lines 88-91 explicitly states: "It bounds the already-running and cold-start paths by `startWaitTimeout` (90s), independent of the caller's `--timeout`. The cap lets a broken service fail fast so `rdd limavm *`, `rdd set`, `rdd kubectl`, and `rdd service config` stay responsive on a long deadline." Gemini raised this as Important but the behavior is deliberate and documented. The asymmetry is real — a user passing `--timeout=2s` would expect the whole command to fail in 2s, not 90s — but this is a design tradeoff that has been raised in multiple prior rounds (round 5 S3) and consistently treated as intentional. A `min(userTimeout, startWaitTimeout)` would address both directions; flagged for design discussion rather than as a bug.

**Deadline-expiry path sends SIGTERM (graceful), not SIGKILL (forced)** — `pkg/service/cmd/service.go:389-401` *(in-scope)* [Gemini 3.1 Pro]

Gemini raised this as Important. The code comment at lines 392-398 documents the rationale: "On Unix, SIGTERM suffices because the service handles signals promptly; its slowness comes from draining hostagents sequentially. Run caps that drain at 45s before exiting anyway, so timeout values shorter than `stopWaitTimeout` (5m) routinely race a still-draining daemon." Combined with the apiserver's "second signal triggers `os.Exit(1)`" behavior (S7), the deadline-expiry path does force termination on Unix — just via the second-signal path rather than SIGKILL. The genuinely-deadlocked-daemon case (where Run's 45s cap also fails) accepts the hang in exchange for log flush and proper cleanup.

### Strengths

- **`cliexit.Classify` is clean and idempotent** — `pkg/cli/exit/exit.go:56-65` *(in-scope)* [Claude Opus 4.7] [Gemini 3.1 Pro]. Correctly handles wrapped `context.DeadlineExceeded`, short-circuits already-classified errors. Replacing the explicit `errors.Is` block in `set.go` with `Classify` is a real readability win.
- **Watch-drop handling is now sound** — `cmd/rdd/limavm.go:497-512, 550-566` *(in-scope)* [Claude Opus 4.7] [Codex GPT 5.4] [Gemini 3.1 Pro]. The post-watch re-read distinguishes "watch closed before predicate satisfied" from "context deadline" and from a satisfied predicate, addressing the silent-success-on-watch-drop class of bugs.
- **`service.Delete`'s "directory survives unless the daemon is confirmed gone" invariant is well-chosen** — `pkg/service/cmd/service.go:452-465` *(in-scope)* [Claude Opus 4.7]. The expanded comment now spells out the predicate's coverage of timeout, signal-delivery failure, and `context.Canceled`, which addresses round 6's S6.
- **Already-running `svc start` branch now waits for fresh discovery ConfigMap** — `cmd/rdd/service.go:267-279` *(in-scope)* [Codex GPT 5.4]. Aligns the cold-start and already-running paths so both wait for `ReadyAnnotation` rather than just process liveness.
- **`StartTime()` truncates to seconds to match `metav1.Time` serialization** — `pkg/service/cmd/service.go:172-178` *(in-scope)* [Claude Opus 4.7]. Prevents the "ConfigMap startTime appears before beforeStart" edge case the round 6 review surfaced.

---

## Testing Assessment

The new BATS coverage hits the new exit-code-4 contract through `svc start/stop/delete` and `lima create --start/restart/start`. Untested scenarios, in priority order (mostly the same as round 6, with `limavm delete` newly missing):

1. **`rdd set --timeout=...` exit-code-4 contract.** `set.go:325` is the original site of the timeout exit code; nothing in `bats/tests/` asserts that `rdd set` returns 4 specifically.
2. **`rdd svc start --wait=false`.** The TODO at `bats/tests/20-service/2-create.bats:3` still says "test `rdd svc start --wait=false`" and remains unimplemented.
3. **`rdd kubectl --timeout-via-ensureServiceRunning` exit code.** `ensureServiceRunning` now routes `DeadlineExceeded` through `Classify`, but no BATS test exercises the path where `ensureServiceRunning` itself times out and the caller surfaces 4.
4. **`limavm delete --timeout` exit-code-4.** New tests cover `start`/`create --start`/`restart` timeouts, but `limavm delete --timeout=…` (the most-changed default — flipped from `--wait=false` to `--wait=true`) has no explicit timeout-exits-4 assertion.
5. **`limavm stop --timeout` exit-code-4.** Same gap, smaller blast radius because the existing tests use `--wait=false`.
6. **Concurrent `rdd svc stop` while `rdd svc start` is running.** The Discovery-ConfigMap deletion in `StopWithWait` creates a window where `ensureServiceRunning` would fail; no test exercises this race.

I2 above tracks (1), (3), (4), (5) as a single Important finding.

---

## Documentation Assessment

`cmd_service.md:93-99` and `cmd_service.md:102-122` correctly describe the new defaults, signal, and exit-code contract for `svc start/stop/delete`. `cmd_lima.md:17/24/31/38/57` updates each subcommand's `--timeout` default to `5m` and notes exit code 4. The new comments on `startWaitTimeout`, `stopWaitTimeout`, and `limaLongWaitTimeout` justify the values and tie them back to the daemon's internal 45s drain cap.

I3 above covers the `api_service.md` imprecision.

Pre-existing test `bats/tests/20-service/2-create.bats:37-44` asserts log strings ("successfully started", "waiting for ... to be ready") that no current code path emits. Not introduced by this PR.

---

## Commit Structure

Single commit, message accurately describes the diff (defaults, exit codes, library surface, removed callers, doc updates, new BATS coverage). No fixup commits, no scope creep. Round 7 of iteration on this PR has converged to a clean single-commit shape.

---

## Acknowledged Limitations

- **Discovery ConfigMap deleted before signaling, leaving a startup window for new clients.** `pkg/service/cmd/service.go:355-359` documents the trade-off: "a timed-out `--wait` leaves the configmap deleted while the daemon drains, so a quick `rdd svc start` polls the missing configmap until startWaitTimeout. Moving cleanup into the daemon's own shutdown path is the durable fix." S6 above offers an interim improvement.
- **`context.Canceled` in `Delete`'s predicate.** `pkg/service/cmd/service.go:455-461` notes that `context.Canceled` is unreachable today because the CLI runs on `context.Background()`; wiring SIGINT into `cmd.Context()` would let Ctrl-C fall through to deletion during a live stop, and needs an explicit decision.
- **Timeout values shorter than `stopWaitTimeout` race the daemon's 45s internal drain cap.** `pkg/service/cmd/service.go:395-398` documents this; the BATS suite avoids the race by using `--timeout=1ms` (deterministic loss) or `--timeout=120s`/`5m` (always wins).
- **Second-precision freshness anchor accepts a stale ConfigMap when a prior crash lands in the same wall-clock second as the new startup.** Round 6 S1 (Codex). The truncation aligns with `metav1.Time` serialization (a real constraint), so finer-grained anchoring would require carrying sub-second precision through the ConfigMap path.

---

## Agent Performance Retro

### [Claude]

Produced the deepest review (1 important + 7 suggestions + 4 design observations), most of them concrete code-comment or test-helper polish that would be hard for the other agents to surface without the same depth of file reading. Caught the api_service.md doc imprecision (I3) — a finding only Claude raised, requiring careful cross-referencing between the doc and two distinct fallback paths in `StopWithWait`. Mislabeled I3 as a "regression" (the prior text was already wrong, just differently); demoted in consolidation. The TOCTOU finding (S2) is a re-raise of round 6 S3 that Claude correctly tracked as still unaddressed.

### [Codex]

Two important findings, both well-grounded and complementary to Claude's coverage. I1 (controller.bash silently reuses wrong controller set) is a fresh observation on the new helper change that no other agent caught — required tracing through `serviceCreateAction`'s early-exit and the `ServeArgs` reuse path. I2 (autostart entry-point coverage gap) re-raises round 6's S1 with sharper framing, correctly identifying every entry point lacking `run -4` coverage. Codex held the bar high: zero suggestions, zero design observations beyond strengths, no false positives.

### [Gemini]

Two important findings, both contradicted by code comments documenting the intentional tradeoff — both demoted to Design Observations. Gemini missed the `cmd/rdd/service.go:88-91` comment when raising I1 (`--timeout` scope) and the `pkg/service/cmd/service.go:392-398` comment when raising I2 (SIGKILL). The single Suggestion (run_e -0 wrapper) is a valid style nit. As expected per the known-behavior note, Gemini also did not run `git blame` for regression labeling.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 13m 21s | 7m 42s | 8m 21s |
| Findings | 1I 7S | 2I | 1S |
| Tool calls | 51 (Bash 29, Read 22) | 46 (shell 42, plan 2, stdin 2) | 33 (grep_search 14, read_file 13, run_shell_command 6) |
| Design observations | 4 | 2 | 2 |
| False positives | 0 | 0 | 0 |
| Unique insights | 5 | 1 | 1 |
| Files reviewed | 11 | 11 | 11 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **1I 7S** | **2I** | **1S** |
| Downgraded | 0 | 0 | 2 (I→S) |
| Dropped | 0 | 0 | 0 |


Claude provided the most value this round through depth and breadth of suggestions. Codex was the most calibrated (no false positives, both findings survived consolidation). Gemini's findings were less reliable this round — both Important claims missed code comments that explained the design — but the run_e -0 catch is genuinely useful.

---

## Review Process Notes

### Repo context updates

- **[repo] When reviewing CLI commands that auto-start the service, distinguish `cmd.Context()` (unbounded) from a user `--timeout`-derived context.** RDD CLI commands routinely call an autostart helper (e.g., `ensureServiceRunning`) before applying the user's `--timeout` to the operation context. Reviewers should check whether the autostart phase is intentionally bounded by its own ceiling (e.g., `startWaitTimeout`) or whether it should inherit the user's deadline. The current convention in this repo is the former — autostart caps independent of `--timeout` — and that intent is documented in the helper's comment. Flag deviations from that convention, but do not flag the convention itself.

- **[repo] When a code comment documents a deliberate design tradeoff, treat findings that re-litigate the tradeoff as Design Observations, not Important Issues.** Multiple rounds of this PR have surfaced the same "user `--timeout` not propagated through implicit phases" and "deadline path sends SIGTERM not SIGKILL" claims, each backed by reasonable user-facing concern but each contradicted by an existing code comment that documents the intentional choice (`cmd/rdd/service.go:88-91`, `pkg/service/cmd/service.go:392-398`). Treat the comment as evidence of conscious design and downgrade unless the reviewer can articulate why the documented rationale is wrong.

### Skill improvements

- **When a finding is contradicted by a same-file code comment, demote it before consolidation rather than after.** Reviewers who raise a concern as Important should first grep the cited file for a comment that explicitly addresses the concern. If one exists and documents the choice as deliberate, the finding belongs in Design Observations (with the comment quoted) unless the reviewer can refute the documented rationale. This prevents repeated re-litigation across review rounds and keeps the Important section reserved for genuinely-unaddressed concerns.

---

## Appendix: Original Reviews

### [Claude] — Pass 1

```
Reading the rest of the changed files to be thorough.

# Deep Review — PR #336: cli: standardize --wait and --timeout across svc and limavm

| | |
|---|---|
| **Verdict** | Approve with one important documentation fix and a few suggestions |
| **Scope** | 11 files, ~328 insertions, ~115 deletions, single squashed commit |
| **Risk** | Low. Behavior-preserving for the most part; new defaults are conservative |

## Executive Summary

This PR unifies `--wait` and `--timeout` semantics across `rdd svc` and `rdd limavm`, introduces `cliexit.Classify` to standardize the exit-code-4 contract on `context.DeadlineExceeded`, makes `svc delete` always wait (preserving the instance directory on timeout), and tightens watch-drop handling in `watchUntil`/`watchUntilDeleted`. The design is sound and matches the doc updates. Findings are mostly minor; one is a documentation accuracy regression introduced by this PR.

[Full text in /tmp/deep-review-39c5yP/review-claude-pass-1.md — see I1 (api_service.md doc) plus S1-S7 above.]
```

### [Codex] — Pass 1

```
### Executive Summary

This PR correctly pushes the CLI toward a consistent wait/timeout contract, and the core service/LimaVM code paths I traced are coherent. The review found one concrete regression in the shared BATS control-plane helper and one important test-coverage gap: the new exit-code-4 contract is only asserted on a subset of the entry points the PR changes.

[Full text in /tmp/deep-review-39c5yP/review-codex-pass-1.md — both Important findings reproduced verbatim above as I1 and I2.]
```

### [Gemini] — Pass 1

```
### Executive Summary

This PR introduces a consistent `wait` and `timeout` CLI contract for `rdd svc` and `rdd limavm` commands, defaulting most wait-capable commands to `--wait=true` with a 5-minute timeout. It centralizes the wait logic and gracefully surfaces `context.DeadlineExceeded` as exit code 4 across the CLI. The implementation is solid, though the timeout scoping creates a blind spot where the implicit control plane startup phase ignores the caller's explicit timeout flag.

[Full text in /tmp/deep-review-39c5yP/review-gemini-pass-1.md — I1 and I2 demoted to Design Observations above; S1 reproduced as S8.]
```

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-22 |
| **Commit** | `7b08089` cli: standardize --wait and --timeout across svc and limavm (amended) |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important #1: `setup_rdd_control_plane` silently reuses prior daemon's controller set after timed-out delete | Fixed |
| 2 | Important #2: exit-code-4 contract not asserted on every entry point | Fixed (limavm delete/stop tests added; autostart entry points deferred) |
| 3 | Important #3: `api_service.md` shutdown-signal description conflates fallback paths | Fixed |
| 4 | Suggestion #1: `local_setup_file` uses default 5m for cleanup | Fixed (both 6-timeout.bats sites) |
| 5 | Suggestion #2: `Running()` → `StartTime()` race surfaces opaque stat error | Skipped (deferred to discovery-CM-cleanup pending-work) |
| 6 | Suggestion #3: `stopWaitTimeout` aliases cross-file constant without reference | Fixed |
| 7 | Suggestion #4: `service.Delete` never reaches `!Exists()` guard | Fixed (same-pattern extension: also removed from `Start`) |
| 8 | Suggestion #5: `StopWithWait` removes `instance.Config()` after `--wait=false` | Fixed |
| 9 | Suggestion #6: `cleanupDiscoveryConfigMap` uses `context.TODO()` | Skipped (deferred to discovery-CM-cleanup pending-work) |
| 10 | Suggestion #7: deadline-path comment understates SIGTERM behavior | Fixed |
| 11 | Suggestion #8: `run_e -0` wrapper unnecessary on setup command | Skipped (false positive — `assert_created` reads `$stderr`) |
