# Deep Review: 20260421-173630-pr-336

| | |
|---|---|
| **Date** | 2026-04-21 17:36 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 1 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#336](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/336) — cli: standardize --wait and --timeout across svc and limavm |
| **Branch** | `wait-timeout-cleanup` |
| **Commits** | `99aa4e3` cli: standardize --wait and --timeout across svc and limavm |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — one important correctness bug (`svc delete --wait=false` races with the still-running daemon) and the shared 30s default is below the daemon's own 45s graceful-shutdown ceiling. Fix those two before merging; the remaining items are doc/testing polish. |
| **Wall-clock time** | `37 min 22 s` |

## Executive Summary

This PR unifies the `--wait` / `--timeout` surface across `rdd svc` and `rdd limavm`, and routes deadline expiry through `cliexit.Timeout` so timeouts exit with code 4 (matching `rdd set`). The refactor is clean and the wrap-once-at-the-CLI-boundary pattern is right. Two behavior changes warrant attention before merging: `svc delete --wait=false` now lets `service.Delete()` proceed to `os.RemoveAll(instance.Dir())` while the serve subprocess is still holding PID/SQLite/log handles; and the shared `defaultWaitTimeout = 30s` is below both the service's own 45s controller-manager drain (`svc stop`/`svc delete`) and the previous 90s CRD-establishment budget (`svc start`, still 90s inside `ensureServiceRunning`). Documentation in `docs/design/cmd_lima.md` was not updated to match the new defaults, the new exit-code-4 contract is asserted in exactly one test, and the wrapper `service.Stop()` removal is unmentioned in the commit body.

---

## Critical Issues

None.

## Important Issues

I1. **`svc delete --wait=false` removes the instance directory while the serve subprocess is still alive** — `pkg/service/cmd/service.go:424-438`, `cmd/rdd/service.go:347-376` [Claude Opus 4.7] [Codex GPT 5.4] [Gemini 3.1 Pro]

```go
func Delete(wait bool, timeout time.Duration) error {
    if !Exists() {
        return fmt.Errorf("%q control plane does not exist", instance.Name())
    }
    if Running() {
        if err := StopWithWait(wait, timeout); err != nil {
            return err
        }
    }
    preserveAllInstanceLogs()
    if os.Getenv("RDD_KEEP_LOGS") == "" {
        _ = os.RemoveAll(instance.LogDir())
    }
    _ = os.RemoveAll(instance.ShortDir())
    return os.RemoveAll(instance.Dir())
}
```

Before the PR, `Delete()` always called `Stop()`, which always waited. This PR exposes `--wait=false` on `svc delete` and threads it through to `StopWithWait`. When the user passes `--wait=false`, `StopWithWait` signals the process and returns immediately (`pkg/service/cmd/service.go:358-395`); `Delete` then calls `os.RemoveAll(instance.LogDir())`, `os.RemoveAll(instance.ShortDir())`, and `os.RemoveAll(instance.Dir())` while the serve subprocess still holds open handles to `rdd.pid`, `rdd.sqlite3`, `rdd.stderr.log`, `rdd.stdout.log`, and has its working directory pinned to `instance.Dir()` (set in `service.go:607`).

On Windows, `os.RemoveAll` fails with a sharing violation mid-removal because the handles are still open; the user sees an error and a partially-deleted tree. On Unix, removal succeeds, but then `Running()` starts returning false (PID file is gone; `PID()` returns `PIDNotFound` at `pkg/service/cmd/service.go:138-161`) even though the serve process remains alive. A subsequent `rdd svc start` for the same instance now sees `!Exists()`, recreates the directory, and starts a second serve subprocess that races the first for the secure port (`6443 + instance.Index()`).

Gemini flagged this as critical; Codex and Claude as important. We consolidate at important: the data being removed is itself destined for deletion, but the PID-file-based mutual exclusion breaks and the Windows partial-deletion path leaves the user with an error.

Fix: always wait for the process to exit before deleting state. Either drop `--wait=false` from `svc delete` entirely (deletion cannot usefully be non-blocking since the writer must be gone) or pass `true` unconditionally to the internal `StopWithWait` call while still exposing `--timeout` so users can tune how long to wait:

```diff
     if Running() {
-        if err := StopWithWait(wait, timeout); err != nil {
+        // Data deletion requires the writer to be gone before the directory
+        // is removed; --wait=false on svc delete is not safe, so ignore it.
+        if err := StopWithWait(true, timeout); err != nil {
             return err
         }
     }
```

If `--wait=false` is preserved for delete, `docs/design/cmd_service.md:112` should document the race explicitly.

I2. **Shared `defaultWaitTimeout = 30s` is below the service's own graceful-shutdown budget and below the previous start-readiness budget** — `cmd/rdd/service.go:32-34`, `cmd/rdd/service.go:234, 343, 354`, `pkg/service/cmd/service.go:815-827` [Claude Opus 4.7] [Codex GPT 5.4] [Gemini 3.1 Pro]

```go
// defaultWaitTimeout is the default --timeout for subcommands whose wait
// normally settles within seconds.
const defaultWaitTimeout = 30 * time.Second
```

The new default wires into three subcommands with different sizing constraints:

*Stop / delete path.* The serve subprocess allows up to 45s for its controller-manager goroutine to drain (`pkg/service/cmd/service.go:826`: `case <-time.After(45 * time.Second)`), and Lima hostagents get 30s of that budget for graceful shutdown. When the CLI's 30s deadline fires first, `StopWithWait` calls `process.Kill(pid)` at line 384, aborting the subprocess's own 45s window. The now-removed comment inside `StopWithWait` captured exactly this: `// The service needs up to 30s for graceful hostagent shutdown plus overhead, so allow 60s total.` With the new default, a healthy shutdown of a service running a Lima VM will routinely force-kill instead of draining cleanly.

*Start path.* The replaced `WaitWithTimeout` used 90s with the comment `// Increased timeout for CRD establishment`. Both `ensureServiceRunning` call sites (`cmd/rdd/service.go:97, 104`) still use 90s, which is the correct budget for CRD establishment + controller-manager registration + discovery ConfigMap readiness. The explicit `rdd svc start` path at line 234 now uses 30s. Either 90s is right in both places or 30s is right in both places — the split is itself evidence that 30s is too tight for the readiness path.

*Test exposure.* `bats/tests/20-service/2-create.bats:65`, `4-start-parameters.bats:45-63`, and `5-port-fallback.bats:72, 105, 147` all run `run -0 rdd svc stop` against services that may have a live Lima VM. Those now risk exit-code-4 flakes when the hostagent drain exceeds 30s.

Fix: split the one constant into named constants that each match a concrete invariant. For example:

```diff
-// defaultWaitTimeout is the default --timeout for subcommands whose wait
-// normally settles within seconds.
-const defaultWaitTimeout = 30 * time.Second
+// startWaitTimeout covers CRD establishment plus controller-manager registration.
+const startWaitTimeout = 90 * time.Second
+
+// stopWaitTimeout must exceed the service's own 45s controller-manager drain
+// (see Run in pkg/service/cmd/service.go) plus overhead.
+const stopWaitTimeout = 60 * time.Second
```

Then use `startWaitTimeout` in both `newServiceStartCommand` and `ensureServiceRunning`, and `stopWaitTimeout` in `newServiceStopCommand` and `newServiceDeleteCommand`.

I3. **Exit-code-4 contract is asserted in exactly one place** — `bats/tests/33-lima-controllers/limavm-cli.bats:208` [Claude Opus 4.7]

The PR introduces a uniform CLI contract: every wait-capable subcommand exits with code 4 on `--timeout` expiry. The only assertion of that contract is the mechanically updated `limavm start --timeout` test (`run -1` → `run -4`). None of the new paths are exercised:

- `rdd svc start --wait --timeout=…` (uses `timeoutError` at `cmd/rdd/service.go:294`)
- `rdd svc stop --wait --timeout=…` (at `cmd/rdd/service.go:326`)
- `rdd svc delete --wait --timeout=…` (at `cmd/rdd/service.go:372`)
- `rdd limavm stop|restart|delete|create --start --timeout=…` (at `cmd/rdd/limavm.go:235, 415, 455, 556`)
- `--timeout=0` (indefinite wait) on any command

A regression that returns the underlying error directly instead of the `cliexit.Timeout` envelope will ship green. The pre-existing TODO at `bats/tests/20-service/2-create.bats:3` (`# TODO: test rdd svc start --wait=false`) also becomes newly relevant: the PR formalizes that flag's semantics but the test is still missing.

Fix: add at least one test per subcommand that forces a timeout and asserts exit code 4. The `svc start --timeout` case is the simplest to provoke — run against an instance whose controllers cannot reach ready.

## Suggestions

S1. **`docs/design/cmd_lima.md` still documents the pre-PR defaults** — `docs/design/cmd_lima.md:17, 24, 31, 37, 38` [Codex GPT 5.4]

```md
- `--timeout` (default `0`): Maximum time to wait.
...
- `--wait` (default `false`): Wait until the resource is fully deleted before returning.
- `--timeout` (default `0`): Maximum time to wait.
```

The implementation now uses `limaLongWaitTimeout = 5 * time.Minute` for `limavm create/start/stop/restart/delete` and flips `limavm delete` to `--wait=true` by default (`cmd/rdd/limavm.go:150, 166, 182, 199, 253-254`). The design doc still says `--timeout` defaults to `0` and `limavm delete` defaults to `--wait=false`, so the documented blocking behavior now contradicts the CLI.

Fix: update `docs/design/cmd_lima.md` with the new defaults and the exit-code-4 semantics, mirroring the update already made to `docs/design/cmd_service.md`.

S2. **`defaultWaitTimeout` constant has no rationale** — `cmd/rdd/service.go:32-34` [Claude Opus 4.7]

```go
// defaultWaitTimeout is the default --timeout for subcommands whose wait
// normally settles within seconds.
const defaultWaitTimeout = 30 * time.Second
```

The previous code carried a sizing comment at the wait site explaining why 60s was the right number ("service needs up to 30s for graceful hostagent shutdown plus overhead"). After this PR the constant is shared but the rationale is gone. Whoever next tunes this number will not know which path it has to cover. Tie the constant's value to a concrete invariant (e.g., "must exceed the 45s controller-manager drain in `pkg/service/cmd/service.go` Run"). This pairs with I2.

S3. **`docs/design/cmd_service.md` says "Sends SIGTERM" but the Unix path sends SIGINT** — `docs/design/cmd_service.md:104` [Claude Opus 4.7]

```
### `rdd service stop`

Sends SIGTERM to the control plane process (`rdd.pid`) and waits up to 30s for it to exit.
```

`StopWithWait` calls `process.Interrupt(pid)` first (`pkg/service/cmd/service.go:358`), and `Interrupt` on Unix is `unix.Kill(pid, unix.SIGINT)` (`pkg/util/process/process_unix.go:26-28`). `process.Kill` (which does send SIGTERM) is only used as a fallback when `Interrupt` fails or on the timeout path. The PR edited this doc section for the new flag behavior but did not fix the pre-existing signal name:

```diff
-Sends SIGTERM to the control plane process (`rdd.pid`) and waits up to 30s for it to exit.
+Sends SIGINT to the control plane process (`rdd.pid`) and waits up to 30s for it to exit.
+On Windows this is delivered as `CTRL_BREAK_EVENT`.
```

S4. **`limavm delete` default flipped from `--wait=false` to `--wait=true` with a 5-minute default** — `cmd/rdd/limavm.go:253-254` [Claude Opus 4.7]

```go
command.Flags().BoolVar(&wait, "wait", true, "Wait for the VM to be deleted before returning")
command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
```

This is a behavior change in addition to a flag rename. Scripts that ran `rdd limavm delete <name>` and then moved on will now block up to 5 minutes (where they used to return in milliseconds). The PR description mentions the default-to-true move, but the design doc and commit body do not call out that `limavm delete` in particular changed. `bats/tests/33-lima-controllers/limavm-cli.bats:129, 196` already call `rdd limavm delete <name>` followed by `rdd ctl wait --for=delete`; under the new default those two waits are redundant.

Fix: call out the behavior change in the design doc (see S1) and consider whether the existing BATS tests should be updated to reflect the new contract.

S5. **`svc delete` log message lies when the deletion races a still-running serve** — `cmd/rdd/service.go:374` [Claude Opus 4.7]

```go
if err := service.Delete(wait, timeout); err != nil {
    return timeoutError(err)
}
logrus.Infof("%q control plane has been deleted", instance.Name())
return nil
```

If I1 is fixed by always waiting on delete, this is moot. If `--wait=false` stays, the log line lies — the process is mid-shutdown and the directory may be partially gone. Replace with something like `"%q control plane delete initiated"` when `wait=false`.

S6. **Typo: "err" → "error" in `timeoutError` comment** — `cmd/rdd/service.go:36-44` [Claude Opus 4.7]

```go
// timeoutError wraps a deadline-expiry err with [cliexit.Timeout] so the CLI
// exits with [cliexit.CodeTimeout], matching `rdd set`. Other errors pass
// through unchanged.
```

```diff
-// timeoutError wraps a deadline-expiry err with [cliexit.Timeout] so the CLI
+// timeoutError wraps a deadline-expiry error with [cliexit.Timeout] so the CLI
```

S7. **`service.Stop()` removal is not called out in the commit body** — `pkg/service/cmd/service.go` [Claude Opus 4.7]

The deleted `service.Stop()` wrapper was labeled "for backward compatibility" but had only one in-tree caller. Removing it is correct, but the commit body should mention that the public surface narrowed so a future reader does not go looking for an external consumer.

S8. **PR description lists `limavm create` under the 30s timeout bucket, but the code assigns it 5m** — `cmd/rdd/limavm.go:150` [Gemini 3.1 Pro]

The PR description says `limavm create` gets 30s; the code correctly uses `limaLongWaitTimeout = 5 * time.Minute` because the wait covers the VM boot cycle when `--start` is passed. Align the PR description / release notes with the code.

---

## Design Observations

### Concerns

- **One `defaultWaitTimeout` collapses three distinct sizing constraints** (in-scope) [Claude Opus 4.7] [Codex GPT 5.4]. `svc start` covers CRD establishment + controller-manager registration, `svc stop` covers the serve subprocess's own 45s controller-manager drain, and `svc delete` covers stop-then-cleanup. Those invariants have different right answers; the PR uses one 30s constant for all three and still needs a separate 5-minute `limaLongWaitTimeout` for VM boot. Three named constants — each tied to an invariant in its doc comment — would make tuning any one of them safer. See I2.
- **`svc delete` mixes process-shutdown with durable-state removal** (in-scope) [Claude Opus 4.7] [Codex GPT 5.4]. The `--wait=false` footgun in I1 is structural: deleting `instance.Dir()` while the writer still owns it cannot be made safe with better timeouts or doc warnings. The simplest design is "delete always waits for the process to exit." If a non-blocking delete is needed, it needs a two-phase design (e.g., a "pending deletion" marker) that defers filesystem cleanup until after the daemon is confirmed dead.

### Strengths

- `errors.Is(err, context.DeadlineExceeded)` plus `fmt.Errorf("…%w", err)` is the right classification primitive. It traverses the wrap chain so both the `limavm.go` wrap and the `service.go` wrap flow through `errors.As(err, &cliexit.Error)` in `main.go` without a sentinel string match. [Claude Opus 4.7]
- `watchtools.ContextWithOptionalTimeout` is used consistently at every wait site, so the "timeout=0 → no deadline" semantics live in one place instead of six independent timer constructions. [Claude Opus 4.7]
- `StopWithWait` now blocks on `<-ctx.Done()` instead of `<-time.After(60 * time.Second)`, so a parent-context cancellation (e.g., Ctrl-C) actually interrupts the wait. [Claude Opus 4.7]
- Routing `cliexit.Timeout` through a small helper at the CLI boundary (`timeoutError`) keeps `cliexit` concerns out of the lower-level service package. [Codex GPT 5.4]

## Testing Assessment

- Exit-code-4 is asserted in one place (`limavm-cli.bats:208`); see I3.
- `--timeout=0` (indefinite wait) is not exercised by any `svc` or `limavm` test.
- `rdd svc delete --wait=false` is untested — the most dangerous new mode, per I1, has zero coverage.
- `rdd svc stop` against a live hostagent is not covered; this is exactly the scenario where the new 30s default (I2) is too tight.
- The pre-existing TODO at `bats/tests/20-service/2-create.bats:3` (`# TODO: test rdd svc start --wait=false`) becomes newly relevant but is unchanged.

## Documentation Assessment

- `docs/design/cmd_service.md` was updated for the flag rename (correct) but still misstates the signal name (S3) and does not flag the `svc delete --wait=false` race (I1).
- `docs/design/cmd_lima.md` was not touched and now contradicts the CLI defaults (S1).
- Neither doc names the relationship between the client-side timeout and the serve subprocess's 45s controller-manager drain — the actual sizing constraint for `svc stop` (I2, S2).
- The comment at `cmd/rdd/service.go:107-114` still says "the API server readiness poll and the ConfigMap freshness poll share the single deadline", which holds only when `timeout > 0`. Minor.

## Commit Structure

Single commit, single concept, message accurately describes the change. The commit body could mention the `limavm delete --wait` default flip (S4) and the `service.Stop()` / `service.WaitWithTimeout()` removals (S7).

## Acknowledged Limitations

- `pkg/service/cmd/service.go:444-449` already comments the Windows `os.Rename` / `FILE_SHARE_DELETE` limitation in `preserveAllInstanceLogs`; that failure mode becomes more likely if I1 is left as-is but the underlying limitation is already named.
- `cmd/rdd/limavm.go:39-41` justifies `limaLongWaitTimeout = 5 * time.Minute` with a comment — consistent with the VM boot/shutdown invariant.

## Unresolved Feedback

No prior PR review comments on this PR.

---

## Agent Performance Retro

### [Claude]

Claude produced the most thorough pass: 5 important + 5 suggestion findings with the deepest technical reasoning. Its analysis of I1 traced the PID-file removal through `PID()` to uncover that `Running()` starts lying on the next invocation — the subtlest and most important aspect of that finding, which Codex and Gemini identified but did not explain. It also caught the SIGINT/SIGTERM doc drift that the other two missed, and separated the start-readiness-budget regression (I3 in its report) from the stop-drain-budget regression that all three agents flagged. One finding (`limavm delete` default flip) was downgraded to a suggestion because the PR description already documents the change; the substance of the observation remains useful for the commit body and design docs.

### [Codex]

Codex was the only agent to catch the stale `docs/design/cmd_lima.md` — a real doc drift that survives because the PR updated only the sibling file. Its two important findings overlap with the other agents' but its writeup on I2 pinned the sizing argument concisely to the service's advertised shutdown budgets (45s controller-manager drain, 30s hostagent) and its design observation on "shutdown and state removal are separate responsibilities" sharpened the fix direction. No false positives; one unique finding.

### [Gemini]

Gemini produced a compact report (1 critical + 1 important + 1 suggestion) and raced through the review quickly. It flagged the `svc delete --wait=false` race at critical severity — we downgraded to important after analysis, because the data being deleted is itself destined for removal and the concurrency exposure (though real) falls short of the critical bar. Its I1 bundled the start and stop timeout-default regressions into one finding; Claude and Codex separated them. One PR-description-vs-code suggestion (`limavm create` timeout category) that no other agent raised. It did not flag the SIGTERM/SIGINT doc drift or the doc drift in `cmd_lima.md`.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 19m 49s | 6m 43s | 3m 13s |
| Findings | 3I 6S | 2I 1S | 2I 1S |
| Tool calls | 35 (Bash 25, Read 10) | 57 (shell 52, stdin 5) | 10 (run_shell_command 6, read_file 3, grep_search 1) |
| Design observations | 5 | 2 | 0 |
| False positives | 0 | 0 | 0 |
| Unique insights | 6 | 1 | 1 |
| Files reviewed | 5 | 5 | 5 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **3I 6S** | **2I 1S** | **2I 1S** |
| Downgraded | 1 (I→S) | 0 | 1 (I→S) |
| Dropped | 0 | 0 | 0 |


**Reconciliation.** Gemini's C1 (`svc delete --wait=false` corrupts data) was downgraded to I1 (important) during consolidation: the data being removed is destined for deletion, so "data corruption" overstates the impact; the PID-file / mutual-exclusion break (Claude's angle) is real but does not meet the critical bar. Claude's I4 (`limavm delete` default flipped to wait) was downgraded to S4 (suggestion): the PR description documents the default change, so the finding is a doc/commit-body polish rather than a behavioral regression.

Claude found the most unique material (SIGINT/SIGTERM, missing test coverage of the exit-code-4 contract, lost rationale comment, typo, PID-file race in I1). Codex's coverage of `docs/design/cmd_lima.md` was the clearest single-agent contribution. Gemini's compact output matches its lower-effort setting; for this PR, the other two agents covered what Gemini surfaced.

---

## Review Process Notes

### Repo context updates

- **Add:** When a CLI change moves a hardcoded timeout to a user-configurable flag default, flag any other code path that uses the same semantic budget. Previously-correlated hardcoded timeouts often survive the refactor (see `ensureServiceRunning` still using 90s while `rdd svc start --timeout` defaults to 30s) and produce behavior splits that no test will catch. Reason: reviewers who compare the diff line-by-line miss these; the pattern requires a global search for the old value after the refactor.
- **Add:** RDD CLI exit codes are defined in `pkg/cli/exit/exit.go` (currently `CodeRejected=3`, `CodeTimeout=4`). When a PR introduces or broadens a contract that uses these codes, expect an assertion in the BATS suite using `run -N`. A contract asserted in only one place is a gap — a later regression that returns the raw error without the `cliexit.Error` envelope ships green.

---

## Appendix: Original Reviews

### [Claude Opus 4.7]

Let me verify the timeout regressions by checking the previous version and a few more details.

I have enough context to write the review.

# Deep Review — PR #336

## Executive Summary

This PR standardizes `--wait` and `--timeout` flags across `rdd svc` and `rdd limavm` subcommands and routes deadline expiry through `cliexit.Timeout` so timeouts exit with code 4 (matching `rdd set`). The refactor is mostly clean, but the new defaults shorten previously-hardcoded timeouts in two paths (60s→30s for `svc stop`, 90s→30s for `svc start` already-running branch), and the new `--wait=false` semantics for `svc delete` introduce a race where the instance directory is removed while the service is still running.

## Findings

### Critical Issues

None.

### Important Issues

**I1. `svc stop --timeout=30s` default is below the service's own ~45s shutdown ceiling — `cmd/rdd/service.go:343`, `pkg/service/cmd/service.go:338-393`** *(important, regression)*

```go
// cmd/rdd/service.go:342-343
command.Flags().Bool("wait", true, "Wait for control plane to actually stop")
command.Flags().Duration("timeout", defaultWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
```

`defaultWaitTimeout` is `30 * time.Second` (line 34). The previous `StopWithWait` used a hardcoded 60s for exactly this scenario, and the now-removed comment explained why:

> "Wait for the process to actually terminate. The service needs up to 30s for graceful hostagent shutdown plus overhead, so allow 60s total."

The serve subprocess itself waits up to 45s for the controller manager goroutine to drain (`pkg/service/cmd/service.go:826`: `case <-time.After(45 * time.Second):`). With a 30s client-side timeout, `StopWithWait` will reach `<-ctx.Done()` and call `process.Kill(pid)` (line 384), aborting the subprocess's own 45s graceful window. Hostagents that would have completed shutdown in time are now SIGTERM'd by the parent kill — exactly the symptom the previous 60s default was sized to avoid. CI suites that rely on `run -0 rdd svc stop` (`bats/tests/20-service/2-create.bats:65`, `4-start-parameters.bats:45-63`, `5-port-fallback.bats:72`) will see a new flake mode: exit code 4 instead of 0, with leftover hostagents.

Fix: keep 60s (or longer) for `svc stop`, since the existing comment captures a sizing constraint the new default no longer satisfies.

```diff
-// defaultWaitTimeout is the default --timeout for subcommands whose wait
-// normally settles within seconds.
-const defaultWaitTimeout = 30 * time.Second
+// defaultWaitTimeout applies to start/delete waits that normally settle in
+// seconds.
+const defaultWaitTimeout = 30 * time.Second
+
+// stopWaitTimeout matches the service's own 45s controller-manager drain
+// (pkg/service/cmd/service.go Run) plus overhead. The serve subprocess will
+// SIGKILL itself if the manager is still running after 45s, so a 60s client
+// budget covers the worst case without orphaning hostagents.
+const stopWaitTimeout = 60 * time.Second
```

…and use `stopWaitTimeout` for `svc stop` and `svc delete`.

---

**I2. `svc delete --wait=false` removes the instance directory while the serve process is still running — `cmd/rdd/service.go:358-376`, `pkg/service/cmd/service.go:424-439`** *(important, regression)*

```go
// pkg/service/cmd/service.go:424-432
func Delete(wait bool, timeout time.Duration) error {
    if !Exists() {
        return fmt.Errorf("%q control plane does not exist", instance.Name())
    }
    if Running() {
        if err := StopWithWait(wait, timeout); err != nil {
            return err
        }
    }
    preserveAllInstanceLogs()
    ...
    return os.RemoveAll(instance.Dir())
}
```

Previously `Delete` always called `Stop()`, which always waited (`return StopWithWait(true)`). The new `--wait=false` path on `svc delete` calls `StopWithWait(false, ...)`, which sends a single SIGINT and returns without waiting. `Delete` then proceeds to `os.RemoveAll(instance.LogDir())` and `os.RemoveAll(instance.Dir())` while the serve subprocess is still alive and has open handles to `rdd.pid`, `rdd.sqlite3`, `rdd.stderr.log`, `rdd.stdout.log`, plus its CWD pinned to `instance.Dir()` (set at `service.go:607`).

- On Windows, `os.RemoveAll` returns an access-denied error mid-removal because those handles are still open; the user gets a partial deletion plus an error, and the still-running service keeps writing to a half-removed tree.
- On Unix the removal succeeds, but the now-orphan service keeps writing to deleted FDs. Worse, since `instance.PIDFile()` is gone, `Running()` returns false on the next invocation (`PID()` returns `PIDNotFound` when the PIDFile is unreadable) — `rdd svc start` will happily spawn a *second* serve subprocess for the same RDD_INSTANCE while the original is still alive, since the PIDFile-based mutual exclusion is gone.

The design doc (`docs/design/cmd_service.md:112`) advertises this flag as safe (`--wait` and `--timeout` behave as in `rdd service stop`), so users will reach for it.

Fix: either reject `--wait=false` for `svc delete` (delete must always wait, since data removal requires the writer to be gone), or skip the directory removal when `wait=false` (turning the flag into "send shutdown and return"). The first option is simpler and matches what `Delete()` always did before the PR:

```diff
 func newServiceDeleteCommand() *cobra.Command {
     command := &cobra.Command{
         Use:  "delete",
         Long: "Delete RDD control plane",
         RunE: serviceDeleteAction,
     }
-    command.Flags().Bool("wait", true, "Wait for the control plane to stop before deleting")
-    command.Flags().Duration("timeout", defaultWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
+    command.Flags().Duration("timeout", stopWaitTimeout,
+        "Timeout for the embedded shutdown wait (0 means wait indefinitely)")
     return command
 }
```

…and pass `true` unconditionally to `service.Delete`.

---

**I3. `svc start --timeout=30s` is too tight for the readiness path it inherits — `cmd/rdd/service.go:233-234, 268, 293`** *(important, regression)*

The two callers that used to share the 90 s readiness budget now use the 30 s default:

```go
// cmd/rdd/service.go:267-270
logrus.Infof("waiting for %q control plane to be ready", instance.Name())
ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
defer cancel()
return timeoutError(service.Wait(ctx))
```

The replaced `WaitWithTimeout` carried this comment: `90*time.Second // Increased timeout for CRD establishment`. CRD establishment, controller manager registration, and discovery ConfigMap publication can legitimately take more than 30 s on a cold start, especially in CI on slower runners (the existing BATS suites in `20-service/` exercise exactly this path with `run -0 rdd svc start`). A user-visible exit-4 on a perfectly healthy startup is a bigger UX regression than a longer default wait.

Note also that `ensureServiceRunning` at lines 97 and 104 still hardcodes `90 * time.Second` for the same readiness check — consistent with the prior behavior but now contradicting the new 30 s default in `svc start`. Either both paths are right at 90 s, or both are wrong at 30 s; the inconsistency is itself evidence the 30 s default is too aggressive.

Fix: lift `svc start`'s default to match the `ensureServiceRunning` budget (and use a single named constant for both):

```diff
-const defaultWaitTimeout = 30 * time.Second
+const startWaitTimeout = 90 * time.Second  // covers CRD establishment + controller manager registration
```

…then use it both in `newServiceStartCommand` and in `ensureServiceRunning`.

---

**I4. `limavm delete` default flipped from no-wait to wait, with a 5-minute default — undocumented behavior break — `cmd/rdd/limavm.go:253-254`** *(important, regression)*

```go
// cmd/rdd/limavm.go:253-254
command.Flags().BoolVar(&wait, "wait", true, "Wait for the VM to be deleted before returning")
command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
```

Previously this defaulted to `--wait=false, --timeout=0`. Any external script that runs `rdd limavm delete vm && rdd limavm create vm ...` without `--wait=false` will now block up to 5 minutes (whereas it returned in milliseconds before). The commit subject is "standardize --wait and --timeout" but this is a behavior change, not just a flag rename. The PR description does mention the new defaults, but `bats/tests/33-lima-controllers/limavm-cli.bats:129, 196` already pass `rdd limavm delete <name>` without `--wait=false`, then call `rdd ctl wait --for=delete` immediately after — under the new default, the inner `delete --wait` and the outer `ctl wait --for=delete` are now redundant (the test still passes, but the second wait is dead).

Fix is partly editorial (call out the behavior change in the commit body and design doc), partly testing — see I5.

---

**I5. New exit-code-4 contract is asserted in only one place — `bats/tests/33-lima-controllers/limavm-cli.bats:208`** *(important, gap)*

The PR introduces a CLI-wide contract: every wait-capable subcommand exits with code 4 on `--timeout` expiry. The only test asserting this is the existing `lima start --timeout` test, mechanically updated from `run -1` to `run -4`. None of the following are exercised:

- `rdd svc start --wait --timeout=...` (uses `timeoutError` at `cmd/rdd/service.go:294`)
- `rdd svc stop --wait --timeout=...` (uses `timeoutError` at `cmd/rdd/service.go:326`)
- `rdd svc delete --wait --timeout=...` (uses `timeoutError` at `cmd/rdd/service.go:372`)
- `rdd limavm stop|restart|delete|create --start --timeout=...` (uses `timeoutError` at `cmd/rdd/limavm.go:235, 415, 455, 556`)
- `--timeout=0` (indefinite wait) on any of the above

A regression where any of these paths returns the underlying error directly instead of the wrapped `cliexit.Timeout` would not be caught.

Fix: add at minimum one test per subcommand verifying the exit code on a forced timeout. The `svc start --timeout` case is the easiest to provoke: run the test against an instance whose controllers can't reach ready (`--controllers=app` against a broken App template, or invoke after artificially hanging the discovery ConfigMap).

---

### Suggestions

**S1. Removed comment explained the constant — restore the rationale at the call site — `cmd/rdd/service.go:32-34`** *(suggestion, regression)*

```go
// defaultWaitTimeout is the default --timeout for subcommands whose wait
// normally settles within seconds.
const defaultWaitTimeout = 30 * time.Second
```

The previous code carried a sizing comment in `StopWithWait` explaining why `60s` was the right number ("service needs up to 30s for graceful hostagent shutdown plus overhead"). After this PR the constant is shared but the rationale is gone. Whoever next tunes this number won't know which path it has to cover. Tie the constant's value to a concrete invariant in the comment (e.g., "must exceed `Run`'s 45 s controller-manager drain in `pkg/service/cmd/service.go`"). This goes hand-in-hand with I1.

---

**S2. `serviceDeleteAction` logs "control plane has been deleted" even when the directory removal raced a still-running serve — `cmd/rdd/service.go:374`** *(suggestion, regression)*

```go
if err := service.Delete(wait, timeout); err != nil {
    return timeoutError(err)
}
logrus.Infof("%q control plane has been deleted", instance.Name())
return nil
```

If I2 is fixed by always waiting on delete, this is moot. If `--wait=false` stays, the log message lies — the process is mid-shutdown and the directory may be partially gone. Replace with `"%q control plane delete initiated"` when `wait=false`.

---

**S3. Doc still says "Sends SIGTERM" but Unix path actually sends SIGINT — `docs/design/cmd_service.md:104`** *(suggestion, gap)*

```
### `rdd service stop`

Sends SIGTERM to the control plane process (`rdd.pid`) and waits up to 30s for it to exit.
```

`pkg/util/process/process_unix.go:26-28`:
```go
func Interrupt(pid int) error {
    return unix.Kill(pid, unix.SIGINT)
}
```

`StopWithWait` calls `process.Interrupt(pid)` first (`pkg/service/cmd/service.go:358`) and only falls back to `process.Kill` (which *does* send SIGTERM) if Interrupt fails. The doc was wrong before this PR but was edited in the PR — a one-word fix would have closed it:

```diff
-Sends SIGTERM to the control plane process (`rdd.pid`) and waits up to 30s for it to exit.
+Sends SIGINT to the control plane process (`rdd.pid`) and waits up to 30s for it to exit. On Windows this is delivered as `CTRL_BREAK_EVENT`.
```

---

**S4. `timeoutError` doc comment has a small typo — `cmd/rdd/service.go:36-44`** *(suggestion, regression)*

```go
// timeoutError wraps a deadline-expiry err with [cliexit.Timeout] so the CLI
// exits with [cliexit.CodeTimeout], matching `rdd set`. Other errors pass
// through unchanged.
func timeoutError(err error) error {
```

"deadline-expiry err" reads as a stray abbreviation in a place where you spelled it out twice ("error" elsewhere). Read better:

```diff
-// timeoutError wraps a deadline-expiry err with [cliexit.Timeout] so the CLI
+// timeoutError wraps a deadline-expiry error with [cliexit.Timeout] so the CLI
 // exits with [cliexit.CodeTimeout], matching `rdd set`. Other errors pass
 // through unchanged.
```

Note also that the function is unexported but `serviceStartAction` calls `timeoutError(startAndWaitForReady(...))` after `startAndWaitForReady` already received the same `timeout` and ran its own `service.Wait(ctx)` whose `ctx.Err()` is `DeadlineExceeded` — so two layers wrap the same error path. Not a bug (`errors.Is` traverses both), but it adds a step for the next reader.

---

**S5. `service.Stop()` shim was the only backward-compat surface — its removal is correct but unmentioned in the commit body** *(suggestion, gap)*

```go
// Stop the service and wait for the process to exit.
func Stop() error {
    // For backward compatibility, always wait
    return StopWithWait(true)
}
```

The comment said "for backward compatibility" but, as `grep -rn "service\.Stop\b"` confirms, the only caller was the in-tree CLI. Removing the wrapper is correct but the commit message should say so explicitly so a future reader doesn't think the public surface was intentionally narrowed for an external consumer.

## Design Observations

### Concerns

- **Single `defaultWaitTimeout` collapses three distinct sizing constraints — *(in-scope)*.** `svc start` has to cover CRD establishment + controller-manager registration (~tens of seconds in CI), `svc stop` has to cover the serve subprocess's own 45 s controller-manager drain, and `svc delete` has to cover stop-then-cleanup. They legitimately have *different* right answers; the PR collapses them into one 30 s constant for ergonomic uniformity, then has to make `limavm` defaults a separate 5-minute constant anyway. If you keep three named constants (`startReadyTimeout`, `stopShutdownTimeout`, `limaLongWaitTimeout`), each can be sized to the actual underlying ceiling and the comment can name the invariant. This is the change implied by I1+I3.

- **`--wait=false` on `svc delete` is structurally unsafe — *(in-scope)*.** As I2 documents, removing `instance.Dir()` while the writer still holds it is a footgun on every OS. Rather than papering over it with a longer timeout or a doc warning, the cleanest design is: `svc delete` must wait for the process to exit before it removes anything. `--timeout` still tunes that wait, but `--wait=false` for delete simply doesn't make sense — it would be "asynchronously corrupt my data store."

### Strengths

- Routing deadline expiry through `errors.Is(err, context.DeadlineExceeded)` (rather than a sentinel string match) is the right call; the wrapping in `fmt.Errorf("...%w", err)` propagates correctly through both layers (`limavm.go` and `service.go`).
- `watchtools.ContextWithOptionalTimeout` for the `timeout=0 → no deadline` semantics is a small thing but it's the right primitive — every wait site uses it consistently, so a future reviewer can grep one helper instead of auditing six independent timer constructions.
- `StopWithWait` correctly switches from `time.After(60 * time.Second)` to `<-ctx.Done()`, which means the wait can now be cancelled by the parent context (a Ctrl-C during a long stop). The previous code blocked the goroutine until the 60 s timer fired regardless of the user's signal.

## Testing Assessment

1. **No test asserts exit code 4 for any svc subcommand.** The PR's stated contract is uniform across `svc start/stop/delete` and `limavm create/start/stop/restart/delete`, but the only updated assertion is for `limavm start`. (See I5.) A regression that returns the wrapped error without the `cliexit.Timeout` envelope ships green.
2. **No test asserts `--timeout=0` semantics.** The "0 means wait indefinitely" branch in `watchtools.ContextWithOptionalTimeout` is exercised neither by `svc` nor by `limavm` tests.
3. **No test exercises `svc delete --wait=false`.** Combined with I2, this means a release shipping the new flag would have zero coverage of its most dangerous mode.
4. **No test exercises `svc stop` against a live hostagent**, which is exactly the scenario where the new 30 s default (I1) starts to matter. `bats/tests/20-service/` covers `svc stop` only against a service that isn't running anything heavy.
5. **The pre-existing TODO at `bats/tests/20-service/2-create.bats:3` (`# TODO: test rdd svc start --wait=false`) is now newly relevant** because the PR formalizes that flag's semantics, but the TODO is untouched.

## Documentation Assessment

- `docs/design/cmd_service.md` was updated for the flag rename but still misstates the signal name (S3) and doesn't mention that `svc delete --wait=false` is hazardous (I2).
- Neither the design doc nor the help text explains the relationship between the client-side timeout and the serve subprocess's own 45 s controller-manager drain, which is the actual sizing constraint for `svc stop` (I1, S1).
- The flag default change for `limavm delete` (I4) deserves a one-line mention in the commit body and ideally a brief note in `docs/design/api_lima.md` (or wherever the limavm CLI surface is documented; the PR doesn't update an equivalent doc).
- Comment at `cmd/rdd/service.go:107-114` correctly updated for the timeout parameter, but it still says "the API server readiness poll and the ConfigMap freshness poll share the single deadline" — that's true only when `timeout > 0`. For `timeout = 0` the share-the-deadline framing doesn't apply because there is no deadline. Minor.

## Commit Structure

Single commit, single concept, message accurately describes the change. No concerns.

## Acknowledged Limitations

- `pkg/service/cmd/service.go:444-449` (`preserveAllInstanceLogs`) carries a comment about Windows `os.Rename` requiring `FILE_SHARE_DELETE` and the failure mode when non-Go processes hold a lock. This limitation becomes more likely to fire with `svc delete --wait=false` (I2), but the existing comment already names it. Not flagged as a separate finding.
- `cmd/rdd/limavm.go:39-41` (`limaLongWaitTimeout = 5 * time.Minute`) carries a comment justifying the long default for VM boot/shutdown. Consistent with the documented operation cost.

## Coverage Summary

| File | Status |
| --- | --- |
| `bats/tests/33-lima-controllers/limavm-cli.bats` | I5 (insufficient coverage of new exit-code-4 contract). |
| `cmd/rdd/limavm.go` | I4 (default flip on `limavm delete`); otherwise reviewed, no further issues. |
| `cmd/rdd/service.go` | I1, I2, I3, S1, S2, S4 — main surface of the PR's behavior changes. |
| `docs/design/cmd_service.md` | S3 (SIGTERM/SIGINT mismatch); doc updates for the flag rename otherwise correct. |
| `pkg/service/cmd/service.go` | I1, I2, S5 — `StopWithWait` and `Delete` rewrite, plus removal of `Stop`/`WaitWithTimeout`. |
---

### [Codex GPT 5.4]

### Executive Summary

This PR standardizes `--wait` / `--timeout` across `rdd svc` and `rdd limavm`, changes timeout expiry to exit code 4, and updates part of the service documentation. The main regressions are in the new service shutdown semantics: the new default `30s` timeout is shorter than the daemon’s own graceful-shutdown budget, and `svc delete --wait=false` can now remove the instance state before the daemon has actually exited. The Lima CLI docs also remain out of sync with the new defaults.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **`svc stop` / `svc delete` now default to a timeout shorter than the daemon’s own graceful-shutdown budget** — `cmd/rdd/service.go:32-34` (important, regression)

```go
// defaultWaitTimeout is the default --timeout for subcommands whose wait
// normally settles within seconds.
const defaultWaitTimeout = 30 * time.Second
```

`cmd/rdd/service.go:34` sets the shared default timeout to 30s, and that value is wired into `svc stop` at `cmd/rdd/service.go:343` and `svc delete` at `cmd/rdd/service.go:354`. That budget is lower than the shutdown path the daemon itself advertises: Lima hostagents are allowed 30s for graceful exit at `pkg/controllers/lima/limavm/controllers/limavm_controller.go:84-86`, and the service still allows 45s total for controller-manager shutdown at `pkg/service/cmd/service.go:815-827`. When the new 30s CLI deadline fires, `StopWithWait` force-kills the service at `pkg/service/cmd/service.go:376-385`. In practice, a healthy shutdown with active Lima VMs can now turn into an avoidable timeout/forced kill under the default flags.

Fix: give `svc stop` / `svc delete` a default timeout that is at least as large as the service’s own graceful-shutdown budget (45s minimum; the previous 60s was safer), or stop sharing one “seconds-scale” default across commands with different teardown costs.

I2. **`svc delete --wait=false` can delete the instance directory while the daemon is still alive** — `pkg/service/cmd/service.go:424-438` (important, regression)

```go
if Running() {
	if err := StopWithWait(wait, timeout); err != nil {
		return err
	}
}
preserveAllInstanceLogs()
if os.Getenv("RDD_KEEP_LOGS") == "" {
	_ = os.RemoveAll(instance.LogDir())
}
```

`cmd/rdd/service.go:353-354` now exposes `--wait` / `--timeout` on `svc delete`, and `cmd/rdd/service.go:371` passes that choice through to `service.Delete()`. When `wait=false`, `StopWithWait` returns immediately after signaling the process (`pkg/service/cmd/service.go:358-395`), but `Delete()` still goes on to remove `instance.ShortDir()` and `instance.Dir()` at `pkg/service/cmd/service.go:433-438`. Those paths contain `rdd.pid`, `args.json`, and the rest of the control-plane metadata (`pkg/instance/instance.go:46-97`). Because liveness is derived from `rdd.pid` in `pkg/service/cmd/service.go:138-165`, deleting the directory can make a still-shutting-down daemon disappear from the CLI’s view while it still owns ports, files, or hostagent cleanup.

Fix: do not remove persistent state until the PID has actually exited. The simplest safe fix is to reject `svc delete --wait=false` while the service is running, or to ignore `--wait=false` for delete. If non-blocking delete is required, it needs a two-phase design that defers filesystem cleanup until after the daemon is confirmed dead.

#### Suggestions

S1. **`docs/design/cmd_lima.md` still documents the pre-PR Lima defaults** — `docs/design/cmd_lima.md:15-38` (suggestion, regression)

```md
- `--timeout` (default `0`): Maximum time to wait.
...
- `--wait` (default `false`): Wait until the resource is fully deleted before returning.
- `--timeout` (default `0`): Maximum time to wait.
```

The implementation changed the default timeout for `rdd limavm create/start/stop/restart/delete` to 5m at `cmd/rdd/limavm.go:150`, `cmd/rdd/limavm.go:166`, `cmd/rdd/limavm.go:182`, `cmd/rdd/limavm.go:199`, and `cmd/rdd/limavm.go:254`, and changed `rdd limavm delete` to default `--wait=true` at `cmd/rdd/limavm.go:253`. `docs/design/cmd_lima.md:17`, `:24`, `:31`, `:37`, and `:38` still say the timeout default is `0` and delete defaults to `--wait=false`, so the design docs now describe different blocking behavior than the CLI.

Fix: update `docs/design/cmd_lima.md` to match the new defaults and timeout semantics, the same way `docs/design/cmd_service.md` was updated in this PR.

### Design Observations

**Concerns**

1. `svc delete --wait=false` mixes two separate responsibilities: requesting process shutdown and deleting durable state. Keeping deletion synchronous, or moving cleanup behind an explicit “pending deletion” state, would eliminate the entire class of “live daemon, missing metadata” failures. (in-scope)

**Strengths**

1. Routing timeout classification through `timeoutError()` at the CLI boundary keeps `cliexit` concerns out of the lower-level service package and makes the exit-code policy consistent across commands.

### Testing Assessment

1. `rdd svc stop --timeout=...` and `rdd svc delete --timeout=...` do not have an explicit timeout-path test. The only exit-code-4 coverage added here is `bats/tests/33-lima-controllers/limavm-cli.bats:200-209`.
2. `rdd svc delete --wait=false` with a running control plane is untested. There is no test asserting that state is preserved until the daemon exits, or that the flag is rejected.
3. The new Lima default values are not covered by any help-text or docs-consistency test, which is how the stale `docs/design/cmd_lima.md` drifted through.

### Documentation Assessment

`docs/design/cmd_service.md` was updated coherently, but the parallel Lima command reference was missed entirely; see S1.

### Acknowledged Limitations

- **Code comment**: `# TODO: test \`rdd svc start --wait=false\`` — `bats/tests/20-service/2-create.bats:3`. This PR expands the `svc` wait/timeout surface, so the missing service CLI coverage matters more now, but the gap is already acknowledged in-tree.

### Coverage Summary

- `bats/tests/33-lima-controllers/limavm-cli.bats` — Reviewed, no issues
- `cmd/rdd/limavm.go` — S1
- `cmd/rdd/service.go` — I1, I2
- `docs/design/cmd_service.md` — Reviewed, no issues
- `pkg/service/cmd/service.go` — I1, I2

---

### [Gemini 3.1 Pro]

### Executive Summary

This PR standardizes the `--wait` and `--timeout` flags across all wait-capable subcommands in the CLI, routing deadline expirations to a unified `cliexit.CodeTimeout` (exit code 4) to match `rdd set`. However, the implementation introduces two severe regressions: it allows `rdd svc delete` to proceed to file deletion while the service is still running, which causes data corruption, and it shrinks the default timeouts for `svc start` and `svc stop` below the known minimum bounds required for CRD establishment and graceful hostagent shutdown. 

### Findings

#### Critical Issues

C1. **Data corruption on `svc delete --wait=false`** — `pkg/service/cmd/service.go:423-435`, `cmd/rdd/service.go:343-375`, `docs/design/cmd_service.md:106` (critical, regression)

```go
func Delete(wait bool, timeout time.Duration) error {
	// ...
	if Running() {
		if err := StopWithWait(wait, timeout); err != nil {
			return err
		}
	}
	preserveAllInstanceLogs()
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		_ = os.RemoveAll(instance.LogDir())
	}
	_ = os.RemoveAll(instance.Dir())
	return os.RemoveAll(instance.ShortDir())
}
```

By parameterizing `Delete` with `wait`, the CLI permits the user to bypass the shutdown wait. If a user invokes `rdd svc delete --wait=false` (as explicitly documented as valid in `cmd_service.md`), `StopWithWait` sends the SIGTERM signal and returns `nil` immediately. `Delete` then instantly proceeds to invoke `os.RemoveAll` on the instance directories while the service process is still actively running. 

On Windows, this deterministically fails with a sharing violation because the running process holds open handles to its SQLite database and klog files, leaving the directories partially deleted and corrupted. On Unix, it unlinks the files out from under the process, meaning the shutting-down process continues to write to orphaned data that prevents disk space reclamation until it finally exits, while the CLI falsely reports success.

Fix: Data deletion inherently requires the process to be fully stopped first. `Delete` must always enforce a wait for the shutdown phase. Either override the `wait` parameter internally or reject `--wait=false` at the CLI boundary.

```diff
 	if Running() {
-		if err := StopWithWait(wait, timeout); err != nil {
+		// Data deletion requires the process to be fully stopped.
+		if err := StopWithWait(true, timeout); err != nil {
 			return err
 		}
 	}
```

#### Important Issues

I1. **Timeout regressions for service start and stop** — `cmd/rdd/service.go:234`, `cmd/rdd/service.go:343` (important, regression)

```diff
-	ctx, cancel := context.WithTimeout(ctx, 90*time.Second) // Increased timeout for CRD establishment
...
-		// Wait for the process to actually terminate. The service needs up to
-		// 30s for graceful hostagent shutdown plus overhead, so allow 60s total.
-		timeout := time.After(60 * time.Second)
```

The PR standardizes both `svc start` and `svc stop` to use `defaultWaitTimeout` (30s). However, the previous hardcoded timeouts were deliberately higher and documented their constraints. The deleted `WaitWithTimeout` function explicitly noted that 90s was the `// Increased timeout for CRD establishment`. Similarly, the old `StopWithWait` allotted 60s because `// The service needs up to 30s for graceful hostagent shutdown plus overhead`. 

Shrinking the `svc stop` timeout to 30s guarantees that the CLI will frequently expire the deadline before hostagents can shut down, triggering an ungraceful `process.Kill(pid)`. Shrinking the `svc start` timeout to 30s will cause spurious exit-code-4 failures on slow machines while the control plane is still initializing normally.

Fix: Do not use a single 30s default for all service commands. Restore the required minimums by customizing the default flag values.

```diff
 func newServiceStartCommand() *cobra.Command {
 	// ...
-	command.Flags().Duration("timeout", defaultWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
+	command.Flags().Duration("timeout", 90*time.Second, "Timeout for --wait (0 means wait indefinitely)")
 	// ...
 }

 func newServiceStopCommand() *cobra.Command {
 	// ...
-	command.Flags().Duration("timeout", defaultWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
+	command.Flags().Duration("timeout", 60*time.Second, "Timeout for --wait (0 means wait indefinitely)")
 	// ...
 }
```

#### Suggestions

S1. **Mismatched timeout in PR description vs code for `limavm create`** — `cmd/rdd/limavm.go:150` (suggestion, gap)

The PR description categorizes `limavm create` under the 30s timeout bucket. However, the code correctly assigns it `limaLongWaitTimeout` (5m) at line 150 because `--start` wait times include the VM boot cycle. Ensure release notes or documentation generated from the commit message reflect the actual 5-minute timeout.

### Design Observations

None.

### Testing Assessment

1. **Delete Wait Semantics**: `rdd svc delete --wait=false` is completely untested. An integration test that invokes this and then attempts to restart the control plane would likely catch the corrupted state left behind on Windows. 
2. **Timeout Boundaries**: The test suite covers timeout firing (e.g., `limavm-cli.bats` using a non-functional VM to force a timeout) but does not assert that normal startup and shutdown can consistently succeed within the newly reduced 30s bounds under load.

### Documentation Assessment

The update to `docs/design/cmd_service.md` accurately describes the new flag behavior, but documenting `--wait=false` for `svc delete` creates a dangerous expectation. If C1 is fixed by overriding `--wait=false`, the documentation should explicitly clarify that `svc delete` will always block until the service is stopped to ensure safe data removal.

### Acknowledged Limitations

None.

### Coverage Summary

- `bats/tests/33-lima-controllers/limavm-cli.bats` — Reviewed, no issues
- `cmd/rdd/limavm.go` — Reviewed, no issues
- `cmd/rdd/service.go` — C1, I1
- `docs/design/cmd_service.md` — C1
- `pkg/service/cmd/service.go` — C1, I1

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-21 |
| **Commit** | `8aa55e4 cli: address deep-review findings for #336` |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important #1: `svc delete --wait=false` removes the instance directory while the serve subprocess is still alive | Fixed |
| 2 | Important #2: Shared `defaultWaitTimeout = 30s` is below the service's own graceful-shutdown budget and below the previous start-readiness budget | Fixed |
| 3 | Important #3: Exit-code-4 contract is asserted in exactly one place | Tests written |
| 4 | Suggestion #1: `docs/design/cmd_lima.md` still documents the pre-PR defaults | Documentation updated |
| 5 | Suggestion #2: `defaultWaitTimeout` constant has no rationale | Auto-resolved by I2 |
| 6 | Suggestion #3: `docs/design/cmd_service.md` says "Sends SIGTERM" but the Unix path sends SIGINT | Auto-resolved by I2 |
| 7 | Suggestion #4: `limavm delete` default flipped from `--wait=false` to `--wait=true` with a 5-minute default | Tests cleaned up |
| 8 | Suggestion #5: `svc delete` log message lies when deletion races a still-running serve | Auto-resolved by I1 |
| 9 | Suggestion #6: Typo "err" → "error" in `timeoutError` comment | Fixed |
| 10 | Suggestion #7: `service.Stop()` removal not called out in commit body | Skipped |
| 11 | Suggestion #8: PR description lists `limavm create` under the 30s timeout bucket but code uses 5m | Skipped |
| 12 | Testing Gap #1: `--timeout=0` (indefinite wait) not exercised | Skipped |
| 13 | Testing Gap #2: `svc delete --wait=false` untested | N/A (flag removed by I1) |
| 14 | Testing Gap #3: `svc stop` against a live hostagent not covered | Skipped |
| 15 | Testing Gap #4: pre-existing TODO at `bats/tests/20-service/2-create.bats:3` | Skipped |
| 16 | Documentation Gap: `cmd/rdd/service.go:107-114` "share the single deadline" comment | Skipped |
