# Deep Review: 20260422-150402-pr-336

| | |
|---|---|
| **Date** | 2026-04-22 15:04 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 6 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#336](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/336) — cli: standardize --wait and --timeout across svc and limavm |
| **Branch** | `wait-timeout-cleanup` |
| **Commits** | `fca9713` cli: standardize --wait and --timeout across svc and limavm |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — Round 5's three Importants (I1 `watchUntil` silent success on closed watch, I2 `StopWithWait` mis-labeling cancellation as timeout, I3 already-running branch skipping freshness anchor) are all fixed on `fca9713`. Two new Importants surface: (I1) `cliexit.Classify` strips any non-`CodeTimeout` classification when the wrapped error is `context.DeadlineExceeded`, so a future `cliexit.Rejected(ctx.Err())` would silently become exit 4 despite the docstring's "callers may Classify repeatedly" promise; and (I2) `cleanupDiscoveryConfigMap` still deletes the ConfigMap before the stop wait, so a timed-out `svc stop --wait` followed by a quick `svc start` now takes the already-running branch, finds the CM gone, and hangs 90s waiting for a ready annotation the dying service will never write. The new BATS test `6-timeout.bats` exercises exactly this sequence and relies on the service exiting fast enough to take the cold-start path instead. No Critical. |
| **Wall-clock time** | `21 min 57 s` |

## Executive Summary

Round 6 (`fca9713`) closes every Round 5 Important: `watchUntil` / `watchUntilDeleted` now re-`Get` on watch close and distinguish a satisfied predicate from a dropped watch (`cmd/rdd/limavm.go:497-509, 547-563`); `StopWithWait`'s deadline branch now splits on `ctx.Err()`, so `context.Canceled` is no longer reported as a timeout (`pkg/service/cmd/service.go:383-397`); and both `ensureServiceRunning` and `serviceStartAction`'s already-running branches now anchor `waitForFreshDiscoveryConfigMap` on `service.StartTime()` (`cmd/rdd/service.go:98-110, 262-282`). The library-surface refactor — `StopWithWait(ctx, wait, timeout)`, `Delete(ctx, timeout)`, the new `StartTime()` — and the centralization of `context.DeadlineExceeded` → exit-4 routing in `cliexit.Classify` are clean and well-documented.

Two new Importants surface in this round. I1: `cliexit.Classify` (`pkg/cli/exit/exit.go:57-66`) short-circuits only when the existing `*Error` already has `Code == CodeTimeout`. Any future call site that wraps a deadline-bearing error with a different code (`cliexit.Rejected(ctx.Err())`, or a not-yet-defined code) will have its classification silently stripped and replaced with `Timeout`. No current caller produces that combination, so the bug is latent, but the docstring promises "callers may Classify repeatedly" — the fix is a one-line broadening of the short-circuit. I2: `StopWithWait` deletes the discovery ConfigMap before sending the signal (`pkg/service/cmd/service.go:355-356`), and on a short timeout `process.Kill` returns before the process actually exits. A user who runs `rdd svc start` in that window now takes the already-running branch, calls `waitForFreshDiscoveryConfigMap` with the current PID-file mtime as anchor, and waits 90s (or until `startWaitTimeout`) for a ReadyAnnotation that the dying service cannot write. The new test `bats/tests/20-service/6-timeout.bats:22-29` exercises this exact sequence and relies on the service exiting fast enough to take the cold-start path instead.

Ten Suggestions follow. S1 captures Codex's observation that the second-precision freshness anchor (PID-file mtime) could theoretically accept a stale ConfigMap when a prior unclean shutdown lands in the same wall-clock second as the new startup — demoted from Important because the pre-PR already-running branch had no freshness check at all, so Round 6 is still strictly better, and the race window is sub-second. S2 and S3 capture Claude's two demoted Importants: `watchUntil`'s watch-close + `NotFound` fall-through returns a raw apierror instead of the symmetric "deleted while waiting" message, and the `Running()` → `StartTime()` call pair has a theoretical ENOENT race that surfaces as an opaque `stat … no such file or directory` error. Both are one-line fixes. The remaining suggestions cover a missing exit-4 contract test for the shared autostart helpers (`svc config`, `set`, `ctl`), a `wait` local variable that shadows the `k8s.io/apimachinery/pkg/util/wait` import, a comment/predicate mismatch in `Delete`, the BATS `setup_rdd_control_plane` helper's now-5-minute potential hang, a timing dependency in the new 6-timeout test, and two doc/flag-help polish items.

No Critical issues. Claude and Codex split their unique finds: Claude surfaced the `Classify` stripping bug and the `cleanupDiscoveryConfigMap` ordering hazard (independent finds), while Codex traced the shared-helper test gap and the second-precision freshness race. Gemini cleared the entire diff — zero findings — which is consistent with its skip on `git blame` and a diff where all Round-5 Importants are genuinely closed.

---

## Critical Issues

None.

## Important Issues

I1. **`Classify` strips a non-timeout classification when the wrapped error is `context.DeadlineExceeded`** — `pkg/cli/exit/exit.go:57-66` [Claude Opus 4.7]

```go
func Classify(err error) error {
    var exitErr *Error
    if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
        return err
    }
    if errors.Is(err, context.DeadlineExceeded) {
        return Timeout(err)
    }
    return err
}
```

The `errors.As` short-circuit only fires when the existing `*Error` already carries `Code == CodeTimeout`. If any caller ever wraps a deadline-bearing error with a different exit code — `cliexit.Rejected(ctx.Err())`, or any future classification — `Classify` strips the original code and replaces it with `Timeout`. Today no call site produces that combination (`classifyAPIError` at `cmd/rdd/set.go:332-337` only fires on admission rejections, which never wrap `DeadlineExceeded`), so the bug is latent. The docstring at `pkg/cli/exit/exit.go:53-56` promises "callers may Classify repeatedly," which reads as a commitment to idempotency across every code path — a future helper that returns `cliexit.Rejected(ctx.Err())` through `Classify` would silently lose its 3 → 4 (rejected → timeout) classification.

Fix: short-circuit on any already-classified `*Error`, not only on `CodeTimeout`:

```diff
 func Classify(err error) error {
     var exitErr *Error
-    if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
+    if errors.As(err, &exitErr) {
         return err
     }
     if errors.Is(err, context.DeadlineExceeded) {
         return Timeout(err)
     }
     return err
 }
```

*(important, regression — `Classify` is new in this PR.)*

---

I2. **`StopWithWait` deletes the discovery ConfigMap before the wait, so a timed-out stop plus a quick restart hangs the new start 90 s on a missing CM** — `pkg/service/cmd/service.go:355-356`, `cmd/rdd/service.go:92-111, 262-282` [Claude Opus 4.7]

```go
// pkg/service/cmd/service.go
func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
    if !Running() { ... }

    // Clean up discovery configmap while cluster is still accessible
    _ = cleanupDiscoveryConfigMap()

    pid := PID()
    if err := process.Interrupt(pid); err != nil { ... }

    if wait {
        ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
        ...
        for {
            select {
            case <-ctx.Done():
                _ = process.Kill(pid)  // async on Unix; process may still be alive
                ...
                return fmt.Errorf("timed out waiting for ...")
```

`StopWithWait` deletes the discovery ConfigMap (line 356) before sending `SIGINT`, so on a short timeout (`--timeout=1ms`) the CM is gone well before the service exits. `process.Kill` at line 392 is fire-and-forget on Unix; `Running()` may still report the process alive when `StopWithWait` returns. A user who runs `rdd svc start` in that window takes the already-running branch at `cmd/rdd/service.go:262-282`, reads `service.StartTime()`, and calls `waitForFreshDiscoveryConfigMap(ctx, startTime)`. The CM was just deleted and the dying service never recreates it — only a fresh `rdd service serve` re-runs `InitDiscovery` (`pkg/service/cmd/service.go:766-768`). The poll at `cmd/rdd/service.go:156-170` runs until `startWaitTimeout` (90 s) and exits 4.

Before Round 6, the already-running branch used the non-freshness `waitForDiscoveryConfigMap`, which accepted any surviving CM. The Round-6 freshness anchor closes a real staleness gap but promotes this pre-existing CM ordering hazard into a user-visible hang. The new test at `bats/tests/20-service/6-timeout.bats:22-29` exercises precisely this sequence and relies on the service exiting fast enough for `rdd svc start` to take the cold-start path instead of the already-running one.

```bats
# Restart the service; the prior test killed it.
rdd svc start
run -4 rdd svc delete --timeout=1ms
```

Fix options, in increasing scope:

1. Move `cleanupDiscoveryConfigMap()` to the success branch of the wait loop (line 399-402), so a timed-out stop leaves the prior CM in place for the next start to consume. The already-running branch then sees a fresh-enough CM (until the service exits, at which point the cold-start path takes over).
2. On the already-running path, detect a missing ConfigMap via the loopback client and fall through to the cold-start path.
3. Document the limitation: after a timed-out stop, the user must run `rdd svc delete && rdd svc create && rdd svc start`.

*(important, regression — the freshness anchor in the already-running branch is new in this PR and it is what makes the pre-existing CM ordering hazard observable on the CLI.)*


## Suggestions

S1. **Second-precision freshness anchor accepts a stale ConfigMap when a prior crash lands in the same wall-clock second as the new startup** — `cmd/rdd/service.go:99-108, 267-278`, `pkg/service/cmd/service.go:172-178` [Codex GPT 5.4]

```go
// pkg/service/cmd/service.go
func StartTime() (time.Time, error) {
    fi, err := os.Stat(instance.PIDFile())
    ...
    return fi.ModTime().Truncate(time.Second), nil
}

// cmd/rdd/service.go
if cm.CreationTimestamp.Time.Before(beforeStart) {
    return false, nil // Stale ConfigMap from a previous run; wait for the new one.
}
```

Both the cold-start path (`beforeStart := time.Now().Truncate(time.Second)` at `cmd/rdd/service.go:125`) and the new already-running path (`service.StartTime()` at `:99, :267`) anchor freshness on a second-truncated timestamp. The predicate `CreationTimestamp.Before(beforeStart)` accepts a stale CM whose CT and the new startup fall in the same wall-clock second. The race requires a prior unclean shutdown that left a `ReadyAnnotation=true` CM behind, a new service startup whose PID-file mtime lands in the same wall-clock second as that stale CT, and the caller entering the already-running branch before `InitDiscovery` deletes and recreates the CM. The window is sub-second.

Demoted from Codex's Important: pre-Round-6 the already-running branch skipped the freshness check entirely and would have accepted any surviving CM unconditionally, so Round 6 is strictly better. The cold-start path has carried this second-precision limit since before the PR. A durable fix replaces the timestamp anchor with a startup nonce copied from `InitDiscovery` into the CM and matched in `waitForFreshDiscoveryConfigMap`, and would also let the docs at `docs/design/cmd_service.md` describe readiness identity without hedging on timestamp precision.

*(suggestion, gap — pre-existing precision limit; Round 6 extends it to a new path.)*

---

S2. **`watchUntil` returns raw `NotFound` on watch-close fall-through, asymmetric with the in-loop "deleted while waiting" message** — `cmd/rdd/limavm.go:497-509` [Claude Opus 4.7]

```go
// ResultChan closed. If ctx ended, return its error. Otherwise the
// watch dropped (server close, proxy timeout, watch expiry); re-read
// and re-evaluate to distinguish a satisfied predicate from a drop.
if err := ctx.Err(); err != nil {
    return err
}
if err := c.Get(ctx, key, &vm); err != nil {
    return err
}
```

When a `Deleted` event arrives on the live channel, line 489-490 returns the descriptive `"LimaVM %q was deleted while waiting"`. When the channel closes before that event lands and the resource is gone, the fall-through `c.Get` at line 503 returns an unwrapped `apierrors.NewNotFound(...)`; the user sees `"limavms.lima.rancherdesktop.io \"foo\" not found"` instead of the consistent message. Same asymmetry applies to terminal-state predicates — `limavm start --wait` against a VM deleted mid-wait reports a stale `NotFound` that does not tell the user why the wait ended.

Demoted from Claude's Important: the fall-through does surface an error rather than report a silent success, so the Round-5 regression class is closed. The remaining gap is one of message symmetry.

Fix:

```diff
     if err := c.Get(ctx, key, &vm); err != nil {
+        if apierrors.IsNotFound(err) {
+            return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
+        }
         return err
     }
```

*(suggestion, regression — the fall-through code is new in Round 6.)*

---

S3. **`Running()` → `StartTime()` race surfaces as an opaque `stat … no such file or directory` error** — `cmd/rdd/service.go:98-102, 266-270`, `pkg/service/cmd/service.go:139-162, 172-178` [Claude Opus 4.7]

```go
if service.Running() {
    startTime, err := service.StartTime()
    if err != nil {
        return err
    }
```

`Running()` walks `PID()`, which removes the PID file when the process no longer responds to signal 0 (`pkg/service/cmd/service.go:158`). A concurrent CLI invocation that passes between our `Running()` and `StartTime()` can remove the file, so `os.Stat(PIDFile)` in `StartTime()` returns `ENOENT` and the user sees the bare stat error with no context.

Demoted from Claude's Important: the only path that removes the file is another `PID()` call, which requires a second process, and the gap is microseconds. Still worth handling: on `os.IsNotExist`, fall through to the cold-start path instead of propagating the raw error.

```diff
     startTime, err := service.StartTime()
-    if err != nil {
+    if os.IsNotExist(err) {
+        // Service exited between Running() and StartTime(); treat as cold start.
+        return cliexit.Classify(startAndWaitForReady(ctx, nil, startWaitTimeout))
+    } else if err != nil {
         return err
     }
```

*(suggestion, regression — `StartTime` is new; prior paths had no such dependency.)*

---

S4. **No exit-4 BATS coverage for the shared autostart helpers (`svc config`, `set`, `ctl`)** — `bats/tests/20-service/6-timeout.bats:1-37`, `cmd/rdd/service.go:92-110` [Codex GPT 5.4]

The new `6-timeout.bats` asserts exit 4 for `svc stop`, `svc start`, and `svc delete`, and `limavm-cli.bats` covers `lima create --start`, `lima start`, and `lima restart`. The same timeout contract now flows through `ensureServiceRunning` for `rdd svc config`, `rdd set`, and `rdd ctl` (at `cmd/rdd/kubectl.go`), but no test asserts exit 4 on any of those call sites. A caller-specific regression that unwraps `*Error` before returning from the autostart path would ship green.

Fix: add one `run -4` case per representative shared-helper caller — e.g. `rdd svc config` and `rdd set running=false` against a deliberately unready control plane.

*(suggestion, gap.)*

---

S5. **Local `wait` bool shadows the imported `k8s.io/apimachinery/pkg/util/wait` package in `serviceStartAction` and `serviceStopAction`** — `cmd/rdd/service.go:19, 247, 327` [Claude Opus 4.7]

```go
import "k8s.io/apimachinery/pkg/util/wait"

func serviceStartAction(cmd *cobra.Command, args []string) error {
    wait, err := cmd.Flags().GetBool("wait")
```

Neither function references `wait.PollUntilContextCancel`, so no bug lands today. A future edit that adds a `wait.Poll…` call in either function would compile-fail on the shadow and the fix would be obscure. Renaming the local to `waitFlag` (or hoisting the branch selection into a helper) removes the trap. The shadow is pre-existing in `serviceStopAction`; this PR introduces the same shadow to `serviceStartAction`.

*(suggestion, gap.)*

---

S6. **`Delete`'s comment claims it suppresses only the outer/inner `Running()` race, but the predicate also absorbs signal-delivery failures and `context.Canceled`** — `pkg/service/cmd/service.go:443-453` [Claude Opus 4.7]

```go
if Running() {
    if err := StopWithWait(ctx, true, timeout); err != nil {
        // Suppress only the outer/inner Running() race, where the
        // process exits between Delete's check and StopWithWait's.
        // Deadline expiry must surface so the CLI exits 4 and leaves
        // the directory in place, per docs/design/cmd_service.md.
        if errors.Is(err, context.DeadlineExceeded) || Running() {
            return err
        }
    }
}
```

The predicate `errors.Is(err, context.DeadlineExceeded) || Running()` suppresses any `StopWithWait` error whose post-call `Running()` returns false — including the `process.Interrupt` / `process.Kill` failure branch at lines 370-373 and the `context.Canceled` branch at lines 393-397. The `context.Canceled` path is unreachable today because the CLI uses `context.Background()` for RunE, but a future commit that wires `SIGINT` into the CLI context would silently proceed to `RemoveAll(instance.Dir())` on Ctrl-C. Either tighten the predicate to `errors.Is(err, ErrAlreadyStopped)` (once `StopWithWait` returns a typed sentinel for that race), or expand the comment to enumerate every case the predicate absorbs.

*(suggestion, regression.)*

---

S7. **`setup_rdd_control_plane` can now block up to 5 minutes when a prior test leaves the service stuck** — `bats/helpers/controller.bash:75-81` [Claude Opus 4.7]

```bash
setup_rdd_control_plane() {
    local controllers=${1:-"*"}

    rdd svc delete
    rdd svc create --controllers="${controllers}"
    rdd svc start
}
```

Pre-PR, `rdd svc delete` waited 60 s and always removed the directory (old `Stop()` errors were ignored). After this PR, `delete` waits up to 5 minutes (`stopWaitTimeout`) and fails with exit 4 if the service does not exit in time, leaving the directory in place. A test file whose prior run left a hostagent in a pathological state would now hang the helper for 5 minutes and then fail, rather than recovering.

Fix: bound the helper with a shorter timeout and tolerate a failed delete so a partial teardown still lets `create` run.

```diff
-    rdd svc delete
+    rdd svc delete --timeout=120s || :
```

*(suggestion, gap — pre-existing helper, behaviour shift introduced by this PR.)*

---

S8. **`6-timeout.bats` Test 2 depends on the dying service exiting before the next `rdd svc start` evaluates `Running()`** — `bats/tests/20-service/6-timeout.bats:22-29` [Claude Opus 4.7]

```bats
@test 'svc delete --timeout=1ms exits with code 4 and preserves the instance directory' {
    # Restart the service; the prior test killed it.
    rdd svc start
    run -4 rdd svc delete --timeout=1ms
    ...
}
```

Test 1 sends `SIGINT` then `SIGTERM` via `process.Kill` and returns. `process.Kill` is async on Unix: the service may still report `Running() == true` when Test 2's `rdd svc start` runs, and Test 2 then takes the already-running branch — see I2 for why that can hang 90 s on a missing CM. Test 3 already resets with `rdd svc delete || :; rdd svc create; rdd svc start`; applying the same reset before Test 2's `rdd svc start` (or a `wait_for_dead` polling helper) removes the timing dependency.

*(suggestion, regression — new test.)*

---

S9. **`--timeout` help strings disagree between `rdd set` and every other wait-capable command** — `cmd/rdd/limavm.go:151, 167, 183, 200, 255`, `cmd/rdd/service.go:234, 354, 364`, `cmd/rdd/set.go:92-93` [Claude Opus 4.7]

The PR's new flags say `"Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)"`. The pre-existing `rdd set --timeout` says `"Timeout for waiting (0 to wait indefinitely)"`. The PR's stated goal is to standardize this surface — aligning `set`'s help string in this PR (or as a targeted follow-up) keeps `--help` output uniform. The hardcoded `300*time.Second` at `cmd/rdd/set.go:92` also duplicates `limaLongWaitTimeout = 5 * time.Minute`; promoting a single named constant prevents drift.

*(suggestion, enhancement.)*

---

S10. **`limavm create`'s `--timeout` help omits the `--start` dependency** — `cmd/rdd/limavm.go:150-151, 409-419` [Claude Opus 4.7]

```go
command.Flags().BoolVar(&wait, "wait", true, "Wait for the operation to complete (only applies with --start)")
command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
```

`--wait` mentions "only applies with --start"; `--timeout`'s help mentions only `--wait=false`. The wait branch at line 409 gates on `start && wait && !dryRun`, so `--timeout` is also a no-op without `--start`. Mention `--start` in the timeout help on `create` for parity with `--wait`.

*(suggestion, enhancement.)*


## Design Observations

**Strengths**

- `watchtools.ContextWithOptionalTimeout` replaces the bespoke `time.After` channel in `StopWithWait`, unifying cancellation and timeout on the same primitive and inheriting client-go's "0 = forever" idiom. `(in-scope)` [Claude Opus 4.7]
- The `watchUntil` / `watchUntilDeleted` fall-through — re-`Get` after channel close, distinguishing a satisfied predicate from a dropped watch — closes Round 5's silent-success regression. `(in-scope)` [Gemini 3.1 Pro]
- Centralizing `context.DeadlineExceeded` → exit 4 in `cliexit.Classify` keeps exit-code policy at the CLI boundary instead of scattering per-command checks. `(in-scope)` [Codex GPT 5.4] [Gemini 3.1 Pro]
- Anchoring `waitForFreshDiscoveryConfigMap` to `service.StartTime()` in the already-running path closes Round 5's staleness gap and makes the already-running and cold-start paths use the same predicate. `(in-scope)` [Claude Opus 4.7]
- `Delete(ctx, timeout)` preserves `instance.Dir()` on a timed-out stop, per the documented Windows directory-corruption and Unix PID-file mutual-exclusion hazards. `(in-scope)` [Gemini 3.1 Pro]

**Concerns**

- `Delete` leaks `StopWithWait`'s internal error vocabulary into the caller via `errors.Is(err, context.DeadlineExceeded) || Running()`. A typed sentinel from `StopWithWait` — or returning a `*cliexit.Error` directly — would let `Delete` express its contract in one check and also retire I1's double-classification risk by construction. `(future)` [Claude Opus 4.7]
- Readiness freshness models identity across two storage systems (filesystem `mtime` and Kubernetes `CreationTimestamp`). An explicit startup nonce copied from `InitDiscovery` into the ConfigMap, and matched in `waitForFreshDiscoveryConfigMap`, would remove clock-precision coupling and let the docs describe readiness identity exactly rather than via timestamp ordering. `(future)` [Codex GPT 5.4]
- `service.Start(ctx, …)` launches the serve subprocess via `exec.CommandContext(ctx, …)` where `ctx` is the CLI command context. A SIGINT during a long `limavm start --wait` would cancel that context and kill the just-started daemon; background daemons typically need `context.Background()` (or a detached context) so parent cancellation does not tear them down. Unchanged by this PR. `(future)` [Gemini 3.1 Pro]

## Testing Assessment

New exit-code-4 coverage hits `svc stop`, `svc delete`, `svc start`, `lima create --start`, `lima start`, and `lima restart` — good breadth against the new contract. Codex built `bin/rdd` and ran `go test ./cmd/rdd/... ./pkg/cli/... ./pkg/service/cmd/... ./pkg/service/controllers/...`; all passed.

Untested scenarios, ranked by risk:

1. **Shared autostart helpers (S4).** `rdd svc config`, `rdd set`, and `rdd ctl` now carry the same exit-4 contract through `ensureServiceRunning`, but no test asserts it.
2. **`svc delete` after the service has already exited cleanly (S3 / S6).** The outer/inner `Running()` race-suppression branch at `pkg/service/cmd/service.go:449` is not exercised; a test that starts the service, delivers `SIGINT` directly, then runs `rdd svc delete` would cover it.
3. **`--timeout=0` (wait forever).** Documented for `svc stop`, `svc delete`, `svc start`, and every `limavm` command, but never asserted. A regression that treats `0` as "return immediately" would ship green.
4. **`svc start --wait=false`.** A TODO at `bats/tests/20-service/2-create.bats:3` predates this PR; the new flag handling makes regressions slightly easier to introduce.
5. **`limavm create --start --wait=false`.** Guarded by `start && wait && !dryRun` at `cmd/rdd/limavm.go:409`, but not exercised.
6. **I1 / I2 regression tests.** Neither Important has a regression test today: I1 needs a synthetic Rejected-wrapping-Deadline error through `Classify`; I2 needs a timed-out stop followed immediately by an already-running `svc start` with a deterministic process-exit barrier.

## Documentation Assessment

- `docs/design/cmd_service.md` accurately reflects the new `--wait` / `--timeout` defaults and the exit-4 contract; the cross-link from `delete` to `stop` is clean.
- `docs/design/cmd_lima.md` updates all four affected commands' defaults.
- `docs/design/api_service.md` now names the SIGINT / SIGTERM / `CTRL_BREAK_EVENT` / `TerminateProcess` fallback chain exactly.
- The `Classify` docstring at `pkg/cli/exit/exit.go:53-56` reads as an idempotency guarantee. If I1 is not adopted, tighten it to say the function only preserves `CodeTimeout` classifications.
- The comment at `pkg/service/cmd/service.go:389-391` ("the service handles signals promptly; its slowness comes from draining hostagents sequentially") is accurate but should cross-reference the 45-second controller-manager drain in `Run` at line 847, since both bound the same wait.

## Commit Structure

Single commit, clean scope, message accurately describes the change.

## Acknowledged Limitations

- **Windows `Interrupt` fallback.** `pkg/service/cmd/service.go:359-369` notes that on Windows, `Interrupt` (`GenerateConsoleCtrlEvent`) fails when caller and target lack a shared console; `Kill` (`TerminateProcess`) then bypasses graceful shutdown and hostagents self-heal on the next service start via `killOrphanedHostagent`. Pre-existing, unchanged.
- **Windows log rename.** `pkg/service/cmd/service.go:466-470` notes that on Windows, `os.Rename` of log files can fail when a non-Go process (QEMU) holds a lock without `FILE_SHARE_DELETE`; those logs are lost. Pre-existing, unchanged.
- **`bats/tests/20-service/2-create.bats:3` TODO.** Explicit `svc start --wait=false` coverage still missing; this PR adds surrounding timeout coverage but not this one.

## Agent Performance Retro

**[Claude]** Carried the round: four of five consolidated Importants/Suggestions with behavioural teeth (the `Classify` stripping bug, the `cleanupDiscoveryConfigMap` ordering hazard, the `watchUntil` NotFound asymmetry, and the `Running()` / `StartTime()` race). The `Classify` finding is the kind of latent-bug spot that requires reading the docstring against the implementation; no other agent surfaced it. Two Importants (I3, I4) were borderline and consolidated as Suggestions — a calibration note rather than a miss, since the findings themselves are real.

**[Codex]** One sharp Important (second-precision freshness race) and one well-chosen test-gap Suggestion. The freshness finding demotes cleanly to Suggestion because Round 6 still improves strictly over Round 5 — Codex correctly identified the residual race but did not weigh the pre-PR baseline. Ran the Go unit tests and reported them green, which neither other agent did. Missed the `Classify` latent bug and the `cleanupDiscoveryConfigMap` ordering hazard.

**[Gemini]** Zero findings. With every Round-5 Important fixed and the remaining issues either latent (I1) or dependent on process-lifecycle timing (I2), a clean bill from the third agent is plausible — the three recorded design observations read the diff honestly. As expected, Gemini skipped `git blame` (quota), so regression/gap labels came entirely from Claude and Codex. Its single design concern (background-daemon lifecycle tied to CLI context) sits outside the PR scope but is worth keeping on file.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 15m 09s | 7m 32s | 6m 14s |
| Findings | 2I 8S | 2S | none |
| Tool calls | 49 (Bash 28, Read 21) | 57 (shell 53, plan 2, stdin 2) | 41 (grep_search 18, read_file 12, run_shell_command 10) |
| Design observations | 2 | 1 | 3 |
| False positives | 0 | 0 | 0 |
| Unique insights | 4 | 2 | 0 |
| Files reviewed | 10 | 10 | 10 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **2I 8S** | **2S** | **none** |
| Downgraded | 2 (I→S) | 1 (I→S) | 0 |
| Dropped | 0 | 0 | 0 |


Overall, Claude produced the most reviewable output this round; Codex added one durable suggestion and the only live build-and-test signal; Gemini's clean read corroborates that the Round-5 Importants genuinely closed.

### Reconciliation

- Claude I3 (`watchUntil` NotFound asymmetry) → Suggestion S2. The fall-through closes the Round-5 silent-success regression; the remaining issue is message symmetry.
- Claude I4 (`Running()` / `StartTime()` race) → Suggestion S3. The race requires a concurrent second CLI removing the PID file between our two calls; microsecond window.
- Codex I1 (second-precision freshness) → Suggestion S1. Round 6 is strictly better than Round 5's no-freshness-check baseline; the sub-second race is a pre-existing precision limit, not a new regression.

## Review Process Notes

### Repo context updates

- **[repo] ADD a verification heuristic for latent idempotency guarantees in wrappers.** When a new helper function's docstring says "callers may X repeatedly" or "X is idempotent," read the implementation against every existing caller that could violate the guarantee, then trace any plausible future caller the signature invites. Latent bugs of this shape rarely surface until the invited caller lands, at which point the behaviour regresses quietly. Flag when the implementation's short-circuit is narrower than the docstring claims. Rationale: the `Classify` finding this round existed because the docstring promised more than the implementation delivered, and no caller today exercises the gap.

- **[repo] ADD pairwise verification of CLI state cleanup and CLI readiness waits.** When a CLI command both (a) deletes shared cluster state during shutdown and (b) another CLI command waits on that state for readiness, trace the sequence `shutdown-with-timeout → state-deleted → restart-quick → readiness-wait-against-missing-state`. This pattern produces user-visible hangs the individual code paths look clean. Rationale: I2 this round is exactly this shape — the stop deletes the discovery CM, the start waits for the same CM with a freshness anchor, and the window between signal send and actual process exit is the failure mode.

### Skill improvements

None. Both findings above are specific to this repo's shutdown / readiness protocol and do not generalise.

## Appendix: Original Reviews

### [Claude] — Pass 1

Based on my thorough review, I'll now write the structured findings.

# Deep Code Review — PR #336: cli: standardize --wait and --timeout across svc and limavm

## Header

| Field | Value |
| --- | --- |
| Verdict | Approve with comments |
| Files reviewed | 10 / 10 |
| Critical | 0 |
| Important | 4 |
| Suggestions | 6 |

## Executive Summary

This PR delivers a clean, well-documented standardization of `--wait` / `--timeout` semantics across `rdd svc` and `rdd limavm`, with consistent CLI exit code 4 on deadline expiry. The library-surface refactor (`StopWithWait(ctx, wait, timeout)`, `Delete(ctx, timeout)`, `StartTime()`) is sensible, the new BATS tests cover the contract, and the design docs are updated to match. The main concerns are: (1) a `cleanupDiscoveryConfigMap` ordering hazard is now more visible in the timeout-as-exit-4 flow; (2) `Classify` overrides earlier classifications; (3) a couple of subtle race windows in the `Running()` → `StartTime()`/`StopWithWait` sequence; and (4) a fall-through error message asymmetry in `watchUntil`/`watchUntilDeleted`.

## Findings

### Critical Issues

None.

### Important Issues

**I1. `Classify` re-wraps `Rejected` errors as `Timeout` when they wrap `context.DeadlineExceeded`** — `pkg/cli/exit/exit.go:57-66`

```go
func Classify(err error) error {
    var exitErr *Error
    if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
        return err
    }
    if errors.Is(err, context.DeadlineExceeded) {
        return Timeout(err)
    }
    return err
}
```

The check at line 59 only short-circuits when the existing `*Error` already has `Code == CodeTimeout`. If a future caller wraps a deadline-bearing error with a different exit code (`Rejected`, or a not-yet-defined code), `Classify` strips that classification and replaces it with `Timeout`. Today no `Rejected` site can reach Classify because `classifyAPIError` (cmd/rdd/set.go:332) only fires on admission errors — so the bug is latent — but the function's docstring promises "callers may Classify repeatedly," which suggests it is intended as a safe top-level convenience. A future helper that returns `cliexit.Rejected(ctx.Err())` would silently lose its 3 → 4 (rejected → timeout) classification.

Fix: short-circuit on any `*Error`, not just `Code == CodeTimeout`.

```diff
 func Classify(err error) error {
     var exitErr *Error
-    if errors.As(err, &exitErr) && exitErr.Code == CodeTimeout {
+    if errors.As(err, &exitErr) {
         return err
     }
     if errors.Is(err, context.DeadlineExceeded) {
         return Timeout(err)
     }
     return err
 }
```

(important, regression — Classify is new in this PR.)

---

**I2. `cleanupDiscoveryConfigMap` deletes the CM before timeout, leaving a "running but unrecoverable" window** — `pkg/service/cmd/service.go:355-356, 376-405`

```go
// Clean up discovery configmap while cluster is still accessible
_ = cleanupDiscoveryConfigMap() // Clean up discovery configmap to prevent stale controller info

pid := PID()
// Try graceful shutdown first.
if err := process.Interrupt(pid); err != nil {
    if err := process.Kill(pid); err != nil { ... }
}

if wait {
    ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
    ...
    case <-ctx.Done():
        _ = process.Kill(pid)  // SIGTERM on Unix — async; process may still be alive
        ...
        return fmt.Errorf("timed out waiting for ...")
```

`StopWithWait` deletes the discovery ConfigMap before sending the signal. On a short timeout (e.g., the new `--timeout=1ms` test), the wait fires before the service actually exits, the function fires SIGTERM and returns. On Unix, `process.Kill` is fire-and-forget — `Running()` may still be true on return. If the user immediately runs `rdd svc start` (or any command that hits `ensureServiceRunning`, line 92-111), control flow enters the *already-running* branch (`service.go:98`), reads `service.StartTime()`, then calls `waitForFreshDiscoveryConfigMap(ctx, startTime)`. That CM no longer exists and will never be recreated by the dying service — only a fresh `rdd service serve` recreates it (`service.go:766`). The poll runs until `startWaitTimeout` (90s) and exits 4.

Before this PR the same hazard existed, but `Delete()` always succeeded and `serviceStartAction`'s "already running" branch used the *non-freshness* `waitForDiscoveryConfigMap`, which would have accepted any pre-existing CM. After this PR, the strict freshness check makes the failure mode user-visible: timeout → exit 4 with no automatic recovery. The new test `bats/tests/20-service/6-timeout.bats:22-29` actually exercises this race (Test 2 calls `rdd svc start` after Test 1 terminated the service via SIGTERM); it relies on the service exiting fast enough that Test 2 takes the cold-start path.

Fix options:
- Move `cleanupDiscoveryConfigMap` to *after* the wait succeeds (only on confirmed exit), so a timed-out stop leaves the previous CM in place for the next start to consume.
- Or, on the already-running path, fall back to recreating the discovery ConfigMap via the loopback client when it is missing.
- Or, accept the limitation and document it: the user must `rdd svc delete && rdd svc create && rdd svc start` after a timed-out stop.

(important, regression — the freshness check in the already-running path is added by this PR; the CM cleanup ordering is pre-existing.)

---

**I3. `watchUntil` returns raw `NotFound` on watcher fall-through, masking the "deleted while waiting" path** — `cmd/rdd/limavm.go:497-509`

```go
// ResultChan closed. If ctx ended, return its error. Otherwise the
// watch dropped (server close, proxy timeout, watch expiry); re-read
// and re-evaluate to distinguish a satisfied predicate from a drop.
if err := ctx.Err(); err != nil {
    return err
}
if err := c.Get(ctx, key, &vm); err != nil {
    return err
}
if check(&vm) {
    return nil
}
return fmt.Errorf("watch closed before LimaVM %q reached the desired state", key.Name)
```

When the watch loop is alive and a `Deleted` event arrives, line 489-490 returns the descriptive `"LimaVM %q was deleted while waiting"`. When the channel closes before that event lands and the resource is gone, the fall-through `c.Get` at line 503 returns `apierrors.NewNotFound(...)`, which is propagated unwrapped. The user gets `"limavms.lima.rancherdesktop.io \"foo\" not found"` instead of the consistent `"... was deleted while waiting"` message. Same asymmetry exists for terminal-state predicates (e.g., the `start --wait` watcher returning a stale NotFound from a parallel delete) — the user cannot tell whether the timeout fired or the resource vanished.

Fix:

```diff
     if err := c.Get(ctx, key, &vm); err != nil {
+        if apierrors.IsNotFound(err) {
+            return fmt.Errorf("LimaVM %q was deleted while waiting", key.Name)
+        }
         return err
     }
```

(important, regression — the fall-through code is new.)

---

**I4. Race between `service.Running()` and `service.StartTime()` produces an opaque error** — `cmd/rdd/service.go:98-102, 262-270`

```go
if service.Running() {
    startTime, err := service.StartTime()
    if err != nil {
        return err
    }
    ...
}
```

`Running()` walks `PID()`, which calls `os.Remove(instance.PIDFile())` (`pkg/service/cmd/service.go:158`) when the process is dead. So `Running()` is mutating: a true-then-false transition during a single CLI invocation can yield `service.Running() == true` followed by `os.Stat(PIDFile)` returning `ENOENT`. The user then sees the bare `stat /path/rdd.pid: no such file or directory` error with no context, both in `ensureServiceRunning` (line 99) and in `serviceStartAction` (line 267).

The probability is low (the gap is microseconds and only one process is reading the PID file in CLI) but the failure mode is silent. Fix: wrap the error or fall back to the cold-start path on `os.IsNotExist`.

```diff
     startTime, err := service.StartTime()
-    if err != nil {
+    if os.IsNotExist(err) {
+        // Service exited between Running() and StartTime(); treat as cold start.
+        return cliexit.Classify(startAndWaitForReady(ctx, nil, startWaitTimeout))
+    } else if err != nil {
         return err
     }
```

(important, regression — `StartTime` is new, and the prior code paths did not have this dependency.)

### Suggestions

**S1. Local `wait` variable shadows imported `k8s.io/apimachinery/pkg/util/wait`** — `cmd/rdd/service.go:19, 247, 327`

```go
"k8s.io/apimachinery/pkg/util/wait"
...
func serviceStartAction(cmd *cobra.Command, args []string) error {
    wait, err := cmd.Flags().GetBool("wait")
```

Inside `serviceStartAction` and `serviceStopAction`, the local `wait` variable shadows the imported `wait` package. Neither function references `wait.PollUntilContextCancel`, so no functional bug, but it makes the code surprising for a reader who searches for `wait.` and sees the import. Renaming the local to `doWait` (or refactoring out the boolean by inlining the caller's intent) avoids the trap. Pre-existing for `serviceStopAction`; this PR adds the same shadow to `serviceStartAction`.

(suggestion, gap.)

---

**S2. `Delete` comment understates the suppression scope** — `pkg/service/cmd/service.go:443-453`

```go
if Running() {
    if err := StopWithWait(ctx, true, timeout); err != nil {
        // Suppress only the outer/inner Running() race, where the
        // process exits between Delete's check and StopWithWait's.
        // Deadline expiry must surface so the CLI exits 4 and leaves
        // the directory in place, per docs/design/cmd_service.md.
        if errors.Is(err, context.DeadlineExceeded) || Running() {
            return err
        }
    }
}
```

The comment claims "only the outer/inner Running() race" is suppressed, but the predicate `errors.Is(err, context.DeadlineExceeded) || Running()` actually suppresses *any* error from `StopWithWait` whose post-call `Running()` is false — including signal-delivery failures (lines 370-373) and the `context.Canceled` path (lines 393-397). The cancellation path is unreachable in practice because the CLI uses `context.Background()` (no signal handler wired in `cmd/rdd/main.go`), but a future commit that wires SIGINT into the CLI ctx would silently delete state on Ctrl-C. Either tighten the predicate to match the comment, or expand the comment to enumerate every suppressed case.

```diff
-        // Suppress only the outer/inner Running() race, where the
-        // process exits between Delete's check and StopWithWait's.
-        // Deadline expiry must surface so the CLI exits 4 and leaves
-        // the directory in place, per docs/design/cmd_service.md.
+        // Suppress only the harmless TOCTOU between Delete's outer
+        // Running() check and StopWithWait's inner one (which yields
+        // "control plane is not running"). All other errors,
+        // including DeadlineExceeded and any case where the process
+        // is still alive, must surface so the CLI exits 4 and leaves
+        // the directory in place, per docs/design/cmd_service.md.
         if errors.Is(err, context.DeadlineExceeded) || Running() {
             return err
         }
```

(suggestion, regression.)

---

**S3. `setup_rdd_control_plane` may now hang up to 5 minutes when previous tests left the service in a bad state** — `bats/helpers/controller.bash:75-81`

```bash
setup_rdd_control_plane() {
    local controllers=${1:-"*"}

    rdd svc delete
    rdd svc create --controllers="${controllers}"
    rdd svc start
}
```

Pre-existing helper, but its assumed behaviour changed. Before this PR, `rdd svc delete` waited up to 60s and *always* removed the directory (old `Stop()` errors were ignored). After this PR it waits up to 5m (default `stopWaitTimeout`) and *fails* with exit 4 if the service does not exit in time, leaving the directory in place. A test file whose previous run left a stuck hostagent would now hang for 5m and then fail loudly instead of recovering. This is a CI durability regression for any future test that exits with a hostagent in pathological state.

Fix: pass an explicit shorter timeout (e.g., `--timeout=120s`) and `|| :` so the helper is robust against partial cleanup.

```diff
 setup_rdd_control_plane() {
     local controllers=${1:-"*"}

-    rdd svc delete
+    rdd svc delete --timeout=120s || :
     rdd svc create --controllers="${controllers}"
     rdd svc start
 }
```

(suggestion, gap — pre-existing helper, behaviour shift introduced by this PR.)

---

**S4. `bats/tests/20-service/6-timeout.bats` second test depends on the dying service exiting fast enough** — `bats/tests/20-service/6-timeout.bats:22-29`

```bats
@test 'svc delete --timeout=1ms exits with code 4 and preserves the instance directory' {
    # Restart the service; the prior test killed it.
    rdd svc start
    run -4 rdd svc delete --timeout=1ms
    ...
}
```

The first test (`svc stop --timeout=1ms`) sends SIGINT then SIGTERM and returns. On Unix, `process.Kill` is async — the service may still be alive when bash starts test 2. `rdd svc start` then takes the *already-running* branch (`cmd/rdd/service.go:262`), calls `service.Wait` for up to 90s on a service that is exiting. With no controllers configured, exit is fast in practice, but on a slow CI runner the test could take 90s and then fail. Add a `wait_for_dead` polling helper before `rdd svc start`, or call `rdd svc delete || :; rdd svc create; rdd svc start` like Test 3 does, to deterministically reset state.

(suggestion, regression — new test.)

---

**S5. `--timeout` help text inconsistent with `rdd set`** — `cmd/rdd/limavm.go:151,167,183,200,255`, `cmd/rdd/service.go:234,354,364`, `cmd/rdd/set.go:92`

The new flags use `"Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)"`. `rdd set` (pre-existing, unchanged) uses `"Timeout for waiting (0 to wait indefinitely)"`. Since the PR is explicitly about standardizing this surface, aligning the `set` help string in this PR (or as a follow-up) keeps `--help` output consistent. Same for the hardcoded `300*time.Second` in `cmd/rdd/set.go:92` versus the named `limaLongWaitTimeout = 5 * time.Minute` constant — both happen to be the same value today.

(suggestion, enhancement.)

---

**S6. `rdd limavm create --start --wait` docstring slightly imprecise** — `cmd/rdd/limavm.go:150-151`

```go
command.Flags().BoolVar(&wait, "wait", true, "Wait for the operation to complete (only applies with --start)")
command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait; ignored if --wait=false (0 means wait indefinitely)")
```

`--wait` says "only applies with --start", but `--timeout`'s help only says "ignored if --wait=false" — it does not mention that the timeout is also ignored when `--start=false` (because the wait branch at line 409 requires `start && wait`). A user reading `--timeout=5m` on `rdd limavm create` without `--start` would expect the create to be bounded; in fact the create returns as soon as the API call succeeds. Mention `--start` in the timeout help on the `create` command for parity.

(suggestion, enhancement.)

## Design Observations

**Strengths**

- Replacing the bespoke `time.After` channel in `StopWithWait` with `watchtools.ContextWithOptionalTimeout` unifies the cancellation/timeout story and inherits client-go's "0 = forever" idiom. This is exactly the right primitive.
- The `watchUntil` / `watchUntilDeleted` fall-through (re-Get after channel close, distinguishing satisfied predicate vs. dropped watch) is a real correctness improvement over the prior `return ctx.Err()`, which could silently report "deleted" on a benign watch close.
- Anchoring `waitForFreshDiscoveryConfigMap` to the PID-file mtime in the *already-running* path closes a subtle staleness window where a CM left over from a prior, partially-cleaned run could satisfy the wait against the *new* server. The new path reuses the same predicate the cold-start path uses, removing a behavioral asymmetry.
- The `Delete(ctx, timeout)` change to *not* destroy state on a partial stop is the right safety call given the Windows directory-corruption / Unix PID-file-MX risks documented in the design doc.

**Concerns** *(in-scope)*

- `Delete` defines the user contract via the predicate `errors.Is(err, context.DeadlineExceeded) || Running()` — this leaks `StopWithWait`'s internal error vocabulary into `Delete`. A cleaner contract is for `StopWithWait` to return a typed sentinel (`ErrTimedOut`) or a `*cliexit.Error` directly, so `Delete` only has to check one thing. This also fixes I1 by construction.

## Testing Assessment

- The new exit-code-4 BATS coverage hits `svc stop`, `svc delete`, `svc start`, `lima create --start`, `lima start`, and `lima restart`. Good breadth.

Untested scenarios, ranked by risk:

1. **`svc stop --timeout=0` (wait forever)** — explicitly documented but not tested. A regression that converts the 0-sentinel to "stop immediately" would ship green.
2. **`svc delete` after the service has already exited cleanly** — the `Running()` → `StopWithWait` race (S2 / I4) needs a test that starts the service, sends SIGINT directly via `kill -INT`, then runs `rdd svc delete`. Currently no test exercises the "outer Running()=true, inner Running()=false" suppression branch (line 449).
3. **`svc start --wait=false`** — there's a TODO at `bats/tests/20-service/2-create.bats:3` from a prior PR. This PR is the natural place to fix it; the new flag handling makes `--wait=false` slightly easier to break.
4. **`limavm create` with `--start --wait=false`** — guarded by `start && wait` at limavm.go:409, but not exercised. A regression making `--wait=false` block would ship green.
5. **`lima start --timeout=0 "vm"`** — same wait-forever sentinel as #1, untested.

## Documentation Assessment

- `docs/design/cmd_service.md` accurately reflects the new behavior; the cross-link to `rdd service stop` from `delete` is clean.
- `docs/design/cmd_lima.md` updates all four affected commands' defaults. ✓
- `docs/design/api_service.md` correctly documents the SIGINT/SIGTERM/CTRL_BREAK_EVENT/TerminateProcess fallback chain. ✓
- The docstring on `Classify` (`pkg/cli/exit/exit.go:53-56`) describes the function's intent but understates its limitation — see I1. Update if the fix is not adopted.
- Comment "the service handles signals promptly; its slowness comes from draining hostagents sequentially" in `pkg/service/cmd/service.go:389-391` is accurate but worth cross-referencing the 45s controller-manager drain in `Run` (line 847) since they bound the same wait.

## Commit Structure

Single commit, well-scoped, message accurately describes the change. ✓

## Acknowledged Limitations

- `pkg/service/cmd/service.go:359-369` — Author acknowledges in-line that on Windows, `Interrupt` (GenerateConsoleCtrlEvent) fails when caller and target lack a shared console, falling back to TerminateProcess and bypassing graceful shutdown. The comment notes "Hostagents survive in their own process groups and self-heal on the next service start via killOrphanedHostagent." Pre-existing limitation, unchanged by this PR.
- `pkg/service/cmd/service.go:466-470` — Author acknowledges that on Windows, `os.Rename` of log files can fail when a non-Go process (QEMU) holds a lock without `FILE_SHARE_DELETE`, in which case logs are lost. Unchanged.

## Coverage Summary

| File | Status |
| --- | --- |
| `bats/tests/20-service/6-timeout.bats` | S4 |
| `bats/tests/33-lima-controllers/limavm-cli.bats` | Reviewed, no issues |
| `cmd/rdd/limavm.go` | I3, S6 |
| `cmd/rdd/service.go` | I2, I4, S1 |
| `cmd/rdd/set.go` | S5 |
| `docs/design/api_service.md` | Reviewed, no issues |
| `docs/design/cmd_lima.md` | Reviewed, no issues |
| `docs/design/cmd_service.md` | Reviewed, no issues |
| `pkg/cli/exit/exit.go` | I1 |
| `pkg/service/cmd/service.go` | I2, S2 |
### [Codex] — Pass 1

### Executive Summary

This PR standardizes wait defaults, exposes timeout flags on `svc`/`limavm`, and centralizes timeout-to-exit-code-4 classification. The implementation is mostly coherent, but the new already-running readiness path now treats second-truncated PID-file mtimes as a freshness token, which can accept a stale discovery ConfigMap after a rapid restart and report the control plane ready too early. The remaining gap is test coverage: the new timeout contract is asserted for `svc` and a few `limavm` paths, but not for the shared autostart entry points that now depend on the same helper.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Second-granularity freshness check can report a restarted control plane ready against stale discovery data** — `cmd/rdd/service.go:267-278` (important, regression)

```go
startTime, err := service.StartTime()
if err != nil {
    return err
}
...
if err := waitForFreshDiscoveryConfigMap(ctx, startTime); err != nil {
    return cliexit.Classify(err)
}
```

`serviceStartAction()` now feeds `service.StartTime()` into `waitForFreshDiscoveryConfigMap()` at `cmd/rdd/service.go:267-278`, and `ensureServiceRunning()` does the same at `cmd/rdd/service.go:99-108`. The problem is that `StartTime()` at `pkg/service/cmd/service.go:172-177` derives that value from `rdd.pid`'s `mtime` and truncates it to whole seconds, while `waitForFreshDiscoveryConfigMap()` accepts any ConfigMap whose `CreationTimestamp` is not *before* that second at `cmd/rdd/service.go:166-169`. If an unclean shutdown leaves `rdd-controller-manager` behind and the next service process writes `rdd.pid` in the same second, the stale ConfigMap passes the freshness check and the CLI can return success before the current run has recreated discovery state and re-registered controllers. Because the new path is shared through `ensureServiceRunning()` at `cmd/rdd/service.go:92-110`, the false-ready result can leak into `rdd svc config`, `rdd set`, and `rdd limavm *` when they attach to a just-started service.

Fix: stop using second-truncated timestamps as the freshness identity. Persist a startup nonce/UUID alongside the PID file (or in another instance-local state file), copy that nonce into the discovery ConfigMap during `InitDiscovery`, and have `waitForFreshDiscoveryConfigMap()` wait for a matching nonce instead of comparing `CreationTimestamp` values.

#### Suggestions

S1. **The new timeout contract is still not asserted for the shared autostart entry points** — `bats/tests/20-service/6-timeout.bats:17-37` (suggestion, gap)

```bash
@test 'svc stop --timeout=1ms exits with code 4' {
    run -4 rdd svc stop --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}
...
@test 'svc start --timeout=1ms exits with code 4' {
    run -4 rdd svc start --wait --timeout=1ms
```

The new BATS coverage at `bats/tests/20-service/6-timeout.bats:17-37` and `bats/tests/33-lima-controllers/limavm-cli.bats:196-229` exercises `svc` and selected `limavm` commands, but the shared autostart helper in `cmd/rdd/service.go:92-110` now also defines timeout behavior for `rdd svc config`, `rdd set` (`cmd/rdd/set.go:324-325`), and `rdd ctl` (`cmd/rdd/kubectl.go:33-35`). Existing tests such as `bats/tests/20-service/3-config.bats:3-7` and the `rdd set` setup path in `bats/tests/32-app-controllers/engine-docker.bats:16-21` only cover success. A caller-specific regression in timeout classification would still ship green.

Fix: add `run -4` coverage for `rdd svc config` and one representative shared-helper caller (`rdd set` or `rdd ctl`) against a deliberately not-ready control plane, so the exit-code-4 contract is asserted where the helper is actually consumed.

### Design Observations

**Concerns**

1. **Discovery freshness is modeled as timestamp ordering across two storage systems** `(in-scope)`

The current design compares filesystem `mtime` from `rdd.pid` with Kubernetes `CreationTimestamp` from the discovery ConfigMap. Replacing that with an explicit startup token would remove precision and clock-coupling issues, simplify the readiness predicate, and make both cold-start and already-running waits deterministic.

**Strengths**

1. Centralizing timeout classification in `cliexit.Classify()` keeps exit-code policy at the CLI boundary instead of scattering `context.DeadlineExceeded` checks across commands.

2. The `watchUntil()` / `watchUntilDeleted()` fall-through changes in `cmd/rdd/limavm.go:497-563` correctly stop treating dropped watches as implicit success.

### Testing Assessment

I built `bin/rdd` and ran `go test ./cmd/rdd/... ./pkg/cli/... ./pkg/service/cmd/... ./pkg/service/controllers/...`; those passed. The checkout does not include a runnable `bats` binary, so I could not execute the modified BATS files directly.

1. I1 has no regression test. There is no scenario that leaves a stale discovery ConfigMap behind, restarts the service within the same timestamp bucket, and verifies that `rdd svc start` / `ensureServiceRunning()` still waits for the current registration.
2. S1 remains untested. No BATS case currently asserts exit code 4 for `rdd svc config`, `rdd set`, or `rdd ctl` even though they now depend on the same autostart timeout path.

### Documentation Assessment

The user-facing docs updated in this PR are consistent with the intended CLI contract. If I1 is fixed by replacing timestamp freshness with a startup token, `docs/design/cmd_service.md` and `docs/design/api_service.md` should be updated to describe the stronger readiness identity instead of implying timestamp-based freshness is exact.

### Acknowledged Limitations

- **Code comment**: `# TODO: test \`rdd svc start --wait=false\`` — `bats/tests/20-service/2-create.bats:3`. This PR adds timeout coverage, but the explicit non-wait `svc start` path is still knowingly untested; the current change does not make that limitation materially worse.

### Coverage Summary

- `bats/tests/20-service/6-timeout.bats` — S1
- `bats/tests/33-lima-controllers/limavm-cli.bats` — Reviewed, no issues
- `cmd/rdd/limavm.go` — Reviewed, no issues
- `cmd/rdd/service.go` — I1, S1
- `cmd/rdd/set.go` — S1
- `docs/design/api_service.md` — Reviewed, no issues
- `docs/design/cmd_lima.md` — Reviewed, no issues
- `docs/design/cmd_service.md` — Reviewed, no issues
- `pkg/cli/exit/exit.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — I1

### [Gemini] — Pass 1

### Executive Summary

This PR successfully standardizes the `--wait` and `--timeout` CLI flags across the `svc` and `limavm` command groups. The changes correctly enforce finite timeouts, introduce code 4 on deadline expiry via robust error classification, and elegantly handle race conditions during instance deletion. The implementation is highly consistent, robust, and correctly covers transitive callers that autostart the service.

### Findings

- **Critical Issues**
None.

- **Important Issues**
None.

- **Suggestions**
None.

### Design Observations

**Concerns**
- **Background Daemon Lifecycle tied to CLI Context (pre-existing):** In `pkg/service/cmd/service.go`, `service.Start(ctx, ...)` uses `exec.CommandContext(ctx, ...)` to launch the background control plane daemon, where `ctx` is the CLI command's context (passed down through `ensureServiceRunning` -> `startAndWaitForReady`). Because `cli.RunNoErrOutput` cancels `cmd.Context()` on SIGINT, if a user hits `Ctrl-C` to abort a long-running wait (e.g., `rdd limavm start --wait`), the context cancellation propagates to `exec.CommandContext`, which then forcefully kills the newly-started daemon process via `cmd.Process.Kill()`. Background processes that must outlive their launcher should use `context.Background()` (or a detached context) to prevent the CLI's signal handler from tearing them down. `(future)`

**Strengths**
- **Robust race condition handling in Delete:** The logic in `Delete(ctx, timeout)` correctly recognizes that if `StopWithWait` fails and `Running()` subsequently returns `false`, the process exited in the race window between the check and the wait. Suppressing the error in this case ensures the deletion completes without falsely blocking the user.
- **Consistent Timeout Routing:** Consolidating the `cliexit.Classify(err)` logic directly into `ensureServiceRunning` efficiently ensures all transitive callers that trigger an auto-start (like `kubectl`, `set`, and `config`) seamlessly inherit the exit-code-4 contract on deadline expiry.
- **Watch Drop Recovery:** The updates to `watchUntil` and `watchUntilDeleted` correctly identify that a natural watch expiration (with a nil `ctx.Err()`) is not a success state, preventing silent false-successes and reliably surfacing drops.

### Testing Assessment

The test additions directly cover the new timeout boundaries and properly assert the exit code 4 behavior for both service lifecycle events and LimaVM provisioning scenarios.

### Documentation Assessment

The documentation strictly mirrors the implementation changes. The descriptions of `Delete` preserving the directory on timeout and `Stop` utilizing SIGTERM (Unix) / TerminateProcess (Windows) fallbacks accurately document the underlying codebase execution.

### Coverage Summary

- `bats/tests/20-service/6-timeout.bats`: Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-cli.bats`: Reviewed, no issues
- `cmd/rdd/limavm.go`: Reviewed, no issues
- `cmd/rdd/service.go`: Reviewed, no issues
- `cmd/rdd/set.go`: Reviewed, no issues
- `docs/design/api_service.md`: Reviewed, no issues
- `docs/design/cmd_lima.md`: Reviewed, no issues
- `docs/design/cmd_service.md`: Reviewed, no issues
- `pkg/cli/exit/exit.go`: Reviewed, no issues
- `pkg/service/cmd/service.go`: Reviewed, no issues

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-22 |
| **Commit** | `47b71db` cli: standardize --wait and --timeout across svc and limavm (autosquashed into the PR commit) |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important #1: `Classify` strips a non-timeout classification | Fixed |
| 2 | Important #2: `cleanupDiscoveryConfigMap` ordering can hang the next start | Commented (architectural follow-up) |
| 3 | Suggestion #1: Second-precision freshness anchor | Skipped (architectural follow-up) |
| 4 | Suggestion #2: `watchUntil` raw NotFound on watch-close | Fixed |
| 5 | Suggestion #3: `Running()` → `StartTime()` race | Skipped |
| 6 | Suggestion #4: No exit-4 coverage for shared autostart helpers | Skipped |
| 7 | Suggestion #5: Local `wait` shadows `wait` package | Fixed |
| 8 | Suggestion #6: `Delete` comment understates suppression scope | Commented (expanded) |
| 9 | Suggestion #7: `setup_rdd_control_plane` 5-minute hang risk | Fixed |
| 10 | Suggestion #8: 6-timeout.bats Test 2 timing dependency | Fixed |
| 11 | Suggestion #9: `--timeout` help string disagreement | Fixed |
| 12 | Suggestion #10: `limavm create --timeout` doc imprecision | Fixed |
| 13 | Testing Gap 1–6 | Skipped |
| 14 | Documentation Gap: 45s drain cross-reference | Documentation updated |
