Deep Review: 20260421-173630-pr-336

Date	2026-04-21 17:36
Repo	rancher-sandbox/rancher-desktop-daemon
Round	1
Author	@jandubois
PR	#336 — cli: standardize --wait and --timeout across svc and limavm
Branch	`wait-timeout-cleanup`
Commits	`99aa4e3` cli: standardize --wait and --timeout across svc and limavm
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — one important correctness bug (`svc delete --wait=false` races with the still-running daemon) and the shared 30s default is below the daemon's own 45s graceful-shutdown ceiling. Fix those two before merging; the remaining items are doc/testing polish.
Wall-clock time	`37 min 22 s`

Executive Summary ¶

This PR unifies the --wait / --timeout surface across rdd svc and rdd limavm, and routes deadline expiry through cliexit.Timeout so timeouts exit with code 4 (matching rdd set). The refactor is clean and the wrap-once-at-the-CLI-boundary pattern is right. Two behavior changes warrant attention before merging: svc delete --wait=false now lets service.Delete() proceed to os.RemoveAll(instance.Dir()) while the serve subprocess is still holding PID/SQLite/log handles; and the shared defaultWaitTimeout = 30s is below both the service's own 45s controller-manager drain (svc stop/svc delete) and the previous 90s CRD-establishment budget (svc start, still 90s inside ensureServiceRunning). Documentation in docs/design/cmd_lima.md was not updated to match the new defaults, the new exit-code-4 contract is asserted in exactly one test, and the wrapper service.Stop() removal is unmentioned in the commit body.

Critical Issues ¶

None.

Important Issues ¶

I1. svc delete --wait=false removes the instance directory while the serve subprocess is still alive Claude Codex Gemini¶

pkg/service/cmd/service.go

	return nil // Don't fail stop operation due to discovery cleanup issues
}

// Delete removes all instance data. If the service is running, Delete waits
// up to timeout for it to exit before removing files; removing instance.Dir()
// while the serve process holds rdd.pid, rdd.sqlite3, and log files corrupts
// the directory on Windows and breaks PID-file mutual exclusion on Unix.
func Delete(timeout time.Duration) error {
	if !Exists() {
		return fmt.Errorf("%q control plane does not exist", instance.Name())
	}
	if Running() {
		if err := StopWithWait(true, timeout); err != nil {
			return err
		}
	}
	preserveAllInstanceLogs()
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		_ = os.RemoveAll(instance.LogDir())
	}
	_ = os.RemoveAll(instance.ShortDir())
	return os.RemoveAll(instance.Dir())
}

// preserveAllInstanceLogs moves .log files from each Lima instance directory

Before the PR, Delete() always called Stop(), which always waited. This PR exposes --wait=false on svc delete and threads it through to StopWithWait. When the user passes --wait=false, StopWithWait signals the process and returns immediately (pkg/service/cmd/service.go:358-395); Delete then calls os.RemoveAll(instance.LogDir()), os.RemoveAll(instance.ShortDir()), and os.RemoveAll(instance.Dir()) while the serve subprocess still holds open handles to rdd.pid, rdd.sqlite3, rdd.stderr.log, rdd.stdout.log, and has its working directory pinned to instance.Dir() (set in service.go:607).

pkg/service/cmd/service.go

	// Kill fallback never fires. On Windows, Interrupt uses
	// GenerateConsoleCtrlEvent, which fails when caller and target lack a
	// shared console; Kill (TerminateProcess) then bypasses graceful shutdown.
	// Hostagents survive in their own process groups and self-heal on the
	// next service start via killOrphanedHostagent.
	if err := process.Interrupt(pid); err != nil {
		if err := process.Kill(pid); err != nil {
			return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
		}
	}

	if wait {
		ctx := context.Background()
		if timeout > 0 {
			var cancel context.CancelFunc
			ctx, cancel = context.WithTimeout(ctx, timeout)
			defer cancel()
		}
		ticker := time.NewTicker(100 * time.Millisecond)
		defer ticker.Stop()

		for {
			select {
			case <-ctx.Done():
				// Graceful shutdown timed out; terminate so we don't leave
				// a hung service process. Kill targets only the service; on
				// Windows (TerminateProcess) this avoids killing hostagents
				// that are children of the service but run in their own
				// process groups. On Unix, SIGTERM suffices because the
				// service is responsive to signals (it's just slow shutting
				// down hostagents sequentially).
				_ = process.Kill(pid)
				return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, ctx.Err())
			case <-ticker.C:
				if !Running() {
					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil
				}
			}
		}
	}

	_ = os.Remove(instance.Config()) // Ignore error as file might not exist
	return nil
}

// cleanupDiscoveryConfigMap removes the discovery configmap to prevent readiness check confusion.
func cleanupDiscoveryConfigMap() error {

On Windows, os.RemoveAll fails with a sharing violation mid-removal because the handles are still open; the user sees an error and a partially-deleted tree. On Unix, removal succeeds, but then Running() starts returning false (PID file is gone; PID() returns PIDNotFound at pkg/service/cmd/service.go:138-161) even though the serve process remains alive. A subsequent rdd svc start for the same instance now sees !Exists(), recreates the directory, and starts a second serve subprocess that races the first for the secure port (6443 + instance.Index()).

pkg/service/cmd/service.go


// PIDNotFound indicates no running service process was found.
const PIDNotFound = 0

// PID returns the process ID of the running service, or PIDNotFound if it is not running.
func PID() int {
	pidStr, err := os.ReadFile(instance.PIDFile())
	if err != nil {
		return PIDNotFound
	}
	pid, err := strconv.Atoi(string(pidStr))
	if err == nil {
		var process *os.Process
		process, err = os.FindProcess(pid)
		if err == nil {
			// on non-Windows, FindProcess may return without the process being
			// alive; on Windows, the result encapsulates a valid handle.
			if runtime.GOOS != "windows" {
				err = process.Signal(syscall.Signal(0))
			}
			_ = process.Release()
		}
	}
	if err != nil {
		_ = os.Remove(instance.PIDFile())
		return PIDNotFound
	}
	return pid
}

// Running reports whether the service process is alive.
func Running() bool {
	return PID() != PIDNotFound
}

Gemini flagged this as critical; Codex and Claude as important. We consolidate at important: the data being removed is itself destined for deletion, but the PID-file-based mutual exclusion breaks and the Windows partial-deletion path leaves the user with an error.

Fix: always wait for the process to exit before deleting state. Either drop --wait=false from svc delete entirely (deletion cannot usefully be non-blocking since the writer must be gone) or pass true unconditionally to the internal StopWithWait call while still exposing --timeout so users can tune how long to wait:

     if Running() {
-        if err := StopWithWait(wait, timeout); err != nil {
+        // Data deletion requires the writer to be gone before the directory
+        // is removed; --wait=false on svc delete is not safe, so ignore it.
+        if err := StopWithWait(true, timeout); err != nil {
             return err
         }
     }

If --wait=false is preserved for delete, docs/design/cmd_service.md:112 should document the race explicitly.

I2. Shared defaultWaitTimeout = 30s is below the service's own graceful-shutdown budget and below the previous start-readiness budget Claude Codex Gemini¶

cmd/rdd/service.go

	service "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/service/cmd"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/service/controllers"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/util/tail"
)

// startWaitTimeout bounds the wait for the API server, CRDs, and
// controller-manager registration to land. Cold start can take well over 30s
// in CI on slower runners.
const startWaitTimeout = 90 * time.Second

// stopWaitTimeout bounds the wait for the control plane to shut down. The
// service itself drains its controller-manager goroutines for up to 45s
// (see Run in pkg/service/cmd/service.go); when LimaVM stop orchestration

The new default wires into three subcommands with different sizing constraints:

Stop / delete path. The serve subprocess allows up to 45s for its controller-manager goroutine to drain (pkg/service/cmd/service.go:826: case <-time.After(45 * time.Second)), and Lima hostagents get 30s of that budget for graceful shutdown. When the CLI's 30s deadline fires first, StopWithWait calls process.Kill(pid) at line 384, aborting the subprocess's own 45s window. The now-removed comment inside StopWithWait captured exactly this: // The service needs up to 30s for graceful hostagent shutdown plus overhead, so allow 60s total. With the new default, a healthy shutdown of a service running a Lima VM will routinely force-kill instead of draining cleanly.

pkg/service/cmd/service.go

	mgrDone := make(chan struct{})
	go func() {
		mgrWg.Wait()
		close(mgrDone)
	}()
	select {
	case <-mgrDone:
	case <-time.After(45 * time.Second):
		klog.Warning("Controller manager did not shut down within 45s, exiting anyway")
	}

cmd/rdd/service.go

}

func newServiceStatusCommand() *cobra.Command {
	command := &cobra.Command{
		Use:  "status",
		Long: "Show control plane status",
		RunE: func(*cobra.Command, []string) error {
			logrus.SetLevel(logrus.InfoLevel)
			logrus.Infof("%q control plane has been created: %v", instance.Name(), service.Exists())
			logrus.Infof("%q control plane has been started: %v", instance.Name(), service.Running())
			logrus.Infof("%q control plane PID is: %v", instance.Name(), service.PID())

Start path. The replaced WaitWithTimeout used 90s with the comment // Increased timeout for CRD establishment. Both ensureServiceRunning call sites (cmd/rdd/service.go:97, 104) still use 90s, which is the correct budget for CRD establishment + controller-manager registration + discovery ConfigMap readiness. The explicit rdd svc start path at line 234 now uses 30s. Either 90s is right in both places or 30s is right in both places — the split is itself evidence that 30s is too tight for the readiness path.

cmd/rdd/service.go

		return err
	}
	logrus.Infof("successfully created %q control plane", instance.Name())
	return nil
}

func newServiceStartCommand() *cobra.Command {
	command := &cobra.Command{
		Use:  "start",
		Long: "Start RDD control plane. When called without parameters, uses default parameters from create. When called with parameters, those override the defaults for this session only.",
		RunE: serviceStartAction,

Test exposure. bats/tests/20-service/2-create.bats:65, 4-start-parameters.bats:45-63, and 5-port-fallback.bats:72, 105, 147 all run run -0 rdd svc stop against services that may have a live Lima VM. Those now risk exit-code-4 flakes when the hostagent drain exceeds 30s.

bats/tests/20-service/2-create.bats

    # stdout may be empty, just verify the command succeeds
    rdd svc logs --stdout
}

@test 'stop instance' {
    run -0 rdd svc stop
    run -0 extract_msg
    assert_output "\"rancher-desktop-${RDD_INSTANCE}\" control plane has stopped"
    assert_dir_exist "${RDD_DIR}"
}

bats/tests/20-service/4-start-parameters.bats

    assert_line configmapreplicaset
    refute_line demo
}

@test 'start instance with parameter override' {
    run -0 rdd svc stop
    run -0 rdd svc start --controllers="*"
    get_enabled_controllers
    assert_line notary
    assert_line configmapreplicaset
    assert_line demo
}

@test 'start instance with controller parameter override' {
    run -0 rdd svc stop
    run -0 rdd svc start --controllers="app"
    get_enabled_controllers
    assert_line demo
    refute_line notary
    refute_line configmapreplicaset
}

@test 'start instance returns to default parameters after override' {
    run -0 rdd svc stop
    # Start with default parameters (should use saved rdd controllers)
    run -0 rdd svc start
    get_enabled_controllers
    assert_line notary
    assert_line configmapreplicaset

Fix: split the one constant into named constants that each match a concrete invariant. For example:

-// defaultWaitTimeout is the default --timeout for subcommands whose wait
-// normally settles within seconds.
-const defaultWaitTimeout = 30 * time.Second
+// startWaitTimeout covers CRD establishment plus controller-manager registration.
+const startWaitTimeout = 90 * time.Second
+
+// stopWaitTimeout must exceed the service's own 45s controller-manager drain
+// (see Run in pkg/service/cmd/service.go) plus overhead.
+const stopWaitTimeout = 60 * time.Second

Then use startWaitTimeout in both newServiceStartCommand and ensureServiceRunning, and stopWaitTimeout in newServiceStopCommand and newServiceDeleteCommand.

I3. Exit-code-4 contract is asserted in exactly one place Claude¶

The PR introduces a uniform CLI contract: every wait-capable subcommand exits with code 4 on --timeout expiry. The only assertion of that contract is the mechanically updated limavm start --timeout test (run -1 → run -4). None of the new paths are exercised:

rdd svc start --wait --timeout=… (uses timeoutError at cmd/rdd/service.go:294)
rdd svc stop --wait --timeout=… (at cmd/rdd/service.go:326)
rdd svc delete --wait --timeout=… (at cmd/rdd/service.go:372)
rdd limavm stop|restart|delete|create --start --timeout=… (at cmd/rdd/limavm.go:235, 415, 455, 556)
--timeout=0 (indefinite wait) on any command

cmd/rdd/service.go

	}
	if cmd.Flags().Changed("secure-port") {
		securePort, err := cmd.Flags().GetInt("secure-port")
		if err != nil {
			return err
		}
		serveArgs = append(serveArgs, "--secure-port", strconv.Itoa(securePort))
	}
	serveArgs = append(serveArgs, "-v", logrusLevelToKlog())
	serveArgs = append(serveArgs, args...)

cmd/rdd/service.go

		return nil
	}

	wait, err := cmd.Flags().GetBool("wait")
	if err != nil {
		return err
	}
	timeout, err := cmd.Flags().GetDuration("timeout")
	if err != nil {
		return err
	}

cmd/rdd/service.go

		logrus.Infof("%q control plane does not exist", instance.Name())
		return nil
	}
	timeout, err := cmd.Flags().GetDuration("timeout")
	if err != nil {
		return err
	}
	if err := service.Delete(timeout); err != nil {
		return timeoutError(err)
	}
	logrus.Infof("%q control plane has been deleted", instance.Name())

A regression that returns the underlying error directly instead of the cliexit.Timeout envelope will ship green. The pre-existing TODO at bats/tests/20-service/2-create.bats:3 (# TODO: test rdd svc start --wait=false) also becomes newly relevant: the PR formalizes that flag's semantics but the test is still missing.

bats/tests/20-service/2-create.bats

load '../../helpers/load'

# TODO: test `rdd svc start --wait=false`

@test 'Make sure instance does not exist' {
    rdd svc delete || :
    assert_dir_not_exist "${RDD_DIR}"
}

Fix: add at least one test per subcommand that forces a timeout and asserts exit code 4. The svc start --timeout case is the simplest to provoke — run against an instance whose controllers cannot reach ready.

Suggestions ¶

S1. docs/design/cmd_lima.md still documents the pre-PR defaults Codex¶

docs/design/cmd_lima.md


`TEMPLATE` will be treated as a local filename or template URL if it contains a `/` or a `:`.  In that case the `create` command will create a ConfigMap in `NAMESPACE` called `NAME`. It will store the "fully embedded" template from the file or URL inside that ConfigMap and use it as the `spec.templateRef`. If a ConfigMap of this name already exists, then the `create` command will fail. If `LimaVM` creation succeeds then ownership of this ConfigMap is transferred to the `LimaVM` resource, and it will be deleted when the `LimaVM` instance is deleted. But when the command fails, then any ConfigMap created from a file or URL is deleted immediately.

- `--start` (default `false`): Set `spec.running=true` to start the VM immediately after creation.
- `--wait` (default `true`): When used with `--start`, wait until the Running condition becomes True before returning.
- `--timeout` (default `5m`): Maximum time to wait. A value of `0` waits indefinitely. Accepts Go duration strings (e.g., `30s`, `5m`). If the deadline expires, `rdd` exits with code 4.

### `rdd limavm start NAME`

Set `spec.running` of the specified instance to `true`. There is no `--namespace` option because `LimaVM` names are globally unique (within the control plane).

The implementation now uses limaLongWaitTimeout = 5 * time.Minute for limavm create/start/stop/restart/delete and flips limavm delete to --wait=true by default (cmd/rdd/limavm.go:150, 166, 182, 199, 253-254). The design doc still says --timeout defaults to 0 and limavm delete defaults to --wait=false, so the documented blocking behavior now contradicts the CLI.

Fix: update docs/design/cmd_lima.md with the new defaults and the exit-code-4 semantics, mirroring the update already made to docs/design/cmd_service.md.

S2. defaultWaitTimeout constant has no rationale Claude¶

cmd/rdd/service.go

	service "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/service/cmd"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/service/controllers"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/util/tail"
)

// startWaitTimeout bounds the wait for the API server, CRDs, and
// controller-manager registration to land. Cold start can take well over 30s
// in CI on slower runners.
const startWaitTimeout = 90 * time.Second

// stopWaitTimeout bounds the wait for the control plane to shut down. The
// service itself drains its controller-manager goroutines for up to 45s
// (see Run in pkg/service/cmd/service.go); when LimaVM stop orchestration

The previous code carried a sizing comment at the wait site explaining why 60s was the right number ("service needs up to 30s for graceful hostagent shutdown plus overhead"). After this PR the constant is shared but the rationale is gone. Whoever next tunes this number will not know which path it has to cover. Tie the constant's value to a concrete invariant (e.g., "must exceed the 45s controller-manager drain in pkg/service/cmd/service.go Run"). This pairs with I2.

S3. docs/design/cmd_service.md says "Sends SIGTERM" but the Unix path sends SIGINT Claude¶

docs/design/cmd_service.md

    If the deadline expires, `rdd` exits with code 4.


### `rdd service stop`

Sends SIGINT to the control plane process (`rdd.pid`) and waits up to 5 minutes for
it to exit. On Windows the signal is delivered as `CTRL_BREAK_EVENT`. Pass
`--wait=false` to return immediately, or `--timeout=0` to wait indefinitely. If the
deadline expires, `rdd` exits with code 4. The default matches the per-VM ceiling
because graceful shutdown will eventually stop every running LimaVM before the
service exits.

StopWithWait calls process.Interrupt(pid) first (pkg/service/cmd/service.go:358), and Interrupt on Unix is unix.Kill(pid, unix.SIGINT) (pkg/util/process/process_unix.go:26-28). process.Kill (which does send SIGTERM) is only used as a fallback when Interrupt fails or on the timeout path. The PR edited this doc section for the new flag behavior but did not fix the pre-existing signal name:

pkg/service/cmd/service.go

	// Kill fallback never fires. On Windows, Interrupt uses
	// GenerateConsoleCtrlEvent, which fails when caller and target lack a
	// shared console; Kill (TerminateProcess) then bypasses graceful shutdown.
	// Hostagents survive in their own process groups and self-heal on the
	// next service start via killOrphanedHostagent.
	if err := process.Interrupt(pid); err != nil {
		if err := process.Kill(pid); err != nil {
			return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
		}
	}

pkg/util/process/process_unix.go

	}
	cmd.SysProcAttr.Setpgid = true
}

// Interrupt sends SIGINT to the process with the given PID.
func Interrupt(pid int) error {
	return unix.Kill(pid, unix.SIGINT)
}

// Kill sends SIGTERM to the process with the given PID.
func Kill(pid int) error {
	return unix.Kill(pid, unix.SIGTERM)
}

-Sends SIGTERM to the control plane process (`rdd.pid`) and waits up to 30s for it to exit.
+Sends SIGINT to the control plane process (`rdd.pid`) and waits up to 30s for it to exit.
+On Windows this is delivered as `CTRL_BREAK_EVENT`.

S4. limavm delete default flipped from --wait=false to --wait=true with a 5-minute default Claude¶

cmd/rdd/limavm.go

		Args:  cobra.ExactArgs(1),
		RunE: func(cmd *cobra.Command, args []string) error {
			return limaVMDeleteAction(cmd.Context(), args[0], wait, timeout)
		},
	}
	command.Flags().BoolVar(&wait, "wait", true, "Wait for the VM to be deleted before returning")
	command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
	return command
}

// getKubeClient returns a controller-runtime client configured for the RDD control plane.
// The returned client supports Watch for event-driven waiting on resource changes.

This is a behavior change in addition to a flag rename. Scripts that ran rdd limavm delete <name> and then moved on will now block up to 5 minutes (where they used to return in milliseconds). The PR description mentions the default-to-true move, but the design doc and commit body do not call out that limavm delete in particular changed. bats/tests/33-lima-controllers/limavm-cli.bats:129, 196 already call rdd limavm delete <name> followed by rdd ctl wait --for=delete; under the new default those two waits are redundant.

Fix: call out the behavior change in the design doc (see S1) and consider whether the existing BATS tests should be updated to reflect the new contract.

S5. svc delete log message lies when the deletion races a still-running serve Claude¶

cmd/rdd/service.go

	}
	timeout, err := cmd.Flags().GetDuration("timeout")
	if err != nil {
		return err
	}
	if err := service.Delete(timeout); err != nil {
		return timeoutError(err)
	}
	logrus.Infof("%q control plane has been deleted", instance.Name())
	return nil
}

If I1 is fixed by always waiting on delete, this is moot. If --wait=false stays, the log line lies — the process is mid-shutdown and the directory may be partially gone. Replace with something like "%q control plane delete initiated" when wait=false.

S6. Typo: "err" → "error" in timeoutError comment Claude¶

cmd/rdd/service.go


// startWaitTimeout bounds the wait for the API server, CRDs, and
// controller-manager registration to land. Cold start can take well over 30s
// in CI on slower runners.
const startWaitTimeout = 90 * time.Second

// stopWaitTimeout bounds the wait for the control plane to shut down. The
// service itself drains its controller-manager goroutines for up to 45s
// (see Run in pkg/service/cmd/service.go); when LimaVM stop orchestration
// lands, the drain will sequence each VM's shutdown, so the CLI deadline must
// match the per-VM ceiling in [limaLongWaitTimeout].
const stopWaitTimeout = limaLongWaitTimeout

// timeoutError wraps a deadline-expiry error with [cliexit.Timeout] so the CLI
// exits with [cliexit.CodeTimeout], matching `rdd set`. Other errors pass
// through unchanged.
func timeoutError(err error) error {
	if errors.Is(err, context.DeadlineExceeded) {
		return cliexit.Timeout(err)

-// timeoutError wraps a deadline-expiry err with [cliexit.Timeout] so the CLI
+// timeoutError wraps a deadline-expiry error with [cliexit.Timeout] so the CLI

S7. service.Stop() removal is not called out in the commit body Claude¶

The deleted service.Stop() wrapper was labeled "for backward compatibility" but had only one in-tree caller. Removing it is correct, but the commit body should mention that the public surface narrowed so a future reader does not go looking for an external consumer.

S8. PR description lists limavm create under the 30s timeout bucket, but the code assigns it 5m Gemini¶

The PR description says limavm create gets 30s; the code correctly uses limaLongWaitTimeout = 5 * time.Minute because the wait covers the VM boot cycle when --start is passed. Align the PR description / release notes with the code.

Design Observations ¶

Concerns ¶

One defaultWaitTimeout collapses three distinct sizing constraints (in-scope) Claude Codex. svc start covers CRD establishment + controller-manager registration, svc stop covers the serve subprocess's own 45s controller-manager drain, and svc delete covers stop-then-cleanup. Those invariants have different right answers; the PR uses one 30s constant for all three and still needs a separate 5-minute limaLongWaitTimeout for VM boot. Three named constants — each tied to an invariant in its doc comment — would make tuning any one of them safer. See I2.
svc delete mixes process-shutdown with durable-state removal (in-scope) Claude Codex. The --wait=false footgun in I1 is structural: deleting instance.Dir() while the writer still owns it cannot be made safe with better timeouts or doc warnings. The simplest design is "delete always waits for the process to exit." If a non-blocking delete is needed, it needs a two-phase design (e.g., a "pending deletion" marker) that defers filesystem cleanup until after the daemon is confirmed dead.

Strengths ¶

errors.Is(err, context.DeadlineExceeded) plus fmt.Errorf("…%w", err) is the right classification primitive. It traverses the wrap chain so both the limavm.go wrap and the service.go wrap flow through errors.As(err, &cliexit.Error) in main.go without a sentinel string match. Claude
watchtools.ContextWithOptionalTimeout is used consistently at every wait site, so the "timeout=0 → no deadline" semantics live in one place instead of six independent timer constructions. Claude
StopWithWait now blocks on <-ctx.Done() instead of <-time.After(60 * time.Second), so a parent-context cancellation (e.g., Ctrl-C) actually interrupts the wait. Claude
Routing cliexit.Timeout through a small helper at the CLI boundary (timeoutError) keeps cliexit concerns out of the lower-level service package. Codex

Testing Assessment ¶

Exit-code-4 is asserted in one place (limavm-cli.bats:208); see I3.
--timeout=0 (indefinite wait) is not exercised by any svc or limavm test.
rdd svc delete --wait=false is untested — the most dangerous new mode, per I1, has zero coverage.
rdd svc stop against a live hostagent is not covered; this is exactly the scenario where the new 30s default (I2) is too tight.
The pre-existing TODO at bats/tests/20-service/2-create.bats:3 (# TODO: test rdd svc start --wait=false) becomes newly relevant but is unchanged.

Documentation Assessment ¶

docs/design/cmd_service.md was updated for the flag rename (correct) but still misstates the signal name (S3) and does not flag the svc delete --wait=false race (I1).
docs/design/cmd_lima.md was not touched and now contradicts the CLI defaults (S1).
Neither doc names the relationship between the client-side timeout and the serve subprocess's 45s controller-manager drain — the actual sizing constraint for svc stop (I2, S2).
The comment at cmd/rdd/service.go:107-114 still says "the API server readiness poll and the ConfigMap freshness poll share the single deadline", which holds only when timeout > 0. Minor.

Commit Structure ¶

Single commit, single concept, message accurately describes the change. The commit body could mention the limavm delete --wait default flip (S4) and the service.Stop() / service.WaitWithTimeout() removals (S7).

Acknowledged Limitations ¶

pkg/service/cmd/service.go:444-449 already comments the Windows os.Rename / FILE_SHARE_DELETE limitation in preserveAllInstanceLogs; that failure mode becomes more likely if I1 is left as-is but the underlying limitation is already named.
cmd/rdd/limavm.go:39-41 justifies limaLongWaitTimeout = 5 * time.Minute with a comment — consistent with the VM boot/shutdown invariant.

Unresolved Feedback ¶

No prior PR review comments on this PR.

Agent Performance Retro ¶

Claude ¶

Claude produced the most thorough pass: 5 important + 5 suggestion findings with the deepest technical reasoning. Its analysis of I1 traced the PID-file removal through PID() to uncover that Running() starts lying on the next invocation — the subtlest and most important aspect of that finding, which Codex and Gemini identified but did not explain. It also caught the SIGINT/SIGTERM doc drift that the other two missed, and separated the start-readiness-budget regression (I3 in its report) from the stop-drain-budget regression that all three agents flagged. One finding (limavm delete default flip) was downgraded to a suggestion because the PR description already documents the change; the substance of the observation remains useful for the commit body and design docs.

Codex ¶

Codex was the only agent to catch the stale docs/design/cmd_lima.md — a real doc drift that survives because the PR updated only the sibling file. Its two important findings overlap with the other agents' but its writeup on I2 pinned the sizing argument concisely to the service's advertised shutdown budgets (45s controller-manager drain, 30s hostagent) and its design observation on "shutdown and state removal are separate responsibilities" sharpened the fix direction. No false positives; one unique finding.

Gemini ¶

Gemini produced a compact report (1 critical + 1 important + 1 suggestion) and raced through the review quickly. It flagged the svc delete --wait=false race at critical severity — we downgraded to important after analysis, because the data being deleted is itself destined for removal and the concurrency exposure (though real) falls short of the critical bar. Its I1 bundled the start and stop timeout-default regressions into one finding; Claude and Codex separated them. One PR-description-vs-code suggestion (limavm create timeout category) that no other agent raised. It did not flag the SIGTERM/SIGINT doc drift or the doc drift in cmd_lima.md.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.4	Gemini 3.1 Pro
Duration	19m 49s	6m 43s	3m 13s
Findings	3I 6S	2I 1S	2I 1S
Tool calls	35 (Bash 25, Read 10)	57 (shell 52, stdin 5)	10 (runshellcommand 6, readfile 3, grepsearch 1)
Design observations	5	2	0
False positives	0	0	0
Unique insights	6	1	1
Files reviewed	5	5	5
Coverage misses	0	0	0
Totals	3I 6S	2I 1S	2I 1S
Downgraded	1 (I→S)	0	1 (I→S)
Dropped	0	0	0

Reconciliation. Gemini's C1 (svc delete --wait=false corrupts data) was downgraded to I1 (important) during consolidation: the data being removed is destined for deletion, so "data corruption" overstates the impact; the PID-file / mutual-exclusion break (Claude's angle) is real but does not meet the critical bar. Claude's I4 (limavm delete default flipped to wait) was downgraded to S4 (suggestion): the PR description documents the default change, so the finding is a doc/commit-body polish rather than a behavioral regression.

Claude found the most unique material (SIGINT/SIGTERM, missing test coverage of the exit-code-4 contract, lost rationale comment, typo, PID-file race in I1). Codex's coverage of docs/design/cmd_lima.md was the clearest single-agent contribution. Gemini's compact output matches its lower-effort setting; for this PR, the other two agents covered what Gemini surfaced.

Review Process Notes ¶

Repo context updates ¶

Add: When a CLI change moves a hardcoded timeout to a user-configurable flag default, flag any other code path that uses the same semantic budget. Previously-correlated hardcoded timeouts often survive the refactor (see ensureServiceRunning still using 90s while rdd svc start --timeout defaults to 30s) and produce behavior splits that no test will catch. Reason: reviewers who compare the diff line-by-line miss these; the pattern requires a global search for the old value after the refactor.
Add: RDD CLI exit codes are defined in pkg/cli/exit/exit.go (currently CodeRejected=3, CodeTimeout=4). When a PR introduces or broadens a contract that uses these codes, expect an assertion in the BATS suite using run -N. A contract asserted in only one place is a gap — a later regression that returns the raw error without the cliexit.Error envelope ships green.