Deep Review: 20260421-225153-pr-336

Date	2026-04-21 22:51
Repo	rancher-sandbox/rancher-desktop-daemon
Round	2
Author	@jandubois
PR	#336 — cli: standardize --wait and --timeout across svc and limavm
Branch	`wait-timeout-cleanup`
Commits	`8aa55e4` cli: address deep-review findings for #336 `99aa4e3` cli: standardize --wait and --timeout across svc and limavm
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
Verdict	Merge with minor fixes — Round 1's Important findings (I1 delete-under-live-process race, I2 shared default timeout) are correctly resolved in `8aa55e4`. One new Important gap remains: `ensureServiceRunning` timeouts still exit with code 1 instead of 4, so the advertised uniform exit-code-4 contract is not uniform for the service-autostart path invoked from `rdd set`, `rdd limavm `, and `rdd kubectl`. The rest is documentation polish and still-missing BATS coverage for the new `svc ` timeout paths.
Wall-clock time	`2 h 4 min 21 s`

Executive Summary ¶

8aa55e4 correctly addresses the Round 1 Important findings: --wait=false is removed from svc delete (the directory-removal-under-live-process race is gone), the single defaultWaitTimeout = 30s is replaced with two named constants (startWaitTimeout = 90s, stopWaitTimeout = limaLongWaitTimeout = 5m) each anchored to a concrete invariant in its doc comment, and the SIGTERM/SIGINT signal-name mismatch in docs/design/cmd_service.md is fixed. docs/design/cmd_lima.md is now consistent with the code, and two new BATS tests cover lima create --start --timeout and lima restart --timeout exit-code-4 paths.

The remaining items split cleanly. One Important gap: ensureServiceRunning (called by rdd set, every rdd limavm * path via getKubeClient, and rdd kubectl) returns raw context.DeadlineExceeded without routing through timeoutError, so a service-start timeout during those commands exits with code 1 instead of the advertised code 4. The rest is documentation polish: the PR description and commit body still describe pre-8aa55e4 defaults, docs/design/api_service.md still says svc stop delivers SIGTERM (the companion doc cmd_service.md was fixed but not this one), and docs/design/cmd_service.md's svc delete section lacks a cross-reference to svc stop for the signal/fallback details. The new svc start/svc stop/svc delete timeout paths still have no BATS coverage — 8aa55e4's new tests cover the lima paths only, so Round 1's I3 is partially addressed but leaves the svc * side of the contract unasserted.

A small Delete TOCTOU (self-heals on retry) and a subtle readiness-contract gap on the documented --wait=false → rdd svc start workflow round out the suggestion list.

Critical Issues ¶

None.

Important Issues ¶

I1. ensureServiceRunning timeouts bypass the cliexit.Timeout envelope Codex Gemini¶

cmd/rdd/service.go

		RunE:  serviceConfigAction,
	}
	return command
}

func ensureServiceRunning(ctx context.Context) error {
	if !service.Exists() {
		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return err
		}
		return waitForDiscoveryConfigMap(ctx)
	}
	return startAndWaitForReady(ctx, nil, startWaitTimeout)
}

// startAndWaitForReady starts the service, waits for the API server and the
// discovery ConfigMap to be ready. After an unclean shutdown the ConfigMap may
// survive with stale data, so the freshness check waits for a ConfigMap whose
// creationTimestamp is at or after the current startup.

ensureServiceRunning now enforces the new startWaitTimeout (90s) on every service-autostart path, but it returns raw context.DeadlineExceeded without routing through timeoutError. Callers — getKubeClient in cmd/rdd/limavm.go:261 (used by every rdd limavm * subcommand), rdd set at cmd/rdd/set.go:110 and :566, rdd kubectl at cmd/rdd/kubectl.go:34, and rdd service config at cmd/rdd/service.go:182 — all propagate the error without wrapping. cmd/rdd/main.go's errors.As(err, &cliexit.Error) check then sees a bare deadline error and exits with code 1, contradicting the PR's stated uniform exit-code-4 contract. The svc start path itself is fine (line 302 wraps startAndWaitForReady directly with timeoutError), but the shared ensureServiceRunning entry point five other commands use is not.

cmd/rdd/limavm.go

}

// getKubeClient returns a controller-runtime client configured for the RDD control plane.
// The returned client supports Watch for event-driven waiting on resource changes.
func getKubeClient(ctx context.Context) (client.WithWatch, error) {
	if err := ensureServiceRunning(ctx); err != nil {
		return nil, err
	}
	config, err := service.GetKubeRestConfig()
	if err != nil {
		return nil, fmt.Errorf("failed to get kubeconfig: %w", err)

cmd/rdd/set.go

}

// fetchPropertyHelp starts the service if needed, fetches the CRD schema, and
// returns a formatted list of available properties.
func fetchPropertyHelp(ctx context.Context) string {
	if err := ensureServiceRunning(ctx); err != nil {
		return ""
	}
	config, err := service.GetKubeRestConfig()
	if err != nil {
		return ""

cmd/rdd/kubectl.go

	}
	return command
}

func ctlAction(cmd *cobra.Command, args []string) error {
	if err := ensureServiceRunning(cmd.Context()); err != nil {
		return err
	}
	if len(args) > 0 && args[0] == "wait-condition" {
		return ctlWaitConditionAction(cmd, args[1:])
	}

cmd/rdd/service.go

		return cm.Annotations[controllers.ReadyAnnotation] == "true", nil
	})
}

func serviceConfigAction(cmd *cobra.Command, _ []string) error {
	if err := ensureServiceRunning(cmd.Context()); err != nil {
		return err
	}
	contents, err := service.GetKubeconfig()
	if err != nil {
		return err

cmd/rdd/service.go

	serveArgs = append(serveArgs, "-v", logrusLevelToKlog())
	serveArgs = append(serveArgs, args...)

	if wait {
		if err := startAndWaitForReady(cmd.Context(), serveArgs, timeout); err != nil {
			return timeoutError(err)
		}
		logrus.Infof("%q control plane is ready", instance.Name())
	} else {
		if err := service.Start(cmd.Context(), serveArgs); err != nil {
			return err

Fix: wrap the error at the source so every caller inherits the contract without a separate timeoutError call at each site:

 func ensureServiceRunning(ctx context.Context) error {
     if !service.Exists() {
         if err := service.Create(nil); err != nil {
             return err
         }
     }
     if service.Running() {
         ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
         defer cancel()
-        if err := service.Wait(ctx); err != nil {
-            return err
-        }
-        return waitForDiscoveryConfigMap(ctx)
+        if err := service.Wait(ctx); err != nil {
+            return timeoutError(err)
+        }
+        return timeoutError(waitForDiscoveryConfigMap(ctx))
     }
-    return startAndWaitForReady(ctx, nil, startWaitTimeout)
+    return timeoutError(startAndWaitForReady(ctx, nil, startWaitTimeout))
 }

(important, gap — consolidated from Codex I1 and Gemini I2; both agents independently traced the unwrapped error path.)

Suggestions ¶

S1. Delete leaves the directory in place when the process exits between outer and inner Running() checks Claude¶

pkg/service/cmd/service.go

}

// StopWithWait sends a shutdown signal to the service. When wait is true, it
// blocks until the process exits or timeout elapses; pass timeout 0 to wait
// indefinitely.
func StopWithWait(wait bool, timeout time.Duration) error {
	if !Running() {
		return fmt.Errorf("%q control plane is not running", instance.Name())
	}

	// Clean up discovery configmap while cluster is still accessible
	_ = cleanupDiscoveryConfigMap() // Clean up discovery configmap to prevent stale controller info

Pre-PR, Delete() called _ = Stop() and discarded the error, so a process that exited between any two checks did not block deletion. Post-PR, Delete(timeout) propagates the StopWithWait error verbatim. A narrow window exists between Delete's outer Running() and StopWithWait's inner one in which the serve subprocess can exit (crash, OOM-kill, external kill, concurrent rdd svc stop). When that happens, StopWithWait returns "… control plane is not running", Delete propagates it, and os.RemoveAll(instance.Dir()) never runs. The user sees exit code 1 and a confusing message, even though the desired end state is one os.RemoveAll away. The condition self-heals on retry — the next invocation sees !Running(), skips the stop branch, and proceeds to delete.

Fix: treat "already not running" as success in Delete by re-checking after the stop failure:

     if Running() {
         if err := StopWithWait(true, timeout); err != nil {
-            return err
+            // The process may have exited between our Running check and
+            // StopWithWait's; in that case its work is already done.
+            if Running() {
+                return err
+            }
         }
     }

(suggestion, regression)

S2. Exit-code-4 contract remains asserted only on lima paths; the new svc * timeout paths have no coverage Claude¶

bats/tests/33-lima-controllers/limavm-cli.bats


    # Delete the VM
    run -0 rdd limavm delete "restart-vm"
}

@test "lima start --timeout fails when VM cannot reach desired state" {
    rdd ctl create configmap "timeout-vm" --namespace "${LIMA_TEST_NS}" \
        --from-literal="template=${TEMPLATE}"
    run_e -0 rdd limavm create "timeout-vm" "timeout-vm" --namespace "${LIMA_TEST_NS}"
    assert_created "timeout-vm" "${LIMA_TEST_NS}" "timeout-vm"

    # Non-functional template cannot boot, so --wait --timeout should fail
    # with exit code 4 (cliexit.CodeTimeout) to match `rdd set`.
    run -4 rdd limavm start --wait --timeout=3s "timeout-vm"
    assert_output --partial 'context deadline exceeded'
}

@test "lima create --start --timeout fails when VM cannot boot" {
    rdd ctl create configmap "create-timeout-vm" --namespace "${LIMA_TEST_NS}" \
        --from-literal="template=${TEMPLATE}"

    # Create with --start: the VM never reaches Running=True because the
    # template is non-functional, so --wait --timeout exits with code 4.
    run -4 rdd limavm create "create-timeout-vm" "create-timeout-vm" \
        --namespace "${LIMA_TEST_NS}" --start --wait --timeout=3s
    assert_output --partial 'context deadline exceeded'
}

@test "lima restart --timeout fails when VM cannot restart" {
    rdd ctl create configmap "restart-timeout-vm" --namespace "${LIMA_TEST_NS}" \
        --from-literal="template=${TEMPLATE}"
    run_e -0 rdd limavm create "restart-timeout-vm" "restart-timeout-vm" --namespace "${LIMA_TEST_NS}"
    assert_created "restart-timeout-vm" "${LIMA_TEST_NS}" "restart-timeout-vm"

    # Restart waits for status.restartCount to increment, which requires the
    # VM to come back up. The non-functional template never boots, so the
    # counter stays at zero and --wait --timeout exits with code 4.
    run -4 rdd limavm restart --wait --timeout=3s "restart-timeout-vm"
    assert_output --partial 'context deadline exceeded'
}

@test "lima help text is displayed" {
    # lima is an alias of the limavm command
    run -0 rdd lima --help
    assert_output --partial "LimaVM virtual machines"

8aa55e4 adds two new exit-code-4 tests on top of the one 99aa4e3 updated, all inside lima subcommands. None of the four new svc * timeout paths the PR introduces carry coverage:

rdd svc start --wait --timeout=… (wraps startAndWaitForReady at cmd/rdd/service.go:302)
rdd svc start against an already-running service (wraps service.Wait at cmd/rdd/service.go:278)
rdd svc stop --wait --timeout=… (wraps service.StopWithWait at cmd/rdd/service.go:334)
rdd svc delete --timeout=… (wraps service.Delete at cmd/rdd/service.go:375)
--timeout=0 (indefinite) on any of these paths

cmd/rdd/service.go

			return nil
		}
		logrus.Infof("waiting for %q control plane to be ready", instance.Name())
		ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
		defer cancel()
		return timeoutError(service.Wait(ctx))
	}

	// Collect all provided flags as arguments for serve subprocess
	var serveArgs []string
	if cmd.Flags().Changed("controllers") {

cmd/rdd/service.go

	serveArgs = append(serveArgs, "-v", logrusLevelToKlog())
	serveArgs = append(serveArgs, args...)

	if wait {
		if err := startAndWaitForReady(cmd.Context(), serveArgs, timeout); err != nil {
			return timeoutError(err)
		}
		logrus.Infof("%q control plane is ready", instance.Name())
	} else {
		if err := service.Start(cmd.Context(), serveArgs); err != nil {
			return err

cmd/rdd/service.go

	if err != nil {
		return err
	}

	if err := service.StopWithWait(wait, timeout); err != nil {
		return timeoutError(err)
	}
	if wait {
		logrus.Infof("%q control plane has stopped", instance.Name())
	} else {
		logrus.Infof("%q control plane is stopping", instance.Name())

cmd/rdd/service.go

	timeout, err := cmd.Flags().GetDuration("timeout")
	if err != nil {
		return err
	}
	if err := service.Delete(timeout); err != nil {
		return timeoutError(err)
	}
	logrus.Infof("%q control plane has been deleted", instance.Name())
	return nil
}

A regression that returns the wrapped error directly without going through timeoutError will ship green. This restates Round 1 I3; the fixup commit partially addressed it (lima paths now covered) but left the svc side unasserted. The simplest provoke for svc start is creating the control plane with a broken controller manifest; the simplest for svc stop is the setup from bats/tests/33-lima-controllers/limavm-running.bats:327 ("graceful service stop terminates hostagent"). At minimum, add one svc * timeout assertion so the contract is not lima-only.

bats/tests/33-lima-controllers/limavm-running.bats


# Graceful shutdown test
# Verifies that stopping the service terminates all running hostagents.
# The controller's shutdown runnable calls shutdownAllHostagents() on exit.

@test "graceful service stop terminates hostagent" {
    local ha_pid
    ha_pid=$(get_hostagent_pid "${VM_NAME}")
    assert [ -n "${ha_pid}" ]
    assert_process_alive "${ha_pid}"

(suggestion, gap — restates Round 1 I3 partially addressed.)

S3. rdd svc start on an already-running service returns before the ReadyAnnotation is set Claude Codex¶

cmd/rdd/service.go

		if err := service.Create(args); err != nil {
			return err
		}
		logrus.Infof("successfully created %q control plane", instance.Name())
	}
	if service.Running() {
		logrus.Infof("%q control plane is already running", instance.Name())
		if !wait {
			return nil
		}
		logrus.Infof("waiting for %q control plane to be ready", instance.Name())
		ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
		defer cancel()
		return timeoutError(service.Wait(ctx))
	}

	// Collect all provided flags as arguments for serve subprocess
	var serveArgs []string
	if cmd.Flags().Changed("controllers") {

docs/design/cmd_service.md:93-95 explicitly advertises the flow "--wait=false … then run rdd service start again with no options to wait." The cold-start path (startAndWaitForReady) waits for both service.Wait(ctx) and waitForFreshDiscoveryConfigMap(ctx, beforeStart) — the latter requires ReadyAnnotation == "true", which the controller manager sets only after every expected controller has registered. The already-running path calls only service.Wait(ctx), which via checkReadiness returns success as soon as the discovery ConfigMap lists registered controllers — before the ready annotation. A concurrent --wait=false followed by rdd svc start during a still-initializing daemon therefore returns success under a narrower contract than the cold-start path guarantees.

docs/design/cmd_service.md


Runs `rdd service serve` to actually start the control plane in the background

Additional options:

*   `--wait=false`
    Return immediately, without waiting for all controllers to be ready. Run `rdd service start`
    again with no options to wait.

*   `--timeout=90s`
    How long to wait for the control plane to become ready. Pass `0` to wait indefinitely.
    If the deadline expires, `rdd` exits with code 4.

Codex raised this at Important because the PR's new documentation names the workflow explicitly; Claude raised it at Suggestion because the race window is narrow (controllers finish registering quickly, and a steady-state instance is always fine). We consolidate at Suggestion: the practical exposure is a narrow mid-startup race, but the documented workflow deserves a one-line fix to deliver the advertised guarantee.

Fix: mirror ensureServiceRunning's pattern and wait for the discovery ConfigMap on the already-running branch too:

-    return timeoutError(service.Wait(ctx))
+    if err := service.Wait(ctx); err != nil {
+        return timeoutError(err)
+    }
+    return timeoutError(waitForDiscoveryConfigMap(ctx))

(suggestion, gap — pre-existing; the PR's new docs make it newly relevant.)

S4. BATS tests still pass --wait explicitly on limavm delete, which is now the default Claude¶

bats/tests/33-lima-controllers/limavm-cli.bats

    run -0 rdd limavm stop --wait=false "test-vm"
    assert_output --partial 'stopped'
    assert_running "false"

    # Delete the VM
    run -0 rdd limavm delete --wait "test-vm"
    assert_output --partial 'deleted'
    run -1 rdd ctl get limavm "test-vm" --namespace "${LIMA_TEST_NS}"
    assert_output --partial "not found"
}

limavm delete now defaults to --wait=true (cmd/rdd/limavm.go:253). The explicit flag is redundant and signals to a future reader that the flag is load-bearing — when the opposite is now true. Drop the flag:

cmd/rdd/limavm.go

		Args:  cobra.ExactArgs(1),
		RunE: func(cmd *cobra.Command, args []string) error {
			return limaVMDeleteAction(cmd.Context(), args[0], wait, timeout)
		},
	}
	command.Flags().BoolVar(&wait, "wait", true, "Wait for the VM to be deleted before returning")
	command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
	return command
}

// getKubeClient returns a controller-runtime client configured for the RDD control plane.

-run -0 rdd limavm delete --wait "test-vm"
+run -0 rdd limavm delete "test-vm"

(suggestion, gap)

S5. PR description and 99aa4e3 commit body describe the pre-8aa55e4 default timeouts Claude¶

The PR description lists:

> Default timeouts: > - 30s: svc start, svc stop, svc delete, limavm create > - 5m: limavm start, limavm stop, limavm restart, limavm delete

After 8aa55e4:

svc start → startWaitTimeout = 90s
svc stop, svc delete → stopWaitTimeout = limaLongWaitTimeout = 5m
All limavm paths → limaLongWaitTimeout = 5m

The "30s bucket" no longer exists. Update the PR description, and either amend 99aa4e3 or extend 8aa55e4's body with the final contract so a future git log reader sees the shipped defaults. This also picks up Round 1 S7 (the service.Stop() / service.WaitWithTimeout() removals are still unnamed in either commit body) and Round 1 S8 (the stale limavm create 30s line).

(suggestion, gap — re-raises Round 1 S7 and S8 unaddressed.)

S6. docs/design/cmd_service.md does not cross-reference svc delete back to svc stop Claude¶

docs/design/cmd_service.md

`--wait=false` to return immediately, or `--timeout=0` to wait indefinitely. If the
deadline expires, `rdd` exits with code 4. The default matches the per-VM ceiling
because graceful shutdown will eventually stop every running LimaVM before the
service exits.

### `rdd service delete`

Stops the control plane and removes the instance directory, the short directory
(which contains the Lima home), and — unless `RDD_KEEP_LOGS` is set — the log
directory.

Delete always waits for the control plane to exit before removing files, because
removing the instance directory under a live process corrupts it on Windows and
breaks PID-file mutual exclusion on Unix. Use `--timeout` to bound that wait
(default `5m`); `0` waits indefinitely. If the deadline expires, `rdd` exits
with code 4 and the directory is left in place.

### `rdd service reset`

Deletes the datastore, but create a new (empty) one with the same settings.

The 99aa4e3 commit body says the change folds the shutdown-wait semantics of svc delete into a reference to svc stop, but the actual doc carries no cross-reference. A reader must know that svc delete uses svc stop's mechanism to understand which signal is sent (SIGINT / CTRLBREAKEVENT, with Kill fallback). A one-line "See svc stop for the signal/fallback details" closes the gap without duplicating content.

(suggestion, gap)

S7. docs/design/api_service.md still describes svc stop as delivering SIGTERM Codex¶

docs/design/api_service.md


* It starts any additional controllers configured via the `config` ConfigMap.

## Service shutdown

The control plane will be notified to shut down, either by receiving a SIGTERM signal from the user (via `rdd service stop`), or by the OS preparing to shut down the host.

Some controllers need to be notified of pending shutdown so they can gracefully terminate, e.g. shut down virtual machines.

Any controller can request shutdown notification by creating a ConfigMap in the `rdd-system` namespace:

cmd_service.md was updated to say SIGINT (with the Windows CTRL_BREAK_EVENT mapping) and the implementation calls process.Interrupt(pid) first at pkg/service/cmd/service.go:358-362, but the architecture doc api_service.md still names SIGTERM. The two docs now disagree on the shutdown contract — exactly what this PR is standardizing.

pkg/service/cmd/service.go

	// Kill fallback never fires. On Windows, Interrupt uses
	// GenerateConsoleCtrlEvent, which fails when caller and target lack a
	// shared console; Kill (TerminateProcess) then bypasses graceful shutdown.
	// Hostagents survive in their own process groups and self-heal on the
	// next service start via killOrphanedHostagent.
	if err := process.Interrupt(pid); err != nil {
		if err := process.Kill(pid); err != nil {
			return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
		}
	}

	if wait {
		ctx := context.Background()
		if timeout > 0 {
			var cancel context.CancelFunc

Fix: update docs/design/api_service.md:21 to match: rdd service stop sends SIGINT on Unix or CTRL_BREAK_EVENT on Windows, then falls back to forced termination if graceful shutdown cannot be initiated or does not complete in time.

(suggestion, gap — pre-existing; the PR's companion doc update made the divergence worse.)

S8. Inconsistent alias for the client-go watch tools import Gemini¶

cmd/rdd/service.go imports k8s.io/client-go/tools/watch as watchtools, while cmd/rdd/limavm.go imports the same package as toolswatch. Standardize on one alias across the CLI to keep a quick text search from missing call sites.

(suggestion, gap)

Design Observations ¶

Concerns ¶

Two readiness definitions coexist in pkg/service/cmd/service.go (in-scope) Codex. service.Wait() proves API readiness plus best-effort controller discovery; startAndWaitForReady() adds the ReadyAnnotation + freshness check. Unifying the two into a single readiness helper removes the mismatch behind S3 and gives every future caller the same contract. Matches the shape of S3's fix.
StopWithWait constructs context.Background() instead of accepting the caller's context (future) Claude Gemini. The function signature is StopWithWait(wait bool, timeout time.Duration) rather than the idiomatic StopWithWait(ctx context.Context, wait bool, timeout time.Duration). Today this is a no-op — cmd/rdd/main.go:153 passes context.Background() and never installs signal-to-context plumbing — but if the CLI ever grows a signal handler that cancels cmd.Context(), the shutdown wait will not observe it. Plumbing the context through costs one parameter and matches every other wait site in the CLI.
Pre-existing PID-recycling race in StopWithWait (future) Claude. Between Running() at line 339 and pid := PID() at line 346, the service can exit and PID() can find the PID file stale and return PIDNotFound (0). process.Interrupt(0) on Unix becomes unix.Kill(0, SIGINT), which signals every process in the calling process's process group — including the rdd CLI itself and its parent shell. The race is narrow and existed before this PR, but the refactor did not close it.

Strengths ¶

Removing --wait=false from svc delete is the correct fix for Round 1 I1 Claude Codex. No safe semantics exist for "delete without waiting" when the writer still owns the directory, and docs/design/cmd_service.md now names the safety reason explicitly.
startWaitTimeout and stopWaitTimeout each tie their value to a concrete invariant in the doc comment Claude. startWaitTimeout cites CRD establishment and controller-manager registration; stopWaitTimeout cites the 45s drain in Run and the per-VM shutdown ceiling. Future tuning now anchors on the underlying sizing constraint, not a handwave.
timeoutError at the CLI boundary plus errors.Is(err, context.DeadlineExceeded) is the right classification primitive Claude Codex. The helper keeps cliexit concerns out of the lower-level service package, and the %w wrap chain propagates through every layer.
watchtools.ContextWithOptionalTimeout is used at every wait site Claude. The "timeout=0 → no deadline" semantics now live in one place instead of six independent timer constructions.
service.Delete(timeout) is a safer API than the previous service.Delete() Codex. Waiting for shutdown before removing instance state is the right direction for both Windows file-handle semantics and Unix PID-file correctness.

Testing Assessment ¶

Untested paths, in priority order:

svc start --wait --timeout=… deadline expiry returns exit code 4. Most novel path in the PR; the --timeout flag on svc start is new. See S2.
svc stop --wait --timeout=… deadline expiry returns exit code 4. StopWithWait was rewritten to use a context-driven loop; no test exercises the timeout-expiry branch. See S2.
svc delete --timeout=… deadline expiry returns exit code 4 and leaves the directory in place. The doc explicitly promises this; no test asserts it. See S2.
--timeout=0 (indefinite wait) succeeds when the desired state is eventually reached. A regression to context.WithTimeout(ctx, 0) (which times out immediately) would ship green. Round 1 testing gap, still unaddressed.
ensureServiceRunning timeout during rdd limavm * / rdd set / rdd kubectl returns exit code 4. Directly ties to I1; the exit-code contract is not exercised for service-autostart paths.
The documented --wait=false → rdd svc start readiness workflow. S3's narrow race window and the docs' advertisement of this workflow; no BATS test covers it.
The Round 1 TODO at bats/tests/20-service/2-create.bats:3 (# TODO: test rdd svc start --wait=false) remains unaddressed.

Documentation Assessment ¶

docs/design/cmd_lima.md now matches the code (Round 1 S1 resolved).
docs/design/cmd_service.md correctly names SIGINT (Round 1 S3 resolved) and explains the safety reason for removing --wait=false on svc delete (Round 1 I1 resolved).
--timeout=0 semantics are documented at every site.
Missing: cross-reference from svc delete to svc stop for signal/fallback details (S6).
Stale: docs/design/api_service.md:21 still says SIGTERM (S7).
Stale comment at cmd/rdd/service.go:120-122: "the API server readiness poll and the ConfigMap freshness poll share the single deadline" holds only when timeout > 0. Round 1 doc gap, still unaddressed; minor.

Commit Structure ¶

99aa4e3 and 8aa55e4 split correctly — the first is the feature, the second is the fixup for Round 1 findings. Both commit bodies understate the final contract (S5): 99aa4e3 lists pre-fixup defaults, and neither names the service.Stop() / service.WaitWithTimeout() library-surface removals. Consider squashing before merge, or amending either commit message to match the shipped code.

Acknowledged Limitations ¶

pkg/service/cmd/service.go:444-451 (in preserveAllInstanceLogs) names the Windows os.Rename / FILE_SHARE_DELETE limitation. The always-wait-before-delete semantics make the exposure rarer, but the QEMU-holds-the-handle case the comment describes is independent of this PR.
cmd/rdd/limavm.go:39-41 justifies limaLongWaitTimeout = 5 * time.Minute with a VM boot/shutdown comment; 8aa55e4 extends the same convention to startWaitTimeout and stopWaitTimeout in cmd/rdd/service.go:32-42.
pkg/service/cmd/service.go:381 comments that process.Kill (TerminateProcess on Windows) leaves child hostagents orphaned if they are slow — an explicit trade-off rather than an oversight.

Unresolved Feedback ¶

No prior PR review comments on this PR.

Agent Performance Retro ¶

Claude ¶

Claude produced the most thorough pass again: six suggestions and six design observations, with the deepest technical detail per finding. Its unique material dominated: the Delete TOCTOU around the outer/inner Running() checks (S1), the PID-recycling race that turns process.Interrupt(PIDNotFound) into a process-group self-signal (design concern), the PR-description-now-lists-pre-fixup-defaults observation (S5), and the BATS --wait redundancy on limavm delete (S4). It independently identified the same serviceStartAction readiness weakness as Codex, but at Suggestion severity with a more careful framing of the race window — which matches the practical exposure better than Codex's Important classification. Its verdict ("Merge") was a hair more optimistic than the consolidated "Merge with minor fixes," because Claude did not flag the ensureServiceRunning exit-code gap that both Codex and Gemini caught.

Codex ¶

Codex caught the Important finding the consolidation led with: ensureServiceRunning errors bypass timeoutError on every service-autostart path (rdd set, rdd limavm *, rdd kubectl), leaving five commands outside the advertised uniform exit-code-4 contract. It also uniquely caught the docs/design/api_service.md SIGTERM stale-doc (S7), which Claude and Gemini both missed despite reviewing the SIGINT/SIGTERM area in Round 1. Its call on the already-running-path readiness mismatch (I2 in its own report) was sharp on the documentation angle but over-weighted on severity; we consolidated at Suggestion alongside Claude's matching finding. No false positives; one clean Important and one unique stale-doc insight.

Gemini ¶

Gemini produced its shortest report of this PR yet and delivered a mixed bag. It independently raised the ensureServiceRunning exit-code-4 gap (its I2, consolidated into our I1) — valuable corroboration of Codex. It caught a clean cosmetic suggestion (S8: watchtools vs toolswatch alias mismatch) that neither other agent noticed. But its critical-severity @test header claim was a false positive: Gemini parsed its own reading of the diff as the file contents and hallucinated a corrupted BATS syntax in lines that are in fact valid @test "…" declarations at limavm-cli.bats:208, 219. Its Important finding on StopWithWait ignoring cmd.Context() was technically accurate but practically a no-op today (the CLI has no signal-to-context plumbing, so cmd.Context() is itself context.Background()); we demoted it to a design observation alongside Claude's matching concern. Gemini continues to skip git blame, so regression vs. pre-existing calls come from the other two agents.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.4	Gemini 3.1 Pro
Duration	54m 15s	6m 47s	5m 05s
Findings	6S	1I 2S	1I 1S
Tool calls	59 (Bash 35, Grep 12, Read 12)	62 (shell 59, stdin 3)	18 (grepsearch 10, runshellcommand 5, readfile 3)
Design observations	6	3	0
False positives	0	0	1
Unique insights	5	1	1
Files reviewed	6	6	6
Coverage misses	0	0	0
Totals	6S	1I 2S	1I 1S
Downgraded	0	1 (I→S)	1 (I→S)
Dropped	0	0	1

Reconciliation. Three severity changes:

Gemini C1 (BATS @test header corrupted): dropped as false positive. The file in $WORKTREE_lead contains valid @test declarations at 208 and 219; Gemini hallucinated the corruption. false_positives += 1.
Gemini I1 (StopWithWait ignores cmd.Context()): downgraded to design observation. No practical impact today because main.go passes context.Background() without signal-to-context plumbing; Claude's matching design concern captures the substance at the right severity.
Codex I2 (already-running readiness mismatch): downgraded to S3. The documented workflow is a real gap but the race window is narrow; Claude's Suggestion classification matches the practical exposure.

Overall: Codex contributed the single most valuable finding (the ensureServiceRunning exit-code-4 gap) and the SIGTERM stale-doc. Claude carried the long tail of suggestions and unique design observations. Gemini corroborated the Important finding and caught the import-alias cosmetic, but its critical-severity call was spurious; with this PR's size and Round-2 narrow scope, Codex + Claude would have been sufficient.

Review Process Notes ¶

Skill improvements ¶

Add: flag a finding's critical/important severity as suspect when the evidence is a quoted block from a "diff view," not from the file in the worktree. Severity escalation on purely textual evidence (especially where the claim requires the file to be malformed) should be cross-checked by reading the file at the cited line before accepting the severity. Agents sometimes synthesize a plausible diff and then reason from it as if it were the file; the consolidator must catch this. A reviewer recognises the pattern when the cited text begins with -/+ markers, references "find-and-replace," or claims a syntax-level corruption that would break tooling the PR passed through.

Repo context updates ¶

Add: When a PR introduces a uniform CLI exit-code contract that applies to "every wait-capable subcommand," enumerate every caller of the shared autostart/autowait helper (e.g., ensureServiceRunning) before accepting the PR as "uniform." Cross-cutting contracts that are only wired in at the outermost entry points miss every command that invokes the helper transitively (rdd set, every rdd limavm *, rdd kubectl, rdd service config in this PR). Reason: deep-review agents catch the obvious call sites but often miss the transitive ones.