Deep Review: 20260422-160047-pr-336

Date	2026-04-22 16:00
Repo	rancher-sandbox/rancher-desktop-daemon
Round	7 (of target)
Author	@jandubois
PR	#336 — cli: standardize --wait and --timeout across svc and limavm
Commits	`47b71db` cli: standardize --wait and --timeout across svc and limavm
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — fix the doc-imprecision (I3), add the missing `--timeout=120s` on `local_setup_file` (S1), and consider the autostart-entry-point coverage gap (I2). The other suggestions are polish that need not block merge.
Wall-clock time	`20 min 19 s`

Executive Summary ¶

Round 7 of a CLI standardization PR (--wait / --timeout flags across rdd svc and rdd limavm, exit code 4 contract, service.Delete always-waits-before-removing-files). Most prior-round findings are addressed in this iteration: waitFlag rename, watchUntil watch-close error symmetry, expanded Delete predicate comment, --timeout help-string consistency, and the setup_rdd_control_plane 5m block. What remains is mostly documentation and test-coverage polish.

The core change is sound. Three findings worth attention before merge: a doc imprecision in api_service.md, a missing --timeout=120s on the new test file's local_setup_file that mirrors the same risk the helper already mitigates, and persistent absence of exit-code-4 BATS coverage for the autostart entry points (rdd set, rdd kubectl, rdd svc config) plus limavm delete/stop. Two findings raised as "Important" by Gemini are deliberate design tradeoffs documented in code comments and have been demoted to Design Observations.

Critical Issues ¶

None.

Important Issues ¶

I1. setup_rdd_control_plane silently reuses the prior daemon's controller set after a timed-out delete Codex¶

bats/helpers/controller.bash

}

setup_rdd_control_plane() {
    local controllers=${1:-"*"}

    # Bound delete and tolerate failure so a stuck prior daemon cannot
    # hang the helper for the full 5m stopWaitTimeout. `create` then
    # proceeds against whatever state survives.
    rdd svc delete --timeout=120s || :
    rdd svc create --controllers="${controllers}"
    rdd svc start
}

The new helper intentionally swallows rdd svc delete failure at line 81. If the timed-out delete leaves the instance directory behind, rdd svc create at line 82 becomes a no-op because serviceCreateAction returns success when the instance already exists (cmd/rdd/service.go:202-205). Then rdd svc start at line 83 either waits on the still-running old daemon (already-running branch, cmd/rdd/service.go:262-281) or restarts from the previously saved args (pkg/service/cmd/service.go:511-518) instead of the requested controllers set. A cleanup timeout therefore turns into cross-suite contamination: the next test runs against whatever controllers the previous daemon happened to preserve.

cmd/rdd/service.go

	}
	return command
}

func serviceCreateAction(cmd *cobra.Command, args []string) error {
	if service.Exists() {
		logrus.Infof("%q control plane already exists", instance.Name())
		return nil
	}
	controllers, err := cmd.Flags().GetString("controllers")
	if err != nil {
		return err
	}
	args = append(args, "--controllers", controllers)

cmd/rdd/service.go

		if err := service.Create(args); err != nil {
			return err
		}
		logrus.Infof("successfully created %q control plane", instance.Name())
	}
	if service.Running() {
		logrus.Infof("%q control plane is already running", instance.Name())
		if !waitFlag {
			return nil
		}
		startTime, err := service.StartTime()
		if err != nil {
			return err
		}
		logrus.Infof("waiting for %q control plane to be ready", instance.Name())
		ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		if err := waitForFreshDiscoveryConfigMap(ctx, startTime); err != nil {
			return cliexit.Classify(err)
		}
		logrus.Infof("%q control plane is ready", instance.Name())
		return nil
	}

	// Collect all provided flags as arguments for serve subprocess
	var serveArgs []string
	if cmd.Flags().Changed("controllers") {

pkg/service/cmd/service.go

			logrus.WithFields(logrus.Fields{"instance": entry.Name(), "count": count}).Debug("Preserved instance logs")
		}
	}
}

// ServeArgs returns the saved command-line arguments written by Create.
func ServeArgs() []string {
	data, err := os.ReadFile(instance.ArgsFile())
	if err == nil {
		var args []string
		if err := json.Unmarshal(data, &args); err == nil {
			return args
		}
	}
	return nil
}

// shouldEnableController determines if a controller should be enabled based on the controller's specification.

Note that even Codex's suggested fix only partially addresses this — rdd svc start --controllers=... is silently ignored when Running() is true. The deeper question is whether to fail loudly when the precondition cannot be established, or to detect and recover (Stop + retry Delete).

Fix: keep the helper's "fresh control plane with this controller set" invariant. The simplest fix is to fail fast when the bounded delete fails, and pass --controllers on start so a surviving stopped instance is at least restarted with the intended set:

-    rdd svc delete --timeout=120s || :
+    rdd svc delete --timeout=120s
     rdd svc create --controllers="${controllers}"
-    rdd svc start
+    rdd svc start --controllers="${controllers}"

I2. The new exit-code-4 contract is not asserted on every entry point the PR changes Codex¶

The new BATS coverage hits svc start/stop/delete and lima create --start/restart/start. The implementation change is broader: ensureServiceRunning (cmd/rdd/service.go:92-110) now classifies autostart deadlines for rdd set, rdd kubectl, and rdd service config; cmd/rdd/set.go:324-325 was updated specifically for this contract; and cmd/rdd/limavm.go:584-589 gives limavm delete the same timeout classification. None of those paths have a run -4 assertion today, and limavm stop (cmd/rdd/limavm.go:449-456) is also uncovered.

cmd/rdd/service.go

// ensureServiceRunning readies the control plane before a CLI command
// relies on it. It bounds the already-running and cold-start paths by
// startWaitTimeout (90s), independent of the caller's --timeout. The
// cap lets a broken service fail fast so rdd limavm *, rdd set, rdd
// kubectl, and rdd service config stay responsive on a long deadline.
func ensureServiceRunning(ctx context.Context) error {
	if !service.Exists() {
		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		startTime, err := service.StartTime()
		if err != nil {
			return err
		}
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		return cliexit.Classify(waitForFreshDiscoveryConfigMap(ctx, startTime))
	}
	return cliexit.Classify(startAndWaitForReady(ctx, nil, startWaitTimeout))
}

// startAndWaitForReady starts the service, waits for the API server and the
// discovery ConfigMap to be ready. After an unclean shutdown the ConfigMap may
// survive with stale data, so the freshness check waits for a ConfigMap whose

cmd/rdd/set.go

	}

	if dryRun || !wait {
		return nil
	}
	if err := waitForDesiredState(ctx, restConfig, timeout, targetGen); err != nil {
		return cliexit.Classify(err)
	}
	return nil
}

// classifyAPIError tags admission-controller rejections with

cmd/rdd/limavm.go

		action = "started"
		desiredStatus = metav1.ConditionTrue
	}
	logrus.Infof("LimaVM %q %s in namespace %q", name, action, limaVM.Namespace)

	if wait {
		waitCtx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
		defer cancel()
		key := client.ObjectKeyFromObject(limaVM)
		if err := watchUntil(waitCtx, c, key, func(vm *limav1alpha1.LimaVM) bool {
			return apimeta.IsStatusConditionPresentAndEqual(vm.Status.Conditions, "Running", desiredStatus)
		}); err != nil {
			return cliexit.Classify(fmt.Errorf("failed waiting for LimaVM to be %s: %w", action, err))
		}
	}
	return nil
}

cmd/rdd/limavm.go


	if err := c.Delete(ctx, limaVM); err != nil {
		return fmt.Errorf("failed to delete LimaVM: %w", err)
	}

	if wait {
		waitCtx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
		defer cancel()
		if err := watchUntilDeleted(waitCtx, c, limaVM); err != nil {
			return cliexit.Classify(fmt.Errorf("failed to wait for LimaVM deletion: %w", err))
		}
	}

	logrus.Infof("LimaVM %q deleted from namespace %q", name, limaVM.Namespace)
	return nil
}

A missing cliexit.Classify on any one of these would still ship green even though the PR claims a uniform CLI contract. Round 6 raised the same gap; this round adds coverage for some commands but the autostart and limavm delete/stop gaps persist.

Fix: add one explicit exit-code-4 BATS assertion per entry point whose observable contract changed. For the autostart callers (rdd set, rdd kubectl, rdd svc config), force ensureServiceRunning past its 90s readiness deadline (e.g., delete the discovery ConfigMap or block readiness via a controller toggle). For the LimaVM commands, reuse the non-booting template fixture and assert run -4.

I3. api_service.md shutdown-signal description conflates two distinct fallback paths Claude¶

docs/design/api_service.md


* It starts any additional controllers configured via the `config` ConfigMap.

## Service shutdown

The control plane is notified to shut down, either by `rdd service stop` (SIGINT on Unix with SIGTERM as fallback; `CTRL_BREAK_EVENT` on Windows with `TerminateProcess` as fallback), or by the OS preparing to shut down the host.

Some controllers need to be notified of pending shutdown so they can gracefully terminate, e.g. shut down virtual machines.

Any controller can request shutdown notification by creating a ConfigMap in the `rdd-system` namespace:

This sentence implies a single "if SIGINT fails, send SIGTERM" fallback. The implementation actually has two distinct fallbacks (pkg/service/cmd/service.go:374-378 and pkg/service/cmd/service.go:399):

pkg/service/cmd/service.go

	// Kill fallback never fires. On Windows, Interrupt uses
	// GenerateConsoleCtrlEvent, which fails when caller and target lack a
	// shared console; Kill (TerminateProcess) then bypasses graceful shutdown.
	// Hostagents survive in their own process groups and self-heal on the
	// next service start via killOrphanedHostagent.
	if err := process.Interrupt(pid); err != nil {
		if err := process.Kill(pid); err != nil {
			return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
		}
	}

	if wait {
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, timeout)
		defer cancel()
		ticker := time.NewTicker(100 * time.Millisecond)

pkg/service/cmd/service.go

				// signals promptly; its slowness comes from draining
				// hostagents sequentially. Run caps that drain at 45s
				// before exiting anyway, so timeout values shorter than
				// stopWaitTimeout (5m) routinely race a still-draining
				// daemon.
				_ = process.Kill(pid)
				err := ctx.Err()
				if errors.Is(err, context.DeadlineExceeded) {
					return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop: %w", instance.Name(), pid, err)
				}
				return fmt.Errorf("wait for %q control plane with pid %d to stop cancelled: %w", instance.Name(), pid, err)

Signal-delivery failure fallback (StopWithWait line 374-378): if process.Interrupt returns an error, immediately call process.Kill. On Unix, Interrupt (SIGINT via unix.Kill) only fails when the PID is invalid (the comment at line 369-371 says so), so this path is essentially Windows-only — it covers the "no shared console" failure of GenerateConsoleCtrlEvent.
Deadline-expiry fallback (StopWithWait line 388-404): if --timeout expires, call process.Kill once. On Unix that delivers SIGTERM, which the daemon's signal handler treats as the second shutdown signal.

docs/design/cmd_service.md:104-107 describes the deadline-expiry path correctly. The api-doc as written contradicts it: on Unix the Interrupt-failed fallback never fires in practice, and the deadline-expiry path is the meaningful Unix fallback. The text was modified in this PR (the previous wording was outright wrong about the primary signal too), so this is an opportunity to land the precise version rather than a partially-corrected one.

docs/design/cmd_service.md

    If the deadline expires, `rdd` exits with code 4.


### `rdd service stop`

Sends SIGINT to the control plane process (`rdd.pid`) and waits up to 5 minutes for
it to exit. On Windows the signal is delivered as `CTRL_BREAK_EVENT`. If the deadline
expires, the wait sends SIGTERM on Unix (or calls `TerminateProcess` on Windows)
as a fallback and `rdd` exits with code 4. Pass `--wait=false` to return immediately,
or `--timeout=0` to wait indefinitely. The default matches the per-VM ceiling because
graceful shutdown will eventually stop every running LimaVM before the service exits.

### `rdd service delete`

Fix:

-The control plane is notified to shut down, either by `rdd service stop` (SIGINT on Unix with SIGTERM as fallback; `CTRL_BREAK_EVENT` on Windows with `TerminateProcess` as fallback), or by the OS preparing to shut down the host.
+The control plane is notified to shut down, either by `rdd service stop`
+(SIGINT on Unix, `CTRL_BREAK_EVENT` on Windows; if the wait deadline expires
+the wait force-terminates with SIGTERM on Unix or `TerminateProcess` on
+Windows), or by the OS preparing to shut down the host.

Suggestions ¶

S1. local_setup_file in the new timeout test uses default 5m for cleanup, while the per-test reset uses 10s Claude¶

bats/tests/20-service/6-timeout.bats

# SIGTERM (or TerminateProcess on Windows) and returns
# context.DeadlineExceeded wrapped in cliexit.Timeout. svc delete also
# preserves the instance directory on timeout (see docs/design/cmd_service.md)
# so the user can retry.

local_setup_file() {
    rdd svc delete || :
    rdd svc create
    rdd svc start
}

@test 'svc stop --timeout=1ms exits with code 4' {
    run -4 rdd svc stop --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}

The other deletes in the same file pass --timeout=10s (line 26) or rely on a known-clean state (line 37). The file-level rdd svc delete || : on line 12 has no explicit timeout, so it inherits the new 5-minute default. If a prior test file left a half-dead daemon (the comment on line 23 acknowledges that tests exit while the daemon is still draining), this setup can hang the entire test file for 5 minutes before BATS reports anything.

bats/tests/20-service/6-timeout.bats

    run -4 rdd svc stop --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}

@test 'svc delete --timeout=1ms exits with code 4 and preserves the instance directory' {
    # Hard reset: the prior test sent SIGTERM but does not wait for the
    # process to exit, so a plain `rdd svc start` could still see
    # Running()=true and take the already-running branch.
    rdd svc delete --timeout=10s || :
    rdd svc create
    rdd svc start
    run -4 rdd svc delete --timeout=1ms
    assert_output --partial 'context deadline exceeded'
    run -0 rdd svc status
    assert_output --partial 'has been created: true'
}

@test 'svc start --timeout=1ms exits with code 4' {
    # Reset to a known state; prior tests left the control plane half-dead.
    rdd svc delete || :
    rdd svc create
    run -4 rdd svc start --wait --timeout=1ms
    assert_output --partial 'context deadline exceeded'
}

The bats/helpers/controller.bash:81 change explicitly bounds its delete at 120s for exactly this reason. The same reasoning applies here.

bats/helpers/controller.bash

    local controllers=${1:-"*"}

    # Bound delete and tolerate failure so a stuck prior daemon cannot
    # hang the helper for the full 5m stopWaitTimeout. `create` then
    # proceeds against whatever state survives.
    rdd svc delete --timeout=120s || :
    rdd svc create --controllers="${controllers}"
    rdd svc start
}

Fix:

 local_setup_file() {
-    rdd svc delete || :
+    rdd svc delete --timeout=120s || :
     rdd svc create
     rdd svc start
 }

S2. serviceStartAction already-running branch reads StartTime() after Running() without guarding the TOCTOU Claude¶

cmd/rdd/service.go

		if err := service.Create(args); err != nil {
			return err
		}
		logrus.Infof("successfully created %q control plane", instance.Name())
	}
	if service.Running() {
		logrus.Infof("%q control plane is already running", instance.Name())
		if !waitFlag {
			return nil
		}
		startTime, err := service.StartTime()
		if err != nil {
			return err
		}
		logrus.Infof("waiting for %q control plane to be ready", instance.Name())
		ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return cliexit.Classify(err)
		}
		if err := waitForFreshDiscoveryConfigMap(ctx, startTime); err != nil {
			return cliexit.Classify(err)
		}
		logrus.Infof("%q control plane is ready", instance.Name())
		return nil
	}

	// Collect all provided flags as arguments for serve subprocess
	var serveArgs []string
	if cmd.Flags().Changed("controllers") {
		controllers, err := cmd.Flags().GetString("controllers")

Running() (line 262) reads the PID file, validates the process is alive, and removes the PID file if the process is gone (pkg/service/cmd/service.go:158). StartTime() (line 267) calls os.Stat(instance.PIDFile()). If the process exits between line 262 and line 267 (e.g., a concurrent rdd svc stop) and the PID file is removed, StartTime() returns a raw os.Stat: ENOENT error. The window is microseconds and harmless in practice, but the surfaced error is opaque and not Classify-able.

pkg/service/cmd/service.go

			}
			_ = process.Release()
		}
	}
	if err != nil {
		_ = os.Remove(instance.PIDFile())
		return PIDNotFound
	}
	return pid
}

The same race exists in ensureServiceRunning (line 99). Round 6 raised this; this round has not addressed it.

cmd/rdd/service.go

		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		startTime, err := service.StartTime()
		if err != nil {
			return err
		}
		ctx, cancel := watchtools.ContextWithOptionalTimeout(ctx, startWaitTimeout)
		defer cancel()

Fix: when StartTime() returns os.IsNotExist, retry the Running() check or return a clearer "control plane exited" message:

startTime, err := service.StartTime()
if err != nil {
    if os.IsNotExist(err) {
        return fmt.Errorf("%q control plane exited while waiting", instance.Name())
    }
    return err
}

S3. stopWaitTimeout aliases a constant defined in a sibling file Claude¶

cmd/rdd/service.go

// stopWaitTimeout bounds the CLI wait for the control plane to shut down.
// Run caps the internal drain at 45s (see pkg/service/cmd/service.go).
// The CLI deadline exceeds that cap to absorb slow disks, misconfigured
// hosts, and sequential multi-VM drains. Aliasing to limaLongWaitTimeout
// makes the per-VM ceiling the single tuning knob for long CLI waits.
const stopWaitTimeout = limaLongWaitTimeout

// logrusLevelToKlog converts current logrus level to klog level.
func logrusLevelToKlog() string {
	switch logrus.GetLevel() {
	case logrus.DebugLevel:

limaLongWaitTimeout is declared in cmd/rdd/limavm.go:42. The two files share the main package, so this compiles, but a reader looking at service.go only has the comment to discover where limaLongWaitTimeout lives. The "single tuning knob" rationale is fine; the implementation makes the file dependency invisible.

cmd/rdd/limavm.go

	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/util/tail"
)

// limaLongWaitTimeout is the default --timeout for VM boot and shutdown waits,
// which routinely take minutes.
const limaLongWaitTimeout = 5 * time.Minute

func newLimaVMCommand() *cobra.Command {
	command := &cobra.Command{
		Use:     "limavm",
		Short:   "Manage LimaVM resources",

Either move both constants to a small shared file, or qualify the comment with the file location:

-// hosts, and sequential multi-VM drains. Aliasing to limaLongWaitTimeout
-// makes the per-VM ceiling the single tuning knob for long CLI waits.
+// hosts, and sequential multi-VM drains. Aliased to limaLongWaitTimeout
+// (defined in limavm.go) so the per-VM ceiling is the single tuning knob
+// for long CLI waits.

S4. service.Delete never reaches its if !Exists() guard Claude¶

pkg/service/cmd/service.go

// up to timeout for it to exit before removing files; removing instance.Dir()
// while the serve process holds rdd.pid, rdd.sqlite3, and log files corrupts
// the directory on Windows and breaks PID-file mutual exclusion on Unix.
// Pass timeout 0 to wait indefinitely (bounded only by ctx).
func Delete(ctx context.Context, timeout time.Duration) error {
	if !Exists() {
		return fmt.Errorf("%q control plane does not exist", instance.Name())
	}
	if Running() {
		if err := StopWithWait(ctx, true, timeout); err != nil {
			// Return the error whenever the process is alive or the
			// deadline expired, so deletion never races a live daemon
			// and the CLI exits 4 on timeout per docs/design/cmd_service.md.

The only caller is serviceDeleteAction (cmd/rdd/service.go:368-372), which short-circuits on !service.Exists() before calling service.Delete. The library-side guard is dead code, and if it ever fired, it would propagate as a CLI error (exit 1) rather than the action's "info + success".

cmd/rdd/service.go

	}
	command.Flags().Duration("timeout", stopWaitTimeout, "Timeout for the control plane to exit before deletion (0 means wait indefinitely)")
	return command
}

func serviceDeleteAction(cmd *cobra.Command, _ []string) error {
	if !service.Exists() {
		logrus.Infof("%q control plane does not exist", instance.Name())
		return nil
	}
	timeout, err := cmd.Flags().GetDuration("timeout")
	if err != nil {
		return err
	}
	if err := service.Delete(cmd.Context(), timeout); err != nil {

Either remove the inner check (and document Delete's precondition that the instance exists) or have Delete log+return-nil to match the action. Removal is the smallest change.

S5. serviceStopAction removes instance.Config() after --wait=false even though the daemon is still alive Claude¶

pkg/service/cmd/service.go

					_ = os.Remove(instance.Config()) // Ignore error as file might not exist
					return nil
				}
			}
		}
	}

	_ = os.Remove(instance.Config()) // Ignore error as file might not exist
	return nil
}

// cleanupDiscoveryConfigMap removes the discovery configmap to prevent readiness check confusion.
func cleanupDiscoveryConfigMap() error {
	// Try to get kubeconfig, but ignore errors since control plane might be stopped

The !wait branch falls through to remove instance.Config() (the kubeconfig the CLI uses to talk to the apiserver) immediately after sending SIGINT. The daemon may still be alive and serving for tens of seconds. Any concurrent rdd ctl get / rdd kubectl call between stop's return and the daemon's actual exit will fail with "could not read config" (pkg/service/cmd/service.go:95), masking the real state ("daemon is still up").

pkg/service/cmd/service.go

	if !Running() {
		return nil, fmt.Errorf("control plane %q is not running", instance.Name())
	}
	data, err := os.ReadFile(instance.Config())
	if err != nil {
		return nil, fmt.Errorf("could not read config from %s: %w (control plane may still be starting)", instance.Config(), err)
	}
	return data, nil
}

// GetKubeRestConfig generates and returns the kubeconfig as a *rest.Config.

This predates the PR but the new contract makes --wait=false more attractive (per the docs), so the symptom will get more exposure. The fix is to gate the removal on a confirmed-exited check, or to leave the kubeconfig in place when !wait (it will be re-created on next start).

S6. cleanupDiscoveryConfigMap uses context.TODO() and is called outside the --timeout envelope Claude¶

pkg/service/cmd/service.go

func StopWithWait(ctx context.Context, wait bool, timeout time.Duration) error {
	if !Running() {
		return fmt.Errorf("%q control plane is not running", instance.Name())
	}

	// Delete the discovery configmap before signaling so the kube-API
	// delete still has a live apiserver. Trade-off: a timed-out --wait
	// leaves the configmap deleted while the daemon drains, so a quick
	// `rdd svc start` polls the missing configmap until startWaitTimeout.
	// Moving cleanup into the daemon's own shutdown path is the durable fix.
	_ = cleanupDiscoveryConfigMap()

	pid := PID()
	// Try graceful shutdown first. On Unix, Kill already sends SIGTERM which
	// triggers the Go signal handler. On Windows, Kill uses TerminateProcess
	// which bypasses all handlers, so we send Interrupt (CTRL_BREAK_EVENT)

Three concerns, all pre-existing but in scope of this PR's "wire context through" theme:

The function is not bounded by the user's --timeout (it runs before watchtools.ContextWithOptionalTimeout).
CleanupDiscovery is called with context.TODO() (pkg/service/cmd/service.go:433), so it has no deadline at all.
The author already documented the trade-off ("Moving cleanup into the daemon's own shutdown path is the durable fix").

pkg/service/cmd/service.go

	if err != nil {
		logrus.WithError(err).Debug("Could not create kubernetes client for discovery cleanup")
		return nil // Not a critical error during shutdown
	}

	err = controllers.CleanupDiscovery(context.TODO(), client)
	if err != nil {
		logrus.WithError(err).Debug("Failed to delete discovery configmap")
	}

	return nil // Don't fail stop operation due to discovery cleanup issues

Listed because the PR did the bulk of the context plumbing for Stop/Delete and this single TODO is now the loudest holdout.

Fix: thread ctx into cleanupDiscoveryConfigMap, or move the cleanup into the daemon's own shutdown path as the comment proposes.

S7. The deadline-path comment understates what SIGTERM actually does to the daemon Claude¶

The deadline-expiry SIGTERM is the second shutdown signal (the first SIGINT was sent earlier in the same call), and the apiserver's SetupSignalContext handler responds to a second signal by calling os.Exit(1). The current comment reads as if SIGTERM lets the daemon "handle signals" gracefully, but the second signal forces an abrupt termination — which is the desired behavior, just not what the comment describes.

-                // On Unix, SIGTERM suffices because the service handles
-                // signals promptly; its slowness comes from draining
-                // hostagents sequentially.
+                // On Unix, SIGTERM is the *second* shutdown signal
+                // (SIGINT was sent at line 374); the apiserver signal
+                // handler responds to a second signal by calling
+                // os.Exit(1), forcing immediate termination — which is
+                // exactly what the deadline path needs.

S8. run_e -0 wrapper is unnecessary on a setup command whose output is not inspected Gemini¶

bats/tests/33-lima-controllers/limavm-cli.bats

}

@test "lima restart --timeout fails when VM cannot restart" {
    rdd ctl create configmap "restart-timeout-vm" --namespace "${LIMA_TEST_NS}" \
        --from-literal="template=${TEMPLATE}"
    run_e -0 rdd limavm create "restart-timeout-vm" "restart-timeout-vm" --namespace "${LIMA_TEST_NS}"
    assert_created "restart-timeout-vm" "${LIMA_TEST_NS}" "restart-timeout-vm"

    # Restart waits for status.restartCount to increment, which requires the
    # VM to come back up. The non-functional template never boots, so the
    # counter stays at zero and --wait --timeout exits with code 4.

run_e -0 invokes rdd limavm create, but assert_created reads ${stderr} set by the previous run_e. If the helper relies on ${stderr}, the wrapper is justified; otherwise per the BATS style notes, commands whose success is the entire assertion should be invoked directly. Verify by reading assert_created: if it inspects $stderr, leave the call as-is; if not, drop the wrapper.

Design Observations ¶

Concerns ¶

--timeout flag scope is intentionally not extended over ensureServiceRunning — cmd/rdd/service.go:88-91 (in-scope) Gemini

The code comment at lines 88-91 explicitly states: "It bounds the already-running and cold-start paths by startWaitTimeout (90s), independent of the caller's --timeout. The cap lets a broken service fail fast so rdd limavm *, rdd set, rdd kubectl, and rdd service config stay responsive on a long deadline." Gemini raised this as Important but the behavior is deliberate and documented. The asymmetry is real — a user passing --timeout=2s would expect the whole command to fail in 2s, not 90s — but this is a design tradeoff that has been raised in multiple prior rounds (round 5 S3) and consistently treated as intentional. A min(userTimeout, startWaitTimeout) would address both directions; flagged for design discussion rather than as a bug.

Deadline-expiry path sends SIGTERM (graceful), not SIGKILL (forced) — pkg/service/cmd/service.go:389-401 (in-scope) Gemini

Gemini raised this as Important. The code comment at lines 392-398 documents the rationale: "On Unix, SIGTERM suffices because the service handles signals promptly; its slowness comes from draining hostagents sequentially. Run caps that drain at 45s before exiting anyway, so timeout values shorter than stopWaitTimeout (5m) routinely race a still-draining daemon." Combined with the apiserver's "second signal triggers os.Exit(1)" behavior (S7), the deadline-expiry path does force termination on Unix — just via the second-signal path rather than SIGKILL. The genuinely-deadlocked-daemon case (where Run's 45s cap also fails) accepts the hang in exchange for log flush and proper cleanup.

Strengths ¶

cliexit.Classify is clean and idempotent — pkg/cli/exit/exit.go:56-65 (in-scope) Claude Gemini. Correctly handles wrapped context.DeadlineExceeded, short-circuits already-classified errors. Replacing the explicit errors.Is block in set.go with Classify is a real readability win.
Watch-drop handling is now sound — cmd/rdd/limavm.go:497-512, 550-566 (in-scope) Claude Codex Gemini. The post-watch re-read distinguishes "watch closed before predicate satisfied" from "context deadline" and from a satisfied predicate, addressing the silent-success-on-watch-drop class of bugs.
service.Delete's "directory survives unless the daemon is confirmed gone" invariant is well-chosen — pkg/service/cmd/service.go:452-465 (in-scope) Claude. The expanded comment now spells out the predicate's coverage of timeout, signal-delivery failure, and context.Canceled, which addresses round 6's S6.
Already-running svc start branch now waits for fresh discovery ConfigMap — cmd/rdd/service.go:267-279 (in-scope) Codex. Aligns the cold-start and already-running paths so both wait for ReadyAnnotation rather than just process liveness.
StartTime() truncates to seconds to match metav1.Time serialization — pkg/service/cmd/service.go:172-178 (in-scope) Claude. Prevents the "ConfigMap startTime appears before beforeStart" edge case the round 6 review surfaced.

Testing Assessment ¶

The new BATS coverage hits the new exit-code-4 contract through svc start/stop/delete and lima create --start/restart/start. Untested scenarios, in priority order (mostly the same as round 6, with limavm delete newly missing):

rdd set --timeout=... exit-code-4 contract. set.go:325 is the original site of the timeout exit code; nothing in bats/tests/ asserts that rdd set returns 4 specifically.
rdd svc start --wait=false. The TODO at bats/tests/20-service/2-create.bats:3 still says "test rdd svc start --wait=false" and remains unimplemented.
rdd kubectl --timeout-via-ensureServiceRunning exit code. ensureServiceRunning now routes DeadlineExceeded through Classify, but no BATS test exercises the path where ensureServiceRunning itself times out and the caller surfaces 4.
limavm delete --timeout exit-code-4. New tests cover start/create --start/restart timeouts, but limavm delete --timeout=… (the most-changed default — flipped from --wait=false to --wait=true) has no explicit timeout-exits-4 assertion.
limavm stop --timeout exit-code-4. Same gap, smaller blast radius because the existing tests use --wait=false.
Concurrent rdd svc stop while rdd svc start is running. The Discovery-ConfigMap deletion in StopWithWait creates a window where ensureServiceRunning would fail; no test exercises this race.

I2 above tracks (1), (3), (4), (5) as a single Important finding.

Documentation Assessment ¶

cmd_service.md:93-99 and cmd_service.md:102-122 correctly describe the new defaults, signal, and exit-code contract for svc start/stop/delete. cmd_lima.md:17/24/31/38/57 updates each subcommand's --timeout default to 5m and notes exit code 4. The new comments on startWaitTimeout, stopWaitTimeout, and limaLongWaitTimeout justify the values and tie them back to the daemon's internal 45s drain cap.

I3 above covers the api_service.md imprecision.

Pre-existing test bats/tests/20-service/2-create.bats:37-44 asserts log strings ("successfully started", "waiting for ... to be ready") that no current code path emits. Not introduced by this PR.

Commit Structure ¶

Single commit, message accurately describes the diff (defaults, exit codes, library surface, removed callers, doc updates, new BATS coverage). No fixup commits, no scope creep. Round 7 of iteration on this PR has converged to a clean single-commit shape.

Acknowledged Limitations ¶

Discovery ConfigMap deleted before signaling, leaving a startup window for new clients. pkg/service/cmd/service.go:355-359 documents the trade-off: "a timed-out --wait leaves the configmap deleted while the daemon drains, so a quick rdd svc start polls the missing configmap until startWaitTimeout. Moving cleanup into the daemon's own shutdown path is the durable fix." S6 above offers an interim improvement.
context.Canceled in Delete's predicate. pkg/service/cmd/service.go:455-461 notes that context.Canceled is unreachable today because the CLI runs on context.Background(); wiring SIGINT into cmd.Context() would let Ctrl-C fall through to deletion during a live stop, and needs an explicit decision.
Timeout values shorter than stopWaitTimeout race the daemon's 45s internal drain cap. pkg/service/cmd/service.go:395-398 documents this; the BATS suite avoids the race by using --timeout=1ms (deterministic loss) or --timeout=120s/5m (always wins).
Second-precision freshness anchor accepts a stale ConfigMap when a prior crash lands in the same wall-clock second as the new startup. Round 6 S1 (Codex). The truncation aligns with metav1.Time serialization (a real constraint), so finer-grained anchoring would require carrying sub-second precision through the ConfigMap path.

Agent Performance Retro ¶

Claude ¶

Produced the deepest review (1 important + 7 suggestions + 4 design observations), most of them concrete code-comment or test-helper polish that would be hard for the other agents to surface without the same depth of file reading. Caught the api_service.md doc imprecision (I3) — a finding only Claude raised, requiring careful cross-referencing between the doc and two distinct fallback paths in StopWithWait. Mislabeled I3 as a "regression" (the prior text was already wrong, just differently); demoted in consolidation. The TOCTOU finding (S2) is a re-raise of round 6 S3 that Claude correctly tracked as still unaddressed.

Codex ¶

Two important findings, both well-grounded and complementary to Claude's coverage. I1 (controller.bash silently reuses wrong controller set) is a fresh observation on the new helper change that no other agent caught — required tracing through serviceCreateAction's early-exit and the ServeArgs reuse path. I2 (autostart entry-point coverage gap) re-raises round 6's S1 with sharper framing, correctly identifying every entry point lacking run -4 coverage. Codex held the bar high: zero suggestions, zero design observations beyond strengths, no false positives.

Gemini ¶

Two important findings, both contradicted by code comments documenting the intentional tradeoff — both demoted to Design Observations. Gemini missed the cmd/rdd/service.go:88-91 comment when raising I1 (--timeout scope) and the pkg/service/cmd/service.go:392-398 comment when raising I2 (SIGKILL). The single Suggestion (run_e -0 wrapper) is a valid style nit. As expected per the known-behavior note, Gemini also did not run git blame for regression labeling.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.4	Gemini 3.1 Pro
Duration	13m 21s	7m 42s	8m 21s
Findings	1I 7S	2I	1S
Tool calls	51 (Bash 29, Read 22)	46 (shell 42, plan 2, stdin 2)	33 (grepsearch 14, readfile 13, runshellcommand 6)
Design observations	4	2	2
False positives	0	0	0
Unique insights	5	1	1
Files reviewed	11	11	11
Coverage misses	0	0	0
Totals	1I 7S	2I	1S
Downgraded	0	0	2 (I→S)
Dropped	0	0	0

Claude provided the most value this round through depth and breadth of suggestions. Codex was the most calibrated (no false positives, both findings survived consolidation). Gemini's findings were less reliable this round — both Important claims missed code comments that explained the design — but the run_e -0 catch is genuinely useful.

Review Process Notes ¶

Repo context updates ¶

[repo] When reviewing CLI commands that auto-start the service, distinguish cmd.Context() (unbounded) from a user --timeout-derived context. RDD CLI commands routinely call an autostart helper (e.g., ensureServiceRunning) before applying the user's --timeout to the operation context. Reviewers should check whether the autostart phase is intentionally bounded by its own ceiling (e.g., startWaitTimeout) or whether it should inherit the user's deadline. The current convention in this repo is the former — autostart caps independent of --timeout — and that intent is documented in the helper's comment. Flag deviations from that convention, but do not flag the convention itself.
[repo] When a code comment documents a deliberate design tradeoff, treat findings that re-litigate the tradeoff as Design Observations, not Important Issues. Multiple rounds of this PR have surfaced the same "user --timeout not propagated through implicit phases" and "deadline path sends SIGTERM not SIGKILL" claims, each backed by reasonable user-facing concern but each contradicted by an existing code comment that documents the intentional choice (cmd/rdd/service.go:88-91, pkg/service/cmd/service.go:392-398). Treat the comment as evidence of conscious design and downgrade unless the reviewer can articulate why the documented rationale is wrong.

Skill improvements ¶

When a finding is contradicted by a same-file code comment, demote it before consolidation rather than after. Reviewers who raise a concern as Important should first grep the cited file for a comment that explicitly addresses the concern. If one exists and documents the choice as deliberate, the finding belongs in Design Observations (with the comment quoted) unless the reviewer can refute the documented rationale. This prevents repeated re-litigation across review rounds and keeps the Important section reserved for genuinely-unaddressed concerns.