# Deep Review: 20260323-003519-pr-242

| | |
|---|---|
| **Date** | 2026-03-23 00:35 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#242](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242) — Run Lima BATS tests on Windows via MSYS2 |
| **Branch** | `msys2-ci` |
| **Commits** | `74f3442` Fix Windows process signaling for Lima hostagent lifecycle<br>`f99d44e` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files<br>`f525d62` Enable Lima BATS tests on Windows<br>`cec19bc` Pre-generate Lima SSH keys on Windows to avoid broken path conversion<br>`f4a83e7` Run Windows BATS tests via MSYS2 instead of WSL2<br>`a224b19` Honor RDD_LOG_LEVEL in external controllers |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — one important edge case in the deletion path for broken instances on Windows |
| **Wall-clock time** | `21 min 27 s` |

---

## Executive Summary

This PR isolates Lima hostagents into their own Windows process groups (CREATE_NEW_PROCESS_GROUP + targeted CTRL_BREAK_EVENT), replaces Lima's console-wide StopForcibly with a KillTree-based implementation, pre-generates SSH keys to avoid MSYS2 path conversion issues, and switches CI from WSL2-based BATS to MSYS2. The implementation handles Windows process semantics and WSL2 deadlock prevention with care. One important gap remains: the deletion path force-stops broken instances using PIDs that may have been recycled on Windows.

**Structure:** 0 critical, 1 important, 5 suggestions. Design observations highlight the sequential shutdown timeout concern (pre-existing TODO) and several well-executed patterns.

---

## Critical Issues

None.

---

## Important Issues

1. **Deletion of broken instances can kill recycled PIDs on Windows** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:47-48` [Codex GPT 5.4] (important, regression)

```go
if existingInst.Status == limatype.StatusRunning || existingInst.Status == limatype.StatusBroken {
    stopInstanceForcibly(ctx, logger, existingInst)
}
```

This PR adds `StatusBroken` to the force-stop guard at line 47. On Windows, Lima's `ReadPIDFile` does not clean stale PID files (unlike Unix, where it removes the file and returns 0 for dead processes). A stopped instance whose PID files survive can appear as `StatusBroken` if its hostagent PID was recycled to an unrelated process. `stopInstanceForcibly` (line 596) calls `KillTree` on `inst.HostAgentPID` and `inst.DriverPID` at lines 598-603, terminating the unrelated process.

The PID zeroing at lines 55-58 protects against Lima's internal `StopForcibly` during `limainstance.Delete`, but runs *after* `stopInstanceForcibly` has already acted on the stale PIDs.

Before this PR, `handleDeletion` called `StopForcibly` only for `StatusRunning`. Adding `StatusBroken` introduces a new path where stale PIDs reach `KillTree`.

Fix: Restrict PID-based force-stop to `StatusRunning`; for `StatusBroken`, use `terminateWSL2Distro` (on WSL2) and stale-file cleanup without PID killing:

```diff
- if existingInst.Status == limatype.StatusRunning || existingInst.Status == limatype.StatusBroken {
+ if existingInst.Status == limatype.StatusRunning {
      stopInstanceForcibly(ctx, logger, existingInst)
  } else if existingInst.VMType == limatype.WSL2 {
```

---

## Suggestions

1. **`StopWithWait` sends SIGINT instead of SIGTERM on Unix — add clarifying comment** — `pkg/service/cmd/service.go:351` [Claude Opus 4.6] (suggestion, regression)

```go
if err := process.Interrupt(pid); err != nil {
    if err := process.Kill(pid); err != nil {
        return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
    }
}
```

The old code sent SIGTERM via `process.Kill`. The new code sends SIGINT via `process.Interrupt`, falling back to `Kill` only on error. Both signals are handled identically by the apiserver signal handler (which registers both `os.Interrupt` and `syscall.SIGTERM`), so this is functionally correct. On Unix, `process.Interrupt` (which sends SIGINT via `syscall.Kill`) always succeeds for a valid PID, so the `Kill` fallback never fires. The fallback exists for Windows, where `GenerateConsoleCtrlEvent` can fail cross-console. The asymmetry deserves a comment.

Fix:

```diff
+	// On Unix, Interrupt (SIGINT) always succeeds for a valid PID; the Kill
+	// fallback fires only on Windows when caller and target lack a shared console.
 	if err := process.Interrupt(pid); err != nil {
```

2. **Misleading log message when no watcher exists** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:528-529` [Claude Opus 4.6] (suggestion, regression)

```go
} else {
    logger.Info("Could not signal hostagent, killing process directly")
    r.killHostagent(name)
```

When `signalHostagent` returns false (no watcher), `killHostagent` is also a no-op (same nil check). The message "killing process directly" is misleading; the actual work happens in `forceStop()` via `stopInstanceForcibly`.

Fix:

```diff
-		logger.Info("Could not signal hostagent, killing process directly")
-		r.killHostagent(name)
+		logger.Info("No watcher for hostagent, forcing stop via stored PIDs")
+		r.killHostagent(name) // no-op if watcher absent; forceStop uses stored PIDs
```

3. **`vm_template_running` comment overstates `RDD_VM_TYPE` support** — `bats/helpers/vm_template.bash:9` [Claude Opus 4.6] (suggestion, regression)

```bash
#   VM_TEMPLATE=$(vm_template_running) # for limavm-running tests (supports RDD_VM_TYPE)
```

`vm_template_running` passes through `RDD_VM_TYPE` only on Unix (line 88). On Windows, `_wsl2_template` always sets `vmType: wsl2` and ignores the variable.

Fix:

```diff
-#   VM_TEMPLATE=$(vm_template_running) # for limavm-running tests (supports RDD_VM_TYPE)
+#   VM_TEMPLATE=$(vm_template_running) # for limavm-running tests (supports RDD_VM_TYPE on Unix)
```

4. **SSH key pre-generation lacks inter-process lock** — `pkg/controllers/lima/limavm/controllers/ssh_keys.go:41-60` [Codex GPT 5.4] (suggestion, enhancement)

```go
_ = os.Remove(pubPath)
configDir := filepath.Dir(privPath)
if err := os.MkdirAll(configDir, 0o700); err != nil {
    return fmt.Errorf("could not create %q: %w", configDir, err)
}
tmpPath := privPath + ".tmp"
```

The helper uses a fixed temp path (`user.tmp`) without an inter-process lock, unlike Lima's upstream `DefaultPubKeys()` which wraps generation in a config-dir lock. If two controller processes race through startup, one can delete the other's temp files. In practice, only one lima-controller runs per instance, so this race is unlikely. The code already handles crash recovery (orphaned `.tmp` files) correctly.

Fix: Consider wrapping the generate/rename section in `lockutil.WithDirLock(configDir, ...)`, or document why single-controller-per-instance makes locking unnecessary.

5. **`setVerbosityFromEnv` warns on valid logrus levels** — `pkg/external/main.go:278-283` [Gemini 3.1 Pro] (suggestion, enhancement)

```go
case "", "info":
    // info is the default klog verbosity; nothing to change.
default:
    fmt.Fprintf(os.Stderr, "RDD_LOG_LEVEL=%q is not supported for klog; valid values are trace, debug, info\n", level)
```

The main RDD service uses logrus, which supports `warn`, `warning`, `error`, and `fatal`. Setting `RDD_LOG_LEVEL=error` (valid for the service) triggers four warning messages to stderr — one per spawned external controller. Since klog has no equivalent granularity, mapping these levels to the default `info` verbosity is correct, but the warning is noise.

Fix: Accept valid logrus levels silently:

```diff
-	case "", "info":
+	case "", "info", "warn", "warning", "error", "fatal":
 		// info is the default klog verbosity; nothing to change.
+		// klog cannot suppress info-level messages, so higher logrus levels
+		// fall through to the default verbosity (v=0).
 	default:
```

---

## Design Observations

### Concerns

**Sequential hostagent shutdown can exceed `StopWithWait` timeout** (future) [Claude Opus 4.6, Gemini 3.1 Pro] — The TODO at `limavm_controller.go:504` acknowledges this. The sequential loop takes N x 30s in the worst case. `StopWithWait` enforces a 60s total timeout (`service.go:360`). With N > 2 hanging hostagents, the service is force-killed before cleanup finishes, leaving hostagents orphaned. This PR adds `stopInstanceForcibly` to the non-graceful path (lines 519-528), which adds WSL2 termination time per hostagent. Parallelizing the wait loop with a `sync.WaitGroup` would bound total time to a single timeout window regardless of N.

### Strengths

**Platform-appropriate signal semantics** — Using CTRL_BREAK_EVENT targeted at process groups rather than Lima's console-wide GenerateConsoleCtrlEvent fixes the root cause of the service self-destructing when stopping a hostagent. [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro]

**Crash-safe SSH key generation** — `ensureSSHKeysAt` generates keys to a temp path, then renames. It handles all partial-state combinations (missing pub key, corrupt private key, stale temps) and cleans up on every failure path. Unit tests cover each state comprehensively. [Claude Opus 4.6, Gemini 3.1 Pro]

**Conditional file cleanup** — `stopInstanceForcibly` removes PID/socket/tmp files only when all `KillTree` calls succeed. Since Lima's `store.Inspect` derives `StatusStopped` from missing PID files, removing them while a process is alive would mask a running process. [Claude Opus 4.6, Codex GPT 5.4]

**PID recycling defense in deletion** — `handleDeletion` zeroes `DriverPID` and `HostAgentPID` before passing to `limainstance.Delete`, preventing Lima's internal `StopForcibly` from killing recycled processes. [Claude Opus 4.6]

---

## Testing Assessment

Test coverage is strong. The BATS tests exercise VM lifecycle, crash recovery, orphaned hostagent recovery, graceful shutdown, template changes, and cleanup. The SSH key unit tests cover all partial-state combinations. Untested scenarios, ranked by risk:

1. **StatusBroken deletion with recycled PIDs on Windows** — The highest-risk gap (see Important Issue 1). No test exercises deletion of a broken instance with stale PID files.
2. **`shutdownHostagent` force-stop timeout path** — Acknowledged in a comment at line 520. Requires a hostagent that ignores shutdown signals.
3. **`StopWithWait` cross-console failure on Windows** — Comment at lines 346-350 describes the scenario. Self-heals via `killOrphanedHostagent` on next start.
4. **Concurrent SSH key generation** — Unit tests cover crash recovery but not concurrent callers (low risk: single controller per instance).

---

## Documentation Assessment

No gaps. Comments explain platform-specific behavior clearly. The `process_windows.go` and `process_unix.go` docstrings mirror each other's platform notes, aiding cross-platform understanding.

---

## Acknowledged Limitations

1. `limavm_controller.go:504-505` — *"TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N x gracefulShutdownTimeout in the worst case."* This PR adds force-stop to the timeout path, making the worst case slightly worse per-hostagent.

2. `process_windows.go:61-75` — *"Platform asymmetry: if the target process is already dead, taskkill /T returns exit code 128 (treated as success), but surviving children (e.g., SSH port forwarders) are not killed because taskkill cannot traverse the tree from a dead parent."* Acceptable: `KillTree` runs while the parent is alive; orphaned port forwarders are harmless.

3. `limavm_lifecycle.go:520-522` — *"Not tested: the forceStop fallback requires a hostagent that ignores shutdown signals."* The orphaned hostagent test exercises forced stop indirectly.

4. `service.go:346-350` — *"On Windows, Interrupt uses GenerateConsoleCtrlEvent, which only works when caller and target share a console."* Falls back to `TerminateProcess`; hostagents self-heal on next start.

---

## Agent Performance Retro

### [Claude]

Claude Opus 4.6 completed in 4 min 17 s and produced the most detailed review (173 lines). It identified a valid but low-severity SIGINT-vs-SIGTERM asymmetry, raised several code-clarity suggestions, and delivered thorough design observations. Claude traced through callers and callees (e.g., confirming the apiserver signal handler accepts both SIGINT and SIGTERM). It correctly downgraded findings where the code is functionally correct but documentation-worthy.

Claude did not flag the StatusBroken PID recycling issue (Codex's unique insight) or the klog verbosity warning spam (Gemini's unique insight).

### [Codex]

Codex GPT 5.4 completed in 6 min 13 s with a concise 93-line review. It raised the strongest finding: the StatusBroken deletion path killing recycled PIDs, with a well-argued trace through `ReadPIDFile` behavior asymmetry on Windows. Codex also noted the SSH key locking gap.

Codex missed the SIGINT/SIGTERM asymmetry in `service.go` (which Claude caught) and the klog verbosity issue (which Gemini caught). Its coverage summary marks `service.go` and `process_windows.go` as "Reviewed, no issues" — the former is debatable given Claude's suggestion about the same file.

### [Gemini]

Gemini 3.1 Pro completed in 6 min 14 s with 124 lines. It raised a practical suggestion about klog verbosity warnings for valid logrus levels — a quality-of-life improvement that both other agents missed. Its important finding about sequential shutdown is valid but flags a pre-existing TODO rather than a regression.

Gemini marked `limavm_lifecycle.go` as "Reviewed, no issues" despite the StatusBroken PID recycling gap that Codex found — a coverage miss.

### Summary

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 4:17 | 6:13 | 6:14 |
| Critical | 0 | 0 | 0 |
| Important | 1 (downgraded to suggestion) | 1 | 1 (pre-existing TODO) |
| Suggestion | 4 | 1 | 1 |
| Design observations | 2 concerns, 4 strengths | 2 strengths | 3 strengths |
| False positives | 0 | 0 | 0 |
| Unique insights | SIGINT asymmetry, log message, comment accuracy | StatusBroken PID recycling | klog verbosity mapping |
| Files reviewed | 20/20 | 20/20 | 20/20 |
| Coverage misses | StatusBroken gap, klog gap | SIGINT asymmetry | StatusBroken gap |

Codex provided the highest-value unique insight (the StatusBroken PID recycling finding). Claude delivered the broadest coverage with the most suggestions and design observations. Gemini caught a practical usability issue the others missed. The multi-agent approach justified itself: no single agent found all three unique insights.

---

## Appendix: Original Reviews

### Claude Opus 4.6

Good — on both POSIX and Windows, the apiserver's signal handler handles `os.Interrupt`. The change from `Kill` (SIGTERM) to `Interrupt` (SIGINT) is safe on both platforms. On Unix both are caught; on Windows only `os.Interrupt` is caught, and the old `Kill` (TerminateProcess) would have bypassed signal handlers entirely.

Now I have comprehensive understanding of all changes. Let me compile the review.

---

## Executive Summary

This PR makes Lima BATS tests run on Windows by fixing cross-platform process signaling (CREATE_NEW_PROCESS_GROUP + CTRL_BREAK_EVENT), replacing Lima's StopForcibly with a Windows-safe implementation, pre-generating SSH keys to avoid MSYS2 path conversion issues, and switching CI from WSL2-based BATS to MSYS2-based BATS. The implementation is thorough, with careful handling of Windows process semantics and WSL2 deadlock prevention. I found one important issue with the graceful-stop codepath on Windows and a few suggestions.

**Verdict: Approve with minor changes** — the important issue is in a Windows-specific edge case that has a self-healing fallback. The overall design is solid.

## Findings

### Critical Issues

None.

### Important Issues

1. **`StopWithWait` sends SIGINT to the RDD service on Unix, changing shutdown semantics** — `pkg/service/cmd/service.go:351`

```go
if err := process.Interrupt(pid); err != nil {
    if err := process.Kill(pid); err != nil {
        return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
    }
}
```

On Unix, the old code sent SIGTERM (via `process.Kill`), which is the conventional process termination signal. The new code sends SIGINT first. Both signals are handled identically by the apiserver signal handler (`signal_posix.go:27` registers both `os.Interrupt` and `syscall.SIGTERM`), so this is functionally correct. However, if `Interrupt` (SIGINT) *succeeds* on Unix, `Kill` (SIGTERM) is never sent, meaning `StopWithWait` now only sends SIGINT on Unix. This works because the signal handler catches it, but there's a subtle issue: on Unix, `process.Interrupt` always succeeds (returns nil) for any valid PID, so the `Kill` fallback never triggers. The fallback was designed for Windows where `GenerateConsoleCtrlEvent` can fail cross-console. On Unix this is fine — just noting the asymmetry is intentional. (important, regression)

Fix: Consider adding a comment clarifying that on Unix the SIGINT always succeeds (so the Kill fallback is Windows-only), or restructure to be explicit about per-platform behavior:

```diff
+	// On Unix, Interrupt (SIGINT) always succeeds for a valid PID; the Kill
+	// fallback only fires on Windows when caller and target lack a shared console.
 	if err := process.Interrupt(pid); err != nil {
```

### Suggestions

1. **`shutdownHostagent` calls `forceStop` and `waitAfterKill` when `signalHostagent` returns false, but `forceStop` re-inspects the instance via `store.Inspect` even when `inst` is already available** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:528-533`

```go
	} else {
		logger.Info("Could not signal hostagent, killing process directly")
		r.killHostagent(name)
		forceStop()
		waitAfterKill()
	}
```

When `signalHostagent` returns false (no watcher exists), `killHostagent` also returns immediately (same nil check at line 223). The code then falls through to `forceStop()` which does the real work via `stopInstanceForcibly`. This is correct behavior — if no watcher exists, the process handle isn't available, so the fallback to PID-from-store is the right approach. The comment "killing process directly" is slightly misleading since `killHostagent` is a no-op in the same condition. (suggestion, regression)

Fix: Adjust the log message:

```diff
-		logger.Info("Could not signal hostagent, killing process directly")
-		r.killHostagent(name)
+		logger.Info("No watcher for hostagent, forcing stop via stored PIDs")
+		r.killHostagent(name) // no-op if watcher absent; forceStop uses stored PIDs
```

2. **`terminateWSL2Distro` inherits parent context deadline, potentially shortening the 10s timeout** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:658-659`

```go
func terminateWSL2Distro(ctx context.Context, logger logr.Logger, instName string) {
	wslCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
```

When called from `stopInstanceForcibly` within `shutdownHostagent`'s `forceStop` closure (line 505), the parent context has a 30-second timeout. After `KillTree` runs, the remaining time for `terminateWSL2Distro`'s 10-second timeout may be less than 10 seconds. In practice this is unlikely to matter since `wsl.exe --terminate` is fast when it works, but it's worth noting the nested timeout may be capped. (suggestion, gap)

Fix: No code change needed — the 30s outer timeout minus KillTree time will almost always exceed 10s. A comment noting the dependency would suffice.

3. **`_wsl2_template` comment in `vm_template.bash` says "RDD_WSL_DISTRO" but the header also mentions "RDD_VM_TYPE"** — `bats/helpers/vm_template.bash:9`

```bash
#   VM_TEMPLATE=$(vm_template_running) # for limavm-running tests (supports RDD_VM_TYPE)
```

`vm_template_running` only passes through `RDD_VM_TYPE` for the Unix template (line 88: `${RDD_VM_TYPE:+vmType: ${RDD_VM_TYPE}}`). On Windows, `_wsl2_template` always sets `vmType: wsl2` and ignores `RDD_VM_TYPE`. The comment's mention of RDD_VM_TYPE support is accurate only for non-Windows platforms, which could confuse someone testing on Windows. (suggestion, regression)

Fix: Clarify the comment:

```diff
-#   VM_TEMPLATE=$(vm_template_running) # for limavm-running tests (supports RDD_VM_TYPE)
+#   VM_TEMPLATE=$(vm_template_running) # for limavm-running tests (supports RDD_VM_TYPE on Unix)
```

4. **`KillTree` on Windows returns `nil` for success but also returns `nil` for exit code 128 (process already dead) — callers cannot distinguish** — `pkg/util/process/process_windows.go:78-86`

```go
func KillTree(ctx context.Context, pid int) error {
	err := exec.CommandContext(ctx, "taskkill", "/F", "/T", "/PID", strconv.Itoa(pid)).Run()
	if err != nil {
		var exitErr *exec.ExitError
		if errors.As(err, &exitErr) && exitErr.ExitCode() == 128 {
			return nil
		}
	}
	return err
}
```

The doc says "Returns nil if the process no longer exists" which is the correct intent. However, `stopInstanceForcibly` (line 600) sets `allKilled = false` on error and skips file cleanup. This means if one PID is alive (killed successfully) and another is already dead (exit 128 → nil), both return nil and `allKilled` stays true, which is the correct behavior. The design is sound. (suggestion, enhancement)

Fix: No action needed. The docstring is clear about the semantics.

## Design Observations

### Concerns

**Sequential `shutdownAllHostagents` with added `stopInstanceForcibly`** (future) — The TODO at line 504 already notes that the sequential loop can take N × `gracefulShutdownTimeout` in the worst case. This PR adds `stopInstanceForcibly` (WSL2 termination + file cleanup) to the non-graceful path (lines 519-528), adding another ~30 seconds per failed hostagent. With N hostagents that all timeout, worst case is now N × (30s graceful + 30s force + 10s wsl terminate). Parallelizing the wait loop (as the TODO suggests) would bound this to a single timeout window regardless of N.

### Strengths

**Careful PID recycling defense** — The `handleDeletion` code (lines 55-58) zeroes out `DriverPID` and `HostAgentPID` before passing to `limainstance.Delete`, preventing Lima's internal `StopForcibly` from killing unrelated processes. This is a subtle Windows concern handled proactively.

**Conditional file cleanup** — `stopInstanceForcibly` only removes PID/socket/tmp files when all `KillTree` calls succeed (lines 624-630). Since Lima's `store.Inspect` derives `StatusStopped` from missing PID files, removing them while a process is still alive would mask a running process. This is a well-considered invariant.

**SSH key generation with crash-safe temp-and-rename** — `ensureSSHKeysAt` generates keys to a temp path, then renames. It handles all partial-state combinations (missing pub key, corrupt private key, stale temps, orphaned pub key) and cleans up on every failure path. The unit tests comprehensively cover each state.

**Platform-appropriate signal semantics** — Using CTRL_BREAK_EVENT targeted at process groups (rather than Lima's console-wide GenerateConsoleCtrlEvent) solves the root cause of the service being killed when stopping a hostagent. The `CREATE_NEW_PROCESS_GROUP` flag in `SetGroup` is the correct Windows mechanism.

## Testing Assessment

Test coverage is good. The BATS tests cover the key scenarios: VM lifecycle, crash recovery, orphaned hostagent recovery, graceful shutdown, template changes, and cleanup. Specific untested scenarios, ranked by risk:

1. **`shutdownHostagent` force-stop timeout path** — Acknowledged in a comment at line 520-522. Would require a hostagent that ignores shutdown signals.
2. **`KillTree` on Windows when parent is already dead** — The documented taskkill /T limitation where orphaned children survive. Acknowledged in the `KillTree` docstring.
3. **`StopWithWait` on Windows when `Interrupt` fails cross-console** — The comment at lines 346-350 describes this, but no integration test exercises it. Self-healing via `killOrphanedHostagent` on next start.

The SSH key unit tests (`ssh_keys_test.go`) are thorough, covering all partial-state combinations.

## Documentation Assessment

Comments throughout are comprehensive and explain platform-specific behavior well. The `process_windows.go` and `process_unix.go` docstrings mirror each other's platform notes, which aids cross-platform understanding. No external documentation updates appear needed.

## Acknowledged Limitations

1. **Sequential hostagent shutdown** — `limavm_controller.go:504-505`: `// TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case.` — This PR adds force-stop to the timeout path, making the worst case slightly worse per-hostagent.

2. **`taskkill /T` cannot traverse dead parents** — `process_windows.go:70-75`: `// Platform asymmetry: if the target process is already dead, taskkill /T returns exit code 128 (treated as success), but surviving children (e.g., SSH port forwarders) are not killed...` — Acknowledged as acceptable; orphaned port forwarders are harmless.

3. **`shutdownHostagent` force-stop path not directly tested** — `limavm_lifecycle.go:520-522`: `// Not tested: the forceStop fallback requires a hostagent that ignores shutdown signals.` — The orphaned hostagent test exercises forced stop indirectly.

4. **`StopWithWait` cross-console limitation on Windows** — `service.go:346-350`: `// On Windows, Interrupt uses GenerateConsoleCtrlEvent, which only works when caller and target share a console.` — Falls back to TerminateProcess; hostagents self-heal on next start.

## Coverage Summary

| File | Status |
|------|--------|
| `.github/actions/spelling/allow.txt` | Trivial |
| `.github/workflows/bats.yaml` | Reviewed, no issues |
| `bats/helpers/load.bash` | Trivial |
| `bats/helpers/os.bash` | Reviewed, no issues |
| `bats/helpers/utils.bash` | Reviewed, no issues |
| `bats/helpers/vm_template.bash` | **Finding S3** |
| `bats/tests/33-lima-controllers/limavm-instance.bats` | Reviewed, no issues |
| `bats/tests/33-lima-controllers/limavm-running.bats` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/limavm_controller.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | **Findings S1, S2** |
| `pkg/controllers/lima/limavm/controllers/ssh_keys.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` | Reviewed, no issues |
| `pkg/external/main.go` | Reviewed, no issues |
| `pkg/service/cmd/service.go` | **Finding I1** |
| `pkg/util/process/process_unix.go` | Reviewed, no issues |
| `pkg/util/process/process_windows.go` | **Finding S4** |
| `scripts/collect-bats-logs.sh` | Reviewed, no issues |

### Codex GPT 5.4

**Verdict**: Needs changes

### Executive Summary
This PR substantially improves Windows Lima lifecycle handling: isolating hostagents into their own process groups, moving BATS to MSYS2, and adding Windows-specific SSH-key/template handling all make the design more robust. I found one important correctness gap in the new delete path on Windows-stale-PID states, plus one lower-risk suggestion around the new SSH-key pre-generation helper.

### Findings

**Critical Issues**  
None.

**Important Issues**

1. **Broken-instance deletion can still kill recycled Windows PIDs** — [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:47](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L47) (important, gap)

```go
if existingInst.Status == limatype.StatusRunning || existingInst.Status == limatype.StatusBroken {
    stopInstanceForcibly(ctx, logger, existingInst)
}
// Clear PIDs so Lima's Delete → StopForcibly does not kill
existingInst.DriverPID = 0
existingInst.HostAgentPID = 0
```

The new guard at lines 47-48 calls `stopInstanceForcibly()` before the defensive PID clearing at lines 55-58, but `stopInstanceForcibly()` immediately kills `inst.DriverPID` / `inst.HostAgentPID` at lines 598-603. On Windows, Lima's `store.ReadPIDFile` treats any live recycled PID as valid, so a stopped instance with stale `.pid` files can surface as `StatusBroken`; in that state, deletion now still targets those recycled PIDs before zeroing them. That undermines the new protection comment at lines 55-56 and can terminate an unrelated host process during `limavm` deletion.

Fix: only use PID-based force-stop on the `StatusRunning` path, or validate that the PIDs still belong to this instance before killing them. For the Windows `StatusBroken` path, the safe fallback is to `terminateWSL2Distro()`, remove stale PID/socket files, zero the PIDs, and then call `limainstance.Delete()`.

```diff
- if existingInst.Status == limatype.StatusRunning || existingInst.Status == limatype.StatusBroken {
+ if existingInst.Status == limatype.StatusRunning {
     stopInstanceForcibly(ctx, logger, existingInst)
- } else if existingInst.VMType == limatype.WSL2 {
+ } else if existingInst.VMType == limatype.WSL2 {
     terminateWSL2Distro(ctx, logger, existingInst.Name)
   }
```

**Suggestions**

1. **Windows SSH-key pre-generation dropped Lima’s startup lock** — [pkg/controllers/lima/limavm/controllers/ssh_keys.go:41](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys.go#L41) (suggestion, enhancement)

```go
_ = os.Remove(pubPath)
configDir := filepath.Dir(privPath)
if err := os.MkdirAll(configDir, 0o700); err != nil {
    return fmt.Errorf("could not create %q: %w", configDir, err)
}
tmpPath := privPath + ".tmp"
```

The new helper in lines 41-60 uses a fixed temp path (`user.tmp`) with no inter-process lock, unlike Lima's upstream `DefaultPubKeys()` path which wraps generation in a config-dir lock. If two controller processes race through startup, one can delete the other's temp files or fail on rename, producing a spurious startup failure or an inconsistent keypair state.

Fix: wrap the generate/rename section in Lima's `lockutil.WithDirLock(configDir, ...)`, or switch to a per-process temp file and retry on rename conflicts.

### Design Observations

**Strengths**
- Using `CREATE_NEW_PROCESS_GROUP` plus a targeted `process.Interrupt()` is the right abstraction boundary; it fixes the console-wide `CTRL_BREAK_EVENT` blast radius instead of papering over symptoms.
- Replacing Lima's `filepath.Glob` tmp-file cleanup with `os.ReadDir` in [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:631](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L631) is a good Windows-specific hardening step because it removes the bracket-in-path bug class entirely.

### Testing Assessment
1. No test covers the highest-risk stale/recycled-PID state on Windows: a `StatusBroken` instance with leftover `.pid` files during delete/restart.
2. The forced-stop fallback in `shutdownHostagent()` is explicitly untested; current BATS only proves recovery, not that the timeout path itself works.
3. The new SSH-key helper has unit coverage for crash leftovers and corrupt keys, but not for concurrent callers.

### Documentation Assessment
None.

### Acknowledged Limitations
- “Not tested: the forceStop fallback requires a hostagent that ignores shutdown signals...” in [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:520](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L520). This limitation matters more now because more stop paths funnel through `shutdownHostagent()`.
- “TODO: Wait on all hostagents in parallel instead of sequentially...” in [pkg/controllers/lima/limavm/controllers/limavm_controller.go:504](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_controller.go#L504). This change does not worsen it materially.

### Coverage Summary
- `.github/actions/spelling/allow.txt` — Trivial
- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/helpers/load.bash` — Reviewed, no issues
- `bats/helpers/os.bash` — Reviewed, no issues
- `bats/helpers/utils.bash` — Reviewed, no issues
- `bats/helpers/vm_template.bash` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-instance.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-running.bats` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Important 1
- `pkg/controllers/lima/limavm/controllers/ssh_keys.go` — Suggestion 1
- `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` — Reviewed, no issues
- `pkg/external/main.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Reviewed, no issues
- `pkg/util/process/process_unix.go` — Reviewed, no issues
- `pkg/util/process/process_windows.go` — Reviewed, no issues
- `scripts/collect-bats-logs.sh` — Reviewed, no issues

### Gemini 3.1 Pro

### Executive Summary

The PR comprehensively adapts the Lima BATS tests to run on Windows using MSYS2 and addresses multiple platform-specific lifecycle bugs. Key changes include isolating hostagents in their own process groups, gracefully handling WSL2 shutdown deadlocks, and pre-generating SSH keys to sidestep cygpath conversion issues. The codebase demonstrates high quality, robust handling of partial states, and excellent context management. A few minor edge cases regarding timeout coordination during RDD service shutdown and klog verbosity mapping remain.

### Critical Issues
None.

### Important Issues
1. **Sequential hostagent shutdown exceeds `StopWithWait` timeout** — `pkg/controllers/lima/limavm/controllers/limavm_controller.go:504-526`

```go
	// TODO: Wait on all hostagents in parallel instead of sequentially; with the
	// current loop, the total wait is N × gracefulShutdownTimeout in the worst case.
	for name, state := range states {
		graceful := true
		select {
		case <-state.procExited:
		case <-time.After(gracefulShutdownTimeout):
```

In `shutdownAllHostagents`, the controller sends an interrupt signal to all hostagents simultaneously but waits for their termination sequentially. The `time.After(gracefulShutdownTimeout)` (30 seconds) is recreated for each hostagent in the loop. If multiple hostagents ignore the interrupt and require a forced kill, the total wait time becomes `N × 30s`. This conflicts with `StopWithWait` in `pkg/service/cmd/service.go:364`, which enforces a strict 60-second timeout for the entire RDD service shutdown. If the controller takes 90 seconds to shut down 3 hostagents, the 60-second limit will forcefully kill the RDD service via `TerminateProcess`, leaving the remaining hostagents orphaned.

Fix: Wait for all hostagents concurrently using a `sync.WaitGroup` to ensure the total elapsed time never exceeds a single `gracefulShutdownTimeout` (resolving the TODO comment).

```go
	var wg sync.WaitGroup
	for name, state := range states {
		wg.Add(1)
		go func(name string, state *instanceState) {
			defer wg.Done()
			graceful := true
			select {
			case <-state.procExited:
			case <-time.After(gracefulShutdownTimeout):
				graceful = false
				if state.cmd != nil && state.cmd.Process != nil {
					_ = process.KillTree(context.Background(), state.cmd.Process.Pid)
				}
				<-state.procExited
			}
			if !graceful {
				logger := ctrl.Log.WithName("shutdownAllHostagents")
				cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
				inst, err := store.Inspect(cleanupCtx, name)
				if err == nil && inst != nil {
					stopInstanceForcibly(cleanupCtx, logger, inst)
				}
				cancel()
			}
			state.cancel()
			r.instanceStatesMu.Lock()
			delete(r.instanceStates, name)
			r.instanceStatesMu.Unlock()
		}(name, state)
	}
	wg.Wait()
```
(important, gap)

### Suggestions
1. **Prevent stderr spam for valid `logrus` log levels** — `pkg/external/main.go:277-280`

```go
	case "", "info":
		// info is the default klog verbosity; nothing to change.
	default:
		// klog has no warn/error levels; unsupported values fall through to
		// the default verbosity (v=0), which is equivalent to info-only.
		fmt.Fprintf(os.Stderr, "RDD_LOG_LEVEL=%q is not supported for klog; valid values are trace, debug, info\n", level)
```

The main RDD service uses `logrus`, which natively supports levels like `warn`, `warning`, `error`, and `fatal`. However, external controllers use `klog` and execute `setVerbosityFromEnv()`. If a user configures a valid logrus level such as `RDD_LOG_LEVEL=error` for the main service, the `default` case will execute for all spawned controllers, falling back to `info` verbosity and printing a warning to `stderr`. Because the main service spawns four external controllers, this results in four duplicate warning messages in the logs every time the service starts, despite the configuration being perfectly valid for the system as a whole.

Fix: Silently accept other valid `logrus` levels and map them to `info` without a warning.

```diff
-	case "", "info":
+	case "", "info", "warn", "warning", "error", "fatal":
		// info is the default klog verbosity; nothing to change.
+		// klog doesn't support suppressing info logs, so higher levels fall through to info.
	default:
```
(suggestion, enhancement)

### Design Observations

**Strengths**
- **Robust Context Usage:** The author strictly adheres to context propagation guidelines. New functions correctly accept contexts, and `context.Background()` is utilized exclusively for post-shutdown cleanup loops or setup phases, complete with clear explanatory comments justifying the detachment.
- **Handling of Partial File System State:** The `ensureSSHKeysAt` implementation handles file system edge cases well, appropriately mitigating the lack of atomic file replacements on Windows and effectively cleaning up stale or corrupt files from previously interrupted attempts.
- **Platform Resilience:** Re-architecting the hostagent lifecycle to rely on dedicated process groups cleanly prevents the RDD service from self-destructing on Windows when executing system interrupts across shared console sessions.

### Testing Assessment
1. The BATS framework modifications exhibit strong platform resilience, specifically intercepting and preventing WSL2 daemon deadlocks (`wsl.exe --unregister`) using isolated bash timeouts.
2. `killOrphanedHostagent` graceful shutdown pathways. The test suite proves that hostagents safely survive a forceful service termination and recover via the lifecycle reconciliation routines upon restarting. 

### Documentation Assessment
No gaps identified. The code comments clearly articulate OS-specific anomalies (`GenerateConsoleCtrlEvent`, `taskkill` traversing limitations).

### Acknowledged Limitations
- `pkg/util/process/process_windows.go:61` — "Platform asymmetry: if the target process is already dead, taskkill /T returns exit code 128 (treated as success), but surviving children (e.g., SSH port forwarders) are not killed because taskkill cannot traverse the tree from a dead parent." (This does not manifest as a bug here since the hostagent lifecycle executes `KillTree` before the parent process decays).
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go:504` — "TODO: Wait on all hostagents in parallel instead of sequentially..." (Addressed by Finding 1).

### Coverage Summary

- `.github/actions/spelling/allow.txt` — Trivial
- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/helpers/load.bash` — Trivial
- `bats/helpers/os.bash` — Reviewed, no issues
- `bats/helpers/utils.bash` — Reviewed, no issues
- `bats/helpers/vm_template.bash` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-instance.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-running.bats` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Finding 1
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` — Reviewed, no issues
- `pkg/external/main.go` — Finding 2
- `pkg/service/cmd/service.go` — Finding 1
- `pkg/util/process/process_unix.go` — Reviewed, no issues
- `pkg/util/process/process_windows.go` — Reviewed, no issues
- `scripts/collect-bats-logs.sh` — Reviewed, no issues
