# Deep Review: 20260324-010506-pr-242

| | |
|---|---|
| **Date** | 2026-03-24 01:05 |
| **Round** | 11 (of PR #242) |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#242](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242) — Run Lima BATS tests on Windows via MSYS2 |
| **Branch** | `msys2-ci` |
| **Commits** | `8c3ed7b` Fix Windows process signaling for Lima hostagent lifecycle<br>`16c73c0` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files<br>`3ab20a2` Enable Lima BATS tests on Windows<br>`e3685b8` Pre-generate Lima SSH keys on Windows to avoid broken path conversion<br>`1f174c6` Run Windows BATS tests via MSYS2 instead of WSL2<br>`aea03ce` Honor RDD_LOG_LEVEL in external controllers<br>`8b7e51c` Address PR review feedback<br>`ae2aa0e` Address deep-review findings for PR #242<br>`911b57e` Address review #8 findings for PR #242<br>`26d07c0` Address review #9 findings for PR #242<br>`6993651` Address review #10 findings for PR #242 |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — one important regression in SSH key recovery can destroy a valid private key on transient ssh-keygen failure |
| **Wall-clock time** | `16 min 31 s` |

---

## Executive Summary

This PR migrates Windows CI from WSL2-based BATS to MSYS2, fixes Windows process signaling so hostagent shutdown no longer kills the RDD service, adds WSL2 distro lifecycle management to prevent `wslservice.exe` deadlocks, and pre-generates SSH keys to avoid MSYS2 path conversion issues. The code is well-structured with thorough comments explaining platform-specific behavior. One important regression exists in the SSH key recovery path where a transient `ssh-keygen` failure deletes the only copy of the private key, breaking SSH access to existing VMs.

## Critical Issues

None.

---

## Important Issues

1. **SSH key recovery deletes valid private key on transient ssh-keygen failure** — `pkg/controllers/lima/limavm/controllers/ssh_keys.go:37-40` [Codex GPT 5.4] (important, regression)

```go
// Public key missing — try to derive it from the existing private key
// to preserve SSH access to existing VMs. Fall back to full regeneration
// if the private key is corrupt.
if pub, err := exec.CommandContext(ctx, "ssh-keygen", "-y", "-f", privPath).Output(); err == nil {
	return os.WriteFile(pubPath, pub, 0o644)
}
_ = os.Remove(privPath)
```

When the public key is missing but the private key is valid, `ensureSSHKeysAt` tries to derive the public key via `ssh-keygen -y` (line 37). If that command fails for *any* reason — `ssh-keygen` missing from `PATH`, a transient execution error, or the 30-second context timeout — the code deletes the valid private key at line 40 before falling through to full regeneration. Because `SetupWithManager()` calls `ensureSSHKeys()` on every controller startup (line 453 of `limavm_controller.go`), a single transient failure destroys the only identity that existing VMs trust.

The regeneration at lines 52–74 creates a fresh keypair, but VMs provisioned with the old key become unreachable. In this pre-release project VMs can be recreated, which limits the blast radius, but the fix is straightforward.

Fix: Do not delete the private key until a replacement is confirmed in place. Remove the eager `os.Remove(privPath)` at line 40, and let the rename sequence at lines 65–74 atomically replace the old key only after generation succeeds.

```diff
 	if pub, err := exec.CommandContext(ctx, "ssh-keygen", "-y", "-f", privPath).Output(); err == nil {
 		return os.WriteFile(pubPath, pub, 0o644)
 	}
-	_ = os.Remove(privPath)
 }
```

The generate-to-temp-then-rename pattern already handles the case where `privPath` exists on Windows (line 44 removes the orphaned pub key, and the rename at line 70 overwrites the private key only after the new one is ready). The only additional consideration: on Windows `os.Rename` fails if the destination exists, so add an `os.Remove(privPath)` immediately before the rename at line 70, matching the pattern already used for `pubPath` at line 44.

---

## Suggestions

1. **Dead `killHostagent` call in shutdownHostagent fallback** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:556-560` [Claude Opus 4.6] (suggestion, regression)

```go
} else {
	logger.Info("No watcher for hostagent, forcing stop via stored PIDs")
	r.killHostagent(name) // no-op if watcher absent; forceStop uses stored PIDs
	forceStop()
	waitAfterKill()
}
```

The `else` branch at line 556 runs when `signalHostagent` returns false (no watcher or no process handle). `killHostagent` (line 558) checks the same watcher map and is guaranteed to be a no-op in this branch. The inline comment acknowledges this, but the dead call confuses readers.

Fix: Remove the `killHostagent` call.

```diff
 } else {
 	logger.Info("No watcher for hostagent, forcing stop via stored PIDs")
-	r.killHostagent(name) // no-op if watcher absent; forceStop uses stored PIDs
 	forceStop()
 	waitAfterKill()
 }
```

2. **`taskkill` exit code 128 is a magic number** — `pkg/util/process/process_windows.go:80-82` [Claude Opus 4.6] (suggestion, regression)

```go
var exitErr *exec.ExitError
if errors.As(err, &exitErr) && exitErr.ExitCode() == 128 {
	return nil
}
```

The exit code 128 from `taskkill` for a nonexistent process is empirically observed, not documented by Microsoft. A named constant conveys intent at every use site.

Fix: Extract a named constant.

```diff
+// taskkillExitNotFound is the exit code taskkill returns when the target
+// process does not exist. Not officially documented by Microsoft but
+// consistently observed across Windows 10 and 11.
+const taskkillExitNotFound = 128
+
 func KillTree(ctx context.Context, pid int) error {
 	err := exec.CommandContext(ctx, "taskkill", "/F", "/T", "/PID", strconv.Itoa(pid)).Run()
 	var exitErr *exec.ExitError
-	if errors.As(err, &exitErr) && exitErr.ExitCode() == 128 {
+	if errors.As(err, &exitErr) && exitErr.ExitCode() == taskkillExitNotFound {
 		return nil
 	}
```

3. **`waitForInstanceStopped` delays 500ms before first check** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:594-601` [Claude Opus 4.6] (suggestion, regression)

```go
func waitForInstanceStopped(ctx context.Context, name string) bool {
	ticker := time.NewTicker(500 * time.Millisecond)
	defer ticker.Stop()
	for {
		select {
		case <-ctx.Done():
			return false
		case <-ticker.C:
```

The function waits for the first ticker tick before polling. If the hostagent exits quickly after receiving the signal, this adds 500ms of unnecessary latency. Not a correctness issue (the 30-second context is generous), but checking once before entering the loop makes the function more responsive.

Fix: Add an initial check before the ticker loop.

4. **Unquoted heredoc in `_unix_template` vs quoted in `_wsl2_template`** — `bats/helpers/vm_template.bash:57` [Claude Opus 4.6] (suggestion, regression)

`_unix_template` uses `<<YAML` (unquoted, enables `${RDD_VM_TYPE:+...}` expansion at line 65) while `_wsl2_template` uses `<<'YAML'` (quoted, no expansion). The difference is intentional but a reader comparing the two functions might think the quoting inconsistency is accidental. A brief comment would clarify.

---

## Design Observations

### Concerns

**Stale `inst` in `shutdownHostagent` forceStop closure** — (future) [Claude Opus 4.6, Gemini 3.1 Pro]

When `shutdownHostagent` receives a non-nil `inst` (e.g., from `stopInstance` at line 478), the `forceStop` closure captures that pointer at line 521. After the 30-second graceful timeout, the hostagent's PID may have been freed and recycled (especially on Windows). `stopInstanceForcibly` then kills by PID without identity verification.

Gemini proposed a fresh `store.Inspect` in the force path as a fix. However, `store.Inspect` reads from the same on-disk PID files the hostagent wrote. If the hostagent crashed without cleaning up (the reason we're in the force path), `store.Inspect` returns the same stale PIDs. The fresh inspect helps only in a narrow race window where the hostagent exits cleanly between the timeout check and the force-kill call.

The code already acknowledges this at lines 215–224 and identifies the real fix: process identity validation or Windows Job Objects. The self-healing restart limits blast radius.

### Strengths

- Replacing Lima's `SysKill` (which broadcasts `CTRL_BREAK_EVENT` to the entire console) with targeted `process.Interrupt` + `process.KillTree` eliminates the root cause of hostagent shutdown killing the RDD service.
- The SSH key pre-generation in `ensureSSHKeysAt` handles every partial-state combination with clean error recovery. The generate-to-temp-then-rename pattern prevents blocked retries from partial keys.
- `stopInstanceForcibly` conditions disk cleanup (lock and tmp file removal) on `allKilled` success, preventing status desynchronization where removing PID files while the process is still alive would make `store.Inspect` falsely report `StatusStopped`.
- The WSL2 distro termination before `--unregister` with the BATS-level `taskkill /F /IM wslservice.exe` fallback is a practical defense against a kernel-level deadlock that cannot be fixed from userspace.
- `setVerbosityFromEnv` cleanly bridges the service's logrus string log levels to the external controllers' klog numeric flags without adding abstraction overhead.

---

## Testing Assessment

SSH key generation has thorough unit tests covering all six on-disk state combinations. BATS integration tests were updated for cross-platform process management (`assert_process_alive`, `process_kill`). Untested scenarios ranked by risk:

1. **Valid private key + missing public key + `ssh-keygen -y` failure** — the path that deletes the private key (Finding 1). No test exercises this.
2. **`shutdownHostagent` graceful-timeout to forceStop fallback** — acknowledged in a comment at line 548. Requires a hostagent that ignores shutdown signals.
3. **PID recycling on Windows during `stopInstanceForcibly`** — acknowledged at lines 215–224. Requires precise timing control.
4. **`wslservice.exe` deadlock during `wsl --unregister`** — the BATS cleanup test recovers from this, but the controller-side `terminateWSL2Distro` timeout path is not directly tested.

---

## Documentation Assessment

No gaps. The PR description explains the MSYS2 migration rationale. Code comments document platform-specific behavior and failure modes. `environment.md` already covers `RDD_LOG_LEVEL`.

---

## Commit Structure

The 11 commits follow a logical progression: process signaling fix, WSL2 lifecycle, SSH keys, BATS enablement, CI migration, log propagation, then review feedback. The five "Address review" fixup commits are acceptable — they address PR feedback and keep review history readable across rebases.

---

## Acknowledged Limitations

1. `limavm_controller.go:504` — `"TODO: Wait on all hostagents in parallel instead of sequentially"`. Pre-existing, not worsened.
2. `limavm_lifecycle.go:216-224` — Windows PID recycling risk for StatusBroken instances. Acknowledged; the proper fix is process identity validation or Job Objects.
3. `process_windows.go:70-75` — `taskkill /T` cannot traverse children of dead parents. Acknowledged as acceptable: orphaned port forwarders cannot rebind their ports.
4. `limavm_lifecycle.go:472-473` — `"TODO: Non-blocking stop"`. Pre-existing.
5. `service/cmd/service.go:347` — On Windows, `process.Interrupt` via `GenerateConsoleCtrlEvent` fails when processes lack a shared console, falling back to `Kill()`.

---

## Agent Performance Retro

### [Claude] Opus 4.6

- **Duration**: 4 min 23 s
- **Unique contributions**: Identified the dead `killHostagent` call (Suggestion 1), the `taskkill` exit code magic number (Suggestion 2), the `waitForInstanceStopped` initial delay (Suggestion 3), and the heredoc quoting inconsistency (Suggestion 4). Raised the stale PID concern as a design observation with correct analysis of why a fresh `store.Inspect` would not help.
- **Accuracy**: All findings verified. No false positives.
- **Depth**: Good — traced into callers and callees, checked coverage of all changed files. Correctly analyzed the `signalHostagent`/`killHostagent` state interaction.
- **Coverage**: Complete. All 21 files reviewed.
- **Signal-to-noise**: High. Four targeted suggestions, one important finding, one design observation. No filler.

### [Codex] GPT 5.4

- **Duration**: 7 min 4 s
- **Unique contributions**: Found the SSH key private key deletion regression (Important 1) — the most significant finding in this review. No other agent caught this.
- **Accuracy**: The finding is verified and correct. No false positives.
- **Depth**: Focused — traced the `ensureSSHKeys` call chain from `SetupWithManager` through the failure paths. Did not explore lifecycle or process management code as deeply as the other agents.
- **Coverage**: Complete. All 21 files reviewed.
- **Coverage misses**: None — but marked `limavm_lifecycle.go` as "Reviewed, no issues" while both other agents found issues or concerns there.
- **Signal-to-noise**: Very high. One finding, zero noise, and it was the most important one.

### [Gemini] 3.1 Pro

- **Duration**: 4 min 32 s
- **Unique contributions**: None — the stale PID concern was also raised by Claude. Gemini rated it as an IMPORTANT regression while Claude correctly assessed it as a future design concern.
- **Accuracy**: The stale PID finding is technically correct in identifying the concern, but the proposed fix (fresh `store.Inspect`) is ineffective because the PID files on disk contain the same stale PIDs. This makes the fix suggestion a false positive.
- **Depth**: Moderate — traced the forceStop closure but did not verify whether `store.Inspect` would return different data.
- **Coverage**: Complete. All 21 files reviewed.
- **Coverage misses**: Marked `ssh_keys.go` as "Reviewed, no issues" — missed the most important finding in the review (Finding 1).
- **Signal-to-noise**: Moderate. One finding with an ineffective fix, good strength observations.

### Summary

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 4:23 | 7:04 | 4:32 |
| Critical | 0 | 0 | 0 |
| Important | 1 (demoted to suggestion) | 1 | 1 (demoted to design obs) |
| Suggestion | 4 | 0 | 0 |
| Design observations | 1 | 2 | 3 |
| False positives | 0 | 0 | 1 (ineffective fix) |
| Unique insights | 4 | 1 | 0 |
| Files reviewed | 21 | 21 | 21 |
| Coverage misses | 0 | 1 (lifecycle) | 1 (ssh_keys) |

Codex found the single most important issue (SSH key deletion) that both other agents missed, validating the multi-agent approach. Claude provided the broadest set of actionable suggestions and the most accurate severity calibration. Gemini's stale PID concern was directionally correct but the proposed fix was not viable.

---

## Appendix: Original Reviews

### Claude Opus 4.6 Review

The environment variable is inherited naturally by child processes via the OS. Since `RDD_LOG_LEVEL` is set in the parent process's environment and external controllers are spawned as subprocesses, they inherit it. The `setVerbosityFromEnv` function reads it on startup.

Now I have enough context to write the complete review. Let me compile my findings.

---

## Executive Summary

This PR migrates Windows CI from WSL2-based BATS to MSYS2, fixes Windows process signaling so that stopping hostagents no longer kills the RDD service, and adds WSL2 distro lifecycle management to prevent `wslservice.exe` deadlocks. The code is well-structured with thorough comments explaining platform-specific behavior and failure modes. Key findings are around a minor correctness issue in the `shutdownHostagent` fallback path and the documented-but-real PID recycling gap on Windows.

**Verdict: Approve with suggestions** — no critical issues, one important item worth addressing.

---

## Findings

### Critical Issues

None.

### Important Issues

1. **`shutdownHostagent` fallback path may double-kill the hostagent via different mechanisms** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:556-560` (important, regression)

```go
	} else {
		logger.Info("No watcher for hostagent, forcing stop via stored PIDs")
		r.killHostagent(name) // no-op if watcher absent; forceStop uses stored PIDs
		forceStop()
		waitAfterKill()
	}
```

When `signalHostagent` returns false (no watcher), the code calls `killHostagent(name)` (line 558), which is a no-op when the watcher is absent (the guard at line 223 checks `!ok`). The comment already says "no-op if watcher absent" — but if `signalHostagent` returned false because `state.cmd.Process` was nil (not because the watcher was absent), `killHostagent` will also hit the same nil guard at line 223 and skip. Then `forceStop` (line 559) calls `stopInstanceForcibly` which kills by stored PIDs from `store.Inspect`. This is fine functionally, but the `killHostagent` call is dead code in all reachable states of the else branch, since `signalHostagent` returning false means either no watcher or no process handle — the same conditions `killHostagent` checks.

Fix: Remove the `killHostagent` call or add a comment explaining it's a defensive no-op. The code is not wrong, but the dead call is confusing.

```diff
 	} else {
 		logger.Info("No watcher for hostagent, forcing stop via stored PIDs")
-		r.killHostagent(name) // no-op if watcher absent; forceStop uses stored PIDs
 		forceStop()
 		waitAfterKill()
 	}
```

### Suggestions

1. **`KillTree` on Windows treats exit code 128 as "process not found" but this exit code is not officially documented** — `pkg/util/process/process_windows.go:80-82` (suggestion, regression)

```go
	var exitErr *exec.ExitError
	if errors.As(err, &exitErr) && exitErr.ExitCode() == 128 {
		return nil
	}
```

The exit code 128 from `taskkill` when the target process doesn't exist is an observed behavior, not documented by Microsoft. If a future Windows update changes this code, `KillTree` would start returning errors for already-dead processes. A comment documenting the source of the 128 value (empirically observed, consistent across Windows 10/11) would help future maintainers.

Fix: Add a named constant with a comment.

```diff
+// taskkillExitNotFound is the exit code taskkill returns when the target
+// process does not exist. Not officially documented by Microsoft but
+// consistently observed across Windows 10 and 11.
+const taskkillExitNotFound = 128
+
 func KillTree(ctx context.Context, pid int) error {
 	err := exec.CommandContext(ctx, "taskkill", "/F", "/T", "/PID", strconv.Itoa(pid)).Run()
 	var exitErr *exec.ExitError
-	if errors.As(err, &exitErr) && exitErr.ExitCode() == 128 {
+	if errors.As(err, &exitErr) && exitErr.ExitCode() == taskkillExitNotFound {
 		return nil
 	}
```

2. **`waitForInstanceStopped` does not do an initial check before waiting for the first ticker tick** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:594-611` (suggestion, regression)

```go
func waitForInstanceStopped(ctx context.Context, name string) bool {
	ticker := time.NewTicker(500 * time.Millisecond)
	defer ticker.Stop()
	for {
		select {
		case <-ctx.Done():
			return false
		case <-ticker.C:
```

There's a 500ms delay before the first status check. If the hostagent exits quickly after receiving the signal, this adds unnecessary latency. The function is called from `killOrphanedHostagent` (line 580) with a 30-second context, so this doesn't cause correctness issues, but checking immediately on entry would make the function more responsive.

Fix: Check once before entering the loop.

```diff
 func waitForInstanceStopped(ctx context.Context, name string) bool {
+	// Check immediately before waiting for the first tick.
+	inst, err := store.Inspect(ctx, name)
+	if err == nil && (inst == nil || inst.Status == limatype.StatusStopped) {
+		return true
+	}
 	ticker := time.NewTicker(500 * time.Millisecond)
```

3. **`stopInstanceForcibly` iterates DriverPID before HostAgentPID** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:626` (suggestion, regression)

```go
	for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
```

On Unix, killing the driver first is fine. On Windows (WSL2 backend), the driver PID is a `wsl.exe` process. According to the repo-specific review context, `taskkill /F /T /PID` on a dead parent cannot traverse to its children. If `wsl.exe` is a child of the hostagent (spawned by Lima's WSL2 driver), killing the hostagent first with `/T` would naturally kill `wsl.exe` as well. Killing the driver (wsl.exe) first is harmless since it's typically a separate process, but the ordering deserves a comment explaining the choice, since Lima's own `StopForcibly` kills the driver first too.

Fix: Add a brief comment about the ordering.

4. **`_unix_template` uses unquoted heredoc allowing variable expansion** — `bats/helpers/vm_template.bash:57` (suggestion, regression)

```bash
_unix_template() {
    cat <<YAML
```

The `_unix_template` function uses an unquoted heredoc (`<<YAML`) which enables shell variable expansion (for `${RDD_VM_TYPE:+...}` on line 65). The `_wsl2_template` function uses `<<'YAML'` (quoted, no expansion). This is intentional — `_unix_template` needs expansion for `RDD_VM_TYPE` — but a reader comparing the two functions might think the quoting difference is accidental. A comment on line 57 explaining the intentional difference would help.

---

## Design Observations

### Concerns

**Stale `inst` in `shutdownHostagent` forceStop closure** — (future). When `shutdownHostagent` is called with a non-nil `inst` (e.g., from `stopInstance` at line 478), the `forceStop` closure captures that inst pointer. After the 30-second graceful timeout, the PIDs in the captured `inst` may be stale (process exited and PID recycled, especially on Windows). `stopInstanceForcibly` will call `KillTree` on those PIDs. The deletion path (`handleDeletion`) guards against this for stopped/broken instances by clearing PIDs, but `shutdownHostagent` does not. On Unix, PID recycling at 32768+ makes this extremely unlikely. On Windows, the risk is real but mitigated by the short window (30 seconds) and the comment block at lines 215-223 acknowledging it. A fresh `store.Inspect` in `forceStop` when `inst != nil` would eliminate this class of bugs, at the cost of one extra filesystem read.

### Strengths

- The replacement of Lima's `SysKill` (which broadcasts `CTRL_BREAK_EVENT` to the entire console, killing the RDD service) with targeted `process.Interrupt` + `process.KillTree` is the correct fix. The root cause analysis of Lima's `osutil_windows.go:38-39` (`GenerateConsoleCtrlEvent(CTRL_BREAK_EVENT, pid)` ignoring the signal type) is precise.

- The SSH key pre-generation (`ssh_keys.go`) handles every partial-state combination (both keys, only private, only public, corrupt private, stale temps, no keys) with clean error recovery. The generate-to-temp-then-rename pattern prevents blocked retries.

- The `stopInstanceForcibly` implementation carefully conditions disk unlocking and tmp file cleanup on `allKilled` success, preventing status desynchronization (removing PID files while the process is still alive would make `store.Inspect` report `StatusStopped` falsely).

- The WSL2 distro termination before `--unregister` (to prevent `wslservice.exe` deadlocks) with the fallback `taskkill.exe /F /IM wslservice.exe` in the BATS cleanup test is a practical defense against a kernel-level issue that can't be fixed from userspace.

---

## Testing Assessment

The SSH key generation has thorough unit tests covering all 6 on-disk state combinations. The BATS integration tests were systematically updated for cross-platform process management (`assert_process_alive`, `process_kill`). Untested scenarios ranked by risk:

1. **`shutdownHostagent` graceful-timeout → forceStop fallback** — acknowledged in a comment at line 548. Requires a hostagent that ignores shutdown signals, which is hard to simulate in integration tests.
2. **PID recycling on Windows during `stopInstanceForcibly`** — acknowledged in comments at lines 215-223. Requires precise timing control that BATS cannot reproduce.
3. **`wslservice.exe` deadlock during `wsl --unregister`** — the BATS cleanup test (line 696-701) recovers from this, but the controller-side `terminateWSL2Distro` timeout path is not directly tested.

---

## Documentation Assessment

No documentation gaps. The `environment.md` already documents `RDD_LOG_LEVEL`. The PR description explains the MSYS2 migration rationale. Code comments are comprehensive, especially around platform-specific behavior and failure modes.

---

## Commit Structure

The 11 commits follow a logical progression: process signaling fix → WSL2 lifecycle → SSH keys → BATS enablement → CI migration → log propagation → review responses. The 5 "Address review" fixup commits are acceptable since they address PR feedback and keep the review history readable. If this were squash-merged, the history would be clean.

---

## Acknowledged Limitations

1. **Sequential hostagent shutdown** — `limavm_controller.go:504-505`: `TODO: Wait on all hostagents in parallel instead of sequentially`. Pre-existing, not worsened by this change.

2. **Windows PID recycling** — `limavm_lifecycle.go:215-223`: "The proper fix is to validate process identity (e.g., check executable name) before killing, or use Windows Job Objects." Acknowledged explicitly; the self-healing restart limits blast radius.

3. **`taskkill /T` dead parent limitation** — `process_windows.go:70-75`: Acknowledged that orphaned children survive when the parent is already dead. Noted as "acceptable: orphaned port forwarders cannot rebind their ports and are harmless."

4. **Non-blocking stop** — `limavm_lifecycle.go:472-473`: `TODO: Non-blocking stop: send SIGINT and return immediately`. Pre-existing.

5. **Windows Job Objects** — `process_windows.go:75`: "Windows Job Objects would fix this if needed." Mentioned in mook-as's review comment and acknowledged as a future improvement.

---

## Unresolved Feedback

The review comments provided in the PR description all appear addressed:
- mook-as's spelling suggestion (allow → expect) — addressed
- mook-as's pacman command format — addressed (line 111 matches suggestion)
- mook-as's openssh/cygwin concern — replied to with explanation
- mook-as's Job Objects suggestion — acknowledged in code comment
- mook-as's PR summary suggestion — not directly verifiable but the PR description is detailed

---

## Coverage Summary

| File | Status |
|---|---|
| `.github/actions/spelling/expect/tech.txt` | **Trivial** — spelling list additions/reordering |
| `.github/actions/spelling/expect/tools.txt` | **Trivial** — spelling list addition |
| `.github/workflows/bats.yaml` | **Reviewed, no issues** — WSL2→MSYS2 migration, wrapper script, correct env setup |
| `bats/helpers/load.bash` | **Trivial** — added `vm_template.bash` source |
| `bats/helpers/os.bash` | **Reviewed, no issues** — cross-platform process helpers are correct |
| `bats/helpers/utils.bash` | **Reviewed, no issues** — `is_msys` guard for `create_file`, empty-contents fix |
| `bats/helpers/vm_template.bash` | **Finding 4** — unquoted heredoc difference worth documenting |
| `bats/tests/33-lima-controllers/limavm-instance.bats` | **Reviewed, no issues** — template extraction, `skip_on_windows` removal, WSL cleanup |
| `bats/tests/33-lima-controllers/limavm-running.bats` | **Reviewed, no issues** — comprehensive cross-platform updates |
| `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` | **Reviewed, no issues** — `signalHostagent` signature change, `killHostagent` addition |
| `pkg/controllers/lima/limavm/controllers/limavm_controller.go` | **Reviewed, no issues** — `shutdownAllHostagents` rewrite with proper forced cleanup |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | **Finding 1, 2, 3** — main lifecycle changes, `shutdownHostagent`, `stopInstanceForcibly` |
| `pkg/controllers/lima/limavm/controllers/ssh_keys.go` | **Reviewed, no issues** — thorough state handling |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` | **Reviewed, no issues** — good coverage of all on-disk states |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` | **Reviewed, no issues** — correct no-op |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` | **Reviewed, no issues** — correct Lima path usage |
| `pkg/external/main.go` | **Reviewed, no issues** — `setVerbosityFromEnv`, timeout/failure count tuning |
| `pkg/service/cmd/service.go` | **Reviewed, no issues** — graceful Interrupt→Kill fallback, timeout kill |
| `pkg/util/process/process_unix.go` | **Reviewed, no issues** — `Interrupt`, `KillTree` with ESRCH fallback |
| `pkg/util/process/process_windows.go` | **Finding 1** — `KillTree` exit code 128 magic number |
| `scripts/collect-bats-logs.sh` | **Reviewed, no issues** — MSYS2 cygpath support added |

---

### Codex GPT 5.4 Review

**Executive Summary**

This PR moves Windows BATS execution from WSL to MSYS2 and reworks Lima hostagent shutdown so Windows signaling no longer takes down the RDD service. The overall direction is sound and most of the Windows-specific lifecycle handling is well thought through, but I found one important regression in the new SSH-key bootstrap path that can destroy the only private key for existing Lima VMs on startup.

**Critical Issues**

None.

**Important Issues**

1. **Missing `user.pub` can destroy the existing private key on startup** — [pkg/controllers/lima/limavm/controllers/ssh_keys.go:34-40](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys.go#L34) (important, regression)

```go
// Public key missing — try to derive it from the existing private key
// to preserve SSH access to existing VMs. Fall back to full regeneration
// if the private key is corrupt.
if pub, err := exec.CommandContext(ctx, "ssh-keygen", "-y", "-f", privPath).Output(); err == nil {
	return os.WriteFile(pubPath, pub, 0o644)
}
_ = os.Remove(privPath)
```

The fallback at [ssh_keys.go:37-40](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys.go#L37) deletes `privPath` after any `ssh-keygen -y` failure, not just a corrupt-key failure. That includes ordinary environment problems such as `ssh-keygen` missing from `PATH`, a transient execution failure, or the same Windows path/tooling mismatch this PR is trying to work around. Because [SetupWithManager() calls `ensureSSHKeys()` during every controller startup at line 453](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_controller.go#L453), a machine that has a valid `user` key but a missing `user.pub` can lose the private key before the controller returns an error. Existing VMs then become unreachable because the original identity is gone.

Fix: keep the old private key until a replacement keypair has been generated and renamed into place successfully. In practice, remove the eager delete at [ssh_keys.go:40](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys.go#L40), generate the new pair at `tmpPath`, move the old private key aside only after generation succeeds, and restore it if either rename fails. Also add a test that simulates `ssh-keygen -y` failure while a valid private key exists.

**Suggestions**

None.

**Design Observations**

**Strengths**

- Switching Windows BATS from WSL interop to MSYS2 addresses the root cause described in the PR, instead of layering more retries on top of flaky `binfmt_misc` behavior.
- Using `CREATE_NEW_PROCESS_GROUP` plus a targeted interrupt for hostagents is the right direction for Windows; it narrows shutdown scope and makes the service/hostagent relationship much easier to reason about.

**Testing Assessment**

1. There is no regression test for the highest-risk SSH-key path: valid existing `user`, missing `user.pub`, and `ssh-keygen -y` failing. That is the case that currently deletes the only private key.
2. The forced-stop timeout path in `shutdownHostagent()` / `shutdownAllHostagents()` is still untested; the current BATS coverage exercises outcomes, but not the branch where graceful signaling fails and cleanup falls back to `KillTree`.
3. The Windows orphaned-hostagent tests do not distinguish graceful shutdown from forced termination, so they would not catch a regression that always falls back to the harsher path.

**Documentation Assessment**

None.

**Acknowledged Limitations**

- [pkg/controllers/lima/limavm/controllers/limavm_controller.go:504](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_controller.go#L504): `"the total wait is N × gracefulShutdownTimeout"` for `shutdownAllHostagents()`. This change makes that more relevant on Windows because graceful `CTRL_BREAK` delivery can fail and force the slow path.
- [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:216](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L216): the new broken-instance recovery still relies on Windows PID files and notes that Job Objects or process-identity validation would be safer. This PR improves behavior overall, but the residual PID-recycling risk remains in that path.
- [bats/tests/33-lima-controllers/limavm-running.bats:261](/Users/jan/git/rancher-desktop-daemon/bats/tests/33-lima-controllers/limavm-running.bats#L261): `"does not distinguish the shutdown mechanism"` for orphaned-hostagent recovery. The change adds useful coverage, but this specific distinction is still unverified.

**Coverage Summary**

- `.github/actions/spelling/expect/tech.txt` — Trivial
- `.github/actions/spelling/expect/tools.txt` — Trivial
- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/helpers/load.bash` — Trivial
- `bats/helpers/os.bash` — Reviewed, no issues
- `bats/helpers/utils.bash` — Reviewed, no issues
- `bats/helpers/vm_template.bash` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-instance.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-running.bats` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Finding 1
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys.go` — Finding 1
- `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` — Finding 1
- `pkg/external/main.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Reviewed, no issues
- `pkg/util/process/process_unix.go` — Reviewed, no issues
- `pkg/util/process/process_windows.go` — Reviewed, no issues
- `scripts/collect-bats-logs.sh` — Reviewed, no issues

---

### Gemini 3.1 Pro Review

### Executive Summary

This PR successfully migrates Windows BATS testing from WSL2 interop to native MSYS2, mitigating flaky test runs caused by GitHub Actions runners. It isolates the RDD service and Lima hostagents into separate process groups to prevent RDD from crashing when VMs restart, and it hardens WSL2 instance lifecycle handling against deadlocks. However, an important regression was introduced during forced instance termination where stale PIDs can result in arbitrary host processes being killed on Windows due to aggressive PID recycling.

### Critical Issues

None.

### Important Issues

1. **Stale instance state used for forced termination after graceful timeout** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:514-528` and `569-587` (important, regression)

```go
	forceStop := func() {
		// Use a background context: the parent reconciler context may be
		// nearing its deadline after the graceful shutdown wait.
		forceCtx, forceCancel := context.WithTimeout(context.Background(), 30*time.Second)
		defer forceCancel()
		forceInst := inst
		if forceInst == nil {
			var err error
			forceInst, err = store.Inspect(forceCtx, name)
```

In `shutdownHostagent` and `killOrphanedHostagent`, the `inst` parameter passed from the caller is often captured *before* the 30-second graceful shutdown wait (e.g., at line 478 in `stopInstance`). If the wait times out, or if the process exits right around the timeout boundary, the PID contained in `inst.HostAgentPID` will be freed and potentially recycled by Windows. `forceStop` then blindly passes this stale PID to `stopInstanceForcibly`, which executes `taskkill /F /T /PID`, risking termination of an unrelated process tree. The `shutdownAllHostagents` function properly avoids this by performing a fresh `store.Inspect` *after* the wait, but these paths do not.

Fix: Ignore the passed `inst` and perform a fresh `store.Inspect` in the failure path before killing to ensure you have the most up-to-date PIDs:

```diff
 	forceStop := func() {
 		// Use a background context: the parent reconciler context may be
 		// nearing its deadline after the graceful shutdown wait.
 		forceCtx, forceCancel := context.WithTimeout(context.Background(), 30*time.Second)
 		defer forceCancel()
-		forceInst := inst
-		if forceInst == nil {
-			var err error
-			forceInst, err = store.Inspect(forceCtx, name)
-			if err != nil {
-				logger.Error(err, "Failed to inspect Lima instance for forceful stop")
-				return
-			}
-			if forceInst == nil {
-				return
-			}
+		forceInst, err := store.Inspect(forceCtx, name)
+		if err != nil {
+			logger.Error(err, "Failed to inspect Lima instance for forceful stop")
+			return
+		}
+		if forceInst == nil {
+			return
 		}
 		stopInstanceForcibly(forceCtx, logger, forceInst)
 	}
```

Apply the same refresh logic in `killOrphanedHostagent` before its call to `stopInstanceForcibly` on line 586:

```diff
 			if waitForInstanceStopped(stopCtx, inst.Name) {
 				logger.Info("Orphaned hostagent exited gracefully")
 				return nil
 			}
 			logger.Info("Orphaned hostagent did not exit gracefully, forcing stop")
 		}
 	}
 
+	var err error
+	inst, err = store.Inspect(ctx, inst.Name)
+	if err != nil || inst == nil {
+		return err
+	}
 	stopInstanceForcibly(ctx, logger, inst)
 	return nil
```

### Suggestions

None.

### Design Observations

**Strengths**
- The SSH key recovery in `ensureSSHKeysAt` (`pkg/controllers/lima/limavm/controllers/ssh_keys.go`) handles partial or corrupt state gracefully (re-deriving pubkeys from privkeys and removing stale temp files). Limiting it with a 30-second context timeout prevents blocking setup if `ssh-keygen` hangs waiting for a TTY prompt.
- `setVerbosityFromEnv` seamlessly ties the service's `logrus` string log levels into the external controllers' `klog` numeric flags without introducing complicated abstractions.
- Implementing the Windows process tree kill via `taskkill /T` ensures SSH port forwarders and hidden host agent dependencies are successfully reaped during forced termination.

### Acknowledged Limitations

- `pkg/controllers/lima/limavm/controllers/limavm_controller.go:504` — "TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case."
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:216` — The limitation that StatusBroken instances on Windows might hold stale PID files that lead to `taskkill` destroying unrelated processes.
- `pkg/service/cmd/service.go:347` — On Windows, `process.Interrupt` via `GenerateConsoleCtrlEvent` fails to gracefully stop the RDD service if called via CLI because the processes lack a shared console, resulting in an immediate fallback to `Kill()`.
- `pkg/util/process/process_windows.go:61` — `taskkill` cannot traverse and kill surviving children of a process that is already dead, unlike `kill(-pgid)` on Unix.

### Coverage Summary

- `.github/actions/spelling/expect/tech.txt` — Trivial
- `.github/actions/spelling/expect/tools.txt` — Trivial
- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/helpers/load.bash` — Trivial
- `bats/helpers/os.bash` — Reviewed, no issues
- `bats/helpers/utils.bash` — Reviewed, no issues
- `bats/helpers/vm_template.bash` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-instance.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-running.bats` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Finding 1
- `pkg/controllers/lima/limavm/controllers/ssh_keys.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` — Reviewed, no issues
- `pkg/external/main.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Reviewed, no issues
- `pkg/util/process/process_unix.go` — Reviewed, no issues
- `pkg/util/process/process_windows.go` — Reviewed, no issues
- `scripts/collect-bats-logs.sh` — Reviewed, no issues
