# Deep Review: 20260322-232014-pr-242

| | |
|---|---|
| **Date** | 2026-03-22 23:20 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#242](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242) — Run Lima BATS tests on Windows via MSYS2 |
| **Branch** | `msys2-ci` |
| **Commits** | `56cbd9c` Fix Windows process signaling for Lima hostagent lifecycle<br>`e4688cf` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files<br>`d299fb7` Enable Lima BATS tests on Windows<br>`222d56c` Pre-generate Lima SSH keys on Windows to avoid broken path conversion<br>`d632612` Run Windows BATS tests via MSYS2 instead of WSL2<br>`4e73c1a` Honor RDD_LOG_LEVEL in external controllers |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — one important regression in the `StopWithWait` timeout path defeats process isolation on Windows |
| **Wall-clock time** | `21 min 9 s` |

---

## Consolidated Review

### Executive Summary

This PR fixes Windows process signaling so hostagents run in their own process groups, replaces Lima's broken `StopForcibly` with a correct `stopInstanceForcibly` that terminates WSL2 distros and cleans up stale files, pre-generates SSH keys to avoid MSYS2 path conversion issues, and migrates CI from WSL2-based BATS to MSYS2. The process isolation design is well-executed, but `StopWithWait`'s 60-second timeout path uses `KillTree` which on Windows kills the entire parent-child tree — undoing the isolation this PR establishes.

---

### Critical Issues

None.

---

### Important Issues

1. **`StopWithWait` timeout kills hostagents on Windows via tree-kill** — `pkg/service/cmd/service.go:366-372` [Claude Opus 4.6] (important, regression)

```go
case <-timeout:
	// Graceful shutdown timed out; force-kill so we don't leave
	// a hung service process. Use KillTree for SIGKILL on Unix
	// (process.Kill sends SIGTERM, which is catchable like the
	// SIGINT already sent) and taskkill /F /T on Windows.
	_ = process.KillTree(context.Background(), pid)
	return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop", instance.Name(), pid)
```

On Unix, `KillTree` sends `SIGKILL` to the service's process group. Hostagents have their own groups (via `SetGroup`/`Setpgid`) and are unaffected. On Windows, `KillTree` uses `taskkill /F /T /PID` which walks the parent-child tree regardless of process groups, killing hostagents that are children of the service. This defeats the process isolation the rest of the PR establishes.

The scenario: `Interrupt(pid)` triggers graceful shutdown (line 351), but `shutdownAllHostagents` takes longer than 60 seconds (sequential N × 30s wait, per the acknowledged TODO). The timeout fires and `KillTree` kills the service and all hostagents, interrupting their graceful shutdown. On Windows, this leaves WSL2 distros potentially un-terminated.

Fix: use a single-process kill instead of tree-kill. On Windows, `process.Kill` uses `TerminateProcess` which affects only the target process. On Unix, `Kill` sends `SIGTERM` — if SIGKILL is needed for a truly hung process, add a `KillForce(pid)` that sends `SIGKILL` to a single PID (not the process group):

```diff
 case <-timeout:
-	_ = process.KillTree(context.Background(), pid)
+	_ = process.Kill(pid)
 	return fmt.Errorf(...)
```

---

### Suggestions

1. **Missing `context.Background()` comment in `killHostagent`** — `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go:226` [Claude Opus 4.6] (suggestion, regression)

```go
func (r *LimaVMReconciler) killHostagent(name string) {
	// ...
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	_ = process.KillTree(ctx, state.cmd.Process.Pid)
}
```

Per the project convention, `context.Background()` in new code should have a comment explaining why. Other uses in this PR (e.g., `shutdownHostagent.forceStop` at line 497, `waitAfterKill` at line 518) include explanatory comments. This one is missing.

Fix: add a comment, e.g.: `// No parent context: killHostagent runs during shutdown when the reconciler context may already be cancelled.`

2. **WSL2 distro termination logic duplicated** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:49-59,627-635` [Claude Opus 4.6] (suggestion, regression)

The `wsl.exe --terminate` + 10-second timeout pattern appears in both `handleDeletion` (stopped WSL2 instances, line 54) and `stopInstanceForcibly` (running instances, line 633). Both construct `"lima-" + inst.Name` and use `context.WithTimeout(ctx, 10*time.Second)`.

Fix: extract a helper:

```go
func terminateWSL2Distro(ctx context.Context, logger logr.Logger, instName string) {
	distroName := "lima-" + instName
	wslCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
	defer cancel()
	if err := exec.CommandContext(wslCtx, "wsl.exe", "--terminate", distroName).Run(); err != nil {
		logger.V(1).Info("Failed to terminate WSL2 distro", "distro", distroName, "error", err)
	}
}
```

3. **`taskkill /T` on dead parent leaves orphaned children on Windows** — `pkg/util/process/process_windows.go:70-76` [Codex GPT 5.4] (suggestion, gap)

```go
func KillTree(ctx context.Context, pid int) error {
	err := exec.CommandContext(ctx, "taskkill", "/F", "/T", "/PID", strconv.Itoa(pid)).Run()
	if err != nil {
		var exitErr *exec.ExitError
		if errors.As(err, &exitErr) && exitErr.ExitCode() == 128 {
			return nil
		}
	}
	return err
}
```

When `stopInstanceForcibly` calls `KillTree` on a dead hostagent (crash recovery path at `limavm_lifecycle.go:190-208`), `taskkill /T` returns exit code 128 because the parent no longer exists. The code treats this as success (`allKilled` stays true), but children like `ssh.exe` port forwarders may still be alive. On Unix, `kill(-pgid, SIGKILL)` kills all group members regardless of the leader's state — this asymmetry is documented in the review context but not guarded against here.

The practical impact is low: orphaned port forwarders cannot bind the same ports as the new hostagent and are harmless. A comment documenting this asymmetry would prevent future confusion. A future fix could use Windows Job Objects to ensure automatic child cleanup.

---

### Design Observations

**Concerns**

- **`exec.CommandContext` races with `shutdownAllHostagents`** (future) [Claude Opus 4.6] — The hostagent is started with `exec.CommandContext(ctx, ...)` where `ctx` derives from the manager context. When the manager shuts down, Go's `CommandContext` calls `Process.Kill()` immediately (SIGKILL on Unix, TerminateProcess on Windows). Concurrently, `shutdownAllHostagents` tries graceful Interrupt + wait. If `CommandContext`'s kill fires first, the hostagent dies ungracefully and the `!graceful` cleanup block (WSL2 distro termination, disk unlock) is skipped. This is pre-existing and self-healing (next start recovers via `killOrphanedHostagent`), but the window is more significant on Windows where cleanup includes WSL2 distro management. A future fix could set `haCmd.Cancel` to send `Interrupt` instead of `Kill`, with `haCmd.WaitDelay = gracefulShutdownTimeout`.

- **Windows child process cleanup via Job Objects** (future) [Gemini 3.1 Pro] — When a hostagent crashes on Windows, its descendants (SSH port forwarders) survive and `taskkill /T` on the dead parent's PID cannot traverse them. Launching the hostagent inside a Windows Job Object would cause the OS to terminate all child processes when the hostagent handle closes, regardless of how it died.

**Strengths**

- The replacement of Lima's `SysKill` (which broadcasts `CTRL_BREAK_EVENT` to the entire console, killing the RDD service) with targeted `process.Interrupt` + `process.KillTree` is exactly the right approach. The PR works around the upstream limitation cleanly without modifying the dependency. [Claude Opus 4.6]
- The SSH key pre-generation (`ssh_keys.go`) handles every partial-state combination (missing pubkey, corrupt privkey, stale temps, orphaned pubkey) with correct crash-recovery behavior. The generate-to-tmp-then-rename pattern includes the Windows-specific `os.Remove` before `os.Rename`. [Claude Opus 4.6, Gemini 3.1 Pro]
- Clearing `existingInst.DriverPID` and `existingInst.HostAgentPID` before calling `limainstance.Delete` prevents Lima's internal `StopForcibly` from killing recycled PIDs on Windows. [Gemini 3.1 Pro]
- The `process` package centralizes Unix/Windows signal semantics instead of scattering them through the controller code. [Codex GPT 5.4]

---

### Testing Assessment

Test coverage is thorough. The BATS tests exercise cross-platform process management, crash recovery, orphaned hostagent recovery, graceful shutdown, and WSL2 deadlock recovery. SSH key generation has comprehensive unit tests covering all partial-state scenarios.

Untested scenarios, ranked by risk:
1. **`shutdownHostagent` graceful timeout → forceStop fallback** — requires a hostagent that ignores shutdown signals. Acknowledged in code comment at `limavm_lifecycle.go:526-528`.
2. **`StopWithWait` 60-second timeout path** — the `KillTree` call on timeout is not exercised by any test.
3. **Windows crash recovery with surviving hostagent children** — current BATS coverage validates eventual recovery, not that stale children are reaped.
4. **`Interrupt` failure on Windows from a different console** — the `Kill` (TerminateProcess) fallback is documented but not tested in CI.

---

### Documentation Assessment

Code comments are thorough and explain the "why" for each platform-specific behavior. CI workflow comments explain MSYS2 package choices. No documentation gaps.

---

### Acknowledged Limitations

1. **Sequential hostagent shutdown** — `limavm_controller.go:504-505`:
   > `// TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case.`

   The PR adds WSL2 cleanup to the `!graceful` block within this loop, making the sequential wait slightly longer per hostagent. The root issue is unchanged.

2. **`shutdownHostagent` forceStop not directly tested** — `limavm_lifecycle.go:526-528`:
   > `// Not tested: the forceStop fallback requires a hostagent that ignores shutdown signals.`

   Pre-existing gap. The orphaned-hostagent integration test exercises forced stop indirectly.

3. **Orphaned hostagent shutdown mechanism not distinguished** — `limavm-running.bats:262-264`:
   > `// Not tested: whether the orphaned hostagent shuts down gracefully (via CTRL_BREAK/SIGINT) or is force-killed.`

   New acknowledgment. The test validates recovery regardless of mechanism.

4. **Graceful shutdown bypass from different console** — `service.go:346-350`:
   > `// On Windows, Interrupt uses GenerateConsoleCtrlEvent, which only works when caller and target share a console.`

   The change reduces blast radius by isolating hostagents into their own process groups, but `rdd svc stop` from a different terminal still falls back to `TerminateProcess`, bypassing graceful shutdown.

---

### Coverage Summary

| File | Status |
|------|--------|
| `.github/actions/spelling/allow.txt` | **Trivial** |
| `.github/workflows/bats.yaml` | **Reviewed, no issues** |
| `bats/helpers/load.bash` | **Trivial** |
| `bats/helpers/os.bash` | **Reviewed, no issues** |
| `bats/helpers/utils.bash` | **Reviewed, no issues** |
| `bats/helpers/vm_template.bash` | **Reviewed, no issues** |
| `bats/tests/33-lima-controllers/limavm-instance.bats` | **Reviewed, no issues** |
| `bats/tests/33-lima-controllers/limavm-running.bats` | **Reviewed, no issues** |
| `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` | **Suggestion 1** |
| `pkg/controllers/lima/limavm/controllers/limavm_controller.go` | **Reviewed, no issues** |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | **Suggestion 2** |
| `pkg/controllers/lima/limavm/controllers/ssh_keys.go` | **Reviewed, no issues** |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` | **Reviewed, no issues** |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` | **Reviewed, no issues** |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` | **Reviewed, no issues** |
| `pkg/external/main.go` | **Reviewed, no issues** |
| `pkg/service/cmd/service.go` | **Important 1** |
| `pkg/util/process/process_unix.go` | **Reviewed, no issues** |
| `pkg/util/process/process_windows.go` | **Suggestion 3** |
| `scripts/collect-bats-logs.sh` | **Reviewed, no issues** |

---

## Agent Performance Retro

### [Claude] Opus 4.6

- **Unique contributions**: Found the `StopWithWait` tree-kill regression (Important 1) — the most actionable finding in this review. Also identified the WSL2 termination duplication (Suggestion 2) and the missing `context.Background()` comment (Suggestion 1).
- **Accuracy**: All findings verified as correct. No false positives.
- **Depth**: Traced `KillTree` semantics across Unix/Windows, understood the interaction between `SetGroup`/`Setpgid` and `taskkill /T` vs `kill(-pgid)`. Traced the `exec.CommandContext` race with `shutdownAllHostagents` as a pre-existing concern.
- **Coverage**: Reviewed all 20 files. No coverage misses.
- **Signal-to-noise**: Excellent — 1 important finding, 3 suggestions, all actionable.

### [Codex] GPT 5.4

- **Unique contributions**: Identified the `taskkill /T` exit-code-128 asymmetry (Suggestion 3). Also ran unit tests independently.
- **Accuracy**: The finding is valid but was rated IMPORTANT when the practical impact is low (orphaned port forwarders are harmless). Downgraded to SUGGESTION in consolidation.
- **Depth**: Traced the crash recovery path through the watcher, `stopInstanceForcibly`, and `KillTree`. Good understanding of the Windows process lifecycle.
- **Coverage**: Reviewed all 20 files. Listed `limavm-running.bats` as "Finding 1" but the finding is about `process_windows.go`/`limavm_lifecycle.go`, not the test file. Minor attribution issue.
- **Signal-to-noise**: Good — 1 finding, no noise. Output was concise (86 lines).

### [Gemini] 3.1 Pro

- **Unique contributions**: Proposed Windows Job Objects as a future design improvement.
- **Accuracy**: One false positive: rated `stopInstanceForcibly`'s use of `KillTree` on `DriverPID` as CRITICAL, claiming the driver is not a process group leader. Verification shows the QEMU driver IS started with `BackgroundSysProcAttr` (which sets `Setpgid: true` on Unix and `CREATE_NEW_PROCESS_GROUP` on Windows), the VZ driver runs in-process (no PID), and the WSL2 driver has no `DriverPID`. The claim does not hold on any platform.
- **Depth**: Did not trace into `BackgroundSysProcAttr` in the vendor tree to verify whether the driver process gets `Setpgid`. The false positive could have been avoided by one grep.
- **Coverage**: Reviewed all 20 files. Marked `service.go` as "Reviewed, no issues" where Claude found the Important 1 finding — a coverage miss.
- **Signal-to-noise**: Low — only finding was a false positive.

### Summary

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|--------|-----------------|----------------|----------------|
| Duration | 11:31 | 7:33 | 5:49 |
| Critical | 0 | 0 | 1 (false positive) |
| Important | 1 | 1 (downgraded) | 0 |
| Suggestion | 3 | 0 | 0 |
| Design observations | 2 | 2 | 2 |
| False positives | 0 | 0 | 1 |
| Unique insights | 3 | 1 | 1 |
| Files reviewed | 20 | 20 | 20 |
| Coverage misses | 0 | 0 | 1 |

Claude provided the most value this review: it found the only important regression, produced three actionable suggestions with no false positives, and traced cross-platform semantics thoroughly. Codex was efficient and identified a real asymmetry, though it over-rated the severity. Gemini's sole finding was a false positive caused by not verifying an assumption one hop into the vendor tree.

---

## Appendix: Original Reviews

### Claude Opus 4.6

Now I have all the context I need. Let me write the review.

---

## Executive Summary

This PR enables Lima BATS tests on Windows by switching the CI runner from WSL2-based bash to MSYS2, fixing Windows process signaling so hostagents run in their own process groups, implementing safe `stopInstanceForcibly` that doesn't kill the RDD service, adding WSL2 distro lifecycle management to prevent `wslservice.exe` deadlocks, and pre-generating SSH keys to work around MSYS2 path conversion issues. The changes are well-designed with thorough crash-recovery handling. One important finding in the `StopWithWait` timeout path and a handful of suggestions.

---

### Findings

#### Critical Issues

None.

#### Important Issues

1. **`StopWithWait` timeout kills hostagents on Windows via tree-kill** — `pkg/service/cmd/service.go:371` (important, regression)

```go
case <-timeout:
	// Graceful shutdown timed out; force-kill so we don't leave
	// a hung service process.
	_ = process.KillTree(context.Background(), pid)
	return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop", instance.Name(), pid)
```

On Unix, `KillTree` sends `SIGKILL` to the service's process group. Hostagents have their own process groups (via `SetGroup`/`Setpgid`) and are unaffected. On Windows, `KillTree` uses `taskkill /F /T /PID` which walks the parent-child tree, killing all descendants — including hostagents, which are children of the service regardless of their process group. This defeats the process isolation that the rest of this PR carefully establishes.

The scenario: `Interrupt(pid)` succeeds (line 351), the service begins graceful shutdown, but `shutdownAllHostagents` takes >60s (sequential N × 30s wait, per the acknowledged TODO). The timeout fires and `KillTree` kills the service and all hostagents, interrupting their graceful shutdown. On Windows, this leaves WSL2 distros potentially un-terminated, which the next start must recover via `killOrphanedHostagent`.

Fix: Use a single-process kill instead of tree-kill. On Unix, send `SIGKILL` to just the PID (not the group). On Windows, `TerminateProcess` (which `process.Kill` already uses) only affects the target process. Add a `KillForce` function or reuse `Kill`:

```diff
 			case <-timeout:
-				_ = process.KillTree(context.Background(), pid)
+				_ = process.Kill(pid)
 				return fmt.Errorf(...)
```

On Unix, `Kill` sends `SIGTERM` which might be insufficient for a truly hung process. If that's a concern, add a `KillForce(pid)` that sends `SIGKILL` to a single PID on Unix and calls `TerminateProcess` on Windows.

#### Suggestions

1. **Missing `context.Background()` comment in `killHostagent`** — `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go:226` (suggestion, regression)

```go
func (r *LimaVMReconciler) killHostagent(name string) {
	// ...
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	_ = process.KillTree(ctx, state.cmd.Process.Pid)
}
```

Per the project convention (CLAUDE.md review context), `context.Background()` in new code should have a comment explaining why. Other uses in this PR (e.g., `shutdownHostagent.forceStop` at line 497, `waitAfterKill` at line 518) correctly include explanatory comments. This one is missing.

Fix: Add a one-line comment, e.g.:
```go
// No parent context: killHostagent is called from shutdown paths where the
// reconciler context may already be cancelled.
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
```

2. **WSL2 distro termination logic duplicated** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:49-59,627-635` (suggestion, regression)

The `wsl.exe --terminate` + 10-second timeout pattern appears both inline in `handleDeletion` (for stopped WSL2 instances, line 54) and inside `stopInstanceForcibly` (for running instances, line 633). Both construct `"lima-" + inst.Name` and use `context.WithTimeout(ctx, 10*time.Second)`.

Fix: Extract a helper:
```go
func terminateWSL2Distro(ctx context.Context, logger logr.Logger, instName string) {
	distroName := "lima-" + instName
	wslCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
	defer cancel()
	if err := exec.CommandContext(wslCtx, "wsl.exe", "--terminate", distroName).Run(); err != nil {
		logger.V(1).Info("Failed to terminate WSL2 distro", "distro", distroName, "error", err)
	}
}
```

3. **`vm_template_running` doesn't support `RDD_VM_TYPE` on Windows** — `bats/helpers/vm_template.bash:20-24` (suggestion, gap)

```bash
vm_template_running() {
    if is_windows; then
        _wsl2_template
    else
        _unix_template_running
    fi
}
```

`_unix_template_running` interpolates `${RDD_VM_TYPE:+vmType: ${RDD_VM_TYPE}}`, but `_wsl2_template` hardcodes `vmType: wsl2`. If someone sets `RDD_VM_TYPE` on Windows (unlikely but possible for Hyper-V testing), it's silently ignored. A comment noting this would prevent future confusion.

---

### Design Observations

**Concerns**

- **`exec.CommandContext` races with `shutdownAllHostagents`** (future) — The hostagent is started with `exec.CommandContext(ctx, rddPath, ...)` where `ctx` is derived from the manager context. When the manager shuts down, Go's `CommandContext` calls `Process.Kill()` on the hostagent (SIGKILL on Unix, TerminateProcess on Windows) immediately. Concurrently, `shutdownAllHostagents` tries graceful Interrupt + wait. If `CommandContext`'s kill fires first, the hostagent dies ungracefully and `shutdownAllHostagents` sees `<-state.procExited` immediately, skipping the `!graceful` cleanup block (WSL2 distro termination, disk unlock, tmp files). This is pre-existing and self-healing (next start recovers via `killOrphanedHostagent`), but the window is now more significant on Windows where the cleanup includes WSL2 distro management. A future fix could set `haCmd.Cancel` to send `Interrupt` instead of Kill, with `haCmd.WaitDelay = gracefulShutdownTimeout`, aligning `CommandContext`'s behavior with the explicit shutdown path.

**Strengths**

- The replacement of Lima's `SysKill` (which on Windows broadcasts `CTRL_BREAK_EVENT` to the entire console, killing the RDD service) with targeted `process.Interrupt` + `process.KillTree` is exactly the right approach. Lima's Windows `SysKill` at `vendor/.../osutil_windows.go:38` ignores the signal parameter and always sends `CTRL_BREAK_EVENT` — even for "SIGKILL". The PR works around this cleanly without modifying the upstream dependency.
- The SSH key pre-generation (`ssh_keys.go`) handles every partial-state combination (missing pubkey, corrupt privkey, stale temps, orphaned pubkey) with correct crash-recovery behavior. The generate-to-tmp-then-rename pattern is well-implemented, including the Windows-specific `os.Remove` before `os.Rename`.
- The `stopInstanceForcibly` replacement of `limainstance.StopForcibly` is well-structured: kill processes, unlock disks only if kills succeeded, terminate WSL2 distros, clean up tmp files only if kills succeeded. The guard on `allKilled` prevents masking still-running processes by removing their PID files.
- The PID-zeroing in `handleDeletion` (line 63-64) before passing to `limainstance.Delete` prevents Lima's internal `StopForcibly` from killing recycled PIDs. This shows careful attention to Windows PID recycling behavior.

---

### Testing Assessment

Test coverage is good. The BATS tests exercise key scenarios on Windows:
- Cross-platform process management (`assert_process_alive`, `process_kill`)
- Crash recovery (hostagent kill → reconcile → new hostagent)
- Orphaned hostagent recovery (service kill → service restart → orphan killed)
- Graceful shutdown (hostagent terminated during `rdd svc stop`)
- WSL2 deadlock recovery (timeout + `taskkill wslservice.exe`)
- SSH key generation (comprehensive unit tests for all partial-state scenarios)

Untested scenarios, ranked by risk:
1. **`shutdownHostagent` graceful timeout → forceStop fallback** — The test comment at line 530 acknowledges this. Requires a hostagent that ignores shutdown signals.
2. **`StopWithWait` 60-second timeout path** — The new `KillTree` call on timeout is not exercised by any test.
3. **`Interrupt` failure on Windows from a different console** — The fallback to `Kill` in `StopWithWait` is documented but not tested in CI.

---

### Documentation Assessment

The code comments are thorough and explain the "why" for each platform-specific behavior. The CI workflow comments explain MSYS2 package choices (e.g., why `openssh` is needed). No documentation gaps.

---

### Acknowledged Limitations

1. **Sequential hostagent shutdown** — `limavm_controller.go:504-505`:
   > `// TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case.`
   
   The PR adds WSL2 cleanup to the `!graceful` block within this loop, making the sequential wait slightly longer per hostagent. The limitation is slightly more impactful now but the root issue is unchanged.

2. **`shutdownHostagent` forceStop not directly tested** — `limavm_lifecycle.go:530-532`:
   > `// Not tested: the forceStop fallback requires a hostagent that ignores shutdown signals.`
   
   Pre-existing gap. The orphaned-hostagent integration test exercises forced stop indirectly.

3. **Orphaned hostagent shutdown mechanism not distinguished** — `limavm-running.bats:262-264`:
   > `// Not tested: whether the orphaned hostagent shuts down gracefully (via CTRL_BREAK/SIGINT) or is force-killed.`
   
   New acknowledgment. The test verifies recovery succeeds regardless of mechanism.

---

### Coverage Summary

| File | Status |
|------|--------|
| `.github/actions/spelling/allow.txt` | **Trivial** — spelling allowlist additions |
| `.github/workflows/bats.yaml` | **Reviewed, no issues** — clean WSL2→MSYS2 migration |
| `bats/helpers/load.bash` | **Trivial** — single source line for vm_template.bash |
| `bats/helpers/os.bash` | **Reviewed, no issues** — correct cross-platform process helpers |
| `bats/helpers/utils.bash` | **Reviewed, no issues** — correct MSYS2 vs WSL file creation fix |
| `bats/helpers/vm_template.bash` | **Finding 3** |
| `bats/tests/33-lima-controllers/limavm-instance.bats` | **Reviewed, no issues** — clean template extraction and WSL2 cleanup |
| `bats/tests/33-lima-controllers/limavm-running.bats` | **Reviewed, no issues** — thorough cross-platform test adaptation |
| `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` | **Finding 1 (suggestion)** |
| `pkg/controllers/lima/limavm/controllers/limavm_controller.go` | **Reviewed, no issues** — correct SSH key init and shutdown cleanup |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | **Finding 2 (suggestion)** |
| `pkg/controllers/lima/limavm/controllers/ssh_keys.go` | **Reviewed, no issues** — thorough crash-recovery handling |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` | **Reviewed, no issues** — covers all partial-state scenarios |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` | **Reviewed, no issues** — correct no-op |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` | **Reviewed, no issues** — correct Lima config dir usage |
| `pkg/external/main.go` | **Reviewed, no issues** — clean klog verbosity mapping |
| `pkg/service/cmd/service.go` | **Finding 1 (important)** |
| `pkg/util/process/process_unix.go` | **Reviewed, no issues** — correct SIGKILL-to-group semantics |
| `pkg/util/process/process_windows.go` | **Reviewed, no issues** — correct CREATE_NEW_PROCESS_GROUP and taskkill /F /T |
| `scripts/collect-bats-logs.sh` | **Reviewed, no issues** — correct cygpath addition |

### Codex GPT 5.4

**Executive Summary**

This PR moves Windows Lima/BATS coverage onto MSYS2, adds Windows-specific SSH-key/bootstrap handling, and reworks hostagent shutdown so VM stop/restart no longer tears down the RDD service. I found one important regression in the new Windows force-stop path: crash recovery can treat a dead hostagent PID as “fully cleaned up” even when child processes are still alive, because `taskkill /T` returns success-by-convention for exit code 128 in a case where it does not actually traverse descendants.

**Critical Issues**

None.

**Important Issues**

1. **Crash recovery can falsely treat a dead Windows hostagent tree as fully cleaned up** — [`pkg/util/process/process_windows.go:69-78`](/Users/jan/git/rancher-desktop-daemon/pkg/util/process/process_windows.go#L69) (`important, regression`)

```go
func KillTree(ctx context.Context, pid int) error {
	err := exec.CommandContext(ctx, "taskkill", "/F", "/T", "/PID", strconv.Itoa(pid)).Run()
	if err != nil {
		var exitErr *exec.ExitError
		if errors.As(err, &exitErr) && exitErr.ExitCode() == 128 {
			return nil
		}
	}
	return err
}
```

The new crash-recovery path calls `stopInstanceForcibly()` after the watcher has already observed `phaseStopped` and torn down the watcher state ([`pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:190-208`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L190)). In that path, `stopInstanceForcibly()` iterates `inst.HostAgentPID` and `inst.DriverPID` ([`pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:602-610`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L602)). On Windows, though, `taskkill /F /T /PID <pid>` does not kill children of a dead parent; it exits 128 instead. Because `process.KillTree()` now converts exit 128 into `nil` ([`pkg/util/process/process_windows.go:69-75`](/Users/jan/git/rancher-desktop-daemon/pkg/util/process/process_windows.go#L69)), `stopInstanceForcibly()` keeps `allKilled=true` and proceeds to unlock disks and remove `.pid` / `.sock` files ([`pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:612-640`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L612)). That leaves leaked hostagent children such as `ssh.exe` forwarders treated as gone, so the next start can race with stale child processes and masked process state.

Fix: do not treat Windows exit code 128 as unconditional success. Either keep it as a failure in `KillTree()` unless the caller has independently proved there are no descendants, or add a Windows-specific fallback that enumerates and kills children when the leader is already dead. In this reconcile path, prefer killing the hostagent tree while the hostagent is still alive, before switching to the stale-PID cleanup path.

**Suggestions**

None.

**Design Observations**

**Strengths**

- The `process` abstraction is a good refactor: `Interrupt`, `SetGroup`, and `KillTree` centralize the Unix/Windows split instead of spreading signal semantics through the controller code.
- Pre-generating Lima SSH keys during controller setup is the right layer for the MSYS2/OpenSSH path-conversion workaround; it fixes the root cause without baking test-runner assumptions into runtime start logic.

**Testing Assessment**

1. The highest-risk untested case is the Windows crash-recovery path where the hostagent leader has already exited but descendants are still alive; the current BATS coverage only checks that recovery eventually succeeds, not that stale children are actually reaped.
2. The forced-stop timeout branch in `shutdownHostagent()` is still untested; the new comment in [`pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:526-528`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L526) is accurate, but this is exactly the path most likely to regress on Windows.
3. The “different console” Windows service-stop fallback in [`pkg/service/cmd/service.go:346-350`](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go#L346) is not covered; current tests stop the service from the same session, so they do not exercise the acknowledged `TerminateProcess` fallback.

I also ran `go test ./pkg/controllers/lima/limavm/controllers -run TestEnsureSSHKeysAt -count=1` and `go test ./pkg/util/process -count=1`.

**Documentation Assessment**

No documentation changes look mandatory for merge.

**Commit Structure**

The commit history is clean. Each commit is a coherent step in the Windows/MSYS2 migration and the later commits read as follow-on fixes, not unrelated scope creep.

**Acknowledged Limitations**

- [`pkg/service/cmd/service.go:346-350`](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go#L346): “If invoked from a different terminal, the call fails and we fall back to Kill (TerminateProcess)…” This limitation is still real; the change reduces blast radius by isolating hostagents into their own process groups, but `rdd svc stop` can still leave VMs running until the next service start.
- [`pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:526-528`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L526): “Not tested: the forceStop fallback requires a hostagent that ignores shutdown signals.” This PR adds the fallback, so the lack of direct coverage matters more now than before.
- [`bats/tests/33-lima-controllers/limavm-running.bats:262-264`](/Users/jan/git/rancher-desktop-daemon/bats/tests/33-lima-controllers/limavm-running.bats#L262): “Not tested: whether the orphaned hostagent shuts down gracefully…” Accurate; the tests validate eventual recovery, not the shutdown mechanism.

**Coverage Summary**

- [`.github/actions/spelling/allow.txt`](/Users/jan/git/rancher-desktop-daemon/.github/actions/spelling/allow.txt) — Trivial
- [`.github/workflows/bats.yaml`](/Users/jan/git/rancher-desktop-daemon/.github/workflows/bats.yaml) — Reviewed, no issues
- [`bats/helpers/load.bash`](/Users/jan/git/rancher-desktop-daemon/bats/helpers/load.bash) — Reviewed, no issues
- [`bats/helpers/os.bash`](/Users/jan/git/rancher-desktop-daemon/bats/helpers/os.bash) — Reviewed, no issues
- [`bats/helpers/utils.bash`](/Users/jan/git/rancher-desktop-daemon/bats/helpers/utils.bash) — Reviewed, no issues
- [`bats/helpers/vm_template.bash`](/Users/jan/git/rancher-desktop-daemon/bats/helpers/vm_template.bash) — Reviewed, no issues
- [`bats/tests/33-lima-controllers/limavm-instance.bats`](/Users/jan/git/rancher-desktop-daemon/bats/tests/33-lima-controllers/limavm-instance.bats) — Reviewed, no issues
- [`bats/tests/33-lima-controllers/limavm-running.bats`](/Users/jan/git/rancher-desktop-daemon/bats/tests/33-lima-controllers/limavm-running.bats) — Finding 1
- [`pkg/controllers/lima/limavm/controllers/hostagent_watcher.go`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/hostagent_watcher.go) — Finding 1
- [`pkg/controllers/lima/limavm/controllers/limavm_controller.go`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_controller.go) — Reviewed, no issues
- [`pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go) — Finding 1
- [`pkg/controllers/lima/limavm/controllers/ssh_keys.go`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys.go) — Reviewed, no issues
- [`pkg/controllers/lima/limavm/controllers/ssh_keys_test.go`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys_test.go) — Reviewed, no issues
- [`pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go) — Reviewed, no issues
- [`pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go) — Reviewed, no issues
- [`pkg/external/main.go`](/Users/jan/git/rancher-desktop-daemon/pkg/external/main.go) — Reviewed, no issues
- [`pkg/service/cmd/service.go`](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go) — Reviewed, no issues
- [`pkg/util/process/process_unix.go`](/Users/jan/git/rancher-desktop-daemon/pkg/util/process/process_unix.go) — Reviewed, no issues
- [`pkg/util/process/process_windows.go`](/Users/jan/git/rancher-desktop-daemon/pkg/util/process/process_windows.go) — Finding 1
- [`scripts/collect-bats-logs.sh`](/Users/jan/git/rancher-desktop-daemon/scripts/collect-bats-logs.sh) — Reviewed, no issues

I could not inspect PR review threads with `gh` from this environment: `gh pr view` / `gh repo view` against `rancher-sandbox/rancher-desktop-daemon` returned repository/PR resolution errors, so I could not verify unresolved review comments.

### Gemini 3.1 Pro

### Executive Summary

This PR addresses Windows process lifecycle issues by isolating Lima hostagents in their own process groups, mitigating `wslservice.exe` deadlocks, and cleaning up stale PID files. It also reliably migrates the BATS testing environment on Windows to MSYS2 to bypass WSL path translation issues. The overarching architecture of these changes is extremely robust, but a critical regression was introduced in `stopInstanceForcibly` on Unix, where treating the QEMU/VZ driver process as a process group leader causes silent failures to kill the driver.

### Critical Issues

1. **`stopInstanceForcibly` silently fails to kill orphaned driver processes on Unix** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:125-132` (approximate, inside `stopInstanceForcibly`) (critical, regression)

```go
        for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
                if pid > 0 {
                        if err := process.KillTree(ctx, pid); err != nil {
                                logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
                                allKilled = false
                        }
                }
        }
```

The function iterates over `inst.DriverPID` and `inst.HostAgentPID` and calls `process.KillTree` on both. The `KillTree` contract explicitly requires the target to be started with `SetGroup` so it leads its own process group. While `rdd` starts the hostagent with `SetGroup`, Lima starts the driver process (e.g., QEMU) without creating a new process group.

On Unix, `KillTree` translates to `unix.Kill(-pid, unix.SIGKILL)`. Since `DriverPID` is not a process group leader, sending a signal to a process group with ID `DriverPID` will likely return `ESRCH` (No such process). `KillTree` deliberately suppresses `ESRCH` (assuming the process group already exited) and returns `nil`.

If the hostagent crashes (`HostAgentPID` is 0 or dead) but the driver survives, `stopInstanceForcibly` will silently fail to kill the driver, set `allKilled = true`, and proceed to delete the instance's PID files. This orphans the driver process and permanently hides it from Lima's status checks. Furthermore, if the OS recycled `DriverPID` and assigned it to an unrelated process group leader, this call will kill that unrelated process tree.

Fix: `process.KillTree` must only be used on `HostAgentPID`. For `DriverPID`, introduce a standard `KillForce` function in the `process` package that kills a specific process without assuming it leads a process group.

```go
// In pkg/util/process/process_unix.go
func KillForce(_ context.Context, pid int) error {
	err := unix.Kill(pid, unix.SIGKILL)
	if errors.Is(err, unix.ESRCH) {
		return nil
	}
	return err
}

// In pkg/util/process/process_windows.go
func KillForce(ctx context.Context, pid int) error {
	err := exec.CommandContext(ctx, "taskkill", "/F", "/PID", strconv.Itoa(pid)).Run()
	if err != nil {
		var exitErr *exec.ExitError
		if errors.As(err, &exitErr) && exitErr.ExitCode() == 128 {
			return nil
		}
	}
	return err
}
```

Then, explicitly differentiate them in `stopInstanceForcibly`:

```diff
-       allKilled := true
-       for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
-               if pid > 0 {
-                       if err := process.KillTree(ctx, pid); err != nil {
-                               logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
-                               allKilled = false
-                       }
-               }
-       }
+       allKilled := true
+       if inst.HostAgentPID > 0 {
+               if err := process.KillTree(ctx, inst.HostAgentPID); err != nil {
+                       logger.V(1).Info("Failed to kill hostagent process tree", "pid", inst.HostAgentPID, "error", err)
+                       allKilled = false
+               }
+       }
+       if inst.DriverPID > 0 {
+               // DriverPID does not lead its own process group; KillTree would fail on Unix.
+               if err := process.KillForce(ctx, inst.DriverPID); err != nil {
+                       logger.V(1).Info("Failed to kill driver process", "pid", inst.DriverPID, "error", err)
+                       allKilled = false
+               }
+       }
```

### Important Issues

None.

### Suggestions

None.

### Design Observations

**Concerns**
- **Windows process tree cleanup (future):** When a parent process crashes or is forcefully terminated on Windows (`TerminateProcess`), its descendants (such as SSH port forwarders spawned by the hostagent) survive. Because `taskkill /T` on the dead parent's PID will return 128 (Not Found) without traversing the children, those children remain permanently orphaned. A future refactor could launch the Windows hostagent inside a Job Object (`CREATE_BREAKAWAY_FROM_JOB`), ensuring the OS automatically terminates all child processes when the hostagent handle closes, regardless of how it died.

**Strengths**
- The SSH key generation logic (`ensureSSHKeysAt`) is exceptionally robust. Removing the potentially corrupt public key before generation and explicitly handling partial cleanup on `os.Rename` failures gracefully avoids Windows' strict file-locking and atomic replacement limitations.
- Clearing `existingInst.DriverPID` and `existingInst.HostAgentPID` before calling `limainstance.Delete` is a clever and highly effective way to prevent Lima from erroneously attempting to kill recycled PIDs on Windows.
- Extracting process lifecycle management (`process.KillTree` and `process.Interrupt`) into an OS-aware abstraction prevents complex process signaling logic from polluting the reconciler's domain logic.

### Testing Assessment

Test coverage is thorough and thoughtfully adapted for the new environment. Moving the tests to MSYS2 avoids the chronic issues associated with WSL path conversions when interacting with Windows-native executables. The new `assert_process_alive` and `process_kill` helpers correctly bridge the gap between MSYS2 `kill` and Windows native APIs, and wrapping `rdd ctl wait --for=delete` with a `wslservice.exe` deadlock recovery block significantly hardens the pipeline against flakiness.

### Documentation Assessment

The code is well-commented. The explanations concerning `GenerateConsoleCtrlEvent` caveats and the expected self-healing loop for hostagents across service restarts provide essential context for future maintainers. 

### Acknowledged Limitations

- **Graceful shutdown bypass on Windows:** The author explicitly acknowledges in `pkg/service/cmd/service.go:346`: *"On Windows, Interrupt uses GenerateConsoleCtrlEvent, which only works when caller and target share a console. If invoked from a different terminal, the call fails and we fall back to Kill (TerminateProcess), bypassing graceful shutdown. Hostagents survive in their own process groups and self-heal on the next service start via killOrphanedHostagent."* This change leverages the intentional self-healing mechanism and relies on the new `stopInstanceForcibly` fallback.
- **Sequential parallel wait inefficiency:** Documented as a TODO in `pkg/controllers/lima/limavm/controllers/limavm_controller.go:504`: *"Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case."* This limitation remains unaffected.

### Coverage Summary

- **.github/actions/spelling/allow.txt** — Trivial
- **.github/workflows/bats.yaml** — Reviewed, no issues
- **bats/helpers/load.bash** — Trivial
- **bats/helpers/os.bash** — Reviewed, no issues
- **bats/helpers/utils.bash** — Reviewed, no issues
- **bats/helpers/vm_template.bash** — Reviewed, no issues
- **bats/tests/33-lima-controllers/limavm-instance.bats** — Reviewed, no issues
- **bats/tests/33-lima-controllers/limavm-running.bats** — Reviewed, no issues
- **pkg/controllers/lima/limavm/controllers/hostagent_watcher.go** — Reviewed, no issues
- **pkg/controllers/lima/limavm/controllers/limavm_controller.go** — Reviewed, no issues
- **pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go** — Finding 1
- **pkg/controllers/lima/limavm/controllers/ssh_keys.go** — Reviewed, no issues
- **pkg/controllers/lima/limavm/controllers/ssh_keys_test.go** — Reviewed, no issues
- **pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go** — Reviewed, no issues
- **pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go** — Reviewed, no issues
- **pkg/external/main.go** — Reviewed, no issues
- **pkg/service/cmd/service.go** — Reviewed, no issues
- **pkg/util/process/process_unix.go** — Finding 1
- **pkg/util/process/process_windows.go** — Reviewed, no issues
- **scripts/collect-bats-logs.sh** — Trivial

